TY - GEN
T1 - Classification of closely related sub-dialects of Arabic using support-vector machines
AU - Wray, Samantha
N1 - Funding Information:
Preliminary work on this research appeared as part of a doctoral dissertation (Wray, 2016). Funding for this research is gratefully acknowledged from the National Science Foundation under BCS-1533780, and the University of Arizona Graduate and Professional Student Council. An allocation of computer time is also gratefully acknowledged from the UofA Research Computing HPC and HTC, as well as HPC at New York University Abu Dhabi.
Publisher Copyright:
© LREC 2018 - 11th International Conference on Language Resources and Evaluation. All rights reserved.
PY - 2019
Y1 - 2019
N2 - Colloquial dialects of Arabic can be roughly categorized into five groups based on relatedness and geographic location (Egyptian, North African/Maghrebi, Gulf, Iraqi, and Levantine), but given that all dialects utilize much of the same writing system and share overlapping features and vocabulary, dialect identification and text classification is no trivial task. Furthermore, text classification by dialect is often performed at a coarse-grained level into these five groups or a subset thereof, and there is little work on sub-dialectal classification. The current study utilizes an n-gram based SVM to classify on a fine-grained sub-dialectal level, and compares it to methods used in dialect classification such as vocabulary pruning of shared items across dialects. A test case of the dialect Levantine is presented here, and results of 65% accuracy on a four-way classification experiment to sub-dialects of Levantine (Jordanian, Lebanese, Palestinian and Syrian) are presented and discussed. This paper also examines the possibility of leveraging existing mixed-dialectal resources to determine their sub-dialectal makeup by automatic classification.
AB - Colloquial dialects of Arabic can be roughly categorized into five groups based on relatedness and geographic location (Egyptian, North African/Maghrebi, Gulf, Iraqi, and Levantine), but given that all dialects utilize much of the same writing system and share overlapping features and vocabulary, dialect identification and text classification is no trivial task. Furthermore, text classification by dialect is often performed at a coarse-grained level into these five groups or a subset thereof, and there is little work on sub-dialectal classification. The current study utilizes an n-gram based SVM to classify on a fine-grained sub-dialectal level, and compares it to methods used in dialect classification such as vocabulary pruning of shared items across dialects. A test case of the dialect Levantine is presented here, and results of 65% accuracy on a four-way classification experiment to sub-dialects of Levantine (Jordanian, Lebanese, Palestinian and Syrian) are presented and discussed. This paper also examines the possibility of leveraging existing mixed-dialectal resources to determine their sub-dialectal makeup by automatic classification.
KW - Language identification
KW - Text classification
KW - Validation of language resources
UR - http://www.scopus.com/inward/record.url?scp=85059891341&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85059891341&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85059891341
T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation
SP - 3671
EP - 3674
BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Piperidis, Stelios
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Hasida, Koiti
A2 - Mazo, Helene
A2 - Choukri, Khalid
A2 - Goggi, Sara
A2 - Mariani, Joseph
A2 - Moreno, Asuncion
A2 - Calzolari, Nicoletta
A2 - Odijk, Jan
A2 - Tokunaga, Takenobu
PB - European Language Resources Association (ELRA)
T2 - 11th International Conference on Language Resources and Evaluation, LREC 2018
Y2 - 7 May 2018 through 12 May 2018
ER -