TY - GEN
T1 - Transliteration of Arabizi into Arabic Orthography
T2 - EMNLP 2014 Workshop on Arabic Natural Language Processing, ANLP 2014
AU - Bies, Ann
AU - Song, Zhiyi
AU - Maamouri, Mohamed
AU - Grimes, Stephen
AU - Lee, Haejoong
AU - Wright, Jonathan
AU - Strassel, Stephanie
AU - Habash, Nizar
AU - Eskander, Ramy
AU - Rambow, Owen
N1 - Publisher Copyright:
©2014 Association for Computational Linguistics
PY - 2014
Y1 - 2014
N2 - This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoticons. This situation is exacerbated in the case of Arabic social media for two reasons. First, Arabic dialects, commonly used in social media, are quite different from Modern Standard Arabic phonologically, morphologically and lexically, and most importantly, they lack standard orthographies. Second, Arabic speakers in social media as well as discussion forums, SMS messaging and online chat often use a non-standard romanization called Arabizi. In the context of natural language processing of social media Arabic, transliterating from Arabizi of various dialects to Arabic script is a necessary step, since many of the existing state-of-the-art resources for Arabic dialect processing expect Arabic script input. The corpus described in this paper is expected to support Arabic NLP by providing this resource.
AB - This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoticons. This situation is exacerbated in the case of Arabic social media for two reasons. First, Arabic dialects, commonly used in social media, are quite different from Modern Standard Arabic phonologically, morphologically and lexically, and most importantly, they lack standard orthographies. Second, Arabic speakers in social media as well as discussion forums, SMS messaging and online chat often use a non-standard romanization called Arabizi. In the context of natural language processing of social media Arabic, transliterating from Arabizi of various dialects to Arabic script is a necessary step, since many of the existing state-of-the-art resources for Arabic dialect processing expect Arabic script input. The corpus described in this paper is expected to support Arabic NLP by providing this resource.
UR - http://www.scopus.com/inward/record.url?scp=85122791870&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122791870&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85122791870
T3 - ANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings
SP - 93
EP - 103
BT - ANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings
A2 - Habash, Nizar
A2 - Vogel, Stephan
PB - Association for Computational Linguistics (ACL)
Y2 - 25 October 2014
ER -