Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus

Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash, Ramy Eskander, Owen Rambow

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and non-linguistic content is written out, such as laughter, sound representations, and emoticons. This situation is exacerbated in the case of Arabic social media for two reasons. First, Arabic dialects, commonly used in social media, are quite different from Modern Standard Arabic phonologically, morphologically and lexically, and most importantly, they lack standard orthographies. Second, Arabic speakers in social media as well as discussion forums, SMS messaging and online chat often use a non-standard romanization called Arabizi. In the context of natural language processing of social media Arabic, transliterating from Arabizi of various dialects to Arabic script is a necessary step, since many of the existing state-of-the-art resources for Arabic dialect processing expect Arabic script input. The corpus described in this paper is expected to support Arabic NLP by providing this resource.

Original languageEnglish (US)
Title of host publicationANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings
EditorsNizar Habash, Stephan Vogel
PublisherAssociation for Computational Linguistics (ACL)
Pages93-103
Number of pages11
ISBN (Electronic)9781937284961
StatePublished - 2014
EventEMNLP 2014 Workshop on Arabic Natural Language Processing, ANLP 2014 - Doha, Qatar
Duration: Oct 25 2014 → …

Publication series

NameANLP 2014 - EMNLP 2014 Workshop on Arabic Natural Language Processing, Proceedings

Conference

ConferenceEMNLP 2014 Workshop on Arabic Natural Language Processing, ANLP 2014
Country/TerritoryQatar
CityDoha
Period10/25/14 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus'. Together they form a unique fingerprint.

Cite this