A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models

Ali Shazal, Aiza Usman, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

While online Arabic is primarily written using the Arabic script, a Roman-script variety called Arabizi is often seen on social media. Although this representation captures the phonology of the language, it is not a one-to-one mapping with the Arabic script version. This issue is exacerbated by the fact that Arabizi on social media is Dialectal Arabic which does not have a standard orthography. Furthermore, Arabizi tends to include a lot of code mixing between Arabic and English (or French). To map Arabizi text to Arabic script in the context of complete utterances, previously published efforts have split Arabizi detection and Arabic script target in two separate tasks. In this paper, we present the first effort on a unified model for Arabizi detection and transliteration into a code-mixed output with consistent Arabic spelling conventions, using a sequence-to-sequence deep learning model. Our best system achieves 80.6% word accuracy and 58.7% BLEU on a blind test set.
Original languageUndefined
Title of host publicationProceedings of the Fifth Arabic Natural Language Processing Workshop
Place of PublicationBarcelona, Spain (Online)
PublisherAssociation for Computational Linguistics
Pages167-177
Number of pages11
StatePublished - Dec 1 2020

Cite this