Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Bashar Alhafni, Sarah Al-Towaity, Ziyad Fawzy, Fatema Nassar, Fadhl Eryani, Houda Bouamor, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, and pretrained models publicly available.

Original languageEnglish (US)
Title of host publicationArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference
EditorsNizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
PublisherAssociation for Computational Linguistics (ACL)
Pages42-54
Number of pages13
ISBN (Electronic)9798891761322
StatePublished - 2024
Event2nd Arabic Natural Language Processing Conference, ArabicNLP 2024 - Bangkok, Thailand
Duration: Aug 16 2024 → …

Publication series

NameArabicNLP 2024 - 2nd Arabic Natural Language Processing Conference, Proceedings of the Conference

Conference

Conference2nd Arabic Natural Language Processing Conference, ArabicNLP 2024
Country/TerritoryThailand
CityBangkok
Period8/16/24 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Exploiting Dialect Identification in Automatic Dialectal Text Normalization'. Together they form a unique fingerprint.

Cite this