Unsupervised Arabic dialect segmentation for machine translation

Wael Salloum, Nizar Habash

Research output: Contribution to journalArticlepeer-review


Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.
Original languageEnglish (US)
Pages (from-to)223-248
Number of pages26
JournalNatural Language Engineering
Issue number2
StatePublished - Mar 23 2022


  • Arabic dialects
  • Machine translation
  • Morphology
  • Unsupervised learning

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Unsupervised Arabic dialect segmentation for machine translation'. Together they form a unique fingerprint.

Cite this