Translate, predict or generate: Modeling rich morphology in statistical machine translation

Ahmed El Kholy, Nizar Habash

Research output: Contribution to conferencePaperpeer-review

Abstract

We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.

Original languageEnglish (US)
Pages27-34
Number of pages8
StatePublished - 2012
Event16th Annual Conference of the European Association for Machine Translation, EAMT 2012 - Trento, Italy
Duration: May 28 2012May 30 2012

Other

Other16th Annual Conference of the European Association for Machine Translation, EAMT 2012
Country/TerritoryItaly
CityTrento
Period5/28/125/30/12

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Translate, predict or generate: Modeling rich morphology in statistical machine translation'. Together they form a unique fingerprint.

Cite this