Abstract
We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-of-the-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.
Original language | English (US) |
---|---|
Pages | 27-34 |
Number of pages | 8 |
State | Published - 2012 |
Event | 16th Annual Conference of the European Association for Machine Translation, EAMT 2012 - Trento, Italy Duration: May 28 2012 → May 30 2012 |
Other
Other | 16th Annual Conference of the European Association for Machine Translation, EAMT 2012 |
---|---|
Country/Territory | Italy |
City | Trento |
Period | 5/28/12 → 5/30/12 |
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Software