Abstract
We study challenges raised by the order of Arabic verbs and their subjects in statistical machine translation (SMT). We show that the boundaries of post-verbal subjects (VS) are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. In addition, VS constructions have highly ambiguous reordering patterns when translated to English, and these patterns are very different for matrix (main clause) VS and non-matrix (subordinate clause) VS. Based on this analysis, we propose a novel method for leveraging VS information in SMT: we reorder VS constructions into pre-verbal (SV) order for word alignment. Unlike previous approaches to source-side reordering, phrase extraction and decoding are performed using the original Arabic word order. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline. Limiting reordering to matrix VS yields further improvements.
Original language | English (US) |
---|---|
Pages (from-to) | 105-120 |
Number of pages | 16 |
Journal | Machine Translation |
Volume | 26 |
Issue number | 1-2 |
DOIs | |
State | Published - Mar 2012 |
Keywords
- Dependency parsing
- Matrix subject
- Post-verbal subjects
- Reordering
- Statistical machine translation
- Subject detection
- VS
- Word alignment
ASJC Scopus subject areas
- Software
- Language and Linguistics
- Linguistics and Language
- Artificial Intelligence