Abstract
In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.
Original language | English (US) |
---|---|
Pages | 49-52 |
Number of pages | 4 |
DOIs | |
State | Published - 2006 |
Event | 2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006 - New York, United States Duration: Jun 4 2006 → Jun 9 2006 |
Conference
Conference | 2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006 |
---|---|
Country/Territory | United States |
City | New York |
Period | 6/4/06 → 6/9/06 |
ASJC Scopus subject areas
- Linguistics and Language
- Language and Linguistics
- Computer Science Applications