Arabic preprocessing schemes for statistical machine translation

Nizar Habash, Fatiha Sadat

Research output: Contribution to conferencePaperpeer-review

Abstract

In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in genre between training and test data.

Original languageEnglish (US)
Pages49-52
Number of pages4
DOIs
StatePublished - 2006
Event2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006 - New York, United States
Duration: Jun 4 2006Jun 9 2006

Conference

Conference2006 Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL 2006
Country/TerritoryUnited States
CityNew York
Period6/4/066/9/06

ASJC Scopus subject areas

  • Linguistics and Language
  • Language and Linguistics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Arabic preprocessing schemes for statistical machine translation'. Together they form a unique fingerprint.

Cite this