Building an Arabic machine translation post-edited corpus: Guidelines and annotation

Wajdi Zaghouani, Nizar Habash, Ossama Obeid, Behrang Mohit, Houda Bouamor, Kemal Oflazer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
EditorsNicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani
PublisherEuropean Language Resources Association (ELRA)
Pages1869-1876
Number of pages8
ISBN (Electronic)9782951740891
StatePublished - 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: May 23 2016May 28 2016

Publication series

NameProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
Country/TerritorySlovenia
CityPortoroz
Period5/23/165/28/16

Keywords

  • Annotation
  • Guidelines
  • Post-editing

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Building an Arabic machine translation post-edited corpus: Guidelines and annotation'. Together they form a unique fingerprint.

Cite this