Exploiting Arabic diacritization for high quality automatic annotation

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

Original languageEnglish (US)
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
EditorsNicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani
PublisherEuropean Language Resources Association (ELRA)
Pages4298-4304
Number of pages7
ISBN (Electronic)9782951740891
StatePublished - 2016
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: May 23 2016May 28 2016

Publication series

NameProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

Other

Other10th International Conference on Language Resources and Evaluation, LREC 2016
Country/TerritorySlovenia
CityPortoroz
Period5/23/165/28/16

Keywords

  • Annotation
  • Arabic
  • Diacritization
  • Morphology

ASJC Scopus subject areas

  • Linguistics and Language
  • Library and Information Sciences
  • Language and Linguistics
  • Education

Fingerprint

Dive into the research topics of 'Exploiting Arabic diacritization for high quality automatic annotation'. Together they form a unique fingerprint.

Cite this