ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus

Nizar Habash, David Palfreyman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates. We describe and discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on Arabic and English texts using consistent guidelines as much as possible, with tracked alignments among the different annotations, and to the original raw texts. For morphological tokenization, POS tagging, and lemmatization, we use existing automatic annotation tools followed by manual correction. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The publicly available ZAEBUC corpus and its annotations are intended to be the stepping stones for additional annotations.

Original languageEnglish (US)
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages79-88
Number of pages10
ISBN (Electronic)9791095546726
StatePublished - 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: Jun 20 2022Jun 25 2022

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period6/20/226/25/22

Keywords

  • Annotated Corpus
  • Arabic
  • CEFR
  • English
  • Learner Corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus'. Together they form a unique fingerprint.

Cite this