The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic

Dana Abdulrahim, Go Inoue, Latifa Shamsan, Salam Khalifa, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, the focus on developing natural language processing (NLP) tools for Arabic has shifted from Modern Standard Arabic to various Arabic dialects. Various corpora of various sizes and representing different genres, have been created for a number of Arabic dialects. As far as Gulf Arabic is concerned, Gumar Corpus (Khalifa et al., 2016) is the largest corpus, to date, that includes data representing the dialectal Arabic of the six Gulf Cooperation Council countries (Bahrain, Kuwait, Saudi Arabia, Qatar, United Arab Emirates, and Oman), particularly in the genre of “online forum novels”. In this paper, we present the Bahrain Corpus. Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.). The corpus comprises 620K words, carefully curated. We provide automatic morphological annotations of the full corpus using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We plan to make the annotated sample as well as the full corpus publicly available to support researchers interested in Arabic NLP.

Original languageEnglish (US)
Title of host publication2022 Language Resources and Evaluation Conference, LREC 2022
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages2345-2352
Number of pages8
ISBN (Electronic)9791095546726
StatePublished - 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: Jun 20 2022Jun 25 2022

Publication series

Name2022 Language Resources and Evaluation Conference, LREC 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period6/20/226/25/22

Keywords

  • Arabic
  • corpus
  • dialect
  • Gulf Arabic
  • morphology

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic'. Together they form a unique fingerprint.

Cite this