Developing and using a pilot dialectal Arabic treebank

Mohamed Maamouri, Ann Bies, Tim Buckwalter, Mona Diab, Nizar Habash, Owen Rambow, Dalila Tabessi

Research output: Contribution to conferencePaperpeer-review

Abstract

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.

Original languageEnglish (US)
Pages443-448
Number of pages6
StatePublished - 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: May 22 2006May 28 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
Country/TerritoryItaly
CityGenoa
Period5/22/065/28/06

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Developing and using a pilot dialectal Arabic treebank'. Together they form a unique fingerprint.

Cite this