Abstract
In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26, 000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedback to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our preexisting MSA resources and the new dialectal corpus.
Original language | English (US) |
---|---|
Pages | 443-448 |
Number of pages | 6 |
State | Published - 2006 |
Event | 5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy Duration: May 22 2006 → May 28 2006 |
Other
Other | 5th International Conference on Language Resources and Evaluation, LREC 2006 |
---|---|
Country/Territory | Italy |
City | Genoa |
Period | 5/22/06 → 5/28/06 |
ASJC Scopus subject areas
- Education
- Library and Information Sciences
- Linguistics and Language
- Language and Linguistics