A categorial variation database for English

Nizar Habash, Bonnie Dorr

Research output: Contribution to conferencePaperpeer-review


We describe our approach to the construction and evaluation of a large-scale database called “CatVar” which contains categorial variations of English lexemes. Due to the prevalence of cross-language categorial variation in multilingual applications, our categorial-variation resource may serve as an integral part of a diverse range of natural language applications. Thus, the research reported herein overlaps heavily with that of the machine-translation, lexicon-construction, and information-retrieval communities. We apply the information-retrieval metrics of precision and recall to evaluate the accuracy and coverage of our database with respect to a human-produced gold standard. This evaluation reveals that the categorial database achieves a high degree of precision and recall. Additionally, we demonstrate that the database improves on the linkability of Porter stemmer by over 30%.

Original languageEnglish (US)
StatePublished - 2003
Event2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003 - Edmonton, Canada
Duration: May 27 2003Jun 1 2003


Conference2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'A categorial variation database for English'. Together they form a unique fingerprint.

Cite this