Unsupervised morphology-based vocabulary expansion

Mohammad Sadegh Rasooli, Thomas Lippincott, Nizar Habash, Owen Rambow

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages. We test our vocabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depending on the morphological complexity of the language. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.

Original languageEnglish (US)
Title of host publicationLong Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages1349-1359
Number of pages11
ISBN (Print)9781937284725
DOIs
StatePublished - 2014
Event52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Baltimore, MD, United States
Duration: Jun 22 2014Jun 27 2014

Publication series

Name52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference
Volume1

Other

Other52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014
CountryUnited States
CityBaltimore, MD
Period6/22/146/27/14

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Unsupervised morphology-based vocabulary expansion'. Together they form a unique fingerprint.

Cite this