Learning N-gram language models from uncertain data

Vitaly Kuznetsov, Hank Liao, Mehryar Mohri, Michael Riley, Brian Roark

Research output: Contribution to journalConference articlepeer-review


We present a new algorithm for efficiently training n-gram language models on uncertain data, and illustrate its use for semisupervised language model adaptation. We compute the probability that an n-gram occurs k times in the sample of uncertain data, and use the resulting histograms to derive a generalized Katz back-off model. We compare three approaches to semisupervised adaptation of language models for speech recognition of selected YouTube video categories: (1) using just the one-best output from the baseline speech recognizer or (2) using samples from lattices with standard algorithms versus (3) using full lattices with our new algorithm. Unlike the other methods, our new algorithm provides models that yield solid improvements over the baseline on the full test set, and, further, achieves these gains without hurting performance on any of the set of video categories. We show that categories with the most data yielded the largest gains. The algorithm has been released as part of the OpenGrm n-gram library [1].

Original languageEnglish (US)
Pages (from-to)2323-2327
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
StatePublished - 2016
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: Sep 8 2016Sep 16 2016

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation


Dive into the research topics of 'Learning N-gram language models from uncertain data'. Together they form a unique fingerprint.

Cite this