Addressing noise in multidialectal word embeddings

Alexander Erdmann, Nasser Zalmout, Nizar Habash

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53%, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.

Original languageEnglish (US)
Title of host publicationACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages558-565
Number of pages8
ISBN (Electronic)9781948087346
DOIs
StatePublished - 2018
Event56th Annual Meeting of the Association for Computational Linguistics, ACL 2018 - Melbourne, Australia
Duration: Jul 15 2018Jul 20 2018

Publication series

NameACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
Volume2

Conference

Conference56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
CountryAustralia
CityMelbourne
Period7/15/187/20/18

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'Addressing noise in multidialectal word embeddings'. Together they form a unique fingerprint.

Cite this