TY - GEN
T1 - Addressing noise in multidialectal word embeddings
AU - Erdmann, Alexander
AU - Zalmout, Nasser
AU - Habash, Nizar
N1 - Funding Information:
The second author was supported by the New York University Abu Dhabi Global PhD Student Fellowship program. We are also grateful for the support of the High Performance Computing Center at New York University Abu Dhabi.
Publisher Copyright:
© 2018 Association for Computational Linguistics
PY - 2018
Y1 - 2018
N2 - Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53%, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.
AB - Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53%, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.
UR - http://www.scopus.com/inward/record.url?scp=85062218690&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062218690&partnerID=8YFLogxK
U2 - 10.18653/v1/p18-2089
DO - 10.18653/v1/p18-2089
M3 - Conference contribution
AN - SCOPUS:85062218690
T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 558
EP - 565
BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
Y2 - 15 July 2018 through 20 July 2018
ER -