Statistical Uncertainty in Word Embeddings: GloVe-V

Andrea Vallebueno, Cassandra Handan-Nader, Christopher D. Manning, Daniel E. Ho

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe (Pennington et al., 2014), one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.

    Original languageEnglish (US)
    Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
    EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
    PublisherAssociation for Computational Linguistics (ACL)
    Pages9032-9047
    Number of pages16
    ISBN (Electronic)9798891761643
    DOIs
    StatePublished - 2024
    Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
    Duration: Nov 12 2024Nov 16 2024

    Publication series

    NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

    Conference

    Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
    Country/TerritoryUnited States
    CityHybrid, Miami
    Period11/12/2411/16/24

    ASJC Scopus subject areas

    • Computational Theory and Mathematics
    • Computer Science Applications
    • Information Systems
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Statistical Uncertainty in Word Embeddings: GloVe-V'. Together they form a unique fingerprint.

    Cite this