Learning Embedding Representations in High Dimensions

Golara Ahmadi Azar, Melika Emami, Alyson Fletcher, Sundeep Rangan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Embeddings are a basic initial feature extraction step in many machine learning models, particularly in natural language processing. An embedding attempts to map data tokens to a low-dimensional space where similar tokens are mapped to vectors that are close to one another by some metric in the embedding space. A basic question is how well can such embedding be learned? To study this problem, we consider a simple probability model for discrete data where there is some "true"but unknown embedding where the correlation of random variables is related to the similarity of the embeddings. Under this model, it is shown that the embeddings can be learned by a variant of low-rank approximate message passing (AMP) method. The AMP approach enables precise predictions of the accuracy of the estimation in certain high-dimensional limits. In particular, the methodology provides insight on the relations of key parameters such as the number of samples per value, the frequency of the terms, and the strength of the embedding correlation on the probability distribution. Our theoretical findings are validated by simulations on both synthetic data and real text data.

Original languageEnglish (US)
Title of host publication2024 58th Annual Conference on Information Sciences and Systems, CISS 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350369298
DOIs
StatePublished - 2024
Event58th Annual Conference on Information Sciences and Systems, CISS 2024 - Princeton, United States
Duration: Mar 13 2024Mar 15 2024

Publication series

Name2024 58th Annual Conference on Information Sciences and Systems, CISS 2024

Conference

Conference58th Annual Conference on Information Sciences and Systems, CISS 2024
Country/TerritoryUnited States
CityPrinceton
Period3/13/243/15/24

Keywords

  • AMP
  • Embedding learning
  • Poisson channel
  • State Evolution

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Safety, Risk, Reliability and Quality
  • Control and Optimization
  • Modeling and Simulation
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Learning Embedding Representations in High Dimensions'. Together they form a unique fingerprint.

Cite this