Tokenized and continuous embedding compressions of protein sequence and structure

Amy X. Lu, Wilson Yan, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, Nathan C. Frey

Research output: Contribution to journalArticlepeer-review

Abstract

Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128× along the channel and 8× along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.

Original languageEnglish (US)
Article number101289
JournalPatterns
Volume6
Issue number6
DOIs
StatePublished - Jun 13 2025

Keywords

  • model interpretability
  • neural compression
  • protein language models
  • protein structure tokenization
  • representation learning

ASJC Scopus subject areas

  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Tokenized and continuous embedding compressions of protein sequence and structure'. Together they form a unique fingerprint.

Cite this