TY - JOUR
T1 - Tokenized and continuous embedding compressions of protein sequence and structure
AU - Lu, Amy X.
AU - Yan, Wilson
AU - Yang, Kevin K.
AU - Gligorijevic, Vladimir
AU - Cho, Kyunghyun
AU - Abbeel, Pieter
AU - Bonneau, Richard
AU - Frey, Nathan C.
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/6/13
Y1 - 2025/6/13
N2 - Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128× along the channel and 8× along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.
AB - Existing protein machine learning representations typically model either the sequence or structure distribution, with the other modality implicit. Here, we characterize an embedding of the joint distribution of protein sequence and structure by compressing the latent space of the protein folding model ESMFold. This provides mechanistic interpretability insights, as well as a flexible compressed representation. We term these CHEAP (compressed hourglass embedding adaptations of proteins) embeddings. In continuous compression schemes, the ESMFold latent space can be reduced by factors of 128× along the channel and 8× along the length while retaining structure information at <2 Å scale accuracy and performing competitively on protein function and localization benchmarks. In discrete compression schemes, we construct a tokenized all-atom structure vocabulary that retains high reconstruction accuracy, thus introducing a tokenized representation of an all-atom structure that can be obtained from the sequence alone. CHEAP democratizes representations captured by large models and can enable flexible downstream applications such as generation, search, and prediction.
KW - model interpretability
KW - neural compression
KW - protein language models
KW - protein structure tokenization
KW - representation learning
UR - http://www.scopus.com/inward/record.url?scp=105007744182&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105007744182&partnerID=8YFLogxK
U2 - 10.1016/j.patter.2025.101289
DO - 10.1016/j.patter.2025.101289
M3 - Article
AN - SCOPUS:105007744182
SN - 2666-3899
VL - 6
JO - Patterns
JF - Patterns
IS - 6
M1 - 101289
ER -