TY - GEN
T1 - SiT
T2 - 18th European Conference on Computer Vision, ECCV 2024
AU - Ma, Nanye
AU - Goldstein, Mark
AU - Albergo, Michael S.
AU - Boffi, Nicholas M.
AU - Vanden-Eijnden, Eric
AU - Xie, Saining
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256×256 and 512×512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively. Code is available here: https://github.com/willisma/SiT.
AB - We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). The interpolant framework, which allows for connecting two distributions in a more flexible way than standard diffusion models, makes possible a modular study of various design choices impacting generative models built on dynamical transport: learning in discrete or continuous time, the objective function, the interpolant that connects the distributions, and deterministic or stochastic sampling. By carefully introducing the above ingredients, SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256×256 and 512×512 benchmark using the exact same model structure, number of parameters, and GFLOPs. By exploring various diffusion coefficients, which can be tuned separately from learning, SiT achieves an FID-50K score of 2.06 and 2.62, respectively. Code is available here: https://github.com/willisma/SiT.
UR - http://www.scopus.com/inward/record.url?scp=85209882065&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85209882065&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-72980-5_2
DO - 10.1007/978-3-031-72980-5_2
M3 - Conference contribution
AN - SCOPUS:85209882065
SN - 9783031729799
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 23
EP - 40
BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
A2 - Leonardis, Aleš
A2 - Ricci, Elisa
A2 - Roth, Stefan
A2 - Russakovsky, Olga
A2 - Sattler, Torsten
A2 - Varol, Gül
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 29 September 2024 through 4 October 2024
ER -