TY - GEN
T1 - CiT
T2 - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
AU - Xu, Hu
AU - Xie, Saining
AU - Huang, Po Yao
AU - Yu, Licheng
AU - Howes, Russell
AU - Ghosh, Gargi
AU - Zettlemoyer, Luke
AU - Feichtenhofer, Christoph
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
AB - Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
UR - http://www.scopus.com/inward/record.url?scp=85185878302&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85185878302&partnerID=8YFLogxK
U2 - 10.1109/ICCV51070.2023.01393
DO - 10.1109/ICCV51070.2023.01393
M3 - Conference contribution
AN - SCOPUS:85185878302
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 15134
EP - 15143
BT - Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2023 through 6 October 2023
ER -