CiT: Curation in Training for Effective Vision-Language Data

Hu Xu, Saining Xie, Po Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages10
ISBN (Electronic)9798350307184
StatePublished - 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: Oct 2 2023Oct 6 2023

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499


Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'CiT: Curation in Training for Effective Vision-Language Data'. Together they form a unique fingerprint.

Cite this