MoDE: CLIP Data Experts via Clustering

Jiawei Ma, Po Yao Huang, Saining Xie, Shang Wen Li, Luke Zettlemoyer, Shih Fu Chang, Wen Tau Yih, Hu Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web- crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation pre-cisely, the samples in one cluster should be semantically similar, but the number of data experts should still be rea-sonable for training and inference. As such, we consider the ontology in human language and propose to use fine- grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available here.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PublisherIEEE Computer Society
Pages26344-26353
Number of pages10
ISBN (Electronic)9798350353006
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period6/16/246/22/24

Keywords

  • Data Clustering
  • Data Expert
  • Multi-Modal

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'MoDE: CLIP Data Experts via Clustering'. Together they form a unique fingerprint.

Cite this