Adapting Grounded Visual Question Answering Models to Low Resource Languages

Ying Wang, Jonas Pfeiffer, Nicolas Carion, Yann Lecun, Aishwarya Kamath

Research output: Chapter in Book/Report/Conference proceedingConference contribution


While huge progress has been made on a variety of vision and language tasks in recent years, most major advances have been restricted to the English language due to the scarcity of relevant training and evaluation datasets in other languages. A popular approach to address this gap, has been to utilize machine-translated multi-modal datasets or multi-lingual text-only datasets for pre-training. This approach not only fails to exploit existing pre-trained state-of-the-art English multi-modal models, but also is not a viable solution for low-resource languages where translation quality is not as reliable. Therefore, we propose xMDETR, a multi-lingual grounded vision-language model based on the state-of-the-art model MDETR, by adapting it to new languages without machine-translated data, while also keeping most of the pre-trained weights frozen. xMDETR leverages mono-lingual pre-trained MDETR to achieve results competitive to state of the art on xGQA, a standard multilingual VQA benchmark. It is also interpretable, providing bounding boxes for key phrases in the multi-lingual questions. Our method utilizes several architectural as well as data-driven techniques such as training a new embedding space with a Masked Language Modeling (MLM) objective, code-switching, and adapters for efficient and modular training. We also explore contrastive losses to enforce the bridging of multi-modal and multi-lingual representations on multi-lingual multi-modal data, when available. We evaluate xMDETR on xGQA in both zero-shot and few-shot settings, improving results on Portuguese, Indonesian and Bengali, while remaining competitive on other languages.

Original languageEnglish (US)
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023
PublisherIEEE Computer Society
Number of pages10
ISBN (Electronic)9798350302493
StatePublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023 - Vancouver, Canada
Duration: Jun 18 2023Jun 22 2023

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516


Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2023

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Cite this