While huge progress has been made on a variety of vision and language tasks in recent years, most major advances have been restricted to the English language due to the scarcity of relevant training and evaluation datasets in other languages. A popular approach to address this gap, has been to utilize machine-translated multi-modal datasets or multi-lingual text-only datasets for pre-training. This approach not only fails to exploit existing pre-trained state-of-the-art English multi-modal models, but also is not a viable solution for low-resource languages where translation quality is not as reliable. Therefore, we propose xMDETR, a multi-lingual grounded vision-language model based on the state-of-the-art model MDETR, by adapting it to new languages without machine-translated data, while also keeping most of the pre-trained weights frozen. xMDETR leverages mono-lingual pre-trained MDETR to achieve results competitive to state of the art on xGQA, a standard multilingual VQA benchmark. It is also interpretable, providing bounding boxes for key phrases in the multi-lingual questions. Our method utilizes several architectural as well as data-driven techniques such as training a new embedding space with a Masked Language Modeling (MLM) objective, code-switching, and adapters for efficient and modular training. We also explore contrastive losses to enforce the bridging of multi-modal and multi-lingual representations on multi-lingual multi-modal data, when available. We evaluate xMDETR on xGQA in both zero-shot and few-shot settings, improving results on Portuguese, Indonesian and Bengali, while remaining competitive on other languages.