Altogether: Image Captioning via Re-aligning Alt-text

Hu Xu, Po Yao Huang, Xiaoqing Ellen Tan, Ching Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen Tau Yih, Shang Wen Li, Saining Xie, Christoph Feichtenhofer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper focuses on creating synthetic data to improve the quality of image captions.Existing works typically have two shortcomings.First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g.GPT) is unknown.In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images.To generate training data, we perform human annotation where annotators start with the existing alt-text and realign it to the image content in multiple rounds, consequently constructing captions with rich visual concepts.This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge.We train a captioner on this data that generalizes the process of realigning alt-texts at scale.Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

Original languageEnglish (US)
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages19302-19318
Number of pages17
ISBN (Electronic)9798891761643
DOIs
StatePublished - 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, United States
Duration: Nov 12 2024Nov 16 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period11/12/2411/16/24

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Altogether: Image Captioning via Re-aligning Alt-text'. Together they form a unique fingerprint.

Cite this