When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis

Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay

Research output: Contribution to journalArticlepeer-review

Abstract

The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol still remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shist significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation to enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and enables visual grounding by building upon off-the-shelf MLLMs. Our code will be made publicly available upon acceptance.

Original languageEnglish (US)
Article number108077
JournalAccident Analysis and Prevention
Volume219
DOIs
StatePublished - Sep 2025

Keywords

  • Accident analysis framework
  • Explainable AI
  • Multimodal large language model
  • Traffic video classification

ASJC Scopus subject areas

  • Human Factors and Ergonomics
  • Safety, Risk, Reliability and Quality
  • Public Health, Environmental and Occupational Health
  • Law

Fingerprint

Dive into the research topics of 'When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis'. Together they form a unique fingerprint.

Cite this