Vision-Language Models Can Identify Distracted Driver Behavior From Naturalistic Videos

Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar

    Research output: Contribution to journalArticlepeer-review


    Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their generalization ability, efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models like CLIP have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP’s vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from naturalistic driving video. Our results show that this framework offers state-of-the-art performance on zero-shot transfer, finetuning and video-based models for predicting the driver’s state on four public datasets. We propose frame-based and video-based frameworks developed on top of the CLIP’s visual representation for distracted driving detection and classification tasks and report the results. Our code is available at

    Original languageEnglish (US)
    Pages (from-to)1-15
    Number of pages15
    JournalIEEE Transactions on Intelligent Transportation Systems
    StateAccepted/In press - 2024


    • Accidents
    • Adaptation models
    • CLIP
    • Data models
    • Distracted driving
    • Task analysis
    • Training
    • Vehicles
    • Videos
    • computer vision
    • embedding
    • vision-language model
    • zero-shot transfer

    ASJC Scopus subject areas

    • Automotive Engineering
    • Mechanical Engineering
    • Computer Science Applications


    Dive into the research topics of 'Vision-Language Models Can Identify Distracted Driver Behavior From Naturalistic Videos'. Together they form a unique fingerprint.

    Cite this