TY - GEN
T1 - RLS3
T2 - 16th Annual ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS 2025, held as part of the CPS-IoT Week 2025
AU - Waite, Joshua R.
AU - Hasan, Md Zahid
AU - Liu, Qisai
AU - Jiang, Zhanhong
AU - Hegde, Chinmay
AU - Sarkar, Soumik
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/5/7
Y1 - 2025/5/7
N2 - Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.
AB - Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune the VLM over the targeted task (e.g. spatial reasoning). The key contribution of this work is developing a framework where the RL agent serves as an informative data sampling tool and assists the VLM in order to enhance performance and address task-specific vulnerabilities. By targeting the data sampling process to address the weaknesses of the VLM, we can effectively train a more context-aware model. In addition, generating synthetic data allows us to have precise control over each scene and generate granular ground truth captions. Our results show that the proposed data generation approach improves the spatial reasoning performance of VLMs, which demonstrates the benefits of using RL-guided data generation in vision-language tasks.
KW - self-improving sampling
KW - spatial reasoning
KW - synthetic data generation
KW - vision-language models
UR - http://www.scopus.com/inward/record.url?scp=105007295797&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105007295797&partnerID=8YFLogxK
U2 - 10.1145/3716550.3722033
DO - 10.1145/3716550.3722033
M3 - Conference contribution
AN - SCOPUS:105007295797
T3 - Proceedings of the ACM/IEEE 16th International Conference on Cyber-Physical Systems, ICCPS 2025, held as part of the CPS-IoT Week 2025
BT - Proceedings of the ACM/IEEE 16th International Conference on Cyber-Physical Systems, ICCPS 2025, held as part of the CPS-IoT Week 2025
PB - Association for Computing Machinery, Inc
Y2 - 6 May 2025 through 9 May 2025
ER -