Zero-Shot Object Navigation with Vision-Language Models Reasoning

Congcong Wen, Yisiyuan Huang, Hao Huang, Yanjia Huang, Shuaihang Yuan, Yu Hao, Hui Lin, Yu Shen Liu, Yi Fang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions. Videos are available at https://vlt-lzson.github.io/.

Original languageEnglish (US)
Title of host publicationPattern Recognition - 27th International Conference, ICPR 2024, Proceedings
EditorsApostolos Antonacopoulos, Subhasis Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Saumik Bhattacharya, Umapada Pal
PublisherSpringer Science and Business Media Deutschland GmbH
Pages389-404
Number of pages16
ISBN (Print)9783031784552
DOIs
StatePublished - 2025
Event27th International Conference on Pattern Recognition, ICPR 2024 - Kolkata, India
Duration: Dec 1 2024Dec 5 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15318 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference27th International Conference on Pattern Recognition, ICPR 2024
Country/TerritoryIndia
CityKolkata
Period12/1/2412/5/24

Keywords

  • LLM Reasoning
  • Large Language Mdoel (LLM)
  • Vision-Language Model (VLM)
  • Zero-shot Object Navigation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Zero-Shot Object Navigation with Vision-Language Models Reasoning'. Together they form a unique fingerprint.

Cite this