Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

1Polytechnic of Turin, 2University of Verona, 3U2IS, ENSTA Paris, Institut Polytechnique de Paris 4Fondazione Bruno Kessler

Hallucination reduction: We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated VQA embodied dataset (IDKVQA), in which responses can be Yes, No, or I Don't Know.

Natural Human-Robot Interaction: CoIN-Bench introduces, for the first time, template-free, open-ended, bidirectional human ↔ agent dialogues. We simulate user responses via a VLM with access to high-resolution images of target objects.

Self-Dialogue: Our method, AIUTA, operates in a zero-shot manner, generalizing across scenes and objects. Before taking any actions, AIUTA engages in a Self-Dialogue between an on-board LLM and a VLM: the Self-Questioner module effectively reduces hallucinations and gather more environmental information, while the Interaction Trigger minimizes agent-user interactions.


Teaser

Sketched episode of the proposed Collaborative Instance Navigation (CoIN) task. The human user (bottom left) provides a request ("Find the picture" ) in natural language. The agent has to locate the object within a completely unknown environment, interacting with the user only when needed via template-free, open-ended natural-language dialogue. Our method, Agent-user Interaction with UncerTainty Awareness (AIUTA), addresses this challenging task, minimizing user interactions by equipping the agent with two modules: a Self-Questioner and an Interaction Trigger, whose output is shown in the blue boxes along the agent’s path (① to ⑤), and whose inner working is shown on the right. The Self-Questioner leverages a Large Language Model (LLM) and Vision Language Model (VLM) in a self-dialogue to initially describe the agent’s observation, and then extract additional relevant details, with a novel entropy-based technique to reduce hallucinations and inaccuracies, producing a refined detection description. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤).


Abstract

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human.

To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human.

We propose a novel training-free method, Agent user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input.

For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes.

Method

Teaser

Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction I: "Find a c = < object category > ". ② At each timestep t, a zero-shot policy π (VLFM), comprising a frozen object detection module, selects the optimal action at. ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation Ot from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing Srefined. ⑤ The Interaction Trigger module then evaluates Srefined against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧.


Video

BibTeX

      
        
        @misc{taioli2025coin,
            title={{Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues}}, 
            author={Francesco Taioli and Edoardo Zorzi and Gianni Franchi and Alberto Castellini and Alessandro Farinelli and Marco Cristani and Yiming Wang},
            year={2025},
            eprint={2412.01250},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2412.01250}, 
      }