Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

1Polytechnic of Turin, 2University of Verona, 3U2IS, ENSTA Paris 4Fondazione Bruno Kessler 5University of Reykjavik

Hallucination reduction: We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated VQA embodied dataset (IDKVQA), in which responses can be Yes, No, or I Don't Know. See the python notebook and the IDKVQA dataset.

Natural Human-Robot Interaction: CoIN-Bench introduces, for the first time, template-free, open-ended, bidirectional human ↔ agent dialogues. We simulate user responses via a VLM with access to high-resolution images of target objects. See the video for an example of a simulated interaction.

Self-Dialogue: Our method, AIUTA, operates in a zero-shot manner, generalizing across scenes and objects. Before taking any actions, AIUTA engages in a Self-Dialogue between an on-board LLM and a VLM: the Self-Questioner module effectively reduces hallucinations and gather more environmental information, while the Interaction Trigger minimizes agent-user interactions. See the Self-Questioner figure for an example!


Motivation

The ObjectGoal Nav task is under-specified: the input is a category c, and the goal is to find any instance of that category. As we show below, which one should we navigate to?

ObjectGoal example

On the other hand, the InstanceObjectGoal Nav task is often ambiguous and over-specified:

InstanceObjectGoal example

The description “A bed with cushions and a colored comforter, two lamps, and a carpet on the floor” could apply to all the beds shown in the image above.

Furthermore,

  1. humans aim to minimum input: providing a long, detailed description before navigation is demanding and time-consuming. Would you want to give such a description?
  2. Agents should ask clarifying questions if necessary.
⬇️

AIUTA: Our Method

We propose a novel embodied reasoning method for human-agent interaction reasoning. Our method integrates two key components: a Self-Questioner and an Interaction Trigger.

Specifically, the agent:

  1. Executes a self-questioning procedure to automatically gather more information from the environment (e.g., Is the whall light blue?), reduce hallucination through a novel normalized-entropy-based technique, and generate a refined description of its observations.

    Our agent is equipped with a Large Language Model (LLM) and a Vision-Language Model (VLM). When environmental information is missing, the self-questioning mechanism is designed from the ground up to automatically leverage the LLM to generate the most informative question, and then use the VLM to extract the corresponding answer from the visual input.

  2. Decides whether to ask a clarifying question to the user, continue navigating, or halt exploration, based on the refined description and the “facts” associated with the target.

See the illustration below, based on a real example of our method in action.

Teaser

Sketched episode of the proposed Collaborative Instance Navigation (CoIN) task. The human user (bottom left) provides a request ("Find the picture" ) in natural language. The agent has to locate the object within a completely unknown environment, interacting with the user only when needed via template-free, open-ended natural-language dialogue. Our method, Agent-user Interaction with UncerTainty Awareness (AIUTA), addresses this challenging task, minimizing user interactions by equipping the agent with two modules: a Self-Questioner and an Interaction Trigger, whose output is shown in the blue boxes along the agent’s path (① to ⑤), and whose inner working is shown on the right. The Self-Questioner leverages a Large Language Model (LLM) and Vision Language Model (VLM) in a self-dialogue to initially describe the agent’s observation, and then extract additional relevant details, with a novel entropy-based technique to reduce hallucinations and inaccuracies, producing a refined detection description. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤).


Abstract

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human.

To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human.

We propose a novel training-free method, Agent user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input.

For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes.

Method

Teaser

Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction I: "Find a c = < object category > ". ② At each timestep t, a zero-shot policy π (VLFM), comprising a frozen object detection module, selects the optimal action at. ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation Ot from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing Srefined. ⑤ The Interaction Trigger module then evaluates Srefined against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧.


Hugging Face logo I don't know VQA dataset - IDKVQA

Not every question has an answer. That's why we created the I Don't Know VQA dataset, a VQA dataset where answers can be Yes, No, or I don't know.

We extensively described the IDKVQA dataset in the supplementary material and in the HF dataset page.

ObjectGoal example

Video

BibTeX

      
        
        @misc{taioli2025coin,
            title={{Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues}}, 
            author={Francesco Taioli and Edoardo Zorzi and Gianni Franchi and Alberto Castellini and Alessandro Farinelli and Marco Cristani and Yiming Wang},
            year={2025},
            eprint={2412.01250},
            archivePrefix={arXiv},
            primaryClass={cs.AI},
            url={https://arxiv.org/abs/2412.01250}, 
      }