Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Francesco Taioli^1,2, Edoardo Zorzi², Gianni Franchi³, Alberto Castellini², Alessandro Farinelli², Marco Cristani^2,5, Yiming Wang⁴

¹Polytechnic of Turin, ²University of Verona, ³U2IS, ENSTA Paris ⁴Fondazione Bruno Kessler ⁵University of Reykjavik

ICCV 25🎉 | arXiv Code

CoIN-Bench

IDKVQA dataset

Hallucination reduction: We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated VQA embodied dataset (IDKVQA), in which responses can be Yes, No, or I Don't Know. See the python notebook and the IDKVQA dataset.

Natural Human-Robot Interaction: CoIN-Bench introduces, for the first time, template-free, open-ended, bidirectional human ↔ agent dialogues. We simulate user responses via a VLM with access to high-resolution images of target objects. See the video for an example of a simulated interaction.

Self-Dialogue: Our method, AIUTA, operates in a zero-shot manner, generalizing across scenes and objects. Before taking any actions, AIUTA engages in a Self-Dialogue between an on-board LLM and a VLM: the Self-Questioner module effectively reduces hallucinations and gather more environmental information, while the Interaction Trigger minimizes agent-user interactions. See the Self-Questioner figure for an example!

Motivation

The ObjectGoal Nav task is under-specified: the input is a category c, and the goal is to find any instance of that category. As we show below, which one should we navigate to?

On the other hand, the InstanceObjectGoal Nav task is often ambiguous and over-specified:

The description “A bed with cushions and a colored comforter, two lamps, and a carpet on the floor” could apply to all the beds shown in the image above.

Furthermore,

humans aim to minimum input: providing a long, detailed description before navigation is demanding and time-consuming. Would you want to give such a description?
Agents should ask clarifying questions if necessary.

⬇️

AIUTA: Our Method

We propose a novel embodied reasoning method for human-agent interaction reasoning. Our method integrates two key components: a Self-Questioner and an Interaction Trigger.

Specifically, the agent:

Executes a self-questioning procedure to automatically gather more information from the environment (e.g., Is the whall light blue?), reduce hallucination through a novel normalized-entropy-based technique, and generate a refined description of its observations.

Our agent is equipped with a Large Language Model (LLM) and a Vision-Language Model (VLM). When environmental information is missing, the self-questioning mechanism is designed from the ground up to automatically leverage the LLM to generate the most informative question, and then use the VLM to extract the corresponding answer from the visual input.
Decides whether to ask a clarifying question to the user, continue navigating, or halt exploration, based on the refined description and the “facts” associated with the target.

See the illustration below, based on a real example of our method in action.

Sketched episode of the proposed Collaborative Instance Navigation (CoIN) task. The human user (bottom left) provides a request ("Find the picture" ) in natural language. The agent has to locate the object within a completely unknown environment, interacting with the user only when needed via template-free, open-ended natural-language dialogue. Our method, Agent-user Interaction with UncerTainty Awareness (AIUTA), addresses this challenging task, minimizing user interactions by equipping the agent with two modules: a Self-Questioner and an Interaction Trigger, whose output is shown in the blue boxes along the agent’s path (① to ⑤), and whose inner working is shown on the right. The Self-Questioner leverages a Large Language Model (LLM) and Vision Language Model (VLM) in a self-dialogue to initially describe the agent’s observation, and then extract additional relevant details, with a novel entropy-based technique to reduce hallucinations and inaccuracies, producing a refined detection description. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤).

Abstract

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human.

To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human.

We propose a novel training-free method, Agent user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input.

For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes.

I don't know VQA dataset - IDKVQA

Not every question has an answer. That's why we created the I Don't Know VQA dataset, a VQA dataset where answers can be Yes, No, or I don't know.

We extensively described the IDKVQA dataset in the supplementary material and in the HF dataset page.

Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

Motivation

AIUTA: Our Method

Abstract

Method

I don't know VQA dataset - IDKVQA

Video

BibTeX