Language-driven instance object navigation assumes that
human users initiate the task by providing a detailed
description of the target instance to the embodied agent.
While this description is crucial for distinguishing the
target from visually similar instances in a scene,
providing it prior to navigation can be demanding for
human.
To bridge this gap, we introduce
Collaborative Instance object Navigation (CoIN), a
new task setting where the agent actively resolve
uncertainties about the target instance during navigation
in natural, template-free, open-ended dialogues with
human.
We propose a novel training-free method,
Agent user Interaction with UncerTainty Awareness
(AIUTA), which operates independently from the navigation
policy, and focuses on the human-agent interaction
reasoning with Vision-Language Models (VLMs) and Large
Language Models (LLMs). First, upon object detection, a
Self-Questioner
model initiates a self-dialogue within the agent to obtain
a complete and accurate observation description with a
novel uncertainty estimation technique. Then, an
Interaction Trigger module determines whether to
ask a question to the human, continue or halt navigation,
minimizing user input.
For evaluation, we introduce
CoIN-Bench, with a curated dataset designed for
challenging multi-instance scenarios. CoIN-Bench supports
both online evaluation with humans and reproducible
experiments with simulated user-agent interactions. On
CoIN-Bench, we show that AIUTA serves as a competitive
baseline, while existing language-driven instance
navigation methods struggle in complex multi-instance
scenes.