Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input

1Polytechnic of Turin, 2University of Verona, 3U2IS, ENSTA Paris, Institut Polytechnique de Paris 4Fondazione Bruno Kessler
Teaser

Sketched episode of the proposed Collaborative Instance Navigation (CoIN) task. The human user (bottom left) provides a request ("Find the picture" ) in natural language. The agent has to locate the object within a completely unknown environment, interacting with the user only when needed via template-free, open-ended natural-language dialogue. Our method, Agent-user Interaction with UncerTainty Awareness (AIUTA), addresses this challenging task, minimizing user interactions by equipping the agent with two modules: a Self-Questioner and an Interaction Trigger, whose output is shown in the blue boxes along the agent’s path (① to ⑤), and whose inner working is shown on the right. The Self-Questioner leverages a Large Language Model (LLM) and Vision Language Model (VLM) in a self-dialogue to initially describe the agent’s observation, and then extract additional relevant details, with a novel entropy-based technique to reduce hallucinations and inaccuracies, producing a refined detection description. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤).


Abstract

Existing embodied instance goal navigation tasks, driven by natural language, assume human users to provide complete and nuanced instance descriptions prior to the navigation, which can be impractical in the real world as human instructions might be brief and ambiguous.

   To bridge this gap, we propose a new task, Collaborative Instance Navigation (CoIN), with dynamic agent-human interaction during navigation to actively resolve uncertainties about the target instance in natural, template-free, open-ended dialogues.

   To address CoIN, we propose a novel method, Agent-user Interaction with UncerTainty Awareness (AIUTA), leveraging the perception capability of Vision Language Models (VLMs) and the capability of Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue to obtain a complete and accurate observation description, while a novel uncertainty estimation technique mitigates inaccurate VLM perception. Then, an Interaction Trigger module determines whether to ask a question to the user, continue or halt navigation, minimizing user input.

   For evaluation, we introduce CoIN-Bench, a benchmark supporting both real and simulated humans. AIUTA achieves competitive performance in instance navigation against state-of-the-art methods, demonstrating great flexibility in handling user inputs.

Method

Teaser

Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction I: "Find a c = < object category > ". ② At each timestep t, a zero-shot policy π (VLFM), comprising a frozen object detection module, selects the optimal action at. ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation Ot from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing Srefined. ⑤ The Interaction Trigger module then evaluates Srefined against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧.


Video