Hallucination reduction: We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated VQA embodied dataset (IDKVQA), in which responses can be Yes, No, or I Don't Know.
Hallucination reduction: We introduce a novel Normalized-Entropy-based technique to quantify VLM perception uncertainty, along with a dedicated VQA embodied dataset (IDKVQA), in which responses can be Yes, No, or I Don't Know.
Natural Human-Robot Interaction: CoIN-Bench introduces, for the first time, template-free, open-ended, bidirectional human ↔ agent dialogues. We simulate user responses via a VLM with access to high-resolution images of target objects.
Self-Dialogue: Our method, AIUTA, operates in a zero-shot manner, generalizing across scenes and objects. Before taking any actions, AIUTA engages in a Self-Dialogue between an on-board LLM and a VLM: the Self-Questioner module effectively reduces hallucinations and gather more environmental information, while the Interaction Trigger minimizes agent-user interactions.
Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ①The agent receives an initial instruction I: "Find a c = < object category > ". ② At each timestep t, a zero-shot policy π (VLFM), comprising a frozen object detection module, selects the optimal action at. ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation Ot from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing Srefined. ⑤ The Interaction Trigger module then evaluates Srefined against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧.
@misc{taioli2025coin,
title={{Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues}},
author={Francesco Taioli and Edoardo Zorzi and Gianni Franchi and Alberto Castellini and Alessandro Farinelli and Marco Cristani and Yiming Wang},
year={2025},
eprint={2412.01250},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.01250},
}