Graphical depiction of AIUTA: left shows its interaction cycle with the user, and right provides an exploded view of our method. ① The agent receives an initial instruction I: "Find a c = < object category > ". ② At each timestep t, a zero-shot policy π (VLFM), comprising a frozen object detection module, selects the optimal action at. ③ Upon detection, the agent performs the proposed AIUTA. Specifically, ④ the agent first obtains an initial scene description of observation Ot from a VLM. Then, a Self-Questioner module leverages an LLM to automatically generate attribute-specific questions to the VLM, acquiring more information and refining the scene description with reduced attribute-level uncertainty, producing Srefined. ⑤ The Interaction Trigger module then evaluates Srefined against the “facts” related to the target, to determine whether to terminate the navigation (if the agent believes it has located the target object ⑥), or to pose template-free, natural-language questions to a human ⑦, updating the “facts” based on the response ⑧.