Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Abstract
Vision-language models struggle to distinguish between shared and interpreted visual information in dialogue, relying on static map cues rather than dynamic grounding processes.
In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Our results show that providing authentic map images improves overall performance but shifts models toward over-predicting alignment. Textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, indicating that the bias is driven by task-relevant map content, not the visual channel. This improvement comes at the cost of degraded accuracy on non-aligned cases. Calibration analysis and reference-chain tracking further suggest that models rely on static referential cues on the maps rather than tracking how grounding unfolds through dialogue history. We observe these patterns most clearly in Qwen3-VL-8B-Instruct and, to varying degrees, in four additional models from two architecture families. In models that exhibit the bias, map content, whether presented visually or textually, is treated as evidence of mutual understanding, conflating potential with established common ground.
Community
This paper investigates a subtle but important distinction in collaborative dialogue: whether vision-language models can tell apart what could be shared (from shared perception) versus what has been shared (through grounding in interaction). Using 13,077 annotated reference expressions from HCRC MapTask dialogues, we evaluate VLMs under controlled manipulations of dialogue context and map-information access.
A key finding is that providing authentic map images improves overall VLM performance but introduces a systematic bias toward over-predicting alignment between participants — models tend to assume interlocutors share the same interpretation simply because they share the same visual input. Interestingly, textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, suggesting the bias is driven by task-relevant content rather than the visual modality itself.
This has implications for anyone working on dialogue systems, grounded language understanding, or VLM evaluation: current models conflate perceptual access with communicative grounding, which is precisely the kind of error that matters in real collaborative settings. We'd be curious to hear thoughts on how this bias might be mitigated — whether through training objectives that explicitly model asymmetric information states, or through architectural changes that separate perceptual and discourse-level representations.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue (2026)
- MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue (2026)
- Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions (2026)
- Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs (2026)
- Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely (2026)
- Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery (2026)
- GAVEL: Grounded Caption Error Verification and Localization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.31719 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper