CVT-Bench Results Dashboard

🏆Model Report Card At-a-glance across all dimensions

Model	Episodic F1 (BS=1)	Long-Ctx F1 (BS≥10)	Rotation Resilience	SG Benefit	Tracking (BS=1)	Tracking (BS≥10)	Consistency (BS=1)	Consistency (BS≥10)	Context Stability
Gemini 3.1 Pro	✓	~	~	✗	✓	✓	✓	✓	✗
Kimi K2.5	~	✗	✗	✓	~	✗	✓	✗	✗
Qwen 3.5 Plus	✓	✗	~	✗	✓	✗	✓	~	✗
Qwen 3.5 OS	✓	✗	✗	~	✓	✗	✓	~	✗
GPT-5.2 ^†BS=10	✓	✓	✗	✗	✓	✓	✓	✓	✓

✓ = Strong ~ = Moderate ✗ = Weak

Threshold Definitions

Dimension	✓ Strong	~ Moderate	✗ Weak
Episodic / Long-Ctx F1	≥ 80%	≥ 60%	< 60%
Rotation Resilience	< 5% drop	< 15% drop	≥ 15% drop
SG Benefit	> +2%	> −2%	≤ −2%
Tracking (BS=1)	≥ 30%	≥ 10%	< 10%
Tracking (BS≥10)	≥ 10%	≥ 3%	< 3%
Consistency (BS=1 & BS≥10)	≥ 95%	≥ 80%	< 80%
Context Stability	< 5% drop	< 15% drop	≥ 15% drop

† GPT-5.2 evaluated at BS=10 (not BS=20) due to shared reasoning+output token budget constraints.

📋 Model Specifications Architecture, parameters, and API-measured input token counts

Model	Architecture	Total / Active Params	Context	Max Out	Vision	Source	Peak Tokens^†	% Context
Gemini 3.1 Pro	Sparse MoE	Undisclosed	1M	65K	Native multimodal	Closed	96,507	9.7%
GPT-5.2	Undisclosed	Undisclosed	400K	128K^‡	Text + Image	Closed	41,205^×10	10.3%
Qwen 3.5 Plus	DeltaNet+MoE	397B / 17B	262K→1M	81K	Early fusion	Closed	86,772	33.1%
Qwen 3.5 OS	DeltaNet+MoE	397B / 17B	262K→1M	81K	Early fusion	Open	86,772	33.1%
Kimi K2.5	MoE+MLA	1T / 32B	256K	65K	MoonViT (400M)	Open	82,991	32.4%

⚠️ Context ≠ Capacity: Degradation Occurs Well Below Limits

At peak batch size, no model exceeds 34% of its context window. Gemini uses only 9.7% and GPT only 10.3%. Yet all models show catastrophic spatial degradation (Gemini F1: 96.6% → 57.9%, Qwen OS: 86.7% → 25.6%). This confirms failures stem from spatial reasoning limitations, not context overflow.

^† Peak Tokens = highest input tokens across all 3 modes (Image, Text-Only, Scene Graph) at max batch size, measured via each model's API. ^‡ GPT-5.2 shares 128K budget between reasoning + output. ^×10 GPT evaluated at BS=10 (not 20). Active params: routed+shared experts per token. Qwen context: 262K native, extensible to 1.01M via API.

📊 Benchmark Specifications CVT-Bench dataset composition and statistics

100

Scenes

5,703

Questions

10

View Angles

320×240

Image Size

Property	Value	Details
Scene Composition
Sparse scenes (3–5 placed objects)	50	mean 3.9 tagged/scene, range 2–5 (min=2 after visibility filtering)
Dense scenes (7–10 objects)	50	mean 8.4 obj/scene, range 7–10
Objects per scene (all)	6.2 avg	min 2, max 10
Question Statistics
Total questions	5,703	mean 57.0/scene, min 20, max 60
Manipulation questions (rotated views)	4,419 (77.5%)	9 rotation angles: 45°–360° + top-down
Standard questions (0° original view)	1,284 (22.5%)	No mental rotation required
Occlusion Statistics
Scenes with occlusion	43 / 100 (43%)	mean 13.2 occluded Qs/scene (max 43)
Occluded questions	1,315 (23.1%)	Subject: 613 \| Object: 620 \| Both: 82
Visible questions	4,388 (76.9%)	No occlusion involvement
Relationship Labels
Label distribution	4 directions	Left 25.3% \| Right 24.8% \| Front 25.1% \| Behind 24.8%
Multi-label questions	5,560 (97.5%)	Most questions have 2+ spatial relations
Single-label questions	143 (2.5%)	Only 1 spatial relation
Total relation labels	11,263	~2.0 labels per question on average
View Angle Distribution
Original view (0°)	1,284 (22.5%)	Each rotated angle has ~491 questions (8.6%); original view overrepresented as baseline
Each rotated angle (45°–360° + top-down)	~491 each (8.6%)

📌Episodic Spatial Reasoning Performance Can MLLMs reason spatially in isolation?

Global F1 grouped by model × mode × batch size

Key Trends

BS=1 Ranking: Gemini 3.1 Pro (92%) → Qwen 3.5 Plus (85%) → GPT-5.2 (85%) → Qwen 3.5 OS (84%) → Kimi K2.5 (72%)
Gemini 3.1 Pro: Observed=96.2% → Counterfactual=90.1% (Δ=-6.1%, slight)
Qwen 3.5 Plus: Observed=94.9% → Counterfactual=81.4% (Δ=-13.5%, moderate)
GPT-5.2: Observed=96.3% → Counterfactual=80.7% (Δ=-15.6%, moderate)
Qwen 3.5 OS: Observed=95.6% → Counterfactual=79.8% (Δ=-15.7%, moderate)
Kimi K2.5: Observed=92.7% → Counterfactual=65.5% (Δ=-27.2%, severe)

🔄Counterfactual Viewpoint Reasoning Does performance degrade with viewpoint deviation?

F1 vs Rotation Angle — "W-curve" dips visible at 90° and 270°

Viewpoint Compass — F1 on physical bearings

Model \ Angle	0°	45°	90°	135°	180°	225°	270°	315°
Gemini 3.1 Pro	96	89	87	89	96	89	87	89
Kimi K2.5	93	63	49	64	84	65	46	64
Qwen 3.5 Plus	95	78	75	77	95	78	75	78
Qwen 3.5 OS	96	76	72	73	95	78	73	77
GPT-5.2	96	77	72	77	97	78	72	77

Low F1

High F1 ⚡ = orthogonal angles (90°, 270°) — "W-curve" dip

Viewpoint Analysis

Average rotation degradation: -15.6% across all model×BS combos
ALL combinations degrade (>2%) — systematic failure in mental rotation
"W-curve" pattern: F1 dips at orthogonal angles (90°, 270°) across models, suggesting coordinate-system confusion at perpendicular viewpoints
GPT-5.2: 90°=71.6% (Δ=-24.7%), 180°=96.7% (Δ=+0.4%)
Gemini 3.1 Pro: 90°=87.2% (Δ=-9.1%), 180°=96.3% (Δ=+0.0%)
Kimi K2.5: 90°=48.8% (Δ=-43.9%), 180°=84.0% (Δ=-8.7%)
Qwen 3.5 OS: 90°=71.6% (Δ=-24.0%), 180°=95.3% (Δ=-0.3%)
Qwen 3.5 Plus: 90°=74.7% (Δ=-20.1%), 180°=94.9% (Δ=+0.0%)
SG reduces degradation for GPT-5.2 BS=1? -4.3% → NO ✗
SG reduces degradation for GPT-5.2 BS=10? -18.2% → NO ✗
SG reduces degradation for Gemini 3.1 Pro BS=1? -4.7% → NO ✗
SG reduces degradation for Gemini 3.1 Pro BS=20? -11.1% → NO ✗
SG reduces degradation for Kimi K2.5 BS=1? +2.1% → YES ✓
SG reduces degradation for Kimi K2.5 BS=20? -5.0% → NO ✗
SG reduces degradation for Qwen 3.5 OS BS=1? +1.3% → YES ✓
SG reduces degradation for Qwen 3.5 OS BS=20? +3.3% → YES ✓
SG reduces degradation for Qwen 3.5 Plus BS=1? -4.4% → NO ✗
SG reduces degradation for Qwen 3.5 Plus BS=20? +6.7% → YES ✓

📉Context Length & Spatial State Degradation How does sequential interaction and context length affect stability?

F1 Decay over Prompt Depth

0°=360° Consistency — collapses under sequential load

Sequential Stability Analysis

BS=1 avg full survival (Streak≥10): 27.5%
BS=1 best: Gemini 3.1 Pro Image = 62.6%
BS≥10 avg full survival: 6.3% — 8/15 entries at 0%

Tracking Survival + Consistency

F1 Degradation

Gemini 3.1 Pro: BS=1: 91.6% → BS=20: 74.9% (Δ=-16.7%)
Kimi K2.5: BS=1: 72.2% → BS=20: 52.8% (Δ=-19.4%)
Qwen 3.5 Plus: BS=1: 84.7% → BS=20: 51.3% (Δ=-33.4%)
Qwen 3.5 OS: BS=1: 83.7% → BS=20: 52.6% (Δ=-31.1%)
GPT-5.2: BS=1: 84.6% → BS=10: 86.3% (Δ=+1.7%)

Gemini 3.1 Pro

BS=1: 92%

BS=20: 75%

-16.7%

Kimi K2.5

BS=1: 72%

BS=20: 53%

-19.4%

Qwen 3.5 Plus

BS=1: 85%

BS=20: 51%

-33.4%

Qwen 3.5 OS

BS=1: 84%

BS=20: 53%

-31.1%

GPT-5.2

BS=1: 85%

BS=10: 86%

+1.7%

Relational Tracking Survival (Streak ≥ 10) Table

Model	BS	Image	Text-Only	SceneGraph
Gemini 3.1 Pro	1	62.6%	57.2%	48.8%
GPT-5.2 †	10	19.2%	44.4%	3.7%
Qwen 3.5 Plus	1	6.7%	37.0%	15.8%
GPT-5.2	1	34.0%	40.4%	20.2%
Qwen 3.5 OS	1	5.7%	36.4%	20.5%
Gemini 3.1 Pro	20	2.4%	17.2%	5.4%
Kimi K2.5	1	7.4%	10.8%	8.4%
Kimi K2.5	20	0.0%	0.0%	0.0%
Qwen 3.5 OS	20	0.0%	0.0%	1.7%
Qwen 3.5 Plus	20	0.0%	0.0%	0.3%

📐Effect of Representation Structure Image vs Text-Only vs SceneGraph

Spatial F1 Octagon — multi-dimensional comparison

Representation Impact

GPT-5.2: Text-Only(85%) > SceneGraph(81%) > Image(79%)
Gemini 3.1 Pro: Image(94%) > Text-Only(92%) > SceneGraph(88%)
Kimi K2.5: SceneGraph(74%) > Text-Only(72%) > Image(69%)
Qwen 3.5 OS: SceneGraph(85%) > Text-Only(84%) > Image(69%)
Qwen 3.5 Plus: Text-Only(85%) > SceneGraph(81%) > Image(71%)
SG HELPS: Qwen 3.5 Plus, Qwen 3.5 OS
SG HURTS: Gemini 3.1 Pro, Kimi K2.5, GPT-5.2

🔍Scene Complexity & Occlusion Sparse vs Dense, Visible vs Occluded

Effect of Initial Visibility & Distractors

Rotational Divergence: At the anchor viewpoint (0°), models exhibit the expected difficulty gradient: sparse and fully visible scenes outperform dense and occluded configurations. Under counterfactual rotations, however, the structure of these gradients diverges.
Viewpoint Stress Overrides Visibility: The performance gap between visible and occluded targets compresses substantially in Image mode, indicating that viewpoint stress overrides sensitivity to initial visibility.
Distractor Load Exacerbation: Conversely, the difficulty gap between sparse and dense scenes widens under rotation, demonstrating that high distractor load heavily exacerbates the difficulty of perspective transformation.
Density Paradoxes: Interestingly, explicit geometric representations (Text-Only) suffer from density paradoxes where sparse scenes occasionally underperform dense ones under rotational stress.

Gemini 3.1 Pro

Dense

90%

Sparse

94%

Δ=-4.3%

Occluded

90%

Visible

92%

Δ=-1.5%

Kimi K2.5

Dense

72%

Sparse

73%

Δ=-0.6%

Occluded

68%

Visible

73%

Δ=-5.3%

Qwen 3.5 Plus

Dense

85%

Sparse

84%

Δ=+1.5%

Occluded

85%

Visible

85%

Δ=+0.5%

Qwen 3.5 OS

Dense

84%

Sparse

83%

Δ=+1.0%

Occluded

85%

Visible

83%

Δ=+2.3% ⚠️ PARADOX

GPT-5.2

Dense

85%

Sparse

85%

Δ=-0.1%

Occluded

85%

Visible

84%

Δ=+0.4%

⚠️Failure Mode Analysis Mechanisms behind spatial reasoning failures

Error Modalities in Counterfactual Tasks

Viewpoint Inversion: A recurring pattern across models is viewpoint inversion, where left/right or front/behind relations are systematically confused under rotation, particularly at orthogonal angles (90° and 270°).
Relation-Association Errors: Under structured inputs, additional failure modes emerge: Scene Graph representations can induce relation-association errors in which models misattribute predicates to incorrect object pairs or partially invert relational structure.
Within-Scene Drift: In the sequential setting, error patterns increasingly reflect within-scene drift, where responses become less sensitive to the geometry of the current scene and more variable with prompt depth, consistent with contextual drift in a shared interaction window.

Qualitative failure modes in CVT-Bench. (A) Orthogonal inversion; (B) Scene-graph mis-association; (C) Within-scene drift.

🔄

Viewpoint Inversion

1 model affected

Large F1 drops at 180° — models fail to maintain spatial relations when view is fully inverted

Kimi K2.5 (-8.7%)

🔀

Relational Confusion

3 models affected

Scene graph format interferes with geometric reasoning, causing worse performance than text-only

Gemini 3.1 Pro BS=20 (-11.1%), Kimi K2.5 BS=20 (-5.0%), GPT-5.2 BS=10 (-18.2%)

⏬

Sequential Interference

3 models affected

Tracking survival collapses from BS=1 to BS≥10 — models cannot maintain persistent spatial state

Kimi K2.5 (9%→0%), Qwen 3.5 Plus (20%→0%), Qwen 3.5 OS (21%→1%)

💀

Degenerate Patterns

4 models affected

Near-perfect BS=1 consistency but severe BS≥10 degradation — memorization without understanding

Gemini 3.1 Pro, Kimi K2.5, Qwen 3.5 Plus, Qwen 3.5 OS

Universal Hard Cases

Qualitative Anatomy of the Hardest Questions

Accumulating Complexity: The prevalence of universally hard questions (failed by all models) increases sharply under sequential prompting for all three formatting representations, confirming shared failure modes become more common as context accumulates.
Shared Limitations: In Image mode, the proportion rises from 7.9% at BS=1 to 19.3% at BS=20.
Key Themes: These hard examples predominantly cluster around recurring spatial attributes, including top-view questions, oblique rotations, and dense scenes, reflecting fundamental geometric bottlenecks of current multimodal architectures.

Montage of queries that are failed by all evaluated models under the specified setting.

A4 Error Correlation Across Models Pairwise agreement and shared failures across context length

Image BS=1 Image BS=20 Text-Only BS=1 Text-Only BS=20 SceneGraph BS=1 SceneGraph BS=20

Universally Hard Questions: 7.9% (449/5703 answered incorrectly by ALL models)

Model Pair	Agreement	Both ✓	Both ✗	Only M1 ✓	Only M2 ✓
Gemini 3.1 Pro × GPT-5.2	71.7%	59.8%	11.9%	24.1%	4.2%
Qwen 3.5 Plus × Qwen 3.5 OS	66.5%	26.4%	40.1%	16.6%	16.9%
Kimi K2.5 × Qwen 3.5 Plus	63.2%	18.4%	44.7%	12.2%	24.7%
Qwen 3.5 Plus × GPT-5.2	62.4%	34.8%	27.6%	8.3%	29.3%
Qwen 3.5 OS × GPT-5.2	61.6%	34.5%	27.1%	8.8%	29.6%
Kimi K2.5 × Qwen 3.5 OS	59.6%	16.7%	42.8%	13.9%	26.5%
Kimi K2.5 × GPT-5.2	54.2%	24.4%	29.8%	6.2%	39.6%
Gemini 3.1 Pro × Qwen 3.5 OS	54.0%	40.6%	13.4%	43.3%	2.7%
Gemini 3.1 Pro × Qwen 3.5 Plus	53.4%	40.2%	13.2%	43.7%	2.9%
Gemini 3.1 Pro × Kimi K2.5	40.6%	27.6%	13.1%	56.3%	3.1%