📄 ArXiv Paper 💻 Code & Benchmark

CVT-Bench Results Dashboard

Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
93.5%
Best F1 Score
Gemini 3.1 Pro · Image · BS=1
-15.6%
Avg Viewpoint Degradation
All model×BS combos
62.6%
Best Tracking Survival
Gemini 3.1 Pro · Image · BS=1
4 / 5
Structured Formats Improve F1
Text/SG beating Image (BS=1)
5 / 5
Structured Formats Improve F1
Text/SG beating Image (BS=20)
🏆Model Report Card At-a-glance across all dimensions
Model Episodic
F1 (BS=1)
Long-Ctx
F1 (BS≥10)
Rotation
Resilience
SG
Benefit
Tracking
(BS=1)
Tracking
(BS≥10)
Consistency
(BS=1)
Consistency
(BS≥10)
Context
Stability
Gemini 3.1 Pro ~ ~
Kimi K2.5 ~ ~
Qwen 3.5 Plus ~ ~
Qwen 3.5 OS ~ ~
GPT-5.2 †BS=10
✓ = Strong ~ = Moderate ✗ = Weak
Threshold Definitions
Dimension ✓ Strong ~ Moderate ✗ Weak
Episodic / Long-Ctx F1 ≥ 80% ≥ 60% < 60%
Rotation Resilience < 5% drop < 15% drop ≥ 15% drop
SG Benefit > +2% > −2% ≤ −2%
Tracking (BS=1) ≥ 30% ≥ 10% < 10%
Tracking (BS≥10) ≥ 10% ≥ 3% < 3%
Consistency (BS=1 & BS≥10) ≥ 95% ≥ 80% < 80%
Context Stability < 5% drop < 15% drop ≥ 15% drop
† GPT-5.2 evaluated at BS=10 (not BS=20) due to shared reasoning+output token budget constraints.
📋 Model Specifications Architecture, parameters, and API-measured input token counts
Model Architecture Total / Active Params Context Max Out Vision Source Peak Tokens % Context
Gemini 3.1 Pro Sparse MoE Undisclosed 1M 65K Native multimodal Closed 96,507 9.7%
GPT-5.2 Undisclosed Undisclosed 400K 128K Text + Image Closed 41,205×10 10.3%
Qwen 3.5 Plus DeltaNet+MoE 397B / 17B 262K→1M 81K Early fusion Closed 86,772 33.1%
Qwen 3.5 OS DeltaNet+MoE 397B / 17B 262K→1M 81K Early fusion Open 86,772 33.1%
Kimi K2.5 MoE+MLA 1T / 32B 256K 65K MoonViT (400M) Open 82,991 32.4%
⚠️ Context ≠ Capacity: Degradation Occurs Well Below Limits
At peak batch size, no model exceeds 34% of its context window. Gemini uses only 9.7% and GPT only 10.3%. Yet all models show catastrophic spatial degradation (Gemini F1: 96.6% → 57.9%, Qwen OS: 86.7% → 25.6%). This confirms failures stem from spatial reasoning limitations, not context overflow.
Peak Tokens = highest input tokens across all 3 modes (Image, Text-Only, Scene Graph) at max batch size, measured via each model's API. GPT-5.2 shares 128K budget between reasoning + output. ×10 GPT evaluated at BS=10 (not 20). Active params: routed+shared experts per token. Qwen context: 262K native, extensible to 1.01M via API.
📊 Benchmark Specifications CVT-Bench dataset composition and statistics
100
Scenes
5,703
Questions
10
View Angles
320×240
Image Size
Property Value Details
Scene Composition
Sparse scenes (3–5 placed objects) 50 mean 3.9 tagged/scene, range 2–5 (min=2 after visibility filtering)
Dense scenes (7–10 objects) 50 mean 8.4 obj/scene, range 7–10
Objects per scene (all) 6.2 avg min 2, max 10
Question Statistics
Total questions 5,703 mean 57.0/scene, min 20, max 60
Manipulation questions (rotated views) 4,419 (77.5%) 9 rotation angles: 45°–360° + top-down
Standard questions (0° original view) 1,284 (22.5%) No mental rotation required
Occlusion Statistics
Scenes with occlusion 43 / 100 (43%) mean 13.2 occluded Qs/scene (max 43)
Occluded questions 1,315 (23.1%) Subject: 613 | Object: 620 | Both: 82
Visible questions 4,388 (76.9%) No occlusion involvement
Relationship Labels
Label distribution 4 directions Left 25.3% | Right 24.8% | Front 25.1% | Behind 24.8%
Multi-label questions 5,560 (97.5%) Most questions have 2+ spatial relations
Single-label questions 143 (2.5%) Only 1 spatial relation
Total relation labels 11,263 ~2.0 labels per question on average
View Angle Distribution
Original view (0°) 1,284 (22.5%) Each rotated angle has ~491 questions (8.6%); original view overrepresented as baseline
Each rotated angle (45°–360° + top-down) ~491 each (8.6%)
📌Episodic Spatial Reasoning Performance Can MLLMs reason spatially in isolation?
Batch Comparison
Global F1 grouped by model × mode × batch size
🔄Counterfactual Viewpoint Reasoning Does performance degrade with viewpoint deviation?
Rotation Lines
F1 vs Rotation Angle — "W-curve" dips visible at 90° and 270°
Compass
Viewpoint Compass — F1 on physical bearings
Model \ Angle 45° 90° 135° 180° 225° 270° 315°
Gemini 3.1 Pro 96 89 87 89 96 89 87 89
Kimi K2.5 93 63 49 64 84 65 46 64
Qwen 3.5 Plus 95 78 75 77 95 78 75 78
Qwen 3.5 OS 96 76 72 73 95 78 73 77
GPT-5.2 96 77 72 77 97 78 72 77
Low F1
High F1 ⚡ = orthogonal angles (90°, 270°) — "W-curve" dip
📉Context Length & Spatial State Degradation How does sequential interaction and context length affect stability?
Attention Fatigue
F1 Decay over Prompt Depth
Consistency Drift
0°=360° Consistency — collapses under sequential load
Tracking
Tracking Survival + Consistency
Gemini 3.1 Pro
BS=1: 92%
BS=20: 75%
-16.7%
Kimi K2.5
BS=1: 72%
BS=20: 53%
-19.4%
Qwen 3.5 Plus
BS=1: 85%
BS=20: 51%
-33.4%
Qwen 3.5 OS
BS=1: 84%
BS=20: 53%
-31.1%
GPT-5.2
BS=1: 85%
BS=10: 86%
+1.7%

Relational Tracking Survival (Streak ≥ 10) Table

Model BS Image Text-Only SceneGraph
Gemini 3.1 Pro 1 62.6% 57.2% 48.8%
GPT-5.2 † 10 19.2% 44.4% 3.7%
Qwen 3.5 Plus 1 6.7% 37.0% 15.8%
GPT-5.2 1 34.0% 40.4% 20.2%
Qwen 3.5 OS 1 5.7% 36.4% 20.5%
Gemini 3.1 Pro 20 2.4% 17.2% 5.4%
Kimi K2.5 1 7.4% 10.8% 8.4%
Kimi K2.5 20 0.0% 0.0% 0.0%
Qwen 3.5 OS 20 0.0% 0.0% 1.7%
Qwen 3.5 Plus 20 0.0% 0.0% 0.3%
📐Effect of Representation Structure Image vs Text-Only vs SceneGraph
Radar
Spatial F1 Octagon — multi-dimensional comparison
🔍Scene Complexity & Occlusion Sparse vs Dense, Visible vs Occluded
Density and Occlusion effect
Gemini 3.1 Pro
Dense
90%
Sparse
94%
Δ=-4.3%
Occluded
90%
Visible
92%
Δ=-1.5%
Kimi K2.5
Dense
72%
Sparse
73%
Δ=-0.6%
Occluded
68%
Visible
73%
Δ=-5.3%
Qwen 3.5 Plus
Dense
85%
Sparse
84%
Δ=+1.5%
Occluded
85%
Visible
85%
Δ=+0.5%
Qwen 3.5 OS
Dense
84%
Sparse
83%
Δ=+1.0%
Occluded
85%
Visible
83%
Δ=+2.3% ⚠️ PARADOX
GPT-5.2
Dense
85%
Sparse
85%
Δ=-0.1%
Occluded
85%
Visible
84%
Δ=+0.4%
⚠️Failure Mode Analysis Mechanisms behind spatial reasoning failures
Failure Modes

Qualitative failure modes in CVT-Bench. (A) Orthogonal inversion; (B) Scene-graph mis-association; (C) Within-scene drift.

Universal Hard Cases

Universal Hard Cases

Montage of queries that are failed by all evaluated models under the specified setting.

A4 Error Correlation Across Models Pairwise agreement and shared failures across context length
Universally Hard Questions: 7.9% (449/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Gemini 3.1 Pro × GPT-5.2 71.7% 59.8% 11.9% 24.1% 4.2%
Qwen 3.5 Plus × Qwen 3.5 OS 66.5% 26.4% 40.1% 16.6% 16.9%
Kimi K2.5 × Qwen 3.5 Plus 63.2% 18.4% 44.7% 12.2% 24.7%
Qwen 3.5 Plus × GPT-5.2 62.4% 34.8% 27.6% 8.3% 29.3%
Qwen 3.5 OS × GPT-5.2 61.6% 34.5% 27.1% 8.8% 29.6%
Kimi K2.5 × Qwen 3.5 OS 59.6% 16.7% 42.8% 13.9% 26.5%
Kimi K2.5 × GPT-5.2 54.2% 24.4% 29.8% 6.2% 39.6%
Gemini 3.1 Pro × Qwen 3.5 OS 54.0% 40.6% 13.4% 43.3% 2.7%
Gemini 3.1 Pro × Qwen 3.5 Plus 53.4% 40.2% 13.2% 43.7% 2.9%
Gemini 3.1 Pro × Kimi K2.5 40.6% 27.6% 13.1% 56.3% 3.1%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
Num_GT_Rels 1 7.7% 66.1% +58.5%
RelAxis front-behind 3.6% 33.4% +29.8%
RelAxis left-right 4.1% 32.7% +28.6%
Density Dense 52.6% 61.5% +8.9%
View 135 8.6% 13.8% +5.2%
Universally Hard Questions: 19.3% (1102/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Qwen 3.5 Plus × Qwen 3.5 OS 80.3% 17.6% 62.8% 10.8% 8.9%
Kimi K2.5 × Qwen 3.5 OS 71.2% 4.0% 67.2% 6.3% 22.5%
Kimi K2.5 × Qwen 3.5 Plus 70.1% 4.4% 65.7% 5.9% 24.0%
Gemini 3.1 Pro × GPT-5.2 64.6% 35.8% 28.8% 10.7% 24.7%
Gemini 3.1 Pro × Qwen 3.5 Plus 57.0% 15.9% 41.1% 30.6% 12.4%
Gemini 3.1 Pro × Qwen 3.5 OS 56.5% 14.7% 41.8% 31.8% 11.7%
Gemini 3.1 Pro × Kimi K2.5 53.9% 5.4% 48.5% 41.1% 5.0%
Qwen 3.5 Plus × GPT-5.2 51.0% 20.0% 31.1% 8.4% 40.6%
Qwen 3.5 OS × GPT-5.2 49.6% 18.3% 31.3% 8.2% 42.3%
Kimi K2.5 × GPT-5.2 42.2% 6.5% 35.6% 3.8% 54.0%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
Num_GT_Rels 1 7.7% 31.5% +23.8%
RelAxis front-behind 3.6% 16.5% +13.0%
RelAxis left-right 4.1% 15.0% +10.9%
View 225 8.6% 16.1% +7.5%
View 135 8.6% 13.4% +4.8%
Universally Hard Questions: 10.6% (605/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Gemini 3.1 Pro × GPT-5.2 81.0% 67.8% 13.2% 14.4% 4.6%
Qwen 3.5 Plus × GPT-5.2 72.3% 53.0% 19.3% 8.3% 19.4%
Qwen 3.5 OS × GPT-5.2 70.7% 51.1% 19.6% 8.0% 21.3%
Gemini 3.1 Pro × Qwen 3.5 Plus 69.5% 56.5% 13.0% 25.7% 4.8%
Gemini 3.1 Pro × Qwen 3.5 OS 68.0% 54.7% 13.3% 27.5% 4.5%
Qwen 3.5 Plus × Qwen 3.5 OS 64.6% 42.5% 22.1% 18.8% 16.6%
Kimi K2.5 × Qwen 3.5 Plus 51.4% 16.8% 34.6% 4.1% 44.5%
Kimi K2.5 × Qwen 3.5 OS 48.2% 14.1% 34.1% 6.8% 45.0%
Kimi K2.5 × GPT-5.2 42.7% 18.0% 24.7% 2.9% 54.4%
Gemini 3.1 Pro × Kimi K2.5 35.8% 19.5% 16.4% 62.7% 1.4%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
Num_GT_Rels 1 7.7% 62.8% +55.1%
RelAxis left-right 4.1% 33.9% +29.8%
RelAxis front-behind 3.6% 28.9% +25.4%
Density Dense 52.6% 63.0% +10.4%
View 45 8.6% 12.2% +3.6%
Universally Hard Questions: 13.4% (766/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Kimi K2.5 × Qwen 3.5 Plus 74.8% 3.4% 71.3% 17.2% 8.0%
Qwen 3.5 Plus × Qwen 3.5 OS 73.9% 5.0% 68.9% 6.5% 19.6%
Kimi K2.5 × Qwen 3.5 OS 70.4% 7.8% 62.6% 12.8% 16.8%
Gemini 3.1 Pro × GPT-5.2 69.0% 51.6% 17.4% 7.3% 23.7%
Gemini 3.1 Pro × Kimi K2.5 50.7% 15.1% 35.6% 43.7% 5.5%
Gemini 3.1 Pro × Qwen 3.5 OS 49.5% 16.5% 33.0% 42.4% 8.1%
Gemini 3.1 Pro × Qwen 3.5 Plus 45.9% 8.1% 37.8% 50.8% 3.3%
Qwen 3.5 OS × GPT-5.2 41.6% 20.7% 20.8% 3.9% 54.6%
Kimi K2.5 × GPT-5.2 39.6% 17.8% 21.8% 2.9% 57.5%
Qwen 3.5 Plus × GPT-5.2 34.3% 10.5% 23.8% 0.9% 64.8%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
Num_GT_Rels 1 7.7% 51.7% +44.0%
RelAxis left-right 4.1% 26.2% +22.1%
RelAxis front-behind 3.6% 25.5% +21.9%
Density Dense 52.6% 66.1% +13.5%
Occlusion occluded 23.1% 29.1% +6.1%
Universally Hard Questions: 6.4% (366/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Gemini 3.1 Pro × GPT-5.2 68.3% 56.7% 11.7% 23.1% 8.5%
Qwen 3.5 OS × GPT-5.2 67.9% 45.3% 22.7% 12.1% 20.0%
Qwen 3.5 Plus × Qwen 3.5 OS 66.6% 36.6% 30.0% 12.7% 20.7%
Qwen 3.5 Plus × GPT-5.2 63.3% 38.9% 24.4% 10.4% 26.3%
Gemini 3.1 Pro × Qwen 3.5 OS 62.8% 50.0% 12.8% 29.8% 7.4%
Gemini 3.1 Pro × Qwen 3.5 Plus 56.1% 42.6% 13.5% 37.2% 6.7%
Kimi K2.5 × Qwen 3.5 Plus 51.1% 10.4% 40.7% 10.0% 38.9%
Kimi K2.5 × Qwen 3.5 OS 47.4% 12.6% 34.8% 7.8% 44.7%
Kimi K2.5 × GPT-5.2 45.3% 15.5% 29.8% 4.9% 49.7%
Gemini 3.1 Pro × Kimi K2.5 35.8% 18.0% 17.8% 61.8% 2.4%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
View 135 8.6% 23.0% +14.3%
View 315 8.6% 22.1% +13.5%
View 225 8.6% 19.4% +10.8%
View 45 8.6% 19.1% +10.5%
Num_GT_Rels 1 7.7% 17.8% +10.1%
Universally Hard Questions: 16.0% (912/5703 answered incorrectly by ALL models)
Model Pair Agreement Both ✓ Both ✗ Only M1 ✓ Only M2 ✓
Qwen 3.5 Plus × Qwen 3.5 OS 70.6% 12.5% 58.1% 12.3% 17.1%
Kimi K2.5 × Qwen 3.5 Plus 68.2% 3.5% 64.7% 10.5% 21.3%
Kimi K2.5 × Qwen 3.5 OS 66.4% 5.0% 61.4% 9.0% 24.6%
Gemini 3.1 Pro × Qwen 3.5 Plus 62.9% 10.7% 52.2% 22.9% 14.2%
Gemini 3.1 Pro × Qwen 3.5 OS 61.8% 12.5% 49.3% 21.1% 17.1%
Gemini 3.1 Pro × Kimi K2.5 61.7% 4.7% 57.1% 28.9% 9.3%
Qwen 3.5 OS × GPT-5.2 50.3% 18.5% 31.8% 11.1% 38.6%
Gemini 3.1 Pro × GPT-5.2 48.6% 19.6% 29.0% 14.0% 37.4%
Kimi K2.5 × GPT-5.2 45.7% 8.4% 37.3% 5.6% 48.7%
Qwen 3.5 Plus × GPT-5.2 45.0% 13.5% 31.5% 11.4% 43.6%

What makes these questions Universally Hard?

Top attribute shifts compared to overall dataset:
Attribute Value Overall % Hard % Shift
View 225 8.6% 19.7% +11.1%
View 135 8.6% 19.4% +10.8%
View 315 8.6% 17.7% +9.0%
View 45 8.6% 16.7% +8.1%
Num_GT_Rels 1 7.7% 15.8% +8.1%
📊Complete F1 Leaderboard
Model BS Image Text-Only SceneGraph Rot Drop SG Effect Consist.
Gemini 3.1 Pro 1 93.5% 91.6% 88.5% -6.1% -4.7% 100%
GPT-5.2 † 10 76.0% 86.3% 73.8% -13.0% -18.2% 99%
Qwen 3.5 Plus 1 70.8% 84.7% 81.0% -13.5% -4.4% 100%
GPT-5.2 1 78.7% 84.6% 81.5% -15.6% -4.3% 100%
Qwen 3.5 OS 1 69.2% 83.7% 84.7% -15.7% +1.3% 100%
Gemini 3.1 Pro 20 66.0% 74.9% 60.4% -21.3% -11.1% 97%
Kimi K2.5 1 69.1% 72.2% 73.9% -27.2% +2.1% 98%
Kimi K2.5 20 45.2% 52.8% 44.9% -14.2% -5.0% 58%
Qwen 3.5 OS 20 52.5% 52.6% 54.8% -14.8% +3.3% 88%
Qwen 3.5 Plus 20 53.4% 51.3% 56.2% -14.2% +6.7% 87%

🔬 Deep Analysis

Additional analyses from per-question evaluation

A1 Relation-Type Breakdown by Context Size Left/Right vs Front/Behind spatial axes
Model L/R F1 F/B F1 Both F1 Gap (L/R−F/B) L/R Inv. F/B Inv.
Gemini 3.1 Pro 65.2% 64.3% 93.7% +0.9% 1.5% 1.6%
Kimi K2.5 45.2% 30.2% 41.5% +15.0% 3.7% 3.7%
Qwen 3.5 Plus 61.5% 57.9% 80.5% +3.6% 3.8% 3.9%
Qwen 3.5 OS 61.6% 52.0% 78.3% +9.6% 3.5% 3.5%
GPT-5.2 66.5% 58.7% 86.4% +7.8% 5.2% 5.2%
Model L/R F1 F/B F1 Both F1 Gap (L/R−F/B) L/R Inv. F/B Inv.
Gemini 3.1 Pro 58.6% 50.0% 77.6% +8.6% 8.2% 8.2%
Kimi K2.5 40.4% 27.6% 48.2% +12.8% 13.5% 13.4%
Qwen 3.5 Plus 26.8% 23.6% 31.6% +3.2% 8.4% 8.4%
Qwen 3.5 OS 46.2% 30.3% 54.1% +15.9% 15.0% 15.0%
GPT-5.2 †10 63.4% 60.2% 88.7% +3.2% 4.0% 4.0%
Model L/R F1 F/B F1 Both F1 Gap (L/R−F/B) L/R Inv. F/B Inv.
Gemini 3.1 Pro 76.4% 73.5% 90.0% +2.9% 3.7% 3.7%
Kimi K2.5 34.4% 33.8% 38.5% +0.6% 2.7% 2.7%
Qwen 3.5 Plus 65.7% 59.5% 74.0% +6.2% 3.9% 3.8%
Qwen 3.5 OS 69.3% 61.5% 81.9% +7.8% 1.8% 1.9%
GPT-5.2 83.3% 69.4% 83.0% +13.9% 4.6% 4.8%
Model L/R F1 F/B F1 Both F1 Gap (L/R−F/B) L/R Inv. F/B Inv.
Gemini 3.1 Pro 53.2% 32.2% 62.1% +21.0% 11.5% 11.5%
Kimi K2.5 43.3% 23.6% 45.8% +19.7% 12.7% 12.3%
Qwen 3.5 Plus 51.3% 37.2% 57.2% +14.1% 14.2% 14.2%
Qwen 3.5 OS 44.4% 34.6% 56.2% +9.8% 18.8% 18.9%
GPT-5.2 †10 68.8% 65.3% 76.5% +3.5% 8.2% 8.4%
A2 Top-View vs Orbit-View Degradation Aerial perspective (2D logic) vs 3D orbit views across modes
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 95.7% 92.4% 95.7% +3.3%
Kimi K2.5 69.7% 45.0% 63.7% +18.7%
Qwen 3.5 Plus 83.6% 51.3% 83.3% +32.0%
Qwen 3.5 OS 84.1% 51.0% 84.9% +33.9%
GPT-5.2 92.3% 71.6% 92.6% +21.0%
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 88.3% 54.5% 88.5% +34.0%
Kimi K2.5 45.6% 35.9% 30.3% -5.6%
Qwen 3.5 Plus 62.2% 49.7% 56.3% +6.6%
Qwen 3.5 OS 62.3% 48.9% 51.9% +3.0%
GPT-5.2 †10 91.6% 68.1% 90.7% +22.6%
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 96.2% 89.4% 95.1% +5.7%
Kimi K2.5 50.5% 35.1% 50.7% +15.6%
Qwen 3.5 Plus 88.6% 73.1% 89.3% +16.2%
Qwen 3.5 OS 86.7% 70.9% 86.3% +15.4%
GPT-5.2 95.8% 78.0% 95.4% +17.4%
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 91.0% 66.6% 91.4% +24.8%
Kimi K2.5 56.6% 42.2% 55.0% +12.8%
Qwen 3.5 Plus 36.8% 28.4% 35.4% +7.0%
Qwen 3.5 OS 63.7% 47.4% 59.5% +12.1%
GPT-5.2 †10 96.1% 81.2% 96.0% +14.8%
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 98.1% 84.1% 93.9% +9.8%
Kimi K2.5 46.9% 33.3% 42.0% +8.7%
Qwen 3.5 Plus 84.0% 66.3% 84.6% +18.3%
Qwen 3.5 OS 89.9% 75.3% 88.7% +13.4%
GPT-5.2 96.1% 73.9% 92.7% +18.8%
Model Original (0°) Orbit Avg Top View Δ(Top−Orbit)
Gemini 3.1 Pro 66.1% 56.6% 72.2% +15.6%
Kimi K2.5 47.1% 43.9% 46.5% +2.6%
Qwen 3.5 Plus 62.0% 53.7% 60.5% +6.8%
Qwen 3.5 OS 62.8% 51.4% 58.0% +6.6%
GPT-5.2 †10 98.2% 59.8% 95.9% +36.1%
A3 Open vs Closed Source Resiliency Qwen 3.5 Plus (closed) vs Qwen 3.5 OS (open)
Mode BS Plus (Closed) OS (Open) Δ
Image 1 64.2% 64.4% -0.2%
Image 20 53.8% 53.0% +0.8%
Text-Only 1 79.3% 77.1% +2.2%
Text-Only 20 31.3% 53.2% -21.9%
SceneGraph 1 73.5% 81.1% -7.6%
SceneGraph 20 56.5% 55.2% +1.3%

Per-View F1 (BS=1, Text-Only)

View Angle Plus OS Δ
88.6% 86.7% +1.9%
45° 72.1% 69.3% +2.8%
90° ⚡ 68.5% 66.7% +1.8%
135° 70.9% 66.1% +4.8%
180° 88.7% 86.8% +1.9%
225° 71.8% 70.7% +1.1%
270° ⚡ 68.0% 66.9% +1.1%
315° 71.5% 69.4% +2.1%
top° 89.3% 86.3% +3.0%