| Model | Episodic F1 (BS=1) |
Long-Ctx F1 (BS≥10) |
Rotation Resilience |
SG Benefit |
Tracking (BS=1) |
Tracking (BS≥10) |
Consistency (BS=1) |
Consistency (BS≥10) |
Context Stability |
|---|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | ✓ | ~ | ~ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ |
| Kimi K2.5 | ~ | ✗ | ✗ | ✓ | ~ | ✗ | ✓ | ✗ | ✗ |
| Qwen 3.5 Plus | ✓ | ✗ | ~ | ✗ | ✓ | ✗ | ✓ | ~ | ✗ |
| Qwen 3.5 OS | ✓ | ✗ | ✗ | ~ | ✓ | ✗ | ✓ | ~ | ✗ |
| GPT-5.2 †BS=10 | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Dimension | ✓ Strong | ~ Moderate | ✗ Weak |
|---|---|---|---|
| Episodic / Long-Ctx F1 | ≥ 80% | ≥ 60% | < 60% |
| Rotation Resilience | < 5% drop | < 15% drop | ≥ 15% drop |
| SG Benefit | > +2% | > −2% | ≤ −2% |
| Tracking (BS=1) | ≥ 30% | ≥ 10% | < 10% |
| Tracking (BS≥10) | ≥ 10% | ≥ 3% | < 3% |
| Consistency (BS=1 & BS≥10) | ≥ 95% | ≥ 80% | < 80% |
| Context Stability | < 5% drop | < 15% drop | ≥ 15% drop |
| Model | Architecture | Total / Active Params | Context | Max Out | Vision | Source | Peak Tokens† | % Context |
|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | Sparse MoE | Undisclosed | 1M | 65K | Native multimodal | Closed | 96,507 | 9.7% |
| GPT-5.2 | Undisclosed | Undisclosed | 400K | 128K‡ | Text + Image | Closed | 41,205×10 | 10.3% |
| Qwen 3.5 Plus | DeltaNet+MoE | 397B / 17B | 262K→1M | 81K | Early fusion | Closed | 86,772 | 33.1% |
| Qwen 3.5 OS | DeltaNet+MoE | 397B / 17B | 262K→1M | 81K | Early fusion | Open | 86,772 | 33.1% |
| Kimi K2.5 | MoE+MLA | 1T / 32B | 256K | 65K | MoonViT (400M) | Open | 82,991 | 32.4% |
| Property | Value | Details | |
|---|---|---|---|
| Scene Composition | |||
| Sparse scenes (3–5 placed objects) | 50 | mean 3.9 tagged/scene, range 2–5 (min=2 after visibility filtering) | |
| Dense scenes (7–10 objects) | 50 | mean 8.4 obj/scene, range 7–10 | |
| Objects per scene (all) | 6.2 avg | min 2, max 10 | |
| Question Statistics | |||
| Total questions | 5,703 | mean 57.0/scene, min 20, max 60 | |
| Manipulation questions (rotated views) | 4,419 (77.5%) | 9 rotation angles: 45°–360° + top-down | |
| Standard questions (0° original view) | 1,284 (22.5%) | No mental rotation required | |
| Occlusion Statistics | |||
| Scenes with occlusion | 43 / 100 (43%) | mean 13.2 occluded Qs/scene (max 43) | |
| Occluded questions | 1,315 (23.1%) | Subject: 613 | Object: 620 | Both: 82 | |
| Visible questions | 4,388 (76.9%) | No occlusion involvement | |
| Relationship Labels | |||
| Label distribution | 4 directions | Left 25.3% | Right 24.8% | Front 25.1% | Behind 24.8% | |
| Multi-label questions | 5,560 (97.5%) | Most questions have 2+ spatial relations | |
| Single-label questions | 143 (2.5%) | Only 1 spatial relation | |
| Total relation labels | 11,263 | ~2.0 labels per question on average | |
| View Angle Distribution | |||
| Original view (0°) | 1,284 (22.5%) | Each rotated angle has ~491 questions (8.6%); original view overrepresented as baseline | |
| Each rotated angle (45°–360° + top-down) | ~491 each (8.6%) | ||
| Model \ Angle | 0° | 45° | 90° | 135° | 180° | 225° | 270° | 315° |
|---|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 96 | 89 | 87 | 89 | 96 | 89 | 87 | 89 |
| Kimi K2.5 | 93 | 63 | 49 | 64 | 84 | 65 | 46 | 64 |
| Qwen 3.5 Plus | 95 | 78 | 75 | 77 | 95 | 78 | 75 | 78 |
| Qwen 3.5 OS | 96 | 76 | 72 | 73 | 95 | 78 | 73 | 77 |
| GPT-5.2 | 96 | 77 | 72 | 77 | 97 | 78 | 72 | 77 |
| Model | BS | Image | Text-Only | SceneGraph |
|---|---|---|---|---|
| Gemini 3.1 Pro | 1 | 62.6% | 57.2% | 48.8% |
| GPT-5.2 † | 10 | 19.2% | 44.4% | 3.7% |
| Qwen 3.5 Plus | 1 | 6.7% | 37.0% | 15.8% |
| GPT-5.2 | 1 | 34.0% | 40.4% | 20.2% |
| Qwen 3.5 OS | 1 | 5.7% | 36.4% | 20.5% |
| Gemini 3.1 Pro | 20 | 2.4% | 17.2% | 5.4% |
| Kimi K2.5 | 1 | 7.4% | 10.8% | 8.4% |
| Kimi K2.5 | 20 | 0.0% | 0.0% | 0.0% |
| Qwen 3.5 OS | 20 | 0.0% | 0.0% | 1.7% |
| Qwen 3.5 Plus | 20 | 0.0% | 0.0% | 0.3% |
Qualitative failure modes in CVT-Bench. (A) Orthogonal inversion; (B) Scene-graph mis-association; (C) Within-scene drift.
Montage of queries that are failed by all evaluated models under the specified setting.
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Gemini 3.1 Pro × GPT-5.2 | 71.7% | 59.8% | 11.9% | 24.1% | 4.2% |
| Qwen 3.5 Plus × Qwen 3.5 OS | 66.5% | 26.4% | 40.1% | 16.6% | 16.9% |
| Kimi K2.5 × Qwen 3.5 Plus | 63.2% | 18.4% | 44.7% | 12.2% | 24.7% |
| Qwen 3.5 Plus × GPT-5.2 | 62.4% | 34.8% | 27.6% | 8.3% | 29.3% |
| Qwen 3.5 OS × GPT-5.2 | 61.6% | 34.5% | 27.1% | 8.8% | 29.6% |
| Kimi K2.5 × Qwen 3.5 OS | 59.6% | 16.7% | 42.8% | 13.9% | 26.5% |
| Kimi K2.5 × GPT-5.2 | 54.2% | 24.4% | 29.8% | 6.2% | 39.6% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 54.0% | 40.6% | 13.4% | 43.3% | 2.7% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 53.4% | 40.2% | 13.2% | 43.7% | 2.9% |
| Gemini 3.1 Pro × Kimi K2.5 | 40.6% | 27.6% | 13.1% | 56.3% | 3.1% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| Num_GT_Rels | 1 | 7.7% | 66.1% | +58.5% |
| RelAxis | front-behind | 3.6% | 33.4% | +29.8% |
| RelAxis | left-right | 4.1% | 32.7% | +28.6% |
| Density | Dense | 52.6% | 61.5% | +8.9% |
| View | 135 | 8.6% | 13.8% | +5.2% |
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Qwen 3.5 Plus × Qwen 3.5 OS | 80.3% | 17.6% | 62.8% | 10.8% | 8.9% |
| Kimi K2.5 × Qwen 3.5 OS | 71.2% | 4.0% | 67.2% | 6.3% | 22.5% |
| Kimi K2.5 × Qwen 3.5 Plus | 70.1% | 4.4% | 65.7% | 5.9% | 24.0% |
| Gemini 3.1 Pro × GPT-5.2 | 64.6% | 35.8% | 28.8% | 10.7% | 24.7% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 57.0% | 15.9% | 41.1% | 30.6% | 12.4% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 56.5% | 14.7% | 41.8% | 31.8% | 11.7% |
| Gemini 3.1 Pro × Kimi K2.5 | 53.9% | 5.4% | 48.5% | 41.1% | 5.0% |
| Qwen 3.5 Plus × GPT-5.2 | 51.0% | 20.0% | 31.1% | 8.4% | 40.6% |
| Qwen 3.5 OS × GPT-5.2 | 49.6% | 18.3% | 31.3% | 8.2% | 42.3% |
| Kimi K2.5 × GPT-5.2 | 42.2% | 6.5% | 35.6% | 3.8% | 54.0% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| Num_GT_Rels | 1 | 7.7% | 31.5% | +23.8% |
| RelAxis | front-behind | 3.6% | 16.5% | +13.0% |
| RelAxis | left-right | 4.1% | 15.0% | +10.9% |
| View | 225 | 8.6% | 16.1% | +7.5% |
| View | 135 | 8.6% | 13.4% | +4.8% |
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Gemini 3.1 Pro × GPT-5.2 | 81.0% | 67.8% | 13.2% | 14.4% | 4.6% |
| Qwen 3.5 Plus × GPT-5.2 | 72.3% | 53.0% | 19.3% | 8.3% | 19.4% |
| Qwen 3.5 OS × GPT-5.2 | 70.7% | 51.1% | 19.6% | 8.0% | 21.3% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 69.5% | 56.5% | 13.0% | 25.7% | 4.8% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 68.0% | 54.7% | 13.3% | 27.5% | 4.5% |
| Qwen 3.5 Plus × Qwen 3.5 OS | 64.6% | 42.5% | 22.1% | 18.8% | 16.6% |
| Kimi K2.5 × Qwen 3.5 Plus | 51.4% | 16.8% | 34.6% | 4.1% | 44.5% |
| Kimi K2.5 × Qwen 3.5 OS | 48.2% | 14.1% | 34.1% | 6.8% | 45.0% |
| Kimi K2.5 × GPT-5.2 | 42.7% | 18.0% | 24.7% | 2.9% | 54.4% |
| Gemini 3.1 Pro × Kimi K2.5 | 35.8% | 19.5% | 16.4% | 62.7% | 1.4% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| Num_GT_Rels | 1 | 7.7% | 62.8% | +55.1% |
| RelAxis | left-right | 4.1% | 33.9% | +29.8% |
| RelAxis | front-behind | 3.6% | 28.9% | +25.4% |
| Density | Dense | 52.6% | 63.0% | +10.4% |
| View | 45 | 8.6% | 12.2% | +3.6% |
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Kimi K2.5 × Qwen 3.5 Plus | 74.8% | 3.4% | 71.3% | 17.2% | 8.0% |
| Qwen 3.5 Plus × Qwen 3.5 OS | 73.9% | 5.0% | 68.9% | 6.5% | 19.6% |
| Kimi K2.5 × Qwen 3.5 OS | 70.4% | 7.8% | 62.6% | 12.8% | 16.8% |
| Gemini 3.1 Pro × GPT-5.2 | 69.0% | 51.6% | 17.4% | 7.3% | 23.7% |
| Gemini 3.1 Pro × Kimi K2.5 | 50.7% | 15.1% | 35.6% | 43.7% | 5.5% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 49.5% | 16.5% | 33.0% | 42.4% | 8.1% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 45.9% | 8.1% | 37.8% | 50.8% | 3.3% |
| Qwen 3.5 OS × GPT-5.2 | 41.6% | 20.7% | 20.8% | 3.9% | 54.6% |
| Kimi K2.5 × GPT-5.2 | 39.6% | 17.8% | 21.8% | 2.9% | 57.5% |
| Qwen 3.5 Plus × GPT-5.2 | 34.3% | 10.5% | 23.8% | 0.9% | 64.8% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| Num_GT_Rels | 1 | 7.7% | 51.7% | +44.0% |
| RelAxis | left-right | 4.1% | 26.2% | +22.1% |
| RelAxis | front-behind | 3.6% | 25.5% | +21.9% |
| Density | Dense | 52.6% | 66.1% | +13.5% |
| Occlusion | occluded | 23.1% | 29.1% | +6.1% |
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Gemini 3.1 Pro × GPT-5.2 | 68.3% | 56.7% | 11.7% | 23.1% | 8.5% |
| Qwen 3.5 OS × GPT-5.2 | 67.9% | 45.3% | 22.7% | 12.1% | 20.0% |
| Qwen 3.5 Plus × Qwen 3.5 OS | 66.6% | 36.6% | 30.0% | 12.7% | 20.7% |
| Qwen 3.5 Plus × GPT-5.2 | 63.3% | 38.9% | 24.4% | 10.4% | 26.3% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 62.8% | 50.0% | 12.8% | 29.8% | 7.4% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 56.1% | 42.6% | 13.5% | 37.2% | 6.7% |
| Kimi K2.5 × Qwen 3.5 Plus | 51.1% | 10.4% | 40.7% | 10.0% | 38.9% |
| Kimi K2.5 × Qwen 3.5 OS | 47.4% | 12.6% | 34.8% | 7.8% | 44.7% |
| Kimi K2.5 × GPT-5.2 | 45.3% | 15.5% | 29.8% | 4.9% | 49.7% |
| Gemini 3.1 Pro × Kimi K2.5 | 35.8% | 18.0% | 17.8% | 61.8% | 2.4% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| View | 135 | 8.6% | 23.0% | +14.3% |
| View | 315 | 8.6% | 22.1% | +13.5% |
| View | 225 | 8.6% | 19.4% | +10.8% |
| View | 45 | 8.6% | 19.1% | +10.5% |
| Num_GT_Rels | 1 | 7.7% | 17.8% | +10.1% |
| Model Pair | Agreement | Both ✓ | Both ✗ | Only M1 ✓ | Only M2 ✓ |
|---|---|---|---|---|---|
| Qwen 3.5 Plus × Qwen 3.5 OS | 70.6% | 12.5% | 58.1% | 12.3% | 17.1% |
| Kimi K2.5 × Qwen 3.5 Plus | 68.2% | 3.5% | 64.7% | 10.5% | 21.3% |
| Kimi K2.5 × Qwen 3.5 OS | 66.4% | 5.0% | 61.4% | 9.0% | 24.6% |
| Gemini 3.1 Pro × Qwen 3.5 Plus | 62.9% | 10.7% | 52.2% | 22.9% | 14.2% |
| Gemini 3.1 Pro × Qwen 3.5 OS | 61.8% | 12.5% | 49.3% | 21.1% | 17.1% |
| Gemini 3.1 Pro × Kimi K2.5 | 61.7% | 4.7% | 57.1% | 28.9% | 9.3% |
| Qwen 3.5 OS × GPT-5.2 | 50.3% | 18.5% | 31.8% | 11.1% | 38.6% |
| Gemini 3.1 Pro × GPT-5.2 | 48.6% | 19.6% | 29.0% | 14.0% | 37.4% |
| Kimi K2.5 × GPT-5.2 | 45.7% | 8.4% | 37.3% | 5.6% | 48.7% |
| Qwen 3.5 Plus × GPT-5.2 | 45.0% | 13.5% | 31.5% | 11.4% | 43.6% |
| Attribute | Value | Overall % | Hard % | Shift |
|---|---|---|---|---|
| View | 225 | 8.6% | 19.7% | +11.1% |
| View | 135 | 8.6% | 19.4% | +10.8% |
| View | 315 | 8.6% | 17.7% | +9.0% |
| View | 45 | 8.6% | 16.7% | +8.1% |
| Num_GT_Rels | 1 | 7.7% | 15.8% | +8.1% |
| Model | BS | Image | Text-Only | SceneGraph | Rot Drop | SG Effect | Consist. |
|---|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 1 | 93.5% | 91.6% | 88.5% | -6.1% | -4.7% | 100% |
| GPT-5.2 † | 10 | 76.0% | 86.3% | 73.8% | -13.0% | -18.2% | 99% |
| Qwen 3.5 Plus | 1 | 70.8% | 84.7% | 81.0% | -13.5% | -4.4% | 100% |
| GPT-5.2 | 1 | 78.7% | 84.6% | 81.5% | -15.6% | -4.3% | 100% |
| Qwen 3.5 OS | 1 | 69.2% | 83.7% | 84.7% | -15.7% | +1.3% | 100% |
| Gemini 3.1 Pro | 20 | 66.0% | 74.9% | 60.4% | -21.3% | -11.1% | 97% |
| Kimi K2.5 | 1 | 69.1% | 72.2% | 73.9% | -27.2% | +2.1% | 98% |
| Kimi K2.5 | 20 | 45.2% | 52.8% | 44.9% | -14.2% | -5.0% | 58% |
| Qwen 3.5 OS | 20 | 52.5% | 52.6% | 54.8% | -14.8% | +3.3% | 88% |
| Qwen 3.5 Plus | 20 | 53.4% | 51.3% | 56.2% | -14.2% | +6.7% | 87% |
Additional analyses from per-question evaluation
| Model | L/R F1 | F/B F1 | Both F1 | Gap (L/R−F/B) | L/R Inv. | F/B Inv. |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 65.2% | 64.3% | 93.7% | +0.9% | 1.5% | 1.6% |
| Kimi K2.5 | 45.2% | 30.2% | 41.5% | +15.0% | 3.7% | 3.7% |
| Qwen 3.5 Plus | 61.5% | 57.9% | 80.5% | +3.6% | 3.8% | 3.9% |
| Qwen 3.5 OS | 61.6% | 52.0% | 78.3% | +9.6% | 3.5% | 3.5% |
| GPT-5.2 | 66.5% | 58.7% | 86.4% | +7.8% | 5.2% | 5.2% |
| Model | L/R F1 | F/B F1 | Both F1 | Gap (L/R−F/B) | L/R Inv. | F/B Inv. |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 58.6% | 50.0% | 77.6% | +8.6% | 8.2% | 8.2% |
| Kimi K2.5 | 40.4% | 27.6% | 48.2% | +12.8% | 13.5% | 13.4% |
| Qwen 3.5 Plus | 26.8% | 23.6% | 31.6% | +3.2% | 8.4% | 8.4% |
| Qwen 3.5 OS | 46.2% | 30.3% | 54.1% | +15.9% | 15.0% | 15.0% |
| GPT-5.2 †10 | 63.4% | 60.2% | 88.7% | +3.2% | 4.0% | 4.0% |
| Model | L/R F1 | F/B F1 | Both F1 | Gap (L/R−F/B) | L/R Inv. | F/B Inv. |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 76.4% | 73.5% | 90.0% | +2.9% | 3.7% | 3.7% |
| Kimi K2.5 | 34.4% | 33.8% | 38.5% | +0.6% | 2.7% | 2.7% |
| Qwen 3.5 Plus | 65.7% | 59.5% | 74.0% | +6.2% | 3.9% | 3.8% |
| Qwen 3.5 OS | 69.3% | 61.5% | 81.9% | +7.8% | 1.8% | 1.9% |
| GPT-5.2 | 83.3% | 69.4% | 83.0% | +13.9% | 4.6% | 4.8% |
| Model | L/R F1 | F/B F1 | Both F1 | Gap (L/R−F/B) | L/R Inv. | F/B Inv. |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | 53.2% | 32.2% | 62.1% | +21.0% | 11.5% | 11.5% |
| Kimi K2.5 | 43.3% | 23.6% | 45.8% | +19.7% | 12.7% | 12.3% |
| Qwen 3.5 Plus | 51.3% | 37.2% | 57.2% | +14.1% | 14.2% | 14.2% |
| Qwen 3.5 OS | 44.4% | 34.6% | 56.2% | +9.8% | 18.8% | 18.9% |
| GPT-5.2 †10 | 68.8% | 65.3% | 76.5% | +3.5% | 8.2% | 8.4% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 95.7% | 92.4% | 95.7% | +3.3% |
| Kimi K2.5 | 69.7% | 45.0% | 63.7% | +18.7% |
| Qwen 3.5 Plus | 83.6% | 51.3% | 83.3% | +32.0% |
| Qwen 3.5 OS | 84.1% | 51.0% | 84.9% | +33.9% |
| GPT-5.2 | 92.3% | 71.6% | 92.6% | +21.0% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 88.3% | 54.5% | 88.5% | +34.0% |
| Kimi K2.5 | 45.6% | 35.9% | 30.3% | -5.6% |
| Qwen 3.5 Plus | 62.2% | 49.7% | 56.3% | +6.6% |
| Qwen 3.5 OS | 62.3% | 48.9% | 51.9% | +3.0% |
| GPT-5.2 †10 | 91.6% | 68.1% | 90.7% | +22.6% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 96.2% | 89.4% | 95.1% | +5.7% |
| Kimi K2.5 | 50.5% | 35.1% | 50.7% | +15.6% |
| Qwen 3.5 Plus | 88.6% | 73.1% | 89.3% | +16.2% |
| Qwen 3.5 OS | 86.7% | 70.9% | 86.3% | +15.4% |
| GPT-5.2 | 95.8% | 78.0% | 95.4% | +17.4% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 91.0% | 66.6% | 91.4% | +24.8% |
| Kimi K2.5 | 56.6% | 42.2% | 55.0% | +12.8% |
| Qwen 3.5 Plus | 36.8% | 28.4% | 35.4% | +7.0% |
| Qwen 3.5 OS | 63.7% | 47.4% | 59.5% | +12.1% |
| GPT-5.2 †10 | 96.1% | 81.2% | 96.0% | +14.8% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 98.1% | 84.1% | 93.9% | +9.8% |
| Kimi K2.5 | 46.9% | 33.3% | 42.0% | +8.7% |
| Qwen 3.5 Plus | 84.0% | 66.3% | 84.6% | +18.3% |
| Qwen 3.5 OS | 89.9% | 75.3% | 88.7% | +13.4% |
| GPT-5.2 | 96.1% | 73.9% | 92.7% | +18.8% |
| Model | Original (0°) | Orbit Avg | Top View | Δ(Top−Orbit) |
|---|---|---|---|---|
| Gemini 3.1 Pro | 66.1% | 56.6% | 72.2% | +15.6% |
| Kimi K2.5 | 47.1% | 43.9% | 46.5% | +2.6% |
| Qwen 3.5 Plus | 62.0% | 53.7% | 60.5% | +6.8% |
| Qwen 3.5 OS | 62.8% | 51.4% | 58.0% | +6.6% |
| GPT-5.2 †10 | 98.2% | 59.8% | 95.9% | +36.1% |
| Mode | BS | Plus (Closed) | OS (Open) | Δ |
|---|---|---|---|---|
| Image | 1 | 64.2% | 64.4% | -0.2% |
| Image | 20 | 53.8% | 53.0% | +0.8% |
| Text-Only | 1 | 79.3% | 77.1% | +2.2% |
| Text-Only | 20 | 31.3% | 53.2% | -21.9% |
| SceneGraph | 1 | 73.5% | 81.1% | -7.6% |
| SceneGraph | 20 | 56.5% | 55.2% | +1.3% |
| View Angle | Plus | OS | Δ |
|---|---|---|---|
| 0° | 88.6% | 86.7% | +1.9% |
| 45° | 72.1% | 69.3% | +2.8% |
| 90° ⚡ | 68.5% | 66.7% | +1.8% |
| 135° | 70.9% | 66.1% | +4.8% |
| 180° | 88.7% | 86.8% | +1.9% |
| 225° | 71.8% | 70.7% | +1.1% |
| 270° ⚡ | 68.0% | 66.9% | +1.1% |
| 315° | 71.5% | 69.4% | +2.1% |
| top° | 89.3% | 86.3% | +3.0% |