UniG2U-Bench Icon UniG2U-Bench

Do Unified Models Advance Multimodal Understanding ?


Zimo Wen2,1, Boxiu Li2,1, Wanbo Zhang4, Junxiang Lei4, Xiaoyu Chen2, Yijia Fan1, Qi Zhang, Yifan Yang1,
Caihua Shan1, Yujiang Wang5, Lili Qiu1, Bo Li3, Ziwei Liu3, Yifei Shen1*

1Microsoft Research Asia
2Shanghai Jiao Tong University
3Nanyang Technological University
4Fudan University
5University of Oxford



Paper Dataset Code






Real-world Applications
Collision

Real-world Applications
Google Map

Geometry
Solid Geometry

Geometry
Plane Geometry

Physcis
Mechanics

Physcis
Optics

Puzzles and Games
Jigsaw

Puzzles and Games
Maze

Chart & Table Reasoning
ChartQA

Chart & Table Reasoning
ChartQA

Spatial Intelligence
Muti-step Reasoning

Spatial Intelligence
Motion Cam

Perception Reasoning
Illusion Icon

Perception Reasoning
Illusion Logo

Perception Reasoning
Algorithmic

Perception Reasoning
Spatial


Abstract


Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. We introduce UniG2U-Bench, a comprehensive benchmark for studying generation-to-understanding (G2U) across seven reasoning regimes that require varying degrees of implicit or explicit visual transformation. Extensive experiments on over 30 models reveal that generation does not yield universal gains: on most tasks, unified models underperform their corresponding base VLMs, and explicit generation-before-answering inference typically degrades performance relative to Direct inference. However, structured improvements consistently emerge in spatial and illusion-sensitive subtasks, where transformation-aware representations are beneficial. Moreover, G2U gain patterns are not random: tasks with similar reasoning structures and models sharing the same architecture or (more strongly) the same base model exhibit correlated behaviors, suggesting that generation–understanding coupling induces class-consistent inductive biases rather than isolated empirical fluctuations.


Example PDF as Image

Overview


We introduce UniG2U-Bench, a comprehensive benchmark designed to systematically study the Generation-to-Understanding (G2U) dynamic in multimodal unified models.

1) Isolating the G2U Dynamic:
While traditional benchmarks evaluate understanding or generation in isolation, UniG2U explicitly investigates whether and when generative capabilities benefit visual understanding. We benchmark over 30 unified models across 7 distinct reasoning regimes (e.g., Spatial Intelligence, Fine-grained Discrimination, Puzzles & Games) that require varying degrees of implicit or explicit visual transformation.

2) Direct vs. Stepwise Inference Protocols:
UniG2U evaluates models under two distinct paradigms: Direct inference (answering directly based on unified representations) and Stepwise inference (explicitly generating intermediate visual artifacts before answering). This decoupled design helps localize whether performance shifts stem from joint training objectives or the utility of intermediate visual reasoning.

3) Quantifying "Generation Helps Understanding": ΔG2U
A key methodological contribution of UniG2U is isolating G2U gains by strictly comparing unified models against their foundational pure-understanding VLMs. To rigorously assess the impact of generation, we introduce ΔG2U, defined as the performance gap between a unified multimodal model (MUM) and its corresponding base model (MBase):

ΔG2U = Perf(MUM) - Perf(MBase)



Key Insights:

  • 1) No Universal Gains (The Interference Dilemma):
    Simply integrating generative capabilities does not yield universal improvements. On most standard tasks, unified models actually underperform their corresponding base VLMs due to inherent objective interference between generative and discriminative training.
  • 2) Transformation-Aware Tasks Benefit (Structured Improvements):
    Despite overall drops, explicit generation yields consistent and distinct capability improvements on tasks that inherently require visual simulation or transformation, such as Spatial Intelligence and illusion-sensitive subtasks.
  • 3) The Double-Edged Sword of Stepwise Reasoning:
    The explicit generation-before-answering (stepwise) paradigm typically degrades performance relative to direct inference if the generated intermediate artifacts lack high alignment fidelity, exposing the vulnerabilities of current step-wise visual reasoning.

These findings demystify the complex relationship between generation and understanding. They reveal that while generation-understanding coupling induces class-consistent inductive biases (rather than random fluctuations), current unified architectures still face critical trade-offs. The results emphasize the need for better architectural decoupling and higher intermediate visual alignment to truly achieve "Generation Helps Understanding."


UniG2U Leaderboard

Main results on UniG2U across categories. All values are reported as percentages (×100). Baseline models (marked with *) are highlighted with a light gray background. ∆ denotes the performance gap to the respective baseline. Bold: Within each block, the best score in each category is highlighted.

Baseline Model
gpt4o
Model Real-world
Apps
Math
Reasoning
STEM Puzzles
& Games
Chart Spatial
Intel.
Perception
Reasoning
Overall | ∆
qwen2.5-vl-7b* 40.25 23.00 34.00 25.73 86.00 23.40 39.42 34.45 | +0.00
OmniGen2(direct) 41.25 14.00 41.00 20.91 86.00 23.80 35.62 31.99 | -2.46
OmniGen2(stepwise) 41.25 13.00 41.50 21.91 79.00 26.40 34.52 31.87 | -2.58
UniWorld-V1(stepwise) 24.00 19.00 34.50 24.72 80.00 24.80 39.35 32.96 | -1.49
OneCAT-3b(direct) 39.50 17.00 32.00 22.05 76.68 25.20 34.55 31.15 | -3.30
OneCAT-3b(stepwise) 33.50 8.50 29.00 20.69 74.00 23.80 33.10 28.80 | -5.65
JavisGPT(direct) 32.00 21.00 26.50 23.72 78.60 21.60 35.52 30.72 | -3.73
uni-video(stepwise) 34.00 21.50 29.00 29.33 61.00 20.00 42.76 34.25 | -0.20
UniPic2(stepwise) 17.50 17.00 31.00 23.95 47.00 24.60 38.80 30.65 | -3.80
STAR-7B(stepwise) 34.00 14.50 32.50 23.56 79.00 25.20 38.56 32.68 | -1.77
Qwen2.5-VL-3B* 41.00 15.50 40.50 20.42 87.00 26.40 35.55 32.39 | +0.00
UAE(direct) 36.75 10.50 30.50 19.06 85.00 25.20 36.37 30.94 | -1.45
UAE(stepwise) 23.00 7.50 32.50 20.33 71.00 22.80 34.68 28.61 | -3.78
Ovis-U1(direct) 45.50 23.50 30.50 27.28 90.00 17.60 34.92 32.15 | -0.24
Ovis-U1(stepwise) 21.00 16.50 17.00 18.00 32.00 20.40 30.56 24.19 | -8.20
yi-6B-vl* 24.00 0.00 33.50 13.22 19.00 22.60 31.20 23.73 | +0.00
mio(direct) 18.00 0.00 31.00 12.85 15.00 23.40 25.49 20.70 | -3.03
mio(stepwise) 15.00 0.00 28.00 8.94 13.00 21.60 24.94 19.00 | -4.73
deepseek-v2* 11.00 4.50 22.00 18.02 74.00 17.40 35.07 25.86 | +0.00
Janus-Pro(direct) 27.50 7.50 19.00 22.49 29.00 24.20 35.08 27.39 | +1.53
AIA(direct) 13.50 8.50 31.00 23.95 33.00 26.60 36.34 28.65 | +2.79
LLaDA-Instruct* 27.25 1.50 21.00 20.83 9.00 23.40 13.78 17.05 | +0.00
MMaDA(direct) 28.75 5.00 26.50 16.01 14.00 25.00 22.74 21.09 | +4.04
Janus-1.3B* 13.50 2.50 29.00 12.10 50.00 21.80 23.52 20.36 | +0.00
FUDOK(direct) 17.50 1.00 21.00 9.31 22.00 29.00 15.52 16.40 | -3.96
Qwen3-VL-8B* 40.00 39.00 37.00 24.24 89.00 26.00 43.66 37.75 | +0.00
MammothModa2(direct) 33.50 0.00 36.50 12.10 91.00 21.80 39.11 29.97 | -7.78
llava-hf* 23.00 0.00 29.50 18.96 28.00 24.60 33.57 26.06 | +0.00
Emu3(direct) 16.50 1.00 20.50 13.37 67.00 23.20 24.78 21.46 | -4.60
llava-onevision* 43.00 10.50 44.00 24.48 78.00 25.40 37.13 33.35 | +0.00
X-Omni(direct) 33.00 9.50 29.50 26.24 83.00 27.00 35.31 31.63 | -1.72
bagel(direct) 38.00 29.00 37.50 23.77 92.00 28.40 39.96 35.84 | +2.49
bagel(stepwise) 32.75 30.00 40.50 31.58 91.00 29.00 37.29 36.10 | +2.75
TokenFlow-XL(direct) 32.00 21.00 26.50 23.72 78.60 21.60 35.52 30.72 | -2.63
show-o2(direct) 35.50 11.00 40.00 26.01 45.00 25.80 36.50 31.59 | -1.76
show-o2(stepwise) 33.00 9.00 40.00 26.39 35.00 20.40 28.11 26.59 | -6.76
ILLUME+(direct) 24.50 7.50 35.00 21.06 80.00 23.20 35.08 29.54 | -3.81
ILLUME+(stepwise) 20.50 2.00 35.50 19.53 58.00 21.00 34.05 27.13 | -6.22
gpt4o 45.25 27.00 34.50 30.73 58.00 36.00 43.73 38.96 | --
ovis2.5 44.00 32.50 43.00 23.13 89.00 27.80 42.28 37.51 | --
GPT-4o + GPT-image 32.75 27.00 30.50 32.72 60.00 27.00 40.54 35.44 | --
gemini pro +nano banana pro 71.50 85.00 91.00 55.31 68.00 38.40 61.12 60.80 | --
Qwen2.5-7b+Qwen-edit 33.75 17.00 31.50 27.79 80.00 21.60 38.56 32.96 | --

Experiment Analysis


Key Takeaways from UniG2U-Bench Analysis

Takeaway 1: On the majority of tasks, unified models perform worse than their base VL models.

Takeaway 2: Unified models show distinct capability improvements on Spatial Intelligence and Visual Illusions.

Takeaway 3: The generation-before-answering paradigm degrades performance across most logic-intensive tasks.

Takeaway 4: Generating intermediate images provides noticeable performance gains over direct output in transformation-intensive scenarios.

Takeaway 5: Task-level G2U gains are highly correlated, as tasks sharing similar cognitive demands consistently exhibit parallel performance trends when leveraging generation. Similarly, model-level behaviors are strongly driven by foundational representations, with unified models sharing identical base weights demonstrating highly parallel generation-to-understanding effects.

Takeaway 6: The effectiveness of explicit generation strictly depends on alignment fidelity; intermediate images aid comprehension in high-alignment tasks (e.g., Perception Reasoning), whereas they propagate errors in structurally constrained domains.

Absolute results on selected Illusion and Spatial subtasks. All values are reported as percentages (×100). Baseline models (marked with *) are highlighted with a light gray background. Overall is computed as the average over the 5 shown subtasks, and the value in parentheses denotes the gap to the corresponding baseline within each model block. The best Overall within each block is bolded.

Model Illusion Spatial Intelligence Overall (Δ)
icon_shape in_shape MSR Attr. (Meas.) Motion (Cam.)
llava_onevision* 6 18 17 26 20 17.40 (+0.00)
X-Omni(direct) 5 11 20 31 24 18.20 (+0.80)
bagel(direct) 2 18 26 31 26 20.60 (+3.20)
bagel(GtA) 2 17 25 36 24 20.80 (+3.40)
TokenFlow-XL(direct) 12 13 16 24 19 16.80 (-0.60)
show-o2(direct) 11 22 27 24 28 22.40 (+5.00)
show-o2(GtA) 6 18 30 15 23 18.40 (+1.00)
ILLUME+(direct) 4 20 28 29 14 19.00 (+1.60)
ILLUME+(GtA) 14 16 29 28 16 20.60 (+3.20)


Direct vs. Generate-then-Answer accuracy on generation-friendly subtasks. Generate-then-Answer shows consistent improvements in multi-step spatial reasoning regimes.

Model MSR Maze Sliding
OmniGen2 0.220 / 0.290 0.043 / 0.046 0.000 / 0.001
OneCRAFT-3B 0.320 / 0.390 0.084 / 0.074 0.112 / 0.131
Ovis-U1 0.120 / 0.270 0.134 / 0.003 0.334 / 0.007
mio 0.210 / 0.320 0.000 / 0.000 0.000 / 0.000
bagel 0.260 / 0.250 0.021 / 0.281 0.009 / 0.195
show-o2 0.270 / 0.300 0.142 / 0.178 0.137 / 0.183
ILLUME+ 0.280 / 0.290 0.069 / 0.129 0.040 / 0.000

Statistics

Evaluation of >30 unified multimodal models on UniG2U-Bench (3,000 instances, 7 regimes, 30 subtasks).
Core findings : Unified models generally underperform base VLMs; GtA typically degrades performance; clear gains appear only in spatial/illusion/multi-step reasoning.

Model Performance Radar Chart (GtA Protocol)

Model Performance Radar Chart

Paper insight : Qwen2.5-7b (base VLM) leads in Geometry Reasoning & Physics; unified models (OmniGen2, UniWorld-V1, OneCAT-3b, etc.) show relative strength in Real-world Applications and Spatial Intelligence.

G2U Offset (Δ) Correlation Heatmaps

Task-level correlation

(a) Task-level correlation

Correlations among base VL models

(b) Correlations among base VL models.

Correlations among model architectures

(c) Correlations among model architectures.

Paper finding : “Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases.”

Direct Inference vs. Generate-then-Answer (GtA)

Direct vs GtA Overall Performance Bar Chart

Overall accuracy (%) for representative models.
GtA typically degrades performance

Direct vs GtA Scatter Plot

Per-task scatter: most points lie below the y=x line.
Only a few models/tasks benefit from explicit generation.


Error Analysis: Failure Case Taxonomy


We present a unified three-way taxonomy of failure cases in generation-augmented understanding (G2U). All examples are drawn from geometry, physics, charts, and factual QA tasks. Successful generation must satisfy three criteria:

Violations of these criteria explain why generation often harms (or fails to help) understanding. No “Wrong-to-Right” success cases were observed in the benchmark.


Error Examples


Three representative failure modes observed when models attempt stepwise generation before answering.

Category I: Capability Failure

Generated diagrams are geometrically distorted, quantitatively invalid, or physically impossible. Downstream solver inherits or amplifies these errors.

Original

Original
Find length of BP

Generated by Bagel

Generated (Bagel)
Geometrically inconsistent

Original

Original
Dihedral angle E-AC-B = 60°

Generated by Bagel

Generated (Bagel)
Physically invalid

Category II: Surface-Relevant but Non-Utility Generation

Diagram is “on-topic” yet adds zero computational benefit or missing operational structure.

Original

Original (Geometry)

Generated by Bagel

Generated (Bagel)
Correct but adds no utility

Original

Original (Maze)

Generated by GPT-4o+image

Generated (GPT-4o + GPT-image)
No valid connectivity/path

Category III: Irrelevant Generation

Generated image has no meaningful relation to the task (chart data lost, factual QA triggers unnecessary drawing).

Original ChartQA

Original (ChartQA)

Generated by GPT-4o+image

Generated (GPT-4o + GPT-image)
Not a structured chart

Original Knowledge QA

Original (Knowledge QA)
“What is the value of Czechia?”

Generated by STAR-7B

Generated (STAR-7B)
Completely unrelated visual

These cases demonstrate that generation alone does not guarantee improved understanding. Only when the intermediate visual satisfies validity + relevance + utility does G2U actually help.


Examples - Puzzles and Games

Jigsaw

Question: You are a unified vision-language model. You will be given: (1) a 2×2 reference image with the bottom-right cell hidden, and (2) two candidate patch images ("Candidate 0" and "Candidate 1"). Your job: - Mentally visualize placing each candidate into the bottom-right cell. - Compare which candidate yields the correct completion based on seam continuity, color/texture gradient, structural alignment, and global semantics. Output EXACTLY the following: 1) Brief analysis comparing the two candidates 2) One strict JSON object with your decision, wrapped as: <FINAL_ANSWER_JSON> {"choice": 0 or 1, "rationale": "≤30 words decisive cue"} </FINAL_ANSWER_JSON>

Ground truth: {"choice": 0}

Maze

Question: You are a precise maze solver. SEMANTICS - Black squares: walls (impassable) - White squares: path (walkable) - Blue dot: start (the agent) - Green rectangular frame: goal (reaching any white cell inside the green frame counts as success) - Legal moves: up, down, left, right only. One cell per step; no diagonals, no jumps; never cross walls. OUTPUT FORMAT 1) Briefly describe your reasoning for the path. 2) Output the final move list as a JSON array of lowercase strings, wrapped as: <ANSWER_JSON>["right","down","left"]</ANSWER_JSON>

Ground truth: ["up", "up", "right", "right", "down", "down", "down", "down"]

Sliding

Question: You are a precise sliding puzzle solver. TASK - You will be given an INITIAL state of a 3x3 sliding puzzle. - The goal is to find the sequence of moves to solve the puzzle. SEMANTICS - The puzzle is a 3x3 grid with 8 colored tiles and one empty space. - The RED square represents the EMPTY space. - A "move" consists of sliding an adjacent colored tile INTO the empty (red) space. - Moves are named by the direction the COLORED TILE moves. OUTPUT FORMAT 1) Briefly describe your reasoning for the solution. 2) Output the final move list as a JSON array of lowercase strings, wrapped as: <ANSWER_JSON>["down","right","up"]</ANSWER_JSON>

Ground truth: ["left", "down", "left", "up", "up"]


Examples - Chart & Table Reasoning

Chartqa

Question: What is the value of Czechia?

Chartqa

Question: When does the gap between Child before age 5 and neonatal become largest?

Chartqa

Question: How much of the total investment received by Mexico was from the manufacturing industry as of September 2020?


Examples - Geometry

Auxsolidmath

Question: As shown in the diagram, the base ABCD is a square with side length 2, and the semicircular surface APD is perpendicular to the base ABCD. Point P is a moving point on the arc AD. Find the cosine value of the dihedral angle P-BC-D when the volume of the tetrahedron P-BCD is maximized.

Auxiliary Line Description: Take O as the midpoint of AD, and draw Ox parallel to AB. Take O as the origin, and let the lines along Ox, OD, and OP be the x-axis, y-axis, and z-axis, respectively, then establish the three-dimensional Cartesian coordinate system O-xyz.

Ground truth: 2√5/5

Auxsolidmath

Question: As shown in the diagram, in the cube ABCD – A₁B₁C₁D₁, point P is a moving point on segment AB₁ (including endpoints), and point Q is the midpoint of segment AC. Let θ be the angle between PQ and plane ACD₁. Determine the minimum value of cosθ.

Auxiliary Line Description: Take D as the origin, and let the lines along DA, DC, and DD₁ be the x-axis, y-axis, and z-axis, respectively, then establish the three-dimensional Cartesian coordinate system D-xyz.

Ground truth: 1/3

Geometry3k

Question: Find tan B

Geometry Diagram

Ground truth: 5


Examples - Perception Reasoning

Icon

Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the icon. Identify the icon that is represented in the image by choosing exclusively among the following options:Animal, Face_Emoji, Music, Sport, Stationery, Vehicle,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the icon. You have to respond with a single word.

Ground truth: Animal

Logo

Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the logo. Identify the logo that is represented in the image by choosing exclusively among the following options:Adidas, Amazon, Apple, Audi, BMW, Mercedes Benz, Facebook, Google, Instagram, Mcdonalds, Nasa, Nike, Olympics, Playstation, Puma, Reebok, Spotify, Starbucks, Tesla, Telegram, Ubuntu,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the logo. You have to respond with a single word.

Ground truth: Audi

In

Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the icon. Identify the shape that is represented in the image by choosing exclusively among the following options:Airplane, Bicycle, Bird, Bottle, Car, Cat, Dog, Dolphin, Fork, Guitar, Mug, Panda, Paper_clip, Sailboat, Scooter, Teapot,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the icon. You have to respond with a single word.

Ground truth: Bicycle


Examples - Physics

Mechanics

Question: A one-euro coin is dropped from the Leaning Tower of Pisa and falls freely from rest. What are its position after 3.0 s? Ignore air resistance.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: -35m B: -44m C: -65m D: -51m

Ground truth: B

Mechanics

Question: Your cousin Throckmorton skateboards from rest down a curved, frictionless ramp. If we treat Throcky and his skateboard as a particle, he moves through a quarter-circle as shown in figure. Throcky and his skateboard have a total mass of 25.0 kg. Find his speed at the bottom of the ramp.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: 5.42m/s B: 7.67m/s C: 4.98m/s D: 9.02m/s

Ground truth: B

Optics

Question: The light beam in figure strikes surface 2 at the critical angle. Determine the angle of incidence θ₁.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: 17.5° B: 19.3° C: 27.5° D: 42.3°

Ground truth: C


Examples - Real-world Applications

Vsp Collision

Question: As a professional navigation agent, your task is to analyze a map and determine the time needed for the car and the person passing the goal.

## Game Setup
- The game presents a fully observable map. There is a person, a car, and a goal on the map.
- The game further specifies the moving direction of the person and car ("up", "down", "left", "right").
- Your goal is to determine the time needed for the car and the person passing the goal.

The following figure shows how the player, the car, and the goals look like. Icon Image We provide an example to further illustrate the rules. Example Image

The car is moving left with speed 1.0 grid per second, and the person is moving up with speed 0.5 grid per second. In this provided example:
- The car is 2 grid away from the goal. Given it's time as 1.0 grid per second, the time needed is 2 / 1.0 = 2 seconds.
- The person is 1 grid away from the goal. Given it's time as 0.5 grid per second, the time needed is 1 / 0.5 = 2 seconds.

## Procedure and Output
Now you will answer for the following given map. To solve it, analyze the car and the person separately. Then, answer for them separately. For example:
Car: 2.0
Person: 2.0
means car and the person will need 2.0 seconds to pass the goal respectively. Do not output any extra content after the above aggregated output.

Please analyze and determine the time needed for the car and the person passing the goal:

The car is moving left with speed 1.0, and the person is moving down with speed 0.25. Answer:

Perception Map

Ground truth: 'gt_car': 3.0, 'gt_person': 4.0

Vsp Google Map

Question: As a professional pathfinder, your task is to analyze a map and find a route from the starting location to the goal. Since coding is not within your skill set, your approach relies on logical reasoning of the map.

## Game Setup
- The game presents a fully observable map.
- The starting location is marked with blue "S", and the goal is marked with red "G".
- Your goal is to find a path from the starting location to the goal.

## Moving Rules
- The action plan involves moves in four directions: 'W' (West), 'E' (east), 'N' (north), or 'S' (south).
- Each move is along with distances. Distances are measured by how many crossroads passed.

We provide an example to further illustrate the rules.
Example Map
In this provided example:
- You are now at the southwest of the goal.
- If you move north by 1 crossroad, you will be at the west of the goal.
- If you move east by 4 crossroads, you will be at the goal.
- IMPORTANT: Please ignore the name of the street and avenue. The numbers in the name cannot be used to compute how many crossroads need to be passed.

## Procedure and Output
Now you will solve the given maze. To analyze the relative spatial relation between the starting point and the goal (for example, southwest). Then, output a path using the format <Direction>: <Number of crossroads passed>. For example:
<Output>
## North: 1
## East: 4
means move north by 1 crossroad, and move east by 4 crossroads.
<Output>
## South: 1
means move south by 1 crossroad. Do not output any extra content after the above aggregated output.

Please output path for the following map:

Comprehension Map

Ground truth: -1 1

Attentional Focusing

Question: How many people are wearing a white hat?

Attentional Focusing

Options:
A: 5 B: 4 C: 3 D: 6

Ground truth: B


Examples - Spatial Intelligence

Attribute(Appr)

Question: How many different houses are captured in total in these two pictures?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: 4 B: 2 C: 3 D: 5

Ground truth: D

Attribute(Meas)

Question: Which has a longer side perpendicular to the adjacent wall: the cabinet in Figure 1 or the table in Figure 2?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: The table in Figure 2 B: The same length C: The cabinet in Figure 1 D: Sometimes the cabinet in Figure 1, sometimes the table in Figure 2

Ground truth: A

Motion(Cam)

Question: The images are taken continuously from a first-person perspective. In which direction is your viewpoint rotating?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.

OPTION:
A: Down B: Left C: Right D: Up

Ground truth: B


Citation

@misc{wen2026unig2ubenchunifiedmodelsadvance,
      title={UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?}, 
      author={Zimo Wen and Boxiu Li and Wanbo Zhang and Junxiang Lei and Xiaoyu Chen and Yijia Fan and Qi Zhang and Yujiang Wang and Lili Qiu and Bo Li and Ziwei Liu and Caihua Shan and Yifan Yang and Yifei Shen},
      year={2026},
      eprint={2603.03241},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.03241}, 
}
			


This website is adapted from Panda-70M, and all website content is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.