UniG2U-Bench
1Microsoft Research Asia
2Shanghai Jiao Tong University
3Nanyang Technological University
4Fudan University
5University of Oxford
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. We introduce UniG2U-Bench, a comprehensive benchmark for studying generation-to-understanding (G2U) across seven reasoning regimes that require varying degrees of implicit or explicit visual transformation. Extensive experiments on over 30 models reveal that generation does not yield universal gains: on most tasks, unified models underperform their corresponding base VLMs, and explicit generation-before-answering inference typically degrades performance relative to Direct inference. However, structured improvements consistently emerge in spatial and illusion-sensitive subtasks, where transformation-aware representations are beneficial. Moreover, G2U gain patterns are not random: tasks with similar reasoning structures and models sharing the same architecture or (more strongly) the same base model exhibit correlated behaviors, suggesting that generation–understanding coupling induces class-consistent inductive biases rather than isolated empirical fluctuations.
We introduce UniG2U-Bench, a comprehensive benchmark designed to systematically study the Generation-to-Understanding (G2U) dynamic in multimodal unified models.
1) Isolating the G2U Dynamic:
While traditional benchmarks evaluate understanding or generation in isolation, UniG2U explicitly investigates whether and when generative capabilities benefit visual understanding. We benchmark over 30 unified models across 7 distinct reasoning regimes (e.g., Spatial Intelligence, Fine-grained Discrimination, Puzzles & Games) that require varying degrees of implicit or explicit visual transformation.
2) Direct vs. Stepwise Inference Protocols:
UniG2U evaluates models under two distinct paradigms: Direct inference (answering directly based on unified representations) and Stepwise inference (explicitly generating intermediate visual artifacts before answering). This decoupled design helps localize whether performance shifts stem from joint training objectives or the utility of intermediate visual reasoning.
3) Quantifying "Generation Helps Understanding": ΔG2U
A key methodological contribution of UniG2U is isolating G2U gains by strictly comparing unified models against their foundational pure-understanding VLMs. To rigorously assess the impact of generation, we introduce ΔG2U, defined as the performance gap between a unified multimodal model (MUM) and its corresponding base model (MBase):
Key Insights:
These findings demystify the complex relationship between generation and understanding. They reveal that while generation-understanding coupling induces class-consistent inductive biases (rather than random fluctuations), current unified architectures still face critical trade-offs. The results emphasize the need for better architectural decoupling and higher intermediate visual alignment to truly achieve "Generation Helps Understanding."
Main results on UniG2U across categories. All values are reported as percentages (×100). Baseline models (marked with *) are highlighted with a light gray background. ∆ denotes the performance gap to the respective baseline. Bold: Within each block, the best score in each category is highlighted.
| Model | Real-world Apps |
Math Reasoning |
STEM | Puzzles & Games |
Chart | Spatial Intel. |
Perception Reasoning |
Overall | ∆ |
|---|---|---|---|---|---|---|---|---|
| qwen2.5-vl-7b* | 40.25 | 23.00 | 34.00 | 25.73 | 86.00 | 23.40 | 39.42 | 34.45 | +0.00 |
| OmniGen2(direct) | 41.25 | 14.00 | 41.00 | 20.91 | 86.00 | 23.80 | 35.62 | 31.99 | -2.46 |
| OmniGen2(stepwise) | 41.25 | 13.00 | 41.50 | 21.91 | 79.00 | 26.40 | 34.52 | 31.87 | -2.58 |
| UniWorld-V1(stepwise) | 24.00 | 19.00 | 34.50 | 24.72 | 80.00 | 24.80 | 39.35 | 32.96 | -1.49 |
| OneCAT-3b(direct) | 39.50 | 17.00 | 32.00 | 22.05 | 76.68 | 25.20 | 34.55 | 31.15 | -3.30 |
| OneCAT-3b(stepwise) | 33.50 | 8.50 | 29.00 | 20.69 | 74.00 | 23.80 | 33.10 | 28.80 | -5.65 |
| JavisGPT(direct) | 32.00 | 21.00 | 26.50 | 23.72 | 78.60 | 21.60 | 35.52 | 30.72 | -3.73 |
| uni-video(stepwise) | 34.00 | 21.50 | 29.00 | 29.33 | 61.00 | 20.00 | 42.76 | 34.25 | -0.20 |
| UniPic2(stepwise) | 17.50 | 17.00 | 31.00 | 23.95 | 47.00 | 24.60 | 38.80 | 30.65 | -3.80 |
| STAR-7B(stepwise) | 34.00 | 14.50 | 32.50 | 23.56 | 79.00 | 25.20 | 38.56 | 32.68 | -1.77 |
| Qwen2.5-VL-3B* | 41.00 | 15.50 | 40.50 | 20.42 | 87.00 | 26.40 | 35.55 | 32.39 | +0.00 |
| UAE(direct) | 36.75 | 10.50 | 30.50 | 19.06 | 85.00 | 25.20 | 36.37 | 30.94 | -1.45 |
| UAE(stepwise) | 23.00 | 7.50 | 32.50 | 20.33 | 71.00 | 22.80 | 34.68 | 28.61 | -3.78 |
| Ovis-U1(direct) | 45.50 | 23.50 | 30.50 | 27.28 | 90.00 | 17.60 | 34.92 | 32.15 | -0.24 |
| Ovis-U1(stepwise) | 21.00 | 16.50 | 17.00 | 18.00 | 32.00 | 20.40 | 30.56 | 24.19 | -8.20 |
| yi-6B-vl* | 24.00 | 0.00 | 33.50 | 13.22 | 19.00 | 22.60 | 31.20 | 23.73 | +0.00 |
| mio(direct) | 18.00 | 0.00 | 31.00 | 12.85 | 15.00 | 23.40 | 25.49 | 20.70 | -3.03 |
| mio(stepwise) | 15.00 | 0.00 | 28.00 | 8.94 | 13.00 | 21.60 | 24.94 | 19.00 | -4.73 |
| deepseek-v2* | 11.00 | 4.50 | 22.00 | 18.02 | 74.00 | 17.40 | 35.07 | 25.86 | +0.00 |
| Janus-Pro(direct) | 27.50 | 7.50 | 19.00 | 22.49 | 29.00 | 24.20 | 35.08 | 27.39 | +1.53 |
| AIA(direct) | 13.50 | 8.50 | 31.00 | 23.95 | 33.00 | 26.60 | 36.34 | 28.65 | +2.79 |
| LLaDA-Instruct* | 27.25 | 1.50 | 21.00 | 20.83 | 9.00 | 23.40 | 13.78 | 17.05 | +0.00 |
| MMaDA(direct) | 28.75 | 5.00 | 26.50 | 16.01 | 14.00 | 25.00 | 22.74 | 21.09 | +4.04 |
| Janus-1.3B* | 13.50 | 2.50 | 29.00 | 12.10 | 50.00 | 21.80 | 23.52 | 20.36 | +0.00 |
| FUDOK(direct) | 17.50 | 1.00 | 21.00 | 9.31 | 22.00 | 29.00 | 15.52 | 16.40 | -3.96 |
| Qwen3-VL-8B* | 40.00 | 39.00 | 37.00 | 24.24 | 89.00 | 26.00 | 43.66 | 37.75 | +0.00 |
| MammothModa2(direct) | 33.50 | 0.00 | 36.50 | 12.10 | 91.00 | 21.80 | 39.11 | 29.97 | -7.78 |
| llava-hf* | 23.00 | 0.00 | 29.50 | 18.96 | 28.00 | 24.60 | 33.57 | 26.06 | +0.00 |
| Emu3(direct) | 16.50 | 1.00 | 20.50 | 13.37 | 67.00 | 23.20 | 24.78 | 21.46 | -4.60 |
| llava-onevision* | 43.00 | 10.50 | 44.00 | 24.48 | 78.00 | 25.40 | 37.13 | 33.35 | +0.00 |
| X-Omni(direct) | 33.00 | 9.50 | 29.50 | 26.24 | 83.00 | 27.00 | 35.31 | 31.63 | -1.72 |
| bagel(direct) | 38.00 | 29.00 | 37.50 | 23.77 | 92.00 | 28.40 | 39.96 | 35.84 | +2.49 |
| bagel(stepwise) | 32.75 | 30.00 | 40.50 | 31.58 | 91.00 | 29.00 | 37.29 | 36.10 | +2.75 |
| TokenFlow-XL(direct) | 32.00 | 21.00 | 26.50 | 23.72 | 78.60 | 21.60 | 35.52 | 30.72 | -2.63 |
| show-o2(direct) | 35.50 | 11.00 | 40.00 | 26.01 | 45.00 | 25.80 | 36.50 | 31.59 | -1.76 |
| show-o2(stepwise) | 33.00 | 9.00 | 40.00 | 26.39 | 35.00 | 20.40 | 28.11 | 26.59 | -6.76 |
| ILLUME+(direct) | 24.50 | 7.50 | 35.00 | 21.06 | 80.00 | 23.20 | 35.08 | 29.54 | -3.81 |
| ILLUME+(stepwise) | 20.50 | 2.00 | 35.50 | 19.53 | 58.00 | 21.00 | 34.05 | 27.13 | -6.22 |
| gpt4o | 45.25 | 27.00 | 34.50 | 30.73 | 58.00 | 36.00 | 43.73 | 38.96 | -- |
| ovis2.5 | 44.00 | 32.50 | 43.00 | 23.13 | 89.00 | 27.80 | 42.28 | 37.51 | -- |
| GPT-4o + GPT-image | 32.75 | 27.00 | 30.50 | 32.72 | 60.00 | 27.00 | 40.54 | 35.44 | -- |
| gemini pro +nano banana pro | 71.50 | 85.00 | 91.00 | 55.31 | 68.00 | 38.40 | 61.12 | 60.80 | -- |
| Qwen2.5-7b+Qwen-edit | 33.75 | 17.00 | 31.50 | 27.79 | 80.00 | 21.60 | 38.56 | 32.96 | -- |
Key Takeaways from UniG2U-Bench Analysis
Takeaway 1: On the majority of tasks, unified models perform worse than their base VL models.
Takeaway 2: Unified models show distinct capability improvements on Spatial Intelligence and Visual Illusions.
Takeaway 3: The generation-before-answering paradigm degrades performance across most logic-intensive tasks.
Takeaway 4: Generating intermediate images provides noticeable performance gains over direct output in transformation-intensive scenarios.
Takeaway 5: Task-level G2U gains are highly correlated, as tasks sharing similar cognitive demands consistently exhibit parallel performance trends when leveraging generation. Similarly, model-level behaviors are strongly driven by foundational representations, with unified models sharing identical base weights demonstrating highly parallel generation-to-understanding effects.
Takeaway 6: The effectiveness of explicit generation strictly depends on alignment fidelity; intermediate images aid comprehension in high-alignment tasks (e.g., Perception Reasoning), whereas they propagate errors in structurally constrained domains.
Absolute results on selected Illusion and Spatial subtasks. All values are reported as percentages (×100). Baseline models (marked with *) are highlighted with a light gray background. Overall is computed as the average over the 5 shown subtasks, and the value in parentheses denotes the gap to the corresponding baseline within each model block. The best Overall within each block is bolded.
| Model | Illusion | Spatial Intelligence | Overall (Δ) | |||
|---|---|---|---|---|---|---|
| icon_shape | in_shape | MSR | Attr. (Meas.) | Motion (Cam.) | ||
| llava_onevision* | 6 | 18 | 17 | 26 | 20 | 17.40 (+0.00) |
| X-Omni(direct) | 5 | 11 | 20 | 31 | 24 | 18.20 (+0.80) |
| bagel(direct) | 2 | 18 | 26 | 31 | 26 | 20.60 (+3.20) |
| bagel(GtA) | 2 | 17 | 25 | 36 | 24 | 20.80 (+3.40) |
| TokenFlow-XL(direct) | 12 | 13 | 16 | 24 | 19 | 16.80 (-0.60) |
| show-o2(direct) | 11 | 22 | 27 | 24 | 28 | 22.40 (+5.00) |
| show-o2(GtA) | 6 | 18 | 30 | 15 | 23 | 18.40 (+1.00) |
| ILLUME+(direct) | 4 | 20 | 28 | 29 | 14 | 19.00 (+1.60) |
| ILLUME+(GtA) | 14 | 16 | 29 | 28 | 16 | 20.60 (+3.20) |
Direct vs. Generate-then-Answer accuracy on generation-friendly subtasks. Generate-then-Answer shows consistent improvements in multi-step spatial reasoning regimes.
| Model | MSR | Maze | Sliding |
|---|---|---|---|
| OmniGen2 | 0.220 / 0.290 | 0.043 / 0.046 | 0.000 / 0.001 |
| OneCRAFT-3B | 0.320 / 0.390 | 0.084 / 0.074 | 0.112 / 0.131 |
| Ovis-U1 | 0.120 / 0.270 | 0.134 / 0.003 | 0.334 / 0.007 |
| mio | 0.210 / 0.320 | 0.000 / 0.000 | 0.000 / 0.000 |
| bagel | 0.260 / 0.250 | 0.021 / 0.281 | 0.009 / 0.195 |
| show-o2 | 0.270 / 0.300 | 0.142 / 0.178 | 0.137 / 0.183 |
| ILLUME+ | 0.280 / 0.290 | 0.069 / 0.129 | 0.040 / 0.000 |
Evaluation of >30 unified multimodal models on UniG2U-Bench (3,000 instances, 7 regimes, 30 subtasks).
Core findings :
Unified models generally underperform base VLMs; GtA typically degrades performance;
clear gains appear only in spatial/illusion/multi-step reasoning.
Paper insight : Qwen2.5-7b (base VLM) leads in Geometry Reasoning & Physics; unified models (OmniGen2, UniWorld-V1, OneCAT-3b, etc.) show relative strength in Real-world Applications and Spatial Intelligence.
(a) Task-level correlation
(b) Correlations among base VL models.
(c) Correlations among model architectures.
Paper finding : “Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases.”
Overall accuracy (%) for representative models.
GtA typically degrades performance
Per-task scatter: most points lie below the y=x line.
Only a few models/tasks benefit from explicit generation.
We present a unified three-way taxonomy of failure cases in generation-augmented understanding (G2U). All examples are drawn from geometry, physics, charts, and factual QA tasks. Successful generation must satisfy three criteria:
Violations of these criteria explain why generation often harms (or fails to help) understanding. No “Wrong-to-Right” success cases were observed in the benchmark.
Three representative failure modes observed when models attempt stepwise generation before answering.
Generated diagrams are geometrically distorted, quantitatively invalid, or physically impossible. Downstream solver inherits or amplifies these errors.
Original
Find length of BP
Generated (Bagel)
Geometrically inconsistent
Original
Dihedral angle E-AC-B = 60°
Generated (Bagel)
Physically invalid
Diagram is “on-topic” yet adds zero computational benefit or missing operational structure.
Original (Geometry)
Generated (Bagel)
Correct but adds no utility
Original (Maze)
Generated (GPT-4o + GPT-image)
No valid connectivity/path
Generated image has no meaningful relation to the task (chart data lost, factual QA triggers unnecessary drawing).
Original (ChartQA)
Generated (GPT-4o + GPT-image)
Not a structured chart
Original (Knowledge QA)
“What is the value of Czechia?”
Generated (STAR-7B)
Completely unrelated visual
These cases demonstrate that generation alone does not guarantee improved understanding. Only when the intermediate visual satisfies validity + relevance + utility does G2U actually help.
Question: You are a unified vision-language model. You will be given: (1) a 2×2 reference image with the bottom-right cell hidden, and (2) two candidate patch images ("Candidate 0" and "Candidate 1"). Your job: - Mentally visualize placing each candidate into the bottom-right cell. - Compare which candidate yields the correct completion based on seam continuity, color/texture gradient, structural alignment, and global semantics. Output EXACTLY the following: 1) Brief analysis comparing the two candidates 2) One strict JSON object with your decision, wrapped as: <FINAL_ANSWER_JSON> {"choice": 0 or 1, "rationale": "≤30 words decisive cue"} </FINAL_ANSWER_JSON>
Ground truth: {"choice": 0}
Question: You are a precise maze solver. SEMANTICS - Black squares: walls (impassable) - White squares: path (walkable) - Blue dot: start (the agent) - Green rectangular frame: goal (reaching any white cell inside the green frame counts as success) - Legal moves: up, down, left, right only. One cell per step; no diagonals, no jumps; never cross walls. OUTPUT FORMAT 1) Briefly describe your reasoning for the path. 2) Output the final move list as a JSON array of lowercase strings, wrapped as: <ANSWER_JSON>["right","down","left"]</ANSWER_JSON>
Ground truth: ["up", "up", "right", "right", "down", "down", "down", "down"]
Question: You are a precise sliding puzzle solver. TASK - You will be given an INITIAL state of a 3x3 sliding puzzle. - The goal is to find the sequence of moves to solve the puzzle. SEMANTICS - The puzzle is a 3x3 grid with 8 colored tiles and one empty space. - The RED square represents the EMPTY space. - A "move" consists of sliding an adjacent colored tile INTO the empty (red) space. - Moves are named by the direction the COLORED TILE moves. OUTPUT FORMAT 1) Briefly describe your reasoning for the solution. 2) Output the final move list as a JSON array of lowercase strings, wrapped as: <ANSWER_JSON>["down","right","up"]</ANSWER_JSON>
Ground truth: ["left", "down", "left", "up", "up"]
Question: What is the value of Czechia?
Question: When does the gap between Child before age 5 and neonatal become largest?
Question: How much of the total investment received by Mexico was from the manufacturing industry as of September 2020?
Question: As shown in the diagram, the base ABCD is a square with side length 2, and the semicircular surface APD is perpendicular to the base ABCD. Point P is a moving point on the arc AD. Find the cosine value of the dihedral angle P-BC-D when the volume of the tetrahedron P-BCD is maximized.
Auxiliary Line Description: Take O as the midpoint of AD, and draw Ox parallel to AB. Take O as the origin, and let the lines along Ox, OD, and OP be the x-axis, y-axis, and z-axis, respectively, then establish the three-dimensional Cartesian coordinate system O-xyz.
Ground truth: 2√5/5
Question: As shown in the diagram, in the cube ABCD – A₁B₁C₁D₁, point P is a moving point on segment AB₁ (including endpoints), and point Q is the midpoint of segment AC. Let θ be the angle between PQ and plane ACD₁. Determine the minimum value of cosθ.
Auxiliary Line Description: Take D as the origin, and let the lines along DA, DC, and DD₁ be the x-axis, y-axis, and z-axis, respectively, then establish the three-dimensional Cartesian coordinate system D-xyz.
Ground truth: 1/3
Question: Find tan B
Ground truth: 5
Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the icon. Identify the icon that is represented in the image by choosing exclusively among the following options:Animal, Face_Emoji, Music, Sport, Stationery, Vehicle,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the icon. You have to respond with a single word.
Ground truth: Animal
Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the logo. Identify the logo that is represented in the image by choosing exclusively among the following options:Adidas, Amazon, Apple, Audi, BMW, Mercedes Benz, Facebook, Google, Instagram, Mcdonalds, Nasa, Nike, Olympics, Playstation, Puma, Reebok, Spotify, Starbucks, Tesla, Telegram, Ubuntu,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the logo. You have to respond with a single word.
Ground truth: Audi
Question: This image contains a icon integrated into a background, where elements of the background contribute to forming the icon. Identify the shape that is represented in the image by choosing exclusively among the following options:Airplane, Bicycle, Bird, Bottle, Car, Cat, Dog, Dolphin, Fork, Guitar, Mug, Panda, Paper_clip, Sailboat, Scooter, Teapot,Ocean, Origami, Forest, Cloud, Sand_dune, Medieval_Village, City, Underwater_ruins, Museum, Bazaar_market, Time_square. Provide your response by stating only the single, most accurate class name that represents the icon. You have to respond with a single word.
Ground truth: Bicycle
Question: A one-euro coin is dropped from the Leaning Tower of Pisa and falls freely from rest. What are its position after 3.0 s? Ignore air resistance.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: -35m B: -44m C: -65m D: -51m
Ground truth: B
Question: Your cousin Throckmorton skateboards from rest down a curved, frictionless ramp. If we treat Throcky and his skateboard as a particle, he moves through a quarter-circle as shown in figure. Throcky and his skateboard have a total mass of 25.0 kg. Find his speed at the bottom of the ramp.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: 5.42m/s B: 7.67m/s C: 4.98m/s D: 9.02m/s
Ground truth: B
Question: The light beam in figure strikes surface 2 at the critical angle. Determine the angle of incidence θ₁.
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: 17.5° B: 19.3° C: 27.5° D: 42.3°
Ground truth: C
Question: As a professional navigation agent, your task is to analyze a map and determine the time needed for the car and the person passing the goal.
## Game Setup
- The game presents a fully observable map. There is a person, a car, and a goal on the map.
- The game further specifies the moving direction of the person and car ("up", "down", "left", "right").
- Your goal is to determine the time needed for the car and the person passing the goal.
The following figure shows how the player, the car, and the goals look like.
We provide an example to further illustrate the rules. 
The car is moving left with speed 1.0 grid per second, and the person is moving up with speed 0.5 grid per second. In this provided example:
- The car is 2 grid away from the goal. Given it's time as 1.0 grid per second, the time needed is 2 / 1.0 = 2 seconds.
- The person is 1 grid away from the goal. Given it's time as 0.5 grid per second, the time needed is 1 / 0.5 = 2 seconds.
## Procedure and Output
Now you will answer for the following given map. To solve it, analyze the car and the person separately. Then, answer for them separately. For example:
Car: 2.0
Person: 2.0
means car and the person will need 2.0 seconds to pass the goal respectively. Do not output any extra content after the above aggregated output.
Please analyze and determine the time needed for the car and the person passing the goal:
The car is moving left with speed 1.0, and the person is moving down with speed 0.25. Answer:
Ground truth: 'gt_car': 3.0, 'gt_person': 4.0
Question: As a professional pathfinder, your task is to analyze a map and find a route from the starting location to the goal. Since coding is not within your skill set, your approach relies on logical reasoning of the map.
## Game Setup
- The game presents a fully observable map.
- The starting location is marked with blue "S", and the goal is marked with red "G".
- Your goal is to find a path from the starting location to the goal.
## Moving Rules
- The action plan involves moves in four directions: 'W' (West), 'E' (east), 'N' (north), or 'S' (south).
- Each move is along with distances. Distances are measured by how many crossroads passed.
We provide an example to further illustrate the rules.
In this provided example:
- You are now at the southwest of the goal.
- If you move north by 1 crossroad, you will be at the west of the goal.
- If you move east by 4 crossroads, you will be at the goal.
- IMPORTANT: Please ignore the name of the street and avenue. The numbers in the name cannot be used to compute how many crossroads need to be passed.
## Procedure and Output
Now you will solve the given maze. To analyze the relative spatial relation between the starting point and the goal (for example, southwest). Then, output a path using the format <Direction>: <Number of crossroads passed>. For example:
<Output>
## North: 1
## East: 4
means move north by 1 crossroad, and move east by 4 crossroads.
<Output>
## South: 1
means move south by 1 crossroad. Do not output any extra content after the above aggregated output.
Please output path for the following map:
Ground truth: -1 1
Question: How many people are wearing a white hat?
Options:
A: 5 B: 4 C: 3 D: 6
Ground truth: B
Question: How many different houses are captured in total in these two pictures?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: 4 B: 2 C: 3 D: 5
Ground truth: D
Question: Which has a longer side perpendicular to the adjacent wall: the cabinet in Figure 1 or the table in Figure 2?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: The table in Figure 2 B: The same length C: The cabinet in Figure 1 D: Sometimes the cabinet in Figure 1, sometimes the table in Figure 2
Ground truth: A
Question: The images are taken continuously from a first-person perspective. In which direction is your viewpoint rotating?
Please directly answer the question and provide the correct OPTION LETTER ONLY, e.g., A, B, C, D.
OPTION:
A: Down B: Left C: Right D: Up
Ground truth: B
@misc{wen2026unig2ubenchunifiedmodelsadvance,
title={UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?},
author={Zimo Wen and Boxiu Li and Wanbo Zhang and Junxiang Lei and Xiaoyu Chen and Yijia Fan and Qi Zhang and Yujiang Wang and Lili Qiu and Bo Li and Ziwei Liu and Caihua Shan and Yifan Yang and Yifei Shen},
year={2026},
eprint={2603.03241},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.03241},
}
This website is adapted from Panda-70M, and all website content is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.