WildTableBench

Benchmarking Multimodal Foundation Models on
Table Understanding In the Wild

Junzhe Huang¹, Xiaoxiao Sun², Yan Yang³, Yuxuan Hou⁴, Ruotian Zhang⁴, Sirui Li⁵, Hehe Fan⁴, Serena Yeung-Levy², Xin Yu⁶

¹University of Queensland · ²Stanford University · ³Australian National University · ⁴Zhejiang University · ⁵Murdoch University · ⁶University of Adelaide

Benchmark Example

WildTableBench teaser example - train schedule table

Question: A person plans to experience the scenic train ride after 5:00 PM but before 7:00 PM. How many days fit this schedule in 2025?

Answer: 26

🟢 Gemini-3-Pro	✗ 18
⚫ GPT-5.2	✗ 2
🟤 Claude-Opus-4.6	✗ 4

Transportation Color-based Counting Multi-hop Reasoning Visual Parsing

All three frontier models answer incorrectly, illustrating the difficulty of in-the-wild table understanding.

Abstract

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored.

We introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories.

We evaluate 21 frontier proprietary and open-source multimodal foundation models. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%, revealing persistent weaknesses in structural perception and numerical reasoning.

Question Taxonomy

C1 · Cell-Level

C1-T Transcription
C1-L Cell Locating
C1-S Semantic Lookup
C1-F Excel Formula

C2 · Numerical

C2-B Basic Numerical
C2-R Ranking
C2-C Conditional Numerical
C2-M Multi-step Conditional

C3 · Verification

C3-V Value Verification
C3-A Aggregate Verification
C3-C Conditional Verification

C4 · Hypothetical

C4-R Row Operation
C4-M Value Modification
C4-H Hypothetical Condition

C5 · Color

C5-I Color Identification
C5-C Color-based Counting
C5-R Color-based Reasoning

Leaderboard

Results on WildTableBench (928 questions, 402 images). Click column headers to sort. Best score per column in bold.

Proprietary Open · Thinking Open · Instruct

Rank	Model	Type	C1 Cell	C2 Num.	C3 Verif.	C4 Hypo.	C5 Color	Overall
🥇 1	Gemini-3-Pro	Proprietary	62.8	71.9	71.6	75.3	55.8	67.9
🥈 2	Kimi-K2.5	Open · Thinking	47.6	50.6	67.2	55.1	37.5	49.9
🥉 3	Gemini-3-Flash	Proprietary	51.0	49.7	58.2	55.1	37.5	49.4
4	Seed-2.0-Pro	Proprietary	59.3	44.9	68.7	44.9	32.5	47.8
5	GPT-5.2	Proprietary	53.1	45.8	56.7	55.1	29.2	46.6
6	Claude-Opus-4.6	Proprietary	51.7	46.7	61.2	53.9	20.8	45.7
7	Claude-Sonnet-4.6	Proprietary	35.9	34.1	49.3	49.4	17.5	35.0
8	Qwen3-VL-235B-Thinking	Open · Thinking	37.9	33.5	41.8	39.3	26.7	34.7
9	Qwen3-VL-32B-Thinking	Open · Thinking	28.3	27.8	40.3	20.2	28.3	28.2
10	GPT-5-mini	Proprietary	20.7	24.3	41.8	33.7	18.3	25.3
11	Qwen3-VL-235B-Instruct	Open · Instruct	22.8	26.6	34.3	21.3	21.7	25.2
12	Qwen3-VL-32B-Instruct	Open · Instruct	24.8	22.5	41.8	27.0	18.3	24.5
13	GLM-4.6V	Open · Thinking	17.9	24.0	34.3	27.0	25.8	24.4
14	GPT-o3	Proprietary	17.2	11.4	32.8	10.1	15.0	14.8
15	Qwen3-VL-8B-Thinking	Open · Thinking	9.0	15.0	28.4	16.9	11.7	14.7
16	Qwen3-VL-8B-Instruct	Open · Instruct	10.3	8.7	17.9	9.0	9.2	9.9
17	Qwen3-VL-4B-Instruct	Open · Instruct	8.3	5.4	16.4	6.7	10.8	7.9
18	Qwen3-VL-4B-Thinking	Open · Thinking	4.8	5.4	25.4	5.6	9.2	7.7
19	GPT-4o	Proprietary	4.1	3.9	23.9	0.0	6.7	5.7
20	Qwen3-VL-2B-Instruct	Open · Instruct	4.1	3.3	22.4	2.2	8.3	5.8
21	Qwen3-VL-2B-Thinking	Open · Thinking	4.1	2.4	14.9	1.1	5.0	4.1

To submit your model results, please contact us via email.