WildTableBench

Benchmarking Multimodal Foundation Models on
Table Understanding In the Wild

Junzhe Huang1, Xiaoxiao Sun2, Yan Yang3, Yuxuan Hou4, Ruotian Zhang4, Sirui Li5, Hehe Fan4, Serena Yeung-Levy2, Xin Yu6
1University of Queensland  ·  2Stanford University  ·  3Australian National University  ·  4Zhejiang University  ·  5Murdoch University  ·  6University of Adelaide
arXiv GitHub Dataset (HF) Leaderboard

Benchmark Example

WildTableBench teaser example - train schedule table

Question: A person plans to experience the scenic train ride after 5:00 PM but before 7:00 PM. How many days fit this schedule in 2025?

Answer: 26

🟢 Gemini-3-Pro ✗ 18
GPT-5.2 ✗ 2
🟤 Claude-Opus-4.6 ✗ 4
Transportation Color-based Counting Multi-hop Reasoning Visual Parsing

All three frontier models answer incorrectly, illustrating the difficulty of in-the-wild table understanding.

Abstract

Using multimodal foundation models to analyze table images is a high-value yet challenging application in consumer and enterprise scenarios. Despite its importance, current evaluations rely largely on structured-text tables or clean rendered images, leaving the visual complexity of in-the-wild table images underexplored.

We introduce WildTableBench, the first question-answering benchmark for naturally occurring table images from real-world settings. WildTableBench comprises 402 high-information-density table images collected from online forums and websites across diverse domains, together with 928 manually annotated and verified questions spanning 17 subtypes across five categories.

We evaluate 21 frontier proprietary and open-source multimodal foundation models. Only one model exceeds 50% accuracy, while all remaining models range from 4.1% to 49.9%, revealing persistent weaknesses in structural perception and numerical reasoning.

Question Taxonomy

C1 · Cell-Level

C1-T Transcription
C1-L Cell Locating
C1-S Semantic Lookup
C1-F Excel Formula

C2 · Numerical

C2-B Basic Numerical
C2-R Ranking
C2-C Conditional Numerical
C2-M Multi-step Conditional

C3 · Verification

C3-V Value Verification
C3-A Aggregate Verification
C3-C Conditional Verification

C4 · Hypothetical

C4-R Row Operation
C4-M Value Modification
C4-H Hypothetical Condition

C5 · Color

C5-I Color Identification
C5-C Color-based Counting
C5-R Color-based Reasoning

Leaderboard

Results on WildTableBench (928 questions, 402 images). Click column headers to sort. Best score per column in bold.

Proprietary   Open · Thinking   Open · Instruct

Rank Model Type C1
Cell
C2
Num.
C3
Verif.
C4
Hypo.
C5
Color
Overall
🥇 1 Gemini-3-Pro Proprietary 62.8 71.9 71.6 75.3 55.8 67.9
🥈 2 Kimi-K2.5 Open · Thinking 47.6 50.6 67.2 55.1 37.5 49.9
🥉 3 Gemini-3-Flash Proprietary 51.0 49.7 58.2 55.1 37.5 49.4
4 Seed-2.0-Pro Proprietary 59.3 44.9 68.7 44.9 32.5 47.8
5 GPT-5.2 Proprietary 53.1 45.8 56.7 55.1 29.2 46.6
6 Claude-Opus-4.6 Proprietary 51.7 46.7 61.2 53.9 20.8 45.7
7 Claude-Sonnet-4.6 Proprietary 35.9 34.1 49.3 49.4 17.5 35.0
8 Qwen3-VL-235B-Thinking Open · Thinking 37.9 33.5 41.8 39.3 26.7 34.7
9 Qwen3-VL-32B-Thinking Open · Thinking 28.3 27.8 40.3 20.2 28.3 28.2
10 GPT-5-mini Proprietary 20.7 24.3 41.8 33.7 18.3 25.3
11 Qwen3-VL-235B-Instruct Open · Instruct 22.8 26.6 34.3 21.3 21.7 25.2
12 Qwen3-VL-32B-Instruct Open · Instruct 24.8 22.5 41.8 27.0 18.3 24.5
13 GLM-4.6V Open · Thinking 17.9 24.0 34.3 27.0 25.8 24.4
14 GPT-o3 Proprietary 17.2 11.4 32.8 10.1 15.0 14.8
15 Qwen3-VL-8B-Thinking Open · Thinking 9.0 15.0 28.4 16.9 11.7 14.7
16 Qwen3-VL-8B-Instruct Open · Instruct 10.3 8.7 17.9 9.0 9.2 9.9
17 Qwen3-VL-4B-Instruct Open · Instruct 8.3 5.4 16.4 6.7 10.8 7.9
18 Qwen3-VL-4B-Thinking Open · Thinking 4.8 5.4 25.4 5.6 9.2 7.7
19 GPT-4o Proprietary 4.1 3.9 23.9 0.0 6.7 5.7
20 Qwen3-VL-2B-Instruct Open · Instruct 4.1 3.3 22.4 2.2 8.3 5.8
21 Qwen3-VL-2B-Thinking Open · Thinking 4.1 2.4 14.9 1.1 5.0 4.1

To submit your model results, please contact us via email.