How is Visual Understanding different from Arena Rankings?

Arena Rankings show which model people prefer when comparing two outputs side by side. Visual Understanding shows which model actually gets the right answer on specific tasks. A model can rank highly in the Arena but still struggle with OCR or object counting. Use both together to make a better decision.

How is the score calculated?

Each model is given 67 image tasks. For each one it either gets it right or it doesn't. The score is the percentage it got right.

Can I test these models on my own images?

Many models link directly to their Playground page. Just click the model name to open it, then upload your own image and run it through open prompt, OCR, or other supported tasks. Some models in this table are included for comparison only and don't have a Playground page yet.

Vision Evals

Looking for rankings based on real user votes? See Arena Rankings

See which AI vision models are best at reading text, counting objects, spotting defects, and understanding documents. Tested on real-world visual QA prompts by Roboflow.

73 models evaluated|67 prompts per model

Score key:≥75%40–74%<40%

		Passed
1	Gemini 3.5 Flash	53 / 67	79.1%	6.71s	1.4K	Google
1	Qwen3.5 35B A3B	53 / 67	79.1%	20.94s		Qwen
3	GPT-5.4	52 / 67	77.61%	7.16s	1.7K	OpenAI
3	Gemini 3.1 Pro	52 / 67	77.61%	13.23s	1.3K	Google
3	GPT-5.5	52 / 67	77.61%	30.12s	1.7K	OpenAI
3	Gemini 3 Flash	52 / 67	77.61%	74.95s	2.0K	Google
7	Qwen3.5 122B A10B	51 / 67	76.12%	1.77s	1.2K	Qwen
8	Gemini 3.1 Pro	50 / 66	75.76%	6.13s	1.1K	Google
9	GPT-5.4 Mini	50 / 67	74.63%	7.87s		OpenAI
9	Gemini 3 Flash	50 / 67	74.63%	9.85s	1.4K	Google
9	Claude Fable 5	50 / 67	74.63%	16.44s		Anthropic
9	Qwen 3.5 Plus	50 / 67	74.63%	21.49s	1.4K	Qwen
13	GPT-5 Mini	49 / 67	73.13%	11.72s	1.8K	OpenAI
14	Qwen3.5 27B	48 / 67	71.64%	1.98s	1.2K	Qwen
14	Qwen3.5 9B	48 / 67	71.64%	8.99s		Qwen
14	OpenAI O4 Mini	48 / 67	71.64%	10.27s	2.7K	OpenAI
17	GPT-4.1	47 / 67	70.15%	2.56s	977	OpenAI
17	Claude Sonnet 4.6	47 / 67	70.15%	4.24s	2.3K	Anthropic
17	Gemini 2.5 Pro	47 / 67	70.15%	11.87s	856	Google
17	ChatGPT-4o	47 / 67	70.15%	21.70s		OpenAI
21	Gemini 3.1 Flash-Lite	46 / 67	68.66%	1.86s	1.1K	Google
21	GPT-4.1 Mini	46 / 67	68.66%	2.54s	1.9K	OpenAI
21	Claude Sonnet 4	46 / 67	68.66%	21.26s		Anthropic
21	Gemma 4 26B A4B	46 / 67	68.66%	30.23s	531	Google
21	Qwen3.6 Plus	46 / 67	68.66%	34.17s	1.6K	Qwen
26	Claude Opus 4.8	45 / 67	67.16%	4.36s	2.2K	Anthropic
26	Claude Opus 4.7	45 / 67	67.16%	4.85s	2.6K	Anthropic
26	Mistral Medium 3.5	45 / 67	67.16%	11.48s		Mistral
26	Cosmos 3 Super	45 / 67	67.16%	15.31s		NVIDIA
26	Gemma 4 31B	45 / 67	67.16%	34.59s	467	Google
26	Llama 4 Scout 17B	45 / 67	67.16%	43.93s		Meta
32	Nemotron 3 Nano Omni 30B (A3B)	43 / 67	64.18%	12.60s		NVIDIA
32	GLM 4.6v	43 / 67	64.18%	19.25s	2.0K	Zhipu AI
32	Claude Opus 4.6	43 / 67	64.18%	23.35s	2.3K	Anthropic
32	OpenAI O1	43 / 67	64.18%	43.13s	1.5K	OpenAI
36	Gemma 4 12B	42 / 67	62.69%	6.88s		Google
36	Claude 3.5 Haiku	42 / 67	62.69%	18.36s		Anthropic
38	Cosmos Reason2 8B	41 / 67	61.19%	2.73s		NVIDIA
38	Gemini 2.0 Flash	41 / 67	61.19%	6.28s		Google
38	Qwen 3.5 4B	41 / 67	61.19%	6.49s		Qwen
41	Llama 4 Maverick 17B	40 / 67	59.7%	2.30s	2.4K	Meta
41	Claude Sonnet 4.5	40 / 67	59.7%	5.67s	2.3K	Anthropic
41	Gemini 2.0 Flash Lite	40 / 67	59.7%	5.76s		Google
41	Claude Opus 4.1	40 / 67	59.7%	7.09s	2.1K	Anthropic
41	Claude 3.7 Sonnet	40 / 67	59.7%	15.48s		Anthropic
46	Cosmos 3 Nano	39 / 67	58.21%	2.98s		NVIDIA
46	Claude Haiku 4.5	39 / 67	58.21%	3.15s	2.3K	Anthropic
46	GPT-5.4 Nano	39 / 67	58.21%	4.61s		OpenAI
46	GPT-5 Nano	39 / 67	58.21%	6.58s	2.7K	OpenAI
46	Cohere Aya Vision 32B	39 / 67	58.21%	20.33s		Cohere
46	Gemma 3 27B	39 / 67	58.21%	33.60s		Google
46	Qwen3.5 397B A17B	39 / 67	58.21%	56.61s	1.5K	Qwen
53	Mistral Medium 3	38 / 67	56.72%	2.85s	1.6K	Mistral
53	Claude Opus 4	38 / 67	56.72%	19.74s		Anthropic
55	Gemini 2.5 Flash	37 / 67	55.22%	24.91s	476	Google
56	Molmo2 8B	36 / 67	53.73%	1.14s		Ai2
56	LFM 2.5 VL 1.6B	36 / 67	53.73%	5.80s		Liquid AI
56	Gemini 2.5 Flash-Lite	36 / 67	53.73%	7.19s	301	Google
59	Qwen2.5 VL 7B Instruct	35 / 67	52.24%	47.64s		Qwen
59	Grok 4	35 / 67	52.24%	85.24s		xAI
61	Arcee.ai Spotlight	32 / 67	47.76%	15.86s		Arcee
62	Qwen 3.5 2B	31 / 67	46.27%	6.95s		Qwen
62	Phi 4 Multimodal	31 / 67	46.27%	17.14s		Microsoft
64	Grok 4.1 Fast	29 / 67	43.28%	23.08s		xAI
65	GPT-4.1 Nano	27 / 67	40.3%	2.36s	2.9K	OpenAI
65	SmolVLM2 2.2B	27 / 67	40.3%	4.51s		HuggingFace
67	Mistral Small 3.1 24B	26 / 67	38.81%	3.18s		Mistral
67	Qwen 3.5 0.8B	26 / 67	38.81%	4.55s		Qwen
69	Gemma 3 4B	25 / 67	37.31%	16.80s		Google
70	Cosmos Reason2 2B	24 / 67	35.82%	2.02s		NVIDIA
70	Cohere Aya Vision 8B	24 / 67	35.82%	12.18s		Cohere
70	Kimi K2.5	24 / 67	35.82%	14.81s	2.7K	Moonshot AI
73	Reka Edge	17 / 67	25.37%	1.63s		Reka

Token usage is the median total tokens (input + output; output includes reasoning) per task on this benchmark, measured from a single run at low temperature. Hover a cell for the input/output breakdown and estimated cost at current pricing. Figures appear only for models whose instrumented run covered at least 90% of tasks, so a sparse, biased median is never shown.

What is Visual Understanding?

Visual Understanding tests models on real image tasks like reading text from a photo, counting objects, spotting defects, and understanding documents. Every model gets the same tasks. The score is just how many it got right. No human votes, no subjective judgment, just pass or fail.

Methodology

We gave each model the same image tasks and recorded whether it got each one right or wrong. The score is simply how many it got right. Every model gets the same tasks, so the scores are directly comparable.

Token usage & cost. Where shown, “output tokens” is the median per-prompt output count measured directly from each provider’s API response, and includes reasoning / thinking tokens, normalized across providers so the figure is comparable (for example, Gemini reports reasoning separately, and we add it into the output count). Input tokens include image tokens, which dominate and differ by model. “Est. cost / task” is that measured token usage multiplied by the model’s published per-1M pricing at the time of our last price sync, so it is an estimate on this benchmark, not a universal model cost. Figures come from a single evaluation run at low temperature; output for reasoning models can vary run to run. Models we haven’t measured (or that don’t expose token usage) show no token or cost figure rather than a zero.

Last evaluated: June 26, 2026