How is Visual Understanding different from Arena Rankings?

Arena Rankings show which model people prefer when comparing two outputs side by side. Visual Understanding shows which model actually gets the right answer on specific tasks. A model can rank highly in the Arena but still struggle with OCR or object counting. Use both together to make a better decision.

How is the score calculated?

Each model is given 67 image tasks. For each one it either gets it right or it doesn't. The score is the percentage it got right.

Can I test these models on my own images?

Many models link directly to their Playground page. Just click the model name to open it, then upload your own image and run it through open prompt, OCR, or other supported tasks. Some models in this table are included for comparison only and don't have a Playground page yet.

Vision Evals

Looking for rankings based on real user votes? See Arena Rankings

See which AI vision models are best at reading text, counting objects, spotting defects, and understanding documents. Tested on real-world visual QA prompts by Roboflow.

70 models evaluated|67 prompts per model

Score key:≥75%40–74%<40%

		Passed
1	Gemini 3.5 Flash	56 / 67	83.58%	5.46s	Google
2	Gemini 3.1 Pro (Tools)	54 / 67	80.6%	30.65s	Google
3	Gemini 3.1 Pro	53 / 67	79.1%	14.05s	Google
3	Gemini 3 Flash (Tools)	53 / 67	79.1%	77.32s	Google
5	Qwen3.5 122B A10B	51 / 67	76.12%	1.60s	Qwen
5	GPT-5.4	51 / 67	76.12%	10.77s	OpenAI
5	OpenAI O4 Mini (Medium Reasoning)	51 / 67	76.12%	14.69s	OpenAI
5	GPT-5.5	51 / 67	76.12%	17.45s	OpenAI
9	GPT-5.4 Mini	50 / 67	74.63%	7.87s	OpenAI
9	Claude Fable 5	50 / 67	74.63%	16.44s	Anthropic
11	Claude Opus 4.8	49 / 67	73.13%	4.52s	Anthropic
11	Qwen3.5 397B A17B	49 / 67	73.13%	5.51s	Qwen
11	Claude Opus 4.7	49 / 67	73.13%	17.82s	Anthropic
11	Qwen3.6 Plus	49 / 67	73.13%	97.59s	Qwen
11	Gemma 4 31B	49 / 67	73.13%	273.07s	Google
16	Qwen3.5 27B	48 / 67	71.64%	1.97s	Qwen
16	Qwen3.5 9b	48 / 67	71.64%	8.99s	Qwen
18	ChatGPT-4o (Medium Reasoning)	47 / 67	70.15%	21.70s	OpenAI
19	Gemini 2.5 Pro	46 / 67	68.66%	10.62s	Google
19	GPT-5 Mini	46 / 67	68.66%	15.93s	OpenAI
19	GPT-4.1 Mini	46 / 67	68.66%	18.32s	OpenAI
19	Claude Sonnet 4	46 / 67	68.66%	21.26s	Anthropic
19	Gemma 4 26B A4B	46 / 67	68.66%	36.20s	Google
24	Gemini 3.1 Flash-Lite	45 / 67	67.16%	3.50s	Google
24	Qwen 3.5 Plus	45 / 67	67.16%	4.55s	Qwen
24	Gemini 3 Flash	45 / 67	67.16%	9.38s	Google
24	Mistral Medium 3.5	45 / 67	67.16%	11.48s	Mistral
24	Cosmos 3 Super	45 / 67	67.16%	15.31s	NVIDIA
24	Llama 4 Scout 17B	45 / 67	67.16%	43.93s	Meta
30	GPT-4.1	44 / 67	65.67%	13.09s	OpenAI
30	Kimi K2.5	44 / 67	65.67%	60.93s	Moonshot AI
32	Nemotron 3 Nano Omni 30B (A3B)	43 / 67	64.18%	12.60s	NVIDIA
32	Claude Opus 4.6	43 / 67	64.18%	14.44s	Anthropic
32	GLM 4.6v	43 / 67	64.18%	17.82s	Zhipu AI
35	Qwen3.5 35B A3B	42 / 67	62.69%	1.30s	Qwen
35	Gemma 4 12B	42 / 67	62.69%	6.88s	Google
35	Mistral Medium 3	42 / 67	62.69%	14.03s	Mistral
35	Claude 3.5 Haiku	42 / 67	62.69%	18.36s	Anthropic
39	Cosmos Reason2 8B	41 / 67	61.19%	2.73s	NVIDIA
39	Gemini 2.0 Flash	41 / 67	61.19%	6.28s	Google
39	Qwen 3.5 4B	41 / 67	61.19%	6.49s	Qwen
39	OpenAI O1	41 / 67	61.19%	32.09s	OpenAI
43	Gemini 2.0 Flash Lite	40 / 67	59.7%	5.76s	Google
43	GPT-5 Nano	40 / 67	59.7%	13.74s	OpenAI
43	Claude 3.7 Sonnet	40 / 67	59.7%	15.48s	Anthropic
46	Cosmos 3 Nano	39 / 67	58.21%	2.98s	NVIDIA
46	GPT-5.4 Nano	39 / 67	58.21%	4.61s	OpenAI
46	Llama 4 Maverick 17B	39 / 67	58.21%	17.10s	Meta
46	Cohere Aya Vision 32B	39 / 67	58.21%	20.33s	Cohere
46	Gemma 3 27B	39 / 67	58.21%	33.60s	Google
51	Claude Opus 4.1	38 / 67	56.72%	18.99s	Anthropic
51	Claude Opus 4	38 / 67	56.72%	19.74s	Anthropic
53	Gemini 2.5 Flash-Lite	37 / 67	55.22%	6.17s	Google
53	Gemini 2.5 Flash	37 / 67	55.22%	24.25s	Google
55	Molmo2 8B	36 / 67	53.73%	1.14s	Allen AI
55	LFM 2.5 VL 1.6B	36 / 67	53.73%	5.80s	Liquid AI
57	Qwen2.5 VL 7B Instruct	35 / 67	52.24%	47.64s	Qwen
57	Grok 4	35 / 67	52.24%	85.24s	xAI
59	Arcee.ai Spotlight	32 / 67	47.76%	15.86s	Arcee
60	Qwen 3.5 2B	31 / 67	46.27%	6.95s	Qwen
60	Phi 4 Multimodal	31 / 67	46.27%	17.14s	Microsoft
62	Grok 4.1 Fast	29 / 67	43.28%	23.08s	xAI
63	SmolVLM2 2.2B	27 / 67	40.3%	4.51s	HuggingFace
64	Qwen 3.5 0.8B	26 / 67	38.81%	4.55s	Qwen
65	Mistral Small 3.1 24B	25 / 67	37.31%	16.41s	Mistral
65	Gemma 3 4B	25 / 67	37.31%	16.80s	Google
67	Cosmos Reason2 2B	24 / 67	35.82%	2.02s	NVIDIA
67	Cohere Aya Vision 8B	24 / 67	35.82%	12.18s	Cohere
69	GPT-4.1 Nano	23 / 67	34.33%	21.91s	OpenAI
70	Reka Edge	17 / 67	25.37%	1.63s	Reka

What is Visual Understanding?

Visual Understanding tests models on real image tasks like reading text from a photo, counting objects, spotting defects, and understanding documents. Every model gets the same tasks. The score is just how many it got right. No human votes, no subjective judgment, just pass or fail.

Methodology

We gave each model the same image tasks and recorded whether it got each one right or wrong. The score is simply how many it got right. Every model gets the same tasks, so the scores are directly comparable.

Last evaluated: June 10, 2026