Vision Evals
Looking for rankings based on real user votes? See Arena Rankings
See which AI vision models are best at reading text, counting objects, spotting defects, and understanding documents. Tested on real-world visual QA prompts by Roboflow.
What is Visual Understanding?
Visual Understanding tests models on real image tasks like reading text from a photo, counting objects, spotting defects, and understanding documents. Every model gets the same tasks. The score is just how many it got right. No human votes, no subjective judgment, just pass or fail.
Methodology
We gave each model the same image tasks and recorded whether it got each one right or wrong. The score is simply how many it got right. Every model gets the same tasks, so the scores are directly comparable.
Token usage & cost. Where shown, “output tokens” is the median per-prompt output count measured directly from each provider’s API response, and includes reasoning / thinking tokens, normalized across providers so the figure is comparable (for example, Gemini reports reasoning separately, and we add it into the output count). Input tokens include image tokens, which dominate and differ by model. “Est. cost / task” is that measured token usage multiplied by the model’s published per-1M pricing at the time of our last price sync, so it is an estimate on this benchmark, not a universal model cost. Figures come from a single evaluation run at low temperature; output for reasoning models can vary run to run. Models we haven’t measured (or that don’t expose token usage) show no token or cost figure rather than a zero.
Last evaluated: June 26, 2026