Vision Evals
Looking for rankings based on real user votes? See Arena Rankings
Video understanding benchmarks across two datasets: VideoNet (action recognition) and NVIDIA VANTAGE-Bench (semantic VQA and temporal localization).
What is Video Understanding?
Video Understanding tests whether models can identify actions, answer semantic questions, and localize events in video clips. Datasets: VideoNet (public action-recognition benchmark) and VANTAGE-Bench (NVIDIA surveillance/smart-spaces benchmark with multiple-choice VQA and temporal span prediction).
Methodology
VideoNet: zero-shot binary, multi-shot binary, and multiple-choice prompts graded pass/fail. VANTAGE-Bench: 100 VQA items (exact letter match A–D) and 100 temporal localization items scored by mean temporal IoU (mIoU) — each item earns partial credit equal to the overlap between its predicted and ground-truth start/end span. Use the VideoNet / VANTAGE filter pills above to view each dataset on its own.
Token usage & cost. Where shown, “output tokens” is the median per-prompt output count measured directly from each provider’s API response, and includes reasoning / thinking tokens, normalized across providers so the figure is comparable (for example, Gemini reports reasoning separately, and we add it into the output count). Input tokens include image tokens, which dominate and differ by model. “Est. cost / task” is that measured token usage multiplied by the model’s published per-1M pricing at the time of our last price sync, so it is an estimate on this benchmark, not a universal model cost. Figures come from a single evaluation run at low temperature; output for reasoning models can vary run to run. Models we haven’t measured (or that don’t expose token usage) show no token or cost figure rather than a zero.
Last evaluated: June 27, 2026