Which datasets are included?

VideoNet (raivn/VideoNet) for action recognition, and NVIDIA VANTAGE-Bench (nvidia/PhysicalAI-VANTAGE-Bench-Subset) for semantic VQA and temporal localization in surveillance and smart-spaces scenarios.

How is the score calculated?

Each model is given 400 sampled video prompts. VideoNet answers are graded pass/fail against the benchmark ground truth. VANTAGE VQA is graded by exact letter match; temporal localization uses mean temporal IoU (mIoU) — each prediction earns partial credit equal to its overlap with the ground-truth start/end span, rather than a single pass/fail threshold.

Vision Evals

Looking for rankings based on real user votes? See Arena Rankings

Video understanding benchmarks across two datasets: VideoNet (action recognition) and NVIDIA VANTAGE-Bench (semantic VQA and temporal localization).

7 models evaluated|400 prompts per model

Score key:≥75%40–74%<40%

		Passed
1	Gemini 3.5 Flash	271.1 / 400	67.77%	12.00s	Google
2	Gemini 3.1 Pro	269.2 / 400	67.31%	15.13s	Google
3	GPT-5.5	235.1 / 400	58.78%	10.62s	OpenAI
4	Molmo2 8B	215.8 / 400	53.96%	7.85s	Ai2
5	Cosmos 3 Nano	205.4 / 400	51.36%	92.56s	NVIDIA
6	Cosmos 3 Super	202.3 / 400	50.57%	256.11s	NVIDIA
7	Qwen3-VL 32B	182.6 / 400	45.64%	21.06s	Alibaba

What is Video Understanding?

Video Understanding tests whether models can identify actions, answer semantic questions, and localize events in video clips. Datasets: VideoNet (public action-recognition benchmark) and VANTAGE-Bench (NVIDIA surveillance/smart-spaces benchmark with multiple-choice VQA and temporal span prediction).

Methodology

VideoNet: zero-shot binary, multi-shot binary, and multiple-choice prompts graded pass/fail. VANTAGE-Bench: 100 VQA items (exact letter match A–D) and 100 temporal localization items scored by mean temporal IoU (mIoU) — each item earns partial credit equal to the overlap between its predicted and ground-truth start/end span. Use the VideoNet / VANTAGE filter pills above to view each dataset on its own.

Token usage & cost. Where shown, “output tokens” is the median per-prompt output count measured directly from each provider’s API response, and includes reasoning / thinking tokens, normalized across providers so the figure is comparable (for example, Gemini reports reasoning separately, and we add it into the output count). Input tokens include image tokens, which dominate and differ by model. “Est. cost / task” is that measured token usage multiplied by the model’s published per-1M pricing at the time of our last price sync, so it is an estimate on this benchmark, not a universal model cost. Figures come from a single evaluation run at low temperature; output for reasoning models can vary run to run. Models we haven’t measured (or that don’t expose token usage) show no token or cost figure rather than a zero.

Last evaluated: June 27, 2026

Vision Evals

What is Video Understanding?

Methodology

Frequently Asked Questions

Which datasets are included?

How is the score calculated?