Vision Evals

Looking for rankings based on real user votes? See Arena Rankings

Open-vocabulary object detection benchmarks. Models draw bounding boxes from text queries across SaCo-Gold and COCO-100.

5 models evaluated|250 queries per model

What is Detection?

Detection evals measure how well models localize objects given a text query. We combine results from SaCo-Gold (attributes/crowded/metaclip configs) and a 100-image COCO val2017 subset.

Methodology

For each (image, text-query) pair the model returns bounding-box predictions with confidence scores. We compute positive micro-F1 (pmF1) across IoU thresholds 0.5:0.05:0.95. The overall score is the query-count-weighted average pmF1 across datasets.

Last evaluated: June 10, 2026

Frequently Asked Questions