Vision Evals
Looking for rankings based on real user votes? See Arena Rankings
Open-vocabulary object detection benchmarks. Models draw bounding boxes from text queries across SaCo-Gold and COCO-100.
5 models evaluated|250 queries per model
What is Detection?
Detection evals measure how well models localize objects given a text query. We combine results from SaCo-Gold (attributes/crowded/metaclip configs) and a 100-image COCO val2017 subset.
Methodology
For each (image, text-query) pair the model returns bounding-box predictions with confidence scores. We compute positive micro-F1 (pmF1) across IoU thresholds 0.5:0.05:0.95. The overall score is the query-count-weighted average pmF1 across datasets.
Last evaluated: June 10, 2026