SAM 3 vs Gemini 3.1 Pro+ 1 other
Compare SAM 3, Gemini 3.1 Pro, and 1 other vision model side-by-side. Test these models on Object Detection in the Playground.
Compare these vision models live
Run the same image across every model that supports a task and compare their outputs side-by-side.
Detect and compare bounding boxes across models on the same image.
Upload an image
Drag and drop an image here, or click to browse
Models in this comparison
Model Overviews
Released on November 19th, 2025, Segment Anything 3 (SAM 3) is a zero-shot image segmentation model that “detects, segments, and tracks objects in images and videos based on concept prompts.” This model was developed by Meta as the third model in the Segment Anything series.
Unlike its previous SAM models (Segment Anything and Segment Anything 2), you can provide SAM 3 with the prompt “shipping container” and it will generate precise segmentation masks for all shipping containers in an image. SAM 3 generates segmentation masks that correspond to the location of the objects found with a text prompt.
SAM 3 vs Gemini 3.1 Pro Comparison Table + 1 other
| Property | SAM 3 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|
| Organization | Meta | OpenAI | |
| Category | closed | closed | closed |
| Modality | multimodal | multimodal | multimodal |
| Release Date | Nov 2025 | Feb 2026 | Apr 2026 |
| Context Window | — | 1.0M | 1.0M |
| Parameters | |||
| License | Proprietary | Proprietary | Proprietary |
| Pricing per 1M tokens | |||
| Input $/1M | $2.00 | $5.00 | |
| Output $/1M | $12.00 | $30.00 | |
| Vision Tasks | |||
| Object Detection | Demo | Demo | Demo |
| Captioning | Demo | Demo | |
| Classification | Demo | Demo | |
| OCR | Demo | Demo | |
| Vision Language | |||
| Visual Question Answering | Demo | Demo | |
| Instance Segmentation | |||
| Promptable Concept Segmentation | Demo | ||
| Video Object Tracking | |||
| Zero Shot Segmentation | |||
| Model Features | |||
| Foundation Vision | |||
| LLMs with Vision Capabilities | |||
| Multimodal Vision | |||
| Zero-shot Detection | |||
Vision Evalspass/fail results · 66 prompts Score key:≥75%40–74%<40% | |||
| Visual Understanding | |||
| Overall Score | 75.76% | 77.61% | |
| Avg Response Time | 6.13s | 30.12s | |
| Median input tokensincl. image tokens | 1.1K | 1.4K | |
| Median output tokens | 11 | 138 | |
| Est. cost / taskon this benchmark | $0.0024 | $0.011 | |
| Defect Detection | 73.3%(11/15) | 86.7%(13/15) | |
| Document Understanding | 88.9%(8/9) | 88.9%(8/9) | |
| Object Counting | 44.4%(4/9) | 30%(3/10) | |
| Object Understanding | 92.9%(13/14) | 92.9%(13/14) | |
| Spatial Understanding | 73.7%(14/19) | 78.9%(15/19) | |
| OCR | |||
| Overall Score | 89.52% | 81.22% | |
| Avg Response Time | 3.11s | 5.16s | |
| Median input tokensincl. image tokens | 1.1K | 105 | |
| Median output tokens | 12 | 83 | |
| Est. cost / taskon this benchmark | $0.0024 | $0.0030 | |
| Focused Scene OCR | 94.9%(94/99) | 77.8%(77/99) | |
| Handwritten Math | 90%(9/10) | 40%(4/10) | |
| License Plate Recognition | 90%(27/30) | 93.3%(28/30) | |
| Text Recognition | 86.7%(26/30) | 83.3%(25/30) | |
| VQA & Extraction | 81.7%(49/60) | 86.7%(52/60) | |
Output tokens (incl. reasoning) and est. cost / task are measured on this benchmark from a single low-temperature run, and shown only for models whose run covered at least 90% of prompts. Methodology