Claude Sonnet 5 vs Grounded SAM

Compare Claude Sonnet 5 and Grounded SAM side-by-side.

Compare Claude Sonnet 5 vs Grounded SAM live

Run the same image across every model that supports a task and compare their outputs side-by-side.

These models don't share enough common tasks for a side-by-side demo. See the comparison table below for their capabilities.

Models in this comparison

Claude Sonnet 5 vs Grounded SAM: Overview

Claude Sonnet 5

Claude Sonnet 5 is a mid-tier large language model from Anthropic, released on June 30, 2026, as the latest model in the Sonnet series and a direct successor to Claude Sonnet 4.6. It is a hybrid reasoning model designed primarily for agentic workflows, software coding, and professional tasks. The model features a 1 million token context window, a 128k maximum output token limit, and runs adaptive thinking by default, giving API users fine-grained control over reasoning effort across five levels (low, medium, high, max, and extra-high). It uses an updated tokenizer shared with Opus 4.7 and later models, which produces approximately 30% more tokens for equivalent text compared to earlier Claude models. On benchmarks, Sonnet 5 scores 63.2% on agentic coding and 81.2% on OSWorld, narrowing the gap with Opus 4.8 while remaining at Sonnet-tier pricing.

The model supports text and image input with text output, and accepts tools including browsers and terminals for autonomous multi-step task execution. Anthropic's safety evaluations report that Sonnet 5 shows a lower rate of undesirable behaviors than Sonnet 4.6 and is generally safer in agentic contexts, with improved resistance to prompt injection and reduced sycophancy. Cybersecurity safeguards equivalent to those on Opus 4.7 and 4.8 are active, though Anthropic notes the model was not deliberately trained on cybersecurity tasks. The model is proprietary and API-only, with no open weights.

Grounded SAM

Grounded SAM is an open-vocabulary image segmentation model developed by IDEA Research, released in January 2024 under the Apache 2.0 license. It combines Grounding DINO, a zero-shot open-vocabulary object detector, with the Segment Anything Model to produce precise segmentation masks for objects identified through free-form text prompts. The two models are used sequentially: Grounding DINO localizes objects from a text query, and SAM generates the corresponding segmentation masks.

Grounded SAM enables zero-shot instance segmentation without task-specific training data, making it applicable to domains where labeled segmentation data is scarce. It supports arbitrary text queries and can segment objects not represented in standard training sets. The model is commonly used in automated labeling pipelines, robotic perception, and domain-specific vision applications requiring open-vocabulary segmentation.

Claude Sonnet 5 vs Grounded SAM Comparison Table

Property	Claude Sonnet 5	Grounded SAM
Organization	Anthropic	IDEA Research
Category	closed	open
Modality	multimodal	multimodal
Release Date	Jun 2026	Jan 2024
Context Window	1.0M	—
Parameters
License	Proprietary	Apache 2.0
Pricing per 1M tokens
Input $/1M	$2.00
Output $/1M	$10.00
Vision Tasks
Vision Language
Captioning	Demo
Classification	Demo
Document Question Answering
Multi-Label Classification
Object Detection	Demo
OCR	Demo
Visual Question Answering	Demo
Zero Shot Segmentation
Model Features
Multimodal Vision
LLMs with Vision Capabilities
Zero-shot Detection
Vision Evalspass/fail results · 67 prompts Score key:≥75%40–74%<40%
Visual Understanding
Overall Score	70.15%
Avg Response Time	3.90s
Median input tokensincl. image tokens	2.1K
Median output tokens	61
Est. cost / taskon this benchmark	$0.0048
Defect Detection	73.3%(11/15)
Document Understanding	66.7%(6/9)
Object Counting	20%(2/10)
Object Understanding	92.9%(13/14)
Spatial Understanding	78.9%(15/19)
OCR
Overall Score	83.84%
Avg Response Time	2.77s
Median input tokensincl. image tokens	642
Median output tokens	64
Est. cost / taskon this benchmark	$0.0019
Focused Scene OCR	88.9%(88/99)
Handwritten Math	50%(5/10)
License Plate Recognition	90%(27/30)
Text Recognition	80%(24/30)
VQA & Extraction	80%(48/60)

Output tokens (incl. reasoning) and est. cost / task are measured on this benchmark from a single low-temperature run, and shown only for models whose run covered at least 90% of prompts. Methodology