Meta

Meta: SAM 3D Objects

SAM 3D Objects Overview

SAM 3D Objects is a 3D reconstruction model released on November 19, 2025 by Meta AI as part of the broader SAM 3 release. It extends the Segment Anything Model family from 2D segmentation into 3D object reconstruction, predicting geometry, texture, and spatial layout for individual objects from a single RGB image. Given an image together with a prompt identifying the object (a segmentation mask, point, or bounding box), the model outputs a full textured 3D mesh, without requiring multi-view captures, depth sensors, or known camera parameters.

SAM 3D Objects uses a two-stage transformer architecture: a coarse stage that predicts 3D shape and object pose, followed by a refinement stage that adds texture and surface detail, with DINOv2 used to encode the input image. The model is trained via a human-and-model-in-the-loop data engine that combines synthetic 3D assets with real-image annotations, producing approximately 3.14 million mesh annotations across nearly 1 million images. Meta released SAM 3D Objects alongside SAM 3D Body, a companion model for single-image human mesh recovery, and SAM 3D Artist Objects (SA-3DAO), a new evaluation benchmark assembled with artist-created 3D ground truth. The model is designed as a companion to SAM 3 for downstream applications including augmented reality, robotics, content creation, and visual effects; it already powers the View in Room feature on Facebook Marketplace. SAM 3D Objects is released under the SAM 3 license; users should review the license terms prior to commercial use.

SAM 3D Objects Details & Performance

Details

Resources

Vision Tasks

Instance Segmentation3D Reconstruction

Features

Foundation Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to SAM 3D Objects

Other models worth comparing for similar use cases.

Meta
Segment Anything Model 2 (SAM 2)
SAM 2 is a real-time image and video segmentation model developed by Meta AI, released in July 2024 under the Apache 2.0 license. It extends the original Segment Anything Model to support video inputs by introducing a streaming memory architecture that maintains object state across frames, enabling consistent segmentation of objects through occlusion, motion, and scene changes. For image inputs, SAM 2 operates similarly to its predecessor with improved mask quality and speed.SAM 2 accepts point, box, and mask prompts and produces object masks interactively or in a fully automated mode. Its memory architecture enables video segmentation at real-time speeds. SAM 2 is used in annotation pipelines, video analysis, robotic perception, and any application requiring high-quality promptable segmentation across both images and video.
Meta
Segment Anything Model (SAM)
The Segment Anything Model is a promptable image segmentation foundation model developed by Meta AI, released in April 2023 under the Apache 2.0 license. It introduces a general-purpose segmentation architecture trained on SA-1B, a dataset of over 1 billion masks across 11 million images collected using a data engine that leveraged the model itself. SAM accepts point, bounding box, and mask prompts and generates high-quality segmentation masks for any object in an image, including objects not seen during training.SAM achieves strong zero-shot performance across a wide range of segmentation tasks and domains. Its promptable interface makes it suitable as a building block for automated annotation, interactive segmentation tools, and integration with detection models such as Grounding DINO. SAM has been extended by subsequent works including SAM 2, SAM 3, and Grounded-SAM.
Meta
SAM 3
Released on November 19th, 2025, Segment Anything 3 (SAM 3) is a zero-shot image segmentation model that “detects, segments, and tracks objects in images and videos based on concept prompts.” This model was developed by Meta as the third model in the Segment Anything series.

Unlike its previous SAM models (Segment Anything and Segment Anything 2), you can provide SAM 3 with the prompt “shipping container” and it will generate precise segmentation masks for all shipping containers in an image. SAM 3 generates segmentation masks that correspond to the location of the objects found with a text prompt.
SAM-CLIP
SAM-CLIP is a unified vision foundation model introduced by researchers at Apple and the University of Illinois Urbana-Champaign in October 2023. It merges two popular vision foundation models — Meta's Segment Anything Model (SAM) and OpenAI's CLIP — into a single shared Vision Transformer backbone through a combination of multi-task learning, continual learning, and teacher-student distillation. The method requires only a small fraction of the original pretraining datasets and demonstrates that complementary capabilities from distinct foundation models can be consolidated without retraining from scratch, reducing the storage and compute cost of running both models in inference.The resulting model retains SAM's zero-shot segmentation ability and CLIP's zero-shot classification and image-text retrieval, while introducing new capabilities the individual models lacked. SAM-CLIP establishes state-of-the-art results on zero-shot semantic segmentation across five benchmarks, improving mean IoU by 6.8 points on Pascal VOC and 5.9 points on COCO-Stuff over prior specialized models. The paper was accepted at the UniReps Workshop at NeurIPS 2023 and the eLVM Workshop at CVPR 2024. Apple has published the research but has not released model weights or inference code publicly.
IDEA Research
Grounded SAM
Grounded SAM is an open-vocabulary image segmentation model developed by IDEA Research, released in January 2024 under the Apache 2.0 license. It combines Grounding DINO, a zero-shot open-vocabulary object detector, with the Segment Anything Model to produce precise segmentation masks for objects identified through free-form text prompts. The two models are used sequentially: Grounding DINO localizes objects from a text query, and SAM generates the corresponding segmentation masks.Grounded SAM enables zero-shot instance segmentation without task-specific training data, making it applicable to domains where labeled segmentation data is scarce. It supports arbitrary text queries and can segment objects not represented in standard training sets. The model is commonly used in automated labeling pipelines, robotic perception, and domain-specific vision applications requiring open-vocabulary segmentation.
Meta
Mask R-CNN
Mask R-CNN is an instance segmentation model developed by Facebook AI Research (Meta), released in October 2017. It extends Faster R-CNN by adding a parallel branch that predicts binary segmentation masks for each detected object, independent of the classification and bounding box regression branches. A key contribution is RoIAlign, which replaces RoIPool with bilinear interpolation to preserve spatial correspondence between features and input pixels, significantly improving mask quality.Mask R-CNN achieves strong performance on the COCO instance segmentation benchmark and supports keypoint detection as an additional output head. It remains a foundational architecture in instance segmentation and is available through Meta's Detectron2 framework. The model is most appropriate for tasks requiring pixel-level object delineation, such as medical imaging, autonomous driving, and industrial inspection.

SAM 3D Objects License

Custom

License terms and commercial-use guidance for SAM 3D Objects.

License information is provided as a guide and is not legal advice.