Meta

Meta: Mask R-CNN

Mask R-CNN Overview

Mask R-CNN is an instance segmentation model developed by Facebook AI Research (Meta), released in October 2017. It extends Faster R-CNN by adding a parallel branch that predicts binary segmentation masks for each detected object, independent of the classification and bounding box regression branches. A key contribution is RoIAlign, which replaces RoIPool with bilinear interpolation to preserve spatial correspondence between features and input pixels, significantly improving mask quality.

Mask R-CNN achieves strong performance on the COCO instance segmentation benchmark and supports keypoint detection as an additional output head. It remains a foundational architecture in instance segmentation and is available through Meta's Detectron2 framework. The model is most appropriate for tasks requiring pixel-level object delineation, such as medical imaging, autonomous driving, and industrial inspection.

Mask R-CNN Details & Performance

Details

Vision Tasks

Object DetectionInstance SegmentationKeypoint Detection

Features

Foundation Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to Mask R-CNN

Other models worth comparing for similar use cases.

Meta
Segment Anything Model 2 (SAM 2)
SAM 2 is a real-time image and video segmentation model developed by Meta AI, released in July 2024 under the Apache 2.0 license. It extends the original Segment Anything Model to support video inputs by introducing a streaming memory architecture that maintains object state across frames, enabling consistent segmentation of objects through occlusion, motion, and scene changes. For image inputs, SAM 2 operates similarly to its predecessor with improved mask quality and speed.SAM 2 accepts point, box, and mask prompts and produces object masks interactively or in a fully automated mode. Its memory architecture enables video segmentation at real-time speeds. SAM 2 is used in annotation pipelines, video analysis, robotic perception, and any application requiring high-quality promptable segmentation across both images and video.
Meta
Segment Anything Model (SAM)
The Segment Anything Model is a promptable image segmentation foundation model developed by Meta AI, released in April 2023 under the Apache 2.0 license. It introduces a general-purpose segmentation architecture trained on SA-1B, a dataset of over 1 billion masks across 11 million images collected using a data engine that leveraged the model itself. SAM accepts point, bounding box, and mask prompts and generates high-quality segmentation masks for any object in an image, including objects not seen during training.SAM achieves strong zero-shot performance across a wide range of segmentation tasks and domains. Its promptable interface makes it suitable as a building block for automated annotation, interactive segmentation tools, and integration with detection models such as Grounding DINO. SAM has been extended by subsequent works including SAM 2, SAM 3, and Grounded-SAM.
Meta
Detectron2
Detectron2 is a computer vision model library developed by Facebook AI Research (Meta), released in September 2019. It serves as a comprehensive platform for object detection, instance segmentation, panoptic segmentation, keypoint detection, and DensePose, implemented in PyTorch. It is the successor to the original Detectron framework, which was written in Caffe2, and offers a more modular and extensible codebase designed for both research and production use.Detectron2 includes implementations of Faster R-CNN, Mask R-CNN, RetinaNet, Cascade R-CNN, Panoptic FPN, and several other architectures. Its modular design allows components such as backbones, necks, and heads to be swapped independently, making it widely used as a baseline framework in academic research. It supports training on COCO-format datasets and integrates with standard distributed training setups.
YOLOv8 Instance Segmentation
YOLOv8 Instance Segmentation is the segmentation variant of the YOLOv8 model developed by Ultralytics, released in January 2023 under the AGPL-3.0 license. It extends the standard YOLOv8 detection head with a mask prediction branch that generates pixel-level segmentation masks for each detected object using a prototype mask approach. This enables real-time instance segmentation within a single forward pass.YOLOv8 Instance Segmentation shares the same backbone and neck architecture as the base detection model and is available in the same size range. It is deployable through Roboflow Inference and supports fine-tuning on custom COCO-format segmentation datasets. It is suited for applications requiring both object localization and precise mask prediction at real-time speeds.
YOLO11
YOLO11 is an object detection and multi-task vision model developed by Ultralytics, released in September 2024 under the AGPL-3.0 license. It is the latest generation in the Ultralytics YOLO series and supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a single unified framework. YOLO11 introduces architectural refinements that improve accuracy while reducing parameter count compared to YOLOv8 at equivalent model sizes.YOLO11 is available in five model sizes from Nano to Extra Large and is deployable through the Ultralytics Python package, Roboflow Inference, and export formats including ONNX, TensorRT, and CoreML. It supports fine-tuning on custom datasets through the standard Ultralytics training API.
YOLOE
YOLOE (YOLO with Everything) is an open-vocabulary object detection and segmentation model developed by THU-MIG at Tsinghua University, released in March 2025 under the AGPL-3.0 license. It extends the YOLO architecture to support open-vocabulary detection through text and visual prompts, enabling the model to detect arbitrary object categories beyond a fixed training set without retraining. The design integrates prompt encoding directly into the YOLO framework while preserving real-time inference speed.YOLOE is evaluated on COCO and LVIS benchmarks and supports both closed-set and open-vocabulary detection modes. It is built on the Ultralytics codebase and maintains compatibility with standard YOLO training and export workflows. YOLOE is suited for applications requiring flexible, prompt-driven object detection where the target object vocabulary may change at inference time.

Mask R-CNN License

MIT

License terms and commercial-use guidance for Mask R-CNN.

License information is provided as a guide and is not legal advice.