Florence-2 vs Grounding DINO

Compare Florence-2 and Grounding DINO side-by-side.

Compare Florence-2 vs Grounding DINO live

Run the same image across every model that supports a task and compare their outputs side-by-side.

These models don't share enough common tasks for a side-by-side demo. See the comparison table below for their capabilities.

Models in this comparison

Florence-2 vs Grounding DINO: Overview

Florence-2

Florence-2, introduced by Microsoft Research at CVPR 2024, is an open-source vision-language foundation model designed to unify diverse computer vision tasks within a single sequence-to-sequence framework. Unlike traditional models that specialize in specific tasks, Florence-2 accepts both images and text prompts and outputs text for tasks such as captioning, object detection, segmentation, OCR, and region-based grounding. It comes in two sizes—Florence-2-base (~230M parameters) and Florence-2-large (~770M parameters)—and is trained on FLD-5B, a large dataset of ~126M images with ~5.4B annotations.

The model demonstrates strong zero-shot and fine-tuned performance, often rivaling larger vision-language systems while remaining lightweight and efficient. Released under the MIT license, all weights are publicly available, making it accessible for fine-tuning and deployment in applications like VQA, content tagging, accessibility, and research. Florence-2’s compact design, versatility, and openness position it as a practical alternative to larger proprietary multimodal models.

Grounding DINO

Grounding DINO is an open-vocabulary object detection model developed by IDEA Research, released in March 2023 under the Apache 2.0 license. It extends the DINO transformer-based detector with grounded pre-training, enabling it to detect arbitrary objects described by free-form text queries rather than a fixed set of predefined categories. The model integrates a text encoder with a visual backbone through a feature fusion module that aligns language and visual representations at multiple scales.

Grounding DINO achieves strong zero-shot detection performance on COCO, LVIS, and ODinW benchmarks, and supports referring expression comprehension tasks. It is widely used as a foundation for open-vocabulary detection pipelines and as the detection backbone in systems such as Grounded-SAM. The model is particularly suited for applications requiring flexible, text-driven object localization across diverse domains.

Florence-2 vs Grounding DINO Comparison Table

PropertyFlorence-2Grounding DINO
OrganizationMicrosoftIDEA Research
Categoryopenopen
Modalitymultimodalvision
Release DateJun 2025Mar 2023
Context Window
Parameters230M172M-341M
LicenseMITApache 2.0
Vision Tasks
Object DetectionDemo
CaptioningDemo
Instance Segmentation
OCRDemo
Open Vocabulary Object Detection
Phrase Grounding
Region Proposal
Model Features
Foundation Vision
Zero-shot Detection