Roboflow

USTC: D-FINE

D-FINE Overview

D-FINE is a real-time object detection model introduced in October 2024 by researchers at the University of Science and Technology of China. It builds on the DETR family of transformer-based detectors by reformulating bounding box regression as a Fine-grained Distribution Refinement task. Rather than predicting box coordinates directly, D-FINE iteratively refines probability distributions over coordinate offsets across decoder layers, which provides finer localization granularity without adding inference cost. The architecture also replaces the encoder's CSP blocks with GELAN modules and inserts a Target Gating Layer after the decoder's cross-attention to reduce representational entanglement across queries. A second contribution, Global Optimal Localization Self-Distillation, transfers localization knowledge from refined deeper-layer predictions back to earlier decoder layers through internal self-distillation.

D-FINE is released in five model sizes (Nano, Small, Medium, Large, and X), with D-FINE-L achieving 54.0% AP on the Microsoft COCO benchmark at 124 FPS on an NVIDIA T4 GPU, and D-FINE-X reaching 55.8% AP at 78 FPS. Pretraining on the Objects365 dataset further improves accuracy to 57.1% AP for the L variant and 59.3% AP for the X variant. The paper was accepted at ICLR 2025 as a Spotlight. Code and pretrained weights are released under the Apache 2.0 license, making the model suitable for commercial use.

D-FINE Details & Performance

Details

Resources

Vision Tasks

Object Detection

Features

Real-Time Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to D-FINE

Other models worth comparing for similar use cases.

DEIM
DEIM is a training framework for DETR-based object detection models released in December 2024 by researchers at Intellindust AI Lab, City University of Hong Kong, Great Bay University, and Hefei Normal University. It enhances existing real-time DETR architectures by improving the matcher used during training, enabling faster convergence and higher accuracy without modifying the inference architecture or adding computational overhead at deployment time. DEIM introduces two core techniques: Dense One-to-One (O2O) matching, which increases the number of positive matches per target, and Matchability-Aware Loss (MAL), which down-weights low-quality matches generated by the dense strategy. The paper was accepted at CVPR 2025.When integrated with RT-DETR and D-FINE, DEIM consistently improves performance while reducing training time by up to 50%. Applied to RT-DETRv2, it achieves 53.2% AP with a single day of training on an NVIDIA 4090 GPU. DEIM-enhanced models including DEIM-D-FINE-L and DEIM-D-FINE-X achieve 54.7% and 56.5% AP at 124 and 78 FPS respectively on an NVIDIA T4 GPU. DEIM is released under the Apache 2.0 license. A successor, DEIMv2, was released in September 2025, adding DINOv3-based backbones and introducing ultra-lightweight variants (Pico, Femto, and Atto) for edge deployment.
Baidu
RT-DETR
RT-DETR (Real-Time Detection Transformer) is an object detection model developed by Baidu, released in April 2023 under the Apache 2.0 license. It is the first transformer-based real-time object detector, addressing the inference speed limitations of earlier DETR models through an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion, enabling the model to process multi-scale features without the high computational overhead of standard transformer encoders.RT-DETR achieves 53.1% AP on COCO at 108 FPS on an NVIDIA T4 GPU for the RT-DETR-L variant, outperforming comparably sized YOLO detectors at similar speeds. It maintains end-to-end inference without non-maximum suppression, simplifying deployment pipelines. RT-DETR established the baseline for real-time transformer detection and has been extended by subsequent works including RF-DETR and RT-DETRv2.
RF-DETR
RF-DETR is a real-time transformer-based object detection model developed by Roboflow, with code and weights first released in March 2025 under the Apache 2.0 license. It is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark, built on a DINOv2 vision transformer backbone with weight-sharing neural architecture search used to identify accuracy-latency trade-offs. The full family spans six sizes from Nano (30.5M parameters, 384×384 input) to 2XL (126.9M parameters, 880×880 input), with the accompanying research paper accepted to ICLR 2026.RF-DETR is designed for strong domain adaptability, achieving state-of-the-art performance on RF100-VL, a benchmark measuring generalization to real-world object detection tasks across diverse domains. It is deployable through Roboflow Inference and supports fine-tuning on custom datasets, making it well suited for domain-specific applications with limited training data.
YOLO11
YOLO11 is an object detection and multi-task vision model developed by Ultralytics, released in September 2024 under the AGPL-3.0 license. It is the latest generation in the Ultralytics YOLO series and supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a single unified framework. YOLO11 introduces architectural refinements that improve accuracy while reducing parameter count compared to YOLOv8 at equivalent model sizes.YOLO11 is available in five model sizes from Nano to Extra Large and is deployable through the Ultralytics Python package, Roboflow Inference, and export formats including ONNX, TensorRT, and CoreML. It supports fine-tuning on custom datasets through the standard Ultralytics training API.
YOLOv12
YOLOv12 is an attention-centric real-time object detection model developed by researchers at Tsinghua University, with the arXiv paper published in February 2025 under the AGPL-3.0 license. It introduces an Area Attention module that partitions feature maps into regions and applies self-attention within each region, reducing the quadratic complexity of full self-attention while capturing long-range dependencies. It also incorporates R-ELAN for improved feature aggregation and scaled residual connections for training stability.YOLOv12-L achieves 54.0% AP on COCO, while the YOLOv12-N variant achieves 40.5% mAP at 1.62ms latency on an NVIDIA T4 GPU. The model is built on the Ultralytics codebase, supporting detection, segmentation, and other standard YOLO tasks at competitive real-time speeds.
RTMDet
RTMDet is a real-time object detection model developed by OpenMMLab, released in December 2022 under the GPL-3.0 license. It adopts a single-stage detection architecture with large-kernel depthwise convolution in both the backbone and neck, enabling it to capture long-range spatial dependencies without the computational cost of full self-attention. The model family spans from RTMDet-tiny to RTMDet-x, covering a wide range of speed-accuracy operating points.RTMDet-x achieves 52.6% AP on COCO at 114 FPS on an NVIDIA 3090 GPU. The architecture supports instance segmentation and rotated object detection variants. RTMDet is included in the OpenMMLab ecosystem and is well suited for applications requiring fast, accurate detection with flexible model sizing.

D-FINE License

Apache 2.0

License terms and commercial-use guidance for D-FINE.

License information is provided as a guide and is not legal advice.