D-FINE is a real-time object detection model introduced in October 2024 by researchers at the University of Science and Technology of China. It builds on the DETR family of transformer-based detectors by reformulating bounding box regression as a Fine-grained Distribution Refinement task. Rather than predicting box coordinates directly, D-FINE iteratively refines probability distributions over coordinate offsets across decoder layers, which provides finer localization granularity without adding inference cost. The architecture also replaces the encoder's CSP blocks with GELAN modules and inserts a Target Gating Layer after the decoder's cross-attention to reduce representational entanglement across queries. A second contribution, Global Optimal Localization Self-Distillation, transfers localization knowledge from refined deeper-layer predictions back to earlier decoder layers through internal self-distillation.
D-FINE is released in five model sizes (Nano, Small, Medium, Large, and X), with D-FINE-L achieving 54.0% AP on the Microsoft COCO benchmark at 124 FPS on an NVIDIA T4 GPU, and D-FINE-X reaching 55.8% AP at 78 FPS. Pretraining on the Objects365 dataset further improves accuracy to 57.1% AP for the L variant and 59.3% AP for the X variant. The paper was accepted at ICLR 2025 as a Spotlight. Code and pretrained weights are released under the Apache 2.0 license, making the model suitable for commercial use.
Other models worth comparing for similar use cases.
License terms and commercial-use guidance for D-FINE.
License information is provided as a guide and is not legal advice.