RT-DETR Overview

RT-DETR (Real-Time Detection Transformer) is an object detection model developed by Baidu, released in April 2023 under the Apache 2.0 license. It is the first transformer-based real-time object detector, addressing the inference speed limitations of earlier DETR models through an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion, enabling the model to process multi-scale features without the high computational overhead of standard transformer encoders.

RT-DETR achieves 53.1% AP on COCO at 108 FPS on an NVIDIA T4 GPU for the RT-DETR-L variant, outperforming comparably sized YOLO detectors at similar speeds. It maintains end-to-end inference without non-maximum suppression, simplifying deployment pipelines. RT-DETR established the baseline for real-time transformer detection and has been extended by subsequent works including RF-DETR and RT-DETRv2.

RT-DETR Details & Performance

Vision Tasks

Object Detection

Features

Real-Time Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to RT-DETR

Other models worth comparing for similar use cases.

RF-DETR
RF-DETR is a real-time transformer-based object detection model developed by Roboflow, with code and weights first released in March 2025 under the Apache 2.0 license. It is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark, built on a DINOv2 vision transformer backbone with weight-sharing neural architecture search used to identify accuracy-latency trade-offs. The full family spans six sizes from Nano (30.5M parameters, 384×384 input) to 2XL (126.9M parameters, 880×880 input), with the accompanying research paper accepted to ICLR 2026.RF-DETR is designed for strong domain adaptability, achieving state-of-the-art performance on RF100-VL, a benchmark measuring generalization to real-world object detection tasks across diverse domains. It is deployable through Roboflow Inference and supports fine-tuning on custom datasets, making it well suited for domain-specific applications with limited training data.
YOLOv8
YOLOv8 is an object detection and multi-task vision model developed by Ultralytics, released in January 2023 under the AGPL-3.0 license. It succeeds YOLOv5 and introduces an anchor-free detection head, a new C2f module for improved gradient flow, and a decoupled head that separates classification and regression tasks. These changes improve both accuracy and training efficiency compared to earlier Ultralytics models.YOLOv8 supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a unified codebase. It is available in five sizes from Nano to Extra Large and exports to ONNX, TensorRT, CoreML, and other formats. YOLOv8 is one of the most widely adopted detection models in production and is directly supported by Roboflow Inference for custom model training and deployment.
YOLO11
YOLO11 is an object detection and multi-task vision model developed by Ultralytics, released in September 2024 under the AGPL-3.0 license. It is the latest generation in the Ultralytics YOLO series and supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a single unified framework. YOLO11 introduces architectural refinements that improve accuracy while reducing parameter count compared to YOLOv8 at equivalent model sizes.YOLO11 is available in five model sizes from Nano to Extra Large and is deployable through the Ultralytics Python package, Roboflow Inference, and export formats including ONNX, TensorRT, and CoreML. It supports fine-tuning on custom datasets through the standard Ultralytics training API.
DEIM
DEIM is a training framework for DETR-based object detection models released in December 2024 by researchers at Intellindust AI Lab, City University of Hong Kong, Great Bay University, and Hefei Normal University. It enhances existing real-time DETR architectures by improving the matcher used during training, enabling faster convergence and higher accuracy without modifying the inference architecture or adding computational overhead at deployment time. DEIM introduces two core techniques: Dense One-to-One (O2O) matching, which increases the number of positive matches per target, and Matchability-Aware Loss (MAL), which down-weights low-quality matches generated by the dense strategy. The paper was accepted at CVPR 2025.When integrated with RT-DETR and D-FINE, DEIM consistently improves performance while reducing training time by up to 50%. Applied to RT-DETRv2, it achieves 53.2% AP with a single day of training on an NVIDIA 4090 GPU. DEIM-enhanced models including DEIM-D-FINE-L and DEIM-D-FINE-X achieve 54.7% and 56.5% AP at 124 and 78 FPS respectively on an NVIDIA T4 GPU. DEIM is released under the Apache 2.0 license. A successor, DEIMv2, was released in September 2025, adding DINOv3-based backbones and introducing ultra-lightweight variants (Pico, Femto, and Atto) for edge deployment.
D-FINE
D-FINE is a real-time object detection model introduced in October 2024 by researchers at the University of Science and Technology of China. It builds on the DETR family of transformer-based detectors by reformulating bounding box regression as a Fine-grained Distribution Refinement task. Rather than predicting box coordinates directly, D-FINE iteratively refines probability distributions over coordinate offsets across decoder layers, which provides finer localization granularity without adding inference cost. The architecture also replaces the encoder's CSP blocks with GELAN modules and inserts a Target Gating Layer after the decoder's cross-attention to reduce representational entanglement across queries. A second contribution, Global Optimal Localization Self-Distillation, transfers localization knowledge from refined deeper-layer predictions back to earlier decoder layers through internal self-distillation.D-FINE is released in five model sizes (Nano, Small, Medium, Large, and X), with D-FINE-L achieving 54.0% AP on the Microsoft COCO benchmark at 124 FPS on an NVIDIA T4 GPU, and D-FINE-X reaching 55.8% AP at 78 FPS. Pretraining on the Objects365 dataset further improves accuracy to 57.1% AP for the L variant and 59.3% AP for the X variant. The paper was accepted at ICLR 2025 as a Spotlight. Code and pretrained weights are released under the Apache 2.0 license, making the model suitable for commercial use.
YOLOv9
YOLOv9 is a real-time object detection model developed by Chien-Yao Wang and Hong-Yuan Mark Liao at Academia Sinica, released in February 2024 under the GPL-3.0 license. It introduces Programmable Gradient Information (PGI), a mechanism that preserves complete input information through auxiliary reversible branches during training to address information loss in deep network layers. It also introduces the Generalized Efficient Layer Aggregation Network (GELAN), which achieves better parameter utilization compared to prior CSP-based designs.YOLOv9-C achieves 53.0% AP on COCO with 42% fewer parameters and 21% less computation than YOLOv8-C at comparable accuracy. YOLOv9-E achieves 55.6% AP. The model is deployable through Roboflow Inference and supports fine-tuning via the standard training pipeline in the official repository.

RT-DETR License

Apache 2.0

License terms and commercial-use guidance for RT-DETR.

License information is provided as a guide and is not legal advice.