Co-DETR Overview

Co-DETR (Co-Deformable-DETR) is an object detection model developed by researchers at Sense-X and OpenMMLab, released in November 2022. It improves upon standard DETR-based detectors by introducing a collaborative hybrid assignment training scheme that enables the encoder to learn from multiple auxiliary heads simultaneously, alongside the primary one-to-one assignment used during inference. This auxiliary supervision significantly accelerates convergence and improves overall detection accuracy without adding inference cost.

Co-DETR is evaluated on the COCO benchmark, where it achieves 59.5% AP when applied to DINO-Deformable-DETR with a Swin-L backbone. With a ViT-L backbone it reaches 66.0% AP on COCO test-dev, outperforming prior methods at comparable model scales. It is suitable for high-accuracy object detection tasks where training efficiency and peak performance on standard benchmarks are priorities.

Co-DETR Details & Performance

Details

Resources

Vision Tasks

Object Detection

Features

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to Co-DETR

Other models worth comparing for similar use cases.

RTMDet
RTMDet is a real-time object detection model developed by OpenMMLab, released in December 2022 under the GPL-3.0 license. It adopts a single-stage detection architecture with large-kernel depthwise convolution in both the backbone and neck, enabling it to capture long-range spatial dependencies without the computational cost of full self-attention. The model family spans from RTMDet-tiny to RTMDet-x, covering a wide range of speed-accuracy operating points.RTMDet-x achieves 52.6% AP on COCO at 114 FPS on an NVIDIA 3090 GPU. The architecture supports instance segmentation and rotated object detection variants. RTMDet is included in the OpenMMLab ecosystem and is well suited for applications requiring fast, accurate detection with flexible model sizing.
Baidu
RT-DETR
RT-DETR (Real-Time Detection Transformer) is an object detection model developed by Baidu, released in April 2023 under the Apache 2.0 license. It is the first transformer-based real-time object detector, addressing the inference speed limitations of earlier DETR models through an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion, enabling the model to process multi-scale features without the high computational overhead of standard transformer encoders.RT-DETR achieves 53.1% AP on COCO at 108 FPS on an NVIDIA T4 GPU for the RT-DETR-L variant, outperforming comparably sized YOLO detectors at similar speeds. It maintains end-to-end inference without non-maximum suppression, simplifying deployment pipelines. RT-DETR established the baseline for real-time transformer detection and has been extended by subsequent works including RF-DETR and RT-DETRv2.
RF-DETR
RF-DETR is a real-time transformer-based object detection model developed by Roboflow, with code and weights first released in March 2025 under the Apache 2.0 license. It is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark, built on a DINOv2 vision transformer backbone with weight-sharing neural architecture search used to identify accuracy-latency trade-offs. The full family spans six sizes from Nano (30.5M parameters, 384×384 input) to 2XL (126.9M parameters, 880×880 input), with the accompanying research paper accepted to ICLR 2026.RF-DETR is designed for strong domain adaptability, achieving state-of-the-art performance on RF100-VL, a benchmark measuring generalization to real-world object detection tasks across diverse domains. It is deployable through Roboflow Inference and supports fine-tuning on custom datasets, making it well suited for domain-specific applications with limited training data.
DEIM
DEIM is a training framework for DETR-based object detection models released in December 2024 by researchers at Intellindust AI Lab, City University of Hong Kong, Great Bay University, and Hefei Normal University. It enhances existing real-time DETR architectures by improving the matcher used during training, enabling faster convergence and higher accuracy without modifying the inference architecture or adding computational overhead at deployment time. DEIM introduces two core techniques: Dense One-to-One (O2O) matching, which increases the number of positive matches per target, and Matchability-Aware Loss (MAL), which down-weights low-quality matches generated by the dense strategy. The paper was accepted at CVPR 2025.When integrated with RT-DETR and D-FINE, DEIM consistently improves performance while reducing training time by up to 50%. Applied to RT-DETRv2, it achieves 53.2% AP with a single day of training on an NVIDIA 4090 GPU. DEIM-enhanced models including DEIM-D-FINE-L and DEIM-D-FINE-X achieve 54.7% and 56.5% AP at 124 and 78 FPS respectively on an NVIDIA T4 GPU. DEIM is released under the Apache 2.0 license. A successor, DEIMv2, was released in September 2025, adding DINOv3-based backbones and introducing ultra-lightweight variants (Pico, Femto, and Atto) for edge deployment.
D-FINE
D-FINE is a real-time object detection model introduced in October 2024 by researchers at the University of Science and Technology of China. It builds on the DETR family of transformer-based detectors by reformulating bounding box regression as a Fine-grained Distribution Refinement task. Rather than predicting box coordinates directly, D-FINE iteratively refines probability distributions over coordinate offsets across decoder layers, which provides finer localization granularity without adding inference cost. The architecture also replaces the encoder's CSP blocks with GELAN modules and inserts a Target Gating Layer after the decoder's cross-attention to reduce representational entanglement across queries. A second contribution, Global Optimal Localization Self-Distillation, transfers localization knowledge from refined deeper-layer predictions back to earlier decoder layers through internal self-distillation.D-FINE is released in five model sizes (Nano, Small, Medium, Large, and X), with D-FINE-L achieving 54.0% AP on the Microsoft COCO benchmark at 124 FPS on an NVIDIA T4 GPU, and D-FINE-X reaching 55.8% AP at 78 FPS. Pretraining on the Objects365 dataset further improves accuracy to 57.1% AP for the L variant and 59.3% AP for the X variant. The paper was accepted at ICLR 2025 as a Spotlight. Code and pretrained weights are released under the Apache 2.0 license, making the model suitable for commercial use.
Meta
DETR
DETR (Detection Transformer) is an end-to-end object detection model developed by Facebook Research (Meta), released in May 2020. It is one of the first models to eliminate hand-crafted components such as anchor generation and non-maximum suppression by framing object detection as a direct set prediction problem, solved with a transformer encoder-decoder architecture built on top of a CNN backbone.DETR achieves 42.0% AP on the COCO benchmark with a ResNet-50 backbone, performing comparably to a well-tuned Faster R-CNN at the time of release. Its attention-based design allows it to reason about global context and long-range dependencies within an image. DETR is primarily used as a research baseline and architectural reference, with subsequent works such as Deformable DETR and DINO building on its foundations to address its slower training convergence and limited small-object detection capability.

Co-DETR License

MIT

License terms and commercial-use guidance for Co-DETR.

License information is provided as a guide and is not legal advice.