YOLOS Overview

YOLOS (You Only Look at One Sequence) is a transformer-based object detection model widely distributed through Hugging Face Transformers, released in June 2021 under the MIT license. It applies a minimally adapted Vision Transformer to object detection by representing both the image and detection tokens as a flat sequence processed by standard multi-head self-attention, without convolutional components or feature pyramid networks. The architecture demonstrates that detection can be performed without region proposals or multi-scale feature fusion.

YOLOS achieves moderate performance on COCO relative to purpose-built detectors, with its primary contribution being a demonstration of the transferability of ViT pre-training to detection tasks. It is most appropriate for research contexts exploring transformer-based detection architectures and for scenarios where architectural simplicity is preferred over peak accuracy.

YOLOS Details & Performance

Vision Tasks

Object Detection

Features

Real-Time Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to YOLOS

Other models worth comparing for similar use cases.

Meta
DETR
DETR (Detection Transformer) is an end-to-end object detection model developed by Facebook Research (Meta), released in May 2020. It is one of the first models to eliminate hand-crafted components such as anchor generation and non-maximum suppression by framing object detection as a direct set prediction problem, solved with a transformer encoder-decoder architecture built on top of a CNN backbone.DETR achieves 42.0% AP on the COCO benchmark with a ResNet-50 backbone, performing comparably to a well-tuned Faster R-CNN at the time of release. Its attention-based design allows it to reason about global context and long-range dependencies within an image. DETR is primarily used as a research baseline and architectural reference, with subsequent works such as Deformable DETR and DINO building on its foundations to address its slower training convergence and limited small-object detection capability.
Baidu
RT-DETR
RT-DETR (Real-Time Detection Transformer) is an object detection model developed by Baidu, released in April 2023 under the Apache 2.0 license. It is the first transformer-based real-time object detector, addressing the inference speed limitations of earlier DETR models through an efficient hybrid encoder that decouples intra-scale interaction and cross-scale fusion, enabling the model to process multi-scale features without the high computational overhead of standard transformer encoders.RT-DETR achieves 53.1% AP on COCO at 108 FPS on an NVIDIA T4 GPU for the RT-DETR-L variant, outperforming comparably sized YOLO detectors at similar speeds. It maintains end-to-end inference without non-maximum suppression, simplifying deployment pipelines. RT-DETR established the baseline for real-time transformer detection and has been extended by subsequent works including RF-DETR and RT-DETRv2.
RF-DETR
RF-DETR is a real-time transformer-based object detection model developed by Roboflow, with code and weights first released in March 2025 under the Apache 2.0 license. It is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark, built on a DINOv2 vision transformer backbone with weight-sharing neural architecture search used to identify accuracy-latency trade-offs. The full family spans six sizes from Nano (30.5M parameters, 384×384 input) to 2XL (126.9M parameters, 880×880 input), with the accompanying research paper accepted to ICLR 2026.RF-DETR is designed for strong domain adaptability, achieving state-of-the-art performance on RF100-VL, a benchmark measuring generalization to real-world object detection tasks across diverse domains. It is deployable through Roboflow Inference and supports fine-tuning on custom datasets, making it well suited for domain-specific applications with limited training data.
YOLOv8
YOLOv8 is an object detection and multi-task vision model developed by Ultralytics, released in January 2023 under the AGPL-3.0 license. It succeeds YOLOv5 and introduces an anchor-free detection head, a new C2f module for improved gradient flow, and a decoupled head that separates classification and regression tasks. These changes improve both accuracy and training efficiency compared to earlier Ultralytics models.YOLOv8 supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a unified codebase. It is available in five sizes from Nano to Extra Large and exports to ONNX, TensorRT, CoreML, and other formats. YOLOv8 is one of the most widely adopted detection models in production and is directly supported by Roboflow Inference for custom model training and deployment.
YOLO11
YOLO11 is an object detection and multi-task vision model developed by Ultralytics, released in September 2024 under the AGPL-3.0 license. It is the latest generation in the Ultralytics YOLO series and supports object detection, instance segmentation, image classification, pose estimation, and oriented bounding box detection within a single unified framework. YOLO11 introduces architectural refinements that improve accuracy while reducing parameter count compared to YOLOv8 at equivalent model sizes.YOLO11 is available in five model sizes from Nano to Extra Large and is deployable through the Ultralytics Python package, Roboflow Inference, and export formats including ONNX, TensorRT, and CoreML. It supports fine-tuning on custom datasets through the standard Ultralytics training API.
Google
OWL-ViT
OWL-ViT (Open-World Localization with Vision Transformers) is an open-vocabulary object detection model released in May 2022 by Google Research. It adapts a pretrained CLIP-style image-text model by removing the final pooling layer and attaching lightweight classification and box prediction heads to each Transformer output token, producing a detector capable of localizing arbitrary objects described by free-form text at inference time. Rather than being restricted to a fixed taxonomy such as the 80 categories in Microsoft COCO, OWL-ViT can detect object classes specified by a user's text query, including categories the model was never explicitly trained on.OWL-ViT accepts an image and a list of text queries as input, and produces bounding boxes with class assignments drawn from the supplied queries. It also supports one-shot image-conditioned detection, where a cropped image region is used as the query instead of text, allowing the model to find visually similar instances within a target scene. The model is released in multiple Vision Transformer sizes (ViT-B/32, ViT-B/16, ViT-L/14) and CLIP-pretrained variants, distributed through the Google Research scenic repository and Hugging Face under the Apache 2.0 license. A successor model, OWLv2, was released in June 2023, introducing the OWL-ST self-training recipe that scales training to over one billion pseudo-annotated examples and substantially improves detection performance on rare and long-tail categories while preserving the open-vocabulary interface.

YOLOS License

MIT

License terms and commercial-use guidance for YOLOS.

License information is provided as a guide and is not legal advice.