Google

Google: Vision Transformer (ViT)

Vision Transformer (ViT) Overview

Vision Transformer is an image classification model developed by Google Research, first published in October 2020. It applies the transformer architecture directly to sequences of image patches without convolutional layers. Each image is divided into fixed-size patches, linearly projected into embeddings, and processed by a standard transformer encoder with multi-head self-attention. A classification token prepended to the patch sequence aggregates global image information for the final prediction.

When pre-trained on large datasets such as JFT-300M and fine-tuned on ImageNet, ViT achieves competitive accuracy with state-of-the-art CNNs of the period. It performs best when pre-training data is abundant, as the lack of convolutional inductive biases makes it less data-efficient than CNN-based classifiers on smaller datasets. ViT established the foundation for transformer-based vision architectures and has influenced a broad range of subsequent models.

Vision Transformer (ViT) Details & Performance

Details

Resources

Vision Tasks

Classification

Features

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to Vision Transformer (ViT)

Other models worth comparing for similar use cases.

Azure
ResNet-50
ResNet-50 is a deep convolutional neural network architecture introduced in the 2015 paper "Deep Residual Learning for Image Recognition" by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun at Microsoft Research. It is part of the ResNet (Residual Network) family, which introduced residual connections — shortcut paths that allow gradients to bypass layers during training — solving the degradation problem that had previously limited the practical training of very deep networks. ResNet-50 specifically refers to a 50-layer variant with approximately 25.6 million parameters, structured as a sequence of bottleneck residual blocks consisting of 1×1, 3×3, and 1×1 convolutions.ResNet-50 was trained on the ImageNet classification benchmark and achieved leading top-1 accuracy at release. Beyond classification, it became a widely used backbone feature extractor for downstream tasks including object detection (as the base network in Faster R-CNN, Mask R-CNN, and RetinaNet) and semantic and instance segmentation. Most current implementations in PyTorch torchvision, TensorFlow, and NVIDIA NGC use the ResNet-50 v1.5 variant, which relocates the stride-2 downsampling from the first 1×1 convolution to the 3×3 convolution within each bottleneck block, yielding approximately 0.5% higher top-1 accuracy than the original v1 formulation at a small throughput cost. ResNet-50 remains a common reference architecture in computer vision benchmarks and a standard backbone choice in detection and segmentation frameworks. The original Microsoft Research code is released under the MIT license.
Meta
ResNet-34
ResNet-34 is a deep residual network for image classification introduced by Kaiming He et al. in December 2015. It is a medium-sized variant in the original ResNet family, designed for ImageNet-scale classification with 34 convolutional layers organized into residual blocks using skip connections. These connections allow the model to learn residual mappings rather than full transformations, mitigating the vanishing gradient problem and enabling stable training of deeper architectures.ResNet-34 achieves a top-5 error rate of 7.36% on the ImageNet validation set. It is widely used as a backbone for transfer learning across classification, detection, and segmentation tasks and remains a common baseline architecture in computer vision research. The model is available through Meta's torchvision library.
Meta
ResNet-32
ResNet-32 is a deep residual network for image classification introduced by Kaiming He et al. in December 2015. It is one of the smaller variants in the ResNet family, designed for classification on datasets such as CIFAR-10 and CIFAR-100 rather than ImageNet-scale tasks. Residual connections allow gradients to flow directly through skip connections, enabling training of significantly deeper networks than was previously practical.ResNet-32 is commonly used in educational and research contexts as a lightweight classification baseline and as a starting point for fine-tuning on custom datasets with limited compute. The architecture is available through Meta's torchvision library. Larger ResNet variants such as ResNet-50 and ResNet-101 are more commonly used for production classification tasks on high-resolution imagery.
Google
MobileNetV2
MobileNetV2 is a lightweight image classification model developed by Google Research, released in January 2018 under the Apache 2.0 license. It introduces two key architectural innovations: inverted residuals, which expand the channel dimension within each bottleneck block before applying depthwise convolution, and linear bottlenecks, which remove the non-linearity before the projection step to preserve information in low-dimensional spaces.MobileNetV2 achieves competitive top-1 accuracy on ImageNet relative to its computational cost, making it practical for deployment on mobile devices and resource-constrained hardware. It is commonly used as a backbone for classification tasks and as a feature extractor in downstream detection and segmentation models through transfer learning. The architecture scales across a range of width and resolution multipliers, allowing developers to trade accuracy for latency based on deployment requirements.
Meta
DINOv2
DINOv2 is a self-supervised vision foundation model released in April 2023 by Meta AI's FAIR lab. It produces general-purpose visual features that transfer to a wide range of downstream tasks (including image classification, semantic segmentation, depth estimation, and image retrieval) without requiring task-specific fine-tuning. DINOv2 is trained on a curated dataset of 142 million images using a self-supervised objective combining student-teacher distillation, masked image modeling, and an image-level contrastive loss, extending the approach introduced in the original DINO.The model family spans Vision Transformer sizes from ViT-S (21M parameters) to ViT-g (1.1B parameters), with the larger variants setting state-of-the-art results on linear-probing benchmarks for classification, segmentation, and dense prediction tasks at release. DINOv2 features can be used directly as frozen backbones, reducing the need for labeled training data in downstream applications. The model is primarily used as an image encoder rather than as a complete task-specific model, making it a common backbone choice for custom vision pipelines. DINOv2 code and pretrained weights are released under the Apache 2.0 license, which was adopted after an initial CC-BY-NC 4.0 release in response to community requests for commercial compatibility. A successor model, DINOv3, was released in August 2025 with further scaling and a new training technique called Gram anchoring.
Google
SigLIP
SigLIP is a vision-language model released in March 2023 by researchers at Google DeepMind. It adapts the CLIP image-text pretraining approach by replacing CLIP's softmax-based contrastive loss with a pairwise sigmoid loss, which operates independently on each image-text pair rather than requiring a global view of all pairs in a batch. This change decouples the loss from batch size, enabling more memory-efficient training and improved performance at smaller batch sizes, a regime where softmax contrastive learning typically struggles. Despite this simplification, SigLIP matches or exceeds CLIP-style models on zero-shot image classification and image-text retrieval benchmarks when trained on comparable data.SigLIP is distributed as an image encoder plus aligned text encoder, supporting zero-shot classification with arbitrary class vocabularies, image-text retrieval, and use as a frozen backbone in downstream vision-language models. Pretrained models are available at multiple Vision Transformer sizes and input resolutions, including 224, 256, 384, and 512 pixel inputs. SigLIP is released under the Apache 2.0 license by Google and is used as the vision encoder in Google's PaliGemma and PaliGemma 2. A successor, SigLIP 2, was released in February 2025 with multilingual support across 109 languages, improvements to localization and dense prediction, and two resolution handling variants (FixRes for backward-compatible fixed resolutions and NaFlex for native aspect ratio with variable sequence length).

Vision Transformer (ViT) License

Apache 2.0

License terms and commercial-use guidance for Vision Transformer (ViT).

License information is provided as a guide and is not legal advice.