What license does Vision Transformer (ViT) use?

This model is released under the Apache License 2.0, a permissive open-source license that allows commercial use, modification, distribution, and patent use.

Can I use Vision Transformer (ViT) commercially?

Yes. Under the terms of the Apache 2.0 license, you can freely use this model for commercial purposes, including in proprietary products. You must retain the copyright notice and disclaimers when redistributing.

Vision Transformer (ViT): Overview & Resources

Vision Transformer is an image classification model developed by Google Research, first published in October 2020. It applies the transformer architecture directly to sequences of image patches without convolutional layers. Each image is divided into fixed-size patches, linearly projected into embeddings, and processed by a standard transformer encoder with multi-head self-attention. A classification token prepended to the patch sequence aggregates global image information for the final prediction.

When pre-trained on large datasets such as JFT-300M and fine-tuned on ImageNet, ViT achieves competitive accuracy with state-of-the-art CNNs of the period. It performs best when pre-training data is abundant, as the lack of convolutional inductive biases makes it less data-efficient than CNN-based classifiers on smaller datasets. ViT established the foundation for transformer-based vision architectures and has influenced a broad range of subsequent models.

Google: Vision Transformer (ViT)

Vision Transformer (ViT) Overview

Vision Transformer (ViT) Details & Performance

Details

Resources

Vision Tasks

Features

Performance

Arena Rankings

Alternatives to Vision Transformer (ViT)

Vision Transformer (ViT) License