Vision Transformer is an image classification model developed by Google Research, first published in October 2020. It applies the transformer architecture directly to sequences of image patches without convolutional layers. Each image is divided into fixed-size patches, linearly projected into embeddings, and processed by a standard transformer encoder with multi-head self-attention. A classification token prepended to the patch sequence aggregates global image information for the final prediction.
When pre-trained on large datasets such as JFT-300M and fine-tuned on ImageNet, ViT achieves competitive accuracy with state-of-the-art CNNs of the period. It performs best when pre-training data is abundant, as the lack of convolutional inductive biases makes it less data-efficient than CNN-based classifiers on smaller datasets. ViT established the foundation for transformer-based vision architectures and has influenced a broad range of subsequent models.
—
Usage
Past 30 DaysNot available
Not in Playground
Not yet ranked in arena
Other models worth comparing for similar use cases.
License terms and commercial-use guidance for Vision Transformer (ViT).
License information is provided as a guide and is not legal advice.