SigLIP Overview

SigLIP is a vision-language model released in March 2023 by researchers at Google DeepMind. It adapts the CLIP image-text pretraining approach by replacing CLIP's softmax-based contrastive loss with a pairwise sigmoid loss, which operates independently on each image-text pair rather than requiring a global view of all pairs in a batch. This change decouples the loss from batch size, enabling more memory-efficient training and improved performance at smaller batch sizes, a regime where softmax contrastive learning typically struggles. Despite this simplification, SigLIP matches or exceeds CLIP-style models on zero-shot image classification and image-text retrieval benchmarks when trained on comparable data.

SigLIP is distributed as an image encoder plus aligned text encoder, supporting zero-shot classification with arbitrary class vocabularies, image-text retrieval, and use as a frozen backbone in downstream vision-language models. Pretrained models are available at multiple Vision Transformer sizes and input resolutions, including 224, 256, 384, and 512 pixel inputs. SigLIP is released under the Apache 2.0 license by Google and is used as the vision encoder in Google's PaliGemma and PaliGemma 2. A successor, SigLIP 2, was released in February 2025 with multilingual support across 109 languages, improvements to localization and dense prediction, and two resolution handling variants (FixRes for backward-compatible fixed resolutions and NaFlex for native aspect ratio with variable sequence length).

SigLIP Details & Performance

Details

Resources

Vision Tasks

Vision LanguageClassificationImage SimilarityImage TaggingImage Embedding

Features

Multimodal VisionFoundation Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to SigLIP

Other models worth comparing for similar use cases.

OpenAI
CLIP
OpenAI CLIP (Contrastive Language-Image Pretraining) is a vision-language model released in January 2021 by OpenAI. It jointly trains an image encoder and a text encoder to produce matching embeddings for image-caption pairs, using a contrastive objective over WebImageText (WIT), a dataset of 400 million image-text pairs collected from the public web. By learning to associate images with free-form text rather than a fixed set of class labels, CLIP produces a shared embedding space that enables zero-shot classification with arbitrary vocabularies at inference time.CLIP supports zero-shot image classification by embedding candidate class labels as text and selecting the label whose embedding is closest to a given image's embedding. It is also widely used for image-text retrieval, as a frozen backbone in downstream vision-language models, and as a building block for content moderation, similarity search, and generative model guidance — notably as the text conditioning mechanism in early versions of Stable Diffusion. OpenAI released several CLIP variants built on different vision encoders, including ResNet and Vision Transformer backbones at multiple sizes and input resolutions, with ViT-L/14 at 336 pixels being the largest and most widely adopted. CLIP is distributed under the MIT license. The model has been widely influential as the basis for subsequent vision-language work — including SigLIP, OpenCLIP, and MetaCLIP — and remains a common reference baseline despite being released in 2021 and surpassed on many benchmarks by later models.
Meta
DINOv2
DINOv2 is a self-supervised vision foundation model released in April 2023 by Meta AI's FAIR lab. It produces general-purpose visual features that transfer to a wide range of downstream tasks (including image classification, semantic segmentation, depth estimation, and image retrieval) without requiring task-specific fine-tuning. DINOv2 is trained on a curated dataset of 142 million images using a self-supervised objective combining student-teacher distillation, masked image modeling, and an image-level contrastive loss, extending the approach introduced in the original DINO.The model family spans Vision Transformer sizes from ViT-S (21M parameters) to ViT-g (1.1B parameters), with the larger variants setting state-of-the-art results on linear-probing benchmarks for classification, segmentation, and dense prediction tasks at release. DINOv2 features can be used directly as frozen backbones, reducing the need for labeled training data in downstream applications. The model is primarily used as an image encoder rather than as a complete task-specific model, making it a common backbone choice for custom vision pipelines. DINOv2 code and pretrained weights are released under the Apache 2.0 license, which was adopted after an initial CC-BY-NC 4.0 release in response to community requests for commercial compatibility. A successor model, DINOv3, was released in August 2025 with further scaling and a new training technique called Gram anchoring.
Google
PaliGemma 2
PaliGemma 2 is a vision-language model released in December 2024 by Google DeepMind. It pairs the SigLIP-So400m vision encoder with the Gemma 2 language model family, extending the original PaliGemma architecture with stronger language capabilities and a wider set of transfer benchmarks. The model is designed primarily as a fine-tuning base rather than a chat-optimized assistant. Google releases pretrained "PT" checkpoints intended for task-specific adaptation rather than direct out-of-the-box use.PaliGemma 2 accepts an image paired with a text prompt and generates natural language output, supporting image captioning, visual question answering, optical character recognition, document understanding, object detection and segmentation (with appropriate fine-tuning), and a range of specialized vision-language tasks. The model is released at three parameter sizes (3B, 10B, and 28B), built on the Gemma 2 2B, 9B, and 27B language backbones. Each size is available at three input resolutions: 224, 448, and 896 pixels. Alongside the base PT checkpoints, Google released PaliGemma 2 Mix variants that have been tuned on a mixture of downstream tasks to provide stronger out-of-the-box performance for common applications such as OCR and document parsing. PaliGemma 2 is distributed under the Gemma license, a custom license from Google that permits commercial use subject to the terms of the Gemma Prohibited Use Policy.
Google
PaliGemma
PaliGemma is a vision-language model released in May 2024 by Google, built by pairing the SigLIP-So400m vision encoder with the Gemma 2B language model. It is designed primarily as a compact, transfer-friendly base model for fine-tuning to downstream vision-language tasks, rather than as a chat-optimized assistant. PaliGemma draws architectural inspiration from the PaLI-3 model at Google Research, applying a similar encoder-decoder approach at a smaller and more accessible parameter scale.PaliGemma accepts an image together with a text prompt and generates text output, supporting image captioning, visual question answering, optical character recognition, object detection, referring expression segmentation, and a range of related vision-language tasks when fine-tuned on task-specific data. The model is released at three input resolutions (224, 448, and 896 pixels), with higher resolutions providing stronger performance on tasks requiring fine visual detail such as OCR and document understanding. Google released pretrained (PT) checkpoints intended as fine-tuning bases, along with Mix variants that have been fine-tuned on a mixture of downstream tasks for direct use without additional training. PaliGemma is distributed under the Gemma license, a custom license from Google that permits commercial use subject to the terms of the Gemma Prohibited Use Policy. It was succeeded by PaliGemma 2 in December 2024, which extends the architecture to larger Gemma 2 language backbones at 3B, 10B, and 28B parameter sizes.
Azure
Florence-2
Florence-2, introduced by Microsoft Research at CVPR 2024, is an open-source vision-language foundation model designed to unify diverse computer vision tasks within a single sequence-to-sequence framework. Unlike traditional models that specialize in specific tasks, Florence-2 accepts both images and text prompts and outputs text for tasks such as captioning, object detection, segmentation, OCR, and region-based grounding. It comes in two sizes—Florence-2-base (~230M parameters) and Florence-2-large (~770M parameters)—and is trained on FLD-5B, a large dataset of ~126M images with ~5.4B annotations.The model demonstrates strong zero-shot and fine-tuned performance, often rivaling larger vision-language systems while remaining lightweight and efficient. Released under the MIT license, all weights are publicly available, making it accessible for fine-tuning and deployment in applications like VQA, content tagging, accessibility, and research. Florence-2’s compact design, versatility, and openness position it as a practical alternative to larger proprietary multimodal models.
Google
Vision Transformer (ViT)
Vision Transformer is an image classification model developed by Google Research, first published in October 2020. It applies the transformer architecture directly to sequences of image patches without convolutional layers. Each image is divided into fixed-size patches, linearly projected into embeddings, and processed by a standard transformer encoder with multi-head self-attention. A classification token prepended to the patch sequence aggregates global image information for the final prediction.When pre-trained on large datasets such as JFT-300M and fine-tuned on ImageNet, ViT achieves competitive accuracy with state-of-the-art CNNs of the period. It performs best when pre-training data is abundant, as the lack of convolutional inductive biases makes it less data-efficient than CNN-based classifiers on smaller datasets. ViT established the foundation for transformer-based vision architectures and has influenced a broad range of subsequent models.

SigLIP License

Apache 2.0

License terms and commercial-use guidance for SigLIP.

License information is provided as a guide and is not legal advice.