SigLIP is a vision-language model released in March 2023 by researchers at Google DeepMind. It adapts the CLIP image-text pretraining approach by replacing CLIP's softmax-based contrastive loss with a pairwise sigmoid loss, which operates independently on each image-text pair rather than requiring a global view of all pairs in a batch. This change decouples the loss from batch size, enabling more memory-efficient training and improved performance at smaller batch sizes, a regime where softmax contrastive learning typically struggles. Despite this simplification, SigLIP matches or exceeds CLIP-style models on zero-shot image classification and image-text retrieval benchmarks when trained on comparable data.
SigLIP is distributed as an image encoder plus aligned text encoder, supporting zero-shot classification with arbitrary class vocabularies, image-text retrieval, and use as a frozen backbone in downstream vision-language models. Pretrained models are available at multiple Vision Transformer sizes and input resolutions, including 224, 256, 384, and 512 pixel inputs. SigLIP is released under the Apache 2.0 license by Google and is used as the vision encoder in Google's PaliGemma and PaliGemma 2. A successor, SigLIP 2, was released in February 2025 with multilingual support across 109 languages, improvements to localization and dense prediction, and two resolution handling variants (FixRes for backward-compatible fixed resolutions and NaFlex for native aspect ratio with variable sequence length).
Other models worth comparing for similar use cases.
License terms and commercial-use guidance for SigLIP.
License information is provided as a guide and is not legal advice.