CLIP Overview

OpenAI CLIP (Contrastive Language-Image Pretraining) is a vision-language model released in January 2021 by OpenAI. It jointly trains an image encoder and a text encoder to produce matching embeddings for image-caption pairs, using a contrastive objective over WebImageText (WIT), a dataset of 400 million image-text pairs collected from the public web. By learning to associate images with free-form text rather than a fixed set of class labels, CLIP produces a shared embedding space that enables zero-shot classification with arbitrary vocabularies at inference time.

CLIP supports zero-shot image classification by embedding candidate class labels as text and selecting the label whose embedding is closest to a given image's embedding. It is also widely used for image-text retrieval, as a frozen backbone in downstream vision-language models, and as a building block for content moderation, similarity search, and generative model guidance — notably as the text conditioning mechanism in early versions of Stable Diffusion. OpenAI released several CLIP variants built on different vision encoders, including ResNet and Vision Transformer backbones at multiple sizes and input resolutions, with ViT-L/14 at 336 pixels being the largest and most widely adopted. CLIP is distributed under the MIT license. The model has been widely influential as the basis for subsequent vision-language work — including SigLIP, OpenCLIP, and MetaCLIP — and remains a common reference baseline despite being released in 2021 and surpassed on many benchmarks by later models.

CLIP Details & Performance

Details

Resources

Vision Tasks

ClassificationImage SimilarityImage TaggingImage Embedding

Features

Zero-shot DetectionFoundation VisionMultimodal Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to CLIP

Other models worth comparing for similar use cases.

Google
SigLIP
SigLIP is a vision-language model released in March 2023 by researchers at Google DeepMind. It adapts the CLIP image-text pretraining approach by replacing CLIP's softmax-based contrastive loss with a pairwise sigmoid loss, which operates independently on each image-text pair rather than requiring a global view of all pairs in a batch. This change decouples the loss from batch size, enabling more memory-efficient training and improved performance at smaller batch sizes, a regime where softmax contrastive learning typically struggles. Despite this simplification, SigLIP matches or exceeds CLIP-style models on zero-shot image classification and image-text retrieval benchmarks when trained on comparable data.SigLIP is distributed as an image encoder plus aligned text encoder, supporting zero-shot classification with arbitrary class vocabularies, image-text retrieval, and use as a frozen backbone in downstream vision-language models. Pretrained models are available at multiple Vision Transformer sizes and input resolutions, including 224, 256, 384, and 512 pixel inputs. SigLIP is released under the Apache 2.0 license by Google and is used as the vision encoder in Google's PaliGemma and PaliGemma 2. A successor, SigLIP 2, was released in February 2025 with multilingual support across 109 languages, improvements to localization and dense prediction, and two resolution handling variants (FixRes for backward-compatible fixed resolutions and NaFlex for native aspect ratio with variable sequence length).
Meta
DINOv2
DINOv2 is a self-supervised vision foundation model released in April 2023 by Meta AI's FAIR lab. It produces general-purpose visual features that transfer to a wide range of downstream tasks (including image classification, semantic segmentation, depth estimation, and image retrieval) without requiring task-specific fine-tuning. DINOv2 is trained on a curated dataset of 142 million images using a self-supervised objective combining student-teacher distillation, masked image modeling, and an image-level contrastive loss, extending the approach introduced in the original DINO.The model family spans Vision Transformer sizes from ViT-S (21M parameters) to ViT-g (1.1B parameters), with the larger variants setting state-of-the-art results on linear-probing benchmarks for classification, segmentation, and dense prediction tasks at release. DINOv2 features can be used directly as frozen backbones, reducing the need for labeled training data in downstream applications. The model is primarily used as an image encoder rather than as a complete task-specific model, making it a common backbone choice for custom vision pipelines. DINOv2 code and pretrained weights are released under the Apache 2.0 license, which was adopted after an initial CC-BY-NC 4.0 release in response to community requests for commercial compatibility. A successor model, DINOv3, was released in August 2025 with further scaling and a new training technique called Gram anchoring.
SAM-CLIP
SAM-CLIP is a unified vision foundation model introduced by researchers at Apple and the University of Illinois Urbana-Champaign in October 2023. It merges two popular vision foundation models — Meta's Segment Anything Model (SAM) and OpenAI's CLIP — into a single shared Vision Transformer backbone through a combination of multi-task learning, continual learning, and teacher-student distillation. The method requires only a small fraction of the original pretraining datasets and demonstrates that complementary capabilities from distinct foundation models can be consolidated without retraining from scratch, reducing the storage and compute cost of running both models in inference.The resulting model retains SAM's zero-shot segmentation ability and CLIP's zero-shot classification and image-text retrieval, while introducing new capabilities the individual models lacked. SAM-CLIP establishes state-of-the-art results on zero-shot semantic segmentation across five benchmarks, improving mean IoU by 6.8 points on Pascal VOC and 5.9 points on COCO-Stuff over prior specialized models. The paper was accepted at the UniReps Workshop at NeurIPS 2023 and the eLVM Workshop at CVPR 2024. Apple has published the research but has not released model weights or inference code publicly.
Google
OWL-ViT
OWL-ViT (Open-World Localization with Vision Transformers) is an open-vocabulary object detection model released in May 2022 by Google Research. It adapts a pretrained CLIP-style image-text model by removing the final pooling layer and attaching lightweight classification and box prediction heads to each Transformer output token, producing a detector capable of localizing arbitrary objects described by free-form text at inference time. Rather than being restricted to a fixed taxonomy such as the 80 categories in Microsoft COCO, OWL-ViT can detect object classes specified by a user's text query, including categories the model was never explicitly trained on.OWL-ViT accepts an image and a list of text queries as input, and produces bounding boxes with class assignments drawn from the supplied queries. It also supports one-shot image-conditioned detection, where a cropped image region is used as the query instead of text, allowing the model to find visually similar instances within a target scene. The model is released in multiple Vision Transformer sizes (ViT-B/32, ViT-B/16, ViT-L/14) and CLIP-pretrained variants, distributed through the Google Research scenic repository and Hugging Face under the Apache 2.0 license. A successor model, OWLv2, was released in June 2023, introducing the OWL-ST self-training recipe that scales training to over one billion pseudo-annotated examples and substantially improves detection performance on rare and long-tail categories while preserving the open-vocabulary interface.
Azure
Florence-2
Florence-2, introduced by Microsoft Research at CVPR 2024, is an open-source vision-language foundation model designed to unify diverse computer vision tasks within a single sequence-to-sequence framework. Unlike traditional models that specialize in specific tasks, Florence-2 accepts both images and text prompts and outputs text for tasks such as captioning, object detection, segmentation, OCR, and region-based grounding. It comes in two sizes—Florence-2-base (~230M parameters) and Florence-2-large (~770M parameters)—and is trained on FLD-5B, a large dataset of ~126M images with ~5.4B annotations.The model demonstrates strong zero-shot and fine-tuned performance, often rivaling larger vision-language systems while remaining lightweight and efficient. Released under the MIT license, all weights are publicly available, making it accessible for fine-tuning and deployment in applications like VQA, content tagging, accessibility, and research. Florence-2’s compact design, versatility, and openness position it as a practical alternative to larger proprietary multimodal models.

CLIP License

MIT

License terms and commercial-use guidance for CLIP.

License information is provided as a guide and is not legal advice.