Meta

Meta: Llama 3.2 Vision 11b

Released Sep 25, 2024
proprietary license
128,000 context
11 parameters
openmultimodal

Overview

Llama 3.2 Vision 11B, released by Meta on September 25, 2024, is the first mid-sized model in the Llama family with vision capabilities, supporting both text and image inputs with text-only outputs. It contains around 11 billion parameters (~10.6B) and features a 128,000-token context window, making it suitable for multimodal reasoning over long documents and image-text tasks. The model was trained on ~6 billion image–text pairs and has a knowledge cutoff of December 2023.

The model is available in a base and an instruction-tuned (“Vision-Instruct”) version, optimized for tasks like captioning, visual question answering, and image reasoning. It leverages Group-Query Attention (GQA) for improved inference efficiency and scalability. While text tasks officially support multiple languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai), multimodal (image+text) tasks are supported primarily in English. Llama 3.2 Vision 11B is accessible through Hugging Face, Amazon Bedrock, Azure AI Foundry, NVIDIA NIM, and OCI, making it a widely deployable open-weight multimodal foundation model.

Performance

Avg. Latency

Model Rankings

Supported Tasks