Llama 3.2 Vision 11B, released by Meta on September 25, 2024, is the first mid-sized model in the Llama family with vision capabilities, supporting both text and image inputs with text-only outputs. It contains around 11 billion parameters (~10.6B) and features a 128,000-token context window, making it suitable for multimodal reasoning over long documents and image-text tasks. The model was trained on ~6 billion image–text pairs and has a knowledge cutoff of December 2023.
The model is available in a base and an instruction-tuned (“Vision-Instruct”) version, optimized for tasks like captioning, visual question answering, and image reasoning. It leverages Group-Query Attention (GQA) for improved inference efficiency and scalability. While text tasks officially support multiple languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai), multimodal (image+text) tasks are supported primarily in English. Llama 3.2 Vision 11B is accessible through Hugging Face, Amazon Bedrock, Azure AI Foundry, NVIDIA NIM, and OCI, making it a widely deployable open-weight multimodal foundation model.