Meta

Meta: Llama 3.2 Vision 90b

Released Sep 25, 2024
proprietary license
128,000 context
90 parameters
openmultimodal

Overview

Llama 3.2 Vision 90B, released by Meta AI on September 25, 2024, is the largest vision-capable model in the Llama 3.2 family. With about 90 billion parameters (~88.8B) and a 128,000-token context window, it is designed for high-performance multimodal reasoning over images and text, while producing only text outputs. The model was trained on ~6 billion image–text pairs and instruction-tuned (SFT + RLHF), with a knowledge cutoff of December 2023.

It powers tasks like visual question answering, captioning, and image-grounded reasoning, and achieves strong benchmark performance compared to both open and proprietary models. The model officially supports English for multimodal (image+text) tasks, while text-only inputs extend to eight languages (including German, French, Hindi, and Spanish). Due to its large parameter size, it requires substantial compute resources but is accessible via cloud providers like Amazon Bedrock, Oracle Cloud, and Azure AI Foundry. While highly capable, it is limited to text-only outputs and has stricter multilingual support for vision-based inputs.

Performance

Avg. Latency

Model Rankings

Supported Tasks