Pixtral-12B is a vision-language model introduced by Mistral AI in September 2024 under the Apache 2.0 license, designed to process both text and images in a unified context. With ~12 billion parameters in its decoder and an additional ~400 million in a custom-trained vision encoder, it supports long-context reasoning up to 128k tokens and accepts multiple images per input. Its architecture is optimized for handling variable image sizes and aspect ratios, making it flexible for diverse multimodal tasks.
As Mistral’s first VLM, Pixtral-12B delivers strong performance not only on image-text reasoning benchmarks but also in text-only applications, positioning it as a versatile alternative to models like GPT-4V and LLaVA. Its open availability via Hugging Face and major cloud providers such as Amazon Bedrock and SageMaker makes it accessible for research and production. Typical use cases include document analysis, visual QA, data extraction, and multimodal assistants requiring both textual and visual understanding.