Google

Google: Gemini 3.1 Flash-Lite

Gemini 3.1 Flash-Lite Overview

Gemini 3.1 Flash-Lite is a natively multimodal reasoning model from Google DeepMind in the Gemini 3 series, based on the Gemini 3 Pro architecture. It processes text, image, video, audio, and PDF inputs within a 1 million token context window and produces text output up to 64K tokens. The model targets high-volume, latency-sensitive workloads and supports visual question answering, image and document data extraction, content moderation, classification, translation, automated speech recognition, and agentic data pipelines. It exposes configurable thinking levels of minimal, low, medium, and high, which set the depth of internal reasoning applied per request and let developers balance response quality against cost and latency.

On benchmarks reported at launch, Gemini 3.1 Flash-Lite scores 86.9% on GPQA Diamond and 76.8% on the MMMU Pro multimodal benchmark, and reaches an Elo score of 1432 on the Arena.ai leaderboard. According to Artificial Analysis benchmarks, it produces a 2.5 times faster time to first answer token and a 45% increase in output speed relative to Gemini 2.5 Flash. It also shows improved instruction following, higher audio input quality for automated speech recognition tasks, and support for structured JSON output used in data extraction pipelines.

Gemini 3.1 Flash-Lite Interactive Demo

Gemini 3.1 Flash-Lite Details & Performance

Details

Resources

Vision Tasks

CaptioningClassificationDocument Question AnsweringImage TaggingMulti-Label ClassificationOCRObject DetectionVision LanguageVisual Question Answering

Features

Multimodal VisionLLMs with Vision Capabilities

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

Gemini 3.1 Flash-Lite Vision Evals

#24 of 70 models|

Pass/fail results across 67 image tasks

Overall Score67.16%across 67 eval prompts
Prompts Passed45 / 675 task categories
Avg Response Time3.50son eval prompts
Score key:≥75%40–74%<40%
CategoryPassedScore
Spatial Understanding15 / 19
78.9%
Document Understanding7 / 9
77.8%
Defect Detection11 / 15
73.3%
Object Understanding9 / 14
64.3%
Object Counting3 / 10
30%

Scores based on single evaluation run · Methodology

View all Vision Evals →

Gemini 3.1 Flash-Lite Pricing

Gemini 3.1 Flash-Lite costs $0.250 per 1M input tokens and $1.50 per 1M output tokens.

Input$0.250 / 1M tokens
Output$1.50 / 1M tokens
Cached input$0.025 / 1M tokens

Pricing updated Jun 21, 2026

Alternatives to Gemini 3.1 Flash-Lite

Other models worth comparing for similar use cases.

Google
Gemini 3.5 Flash
Gemini 3.5 Flash is a multimodal language model developed by Google DeepMind and released at Google I/O 2026. It is built on the Gemini 3 Flash reasoning foundation and introduces configurable thinking levels (minimal, low, medium, and high) that allow developers to tune the depth of internal reasoning before a response is generated. The model accepts text, image, video, audio, and PDF inputs and produces text output, with a 1 million token context window and up to 65,000 output tokens per request. It is natively multimodal, processing visual inputs alongside text to support tasks such as image captioning, classification, optical character recognition, object detection, and visual grounding, where the model references specific regions within an image or video frame.Its vision capabilities extend to interpreting UI screenshots, diagrams, charts, and real-world scenes, as well as understanding video and live frame sequences for activity and scene recognition. The model supports combined tool use, including Google Search, URL context, code execution, and custom functions, within a single request, and it uses reasoning context from previous turns when thought signatures are present in the conversation history, enabling persistent multi-turn reasoning chains. Gemini 3.5 Flash carries a knowledge cutoff of January 2026 and is available via the Gemini API, Google AI Studio, Google Antigravity, and the Gemini Enterprise Agent Platform.
Google
Gemma 4 26B A4B
Gemma 4 26B A4B is the Mixture-of-Experts variant in Google's Gemma 4 family, with 25.2B total parameters but only 3.8B active per token. Built from the same Gemini 3 research as the 31B dense sibling and released as open weights under the Apache 2.0 license, it supports a 256K token context window with text and image input and configurable thinking mode. The "A4B" in the name refers to its approximately 4B active parameters. The MoE design makes it significantly faster at inference than the dense 31B, running nearly as fast as a 4B-parameter model while delivering roughly 97% of the dense model's quality.For vision tasks, the 26B A4B shares the same multimodal capabilities as the 31B image understanding with variable aspect ratios and resolutions, and structured bounding box output for UI element detection. The tradeoff versus the 31B dense model is a small quality reduction in exchange for much faster inference and lower hardware requirements, fitting in 18GB of VRAM at 4-bit quantization. It ranked #6 among open models on the Arena AI text leaderboard at launch.
Google
Gemma 4 31B
Gemma 4 31B is the largest dense model in Google's Gemma 4 family, built from the same research as Gemini 3 and released as open weights under the Apache 2.0 license. It supports a 256K token context window with text and image input, configurable thinking mode for step-by-step reasoning, and multilingual support across 140+ languages. The unquantized model fits on a single 80GB GPU.For vision tasks, Gemma 4 31B supports image understanding with variable aspect ratios and resolutions, and can output structured bounding boxes for UI element detection, making it useful for document parsing and UI understanding. Compared to Gemma 3, it delivers stronger reasoning and multimodal performance. It is part of a four-size family alongside the 26B A4B MoE variant and two on-device models (E2B, E4B), with the 31B dense variant optimized for output quality and fine-tuning over inference speed.
Anthropic
Claude Sonnet 4.5
Claude Sonnet 4.5, released by Anthropic in September 2025, is the company’s most advanced Sonnet-series model, built for high-performance reasoning, coding, and long-horizon agentic workflows. It is a multimodal system that accepts both text and images, with a 200,000-token context window designed for handling large documents and extended interactions. Anthropic highlights its improvements in reliability, reduced sycophancy, and alignment, making it suitable for sustained enterprise use.The model delivers strong results in coding and autonomous workflows, achieving 61.4% on the OSWorld benchmark and leading performance on SWE-bench Verified. It introduces infrastructure features such as a memory tool (beta), checkpointing for Claude Code, parallel tool use, and tighter integration with VS Code. Compared to Opus, which targets broader reasoning, Sonnet 4.5 is optimized for structured, long-duration tasks. Positioned against leading offerings from OpenAI and Google, it is aimed at enterprise automation, software engineering, and research-intensive applications.
Qwen
Qwen3.6 35B A3B
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts (MoE) multimodal language model developed by the Qwen team at Alibaba Group. It carries 35 billion total parameters but activates only approximately 3 billion per forward pass via a learned routing mechanism, giving it the representational capacity of a large dense model at a fraction of the inference compute. The model is natively multimodal, processing images, documents, and video alongside text as a core architectural capability rather than an add-on. It supports a native context window of 262,144 tokens, extensible up to 1,010,000 tokens via YaRN. A key design feature is the unified thinking/non-thinking mode framework: users can switch between deliberate chain-of-thought reasoning and fast direct responses within a single model, and a "thinking preservation" option retains reasoning context across multi-turn agentic workflows to reduce redundant computation.The model is specifically optimized for agentic coding tasks, including repository-level reasoning, frontend workflow generation, multi-step tool use, and MCP (Model Context Protocol) integration. On SWE-bench Verified it scores 73.4%, on Terminal-Bench 2.0 it scores 51.5%, and on MCPMark it scores 37.0%. For vision-language tasks it achieves 92.0 on RefCOCO, 89.9 on OmniDocBench 1.5, and 83.7 on VideoMMMU. The model also supports Multi-Token Prediction (MTP) for speculative decoding. All Qwen3.6 open-weight models are released under the Apache 2.0 license.
Meta
Llama 4 Scout
Llama 4 Scout, released on April 5, 2025, is one of Meta AI’s first Llama 4 multimodal models, alongside Maverick. It accepts text + image inputs and produces text outputs, with a knowledge cutoff of August 2024. Scout is notable for its extremely large context window of 10 million tokens, making it well-suited for analyzing very long documents, extended conversations, or large codebases.Architecturally, Scout uses a Mixture-of-Experts (MoE) system with 16 experts, activating ~17B parameters per inference from a pool of ~109B total parameters, balancing capacity with efficiency. It officially supports 12 languages (including English, Arabic, French, Hindi, and Spanish), while offering multimodal reasoning for images (captioning, Q&A, recognition). Meta highlights that Scout can run on a single Nvidia H100 GPU, making it more accessible than larger-scale Llama 4 models. However, its output token limit is far smaller than its 10M input window, image input support is still constrained, and license restrictions apply for large-scale commercial deployments.

Gemini 3.1 Flash-Lite License

Proprietary

License terms and commercial-use guidance for Gemini 3.1 Flash-Lite.

License information is provided as a guide and is not legal advice.