Google

Google: Gemini 2.5 Flash

Gemini 2.5 Flash Overview

Gemini 2.5 Flash, released on June 17, 2025, is Google DeepMind’s production-ready, efficiency-focused model in the Gemini 2.5 family. It is multimodal, accepting text, images, video, and audio as inputs, with text as the primary output format. The model supports 1 million input tokens and up to 65K output tokens, enabling it to process very large contexts such as books, long video transcripts, or extensive datasets. Its training knowledge extends to January 2025.

Designed as a price-performance leader, Gemini 2.5 Flash balances speed and reasoning power, making it suitable for everyday enterprise and developer use cases without the higher latency and cost of Pro models. It supports advanced workflows like function calling, code execution, search grounding, URL context ingestion, and structured outputs. While efficient and scalable, output length is still limited compared to its input capacity, and multimodal outputs (e.g. image or audio generation) remain restricted to specialized or preview variants.

Gemini 2.5 Flash Interactive Demo

Gemini 2.5 Flash Details & Performance

Details

Resources

Vision Tasks

Vision LanguageObject DetectionClassificationOCRVisual Question AnsweringCaptioning

Features

Foundation VisionLLMs with Vision CapabilitiesMultimodal Vision

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

Gemini 2.5 Flash Vision Evals

Visual Understanding

72 models · 67 tasks
HighestLowest
This model#54 of 7255.22% pass rate · better than 25%
Score55.22%pass rate across 67 tasks
Speed24.91savg response per task
Cost$0.0005 / task$0.300 in · $2.50 out / 1M
Tokens476 / task294 in · 171 out
Score key:≥75%40–74%<40%
CategoryPassedScore
Document Understanding8 / 9
88.9%
Object Understanding10 / 14
71.4%
Defect Detection9 / 15
60%
Spatial Understanding10 / 19
52.6%
Object Counting0 / 10
0%
HighestLowest
This model#21 of 5079.04% pass rate · better than 56%
Score79.04%pass rate across 229 tasks
Speed2.39savg response per task
Cost$0.0003 / task$0.300 in · $2.50 out / 1M
Tokens372 / task290 in · 81 out
Score key:≥75%40–74%<40%
CategoryPassedScore
License Plate Recognition27 / 30
90%
Text Recognition24 / 30
80%
Handwritten Math8 / 10
80%
Focused Scene OCR79 / 99
79.8%
VQA & Extraction43 / 60
71.7%

Scores based on a single evaluation run · Methodology

View all Vision Evals →

Gemini 2.5 Flash Pricing

Gemini 2.5 Flash costs $0.300 per 1M input tokens and $2.50 per 1M output tokens.

Input$0.300 / 1M tokens
Output$2.50 / 1M tokens
Cached input$0.030 / 1M tokens

Pricing updated Jun 28, 2026

Price vs. performance

Estimated cost per task vs. Visual Understanding score, for this model and others ranked near it. Upper-left is the sweet spot (high quality, low cost).

8 of 8 models plotted

ModelScoreMedian tokensEst. cost / taskCompare
AnthropicClaude Opus 4.6 64.2%2.3K$0.014Compare
AnthropicClaude Sonnet 4.559.7%2.3K$0.0092Compare
AnthropicClaude Opus 4.159.7%2.1K$0.040Compare
OpenAIGPT-5 Nano58.2%2.7K$0.0003Compare
QwenQwen3.5 397B A17B58.2%1.5K$0.0006Compare
GoogleGemini 2.5 Flash(this model)55.2%476$0.0005
GoogleGemini 2.5 Flash-Lite53.7%301$0.0000Compare
MoonshotAIKimi K2.535.8%2.7K$0.0021Compare

Alternatives to Gemini 2.5 Flash

Other models worth comparing for similar use cases.

Google
Gemini 3.1 Flash-Lite
Gemini 3.1 Flash-Lite is a natively multimodal reasoning model from Google DeepMind in the Gemini 3 series, based on the Gemini 3 Pro architecture. It processes text, image, video, audio, and PDF inputs within a 1 million token context window and produces text output up to 64K tokens. The model targets high-volume, latency-sensitive workloads and supports visual question answering, image and document data extraction, content moderation, classification, translation, automated speech recognition, and agentic data pipelines. It exposes configurable thinking levels of minimal, low, medium, and high, which set the depth of internal reasoning applied per request and let developers balance response quality against cost and latency.On benchmarks reported at launch, Gemini 3.1 Flash-Lite scores 86.9% on GPQA Diamond and 76.8% on the MMMU Pro multimodal benchmark, and reaches an Elo score of 1432 on the Arena.ai leaderboard. According to Artificial Analysis benchmarks, it produces a 2.5 times faster time to first answer token and a 45% increase in output speed relative to Gemini 2.5 Flash. It also shows improved instruction following, higher audio input quality for automated speech recognition tasks, and support for structured JSON output used in data extraction pipelines.
Google
Gemma 4 26B A4B
Gemma 4 26B A4B is the Mixture-of-Experts variant in Google's Gemma 4 family, with 25.2B total parameters but only 3.8B active per token. Built from the same Gemini 3 research as the 31B dense sibling and released as open weights under the Apache 2.0 license, it supports a 256K token context window with text and image input and configurable thinking mode. The "A4B" in the name refers to its approximately 4B active parameters. The MoE design makes it significantly faster at inference than the dense 31B, running nearly as fast as a 4B-parameter model while delivering roughly 97% of the dense model's quality.For vision tasks, the 26B A4B shares the same multimodal capabilities as the 31B image understanding with variable aspect ratios and resolutions, and structured bounding box output for UI element detection. The tradeoff versus the 31B dense model is a small quality reduction in exchange for much faster inference and lower hardware requirements, fitting in 18GB of VRAM at 4-bit quantization. It ranked #6 among open models on the Arena AI text leaderboard at launch.
Qwen
Qwen3.6 35B A3B
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts (MoE) multimodal language model developed by the Qwen team at Alibaba Group. It carries 35 billion total parameters but activates only approximately 3 billion per forward pass via a learned routing mechanism, giving it the representational capacity of a large dense model at a fraction of the inference compute. The model is natively multimodal, processing images, documents, and video alongside text as a core architectural capability rather than an add-on. It supports a native context window of 262,144 tokens, extensible up to 1,010,000 tokens via YaRN. A key design feature is the unified thinking/non-thinking mode framework: users can switch between deliberate chain-of-thought reasoning and fast direct responses within a single model, and a "thinking preservation" option retains reasoning context across multi-turn agentic workflows to reduce redundant computation.The model is specifically optimized for agentic coding tasks, including repository-level reasoning, frontend workflow generation, multi-step tool use, and MCP (Model Context Protocol) integration. On SWE-bench Verified it scores 73.4%, on Terminal-Bench 2.0 it scores 51.5%, and on MCPMark it scores 37.0%. For vision-language tasks it achieves 92.0 on RefCOCO, 89.9 on OmniDocBench 1.5, and 83.7 on VideoMMMU. The model also supports Multi-Token Prediction (MTP) for speculative decoding. All Qwen3.6 open-weight models are released under the Apache 2.0 license.

Other Google Gemini Flash models

Other versions in the same family as Gemini 2.5 Flash.

Gemini 2.5 Flash License

Proprietary

License terms and commercial-use guidance for Gemini 2.5 Flash.

License information is provided as a guide and is not legal advice.