Google: Gemini 3 Flash

Gemini 3 Flash Overview

Gemini 3 Flash is a proprietary multimodal large language model developed by Google through Google DeepMind, designed to deliver fast, cost-efficient reasoning across real-time products and developer workflows. Released in December 2025, it is the Flash-tier variant of the Gemini 3 family, balancing low latency with reasoning quality approaching Pro models.

The model supports text, images, audio, and video, with an exceptionally large context window of roughly one million input tokens and outputs up to ~65k tokens. It emphasizes rapid responses for coding, summarization, analysis, and agentic tasks, and exposes configurable “thinking levels” via API to trade speed for deeper reasoning. Today, Gemini 3 Flash positions itself as a high-throughput, production-ready model, serving as the default in the Gemini app and Google Search’s AI Mode, optimized for scalable, interactive AI applications.

Gemini 3 Flash Interactive Demo

Upload an image

Drag and drop an image here, or click to browse

JPEGPNGGIFWebP

Gemini 3 Flash Details & Performance

Details

Resources

—

Vision Tasks

CaptioningChart Question AnsweringClassificationDocument Question AnsweringImage TaggingMulti-Label ClassificationOCRObject DetectionVision LanguageVisual Question Answering

Features

Foundation VisionLLMs with Vision CapabilitiesMultimodal Vision

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

Gemini 3 Flash Vision Evals

Vision Evals is Roboflow's ground-truth benchmark: every model runs the same real-world samples across six vision tasks, and answers are scored against ground truth.

Evals updated July 24, 2026Pricing updated July 27, 2026

Overall score#7 of 21

77.1%

Avg cost / sample#5 of 21

$0.0021

Avg speed / sample#2 of 21

4.3s

Avg tokens / sample

1.7K

Strengths and weaknesses

Gemini 3 Flash averages 77.1% across the six Vision Evals tasks, ranking #7 of 21 models overall.

It leads the field in Data Extraction.

Its weakest relative showing is OCR, ranking #18 of 21 at 87.6%.

At $0.0021 per sample it is the 5th cheapest of the 21 benchmarked models, and its average inference time of 4.3s per sample makes it the 2nd fastest.

Performance profile

Field medianGemini 3 Flash

Field medians: Object Detection 56.0%, Counting 60.8%, Identification 84.4%, OCR 88.8%, Data Extraction 86.6%, Reasoning 73.9%.

Results by task

Task	Score	Rank	Cost / sample	Speed
Object Detection	38.6%	#16 of 21	$0.0031	5.9s
Counting	67.6%	#6 of 21	$0.0012	2.9s
Identification	93.8%	#5 of 21	$0.0009	2.2s
OCR	87.6%	#18 of 21	$0.0024	4.2s
Data Extraction	96.9%	#1 of 21	$0.0008	2.0s
Reasoning	78.3%	#7 of 21	$0.0017	3.6s

Price vs. performance

Score vs. cost

Overall benchmark score against estimated cost per sample. Upper-left is the sweet spot: high quality at low cost.

21 models on the current benchmark · scores and efficiency pooled across all six tasks · Gemini 3 Flash highlighted

Gemini 3 Flash scores from a single evaluation run · Methodology

View all Vision Evals →

Gemini 3 Flash Pricing

Gemini 3 Flash costs $0.500 per 1M input tokens and $3.00 per 1M output tokens.

Input$0.500 / 1M tokens

Output$3.00 / 1M tokens

Cached input$0.050 / 1M tokens

Pricing updated Jul 27, 2026

Alternatives to Gemini 3 Flash

Other models worth comparing for similar use cases.

GPT-5 Mini

GPT-5 Mini, released by OpenAI on August 7, 2025, is a mid-tier variant of the GPT-5 family that balances cost, speed, and capability. It is multimodal, supporting both text and image inputs, and offers a substantial input context window of ~400,000 tokens with output lengths up to ~128,000 tokens. While less powerful than the full GPT-5, it inherits its safety tuning, instruction-following improvements, and multimodal reasoning, making it a practical choice for developers who need large context handling without the expense of premium models.GPT-5 Mini is optimized for affordability while retaining strong reasoning performance. Benchmarks show it outperforming earlier models such as GPT-4o on many multimodal and medical VQA tasks, though it lags behind GPT-5 on the most complex problems. Ideal use cases include prototyping, scalable content generation, document analysis, and mid-range reasoning tasks where efficiency and context capacity matter more than top-tier accuracy.

Claude Sonnet 5

Claude Sonnet 5 is a mid-tier large language model from Anthropic, released on June 30, 2026, as the latest model in the Sonnet series and a direct successor to Claude Sonnet 4.6. It is a hybrid reasoning model designed primarily for agentic workflows, software coding, and professional tasks. The model features a 1 million token context window, a 128k maximum output token limit, and runs adaptive thinking by default, giving API users fine-grained control over reasoning effort across five levels (low, medium, high, max, and extra-high). It uses an updated tokenizer shared with Opus 4.7 and later models, which produces approximately 30% more tokens for equivalent text compared to earlier Claude models. On benchmarks, Sonnet 5 scores 63.2% on agentic coding and 81.2% on OSWorld, narrowing the gap with Opus 4.8 while remaining at Sonnet-tier pricing.The model supports text and image input with text output, and accepts tools including browsers and terminals for autonomous multi-step task execution. Anthropic's safety evaluations report that Sonnet 5 shows a lower rate of undesirable behaviors than Sonnet 4.6 and is generally safer in agentic contexts, with improved resistance to prompt injection and reduced sycophancy. Cybersecurity safeguards equivalent to those on Opus 4.7 and 4.8 are active, though Anthropic notes the model was not deliberately trained on cybersecurity tasks. The model is proprietary and API-only, with no open weights.

Qwen3.6 35B A3B

Qwen3.6-35B-A3B is a sparse Mixture-of-Experts (MoE) multimodal language model developed by the Qwen team at Alibaba Group. It carries 35 billion total parameters but activates only approximately 3 billion per forward pass via a learned routing mechanism, giving it the representational capacity of a large dense model at a fraction of the inference compute. The model is natively multimodal, processing images, documents, and video alongside text as a core architectural capability rather than an add-on. It supports a native context window of 262,144 tokens, extensible up to 1,010,000 tokens via YaRN. A key design feature is the unified thinking/non-thinking mode framework: users can switch between deliberate chain-of-thought reasoning and fast direct responses within a single model, and a "thinking preservation" option retains reasoning context across multi-turn agentic workflows to reduce redundant computation.The model is specifically optimized for agentic coding tasks, including repository-level reasoning, frontend workflow generation, multi-step tool use, and MCP (Model Context Protocol) integration. On SWE-bench Verified it scores 73.4%, on Terminal-Bench 2.0 it scores 51.5%, and on MCPMark it scores 37.0%. For vision-language tasks it achieves 92.0 on RefCOCO, 89.9 on OmniDocBench 1.5, and 83.7 on VideoMMMU. The model also supports Multi-Token Prediction (MTP) for speculative decoding. All Qwen3.6 open-weight models are released under the Apache 2.0 license.

Other Google Gemini Flash models

Other versions in the same family as Gemini 3 Flash.

Gemini 3.6 Flash Gemini 3.5 Flash Gemini 2.5 Flash

Gemini 3 Flash License

Proprietary

License terms and commercial-use guidance for Gemini 3 Flash.

This model is proprietary. The author retains all rights, and use of the model is governed by their specific terms of service or license agreement.

Commercial use depends on the terms set by the model author. Most proprietary commercial models require a paid subscription, API key, or per-call billing. Check the provider’s pricing and terms-of-service for details.

License information is provided as a guide and is not legal advice.

Frequently Asked Questions About Gemini 3 Flash Vision

Yes. Gemini 3 Flash accepts image input and handles OCR, data extraction, object counting, identification, visual reasoning, and object detection. On Roboflow's Vision Evals its strongest task is Data Extraction at 96.9% (#1 of 21). You can test it on your own image in the demo above.

Yes. its transcriptions match the ground truth 87.6% on average (#18 of 21) on Vision Evals OCR. Pulling specific fields out of documents (data extraction) scores 96.9%.

Not its strength. On Vision Evals, Gemini 3 Flash scores 38.6% mAP@50 on object detection (#16 of 21) and 67.6% exact-match accuracy on object counting. For production counting or precise localization, pairing it with a specialized detector like RF-DETR or your own trained model in a Roboflow Workflow is usually more reliable: detect the objects, then count the detections.

On our benchmark's task mix, Gemini 3 Flash averages $0.0021 per sample at $0.50 per 1M input and $3.00 per 1M output tokens (#5 of 21 on cost), with an average speed of 4.3s per sample across the benchmark. Actual cost depends on your images and prompts.

On the overall Vision Evals ranking, Gemini 3 Flash sits #7 of 21 at 77.1%, just behind GPT-5.6 Sol (78.3%) and just ahead of GPT-5.6 Luna (76.8%). See the full side-by-side: Gemini 3 Flash vs GPT-5.6 Sol.