Google

Google: Gemma 3 4B

Gemma 3 4B Overview

Gemma 3 4B, released on March 12, 2025, is the mid-sized member of Google DeepMind’s open-weight Gemma 3 family. With about 4 billion parameters, it is multimodal—supporting text and image inputs and generating text outputs. Like the larger Gemma 3 models, it features a 128,000-token input context window with an output capacity of ~8,192 tokens, enabling it to handle long documents and mixed text–image reasoning tasks.

The 4B variant is designed as a balance between efficiency and capability: it offers multilingual support across 140+ languages, strong summarization and reasoning performance, and compatibility with moderate hardware. Inference can run with ~6.4 GB VRAM in BF16, or significantly less in quantized 8-bit (~4.4 GB) or 4-bit (~3.4 GB) modes, making it accessible to developers outside large-scale infrastructure. While it lags behind the 12B and 27B versions on the most complex reasoning and multimodal benchmarks, its lower compute footprint makes it ideal for research, prototyping, and practical deployment where efficiency matters.

Gemma 3 4B Interactive Demo

Gemma 3 4B Details & Performance

Details

Resources

Vision Tasks

Vision LanguageOCRVisual Question AnsweringCaptioning

Features

Multimodal Vision

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

Gemma 3 4B Vision Evals

Visual Understanding

72 models · 67 tasks
HighestLowest
This model#68 of 7237.31% pass rate · better than 6%
Score37.31%pass rate across 67 tasks
Speed16.80savg response per task
Cost$0.050 in · $0.100 out / 1M
Tokenstokens unavailable
Score key:≥75%40–74%<40%
CategoryPassedScore
Defect Detection9 / 15
60%
Document Understanding5 / 9
55.6%
Object Understanding6 / 14
42.9%
Spatial Understanding5 / 19
26.3%
Object Counting0 / 10
0%
HighestLowest
This model#38 of 5064.19% pass rate · better than 24%
Score64.19%pass rate across 229 tasks
Speed0.92savg response per task
Cost$0.0000 / task$0.050 in · $0.100 out / 1M
Tokens314 / task300 in · 12 out
Score key:≥75%40–74%<40%
CategoryPassedScore
License Plate Recognition26 / 30
86.7%
Text Recognition22 / 30
73.3%
Focused Scene OCR63 / 99
63.6%
VQA & Extraction35 / 60
58.3%
Handwritten Math1 / 10
10%

Scores based on a single evaluation run · Methodology

View all Vision Evals →

Gemma 3 4B Pricing

Gemma 3 4B costs $0.050 per 1M input tokens and $0.100 per 1M output tokens.

Input$0.050 / 1M tokens
Output$0.100 / 1M tokens

Pricing updated Jun 27, 2026

Price vs. performance

Estimated cost per task vs. Visual Understanding score, for this model and others ranked near it. Upper-left is the sweet spot (high quality, low cost).

6 of 7 models plotted · 1 not yet evaluated

ModelScoreMedian tokensEst. cost / taskCompare
AnthropicClaude Opus 4.159.7%2.1K$0.040Compare
OpenAIGPT-5 Nano58.2%2.7K$0.0003Compare
QwenQwen3.5 397B A17B58.2%1.5K$0.0006Compare
GoogleGemini 2.5 Flash55.2%476$0.0005Compare
GoogleGemini 2.5 Flash-Lite53.7%301$0.0000Compare
GoogleGemma 3 4B(this model)37.3%
MoonshotAIKimi K2.535.8%2.7K$0.0021Compare

Alternatives to Gemma 3 4B

Other models worth comparing for similar use cases.

Google
Gemini 2.5 Flash-Lite
Gemini 2.5 Flash-Lite, released for general availability on July 22, 2025, is the most cost-efficient model in the Gemini 2.5 family, designed for high-volume and latency-sensitive tasks. It is multimodal, supporting text, images, video, audio, and PDFs as inputs, with text as its primary output. The model handles up to 1 million input tokens and generates outputs up to 64K tokens, making it suitable for large-scale document or media processing at low cost. It is built on a Sparse Mixture-of-Experts architecture with native multimodal support, though exact parameter counts are undisclosed.Flash-Lite offers the lowest usage cost among Gemini 2.5 models. It introduces developer controls for “thinking mode,” allowing fine-tuning of reasoning depth vs. efficiency. It also integrates native tools such as code execution, search grounding, and URL context. While strong on translation, classification, coding, and general multimodal reasoning, it lacks support for image or audio generation in its stable release and is less capable than Gemini 2.5 Flash or Pro on complex reasoning-heavy workflows.
Google
PaliGemma 2
PaliGemma 2 is a vision-language model released in December 2024 by Google DeepMind. It pairs the SigLIP-So400m vision encoder with the Gemma 2 language model family, extending the original PaliGemma architecture with stronger language capabilities and a wider set of transfer benchmarks. The model is designed primarily as a fine-tuning base rather than a chat-optimized assistant. Google releases pretrained "PT" checkpoints intended for task-specific adaptation rather than direct out-of-the-box use.PaliGemma 2 accepts an image paired with a text prompt and generates natural language output, supporting image captioning, visual question answering, optical character recognition, document understanding, object detection and segmentation (with appropriate fine-tuning), and a range of specialized vision-language tasks. The model is released at three parameter sizes (3B, 10B, and 28B), built on the Gemma 2 2B, 9B, and 27B language backbones. Each size is available at three input resolutions: 224, 448, and 896 pixels. Alongside the base PT checkpoints, Google released PaliGemma 2 Mix variants that have been tuned on a mixture of downstream tasks to provide stronger out-of-the-box performance for common applications such as OCR and document parsing. PaliGemma 2 is distributed under the Gemma license, a custom license from Google that permits commercial use subject to the terms of the Gemma Prohibited Use Policy.
Qwen
Qwen2.5 VL 7B Instruct
Qwen2.5-VL-7B-Instruct is a 7-billion parameter vision-language model from Alibaba’s QwenLM team, released on January 26, 2025 under the Apache 2.0 license. It is the instruction-tuned variant of the 7B scale in the Qwen2.5-VL family, designed to process multimodal inputs such as text, images, charts, documents, and video. The model enables structured outputs—including JSON for structured content and bounding boxes for visual localization. Weights are publicly available on Hugging Face and GitHub, making it suitable for both research and applied multimodal use.
Qwen
Qwen3 VL 8B Instruct
Qwen3 VL 8B Instruct is an open-weight multimodal vision-language model developed by Qwen / Alibaba Cloud as part of the Qwen3-VL series, designed for instruction-following tasks that combine text with visual inputs such as images and video. Released around October 2025 under the Apache-2.0 license, it targets developers who need capable multimodal reasoning without the scale or cost of very large models.The model contains roughly 8.8 billion dense parameters and supports text, image, and video understanding with strong spatial perception, visual reasoning, and emerging visual agent abilities such as GUI interaction. A standout feature is its native ~256K token context window, extendable to around 1M tokens, enabling long-document reading and extended video comprehension. In today’s landscape, it balances openness, long-context capacity, and solid multimodal performance against heavier proprietary models. Typical applications include multimodal assistants, document and video analysis, visual question answering, and research or product prototyping where transparency and deployability matter.
HuggingFace
SmolVLM2
SmolVLM2 is a compact multimodal vision-language model developed by the Hugging Face TB Research team, released in February 2025 under the Apache 2.0 license. It is designed for efficient image and video understanding on resource-constrained hardware, with model variants ranging from 256M to 2.2B parameters. SmolVLM2 processes images, multi-image inputs, and video alongside text queries to generate text outputs for tasks including visual question answering, image captioning, and OCR.SmolVLM2 is designed for on-device and edge deployment, requiring substantially less GPU memory than comparable multimodal models. It supports standard fine-tuning pipelines via the Hugging Face transformers library and quantization through bitsandbytes. SmolVLM2 is suited for applications where a capable vision-language model is needed without full server-scale infrastructure.
Moondream 2
Moondream 2 is a small open-source vision-language model from Moondream, the company founded by Vikhyat Korrapati. It was first released in early 2024 and updated through mid-2025. At approximately 1.9 billion parameters, it is designed to run efficiently on consumer hardware such as laptops and edge devices while supporting a practical range of multimodal tasks. Moondream 2 combines a vision encoder based on SigLIP with a compact language backbone, trained for image understanding tasks rather than as a general chat model.The model accepts an image paired with a natural language prompt and produces text responses, supporting visual question answering, image captioning, and image-conditioned dialogue. Later Moondream 2 releases added object localization through a point API that returns coordinates for queried objects, along with improvements to OCR, counting, and document understanding. Moondream 2 is distributed under the Apache 2.0 license and is available through Hugging Face and the maintainer's distribution. Because the model is updated frequently, production deployments should pin to a specific revision rather than tracking the latest release. A successor model, Moondream 3 (Preview), was released in September 2025 with a 9B mixture-of-experts architecture and 2B active parameters, offering substantially stronger visual reasoning than Moondream 2 while retaining the efficiency-focused design. A referring expression segmentation extension to Moondream 3 was released in March 2026.

Gemma 3 4B License

Proprietary

License terms and commercial-use guidance for Gemma 3 4B.

License information is provided as a guide and is not legal advice.