Google

Google: Gemma 4 12B

Gemma 4 12B Overview

Gemma 4 12B is an open-weight multimodal model from Google in the Gemma 4 family. It is intended for text and image understanding tasks such as visual question answering, OCR, captioning, and document understanding, with a smaller parameter footprint than the larger Gemma 4 variants.

This entry is connected to Roboflow Playground vision evals for comparison. No runnable Playground workflow is configured yet, so the model page is used for discovery and benchmark context rather than direct hosted inference.

Gemma 4 12B Details & Performance

Details

Resources

Vision Tasks

Vision LanguageOCRVisual Question AnsweringCaptioning

Features

Multimodal Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Gemma 4 12B Vision Evals

#35 of 70 models|

Pass/fail results across 67 image tasks

Overall Score62.69%across 67 eval prompts
Prompts Passed42 / 675 task categories
Avg Response Time6.88son eval prompts
Score key:≥75%40–74%<40%
CategoryPassedScore
Document Understanding8 / 9
88.9%
Object Understanding11 / 14
78.6%
Defect Detection11 / 15
73.3%
Spatial Understanding11 / 19
57.9%
Object Counting1 / 10
10%

Scores based on single evaluation run · Methodology

View all Vision Evals →

Gemma 4 12B License

Apache 2.0

License terms and commercial-use guidance for Gemma 4 12B.

License information is provided as a guide and is not legal advice.