Z.ai

Z.ai: GLM-OCR

GLM-OCR Overview

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder-decoder architecture by Zhipu AI. The model combines a 0.4B-parameter CogViT visual encoder pre-trained on large-scale image-text data, a lightweight cross-modal connector with efficient token downsampling, and a 0.5B-parameter GLM language decoder, totaling 0.9B parameters. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. Training proceeds through four stages: visual encoder pretraining with MIM, CLIP, and distillation objectives; vision-language pretraining on document parsing, grounding, and VQA data; supervised fine-tuning on curated OCR datasets covering text, formula, table, and key information extraction; and full-task reinforcement learning to improve accuracy and structural consistency.

At the system level, GLM-OCR adopts a two-stage pipeline in which PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. This design enables robust handling of diverse document layouts including tables, formulas, and multi-column text. The model supports document parsing and targeted recognition tasks, producing structured outputs in Markdown, JSON, and LaTeX formats across more than 100 languages. On the OmniDocBench V1.5 benchmark, GLM-OCR scores 94.62, and achieves 94.0 on OCRBench and 96.5 on UniMERNet for formula recognition.

GLM-OCR Interactive Demo

GLM-OCR Details & Performance

Details

Resources

Vision Tasks

OCRDocument Question AnsweringVisual Question AnsweringVision LanguageChart Question Answering

Features

Multimodal VisionLLMs with Vision Capabilities

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

GLM-OCR Vision Evals

#1 of 5 models|

Pass/fail results across 229 image tasks

Overall Score87.34%across 229 eval prompts
Prompts Passed200 / 2295 task categories
Avg Response Time1.00son eval prompts
Score key:≥75%40–74%<40%
CategoryPassedScore
Handwritten Math10 / 10
100%
Text Recognition27 / 30
90%
License Plate Recognition27 / 30
90%
Focused Scene OCR87 / 99
87.9%
VQA & Extraction49 / 60
81.7%

Scores based on single evaluation run · Methodology

View all Vision Evals →

GLM-OCR License

MIT

License terms and commercial-use guidance for GLM-OCR.

License information is provided as a guide and is not legal advice.