GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder-decoder architecture by Zhipu AI. The model combines a 0.4B-parameter CogViT visual encoder pre-trained on large-scale image-text data, a lightweight cross-modal connector with efficient token downsampling, and a 0.5B-parameter GLM language decoder, totaling 0.9B parameters. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. Training proceeds through four stages: visual encoder pretraining with MIM, CLIP, and distillation objectives; vision-language pretraining on document parsing, grounding, and VQA data; supervised fine-tuning on curated OCR datasets covering text, formula, table, and key information extraction; and full-task reinforcement learning to improve accuracy and structural consistency.
At the system level, GLM-OCR adopts a two-stage pipeline in which PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. This design enables robust handling of diverse document layouts including tables, formulas, and multi-column text. The model supports document parsing and targeted recognition tasks, producing structured outputs in Markdown, JSON, and LaTeX formats across more than 100 languages. On the OmniDocBench V1.5 benchmark, GLM-OCR scores 94.62, and achieves 94.0 on OCRBench and 96.5 on UniMERNet for formula recognition.
Drag and drop an image here, or click to browse
OCR will run automatically
| Category | Passed | Score |
|---|---|---|
| Handwritten Math | 10 / 10 | 100% |
| Text Recognition | 27 / 30 | 90% |
| License Plate Recognition | 27 / 30 | 90% |
| Focused Scene OCR | 87 / 99 | 87.9% |
| VQA & Extraction | 49 / 60 | 81.7% |
Scores based on single evaluation run · Methodology
View all Vision Evals →License terms and commercial-use guidance for GLM-OCR.
License information is provided as a guide and is not legal advice.