Surya Overview

Surya is an OCR and document layout analysis toolkit developed by Vikram Paruchuri and distributed via Mindee, first released in January 2024 under the GPL-3.0 license. It supports text recognition across more than 90 languages, document layout detection, reading order prediction, table recognition, and equation detection, providing a comprehensive set of tools for extracting structured information from document images.

Surya is designed to operate without cloud API dependencies, running fully on local hardware with support for CPU and GPU inference. It is commonly used for digitizing scanned documents, extracting text from PDFs with complex layouts, and building automated document processing pipelines.

Surya Details & Performance

Details

Vision Tasks

OCR

Features

Real-Time Vision

Usage

Past 30 Days

Not available

Not in Playground

Performance

Avg. Latency

Arena Rankings

Not yet ranked in arena

Alternatives to Surya

Other models worth comparing for similar use cases.

docTR
docTR (Document Text Recognition) is an open-source OCR toolkit developed by Mindee, with its initial public release in March 2021 under the Apache 2.0 license. It provides end-to-end document text recognition through a two-stage pipeline consisting of text detection and text recognition, both implemented as deep learning models. docTR supports multiple detection architectures including DBNet and LinkNet, and recognition architectures including CRNN and SAR, with both TensorFlow and PyTorch backends available.docTR is designed for reading text in document images including scanned PDFs, photographs of printed documents, and forms. It handles multilingual text recognition across standard Latin-script languages and is deployable through Roboflow Inference. It is suited for document digitization pipelines, automated form processing, and applications requiring accurate structured text extraction from document images.
Azure
TrOCR
TrOCR (Transformer-based Optical Character Recognition) is an end-to-end OCR model released in September 2021 by Microsoft Research. It departs from the traditional two-stage OCR pipeline — which typically combines a CNN-based feature extractor with an RNN-based sequence decoder — by using a pure Transformer architecture composed of a pretrained image Transformer encoder and a pretrained text Transformer decoder, an approach that later became standardized as the VisionEncoderDecoder pattern in Hugging Face Transformers.TrOCR takes a cropped text line image as input and produces a sequence of output tokens, supporting printed, handwritten, and scene text recognition. The model is designed for use downstream of a separate text detection stage — TrOCR recognizes text in pre-cropped regions rather than detecting text locations in a full page. Microsoft released three size variants: TrOCR-small (62M parameters, DeiT-small encoder + MiniLM decoder), TrOCR-base (334M parameters, BEiT-base encoder + RoBERTa-large decoder), and TrOCR-large (558M parameters, BEiT-large encoder + RoBERTa-large decoder). Pretrained and fine-tuned checkpoints are available for printed text (on SROIE), handwritten text (on IAM), and scene text (on the standard scene text benchmarks) under the MIT license, distributed through the Microsoft unilm repository and Hugging Face. At release, TrOCR achieved state-of-the-art results across all three benchmark categories, and the model continues to be used as a baseline for handwritten text recognition.
Azure
Florence-2
Florence-2, introduced by Microsoft Research at CVPR 2024, is an open-source vision-language foundation model designed to unify diverse computer vision tasks within a single sequence-to-sequence framework. Unlike traditional models that specialize in specific tasks, Florence-2 accepts both images and text prompts and outputs text for tasks such as captioning, object detection, segmentation, OCR, and region-based grounding. It comes in two sizes—Florence-2-base (~230M parameters) and Florence-2-large (~770M parameters)—and is trained on FLD-5B, a large dataset of ~126M images with ~5.4B annotations.The model demonstrates strong zero-shot and fine-tuned performance, often rivaling larger vision-language systems while remaining lightweight and efficient. Released under the MIT license, all weights are publicly available, making it accessible for fine-tuning and deployment in applications like VQA, content tagging, accessibility, and research. Florence-2’s compact design, versatility, and openness position it as a practical alternative to larger proprietary multimodal models.
Google
PaliGemma 2
PaliGemma 2 is a vision-language model released in December 2024 by Google DeepMind. It pairs the SigLIP-So400m vision encoder with the Gemma 2 language model family, extending the original PaliGemma architecture with stronger language capabilities and a wider set of transfer benchmarks. The model is designed primarily as a fine-tuning base rather than a chat-optimized assistant. Google releases pretrained "PT" checkpoints intended for task-specific adaptation rather than direct out-of-the-box use.PaliGemma 2 accepts an image paired with a text prompt and generates natural language output, supporting image captioning, visual question answering, optical character recognition, document understanding, object detection and segmentation (with appropriate fine-tuning), and a range of specialized vision-language tasks. The model is released at three parameter sizes (3B, 10B, and 28B), built on the Gemma 2 2B, 9B, and 27B language backbones. Each size is available at three input resolutions: 224, 448, and 896 pixels. Alongside the base PT checkpoints, Google released PaliGemma 2 Mix variants that have been tuned on a mixture of downstream tasks to provide stronger out-of-the-box performance for common applications such as OCR and document parsing. PaliGemma 2 is distributed under the Gemma license, a custom license from Google that permits commercial use subject to the terms of the Gemma Prohibited Use Policy.
Google
Google Vision OCR
Google Vision OCR, released as part of the Cloud Vision API’s general availability in February 2016, is a proprietary Google Cloud service for extracting text from images and documents. It supports common formats like JPEG, PNG, GIF, TIFF, and PDF, and provides two main modes: TEXT_DETECTION for short snippets and scene text, and DOCUMENT_TEXT_DETECTION for dense documents, which returns structured layout information with bounding boxes.While not an LLM (so it has no token context window or parameter count), the service performs OCR across printed text and some handwriting. It outputs detected text along with positional metadata, making it useful for digitizing scanned files, receipts, forms, and signs. However, complex layouts like tables often require downstream processing. Accessible via REST and RPC APIs, with client libraries in major languages, Google Vision OCR is widely used for document processing pipelines, archival, and accessibility applications.
Qwen
Qwen2.5 VL 7B Instruct
Qwen2.5-VL-7B-Instruct is a 7-billion parameter vision-language model from Alibaba’s QwenLM team, released on January 26, 2025 under the Apache 2.0 license. It is the instruction-tuned variant of the 7B scale in the Qwen2.5-VL family, designed to process multimodal inputs such as text, images, charts, documents, and video. The model enables structured outputs—including JSON for structured content and bounding boxes for visual localization. Weights are publicly available on Hugging Face and GitHub, making it suitable for both research and applied multimodal use.

Surya License

GPL-3.0

License terms and commercial-use guidance for Surya.

License information is provided as a guide and is not legal advice.