Roboflow
Qwen

Qwen: Qwen3 VL 30B A3B Instruct

Qwen3 VL 30B A3B Instruct Overview

Qwen3 VL 30B A3B Instruct is an open-weight multimodal large language model developed by Alibaba as part of the Qwen family, built for instruction-following tasks that unify text generation with visual and video understanding. Released around October 2025 under the Apache-2.0 license, it targets efficient, high-fidelity vision-language reasoning across very long contexts.

The model accepts text and image inputs and produces text outputs, with strong performance in OCR, spatial reasoning, long-video understanding, and agentic or GUI-centric visual tasks. It uses a Mixture-of-Experts (A3B) design with ~31.1B total parameters and ~3B active per token, paired with Qwen3-VL’s unified multimodal stack (including Interleaved-MRoPE and DeepStack fusion) to process text, images, and video in a single architecture. OCR support expands to 32 languages, enhancing document workflows. With a native ~262K token context window (extendable further), it stands out today for its balance of scale, efficiency, long-context support, and open accessibility in multimodal systems.

Qwen3 VL 30B A3B Instruct Interactive Demo

Qwen3 VL 30B A3B Instruct Details & Performance

Details

Resources

Vision Tasks

Vision LanguageObject DetectionOCRVisual Question AnsweringCaptioning

Features

LLMs with Vision CapabilitiesMultimodal Vision

Usage

Past 30 Days

Performance

Avg. Latency

Arena Rankings

Qwen3 VL 30B A3B Instruct Pricing

Qwen3 VL 30B A3B Instruct costs $0.130 per 1M input tokens and $0.520 per 1M output tokens.

Input$0.130 / 1M tokens
Output$0.520 / 1M tokens

Pricing updated Jul 5, 2026

Alternatives to Qwen3 VL 30B A3B Instruct

Other models worth comparing for similar use cases.

Qwen
Qwen3.6 35B A3B
Qwen3.6-35B-A3B is a sparse Mixture-of-Experts (MoE) multimodal language model developed by the Qwen team at Alibaba Group. It carries 35 billion total parameters but activates only approximately 3 billion per forward pass via a learned routing mechanism, giving it the representational capacity of a large dense model at a fraction of the inference compute. The model is natively multimodal, processing images, documents, and video alongside text as a core architectural capability rather than an add-on. It supports a native context window of 262,144 tokens, extensible up to 1,010,000 tokens via YaRN. A key design feature is the unified thinking/non-thinking mode framework: users can switch between deliberate chain-of-thought reasoning and fast direct responses within a single model, and a "thinking preservation" option retains reasoning context across multi-turn agentic workflows to reduce redundant computation.The model is specifically optimized for agentic coding tasks, including repository-level reasoning, frontend workflow generation, multi-step tool use, and MCP (Model Context Protocol) integration. On SWE-bench Verified it scores 73.4%, on Terminal-Bench 2.0 it scores 51.5%, and on MCPMark it scores 37.0%. For vision-language tasks it achieves 92.0 on RefCOCO, 89.9 on OmniDocBench 1.5, and 83.7 on VideoMMMU. The model also supports Multi-Token Prediction (MTP) for speculative decoding. All Qwen3.6 open-weight models are released under the Apache 2.0 license.
Qwen
Qwen3.6 27B
Qwen3.6-27B is a dense 27-billion-parameter multimodal language model developed by Alibaba's Qwen team and released on April 22, 2026. It combines a causal language model with an integrated vision encoder, supporting text, image, and video inputs natively. The architecture employs a hybrid attention design that interleaves Gated DeltaNet linear attention blocks with standard Gated Attention layers across 64 transformer layers with a hidden dimension of 5,120. Unlike Mixture-of-Experts variants in the Qwen3.6 family, all 27 billion parameters are active on every inference pass, simplifying deployment and quantization. The model supports a native context window of 262,144 tokens, extensible to approximately 1,010,000 tokens via YaRN scaling. It is released under the Apache 2.0 license with open weights available on Hugging Face and ModelScope.The model introduces two notable capabilities relative to prior Qwen releases: enhanced agentic coding support covering frontend workflows and repository-level reasoning, and a Thinking Preservation mechanism that retains chain-of-thought reasoning context across multi-turn conversation history to reduce redundant token generation in iterative agent sessions. It supports both a thinking mode for multi-step reasoning and a non-thinking mode for faster responses within a single model. On coding benchmarks, Qwen reports scores of 77.2 on SWE-bench Verified, 59.3 on Terminal-Bench 2.0, and 48.2 on SkillsBench. Vision capabilities include chart understanding (CharXiv RQ: 78.4), OCR (CC-OCR: 81.2), and video understanding (VideoMME with subtitles: 87.7).
Google
Gemini 2.5 Flash
Gemini 2.5 Flash, released on June 17, 2025, is Google DeepMind’s production-ready, efficiency-focused model in the Gemini 2.5 family. It is multimodal, accepting text, images, video, and audio as inputs, with text as the primary output format. The model supports 1 million input tokens and up to 65K output tokens, enabling it to process very large contexts such as books, long video transcripts, or extensive datasets. Its training knowledge extends to January 2025.Designed as a price-performance leader, Gemini 2.5 Flash balances speed and reasoning power, making it suitable for everyday enterprise and developer use cases without the higher latency and cost of Pro models. It supports advanced workflows like function calling, code execution, search grounding, URL context ingestion, and structured outputs. While efficient and scalable, output length is still limited compared to its input capacity, and multimodal outputs (e.g. image or audio generation) remain restricted to specialized or preview variants.
Meta
Llama 4 Scout
Llama 4 Scout, released on April 5, 2025, is one of Meta AI’s first Llama 4 multimodal models, alongside Maverick. It accepts text + image inputs and produces text outputs, with a knowledge cutoff of August 2024. Scout is notable for its extremely large context window of 10 million tokens, making it well-suited for analyzing very long documents, extended conversations, or large codebases.Architecturally, Scout uses a Mixture-of-Experts (MoE) system with 16 experts, activating ~17B parameters per inference from a pool of ~109B total parameters, balancing capacity with efficiency. It officially supports 12 languages (including English, Arabic, French, Hindi, and Spanish), while offering multimodal reasoning for images (captioning, Q&A, recognition). Meta highlights that Scout can run on a single Nvidia H100 GPU, making it more accessible than larger-scale Llama 4 models. However, its output token limit is far smaller than its 10M input window, image input support is still constrained, and license restrictions apply for large-scale commercial deployments.

Qwen3 VL 30B A3B Instruct License

Apache 2.0

License terms and commercial-use guidance for Qwen3 VL 30B A3B Instruct.

License information is provided as a guide and is not legal advice.