Qwen2.5-VL-7B-Instruct is a 7-billion parameter vision-language model from Alibaba’s QwenLM team, released on January 26, 2025 under the Apache 2.0 license. It is the instruction-tuned variant of the 7B scale in the Qwen2.5-VL family, designed to process multimodal inputs such as text, images, charts, documents, and video. The model enables structured outputs—including JSON for structured content and bounding boxes for visual localization. Weights are publicly available on Hugging Face and GitHub, making it suitable for both research and applied multimodal use.
Drag and drop an image here, or click to browse
—
Usage
Past 30 Days| Category | Passed | Score |
|---|---|---|
| Document Understanding | 7 / 9 | 77.8% |
| Defect Detection | 9 / 15 | 60% |
| Spatial Understanding | 11 / 19 | 57.9% |
| Object Understanding | 8 / 14 | 57.1% |
| Object Counting | 0 / 10 | 0% |
Scores based on single evaluation run · Methodology
View all Vision Evals →Other models worth comparing for similar use cases.
License terms and commercial-use guidance for Qwen2.5 VL 7B Instruct.
License information is provided as a guide and is not legal advice.