Qwen3-VL-2B-Instruct

Generate multimodal responses by interpreting both text and video, enabling vision-language understanding and content creation

Model Properties

The pipeline processes image and text inputs using a vision
encoder and language model to generate contextualized outputs

License name: Apache License 2.0
Number of parameters: 2B
Model Size: 2.18 GB
Select device..

Technical Details

Image Input Size: [288, 512, 3]
Numerical Scheme: A16W4, symmetric, group-wise
Inference Api: C++, Python
Vision Tokens Per Frame: 144
Context Length: 2048
Compiled Model:

Performance Metrics

First Load Time In Sec 7.02
Time To First Token In Sec 1.47
TPS 4.74

Explore More Models

Gen AI _ Stable Diffusion
GenAI Models
Stable Diffusion 1.5
Generate high-quality images from textual descriptions by leveraging advanced deep learning techniques
GenAI Models
Qwen2-VL-2B-Instruct
Generate multimodal responses by interpreting both text and images, enabling vision-language understanding and content creation