Qwen2-VL-2B-Instruct

Generate multimodal responses by interpreting both text and images, enabling vision-language understanding and content creation

Model Properties

The pipeline processes image and text inputs using a vision encoder and language model to generate contextualized outputs.

License name: Apache License 2.0
Number of parameters: 2B
Model Size: 2.18 GB
Select device..

Technical Details

Image Input Size: [336, 336, 3]
Numerical Scheme: A8W4, symmetric, channel-wise
Inference Api: CPP
Vision Tokens Per Frame: 144
Context Length: 2048
Compiled Model:

Performance Metrics

First Load Time In Sec 21.55
Text Time To First Token In Sec 0.32
Image Time To First Token In Sec 0.93
TPS 8.4
Time To First Token In Sec 0.91
Accuracy
Test Evaluation Metric Full Precision Accuracy Post Quantization Accuracy
MMLU Accuracy 52 47

Explore More Models

Illustration of Florence-2-Base, a VLM for smart cameras and real-time video edge analytics
GenAI Models
Florence-2-Base
Generate descriptive captions for input images, enabling visual recognition and image-to-text translation tasks
Gen AI _ Stable Diffusion
GenAI Models
Stable Diffusion 1.5
Generate high-quality images from textual descriptions by leveraging advanced deep learning techniques