Qwen2-VL-2B-Instruct

Generate multimodal responses by interpreting both text and images,
enabling vision-language understanding and content creation.

Model Properties

The pipeline processes image and text inputs using a vision
encoder and language model to generate contextualized outputs.

License name: Apache License 2.0
Number of parameters: 2B
Model Size: 2.18 GB
Select device..

Technical Details

Image Input Size: [336, 336, 3]
Numerical Scheme: A8W4, symmetric, channel-wise
Vision Tokens Per Frame: 144
Operations: TBD
Context Length: 2048

Performance Metrics

First Load Time In Sec 17.1006
Text Time To First Token In Sec 0.322963
Image Time To First Token In Sec 0.93134
TPS 8.15536
Accuracy
Test Evaluation Metric Full Precision Accuracy Post Quantization Accuracy
MMLU accuracy 52 47

Explore More Models

GenAI Models
Florence-2-Base
Generate descriptive captions for input images, enabling visual recognition and image-to-text translation tasks
GenAI Models
Stable Diffusion 1.5
Generate high-quality images from textual descriptions by leveraging advanced deep learning techniques