Qwen2-VL 2B

Generate multimodal responses by interpreting both text and images, enabling vision-language understanding and content creation

Model Properties

The pipeline processes image and text inputs using a vision encoder and language model to generate contextualized outputs

License name: Apache License 2.0
Number of parameters: 2B

Select device..

Technical Details

Context Length: 2048
Image Input Size: [224, 224, 3]
Max Output Tokens: 64 vision tokens

Performance Metrics

FPS 2.45
Text Prefill Per Sec 400
Text Time To First Token Per Sec 0.24
Image Time To First Token Per Sec 0.4
TPS 11

Explore More Models

GenAI Models
Florence-2-Base
Generate descriptive captions for input images, enabling visual recognition and image-to-text translation tasks
GenAI Models
Stable Diffusion 1.5
Generate high-quality images from textual descriptions by leveraging advanced deep learning techniques