Generate multimodal responses by interpreting both text and images, enabling vision-language understanding and content creation
The pipeline processes image and text inputs using a vision encoder and language model to generate contextualized outputs.
License name: Apache License 2.0| Test | Evaluation Metric | Full Precision Accuracy | Post Quantization Accuracy |
|---|---|---|---|
| MMLU | Accuracy | 52 | 47 |