4.8 KiB

Raw Blame History

Multimodal Language Models

These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders.

Example launch Command

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
  --host 0.0.0.0 \
  --port 30000 \

Supported models

Below the supported models are summarized in a table.

If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for Qwen2_5_VLForConditionalGeneration, use the expression:

repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration

in the GitHub search bar.

Model Family (Variants)	Example HuggingFace Identifier	Chat Template	Description
Qwen-VL (Qwen2 series)	`Qwen/Qwen2.5-VL-7B-Instruct`	`qwen2-vl`	Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content.
DeepSeek-VL2	`deepseek-ai/deepseek-vl2`	`deepseek-vl2`	Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs.
Janus-Pro (1B, 7B)	`deepseek-ai/Janus-Pro-7B`	`janus-pro`	DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks.
MiniCPM-V / MiniCPM-o	`openbmb/MiniCPM-V-2_6`	`minicpmv`	MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices.
Llama 3.2 Vision (11B)	`meta-llama/Llama-3.2-11B-Vision-Instruct`	`llama_3_vision`	Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks.
LLaVA (v1.5 & v1.6)	e.g. `liuhaotian/llava-v1.5-13b`	`vicuna_v1.1`	Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts.
LLaVA-NeXT (8B, 72B)	`lmms-lab/llava-next-72b`	`chatml-llava`	Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks.
LLaVA-OneVision	`lmms-lab/llava-onevision-qwen2-7b-ov`	`chatml-llava`	Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format.
Gemma 3 (Multimodal)	`google/gemma-3-4b-it`	`gemma-it`	Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context.
Kimi-VL (A3B)	`moonshotai/Kimi-VL-A3B-Instruct`	`kimi-vl`	Kimi-VL is a multimodal model that can understand and generate text from images.
Mistral-Small-3.1-24B	`mistralai/Mistral-Small-3.1-24B-Instruct-2503`	`mistral`	Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output.
Phi-4-multimodal-instruct	`microsoft/Phi-4-multimodal-instruct`	`phi-4-mm`	Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. Currently, it supports only text and vision modalities in SGLang.

4.8 KiB Raw Blame History Unescape Escape

Multimodal Language Models

Example launch Command

Supported models

4.8 KiB

Raw Blame History