{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenAI APIs - Vision\n", "\n", "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n", "This tutorial covers the vision APIs for vision language models.\n", "\n", "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](https://docs.sglang.ai/references/supported_models): \n", "- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) \n", "- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat) \n", "- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)\n", "- [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)\n", "- [openbmb/MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V)\n", "- [deepseek-ai/deepseek-vl2](https://huggingface.co/deepseek-ai/deepseek-vl2)\n", "\n", "As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server\n", "\n", "Launch the server in your terminal and wait for it to initialize.\n", "\n", "**Remember to add** `--chat-template llama_3_vision` **to specify the [vision chat template](https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template), otherwise, the server will only support text (images won’t be passed in), which can lead to degraded performance.**\n", "\n", "We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.test.test_utils import is_in_ci\n", "\n", "if is_in_ci():\n", " from patch import launch_server_cmd\n", "else:\n", " from sglang.utils import launch_server_cmd\n", "\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", "vision_process, port = launch_server_cmd(\n", " \"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n", " --chat-template=llama_3_vision\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using cURL\n", "\n", "Once the server is up, you can send test requests using curl or requests." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "curl_command = f\"\"\"\n", "curl -s http://localhost:{port}/v1/chat/completions \\\\\n", " -d '{{\n", " \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " \"messages\": [\n", " {{\n", " \"role\": \"user\",\n", " \"content\": [\n", " {{\n", " \"type\": \"text\",\n", " \"text\": \"What’s in this image?\"\n", " }},\n", " {{\n", " \"type\": \"image_url\",\n", " \"image_url\": {{\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n", " }}\n", " }}\n", " ]\n", " }}\n", " ],\n", " \"max_tokens\": 300\n", " }}'\n", "\"\"\"\n", "\n", "response = subprocess.check_output(curl_command, shell=True).decode()\n", "print_highlight(response)\n", "\n", "\n", "response = subprocess.check_output(curl_command, shell=True).decode()\n", "print_highlight(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Python Requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = f\"http://localhost:{port}/v1/chat/completions\"\n", "\n", "data = {\n", " \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n", " },\n", " },\n", " ],\n", " }\n", " ],\n", " \"max_tokens\": 300,\n", "}\n", "\n", "response = requests.post(url, json=data)\n", "print_highlight(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using OpenAI Python Client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What is in this image?\",\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n", " },\n", " },\n", " ],\n", " }\n", " ],\n", " max_tokens=300,\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple-Image Inputs\n", "\n", "The server also supports multiple images and interleaved text and images if the model supports it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\",\n", " },\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n", " },\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"I have two very different images. They are not related at all. \"\n", " \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n", " },\n", " ],\n", " }\n", " ],\n", " temperature=0,\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(vision_process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chat Template\n", "\n", "As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.\n", "\n", "We list popular vision models with their chat templates:\n", "\n", "- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.\n", "- [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) uses `qwen2-vl`.\n", "- [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) uses `gemma-it`.\n", "- [openbmb/MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V) uses `minicpmv`.\n", "- [deepseek-ai/deepseek-vl2](https://huggingface.co/deepseek-ai/deepseek-vl2) uses `deepseek-vl2`.\n", "- [LlaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.\n", "- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.\n", "- [Llama3-LLaVA-NeXT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.\n", "- [LLaVA-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.6-34b) uses `vicuna_v1.1`." ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }