{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sending Requests\n", "This notebook provides a quick-start guide to use SGLang in chat completions after installation.\n", "\n", "- For Vision Language Models, see [OpenAI APIs - Vision](../backend/openai_api_vision.ipynb).\n", "- For Embedding Models, see [OpenAI APIs - Embedding](../backend/openai_api_embeddings.ipynb) and [Encode (embedding model)](../backend/native_api.html#Encode-(embedding-model)).\n", "- For Reward Models, see [Classify (reward model)](../backend/native_api.html#Classify-(reward-model))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.test.test_utils import is_in_ci\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", "if is_in_ci():\n", " from patch import launch_server_cmd\n", "else:\n", " from sglang.utils import launch_server_cmd\n", "\n", "# This is equivalent to running the following command in your terminal\n", "\n", "# python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0\n", "\n", "server_process, port = launch_server_cmd(\n", " \"\"\"\n", "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --host 0.0.0.0\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using cURL\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import subprocess, json\n", "\n", "curl_command = f\"\"\"\n", "curl -s http://localhost:{port}/v1/chat/completions \\\n", " -H \"Content-Type: application/json\" \\\n", " -d '{{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{{\"role\": \"user\", \"content\": \"What is the capital of France?\"}}]}}'\n", "\"\"\"\n", "\n", "response = json.loads(subprocess.check_output(curl_command, shell=True))\n", "print_highlight(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Python Requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = f\"http://localhost:{port}/v1/chat/completions\"\n", "\n", "data = {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n", "}\n", "\n", "response = requests.post(url, json=data)\n", "print_highlight(response.json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using OpenAI Python Client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " ],\n", " temperature=0,\n", " max_tokens=64,\n", ")\n", "print_highlight(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "# Use stream=True for streaming responses\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " ],\n", " temperature=0,\n", " max_tokens=64,\n", " stream=True,\n", ")\n", "\n", "# Handle the streaming output\n", "for chunk in response:\n", " if chunk.choices[0].delta.content:\n", " print(chunk.choices[0].delta.content, end=\"\", flush=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Native Generation APIs\n", "\n", "You can also use the native `/generate` endpoint with requests, which provides more flexiblity. An API reference is available at [Sampling Parameters](../references/sampling_params.md)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "response = requests.post(\n", " f\"http://localhost:{port}/generate\",\n", " json={\n", " \"text\": \"The capital of France is\",\n", " \"sampling_params\": {\n", " \"temperature\": 0,\n", " \"max_new_tokens\": 32,\n", " },\n", " },\n", ")\n", "\n", "print_highlight(response.json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests, json\n", "\n", "response = requests.post(\n", " f\"http://localhost:{port}/generate\",\n", " json={\n", " \"text\": \"The capital of France is\",\n", " \"sampling_params\": {\n", " \"temperature\": 0,\n", " \"max_new_tokens\": 32,\n", " },\n", " \"stream\": True,\n", " },\n", " stream=True,\n", ")\n", "\n", "prev = 0\n", "for chunk in response.iter_lines(decode_unicode=False):\n", " chunk = chunk.decode(\"utf-8\")\n", " if chunk and chunk.startswith(\"data:\"):\n", " if chunk == \"data: [DONE]\":\n", " break\n", " data = json.loads(chunk[5:].strip(\"\\n\"))\n", " output = data[\"text\"]\n", " print(output[prev:], end=\"\", flush=True)\n", " prev = len(output)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }