{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenAI APIs - Completions\n", "\n", "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n", "\n", "This tutorial covers the following popular APIs:\n", "\n", "- `chat/completions`\n", "- `completions`\n", "- `batches`\n", "\n", "Check out other tutorials to learn about [vision APIs](https://docs.sglang.ai/backend/openai_api_vision.html) for vision-language models and [embedding APIs](https://docs.sglang.ai/backend/openai_api_embeddings.html) for embedding models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server\n", "\n", "Launch the server in your terminal and wait for it to initialize." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.test.test_utils import is_in_ci\n", "\n", "if is_in_ci():\n", " from patch import launch_server_cmd\n", "else:\n", " from sglang.utils import launch_server_cmd\n", "\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", "\n", "server_process, port = launch_server_cmd(\n", " \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")\n", "print(f\"Server started on http://localhost:{port}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chat Completions\n", "\n", "### Usage\n", "\n", "The server fully implements the OpenAI API.\n", "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n", "You can also specify a custom chat template with `--chat-template` when launching the server." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " ],\n", " temperature=0,\n", " max_tokens=64,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n", "\n", "Here is an example of a detailed chat completion request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n", " {\n", " \"role\": \"assistant\",\n", " \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n", " ],\n", " temperature=0.3, # Lower temperature for more focused responses\n", " max_tokens=128, # Reasonable length for a concise response\n", " top_p=0.95, # Slightly higher for better fluency\n", " presence_penalty=0.2, # Mild penalty to avoid repetition\n", " frequency_penalty=0.2, # Mild penalty for more natural language\n", " n=1, # Single response is usually more stable\n", " seed=42, # Keep for reproducibility\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Streaming mode is also supported." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stream = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n", " stream=True,\n", ")\n", "for chunk in stream:\n", " if chunk.choices[0].delta.content is not None:\n", " print(chunk.choices[0].delta.content, end=\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Completions\n", "\n", "### Usage\n", "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " prompt=\"List 3 countries and their capitals.\",\n", " temperature=0,\n", " max_tokens=64,\n", " n=1,\n", " stop=None,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n", "\n", "Here is an example of a detailed completions request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " prompt=\"Write a short story about a space explorer.\",\n", " temperature=0.7, # Moderate temperature for creative writing\n", " max_tokens=150, # Longer response for a story\n", " top_p=0.9, # Balanced diversity in word choice\n", " stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n", " presence_penalty=0.3, # Encourage novel elements\n", " frequency_penalty=0.3, # Reduce repetitive phrases\n", " n=1, # Generate one completion\n", " seed=123, # For reproducible results\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structured Outputs (JSON, Regex, EBNF)\n", "\n", "For OpenAI compatible structed outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batches\n", "\n", "Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n", "\n", "The batches APIs are:\n", "\n", "- `batches`\n", "- `batches/{batch_id}/cancel`\n", "- `batches/{batch_id}`\n", "\n", "Here is an example of a batch job for chat completions, completions are similar.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = [\n", " {\n", " \"custom_id\": \"request-1\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n", " ],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", " {\n", " \"custom_id\": \"request-2\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", "]\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " file_response = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_response = client.batches.create(\n", " input_file_id=file_response.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Batch job created with ID: {batch_response.id}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n", " batch_response = client.batches.retrieve(batch_response.id)\n", "\n", "if batch_response.status == \"completed\":\n", " print(\"Batch job completed successfully!\")\n", " print(f\"Request counts: {batch_response.request_counts}\")\n", "\n", " result_file_id = batch_response.output_file_id\n", " file_response = client.files.content(result_file_id)\n", " result_content = file_response.read().decode(\"utf-8\")\n", "\n", " results = [\n", " json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n", " ]\n", "\n", " for result in results:\n", " print_highlight(f\"Request {result['custom_id']}:\")\n", " print_highlight(f\"Response: {result['response']}\")\n", "\n", " print_highlight(\"Cleaning up files...\")\n", " # Only delete the result file ID since file_response is just content\n", " client.files.delete(result_file_id)\n", "else:\n", " print_highlight(f\"Batch job failed with status: {batch_response.status}\")\n", " if hasattr(batch_response, \"errors\"):\n", " print_highlight(f\"Errors: {batch_response.errors}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n", "\n", "1. `batches/{batch_id}`: Retrieve the batch job status.\n", "2. `batches/{batch_id}/cancel`: Cancel the batch job.\n", "\n", "Here is an example to check the batch job status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(20):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 64,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "max_checks = 5\n", "for i in range(max_checks):\n", " batch_details = client.batches.retrieve(batch_id=batch_job.id)\n", "\n", " print_highlight(\n", " f\"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}\"\n", " )\n", " print_highlight(\n", " f\"Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}\"\n", " )\n", "\n", " time.sleep(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example to cancel a batch job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(5000):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 128,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "try:\n", " cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n", " print_highlight(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n", " assert cancelled_job.status == \"cancelling\"\n", "\n", " # Monitor the cancellation process\n", " while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " cancelled_job = client.batches.retrieve(batch_job.id)\n", " print_highlight(f\"Current status: {cancelled_job.status}\")\n", "\n", " # Verify final status\n", " assert cancelled_job.status == \"cancelled\"\n", " print_highlight(\"Batch job successfully cancelled\")\n", "\n", "except Exception as e:\n", " print_highlight(f\"Error during cancellation: {e}\")\n", " raise e\n", "\n", "finally:\n", " try:\n", " del_response = client.files.delete(uploaded_file.id)\n", " if del_response.deleted:\n", " print_highlight(\"Successfully cleaned up input file\")\n", " if os.path.exists(input_file_path):\n", " os.remove(input_file_path)\n", " print_highlight(\"Successfully deleted local batch_requests.jsonl file\")\n", " except Exception as e:\n", " print_highlight(f\"Error cleaning up: {e}\")\n", " raise e" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }