535 lines
18 KiB
Plaintext
535 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# OpenAI APIs - Completions\n",
|
|
"\n",
|
|
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
|
|
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
|
|
"\n",
|
|
"This tutorial covers the following popular APIs:\n",
|
|
"\n",
|
|
"- `chat/completions`\n",
|
|
"- `completions`\n",
|
|
"- `batches`\n",
|
|
"\n",
|
|
"Check out other tutorials to learn about [vision APIs](https://docs.sglang.ai/backend/openai_api_vision.html) for vision-language models and [embedding APIs](https://docs.sglang.ai/backend/openai_api_embeddings.html) for embedding models."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Launch A Server\n",
|
|
"\n",
|
|
"Launch the server in your terminal and wait for it to initialize."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sglang.test.test_utils import is_in_ci\n",
|
|
"\n",
|
|
"if is_in_ci():\n",
|
|
" from patch import launch_server_cmd\n",
|
|
"else:\n",
|
|
" from sglang.utils import launch_server_cmd\n",
|
|
"\n",
|
|
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
|
|
"\n",
|
|
"\n",
|
|
"server_process, port = launch_server_cmd(\n",
|
|
" \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0\"\n",
|
|
")\n",
|
|
"\n",
|
|
"wait_for_server(f\"http://localhost:{port}\")\n",
|
|
"print(f\"Server started on http://localhost:{port}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Chat Completions\n",
|
|
"\n",
|
|
"### Usage\n",
|
|
"\n",
|
|
"The server fully implements the OpenAI API.\n",
|
|
"It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n",
|
|
"You can also specify a custom chat template with `--chat-template` when launching the server."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import openai\n",
|
|
"\n",
|
|
"client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
|
"\n",
|
|
"response = client.chat.completions.create(\n",
|
|
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" messages=[\n",
|
|
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
|
|
" ],\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=64,\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Parameters\n",
|
|
"\n",
|
|
"The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
|
|
"\n",
|
|
"Here is an example of a detailed chat completion request:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.chat.completions.create(\n",
|
|
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" messages=[\n",
|
|
" {\n",
|
|
" \"role\": \"system\",\n",
|
|
" \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
|
|
" },\n",
|
|
" {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
|
|
" {\n",
|
|
" \"role\": \"assistant\",\n",
|
|
" \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
|
|
" },\n",
|
|
" {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
|
|
" ],\n",
|
|
" temperature=0.3, # Lower temperature for more focused responses\n",
|
|
" max_tokens=128, # Reasonable length for a concise response\n",
|
|
" top_p=0.95, # Slightly higher for better fluency\n",
|
|
" presence_penalty=0.2, # Mild penalty to avoid repetition\n",
|
|
" frequency_penalty=0.2, # Mild penalty for more natural language\n",
|
|
" n=1, # Single response is usually more stable\n",
|
|
" seed=42, # Keep for reproducibility\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(response.choices[0].message.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Streaming mode is also supported."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"stream = client.chat.completions.create(\n",
|
|
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
|
|
" stream=True,\n",
|
|
")\n",
|
|
"for chunk in stream:\n",
|
|
" if chunk.choices[0].delta.content is not None:\n",
|
|
" print(chunk.choices[0].delta.content, end=\"\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Completions\n",
|
|
"\n",
|
|
"### Usage\n",
|
|
"Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.completions.create(\n",
|
|
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" prompt=\"List 3 countries and their capitals.\",\n",
|
|
" temperature=0,\n",
|
|
" max_tokens=64,\n",
|
|
" n=1,\n",
|
|
" stop=None,\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Parameters\n",
|
|
"\n",
|
|
"The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
|
|
"\n",
|
|
"Here is an example of a detailed completions request:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"response = client.completions.create(\n",
|
|
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" prompt=\"Write a short story about a space explorer.\",\n",
|
|
" temperature=0.7, # Moderate temperature for creative writing\n",
|
|
" max_tokens=150, # Longer response for a story\n",
|
|
" top_p=0.9, # Balanced diversity in word choice\n",
|
|
" stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n",
|
|
" presence_penalty=0.3, # Encourage novel elements\n",
|
|
" frequency_penalty=0.3, # Reduce repetitive phrases\n",
|
|
" n=1, # Generate one completion\n",
|
|
" seed=123, # For reproducible results\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Response: {response}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Structured Outputs (JSON, Regex, EBNF)\n",
|
|
"\n",
|
|
"For OpenAI compatible structed outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Batches\n",
|
|
"\n",
|
|
"Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n",
|
|
"\n",
|
|
"The batches APIs are:\n",
|
|
"\n",
|
|
"- `batches`\n",
|
|
"- `batches/{batch_id}/cancel`\n",
|
|
"- `batches/{batch_id}`\n",
|
|
"\n",
|
|
"Here is an example of a batch job for chat completions, completions are similar.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import time\n",
|
|
"from openai import OpenAI\n",
|
|
"\n",
|
|
"client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
|
"\n",
|
|
"requests = [\n",
|
|
" {\n",
|
|
" \"custom_id\": \"request-1\",\n",
|
|
" \"method\": \"POST\",\n",
|
|
" \"url\": \"/chat/completions\",\n",
|
|
" \"body\": {\n",
|
|
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" \"messages\": [\n",
|
|
" {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n",
|
|
" ],\n",
|
|
" \"max_tokens\": 50,\n",
|
|
" },\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"custom_id\": \"request-2\",\n",
|
|
" \"method\": \"POST\",\n",
|
|
" \"url\": \"/chat/completions\",\n",
|
|
" \"body\": {\n",
|
|
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n",
|
|
" \"max_tokens\": 50,\n",
|
|
" },\n",
|
|
" },\n",
|
|
"]\n",
|
|
"\n",
|
|
"input_file_path = \"batch_requests.jsonl\"\n",
|
|
"\n",
|
|
"with open(input_file_path, \"w\") as f:\n",
|
|
" for req in requests:\n",
|
|
" f.write(json.dumps(req) + \"\\n\")\n",
|
|
"\n",
|
|
"with open(input_file_path, \"rb\") as f:\n",
|
|
" file_response = client.files.create(file=f, purpose=\"batch\")\n",
|
|
"\n",
|
|
"batch_response = client.batches.create(\n",
|
|
" input_file_id=file_response.id,\n",
|
|
" endpoint=\"/v1/chat/completions\",\n",
|
|
" completion_window=\"24h\",\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Batch job created with ID: {batch_response.id}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n",
|
|
" time.sleep(3)\n",
|
|
" print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n",
|
|
" batch_response = client.batches.retrieve(batch_response.id)\n",
|
|
"\n",
|
|
"if batch_response.status == \"completed\":\n",
|
|
" print(\"Batch job completed successfully!\")\n",
|
|
" print(f\"Request counts: {batch_response.request_counts}\")\n",
|
|
"\n",
|
|
" result_file_id = batch_response.output_file_id\n",
|
|
" file_response = client.files.content(result_file_id)\n",
|
|
" result_content = file_response.read().decode(\"utf-8\")\n",
|
|
"\n",
|
|
" results = [\n",
|
|
" json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n",
|
|
" ]\n",
|
|
"\n",
|
|
" for result in results:\n",
|
|
" print_highlight(f\"Request {result['custom_id']}:\")\n",
|
|
" print_highlight(f\"Response: {result['response']}\")\n",
|
|
"\n",
|
|
" print_highlight(\"Cleaning up files...\")\n",
|
|
" # Only delete the result file ID since file_response is just content\n",
|
|
" client.files.delete(result_file_id)\n",
|
|
"else:\n",
|
|
" print_highlight(f\"Batch job failed with status: {batch_response.status}\")\n",
|
|
" if hasattr(batch_response, \"errors\"):\n",
|
|
" print_highlight(f\"Errors: {batch_response.errors}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n",
|
|
"\n",
|
|
"1. `batches/{batch_id}`: Retrieve the batch job status.\n",
|
|
"2. `batches/{batch_id}/cancel`: Cancel the batch job.\n",
|
|
"\n",
|
|
"Here is an example to check the batch job status."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import time\n",
|
|
"from openai import OpenAI\n",
|
|
"\n",
|
|
"client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
|
"\n",
|
|
"requests = []\n",
|
|
"for i in range(20):\n",
|
|
" requests.append(\n",
|
|
" {\n",
|
|
" \"custom_id\": f\"request-{i}\",\n",
|
|
" \"method\": \"POST\",\n",
|
|
" \"url\": \"/chat/completions\",\n",
|
|
" \"body\": {\n",
|
|
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" \"messages\": [\n",
|
|
" {\n",
|
|
" \"role\": \"system\",\n",
|
|
" \"content\": f\"{i}: You are a helpful AI assistant\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" \"max_tokens\": 64,\n",
|
|
" },\n",
|
|
" }\n",
|
|
" )\n",
|
|
"\n",
|
|
"input_file_path = \"batch_requests.jsonl\"\n",
|
|
"with open(input_file_path, \"w\") as f:\n",
|
|
" for req in requests:\n",
|
|
" f.write(json.dumps(req) + \"\\n\")\n",
|
|
"\n",
|
|
"with open(input_file_path, \"rb\") as f:\n",
|
|
" uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
|
|
"\n",
|
|
"batch_job = client.batches.create(\n",
|
|
" input_file_id=uploaded_file.id,\n",
|
|
" endpoint=\"/v1/chat/completions\",\n",
|
|
" completion_window=\"24h\",\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n",
|
|
"print_highlight(f\"Initial status: {batch_job.status}\")\n",
|
|
"\n",
|
|
"time.sleep(10)\n",
|
|
"\n",
|
|
"max_checks = 5\n",
|
|
"for i in range(max_checks):\n",
|
|
" batch_details = client.batches.retrieve(batch_id=batch_job.id)\n",
|
|
"\n",
|
|
" print_highlight(\n",
|
|
" f\"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}\"\n",
|
|
" )\n",
|
|
" print_highlight(\n",
|
|
" f\"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>\"\n",
|
|
" )\n",
|
|
"\n",
|
|
" time.sleep(3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here is an example to cancel a batch job."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"import time\n",
|
|
"from openai import OpenAI\n",
|
|
"import os\n",
|
|
"\n",
|
|
"client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
|
|
"\n",
|
|
"requests = []\n",
|
|
"for i in range(5000):\n",
|
|
" requests.append(\n",
|
|
" {\n",
|
|
" \"custom_id\": f\"request-{i}\",\n",
|
|
" \"method\": \"POST\",\n",
|
|
" \"url\": \"/chat/completions\",\n",
|
|
" \"body\": {\n",
|
|
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
|
|
" \"messages\": [\n",
|
|
" {\n",
|
|
" \"role\": \"system\",\n",
|
|
" \"content\": f\"{i}: You are a helpful AI assistant\",\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"role\": \"user\",\n",
|
|
" \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
|
|
" },\n",
|
|
" ],\n",
|
|
" \"max_tokens\": 128,\n",
|
|
" },\n",
|
|
" }\n",
|
|
" )\n",
|
|
"\n",
|
|
"input_file_path = \"batch_requests.jsonl\"\n",
|
|
"with open(input_file_path, \"w\") as f:\n",
|
|
" for req in requests:\n",
|
|
" f.write(json.dumps(req) + \"\\n\")\n",
|
|
"\n",
|
|
"with open(input_file_path, \"rb\") as f:\n",
|
|
" uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
|
|
"\n",
|
|
"batch_job = client.batches.create(\n",
|
|
" input_file_id=uploaded_file.id,\n",
|
|
" endpoint=\"/v1/chat/completions\",\n",
|
|
" completion_window=\"24h\",\n",
|
|
")\n",
|
|
"\n",
|
|
"print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n",
|
|
"print_highlight(f\"Initial status: {batch_job.status}\")\n",
|
|
"\n",
|
|
"time.sleep(10)\n",
|
|
"\n",
|
|
"try:\n",
|
|
" cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n",
|
|
" print_highlight(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n",
|
|
" assert cancelled_job.status == \"cancelling\"\n",
|
|
"\n",
|
|
" # Monitor the cancellation process\n",
|
|
" while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n",
|
|
" time.sleep(3)\n",
|
|
" cancelled_job = client.batches.retrieve(batch_job.id)\n",
|
|
" print_highlight(f\"Current status: {cancelled_job.status}\")\n",
|
|
"\n",
|
|
" # Verify final status\n",
|
|
" assert cancelled_job.status == \"cancelled\"\n",
|
|
" print_highlight(\"Batch job successfully cancelled\")\n",
|
|
"\n",
|
|
"except Exception as e:\n",
|
|
" print_highlight(f\"Error during cancellation: {e}\")\n",
|
|
" raise e\n",
|
|
"\n",
|
|
"finally:\n",
|
|
" try:\n",
|
|
" del_response = client.files.delete(uploaded_file.id)\n",
|
|
" if del_response.deleted:\n",
|
|
" print_highlight(\"Successfully cleaned up input file\")\n",
|
|
" if os.path.exists(input_file_path):\n",
|
|
" os.remove(input_file_path)\n",
|
|
" print_highlight(\"Successfully deleted local batch_requests.jsonl file\")\n",
|
|
" except Exception as e:\n",
|
|
" print_highlight(f\"Error cleaning up: {e}\")\n",
|
|
" raise e"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"terminate_process(server_process)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|