sglang_v0.5.2/sglang/docs/advanced_features/separate_reasoning.ipynb

382 lines
12 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reasoning Parser\n",
"\n",
"SGLang supports parsing reasoning content out from \"normal\" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).\n",
"\n",
"## Supported Models & Parsers\n",
"\n",
"| Model | Reasoning tags | Parser | Notes |\n",
"|---------|-----------------------------|------------------|-------|\n",
"| [DeepSeekR1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n",
"| [DeepSeekV3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Supports `thinking` parameter |\n",
"| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |\n",
"| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n",
"| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n",
"| [GPT OSS](https://huggingface.co/openai/gpt-oss-120b) | `<\\|channel\\|>analysis<\\|message\\|>` … `<\\|end\\|>` | `gpt-oss` | N/A |\n",
"### Model-Specific Behaviors\n",
"\n",
"**DeepSeek-R1 Family:**\n",
"- DeepSeek-R1: No `<think>` start tag, jumps directly to thinking content\n",
"- DeepSeek-R1-0528: Generates both `<think>` start and `</think>` end tags\n",
"- Both are handled by the same `deepseek-r1` parser\n",
"\n",
"**DeepSeek-V3 Family:**\n",
"- DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n",
"\n",
"**Qwen3 Family:**\n",
"- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n",
"- Qwen3-Thinking (e.g., Qwen3-235B-A22B-Thinking-2507): Use `qwen3` or `qwen3-thinking` parser, always thinks\n",
"\n",
"**Kimi:**\n",
"- Kimi: Uses special `◁think▷` and `◁/think▷` tags\n",
"\n",
"**GPT OSS:**\n",
"- GPT OSS: Uses special `<|channel|>analysis<|message|>` and `<|end|>` tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usage\n",
"\n",
"### Launching the Server"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify the `--reasoning-parser` option."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from openai import OpenAI\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1 --log-level warning\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `--reasoning-parser` defines the parser used to interpret responses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### OpenAI Compatible API\n",
"\n",
"Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1:\n",
"\n",
"- `reasoning_content`: The content of the CoT.\n",
"- `content`: The content of the final answer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initialize OpenAI-like client\n",
"client = OpenAI(api_key=\"None\", base_url=f\"http://0.0.0.0:{port}/v1\")\n",
"model_name = client.models.list().data[0].id\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"What is 1+3?\",\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Non-Streaming Request"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_non_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=False, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True},\n",
")\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(response_non_stream.choices[0].message.reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(response_non_stream.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Streaming Request"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=True, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True},\n",
")\n",
"\n",
"reasoning_content = \"\"\n",
"content = \"\"\n",
"for chunk in response_stream:\n",
" if chunk.choices[0].delta.content:\n",
" content += chunk.choices[0].delta.content\n",
" if chunk.choices[0].delta.reasoning_content:\n",
" reasoning_content += chunk.choices[0].delta.reasoning_content\n",
"\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=True, # Non-streaming\n",
" extra_body={\"separate_reasoning\": True, \"stream_reasoning\": False},\n",
")\n",
"\n",
"reasoning_content = \"\"\n",
"content = \"\"\n",
"for chunk in response_stream:\n",
" if chunk.choices[0].delta.content:\n",
" content += chunk.choices[0].delta.content\n",
" if chunk.choices[0].delta.reasoning_content:\n",
" reasoning_content += chunk.choices[0].delta.reasoning_content\n",
"\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_content)\n",
"\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reasoning separation is enable by default when specify . \n",
"**To disable it, set the `separate_reasoning` option to `False` in request.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_non_stream = client.chat.completions.create(\n",
" model=model_name,\n",
" messages=messages,\n",
" temperature=0.6,\n",
" top_p=0.95,\n",
" stream=False, # Non-streaming\n",
" extra_body={\"separate_reasoning\": False},\n",
")\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(response_non_stream.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SGLang Native API "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"input = tokenizer.apply_chat_template(\n",
" messages,\n",
" tokenize=False,\n",
" add_generation_prompt=True,\n",
")\n",
"\n",
"gen_url = f\"http://localhost:{port}/generate\"\n",
"gen_data = {\n",
" \"text\": input,\n",
" \"sampling_params\": {\n",
" \"skip_special_tokens\": False,\n",
" \"max_new_tokens\": 1024,\n",
" \"temperature\": 0.6,\n",
" \"top_p\": 0.95,\n",
" },\n",
"}\n",
"gen_response = requests.post(gen_url, json=gen_data).json()[\"text\"]\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(gen_response)\n",
"\n",
"parse_url = f\"http://localhost:{port}/separate_reasoning\"\n",
"separate_reasoning_data = {\n",
" \"text\": gen_response,\n",
" \"reasoning_parser\": \"deepseek-r1\",\n",
"}\n",
"separate_reasoning_response_json = requests.post(\n",
" parse_url, json=separate_reasoning_data\n",
").json()\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(separate_reasoning_response_json[\"reasoning_text\"])\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(separate_reasoning_response_json[\"text\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Offline Engine API"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sglang as sgl\n",
"from sglang.srt.parser.reasoning_parser import ReasoningParser\n",
"from sglang.utils import print_highlight\n",
"\n",
"llm = sgl.Engine(model_path=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
"input = tokenizer.apply_chat_template(\n",
" messages,\n",
" tokenize=False,\n",
" add_generation_prompt=True,\n",
")\n",
"sampling_params = {\n",
" \"max_new_tokens\": 1024,\n",
" \"skip_special_tokens\": False,\n",
" \"temperature\": 0.6,\n",
" \"top_p\": 0.95,\n",
"}\n",
"result = llm.generate(prompt=input, sampling_params=sampling_params)\n",
"\n",
"generated_text = result[\"text\"] # Assume there is only one prompt\n",
"\n",
"print_highlight(\"==== Original Output ====\")\n",
"print_highlight(generated_text)\n",
"\n",
"parser = ReasoningParser(\"deepseek-r1\")\n",
"reasoning_text, text = parser.parse_non_stream(generated_text)\n",
"print_highlight(\"==== Reasoning ====\")\n",
"print_highlight(reasoning_text)\n",
"print_highlight(\"==== Text ====\")\n",
"print_highlight(text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm.shutdown()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Supporting New Reasoning Model Schemas\n",
"\n",
"For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly."
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}