sglang0.4.5.post1/docs/backend/separate_reasoning.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reasoning Parser\n",
    "\n",
    "SGLang supports parsing reasoning content our from \"normal\" content for reasoning models such as [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).\n",
    "\n",
    "## Supported Models\n",
    "\n",
    "Currently, SGLang supports the following reasoning models:\n",
    "- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d): The reasoning content is wrapped with `<think>` and `</think>` tags.\n",
    "- [QwQ](https://huggingface.co/Qwen/QwQ-32B): The reasoning content is wrapped with `<think>` and `</think>` tags."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage\n",
    "\n",
    "### Launching the Server"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specify the `--reasoning-parser` option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from openai import OpenAI\n",
    "from sglang.test.test_utils import is_in_ci\n",
    "\n",
    "if is_in_ci():\n",
    "    from patch import launch_server_cmd\n",
    "else:\n",
    "    from sglang.utils import launch_server_cmd\n",
    "\n",
    "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
    "\n",
    "\n",
    "server_process, port = launch_server_cmd(\n",
    "    \"python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --host 0.0.0.0 --reasoning-parser deepseek-r1\"\n",
    ")\n",
    "\n",
    "wait_for_server(f\"http://localhost:{port}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `--reasoning-parser` defines the parser used to interpret responses. Currently supported parsers include:\n",
    "\n",
    "- deepseek-r1: DeepSeek R1 series and QwQ (e.g. deepseek-ai/DeepSeek-R1, Qwen/QwQ-32B)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OpenAI Compatible API\n",
    "\n",
    "Using the OpenAI compatible API, the contract follows the [DeepSeek API design](https://api-docs.deepseek.com/guides/reasoning_model) established with the release of DeepSeek-R1:\n",
    "\n",
    "- `reasoning_content`: The content of the CoT.\n",
    "- `content`: The content of the final answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize OpenAI-like client\n",
    "client = OpenAI(api_key=\"None\", base_url=f\"http://0.0.0.0:{port}/v1\")\n",
    "model_name = client.models.list().data[0].id\n",
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": \"What is 1+3?\",\n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Non-Streaming Request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_non_stream = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=messages,\n",
    "    temperature=0.6,\n",
    "    top_p=0.95,\n",
    "    stream=False,  # Non-streaming\n",
    "    extra_body={\"separate_reasoning\": True},\n",
    ")\n",
    "print_highlight(\"==== Reasoning ====\")\n",
    "print_highlight(response_non_stream.choices[0].message.reasoning_content)\n",
    "\n",
    "print_highlight(\"==== Text ====\")\n",
    "print_highlight(response_non_stream.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Streaming Request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_stream = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=messages,\n",
    "    temperature=0.6,\n",
    "    top_p=0.95,\n",
    "    stream=True,  # Non-streaming\n",
    "    extra_body={\"separate_reasoning\": True},\n",
    ")\n",
    "\n",
    "reasoning_content = \"\"\n",
    "content = \"\"\n",
    "for chunk in response_stream:\n",
    "    if chunk.choices[0].delta.content:\n",
    "        content += chunk.choices[0].delta.content\n",
    "    if chunk.choices[0].delta.reasoning_content:\n",
    "        reasoning_content += chunk.choices[0].delta.reasoning_content\n",
    "\n",
    "print_highlight(\"==== Reasoning ====\")\n",
    "print_highlight(reasoning_content)\n",
    "\n",
    "print_highlight(\"==== Text ====\")\n",
    "print_highlight(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optionally, you can buffer the reasoning content to the last reasoning chunk (or the first chunk after the reasoning content)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_stream = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=messages,\n",
    "    temperature=0.6,\n",
    "    top_p=0.95,\n",
    "    stream=True,  # Non-streaming\n",
    "    extra_body={\"separate_reasoning\": True, \"stream_reasoning\": False},\n",
    ")\n",
    "\n",
    "reasoning_content = \"\"\n",
    "content = \"\"\n",
    "for chunk in response_stream:\n",
    "    if chunk.choices[0].delta.content:\n",
    "        content += chunk.choices[0].delta.content\n",
    "    if chunk.choices[0].delta.reasoning_content:\n",
    "        reasoning_content = chunk.choices[0].delta.reasoning_content\n",
    "\n",
    "print_highlight(\"==== Reasoning ====\")\n",
    "print_highlight(reasoning_content)\n",
    "\n",
    "print_highlight(\"==== Text ====\")\n",
    "print_highlight(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reasoning separation is enable by default when specify . \n",
    "**To disable it, set the `separate_reasoning` option to `False` in request.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response_non_stream = client.chat.completions.create(\n",
    "    model=model_name,\n",
    "    messages=messages,\n",
    "    temperature=0.6,\n",
    "    top_p=0.95,\n",
    "    stream=False,  # Non-streaming\n",
    "    extra_body={\"separate_reasoning\": False},\n",
    ")\n",
    "\n",
    "print_highlight(\"==== Original Output ====\")\n",
    "print_highlight(response_non_stream.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### SGLang Native API "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
    "input = tokenizer.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize=False,\n",
    "    add_generation_prompt=True,\n",
    ")\n",
    "\n",
    "gen_url = f\"http://localhost:{port}/generate\"\n",
    "gen_data = {\n",
    "    \"text\": input,\n",
    "    \"sampling_params\": {\n",
    "        \"skip_special_tokens\": False,\n",
    "        \"max_new_tokens\": 1024,\n",
    "        \"temperature\": 0.6,\n",
    "        \"top_p\": 0.95,\n",
    "    },\n",
    "}\n",
    "gen_response = requests.post(gen_url, json=gen_data).json()[\"text\"]\n",
    "\n",
    "print_highlight(\"==== Original Output ====\")\n",
    "print_highlight(gen_response)\n",
    "\n",
    "parse_url = f\"http://localhost:{port}/separate_reasoning\"\n",
    "separate_reasoning_data = {\n",
    "    \"text\": gen_response,\n",
    "    \"reasoning_parser\": \"deepseek-r1\",\n",
    "}\n",
    "separate_reasoning_response_json = requests.post(\n",
    "    parse_url, json=separate_reasoning_data\n",
    ").json()\n",
    "print_highlight(\"==== Reasoning ====\")\n",
    "print_highlight(separate_reasoning_response_json[\"reasoning_text\"])\n",
    "print_highlight(\"==== Text ====\")\n",
    "print_highlight(separate_reasoning_response_json[\"text\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terminate_process(server_process)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Offline Engine API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sglang as sgl\n",
    "from sglang.srt.reasoning_parser import ReasoningParser\n",
    "from sglang.utils import print_highlight\n",
    "\n",
    "llm = sgl.Engine(model_path=\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"deepseek-ai/DeepSeek-R1-Distill-Qwen-7B\")\n",
    "input = tokenizer.apply_chat_template(\n",
    "    messages,\n",
    "    tokenize=False,\n",
    "    add_generation_prompt=True,\n",
    ")\n",
    "sampling_params = {\n",
    "    \"max_new_tokens\": 1024,\n",
    "    \"skip_special_tokens\": False,\n",
    "    \"temperature\": 0.6,\n",
    "    \"top_p\": 0.95,\n",
    "}\n",
    "result = llm.generate(prompt=input, sampling_params=sampling_params)\n",
    "\n",
    "generated_text = result[\"text\"]  # Assume there is only one prompt\n",
    "\n",
    "print_highlight(\"==== Original Output ====\")\n",
    "print_highlight(generated_text)\n",
    "\n",
    "parser = ReasoningParser(\"deepseek-r1\")\n",
    "reasoning_text, text = parser.parse_non_stream(generated_text)\n",
    "print_highlight(\"==== Reasoning ====\")\n",
    "print_highlight(reasoning_text)\n",
    "print_highlight(\"==== Text ====\")\n",
    "print_highlight(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Supporting New Reasoning Model Schemas\n",
    "\n",
    "For future reasoning models, you can implement the reasoning parser as a subclass of `BaseReasoningFormatDetector` in `python/sglang/srt/reasoning_parser.py` and specify the reasoning parser for new reasoning model schemas accordingly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "class DeepSeekR1Detector(BaseReasoningFormatDetector):\n",
    "    \"\"\"\n",
    "    Detector for DeepSeek-R1 model.\n",
    "    Assumes reasoning format:\n",
    "      (<think>)*(.*)</think>\n",
    "    Returns all the text before the </think> tag as `reasoning_text`\n",
    "    and the rest of the text as `normal_text`.\n",
    "\n",
    "    Args:\n",
    "        stream_reasoning (bool): If False, accumulates reasoning content until the end tag.\n",
    "            If True, streams reasoning content as it arrives.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, stream_reasoning: bool = False):\n",
    "        # DeepSeek-R1 is assumed to be reasoning until `</think>` token\n",
    "        super().__init__(\"<think>\", \"</think>\", True, stream_reasoning=stream_reasoning)\n",
    "        # https://github.com/sgl-project/sglang/pull/3202#discussion_r1950153599\n",
    "\n",
    "\n",
    "class ReasoningParser:\n",
    "    \"\"\"\n",
    "    Parser that handles both streaming and non-streaming scenarios for extracting\n",
    "    reasoning content from model outputs.\n",
    "\n",
    "    Args:\n",
    "        model_type (str): Type of model to parse reasoning from\n",
    "        stream_reasoning (bool): If Flase, accumulates reasoning content until complete.\n",
    "            If True, streams reasoning content as it arrives.\n",
    "    \"\"\"\n",
    "\n",
    "    DetectorMap: Dict[str, BaseReasoningFormatDetector] = {\n",
    "        \"deepseek-r1\": DeepSeekR1Detector\n",
    "    }\n",
    "\n",
    "    def __init__(self, model_type: str = None, stream_reasoning: bool = True):\n",
    "        if not model_type:\n",
    "            raise ValueError(\"Model type must be specified\")\n",
    "\n",
    "        detector_class = self.DetectorMap.get(model_type.lower())\n",
    "        if not detector_class:\n",
    "            raise ValueError(f\"Unsupported model type: {model_type}\")\n",
    "\n",
    "        self.detector = detector_class(stream_reasoning=stream_reasoning)\n",
    "\n",
    "    def parse_non_stream(self, full_text: str) -> StreamingParseResult:\n",
    "        \"\"\"Non-streaming call: one-time parsing\"\"\"\n",
    "        ret = self.detector.detect_and_parse(full_text)\n",
    "        return ret.reasoning_text, ret.normal_text\n",
    "\n",
    "    def parse_stream_chunk(self, chunk_text: str) -> StreamingParseResult:\n",
    "        \"\"\"Streaming call: incremental parsing\"\"\"\n",
    "        ret = self.detector.parse_streaming_increment(chunk_text)\n",
    "        return ret.reasoning_text, ret.normal_text\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}