{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Offline Engine API\n", "\n", "SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n", "\n", "- Offline Batch Inference\n", "- Custom Server on Top of the Engine\n", "\n", "This document focuses on the offline batch inference, demonstrating four different inference modes:\n", "\n", "- Non-streaming synchronous generation\n", "- Streaming synchronous generation\n", "- Non-streaming asynchronous generation\n", "- Streaming asynchronous generation\n", "\n", "Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nest Asyncio\n", "Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n", "```python\n", "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Usage\n", "\n", "The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n", "\n", "Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Offline Batch Inference\n", "\n", "SGLang offline engine supports batch inference with efficient scheduling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# launch the offline engine\n", "import asyncio\n", "\n", "import sglang as sgl\n", "import sglang.test.doc_patch\n", "from sglang.utils import async_stream_and_merge, stream_and_merge\n", "\n", "llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-streaming Synchronous Generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " \"Hello, my name is\",\n", " \"The president of the United States is\",\n", " \"The capital of France is\",\n", " \"The future of AI is\",\n", "]\n", "\n", "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", "\n", "outputs = llm.generate(prompts, sampling_params)\n", "for prompt, output in zip(prompts, outputs):\n", " print(\"===============================\")\n", " print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming Synchronous Generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", "]\n", "\n", "sampling_params = {\n", " \"temperature\": 0.2,\n", " \"top_p\": 0.9,\n", "}\n", "\n", "print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n", "\n", "for prompt in prompts:\n", " print(f\"Prompt: {prompt}\")\n", " merged_output = stream_and_merge(llm, prompt, sampling_params)\n", " print(\"Generated text:\", merged_output)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Non-streaming Asynchronous Generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", "]\n", "\n", "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", "\n", "print(\"\\n=== Testing asynchronous batch generation ===\")\n", "\n", "\n", "async def main():\n", " outputs = await llm.async_generate(prompts, sampling_params)\n", "\n", " for prompt, output in zip(prompts, outputs):\n", " print(f\"\\nPrompt: {prompt}\")\n", " print(f\"Generated text: {output['text']}\")\n", "\n", "\n", "asyncio.run(main())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming Asynchronous Generation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompts = [\n", " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n", " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n", " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n", "]\n", "\n", "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n", "\n", "print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n", "\n", "\n", "async def main():\n", " for prompt in prompts:\n", " print(f\"\\nPrompt: {prompt}\")\n", " print(\"Generated text: \", end=\"\", flush=True)\n", "\n", " # Replace direct calls to async_generate with our custom overlap-aware version\n", " async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n", " print(cleaned_chunk, end=\"\", flush=True)\n", "\n", " print() # New line after each prompt\n", "\n", "\n", "asyncio.run(main())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "llm.shutdown()" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }