{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LoRA Serving" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Arguments for LoRA Serving" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following server arguments are relevant for multi-LoRA serving:\n", "\n", "* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.\n", "\n", "* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n", "\n", "* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n", "\n", "* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n", "\n", "From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage\n", "\n", "### Serving Single Adaptor" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.test.test_utils import is_in_ci\n", "\n", "if is_in_ci():\n", " from patch import launch_server_cmd\n", "else:\n", " from sglang.utils import launch_server_cmd\n", "\n", "from sglang.utils import wait_for_server, terminate_process\n", "\n", "import json\n", "import requests" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "server_process, port = launch_server_cmd(\n", " \"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", " --max-loras-per-batch 1 --lora-backend triton \\\n", " --disable-radix-cache\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = f\"http://127.0.0.1:{port}\"\n", "json_data = {\n", " \"text\": [\n", " \"List 3 countries and their capitals.\",\n", " \"AI is a field of computer science focused on\",\n", " ],\n", " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", " # The first input uses lora0, and the second input uses the base model\n", " \"lora_path\": [\"lora0\", None],\n", "}\n", "response = requests.post(\n", " url + \"/generate\",\n", " json=json_data,\n", ")\n", "print(f\"Output 0: {response.json()[0]['text']}\")\n", "print(f\"Output 1: {response.json()[1]['text']}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Serving Multiple Adaptors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "server_process, port = launch_server_cmd(\n", " \"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", " --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n", " lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n", " --max-loras-per-batch 2 --lora-backend triton \\\n", " --disable-radix-cache\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = f\"http://127.0.0.1:{port}\"\n", "json_data = {\n", " \"text\": [\n", " \"List 3 countries and their capitals.\",\n", " \"AI is a field of computer science focused on\",\n", " ],\n", " \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n", " # The first input uses lora0, and the second input uses lora1\n", " \"lora_path\": [\"lora0\", \"lora1\"],\n", "}\n", "response = requests.post(\n", " url + \"/generate\",\n", " json=json_data,\n", ")\n", "print(f\"Output 0: {response.json()[0]['text']}\")\n", "print(f\"Output 1: {response.json()[1]['text']}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Future Works\n", "\n", "The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development." ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }