205 lines
6.3 KiB
Plaintext
205 lines
6.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# LoRA Serving"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Arguments for LoRA Serving"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following server arguments are relevant for multi-LoRA serving:\n",
|
|
"\n",
|
|
"* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.\n",
|
|
"\n",
|
|
"* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.\n",
|
|
"\n",
|
|
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
|
|
"\n",
|
|
"* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.\n",
|
|
"\n",
|
|
"From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Usage\n",
|
|
"\n",
|
|
"### Serving Single Adaptor"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sglang.test.test_utils import is_in_ci\n",
|
|
"\n",
|
|
"if is_in_ci():\n",
|
|
" from patch import launch_server_cmd\n",
|
|
"else:\n",
|
|
" from sglang.utils import launch_server_cmd\n",
|
|
"\n",
|
|
"from sglang.utils import wait_for_server, terminate_process\n",
|
|
"\n",
|
|
"import json\n",
|
|
"import requests"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"server_process, port = launch_server_cmd(\n",
|
|
" \"\"\"\n",
|
|
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
|
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
|
" --max-loras-per-batch 1 --lora-backend triton \\\n",
|
|
" --disable-radix-cache\n",
|
|
"\"\"\"\n",
|
|
")\n",
|
|
"\n",
|
|
"wait_for_server(f\"http://localhost:{port}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"url = f\"http://127.0.0.1:{port}\"\n",
|
|
"json_data = {\n",
|
|
" \"text\": [\n",
|
|
" \"List 3 countries and their capitals.\",\n",
|
|
" \"AI is a field of computer science focused on\",\n",
|
|
" ],\n",
|
|
" \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
|
|
" # The first input uses lora0, and the second input uses the base model\n",
|
|
" \"lora_path\": [\"lora0\", None],\n",
|
|
"}\n",
|
|
"response = requests.post(\n",
|
|
" url + \"/generate\",\n",
|
|
" json=json_data,\n",
|
|
")\n",
|
|
"print(f\"Output 0: {response.json()[0]['text']}\")\n",
|
|
"print(f\"Output 1: {response.json()[1]['text']}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"terminate_process(server_process)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Serving Multiple Adaptors"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"server_process, port = launch_server_cmd(\n",
|
|
" \"\"\"\n",
|
|
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
|
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
|
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
|
|
" --max-loras-per-batch 2 --lora-backend triton \\\n",
|
|
" --disable-radix-cache\n",
|
|
"\"\"\"\n",
|
|
")\n",
|
|
"\n",
|
|
"wait_for_server(f\"http://localhost:{port}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"url = f\"http://127.0.0.1:{port}\"\n",
|
|
"json_data = {\n",
|
|
" \"text\": [\n",
|
|
" \"List 3 countries and their capitals.\",\n",
|
|
" \"AI is a field of computer science focused on\",\n",
|
|
" ],\n",
|
|
" \"sampling_params\": {\"max_new_tokens\": 32, \"temperature\": 0},\n",
|
|
" # The first input uses lora0, and the second input uses lora1\n",
|
|
" \"lora_path\": [\"lora0\", \"lora1\"],\n",
|
|
"}\n",
|
|
"response = requests.post(\n",
|
|
" url + \"/generate\",\n",
|
|
" json=json_data,\n",
|
|
")\n",
|
|
"print(f\"Output 0: {response.json()[0]['text']}\")\n",
|
|
"print(f\"Output 1: {response.json()[1]['text']}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"terminate_process(server_process)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Future Works\n",
|
|
"\n",
|
|
"The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|