# LoRA Serving

SGLang enables the use of [LoRA adapters](https://arxiv.org/abs/2106.09685) with a base model. By incorporating techniques from [S-LoRA](https://arxiv.org/pdf/2311.03285) and [Punica](https://arxiv.org/pdf/2310.18547), SGLang can efficiently support multiple LoRA adapters for different sequences within a single batch of inputs.

## Arguments for LoRA Serving

The following server arguments are relevant for multi-LoRA serving:

* `lora_paths`: A mapping from each adaptor's name to its path, in the form of `{name}={path} {name}={path}`.

* `max_loras_per_batch`: Maximum number of adaptors used by each batch. This argument can affect the amount of GPU memory reserved for multi-LoRA serving, so it should be set to a smaller value when memory is scarce. Defaults to be 8.

* `lora_backend`: The backend of running GEMM kernels for Lora modules. It can be one of `triton` or `flashinfer`, and set to `triton` by default. For better performance and stability, we recommend using the Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.

* `tp_size`: LoRA serving along with Tensor Parallelism is supported by SGLang. `tp_size` controls the number of GPUs for tensor parallelism. More details on the tensor sharding strategy can be found in [S-Lora](https://arxiv.org/pdf/2311.03285) paper.

From client side, the user needs to provide a list of strings as input batch, and a list of adaptor names that each input sequence corresponds to.

## Usage

### Serving Single Adaptor

In [None]:
from sglang.test.test_utils import is_in_ci

if is_in_ci():
 from patch import launch_server_cmd
else:
 from sglang.utils import launch_server_cmd

from sglang.utils import wait_for_server, terminate_process

import json
import requests

In [None]:
server_process, port = launch_server_cmd(
 """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
 --max-loras-per-batch 1 --lora-backend triton \
 --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

In [None]:
url = f"http://127.0.0.1:{port}"
json_data = {
 "text": [
 "List 3 countries and their capitals.",
 "AI is a field of computer science focused on",
 ],
 "sampling_params": {"max_new_tokens": 32, "temperature": 0},
 # The first input uses lora0, and the second input uses the base model
 "lora_path": ["lora0", None],
}
response = requests.post(
 url + "/generate",
 json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

In [None]:
terminate_process(server_process)

### Serving Multiple Adaptors

In [None]:
server_process, port = launch_server_cmd(
 """
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \
 lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \
 --max-loras-per-batch 2 --lora-backend triton \
 --disable-radix-cache
"""
)

wait_for_server(f"http://localhost:{port}")

In [None]:
url = f"http://127.0.0.1:{port}"
json_data = {
 "text": [
 "List 3 countries and their capitals.",
 "AI is a field of computer science focused on",
 ],
 "sampling_params": {"max_new_tokens": 32, "temperature": 0},
 # The first input uses lora0, and the second input uses lora1
 "lora_path": ["lora0", "lora1"],
}
response = requests.post(
 url + "/generate",
 json=json_data,
)
print(f"Output 0: {response.json()[0]['text']}")
print(f"Output 1: {response.json()[1]['text']}")

In [None]:
terminate_process(server_process)

## Future Works

The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development.