12 KiB
SGLang Router
SGLang router is a standalone Rust module that enables data parallelism across SGLang instances, providing high-performance request routing and advanced load balancing. The router supports multiple load balancing algorithms including cache-aware, power of two, random, and round robin, and acts as a specialized load balancer for prefill-decode disaggregated serving architectures.
Documentation
- User Guide: docs.sglang.ai/advanced_features/router.html
Quick Start
Prerequisites
Rust and Cargo:
# Install rustup (Rust installer and version manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Follow the installation prompts, then reload your shell
source $HOME/.cargo/env
# Verify installation
rustc --version
cargo --version
Python with pip installed
Installation
Option A: Build and Install Wheel (Recommended)
# Install build dependencies
pip install setuptools-rust wheel build
# Build the wheel package
python -m build
# Install the generated wheel
pip install dist/*.whl
# One-liner for development (rebuild + install)
python -m build && pip install --force-reinstall dist/*.whl
Option B: Development Mode
# Currently broken
pip install -e .
⚠️ Warning: Editable installs may suffer performance degradation. Use wheel builds for performance testing.
Basic Usage
# Build Rust components
cargo build
Using the Rust Binary Directly (Alternative to Python)
# Build the Rust binary
cargo build --release
# Launch router with worker URLs in regular mode
./target/release/sglang-router \
--worker-urls http://worker1:8000 http://worker2:8000
# Or use cargo run
cargo run --release -- \
--worker-urls http://worker1:8000 http://worker2:8000
Launch Router with Python (Original Method)
# Launch router with worker URLs
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8000
Launch Router with Worker URLs in prefill-decode mode
# Note that the prefill and decode URLs must be provided in the following format:
# http://<ip>:<port> for decode nodes
# http://<ip>:<port> bootstrap-port for prefill nodes, where bootstrap-port is optional
# Using Rust binary directly
./target/release/sglang-router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://127.0.0.1:30001 9001 \
--prefill http://127.0.0.2:30002 9002 \
--prefill http://127.0.0.3:30003 9003 \
--prefill http://127.0.0.4:30004 9004 \
--decode http://127.0.0.5:30005 \
--decode http://127.0.0.6:30006 \
--decode http://127.0.0.7:30007 \
--host 0.0.0.0 \
--port 8080
# Or using Python launcher
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--prefill http://127.0.0.1:30001 9001 \
--prefill http://127.0.0.2:30002 9002 \
--prefill http://127.0.0.3:30003 9003 \
--prefill http://127.0.0.4:30004 9004 \
--decode http://127.0.0.5:30005 \
--decode http://127.0.0.6:30006 \
--decode http://127.0.0.7:30007 \
--host 0.0.0.0 \
--port 8080
Configuration
Logging
Enable structured logging with optional file output:
from sglang_router import Router
# Console logging (default)
router = Router(worker_urls=["http://worker1:8000", "http://worker2:8000"])
# File logging enabled
router = Router(
worker_urls=["http://worker1:8000", "http://worker2:8000"],
log_dir="./logs" # Daily log files created here
)
Set log level with --log-level flag (documentation).
Metrics
Prometheus metrics endpoint available at 127.0.0.1:29000 by default.
# Custom metrics configuration
python -m sglang_router.launch_router \
--worker-urls http://localhost:8080 http://localhost:8081 \
--prometheus-host 0.0.0.0 \
--prometheus-port 9000
Retries and Circuit Breakers
- Retries (regular router) are enabled by default with exponential backoff and jitter. You can tune them via CLI:
python -m sglang_router.launch_router \
--worker-urls http://localhost:8080 http://localhost:8081 \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 10000 \
--retry-backoff-multiplier 2.0 \
--retry-jitter-factor 0.1
- Circuit Breaker defaults protect workers and auto-recover. Tune thresholds/timeouts:
python -m sglang_router.launch_router \
--worker-urls http://localhost:8080 http://localhost:8081 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
Behavior summary:
- Closed → Open after N consecutive failures (failure-threshold)
- Open → HalfOpen after timeout (timeout-duration-secs)
- HalfOpen → Closed after M consecutive successes (success-threshold)
- Any failure in HalfOpen reopens immediately
Retry predicate (regular router): retry on 408/429/500/502/503/504, otherwise return immediately. Backoff/jitter observed between attempts.
Request ID Tracking
Track requests across distributed systems with configurable headers:
# Use custom request ID headers
python -m sglang_router.launch_router \
--worker-urls http://localhost:8080 \
--request-id-headers x-trace-id x-request-id
Default headers: x-request-id, x-correlation-id, x-trace-id, request-id
Advanced Features
Kubernetes Service Discovery
Automatic worker discovery and management in Kubernetes environments.
Basic Service Discovery
python -m sglang_router.launch_router \
--service-discovery \
--selector app=sglang-worker role=inference \
--service-discovery-namespace default
PD (Prefill-Decode) Mode
For disaggregated prefill/decode routing:
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--service-discovery \
--prefill-selector app=sglang component=prefill \
--decode-selector app=sglang component=decode \
--service-discovery-namespace sglang-system
# With separate routing policies:
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill-policy cache_aware \
--decode-policy power_of_two \
--service-discovery \
--prefill-selector app=sglang component=prefill \
--decode-selector app=sglang component=decode \
--service-discovery-namespace sglang-system
# in lws case, such as tp16(1 leader pod, 1 worker pod)
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--service-discovery \
--prefill-selector app=sglang component=prefill role=leader\
--decode-selector app=sglang component=decode role=leader\
--service-discovery-namespace sglang-system
Kubernetes Pod Configuration
Prefill Server Pod:
apiVersion: v1
kind: Pod
metadata:
name: sglang-prefill-1
labels:
app: sglang
component: prefill
annotations:
sglang.ai/bootstrap-port: "9001" # Optional: Bootstrap port
spec:
containers:
- name: sglang
image: lmsys/sglang:latest
ports:
- containerPort: 8000 # Main API port
- containerPort: 9001 # Optional: Bootstrap port
Decode Server Pod:
apiVersion: v1
kind: Pod
metadata:
name: sglang-decode-1
labels:
app: sglang
component: decode
spec:
containers:
- name: sglang
image: lmsys/sglang:latest
ports:
- containerPort: 8000
RBAC Configuration
Namespace-scoped (recommended):
apiVersion: v1
kind: ServiceAccount
metadata:
name: sglang-router
namespace: sglang-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: sglang-system
name: sglang-router
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: sglang-router
namespace: sglang-system
subjects:
- kind: ServiceAccount
name: sglang-router
namespace: sglang-system
roleRef:
kind: Role
name: sglang-router
apiGroup: rbac.authorization.k8s.io
Complete PD Example
python -m sglang_router.launch_router \
--pd-disaggregation \
--policy cache_aware \
--service-discovery \
--prefill-selector app=sglang component=prefill environment=production \
--decode-selector app=sglang component=decode environment=production \
--service-discovery-namespace production \
--host 0.0.0.0 \
--port 8080 \
--prometheus-host 0.0.0.0 \
--prometheus-port 9090
Command Line Arguments Reference
Service Discovery
--service-discovery: Enable Kubernetes service discovery--service-discovery-port: Port for worker URLs (default: 8000)--service-discovery-namespace: Kubernetes namespace to watch--selector: Label selectors for regular mode (format:key1=value1 key2=value2)
PD Mode
--pd-disaggregation: Enable Prefill-Decode disaggregated mode--prefill: Initial prefill server (format:URL BOOTSTRAP_PORT)--decode: Initial decode server URL--prefill-selector: Label selector for prefill pods--decode-selector: Label selector for decode pods--policy: Routing policy (cache_aware,random,power_of_two,round_robin)--prefill-policy: Separate routing policy for prefill nodes (optional, overrides--policyfor prefill)--decode-policy: Separate routing policy for decode nodes (optional, overrides--policyfor decode)
Development
Build Process
# Build Rust project
cargo build
# Build Python binding (see Installation section above)
Note: When modifying Rust code, you must rebuild the wheel for changes to take effect.
Troubleshooting
VSCode Rust Analyzer Issues:
Set rust-analyzer.linkedProjects to the absolute path of Cargo.toml:
{
"rust-analyzer.linkedProjects": ["/workspaces/sglang/sgl-router/Cargo.toml"]
}
CI/CD Pipeline
The continuous integration pipeline includes comprehensive testing, benchmarking, and publishing:
Build & Test
- Build Wheels: Uses
cibuildwheelfor manylinux x86_64 packages - Build Source Distribution: Creates source distribution for pip fallback
- Rust HTTP Server Benchmarking: Performance testing of router overhead
- Basic Inference Testing: End-to-end validation through the router
- PD Disaggregation Testing: Benchmark and sanity checks for prefill-decode load balancing
Publishing
- PyPI Publishing: Wheels and source distributions are published only when the version changes in
pyproject.toml - Container Images: Docker images published using
/docker/Dockerfile.router
Features
- High Performance: Rust-based routing with connection pooling and optimized request handling
- Advanced Load Balancing: Multiple algorithms including:
- Cache-Aware: Intelligent routing based on cache locality for optimal performance
- Power of Two: Chooses the less loaded of two randomly selected workers
- Random: Distributes requests randomly across available workers
- Round Robin: Sequential distribution across workers in rotation
- Prefill-Decode Disaggregation: Specialized load balancing for separated prefill and decode servers
- Service Discovery: Automatic Kubernetes worker discovery and health management
- Monitoring: Comprehensive Prometheus metrics and structured logging
- Scalability: Handles thousands of concurrent connections with efficient resource utilization