17 KiB
SGLang Router
The SGLang Router is a high-performance request distribution system that routes inference requests across multiple SGLang runtime instances. It features cache-aware load balancing, fault tolerance, and support for advanced deployment patterns including data parallelism and prefill-decode disaggregation.
Key Features
- Cache-Aware Load Balancing: Optimizes cache utilization while maintaining balanced load distribution
- Multiple Routing Policies: Choose from random, round-robin, cache-aware, or power-of-two policies
- Fault Tolerance: Automatic retry and circuit breaker mechanisms for resilient operation
- Dynamic Scaling: Add or remove workers at runtime without service interruption
- Kubernetes Integration: Native service discovery and pod management
- Prefill-Decode Disaggregation: Support for disaggregated serving load balancing
- Prometheus Metrics: Built-in observability and monitoring
Installation
pip install sglang-router
Quick Start
To see all available options:
python -m sglang_router.launch_server --help # Co-launch router and workers
python -m sglang_router.launch_router --help # Launch router only
Deployment Modes
The router supports three primary deployment patterns:
- Co-launch Mode: Router and workers launch together (simplest for single-node deployments)
- Separate Launch Mode: Router and workers launch independently (best for multi-node setups)
- Prefill-Decode Disaggregation: Specialized mode for disaggregated serving
Mode 1: Co-launch Router and Workers
This mode launches both the router and multiple worker instances in a single command. It's the simplest deployment option and replaces the --dp-size argument of SGLang Runtime.
# Launch router with 4 workers
python -m sglang_router.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--dp-size 4 \
--host 0.0.0.0 \
--port 30000
Sending Requests
Once the server is ready, send requests to the router endpoint:
import requests
# Using the /generate endpoint
url = "http://localhost:30000/generate"
data = {
"text": "What is the capital of France?",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 100
}
}
response = requests.post(url, json=data)
print(response.json())
# OpenAI-compatible endpoint
url = "http://localhost:30000/v1/chat/completions"
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
response = requests.post(url, json=data)
print(response.json())
Mode 2: Separate Launch Mode
This mode is ideal for multi-node deployments where workers run on different machines.
Step 1: Launch Workers
On each worker node:
# Worker node 1
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# Worker node 2
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8001
Step 2: Launch Router
On the router node:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--host 0.0.0.0 \
--port 30000 \
--policy cache_aware # or random, round_robin, power_of_two
Mode 3: Prefill-Decode Disaggregation
This advanced mode separates prefill and decode operations for optimized performance:
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000 9000 \
--prefill http://prefill2:8001 9001 \
--decode http://decode1:8002 \
--decode http://decode2:8003 \
--prefill-policy cache_aware \
--decode-policy round_robin
Understanding --prefill Arguments
The --prefill flag accepts URLs with optional bootstrap ports:
--prefill http://server:8000- No bootstrap port--prefill http://server:8000 9000- Bootstrap port 9000--prefill http://server:8000 none- Explicitly no bootstrap port
Policy Inheritance in PD Mode
The router intelligently handles policy configuration for prefill and decode nodes:
- Only
--policyspecified: Both prefill and decode nodes use this policy --policyand--prefill-policyspecified: Prefill nodes use--prefill-policy, decode nodes use--policy--policyand--decode-policyspecified: Prefill nodes use--policy, decode nodes use--decode-policy- All three specified: Prefill nodes use
--prefill-policy, decode nodes use--decode-policy(main--policyis ignored)
Example with mixed policies:
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://prefill1:8000
--prefill http://prefill2:8000 \
--decode http://decode1:8001
--decode http://decode2:8001 \
--policy round_robin \
--prefill-policy cache_aware # Prefill uses cache_aware and decode uses round_robin from --policy
PD Mode with Service Discovery
For Kubernetes deployments with separate prefill and decode server pools:
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server tier=gpu \
--decode-selector app=decode-server tier=cpu \
--service-discovery-namespace production \
--prefill-policy cache_aware \
--decode-policy round_robin
Dynamic Scaling
The router supports runtime scaling through REST APIs:
Adding Workers
# Launch a new worker
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30001
# Add it to the router
curl -X POST "http://localhost:30000/add_worker?url=http://127.0.0.1:30001"
Removing Workers
curl -X POST "http://localhost:30000/remove_worker?url=http://127.0.0.1:30001"
Note: When using cache-aware routing, removed workers are cleanly evicted from the routing tree and request queues.
Fault Tolerance
The router includes comprehensive fault tolerance mechanisms:
Retry Configuration
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--retry-max-retries 3 \
--retry-initial-backoff-ms 100 \
--retry-max-backoff-ms 10000 \
--retry-backoff-multiplier 2.0 \
--retry-jitter-factor 0.1
Circuit Breaker
Protects against cascading failures:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--cb-failure-threshold 5 \
--cb-success-threshold 2 \
--cb-timeout-duration-secs 30 \
--cb-window-duration-secs 60
Behavior:
- Worker is marked unhealthy after
cb-failure-thresholdconsecutive failures - Returns to service after
cb-success-thresholdsuccessful health checks - Circuit breaker can be disabled with
--disable-circuit-breaker
Routing Policies
The router supports multiple routing strategies:
1. Random Routing
Distributes requests randomly across workers.
--policy random
2. Round-Robin Routing
Cycles through workers in order.
--policy round_robin
3. Power of Two Choices
Samples two workers and routes to the less loaded one.
--policy power_of_two
4. Cache-Aware Load Balancing (Default)
The most sophisticated policy that combines cache optimization with load balancing:
--policy cache_aware \
--cache-threshold 0.5 \
--balance-abs-threshold 32 \
--balance-rel-threshold 1.0001
How It Works
-
Load Assessment: Checks if the system is balanced
- Imbalanced if:
(max_load - min_load) > balance_abs_thresholdANDmax_load > balance_rel_threshold * min_load
- Imbalanced if:
-
Routing Decision:
- Balanced System: Uses cache-aware routing
- Routes to worker with highest prefix match if match >
cache_threshold - Otherwise routes to worker with most available cache capacity
- Routes to worker with highest prefix match if match >
- Imbalanced System: Uses shortest queue routing to the least busy worker
- Balanced System: Uses cache-aware routing
-
Cache Management:
- Maintains approximate radix trees per worker
- Periodically evicts LRU entries based on
--eviction-interval-secsand--max-tree-size
Data Parallelism Aware Routing
Enables fine-grained control over data parallel replicas:
--dp-aware \
--api-key your_api_key # Required for worker authentication
This mode coordinates with SGLang's DP controller for optimized request distribution across data parallel ranks.
Configuration Reference
Core Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
--host |
str | 127.0.0.1 | Router server host address |
--port |
int | 30000 | Router server port |
--worker-urls |
list | [] | Worker URLs for separate launch mode |
--policy |
str | cache_aware | Routing policy (random, round_robin, cache_aware, power_of_two) |
--max-concurrent-requests |
int | 64 | Maximum concurrent requests (rate limiting) |
--request-timeout-secs |
int | 600 | Request timeout in seconds |
--max-payload-size |
int | 256MB | Maximum request payload size |
Cache-Aware Routing Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--cache-threshold |
float | 0.5 | Minimum prefix match ratio for cache routing (0.0-1.0) |
--balance-abs-threshold |
int | 32 | Absolute load difference threshold |
--balance-rel-threshold |
float | 1.0001 | Relative load ratio threshold |
--eviction-interval-secs |
int | 60 | Seconds between cache eviction cycles |
--max-tree-size |
int | 16777216 | Maximum nodes in routing tree |
Fault Tolerance Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--retry-max-retries |
int | 3 | Maximum retry attempts per request |
--retry-initial-backoff-ms |
int | 100 | Initial retry backoff in milliseconds |
--retry-max-backoff-ms |
int | 10000 | Maximum retry backoff in milliseconds |
--retry-backoff-multiplier |
float | 2.0 | Backoff multiplier between retries |
--retry-jitter-factor |
float | 0.1 | Random jitter factor for retries |
--disable-retries |
flag | False | Disable retry mechanism |
--cb-failure-threshold |
int | 5 | Failures before circuit opens |
--cb-success-threshold |
int | 2 | Successes to close circuit |
--cb-timeout-duration-secs |
int | 30 | Circuit breaker timeout duration |
--cb-window-duration-secs |
int | 60 | Circuit breaker window duration |
--disable-circuit-breaker |
flag | False | Disable circuit breaker |
Prefill-Decode Disaggregation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--pd-disaggregation |
flag | False | Enable PD disaggregated mode |
--prefill |
list | [] | Prefill server URLs with optional bootstrap ports |
--decode |
list | [] | Decode server URLs |
--prefill-policy |
str | None | Routing policy for prefill nodes (overrides --policy) |
--decode-policy |
str | None | Routing policy for decode nodes (overrides --policy) |
--worker-startup-timeout-secs |
int | 300 | Timeout for worker startup |
--worker-startup-check-interval |
int | 10 | Interval between startup checks |
Kubernetes Integration
| Parameter | Type | Default | Description |
|---|---|---|---|
--service-discovery |
flag | False | Enable Kubernetes service discovery |
--selector |
list | [] | Label selector for workers (key1=value1 key2=value2) |
--prefill-selector |
list | [] | Label selector for prefill servers in PD mode |
--decode-selector |
list | [] | Label selector for decode servers in PD mode |
--service-discovery-port |
int | 80 | Port for discovered pods |
--service-discovery-namespace |
str | None | Kubernetes namespace to watch |
--bootstrap-port-annotation |
str | sglang.ai/bootstrap-port | Annotation for bootstrap ports |
Observability
| Parameter | Type | Default | Description |
|---|---|---|---|
--prometheus-port |
int | 29000 | Prometheus metrics port |
--prometheus-host |
str | 127.0.0.1 | Prometheus metrics host |
--log-dir |
str | None | Directory for log files |
--log-level |
str | info | Logging level (debug, info, warning, error, critical) |
--request-id-headers |
list | None | Custom headers for request tracing |
CORS Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
--cors-allowed-origins |
list | [] | Allowed CORS origins |
Advanced Features
Kubernetes Service Discovery
Automatically discover and manage workers in Kubernetes:
Standard Mode
python -m sglang_router.launch_router \
--service-discovery \
--selector app=sglang-worker env=prod \
--service-discovery-namespace production \
--service-discovery-port 8000
Prefill-Decode Disaggregation Mode
python -m sglang_router.launch_router \
--pd-disaggregation \
--service-discovery \
--prefill-selector app=prefill-server env=prod \
--decode-selector app=decode-server env=prod \
--service-discovery-namespace production
Note: The --bootstrap-port-annotation (default: sglang.ai/bootstrap-port) is used to discover bootstrap ports for prefill servers in PD mode. Prefill pods should have this annotation set to their bootstrap port value.
Prometheus Metrics
Expose metrics for monitoring:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--prometheus-port 29000 \
--prometheus-host 0.0.0.0
Metrics available at http://localhost:29000/metrics
Request Tracing
Enable request ID tracking:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--request-id-headers x-request-id x-trace-id
Troubleshooting
Common Issues
-
Workers not connecting: Ensure workers are fully initialized before starting the router. Use
--worker-startup-timeout-secsto increase wait time. -
High latency: Check if cache-aware routing is causing imbalance. Try adjusting
--balance-abs-thresholdand--balance-rel-threshold. -
Memory growth: Reduce
--max-tree-sizeor decrease--eviction-interval-secsfor more aggressive cache cleanup. -
Circuit breaker triggering frequently: Increase
--cb-failure-thresholdor extend--cb-window-duration-secs.
Debug Mode
Enable detailed logging:
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--log-level debug \
--log-dir ./router_logs