Environment Variables
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.
General Configuration
| Environment Variable |
Description |
Default Value |
SGLANG_USE_MODELSCOPE |
Enable using models from ModelScope |
false |
SGLANG_HOST_IP |
Host IP address for the server |
0.0.0.0 |
SGLANG_PORT |
Port for the server |
auto-detected |
SGLANG_LOGGING_CONFIG_PATH |
Custom logging configuration path |
Not set |
SGLANG_DISABLE_REQUEST_LOGGING |
Disable request logging |
false |
SGLANG_HEALTH_CHECK_TIMEOUT |
Timeout for health check in seconds |
20 |
Performance Tuning
| Environment Variable |
Description |
Default Value |
SGLANG_ENABLE_TORCH_INFERENCE_MODE |
Control whether to use torch.inference_mode |
false |
SGLANG_ENABLE_TORCH_COMPILE |
Enable torch.compile |
true |
SGLANG_SET_CPU_AFFINITY |
Enable CPU affinity setting (often set to 1 in Docker builds) |
0 |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN |
Allows the scheduler to overwrite longer context length requests (often set to 1 in Docker builds) |
0 |
SGLANG_IS_FLASHINFER_AVAILABLE |
Control FlashInfer availability check |
true |
SGLANG_SKIP_P2P_CHECK |
Skip P2P (peer-to-peer) access check |
false |
SGL_CHUNKED_PREFIX_CACHE_THRESHOLD |
Sets the threshold for enabling chunked prefix caching |
8192 |
SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION |
Enable RoPE fusion in Fused Multi-Layer Attention |
1 |
DeepGEMM Configuration (Advanced Optimization)
| Environment Variable |
Description |
Default Value |
SGL_ENABLE_JIT_DEEPGEMM |
Enable Just-In-Time compilation of DeepGEMM kernels |
"true" |
SGL_JIT_DEEPGEMM_PRECOMPILE |
Enable precompilation of DeepGEMM kernels |
"true" |
SGL_JIT_DEEPGEMM_COMPILE_WORKERS |
Number of workers for parallel DeepGEMM kernel compilation |
4 |
SGL_IN_DEEPGEMM_PRECOMPILE_STAGE |
Indicator flag used during the DeepGEMM precompile script |
"false" |
SGL_DG_CACHE_DIR |
Directory for caching compiled DeepGEMM kernels |
~/.cache/deep_gemm |
SGL_DG_USE_NVRTC |
Use NVRTC (instead of Triton) for JIT compilation (Experimental) |
"0" |
SGL_USE_DEEPGEMM_BMM |
Use DeepGEMM for Batched Matrix Multiplication (BMM) operations |
"false" |
Memory Management
| Environment Variable |
Description |
Default Value |
SGLANG_DEBUG_MEMORY_POOL |
Enable memory pool debugging |
false |
SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION |
Clip max new tokens estimation for memory planning |
Not set |
SGLANG_DETOKENIZER_MAX_STATES |
Maximum states for detokenizer |
Default value based on system |
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK |
Disable checks for memory imbalance across Tensor Parallel ranks |
Not set (defaults to enabled check) |
Model-Specific Options
| Environment Variable |
Description |
Default Value |
SGLANG_USE_AITER |
Use AITER optimize implementation |
false |
SGLANG_INT4_WEIGHT |
Enable INT4 weight quantization |
false |
SGLANG_MOE_PADDING |
Enable MoE padding (sets padding size to 128 if value is 1, often set to 1 in Docker builds) |
0 |
SGLANG_FORCE_FP8_MARLIN |
Force using FP8 MARLIN kernels even if other FP8 kernels are available |
false |
SGLANG_ENABLE_FLASHINFER_GEMM |
Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs |
false |
SGLANG_SUPPORT_CUTLASS_BLOCK_FP8 |
Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs |
false |
SGLANG_CUTLASS_MOE |
Use Cutlass FP8 MoE kernel on Blackwell GPUs |
false |
Distributed Computing
| Environment Variable |
Description |
Default Value |
SGLANG_BLOCK_NONZERO_RANK_CHILDREN |
Control blocking of non-zero rank children processes |
1 |
SGL_IS_FIRST_RANK_ON_NODE |
Indicates if the current process is the first rank on its node |
"true" |
SGLANG_PP_LAYER_PARTITION |
Pipeline parallel layer partition specification |
Not set |
Testing & Debugging (Internal/CI)
These variables are primarily used for internal testing, continuous integration, or debugging.
| Environment Variable |
Description |
Default Value |
SGLANG_IS_IN_CI |
Indicates if running in CI environment |
false |
SGLANG_AMD_CI |
Indicates running in AMD CI environment |
0 |
SGLANG_TEST_RETRACT |
Enable retract decode testing |
false |
SGLANG_RECORD_STEP_TIME |
Record step time for profiling |
false |
SGLANG_TEST_REQUEST_TIME_STATS |
Test request time statistics |
false |
SGLANG_CI_SMALL_KV_SIZE |
Use small KV cache size in CI |
Not set |
Profiling & Benchmarking
| Environment Variable |
Description |
Default Value |
SGLANG_TORCH_PROFILER_DIR |
Directory for PyTorch profiler output |
/tmp |
SGLANG_PROFILE_WITH_STACK |
Set with_stack option (bool) for PyTorch profiler (capture stack trace) |
true |
Storage & Caching
| Environment Variable |
Description |
Default Value |
SGLANG_DISABLE_OUTLINES_DISK_CACHE |
Disable Outlines disk cache |
true |