5.4 KiB

Raw Blame History

Environment Variables

SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.

Note: SGLang uses two prefixes for environment variables: SGL_ and SGLANG_. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.

General Configuration

Environment Variable	Description	Default Value
`SGLANG_USE_MODELSCOPE`	Enable using models from ModelScope	`false`
`SGLANG_HOST_IP`	Host IP address for the server	`0.0.0.0`
`SGLANG_PORT`	Port for the server	auto-detected
`SGLANG_LOGGING_CONFIG_PATH`	Custom logging configuration path	Not set
`SGLANG_DISABLE_REQUEST_LOGGING`	Disable request logging	`false`
`SGLANG_HEALTH_CHECK_TIMEOUT`	Timeout for health check in seconds	`20`

Performance Tuning

Environment Variable	Description	Default Value
`SGLANG_ENABLE_TORCH_INFERENCE_MODE`	Control whether to use torch.inference_mode	`false`
`SGLANG_ENABLE_TORCH_COMPILE`	Enable torch.compile	`true`
`SGLANG_SET_CPU_AFFINITY`	Enable CPU affinity setting (often set to `1` in Docker builds)	`0`
`SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN`	Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds)	`0`
`SGLANG_IS_FLASHINFER_AVAILABLE`	Control FlashInfer availability check	`true`
`SGLANG_SKIP_P2P_CHECK`	Skip P2P (peer-to-peer) access check	`false`
`SGL_CHUNKED_PREFIX_CACHE_THRESHOLD`	Sets the threshold for enabling chunked prefix caching	`8192`
`SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION`	Enable RoPE fusion in Fused Multi-Layer Attention	`1`

DeepGEMM Configuration (Advanced Optimization)

Environment Variable	Description	Default Value
`SGL_ENABLE_JIT_DEEPGEMM`	Enable Just-In-Time compilation of DeepGEMM kernels	`"true"`
`SGL_JIT_DEEPGEMM_PRECOMPILE`	Enable precompilation of DeepGEMM kernels	`"true"`
`SGL_JIT_DEEPGEMM_COMPILE_WORKERS`	Number of workers for parallel DeepGEMM kernel compilation	`4`
`SGL_IN_DEEPGEMM_PRECOMPILE_STAGE`	Indicator flag used during the DeepGEMM precompile script	`"false"`
`SGL_DG_CACHE_DIR`	Directory for caching compiled DeepGEMM kernels	`~/.cache/deep_gemm`
`SGL_DG_USE_NVRTC`	Use NVRTC (instead of Triton) for JIT compilation (Experimental)	`"0"`
`SGL_USE_DEEPGEMM_BMM`	Use DeepGEMM for Batched Matrix Multiplication (BMM) operations	`"false"`

Memory Management

Environment Variable	Description	Default Value
`SGLANG_DEBUG_MEMORY_POOL`	Enable memory pool debugging	`false`
`SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION`	Clip max new tokens estimation for memory planning	Not set
`SGLANG_DETOKENIZER_MAX_STATES`	Maximum states for detokenizer	Default value based on system
`SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK`	Disable checks for memory imbalance across Tensor Parallel ranks	Not set (defaults to enabled check)

Model-Specific Options

Environment Variable	Description	Default Value
`SGLANG_USE_AITER`	Use AITER optimize implementation	`false`
`SGLANG_INT4_WEIGHT`	Enable INT4 weight quantization	`false`
`SGLANG_MOE_PADDING`	Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds)	`0`
`SGLANG_FORCE_FP8_MARLIN`	Force using FP8 MARLIN kernels even if other FP8 kernels are available	`false`
`SGLANG_ENABLE_FLASHINFER_GEMM`	Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs	`false`
`SGLANG_SUPPORT_CUTLASS_BLOCK_FP8`	Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs	`false`
`SGLANG_CUTLASS_MOE`	Use Cutlass FP8 MoE kernel on Blackwell GPUs	`false`

Distributed Computing

Environment Variable	Description	Default Value
`SGLANG_BLOCK_NONZERO_RANK_CHILDREN`	Control blocking of non-zero rank children processes	`1`
`SGL_IS_FIRST_RANK_ON_NODE`	Indicates if the current process is the first rank on its node	`"true"`
`SGLANG_PP_LAYER_PARTITION`	Pipeline parallel layer partition specification	Not set

Testing & Debugging (Internal/CI)

These variables are primarily used for internal testing, continuous integration, or debugging.

Environment Variable	Description	Default Value
`SGLANG_IS_IN_CI`	Indicates if running in CI environment	`false`
`SGLANG_AMD_CI`	Indicates running in AMD CI environment	`0`
`SGLANG_TEST_RETRACT`	Enable retract decode testing	`false`
`SGLANG_RECORD_STEP_TIME`	Record step time for profiling	`false`
`SGLANG_TEST_REQUEST_TIME_STATS`	Test request time statistics	`false`
`SGLANG_CI_SMALL_KV_SIZE`	Use small KV cache size in CI	Not set

Profiling & Benchmarking

Environment Variable	Description	Default Value
`SGLANG_TORCH_PROFILER_DIR`	Directory for PyTorch profiler output	`/tmp`
`SGLANG_PROFILE_WITH_STACK`	Set `with_stack` option (bool) for PyTorch profiler (capture stack trace)	`true`

Storage & Caching

Environment Variable	Description	Default Value
`SGLANG_DISABLE_OUTLINES_DISK_CACHE`	Disable Outlines disk cache	`true`

5.4 KiB Raw Blame History