1.9 KiB
Transformers fallback in SGLang
sglang can fall back to using models that are available in transformers. This works for most decoder-style language models and support for vision-language models is coming soon!
Example launch Command
By default, we will use sglang implementation if it is available. Otherwise, we will fall back to transformers one. However, you can switch the implementation by setting impl to transformers.
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--impl transformers
Supported features
Quantization
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See Quantization page for more information about supported quantization in SGLang.
Remote code
This fallback also means that any model on the hub that can be used in transformers with trust_remote_code=True that correctly implements attention can be used in production!
A model just needs the following two things:
from transformers import PreTrainedModel
from torch import nn
class MyAttention(nn.Module):
def forward(self, hidden_states, **kwargs): # <- kwargs are required
...
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...
class MyModel(PreTrainedModel):
_supports_attention_backend = True
Here is what happens in the background:
- The config is loaded
MyModelpython class is loaded from theauto_map, and we check that the model_supports_attention_backend.- The
TransformersModelbackend is used. See/srt/models/transformers, which leveragesself.config._attn_implementation = "sglang", thus the need to useALL_ATTENTION_FUNCTIONS.
That's it!