History

hailin cc76bab27e first commit		2025-09-15 10:32:17 +08:00
..
engine	first commit	2025-09-15 10:32:17 +08:00
hidden_states	first commit	2025-09-15 10:32:17 +08:00
multimodal	first commit	2025-09-15 10:32:17 +08:00
token_in_token_out	first commit	2025-09-15 10:32:17 +08:00
README.md	first commit	2025-09-15 10:32:17 +08:00
lora.py	first commit	2025-09-15 10:32:17 +08:00
multimodal_embedding.py	first commit	2025-09-15 10:32:17 +08:00
openai_chat_with_response_prefill.py	first commit	2025-09-15 10:32:17 +08:00
reward_model.py	first commit	2025-09-15 10:32:17 +08:00
vertex_predict.py	first commit	2025-09-15 10:32:17 +08:00

README.md

Runtime examples

The below examples will mostly need you to start a server in a separate terminal before you can execute them. Please see in the code for detailed instruction.

Native API

lora.py: An example how to use LoRA adapters.
multimodal_embedding.py: An example how perform multi modal embedding.
openai_batch_chat.py: An example how to process batch requests for chat completions.
openai_batch_complete.py: An example how to process batch requests for text completions.
openai_chat_with_response_prefill.py: An example that demonstrates how to prefill a response using the OpenAI API by enabling the continue_final_message parameter. When enabled, the final (partial) assistant message is removed and its content is used as a prefill so that the model continues that message rather than starting a new turn. See Anthropic's prefill example for more context.
reward_model.py: An example how to extract scores from a reward model.
vertex_predict.py: An example how to deploy a model to Vertex AI.

Engine

The engine folder contains that examples that show how to use Offline Engine API for common workflows.

custom_server.py: An example how to deploy a custom server.
embedding.py: An example how to extract embeddings.
launch_engine.py: An example how to launch the Engine.
offline_batch_inference_eagle.py: An example how to perform speculative decoding using EAGLE.
offline_batch_inference_torchrun.py: An example how to perform inference using torchrun.
offline_batch_inference_vlm.py: An example how to use VLMs with the engine.
offline_batch_inference.py: An example how to use the engine to perform inference on a batch of examples.

Hidden States

The hidden_states folder contains examples on how to extract hidden states using SGLang. Please note that this might degrade throughput due to cuda graph rebuilding.

hidden_states_engine.py: An example how to extract hidden states using the Engine API.
hidden_states_server.py: An example how to extract hidden states using the Server API.

Multimodal

SGLang supports multimodal inputs for various model architectures. The multimodal folder contains examples showing how to use urls, files or encoded data to make requests to multimodal models. Examples include querying the Llava-OneVision model (image, multi-image, video), Llava-backed Qwen-Llava and Llama3-Llava models (image, multi-image), and Mistral AI's Pixtral (image, multi-image).

Token In, Token Out

The folder token_in_token_out shows how to perform inference, where we provide tokens and get tokens as response.

token_in_token_out_{llm|vlm}_{engine|server}.py: Shows how to perform token in, token out workflow for llm/vlm using either the engine or native API.