|
|
||
|---|---|---|
| .. | ||
| engine | ||
| hidden_states | ||
| llava_onevision | ||
| token_in_token_out | ||
| README.md | ||
| lora.py | ||
| multimodal_embedding.py | ||
| openai_batch_chat.py | ||
| openai_batch_complete.py | ||
| openai_chat_with_response_prefill.py | ||
| reward_model.py | ||
| vertex_predict.py | ||
README.md
Runtime examples
The below examples will mostly need you to start a server in a separate terminal before you can execute them. Please see in the code for detailed instruction.
Native API
lora.py: An example how to use LoRA adapters.multimodal_embedding.py: An example how perform multi modal embedding.openai_batch_chat.py: An example how to process batch requests for chat completions.openai_batch_complete.py: An example how to process batch requests for text completions.openai_chat_with_response_prefill.py: An example how to prefill a response using OpenAI API.reward_model.py: An example how to extract scores from a reward model.vertex_predict.py: An example how to deploy a model to Vertex AI.
Engine
The engine folder contains that examples that show how to use Offline Engine API for common workflows.
custom_server.py: An example how to deploy a custom server.embedding.py: An example how to extract embeddings.launch_engine.py: An example how to launch the Engine.offline_batch_inference_eagle.py: An example how to perform speculative decoding using EAGLE.offline_batch_inference_torchrun.py: An example how to perform inference using torchrun.offline_batch_inference_vlm.py: An example how to use VLMs with the engine.offline_batch_inference.py: An example how to use the engine to perform inference on a batch of examples.
Hidden States
The hidden_states folder contains examples on how to extract hidden states using SGLang. Please note that this might degrade throughput due to cuda graph rebuilding.
hidden_states_engine.py: An example how to extract hidden states using the Engine API.hidden_states_server.py: An example how to extract hidden states using the Server API.
LLaVA-NeXT
SGLang support LLaVA-OneVision with single-image, multi-image and video are supported. The folder llava_onevision shows how to do this.
Token In, Token Out
The folder token_in_token_out shows how to perform inference, where we provide tokens and get tokens as response.
token_in_token_out_{llm|vlm}_{engine|server}.py: Shows how to perform token in, token out workflow for llm/vlm using either the engine or native API.