sglang0.4.5.post1/examples/runtime/README.md

2.5 KiB

Runtime examples

The below examples will mostly need you to start a server in a separate terminal before you can execute them. Please see in the code for detailed instruction.

Native API

  • lora.py: An example how to use LoRA adapters.
  • multimodal_embedding.py: An example how perform multi modal embedding.
  • openai_batch_chat.py: An example how to process batch requests for chat completions.
  • openai_batch_complete.py: An example how to process batch requests for text completions.
  • openai_chat_with_response_prefill.py: An example how to prefill a response using OpenAI API.
  • reward_model.py: An example how to extract scores from a reward model.
  • vertex_predict.py: An example how to deploy a model to Vertex AI.

Engine

The engine folder contains that examples that show how to use Offline Engine API for common workflows.

  • custom_server.py: An example how to deploy a custom server.
  • embedding.py: An example how to extract embeddings.
  • launch_engine.py: An example how to launch the Engine.
  • offline_batch_inference_eagle.py: An example how to perform speculative decoding using EAGLE.
  • offline_batch_inference_torchrun.py: An example how to perform inference using torchrun.
  • offline_batch_inference_vlm.py: An example how to use VLMs with the engine.
  • offline_batch_inference.py: An example how to use the engine to perform inference on a batch of examples.

Hidden States

The hidden_states folder contains examples on how to extract hidden states using SGLang. Please note that this might degrade throughput due to cuda graph rebuilding.

  • hidden_states_engine.py: An example how to extract hidden states using the Engine API.
  • hidden_states_server.py: An example how to extract hidden states using the Server API.

LLaVA-NeXT

SGLang support LLaVA-OneVision with single-image, multi-image and video are supported. The folder llava_onevision shows how to do this.

Token In, Token Out

The folder token_in_token_out shows how to perform inference, where we provide tokens and get tokens as response.

  • token_in_token_out_{llm|vlm}_{engine|server}.py: Shows how to perform token in, token out workflow for llm/vlm using either the engine or native API.