|
|
||
|---|---|---|
| .. | ||
| custom_server.py | ||
| embedding.py | ||
| fastapi_engine_inference.py | ||
| launch_engine.py | ||
| offline_batch_inference.py | ||
| offline_batch_inference_async.py | ||
| offline_batch_inference_eagle.py | ||
| offline_batch_inference_vlm.py | ||
| readme.md | ||
| save_remote_state.py | ||
| save_sharded_state.py | ||
readme.md
SGLang Engine
SGLang provides a direct inference engine without the need for an HTTP server. There are generally these use cases:
- Offline Batch Inference
- Embedding Generation
- Custom Server
- Token-In-Token-Out for RLHF
- Inference Using FastAPI
Examples
Offline Batch Inference
In this example, we launch an SGLang engine and feed a batch of inputs for inference. If you provide a very large batch, the engine will intelligently schedule the requests to process efficiently and prevent OOM (Out of Memory) errors.
Embedding Generation
In this example, we launch an SGLang engine and feed a batch of inputs for embedding generation.
Custom Server
This example demonstrates how to create a custom server on top of the SGLang Engine. We use Sanic as an example. The server supports both non-streaming and streaming endpoints.
Steps
-
Install Sanic:
pip install sanic -
Run the server:
python custom_server -
Send requests:
curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "The Transformer architecture is..."}' curl -X POST http://localhost:8000/generate_stream -H "Content-Type: application/json" -d '{"prompt": "The Transformer architecture is..."}' --no-bufferThis will send both non-streaming and streaming requests to the server.
Token-In-Token-Out for RLHF
In this example, we launch an SGLang engine, feed tokens as input and generate tokens as output.
Inference Using FastAPI
This example demonstrates how to create a FastAPI server that uses the SGLang engine for text generation.