# Offline Engine API

SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:

- Offline Batch Inference
- Custom Server on Top of the Engine

This document focuses on the offline batch inference, demonstrating four different inference modes:

- Non-streaming synchronous generation
- Streaming synchronous generation
- Non-streaming asynchronous generation
- Streaming asynchronous generation

Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).



## Nest Asyncio
Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:
```python
import nest_asyncio

nest_asyncio.apply()

```

## Advanced Usage

The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). 

Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases.

## Offline Batch Inference

SGLang offline engine supports batch inference with efficient scheduling.

In [None]:
# launch the offline engine
import asyncio

import sglang as sgl
import sglang.test.doc_patch
from sglang.utils import async_stream_and_merge, stream_and_merge

llm = sgl.Engine(model_path="qwen/qwen2.5-0.5b-instruct")

### Non-streaming Synchronous Generation

In [None]:
prompts = [
 "Hello, my name is",
 "The president of the United States is",
 "The capital of France is",
 "The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
 print("===============================")
 print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

### Streaming Synchronous Generation

In [None]:
prompts = [
 "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
 "Provide a concise factual statement about France’s capital city. The capital of France is",
 "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {
 "temperature": 0.2,
 "top_p": 0.9,
}

print("\n=== Testing synchronous streaming generation with overlap removal ===\n")

for prompt in prompts:
 print(f"Prompt: {prompt}")
 merged_output = stream_and_merge(llm, prompt, sampling_params)
 print("Generated text:", merged_output)
 print()

### Non-streaming Asynchronous Generation

In [None]:
prompts = [
 "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
 "Provide a concise factual statement about France’s capital city. The capital of France is",
 "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous batch generation ===")


async def main():
 outputs = await llm.async_generate(prompts, sampling_params)

 for prompt, output in zip(prompts, outputs):
 print(f"\nPrompt: {prompt}")
 print(f"Generated text: {output['text']}")


asyncio.run(main())

### Streaming Asynchronous Generation

In [None]:
prompts = [
 "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
 "Provide a concise factual statement about France’s capital city. The capital of France is",
 "Explain possible future trends in artificial intelligence. The future of AI is",
]

sampling_params = {"temperature": 0.8, "top_p": 0.95}

print("\n=== Testing asynchronous streaming generation (no repeats) ===")


async def main():
 for prompt in prompts:
 print(f"\nPrompt: {prompt}")
 print("Generated text: ", end="", flush=True)

 # Replace direct calls to async_generate with our custom overlap-aware version
 async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):
 print(cleaned_chunk, end="", flush=True)

 print() # New line after each prompt


asyncio.run(main())

In [None]:
llm.shutdown()