快速开始
本文档帮助昇腾开发者快速使用 SGLang × 昇腾 进行 LLM 推理服务。可以访问 这篇官方文档 获取更多信息。
概览
SGLang 是一款适用于 LLM 和 VLM 的高速服务框架。通过协同设计后端运行时环境与前端语言,让用户与模型的交互更快速、更可控。
使用 SGLang 启动服务
以下示例展示了如何使用 SGLang 启动一个简单的会话生成服务:
启动一个 server:
1# Launch the SGLang server on NPU
2python -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct \
3--device npu --port 8000 --attention-backend ascend \
4--host 0.0.0.0 --trust-remote-code
启动成功后,将看到类似如下的日志输出:
1INFO: Started server process [89394]
2INFO: Waiting for application startup.
3INFO: Application startup complete.
4INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
5INFO: 127.0.0.1:40106 - "GET /get_model_info HTTP/1.1" 200 OK
6Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
7INFO: 127.0.0.1:40108 - "POST /generate HTTP/1.1" 200 OK
8The server is fired up and ready to roll!
使用 curl 进行测试:
1curl -s http://localhost:8000/v1/chat/completions \
2 -H "Content-Type: application/json" \
3 -d '{
4 "model": "qwen/qwen2.5-0.5b-instruct",
5 "messages": [
6 {
7 "role": "user",
8 "content": "What is the capital of France?"
9 }
10 ]
11 }'
将看到类似如下返回结果:
1{"id":"3f2f1aa779b544c19f01c08b803bf4ef","object":"chat.completion","created":1759136880,"model":"qwen/qwen2.5-0.5b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of France is Paris.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":36,"total_tokens":44,"completion_tokens":8,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
使用 SGLang 进行推理验证
以下代码展示了如何使用 SGLang 进行推理验证:
1# example.py
2import torch
3
4import sglang as sgl
5
6def main():
7
8 prompts = [
9 "Hello, my name is",
10 "The Independence Day of the United States is",
11 "The capital of Germany is",
12 "The full form of AI is",
13 ] * 1
14
15 llm = sgl.Engine(model_path="/Qwen2.5/Qwen2.5-0.5B-Instruct", device="npu", attention_backend="ascend")
16
17 sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}
18
19 outputs = llm.generate(prompts, sampling_params)
20 for prompt, output in zip(prompts, outputs):
21 print("===============================")
22 print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
23
24if __name__ == '__main__':
25 main()
运行 example.py 进行测试,查看是否得到输出即可验证 SGLang 是否安装成功。