快速开始

备注

阅读本篇前,请确保已按照 安装教程 准备好昇腾环境及 SGLang !

本篇教程将介绍如何使用 SGLang 进行快速开发,帮助您快速上手 SGLang。

本文档帮助昇腾开发者快速使用 SGLang × 昇腾 进行 LLM 推理服务。可以访问 这篇官方文档 获取更多信息。

概览

SGLang 是一款适用于 LLM 和 VLM 的高速服务框架。通过协同设计后端运行时环境与前端语言,让用户与模型的交互更快速、更可控。

使用 SGLang 启动服务

以下示例展示了如何使用 SGLang 启动一个简单的会话生成服务:

启动一个 server:

1# Launch the SGLang server on NPU
2python -m sglang.launch_server --model Qwen/Qwen2.5-0.5B-Instruct \
3--device npu --port 8000 --attention-backend ascend \
4--host 0.0.0.0 --trust-remote-code

启动成功后,将看到类似如下的日志输出:

1INFO:     Started server process [89394]
2INFO:     Waiting for application startup.
3INFO:     Application startup complete.
4INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
5INFO:     127.0.0.1:40106 - "GET /get_model_info HTTP/1.1" 200 OK
6Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
7INFO:     127.0.0.1:40108 - "POST /generate HTTP/1.1" 200 OK
8The server is fired up and ready to roll!

使用 curl 进行测试:

 1curl -s http://localhost:8000/v1/chat/completions \
 2    -H "Content-Type: application/json" \
 3    -d '{
 4        "model": "qwen/qwen2.5-0.5b-instruct",
 5        "messages": [
 6        {
 7            "role": "user",
 8            "content": "What is the capital of France?"
 9        }
10        ]
11    }'

将看到类似如下返回结果:

1{"id":"3f2f1aa779b544c19f01c08b803bf4ef","object":"chat.completion","created":1759136880,"model":"qwen/qwen2.5-0.5b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The capital of France is Paris.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":36,"total_tokens":44,"completion_tokens":8,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

使用 SGLang 进行推理验证

以下代码展示了如何使用 SGLang 进行推理验证:

 1# example.py
 2import torch
 3
 4import sglang as sgl
 5
 6def main():
 7
 8    prompts = [
 9        "Hello, my name is",
10        "The Independence Day of the United States is",
11        "The capital of Germany is",
12        "The full form of AI is",
13    ] * 1
14
15    llm = sgl.Engine(model_path="/Qwen2.5/Qwen2.5-0.5B-Instruct", device="npu", attention_backend="ascend")
16
17    sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}
18
19    outputs = llm.generate(prompts, sampling_params)
20    for prompt, output in zip(prompts, outputs):
21        print("===============================")
22        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
23
24if __name__ == '__main__':
25    main()

运行 example.py 进行测试,查看是否得到输出即可验证 SGLang 是否安装成功。