性能基准

性能基准#

本文档详细说明了 vllm-ascend 的基准测试方法，旨在评估其在多种工作负载下的性能。为了与 vLLM 保持一致，我们使用 vllm 项目提供的 benchmark 脚本。

基准测试覆盖范围：我们测量离线端到端延迟和吞吐量，以及固定 QPS 的在线服务基准测试。更多详情请参见 vllm-ascend 基准测试脚本。

1. Run docker container#

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.9.2rc1
docker run --rm \
--name vllm-ascend \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-it $IMAGE \
/bin/bash

2. Install dependencies#

cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r benchmarks/requirements-bench.txt

3.（可选）准备模型权重#

为了更快的运行速度，建议提前下载模型：

modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct

你也可以将 json 文件中的所有模型路径替换为你的本地路径：

[
  {
    "test_name": "latency_llama8B_tp1",
    "parameters": {
      "model": "your local model path",
      "tensor_parallel_size": 1,
      "load_format": "dummy",
      "num_iters_warmup": 5,
      "num_iters": 15
    }
  }
]

4. Run benchmark script#

运行基准测试脚本：

bash benchmarks/scripts/run-performance-benchmarks.sh

大约 10 分钟后，输出如下所示：

online serving:
qps 1:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  212.77    
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              0.94      
Output token throughput (tok/s):         204.66    
Total Token throughput (tok/s):          405.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          104.14    
Median TTFT (ms):                        102.22    
P99 TTFT (ms):                           153.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.78     
Median TPOT (ms):                        38.70     
P99 TPOT (ms):                           48.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.46     
Median ITL (ms):                         36.96     
P99 ITL (ms):                            75.03     
==================================================

qps 4:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.55     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              2.76      
Output token throughput (tok/s):         600.24    
Total Token throughput (tok/s):          1188.27   
---------------Time to First Token----------------
Mean TTFT (ms):                          115.62    
Median TTFT (ms):                        109.39    
P99 TTFT (ms):                           169.03    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.48     
Median TPOT (ms):                        52.40     
P99 TPOT (ms):                           69.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.47     
Median ITL (ms):                         43.95     
P99 ITL (ms):                            130.29    
==================================================

qps 16:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  47.82     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.18      
Output token throughput (tok/s):         910.62    
Total Token throughput (tok/s):          1802.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          128.50    
Median TTFT (ms):                        128.36    
P99 TTFT (ms):                           187.87    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.60     
Median TPOT (ms):                        77.85     
P99 TPOT (ms):                           165.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.72     
Median ITL (ms):                         54.84     
P99 ITL (ms):                            289.63    
==================================================

qps inf:
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  41.26     
Total input tokens:                      42659     
Total generated tokens:                  43545     
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1055.44   
Total Token throughput (tok/s):          2089.40   
---------------Time to First Token----------------
Mean TTFT (ms):                          3394.37   
Median TTFT (ms):                        3359.93   
P99 TTFT (ms):                           3540.93   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          66.28     
Median TPOT (ms):                        64.19     
P99 TPOT (ms):                           97.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           56.62     
Median ITL (ms):                         55.69     
P99 ITL (ms):                            82.90     
==================================================

offline:
latency:
Avg latency: 4.944929537673791 seconds
10% percentile latency: 4.894104263186454 seconds
25% percentile latency: 4.909652255475521 seconds
50% percentile latency: 4.932477846741676 seconds
75% percentile latency: 4.9608619548380375 seconds
90% percentile latency: 5.035418218374252 seconds
99% percentile latency: 5.052476694583893 seconds

throughput:
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens:  42659
Total num output tokens:  43545

结果 json 文件会生成到路径 benchmark/results。这些文件包含了用于进一步分析的详细基准测试结果。

.
|-- latency_llama8B_tp1.json
|-- serving_llama8B_tp1_qps_1.json
|-- serving_llama8B_tp1_qps_16.json
|-- serving_llama8B_tp1_qps_4.json
|-- serving_llama8B_tp1_qps_inf.json
`-- throughput_llama8B_tp1.json