量化指南

量化指南#

模型量化是一种通过降低模型中权重和激活值的数据精度，从而减少模型大小和计算需求的技术，这样可以节省内存并提高推理速度。

自 0.9.0rc2 版本起，vLLM Ascend 实验性地支持量化特性。用户可以通过指定 --quantization ascend 启用量化功能。目前，只有 Qwen、DeepSeek 系列模型经过了充分测试。未来我们将支持更多的量化算法和模型。

安装 modelslim#

要对模型进行量化，用户应安装ModelSlim，这是昇腾的压缩与加速工具。它是一种基于亲和性的压缩工具，专为加速设计，以压缩为核心技术，并基于昇腾平台构建。

目前，只有 modelslim 的特定标签 modelslim-VLLM-8.1.RC1.b020_001 支持 vLLM Ascend。在未来 modelslim 的主版本支持 vLLM Ascend 之前，请不要安装其他版本。

安装 modelslim：

git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim
bash install.sh
pip install accelerate

量化模型#

以 DeepSeek-V2-Lite 为例，你只需要下载模型，然后执行转换命令。命令如下所示。更多信息可参考 modelslim 文档 deepseek w8a8 动态量化文档。

cd example/DeepSeek
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8  --is_dynamic True

备注

你也可以下载我们上传的量化模型。请注意，这些权重仅应用于测试。例如：https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8

转换操作完成后，会生成两个重要的文件。

config.json。请确保其中没有 quantization_config 字段。
quant_model_description.json。所有被转换的权重信息都记录在该文件中。

以下是完整转换后的模型文件：

.
├── config.json
├── configuration_deepseek.py
├── configuration.json
├── generation_config.json
├── quant_model_description.json
├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic.safetensors.index.json
├── README.md
├── tokenization_deepseek_fast.py
├── tokenizer_config.json
└── tokenizer.json

运行模型#

现在，你可以使用 vLLM Ascend 运行量化模型。下面是在线和离线推理的示例。

离线推理#

import torch

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
          max_model_len=2048,
          trust_remote_code=True,
          # Enable quantization by specifying `quantization="ascend"`
          quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

在线推理#

# Enable quantization by specifying `--quantization ascend`
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code

常见问题解答#

1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?#

首先，请确保你指定了 ascend 量化方法。其次，检查你的模型是否由 modelslim-VLLM-8.1.RC1.b020_001 这个 modelslim 版本转换。如果仍然无法使用，请提交一个 issue，可能有一些新模型需要适配。

2. How to solve the error "Could not locate the configuration_deepseek.py"?#

请使用 modelslim-VLLM-8.1.RC1.b020_001 的 modelslim 转换 DeepSeek 系列模型，该版本已修复缺少 configuration_deepseek.py 的错误。