# td

**Repository Path**: ukSir/td

## Basic Information

- **Project Name**: td
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-05-09
- **Last Updated**: 2026-05-09

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
license: mit
language:
  - en
  - zh
hardware: NPU
---


# Qwen3.6-27B on vLLM-Ascend 0.18.0rc1

## 1. 简介

本文档记录 `Qwen3.6-27B` 在 `vLLM-Ascend 0.18.0rc1` 环境的快速部署与验证结果。整体部署方式可直接参考官方 `Qwen3.5-27B` 教程， `Qwen3.6-27B` 额外验证两个点：

- MTP 方法改为 `qwen3_next_mtp`
- 可选请求级配置 `preserve_thinking`

从模型配置看，`Qwen3.6-27B` 当前仍走 `Qwen3_5ForConditionalGeneration` 推理链路，和 `Qwen3.5-27B` 的部署方式基本兼容，可以快速迁移验证。

相关获取地址：

- 权重下载地址（ModelScope）：<https://modelscope.cn/models/Qwen/Qwen3.6-27B>
- 权重下载地址（HuggingFace）：<https://huggingface.co/Qwen/Qwen3.6-27B>
- Docker Image （vLLM-Ascend 0.18.0rc1 ）：`quay.io/ascend/vllm-ascend:v0.18.0rc1`

参考文档：

- <https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/tutorials/models/Qwen3.5-27B.html>
- <https://docs.vllm.ai/projects/ascend/zh-cn/v0.18.0/developer_guide/Design_Documents/ACL_Graph.html>

## 2. 验证环境

| 组件 | 版本 |
| --- | --- |
| `vllm-ascend` | `0.18.0rc1` |
| `vllm` | `0.18.0+empty` |
| `transformers` | `4.57.6` |
| `torch-npu` | `2.9.0.post1+gitee7ba04` |

- NPU：`2` 逻辑卡
- 模型路径：`/mnt/weight/Qwen3.6-27B`
- 服务端口：`8000`

## 3. 服务启动

启动前可先检查端口：

```bash
ss -lntp | grep ':8000 ' || true
```

已验证通过的启动命令：

```bash
export ASCEND_RT_VISIBLE_DEVICES=14,15
export VLLM_USE_MODELSCOPE=true
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_BUFFSIZE=512
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export TASK_QUEUE_ENABLE=1

vllm serve /mnt/weight/Qwen3.6-27B \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 1 \
  --tensor-parallel-size 2 \
  --seed 1024 \
  --served-model-name qwen3.6-27b \
  --max-num-seqs 32 \
  --max-model-len 133000 \
  --max-num-batched-tokens 8096 \
  --trust-remote-code \
  --reasoning-parser qwen3 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --speculative_config '{"method":"qwen3_next_mtp","num_speculative_tokens":2,"disable_padded_drafter_batch":false}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY","cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93,96]}' \
  --additional-config '{"enable_cpu_binding":true}'
```

## 4. Smoke 验证

基础检查：

```bash
curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-27b",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "用一句中文说明 TCP 和 UDP 的核心区别。"}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'
```

验证结果：

- `/v1/models` 返回 `200`
- `/v1/chat/completions` 返回 `200`
- `reasoning` 字段可正常返回

`preserve_thinking` 作为可选配置也已验证生效，示例：

```json
{
  "extra_body": {
    "chat_template_kwargs": {
      "preserve_thinking": true
    }
  }
}
```

验证现象：

- `preserve_thinking=false` 时 `prompt_tokens=71`
- `preserve_thinking=true` 时 `prompt_tokens=100`

## 5. 性能参考

测试条件：`8k input / 1k output / concurrency=8`，连续两次，以下取第二次数据。

| 指标 | 数值 |
| --- | --- |
| `duration` | `121.858 s` |
| `request_throughput` | `0.263 req/s` |
| `output_throughput` | `262.601 tok/s` |
| `total_token_throughput` | `2363.408 tok/s` |
| `mean_ttft_ms` | `3036.530 ms` |
| `median_ttft_ms` | `1563.622 ms` |
| `p99_ttft_ms` | `11261.911 ms` |
| `mean_tpot_ms` | `26.849 ms` |
| `median_tpot_ms` | `26.519 ms` |
| `p99_tpot_ms` | `35.933 ms` |
| `spec_decode_acceptance_rate` | `93.297%` |
| `spec_decode_acceptance_length` | `2.866` |

压测时建议显式指定：

```bash
--tokenizer /mnt/weight/Qwen3.6-27B
```

## 6. 精度评测

使用 `EvalScope` 对 `AIME26` 做了 `3` 轮精度评测。

| 指标 | 数值 |
| --- | --- |
| 数据集 | `AIME26` |
| 评测工具 | `EvalScope` |
| 轮数 | `3` |
| 单轮样本数 | `30` |
| 平均分 | `93.3` |
| 最佳轮次 | `round3` |
| 最佳分数 | `96.7` |
| 参考分 | `94.1` |


## 7. 注意事项

`qwen3_next_mtp` 是当前环境最容易踩坑的点。

如果只是把 `Qwen3.5-27B` 的 `qwen3_5_mtp` 改成 `qwen3_next_mtp`，但仍使用默认自动图桶配置，服务可能在 ACL Graph 捕获阶段失败。实际失败特征如下：

- 关键报错：`KeyError: 90`
- 位置：`/vllm-workspace/vllm-ascend/vllm_ascend/attention/attention_v1.py`
- 最终报错：`RuntimeError: Engine core initialization failed`

原因不是权重或 `transformers` 版本问题，而是 `ACL Graph + MTP + num_speculative_tokens > 1` 时，默认图桶可能覆盖不到实际 decode case。

当前环境的可用处理方式是显式指定：

```text
cudagraph_capture_sizes=[3,6,9,12,...,96]
```

另外，`preserve_thinking` 属于请求级参数，不需要改服务启动命令。Agent 多轮场景建议验证后按需开启，纯吞吐场景可保持默认关闭。