vLLM/Recipes
Qwen

Qwen/Qwen3-235B-A22B-Instruct-2507

Flagship Qwen3 MoE instruct model with 235B total and 22B active parameters, tuned for high-quality text generation.

View on HuggingFace
moe235B / 22B262,144 ctxvLLM 0.10.0+text
Guide

Overview

Qwen3 is the flagship instruct MoE model in the Qwen3 series, with 235B total parameters and 22B active parameters. This guide covers deploying the model efficiently with vLLM on NVIDIA and AMD GPUs.

Prerequisites

NVIDIA CUDA (pip)

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto

AMD ROCm (pip)

Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. Use the Docker flow if your environment is incompatible.

uv venv
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.14.1/rocm700

Deployment Configurations

BF16 on MI300X/MI325X/MI355X (4 GPUs)

HIP_VISIBLE_DEVICES="4,5,6,7" \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-235B-A22B \
  --trust-remote-code \
  -tp 4 \
  --disable-log-requests \
  --swap-space 32 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 32768 \
  --max-model-len 32768 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8

FP8 on MI300X/MI325X/MI355X (4 GPUs)

HIP_VISIBLE_DEVICES="4,5,6,7" \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen3-235B-A22B-FP8 \
  --trust-remote-code \
  -tp 4 \
  --disable-log-requests \
  --swap-space 16 \
  --distributed-executor-backend mp \
  --max-num-batched-tokens 32768 \
  --max-model-len 32768 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8

TPU Deployment

Client Usage

vllm bench serve \
  --model "Qwen/Qwen3-235B-A22B-FP8" \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 1024 \
  --request-rate 10000 \
  --num-prompts 16 \
  --ignore-eos \
  --trust-remote-code

Troubleshooting

  • Use --max-num-batched-tokens and --max-model-len to fit memory constraints on smaller nodes.
  • If OOM at startup, lower --gpu-memory-utilization (e.g. 0.8) and disable prefix caching.
  • AMD builds require VLLM_ROCM_USE_AITER=1 for best performance.

References

Configuration Matrix
VariantPrecisionMin VRAMNotes
DefaultBF16564 GBFull precision BF16 — requires 4x H200 or 8x MI300X/MI325X/MI355X
FP8FP8240 GBQwen official FP8 checkpoint for improved efficiency on SM90+
NVFP4NVFP4141 GBNVIDIA NVFP4 quantized weights for Blackwell GPUs