vLLM/Recipes
Meta

meta-llama/Llama-4-Scout-17B-16E-Instruct

Llama 4 Scout 17B-16E MoE model with NVIDIA FP8/FP4 variants, fits on a single GPU with quantization

View on HuggingFace
moe109B / 17B10,485,760 ctxvLLM 0.12.0+text
Guide

Overview

Llama 4 Scout is Meta's MoE model with 17B active parameters across 16 experts (109B total). NVIDIA provides FP8 and FP4 quantized variants. With FP4 quantization, the model fits on a single B200 GPU — making it one of the most accessible MoE models.

Prerequisites

  • Hardware: 1x B200 (FP4), 1x H100 (FP8), or 4x GPUs (BF16)
  • vLLM >= 0.12.0
  • CUDA Driver >= 575
  • Docker with NVIDIA Container Toolkit (recommended)
  • License: Must agree to Meta's Llama 4 Scout Community License

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="nvidia/Llama-4-Scout-17B-16E-Instruct-FP8",
    messages=[{"role": "user", "content": "Explain MoE models briefly."}],
)
print(response.choices[0].message.content)

Troubleshooting

FP4 only works on Blackwell: FP4 quantization requires compute capability 10.0 (B200/GB200). Use FP8 on Hopper.

TP=1 recommended for best throughput: For maximum throughput per GPU, keep TP=1. Increase TP to 2/4/8 for lower latency.

References

Configuration Matrix
VariantPrecisionMin VRAMNotes
DefaultBF16262 GBFull precision BF16
FP8FP8131 GBNVIDIA FP8 quantization for Hopper and Blackwell, fits on 1x H100
NVFP4NVFP465 GBNVIDIA NVFP4 quantized weights for Blackwell GPUs