meta-llama/Llama-4-Scout-17B-16E-Instruct

Llama 4 Scout 17B-16E MoE model with NVIDIA FP8/FP4 variants, fits on a single GPU with quantization

moe109B / 17B10,485,760 ctxvLLM 0.12.0+text

Guide

Overview

Llama 4 Scout is Meta's MoE model with 17B active parameters across 16 experts (109B total). NVIDIA provides FP8 and FP4 quantized variants. With FP4 quantization, the model fits on a single B200 GPU — making it one of the most accessible MoE models.

Prerequisites

Hardware: 1x B200 (FP4), 1x H100 (FP8), or 4x GPUs (BF16)
vLLM >= 0.12.0
CUDA Driver >= 575
Docker with NVIDIA Container Toolkit (recommended)
License: Must agree to Meta's Llama 4 Scout Community License

Client Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
    model="nvidia/Llama-4-Scout-17B-16E-Instruct-FP8",
    messages=[{"role": "user", "content": "Explain MoE models briefly."}],
)
print(response.choices[0].message.content)

Troubleshooting

FP4 only works on Blackwell: FP4 quantization requires compute capability 10.0 (B200/GB200). Use FP8 on Hopper.

TP=1 recommended for best throughput: For maximum throughput per GPU, keep TP=1. Increase TP to 2/4/8 for lower latency.

References

Configuration Matrix

Variant	Precision	Min VRAM	Notes
Default	BF16	262 GB	Full precision BF16
FP8	FP8	131 GB	NVIDIA FP8 quantization for Hopper and Blackwell, fits on 1x H100
NVFP4	NVFP4	65 GB	NVIDIA NVFP4 quantized weights for Blackwell GPUs