Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X

5.6x throughput improvement over baseline autoregressive serving
90 tok/s β†’ 508 tok/s on the same hardware, same model, zero quality loss


Performance

Throughput Scaling

Throughput scaling chart showing 90 to 508 tok/s

Head-to-Head: DFlash vs Autoregressive

Autoregressive (baseline) DFlash st=2 (this config) Speedup
8 users 90.4 tok/s 127.1 tok/s 1.4x
12 users 125.1 tok/s 192.8 tok/s 1.5x
16 users β€” 250.8 tok/s β€”
24 users β€” 379.0 tok/s β€”
32 users β€” 507.6 tok/s 5.6x

All measurements: no prefix cache, warmed server, 512 max tokens, temperature=0, prompts from a diverse reasoning benchmark set. Latency is flat at ~30s regardless of concurrency.

Per-User Latency

Latency stays flat as concurrency scales

Concurrent users Mean latency P95 latency Per-user tok/s
8 31.0s 31.3s 15.9
16 30.8s 31.1s 15.7
24 30.0s 30.4s 15.8
32 30.7s 31.0s 15.9

Latency does not degrade as concurrency increases. Each user gets a consistent ~15.8 tok/s regardless of how many others are being served.


What is this?

A production-ready serving configuration for moonshotai/Kimi-K2.6 using DFlash speculative decoding with the z-lab/Kimi-K2.5-DFlash draft model, optimized for AMD MI300X GPUs.

This is not a new model β€” it's an optimized serving recipe. The model weights are unchanged. Output quality is identical to standard autoregressive serving.

Three optimizations that delivered 5.6x

Optimization journey from 90 to 508 tok/s

What Before After Impact
NUMA balancing Enabled Disabled Removed memory access bottleneck across NUMA domains
DFlash spec tokens 8 2 Acceptance rate: 16% β†’ 50%. DFlash went from net-negative to net-positive
max_num_seqs 8 32 Linear throughput scaling β€” each slot adds 15.8 tok/s

Hardware

Hardware and software stack

Component Specification
GPU 8x AMD Instinct MI300X
GPU Architecture CDNA 3 (gfx942)
VRAM per GPU 192 GB HBM3
Total VRAM 1,536 GB (1.5 TB)
System RAM ~2 TB
Storage NVMe (14 TB), model on local disk
Runtime vLLM v0.19.2 ROCm nightly
ROCm Version 6.x

Model Specifications

Target Model Draft Model
Name moonshotai/Kimi-K2.6 z-lab/Kimi-K2.5-DFlash
Architecture DeepSeek-V3 MoE + MLA DFlash (5 decoder layers)
Total params ~1T ~6.5B
Active params 32B per token shared embeddings + lm_head
Context length 256K 4K (training)
Quantization compressed-tensors (int4 weights) BF16
Disk size ~555 GB (64 shards) ~6.5 GB

Quick Start

1. Download models

# Target model (~555 GB)
huggingface-cli download moonshotai/Kimi-K2.6 --local-dir /models/Kimi-K2.6

# Draft model (~6.5 GB)
huggingface-cli download z-lab/Kimi-K2.5-DFlash --local-dir /models/Kimi-K2.5-DFlash

2. Configure

Edit configs/production.env:

MODEL_DIR=/models/Kimi-K2.6
DRAFT_MODEL_DIR=/models/Kimi-K2.5-DFlash

3. Disable NUMA balancing (required)

sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

4. Launch

./serve.sh

Server takes ~5 minutes to load. Once ready:

curl http://localhost:8262/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.6-amd-dflash",
    "messages": [{"role": "user", "content": "Explain the Riemann hypothesis"}],
    "max_tokens": 512,
    "temperature": 0
  }'

5. Benchmark

# Single-shot throughput benchmark
python3 payload/benchmark_multi_turn.py \
  --base-url http://localhost:8262/v1 \
  --model kimi-k2.6-amd-dflash \
  --sessions 32 --turns-per-session 1 \
  --max-tokens 512

# Compare against autoregressive baseline:
# Launch without DFlash (remove --speculative-config, set --block-size 1)
# and run the same benchmark

How DFlash Works

Standard Autoregressive          DFlash Speculative (st=2)
=======================          =========================

Step 1: Generate token 1         Step 1: Draft predicts tokens 1,2
Step 2: Generate token 2         Step 2: Target verifies both in ONE pass
Step 3: Generate token 3           β†’ If both accepted: got 2 tokens for ~1 step
Step 4: Generate token 4           β†’ If only token 1 accepted: got 1 token
...                              Step 3: Draft predicts tokens 3,4
                                 Step 4: Target verifies...

4 tokens = 4 forward passes      4 tokens β‰ˆ 2-3 forward passes

The draft model (Kimi-K2.5-DFlash, 6.5 GB) is ~85x smaller than the target. It runs in <1% of the target's compute time. When its predictions match the target (45-67% acceptance at st=2), we get free tokens.

Why st=2 instead of st=8?

Acceptance rate comparison: st=8 vs st=2

The public drafter was trained for K2.5, not K2.6. The model mismatch causes acceptance to drop sharply at later positions:

Spec tokens Pos 0 Pos 1 Pos 2 Pos 3 Pos 4-7 Avg acceptance Net effect
2 64% 34% β€” β€” β€” 49% +40% throughput
8 64% 34% 18% 9% <3% 16% -20% throughput

At st=8, the target model wastes compute verifying 6 tokens that will almost certainly be rejected. At st=2, every verification step has a ~50% chance of yielding a free token.


ROCm Patches

DFlash requires 9 patches to work on ROCm with MLA attention. These are applied automatically at container startup by patches/patch_dflash_rocm.py. The patches:

  1. Add non-causal attention support to AITER flash attention backend
  2. Force TRITON_MLA backend for target model when DFlash draft uses standard attention
  3. Add IS_CAUSAL parameter to Triton unified attention kernels
  4. Relax causal assertions in the DFlash verification path

All patches are idempotent and track upstream vllm-project/vllm#39930.


Configuration Reference

# configs/production.env β€” all tunable parameters

NUM_SPECULATIVE_TOKENS=2    # DFlash draft tokens per step
MAX_NUM_SEQS=32             # Max concurrent decode sequences
MAX_NUM_BATCHED_TOKENS=32768 # Max tokens per scheduler step
MAX_MODEL_LEN=262144        # Max context length (256K)
GPU_MEMORY_UTILIZATION=0.90 # Fraction of VRAM for KV cache
BLOCK_SIZE=16               # Required for DFlash + MLA
ENFORCE_EAGER=true          # Compiled mode provides no gain
MOE_BACKEND=aiter           # AMD's optimized MoE kernels

Known Constraints

Constraint Root cause Workaround
max_num_batched_tokens capped at 32768 AITER MoE kernel grid overflow at 384 experts Γ— large batch Stay at 32768
K2.5 drafter acceptance ~50% Model version mismatch (trained for K2.5) Train K2.6-specific drafter (see below)

FP8 KV Cache: 901 tok/s (updated numbers)

FP8 KV cache halves KV memory (8-bit vs 16-bit per element). Measured capacity: 2,469,568 tokens (up from 1,230,368 with BF16) = 2.01x. This enables max_num_seqs=64, pushing aggregate throughput to 901 tok/s β€” 1.77x over the BF16 baseline.

Head-to-Head: BF16 vs FP8 KV

Concurrent users BF16 KV (seqs=32) FP8 KV (seqs=64) Speedup
8 127.1 tok/s β€” β€”
16 250.8 tok/s β€” β€”
24 379.0 tok/s β€” β€”
32 507.6 tok/s 394.6 tok/s 0.78x
48 β€” 593.6 tok/s β€”
64 β€” 900.9 tok/s 1.77x

At matched concurrency (c=32), FP8 is ~22% slower per slot due to dynamic scale computation overhead. But FP8 enables 2x more concurrent sequences, and aggregate throughput at c=64 is 1.77x the BF16 peak.

The FP8 scale problem (and fix)

The Kimi-K2.6 checkpoint has no pre-computed FP8 KV scales. Without them, vLLM defaults to scale=1.0, which clips KV values in FP8 E4M3 range and produces degenerate output (vllm#13133, vllm#27364).

Our fix: a runtime patch to the MLA do_kv_cache_update that computes scales dynamically from each batch's actual KV data using a running-max approach. The scale converges after the first few requests and stays stable. Calibration with 200 diverse prompts (51K tokens) confirmed the converged scale range: 0.026–0.068.

The 384-expert AITER crash does NOT affect FP8 KV β€” that's a MoE-side issue triggered only at max_num_batched_tokens > 32768. FP8 KV is purely attention-side.

Quick start: FP8 KV

./serve.sh configs/production-fp8kv.env

Configs

Config KV dtype MoE backend max_num_seqs Throughput
production.env BF16 AITER 32 508 tok/s
production-fp8kv.env FP8 AITER 64 901 tok/s

Training a K2.6-Matched DFlash Drafter

The public drafter (z-lab/Kimi-K2.5-DFlash) was trained for K2.5 and gets ~50% acceptance on K2.6. A K2.6-matched drafter should reach 60-80% acceptance, making num_speculative_tokens=8 viable and roughly doubling per-slot throughput.

Architecture

The drafter is a 6-layer Qwen3-based decoder (~1.2B trainable params) that:

  • Shares embeddings and LM head with the target (frozen)
  • Reads hidden states from 6 target layers: [1, 12, 24, 35, 47, 58]
  • Projects concatenated target hidden states through an FC layer
  • Uses block-causal attention (block_size=16 for training, 8 for inference)

The config is at configs/kimi-k2.6-dflash-draft.json β€” identical to K2.5-DFlash since the architectures match.

Training pipeline

# Full pipeline: setup SpecForge, regenerate data with K2.6, train drafter
./train-drafter.sh

# Skip regeneration if data exists
./train-drafter.sh --skip-regen

# Skip setup + regen, just train
./train-drafter.sh --skip-setup

The pipeline uses SpecForge and runs three phases:

  1. Setup: Clone SpecForge, prepare PerfectBlend dataset (~1.16M samples)
  2. Regenerate: Run prompts through K2.6 to get target-distribution responses (hours)
  3. Train: 6-epoch DFlash training on 8x MI300X (3-6 days)

Serving with matched drafter

# After training completes:
./serve.sh configs/production-fp8kv-matched.env

Expected performance with matched drafter

Metric K2.5 drafter (current) K2.6 drafter (matched)
Acceptance rate (st=2) ~50% ~75%
Acceptance rate (st=8) ~16% ~65%
Best spec tokens 2 8
Per-slot tok/s 15.8 ~25
Aggregate at seqs=64 901 ~1600

Optimization Roadmap

Optimization Expected throughput Status
BF16 KV, K2.5 drafter, seqs=32 508 tok/s Done
FP8 KV, K2.5 drafter, seqs=64 901 tok/s Done (updated numbers)
K2.6 matched DFlash drafter ~800 tok/s at seqs=32 Training pipeline ready
FP8 KV + matched drafter, seqs=64 ~1600 tok/s Needs matched drafter
DDTree draft trees +35% on matched drafter Research (arXiv 2604.12989)

Repository Structure

kimi-k26dflash/
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ serve.sh                        # Server launch (pass config as arg)
β”œβ”€β”€ validate-fp8.sh                 # FP8 KV validation + benchmark
β”œβ”€β”€ train-drafter.sh                # K2.6 DFlash drafter training pipeline
β”œβ”€β”€ Dockerfile.kimi26-dflash        # Patch-at-build Docker image
β”œβ”€β”€ build-kimi26-dflash.sh          # Docker build helper
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ production.env              # BF16 KV, 508 tok/s (current)
β”‚   β”œβ”€β”€ production-fp8kv.env        # FP8 KV, seqs=64, ~1010 tok/s
β”‚   β”œβ”€β”€ production-fp8kv-safe.env   # FP8 KV + Triton MoE fallback
β”‚   β”œβ”€β”€ production-fp8kv-matched.env # FP8 KV + matched drafter, ~1600 tok/s
β”‚   └── kimi-k2.6-dflash-draft.json # DFlash drafter architecture config
β”œβ”€β”€ patches/
β”‚   └── patch_dflash_rocm.py        # 9 ROCm patches (idempotent)
β”œβ”€β”€ launchers/
β”‚   β”œβ”€β”€ kimi26-vllm-dflash.sh       # Standard launcher
β”‚   └── kimi26-vllm-dflash-sweep.sh # Parameter sweep
β”œβ”€β”€ payload/
β”‚   β”œβ”€β”€ benchmark_multi_turn.py     # Multi-turn benchmark tool
β”‚   β”œβ”€β”€ calibrate_kv_scales.py      # FP8 KV scale calibration
β”‚   └── preshard_kimi26.py          # Checkpoint pre-sharding
β”œβ”€β”€ benchmarks/                     # Raw JSON benchmark results
β”‚   β”œβ”€β”€ CLEAN-dflash-st2-s32-c32.json   # 508 tok/s
β”‚   β”œβ”€β”€ CLEAN-dflash-st2-s24-c24.json   # 379 tok/s
β”‚   └── ...
└── docs/
    β”œβ”€β”€ kimi-k2.6-250-toks-achieved-2026-04-21.md
    β”œβ”€β”€ kimi-k2.6-acceptance-rate-analysis-2026-04-21.md
    └── kimi-k2.6-dflash-execution-playbook-2026-04-21.md

Citation

If you use this configuration:

@misc{kimi-k26-dflash-mi300x-2026,
  title={Kimi K2.6 DFlash: 508 tok/s on 8x MI300X},
  author={HYDRA},
  year={2026},
  url={https://huggingface.co/hydra/kimi-k26-dflash-mi300x}
}

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for florianleibert/kimi-k26-dflash-mi300x

Finetuned
(7)
this model