Kimi K2.6 + DFlash: 508 tok/s on 8x MI300X

5.6x throughput improvement over baseline autoregressive serving
90 tok/s → 508 tok/s on the same hardware, same model, zero quality loss

Performance

Throughput Scaling

Throughput scaling chart showing 90 to 508 tok/s

Head-to-Head: DFlash vs Autoregressive

	Autoregressive (baseline)	DFlash st=2 (this config)	Speedup
8 users	90.4 tok/s	127.1 tok/s	1.4x
12 users	125.1 tok/s	192.8 tok/s	1.5x
16 users	—	250.8 tok/s	—
24 users	—	379.0 tok/s	—
32 users	—	507.6 tok/s	5.6x

All measurements: no prefix cache, warmed server, 512 max tokens, temperature=0, prompts from a diverse reasoning benchmark set. Latency is flat at ~30s regardless of concurrency.

Per-User Latency

Latency stays flat as concurrency scales

Concurrent users	Mean latency	P95 latency	Per-user tok/s
8	31.0s	31.3s	15.9
16	30.8s	31.1s	15.7
24	30.0s	30.4s	15.8
32	30.7s	31.0s	15.9

Latency does not degrade as concurrency increases. Each user gets a consistent ~15.8 tok/s regardless of how many others are being served.

What is this?

A production-ready serving configuration for moonshotai/Kimi-K2.6 using DFlash speculative decoding with the z-lab/Kimi-K2.5-DFlash draft model, optimized for AMD MI300X GPUs.

This is not a new model — it's an optimized serving recipe. The model weights are unchanged. Output quality is identical to standard autoregressive serving.

Three optimizations that delivered 5.6x

Optimization journey from 90 to 508 tok/s

What	Before	After	Impact
NUMA balancing	Enabled	Disabled	Removed memory access bottleneck across NUMA domains
DFlash spec tokens	8	2	Acceptance rate: 16% → 50%. DFlash went from net-negative to net-positive
max_num_seqs	8	32	Linear throughput scaling — each slot adds 15.8 tok/s

Hardware

Hardware and software stack

Component	Specification
GPU	8x AMD Instinct MI300X
GPU Architecture	CDNA 3 (gfx942)
VRAM per GPU	192 GB HBM3
Total VRAM	1,536 GB (1.5 TB)
System RAM	~2 TB
Storage	NVMe (14 TB), model on local disk
Runtime	vLLM v0.19.2 ROCm nightly
ROCm Version	6.x

Model Specifications

	Target Model	Draft Model
Name	moonshotai/Kimi-K2.6	z-lab/Kimi-K2.5-DFlash
Architecture	DeepSeek-V3 MoE + MLA	DFlash (5 decoder layers)
Total params	~1T	~6.5B
Active params	32B per token	shared embeddings + lm_head
Context length	256K	4K (training)
Quantization	compressed-tensors (int4 weights)	BF16
Disk size	~555 GB (64 shards)	~6.5 GB

Quick Start

1. Download models

# Target model (~555 GB)
huggingface-cli download moonshotai/Kimi-K2.6 --local-dir /models/Kimi-K2.6

# Draft model (~6.5 GB)
huggingface-cli download z-lab/Kimi-K2.5-DFlash --local-dir /models/Kimi-K2.5-DFlash

2. Configure

Edit configs/production.env:

MODEL_DIR=/models/Kimi-K2.6
DRAFT_MODEL_DIR=/models/Kimi-K2.5-DFlash

3. Disable NUMA balancing (required)

sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'

4. Launch

./serve.sh

Server takes ~5 minutes to load. Once ready:

curl http://localhost:8262/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2.6-amd-dflash",
    "messages": [{"role": "user", "content": "Explain the Riemann hypothesis"}],
    "max_tokens": 512,
    "temperature": 0
  }'

5. Benchmark

# Single-shot throughput benchmark
python3 payload/benchmark_multi_turn.py \
  --base-url http://localhost:8262/v1 \
  --model kimi-k2.6-amd-dflash \
  --sessions 32 --turns-per-session 1 \
  --max-tokens 512

# Compare against autoregressive baseline:
# Launch without DFlash (remove --speculative-config, set --block-size 1)
# and run the same benchmark

How DFlash Works

Standard Autoregressive          DFlash Speculative (st=2)
=======================          =========================

Step 1: Generate token 1         Step 1: Draft predicts tokens 1,2
Step 2: Generate token 2         Step 2: Target verifies both in ONE pass
Step 3: Generate token 3           → If both accepted: got 2 tokens for ~1 step
Step 4: Generate token 4           → If only token 1 accepted: got 1 token
...                              Step 3: Draft predicts tokens 3,4
                                 Step 4: Target verifies...

4 tokens = 4 forward passes      4 tokens ≈ 2-3 forward passes

The draft model (Kimi-K2.5-DFlash, 6.5 GB) is ~85x smaller than the target. It runs in <1% of the target's compute time. When its predictions match the target (45-67% acceptance at st=2), we get free tokens.

Why st=2 instead of st=8?

Acceptance rate comparison: st=8 vs st=2

The public drafter was trained for K2.5, not K2.6. The model mismatch causes acceptance to drop sharply at later positions:

Spec tokens	Pos 0	Pos 1	Pos 2	Pos 3	Pos 4-7	Avg acceptance	Net effect
2	64%	34%	—	—	—	49%	+40% throughput
8	64%	34%	18%	9%	<3%	16%	-20% throughput

At st=8, the target model wastes compute verifying 6 tokens that will almost certainly be rejected. At st=2, every verification step has a ~50% chance of yielding a free token.

ROCm Patches

DFlash requires 9 patches to work on ROCm with MLA attention. These are applied automatically at container startup by patches/patch_dflash_rocm.py. The patches:

Add non-causal attention support to AITER flash attention backend
Force TRITON_MLA backend for target model when DFlash draft uses standard attention
Add IS_CAUSAL parameter to Triton unified attention kernels
Relax causal assertions in the DFlash verification path

All patches are idempotent and track upstream vllm-project/vllm#39930.

Configuration Reference

# configs/production.env — all tunable parameters

NUM_SPECULATIVE_TOKENS=2    # DFlash draft tokens per step
MAX_NUM_SEQS=32             # Max concurrent decode sequences
MAX_NUM_BATCHED_TOKENS=32768 # Max tokens per scheduler step
MAX_MODEL_LEN=262144        # Max context length (256K)
GPU_MEMORY_UTILIZATION=0.90 # Fraction of VRAM for KV cache
BLOCK_SIZE=16               # Required for DFlash + MLA
ENFORCE_EAGER=true          # Compiled mode provides no gain
MOE_BACKEND=aiter           # AMD's optimized MoE kernels

Known Constraints

Constraint	Root cause	Workaround
`max_num_batched_tokens` capped at 32768	AITER MoE kernel grid overflow at 384 experts × large batch	Stay at 32768
K2.5 drafter acceptance ~50%	Model version mismatch (trained for K2.5)	Train K2.6-specific drafter (see below)

FP8 KV Cache: 901 tok/s (updated numbers)

FP8 KV cache halves KV memory (8-bit vs 16-bit per element). Measured capacity: 2,469,568 tokens (up from 1,230,368 with BF16) = 2.01x. This enables max_num_seqs=64, pushing aggregate throughput to 901 tok/s — 1.77x over the BF16 baseline.

Head-to-Head: BF16 vs FP8 KV

Concurrent users	BF16 KV (seqs=32)	FP8 KV (seqs=64)	Speedup
8	127.1 tok/s	—	—
16	250.8 tok/s	—	—
24	379.0 tok/s	—	—
32	507.6 tok/s	394.6 tok/s	0.78x
48	—	593.6 tok/s	—
64	—	900.9 tok/s	1.77x

At matched concurrency (c=32), FP8 is ~22% slower per slot due to dynamic scale computation overhead. But FP8 enables 2x more concurrent sequences, and aggregate throughput at c=64 is 1.77x the BF16 peak.

The FP8 scale problem (and fix)

The Kimi-K2.6 checkpoint has no pre-computed FP8 KV scales. Without them, vLLM defaults to scale=1.0, which clips KV values in FP8 E4M3 range and produces degenerate output (vllm#13133, vllm#27364).

Our fix: a runtime patch to the MLA do_kv_cache_update that computes scales dynamically from each batch's actual KV data using a running-max approach. The scale converges after the first few requests and stays stable. Calibration with 200 diverse prompts (51K tokens) confirmed the converged scale range: 0.026–0.068.

The 384-expert AITER crash does NOT affect FP8 KV — that's a MoE-side issue triggered only at max_num_batched_tokens > 32768. FP8 KV is purely attention-side.

Quick start: FP8 KV

./serve.sh configs/production-fp8kv.env

Configs

Config	KV dtype	MoE backend	max_num_seqs	Throughput
`production.env`	BF16	AITER	32	508 tok/s
`production-fp8kv.env`	FP8	AITER	64	901 tok/s

Training a K2.6-Matched DFlash Drafter

The public drafter (z-lab/Kimi-K2.5-DFlash) was trained for K2.5 and gets ~50% acceptance on K2.6. A K2.6-matched drafter should reach 60-80% acceptance, making num_speculative_tokens=8 viable and roughly doubling per-slot throughput.

Architecture

The drafter is a 6-layer Qwen3-based decoder (~1.2B trainable params) that:

Shares embeddings and LM head with the target (frozen)
Reads hidden states from 6 target layers: [1, 12, 24, 35, 47, 58]
Projects concatenated target hidden states through an FC layer
Uses block-causal attention (block_size=16 for training, 8 for inference)

The config is at configs/kimi-k2.6-dflash-draft.json — identical to K2.5-DFlash since the architectures match.

Training pipeline

# Full pipeline: setup SpecForge, regenerate data with K2.6, train drafter
./train-drafter.sh

# Skip regeneration if data exists
./train-drafter.sh --skip-regen

# Skip setup + regen, just train
./train-drafter.sh --skip-setup

The pipeline uses SpecForge and runs three phases:

Setup: Clone SpecForge, prepare PerfectBlend dataset (~1.16M samples)
Regenerate: Run prompts through K2.6 to get target-distribution responses (hours)
Train: 6-epoch DFlash training on 8x MI300X (3-6 days)

Serving with matched drafter

# After training completes:
./serve.sh configs/production-fp8kv-matched.env

Expected performance with matched drafter

Metric	K2.5 drafter (current)	K2.6 drafter (matched)
Acceptance rate (st=2)	~50%	~75%
Acceptance rate (st=8)	~16%	~65%
Best spec tokens	2	8
Per-slot tok/s	15.8	~25
Aggregate at seqs=64	901	~1600

Optimization Roadmap

Optimization	Expected throughput	Status
BF16 KV, K2.5 drafter, seqs=32	508 tok/s	Done
FP8 KV, K2.5 drafter, seqs=64	901 tok/s	Done (updated numbers)
K2.6 matched DFlash drafter	~800 tok/s at seqs=32	Training pipeline ready
FP8 KV + matched drafter, seqs=64	~1600 tok/s	Needs matched drafter
DDTree draft trees	+35% on matched drafter	Research (arXiv 2604.12989)

Repository Structure

kimi-k26dflash/
├── README.md                       # This file
├── serve.sh                        # Server launch (pass config as arg)
├── validate-fp8.sh                 # FP8 KV validation + benchmark
├── train-drafter.sh                # K2.6 DFlash drafter training pipeline
├── Dockerfile.kimi26-dflash        # Patch-at-build Docker image
├── build-kimi26-dflash.sh          # Docker build helper
├── configs/
│   ├── production.env              # BF16 KV, 508 tok/s (current)
│   ├── production-fp8kv.env        # FP8 KV, seqs=64, ~1010 tok/s
│   ├── production-fp8kv-safe.env   # FP8 KV + Triton MoE fallback
│   ├── production-fp8kv-matched.env # FP8 KV + matched drafter, ~1600 tok/s
│   └── kimi-k2.6-dflash-draft.json # DFlash drafter architecture config
├── patches/
│   └── patch_dflash_rocm.py        # 9 ROCm patches (idempotent)
├── launchers/
│   ├── kimi26-vllm-dflash.sh       # Standard launcher
│   └── kimi26-vllm-dflash-sweep.sh # Parameter sweep
├── payload/
│   ├── benchmark_multi_turn.py     # Multi-turn benchmark tool
│   ├── calibrate_kv_scales.py      # FP8 KV scale calibration
│   └── preshard_kimi26.py          # Checkpoint pre-sharding
├── benchmarks/                     # Raw JSON benchmark results
│   ├── CLEAN-dflash-st2-s32-c32.json   # 508 tok/s
│   ├── CLEAN-dflash-st2-s24-c24.json   # 379 tok/s
│   └── ...
└── docs/
    ├── kimi-k2.6-250-toks-achieved-2026-04-21.md
    ├── kimi-k2.6-acceptance-rate-analysis-2026-04-21.md
    └── kimi-k2.6-dflash-execution-playbook-2026-04-21.md

Citation

If you use this configuration:

@misc{kimi-k26-dflash-mi300x-2026,
  title={Kimi K2.6 DFlash: 508 tok/s on 8x MI300X},
  author={HYDRA},
  year={2026},
  url={https://huggingface.co/hydra/kimi-k26-dflash-mi300x}
}

Acknowledgments

Moonshot AI for Kimi K2.6
Z-Lab for the DFlash drafter and framework
vLLM project for the serving engine
AMD ROCm for MI300X software stack and AITER kernels
Hot Aisle for compute

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for florianleibert/kimi-k26-dflash-mi300x

Base model

moonshotai/Kimi-K2.6

Finetuned

(7)

this model