MiniMax M2.7 — JANGTQ4 (MLX)
TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine. Near-bf16 quality at ~25% of bf16 disk.
Model Details
| Property | Value |
|---|---|
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE |
| Total Parameters | 228.7 B |
| Active per Token | ~1.4 B |
| Profile | JANGTQ4 |
| Format | JANGTQ (codebook + Hadamard) — weight_format: mxtq in jang_config.json |
| Avg bits/param | ~4.10 |
| Codebook size | 16 entries (4-bit) |
| Disk | ~113 GB |
| Context length | 192 K tokens |
| Chat template | Always-reasoning (<think> opened at assistant start) |
What is JANGTQ4?
JANGTQ (JANG TurboQuant) is a codebook-based quantization format for MoE
models on Apple Silicon. Routed expert weights stay in a compact codebook +
Hadamard-rotated form at runtime — no decompression to affine — and the
matmul path uses custom Metal kernels that read packed uint32 weights, look
up centroids in a small codebook, and accumulate dot products against a
Hadamard-rotated input (QuIP# rotate-input-once math).
JANGTQ4 uses a 16-entry Lloyd-Max codebook per routed expert tensor, which captures the weight distribution near-losslessly. Quality approaches bf16 at ~25% of bf16 disk and runs at the full JANGTQ decode speed. Pick this profile when RAM permits and you want the closest quality to bf16 on Apple Silicon; pick JANGTQ (2-bit) for the smallest footprint.
JANGTQ vs JANGTQ4 vs bf16
| JANGTQ (2-bit) | JANGTQ4 | bf16 | |
|---|---|---|---|
| Disk | ~57 GB | ~113 GB | ~457 GB |
| Routed expert bits | 2 | 4 | 16 |
| Codebook size | 4 entries | 16 entries | — |
| Avg bits/param | ~2.15 | ~4.10 | 16 |
Bit Allocation
| Component | Bits | Format |
|---|---|---|
| Routed expert MLP (gate / up / down) | 4 | JANGTQ codebook + Hadamard |
| Attention (Q / K / V / O) | 8 | Affine (nn.QuantizedLinear, group_size=64) |
| Shared expert | 8 | Affine |
| Embed tokens / LM head | 8 | Affine |
| Router gate | fp16 | Unquantized nn.Linear |
| RMSNorms / RoPE / biases | fp16 | Unquantized |
The routed experts are 98 % of parameters and the natural compression target. Everything else stays at 8-bit affine so the quality-critical hot path runs at full precision.
Important Settings
MiniMax M2.7 is an always-reasoning model. The chat template
unconditionally opens <think> at each assistant turn.
| Setting | Value | Notes |
|---|---|---|
| Temperature | 1.0 | Required — temp=0 can cause thinking loops |
| Top-P | 0.95 | |
| Top-K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
max_tokens |
≥ 8192 | Give reasoning room to converge |
Strip <think>…</think> from the response before using the final answer.
Usage
This model requires the jang-tools loader — stock mlx_lm.load() does not
recognize weight_format: mxtq. The loader applies Metal kernel
monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block
Hadamard, router compile, QKV fusion).
pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
temperature=1.0, verbose=True)
Swift — Osaurus / MLX Studio
Both clients auto-detect the JANGTQ runtime from jang_config.json and route
through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.
What's In This Repo
| File | Role |
|---|---|
model-*.safetensors (117 shards, ~113 GB) |
Weights — 4-bit routed TQ + 8-bit affine |
model.safetensors.index.json |
Shard index |
jangtq_runtime.safetensors |
Codebooks + Hadamard signs sidecar (Swift loader) |
jang_config.json |
JANG metadata + Tier-1 capabilities stamp (reasoning=qwen3, tool=minimax) |
config.json |
HF model config (minimax_m2, weight_format=mxtq, mxtq_bits=4) |
chat_template.jinja, tokenizer.*, vocab.json, merges.txt |
Tokenizer + chat template |
configuration_minimax_m2.py, modeling_minimax_m2.py |
HF custom code (untouched from upstream) |
osaurus-x-banner.png |
Branding asset |
Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)
{
"reasoning_parser": "qwen3",
"tool_parser": "minimax",
"think_in_template": true,
"supports_tools": true,
"supports_thinking": true,
"family": "minimax_m2",
"modality": "text",
"cache_type": "kv"
}
<think> and <tool_call> are non-special tokens by design — the
application layer parses them. Osaurus and vmlx CapabilityDetector read
this block verbatim and wire the qwen3 reasoning parser + minimax tool
parser automatically, so streamed responses route reasoning_content and
tool_calls into the OpenAI-compatible SSE fields instead of leaking into
content.
License
MIT — see LICENSE.
Credits
Created by Jinho Jang — [email protected]
Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.
- Downloads last month
- 391
Quantized
Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ4
Base model
MiniMaxAI/MiniMax-M2.7