OsaurusAI/MiniMax-M2.7-JANGTQ4

MiniMax M2.7 — JANGTQ4 (MLX)

TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine. Near-bf16 quality at ~25% of bf16 disk.

Model Details

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.7
Architecture	MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE
Total Parameters	228.7 B
Active per Token	~1.4 B
Profile	JANGTQ4
Format	JANGTQ (codebook + Hadamard) — `weight_format: mxtq` in `jang_config.json`
Avg bits/param	~4.10
Codebook size	16 entries (4-bit)
Disk	~113 GB
Context length	192 K tokens
Chat template	Always-reasoning (`<think>` opened at assistant start)

What is JANGTQ4?

JANGTQ (JANG TurboQuant) is a codebook-based quantization format for MoE models on Apple Silicon. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a small codebook, and accumulate dot products against a Hadamard-rotated input (QuIP# rotate-input-once math).

JANGTQ4 uses a 16-entry Lloyd-Max codebook per routed expert tensor, which captures the weight distribution near-losslessly. Quality approaches bf16 at ~25% of bf16 disk and runs at the full JANGTQ decode speed. Pick this profile when RAM permits and you want the closest quality to bf16 on Apple Silicon; pick JANGTQ (2-bit) for the smallest footprint.

JANGTQ vs JANGTQ4 vs bf16

	JANGTQ (2-bit)	JANGTQ4	bf16
Disk	~57 GB	~113 GB	~457 GB
Routed expert bits	2	4	16
Codebook size	4 entries	16 entries	—
Avg bits/param	~2.15	~4.10	16

Bit Allocation

Component	Bits	Format
Routed expert MLP (gate / up / down)	4	JANGTQ codebook + Hadamard
Attention (Q / K / V / O)	8	Affine (`nn.QuantizedLinear`, group_size=64)
Shared expert	8	Affine
Embed tokens / LM head	8	Affine
Router gate	fp16	Unquantized `nn.Linear`
RMSNorms / RoPE / biases	fp16	Unquantized

The routed experts are 98 % of parameters and the natural compression target. Everything else stays at 8-bit affine so the quality-critical hot path runs at full precision.

Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think> at each assistant turn.

Setting	Value	Notes
Temperature	1.0	Required — `temp=0` can cause thinking loops
Top-P	0.95
Top-K	40
Repetition Penalty	1.1	Optional, helps prevent loops
`max_tokens`	≥ 8192	Give reasoning room to converge

Strip <think>…</think> from the response before using the final answer.

Usage

This model requires the jang-tools loader — stock mlx_lm.load() does not recognize weight_format: mxtq. The loader applies Metal kernel monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block Hadamard, router compile, QKV fusion).

pip install jang-tools

from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
               temperature=1.0, verbose=True)

Swift — Osaurus / MLX Studio

Both clients auto-detect the JANGTQ runtime from jang_config.json and route through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.

What's In This Repo

File	Role
`model-*.safetensors` (117 shards, ~113 GB)	Weights — 4-bit routed TQ + 8-bit affine
`model.safetensors.index.json`	Shard index
`jangtq_runtime.safetensors`	Codebooks + Hadamard signs sidecar (Swift loader)
`jang_config.json`	JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`)
`config.json`	HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=4`)
`chat_template.jinja`, `tokenizer.*`, `vocab.json`, `merges.txt`	Tokenizer + chat template
`configuration_minimax_m2.py`, `modeling_minimax_m2.py`	HF custom code (untouched from upstream)
`osaurus-x-banner.png`	Branding asset

Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)

{
  "reasoning_parser": "qwen3",
  "tool_parser": "minimax",
  "think_in_template": true,
  "supports_tools": true,
  "supports_thinking": true,
  "family": "minimax_m2",
  "modality": "text",
  "cache_type": "kv"
}

<think> and <tool_call> are non-special tokens by design — the application layer parses them. Osaurus and vmlx CapabilityDetector read this block verbatim and wire the qwen3 reasoning parser + minimax tool parser automatically, so streamed responses route reasoning_content and tool_calls into the OpenAI-compatible SSE fields instead of leaking into content.

License

MIT — see LICENSE.

Credits

Created by Jinho Jang — [email protected]

Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.

Downloads last month: 391

Safetensors

Model size

29B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ4

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(73)

this model