Osaurus AI

MiniMax M2.7 — JANGTQ4 (MLX)

TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine. Near-bf16 quality at ~25% of bf16 disk.

Website  OsaurusAI


Model Details

Property Value
Base Model MiniMaxAI/MiniMax-M2.7
Architecture MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE
Total Parameters 228.7 B
Active per Token ~1.4 B
Profile JANGTQ4
Format JANGTQ (codebook + Hadamard) — weight_format: mxtq in jang_config.json
Avg bits/param ~4.10
Codebook size 16 entries (4-bit)
Disk ~113 GB
Context length 192 K tokens
Chat template Always-reasoning (<think> opened at assistant start)

What is JANGTQ4?

JANGTQ (JANG TurboQuant) is a codebook-based quantization format for MoE models on Apple Silicon. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a small codebook, and accumulate dot products against a Hadamard-rotated input (QuIP# rotate-input-once math).

JANGTQ4 uses a 16-entry Lloyd-Max codebook per routed expert tensor, which captures the weight distribution near-losslessly. Quality approaches bf16 at ~25% of bf16 disk and runs at the full JANGTQ decode speed. Pick this profile when RAM permits and you want the closest quality to bf16 on Apple Silicon; pick JANGTQ (2-bit) for the smallest footprint.

JANGTQ vs JANGTQ4 vs bf16

JANGTQ (2-bit) JANGTQ4 bf16
Disk ~57 GB ~113 GB ~457 GB
Routed expert bits 2 4 16
Codebook size 4 entries 16 entries
Avg bits/param ~2.15 ~4.10 16

Bit Allocation

Component Bits Format
Routed expert MLP (gate / up / down) 4 JANGTQ codebook + Hadamard
Attention (Q / K / V / O) 8 Affine (nn.QuantizedLinear, group_size=64)
Shared expert 8 Affine
Embed tokens / LM head 8 Affine
Router gate fp16 Unquantized nn.Linear
RMSNorms / RoPE / biases fp16 Unquantized

The routed experts are 98 % of parameters and the natural compression target. Everything else stays at 8-bit affine so the quality-critical hot path runs at full precision.

Important Settings

MiniMax M2.7 is an always-reasoning model. The chat template unconditionally opens <think> at each assistant turn.

Setting Value Notes
Temperature 1.0 Required — temp=0 can cause thinking loops
Top-P 0.95
Top-K 40
Repetition Penalty 1.1 Optional, helps prevent loops
max_tokens ≥ 8192 Give reasoning room to converge

Strip <think>…</think> from the response before using the final answer.

Usage

This model requires the jang-tools loader — stock mlx_lm.load() does not recognize weight_format: mxtq. The loader applies Metal kernel monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block Hadamard, router compile, QKV fusion).

pip install jang-tools
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)

messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
               temperature=1.0, verbose=True)

Swift — Osaurus / MLX Studio

Both clients auto-detect the JANGTQ runtime from jang_config.json and route through the MiniMaxJANGTQModel class. Just load the repo — no extra flags.

What's In This Repo

File Role
model-*.safetensors (117 shards, ~113 GB) Weights — 4-bit routed TQ + 8-bit affine
model.safetensors.index.json Shard index
jangtq_runtime.safetensors Codebooks + Hadamard signs sidecar (Swift loader)
jang_config.json JANG metadata + Tier-1 capabilities stamp (reasoning=qwen3, tool=minimax)
config.json HF model config (minimax_m2, weight_format=mxtq, mxtq_bits=4)
chat_template.jinja, tokenizer.*, vocab.json, merges.txt Tokenizer + chat template
configuration_minimax_m2.py, modeling_minimax_m2.py HF custom code (untouched from upstream)
osaurus-x-banner.png Branding asset

Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)

{
  "reasoning_parser": "qwen3",
  "tool_parser": "minimax",
  "think_in_template": true,
  "supports_tools": true,
  "supports_thinking": true,
  "family": "minimax_m2",
  "modality": "text",
  "cache_type": "kv"
}

<think> and <tool_call> are non-special tokens by design — the application layer parses them. Osaurus and vmlx CapabilityDetector read this block verbatim and wire the qwen3 reasoning parser + minimax tool parser automatically, so streamed responses route reasoning_content and tool_calls into the OpenAI-compatible SSE fields instead of leaking into content.

License

MIT — see LICENSE.

Credits

Created by Jinho Jang[email protected]

Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.

Downloads last month
391
Safetensors
Model size
29B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/MiniMax-M2.7-JANGTQ4

Quantized
(73)
this model