MOSS-TTS-Nano-100M β INT8 (shared external data)
Dynamic-int8-quantized version of OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX, re-exported with shared external weight data so the prefill and decode_step graphs can mmap the same 105 MB blob instead of duplicating weights. Tuned for ONNX Runtime Mobile on Android β runs faster than real-time on Snapdragon 8 Gen 3.
Audio codec is unchanged from upstream. Use this repo for the language model graphs and pull the codec from OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX.
What's here
| File | Size | Purpose |
|---|---|---|
moss_tts_prefill.onnx |
478 KB | global LM, full-context prefill |
moss_tts_decode_step.onnx |
489 KB | global LM, autoregressive single step (KV cache) |
moss_tts_local_fixed_sampled_frame.onnx |
744 KB | local LM with baked-in sampling (top-p / top-k / rep penalty) |
moss_tts_global_shared_int8.data |
111 MB | int8 weights shared by both prefill + decode_step |
moss_tts_local_fixed_sampled_frame_int8.data |
85 MB | int8 weights for the local sampler |
tokenizer.model |
471 KB | SentencePiece (unchanged from upstream) |
browser_poc_manifest.json |
503 KB | prompt templates + 18 builtin voices (preencoded RVQ codes) |
tts_browser_onnx_meta.json |
4 KB | I/O metadata (op names, KV layout) |
| Total | ~198 MB | (vs 640 MB upstream fp32 LM dir) |
Trade-off vs the fp32 source
| Metric | fp32 (upstream) | int8 (this) | Ξ |
|---|---|---|---|
| Download | 640 MB (LM) | ~196 MB (LM) | β69 % |
| RTF on x86 4-thread CPU | 0.35 | 0.25 | β29 % |
| RTF on Snapdragon 8 Gen 3 | 0.60 | 0.36 | β40 % |
| Time-to-first-audio (S8G3, streaming) | n/a | 123 ms warm | new |
| Spectral envelope MAE vs fp32 | 0 dB (self) | ~2 dB above fp32 noise floor | acceptable |
Quality verification details in the parent project's M10 / M14 reports. The int8 codec was investigated separately and rolled back β Conv-only quantization slowed it 2Γ with negligible size win, see M9 report.
Quick test (Python ORT)
from onnxruntime import InferenceSession
from huggingface_hub import snapshot_download
# Download both repos (LM int8 + upstream codec fp32)
lm_dir = snapshot_download("REALBITS/MOSS-TTS-Nano-100M-ONNX-int8")
codec_dir = snapshot_download("OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX",
allow_patterns=["moss_audio_tokenizer_decode_*",
"*.json"])
# Then drive prefill β loop {sampler β decode_step} β codec.decode_full
# (or codec.decode_step for streaming) per the upstream onnx_tts_runtime.py
The simplest end-to-end runner that knows about external-data files and
the autoregressive loop is the upstream
onnx_tts_runtime.py.
Point it at the two snapshot dirs (LM + codec) and synthesis runs unchanged.
On-device example
This repo is the default INT8 variant for the Fictures MOSS-TTS-Nano Android prototype. That implementation:
- Downloads from this repo + the upstream codec repo
- Uses ONNX Runtime Mobile + onnxruntime-extensions-android (for the on-device SentencePiece tokenizer wrapped as a custom-op ONNX graph)
- Streams audio per-frame via
decode_step.onnx(49 named state tensors) - Achieves 123 ms warm time-to-first-audio on Snapdragon 8 Gen 3
How it was quantized
onnxruntime.quantization.quantize_dynamic with
use_external_data_format=True, then a small post-processing step that
hashes the resulting .data files, identifies the byte-identical pair
(prefill + decode_step share the underlying global transformer weights),
keeps one copy as moss_tts_global_shared_int8.data, and rewrites both
graphs' external_data location attributes.
Repro: see
_dryrun/09_quantize_shared.py
in the parent project.
License + attribution
Apache 2.0, inherited from upstream OpenMOSS/MOSS-TTS-Nano.
If you use this in a paper or product, please cite the original MOSS team (see the upstream repo's CITATION).
Known limitations
- English / Chinese / Japanese only β same languages as upstream; no extra training data added.
- No microphone voice cloning out of the box. Use one of the 18
builtin voices in
browser_poc_manifest.json(preencoded RVQ codes) or fetch the codec encoder from upstream and encode your own reference audio to RVQ codes first. - Codec stays fp32. Conv-only static int8 quantization on the codec was tried and rolled back (~2Γ slower per audio second on both x86 and ARM due to per-Conv Q/DQ overhead). The 42 MB upstream codec is fast enough.
- No NPU acceleration assumptions. This release targets ORT-CPU. NNAPI works but gives marginal wins on this graph (~9 %) due to dynamic shapes; QNN would need fixed-shape AOT compilation.
Model tree for REALBITS/MOSS-TTS-Nano-100M-ONNX-int8
Base model
OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX