MOSS-TTS-Nano-100M — INT8 (shared external data)

Dynamic-int8-quantized version of OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX, re-exported with shared external weight data so the prefill and decode_step graphs can mmap the same 105 MB blob instead of duplicating weights. Tuned for ONNX Runtime Mobile on Android — runs faster than real-time on Snapdragon 8 Gen 3.

Audio codec is unchanged from upstream. Use this repo for the language model graphs and pull the codec from OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX.

What's here

File	Size	Purpose
`moss_tts_prefill.onnx`	478 KB	global LM, full-context prefill
`moss_tts_decode_step.onnx`	489 KB	global LM, autoregressive single step (KV cache)
`moss_tts_local_fixed_sampled_frame.onnx`	744 KB	local LM with baked-in sampling (top-p / top-k / rep penalty)
`moss_tts_global_shared_int8.data`	111 MB	int8 weights shared by both prefill + decode_step
`moss_tts_local_fixed_sampled_frame_int8.data`	85 MB	int8 weights for the local sampler
`tokenizer.model`	471 KB	SentencePiece (unchanged from upstream)
`browser_poc_manifest.json`	503 KB	prompt templates + 18 builtin voices (preencoded RVQ codes)
`tts_browser_onnx_meta.json`	4 KB	I/O metadata (op names, KV layout)
Total	~198 MB	(vs 640 MB upstream fp32 LM dir)

Trade-off vs the fp32 source

Metric	fp32 (upstream)	int8 (this)	Δ
Download	640 MB (LM)	~196 MB (LM)	−69 %
RTF on x86 4-thread CPU	0.35	0.25	−29 %
RTF on Snapdragon 8 Gen 3	0.60	0.36	−40 %
Time-to-first-audio (S8G3, streaming)	n/a	123 ms warm	new
Spectral envelope MAE vs fp32	0 dB (self)	~2 dB above fp32 noise floor	acceptable

Quality verification details in the parent project's M10 / M14 reports. The int8 codec was investigated separately and rolled back — Conv-only quantization slowed it 2× with negligible size win, see M9 report.

Quick test (Python ORT)

from onnxruntime import InferenceSession
from huggingface_hub import snapshot_download

# Download both repos (LM int8 + upstream codec fp32)
lm_dir = snapshot_download("REALBITS/MOSS-TTS-Nano-100M-ONNX-int8")
codec_dir = snapshot_download("OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX",
                              allow_patterns=["moss_audio_tokenizer_decode_*",
                                              "*.json"])

# Then drive prefill → loop {sampler → decode_step} → codec.decode_full
# (or codec.decode_step for streaming) per the upstream onnx_tts_runtime.py

The simplest end-to-end runner that knows about external-data files and the autoregressive loop is the upstream onnx_tts_runtime.py. Point it at the two snapshot dirs (LM + codec) and synthesis runs unchanged.

On-device example

This repo is the default INT8 variant for the Fictures MOSS-TTS-Nano Android prototype. That implementation:

Downloads from this repo + the upstream codec repo
Uses ONNX Runtime Mobile + onnxruntime-extensions-android (for the on-device SentencePiece tokenizer wrapped as a custom-op ONNX graph)
Streams audio per-frame via decode_step.onnx (49 named state tensors)
Achieves 123 ms warm time-to-first-audio on Snapdragon 8 Gen 3

How it was quantized

onnxruntime.quantization.quantize_dynamic with use_external_data_format=True, then a small post-processing step that hashes the resulting .data files, identifies the byte-identical pair (prefill + decode_step share the underlying global transformer weights), keeps one copy as moss_tts_global_shared_int8.data, and rewrites both graphs' external_data location attributes.

Repro: see _dryrun/09_quantize_shared.py in the parent project.

License + attribution

Apache 2.0, inherited from upstream OpenMOSS/MOSS-TTS-Nano.

If you use this in a paper or product, please cite the original MOSS team (see the upstream repo's CITATION).

Known limitations

English / Chinese / Japanese only — same languages as upstream; no extra training data added.
No microphone voice cloning out of the box. Use one of the 18 builtin voices in browser_poc_manifest.json (preencoded RVQ codes) or fetch the codec encoder from upstream and encode your own reference audio to RVQ codes first.
Codec stays fp32. Conv-only static int8 quantization on the codec was tried and rolled back (~2× slower per audio second on both x86 and ARM due to per-Conv Q/DQ overhead). The 42 MB upstream codec is fast enough.
No NPU acceleration assumptions. This release targets ORT-CPU. NNAPI works but gives marginal wins on this graph (~9 %) due to dynamic shapes; QNN would need fixed-shape AOT compilation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for REALBITS/MOSS-TTS-Nano-100M-ONNX-int8

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX

Quantized

(2)

this model