MOSS-TTS-Nano-100M β€” INT8 (shared external data)

Dynamic-int8-quantized version of OpenMOSS-Team/MOSS-TTS-Nano-100M-ONNX, re-exported with shared external weight data so the prefill and decode_step graphs can mmap the same 105 MB blob instead of duplicating weights. Tuned for ONNX Runtime Mobile on Android β€” runs faster than real-time on Snapdragon 8 Gen 3.

Audio codec is unchanged from upstream. Use this repo for the language model graphs and pull the codec from OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX.

What's here

File Size Purpose
moss_tts_prefill.onnx 478 KB global LM, full-context prefill
moss_tts_decode_step.onnx 489 KB global LM, autoregressive single step (KV cache)
moss_tts_local_fixed_sampled_frame.onnx 744 KB local LM with baked-in sampling (top-p / top-k / rep penalty)
moss_tts_global_shared_int8.data 111 MB int8 weights shared by both prefill + decode_step
moss_tts_local_fixed_sampled_frame_int8.data 85 MB int8 weights for the local sampler
tokenizer.model 471 KB SentencePiece (unchanged from upstream)
browser_poc_manifest.json 503 KB prompt templates + 18 builtin voices (preencoded RVQ codes)
tts_browser_onnx_meta.json 4 KB I/O metadata (op names, KV layout)
Total ~198 MB (vs 640 MB upstream fp32 LM dir)

Trade-off vs the fp32 source

Metric fp32 (upstream) int8 (this) Ξ”
Download 640 MB (LM) ~196 MB (LM) βˆ’69 %
RTF on x86 4-thread CPU 0.35 0.25 βˆ’29 %
RTF on Snapdragon 8 Gen 3 0.60 0.36 βˆ’40 %
Time-to-first-audio (S8G3, streaming) n/a 123 ms warm new
Spectral envelope MAE vs fp32 0 dB (self) ~2 dB above fp32 noise floor acceptable

Quality verification details in the parent project's M10 / M14 reports. The int8 codec was investigated separately and rolled back β€” Conv-only quantization slowed it 2Γ— with negligible size win, see M9 report.

Quick test (Python ORT)

from onnxruntime import InferenceSession
from huggingface_hub import snapshot_download

# Download both repos (LM int8 + upstream codec fp32)
lm_dir = snapshot_download("REALBITS/MOSS-TTS-Nano-100M-ONNX-int8")
codec_dir = snapshot_download("OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano-ONNX",
                              allow_patterns=["moss_audio_tokenizer_decode_*",
                                              "*.json"])

# Then drive prefill β†’ loop {sampler β†’ decode_step} β†’ codec.decode_full
# (or codec.decode_step for streaming) per the upstream onnx_tts_runtime.py

The simplest end-to-end runner that knows about external-data files and the autoregressive loop is the upstream onnx_tts_runtime.py. Point it at the two snapshot dirs (LM + codec) and synthesis runs unchanged.

On-device example

This repo is the default INT8 variant for the Fictures MOSS-TTS-Nano Android prototype. That implementation:

  • Downloads from this repo + the upstream codec repo
  • Uses ONNX Runtime Mobile + onnxruntime-extensions-android (for the on-device SentencePiece tokenizer wrapped as a custom-op ONNX graph)
  • Streams audio per-frame via decode_step.onnx (49 named state tensors)
  • Achieves 123 ms warm time-to-first-audio on Snapdragon 8 Gen 3

How it was quantized

onnxruntime.quantization.quantize_dynamic with use_external_data_format=True, then a small post-processing step that hashes the resulting .data files, identifies the byte-identical pair (prefill + decode_step share the underlying global transformer weights), keeps one copy as moss_tts_global_shared_int8.data, and rewrites both graphs' external_data location attributes.

Repro: see _dryrun/09_quantize_shared.py in the parent project.

License + attribution

Apache 2.0, inherited from upstream OpenMOSS/MOSS-TTS-Nano.

If you use this in a paper or product, please cite the original MOSS team (see the upstream repo's CITATION).

Known limitations

  • English / Chinese / Japanese only β€” same languages as upstream; no extra training data added.
  • No microphone voice cloning out of the box. Use one of the 18 builtin voices in browser_poc_manifest.json (preencoded RVQ codes) or fetch the codec encoder from upstream and encode your own reference audio to RVQ codes first.
  • Codec stays fp32. Conv-only static int8 quantization on the codec was tried and rolled back (~2Γ— slower per audio second on both x86 and ARM due to per-Conv Q/DQ overhead). The 42 MB upstream codec is fast enough.
  • No NPU acceleration assumptions. This release targets ORT-CPU. NNAPI works but gives marginal wins on this graph (~9 %) due to dynamic shapes; QNN would need fixed-shape AOT compilation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for REALBITS/MOSS-TTS-Nano-100M-ONNX-int8

Quantized
(2)
this model