Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model

Code | Model | Dataset

Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a LISTENING state, where it consumes one encoder-output chunk per step and emits either KEEP_SILENCE or TEXT_BEGIN, and a SPEAKING state, where it autoregressively generates a text turn until TEXT_END and then returns to listening for the next chunk.

This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.

Model Details

Model name: Mini-Omni3
Task: Streaming audio-conditioned text generation (audio in, text out)
Audio encoder: Qwen2.5-Omni audio tower (chunk-wise)
Audio framing: 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
Decoding states: LISTENING (emits KEEP_SILENCE / TEXT_BEGIN) and SPEAKING (emits text until TEXT_END)
Default sampling: temperature 0.3, top-k 3
Default max new tokens: 4096 per session
License: Apache-2.0

Repository Contents

Mini-Omni3/
├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json          # Shard index consumed by safetensors loader
├── config.json                           # Top-level model config
├── generation_config.json                # Generation defaults
├── model_config.yaml                     # GPT config consumed by Config.from_file
├── hyperparameters.yaml                  # Training-time hyperparameters (reference)
├── tokenizer.json                        # Tokenizer
├── tokenizer_config.json
├── MiniOmni3_ChunkwisedEncoder.pth       # Audio encoder weights (Qwen2.5-Omni audio tower)
└── qwen25OmniConfig/                     # Audio-encoder config (nested: thinker_config.audio_config)

Intended Use

Mini-Omni3 is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.

Quick Start

Installation

git clone https://github.com/xzf-thu/Mini-Omni3.git  # TODO: confirm repo URL
cd Mini-Omni3
conda create -n mini-omni3 python=3.10 -y
conda activate mini-omni3
pip install -r requirements.txt

Download the checkpoint

From the Mini-Omni3 project root, pull the weights into checkpoints/:

from huggingface_hub import snapshot_download

snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")

snapshot_download is the recommended path — it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid git clone of the HF repo or the web "Download" button if you want your run reflected in the stats.

Python Usage

from src.miniomni3.generate.run import run_inference

run_inference(
    checkpoint_dir="checkpoints",
    audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
    device="cuda:0",                    # or "mps" / "cpu"
)

For interactive use, omit audio_paths and run_inference will prompt for an audio path each round:

run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")

Streaming Protocol

A single session looks like:

[system prompt tokens]
  ┌─── LISTENING ───┐
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  KEEP_SILENCE          (keep listening)
  │ AUDIO_BEGIN PAD*10 ASSISTANT  →  TEXT_BEGIN EMOTION    (start replying)
  └─────────────────┘
  ┌─── SPEAKING ────┐
  │ … text tokens … TEXT_END                                (reply finished)
  └─────────────────┘
  ┌─── LISTENING ───┐  (next audio chunk)
  …

The model is trained to emit at most one TEXT_BEGIN per audio chunk. Each assistant turn begins with TEXT_BEGIN, followed by an emotion token, the reply tokens, and TEXT_END. Turns starting with KEEP_SILENCE indicate the model chose not to respond to that chunk.

Training Summary

Evaluation

Limitations

The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
Audio must be 16 kHz mono; non-conforming inputs are resampled by whisper.load_audio and padded to 0.4-second boundaries before encoding.
Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.

Citation

@misc{xie_miniomni3,
  title  = {Mini-Omni3: Streaming Audio-In, Text-Out Conversational Modeling},
  author = {Zhifei Xie and collaborators},
  year   = {2026},
  note   = {Preprint in preparation}
}

Acknowledgements

Mini-Omni3 builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.

Downloads last month: 87

Safetensors

Model size

3B params

Tensor type

F32

Inference Providers NEW

Audio-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support