Access Streaming Speech Translation — Vertox-AI

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access Streaming Speech Translation — Vertox-AI, you must review and agree to the CC BY-NC 4.0 license. By submitting this form, you confirm that you have read the license and will only use the model under its terms. Requests are processed immediately.

Streaming Speech Translation Pipeline

Real-time English → Russian speech translation: Audio In → ASR → NMT → TTS → Audio Out

Translates spoken English into spoken Russian with streaming output over WebSocket.

Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and XTTSv2 (TTS). You can modify these accordingly.

Architecture

Audio Input → ASR (ONNX) → NMT (GGUF) → TTS (ONNX) → Audio Output
  (PCM16)   Conformer RNN-T  TranslateGemma  XTTSv2     (PCM16)

ASR: NVIDIA NeMo FastConformer RNN-T (cache-aware streaming, ONNX)
NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output

See ARCHITECTURE.md for detailed design documentation.

Requirements

Python 3.10+
Model files:
- ASR: NeMo FastConformer RNN-T ONNX model directory
- NMT: TranslateGemma 4B GGUF file
- TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio

Installation

pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

Recommended to use --tts-int8-gpt if using CPU.
Recommended to at least use 8 core CPUs, e.g., m8a.2xlarge, with the default --nmt-n-threads 2 and --tts-threads-gpt 1.
Recommended to increase the --nmt-n-threads to 4 and --tts-threads-gpt to 2 with 16 core CPUs, e.g., m8a.4xlarge, to get smooth processing.

python app.py \
  --asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \
  --nmt-gguf-path models/nmt/translategemma-4b-it-q8_0-gguf/translategemma-4b-it-q8_0.gguf \
  --tts-model-dir models/tts/xttsv2-onnx/ \
  --tts-vocab-path models/tts/xttsv2-onnx/vocab.json \
  --tts-mel-norms-path models/tts/xttsv2-onnx/mel_stats.npy \
  --tts-ref-audio-path audio_ref/male_stewie.mp3 \
  --tts-int8-gpt \
  --host 0.0.0.0 \
  --port 8765

CLI Options

Flag	Default	Description
`--asr-onnx-path`	(required)	ASR ONNX model directory
`--asr-chunk-ms`	10	ASR audio chunk duration (ms)
`--asr-sample-rate`	16000	ASR expected sample rate
`--nmt-gguf-path`	(required)	NMT GGUF model file
`--nmt-n-threads`	2	NMT CPU threads
`--tts-model-dir`	(required)	TTS ONNX model directory
`--tts-vocab-path`	(required)	TTS BPE vocab.json
`--tts-mel-norms-path`	(required)	TTS mel_stats.npy
`--tts-ref-audio-path`	(required)	TTS reference speaker audio
`--tts-language`	ru	TTS target language code
`--tts-int8-gpt`	False	Use INT8 quantized GPT
`--tts-threads-gpt`	1	TTS GPT ONNX threads
`--tts-chunk-size`	20	TTS AR tokens per vocoder chunk
`--audio-queue-max`	256	Audio input queue max size
`--text-queue-max`	64	Text queue max size
`--tts-queue-max`	16	NMT→TTS text queue max size
`--audio-out-queue-max`	32	Audio output queue max size
`--host`	0.0.0.0	Server bind host
`--port`	8765	Server port

Python Client

Captures microphone audio and plays back translated speech:

pip install -r requirements_client.txt
python clients/python_client.py --uri ws://localhost:8765

Web Client

TBD

WebSocket Protocol

Direction	Type	Format	Description
Client→	Binary	PCM16	Raw audio at declared sample rate
Client→	Text	JSON	`{"action": "start", "sample_rate": 16000}`
Client→	Text	JSON	`{"action": "stop"}`
→Client	Binary	PCM16	Synthesized audio at 24kHz
→Client	Text	JSON	`{"type": "transcript", "text": "..."}`
→Client	Text	JSON	`{"type": "translation", "text": "..."}`
→Client	Text	JSON	`{"type": "status", "status": "started"}`

Docker

docker build -t streaming-translation .
docker run -p 8765:8765 \
  -v /path/to/models:/models \
  streaming-translation \
  --asr-onnx-path /models/asr/ \
  --nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \
  --tts-model-dir /models/xtts/ \
  --tts-vocab-path /models/xtts/vocab.json \
  --tts-mel-norms-path /models/xtts/mel_stats.npy \
  --tts-ref-audio-path /models/reference.wav

Project Structure

streaming_speech_translation/
├── app.py                              # Main entry point
├── requirements.txt
├── README.md
├── ARCHITECTURE.md
├── Dockerfile
├── models/
│   ├── asr/
│   │   └── nemo-cache-aware-streaming-560ms-onnx/
│   ├── nmt/
│   │   ├── translategemma-4b-it-q8_0-gguf/
│   │   └── translategemma-4b-it-q4_k_m-gguf/
│   └── tts/
│       └── xttsv2-onnx/
├── src/
│   ├── asr/
│   │   ├── streaming_asr.py            # StreamingASR wrapper
│   │   ├── cache_aware_modules.py      # Audio buffer + streaming ASR
│   │   ├── cache_aware_modules_config.py
│   │   ├── modules.py                  # ONNX model loading
│   │   ├── modules_config.py
│   │   ├── onnx_utils.py
│   │   └── utils.py                    # Audio utilities
│   ├── nmt/
│   │   ├── streaming_nmt.py            # StreamingNMT wrapper
│   │   ├── streaming_segmenter.py      # Word-group segmentation
│   │   ├── streaming_translation_merger.py
│   │   └── translator_module.py        # TranslateGemma via llama-cpp
│   ├── tts/
│   │   ├── streaming_tts.py            # StreamingTTS wrapper
│   │   ├── xtts_streaming_pipeline.py  # Full TTS pipeline
│   │   ├── xtts_onnx_orchestrator.py   # GPT-2 AR + vocoder
│   │   ├── xtts_tokenizer.py           # BPE tokenizer
│   │   └── zh_num2words.py             # Chinese text normalization
│   ├── pipeline/
│   │   ├── orchestrator.py             # PipelineOrchestrator
│   │   └── config.py                   # PipelineConfig
│   └── server/
│       └── websocket_server.py         # WebSocket server
└── clients/
    ├── python_client.py                # Python CLI client
    └── web_client.html                 # Browser client

TTS Threading Update (v2 Refactor)

The TTS integration has been revised to match the 3-thread ASR model.

Previous design

Both GPT-2 AR generation and HiFi-GAN vocoding ran inside a single synthesize_stream() call that was dispatched to the shared ThreadPoolExecutor:

[orchestrator asyncio loop]
    └─ run_in_executor ──► synthesize_stream()
                               ├─ GPT-2 AR loop  (blocking)
                               └─ HiFi-GAN       (blocking)

This meant the executor slot was held for the entire TTS inference duration, blocking NMT dispatches and delivering audio only after full-segment synthesis.

New design

Two dedicated daemon threads decouple GPT generation from vocoding:

text ──► [TTS-GPT Thread]  ──latent batches──►  [TTS-Vocoder Thread] ──► audio
            BPE + AR loop                          HiFi-GAN + crossfade

The vocoder starts producing audio as soon as the first gpt_chunk_size (default 20) AR tokens are generated, rather than waiting for the full segment.

New CLI flags

Flag	Default	Description
`--tts-text-queue-max`	8	Max segments in TTS text input queue
`--tts-latent-queue-max`	4	Max latent batches in TTS-GPT→Vocoder queue

See ARCHITECTURE.md for the full concurrency diagram and queue map.

LICENSE and COPYRIGHT

This repository is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

✅ Research and academic use
✅ Personal experimentation
✅ Open-source contributions
❌ Commercial applications
❌ Production deployment
❌ Monetized services

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026streamingspeechtranslation,
  title={Streaming Speech Translation — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

Acknowledgments

NVIDIA for Cache-Aware ASR NeMo
istupakov for the ONNX reference
Google for the TranslateGemma NMT model
Coqui for the XTTSv2

Downloads last month: 245

GGUF

Model size

4B params

Architecture

gemma3

Hardware compatibility

4-bit

8-bit

Model tree for pltobing/streaming-speech-translation

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(8)

this model

Collection including pltobing/streaming-speech-translation

Streaming ST

Collection

Streaming speech translation models and frameworks • 3 items • Updated 8 days ago