Parakeet TDT 0.6B v3 - ONNX INT8
ONNX INT8-quantized version of NVIDIA's Parakeet TDT 0.6B v3 for browser inference.
Attribution
This model is a converted version of nvidia/parakeet-tdt-0.6b-v3 by NVIDIA.
Original Model: NVIDIA Parakeet TDT 0.6B v3
Original License: CC-BY-4.0
Modifications: Converted from NeMo to ONNX format, quantized to INT8
Usage
Optimized for ONNX Runtime Web with WASM backend.
- Works on all browsers (no WebGPU required)
- ~890MB total download
- Fallback for devices without WebGPU
Files
| File | Size | Description |
|---|---|---|
| encoder-int8.onnx | 1.4MB | Encoder model graph |
| encoder-int8.onnx.data | 838MB | Encoder weights (MatMul/Gemm INT8, Conv FP32) |
| decoder_joint-int8.onnx | 52MB | Decoder + joiner |
| vocab.txt | 92KB | Tokenizer vocabulary |
Conversion
Converted using NeMo Toolkit + PyTorch 2.4 with weight-only dynamic quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
"encoder-temp.onnx",
"encoder-int8.onnx",
weight_type=QuantType.QInt8,
op_types_to_quantize=['MatMul', 'Gemm'], # Skip Conv ops
use_external_data_format=True,
extra_options={
'WeightSymmetric': True,
'MatMulConstBOnly': True,
}
)
Important Notes
- Only MatMul/Gemm ops are quantized - Conv ops remain FP32 to avoid creating
ConvIntegerops which are not supported by ONNX Runtime Web WASM backend. - Weight-only quantization - Activations stay FP32, only weights are INT8.
License
CC-BY-4.0 - © NVIDIA Corporation
Model tree for nasedkinpv/parakeet-tdt-0.6b-v3-onnx-int8
Base model
nvidia/parakeet-tdt-0.6b-v3