Parakeet TDT 0.6B v3 - ONNX INT8

ONNX INT8-quantized version of NVIDIA's Parakeet TDT 0.6B v3 for browser inference.

Attribution

This model is a converted version of nvidia/parakeet-tdt-0.6b-v3 by NVIDIA.

Original Model: NVIDIA Parakeet TDT 0.6B v3
Original License: CC-BY-4.0
Modifications: Converted from NeMo to ONNX format, quantized to INT8

Usage

Optimized for ONNX Runtime Web with WASM backend.

  • Works on all browsers (no WebGPU required)
  • ~890MB total download
  • Fallback for devices without WebGPU

Files

File Size Description
encoder-int8.onnx 1.4MB Encoder model graph
encoder-int8.onnx.data 838MB Encoder weights (MatMul/Gemm INT8, Conv FP32)
decoder_joint-int8.onnx 52MB Decoder + joiner
vocab.txt 92KB Tokenizer vocabulary

Conversion

Converted using NeMo Toolkit + PyTorch 2.4 with weight-only dynamic quantization:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "encoder-temp.onnx",
    "encoder-int8.onnx",
    weight_type=QuantType.QInt8,
    op_types_to_quantize=['MatMul', 'Gemm'],  # Skip Conv ops
    use_external_data_format=True,
    extra_options={
        'WeightSymmetric': True,
        'MatMulConstBOnly': True,
    }
)

Important Notes

  • Only MatMul/Gemm ops are quantized - Conv ops remain FP32 to avoid creating ConvInteger ops which are not supported by ONNX Runtime Web WASM backend.
  • Weight-only quantization - Activations stay FP32, only weights are INT8.

License

CC-BY-4.0 - © NVIDIA Corporation

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nasedkinpv/parakeet-tdt-0.6b-v3-onnx-int8

Quantized
(12)
this model