Generation rate

#1
by NGC404 - opened

Hi, can you tell me how much the generation speed has increased in your model?

We'll keep you updated, it needs to be modified for CUDA graph compat, and use GPU.

If you use CPU for now, advised to use m8a families of ec2 aws instances.

Quick update, we're working on the branch feat/cudagraph-compat-talker-localtalker-codecdecoder, and currently at 500 ms latency with A10G (EC2 g5.xlarge) for 1s of audio (~2 RTFx).
This is still very rough as many small ops are still falling back to CPU.

We are working to further optimize unnecessary ops, then merge to main, as well as once the numerical output is also stable.

The current average processing time per prefill/step/local-step and per 4 chunk_frames for codec decoder:

  • speaker_encoder mean=5.99 ms (run once/per reset)
  • text_embed_proj mean=0.23 ms (lower than speech tokens rate, perhaps around 6Hz, so multiply by 6 for 1s. of audio =~1.38 ms)
  • talker_prefill mean=9.73 ms (run once/per reset)
  • talker_step mean=8.01 ms (x12.5 per 1s. = 100.13 ms)
  • local_prefill mean=1.45 ms (x12.5 per 1s. = 18.13 ms)
  • local_step mean=1.21 ms (x14x12.5 per 1s. = 211.75 ms)
  • local_lm_head mean=0.35 ms (x15x12.5 per 1s. = 65.63 ms)
  • codec_decoder mean=33.14 ms (x3.125 per 1s. for 4 chunk_frames, i.e, 12.5/4 = 103.56 ms)
    --> total ~500.58 ms.

It has been merged to main, and you can use it for real-time streaming with at least g6.xlarge (L4) or g5.xlarge (A10G).
More detailed profiling will be provided.

We will try to remove further dependency on transformers for the Text Tokenizer.
Further optimization of the Ops will also be in consideration.

How many Concurrent Streams on g6.xlarge (L4) and g5.xlarge (A10G) ?

The major onnx files use fixed batch axis for CUDA graph. It can be made dynamic, but without CUDA graph, or made with multiple batch size files. As you can use all of these freely, which to the best of our knowledge is the only real streaming TTS that can produce high-quality output with voice cloning, you can use batch size 1. We will provide more options for it for concurrent usage through our company VertoX, stay tuned.

Sign up or log in to comment