Generation rate

by NGC404 - opened 25 days ago

Discussion

NGC404

25 days ago

Hi, can you tell me how much the generation speed has increased in your model?

pltobing

Owner 25 days ago

•

edited 24 days ago

We'll keep you updated, it needs to be modified for CUDA graph compat, and use GPU.

If you use CPU for now, advised to use m8a families of ec2 aws instances.

pltobing

Owner 21 days ago

•

edited 21 days ago

Quick update, we're working on the branch feat/cudagraph-compat-talker-localtalker-codecdecoder, and currently at 500 ms latency with A10G (EC2 g5.xlarge) for 1s of audio (~2 RTFx).
This is still very rough as many small ops are still falling back to CPU.

We are working to further optimize unnecessary ops, then merge to main, as well as once the numerical output is also stable.

The current average processing time per prefill/step/local-step and per 4 chunk_frames for codec decoder:

speaker_encoder mean=5.99 ms (run once/per reset)
text_embed_proj mean=0.23 ms (lower than speech tokens rate, perhaps around 6Hz, so multiply by 6 for 1s. of audio =~1.38 ms)
talker_prefill mean=9.73 ms (run once/per reset)
talker_step mean=8.01 ms (x12.5 per 1s. = 100.13 ms)
local_prefill mean=1.45 ms (x12.5 per 1s. = 18.13 ms)
local_step mean=1.21 ms (x14x12.5 per 1s. = 211.75 ms)
local_lm_head mean=0.35 ms (x15x12.5 per 1s. = 65.63 ms)
codec_decoder mean=33.14 ms (x3.125 per 1s. for 4 chunk_frames, i.e, 12.5/4 = 103.56 ms)
--> total ~500.58 ms.

pltobing

Owner 17 days ago

It has been merged to main, and you can use it for real-time streaming with at least g6.xlarge (L4) or g5.xlarge (A10G).
More detailed profiling will be provided.

We will try to remove further dependency on transformers for the Text Tokenizer.
Further optimization of the Ops will also be in consideration.

MuhammadZaeemNasir

12 days ago

How many Concurrent Streams on g6.xlarge (L4) and g5.xlarge (A10G) ?

pltobing

Owner 12 days ago

The major onnx files use fixed batch axis for CUDA graph. It can be made dynamic, but without CUDA graph, or made with multiple batch size files. As you can use all of these freely, which to the best of our knowledge is the only real streaming TTS that can produce high-quality output with voice cloning, you can use batch size 1. We will provide more options for it for concurrent usage through our company VertoX, stay tuned.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment