Audio / Text-to-Speech(xtts-v2)

Hello,

XTTS v2 clone sounds American and stutters on long sentences. How can I get a UK accent and stable speech?

I am using XTTS v2 for an AI-Tutor project. My reference voice is a British English teacher. I pass this as speaker_wav and set language="en".

Issues

  1. The output often sounds US English instead of British.

  2. Some long or connected sentences repeat or get stuck.

Setup

  • OS: Windows

  • Python: 3.11

  • TTS: 0.22.0

  • Model: xtts-v2 (local)

  • Inference: language="en", speaker_wav=<british_teacher.wav>

  • Data: ~75 clean clips of the same speaker (mono WAV)

What I tried

  • Multiple short reference clips from the same speaker

  • Clean audio with no music or noise

  • Added full stops and commas, and split long text into shorter sentences

Questions

  1. Is there a way to force a UK accent in XTTS v2 (for example a code like en-gb) or is accent only taken from the reference audio?

  2. How much reference audio is recommended for better accent retention?

  3. What settings help to avoid stutters on long sentences? Is there a best practice for sentence splitting with XTTS v2?

  4. Would a small fine-tune on my British dataset help more than zero-shot cloning? If yes, which training recipe should I follow?

1 Like

If you’re particular about accents, it might be easier to use a TTS other than XTTS-v2…

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.