VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance (Submitted to ICASSP 2025)
Authors
- Jiheum Yeom quilava1234@snu.ac.kr
- Heeseung Kim gmltmd789@snu.ac.kr
- Jooyoung Choi jy_choi@snu.ac.kr
- Che Hyun Lee saga1214@snu.ac.kr
- Nohil Park pnoil2588@snu.ac.kr
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
Abstract
When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.
All generated voices where resampled to 16kHz and normalized to -27dB for fair comparison.
Comparison between Full-finetuning (UnitSpeech) and LoRA-tuning (VoiceTailor) on Different Domains of Datasets
LibriTTS: In-domain dataset used for pretraining. Reference speakers are randomly chosen from the test set.
Transcript: But Polly couldn’t speak and if Jasper hadn’t caught her just in time, she would have tumbled over backward from the stool, Phronsie and all!
Reference | GT | UnitSpeech | VoiceTailor |
---|---|---|---|
VCTK: Out-of-domain dataset not used for pretraining.
Transcript: We decided we would go for a specialist inside centre.
Reference | GT | UnitSpeech | VoiceTailor |
---|---|---|---|
GigaSpeech: A more extreme case of out-of-domain dataset. Often contains data collect from in-the-wild situations such as youtube.
Transcript: Tired of eating the same old food, fancy a taste of some of the finest ingredients money can buy.
Reference | GT | UnitSpeech | VoiceTailor |
---|---|---|---|
Adaptive Text-to-Speech Model Comparison on GigaSpeech
Transcript: That was of course until the steam powered loom came into the picture. With weaving technology rapidly improving, his father’s loom business would crumble finding himself crushed by the industrial revolution.
Reference | GT | XTTS v2 | CosyVoice | UnitSpeech | VoiceTailor | VoiceGuider |
---|---|---|---|---|---|---|
Transcript: This video would not have been possible if it wasn’t for our friends at acorns and investment app with over six million users that makes investing as easy as spending.
Reference | GT | XTTS v2 | CosyVoice | UnitSpeech | VoiceTailor | VoiceGuider |
---|---|---|---|---|---|---|
Transcript: And your heart skips a beat, and you’re like a dead for a millisecond or something. um, yeah, clearly not true.
Reference | GT | XTTS v2 | CosyVoice | UnitSpeech | VoiceTailor | VoiceGuider |
---|---|---|---|---|---|---|
Ablation Studies
Number of training iterations for the Inferior Model used for Autoguidance
Transcript: But if you want to shell out and see for yourself, you’ll have to head down to australia is at peninsula.
Reference | Iteration 0 | Iteration 100 (default) | Iteration 200 | Iteration 300 | Iteration 400 | Iteration 500 |
---|---|---|---|---|---|---|
Transcript: Caffeine has been shown to boost metabolism by up to eleven percent and dramatically increase fat burning potential.
Reference | Iteration 0 | Iteration 100 (default) | Iteration 200 | Iteration 300 | Iteration 400 | Iteration 500 |
---|---|---|---|---|---|---|
Rank $r$ for Autoguidance Model
Transcript: Giving is most fulfilling when you donate your money or time to a cause you’re passionate about.
Reference | $r = 1$ (default) | $r = 2$ | $r = 4$ | $r = 8$ |
---|---|---|---|---|
Transcript: In order to boost public interest in space exploration and hopefully increase nasa’s budget.
Reference | $r = 1$ (default) | $r = 2$ | $r = 4$ | $r = 8$ |
---|---|---|---|---|
Autoguidance Scale $\gamma_a$
Transcript: And brought them both to multi billion dollar valuations.
Reference | $\gamma_a = 0.0$ | $\gamma_a = 0.33$ | $\gamma_a = 0.66$ | $\gamma_a = 1.0$ (default) | $\gamma_a = 1.33$ |
---|---|---|---|---|---|
Transcript: Waiting for carbs mistakes and the game was adjourned to be resumed next day.
Reference | $\gamma_a = 0.0$ | $\gamma_a = 0.33$ | $\gamma_a = 0.66$ | $\gamma_a = 1.0$(default) | $\gamma_a = 1.33$ |
---|---|---|---|---|---|
Upper Guidance Interval $t_{hi}$
Transcript: Always do a proper analysis but regardless these rules of thumb can come in really handy in saving you from analyzing every single deal that you’ve come across.
Reference | $t_{hi} = 0.9$ | $t_{hi} = 0.8$ | $t_{hi} = 0.7$ | $t_{hi} = 0.6$ (default) | $t_{hi} = 0.5$ |
---|---|---|---|---|---|
Transcript: And the nine scientific benefits to having a daily cup of joe.
Reference | $t_{hi} = 0.9$ | $t_{hi} = 0.8$ | $t_{hi} = 0.7$ | $t_{hi} = 0.6$ (default) | $t_{hi} = 0.5$ |
---|---|---|---|---|---|
Lower Guidance Interval $t_{lo}$
Transcript: So i’ll just get right into it. The first is an easy one that’s actually in my artist of life workbook, is to list ten things that you love about yourself
Reference | $t_{lo} = 0.1$ (default) | $t_{lo} = 0.2$ | $t_{lo} = 0.3$ | $t_{lo} = 0.4$ | $t_{lo} = 0.5$ |
---|---|---|---|---|---|
Transcript: It can also help to commit to a consistent donation over time and to focus on the less glamorous but essential needs the charity may have.
Reference | $t_{lo} = 0.1$ (default) | $t_{lo} = 0.2$ | $t_{lo} = 0.3$ | $t_{lo} = 0.4$ | $t_{lo} = 0.5$ |
---|---|---|---|---|---|