Skip to the content.

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance (Submitted to ICASSP 2025)

Authors

Abstract

When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.

All generated voices where resampled to 16kHz and normalized to -27dB for fair comparison.

Comparison between Full-finetuning (UnitSpeech) and LoRA-tuning (VoiceTailor) on Different Domains of Datasets

LibriTTS: In-domain dataset used for pretraining. Reference speakers are randomly chosen from the test set.

Transcript: But Polly couldn’t speak and if Jasper hadn’t caught her just in time, she would have tumbled over backward from the stool, Phronsie and all!

Reference GT UnitSpeech VoiceTailor

VCTK: Out-of-domain dataset not used for pretraining.

Transcript: We decided we would go for a specialist inside centre.

Reference GT UnitSpeech VoiceTailor

GigaSpeech: A more extreme case of out-of-domain dataset. Often contains data collect from in-the-wild situations such as youtube.

Transcript: Tired of eating the same old food, fancy a taste of some of the finest ingredients money can buy.

Reference GT UnitSpeech VoiceTailor




Adaptive Text-to-Speech Model Comparison on GigaSpeech

Transcript: That was of course until the steam powered loom came into the picture. With weaving technology rapidly improving, his father’s loom business would crumble finding himself crushed by the industrial revolution.

Reference GT XTTS v2 CosyVoice UnitSpeech VoiceTailor VoiceGuider

Transcript: This video would not have been possible if it wasn’t for our friends at acorns and investment app with over six million users that makes investing as easy as spending.

Reference GT XTTS v2 CosyVoice UnitSpeech VoiceTailor VoiceGuider

Transcript: And your heart skips a beat, and you’re like a dead for a millisecond or something. um, yeah, clearly not true.

Reference GT XTTS v2 CosyVoice UnitSpeech VoiceTailor VoiceGuider




Ablation Studies

Number of training iterations for the Inferior Model used for Autoguidance

Transcript: But if you want to shell out and see for yourself, you’ll have to head down to australia is at peninsula.

Reference Iteration 0 Iteration 100 (default) Iteration 200 Iteration 300 Iteration 400 Iteration 500

Transcript: Caffeine has been shown to boost metabolism by up to eleven percent and dramatically increase fat burning potential.

Reference Iteration 0 Iteration 100 (default) Iteration 200 Iteration 300 Iteration 400 Iteration 500


Rank $r$ for Autoguidance Model

Transcript: Giving is most fulfilling when you donate your money or time to a cause you’re passionate about.

Reference $r = 1$ (default) $r = 2$ $r = 4$ $r = 8$

Transcript: In order to boost public interest in space exploration and hopefully increase nasa’s budget.

Reference $r = 1$ (default) $r = 2$ $r = 4$ $r = 8$


Autoguidance Scale $\gamma_a$

Transcript: And brought them both to multi billion dollar valuations.

Reference $\gamma_a = 0.0$ $\gamma_a = 0.33$ $\gamma_a = 0.66$ $\gamma_a = 1.0$ (default) $\gamma_a = 1.33$

Transcript: Waiting for carbs mistakes and the game was adjourned to be resumed next day.

Reference $\gamma_a = 0.0$ $\gamma_a = 0.33$ $\gamma_a = 0.66$ $\gamma_a = 1.0$(default) $\gamma_a = 1.33$


Upper Guidance Interval $t_{hi}$

Transcript: Always do a proper analysis but regardless these rules of thumb can come in really handy in saving you from analyzing every single deal that you’ve come across.

Reference $t_{hi} = 0.9$ $t_{hi} = 0.8$ $t_{hi} = 0.7$ $t_{hi} = 0.6$ (default) $t_{hi} = 0.5$

Transcript: And the nine scientific benefits to having a daily cup of joe.

Reference $t_{hi} = 0.9$ $t_{hi} = 0.8$ $t_{hi} = 0.7$ $t_{hi} = 0.6$ (default) $t_{hi} = 0.5$


Lower Guidance Interval $t_{lo}$

Transcript: So i’ll just get right into it. The first is an easy one that’s actually in my artist of life workbook, is to list ten things that you love about yourself

Reference $t_{lo} = 0.1$ (default) $t_{lo} = 0.2$ $t_{lo} = 0.3$ $t_{lo} = 0.4$ $t_{lo} = 0.5$

Transcript: It can also help to commit to a consistent donation over time and to focus on the less glamorous but essential needs the charity may have.

Reference $t_{lo} = 0.1$ (default) $t_{lo} = 0.2$ $t_{lo} = 0.3$ $t_{lo} = 0.4$ $t_{lo} = 0.5$