NoiseVC Voise Conversion Demos

Abstract

Zero-shot Voice conversion (VC) is a challenging task that transforms voice from a target audio to source without losing linguistic contents, while both source and target speakers are unseen during training. Previous approaches require a pre-trained model or linguistic data to do the zero-shot conversion. Meanwhile, VC models with vector quantization (VQ) or instance normalization (IN) are able to disentangle contents from audios achieve a successful conversion. However, disentanglement in these models highly relies on heavily constrained bottleneck layers, therefore, the audio quality is drastically sacrificed. In this paper, we propose NoiseVC, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC) together. Additionally, we perform Noise Augmentation to further enhance disentanglement capability. We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability without huge quality sacrifice.

Unseen speakers

Source and target speakers have not been seen during training. Audio samples are randomly picked from eval dataset.

Female to Male

Source (F) Target (M) Conversion

p343

p260

AutoVC

VQVC+

NoiseVC (Ours)

p269

p316

AutoVC

VQVC+

NoiseVC (Ours)

p308

p275

AutoVC

VQVC+

NoiseVC (Ours)

p299

p227

AutoVC

VQVC+

NoiseVC (Ours)

p303

p326

AutoVC

VQVC+

NoiseVC (Ours)

Male to Female

Source (M) Target (F) Conversion

p275

p299

AutoVC

VQVC+

NoiseVC (Ours)

p326

p303

AutoVC

VQVC+

NoiseVC (Ours)

p316

p343

AutoVC

VQVC+

NoiseVC (Ours)

p260

p269

AutoVC

VQVC+

NoiseVC (Ours)

p227

p308

AutoVC

VQVC+

NoiseVC (Ours)

Female to Female

Source (F) Target (F) Conversion

p362

p229

AutoVC

VQVC+

NoiseVC (Ours)

p314

p308

AutoVC

VQVC+

NoiseVC (Ours)

p303

p262

AutoVC

VQVC+

NoiseVC (Ours)

p351

p343

AutoVC

VQVC+

NoiseVC (Ours)

p361

p282

AutoVC

VQVC+

NoiseVC (Ours)

Male to Male

Source (M) Target (M) Conversion

p227

p326

AutoVC

VQVC+

NoiseVC (Ours)

p260

p275

AutoVC

VQVC+

NoiseVC (Ours)

Seen speakers

Source and target speakers have been seen during training. Audio samples are randomly picked from eval dataset.

Female to Male

Source (F) Target (M) Conversion

p313

p278

AutoVC

VQVC+

NoiseVC (Ours)

p283

p263

AutoVC

VQVC+

NoiseVC (Ours)

p249

p287

AutoVC

VQVC+

NoiseVC (Ours)

p339

p270

AutoVC

VQVC+

NoiseVC (Ours)

p305

p272

AutoVC

VQVC+

NoiseVC (Ours)

Male to Female

Source (M) Target (F) Conversion

p274

p313

AutoVC

VQVC+

NoiseVC (Ours)

p243

p336

AutoVC

VQVC+

NoiseVC (Ours)

p245

p239

AutoVC

VQVC+

NoiseVC (Ours)

p270

p249

AutoVC

VQVC+

NoiseVC (Ours)

p232

p305

AutoVC

VQVC+

NoiseVC (Ours)

Female to Female

Source (F) Target (F) Conversion

p283

p249

AutoVC

VQVC+

NoiseVC (Ours)

p234

p313

AutoVC

VQVC+

NoiseVC (Ours)

p239

p339

AutoVC

VQVC+

NoiseVC (Ours)

p231

p336

AutoVC

VQVC+

NoiseVC (Ours)

p305

p257

AutoVC

VQVC+

NoiseVC (Ours)

Male to Male

Source (M) Target (M) Conversion

p274

p263

AutoVC

VQVC+

NoiseVC (Ours)

p243

p272

AutoVC

VQVC+

NoiseVC (Ours)

p245

p360

AutoVC

VQVC+

NoiseVC (Ours)

p278

p270

AutoVC

VQVC+

NoiseVC (Ours)

p232

p287

AutoVC

VQVC+

NoiseVC (Ours)