Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.
Many thanks to ericguizzo & AK391!
Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':
pip install -r requirements.txt
ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN to try the pre-trained models:
python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path}
For example:
python convert_example.py -s test_wavs/p225_038.wav -r test_wavs/p334_047.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt
The converted wav is put in 'converted' directory.
Put VCTK corpus under directory: 'Dataset/'
Training/testing speakers split & feature (mel+lf0) extraction:
python preprocess.py
Training with mutual information minimization (MIM):
python train.py use_CSMI=True use_CPMI=True use_PSMI=True
Training without MIM:
python train.py use_CSMI=False use_CPMI=False use_PSMI=False
Put PWG vocoder under directory: 'vocoder/'
Inference with model trained with MIM:
python convert.py checkpoint=checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt
Inference with model trained without MIM:
python convert.py checkpoint=checkpoints/useCSMIFalse_useCPMIFalse_usePSMIFalse_useAmpTrue/model.ckpt-500.pt
If the code is used in your research, please Star our repo and cite our paper:
@inproceedings{wang21n_interspeech,
author={Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},
title={{VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion}},
year=2021,
booktitle={Proc. Interspeech 2021},
pages={1344--1348},
doi={10.21437/Interspeech.2021-283}
}