面向非平行语料的语音转换技术综述

李鹏程; 张旭龙; 王健宗; 程宁; 肖京

doi:10.11959/j.issn.2096-0271.2024011

您当前的位置：

首页 >

文章列表页 >

面向非平行语料的语音转换技术综述

研究 | 更新时间：2024-06-03

- 面向非平行语料的语音转换技术综述
- A survey of voice conversion based on non-parallel data
- 大数据 2024年10卷第3期页码：65-81
- 作者机构：
  
  1. 平安科技（深圳）有限公司, 广东深圳 518063
  2. 中国科学技术大学，安徽合肥 230026
- 作者简介：
  
  [ "李鹏程（1999- ），男，中国科学技术大学硕士生，平安科技（深圳）有限公司算法工程师，主要研究方向为语音合成、语音转换和语音安全等。" ]
  [ "张旭龙（1988- ），男，博士，平安科技（深圳）有限公司高级算法研究员，复旦大学计算机理学博士，主要研究方向为语音合成、语音转换、音乐信息检索以及机器学习和深度学习方法在人工智能领域应用。担任清华大学深圳研究院以及中国科学技术大学先进技术研究院校外导师，目前是IEEE、中国自动化学会以及中国计算机学会会员，担任联邦数据与联邦智能专委会委员，2023年入选上海市东方英才计划青年项目。" ]
  [ "王健宗（1983- ），男，博士，平安科技（深圳）有限公司副总工程师，资深人工智能总监，联邦学习技术部总经理，智能金融前沿技术研究院院长。美国佛罗里达大学人工智能博士后，美国莱斯大学和华中科技大学联合培养博士，中国计算机学会资深会员，中国计算机学会大数据专家委员会委员，中国自动化学会联邦数据和联邦智能专业委员会副主任。主要研究方向为大模型、联邦学习和深度学习等。" ]
  [ "程宁（1981- ），男，博士，平安科技（深圳）有限公司高级人工智能专家、中国科学院自动化所博士。专注于人工智能算法研究以及其在语音处理和自然语言处理领域的应用。目前在大数据、机器学习、人工智能国际顶会或期刊上发表学术论文50余篇，发明专利申请100余项。" ]
  [ "肖京（1972- ），男，博士，美国卡耐基梅隆大学博士，国家特聘专家。国家新一代普惠金融人工智能开放创新平台技术负责人、深圳市政协委员、深圳市决策咨询委员会委员，兼中国计算机学会深圳分部副主席、广东省人工智能与机器人学会副理事长、深圳市人工智能行业协会会长、深圳市人工智能学会副理事长，清华大学、上海交通大学、同济大学等客座教授。长期从事人工智能与大数据分析挖掘相关领域的研究，先后在爱普生美国研究院及美国微软公司担任高级研发管理职务，现任平安集团首席科学家，负责人工智能技术研发及在金融、医疗、智慧城市等领域的应用，带领团队树立了多项传统行业智能化经营的标杆。发表学术论文249篇，美国授权专利101项，中国发明专利155项，参与及承担国家级项目8项。凭借在技术创新及应用方面的杰出贡献，先后获得2018年中国专利奖、2019年吴文俊人工智能杰出贡献奖、2020年吴文俊人工智能科技进步奖一等奖、2020年上海市科技进步奖一等奖、2020年中国人工智能十大风云人物、2021年深圳市五一劳动奖章、2022年深圳市最美科技工作者等荣誉。" ]
- 基金信息：
  
  广东省重点领域研发计划“新一代人工智能”重大专项;The Key Research and Development Program of Guangdong Province(2021B0101400003)
- DOI：10.11959/j.issn.2096-0271.2024011
  中图分类号： TP391
- 网络首发：2024-05，
  
  纸质出版：2024-05-15
- 稿件说明：
移动端阅览
李鹏程, 张旭龙, 王健宗, 等. 面向非平行语料的语音转换技术综述[J]. 大数据, 2024,10(3):65-81.

Pengcheng LI, Xulong ZHANG, Jianzong WANG, et al. A survey of voice conversion based on non-parallel data[J]. Big data research, 2024, 10(3): 65-81.
李鹏程, 张旭龙, 王健宗, 等. 面向非平行语料的语音转换技术综述[J]. 大数据, 2024,10(3):65-81. DOI： 10.11959/j.issn.2096-0271.2024011.

Pengcheng LI, Xulong ZHANG, Jianzong WANG, et al. A survey of voice conversion based on non-parallel data[J]. Big data research, 2024, 10(3): 65-81. DOI： 10.11959/j.issn.2096-0271.2024011.

摘要

语音转换是语音及人工智能领域的一项研究课题，其目标是在保持源语音内容不变的情况下改变语音的音色，使其听上去像是由另一个目标说话人说出的，同时还需保证语音的质量和自然度。面向非平行语料的语音转换技术是当下的热门研究内容，其使用非平行的多说话人语音数据集进行模型训练，能完成多对多以及任意对任意的语音转换。对近年来面向非平行语料的语音转换进行了全面的总结和分析。首先概述了早期面向平行语料的语音转换及其缺陷，然后对当下面向非平行语料的语音转换的各类实现方法进行介绍和对比分析，最后对语音转换技术进行了总结和展望。

Abstract

Voice conversion is a research topic in the fields of speech and artificial intelligence.The goal of voice conversion is to change the timbre of speech while preserving the content of the source speech

making it sounds like spoken by the target speaker.It is essential to ensure both the quality and naturalness of the converted speech.Voice conversion based on nonparallel data gains much attention currently

where models are trained using non-parallel multilingual speaker datasets

enabling many-to-many and any-to-any voice conversions.This paper provides a comprehensive summary and analysis of recent developments in non-parallel voice conversion.Firstly

we outline the early voice conversion techniques based on parallel corpus and their limitations.Then

we introduce and compare various approaches to voice conversion based on nonparallel data

providing a thorough analysis.Finally

a summary and outlook on voice conversion technology is provided.

关键词

Keywords

references

YUAN R B , WU Y X , LI J , et al . DeIDVC:speaker de-identification via zero-shot pseudo voice conversion [C ] // Proceedings of Interspeech 2022 .[S.l. ] : ISCA , 2022 : 2593 - 2597 .

SRIVASTAVA B M L , VAUQUIER N , SAHIDULLAH M , et al . Evaluating voice conversion-based privacy protection against informed attackers [EB ] . arXiv preprint,2019,arXiv:1911.03934 .

YE Z , MAO T R , DONG L , et al . Fake the real:backdoor attack on deep speech classification via voice conversion [C ] // Proceedings of INTERSPEECH 2023 .[S.l. ] : ISCA , 2023 : 4923 - 4927 .

WU Z Z , LI H Z . Voice conversion versus speaker verification:an overview [J ] . APSIPA Transactions on Signal and Information Processing , 2014 , 3 ( 1 ): 1 - 16 .

HUANG C Y , LIN Y Y , LEE H Y , et al . Defending your voice:adversarial attack on voice conversion [C ] // Proceedings of 2021 IEEE Spoken Language Technology Workshop . Piscataway:IEEE Press , 2021 : 552 - 559 .

TODA T , NAKAGIRI M , SHIKANO K . Statistical voice conversion techniques for body-conducted unvoiced speech enhancement [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2012 , 20 ( 9 ): 2505 - 2517 .

MA D , VIOLETA L P , KOBAYASHI K , et al . Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequenceto-sequence voice conversion [C ] // Proceedings of 2022 IEEE Spoken Language Technology Workshop . Piscataway:IEEE Press , 2023 : 949 - 954 .

ZHANG M Y , WANG X , FANG F M , et al . Joint training framework for textto-speech and voice conversion using multi-source tacotron and WaveNet [C ] // Proceedings of Interspeech 2019 .[S.l. ] : ISCA , 2019 : 1298 - 1302 .

KAIN A , MACON M W . Spectral voice conversion for text-to-speech synthesis [C ] // Proceedings of the 1998 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2002 : 285 - 288 .

VEAUX C , YAMAGISHI J , KING S . Towards personalised synthesised voices for individuals with vocal disabilities:voice banking and reconstruction [C ] // Proceedings of SLPAT 2013 .[S.l.:s.n. ] , 2013 : 107 - 111 .

WANG S J , BORTH D . Zero-shot voice conversion via self-supervised prosody representation learning [C ] // Proceedings of 2022 International Joint Conference on Neural Networks . Piscataway:IEEE Press , 2022 : 1 - 8 .

DU Z Y , SISMAN B , ZHOU K , et al . Expressive voice conversion:a joint framework for speaker identity and emotional style transfer [C ] // Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) . Piscataway:IEEE Press , 2022 : 594 - 601 .

DU Z Y , SISMAN B , ZHOU K , et al . Disentanglement of emotional style and speaker identity for expressive voice conversion [C ] // Proceedings of Interspeech 2022 .[S.l. ] : ISCA , 2022 : 2603 - 2607 .

NAKAMURA K , TODA T , SARUWATARI H , et al . Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech [J ] . Speech Communication , 2012 , 54 ( 1 ): 134 - 146 .

CHIEN Y L , CHEN H H , YEN M C , et al . Audio-visual mandarin electrolaryngeal speech voice conversion [C ] // Proceedings of INTERSPEECH 2023 .[S.l. ] : ISCA , 2023 : 5023 - 5026 .

SISMAN B , YAMAGISHI J , KING S , et al . An overview of voice conversion and its challenges:from statistical modeling to deep learning [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2020 , 29 : 132 - 157 .

HELANDER E , SCHWARZ J , NURMINEN J , et al . On the impact of alignment on voice conversion performance [C ] // Proceedings of Interspeech 2008 .[S.l. ] : ISCA , 2008 .

TODA T , BLACK A W , TOKUDA K . Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2007 , 15 ( 8 ): 2222 - 2235 .

ZEN H G , NANKAKU Y , TOKUDA K . Probabilistic feature mapping based on trajectory HMMs [C ] // Proceedings of Interspeech 2008 .[S.l. ] : ISCA , 2008 .

KOBAYASHI K , TAKAMICHI S , NAKAMURA S , et al . The NUNAIST voice conversion system for the voice conversion challenge 2016 [C ] // Proceedings of Interspeech 2016 .[S.l. ] : ISCA , 2016 : 1667 - 1671 .

HELANDER E , VIRTANEN T , NURMINEN J , et al . Voice conversion using partial least squares regression [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2010 , 18 ( 5 ): 912 - 921 .

HELANDER E , SILEN H , VIRTANEN T , et al . Voice conversion using dynamic kernel partial least squares regression [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2012 , 20 ( 3 ): 806 - 817 .

LUAN Y , SAITO D , KASHIWAGI Y , et al . Semisupervised noise dictionary adaptation for exemplar-based noise robust speech recognition [C ] // Proceedings of 2014 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2014 .

NARENDRANATH M , MURTHY H A , RAJENDRAN S , et al . Transformation of formants for voice conversion using artificial neural networks [J ] . Speech Communication , 1995 , 16 ( 2 ): 207 - 216 .

VAN DEN OORD A , DIELEMAN S , ZEN H , et al . WaveNet:a generative model for raw audio [EB ] . arXiv preprint,2016,arXiv:1609.03499 .

KALCHBRENNER N , ELSEN E , SIMONYAN K , et al . Efficient neural audio synthesis [EB ] . arXiv preprint,2018,arXiv:1802.08435 .

KIM S , LEE S G , SONG J , et al . FloWaveNet:a generative flow for raw audio [EB ] . arXiv preprint,2018,arXiv:1811.02155 .

KONG J , KIM J , BAE J , et al . HiFi-GAN:generative adversarial networks for efficient and high fidelity speech synthesis [EB ] . arXiv preprint,2020,arXiv:2010.05646 .

KUMAR K , KUMAR R , DE BOISSIERE T , et al . MelGAN:generative adversarial networks for conditional waveform synthesis [EB ] . arXiv preprint,2019,arXiv:1910.06711 .

REN Y , HU C , QIN T , et al . FastSpeech 2:fast and high-quality end-to-end textto-speech [EB ] . arXiv preprint,2020,arXiv:2006.04558 .

DONAHUE J , DIELEMAN S , BIŃKOWSKI M , et al , et al . End-to-end adversarial textto-speech [EB ] . arXiv preprint,2020,arXiv:2006.03575 .

KIM J , KONG J , SON J , et al . Conditional variational autoencoder with adversarial learning for end-to-end text-tospeech [EB ] . arXiv preprint,2021,arXiv:2106.06103 .

LI J Y , TU W P , XIAO L . Freevc:towards high-quality text-free one shot voice conversion [C ] // Proceedings of ICASSP 2023 - 2023 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2023 : 1 - 5 .

CHOROWSKI J , WEISS R J , BENGIO S , et al . Unsupervised speech representation learning using WaveNet autoencoders [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2019 , 27 ( 12 ): 2041 - 2053 .

ERRO D , MORENO A , BONAFONTE A . INCA algorithm for training voice conversion systems from nonparallel corpora [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2010 , 18 ( 5 ): 944 - 953 .

TAO J H , ZHANG M , NURMINEN J , et al . Supervisory data alignment for textindependent voice conversion [J ] . IEEE Transactions on Audio,Speech,and Language Processing , 2010 , 18 ( 5 ): 932 - 943 .

HAZEN T J , SHEN W D , WHITE C . Query-by-example spoken term detection using phonetic posteriorgram templates [C ] // Proceedings of 2009 IEEE Workshop on Automatic Speech Recognition ＆ Understanding . Piscataway:IEEE Press , 2010 : 421 - 426 .

SUN L F , LI K , WANG H , et al . Phonetic posteriorgrams for many-to-one voice conversion without parallel data training [C ] // Proceedings of 2016 IEEE International Conference on Multimedia and Expo . Piscataway:IEEE Press , 2016 : 1 - 6 .

SUNDERMANN D , NEY H , HOGE H . VTLN-based cross-language voice conversion [C ] // Proceedings of 2003 IEEE Workshop on Automatic Speech Recognition and Understanding . Piscataway:IEEE Press , 2004 : 676 - 681 .

QIAN Y , XU J , SOONG F K . A frame mapping based HMM approach to cross-lingual voice transformation [C ] // Proceedings of 2011 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2011 : 5120 - 5123 .

LIU S X , SUN L F , WU X X , et al . The HCCL-CUHK system for the voice conversion challenge 2018 [C ] // Proceedings of Speaker and Language Recognition Workshop (Odyssey 2018) [S.l. ] : ISCA , 2018 : 248 - 254 .

LIU S X , CAO Y W , WANG D S , et al . Any-to-many voice conversion with location-relative sequence-to-sequence modeling [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2021 , 29 : 1717 - 1728 .

ZHOU Y , TIAN X H , YıLMAZ E , et al . A modularized neural network with language-specific output layers for cross-lingual voice conversion [C ] // Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway:IEEE Press , 2020 : 160 - 167 .

ZHOU Y , TIAN X H , XU H H , et al . Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling [C ] // Proceedings of ICASSP 2019 - 2019 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2019 : 6790 - 6794 .

ISOLA P , ZHU J Y , ZHOU T H , et al . Image-to-image translation with conditional adversarial networks [C ] // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2017 : 5967 - 5976 .

ZHU J Y , PARK T , ISOLA P , et al . Unpaired image-to-image translation using cycle-consistent adversarial networks [C ] // Proceedings of 2017 IEEE International Conference on Computer Vision . Piscataway:IEEE Press , 2017 : 2242 - 2251 .

KANEKO T , KAMEOKA H . CycleGANVC:non-parallel voice conversion using cycle-consistent adversarial networks [C ] // Proceedings of 2018 26th European Signal Processing Conference . Piscataway:IEEE Press , 2018 : 2100 - 2104 .

KAMEOKA H , KANEKO T , TANAKA K , et al . StarGAN-VC:non-parallel many-to-many voice conversion using star generative adversarial networks [C ] // Proceedings of 2018 IEEE Spoken Language Technology Workshop . Piscataway:IEEE Press , 2019 : 266 - 273 .

KANEKO T , KAMEOKA H , TANAKA K , et al . StarGAN-VC2:rethinking conditional methods for StarGANbased voice conversion [C ] // Proceedings of Interspeech 2019 .[S.l. ] : ISCA , 2019 : 679 - 683 .

LI Y A , ZARE A , MESGARANI N . StarGANv2-VC:a diverse,unsupervised,non-parallel framework for naturalsounding voice conversion [C ] // Proceedings of Interspeech 2021 .[S.l. ] : ISCA , 2021 : 1349 - 1353 .

SISMAN B , ZHANG M Y , DONG M H , et al . On the study of generative adversarial networks for cross-lingual voice conversion [C ] // Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway:IEEE Press , 2020 : 144 - 151 .

YEH C C , HSU P C , CHOU J C , et al . Rhythm-flexible voice conversion without parallel data using cycle-GAN over phoneme posteriorgram sequences [C ] // Proceedings of 2018 IEEE Spoken Language Technology Workshop . Piscataway:IEEE Press , 2019 : 274 - 281 .

ZHOU K , SISMAN B , LI H Z . Transforming spectrum and prosody for emotional voice conversion with nonparallel training data [C ] // Proceedings of Speaker and Language Recognition Workshopl .[S.l. ] : ISCA , 2020 .

WANG Y X , SKERRY-RYAN R , STANTON D , et al Tacotron:towards end-to-end speech synthesis [C ] // Proceedings of Interspeech 2017 .[S.l. ] : ISCA , 2017 : 4006 - 4010 .

ZHANG J X , LING Z H , LIU L J , et al . Sequence-to-sequence acoustic modeling for voice conversion [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2019 , 27 ( 3 ): 631 - 644 .

MIYOSHI H , SAITO Y , TAKAMICHI S , et al . Voice conversion using sequenceto-sequence learning of context posterior probabilities [C ] // Proceedings of Interspeech 2017 .[S.l. ] : ISCA , 2017 : 1268 - 1272 .

ZHANG M Y , ZHOU Y , ZHAO L , et al . Transfer learning from speech synthesis to voice conversion with non-parallel training data [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2021 , 29 : 1290 - 1302 .

PARK S W , KIM D Y , JOE M C , et al . Cotatron:transcription-guided speech encoder for any-to-many voice conversion without parallel data [EB ] . arXiv preprint,2020,arXiv:2005.03295 .

TIAN X H , CHNG E S , LI H Z . A speaker-dependent WaveNet for voice conversion with non-parallel data [C ] // Proceedings of Interspeech 2019 .[S.l. ] : ISCA , 2019 : 15 - 19 .

LIU S , CAO Y , MENG H , et al . Multi-target emotional voice conversion with neural vocoders [EB ] . arXiv preprint,2020,arXiv:2004.03782 .

HUANG W C , HAYASHI T , WU Y C , et al . Voice transformer network:sequence-to-sequence voice conversion using transformer with text-tospeech pretraining [C ] // Proceedings of Interspeech 2020 .[S.l. ] : ISCA , 2020 : 4676 - 4680 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all You need [EB ] . arXiv preprint,2017,arXiv:1706.03762 .

LUONG H T , YAMAGISHI J . NAUTILUS:a versatile voice cloning system [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2020 , 28 : 2967 - 2981 .

LUONG H T , YAMAGISHI J . Bootstrapping non-parallel voice conversion from speaker-adaptive textto-speech [C ] // Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway:IEEE Press , 2020 : 200 - 207 .

BOESEN A , LARSEN L , SONDERBY S K , et al . Autoencoder beyond pixels using a learned similarity metric [C ] // Proceedings of International Conference on Machine Learning .[S.l.:s.n. ] , 2016 .

HSU C C , HWANG H T , WU Y C , et al . Voice conversion from nonparallel corpora using variational autoencoder [C ] // Proceedings of 2016 AsiaPacific Signal and Information Processing Association Annual Summit and Conference . Piscataway:IEEE Press , 2017 : 1 - 6 .

HUANG W C , HWANG H T , PENG Y H , et al . Voice conversion based on crossdomain features using variational auto encoders [C ] // Proceedings of 2018 11th International Symposium on Chinese Spoken Language Processing . Piscataway:IEEE Press , 2019 : 51 - 55 .

QIAN K , ZHANG Y , CHANG S , et al . Zero-shot voice style transfer with only autoencoder loss [EB ] . arXiv preprint,2019:arXiv:1905.05879 .

QIAN K , ZHANG Y , CHANG S , et al . Unsupervised speech decomposition via triple information bottleneck [EB ] . arXiv preprint,2020,arXiv:2004.11284 .

HO CHAN C , QIAN K Z , ZHANG Y , et al . SpeechSplit2.0:unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks [C ] // Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2022 : 6332 - 6336 .

CHOU J C , LEE H Y . One-shot voice conversion by separating speaker and content representations with instance normalization [C ] // Proceedings of Interspeech 2019 .[S.l. ] : ISCA , 2019 : 664 - 668 .

CHEN Y H , WU D Y , WU T H , et al . Again-VC:a one-shot voice conversion using activation guidance and adaptive instance normalization [C ] // Proceedings of ICASSP 2021 - 2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 5954 - 5958 .

WU D Y , LEE H Y . One-shot voice conversion by vector quantization [C ] // Proceedings of ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 7734 - 7738 .

WU D Y , CHEN Y H , LEE H Y . VQVC+:one-shot voice conversion by vector quantization and U-net architecture [C ] // Proceedings of Interspeech 2020 .[S.l. ] : ISCA , 2020 : 4691 - 4695 .

WANG D S , DENG L Q , YEUNG Y T , et al . VQMIVC:vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion [C ] // Proceedings of Interspeech 2021 .[S.l. ] : ISCA , 2021 : 1344 - 1348 .

LIU Z H , WANG S J , CHEN N . Automatic speech disentanglement for voice conversion using rank module and speech augmentation [C ] // Proceedings of Interspeech 2023 .[S.l. ] : ISCA , 2023 .

YANG S C , TANTRAWENITH M , ZHUANG H L , et al . Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion [C ] // Proceedings of Interspeech 2022 .[S.l. ] : ISCA , 2022 : 2553 - 2557 .

KANEKO T , KAMEOKA H , HIRAMATSU K , et al . Sequenceto-sequence voice conversion with similarity metric learned using generative adversarial networks [C ] // Proceedings of Interspeech 2017 .[S.l. ] : ISCA , 2017 : 1283 - 1287 .

HO J , JAIN A , ABBEEL P , et al . Denoising diffusion probabilistic models [EB ] . arXiv preprint,2020,arXiv:2006.11239 .

KONG Z , PING W , HUANG J , et al . DiffWave:a versatile diffusion model for audio synthesis [EB ] . arXiv preprint,2020,arXiv:2009.09761 .

HUANG R , LAM M W Y , WANG J , et al . FastDiff:a fast conditional diffusion model for high-quality speech synthesis [EB ] . arXiv preprint,2022,arXiv:2204.09934 .

LIU S X , CAO Y W , SU D , et al . DiffSVC:a diffusion probabilistic model for singing voice conversion [C ] // Proceedings of 2021 IEEE Automatic Speech Recognition and Understanding Workshop . Piscataway:IEEE Press , 2022 : 741 - 748 .

KOMINEK J , BLACK A . The CMU Arctic speech databases [EB ] . ResearchGate , 2004 .: 228978129 .

PANAYOTOV V , CHEN G G , POVEY D , et al . Librispeech:an ASR corpus based on public domain audio books [C ] // Proceedings of 2015 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2015 : 5206 - 5210 .

VEAUX C , YAMAGISHI J , MACDONALD K . CSTR VCTK Corpus:English multispeaker Corpus for CSTR voice cloning toolkit [Z ] . 2016 .

TODA T , CHEN L H , SAITO D , et al . The voice conversion challenge 2016 [C ] // Proceedings of Interspeech 2016 .[S.l.:s.n. ] , 2016 : 1632 - 1636 .

BU H , DU J , NA X , et al . AISHELL-1:an open-source mandarin speech corpus and a speech recognition baseline [EB ] . arXiv preprint,2017,arXiv:1709.05522 .

LORENZO-TRUEBA J , YAMAGISHI J TODA T , et al . The voice conversion challenge 2018:promoting development of parallel and nonparallel methods [EB ] . arXiv preprint,2018,arXiv:1804.04262 .

ZHAO Y , HUANG W C , TIAN X , et al . Voice Conversion Challenge 2020:intralingual semi-parallel and cross-lingual voice conversion [EB ] . arXiv preprint,2020,arXiv:2008.12527 .

浏览量

199

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于容器云技术的典型遥感智能解译算法集成

深度学习在医学影像中的研究进展及发展趋势

深度学习在化学信息学中的应用

基于改进YOLOv8的高分辨率遥感图像目标检测算法

沙尘图像视觉增强技术综述