表现性语音合成综述

唐浩彬; 张旭龙; 王健宗; 程宁; 肖京

doi:10.11959/j.issn.2096-0271.2022082

您当前的位置：

首页 >

文章列表页 >

表现性语音合成综述

研究 | 更新时间：2024-06-03

- 表现性语音合成综述
- A survey of expressive speech synthesis
- 大数据 2023年9卷第6期页码：53-71
- 作者机构：
  
  1. 平安科技（深圳）有限公司，广东深圳 518063
  2. 中国科学技术大学，安徽合肥 230026
- 作者简介：
  
  [ "唐浩彬（1999- ），男，中国科学技术大学硕士生，平安科技（深圳）有限公司算法工程师，主要研究方向为人工智能、语音识别和语音合成等。" ]
  [ "张旭龙（1988- ），男，博士，平安科技（深圳）有限公司高级算法研究员，主要研究方向为语音合成、语音转换、音乐信息检索、机器学习和深度学习方法在人工智能领域应用。" ]
  [ "王健宗（1983- ），男，博士，平安科技（深圳）有限公司副总工程师，资深人工智能总监，联邦学习技术部总经理。美国佛罗里达大学人工智能博士后，中国计算机学会高级会员，中国计算机学会大数据专家委员会委员，主要研究方向为联邦学习和人工智能等。" ]
  [ "程宁（1981- ），男，博士，平安科技高级专家算法研究员，中国科学院软件所高级工程师，主要研究方向为语音识别、语音合成、自然语言处理等。" ]
  [ "肖京（1972- ），男，博士，中国平安集团首席科学家，2019年吴文俊人工智能杰出贡献奖获得者，中国计算机学会深圳分部副主席，主要研究方向为计算机图形学学科、自动驾驶、3D显示、医疗诊断、联邦学习等。" ]
- 基金信息：
  
  广东省重点领域研发计划“新一代人工智能”重大专项;The Key Research and Development Program of Guangdong Province(2021B0101400003)
- DOI：10.11959/j.issn.2096-0271.2022082
  中图分类号： TP391
- 网络首发：2023-11，
  
  纸质出版：2023-11-15
- 稿件说明：
移动端阅览
唐浩彬, 张旭龙, 王健宗, 等. 表现性语音合成综述[J]. 大数据, 2023,9(6):53-71.

Haobin TANG, Xulong ZHANG, Jianzong WANG, et al. A survey of expressive speech synthesis[J]. Big data research, 2023, 9(6): 53-71.
唐浩彬, 张旭龙, 王健宗, 等. 表现性语音合成综述[J]. 大数据, 2023,9(6):53-71. DOI： 10.11959/j.issn.2096-0271.2022082.

Haobin TANG, Xulong ZHANG, Jianzong WANG, et al. A survey of expressive speech synthesis[J]. Big data research, 2023, 9(6): 53-71. DOI： 10.11959/j.issn.2096-0271.2022082.

摘要

语音合成是语音、语言和机器学习领域的一个热门研究课题，旨在合成给定文本的可理解和自然的语音，在工业中有广泛的应用。语音合成的目标之一是合成自然的语音，而目前的语音合成在情感、韵律等方面还有很大的改进空间。对表现性语音合成进行了全面的调查，旨在更好地了解当前的研究现状和未来的趋势。对近年来基于情感及韵律的表现性语音合成进行了全面的总结、比较和分析。首先介绍了普通语音合成的传统实现方式及瓶颈；然后引入表现性语音合成并描述表现性语音合成在情感、韵律等方面为语音合成自然化带来的增益；最后对表现性语音合成进行了展望和总结。

Abstract

Speech synthesis is a hot research topic in the field of speech

language and machine learning

which aims to synthesize understandable and natural speech for a given text.It has a wide range of applications in industry.One of the goals of speech synthesis is to make the synthesized speech natural

and there is still a lot of room for improvement in emotion

prosody and other aspects of speech synthesis.A comprehensive survey of expressive speech synthesis was conducted with the aim of better understanding current research status and future trends.A comprehensive summary

comparison and analysis of emotion-based and prosodic speech synthesis in recent years were given.Firstly the traditional way and bottleneck of common speech synthesis were introduced

then expressive speech synthesis was introduced and the benefits of expressive speech synthesis in the aspects of emotion and prosody were described.Finally

the prospect and summary of expressive speech synthesis were presented.

关键词

Keywords

references

COKER C H . A model of articulatory dynamics and control [J ] . Proceedings of the IEEE , 1976 , 64 ( 4 ): 452 - 460 .

CAPES T , COLES P , CONKIE A , et al . Siri on-device deep learning-guided unit selection text-to-speech system [C ] // Proceedings of Interspeech 2017 .[S.l.:s.n. ] , 2017 .

GONZALVO X , TAZARI S , CHAN C A , et al . Recent advances in google real-time HMM-driven unit selection synthesizer [C ] // Proceedings of Interspeech 2016 .[S.l.:s.n. ] , 2016 .

HUNT A J , BLACK A W . Unit selection in a concatenative speech synthesis system using a large speech database [C ] // Proceedings of 1996 IEEE International Conference on Acoustics,Speech,and Signal Processing Conference Proceedings . Piscataway:IEEE Press , 2002 : 373 - 376 .

MOULINES E , CHARPENTIER F . Pitchsynchronous waveform processing techniques for text-to-speech synthesis using diphones [J ] . Speech Communication , 1990 , 9 ( 5/6 ): 453 - 467 .

ZEN H , NOSE T , YAMAGISHI J , et al . The HMM-based speech synthesis system (HTS) [J ] . SSW , 2007 , 6 : 294 - 299 .

SAITO Y , TAKAMICHI S , SARUWATARI H . Statistical parametric speech synthesis incorporating generative adversarial networks [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2017 , 26 ( 1 ): 84 - 96 .

NOSE T , NOSE T , NOSE T . Efficient implementation of global variance compensation for parametric speech synthesis [J ] . IEEE/ACM Transactions on Audio,Speech and Language Processing , 2016 , 24 ( 10 ): 1694 - 1704 .

KAWAHARA H , MORISE M , TAKAHASHI T , et al . TandemSTRAIGHT:a temporally stable power spectral representation for periodic signals and applications to interferencefree spectrum,F0,and aperiodicity estimation [C ] // Proceedings of 2008 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2008 : 3933 - 3936 .

CHEN L H , RAITIO T , VALENTINIBOTINHAO C , et al . A deep generative architecture for postfiltering in statistical parametric speech synthesis [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2015 , 23 ( 11 ): 2003 - 2014 .

LADD D R . Intonational phonology [M ] . Cambridge : Cambridge University Press , 1996 .

WATSON D G , WAGNER M , GIBSON E . Experimental and theoretical advances in prosody:a special issue of language and cognitive processes [M ] .[S.l.:s.n. ] , 2012 .

YAN Y , TAN X , LI B , et al . AdaSpeech 3:adaptive text to speech for spontaneous style [EB ] . arXiv preprint , 2021 ,arXiv:2107.02530.

HONG Y W , CHO S J , KIM J M , et al . Formant synthesis of Haegeum sounds using Cepstral envelope [J ] . The Journal of the Acoustical Society of Korea , 2009 , 28 : 526 - 533 .

KHORINPHAN C , PHANSAMDAENG S , SAIYOD S . Thai speech synthesis with emotional tone:based on Formant synthesis for Home Robot [C ] // Proceedings of 2014 3rd ICT International Student Project Conference . Piscataway:IEEE Press , 2014 : 111 - 114 .

KLATT D H . Software for a cascade/parallel formant synthesizer [J ] . The Journal of the Acoustical Society of America , 1980 , 67 ( 3 ): 971 - 995 .

VOGTEN L , BERENDSEN E . From text to speech:the MITalk system [J ] . Journal of Phonetics , 1988 , 16 ( 3 ): 371 - 375 .

YOSHIMURA T , TOKUDA K , MASUKO T , et al . Simultaneous modeling of spectrum,pitch and duration in HMMbased speech synthesis [C ] // Proceedings of 6th European Conference on Speech Communication and Technology .[S.l.:s.n. ] , 1999 .

FUKADA T , TOKUDA K , KOBAYASHI T , et al . An adaptive algorithm for mel-cepstral analysis of speech [C ] // Proceedings of 1992 IEEE International Conference on Acoustics,Speech,and Signal Processing . Piscataway:IEEE Press , 2002 : 137 - 140 .

IMAI S , SUMITA K , FURUICHI C . Mel log spectrum approximation (MLSA) filter for speech synthesis [J ] . Electronics and Communications in Japan (Part I:Communications) , 1983 , 66 ( 2 ): 10 - 18 .

WANG Y X , SKERRY-RYAN R J , STANTON D , et al . Tacotron:towards end-to-end speech synthesis [C ] // Proceedings of 2017 Interspeech .[S.l.:s.n. ] , 2017 .

SHEN J , PANG R M , WEISS R J , et al . Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions [C ] // Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2018 : 4779 - 4783 .

LI N H , LIU S J , LIU Y Q , et al . Neural speech synthesis with transformer network [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2019 , 33 ( 1 ): 6706 - 6713 .

REN Y , RUAN Y J , TAN X , et al . FastSpeech:fast,robust and controllable text to speech [EB ] . arXiv preprint , 2019 ,arXiv:1905.09263.

REN Y , HU C X , TAN X , et al . FastSpeech 2:fast and high-quality end-to-end text to speech [EB ] . arXiv preprint , 2020 ,arXiv:2006.04558.

ITAKURA F . Line spectrum representation of linear predictor coefficients of speech signals [J ] . The Journal of the Acoustical Society of America , 1975 , 57 ( S1 ): S35 .

TOKUDA K , KOBAYASHI T , MASUKO T , et al . Mel-generalized cepstral analysis - a unified approach to speech spectral estimation [C ] // Proceedings of 3rd International Conference on Spoken Language Processing .[S.l.:s.n. ] , 1994 .

KAWAHARA H , MASUDA-KATSUSE I , DE CHEVEIGNÉ A . Restructuring speech representations using a pitchadaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction:possible role of a repetitive structure in sounds [J ] . Speech Communication , 1999 , 27 ( 3/4 ): 187 - 207 .

KAWAHARA H , ESTILL J , FUJIMURA O . Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis,modification and synthesis system STRAIGHT [J ] . Second International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications , 2001 .

KAWAHARA H . STRAIGHT,exploitation of the other aspect of VOCODER:perceptually isomorphic decomposition of speech sounds [J ] . Acoustical Science and Technology , 2006 , 27 ( 6 ): 349 - 353 .

MORISE M , YOKOMORI F , OZAWA K . WORLD:a vocoder-based high-quality speech synthesis system for real-time applications [J ] . IEICE Transactions on Information and Systems , 2016 , E99.D ( 7 ): 1877 - 1884 .

XIE Q C , LI T , WANG X S , et al . Multispeaker multi-style text-to-speech synthesis with single-speaker singlestyle training data scenarios [C ] // Proceedings of 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP) . Piscataway:IEEE Press , 2023 : 66 - 70 .

JIA Y , ZHANG Y , WEISS R J , et al . Transfer learning from speaker verification to multispeaker text-tospeech synthesis [C ] // Proceedings of the 32nd International Conference on Neural Information Processing Systems . New York:ACM , 2018 : 4485 - 4495 .

ARIK S O , CHEN J , PENG K , et al . Neural voice cloning with a few samples [EB ] . arXiv preprint , 2018 ,arXiv:1802.06006.

ŁAŃCUCKI A . Fastpitch:parallel textto-speech with pitch prediction [C ] // Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 6588 - 6592 .

SKERRY-RYAN R , BATTENBERG E , XIAO Y , et al . Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron [EB ] . arXiv preprint , 2018 ,arXiv:1803.09047.

GURURANI S , GUPTA K , SHAH D , et al . Prosody transfer in neural text to speech using global pitch and loudness features [EB ] . arXiv preprint , 2019 .

WANG Y , STANTON D , ZHANG Y , et al . Style tokens:unsupervised style modeling,control and transfer in end-to-end speech synthesis [J ] . arXiv preprint , 2018 ,arXiv:1803.09017.

VALLE R , LI J , PRENGER R , et al . Mellotron:multispeaker expressive voice synthesis by conditioning on rhythm,pitch and global style tokens [C ] // Proceedings of2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 6189 - 6193 .

WAN V , CHAN C , KENTER T , et al . CHiVE:varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network [EB ] . arXiv preprint , 2019 ,arXiv:1905.07195.

HSU W N , ZHANG Y , WEISS R J , et al . Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization [C ] // Proceedings of 2019 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2019 : 5901 - 5905 .

DONAHUE C , MCAULEY J , PUCKETTE M , et al . Adversarial audio synthesis [EB ] . arXiv preprint , 2018 ,arXiv:1802.04208.

BIŃKOWSKI M , DONAHUE J , DIELEMAN S , et al . High fidelity speech synthesis with adversarial networks [EB ] . arXiv preprint , 2019 ,arXiv:1909.11646.

KUMAR K , KUMAR R , DE BOISSIERE T , et al . MelGAN:generative adversarial networks for conditional waveform synthesis [EB ] . arXiv preprint , 2019 ,arXiv:1910.06711.

EYBEN F , BUCHHOLZ S , BRAUNSCHWEILER N , et al . Unsupervised clustering of emotion and voice styles for expressive TTS [C ] // Proceedings of 2012 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2012 : 4009 - 4012 .

ZEN H , NOSE T , YAMAGISHI J , et al . The HMM-based speech synthesis system [J ] . IEICE Technical Report Natural Language Understanding ＆ Models of Communication , 2007 .

ROSENBERG A . AutoBI - a tool for automatic toBI annotation [C ] // Proceedings of the 2010 Interspeech .[S.l.:s.n. ] , 2010 .

MORRISON M , JIN Z , SALAMON J , et al . Controllable Neural Prosody Synthesis [C ] // Proceedings of the 2020 Interspeech .[S.l.:s.n. ] , 2020 .

SUN J W , YANG J , ZHANG J P , et al . Chinese prosody structure prediction based on conditional random fields [C ] // Proceedings of 2009 5th International Conference on Natural Computation . Piscataway:IEEE Press , 2009 : 602 - 606 .

LI T , YANG S , XUE L M , et al . Controllable emotion transfer for endto-end speech synthesis [C ] // Proceedings of 2021 12th International Symposium on Chinese Spoken Language Processing . Piscataway:IEEE Press , 2021 : 1 - 5 .

JOHNSON J , ALAHI A , LI F F , et al . Perceptual losses for real-time style transfer and super-resolution [EB ] . arXiv preprint , 2016 ,arXiv:1603.08155.

GATYS L , ECKER A , BETHGE M . A neural algorithm of artistic style [J ] . Journal of Vision , 2016 , 16 ( 12 ).

GATYS L A , ECKER A S , BETHGE M . Image style transfer using convolutional neural networks [C ] // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway:IEEE Press , 2016 : 2414 - 2423 .

KWON O , JANG I , AHN C , et al . An effective style token weight control technique for end-to-end emotional speech synthesis [J ] . IEEE Signal Processing Letters , 2019 , 26 ( 9 ): 1383 - 1387 .

UM S Y , OH S , BYUN K , et al . Emotional speech synthesis with rich and granularized control [C ] // Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 7254 - 7258 .

KWON O , SONG E , KIM J M , et al . Effective parameter estimation methods for an ExcitNet model in generative textto-speech systems [EB ] . arXiv preprint , 2019 ,arXiv:1905.08486.

LIU D R , YANG C Y , WU S L , et al . Improving Unsupervised Style Transfer in end-to-end Speech Synthesis with end-to-end Speech Recognition [C ] // Proceedings of 2018 IEEE Spoken Language Technology Workshop . Piscataway:IEEE Press , 2019 : 640 - 647 .

BIAN Y Y , CHEN C B , KANG Y G , et al . Multi-reference Tacotron by intercross training for style disentangling,transfer and control in speech synthesis [EB ] . arXiv preprint , 2019 ,arXiv:1904.02373.

WHITEHILL M , MA S , MCDUFF D , et al . Multi-reference neural TTS stylization with adversarial cycle consistency [C ] // Proceedings of Interspeech 2020 .[S.l.:s.n. ] , 2020 .

KINGMA D P , WELLING M . Autoencoding variational bayes [EB ] . arXiv preprint,2014,arXiv:1312 . 6114 .

ZHANG Y J , PAN S F , HE L , et al . Learning latent representations for style control and transfer in end-toend speech synthesis [C ] // Proceedings of 2019 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2019 : 6945 - 6949 .

SUN G Z , ZHANG Y , WEISS R J , et al . Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis [C ] // Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 6264 - 6268 .

GOODFELLOW I J , POUGET-ABADIE J , MIRZA M , et al . Generative adversarial nets [C ] // Proceedings of the 27th International Conference on Neural Information Processing Systems . New York:ACM Press , 2014 : 2672 - 2680 .

ZHU J Y , PARK T , ISOLA P , et al . Unpaired image-to-image translation using cycle-consistent adversarial networks [C ] // Proceedings of 2017 IEEE International Conference on Computer Vision . Piscataway:IEEE Press , 2017 : 2242 - 2251 .

YU L T , ZHANG W N , WANG J , et al . SeqGAN:sequence generative adversarial nets with policy gradient [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2017 , 31 ( 1 ).

MA S , MCDUFF D , SONG Y L . Neural TTS stylization with adversarial and collaborative games [C ] // Proceedings of 2019 International Conference on Learning Representations ,[S.l.:s.n. ] , 2019 .

LEE S H , YOON H W , NOH H R , et al . Multi-SpectroGAN:high-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 14 ): 13198 - 13206 .

VALLE R , SHIH K , PRENGER R , et al . Flowtron:an autoregressive flow-based generative network for text-to-speech synthesis [EB ] . arXiv preprint , 2020 .

MIAO C F , LIANG S , CHEN M C , et al . Flow-TTS:a non-autoregressive network for text to speech based on flow [C ] // Proceedings of2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 7209 - 7213 .

KIM J , KIM S , KONG J , et al . Glow-TTS:a generative flow for text-to-speech via monotonic alignment search [EB ] . arXiv preprint , 2020 ,arXiv:2005.11129.

AN X C , SOONG F K , XIE L . Disentangling style and speaker attributes for TTS style transfer [J ] . IEEE/ACM Transactions on Audio,Speech and Language Processing , 2022 , 30 : 646 - 658 .

JEONG M , KIM H , CHEON S J , et al . DiffTTS:a denoising diffusion model for text-to-speech [C ] // Proceedings of Interspeech 2021 .[S.l.:s.n. ] , 2021 .

ARIK S Ö , DIAMOS G , GIBIANSKY A , et al . Deep voice 2:multi-speaker neural text-to-speech [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . New York:ACM , 2017 : 2966 - 2974 .

WEI P , PENG K N , GIBIANSKY A , et al . Deep Voice 3:2000-speaker neural text-to-speech [EB ] . arXiv preprint , 2017 ,arXiv:1710.07654.

COOPER E , LAI C I , YASUDA Y , et al . Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings [C ] // Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2020 : 6184 - 6188 .

NACHMANI E , POLYAK A , TAIGMAN Y , et al . Fitting new speakers based on a short untranscribed sample [EB ] . arXiv preprint , 2018 ,arXiv:1802.06984.

HUYBRECHTS G , MERRITT T , COMINI G , et al . Low-resource expressive text-to-speech using data augmentation [C ] // Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 6593 - 6597 .

TERASHIMA R , YAMAMOTO R , SONG E , et al . Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitchshift data augmentation [C ] // Proceedings of Interspeech 2022 .[S.l.:s.n. ] , 2022 .

SAM RIBEIRO M , ROTH J , COMINI G , et al . Cross-speaker style transfer for text-to-speech using data augmentation [C ] // Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2022 : 6797 - 6801 .

SHAH R , POKORA K , EZZERG A , et al . Nonautoregressive TTS with explicit duration modelling for low-resource highly expressive speech [C ] // Proceedings of 11th ISCA Speech Synthesis Workshop .[S.l.:s.n. ] , 2021 .

LAJSZCZAK M , PRASAD A , VAN KORLAAR A , et al . Distribution augmentation for lowresource expressive text-to-speech [C ] // Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2022 : 8307 - 8311 .

浏览量

522

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

小样本数据下特种材料基因工程的数据扩充方法

文本情感可视分析技术及其在人文领域的应用

联邦学习攻击与防御综述

基于材料数值计算大数据的材料辐照机理发现

基于百度贴吧的HIV高危人群特征分析