Abstract
In recent years, the demand for news broadcasting has increased with the explosion of information. The automatic news broadcasting system based on deep learning text-to-speech technology can solve the problems of working time limitation and errors caused by manual broadcasting. Most of the existing speech synthesis technologies cannot switch speakers in real-time and cannot solve a series of additional news broadcasting scenarios problems in Chinese. In this paper, we propose a multi-speaker Chinese news broadcasting system with switchable timbres based on our established Chinese news corpus CNews dataset for training. This system uses the CPM module to convert Chinese into pinyin phonemes more accurately. Then a timbre encoder is used to construct multi-speaker timbre feature embeddings. As for the problem of having long texts in news, the acoustic model of this system is improved based on Tacotron2 and uses Discrete Grave attention as the attention mechanism so that the model reduces the demand for audio data in the training phase and better extracts the information from the context. The HiFi-GAN vocoder is also used to generate time domain waveforms instead of the original WaveNet, reducing the synthesis time and improving the voice quality of the synthesized speech. Experiments show that the system can change the target timbre flexibly according to the reference speech compared with Tacotron2. Moreover, it is able to synthesize speech with the prosody and style of the target presenter under the training of limited data and with better naturalness as well as faster inference speed, which can be used for real-time news broadcasting.
Similar content being viewed by others
Data Availability
AISHELL3 and Aidatatang_200zh datasets are openly available in a public repository. CNews Dataset is available on request from the authors
Notes
https://github.com/wiseman/py-webrtcvadwhere \({x}_{k}\) denotes the feature vector of the input voice source, which in WebRTC’s VAD algorithm specifically refers to the corresponding subband energy, and \({r}_{k}\) represents the set of parameters affecting the mean \({u}_{z}\) and variance \(\sigma\) of the Gaussian distribution. When z = 0 indicates the probability that a non-speech segment is recognized, and z = 1 indicates the probability that a speech segment is recognized.
References
Allen J, Hunnicutt S, Carlson R et al (1979) MITalk-79: The 1979 MIT text-to-speech system. J Acoust Soc Am 65(S1):S130–S130
Arik S O, Chrzanowski M, Coates A et al (2017) Deep Voice: Real-time Neural Text-to-Speech. Proceedings of the 34th International Conference on Machine Learning JMLR.org, pp 195–204
Arik S, Diamos G, Gibiansky A et al (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. https://doi.org/10.1109/ICASSP40776.2020.9054106
Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. https://doi.org/10.1109/ICASSP40776.2020.9054106
Beliaev S, Rebryk Y, Ginsburg B (2020) TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model. https://doi.org/10.48550/arXiv.2005.05514
陈小平 (2006) 声音与人耳听觉. 中国广播电视出版
Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Computerence 10(4):429-439
Coker and H. C (1976) A model of articulatory dynamics and control. Proc IEEE 64(4):452–460
Devlin J, Chang M W, Lee K et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186
Elias I, Zen H, Shen J et al (2020). Parallel Tacotron: Non-Autoregressive and Controllable TTS. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5709–5713
Elias I, Zen H, Shen J et al (2021) Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. https://doi.org/10.48550/arXiv.2103.14574
Ephraim, Y, Malah et al (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing, pp. 443–445
Graves A (2013) Generating Sequences With Recurrent Neural Networks. Computer Science. https://doi.org/10.48550/arXiv.1308.0850
Hideki K (2006) STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoust Sci Technol 27(6):349–353
Hoßfeld T, Heegaard PE, Varela M et al (2016) QoEbeyondthe MOS: an in-depth look at QoE via better metrics and their relation to MOS. Qual User Exp 1(1):1–23
Huang Z, Li H, Lei M (2020) DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech. ArXiv abs/2010.15311
Kingma D, Ba J (2014) Adam: A method for stochastic optimization. https://doi.org/10.48550/arXiv.1412.6980
Klatt DH (1980) Software for a cascade\/parallel formant synthesizer. J Acoustic Soc Am 67(3):971–995
Kong J, Kim J, Bae J (2020) Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst 33:17022–17033
Kumar K, Kumar R, Boissiere T D, Gestin L, Teoh W Z, Sotelo J et al (2019) Melgan: generative adversarial networks for conditional waveform synthesis
Kumar A, Kumar S, Ganesan R A (2021) Efficient Human-Quality Kannada TTS using Transfer Learning on NVIDIA's Tacotron2. International Conference on Electronics, Computing and Communication Technologies, pp. 01–06. https://doi.org/10.1109/CONECCT52877.2021.9622581
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Lee K, Park K, Kim D (2021) STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. https://doi.org/10.48550/arXiv.2103.09474
Lim D, Jang W, Gyeonghwan O et al (2020) JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment. https://doi.org/10.48550/arXiv.2005.07799
Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054
Melara RD, Marks LE (1990) Interaction among auditory dimensions: timbre, pitch, and loudness. Percept Psychophys 48:169–178. https://doi.org/10.3758/BF03207084
Mikolov T, M Karafiát, Burget L et al (2010) Recurrent neural network based language model. INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association
Morise M, Yokomori F, Ozawa K (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans Inf Syst 99(7):1877–1884
Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9(5–6):453–467
Olive J (1977) Rule synthesis of speech from dyadic units. ICASSP ‘77. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp 568–570
Oord A, Dieleman S, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio. https://doi.org/10.48550/arXiv.1609.03499
Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693
Prenger R, Valle R, Catanzaro B (2019) Waveglow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
Ren Y, Hu C, Qin T et al (2020) FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2006.04558
Ren Y, Ruan Y, Tan X et al (2019) Fastspeech: Fast, robust and controllable text to speech. In NeurIPS
Salimans T, Karpathy A, Chen X et al (2017) Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. https://doi.org/10.48550/arXiv.1701.05517
D. V. Sang, L. X. Thu (2021) FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis. 2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–5. https://doi.org/10.1109/MAPR53640.2021.9585267
Seeviour P, Holmes J, Judd M (1976) Automatic generation of control signals for a parallel formant speech synthesizer. IEEE International Conference on Acoustics, Speech, & Signal Processing, pp 690–693
Shadle C H, Damper R I (2002) Prospects for articulatory synthesis: A position paper. 齒科學報
Shen J, Jia Y, Chrzanowski M et al (2020) Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Shen J, Pang R, Weiss R J, et al (2017) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779- 4783
Silva A, Gomes MM, Costa C et al (2020) Intelligent personal assistants: A systematic literature review. Exp Syst Appl 147:113193. https://doi.org/10.1016/j.eswa.2020.113193
Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp. 3104–3112
Tan X, Qin T, Soong F, et al (2021) A Survey on Neural Speech Synthesis. https://doi.org/10.48550/arXiv.2106.15561
Thangthai A, Thatphithakkul S, Thangthai K et al (2020) TSynC-3miti: Audiovisual Speech Synthesis Database from Found Data. 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech Synthesis Based on Hidden Markov Models. Proc IEEE 101(5):1234–1252
Valle R, Shih K, Prenger R et al (2020) Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. https://doi.org/10.48550/arXiv.2005.05957
Vasquez S, Lewis M (2019) MelNet: A Generative Model for Audio in the Frequency Domain. https://doi.org/10.48550/arXiv.1906.01083
Vasquez S, Lewis M (2019) Melnet: a generative model for audio in the frequency domain
Wan L, Wang Q, Papir A et al (2018) Generalized End-to-End Loss for Speaker Verification. http://arxiv.org/abs/1710.10467
Wang Y, Skerry-Ryan R J, Stanton D et al (2017) Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, pp. 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
Wei P, Peng K, Chen J (2018). ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In International Conference on Learning Representations
Wei P, Peng K, Gibiansky A, et al (2017) Deep voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pp 214–217
Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis. Trans Inst Electron Inf Commun Eng 83(3):2099–2107
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51:1039–1064. https://doi.org/10.1016/j.specom.2009.04.004
Zeng Z, Wang J, Cheng N, Xia T, Xiao J (2020) AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment. in Proc. ICASSP, pp 6714–6718
Zhang Y, Deng L, Wang Y (2020) Unified Mandarin TTS Front-end Based on Distilled BERT Model. https://doi.org/10.48550/arXiv.2012.15404
Zhang JX, Ling ZH, Dai LR (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. IEEE. https://doi.org/10.1109/ICASSP.2018.8462020
Zhang H, Sproat R, Ng AH et al (2019) Neural Models of Text Normalization for Speech Applications. Comput Linguist 45:1–49. https://doi.org/10.1162/COLI_a_00349
Zhang H, Yuan T, Chen J et al (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. ArXiv, abs/2205.12007
Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169
Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore
Funding
This study is supported by the Fundamental Research Funds for the Central Universities, the National Key R&D Program of China (Grand No.2022YFC3302100). This paper is also the research result of a collaborative project between INSTEC Technology Co., Ltd. and INSTEC Import & Export Co, Ltd., and Communication University of China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The author declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, W., Lian, Y., Chai, J. et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2. Multimed Tools Appl 82, 46905–46937 (2023). https://doi.org/10.1007/s11042-023-15279-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15279-z