Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Zhao, Wei; Lian, Yue; Chai, Jianping; Tu, Zhongwen

doi:10.1007/s11042-023-15279-z

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Published: 03 May 2023

Volume 82, pages 46905–46937, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wei Zhao ORCID: orcid.org/0000-0002-1566-4155¹^na1,
Yue Lian¹^na1,
Jianping Chai¹ &
…
Zhongwen Tu¹

110 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In recent years, the demand for news broadcasting has increased with the explosion of information. The automatic news broadcasting system based on deep learning text-to-speech technology can solve the problems of working time limitation and errors caused by manual broadcasting. Most of the existing speech synthesis technologies cannot switch speakers in real-time and cannot solve a series of additional news broadcasting scenarios problems in Chinese. In this paper, we propose a multi-speaker Chinese news broadcasting system with switchable timbres based on our established Chinese news corpus CNews dataset for training. This system uses the CPM module to convert Chinese into pinyin phonemes more accurately. Then a timbre encoder is used to construct multi-speaker timbre feature embeddings. As for the problem of having long texts in news, the acoustic model of this system is improved based on Tacotron2 and uses Discrete Grave attention as the attention mechanism so that the model reduces the demand for audio data in the training phase and better extracts the information from the context. The HiFi-GAN vocoder is also used to generate time domain waveforms instead of the original WaveNet, reducing the synthesis time and improving the voice quality of the synthesized speech. Experiments show that the system can change the target timbre flexibly according to the reference speech compared with Tacotron2. Moreover, it is able to synthesize speech with the prosody and style of the target presenter under the training of limited data and with better naturalness as well as faster inference speed, which can be used for real-time news broadcasting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Synthesis Method Based on Tacotron + WaveNet

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Article 16 November 2021

Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2

Article Open access 12 March 2022

Data Availability

AISHELL3 and Aidatatang_200zh datasets are openly available in a public repository. CNews Dataset is available on request from the authors

Notes

https://huggingface.co/bert-base-chinese
https://github.com/NVIDIA/tacotron2
https://github.com/wiseman/py-webrtcvadwhere \({x}_{k}\) denotes the feature vector of the input voice source, which in WebRTC’s VAD algorithm specifically refers to the corresponding subband energy, and \({r}_{k}\) represents the set of parameters affecting the mean \({u}_{z}\) and variance \(\sigma\) of the Gaussian distribution. When z = 0 indicates the probability that a non-speech segment is recognized, and z = 1 indicates the probability that a speech segment is recognized.
https://asrt.ailemon.me
https://pypi.org/project/pypinyin/
https://github.com/NVIDIA/waveglow
https://github.com/kan-bayashi/PytorchWaveNetVocoder
https://github.com/descriptinc/melgan-neurips
https://github.com/jik876/hifi-gan

References

Allen J, Hunnicutt S, Carlson R et al (1979) MITalk-79: The 1979 MIT text-to-speech system. J Acoust Soc Am 65(S1):S130–S130
Article Google Scholar
Arik S O, Chrzanowski M, Coates A et al (2017) Deep Voice: Real-time Neural Text-to-Speech. Proceedings of the 34th International Conference on Machine Learning JMLR.org, pp 195–204
Arik S, Diamos G, Gibiansky A et al (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. https://doi.org/10.1109/ICASSP40776.2020.9054106
Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. https://doi.org/10.1109/ICASSP40776.2020.9054106
Beliaev S, Rebryk Y, Ginsburg B (2020) TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model. https://doi.org/10.48550/arXiv.2005.05514
陈小平 (2006) 声音与人耳听觉. 中国广播电视出版
Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Computerence 10(4):429-439
Coker and H. C (1976) A model of articulatory dynamics and control. Proc IEEE 64(4):452–460
Article Google Scholar
Devlin J, Chang M W, Lee K et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186
Elias I, Zen H, Shen J et al (2020). Parallel Tacotron: Non-Autoregressive and Controllable TTS. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5709–5713
Elias I, Zen H, Shen J et al (2021) Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. https://doi.org/10.48550/arXiv.2103.14574
Ephraim, Y, Malah et al (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing, pp. 443–445
Graves A (2013) Generating Sequences With Recurrent Neural Networks. Computer Science. https://doi.org/10.48550/arXiv.1308.0850
Hideki K (2006) STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoust Sci Technol 27(6):349–353
Article Google Scholar
Hoßfeld T, Heegaard PE, Varela M et al (2016) QoEbeyondthe MOS: an in-depth look at QoE via better metrics and their relation to MOS. Qual User Exp 1(1):1–23
Article Google Scholar
Huang Z, Li H, Lei M (2020) DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech. ArXiv abs/2010.15311
Kingma D, Ba J (2014) Adam: A method for stochastic optimization. https://doi.org/10.48550/arXiv.1412.6980
Klatt DH (1980) Software for a cascade\/parallel formant synthesizer. J Acoustic Soc Am 67(3):971–995
Article Google Scholar
Kong J, Kim J, Bae J (2020) Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst 33:17022–17033
Google Scholar
Kumar K, Kumar R, Boissiere T D, Gestin L, Teoh W Z, Sotelo J et al (2019) Melgan: generative adversarial networks for conditional waveform synthesis
Kumar A, Kumar S, Ganesan R A (2021) Efficient Human-Quality Kannada TTS using Transfer Learning on NVIDIA's Tacotron2. International Conference on Electronics, Computing and Communication Technologies, pp. 01–06. https://doi.org/10.1109/CONECCT52877.2021.9622581
Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Lee K, Park K, Kim D (2021) STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. https://doi.org/10.48550/arXiv.2103.09474
Lim D, Jang W, Gyeonghwan O et al (2020) JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment. https://doi.org/10.48550/arXiv.2005.07799
Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054
Melara RD, Marks LE (1990) Interaction among auditory dimensions: timbre, pitch, and loudness. Percept Psychophys 48:169–178. https://doi.org/10.3758/BF03207084
Article Google Scholar
Mikolov T, M Karafiát, Burget L et al (2010) Recurrent neural network based language model. INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association
Morise M, Yokomori F, Ozawa K (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans Inf Syst 99(7):1877–1884
Article Google Scholar
Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9(5–6):453–467
Article Google Scholar
Olive J (1977) Rule synthesis of speech from dyadic units. ICASSP ‘77. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp 568–570
Oord A, Dieleman S, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio. https://doi.org/10.48550/arXiv.1609.03499
Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693
Prenger R, Valle R, Catanzaro B (2019) Waveglow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
Ren Y, Hu C, Qin T et al (2020) FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2006.04558
Ren Y, Ruan Y, Tan X et al (2019) Fastspeech: Fast, robust and controllable text to speech. In NeurIPS
Salimans T, Karpathy A, Chen X et al (2017) Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. https://doi.org/10.48550/arXiv.1701.05517
D. V. Sang, L. X. Thu (2021) FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis. 2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–5. https://doi.org/10.1109/MAPR53640.2021.9585267
Seeviour P, Holmes J, Judd M (1976) Automatic generation of control signals for a parallel formant speech synthesizer. IEEE International Conference on Acoustics, Speech, & Signal Processing, pp 690–693
Shadle C H, Damper R I (2002) Prospects for articulatory synthesis: A position paper. 齒科學報
Shen J, Jia Y, Chrzanowski M et al (2020) Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling
Shen J, Pang R, Weiss R J, et al (2017) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779- 4783
Silva A, Gomes MM, Costa C et al (2020) Intelligent personal assistants: A systematic literature review. Exp Syst Appl 147:113193. https://doi.org/10.1016/j.eswa.2020.113193
Article Google Scholar
Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp. 3104–3112
Tan X, Qin T, Soong F, et al (2021) A Survey on Neural Speech Synthesis. https://doi.org/10.48550/arXiv.2106.15561
Thangthai A, Thatphithakkul S, Thangthai K et al (2020) TSynC-3miti: Audiovisual Speech Synthesis Database from Found Data. 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech Synthesis Based on Hidden Markov Models. Proc IEEE 101(5):1234–1252
Article Google Scholar
Valle R, Shih K, Prenger R et al (2020) Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. https://doi.org/10.48550/arXiv.2005.05957
Vasquez S, Lewis M (2019) MelNet: A Generative Model for Audio in the Frequency Domain. https://doi.org/10.48550/arXiv.1906.01083
Vasquez S, Lewis M (2019) Melnet: a generative model for audio in the frequency domain
Wan L, Wang Q, Papir A et al (2018) Generalized End-to-End Loss for Speaker Verification. http://arxiv.org/abs/1710.10467
Wang Y, Skerry-Ryan R J, Stanton D et al (2017) Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, pp. 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452
Wei P, Peng K, Chen J (2018). ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In International Conference on Learning Representations
Wei P, Peng K, Gibiansky A, et al (2017) Deep voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pp 214–217
Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis. Trans Inst Electron Inf Commun Eng 83(3):2099–2107
Google Scholar
Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51:1039–1064. https://doi.org/10.1016/j.specom.2009.04.004
Article Google Scholar
Zeng Z, Wang J, Cheng N, Xia T, Xiao J (2020) AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment. in Proc. ICASSP, pp 6714–6718
Zhang Y, Deng L, Wang Y (2020) Unified Mandarin TTS Front-end Based on Distilled BERT Model. https://doi.org/10.48550/arXiv.2012.15404
Zhang JX, Ling ZH, Dai LR (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. IEEE. https://doi.org/10.1109/ICASSP.2018.8462020
Article Google Scholar
Zhang H, Sproat R, Ng AH et al (2019) Neural Models of Text Normalization for Speech Applications. Comput Linguist 45:1–49. https://doi.org/10.1162/COLI_a_00349
Article MathSciNet Google Scholar
Zhang H, Yuan T, Chen J et al (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. ArXiv, abs/2205.12007
Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169
Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore
Book Google Scholar

Download references

Funding

This study is supported by the Fundamental Research Funds for the Central Universities, the National Key R&D Program of China (Grand No.2022YFC3302100). This paper is also the research result of a collaborative project between INSTEC Technology Co., Ltd. and INSTEC Import & Export Co, Ltd., and Communication University of China.

Author information

Wei Zhao and Yue Lian These authors contributed equally to this work.

Authors and Affiliations

Communication University of China, No.1, Dingfuzhuang Street(E), Beijing, China
Wei Zhao, Yue Lian, Jianping Chai & Zhongwen Tu

Authors

Wei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yue Lian
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Chai
View author publications
You can also search for this author in PubMed Google Scholar
Zhongwen Tu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Zhao.

Ethics declarations

Conflict of interests

The author declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, W., Lian, Y., Chai, J. et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2. Multimed Tools Appl 82, 46905–46937 (2023). https://doi.org/10.1007/s11042-023-15279-z

Download citation

Received: 31 May 2022
Revised: 10 January 2023
Accepted: 06 April 2023
Published: 03 May 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11042-023-15279-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Abstract

Access this article

Similar content being viewed by others

Speech Synthesis Method Based on Tacotron + WaveNet

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

Abstract

Access this article

Similar content being viewed by others

Speech Synthesis Method Based on Tacotron + WaveNet

Persian speech synthesis using enhanced tacotron based on multi-resolution convolution layers and a convex optimization method

Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation