Skip to main content
Log in

Multi-speaker Chinese news broadcasting system based on improved Tacotron2

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent years, the demand for news broadcasting has increased with the explosion of information. The automatic news broadcasting system based on deep learning text-to-speech technology can solve the problems of working time limitation and errors caused by manual broadcasting. Most of the existing speech synthesis technologies cannot switch speakers in real-time and cannot solve a series of additional news broadcasting scenarios problems in Chinese. In this paper, we propose a multi-speaker Chinese news broadcasting system with switchable timbres based on our established Chinese news corpus CNews dataset for training. This system uses the CPM module to convert Chinese into pinyin phonemes more accurately. Then a timbre encoder is used to construct multi-speaker timbre feature embeddings. As for the problem of having long texts in news, the acoustic model of this system is improved based on Tacotron2 and uses Discrete Grave attention as the attention mechanism so that the model reduces the demand for audio data in the training phase and better extracts the information from the context. The HiFi-GAN vocoder is also used to generate time domain waveforms instead of the original WaveNet, reducing the synthesis time and improving the voice quality of the synthesized speech. Experiments show that the system can change the target timbre flexibly according to the reference speech compared with Tacotron2. Moreover, it is able to synthesize speech with the prosody and style of the target presenter under the training of limited data and with better naturalness as well as faster inference speed, which can be used for real-time news broadcasting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data Availability

AISHELL3 and Aidatatang_200zh datasets are openly available in a public repository. CNews Dataset is available on request from the authors

Notes

  1. https://huggingface.co/bert-base-chinese

  2. https://github.com/NVIDIA/tacotron2

  3. https://github.com/wiseman/py-webrtcvadwhere \({x}_{k}\) denotes the feature vector of the input voice source, which in WebRTC’s VAD algorithm specifically refers to the corresponding subband energy, and \({r}_{k}\) represents the set of parameters affecting the mean \({u}_{z}\) and variance \(\sigma\) of the Gaussian distribution. When z = 0 indicates the probability that a non-speech segment is recognized, and z = 1 indicates the probability that a speech segment is recognized.

  4. https://asrt.ailemon.me

  5. https://pypi.org/project/pypinyin/

  6. https://github.com/NVIDIA/waveglow

  7. https://github.com/kan-bayashi/PytorchWaveNetVocoder

  8. https://github.com/descriptinc/melgan-neurips

  9. https://github.com/jik876/hifi-gan

References

  1. Allen J, Hunnicutt S, Carlson R et al (1979) MITalk-79: The 1979 MIT text-to-speech system. J Acoust Soc Am 65(S1):S130–S130

    Article  Google Scholar 

  2. Arik S O, Chrzanowski M, Coates A et al (2017) Deep Voice: Real-time Neural Text-to-Speech. Proceedings of the 34th International Conference on Machine Learning JMLR.org, pp 195–204

  3. Arik S, Diamos G, Gibiansky A et al (2017) Deep Voice 2: Multi-Speaker Neural Text-to-Speech

  4. Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. https://doi.org/10.1109/ICASSP40776.2020.9054106

  5. Battenberg E, Skerry-Ryan R, Mariooryad S et al (2019) Location-relative attention mechanisms for robust long-form speech synthesis. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6194–6198. https://doi.org/10.1109/ICASSP40776.2020.9054106

  6. Beliaev S, Rebryk Y, Ginsburg B (2020) TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model. https://doi.org/10.48550/arXiv.2005.05514

  7. 陈小平 (2006) 声音与人耳听觉. 中国广播电视出版

  8. Chorowski J, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. Computerence 10(4):429-439

  9. Coker and H. C (1976) A model of articulatory dynamics and control. Proc IEEE 64(4):452–460

    Article  Google Scholar 

  10. Devlin J, Chang M W, Lee K et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186

  11. Elias I, Zen H, Shen J et al (2020). Parallel Tacotron: Non-Autoregressive and Controllable TTS. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 5709–5713

  12. Elias I, Zen H, Shen J et al (2021) Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. https://doi.org/10.48550/arXiv.2103.14574

  13. Ephraim, Y, Malah et al (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing, pp. 443–445

  14. Graves A (2013) Generating Sequences With Recurrent Neural Networks. Computer Science. https://doi.org/10.48550/arXiv.1308.0850

  15. Hideki K (2006) STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoust Sci Technol 27(6):349–353

    Article  Google Scholar 

  16. Hoßfeld T, Heegaard PE, Varela M et al (2016) QoEbeyondthe MOS: an in-depth look at QoE via better metrics and their relation to MOS. Qual User Exp 1(1):1–23

    Article  Google Scholar 

  17. Huang Z, Li H, Lei M (2020) DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech. ArXiv abs/2010.15311

  18. Kingma D, Ba J (2014) Adam: A method for stochastic optimization. https://doi.org/10.48550/arXiv.1412.6980

  19. Klatt DH (1980) Software for a cascade\/parallel formant synthesizer. J Acoustic Soc Am 67(3):971–995

    Article  Google Scholar 

  20. Kong J, Kim J, Bae J (2020) Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv Neural Inf Process Syst 33:17022–17033

    Google Scholar 

  21. Kumar K, Kumar R, Boissiere T D, Gestin L, Teoh W Z, Sotelo J et al (2019) Melgan: generative adversarial networks for conditional waveform synthesis

  22. Kumar A, Kumar S, Ganesan R A (2021) Efficient Human-Quality Kannada TTS using Transfer Learning on NVIDIA's Tacotron2. International Conference on Electronics, Computing and Communication Technologies, pp. 01–06. https://doi.org/10.1109/CONECCT52877.2021.9622581

  23. Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  24. Lee K, Park K, Kim D (2021) STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. https://doi.org/10.48550/arXiv.2103.09474

  25. Lim D, Jang W, Gyeonghwan O et al (2020) JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment. https://doi.org/10.48550/arXiv.2005.07799

  26. Lu Y, Dong M, Chen Y (2019) Implementing prosodic phrasing in Chinese end-to-end speech synthesis. 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7050–7054

  27. Melara RD, Marks LE (1990) Interaction among auditory dimensions: timbre, pitch, and loudness. Percept Psychophys 48:169–178. https://doi.org/10.3758/BF03207084

    Article  Google Scholar 

  28. Mikolov T, M Karafiát, Burget L et al (2010) Recurrent neural network based language model. INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association

  29. Morise M, Yokomori F, Ozawa K (2016) WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans Inf Syst 99(7):1877–1884

    Article  Google Scholar 

  30. Moulines E, Charpentier F (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 9(5–6):453–467

    Article  Google Scholar 

  31. Olive J (1977) Rule synthesis of speech from dyadic units. ICASSP ‘77. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp 568–570

  32. Oord A, Dieleman S, Zen H et al (2016) WaveNet: A Generative Model for Raw Audio. https://doi.org/10.48550/arXiv.1609.03499

  33. Pan J, Yin X, Zhang Z, Liu S, Zhang Y, Ma Z, Wang Y (2020) A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. Proceedings of the international conference on acoustics, speech and signal processing, pp 6689–6693

  34. Prenger R, Valle R, Catanzaro B (2019) Waveglow: A Flow-based Generative Network for Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143

  35. Ren Y, Hu C, Qin T et al (2020) FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2006.04558

  36. Ren Y, Ruan Y, Tan X et al (2019) Fastspeech: Fast, robust and controllable text to speech. In NeurIPS

  37. Salimans T, Karpathy A, Chen X et al (2017) Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. https://doi.org/10.48550/arXiv.1701.05517

  38. D. V. Sang, L. X. Thu (2021) FastTacotron: A Fast, Robust and Controllable Method for Speech Synthesis. 2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–5. https://doi.org/10.1109/MAPR53640.2021.9585267

  39. Seeviour P, Holmes J, Judd M (1976) Automatic generation of control signals for a parallel formant speech synthesizer. IEEE International Conference on Acoustics, Speech, & Signal Processing, pp 690–693

  40. Shadle C H, Damper R I (2002) Prospects for articulatory synthesis: A position paper. 齒科學報

  41. Shen J, Jia Y, Chrzanowski M et al (2020) Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

  42. Shen J, Pang R, Weiss R J, et al (2017) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779- 4783

  43. Silva A, Gomes MM, Costa C et al (2020) Intelligent personal assistants: A systematic literature review. Exp Syst Appl 147:113193. https://doi.org/10.1016/j.eswa.2020.113193

    Article  Google Scholar 

  44. Sutskever I, Vinyals O, Le Q V (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp. 3104–3112

  45. Tan X, Qin T, Soong F, et al (2021) A Survey on Neural Speech Synthesis. https://doi.org/10.48550/arXiv.2106.15561

  46. Thangthai A, Thatphithakkul S, Thangthai K et al (2020) TSynC-3miti: Audiovisual Speech Synthesis Database from Found Data. 2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

  47. Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K (2013) Speech Synthesis Based on Hidden Markov Models. Proc IEEE 101(5):1234–1252

    Article  Google Scholar 

  48. Valle R, Shih K, Prenger R et al (2020) Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis. https://doi.org/10.48550/arXiv.2005.05957

  49. Vasquez S, Lewis M (2019) MelNet: A Generative Model for Audio in the Frequency Domain. https://doi.org/10.48550/arXiv.1906.01083

  50. Vasquez S, Lewis M (2019) Melnet: a generative model for audio in the frequency domain

  51. Wan L, Wang Q, Papir A et al (2018) Generalized End-to-End Loss for Speaker Verification. http://arxiv.org/abs/1710.10467

  52. Wang Y, Skerry-Ryan R J, Stanton D et al (2017) Tacotron: Towards End-to-End Speech Synthesis. Interspeech 2017, pp. 4006–4010. https://doi.org/10.21437/Interspeech.2017-1452

  53. Wei P, Peng K, Chen J (2018). ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In International Conference on Learning Representations

  54. Wei P, Peng K, Gibiansky A, et al (2017) Deep voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pp 214–217

  55. Yang F, Yang S, Zhu P, Yan P, Xie L (2019) Improving mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp 208–213

  56. Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T (1999) Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis. Trans Inst Electron Inf Commun Eng 83(3):2099–2107

    Google Scholar 

  57. Zen H, Tokuda K, Black AW (2009) Statistical parametric speech synthesis. Speech Commun 51:1039–1064. https://doi.org/10.1016/j.specom.2009.04.004

    Article  Google Scholar 

  58. Zeng Z, Wang J, Cheng N, Xia T, Xiao J (2020) AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment. in Proc. ICASSP, pp 6714–6718

  59. Zhang Y, Deng L, Wang Y (2020) Unified Mandarin TTS Front-end Based on Distilled BERT Model. https://doi.org/10.48550/arXiv.2012.15404

  60. Zhang JX, Ling ZH, Dai LR (2018) Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. IEEE. https://doi.org/10.1109/ICASSP.2018.8462020

    Article  Google Scholar 

  61. Zhang H, Sproat R, Ng AH et al (2019) Neural Models of Text Normalization for Speech Applications. Comput Linguist 45:1–49. https://doi.org/10.1162/COLI_a_00349

    Article  MathSciNet  Google Scholar 

  62. Zhang H, Yuan T, Chen J et al (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. ArXiv, abs/2205.12007

  63. Zhang C, Zhang S, Zhong H (2019) A prosodic mandarin text-to-speech system based on Tacotron. In: 2019 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 165–169

  64. Zhu X (2019) Emerging champions in the digital economy. Springer, Singapore

    Book  Google Scholar 

Download references

Funding

This study is supported by the Fundamental Research Funds for the Central Universities, the National Key R&D Program of China (Grand No.2022YFC3302100). This paper is also the research result of a collaborative project between INSTEC Technology Co., Ltd. and INSTEC Import & Export Co, Ltd., and Communication University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Zhao.

Ethics declarations

Conflict of interests

The author declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, W., Lian, Y., Chai, J. et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2. Multimed Tools Appl 82, 46905–46937 (2023). https://doi.org/10.1007/s11042-023-15279-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15279-z

Keywords

Navigation