Advertisement

Formosa Speech in the Wild Corpus for Improving Taiwanese Mandarin Speech-Enabled Human-Computer Interaction

  • Yuan-Fu LiaoEmail author
  • Yung-Hsiang Shawn Chang
  • Yu-Chen Lin
  • Wu-Hua Hsu
  • Matus Pleva
  • Jozef Juhar
Article
  • 19 Downloads

Abstract

Mandarin in Taiwan is notably different from other variants of Mandarin in terms of lexical use and accents. However, from an investment perspective, it remains debated whether the general-purpose Mandarin speech recognition (MSR) systems are sufficient for supporting human-computer interaction in Taiwan. In addressing this question, we established the Formosa (an ancient name of Taiwan given by the Portuguese) Speech in the Wild (FSW) (Liao 2018) project to (1) collect large-scale Taiwanese Mandarin speech to boost Taiwanese-specific MSR technique development, and (2) host a Formosa Speech Recognition (FSR) challenge (Liao 2018) to promote the corpus as well as to evaluate the performance of the available Taiwanese-specific MSR systems. The FSW project has focused on transcribing spontaneous Taiwanese Mandarin speech selected from real-life, multi-genre broadcast radio speech provided by Taiwan’s National Education Radio (2018). We plan to publicly release about 3000 hours of speech data at the end of 2019. FSR-2018 (Liao 2018) was the culmination of FSW’s events in the year 2018, which featured a Taiwanese broadcast Mandarin speech recognition evaluation campaign using released corpora. The challenge was also an official activity (Liao 2018) of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) [22]. At the end of 2018, the first 4 volumes of the FSW Corpus, NER-Trs-Vol1∼4, a total of 610.2 hours of speech data, were released to support two events, Formosa Grand Challenge, Talk to AI (FGC) (Ministry of Science And Technology Taiwan 2018) (Dec. 2017 ∼ Mar. 2019) and FSR-2018 challenge (Liao 2018) (June 2018 ∼ Nov. 2018), which had 147 and 27 participating teams respectively. For FSR-2018, 30 recognition results on the final-test set were submitted by 16 teams. The evaluation results revealed that the best Taiwanese-specific MSR system achieved an 8.1% Chinese character error rate (CER). As reference, the performances of iFlyTek’s (ISCSLP 2018) and Google’s (2018) commercial MSR systems which were not optimized for this task were 18.8% and 20.6% CERs, respectively. Taken together, we argued that a Taiwanese-specific MSR system is necessary for improving the performance of Taiwanese Mandarin speech-enabled human-computer interaction.

Keywords

Taiwanese Mandarin speech corpus Taiwanese Mandarin speech recognition Evaluation & benchmarks Deep neural networks 

Notes

Acknowledgements

This research was funded by Taiwan’s Ministry of Science Technology (MOST 106-3011-F-027-006, 107-3011-F-027-003, 106-2221-E-027-128, 107-2221-E-027-102, 108-2221-E-027-067, 107-2911-I-027-501 and 108-2911-I-027-501), by the Slovak Research and Development Agency - APVV SK-TW-2017-0005, and by the Cultural and educational grant agency project KEGA 009TUKE-4/2019 & Scientific grant agency project VEGA 1/0511/17 both financed by the Ministry of Education, Science, Research and Sport of the Slovak Republic.

This work was made possible with contents contributed by National Education Radio, Taiwan. Authors also want to thank Open image in new window for English proofreading.

References

  1. 1.
    Boersma, P., & Weenink, D. (2018). Praat: doing phonetics by computer. http://www.fon.hum.uva.nl/praat/. Accessed: 2019-01-28.
  2. 2.
    Bu, H., Du, J., Na, X., Wu, B., Zheng, H. (2018). AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the oriental chapter of international committee for coordination and standardization of speech databases and assessment techniques, O-COCOSDA 2017 (pp. 1–5), DOI  https://doi.org/10.1109/ICSDA.2017.8384449.
  3. 3.
    Chan, W., Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings (pp. 4960–4964), DOI  https://doi.org/10.1109/ICASSP.2016.7472621.
  4. 4.
    Chang, H.j., Chao, W.c., Lo, T.h., Chen, B. (2018). NTNU Speech recognition system at FSR 2018. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/1W2T76fyUj4mSFcKa7Z2kieVZMmdYVoWf. Accessed 2019-01-20.
  5. 5.
    Chang, Y.H.S., Liao, Y.F., Wang, S.M., Wang, J.H., Wang, S.Y., Chen, J.W., Chen, Y.D. (2017). Development of a large-scale Mandarin radio speech corpus. In 2017 IEEE International conference on consumer electronics - Taiwan, ICCE-TW 2017 (pp. 359–360), DOI  https://doi.org/10.1109/ICCE-China.2017.7991144.
  6. 6.
    Chen, L.h., Hu, C.k., Hung, L.j., Lin, C.w. (2018). Towards a robust Taiwanese Mandarin automatic speech recognition system with Kaldi toolkit. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15p5T43Qb3XVGkQbPTlH1MlhpdUa-Z-nL. Accessed 2019-01-20.
  7. 7.
    Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., Jaitly, N., Li, B., Chorowski, J., Bacchiani, M. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings (pp. 4774–4778), DOI  https://doi.org/10.1109/ICASSP.2018.8462105.
  8. 8.
    Du, J., Na, X., Liu, X., Bu, H. (2018). AISHELL-2: transforming Mandarin ASR research into industrial scale. arXiv:1808.10583. Accessed 2019-01-20.
  9. 9.
    ELRA Catalogue. (2006). Taiwan Mandarin speecon database – ELRA catalogue. http://catalogue.elra.info/en-us/repository/browse/ELRA-S0212/. Accessed 2019-01-18.
  10. 10.
    ESPnet. (2018). ESPnet: end-to-end speech processing toolkit. https://github.com/espnet/espnet. Accessed 2019-01-26.
  11. 11.
    Facebook Research. (2018). GitHub - facebookresearch/wav2letter: Facebook AI research automatic speech recognition toolkit. https://github.com/facebookresearch/wav2letter. Accessed 2019-01-28.
  12. 12.
    Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP).Google Scholar
  13. 13.
    Ghahremani, P., Manohar, V., Hadian, H., Povey, D., Khudanpur, S. (2017). Investigation of transfer learning for ASR using LF-MMI trained neural networks. In ASRU 2017. http://www.danielpovey.com/files/2017_asru_transfer_learning.pdf. Accessed 2019-01-26.
  14. 14.
    Google. (2018). Cloud speech-to-text. https://cloud.google.com/speech-to-text/. Accessed 2019-01-26.
  15. 15.
    Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., Ng, A.Y. (2014). Deep speech: scaling up end-to-end speech recognition. arXiv:http://arXiv.org/abs/1412.5567v2.
  16. 16.
    Hori, T., Cho, J., Watanabe, S. (2018). End-to-end speech recognition with word-based RNN language models. arXiv:1808.02608.
  17. 17.
    Hsu, W.H. (2018). A preliminary study on speaker diarization for automatic transcription of broadcast radio speech. Ph.D. thesis, National Taipei university of Technology. https://ir.lib.ntut.edu.tw/wSite/ct?mp=ntut&xItem=71271&ctNode=447. Accessed 2019-01-13.
  18. 18.
    Huang, C., & Chen, K. (1998). Academia Sinica balanced corpus of Modern Chinese. http://ckip.iis.sinica.edu.tw/CKIP/engversion/20corpus.htm. Accessed 2019-01-27.
  19. 19.
    Huang, C.R. (2009). Tagged Chinese gigaword version 2.0. http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T14. Accessed 2019-01-26.
  20. 20.
    Hung, H.t. (2018). The AlexHT system for FSR challenge. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15hjrTipVW0QOxb_tp_UAV9C2evdORdLu. Accessed 2019-01-20.
  21. 21.
    IFlyTek. (2018). iFLYTEK open platform —— China’s first artificial intelligence open platform for mobile internet and intelligent hardware developers. http://global.xfyun.cn/. Accessed 2019-01-19.
  22. 22.
    ISCSLP. (2018). ISCSLP2018 - the 11th international symposium on chinese spoken language processing (ISCSLP 2018). http://iscslp2018.org/. Accessed 2019-01-26.
  23. 23.
    Jaitly, N., & Hinton, G.E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In ICML Workshop on deep learning for audio, speech and language. https://pdfs.semanticscholar.org/3de0/616eb3cd4554fdf9fd65c9c82f2605a17413.pdf. Accessed 2019-01-26.
  24. 24.
    Kaldi-ASR. (2018). Kaldi speech recognition toolkit. https://github.com/kaldi-asr/kaldi. Accessed 2019-01-27.
  25. 25.
    Kanda, N., Fujita, Y., Nagamatsu, K. (2018). Lattice-free state-level minimum Bayes risk training of acoustic models.Google Scholar
  26. 26.
    KingLine Data Center. (2018). KingLine data center. http://kingline.speechocean.com/. Accessed 2019-01-26.
  27. 27.
    KingLine Data Center. (2018). Taiwanese and english mixed speech recognition corpus (Mobile)-Sentences-1026 Speakers_ASR-Corpus_Commercial Resources_KingLine Data Center. http://kingline.speechocean.com/exchange.php?id=14927&act=view. Accessed 2019-01-27.
  28. 28.
    KingLine Data Center. (2018). Taiwanese speech recognition corpus (desktop)-conversation-300 Speakers_ASR-Corpus_ Commercial Resources_KingLine Data Center. http://kingline.speechocean.com/exchange.php?id=19262&act=view. Accessed 2019-01-27.
  29. 29.
    KingLine Data Center. (2018). Taiwanese speech recognition corpus (desktop)-sentences-204 Speakers_ASR-Corpus_Commercial Resources_KingLine Data Center. http://kingline.speechocean.com/exchange.php?act=view&id=1548. Accessed 2019-01-27.
  30. 30.
    KingLine Data Center. (2018). Taiwanese speech recognition corpus (mobile)-conversation-300 Speakers_ASR-Corpus_ Commercial Resources_KingLine Data Center. http://kingline.speechocean.com/exchange.php?id=19228&act=view. Accessed 2019-01-27.
  31. 31.
    KingLine Data Center. (2018). Taiwanese speech recognition corpus (mobile)-Ssentences-5232 Speakers_ASR-Corpus_ Commercial Resources_KingLine Data Center. http://kingline.speechocean.com/exchange.php?id=766&act=view. Accessed 2019-01-27.
  32. 32.
    Ko, T., Peddinti, V., Povey, D., Khudanpur, S. (2015). Audio augmentation for speech recognition. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH. https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf. Accessed 2019-01-26 (pp. 3586–3589).
  33. 33.
    Lee, H.s., Chen, K.y., Tsao, Y., Wang, H.m. (2018). The AS Kaldi-based Taiwanese Mandarin ASR system for FSR-2018. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15gsMr_ZtT6Wuotz8-T9Gysj-7-tJz4Mw. Accessed 2019-01-20.
  34. 34.
    Liang, H.b., & Wang, Y.r. (2018). The NCTU ASR system for Formosa speech recognition challenge 2018. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15inv3RHf9bTxwhwqrXwWbqNcxAfDgoxl. Accessed 2019-01-20.
  35. 35.
    Liao, Y.F. (2018). Call for FSR-2018 participants - ISCSLP2018. http://iscslp2018.org/CFParticipants.html. Accessed 2019-01-26.
  36. 36.
    Liao, Y.F. (2018). Formosa speech in the wild corpus. https://sites.google.com/speech.ntut.edu.tw/fsw/home/corpus. Accessed 2019-01-26.
  37. 37.
    Liao, Y.F. (2018). Formosa speech in the wild project - GitLab server. https://speech.nchc.org.tw. Accessed 2019-01-27.
  38. 38.
    Liao, Y.F. (2018). Formosa speech recognition challenge 2018. https://sites.google.com/speech.ntut.edu.tw/fsw/home/challenge. Accessed 2019-01-26.
  39. 39.
    Liao, Y.F. (2018). Formosa speech recognition challenge 2018. https://sites.google.com/speech.ntut.edu.tw/fsw/home/workshop. Accessed 2019-01-26.
  40. 40.
    Liao, Y.F. (2018). Formosa speech recognition recipe. https://github.com/yfliao/kaldi/tree/master/egs/formosa. Accessed 2019-01-28.
  41. 41.
    Liao, Y.F. (2018). Formosa speech recognition recipe. https://github.com/kaldi-asr/kaldi/tree/master/egs/formosa. Accessed 2019-01-28.
  42. 42.
    Liao, Y.F. (2018). Kaldi pull request #2474-formosa_speech recipe and database for Taiwanese Mandarin speech recognition. https://github.com/kaldi-asr/kaldi/pull/2474. Accessed 2019-01-28.
  43. 43.
    Liao, Y.F., Chang, Y.H.S., Wang, S.Y., Chen, J.W., Wang, S.M., Wang, J.H. (2018). A progress report of the Taiwan Mandarin radio speech corpus project. In 2017 20th conference of the oriental chapter of international committee for coordination and standardization of speech databases and assessment techniques, O-COCOSDA 2017 (pp. 1–6), DOI  https://doi.org/10.1109/ICSDA.2017.8384450.
  44. 44.
    Liao, Y.F., Hsu, W.H., Lin, Y.C., Chang, Y.h.S., Pleva, M. (2018). Formosa speech recognition challenge 2018: data, plan and baselines, IEEE.Google Scholar
  45. 45.
    Linguistic Data Consortium. (1996). CALLFRIEND Mandarin Chinese-Taiwan dialect - linguistic data consortium. https://catalog.ldc.upenn.edu/LDC96S56. Accessed 2019-01-27.
  46. 46.
    Linguistic Data Consortium. (1998). Taiwanese Putonghua speech and transcripts - linguistic data consortium. https://catalog.ldc.upenn.edu/LDC98S72. Accessed 2019-01-27.
  47. 47.
    Linguistic Data Consortium. (2008). U.o.P.: linguistic data consortium webpage. https://www.ldc.upenn.edu/. Accessed 2019-01-26.
  48. 48.
    Lu, M.p., & Chen, C.p. (2018). NSYSU team for the Formosa speech recognition challenge 2018. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15ndD-mwfM3JZ0DX_6BfArxSdb5J1dDYQ. Accessed 2019-01-20.
  49. 49.
    Lu, M.p., & Chen, C.p. (2018). NSYSU team for the Formosa speech recognition challenge 2018. In: Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15ndD-mwfM3JZ0DX_6BfArxSdb5J1dDYQ. Accessed 2019-01-20.
  50. 50.
    Manohar, V., Hadian, H., Povey, D., Khudanpur, S. (2018). Semi-supervised training of acoustic models using lattice-free MMI. In ICASSP 2018 (pp. 4844–4848).Google Scholar
  51. 51.
    Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J. (2011). RNNLM - recurrent neural network language modeling toolkit. In Proceedings of ASRU 2011. http://www.fit.vutbr.cz/imikolov/rnnlm/rnnlm-demo.pdf. Accessed 2019-01-20 (pp. 196–201).
  52. 52.
    Milivojević, Z., Savić, N., Brodić, D. (2017). Three-parametric cubic interpolation for estimating the fundamental frequency of the speech signal. Computing and Informatics, 36(2), 449–469.MathSciNetCrossRefGoogle Scholar
  53. 53.
    Ministry of Science And Technology Taiwan. (2018). Formosa grand challenge, talk to AI. https://fgc.stpi.narl.org.tw/activity/techai2018. Accessed 2019-01-26.
  54. 54.
    Mozilla. (2013). Common voice. https://voice.mozilla.org/zh-TW. Accessed 2019-01-27.
  55. 55.
    Mozilla. (2018). Project DeepSpeech. https://github.com/mozilla/DeepSpeech. Accessed: 2019-01-28.
  56. 56.
    National Education Radio. (2018). National education radio. https://www.ner.gov.tw/english. Accessed 2019-01-20.
  57. 57.
    National Statistics Taiwan. (2010). 2010 population and housing census in 2010. https://www.stat.gov.tw/public/Attachment/21081884771.pdf. Accessed 2019-01-13.
  58. 58.
    Phonetics Laboratory. (2018). U.o.P.: the penn phonetics lab forced aligner. https://babel.ling.upenn.edu/phonetics/old_website_2015/p2fa/index.html. Accessed 2019-01-26.
  59. 59.
    Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohamadi, M., Khudanpur, S. (2018). Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech 2018.  https://doi.org/10.21437/Interspeech.2018-1417, http://www.danielpovey.com/files/2018_interspeech_tdnnf.pdf, (Vol. 2 pp. 3743–3747).
  60. 60.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovsk’y, J.S., Stemmer, G., Vesel’y, K.V. (2011). The Kaldi speech recognition toolkit. In ASRU 2011. http://kaldi.sf.net/.
  61. 61.
    Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S. (2016). Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH (pp. 2751–2755), DOI  https://doi.org/10.21437/Interspeech.2016-595.
  62. 62.
    Sak, H., Senior, A.W., Rao, K., Beaufays, F. (2015). Fast and accurate recurrent neural network acoustic models for speech recognition. In INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, Germany, September 6-10, 2015. http://www.isca-speech.org/archive/interspeech_2015/i15_1468.html (pp. 1468–1472).
  63. 63.
    SpeechOcean. (2018). Speech data services - text data and image data services - speech datasets database. http://en.speechocean.com/. Accessed 2019-01-26.
  64. 64.
    Steering Committee for the Test Of Proficiency-Huayu. (2018). The test of Chinese as a foreign language (TOCFL). https://www.sc-top.org.tw/english/eng_index.php. Accessed 2019-01-26.
  65. 65.
    Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., Yu, K. (2018). Adaptive very deep convolutional residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 26 (8), 1393–1405.  https://doi.org/10.1109/TASLP.2018.2825432.CrossRefGoogle Scholar
  66. 66.
    Tang, H., Lu, L., Kong, L., Gimpel, K., Livescu, K., Dyer, C., Smith, N.A., Renals, S. (2017). End-to-end neural segmental models for speech recognition. IEEE Journal on Selected Topics in Signal Processing, 11(8), 1254–1264.  https://doi.org/10.1109/JSTSP.2017.2752462.CrossRefGoogle Scholar
  67. 67.
    The Association for Computational Linguistics and Chinese Language Processing. (2000). Database - the association for computational linguistics and chinese language processing. http://www.aclclp.org.tw/corp.php. Accessed 2019-01-27.
  68. 68.
    The Association for Computational Linguistics and Chinese Language Processing. (2018). TCC300 Corpus. http://www.aclclp.org.tw/use_mat.php#tcc300edu. Accessed 2019-01-27.
  69. 69.
    The Association for Computational Linguistics and Chinese Language Processing. (2018). The association for computational linguistics and chinese language processing. http://www.aclclp.org.tw. Accessed 2019-01-26.
  70. 70.
    The European Language Resources Association. (2018). ELRA-ELDA: the evaluations and language resources distribution agency. http://www.elra.info/en/. Accessed 2019-01- 26.
  71. 71.
    Wang, H.c., Seide, F., Tseng, C.y., Lee, L.s. (2000). Mat-2000 – design, collection, and validation of a Mandarin 2000-speaker telephone speech database. In InterSpeech (pp. 3–6).Google Scholar
  72. 72.
    Wang, H.M. (2003). MATBN 2002: a Mandarin Chinese broadcast news corpus. ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 10(2), 219–236.Google Scholar
  73. 73.
    Wells, J.C. (1995). Computer-coding the IPA: a proposed extension of SAMPA.  https://doi.org/10.1179/1362171811Y.0000000076, https://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf. Accessed 2019-01-26.CrossRefGoogle Scholar
  74. 74.
    Wikipedia. (2018). Languages of Taiwan. http://www.ethnologue.com/show_country.asp?name=Taiwan. Accessed 2019-01-27.
  75. 75.
    Wikipedia. (2018). Taiwanese Mandarin. https://en.wikipedia.org/wiki/Taiwanese_Mandarin. Accessed 2019-01-20.
  76. 76.
    Wu, M.c., Chen, W.y., Misbullah, A. (2018). Established a Taiwanese speech recognition system for formosa speech recognition challenge 2018. In Formosa speech recognition challenge workshop. https://drive.google.com/file/d/15kKTG_w_jbx20vW1s_6rBAScXMAHYxd-.
  77. 77.
    Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G. (2017). The Microsoft 2016 conversational speech recognition system. In ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings.  https://doi.org/10.1109/ICASSP.2017.7953159, https://arxiv.org/pdf/1708.06073.pdf (pp. 5255–5259).
  78. 78.
    Xu, H., Povey, D., Mangu, L., Zhu, J. (2011). Minimum Bayes Risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language, 25(4), 802–828.  https://doi.org/10.1016/j.csl.2011.03.001.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Electronic EngineeringNational Taipei University of TechnologyTaipeiTaiwan
  2. 2.Taipei CityTaiwan
  3. 3.Department of EnglishNational Taipei University of TechnologyTaipeiTaiwan
  4. 4.Department of Electronics and Multimedia Communications, KEMT FEITechnical University of KosiceKosiceSlovakia

Personalised recommendations