Abstract
Due to the distinct search space and efficiency demands in different ASR applications, the state-of-the-art confidence measures and their decoding frameworks are heterogeneous among keyword spotting, domain-specific recognition and LVCSR. Inspired by the success in applying a phone level language model to replace the word lattice in discriminative training, the auxiliary normalization graph is proposed in this work, and it is constructed to model the observation probability in hypothesis posterior based confidence measure. In this way, confidence measure normalizing term modelling can be independent from the original search space and the confidence measure can be grouped into an unified framework. Experiments on three typical ASR applications show that the proposed method using a unified confidence measure framework achieves comparable performance to the separately optimized system on each task.
K. Yu—This work was supported by the Shanghai Sailing Program No. 16YF1405300, the China NSFC projects (No. 61573241 and No. 61603252) and the Interdisciplinary Program (14JCZ03) of Shanghai Jiao Tong University in China. Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The LVCSR based KWS is not included in the discussion because it’s mostly a problem to enhance the acoustic model performance and keyword indexing algorithm. Besides, the computational burden is not suitable for resource-limited scenarios.
- 2.
While in language like English, the mapping is many-to-many.
- 3.
Grammar language model based decoding is taken, as the in-domain and out-domain evaluation discussed in Sect. 2.2 are similar between grammar and class based model.
- 4.
The comparison between CMs in CTC and HMM frameworks has been conducted in previous research [15], all the comparisons below are within the CTC.
- 5.
As in [14], the acoustic model is a small size one applied in the embedded application. Therefore, computation time is comparable between all above portions in the three tasks.
References
Weintraub, M.: Keyword-spotting using SRI’s DECIPHER large-vocabulary speech-recognition system. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 463–466. IEEE (1993)
Woodland, P.C., Odell, J.J., Valtchev, V., Young, S.J.: Large vocabulary continuous speech recognition using HTK. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II–125. IEEE (1994)
Ward, W., Issar, S.: A class based language model for speech recognition. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1996, Conference Proceedings, vol. 1, pp. 416–418. IEEE (1996)
Vasserman, L., Haynor, B., Aleksic, P.: Contextual language model adaptation using dynamic classes. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 441–446. IEEE (2016)
Cleveland, J., Thakur, D., Dames, P., Phillips, C., Kientz, T., Daniilidis, K., Bergstrom, J., Kumar, V.: Automated system for semantic object labeling with soft-object recognition and dynamic programming segmentation. IEEE Trans. Autom. Sci. Eng. 14(2), 820–833 (2017)
Hakkani-Tür, D., Béchet, F., Riccardi, G., Tur, G.: Beyond ASR 1-best: using word confusion networks in spoken language understanding. Comput. Speech Lang. 20(4), 495–514 (2006)
Hu, W., Qian, Y., Soong, F.K.: A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL). In: INTERSPEECH, pp. 1886–1890 (2013)
Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087–4091. IEEE (2014)
Young, S.R.: Detecting misrecognitions and out-of-vocabulary words. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1994, vol. 2. pp. II–21. IEEE (1994)
Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)
Rose, R.C., Juang, B.-H., Lee, C.-H.: A training procedure for verifying string hypotheses in continuous speech recognition. In: 1995 International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1995, vol. 1, pp. 281–284. IEEE (1995)
Chen, S.F., Kingsbury, B., Mangu, L., Povey, D., Saon, G., Soltau, H., Zweig, G.: Advances in speech transcription at IBM under the DARPA EARS program. IEEE Trans. Audio Speech Lang. Process. 14(5), 1596–1608 (2006)
Povey, D., Peddinti, V., Galvez, D., Ghahrmani, P., Manohar, V., Na, X., Wang, Y., Khudanpur, S.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: Submitted to Interspeech (2016)
Chen, Z., Deng, W., Xu, T., Yu, K.: Phone synchronous decoding with CTC lattice. In: Interspeech 2016, pp. 1923–1927 (2016). http://dx.doi.org/10.21437/Interspeech.2016-831
Chen, Z., Zhuang, Y., Yu, K.: Confidence measures for CTC-based phone synchronous decoding. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE (2017)
Chen, Z., Zhuang, Y., Qian, Y., Yu, K.: Phone synchronous speech recognition with CTC lattices. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 86–97 (2017)
Stolcke, A., et al.: SRILM-an extensible language modeling toolkit. In: Interspeech 2002, vol. 2002, 2002 p. (2002)
Chen, I.-F., Ni, C., Lim, B.P., Chen, N.F., Lee, C.-H.: A novel keyword+ LVCSR-filler based grammar network representation for spoken keyword search. In: 2014 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 192–196. IEEE (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chen, Z., Qian, Y., Yu, K. (2017). A Unified Confidence Measure Framework Using Auxiliary Normalization Graph. In: Sun, Y., Lu, H., Zhang, L., Yang, J., Huang, H. (eds) Intelligence Science and Big Data Engineering. IScIDE 2017. Lecture Notes in Computer Science(), vol 10559. Springer, Cham. https://doi.org/10.1007/978-3-319-67777-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-67777-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67776-7
Online ISBN: 978-3-319-67777-4
eBook Packages: Computer ScienceComputer Science (R0)