Abstract
In this paper, neural networks are applied as a feature extractors for a speech recognition system and a speaker verification system. A long-temporal features with delta coefficients, mean and variance normalization are applied when a neural-network-based feature extraction is trained together with a neural-network-based voice activity detector and with a neural-network-based acoustic model for speech recognition. In speaker verification, the acoustic model is replaced with a score computation. The performance of our speech recognition system was evaluated on the British English speech corpus WSJCAM0 and the performance of our speech verification system was evaluated on our Czech speech corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chang, S., Morgan, N.: Robust CNN-based speech recognition with Gabor filter kernels. In: Interspeech 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014, pp. 905–909 (2014)
Das, A., Tapaswi, M.: Direct modeling of spoken passwords for text-dependent speaker recognition by compressed time-feature representations. In: ICASSP, IEEE, March 2010
Astudillo, R.F., Abad, A., Trancoso, I.: Accounting for the residual uncertainty of multi-layer perceptron based features. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6859–6863, May 2014
Grézl, F., Karafiát, M.: Semi-supervised bootstrapping approach for neural network feature extractor training. In: ASRU, pp. 470–475. IEEE (2013)
Hbert, M.: Text-dependent speaker recognition. In: Benesty, J., Sondhi, M., Huang, Y. (eds.) Springer Handbook of Speech Processing, pp. 743–762. Springer, Berlin Heidelberg (2008)
Heřmanský, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 57(4), 1738–1752 (1990)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015)
Narayanan, A., Wang, D.: Ideal ratio mask estimation using deep neural networks for robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7092–7096, May 2013
Pražák, A., Psutka, J.V., Psutka, J., Loose, Z.: Towards live subtitling of TV ice-hockey commentary. In: Cabello, E., Virvou, M., Obaidat, M.S., Ji, H., Nicopolitidis, P., Vergados, D.D. (eds.) SIGMAP, pp. 151–155. SciTePress (2013)
Ramasubramanian, V., Das, A., Praveen, K.V.: Text-dependent speaker-recognition using one-pass dynamic programming algorithm. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing, ICASSP 2006, Toulouse, France, 14–19 May 2006, pp. 901–904 (2006)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: Wsjcam0: a british english speech corpus for large vocabulary continuous speech recognition. In: Proceedings of ICASSP 1995, pp. 81–84. IEEE (1995)
Sainath, T.N., Kingsbury, B., Rahman Mohamed, A., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: ASRU, pp. 297–302. IEEE (2013)
Sainath, T.N., Peddinti, V., Kingsbury, B., Fousek, P., Ramabhadran, B., Nahamoo, D.: Deep scattering spectra with deep neural networks for LVCSR tasks. In: Interspeech 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014, pp. 900–904 (2014)
Seps, L., Málek, J., Cerva, P., Nouza, J.: Investigation of deep neural networks for robust recognition of nonlinearly distorted speech. In: Interspeech, pp. 363–367 (2014)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR abs/1409.4842 (2014)
Tüske, Z., Golik, P., Schlüter, R., Ney, H.: Acoustic modeling with deep neural networks using raw time signal for lvcsr. In: Interspeech, Singapore, pp. 890–894. September 2014
Yegnanarayana, B., Prasanna, S., Zachariah, J., Gupta, C.: Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Speech Audio Process. 13(4), 575–582 (2005)
Acknowledgments
This research was supported by the Ministry of Culture Czech Republic, project No. DF12P01OVV022.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zelinka, J., Vaněk, J., Müller, L. (2015). Neural-Network-Based Spectrum Processing for Speech Recognition and Speaker Verification. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-25789-1_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)