Abstract
The paper deals with neural network-based estimation of articulatory features for Czech which are intended to be applied within automatic phonetic segmentation or automatic speech recognition. In our current approach we use the multi-layer perceptron networks to extract the articulatory features on the basis of non-linear mapping from standard acoustic features extracted from speech signal. The suitability of various acoustic features and the optimum length of temporal context at the input of used network were analysed. The temporal context is represented by a context window created from the stacked feature vectors. The optimum length of the temporal contextual information was analysed and identified for the context window in the range from 9 to 21 frames. We obtained 90.5% frame level accuracy on average across all the articulatory feature classes for mel-log filter-bank features. The highest classification rate of 95.3% was achieved for the voicing class.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Qian, Y., Liu, J.: Articulatory feature based multilingual mlps for low-resource speech recognition. In: INTERSPEECH. ISCA (2012)
Qian, Y., Xu, J., Povey, D., Liu, J.: Strategies for using mlp based features with limited target-language training data. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 354–358 (December 2011)
Frankel, J., Magimai-Doss, M., King, S., Livescu, K., Cetin, O.: Articulatory feature classifiers trained on 2000 hours of telephone speech. In: Proceedings of Interspeech, Antwerp, Belgium (2007)
Carson-Berndsen, J.: Articulatory-acoustic-feature-based automatic language identification. In: ISCA - MultiLing 2006, Stellenbosch, South Africa (2006)
Zhang, S.X., Mak, M.: High-level speaker verification via articulatory-feature based sequence kernels and SVM. In: Proceedings of Interspeech, Brisbane, Australia (2008)
Kirchhoff, K.: Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments. In: Proceedings of ICSLP (1998)
Rasipuram, R., Magimai-Doss, M.: Improving articulatory feature and phoneme recognition using multitask learning. In: Honkela, T. (ed.) ICANN 2011, Part I. LNCS, vol. 6791, pp. 299–306. Springer, Heidelberg (2011)
Frankel, J., Wester, M., King, S.: Articulatory feature recognition using dynamic Bayesian networks. Computer Speech & Language, 620–640 (2007)
Næss, A.B., Livescu, K., Prabhavalkar, R.: Articulatory feature classification using nearest neighbors. In: INTERSPEECH, Florence, Italy, ISCA, pp. 2301–2304 (2011)
Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in cd-dnn-hmm. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 131–136 (December 2012)
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(10), 1533–1545 (2014)
Morgan, N., et al.: Pushing the envelope - aside [speech recognition]. IEEE Signal Processing Magazine 22(5), 81–88 (2005)
Pinto, J.P., Prasanna, S.R.M., Yegnanarayana, B., Hermansky, H.: Significance of contextual information in phoneme recognition. Idiap-RR Idiap-RR-28-2007, IDIAP (2007)
Yu, D., Siniscalchi, S., Deng, L., Lee, C.: Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 4169–4172 (2012)
Abdel-hamid, O., Rahman Mohamed, A., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280 (2012)
Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. Signal Processing Magazine (2012)
Volín, J.: Phonetic and phonology. In: Cvrček, V., et al. (eds.) Grammar of Contemporary Czech. Karolinum (2013) 35–64; In Czech language: Mluvnice současné češtiny
Wells, J.C., Batusek, R., Matousek, J., Hanzl, V.: Czech SAMPA Home Page (2003). http://www.phon.ucl.ac.uk/home/sampa/czech-uni.htm
Mizera, P., Pollak, P.: Robust neural network-based estimation of articulatory features for czech. Neural Network World 24(5), 463–478 (2014)
Schwarz, P.: Phoneme recognition based on long temporal context, PhD Thesis (2009)
Park, J., Diehl, F., Gales, M.J.F., Tomalin, M., Woodland, P.C.: Efficient generation and use of mlp features for arabic speech recognition (2008)
Kirchhoff, K.: Robust speech recognition using articulatory information. PhD thesis, Der. Technischen Fakultaet der Universitaet Bielefeld (June 1999)
Pollak, P., Cernocky, J.: Czech SPEECON adult database. Czech Technical University in Prague & Brno University of Technology, Technical report (April 2004)
Fousek, P., Mizera, P., Pollak, P.: Ctucopy feature extraction tool. http://noel.feld.cvut.cz/speechlab/
Povey, D., Ghoshal, A., et al.: The Kaldi speech recognition toolkit. In: Proc. of ASRU 2011, IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mizera, P., Pollak, P. (2015). Improved Estimation of Articulatory Features Based on Acoustic Features with Temporal Context. In: Král, P., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science(), vol 9302. Springer, Cham. https://doi.org/10.1007/978-3-319-24033-6_63
Download citation
DOI: https://doi.org/10.1007/978-3-319-24033-6_63
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24032-9
Online ISBN: 978-3-319-24033-6
eBook Packages: Computer ScienceComputer Science (R0)