Abstract
The spectral effects of vocal tract length (VTL) differences are one reason for the lower recognition rate of today’s speaker-independent automatic speech recognition (ASR) systems compared to speaker-dependent ones. By using certain types of filter banks the VTL-related effects can be described by a translation in subband-index space. In this paper, nonlinear translation-invariant transformations that originally have been proposed in the field of pattern recognition are investigated for their applicability in speaker-independent ASR tasks. It is shown that the combination of different types of such transformations leads to features that are more robust against VTL changes than the standard mel-frequency cepstral coefficients and that they almost yield the performance of vocal tract length normalization without any adaption to individual speakers.
This work has been supported by the German Research Foundation under Grant No. ME1170/2-1.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Benzeghiba, M., Mori, R.D., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Communication 49(10-11), 763–786 (2007)
Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12(2), 75–98 (1998)
Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech and Audio Processing 13(5 Part 2), 930–944 (2005) (ausgedruckt)
Welling, L., Ney, H., Kanthak, S.: Speaker adaptive modeling by vocal tract normalization. IEEE Trans. Speech and Audio Processing 10(6), 415–426 (2002)
Lee, L., Rose, R.C.: A frequency warping approach to speaker normalization. IEEE Trans. Speech and Audio Processing 6(1), 49–60 (1998)
Umesh, S., Cohen, L., Marinovic, N., Nelson, D.J.: Scale transform in speech analysis. IEEE Trans. Speech and Audio Processing 7, 40–45 (1999)
Mertins, A., Rademacher, J.: Frequency-warping invariant features for automatic speech recognition. In: Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, Toulouse, France, May 2006, vol. V, pp. 1025–1028 (2006)
Rademacher, J., Wächter, M., Mertins, A.: Improved warping-invariant features for automatic speech recognition. In: Proc. Int. Conf. Spoken Language Processing (Interspeech 2006 - ICSLP), Pittsburgh, PA, USA, September 2006, pp. 1499–1502 (2006)
Monaghan, J.J., Feldbauer, C., Walters, T.C., Patterson, R.D.: Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition. The Journal of the Acoustical Society of America 123(5), 3066–3066 (2008)
Burkhardt, H., Siggelkow, S.: Invariant features in pattern recognition – fundamentals and applications. In: Nonlinear Model-Based Image/Video Processing and Analysis, pp. 269–307. John Wiley & Sons, Chichester (2001)
Wagh, M., Kanetkar, S.: A class of translation invariant transforms. IEEE Trans. Acoustics, Speech, and Signal Processing 25(2), 203–205 (1977)
Burkhardt, H., Müller, X.: On invariant sets of a certain class of fast translation-invariant transforms. IEEE Trans. Acoustic, Speech, and Signal Processing 28(5), 517–523 (1980)
Fang, M., Häusler, G.: Modified rapid transform. Applied Optics 28(6), 1257–1262 (1989)
Reitboeck, H., Brody, T.P.: A transformation with invariance under cyclic permutation for applications in pattern recognition. Inf. & Control. 15, 130–154 (1969)
Wang, P.P., Shiau, R.C.: Machine recognition of printed chinese characters via transformation algorithms. Pattern Recognition 5(4), 303–321 (1973)
Gamec, J., Turan, J.: Use of Invertible Rapid Transform in Motion Analysis. Radioengineering 5(4), 21–27 (1996)
Pinkowski, B.: Multiscale fourier descriptors for classifying semivowels in spectrograms. Pattern Recognition 26(10), 1593–1602 (1993)
Stemmer, G., Hacker, C., Noth, E., Niemann, H.: Multiple time resolutions for derivatives of Mel-frequency cepstral coefficients. In: IEEE Workshop on Automatic Speech Recognition and Understanding, December 2001, pp. 37–40 (2001)
Mesgarani, N., Shamma, S., Slaney, M.: Speech discrimination based on multiscale spectro-temporal modulations. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, May 2004, vol. 1, pp. I-601–I-604 (2004)
Zhang, Y., Zhou, J.: Audio segmentation based on multi-scale audio classification. In: IEEE Int. Con. Acoustics, Speech, and Signal Processing, May 2004, vol. 4, pp. iv-349–iv-352 (2004)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 27(8), 1226–1238 (2005)
Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoustics, Speech and Signal Processing 37(11), 1641–1648 (1989)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK version 3.4). Cambridge University Engineering Department, Cambridge (2006)
Patterson, R.D.: Auditory images: How complex sounds are represented in the auditory system. Journal-Acoustical Society of Japan (E) 21(4), 183–190 (2000)
Bacon, S., Fay, R., Popper, A.: Compression: from cochlea to cochlear implants. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Müller, F., Mertins, A. (2010). Robust Features for Speaker-Independent Speech Recognition Based on a Certain Class of Translation-Invariant Transformations. In: Solé-Casals, J., Zaiats, V. (eds) Advances in Nonlinear Speech Processing. NOLISP 2009. Lecture Notes in Computer Science(), vol 5933. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11509-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-11509-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11508-0
Online ISBN: 978-3-642-11509-7
eBook Packages: Computer ScienceComputer Science (R0)