Skip to main content

Robust Features for Speaker-Independent Speech Recognition Based on a Certain Class of Translation-Invariant Transformations

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5933))

Abstract

The spectral effects of vocal tract length (VTL) differences are one reason for the lower recognition rate of today’s speaker-independent automatic speech recognition (ASR) systems compared to speaker-dependent ones. By using certain types of filter banks the VTL-related effects can be described by a translation in subband-index space. In this paper, nonlinear translation-invariant transformations that originally have been proposed in the field of pattern recognition are investigated for their applicability in speaker-independent ASR tasks. It is shown that the combination of different types of such transformations leads to features that are more robust against VTL changes than the standard mel-frequency cepstral coefficients and that they almost yield the performance of vocal tract length normalization without any adaption to individual speakers.

This work has been supported by the German Research Foundation under Grant No. ME1170/2-1.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Benzeghiba, M., Mori, R.D., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Wellekens, C.: Automatic speech recognition and speech variability: a review. Speech Communication 49(10-11), 763–786 (2007)

    Article  Google Scholar 

  2. Gales, M.J.F.: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 12(2), 75–98 (1998)

    Article  Google Scholar 

  3. Pitz, M., Ney, H.: Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech and Audio Processing 13(5 Part 2), 930–944 (2005) (ausgedruckt)

    Article  Google Scholar 

  4. Welling, L., Ney, H., Kanthak, S.: Speaker adaptive modeling by vocal tract normalization. IEEE Trans. Speech and Audio Processing 10(6), 415–426 (2002)

    Article  Google Scholar 

  5. Lee, L., Rose, R.C.: A frequency warping approach to speaker normalization. IEEE Trans. Speech and Audio Processing 6(1), 49–60 (1998)

    Article  Google Scholar 

  6. Umesh, S., Cohen, L., Marinovic, N., Nelson, D.J.: Scale transform in speech analysis. IEEE Trans. Speech and Audio Processing 7, 40–45 (1999)

    Article  Google Scholar 

  7. Mertins, A., Rademacher, J.: Frequency-warping invariant features for automatic speech recognition. In: Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, Toulouse, France, May 2006, vol. V, pp. 1025–1028 (2006)

    Google Scholar 

  8. Rademacher, J., Wächter, M., Mertins, A.: Improved warping-invariant features for automatic speech recognition. In: Proc. Int. Conf. Spoken Language Processing (Interspeech 2006 - ICSLP), Pittsburgh, PA, USA, September 2006, pp. 1499–1502 (2006)

    Google Scholar 

  9. Monaghan, J.J., Feldbauer, C., Walters, T.C., Patterson, R.D.: Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition. The Journal of the Acoustical Society of America 123(5), 3066–3066 (2008)

    Article  Google Scholar 

  10. Burkhardt, H., Siggelkow, S.: Invariant features in pattern recognition – fundamentals and applications. In: Nonlinear Model-Based Image/Video Processing and Analysis, pp. 269–307. John Wiley & Sons, Chichester (2001)

    Google Scholar 

  11. Wagh, M., Kanetkar, S.: A class of translation invariant transforms. IEEE Trans. Acoustics, Speech, and Signal Processing 25(2), 203–205 (1977)

    Article  MATH  Google Scholar 

  12. Burkhardt, H., Müller, X.: On invariant sets of a certain class of fast translation-invariant transforms. IEEE Trans. Acoustic, Speech, and Signal Processing 28(5), 517–523 (1980)

    Article  MATH  Google Scholar 

  13. Fang, M., Häusler, G.: Modified rapid transform. Applied Optics 28(6), 1257–1262 (1989)

    Article  Google Scholar 

  14. Reitboeck, H., Brody, T.P.: A transformation with invariance under cyclic permutation for applications in pattern recognition. Inf. & Control. 15, 130–154 (1969)

    Article  MATH  MathSciNet  Google Scholar 

  15. Wang, P.P., Shiau, R.C.: Machine recognition of printed chinese characters via transformation algorithms. Pattern Recognition 5(4), 303–321 (1973)

    Article  Google Scholar 

  16. Gamec, J., Turan, J.: Use of Invertible Rapid Transform in Motion Analysis. Radioengineering 5(4), 21–27 (1996)

    Google Scholar 

  17. Pinkowski, B.: Multiscale fourier descriptors for classifying semivowels in spectrograms. Pattern Recognition 26(10), 1593–1602 (1993)

    Article  Google Scholar 

  18. Stemmer, G., Hacker, C., Noth, E., Niemann, H.: Multiple time resolutions for derivatives of Mel-frequency cepstral coefficients. In: IEEE Workshop on Automatic Speech Recognition and Understanding, December 2001, pp. 37–40 (2001)

    Google Scholar 

  19. Mesgarani, N., Shamma, S., Slaney, M.: Speech discrimination based on multiscale spectro-temporal modulations. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, May 2004, vol. 1, pp. I-601–I-604 (2004)

    Google Scholar 

  20. Zhang, Y., Zhou, J.: Audio segmentation based on multi-scale audio classification. In: IEEE Int. Con. Acoustics, Speech, and Signal Processing, May 2004, vol. 4, pp. iv-349–iv-352 (2004)

    Google Scholar 

  21. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  22. Lee, K.F., Hon, H.W.: Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoustics, Speech and Signal Processing 37(11), 1641–1648 (1989)

    Article  Google Scholar 

  23. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK version 3.4). Cambridge University Engineering Department, Cambridge (2006)

    Google Scholar 

  24. Patterson, R.D.: Auditory images: How complex sounds are represented in the auditory system. Journal-Acoustical Society of Japan (E) 21(4), 183–190 (2000)

    Article  MathSciNet  Google Scholar 

  25. Bacon, S., Fay, R., Popper, A.: Compression: from cochlea to cochlear implants. Springer, Heidelberg (2004)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Müller, F., Mertins, A. (2010). Robust Features for Speaker-Independent Speech Recognition Based on a Certain Class of Translation-Invariant Transformations. In: Solé-Casals, J., Zaiats, V. (eds) Advances in Nonlinear Speech Processing. NOLISP 2009. Lecture Notes in Computer Science(), vol 5933. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11509-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11509-7_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11508-0

  • Online ISBN: 978-3-642-11509-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics