Advertisement

Improved Data Modeling for Text-Dependent Speaker Recognition Using Sub-Band Processing

  • R.A. Finan
  • R.I. Damper
  • A.T. Sapeluk
Article

Abstract

A growing body of recent work documents the potential benefits of sub-band processing over wideband processing in automatic speech recognition and, less usually, speaker recognition. It is often found that the sub-band approach delivers performance improvements (especially in the presence of noise), but not always so. This raises the question of precisely when and how sub-band processing might be advantageous, which is difficult to answer because there is as yet only a rudimentary theoretical framework guiding this work. We describe a simple sub-band speaker recognition system designed to facilitate experimentation aimed at increasing understanding of the approach. This splits the time-domain speech signal into 16 sub-bands using a bank of second-order filters spaced on the psychophysical mel scale. Each sub-band has its own separate cepstral-based recognition system, the outputs of which are combined using the sum rule to produce a final decision. We find that sub-band processing leads to worthwhile reductions in both the verification and identification error rates relative to the wideband system, decreasing the identification error rate from 3.33% to 0.56% and equal error rate for verification by approximately 50% for clean speech. The hypothesis is advanced that, unlike the wideband system, sub-band processing effectively constrains the free parameters of the speaker models to be more uniformly deployed across frequency: as such, it offers a practical solution to the bias/variance dilemma of data modeling. Much remains to be done to explore fully the new paradigm of sub-band processing. Accordingly, several avenues for future work are identified. In particular, we aim to explore the hypothesis of a practical solution to the bias/variance dilemma in more depth.

speaker recognition linear prediction Fletcher-Allen principle information fusion bias-variance dilemma 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen, J.B. (1994). How do humans process and recognize speech? IEEE Transactions on Speech and Audio Processing, 2(4):567–577.Google Scholar
  2. Atal, B.S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America, 55:1304–1312.Google Scholar
  3. Auckenthaler, R. and Mason, J.S. (1997). Equalizing sub-band error rates in speaker recognition. In European Speech Communication Association (ESCA) Conference, Eurospeech 97, Rhodes, Greece, pp. 2303–2306.Google Scholar
  4. Besacier, L. and Bonastre, J.-F. (1997). Subband approach for automatic speaker recognition: Optimal division of the frequency domain. In Proceedings of 1st International Conference on Audioand Visual-Based Biometric Person Authentication (AVBPA), Crans-Montana, Switzerland, pp. 195–202.Google Scholar
  5. Bimbot, F. and Mathan, L. (1994). Second-order statistical measures for text-independent speaker recognition. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 51–54.Google Scholar
  6. Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press.Google Scholar
  7. Booth, I., Barlow, M., and Watson, B. (1993). Enhancements to DTW and VQ decision algorithms for speaker recognition. Speech Communication, 13:427–433.Google Scholar
  8. Bourlard, H. and Dupont, S. (1996). A new ASR approach based on independent processing and recombination of partial frequency bands. In Proceedings of International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, pp. 426–429.Google Scholar
  9. Bowles, R.L., Damper, R.I., and Lucas, S.M. (1988). Combining evidence from separate speech recognition processes. In Proceedings of 7th FASE Symposium, Speech 88, Edinburgh, Scotland, Vol. 2, pp. 669–674.Google Scholar
  10. Campbell, J.P. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462.Google Scholar
  11. Carey, M.J. and Parris, E.S. (1992). Speaker verification using connected words. Proceedings of the Institute of Acoustics, 14(6):95–100.Google Scholar
  12. Cherkassky, V. and Mulier, F. (1998). Learning from Data.NewYork, NY: John Wiley.Google Scholar
  13. Damper, R.I. (1995). Introduction to Discrete-Time Signals and Systems. London: Chapman and Hall.Google Scholar
  14. Doddington, G. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73(11):1651–1664.Google Scholar
  15. Doddington, G., Liggett, W., Martin, A., Przybocki, M., and Reynolds, D. (1998). Sheep, goats, lambs andwolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 608 on CD-ROM.Google Scholar
  16. Finan, R.A. (1998). Towards the Use of Sub-Band Processing in Automatic Speaker Recognition. Ph.D. thesis, School of Engineering, University of Abertay Dundee.Google Scholar
  17. Finan, R.A., Sapeluk, A.T., and Damper, R.I. (1997). Impostor cohort selection for score normalisation in speaker verification. Pattern Recognition Letters, 18:881–888.Google Scholar
  18. Furui, S. (1997). Recent advances in speaker recognition. Pattern Recognition Letters, 18:859–872.Google Scholar
  19. Gabor, D. (1946). Theory of communication. Journal of the Institution of Electrical Engineers, 93:429–457.Google Scholar
  20. Gabor, D. (1950). Communication theory and physics. Philosophical Magazine, 4:1161–1187.Google Scholar
  21. Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4:1–58.Google Scholar
  22. Hennecke, M., Stork, D.G., and Venkatesh Prasad, K. (1996).Visionary speech: Looking ahead to practical speechreading systems. In D.G. Stork and M. Hennecke (Eds.), Speechreading by Humans and Machines: Models, Systems and Applications. Berlin, Germany: NATO ASI Series, Springer, pp. 331–349.Google Scholar
  23. Hermansky, H. (1998). Should recognizers have ears? Speech Communication, 25(1–3):3–27.Google Scholar
  24. Hermansky, H. and Morgan, N. (1994). RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578–589.Google Scholar
  25. Hermansky, H. and Sharma, S. (1998). TRAPS—Classifiers of temporal patterns. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 615 on CD-ROM.Google Scholar
  26. Hermansky, H., Tibrewala, S., and Pavel, M. (1996). Towards ASR on partially corrupted speech. In Proceedings of 4th International Conference on Spoken Language Processing (ICSLP 96), Philadelphia, PA, Vol. 1, pp. 462–465.Google Scholar
  27. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., and Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3:79–87.Google Scholar
  28. Kittler, J., Hatef, M., Duin, R.P.W., and Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239.Google Scholar
  29. Li, K.-P. and Porter, J.E. (1988). Normalizations and selection of speech segments for speaker recognition scoring. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 88), New York, NY, pp. 595–598.Google Scholar
  30. Linde, J., Buzo, A., and Gray, R.M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, 28:84–95.CrossRefGoogle Scholar
  31. Markel, J.D. and Gray, A.H. (1976). Linear Prediction of Speech. Berlin, Germany: Springer-Verlag.Google Scholar
  32. Matsui, T. and Furui, S. (1995). Likelihood normalization for speaker verification using phone-and speaker-independent models. Speech Communication, 17:109–116.Google Scholar
  33. Morris, A., Hagen, A., and Bourlard, H. (1999). The full-combination sub-bands approach to noise robust HMM/ANN-based ASR. In Proceedings of 6th European Conference on Speech Communication and Technology (Eurospeech 99), Budapest, Hungary, Vol. 2, pp. 599–602.Google Scholar
  34. Naik, J.M., Netsch, L.P., and Doddington, G.R. (1989). Speaker verification over long-distance telephone lines. In Proceedings of International Conference on Acoustics, Speech and Signal Processing ICASSP 89, Vol. 1, Glasgow, Scotland, pp. 524–527.Google Scholar
  35. Okawa, S., Bocchieri, E., and Potamianos, A. (1998). Multi-band speech recognition in noisy environments. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 98), Seattle, WA, Vol. I, p. 641.Google Scholar
  36. Owens, F.J. (1993). Signal Processing of Speech. Basingstoke, UK: Macmillan.Google Scholar
  37. Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81:1215–1247.Google Scholar
  38. Reynolds, D.A. (1994). Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4):639–643.Google Scholar
  39. Reynolds, D.A. (1995). Speaker identification and verification using Gaussian mixture models. Speech Communication, 17:91–108.Google Scholar
  40. Reynolds, D.A. (1997). Comparison of background normalization methods for text-independent speaker verification. In Proceedings of 5th European Conference on Speech Communication and Technology (Eurospeech 97), Vol. 2, Rhodes, Greece, pp. 963–966.Google Scholar
  41. Rosenberg, A.E. and Parthasarathy, S. (1996). Speaker background models for connected digit password speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 96), Atlanta, GA, Vol. 1, pp. 81–84.Google Scholar
  42. Rosenberg, A.E. and Soong, F.K. (1987). Evaluation of a vector quantization talker recognition system in text dependent and text independent modes. Computer Speech and Language, 22:143–157.Google Scholar
  43. Schroeder, M. (1999). Computer Speech: Recognition, Compression and Synthesis. Berlin, Germany: Springer-Verlag.Google Scholar
  44. Siegel, S. (1956). Non-parametric Statistics for the Behavioral Sciences. Tokyo, Japan: McGraw-HillKogakusha.Google Scholar
  45. Sivakumaran, P., Ariyaeeinia, A.M., and Hewitt, J.A. (1998). Subband speaker verification using dynamic recombination weights. In Proceedings of 5th International Conference on Spoken Language Processing (ICSLP 98), Sydney, Australia. Paper 1055 on CD-ROM.Google Scholar
  46. Sivakumaran, P., Ariyaeeinia, A.M., Hewitt, J.A., and Malcolm, J.A. (1998). An effective sub-band based approach for robust speaker verification. Proceedings of the Institute of Acoustics, 20(6):69–72.Google Scholar
  47. Steeneken, H.T.M. and Houtgast, T. (1999). Mutual dependence of the octave-band weights in predicting speech intelligibility. Speech Communication, 28:109–123.Google Scholar
  48. Thompson, J. and Mason, J.S. (1994). The pre-detection of errorprone class members at the enrollment stage of speaker recognition systems. In European Speech Communication Association (ESCA) Workshop on Automatic Speaker Recognition, Identification and Verification, Martigny, Switzerland, pp. 127–130.Google Scholar
  49. Tibrewala, S. and Hermansky, H. (1997). Sub-band based recognition of noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 97), Munich, Germany, Vol. II, pp. 1255–1258.Google Scholar
  50. Wolpert, D.H. (1992). Stacked generalization. Neural Networks,:241–259.Google Scholar
  51. Yu, K., Mason, J., and Oglesby, J. (1995). Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation. IEEE Proceedings: Vision, Image and Signal Processing, 142:313–318.Google Scholar
  52. Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. Journal of the Acoustical Society of America, 68(5):1523–1525.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • R.A. Finan
    • 1
  • R.I. Damper
    • 2
  • A.T. Sapeluk
    • 1
  1. 1.School of EngineeringUniversity of Abertay DundeeScotland
  2. 2.Image, Speech and Intelligent Systems (ISIS) Research Group, Department of Electronics and Computer ScienceUniversity of SouthamptonHantsUK

Personalised recommendations