Advertisement

International Journal of Speech Technology

, Volume 10, Issue 2–3, pp 95–107 | Cite as

Speaker distinguishing distances: a comparative study

  • Ananth N. IyerEmail author
  • Uchechukwu O. Ofoegbu
  • Robert E. Yantorno
  • Brett Y. Smolenski
Article

Abstract

Speaker discrimination is a vital aspect of speaker recognition applications such as speaker identification, verification, clustering, indexing and change-point detection. These tasks are usually performed using distance-based approaches to compare speaker models or features from homogeneous speaker segments in order to determine whether or not they belong to the same speaker. Several distance measures and features have been examined for all the different applications, however, no single distance or feature has been reported to perform optimally for all applications in all conditions. In this paper, a thorough analysis is made to determine the behavior of some frequently used distance measures, as well as features, in distinguishing speakers for different data lengths. Measures studied include the Mahalanobis distance, Kullback-Leibler (KL) distance, T 2 statistic, Hellinger distance, Bhattacharyya distance, Generalized Likelihood Ratio (GLR), Levenne distance, L 2 and L distances. The Mel-Scale Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Line Spectral Pairs (LSP) and the Log Area Ratios (LAR) comprise the features investigated. The usefulness of these measures is studied for different data lengths. Generally, a larger data size for each speaker results in better speaker differentiating capability, as more information can be taken into account. However, in some applications such as segmentation of telephone data, speakers change frequently, making it impossible to obtain large speaker-consistent utterances (especially when speaker change-points are unknown). A metric is defined for determining the probability of speaker discrimination error obtainable for each distance measure using each feature set, and the effect of data size on this probability is observed. Furthermore, simple distance-based speaker identification and clustering systems are developed, and the performances of each distance and feature for various data sizes are evaluated on these systems in order to illustrate the importance of choosing the appropriate distance and feature for each application. Results show that for tasks which do not involve any limitation of data length, such as speaker identification, the Kullback Leibler distance with the MFCCs yield the highest speaker differentiation performance, which is comparable to results obtained using more complex state-of-the-art speaker identification systems. Results also indicate that the Hellinger and Bhattacharyya distances with the LSPs yield the best performance for small data sizes.

Keywords

Speaker discrimination Distances Speaker identification Speaker clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderson, T. W. (2003). An introduction to multivariate statistical analysis (3rd ed.). New York: Wiley. zbMATHGoogle Scholar
  2. Assaleh, K. T., & Mammone, R. J. (1994). New LP-derived features for speaker identification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 2(4), 630–638. Google Scholar
  3. Basseville, M. (1989). Distance measures for signal processing and pattern recognition. Signal Processing, 18(4), 349–369. CrossRefMathSciNetGoogle Scholar
  4. Bimbot, F., & Magrin-Chagnolleau, I. (1995). Second-order statistical measures for text-independent speaker identification. Speech Communication, 17, 177–192. CrossRefGoogle Scholar
  5. Chaudhari, U. V., Navrratil, J., Ramaswamy, G. N., & Maes, S. H. (2001). Very large population text-independent speaker identification using transformation enhanced multi-grained models. ICASSP, 1, 461–464. Google Scholar
  6. Chen, S. S., & Gopalakrishnan, P. S. (1998a). Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In DARPA speech recognition workshop. Google Scholar
  7. Chen, S. S., & Gopalakrishnan, P. S. (1998b). Clustering via the Bayesian information criterion with applications in speech recognition. ICASSP, 2, 645–648. Google Scholar
  8. de Souza, P. (1977). Statistical tests and distance measures for LPC coefficients. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(6), 554–559. zbMATHCrossRefGoogle Scholar
  9. Delacourt, P., & Welkens, C. J. (2000). DISTBIC: A speaker based segmentation for audio data indexing. Speech Communication, 32, 111–126. CrossRefGoogle Scholar
  10. Deller, J. R., Hansen, J. H. L., & Proakis, J. G. (2000). Discrete-time processing of speech signals. New York: IEEE Press. Google Scholar
  11. Duda, R. O., Hart, P. E., & Stork, D. G. (2003). Pattern classification (2nd ed.). New York: Wiley. Google Scholar
  12. Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272. CrossRefGoogle Scholar
  13. Gish, H., & Schmidt, M. (1994). Text-independent speaker identification. IEEE Signal Processing Magazine, 11(4), 18–32. CrossRefGoogle Scholar
  14. Gish, H., Siu, H., & Rohlicek, R. (2001). Segregation of speakers for speech recognition and speaker identification. In ICASSP (pp. 873–876). Google Scholar
  15. Godfrey, J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. In ICASSP (pp. 517–520). Google Scholar
  16. Gray, A., & Markel, J. (1976). Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5), 380–391. CrossRefMathSciNetGoogle Scholar
  17. Gray, R., Gray, A., Buzo, A., & Matsuyama, Y. (1980). Distortion measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 367–376. zbMATHCrossRefGoogle Scholar
  18. Huang, X., Alleva, F., Hon, H.-W., Hwang, M.-Y., & Rosenfeld, R. (1993). The SPHINX-II speech recognition system: an overview. Computer Speech and Language, 7(2), 137–148. CrossRefGoogle Scholar
  19. Iyer, A. N., Ofoegbu, U. O., Yantorno, R. E., & Smolenski, B. Y. (2006). Blind speaker clustering. In ISPACS (pp. 343–346). Google Scholar
  20. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. New York: Prentice Hall. zbMATHGoogle Scholar
  21. Johnson, D. E. (1998). Applied multivariate methods for data analysts. N. Scituate: Duxbury. zbMATHGoogle Scholar
  22. Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 35–45. CrossRefGoogle Scholar
  23. Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and statistics (pp. 278–292). Stanford: Stanford University Press. Google Scholar
  24. Lu, Z., Hui, Y. V., & Lee, A. H. (2003). Minimum Hellinger distance estimation for finite mixtures of Poisson regression models and its applications. Biometrics, 59(4), 1016–1026. zbMATHCrossRefMathSciNetGoogle Scholar
  25. Manly, B. F. J. (1995). Multivariate statistical methods (2nd ed.). London: Chapman & Hall. zbMATHGoogle Scholar
  26. Matsui, T., & Furui, S. (1994). Comparison of text-independent speaker recognition methods using vq-distortion and discrete/continuous HMM’s. IEEE Transactions on Acoustics, Speech, and Signal Processing, 2(3), 456–459. Google Scholar
  27. Naik, J. M. (1990). Speaker verification: a tutorial. IEEE Communications Magazine, 28(1), 42–48. CrossRefMathSciNetGoogle Scholar
  28. Ofoegbu, U. O., Iyer, A. N., Yantorno, R. E., & Smolenski, B. Y. (2006a). A simple approach to unsupervised speaker indexing. In ISPACS (pp. 339–342). Google Scholar
  29. Ofoegbu, U. O., Iyer, A. N., Yantorno, R. E., & Smolenski, B. Y. (2006b). A speaker count system for telephone conversations. In ISPACS (pp. 331–334). Google Scholar
  30. Ofoegbu, U. O., Iyer, A. N., Yantorno, R. E., & Wenndt, S. J. (2006c). Detection of a third speaker in telephone conversations. In ICSLP. Google Scholar
  31. Ong, S., & Yang, C. (1998). A comparative study of text-independent speaker identification using statistical features. International Journal of Computer and Engineering Management, 6(1). Google Scholar
  32. Reynolds, D. (1997). HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects. In ICASSP (pp. 1535–1538). Google Scholar
  33. Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker. Ph.D. thesis, Georgia Institute of Technology, August 1992. Google Scholar
  34. Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83. CrossRefGoogle Scholar
  35. Rudasi, L., & Zahorian, S. A. (1991). Text-independent talker identification with neural networks. ICASSP, 1, 389–392. Google Scholar
  36. Solomonoff, A., Mielke, A., Schmidt, M., & Gish, H. (1998). Clustering speakers by their voices. ICASSP, 2, 757–760. Google Scholar
  37. Srivastava, M. S., & Carter, E. M. (1983). An introduction to applied multivariate statistics. Amsterdam: North-Holland. zbMATHGoogle Scholar
  38. Theodoridis, S., & Koutroumbas, K. (1999). Pattern recognition. San Diego: Academic Press. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Ananth N. Iyer
    • 1
    Email author
  • Uchechukwu O. Ofoegbu
    • 2
  • Robert E. Yantorno
    • 3
  • Brett Y. Smolenski
    • 4
  1. 1.ConversayRedmondUSA
  2. 2.Mongomery CollegeRockvilleUSA
  3. 3.Temple UniversityPhiladelphiaUSA
  4. 4.RADCRomeUSA

Personalised recommendations