Skip to main content
Log in

Manifold learning based speaker dependent dimension reduction for robust text independent speaker verification

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Speaker verification has been studied widely from different points of view, including accuracy, robustness and being real-time. Recent studies have turned toward better feature stability and robustness. In this paper we study the effect of nonlinear manifold based dimensionality reduction for feature robustness. Manifold learning is a popular recent approach for nonlinear dimensionality reduction. Algorithms for this task are based on the idea that each data point may be described as a function of only a few parameters. Manifold learning algorithms attempt to uncover these parameters in order to find a low-dimensional representation of the data. From the manifold based dimension reduction approaches, we applied the widely used Isometric mapping (Isomap) algorithm. Since in the problem of speaker verification, the input utterance is compared with the model of the claiming client, a speaker dependent feature transformation would be beneficial for deciding on the identity of the speaker. Therefore, our first contribution is to use Isomap dimension reduction approach in the speaker dependent context and compare its performance with two other widely used approaches, namely principle component analysis and factor analysis. The other contribution of our work is to perform the nonlinear transformation in a speaker-dependent framework. We evaluated this approach in a GMM based speaker verification framework using Tfarsdat Telephone speech dataset for different noises and SNRs and the evaluations have shown reliability and robustness even in low SNRs. The results also show better performance for the proposed Isomap approach compared to the other approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Arandjelovic, O., & Cipolla, R. (2007). A manifold approach to face recognition from low quality video across illumination and pose using implicit super-resolution, ICCV.

  • Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Process, 10(1–3), 42–54.

    Article  Google Scholar 

  • Balasubramanian, M., & Schwartz, E. L. (2002). The Isomap algorithm and topological stability. Science, 295(5552), 7.

    Article  Google Scholar 

  • Batlle, E., Nadeu, C., & Fonollosa, J. (1998). Feature decorrelation methods in speech recognition. A comparative study, Proc. ICSLP, Sydney, Australia, vol. 7 (pp. 2907–2910).

  • Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396.

    Article  MATH  Google Scholar 

  • Bengio, Y., Paiement, J. F., Vincent, P., Delalleau, O., Le Roux, N., Ouimet, M. (2004). Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering. In Advances in neural information processing systems, volume 16, Cambridge, MA, USA.

  • Bijankhan, M. et al. (2003). TFarsdat, the telephony Farsi speech database. EuroSpeech, pp. 1525–1528.

  • Burget, L., Matejka, P., Schwarz, P., Glembek, O., & Cernocky, J. (2007). Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech, and Language Processing, ASSP–15(7), 1979–1986.

    Article  Google Scholar 

  • Dalmasso, E., Castaldo, F., Laface, P., Colibro, D., & Vair, C. (2009). Loquendo–politechnico di torino’s 2008 NIST speaker recognitionevaluation system, Proc. ICASSP’09, Taiwan, China (pp. 4213–4216).

  • Davenport, M., Hegde, C., Duarte, M., & Baraniuk, R. (2010a). High-Dimensional Data Fusion via Joint Manifold Learning. Manifold Learning and Its Applications: Papers from the AAAI Fall Symposium (pp. 20–27).

  • Davenport, M., Hegde, C., Duarte, M., & Baraniuk, R. (2010b). Joint manifolds for data fusion. IEEE Transactions Image Processing, 19(10), 2580–2594.

    Article  MathSciNet  Google Scholar 

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP–28, 357–366.

    Article  Google Scholar 

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing, 19(4), 788–798.

    Google Scholar 

  • Dijkstra, E. W. (1959). A note on two problems in connection with graphs. NumerischeMathematik, 1, 269–271.

    MATH  MathSciNet  Google Scholar 

  • Floyd, R. W. (1962). Algorithm 97: Shortest path. Communications of the ACM, 5(6), 345.

    Article  Google Scholar 

  • Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-Vectorlength normalization in speaker recognition systems. Proceedings of Interspeech, Florence, Italy (pp. 249–252).

  • Gerber, S., Tasdizen, T., Joshi, S., & Whitaker, R. (2009). On the manifold structure of the space of brain images. MICCAI, 5761.

  • Globerson, A., & Roweis, S. (2006). Metric learning by collapsing classes. Advances on Neural Information Processing Systems pp. 451–458.

  • Guom, W., Long, Y., Li, Y., Pan, L., Wang, E., & Dai, L. (2009). IFLY system forthe NIST 2008 speaker recognition evaluation. Proceedings of ICASSP’09, Taiwan, China (pp. 4209–4212).

  • Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. ICASSP, 1, 13–16.

    Google Scholar 

  • Hamm, J., Davatzikos, C., & Verma, R. (2009). Efficient large deformation registration via geodesics on a learned manifold of images. MICCAI, 5761, 680–687.

    Google Scholar 

  • Hassan, T., & Hansen, J. (2013). Acoustic factor analysis for robust speaker verification. IEEE Transactions On Audio, Speech, and Language Processing, 21(4), 842–853.

    Article  Google Scholar 

  • Huo, X., Ni, X. S., & Smith, A. K. (2007). A survey of manifold-based learning methods. In T. W. Liao & E . Triantaphyllou (Eds.), Recent advances in data mining of enterprise data (pp 691–745). Singapore: World Scientific.

  • Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modelingwith sparse training data. IEEE Transactions on Speech, Audio and Language Processing, 13(3), 345–354.

    Article  Google Scholar 

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus Eigenchannels in speakerrecognition. IEEE Transactions on Speech, Audio and Language Processing, 15(4), 1435–1447.

    Article  Google Scholar 

  • Lee, C. S., & Elgammal, A. (2007). Modeling view and posture manifolds for tracking. ICCV.

  • Lee, J. A., & Verleysen, M. (2005). Nonlinear dimensionality reduction of data manifolds with essential loops. Neurocomputing, 67, 29–53.

    Article  Google Scholar 

  • Lee, J. A., & Verleysen, M. (2010). Unsupervised dimensionality reduction: Overview and recent advances. IJCNN (pp. 1–8).

  • Matejka, P., et al. (2011). Full-covariance UBM and heavy-tailedPLDA in i-vector speaker verification. Proceedings of ICASSP, Florence, Italy (pp. 4828–4831).

  • Moattar, M. H., & Homayounpour, M. M. (2011). A weighted feature voting approach for robust and real-time voice activity detection. Korean Electronics and Telecommunication Research Institute (ETRI) Journal, 33(1), 99–109.

  • Partridge, M. R., & Calvo, (1998). Fast dimensionality reduction and simple PCA. Intelligent data analysis, 2, 292–298.

    Article  Google Scholar 

  • Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326.

    Article  Google Scholar 

  • Soto, A.J., Strickert, M., Vazquez, G. E., & Milios, E. (2010). Adaptive visualization of text documents incorporating domain knowledge. Challenges of data visualization. NIPS 2010 Workshop.

  • Tenenbaum, J. B. (1998). Mapping a manifold of perceptual observations. Advances in Neural Information Processing Systems, 10, 682–688.

    Google Scholar 

  • Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323.

    Article  Google Scholar 

  • The NIST Year (2010) Speaker Recognition Evaluation Plan, December 23, 2009, Available online: http://www.itl.nist.gov/iad/mig/tests/sre/2010/NIST_SRE10_evalplan.r6.pdf, Accessed on 2010–10-22

  • Varga, A. P., Steeneken, H. J. M., Tomlinson, M., & Jones, D. (1992). The NOISEX-92 study on the effect of additive noise on automatic speech recognition. DRA Speech Research Unit: Technical report.

  • Zhang, J., Huangand, H., & Wang, J. (2010). Manifold learning for visualizing and analyzing high-dimensional data. IEEE Intel System, 25(4), 54–61.

    Google Scholar 

  • Zhang, Q., Souvenir, R., & Pless, R. (2006). On manifold structure of cardiac MRI data:Application to segmentation. CVPR, 1, 1092–1098.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davood Zabihzadeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zabihzadeh, D., Moattar, M.H. Manifold learning based speaker dependent dimension reduction for robust text independent speaker verification. Int J Speech Technol 17, 271–280 (2014). https://doi.org/10.1007/s10772-014-9228-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-014-9228-6

Keywords

Navigation