i-Vectors in speech processing applications: a survey

Abstract

In the domain of speech recognition many methods have been proposed over time like Gaussian mixture models (GMM), GMM with universal background model (GMM-UBM framework), joint factor analysis, etc. i-Vector subspace modeling is one of the recent methods that has become the state of the art technique in this domain. This method largely provides the benefit of modeling both the intra-domain and inter-domain variabilities into the same low dimensional space. In this survey, we present a comprehensive collection of research work related to i-vectors since its inception. Some recent trends of using i-vectors in combination with other approaches are also discussed. The application of i-vectors in various fields of speech recognition, viz speaker, language, accent recognition, etc. is also presented. This paper should serve as a good starting point for anyone interested in working with i-vectors for speech processing in general. We then conclude the paper with a brief discussion on the future of i-vectors.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    https://catalog.ldc.upenn.edu/LDC93S1.

  2. 2.

    https://catalog.ldc.upenn.edu/LDC97S62.

  3. 3.

    https://ivectorchallenge.nist.gov.

References

  1. Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.1109/ICASSP.2003.1202761.

  2. Adami, A. G. (2007). Modeling prosodic differences for speaker recognition. Speech Communications, 49(4), 277–291. doi:10.1016/j.specom.2007.02.005.

    Article  Google Scholar 

  3. Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.1007/978-3-642-25020-0_32.

  4. Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.1109/ICASSP.2014.6854353.

  5. Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.1109/ICASSP.2012.6288990.

  6. Aronowitz, H., & Rendel, A. (2014). Domain adaptation for text dependent speaker verification. INTERSPEECH 2014, 15th annual conference of the international speech communication Association, Singapore. Retrieved September 14–18, 2014, pp. 1337–1341. http://www.isca-speech.org/archive/interspeech_2014/i14_1337.html.

  7. Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.1109/ICASSP.2013.6639089.

  8. Bahari, M. H., McLaren, M., Hamme, H. V., & van Leeuwen, D. A. (2014). Speaker age estimation using i-vectors. Engineering Applications of AI, 34, 99–108. doi:10.1016/j.engappai.2014.05.003.

    Google Scholar 

  9. Behravan, H., Hautamäki, V., & Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013, pp. 79–83. http://www.isca-speech.org/archive/interspeech_2013/i13_0079.html.

  10. Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.1016/j.specom.2014.

    Article  Google Scholar 

  11. Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105.

  12. Bousquet, P., Matrouf, D., & Bonastre, J. (2011). Intersession compensation and scoring methods in the i-vectors space for speaker recognition. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 485–488. http://www.isca-speech.org/archive/interspeech_2011/i11_0485.html.

  13. Brümmer, N., Strasheim, A., Hubeika, V., Matejka, P., Burget, L., & Glembek, O. (2009). Discriminative acoustic language recognition via channel-compensated GMM statistics. INTERSPEECH 2009, 10th annual conference of the international speech communication association, Brighton. Retrieved September 6–10, 2009. pp. 2187–2190. http://www.isca-speech.org/archive/interspeech_2009/i09_2187.html.

  14. Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.1109/ICASSP.2011.5947437.

  15. Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.

    Article  Google Scholar 

  16. Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.1007/978-3-642-25449-9-22.

    Google Scholar 

  17. Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.1109/ICASSP.2013.6639174.

  18. Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.1109/ICASSP.2010.5495068.

  19. Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.1109/ICASSP.2014.6854875.

  20. Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.1109/ICASSP.2012.6288885.

  21. Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490.

  22. Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture.

  23. Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.1109/TASL.2007.902758.

    Article  Google Scholar 

  24. Dehak, N., Kenny, P., & Dumouchel, P. (2007b) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. Retrieved August 27–31, 2007. pp 1234–1237. http://www.isca-speech.org/archive/interspeech_2007/i07_1234.html.

  25. Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19.

  26. Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.1109/ICASSP.2011.5947363.

  27. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.1109/TASL.2010.2064307.

    Article  Google Scholar 

  28. Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011c). Language recognition via i-vectors and dimensionality reduction. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 857–860. http://www.isca-speech.org/archive/interspeech_2011/i11_0857.html.

  29. DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4.

  30. DeMarco, A., & Cox, S. J. (2013). Native accent classification via I-vectors and speaker compensation fusion. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 1472–1476. http://www.isca-speech.org/archive/interspeech_2013/i13_1472.html.

  31. Dupuy, G., Rouvier, M., Meignier, S., & Estève, Y. (2012). i-Vectors and ILP clustering adapted to cross-show speaker diarization. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 2174–2177. http://www.isca-speech.org/archive/interspeech_2012/i12_2174.html.

  32. Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.1109/ICASSP.2010.5495632.

  33. Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11)

  34. Gaida, C., Lange, P., Petrick, R., Proba, P., Malatawy, A., & Suendermann-Oeft, D. (2014). Comparing open-source speech recognition toolkits. http://suendermann.com/su/pdf/oasis2014.pdf.

  35. Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length nNormalization in speaker recognition systems. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence Retrieved August 27–31, 2011. pp. 249–252. http://www.isca-speech.org/archive/interspeech_2011/i11_0249.html.

  36. Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.1109/ICASSP.2012.6288859.

  37. Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.1109/ICASSP.2014.6853888.

  38. Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.1007/978-3-319-13623-3.

  39. Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.1109/ICASSP.2009.4960519.

  40. Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.1109/ICASSP.2011.5947358.

  41. Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.1109/ICASSP.2014.6854359.

  42. González, D. M., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in iVectors space. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 861–864. http://www.isca-speech.org/archive/interspeech_2011/i11_0861.html.

  43. Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.1109/ICASSP.2014.6854823.

  44. Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.1109/ICASSP.2013.6639154.

  45. Hautamäki, V., Cheng, Y., Rajan, P., & Lee, C. (2013). Minimax i-vector extractor for short duration speaker verification. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 3708–3712. http://www.isca-speech.org/archive/interspeech_2013/i13_3708.html.

  46. Huang, Z., Cheng, Y., Li, K., Hautamäki, V., & Lee, C. (2013). A blind segmentation approach to acoustic event detection based on i-vector. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 2282–2286. http://www.isca-speech.org/archive/interspeech_2013/i13_2282.html.

  47. Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.1109/ICASSP.2006.1659988.

  48. Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221.

  49. Jiang, Y., Lee, K., Tang, Z., Ma, B., Larcher, A., & Li, H. (2012), PLDA modeling in i-vector and supervector space for speaker verification. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 1680–1683. http://www.isca-speech.org/archive/interspeech_2012/i12_1680.html.

  50. Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-Vector based speaker recognition on short utterances. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 2341–2344. http://www.isca-speech.org/archive/interspeech_2011/i11_2341.html.

  51. Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.1109/ICASSP.2012.6288988.

  52. Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33.

  53. Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://www.isca-speech.org/archive/interspeech_2013/i13_2465.html.

  54. Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.

    Article  Google Scholar 

  55. Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.

    Article  Google Scholar 

  56. Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.1109/ASRU.2011.6163922.

  57. Karanasou, P., Wang, Y., Gales, M. J. F., & Woodland, P. C. (2014). Adaptation of deep neural network acoustic models using factorised i-Vectors. INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. Retrieved September 14–18, 2014. pp. 2180–2184. http://www.isca-speech.org/archive/interspeech_2014/i14_2180.html.

  58. Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13.

  59. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.1109/TASL.2006.881693.

    Article  Google Scholar 

  60. Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.1109/TASL.2007.894527.

    Article  Google Scholar 

  61. Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.1109/TASL.2008.925147.

    Article  Google Scholar 

  62. Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.1109/ICASSP.2013.6639151.

  63. Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, 11th annual conference of the international speech communication association, Makuhari, Chiba. September 26–30, 2010, pp 2822–2825. http://www.isca-speech.org/archive/interspeech_2010/i10_2822.html

  64. Kockmann, M., Ferrer, L., Burget, L., & Cernocký, J. (2011). iVector fusion of prosodic and cepstral features for speaker verification. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence,. August 27–31, 2011, pp 265–268. http://www.isca-speech.org/archive/interspeech_2011/i11_0265.html

  65. Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5.

  66. Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.1109/ICASSP.2012.6288986

  67. Larcher, A., Bonastre, J., Fauve, B. G. B., Lee, K., Lévy, C., Li, H., et al. (2013). ALIZE 3.0: Open source toolkit for state-of-the-art speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp 2768–2772, http://www.isca-speech.org/archive/interspeech_2013/i13_2768.html

  68. Le,V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. In: INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. August 27–31, 2007, pp. 1869–1872, http://www.isca-speech.org/archive/interspeech_2007/i07_1869.html

  69. Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.1109/ICASSP.2012.6288858

  70. Lei, Y., Burget, L., & Scheffer, N. (2012b). Bilinear factor analysis for ivector based speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp 1588–1591, http://www.isca-speech.org/archive/interspeech_2012/i12_1588.html

  71. Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.1109/ICASSP.2013.6638976

  72. Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.1109/ICASSP.2014.6854360.

  73. Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887

  74. Li, M., & Liu, W. (2014). Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp 1120–1124, http://www.isca-speech.org/archive/interspeech_2014/i14_1120.html

  75. Li, M., Zhang, X., Yan, Y., & Narayanan, S. S. (2011). Speaker verification using sparse representations on total variability i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp 2729–2732, http://www.isca-speech.org/archive/interspeech_2011/i11_2729.html

  76. Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.1109/ICASSP.2012.6288857

  77. Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 21–24, http://www.isca-speech.org/archive/interspeech_2011/i11_0021.html

  78. Mariooryad, S., & Busso, C. (2014). Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Communication, 57(0):1–12. doi:10.1016/j.specom.2013.07.011, http://www.sciencedirect.com/science/article/pii/S0167639313001015

  79. Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.1109/ICASSP.2012.6289008

  80. Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.1109/ICASSP.2013.6638988

  81. Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.1109/ICASSP.2014.6854361

  82. Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.1109/ICASSP.2011.5947436

  83. McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593.

  84. McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.1109/ICASSP.2012.6288888

  85. McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.1109/TASL.2011.2164533.

    Article  Google Scholar 

  86. McLaren, M., & van Leeuwen, D. A. (2011b). To weight or not to weight: source-normalised LDA for speaker recognition using i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2709–2712, http://www.isca-speech.org/archive/interspeech_2011/i11_2709.html

  87. Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010)

  88. Novoselov, S., Pekhovsky, T., Simonchik, K., & Shulipa, A. (2014). RBM-PLDA subsystem for the NIST i-vector challenge. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp 378–382, http://www.isca-speech.org/archive/interspeech_2014/i14_0378.html

  89. Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247. doi:10.1109/5.237532.

    Article  Google Scholar 

  90. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.

  91. Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology.

  92. Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(13), 19–41. doi:10.1006/dspr.1999.0361.

    Article  Google Scholar 

  93. Rouvier, M., & Favre, B. (2014). Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers? In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp. 3007–3011, http://www.isca-speech.org/archive/interspeech_2014/i14_3007.html

  94. Rouvier, M., Dupuy, G., Gay, P., el Khoury, E., Merlin, T., & Meignier, S. (2013). An open-source state-of-the-art toolbox for broadcast news diarization. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp. 1477–1481, http://www.isca-speech.org/archive/interspeech_2013/i13_1477.html

  95. Sadjadi, S. O., Slaney, M., & Heck, L. (2013). Msr identity toolbox v1.0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter http://research.microsoft.com/apps/pubs/default.aspx?id=205119

  96. Sarkar, A. K., Matrouf, D., Bousquet, P., & Bonastre, J. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2662–2665, http://www.isca-speech.org/archive/interspeech_2012/i12_2662.html

  97. Sarkar, S., & Rao, K. S. (2014). A novel boosting algorithm for improved i-vector based speaker verification in noisy environments. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 671–675, http://www.isca-speech.org/archive/interspeech_2014/i14_0671.html

  98. Segbroeck, M. V., Travadi, R., & Narayanan, S. S. (2014a) UBM fused total variability modeling for language identification. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3027–3031, http://www.isca-speech.org/archive/interspeech_2014/i14_3027.html

  99. Segbroeck, M. V., Travadi, R., Vaz, C., Kim, J., Black, M. P., Potamianos, A., et al. (2014b). Classification of cognitive load from speech using an i-vector framework. in: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 751–755, http://www.isca-speech.org/archive/interspeech_2014/i14_0751.html

  100. Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.1109/ICASSP.2014.6853591

  101. Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6

  102. Senoussaoui, M., Kenny, P., Brümmer, N., de Villiers, E., & Dumouchel, P. (2011). Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 25–28, http://www.isca-speech.org/archive/interspeech_2011/i11_0025.html

  103. Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D. A., & Glass, J. R. (2011). Exploiting intra-conversation variability for speaker diarization. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 945–948, http://www.isca-speech.org/archive/interspeech_2011/i11_0945.html

  104. Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.1109/ICASSP.2012.6288843

  105. Simonchik, K., Pekhovsky, T., Shulipa, A., & Afanasyev, A. (2012). Supervized mixture of PLDA models for cross-channel speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 1684–1687, http://www.isca-speech.org/archive/interspeech_2012/i12_1684.html

  106. Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.1109/TIFS.2015.2407362.

    Article  Google Scholar 

  107. Soufifar, M., Kockmann, M., Burget, L., Plchot, O., Glembek, O., & Svendsen, T. (2011). iVector approach to phonotactic language recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2913–2916, http://www.isca-speech.org/archive/interspeech_2011/i11_2913.html

  108. Travadi, R., Segbroeck, M. V., & Narayanan, S. S. (2014). Modified-prior i-vector estimation for language identification of short duration utterances. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3037–3041, http://www.isca-speech.org/archive/interspeech_2014/i14_3037.html

  109. Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12(3):247–251. doi:10.1016/0167-6393(93)90095-3, http://www.sciencedirect.com/science/article/pii/0167639393900953

  110. Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.1109/ICASSP.2014.6854363

  111. Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.1109/ICASSP.2013.6638971

  112. Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical Society of America, 51(6B):2044–2056. doi:10.1121/1.1913065, http://scitation.aip.org/content/asa/journal/jasa/51/6B/10.1121/1.1913065

  113. Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.1109/ODYSSEY.2006.248084

  114. Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2230–2233, http://www.isca-speech.org/archive/interspeech_2012/i12_2230.html

  115. Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.1109/ODYSSEY.2006.248130

  116. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4).

  117. Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.1109/ICASSP.2014.6854356

  118. Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.1109/ICASSP.2014.6853564

  119. Zhuang, X., Tsakalidis, S., Wu, S., Natarajan, P., Prasad, R., & Natarajan, P. (2012). Compact audio representation for event detection in consumer media. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2089–2092, http://www.isca-speech.org/archive/interspeech_2012/i12_2089.html

Download references

Acknowledgments

The authors wish to acknowledge UNICEF India and the DST, Government of India, for the funding provided under their FIST scheme, which greatly aided in the work reported herein.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pulkit Verma.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Verma, P., Das, P.K. i-Vectors in speech processing applications: a survey. Int J Speech Technol 18, 529–546 (2015). https://doi.org/10.1007/s10772-015-9295-3

Download citation

Keywords

  • Speech processing
  • Feature extraction
  • JFA
  • Factor analysis
  • i-Vectors
  • PLDA