Abstract
Mel frequency cepstral coefficients (MFCCs) have been the most predominantly used spectral features in many a speech-based application. It was primarily introduced to address speech recognition and was later adopted for various other applications such as speaker recognition and emotion recognition. Several findings, in recent times, suggest that Mel-scale filterbank, which is primarily inspired by human perception phenomenon, may not be the most optimum one for speaker recognition. Working in the same direction, this study attempts optimization of filterbank design for text-dependent speaker verification. Motivated by the success of evolutionary computations in the related fields, an evolutionary algorithm is used to carry out this optimization process. This brings into effect data-driven learning of the design parameters and is hypothesized to yield filterbanks which would suit the specific task of speaker-phrase discrimination. The filterbanks have been optimized for the task of text-dependent speaker verification in general, and also for specific cases of speakers and phrases. The proposed filterbank results in relative equal error rate reduction of up to 39.41% with respect to the baseline MFCCs.











References
Hansen, J.H.; Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. 32(6), 74–99 (2015)
Tirumala, S.S.; Shahamiri, S.R.; Garhwal, A.S.; Wang, R.: Speaker identification features extraction methods: a systematic review. Expert Syst. Appl. 90, 250–271 (2017)
Mahmood, A.; Alsulaiman, M.; Muhammad, G.: Automatic speaker recognition using multi-directional local features (mdlf). Arab. J. Sci. Eng. 39(5), 3799–3811 (2014)
Aronowitz, H.: Text dependent speaker verification using a small development set. In: Odyssey 2012-The Speaker and Language Recognition Workshop (2012)
Al-Kaltakchi, M.T.; Woo, W.L.; Dlay, S.; Chambers, J.A.: Evaluation of a speaker identification system with and without fusion using three databases in the presence of noise and handset effects. EURASIP J. Adv. Signal Process. 2017, 80 (2017)
Al-Kaltakchi, M.T.; Woo W.L.; Dlay, S.S.; Chambers, J.A.: Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments. In: 2017 25th European Signal Processing Conference (EUSIPCO), pp. 533–537. IEEE (2017)
Larcher, A.; Lee, K.A.; Ma, B.; Li, H.: Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
Al-Kaltakchi, M.T.; Woo, W.L.; Dlay, S.S.; Chambers, J.A.: Multi-dimensional I-vector closed set speaker identification based on an extreme learning machine with and without fusion technologies. In 2017 Intelligent Systems Conference (IntelliSys), pp. 1141–1146. IEEE (2017)
Hanilçi, C.; Çeliktaş, H.: Turkish text-dependent speaker verification using i-vector/PLDA approach. In 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE (2018).
Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms. CRIM, Montreal (Report) CRIM-06/08-13, 14, 28–29 (2005)
Kenny, P.; Stafylakis, T.; Ouellet, P.; Alam, M.J.: JFA-based front ends for speaker recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1705–1709. IEEE (2014).
Kanagasundaram, A.; Vogt, R.; Dean, D.B.; Sridharan, S.; Mason, M.W.: I-vector based speaker recognition on short utterances. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, International Speech Communication Association (ISCA), pp. 2341–2344 (2011).
Larcher, A.; Bonastre, J.F.; Mason, J.: Reinforced temporal structure information for embedded utterance-based speaker recognition. In: Interspeech (2008)
Ali, H.; Tran, S.N.; Benetos, E.; Garcez, A.S.D.A.: Speaker recognition with hybrid features from a deep belief network. Neural Comput. Appl. 29(6), 13–19 (2018)
Zeinali, H.; Sameti, H.; Burget, L.: Text-dependent speaker verification based on i-vectors, neural networks and hidden Markov models. Comput. Speech Lang. 46, 53–71 (2017)
Liu, Y.; Qian, Y.; Chen, N.; Fu, T.; Zhang, Y.; Yu, K.: Deep feature for text-dependent speaker verification. Speech Commun. 73, 1–13 (2015)
Variani, E.; Lei, X.; McDermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014)
Bhattacharya, G.; Alam, M.J.; Stafylakis, T. Kenny, P.: Deep neural network based text-dependent speaker recognition: preliminary results. In: Odyssey Speak. Lang. Recognit. Work, pp. 9–15 (2016)
Heigold, G.; Moreno, I.; Bengio, S.; Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115-5119. IEEE (2016).
Sadjadi, S.O.; Hansen, J.H.: Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Commun. 72, 138–148 (2015)
Lippmann, R.P.: Speech recognition by machines and humans. Speech Commun. 22(1), 1–15 (1997)
Charbuillet, C.; Gas, B.; Chetouani, M.; Zarader, J.L.: Filter bank design for speaker diarization based on genetic algorithms. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 1, I-I. IEEE (2006)
Pinheiro, H.N.; Neto, F.M.; Oliveira, A.L.; Ren, T.I.; Cavalcanti, G.D.; Adami, A.G.: Optimizing speaker-specific filter banks for speaker verification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5350–5354. IEEE (2017)
Miyajima, C.; Watanabe, H.; Tokuda, K.; Kitamura, T.; Katagiri, S.: A new approach to designing a feature extractor in speaker identification based on discriminative feature extraction. Speech Commun. 35(3–4), 203–218 (2001)
Charbuillet, C.; Gas, B.; Chetouani, M.; Zarader, J.L.: Optimizing feature complementarity by evolution strategy: application to automatic speaker verification. Speech Commun. 51(9), 724–731 (2009)
Vignolo, L.D.; Prasanna, S.M.; Dandapat, S.; Rufiner, H.L.; Milone, D.H.: Feature optimization for stress recognition in speech. Pattern Recognit. Lett. 84, 1–7 (2016)
Chittaragi, N.B.; Prakash, A.; Koolagudi, S.G.: Dialect identification using spectral and prosodic features on single and ensemble classifiers. Arab. J. Sci. Eng. 43(8), 4289–4302 (2018)
Dey, S.; Motlicek, P.; Madikeri, S.; Ferras, M.: Template-matching for text-dependent speaker verification. Speech Commun. 88, 96–105 (2017)
Vignolo, L.D.; Rufiner, H.L.; Milone, D.H.; Goddard, J.C.: Evolutionary cepstral coefficients. Appl. Soft Comput. 11(4), 3419–3428 (2011)
Vignolo, L.D.; Rufiner, H.L.; Milone, D.H.; Goddard, J.C.: Evolutionary splines for cepstral filterbank optimization in phoneme classification. EURASIP J. Adv. Signal Process. 2011, 8 (2011)
Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)
Deb, K.: An introduction to genetic algorithms. Sadhana 24(4–5), 293–315 (1999)
Al-Salami, N.M.: Evolutionary algorithm definition. Am. J. Eng. Appl. Sci. 2(4), 789–795 (2009)
Goldberg, D.E.: Genetic Algorithms. Pearson Education, New delhi (2006)
Young, S.J.; Young, S.: The HTK Hidden Markov Model Toolkit: Design and Philosophy, p. 28. University of Cambridge, Department of Engineering, Cambridge (1993)
Gallardo, L.F.: Human and Automatic Speaker Recognition Over Telecommunication Channels. Springer, Berlin (2015)
Lei, H.; Lopez, E.: Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Zeinali, H.; Sameti, H.; Burget, L.: HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE ACM Trans. Audio Speech Lang. Process. 25(7), 1421–1435 (2017)
Chen, N.; Qian, Y.; Yu, K.: Multi-task learning for text-dependent speaker verification. In: Sixteenth annual conference of the international speech communication association (2015)
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Laskar, M.A.; Laskar, R.H.: Integrating DNN–HMM technique with hierarchical multi-layer acoustic model for text-dependent speaker verification. Circuits Syst. Signal Process. 38, 1531–5878 (2019)
Acknowledgements
The authors would like to extend their gratitude to the members of the Speech and Image Processing Lab of the National Institute of Technology Silchar for their encouragement and support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Laskar, M.A., Laskar, R.H. Filterbank Optimization for Text-Dependent Speaker Verification by Evolutionary Algorithm Using Spline-Defined Design Parameters. Arab J Sci Eng 44, 9703–9718 (2019). https://doi.org/10.1007/s13369-019-04090-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-019-04090-4