Speaker recognition with hybrid features from a deep belief network
Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.
KeywordsDeep belief networks Deep learning Mel-frequency cepstral coefficients
The authors would like to thank Nasir Ahmad, University of Engineering and Technology Peshawar Pakistan and Tillman Weyde, City University London for their useful feedback during this work.
Hazrat Ali is grateful for funding from the Erasmus Mundus Strong Ties Grant. Emmanouil Benetos was supported by the UK AHRC-funded Project `Digital Music Lab-Analysing Big Music Data', Grant No. AH/L01016X/1 and is supported by a UK RAEng Research Fellowship, grant no. RF/128. Hazrat and Son have equal contribution to the paper.
- 4.Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A (eds) Advances in neural information processing systems. NIPS, Abu Dhabi, pp 1096–1104Google Scholar
- 5.Garofolo J, Lamel L, Fisher W, Fiscus J, Pallett D, Dahlgren N, Zue V (1993) DARPA TIMIT acoustic phonetic continuous speech corpus cdrom. https://catalog.ldc.upenn.edu/LDC93S1
- 6.Senoussaoui M, Dehak N, Kenny P, Dehak R, Dumouchel P (2012) First attempt of Boltzmann machines for speaker verification. In: Odyssey 2012: the speaker and language recognition workshop. ACM, pp 1064–1071Google Scholar
- 7.Ghahabi O, Hernando J (2014) Deep belief networks for i-vector based speaker recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1700–1704Google Scholar
- 9.NIST i-vector Machine Learning Challenge (2014). https://ivectorchallenge.nist.gov/
- 11.Molau S, Pitz M, Schluter R, Ney H (2001) Computing mel-frequency cepstral coefficients on the power spectrum. In: Proceedings of 2001 IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 73–76Google Scholar
- 15.Freund Y, Haussler D (1994) Unsupervised learning of distributions on binary vectors using two layer networks. University of California at Santa Cruz, Santa Cruz, Tech. RepGoogle Scholar
- 18.Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol, vol 2, pp 27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm
- 19.Ali H, Ahmad N, Yahya KM, Farooq O (2012) A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In: 2012 international conference on electronics computer technology (ICECT 2012), pp 473–476Google Scholar
- 20.Ali H, Ahmad N, Zhou X, Ali M, Manjotho A (2014) Linear discriminant analysis based approach for automatic speech recognition of Urdu isolated words. In: Communication technologies, information security and sustainable development, ser. communications in computer and information science, vol 414. Springer International Publishing, pp 24–34Google Scholar