Neural Computing and Applications

, Volume 29, Issue 6, pp 13–19 | Cite as

Speaker recognition with hybrid features from a deep belief network

  • Hazrat AliEmail author
  • Son N. Tran
  • Emmanouil Benetos
  • Artur S. d’Avila Garcez
Original Article


Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.


Deep belief networks Deep learning Mel-frequency cepstral coefficients 



The authors would like to thank Nasir Ahmad, University of Engineering and Technology Peshawar Pakistan and Tillman Weyde, City University London for their useful feedback during this work.

Hazrat Ali is grateful for funding from the Erasmus Mundus Strong Ties Grant. Emmanouil Benetos was supported by the UK AHRC-funded Project `Digital Music Lab-Analysing Big Music Data', Grant No. AH/L01016X/1 and is supported by a UK RAEng Research Fellowship, grant no. RF/128. Hazrat and Son have equal contribution to the paper.


  1. 1.
    Mohamed AR, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22CrossRefGoogle Scholar
  2. 2.
    Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40CrossRefGoogle Scholar
  3. 3.
    Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675CrossRefGoogle Scholar
  4. 4.
    Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A (eds) Advances in neural information processing systems. NIPS, Abu Dhabi, pp 1096–1104Google Scholar
  5. 5.
    Garofolo J, Lamel L, Fisher W, Fiscus J, Pallett D, Dahlgren N, Zue V (1993) DARPA TIMIT acoustic phonetic continuous speech corpus cdrom.
  6. 6.
    Senoussaoui M, Dehak N, Kenny P, Dehak R, Dumouchel P (2012) First attempt of Boltzmann machines for speaker verification. In: Odyssey 2012: the speaker and language recognition workshop. ACM, pp 1064–1071Google Scholar
  7. 7.
    Ghahabi O, Hernando J (2014) Deep belief networks for i-vector based speaker recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1700–1704Google Scholar
  8. 8.
    Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRefGoogle Scholar
  9. 9.
    NIST i-vector Machine Learning Challenge (2014).
  10. 10.
    Ali H, d’Avila Garcez A, Tran S, Zhou X, Iqbal K (2014) Unimodal late fusion for NIST i-vector challenge on speaker detection. Electron Lett 50(15):1098–1100CrossRefGoogle Scholar
  11. 11.
    Molau S, Pitz M, Schluter R, Ney H (2001) Computing mel-frequency cepstral coefficients on the power spectrum. In: Proceedings of 2001 IEEE international conference on acoustics, speech, and signal processing, vol  1, pp 73–76Google Scholar
  12. 12.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Le Roux N, Bengio Y (2008) Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput 20(6):1631–1649MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Deng L, Yu D (2014) Deep learning: methods and applications. NOW Publishers, BredazbMATHGoogle Scholar
  15. 15.
    Freund Y, Haussler D (1994) Unsupervised learning of distributions on binary vectors using two layer networks. University of California at Santa Cruz, Santa Cruz, Tech. RepGoogle Scholar
  16. 16.
    Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800CrossRefzbMATHGoogle Scholar
  17. 17.
    Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167CrossRefGoogle Scholar
  18. 18.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol, vol 2, pp 27:1–27:27.
  19. 19.
    Ali H, Ahmad N, Yahya KM, Farooq O (2012) A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In: 2012 international conference on electronics computer technology (ICECT 2012), pp 473–476Google Scholar
  20. 20.
    Ali H, Ahmad N, Zhou X, Ali M, Manjotho A (2014) Linear discriminant analysis based approach for automatic speech recognition of Urdu isolated words. In: Communication technologies, information security and sustainable development, ser. communications in computer and information science, vol 414. Springer International Publishing, pp 24–34Google Scholar
  21. 21.
    Ali H, Ahmad N, Zhou X, Iqbal K, Ali SM (2014) DWT features performance analysis for automatic speech recognition of Urdu. SpringerPlus 3(1):204CrossRefGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2016

Authors and Affiliations

  • Hazrat Ali
    • 1
    Email author
  • Son N. Tran
    • 2
  • Emmanouil Benetos
    • 2
    • 3
  • Artur S. d’Avila Garcez
    • 2
  1. 1.Department of Electrical EngineeringCOMSATS Institute of Information TechnologyAbbottabadPakistan
  2. 2.Department of Computer ScienceCity University LondonLondonUK
  3. 3.School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK

Personalised recommendations