A Gender-Aware Deep Neural Network Structure for Speech Recognition

  • Toktam Zoughi
  • Mohammad Mehdi HomayounpourEmail author
Research Paper


Recently deep neural networks (DNNs) have attracted a great deal of interest among researchers for speech recognition. DNN training is computationally expensive due to the model large number of parameters. We can improve DNN performance by using pre-training methods. However, DNN learning is hard if pre-training is inefficient. This paper proposes a new pre-training method that utilizes both gender and phoneme information for speech recognition. We use speaker gender information across phoneme information to construct acoustic models more precisely. The new approach named gender-aware deep Boltzmann machine (GADBM) is used for DNN pre-training. GADBM utilizes additional information, which improves recognition accuracy. For this purpose, we have changed the overall structure of deep Boltzmann machine (DBM) to consider additional information. Experimental results on TIMIT dataset show that the proposed method outperforms deep belief network and DBM for phone recognition task. In addition, parameter tuning in the proposed method improves the model performance.


Speech recognition Deep neural networks Deep Boltzmann machine Pre-training 


  1. Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127MathSciNetCrossRefzbMATHGoogle Scholar
  2. Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19(1):153Google Scholar
  3. Blank H, Anwander A, von Kriegstein K (2011) Direct structural connections between voice- and face-recognition areas. J Neurosci 31(36):12906–12915CrossRefGoogle Scholar
  4. Bourlard H, Hermansky H, Morgan N (1996) Towards increasing speech recognition error rates. Speech Commun 18(3):205–231CrossRefGoogle Scholar
  5. Chan W et al (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: Acoustics, speech and signal processing (ICASSP)Google Scholar
  6. Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp 310–318Google Scholar
  7. Chen Z et al (2018) Progressive joint modeling in unsupervised single-channel overlapped speech recognition. IEEE/ACM Trans Audio Speech Lang Process TASLP 26(1):184–196MathSciNetCrossRefGoogle Scholar
  8. Dahl G, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42CrossRefGoogle Scholar
  9. Davis KH, Biddulph R, Balashek S (1952) Automatic recognition of spoken digits. J Acoust Soc Am 24(6):637–642CrossRefGoogle Scholar
  10. Deng J, Leung C, Li Y (2018) Beyond big data of human behaviors: modeling human behaviors and deep emotions. In: IEEE conference on multimedia information processing and retrieval (MIPR)Google Scholar
  11. Erhan D, Courville A, Vincent P (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(36):625–660MathSciNetzbMATHGoogle Scholar
  12. Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: 1989 International conference on acoustics, speech, and signal processing, 1989. ICASSP-89. IEEEGoogle Scholar
  13. Graves A (2008) Supervised sequence labelling with recurrent neural networks. In: Image, Rochester, NY, p 124Google Scholar
  14. Graves A, Fernandez S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning, pp 369–376Google Scholar
  15. Graves A, Jaitly N, Mohamed AR (2013a) Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE workshop on automatic speech recognition and understanding, ASRU 2013: proceedings, pp 273–278Google Scholar
  16. Graves A, Mohamed R, Hinton G (2013b) Speech recognition with deep recurrent neural networks. ICASSP 3:6645–6649Google Scholar
  17. Hifny Y, Renals S (2009) Speech recognition using augmented conditional random fields. IEEE Trans Audio Speech Lang Process 17(2):354–365CrossRefGoogle Scholar
  18. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800CrossRefzbMATHGoogle Scholar
  19. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRefzbMATHGoogle Scholar
  20. Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97CrossRefGoogle Scholar
  21. Jain A, Kulkarni G, Shah V (2018) Natural language processing. Int J Comput Sci Eng 6(1):161–167Google Scholar
  22. Juang BH, Chou W, Lee CH (1997) Minimum classification error rate methods for speech recognition. IEEE Trans Speech Audio Process 5(3):257–265CrossRefGoogle Scholar
  23. Kapadia S, Valtchev V, Young SJ (1993) MMI training for continuous phoneme recognition on the TIMIT database. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 491–494Google Scholar
  24. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical reportGoogle Scholar
  25. Kumar P et al (2018) Envisioned speech recognition using EEG sensors. Pers Ubiquit Comput 22(1):185–199CrossRefGoogle Scholar
  26. Larochelle H, Bengio Y (2008) Classification using discriminative restricted Boltzmann machines. In: Proceedings of the 25th international conference on machine learning (ICML 2008), pp 536–543Google Scholar
  27. Lee L, Rose RC (1996) Speaker normalization using efficient frequency warping procedures. In: IEEE international conference on acoustics, speech, and signal processing, vol 1, pp 356–1996Google Scholar
  28. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 9(2):171–185CrossRefGoogle Scholar
  29. Liu W et al (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26CrossRefGoogle Scholar
  30. McDermott E, Hazen TJ, Roux J, Nakamura A, Katagiri S (2007) Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans Audio Speech Lang Process 15(1):203–223CrossRefGoogle Scholar
  31. Mohamed A, Hinton GE, Penn G (2012a) Understanding how deep belief networks perform acoustic modeling. In: ICASSPGoogle Scholar
  32. Mohamed AR, Dahl G, Hinton G (2012b) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22CrossRefGoogle Scholar
  33. Morgan N, Zhu Q, Stolcke A (2005) Pushing the envelope-aside. Signal Process Mag 22(5):81–88CrossRefGoogle Scholar
  34. Ostendorf M (1999) Moving beyond the ‘beads-on-a-string’ model of speech. In: IEEE automatic speech recognition and understanding workshop, pp 79–83Google Scholar
  35. Povey D (2003) Discriminative training for large vocabulary speech recognition. PhD thesis, Cambridge UniversityGoogle Scholar
  36. Rabiner L, Juang B (1993) Fundamentals of speech recognition, vol 22. Prentice Hall, Englewood CliffsGoogle Scholar
  37. Sainath N, Kingsbury B, Soltau H, Ramabhadran B (2013) Optimization techniques to improve training speed of deep neural networks for large speech tasks. IEEE Trans Audio Speech Lang Process 21(11):2267–2276CrossRefGoogle Scholar
  38. Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed AR, Dahl G, Ramabhadran B (2014a) Deep convolutional neural networks for large-scale speech tasks. Neural Netw 64:39–48Google Scholar
  39. Sainath T, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2014b) Deep convolutional neural networks for large-scale speech tasks. Neural Netw 1(1):30–42Google Scholar
  40. Sainath TN, Kingsbury B, Saon G, Soltau H, Mohamed A, Dahl G, Ramabhadran B (2015) Deep convolutional neural networks for large-scale speech tasks. Spec Issue Deep Learn 64:39–48Google Scholar
  41. Salakhutdinov R (2009) Learning deep generative models. PhD thesis, University of Toronto, Toronto, ON, CanadaGoogle Scholar
  42. Salakhutdinov RR, Hinton GE (2012) An efficient learning procedure for deep Boltzmann machines. Neural Comput 24(8):1967–2006MathSciNetCrossRefzbMATHGoogle Scholar
  43. Scharenborg O (2007) Reaching over the gap: a review of efforts to link human and automatic speech recognition research. Speech Commun 49(5):336–347CrossRefGoogle Scholar
  44. Seide F, Li G, Chen X, Yu D (2011) Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Automatic speech recognition and understanding (ASRU), pp 24–29Google Scholar
  45. Sha F, Saul L (2006) Large margin Gaussian mixture modeling for phonetic classification and recognition. In: IEEE international conference on acoustics speech and signal processing proceedings, vol 1, pp 265–268Google Scholar
  46. Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304CrossRefGoogle Scholar
  47. Ullah A et al (2018) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166CrossRefGoogle Scholar
  48. von Kriegstein K, Smith D, Patterson RD, Kiebel S, Griffiths T (2010) How the human brain recognizes speech in the context of changing speakers. J Neurosci 30(2):629–638CrossRefGoogle Scholar
  49. Welling L, Kanthak S, Ney H (1999) Improved methods for vocal tract normalization. In: IEEE international conference on acoustics, speech, and signal processing, vol 2, pp 761–764Google Scholar
  50. Xiong W et al (2017) The Microsoft 2016 conversational speech recognition system. In: Acoustics, speech and signal processing (ICASSP)Google Scholar
  51. Yu D, Li D (2016) Automatic speech recognition. Springer, LondonGoogle Scholar
  52. Zeiler MD, Ranzato M, Monga R, Mao M, Yang K, Le QV, Hinton GE (2013) On rectified linear units for speech processing. In: ICASSPGoogle Scholar
  53. Zeyer A et al (2017) A comprehensive study of deep bidirectional LSTM RNNs for acoustic modeling in speech recognition. In: Acoustics, speech and signal processing (ICASSP)Google Scholar
  54. Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: ICASSP, pp 215–219Google Scholar
  55. Zhang Y, Chan W, Jaitly N (2017) Very deep convolutional networks for end-to-end speech recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEEGoogle Scholar
  56. Zoughi T, Homayounpour MM (2015) Gender aware deep Boltzmann machines for phone recognition. In: 2015 International joint conference on neural networks (IJCNN), Killarney, pp 1–5.
  57. Zweig G, Nguyen P, Van Compernolle D, Demuynck K, Atlas L, Clark P, Sell G, Wang M, Sha F, Hermansky H, Karakos D, Jansen A, Thomas S, Bowman S, Kao J (2011) Speech recognition with segmental conditional random fields. In: IEEE international conference on acoustics, speech and signal processing, pp 5044–5047Google Scholar

Copyright information

© Shiraz University 2019

Authors and Affiliations

  1. 1.Laboratory for Intelligent Multimedia Processing (LIMP), Computer Engineering and IT DepartmentAmirkabir University of TechnologyTehranIran

Personalised recommendations