Skip to main content

The long short-term memory based on i-vector extraction for conversational speech gender identification approach


Stress causes a speaker’s voice characteristics to be changed. Emotional stress alters a person’s speech pattern such that it is distributed non-normally along the temporal dimension. Thus, the methods for identifying the gender of a non-stressed speaker were no longer effective in recognizing the gender of a speaker in stressful conditions. To address this issue, a new gender identification framework is proposed. We leveraged i-vector for capturing gender information on each speech segment. Then the long short-term memory dynamically handled all speech temporal context features and learned the long-term dependency from the input. We evaluated the effectiveness, in terms of accuracy and the number of iterations to saturate, of the proposed method by comparing it with the baseline methods in their respective abilities to identify the speaker’s gender from conversations with different durations. By learning the gender information encoded in long-term dependencies, our proposed method outperforms the baseline methods and is able to correctly identify the speaker’s gender in all conversation types.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. Kanervisto A, Vestman V, Sahidullah Md, Hautamaki V, Kinnunen T (2017) Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA

  2. Jayasankar T, Vinothkumar K, Vijayaselvi A (2017) Automatic gender identification in speech recognition by genetic algorithm. Appl Math Inf Sci 11(3):907–913

    Article  Google Scholar 

  3. Jayasankar T, Vinothkumar K, Vijayaselvi A (2013) Gender-dependent emotion recognition based on HMMs and SPHMMs. Int J Speech Technol 16(2):133–141

    Article  Google Scholar 

  4. Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Proced Comput Sci 151:37–44

    Article  Google Scholar 

  5. Bisio I, Delfino A, Lavagetto F, Marchese M, Sciarrone A (2013) Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Trans Emerg Top Comput 1(2):224–257

    Google Scholar 

  6. Zhang L, Wang L, Dang J, Guo L, Yu Q (2018) Gender-aware CNN-BLSTM for speech emotion recognition. In: International conference on artificial neural networks (ICANN), Rhodes, Greece

  7. Lester-Smith RA, Brad HS (2016) The effects of physiological adjustments on the perceptual and acoustical characteristics of vibrato as a model of vocal tremor. J Acoust Soc Am 140(5):3827–3833

    Article  Google Scholar 

  8. Archana GS, Malleswari M (2015) Gender identification and performance analysis of speech signals. In: Global conference on communication technologies (GCCT), Thuckalay, India

  9. Ramdinmawii E, Mittal VK (2016) Gender identification from speech signal by examining the speech production characteristics. In: International conference on signal processing and communication (ICSC), Noida, India

  10. Gupta M, Bharti SS, Agarwal S (2016) Support vector machine based gender identification using voiced speech frames. In: International conference on parallel, distributed and grid computing (PDGC), Waknaghat, India

  11. Levitan SI, Mishra T, Bangalore S (2016) Automatic identification of gender from speech. In: Proceeding of speech prosody

  12. Hansen JHL, Patil S (2007) Speech under stress: analysis, modeling and recognition. In: Müller C (eds) Speaker classification I. Lecture notes in computer science, vol 4343. Springer, Berlin

  13. Godin KW, Hansen JHL (2015) Physical task stress and speaker variability in voice quality. EURASIP J Audio Speech Music Process 29:2015

    Google Scholar 

  14. Marten J (2005) Culture, gender and the recognition of the basic emotions. Psychologia 2005(48):306–316

    Article  Google Scholar 

  15. Grzybowska J, Ziolko M (2015) I-vectors in gender recognition from telephone speech. In: Proceedings of the twenty-first national conference on applications of mathematics in biology and medicine, pp 57–62

  16. Wang M, Chen Y, Tang Z, Zhang E (2015) I-vector based speaker gender recognition. In: IEEE advanced information technology, electronic and automation control conference (IAEAC), Chongqing, China

  17. Xia R, Liu Y (2012) Using i-vector space model for emotion recognition. INTERSPEECH, Portland

    Google Scholar 

  18. Gomes J, EL-Sharkawy M (2015) i-vector algorithm with Gaussian mixture model for efficient speech emotion recognition. In: International conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA

  19. Xia R, Liu Y (2016) DBN-ivector framework for acoustic emotion recognition. INTERSPEECH, San Francisco

    Book  Google Scholar 

  20. Verma P, Das PK (2015) i-Vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546

    Article  Google Scholar 

  21. Coutinho E, Schuller B (2017) Shared acoustic codes underlie emotional communication in music and speech-evidence from deep transfer learning. PLoS One 12:6

    Google Scholar 

  22. Hansen JHL (1999) Composer, SUSAS LDC99S78. Web download. [Sound Recording]. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  23. Hansen JHL (1999) Composer, SUSAS transcripts LDC99T33. [Sound Recording]. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  24. Son HH (2017) Toward a proposed framework for mood recognition using LSTM recurrent neuron network. Proced Comput Sci 109:1028–1034

    Article  Google Scholar 

  25. Son G, Kwon S, Park N (2019) Gender classification based on the non-lexical cues of emergency calls with recurrent neural networks (RNN). Symmetry 11(4):15

    Article  Google Scholar 

  26. Nammous MK, Saeed K (2019) Natural language processing: speaker, language, and gender identification with LSTM, advanced computing and systems for security, advances in intelligent systems and computing. Springer, Singapore, p 883

    Google Scholar 

  27. Serizel R, Bisot V, Essid S, Richard G (2017) Acoustic features for environmental sound analysis. Computational analysis of sound scenes and events. Springer, Berlin, pp 71–101

    Google Scholar 

  28. Rabiner LR, Schafer RW (2010) Theory and applications of digital speech processing. Pearson, Upper Saddle River

    Google Scholar 

  29. Zewoudie AW, Luque J, Hernandi J (2018) The use of long-term features for GMM-and i-vector-based speaker diarization systems. EURASIP J Audio Speech Music Process 14:2018

    Google Scholar 

  30. Zazo R, Nidadavolu PS, Chen N, Gonzalez-Rodriguez J, Dehak N (2018) Age estimation in short speech utterances based on LSTM recurrent neural networks. IEEE Access 6:22524–22530

    Article  Google Scholar 

  31. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  32. Alsulami B, Dauber E, Harang RE, Mancoridis S, Greenstadt R (2017) Source code authorship attribution using long short-term memory based networks. In: 22nd European symposium on research in computer security (ESORICS), Oslo, Norway

  33. Dwarampudi M, Reddy NVS (2019) Effects of padding on LSTMs and CNNs. arXiv:1903.07288v1 [cs.LG]

  34. Mesnil G, He X, Deng L, Bengio Y (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. INTERSPEECH, Lyon

    Google Scholar 

  35. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning, vol 28, no 3, pp 1310–1318

  36. Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv:1402.1128v1 [cs.NE]

  37. Yokoyama N, Azuma D, Tsukiyama S (2016) An efficient Gaussian mixture reduction to two components. In: The 20th workshop on synthesis and system integration of mixed information technologies, Kyoto, Japan, pp 236–241

Download references


I would like to thank the Tamura Laboratory, which supported us in this work. Thank you to the LDC for allowing us access to the SUSAS database. Thank you to Dr. Lindsey R. Tate for editing the grammatical error in this manuscript.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Barlian Henryranu Prasetio.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prasetio, B.H., Tamura, H. & Tanno, K. The long short-term memory based on i-vector extraction for conversational speech gender identification approach. Artif Life Robotics 25, 233–240 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Gender recognition
  • MFCC
  • LTSM
  • i-Vector
  • Conversational speech