Skip to main content

Depression Detection by Person’s Voice

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2021)

Abstract

In this work, a machine learning algorithm is proposed to detect depression. The Transformer encoder network is considered and compared with top baseline approaches. Low-level features are extracted from audio recordings and then are augmented to overcome the problem of the small size of available dataset. The Transformer network achieves recognition accuracy of 73.51% on DAIC-WOZ database, which compare favourably to the accuracy of 65.85% and 66.35% obtained by traditional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/IliaZenkov/transformer-cnn-emotion-recognition.

References

  1. Al Hanai, T., Ghassemi, M.M., Glass, J.R.: Detecting depression with audio/text sequence modeling of interviews. In: Interspeech, pp. 1716–1720 (2018)

    Google Scholar 

  2. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning, pp. 173–182. PMLR (2016)

    Google Scholar 

  3. Ananyeva, M., Makarov, I., Pendiukhov, M.: GSM: inductive learning on dynamic graph embeddings. In: Bychkov, I., Kalyagin, V.A., Pardalos, P.M., Prokopyev, O. (eds.) NET 2018. SPMS, vol. 315, pp. 85–99. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37157-9_6

    Chapter  Google Scholar 

  4. American Psychiatric Association et al.: Diagnostic and Statistical Manual of Mental Disorders: DSM-5. Arlington (2013)

    Google Scholar 

  5. Averchenkova, A., et al.: Collaborator recommender system. In: Bychkov, I., Kalyagin, V.A., Pardalos, P.M., Prokopyev, O. (eds.) NET 2018. SPMS, vol. 315, pp. 101–119. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37157-9_7

    Chapter  Google Scholar 

  6. Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)

    Google Scholar 

  7. Bhargava, M., Rose, R.: Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  8. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

    Google Scholar 

  9. Cohn, J.F., et al.: Detecting depression from facial actions and vocal prosody. In: 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–7. IEEE (2009)

    Google Scholar 

  10. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE (2018)

    Google Scholar 

  11. France, D.J., Shiavi, R.G., Silverman, S., Silverman, M., Wilkes, M.: Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 47(7), 829–837 (2000)

    Article  Google Scholar 

  12. Gratch, J., et al.: The distress analysis interview corpus of human and computer interviews. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 3123–3128 (2014)

    Google Scholar 

  13. Haque, A., Guo, M., Miner, A.S., Fei-Fei, L.: Measuring depression symptom severity from spoken language and 3d facial expressions. arXiv preprint arXiv:1811.08592 (2018)

  14. Keren, G., Schuller, B.: Convolutional RNN: an enhanced model for extracting features from sequential data. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 3412–3419. IEEE (2016)

    Google Scholar 

  15. Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Interspeech 2015 (2015)

    Google Scholar 

  16. Li, S., Raj, D., Lu, X., Shen, P., Kawahara, T., Kawai, H.: Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: Interspeech, pp. 4400–4404 (2019)

    Google Scholar 

  17. Low, L.S.A., Maddage, N.C., Lech, M., Sheeber, L., Allen, N.: Influence of acoustic low-level descriptors in the detection of clinical depression in adolescents. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5154–5157. IEEE (2010)

    Google Scholar 

  18. Makarov, I., Borisenko, G.: Depth inpainting via vision transformer. In: 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 286–291. IEEE (2021)

    Google Scholar 

  19. Makarov, I., Gerasimova, O.: Link prediction regression for weighted co-authorship networks. In: Rojas, I., Joya, G., Catala, A. (eds.) IWANN 2019, Part II. LNCS, vol. 11507, pp. 667–677. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20518-8_55

    Chapter  Google Scholar 

  20. Makarov, I., Gerasimova, O.: Predicting collaborations in co-authorship network. In: 2019 14th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), pp. 1–6. IEEE (2019)

    Google Scholar 

  21. Makarov, I., Gerasimova, O., Sulimov, P., Zhukov, L.E.: Co-authorship network embedding and recommending collaborators via network embedding. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 32–38. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-11027-7_4

    Chapter  Google Scholar 

  22. Makarov, I., Gerasimova, O., Sulimov, P., Zhukov, L.E.: Dual network embedding for representing research interests in the link prediction problem on co-authorship networks. PeerJ Comput. Sci. 5, e172 (2019)

    Google Scholar 

  23. Makarov, I., Kiselev, D., Nikitinsky, N., Subelj, L.: Survey on graph embeddings and their applications to machine learning problems on graphs. PeerJ Comput. Sci. 7, e357 (2021)

    Google Scholar 

  24. Makarov, I., Korovina, K., Kiselev, D.: JONNEE: joint network nodes and edges embedding. IEEE Access 9, 144646–144659 (2021)

    Article  Google Scholar 

  25. Makarov, I., Makarov, M., Kiselev, D.: Fusion of text and graph information for machine learning problems on networks. PeerJ Comput. Sci. 7, e526 (2021)

    Google Scholar 

  26. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 16(8), 2203–2213 (2014)

    Article  Google Scholar 

  27. Moore, E., Clements, M., Peifer, J., Weisser, L.: Analysis of prosodic variation in speech for clinical depression. In: Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439), vol. 3, pp. 2925–2928. IEEE (2003)

    Google Scholar 

  28. Moore, E., II., Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. IEEE Trans. Biomed. Eng. 55(1), 96–107 (2007)

    Article  Google Scholar 

  29. Mundt, J.C., Snyder, P.J., Cannizzaro, M.S., Chappie, K., Geralts, D.S.: Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. J. Neurolinguistics 20(1), 50–64 (2007)

    Article  Google Scholar 

  30. Muzammel, M., Salam, H., Othmani, A.: End-to-end multimodal clinical depression recognition using deep neural networks: a comparative analysis. Comput. Methods Prog. Biomed. 211, 106433 (2021)

    Google Scholar 

  31. Othmani, A., Kadoch, D., Bentounes, K., Rejaibi, E., Alfred, R., Hadid, A.: Towards robust deep neural networks for affect and depression recognition from speech. In: Del Bimbo, A., et al. (eds.) ICPR 2021. LNCS, vol. 12662, pp. 5–19. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68790-8_1

    Chapter  Google Scholar 

  32. Ozdas, A., Shiavi, R.G., Silverman, S.E., Silverman, M.K., Wilkes, D.M.: Investigation of vocal jitter and glottal flow spectrum as possible cues for depression and near-term suicidal risk. IEEE Trans. Biomed. Eng. 51(9), 1530–1540 (2004)

    Article  Google Scholar 

  33. Pareja, A., et al.: EvolveGCN: evolving graph convolutional networks for dynamic graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 5363–5370 (2020)

    Google Scholar 

  34. Pham, V.T., et al.: Independent language modeling architecture for end-to-end ASR. arXiv preprint arXiv:1912.00863 (2019)

  35. Prendergast, M.: Understanding Depression. Penguin Group Australia (2006)

    Google Scholar 

  36. Ringeval, F., et al.: AVEC 2017: real-life depression, and affect recognition workshop and challenge. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 3–9 (2017)

    Google Scholar 

  37. Rustem, M.K., Makarov, I., Zhukov, L.E.: Predicting psychology attributes of a social network user. In: Proceedings of the Fourth Workshop on Experimental Economics and Machine Learning (EEML 2017), Dresden, Germany, 17–18 September 2017, pp. 1–7. CEUR WP (2017)

    Google Scholar 

  38. Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. IEEE (2015)

    Google Scholar 

  39. Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp. 1089–1093 (2017)

    Google Scholar 

  40. Seo, Y., Defferrard, M., Vandergheynst, P., Bresson, X.: Structured sequence modeling with graph convolutional recurrent networks. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018, Part I. LNCS, vol. 11301, pp. 362–373. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04167-0_33

    Chapter  Google Scholar 

  41. Shirian, A., Guha, T.: Compact graph architecture for speech emotion recognition. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 6284–6288. IEEE (2021)

    Google Scholar 

  42. Tikhomirova, K., Makarov, I.: Community detection based on the nodes role in a network: the telegram platform case. In: van der Aalst, W.M.P., et al. (eds.) AIST 2020. LNCS, vol. 12602, pp. 294–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72610-2_22

    Chapter  Google Scholar 

  43. Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  45. Wang, H., Liu, Y., Zhen, X., Tu, X.: Depression speech recognition with a three-dimensional convolutional network. Front. Hum. Neurosci. 15 (2021)

    Google Scholar 

  46. Wang, P.S., et al.: Use of mental health services for anxiety, mood, and substance disorders in 17 countries in the who world mental health surveys. Lancet 370(9590), 841–850 (2007)

    Article  Google Scholar 

  47. Yang, L., Sahli, H., Xia, X., Pei, E., Oveneke, M.C., Jiang, D.: Hybrid depression classification and estimation from audio video and text information. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 45–51 (2017)

    Google Scholar 

  48. Zlochower, A.J., Cohn, J.F.: Vocal timing in face-to-face interaction of clinically depressed and nondepressed mothers and their 4-month-old infants. Infant Behav. Dev. 19(3), 371–374 (1996)

    Article  Google Scholar 

Download references

Acknowledgement

The work of Ilya Makarov was supported by the Russian Science Foundation under grant 22-11-00323 and performed at HSE University, Moscow, Russia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilya Makarov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zavorina, E., Makarov, I. (2022). Depression Detection by Person’s Voice. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16500-9_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16499-6

  • Online ISBN: 978-3-031-16500-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics