Skip to main content
Log in

Intelligent System for Identifying Emotions on Audio Recordings Using Chalk Spectrograms

  • ARTIFICIAL INTELLIGENCE
  • Published:
Journal of Computer and Systems Sciences International Aims and scope

Abstract

A neural network architecture is proposed to identify human emotions on audio recordings. Emotions are understood as fear, joy, sadness, anger, calmness, and neutrality. Library data are used for training. The psychophysical properties of an audio recording are saved by converting an audio file into a spectrogram image with a chalk scale (chalk spectrogram). Such a spectrogram is an empirically chosen logarithmic dependence of the volume of sound vibrations perceived by human hearing organs on their frequency. Then methods for classifying graphic files are applied, including convolutional layers (the fragmental multiplication of pixel value matrices by the given matrices with the possible reduction of the picture dimension).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Similar content being viewed by others

REFERENCES

  1. A. A. Aleksandrov, A. P. Kirpichnikov, S. A. Lyasheva, and M. P. Shleimovich, “Analyzing the emotional states of a person in an image,” Vestn. Tekhnol. Univ. 22 (8), 120–123 (2019).

    Google Scholar 

  2. A. V. Zaboleeva-Zotova, “Development of an automated emotion detection system and possible applications,” Otkryt. Obrazov., No. 2, 59–62 (2011).

  3. D. V. Lyusin, “Modern ideas about emotional intelligence,” in Social Intelligence: Theory, Measurement, Research, Ed. by D. V. Lyusin and D. V. Ushakov (Inst. Psikhol. RAN, Moscow, 2004), pp. 29–36 [in Russian].

    Google Scholar 

  4. Yu. V. Granskaya, “Recognizing emotions from facial expressions,” Extended Abstract of Cand. Sci. (Psychol.) Dissertation (St. Petersburg, 1998).

  5. S. Bhatnagar, D. Ghosal, and M. H. Kolekar, “Classification of fashion article images using convolutional neural networks,” in Proceedings of the 4th International Conference on Image Information Processing (ICIIP), Waknaghat (2017), pp. 1–6. https://doi.org/10.1109/ICIIP.2017.8313740

  6. J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: A convolutional representation for pitch estimation,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (Music and Audio Research Labor., Center for Urban Sci. Prog., New York Univ., New York, 2018). https://arxiv.org/pdf/1802.06182.pdf.

  7. A. Cheveigne and H. Kawahara, YIN, A Fundamental Frequency Estimator for Speech and Music (Ircam-CNRS, Wakayama Univ., 2002). http://recherche.ircam.fr/equipes/pcm/cheveign/ps/2002_JASA_YIN_proof.pdf.

  8. M. Mauch and S. Dixon, PYIN: A Fundamental Frequency Estimator Using Probabilistic Threshold Distributions (Queen Mary Univ., Centre for Digital Music, London, 2014). http://matthiasmauch.de/_pdf/mauch_pyin_2014.pdf.

    Google Scholar 

  9. Y. Ü. Sonmez and A. Varol, “New trends in speech emotion recognition,” in Proceedings of the 7th International Symposium on Digital Forensics and Security (ISDFS), Barcelos 2019, pp. 1–7. https://doi.org/10.1109/ISDFS.2019.8757528

  10. J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller, “Semisupervised autoencoders for speech emotion recognition,” IEEE/ACM Trans. Audio, Speech, Language Proces. 26, 31–43 (2018). https://doi.org/10.1109/TASLP.2017.2759338

    Article  Google Scholar 

  11. J. Chang and S. Scherer, “Learning representations of emotional speech with deep convolutional generative adversarial networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP (Toronto, 2017), pp. 2746–2750. https://doi.org/10.1109/ICASSP.2017.7952656

  12. M. Abdelwahab and C. Busso, “Incremental adaptation using active learning for acoustic emotion recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP (Toronto, 2017), pp. 5160–5164. https://doi.org/10.1109/ICASSP.2017.7953140

  13. S. R. Livingstone and F. A. Russo, “The  Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” PLoS One 13 (5), 1–35 (2018).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. V. Makarov.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by A. Mazurov

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Derevyagin, L.A., Makarov, V.V., Tsurkov, V.I. et al. Intelligent System for Identifying Emotions on Audio Recordings Using Chalk Spectrograms. J. Comput. Syst. Sci. Int. 61, 407–412 (2022). https://doi.org/10.1134/S1064230722030042

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064230722030042

Navigation