Skip to main content
Log in

Multimedia emotion prediction using movie script and spectrogram

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This article proposes a multimedia emotion-prediction approach using movie scripts and spectrograms with speech information. First, a variety of information is extracted from textual dialogues in scripts for emotion prediction. In addition, spectrograms transformed from speech information help to identify subtle representations of difficult-to-predict emotions from scripts. Accent helps predict emotions because it is an important means of expressing emotion states in speech. These are to analyze emotion words with a similar tendency on the basis of the emotion keywords in scripts and spectrograms. Emotion candidate keywords are extracted from text data using morphological analysis, and representative emotion keywords are extracted through Word2Vec_ARSP. Emotion keywords and speech data from the last part of the dialogue are extracted and converted into images. This multimedia information is used for the input layer in a convolutional neural network. In this paper, we propose a multi-modal method for more efficiently extracting and predicting emotions by mixing and learning integrated multimedia information through the character’s speech and background sounds, as well as dialogue that can directly express the emotional situation of the context. In order to improve the accuracy of emotion prediction using multimedia information in movies, we propose a system with a CNN for learning, testing, and prediction using a multi-modal method. The proposed multi-modal system compensates for unpredictable emotions from certain parts of the text through the spectrogram. The prediction accuracy is improved by 20.9% and 6.7%, compared to using only text information and only voice information, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Birajdar G, Patil M (2019) Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed Tools Appl 78(11):15141–15168

    Article  Google Scholar 

  2. Bird S, Klein E, Loper E (2009) Natural language processing with Python, O’Reilly Media

  3. Bordwell D, Thompson K, Smith J (2016) Film art: an introduction, McGraw-hill education; 11 edition, ISBN-13: 978–1259534959

  4. Cerisara C, Král P, Lenc L (2018) On the effects of using word2vec representations in neural networks for dialogue act recognition. Comput Speech Lang 47:175–193

    Article  Google Scholar 

  5. Cun Y, Bengio Y (1995) Convolutional networks for images, speech, and time series, The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA: MIT Press, 255–258

  6. George K, Kumar C, Sivadas S, Ramachandran K, Panda A (2018) Analysis of cosine distance features for speaker verification. Pattern Recogn Lett 112:285–289

    Article  Google Scholar 

  7. N. Kalchbrenner, E. Grefenstette, P. Blunsom (2014) A convolutional neural network for modelling sentences, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665

  8. Kim Y (2014) Convolutional neural networks for sentence classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1746–1751

  9. Kim J (2014) Emotion prediction of document using paragraph analysis. Journal of Digital Convergence 12(12):249–255

    Article  Google Scholar 

  10. Kim O, Lee S (2015) A Movie Recommendation Method based on Emotion Ontology. Journal of Korea Multimedia Society 18(9):1068–1082

    Article  Google Scholar 

  11. Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225

    Article  Google Scholar 

  12. Maaoui C, Pruski A (2010) Emotion recognition through physiological signals for human-machine communication, Cutting Edge Robotics, pp. 317–333

  13. Manning C, Raghavan P, Schutze H (2009) Introduction to information retrieval, Cambridge University Press

  14. McGuinness D, Harmelen F (2009) OWL web ontology language overview, W3C recommendation

  15. Metz C (2008) ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics & Technology 1(1):2–12

    Article  Google Scholar 

  16. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space, In ICLR Workshop Papers

  17. Ouali C, Dumouchel P, Gupta V (2016) A spectrogram-based audio fingerprinting system for content-based copy detection. In Multimedia Tools and Applications 75(15):9145–9165

    Article  Google Scholar 

  18. Park E, Cho S (2014) KoNLPy: Korean natural language processing in Python (http://dmlab.snu.ac.kr/~lucypark/docs/2014-10-10-hclt.pdf), Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea

  19. Park J, Seo Y (2011) Acoustic information based emotion recognition for human-robot interaction. The Journal of Korean Institute of Information Technology 9(6):39–46

    Google Scholar 

  20. Picard R (2003) Affective computing: challenges. International Journal of Human-Computer Studies 59(1):55–64

    Article  Google Scholar 

  21. Santos D, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78

  22. Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms, In INTERSPEECH, pp. 1089–1093

  23. Scherer K, Ekman P (2014) Approaches to emotion, Psychology Press

  24. Sewak M, Karim M, Pujari P (2018) Practical convolutional neural network models, Packt Publishing Ltd.

  25. S. Shai, BD. S, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.

  26. Subscene-Passionate about good subtitles. https://subscene.com/. Accessed 20 June 2019.

  27. Tang G, Liang R, Xie Y, Bao Y, Wang S (2019) Improved convolutional neural networks for acoustic event classification. In Multimedia Tools and Applications 78(12):15801–15816

    Article  Google Scholar 

  28. The Internet Movie Script Database (IMSDb), https://www.imsdb.com. Accessed 20 June 2019.

  29. Toutanova K, Manning C (2000) Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger, In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70

  30. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5):293–302

    Article  Google Scholar 

  31. Umeozor S (2019) Information networking and its application in the digital era with illustration from the University of Port Harcourt Library. International Journal of Knowledge Content Development & Technology 9(2):33–44

    Google Scholar 

  32. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. Interspeech:3688–3692

  33. Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. In Multimedia Tools and Applications 78(3):3705–3722

    Article  Google Scholar 

  34. Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jin-Su Kim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, JS. Multimedia emotion prediction using movie script and spectrogram. Multimed Tools Appl 80, 34535–34551 (2021). https://doi.org/10.1007/s11042-020-08777-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08777-x

Keywords

Navigation