Abstract
This article proposes a multimedia emotion-prediction approach using movie scripts and spectrograms with speech information. First, a variety of information is extracted from textual dialogues in scripts for emotion prediction. In addition, spectrograms transformed from speech information help to identify subtle representations of difficult-to-predict emotions from scripts. Accent helps predict emotions because it is an important means of expressing emotion states in speech. These are to analyze emotion words with a similar tendency on the basis of the emotion keywords in scripts and spectrograms. Emotion candidate keywords are extracted from text data using morphological analysis, and representative emotion keywords are extracted through Word2Vec_ARSP. Emotion keywords and speech data from the last part of the dialogue are extracted and converted into images. This multimedia information is used for the input layer in a convolutional neural network. In this paper, we propose a multi-modal method for more efficiently extracting and predicting emotions by mixing and learning integrated multimedia information through the character’s speech and background sounds, as well as dialogue that can directly express the emotional situation of the context. In order to improve the accuracy of emotion prediction using multimedia information in movies, we propose a system with a CNN for learning, testing, and prediction using a multi-modal method. The proposed multi-modal system compensates for unpredictable emotions from certain parts of the text through the spectrogram. The prediction accuracy is improved by 20.9% and 6.7%, compared to using only text information and only voice information, respectively.
Similar content being viewed by others
References
Birajdar G, Patil M (2019) Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed Tools Appl 78(11):15141–15168
Bird S, Klein E, Loper E (2009) Natural language processing with Python, O’Reilly Media
Bordwell D, Thompson K, Smith J (2016) Film art: an introduction, McGraw-hill education; 11 edition, ISBN-13: 978–1259534959
Cerisara C, Král P, Lenc L (2018) On the effects of using word2vec representations in neural networks for dialogue act recognition. Comput Speech Lang 47:175–193
Cun Y, Bengio Y (1995) Convolutional networks for images, speech, and time series, The Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed. Cambridge, MA: MIT Press, 255–258
George K, Kumar C, Sivadas S, Ramachandran K, Panda A (2018) Analysis of cosine distance features for speaker verification. Pattern Recogn Lett 112:285–289
N. Kalchbrenner, E. Grefenstette, P. Blunsom (2014) A convolutional neural network for modelling sentences, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665
Kim Y (2014) Convolutional neural networks for sentence classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1746–1751
Kim J (2014) Emotion prediction of document using paragraph analysis. Journal of Digital Convergence 12(12):249–255
Kim O, Lee S (2015) A Movie Recommendation Method based on Emotion Ontology. Journal of Korea Multimedia Society 18(9):1068–1082
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225
Maaoui C, Pruski A (2010) Emotion recognition through physiological signals for human-machine communication, Cutting Edge Robotics, pp. 317–333
Manning C, Raghavan P, Schutze H (2009) Introduction to information retrieval, Cambridge University Press
McGuinness D, Harmelen F (2009) OWL web ontology language overview, W3C recommendation
Metz C (2008) ROC analysis in medical imaging: a tutorial review of the literature. Radiological Physics & Technology 1(1):2–12
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space, In ICLR Workshop Papers
Ouali C, Dumouchel P, Gupta V (2016) A spectrogram-based audio fingerprinting system for content-based copy detection. In Multimedia Tools and Applications 75(15):9145–9165
Park E, Cho S (2014) KoNLPy: Korean natural language processing in Python (http://dmlab.snu.ac.kr/~lucypark/docs/2014-10-10-hclt.pdf), Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea
Park J, Seo Y (2011) Acoustic information based emotion recognition for human-robot interaction. The Journal of Korean Institute of Information Technology 9(6):39–46
Picard R (2003) Affective computing: challenges. International Journal of Human-Computer Studies 59(1):55–64
Santos D, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 69–78
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms, In INTERSPEECH, pp. 1089–1093
Scherer K, Ekman P (2014) Approaches to emotion, Psychology Press
Sewak M, Karim M, Pujari P (2018) Practical convolutional neural network models, Packt Publishing Ltd.
S. Shai, BD. S, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
Subscene-Passionate about good subtitles. https://subscene.com/. Accessed 20 June 2019.
Tang G, Liang R, Xie Y, Bao Y, Wang S (2019) Improved convolutional neural networks for acoustic event classification. In Multimedia Tools and Applications 78(12):15801–15816
The Internet Movie Script Database (IMSDb), https://www.imsdb.com. Accessed 20 June 2019.
Toutanova K, Manning C (2000) Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger, In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5):293–302
Umeozor S (2019) Information networking and its application in the digital era with illustration from the University of Port Harcourt Library. International Journal of Knowledge Content Development & Technology 9(2):33–44
Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. Interspeech:3688–3692
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. In Multimedia Tools and Applications 78(3):3705–3722
Zhang D, Xu H, Su Z, Xu Y (2015) Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst Appl 42(4):1857–1863
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, JS. Multimedia emotion prediction using movie script and spectrogram. Multimed Tools Appl 80, 34535–34551 (2021). https://doi.org/10.1007/s11042-020-08777-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08777-x