Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises

Lucia-Mulas, Maria Jose; Revuelta-Sanz, Pablo; Ruiz-Mezcua, Belen; Gonzalez-Carrasco, Israel

doi:10.1007/s10489-023-04967-w

Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises

Open access
Published: 01 September 2023

Volume 53, pages 27096–27109, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises

Download PDF

Maria Jose Lucia-Mulas¹,
Pablo Revuelta-Sanz²,
Belen Ruiz-Mezcua¹ &
…
Israel Gonzalez-Carrasco ORCID: orcid.org/0000-0001-8294-3157¹

1573 Accesses
4 Citations
Explore all metrics

Abstract

The ability of music to induce emotions has been arousing a lot of interest in recent years, especially due to the boom in music streaming platforms and the use of automatic music recommenders. Music Emotion Recognition approaches are based on combining multiple audio features extracted from digital audio samples and different machine learning techniques. In these approaches, neuroscience results on musical emotion perception are not considered. The main goal of this research is to facilitate the automatic subtitling of music. The authors approached the problem of automatic musical emotion detection in movie soundtracks considering these characteristics and using scientific musical databases, which have become a reference in neuroscience research. In the experiments, the Constant-Q-Transform spectrograms, the ones that best represent the relationships between musical tones from the point of view of human perception, are combined with Convolutional Neural Networks. Results show an efficient emotion classification model for 2-second musical audio fragments representative of intense basic feelings of happiness, sadness, and fear. Those emotions are the most interesting to be identified in the case of movie music captioning. The quality metrics have demonstrated that the results of the different models differ significantly and show no homogeneity. Finally, these results pave the way for an accessible and automatic captioning of music, which could automatically identify the emotional intent of the different segments of the movie soundtrack.

Predicting Music Emotion by Using Convolutional Neural Network

Binary Emotion Classification of Music Using Deep Neural Networks

A survey of music emotion recognition

Article 22 January 2022

1 Introduction

Captioning is the reference assistive tool for hearing impairment. Captions are based on speech subtitling but include additional information such as sound effects, speaker identification, and other essential non-speech features. Captions are the “audio” for the deaf and hard of hearing. Special regulations have been issued to guarantee its application^{Footnote 1} and quality, considering factors such as synchronism, presentation speed, or accuracy, among others [1]. Pre-recorded captioning is the standard mode of captioning movies. Pre-recorded captions are produced after the movie has been created and are carefully checked for accuracy using specific software frameworks that ease tasks such as video file editing, audio frame localization, caption editing or preview. Based on deep learning, speech recognition technologies significantly reduce speech captioning time by automatically proposing the corresponding transcript for voice frames [2].

This study aims to evaluate other deep learning technologies that could be added to these frameworks to ease the task of music captioning. The capacity of music to generate emotions is widely used in movie soundtracks [3] as a support to the narrative [4, 5].

For example, the meaning of a wordless scene in which a character is seen from behind looking out a window is changed by a few seconds of happy, sad, or frightening music. Accessible captioning must include music information whenever it is important to help understand the plot, with a text summarizing the type of music, sensation transmitted, or identification of the piece, e.g., “(Horror Music)”. The professional responsible for captioning the film decides when the music should be captioned and the feeling the author intended to convey. Deep learning technologies which could detect significant musical fragments, and propose the corresponding musical emotion, could contribute to the automation of this task.

Emotion investigation is a field of neuroscience research that has begun in relatively recent decades, and much remains to be known. Since the end of the last century, research has been developed based on two basic paradigms: the categorical model and the dimensional model of emotion [6].

The categorical model of emotion presupposes the existence of a limited number of basic, innate, and universal emotions. The Ekman model is the best known and considers seven basic emotions: fear, sadness, anger, happiness, surprise, disgust, and contempt [7]. Subsequently, studies have reduced this set to four “basic” emotions: happiness, sadness, fear, and anger [8]. On the other hand, the dimensional model of emotion states that emotions may be represented in a continuous space, generally of 2 or 3 dimensions. The hybrid model, “Circumplex model of affect”, proposes that all affective states arise from cognitive interpretations of central neural sensations that are the product of two independent dimensions: one related to valence (positive/negative stimuli) and one related to arousal (activation) [9]. The discrete emotions would be subjective psychological “labels” that can be identified with points of that continuum space Valence-Arousal. In both models, it has been widely assumed that emotions are the subjective representations of primary neural circuits, basic for survival, that have evolved from the earliest complex animals [10]. Emotion would be a primitive adaptive mechanism that is triggered by critical stimuli for survival, prompting action. Thus, some authors consider that music could activate biologically important emotional circuits for processing sounds [11,12,13]. This primitive origin would explain the immediacy and universality of musical emotion concerning the basic emotions of happiness, sadness, and fear, which are the most identifiable in musical extracts when expressed with intensity [6, 14, 15]. The recognition of these basic emotions in music is consistent among listeners from the same culture [16], and among listeners from different cultures [17, 18].

In addition, this recognition is immediate, occurring in less than two seconds, with a simple chord or a few notes, when the music expresses the basic emotions of happiness, sadness, or fear [11, 14, 15]. In [15], the authors found an average time of 483 ms, 1446 ms, 1737 ms and 1261 ms for correctly recognising the happy, sad, scary, and peaceful excerpts, respectively. In [14], the authors, using a set of very short musical clips of on average 1.6 seconds, showed that the experimental subjects correctly categorised, and with great precision, the emotions associated with these clips and that even 250 milliseconds from the start of the music were enough in some cases to distinguish sad music from happy music. With a weaker musical expression of these emotions or concerning other emotions, the consensus among listeners decreases significantly [6].

One of the problems that neuroscientists encountered in these studies were deciding on the musical stimuli. [6, 15] have created standard scientific musical databases rigorously validated for musical emotion research, which have become a reference, and that are precisely based on movie soundtracks, as this is music composed to transmit powerful emotional stimuli. Musical fragments are labelled on the perceived basic emotions of joy, sadness, fear, and a fourth emotion, peacefulness/tenderness, which is not considered a primary emotion but is easily identified as a perceived musical emotional state. In these studies, evaluators are instructed to evaluate perceived emotions (the emotion music intends to represent) rather than induced emotion (the felt emotion). However, the border between the two is very diffuse and empirical studies show great similarity in both emotions [6].

The relationship between musical parameters and emotion is also gaining much interest. Many studies, in general, focused on the basic emotions of joy, sadness, and fear, show that mode, tempo, register, dynamics, articulation, and timbre are the most critical parameters that affect musical emotion and that these parameters operate additively (see Table 18). The relative importance of these parameters varies for each emotion. For example, the mode is extremely important for happy and sad emotions or articulation for fear [14, 19, 20].

The ability of music to induce emotions has also risen in the field of computer science and affective computing to a field of research dedicated to identifying the characteristics of music that generate different emotional states. This field, called Music Emotion Recognition (MER), has been arousing a lot of interest in recent years, mainly due to the boom in music streaming platforms and the use of automatic music recommenders [21,22,23]. MER is based on the analysis of low or medium-level characteristics of music. These characteristics are obtained from digital audio samples using the techniques of another nearby field of research, the so-called Music Information Retrieval (MIR). According to the review performed in [21], the first article in this field was published in 2003 [24]. In this work, the authors proposed a system for classifying songs into four emotional categories: happiness, sadness, anger, and fear, based on two musical characteristics, tempo (fast or slow) and articulation (staccato or legato). Since then, many studies on music emotion classification algorithms have been produced.

The typical scheme of development of an MER model is based on three steps: selection and labelling of digital musical samples (ground truth), selection and extraction of characteristics (features) from the audio digital samples, and application of supervised machine learning to map emotion with features. Each phase entails significant limitations [21, 25].

First, there is an absence of public, consensual and adequately validated datasets. In general, MER datasets are labelled in a variable and poorly controlled environment, without previous training of the evaluators and control of the evaluation process, evaluators or even labels changing throughout the evaluation process. For example, commonly used data sets such as Million Song Dataset [26], MTurk [27], or MagnaTagATune [28] are the result of a free annotation open to any user.

Second, regarding features, there is no agreement on the significant audio features to capture the musical emotion nor certainty in the validity of the algorithms used to extract them. Tools like Librosa,^{Footnote 2} Essentia,^{Footnote 3} or MirToolBox^{Footnote 4} allow the extraction of a large amount of audio information from which it is difficult to choose the significant parameters. Thus, different feature sets, grouping many of these characteristics, have been used to establish a predictive model [21, 25]. Still, it is unclear if the audio features used are sufficiently relevant to the problem [25].

1.1 Related work

Classification algorithms such as Gaussian Mixture Models (GMM), K-Nearest Neighbour (KNN), Support Vector Machines (SVM), and Support Vector Regression (SVR) are generally used [29,30,31] with SVM being the classifier that would obtain the best results[25, 32]. The review in [21] indicates that the highest accuracy achieved in emotion classification was 69.5% considering five emotional categories. In [25], the authors compared different results using SVM to classify musical fragments in the four quadrants of the “Circumplex model of affect”, obtaining accuracies of up to 76.4%.

Another difficulty is choosing the length of the musical segments to be evaluated. In the case of a song a few minutes long, the emotional content can fluctuate temporarily. Usually, the song is divided into small segments to get more accurate results, detecting the emotion for each segment. For example, the typical segmentation length for popular music is usually 25-30 seconds [21]. For classical music, results obtained were optimal with lengths of 8-16 seconds (4-8-16-32 were tested) [33].

Recently, models based on feature selection have been replaced by Convolutional Neural Networks (CNN) models with promising results. The success of neural networks in image recognition has aroused interest in applying these networks using as input images the spectrograms obtained from audio samples, such as Short-time Fourier transform(STFT) or Mel spectrograms. [34] used a CNN network as a novel approach for music genre classification using MFCC (Mel Frequency Cepstral Coefficients) spectra, showing that CNNs had great potential to extract features from audio samples. In the review, [21], the best accuracy obtained with CNN was 69,5%. In [35], the authors have benchmarked the latest CNN architectures proposed in classifying musical genres. The results showed that the most straightforward architectures applied to short musical fragments (about 3 seconds) obtained the best results.

However, the field of automatic MER is wide open, and the mentioned accuracy rates are yet far from being fair enough to be used in automatic emotional labelling and captioning. In general, MER approaches are based on combining multiple audio parameters and machine learning techniques without considering the problem’s main characteristics based on human musical perception and emotions. Computational models seem anchored in the labyrinth of MIR and machine learning algorithms, distancing themselves from the neuroscientific foundations of musical emotion perception.

Table 1 Emotion distribution in musical fragment

Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises

Abstract

Similar content being viewed by others

Predicting Music Emotion by Using Convolutional Neural Network

Binary Emotion Classification of Music Using Deep Neural Networks

A survey of music emotion recognition

1 Introduction

1.1 Related work

1.2 Objectives and hypothesis

2 Materials and methods

2.1 Dataset description

3 Classification methods

3.1 Experimentation 1

3.2 Experimentation 2

4 Results

4.1 Statistical analysis

5 Discussion

6 Conclusions

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation