1 Introduction

Music holds a profound significance in human culture and life, exerting a profound influence on a global scale. Spanning from classical melodies to contemporary pop beats, and from traditional folk tunes to cutting-edge electronic compositions, music embodies a vast spectrum of styles, serving as a powerful medium for inspiration and emotional expression [1]. Beyond its artistic essence, music serves as a multifaceted tool for entertainment, social cohesion, cultural preservation, and psychological well-being. In recent decades, propelled by the rapid advancements in multimedia technology, music has become increasingly accessible, integrating seamlessly into people's daily lives, serving as a source of leisure and a means of emotional regulation [2]. Central to the essence of music is its capacity to evoke and convey emotions, serving as a conduit for profound emotional experiences. Whether evoking feelings of joy, melancholy, anticipation, or tranquility, music possesses the remarkable ability to elicit emotional resonance within listeners, offering a medium for emotional exploration and catharsis. Consequently, the study of emotions in music has emerged as a pivotal area of research within the realms of music psychology and computer science.

However, despite significant advancements, emotion classification tasks in music analysis still confront several challenges. Traditional methods often rely solely on audio data for classification, overlooking the rich potential offered by other multimodal sources such as lyrics, music videos, and social media comments [3]. Moreover, existing classification models may struggle to effectively capture the complexity of emotional expressions and nuances inherent in music, given the multidimensional nature of emotions, which extend beyond simple binary distinctions like pleasure or sadness.

In response to these challenges, multimodal approaches offer a promising avenue for enhancing emotion classification tasks in music analysis. By harnessing insights from diverse multimodal data sources, including audio, text, images, and videos, researchers can cultivate a more holistic understanding of the emotional landscapes embedded within music [4]. Multimodal music emotion classification not only enhances classification accuracy but also enables a nuanced portrayal of the intricate emotional tapestry woven within musical compositions. This advancement holds transformative potential, empowering music recommendation systems to tailor experiences to the emotional needs of listeners, while also fostering applications in domains such as healthcare, psychology, and advertising.

In traditional research on music emotion classification, researchers extensively utilize various audio feature extraction techniques to obtain information from audio signals for use in emotion classification. Common traditional audio feature extraction techniques include MFCC and Spectral Centroid. MFCC, in particular, is a classical method for audio feature extraction widely used in music and speech processing [5]. It simulates the way human ears perceive sound to extract spectral information from audio signals. Spectral Centroid represents the central position of the spectrum and is used to describe the pitch attributes of the audio [6]. This feature is often employed in music emotion classification to aid in distinguishing pitch variations under different emotional states. In addition to traditional audio feature extraction techniques, some advanced deep-learning models have garnered significant attention. For example, CNNs, are not only used in image processing but also extensively applied in audio processing [7]. They can effectively capture local features in audio signals, making CNNs highly valuable in music emotion classification. In addition, RNNs, which are recurrent neural networks specialized in handling time-series data, are helpful in capturing time-related features in audio signals, particularly when describing emotional changes in music.

Furthermore, the use of pretrained deep learning models, such as BERT or GPT, has led to significant improvements in music emotion classification [8]. This is because these models have undergone extensive training in processing text data, possessing strong semantic understanding and representation learning capabilities that contribute to a better understanding of emotional information in music [9]. However, these approaches do not take into account the often-existing consistency of emotions between lyrics and melody. In addition, music possesses natural structural information, and these approaches have not considered the inherent structure of music (such as verse-chorus), which is highly effective and necessary for music emotion analysis [10].

Drawing from these insights, we introduce the MMD-MII (Multilayered Music Decomposition and Multimodal Integration Interaction) model, a cutting-edge multimodal framework designed to enhance music emotion recognition and analysis. Our model incorporates the inherent structural elements of music, specifically focusing on the verse and chorus sections, while facilitating interaction between modalities during processing. Upon input, music undergoes cross-processing, enabling seamless interaction between audio and lyrics to maintain emotional coherence. In addition, we establish a hierarchical framework based on the theory of music's verse and chorus, conducting separate analysis on the chorus section to extract precise emotional representations. The overarching objective of the MMD-MII model is to integrate audio, lyrics, and other multimodal data sources at multiple levels, while considering the intrinsic structure of music to significantly elevate music emotion recognition and analysis performance. Through interactive processing and hierarchical analysis, our model offers unparalleled accuracy in capturing emotional information in music, effectively catering to the diverse emotional needs of audiences. Furthermore, the MMD-MII model holds immense potential for transformative applications in fields such as music recommendation, advertising, and psychology, promising to reshape the landscape of multimodal emotion analysis in music research.

The article primarily makes three contributions:

  • This article first introduces a hierarchical music analysis framework for analyzing the structure of music. Then, it constructs a novel multimodal interaction framework, extracting the current emotion vector at each time step, and further fusing and updating these emotion vectors to ensure emotional consistency among various modalities.

  • The MMD-MII model not only integrates multimodal data but also places a specific focus on the intrinsic structure of music, including aspects like the verse and chorus. Through the hierarchical framework, we can more accurately extract and analyze emotions in different parts of the music, facilitating a deeper understanding of emotional expression in music.

  • The MMD-MII model introduces emotion vectors and designs emotion LSTM cell units to effectively capture emotional information in music, especially when dealing with datasets featuring four different emotion labels. This provides a more accurate and in-depth approach to emotion analysis.

In the rest of this paper, we present recent related work in Sect. 2. Section 3 introduces our proposed methods. Section 4 showcases the experimental part. Section 5 contains the conclusion.

2 Related Work

2.1 Research on Lyric Text Processing

In the research of music emotion classification, lyric text processing is a crucial step that helps models better understand emotional information within music. Within this research domain, various deep learning models have emerged for processing lyric text, aimed at enhancing the performance of music emotion classification. These models include BERT, GPT-3.5, XLNet, RoBERTa, and DistilBERT.

BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional pretrained model that, through deep learning, can delve deeper into the understanding of emotional content within the lyric text [11]. GPT-3.5, with its enormous parameter count, excels in text generation and comprehension, offering robust support for lyric emotion analysis and lyric generation [12]. XLNet employs a different pretraining approach, aiding in capturing a more comprehensive view of dependencies in text and providing additional angles for emotional understanding [13]. RoBERTa represents an enhancement of BERT, achieved through a larger dataset and extended training duration, resulting in improved performance and more precise emotional representations [14]. Furthermore, DistilBERT is a lightweight version of BERT, offering computational efficiency while still performing well in lyric text processing [14].

These deep learning models provide a diverse set of tools for lyric text processing, enabling the automatic extraction of emotional features from lyrics without manual intervention. Researchers can choose an appropriate model based on the task requirements and computational resources to enhance the accuracy and performance of music emotion classification. These models play a pivotal role in the emotional analysis of lyric text.

3 Research Based on Audio Feature Extraction

In the research of music emotion classification, melodic audio processing plays a crucial role in extracting melodic information from audio signals to enhance emotion classification performance. Several novel audio feature extraction methods have emerged in this research domain [15]. For instance, Deep Chroma is a deep learning-based audio feature extraction method that focuses on capturing harmony, chords, and melodic information in audio. This model employs a combination of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to analyze the temporal features of audio signals and extract melodic information from music [16]. Deep learning methods like Deep Chroma have the advantage of automatic feature learning, aiding in a more accurate capture of emotional elements within audio. In addition, WaveNet, originally designed for audio waveform generation, has proven to be valuable in audio analysis. WaveNet can model audio signals at high resolutions, capturing fine-grained audio features and melodic variations, providing more informative features for music emotion classification [17]. Furthermore, transfer learning methods have gained prominence in audio processing. They involve the use of pretrained audio models such as VGGish or OpenL3, which offer significant performance improvements in music emotion classification. These models have undergone extensive pretraining on large audio datasets and can automatically extract audio features, eliminating the need for manual feature engineering [18].

These novel audio feature extraction methods offer diverse tools for melodic audio processing, enabling researchers to better understand and analyze emotional elements within music. They allow for a more accurate capture of emotional information within audio signals, thereby improving the performance of music emotion classification. Researchers can select appropriate models based on task requirements and dataset characteristics, unlocking greater potential in the field of music emotion classification [19].

3.1 The Multimodal Models for Music Emotion Classification

In the field of music emotion classification, multimodal models have made significant progress in handling different modalities of data, including audio, lyrics, images, and more. For example, MuSeNet is designed to fuse audio and lyric information to more accurately capture emotions in music [20]. MuSeNet employs a deep neural network structure capable of processing both audio and text data. The model consists of two key components: a multimodal encoder and an adaptive module. The multimodal encoder is used to extract feature representations from audio and lyrics, while the adaptive module dynamically learns the weights between different modalities to achieve better performance. MuSeNet’s uniqueness lies in its ability to effectively integrate information from different modalities, thus improving the accuracy and performance of music emotion classification. Another model, FusionNet, combines audio and lyric information [21]. It uses deep convolutional neural networks (CNN) and recurrent neural networks (RNN) structures to process different modal data. This model also introduces fusion strategies to gradually combine modality information for music emotion classification. In addition, MuSeCAR combines audio, lyrics, and emotions to gain a more comprehensive understanding of music emotions [22]. This model incorporates deep learning and knowledge graph techniques by combining multimodal data with an emotional knowledge graph, enabling deeper analysis of emotional content. What sets MuSeCAR apart is its ability not only to predict emotions but also to explain the reasons behind those emotions, providing a more in-depth emotional analysis.

These multimodal models provide powerful tools for music emotion classification, with the potential to more accurately capture emotional elements in music by integrating information from different modalities, thereby enhancing the performance of emotion classification. The ongoing development and innovation of these models will further drive research and applications in the field of music emotion classification.

4 Methodology

We propose the MMD-MII multimodal music emotion classification model. Firstly, we utilize VGGish to extract audio features and ALBERT to extract lyric features. We then introduce the inherent structure of music (verse and chorus) into the overall model framework, enabling interactions between modalities during processing. The goal is to enhance music emotion recognition and analysis. The model diagram is shown in Fig. 1. Once music is input, it passes through a module called the “cross-processing” module. Within this module, audio and lyrics interact to ensure emotional consistency. Simultaneously, we employ a hierarchical framework based on the theory of music’s verse and chorus. When the music reaches the chorus section, we process it separately, extracting more accurate emotional representations.

Fig. 1
figure 1

Overall flow chart of the model

4.1 VGGish Module

VGGish is a widely employed deep learning model in the field of audio, specialized for audio feature extraction and audio content analysis [23]. Firstly, VGGish adopts a convolutional neural network (CNN) architecture, bearing some similarities to the VGG models used in the visual domain. The core task of this model is to extract high-level audio features from audio spectrograms, which can be utilized for various audio analysis tasks. Secondly, VGGish takes short time segments of audio signals as input and employs convolution and pooling layers to analyze these audio features. The model's output is a fixed-length vector representing a high-level representation of the audio segment. These embedding vectors can be used in subsequent tasks such as audio classification, music emotion analysis, and environmental sound recognition.

VGGish plays a crucial role in our model by providing essential support for the processing and feature extraction of audio data. It enables our model to better understand and analyze audio content, thus enhancing both its performance and versatility. Figure 2 depicts the flowchart of the VGGish Module.

Fig. 2
figure 2

The VGGish structural unit

4.2 ALBERT Module

In our framework, ALBERT is applied to extract features from lyrics text, facilitating the model's enhanced comprehension and analysis of lyrical content [7]. The subsequent points elucidate ALBERT’s pivotal role in extracting lyric features:

Initially, ALBERT undergoes pretraining on extensive textual datasets, enabling it to acquire comprehensive representations of textual information. This pretrained model is adept at extracting general text characteristics, encompassing those pertinent to lyrics text.

Subsequently, upon inputting lyrics text into the ALBERT model, it undergoes conversion into text embedding vectors. These vectors encapsulate the semantic essence and structural composition of the lyrics, constituting high-dimensional feature representations that frequently encapsulate rich semantic information.

Furthermore, ALBERT exhibits context awareness, possessing the capacity to discern intricate relationships between words and contextual cues. This contextual comprehension holds paramount importance for processing lyrics text, given its propensity for harboring nuanced implicit meanings and emotional nuances.

Moreover, in our model, ALBERT's capabilities extend beyond mere semantic analysis. It actively integrates contextual information and emotional nuances from lyrics text, thereby enriching the model's understanding of the lyrical content's emotional depth and complexity.

Conclusively, ALBERT serves as a fundamental component in our framework for extracting features from lyrics text, thereby amplifying the model's proficiency in comprehending and analyzing the emotional dimensions of music lyrics.

ALBERT is a deep learning model used for extracting features from lyrics text. Through pretraining, it can convert lyrics text into meaningful high-dimensional feature representations, which contribute to a better understanding and analysis of the emotional content in music lyrics. This provides strong support for music emotion classification tasks. Figure 3 displays the network structure of the ALBERT model.

Fig. 3
figure 3

The basic ALBERT network structure

4.3 Cross Processing Module

The Cross Processing Module is a module designed for handling multimodal information. It is configured to facilitate interaction between lyrics and melody, and its output is an emotion feature vector. This emotion feature vector is jointly constrained by the melody and lyrics modules. Furthermore, this emotion feature vector is subsequently input into the next set of lyric–melody pairs to ensure emotional consistency between melody and lyrics. The specific structure is depicted in Fig. 4. This module is primarily utilized for processing lyric–melody pairs, enabling interaction between the two modalities.

Fig. 4
figure 4

The Cross Processing Module network structure

In Fig. 4, we can see that at any given moment, the input to the Emotion–LSTM includes not only the lyric–melody pair but also a vector referred to as the “emotion vector.” This vector is determined after the previous interaction between the lyric and melody (with the first emotion vector originating from random initialization). When the current lyric–melody pair enters the paired Emotion-LSTM, they each produce an emotion vector. These emotion vectors generated from both modalities then interact to create a new emotion vector, fusing information from both channels. This approach ensures that the two modalities under this vector maintain a consistent emotional state with respect to the previous moment's lyric–melody pair and prevents emotions between the two channels from being independent.

4.4 Emotion–LSTM model

The Emotion–LSTM model represents an enhancement of the traditional LSTM architecture, with a particular focus on the incorporation of emotion vectors as inputs. This study aims to investigate the precise allocation of emotional weighting within the context of lyrical content and melody [7]. To accomplish this, a novel two-polarity emotion vector is introduced, designed to partition emotions into distinct categories. The upper segment of the vector corresponds to positive emotions, signifying heightened states of joy and euphoria. The middle portion is dedicated to neutral emotions, while the lower section encapsulates increasingly melancholic and negative emotions. The Emotion–LSTM cell unit, delineated in Fig. 5, serves as the cornerstone of this framework, providing a detailed overview of its inner workings. The incorporation of the emotion vector empowers the model to adeptly capture and comprehend the intricate emotional nuances conveyed through music. Notably, the dataset used in this research comprises four distinct emotional classes. It encompasses not only the polar emotional states of sadness and happiness but also embraces two additional nuanced emotional categories: tranquility and healing.

Fig. 5
figure 5

The Emotion–LSTM network structure

As depicted in Fig. 5, the Cum–Sum process involves the integration of both the historical emotion vector and the current input emotion vector, akin to a single interaction between past and present emotions. The historical emotion vector primarily aims to retain intense emotional information, while the current emotion vector focuses on refreshing relatively weaker emotions. The resultant emotion vector undergoes continuous iteration alongside the historical emotion vector. This iterative process facilitates the determination of which emotional levels within the song should be retained and which previously utilized emotions necessitate updating. The hierarchical emotion vector maintains its connection with the cell state “C” and the hidden state “h” within the Long Short-Term Memory (LSTM) framework, exerting a guiding influence on updates to the cell state “C” and the hidden state “h.” The specific operations will be elucidated and elaborated upon through subsequent mathematical formulations.

Input gate:

$${i}_{t}=\sigma ({W}_{i}{F}_{t}+{U}_{i}{h}_{t-1}+{b}_{i})$$

Forget gate:

$${f}_{t}=\sigma ({W}_{f}{F}_{t}+{U}_{f}{h}_{t-1}+{b}_{f})$$

Output gate

$${o}_{t}=\sigma ({W}_{o}{F}_{t}+{U}_{o}{h}_{t-1}+{b}_{o})$$

Candidate cell state:

$${\widetilde{C}}_{t}={\text{tanh}}({W}_{c}{F}_{t}+{U}_{i}{h}_{t-1}+{b}_{c})$$

Emotion forget gate (negative emotion part):

$${\widetilde{f}}_{t}^{[0:l/2]}=\overline{{\text{cumsum}} }\left({\text{softmax}}\left({E}_{t-1}\left[0;\frac{l}{2}\right]\right)\right)$$

Emotion forget gate (positive emotion part):

$${\widetilde{f}}_{t}^{[l/2:l]}=\overline{{\text{cumsum}} }\left({\text{softmax}}\left({E}_{t-1}\left[\frac{l}{2}:l\right]\right)\right)$$

Emotion forget gate:

$${\widetilde{f}}_{t}={\text{concat}}\left[{\widetilde{f}}_{t}^{\left[0:\frac{l}{2}\right]},{\widetilde{f}}_{t}^{\left[\frac{l}{2}:l\right]}\right]$$

Emotion input gate (negative emotion part):

$${\widetilde{t}}_{t}^{[0:l/2]}=\overline{{\text{cumsum}} }\left({\text{softmax}}\left({E}_{t-1}\left[0:\frac{l}{2}\right]\right)\right)$$

Emotion input gate (positive emotion part):

$${\widetilde{i}}_{t}^{[l/2:l]}=\overline{{\text{cumsum}} }\left({\text{softmax}}\left({E}_{t-1}\left[\frac{l}{2}:l\right]\right)\right)$$

Emotion input gate:

$${\widetilde{\iota }}_{t}={\text{concat}}\left[{\widetilde{\iota }}_{t}^{\left[0:\frac{l}{2}\right]},{\widetilde{\iota }}_{t}^{\left[\frac{l}{2}:l\right]}\right]$$

Emotion interaction state:

$${w}_{t}={\widetilde{f}}_{t}\cdot {\widetilde{l}}_{t}$$

Update memory unit:

$${C}_{t}={w}_{t}\cdot ({f}_{t}\cdot {c}_{t-1}+{i}_{t}\cdot {\widetilde{C}}_{t})+({\widetilde{f}}_{t}-{w}_{t})\cdot {c}_{t-1}+({\widetilde{\iota }}_{t}-{{\text{w}}}_{t})\cdot {\widetilde{C}}_{t}$$

In this context, ‘t’ represents the current time step, ‘t − 1’ represents the previous historical time step, ‘\({f}_{t}\)’ represents the elements input at the current time step ‘t’ (which can be audio or text), and ‘\({E}_{t}\)’ represents the emotion vector for the current time step. The emotion vector for audio is determined and updated jointly by ‘h’ and ‘c’ within the emotion LSTM, to model emotional intensity. The emotion vector for lyrics can be mapped using a neural network or an emotional dictionary and is determined after embedding. All ‘b’ terms are bias terms, and ‘σ’ represents the sigmoid activation function. Similar to a standard LSTM, there are input gates, output gates, and forget gates, as denoted by the formulas, which represent the three main gates in the LSTM. The candidate cell state ‘\({C}_{t}\)’ contains hierarchical information, and unlike a standard LSTM, it undergoes hierarchical updates based on the emotion vector, with updates to the cell state ‘\({C}_{t}\)’ following new rules.

5 Experiment

5.1 Datasets

The experimental section of this study involves two significant music datasets: the DEAM Dataset (Dynamic Emotional Analysis of Music) and the FMA Dataset (Free Music Archive). These two datasets have widespread applications in the fields of music emotion analysis and music research, providing valuable resources and materials for our research.

The DEAM Dataset is a multimodal music dataset designed to assist researchers in delving deeper into the relationship between music and emotions [24]. It comprises a substantial amount of music audio, emotional annotations, and associated metadata. The emotional annotation segment of this dataset meticulously records the emotional states within the audio, allowing researchers to analyze and understand the subtle variations in emotional expression within music compositions. The multimodal nature of the DEAM Dataset, encompassing both audio and emotional annotations, offers profound insights into the realm of music emotion analysis.

The FMA Dataset is a widely used music dataset, featuring a vast collection of audio tracks and related metadata [25]. It offers music of various genres and styles, spanning a wide range from classical to popular music. The accessibility and open nature of the FMA Dataset make it an ideal choice for music classification, music recommendation, and music research. Researchers can access audio materials from the FMA Dataset for experimentation and analysis to support their music research projects.

5.2 Experimental Environment

My experimental setup consists of: Processor, Intel i7-13650 CPU; Graphics Card, NVIDIA GTX 4090; Memory, 32 GB. The software environment is as follows: General-purpose computing architecture CUDA 11.6; GPU acceleration library, CUDNN 9.0; Deep learning framework Pytorch.

The tool we employed for vocal separation is the open-source program Spleeter. Strictly speaking, Spleeter is also a form of pretrained model written in TensorFlow and achieves its functionality through the utilization of U-Net architecture. In addition, we use the open-source computer program FFmpeg for sentence alignment between lyrics and melody. FFmpeg is an open-source computer program developed for Linux, used for recording, converting digital audio and video, and transforming them into streams. FFmpeg also serves as an audio or video encoder.

5.3 Evaluation Metrics

This experiment used accuracy rate, precision rate (P), recall rate (R), and F1 score to evaluate the model’s performance.

Accuracy (accuracy rate): accuracy measures the overall correctness of the model’s predictions. It is the ratio of correctly predicted samples to the total number of samples in the dataset.

$${\text{Accuracy}}=\frac{\text{Number of correct }{\text{p}}{\text{redictions}}}{\text{Total number of }{\text{s}}{\text{amples}}}$$

Recall (sensitivity or true positive rate): Recall calculates the proportion of positive samples that are correctly identified by the model. It measures the ability of the model to find all the positive samples.

$${\text{Recall}}=\frac{\text{True Positives}}{\text{True Positives }+\text{False Negatives}}$$

Precision (precision rate): Precision calculates the proportion of positive predictions made by the model that is correct. It measures the model’s ability to avoid false positives.

$${\text{Precision}}=\frac{\text{True Positives}}{\text{True Positives}+\text{False Positives}}$$

F-Score (F1-Score): F-Score is the harmonic mean of precision and recall. It is useful when both precision and recall are important, and you want to balance their contribution.

$$F1=2\times \frac{{\text{Precision}}\times {\text{Recall}}}{{\text{Precision}}+{\text{Recall}}}$$

5.4 Experimental Details

In the DEAM dataset, we selected a subset of songs with relatively high play counts, assuming that these songs exhibit a higher degree of emotional consistency with the provided emotion labels. We chose four emotion labels: happiness, sadness, healing, and calm, and collected approximately 6000 songs. After further filtering based on factors such as song length, audio quality, and language, we retained around 4280 songs as our candidate dataset. In the case of the FMA dataset, we collected approximately 5320 songs and eventually narrowed it down to 4605 songs for our candidate dataset after applying similar selection criteria. In Table 1, the specific partitioning of the dataset is presented.

Table 1 The size of the dataset in each emotion category

The vast majority of songs follow a natural structure known as the verse–chorus structure. The verse–chorus structure typically involves dividing a song into two primary sections: one section known as the “Verse,” which serves to establish the song’s background, and the other section referred to as the “Chorus,” which is responsible for emphasizing and expressing the emotions of the song. We have conducted detailed annotations of the verses and choruses in the music dataset.

5.5 Main Results

In Table 2, we can see that the use of multimodal methods (combining multiple data modalities) generally outperforms single-modal methods. The superiority of multimodal methods lies in their ability to integrate information from different data sources, thus enabling a more comprehensive and in-depth understanding of emotional expression in music. These methods exhibit greater adaptability in the field of music emotion analysis, given the complexity and diversity of emotions in music, which encompass aspects such as sound, lyrics, and melody. Notably, our multimodal method, “Ours,” excels on both the DEAM and FMA datasets. Taking the DEAM dataset as an example, our method exhibits an improvement of nearly 2 percentage points in accuracy compared to other methods, with significantly higher precision, recall, and F1 scores. These substantial performance improvements clearly demonstrate the superiority of multimodal methods, and highlight the leading position of our multimodal method in the field of music emotion analysis.

Table 2 The performance of the music emotion recognition task using our proposed framework and different baseline methods

This table provides compelling evidence in favor of employing multimodal methods for music emotion analysis. Multimodal methods not only deliver superior performance but also offer researchers and practitioners a deeper and more comprehensive insight, aiding in a better understanding of emotional expression in music [33, 34]. This is of significant value for research and applications in the field of music [34].

5.6 Ablation Experiment

In Table 3, we conducted a series of ablation experiments, specifically focusing on the model's performance on the DEAM and FMA datasets. These experiments aimed to emphasize the importance of different modules, including using only the verse module and using only the chorus module, with our multimodal approach serving as the benchmark. Clear distinctions can be observed: on the DEAM dataset, our multimodal method achieved an accuracy of 49.68%, while using only the verse module or chorus module resulted in accuracies of 46.73% and 47.53%, respectively. On the FMA dataset, our multimodal approach achieved an accuracy of 49.54%, whereas using only the verse module or chorus module yielded accuracies of 46.22% and 47.52%, respectively. These numerical comparisons clearly demonstrate that our multimodal method outperforms single-modal models in terms of accuracy, showing a specific improvement of approximately 3–4%. We have visualized the content of the table in Fig. 6.

Table 3 Ablation comparison of different metrics and our model for different music levels in the music emotion recognition task
Fig. 6
figure 6

Comparison of different indicators of different models

In Table 4, we present the results of Emotion–LSTM ablation experiments on the DEAM and FMA datasets. These results aim to compare the performance of different recurrent neural networks, including GRU, BIGRU, LSTM, BILSTM, and Emotion–LSTM, across metrics such as accuracy, precision, recall, and F1 score. It is clear that Emotion–LSTM performs the best overall, indicating that the introduction of emotion vectors and their interaction with historical and current emotions is highly effective for music emotion classification tasks. Specifically, Emotio–LSTM exhibits higher accuracy, precision, recall, and F1 score in these experiments. This suggests that Emotion–LSTM can classify music emotions more accurately while maintaining a better balance, avoiding overfitting or underfitting issues.

Table 4 Ablation comparison of different metrics and our model for different music levels in the music emotion recognition task

Figure 7 provides a visual representation of the table’s content, emphasizing the potential of the Emotion–LSTM model in the field of music emotion classification. It offers an effective approach to emotional analysis and lays a strong foundation for future research and applications. The introduction of emotion vectors and their interaction contributes to a more comprehensive and accurate understanding of the emotional elements present in music.

Fig. 7
figure 7

Comparison of different indicators of different models

In Table 5, we also investigated the influence of the number of layers in the cross-processing module on the experimental results. We analyzed the experimental results on the music emotion dataset for cross-processing module layer numbers 1, 2, 3, 4, 5, and 6. From the experimental results, we can see that the performance does not linearly increase with the increase in the number of layers. The effectiveness of the model peaks when the number of layers in the cross-processing module is 3. Beyond 3 layers, both increasing and decreasing the number of layers leads to a corresponding decrease in experimental results. This may be due to the model's inability to capture the complexity of music emotion with too few layers, and the model becoming overly complex and prone to overfitting with too many layers. This finding emphasizes the importance of carefully selecting the number of layers when designing deep learning models to maximize their performance. Furthermore, we also observed that the model’s performance may have different effects on different music emotion tasks under different layer numbers. Therefore, in practical applications, researchers and practitioners need to balance and select the appropriate number of layers based on the specific task requirements and dataset characteristics to achieve the best performance and efficiency. Figure 8 visualizes the contents of the table.

Table 5 The effect of the number of layers of the cross-processing module on the experimental results
Fig. 8
figure 8

Comparison of different indicators of different layer

6 Conclusion

In this study, we introduced the MMD-MII multimodal music emotion classification model, which integrates audio and lyric data while considering the inherent structure of music, including verses and choruses. Through experiments, we verified its effectiveness in capturing emotional information within music, showing promise for applications in music recommendation, advertising, psychology, and related fields.

However, our model still has limitations. Firstly, its performance may be constrained by the quality and diversity of input data. Additionally, despite incorporating music's inherent structure, there may be challenges in adapting to different music genres. These variations in emotional content and composition across genres can impact the model's accuracy and generalization. To address these challenges, future research will focus on enhancing the diversity and quality of training data and exploring techniques for genre-specific adaptation. We aim to improve the model's robustness and applicability across a broader range of music genres. Furthermore, we plan to expand its applications to real-world scenarios, including personalized music recommendations, advertising, and emotional therapy. Our research will continue to explore innovative methods and technologies to advance multimodal music emotion analysis.

In conclusion, the MMD-MII model represents a significant advancement in the field of music emotion classification. Despite the challenges and room for improvement, we believe that this research will provide valuable insights and methodologies for future multimodal music emotion analysis and related applications, ultimately contributing to a better music experience and emotional support for both individuals and society as a whole.