1 Introduction

There is a growing interest in studying music as an embodied phenomenon in musicology, a multimodal phenomenon in music psychology, and as multimedia in music technology. This paper addresses challenges in collecting and evaluating multimodal music datasets that, in various ways, represent different data and related metadata about music performance and perception.

Music analysis has evolved from focusing on symbolic representations (using various score-based data types, such as MIDI and MusicXML) to incorporating subsymbolic representations (using data forms that capture complex information through continuous and distributed patterns in the time or frequency domain). While direct analysis from audio has been prominent in recent decades, it is crucial to consider the potential benefits of integrating symbolic representations as an intermediary step. This approach can enhance the depth and accuracy of music analysis, aligning with contemporary advocacy for a balanced integration of symbolic and subsymbolic methods (Lartillot, Submitted). Over the last decades, there has been a growing interest in incorporating embodied elements in music analysis [20, 34, 38, 39, 43, 56, 57]. This has led to datasets containing video, motion capture, physiological data, and various types of brain imagery. These data types are challenging to capture and analyze on their own. It is even harder to synchronize and analyze them together. Still, if we aim to uncover the “essence” of music, it is necessary to create datasets that fully encapsulate musical experiences. These data should, again, be connected to relevant contextual information about the creators, temporal and spatial contexts, events, and instruments involved.

Fig. 1
figure 1

Multimodal Representation Levels: Illustration depicting the hierarchical structure of multimodality across music research domains. From left to right: signals at the physical level include physical images, motion, sound, electromagnetic radiation within the visible spectrum, and physical text. These signals undergo digitization (images, motion capture, audio, video, text files, MIDI), facilitating computational processing and analysis. At the cognitive level, the signals are interpreted, capturing the nuanced human understanding of multimodal information through vision, hearing, etc

We are currently exploring methods for collecting, pre-processing, storing, publishing, archiving, and evaluating multimodal music datasets. This paper contributes to the current landscape by defining the concept of a multimodal music dataset, categorizing existing datasets, and addressing the drawbacks and advantages of modalities for specific music-processing tasks. Additionally, we highlight the benefits of pre-processing techniques in analyzing and publishing datasets, addressing the challenges associated with different data types. By doing so, we aim to facilitate interdisciplinary collaboration and empower researchers to navigate the complexities of multimodal music analysis.

2 Multimodal music data definitions

Searching for “music & multimodality” in online databases reveals that the term is used differently between—and even within—disciplines. To further discuss specific definitions, we suggest a model with three levels: physical, digital, and cognitive (Fig. 1). The physical level refers to natural phenomena (such as sound and light waves) that can be captured and digitized using various sensors. The cognitive level covers human sensory modalities (audition, vision, etc.) and also includes data that provide context and interpretative meaning beyond direct sensory perception, such as personal memories, or emotional states. These can also be digitally represented in various ways based on capturing information from human senses (e.g., eye tracking) or cognitive processes (e.g., brain imagery techniques).

Various disciplines use the term multimodality differently. We work on the intersection between musicology, psychology, and technology and frequently experience gaps in terminology, theoretical foundations, and methodological approaches among researchers studying music. In this paper, we are particularly concerned about developing solutions that work for all three disciplines—and beyond—and can invite and inspire cross-disciplinary collaboration.

Parcalabescu et al. [73] suggest the following definition of multimodality in machine learning systems: “A machine learning task is multimodal when inputs or outputs are represented differently or are composed of distinct types of atomic units of information.” Inspired by this definition, we suggest that a multimodal music dataset can be defined as “diverse data types that offer complementary insights for a specific music processing task, regardless of source, format, or perceptual characteristics.” This definition is based on a review of existing ones, which we will discuss in the following sections.

2.1 Existing definitions

The term multimodal is commonly used in the literature to describe complex music datasets, considering both human-oriented and computational perspectives. Human-oriented modalities are commonly used in the music psychology literature, relating to sensory channels in perception and cognition [34, 56]. Audition and vision may be the most commonly studied modalities, but other modalities also shape musical experiences. Conversely, computer-oriented modality is more commonly used in the music technology literature, particularly within the sound and music computing (SMC) and music information retrieval (MIR) communities [1, 3, 23, 69, 71, 83]. Here, modality is often used to describe datasets with multiple data types like text, audio, and video. This perspective highlights the diversity of information sources and their transformation into digital forms for analysis and manipulation.

Multimedia is a related—yet different—term than multimodal. Multimedia denotes the fusion of various media or data types within a dataset. Many music datasets contain video files with embedded audio streams and can, therefore, also be classified as multimedia. Such a multimedia dataset would arguably also be multimodal since it incorporates both auditory and visual modalities. Other data types could also be included, such as images, textual information (e.g., lyrics), and musical notation.

Sometimes, relating a physical signal to a media type and a modality is easy, such as the relationship between sound waves in the air, audio files on a hard drive, and the human auditory system. Other times, it is more challenging. For example, musical notation can materialize as dots on a sheet of paper that can be digitized as a photo of the score or converted into symbolic form in a MIDI file or MusicXML. Scores are typically perceived through the visual system, but they could also be performed and, therefore, could arguably also relate to audition, motion, or the vestibular system. Musical scores can be seen as a representation of a performance, “impregnated” in musical knowledge [62], or as representations of sound actions [47]. In any case multimodal music datasets capture the intricate dimensions of musical experiences, facilitating a holistic exploration of music.

Reviewing relevant literature, we have found four types of definitions for multimodal music datasets:

  • Sensory Modalities. This group of publications describes a (music) dataset as multimodal when its information relates to any of the five sensory modalities. In music psychology and neuroscience, modality signifies the mode through which human senses perceive information, as seen in [6]. These modalities encompass auditory, visual, gustatory, olfactory, haptic, and even the perception of balance (vestibular).

  • Communication Modalities. Some publications use multimodality to describe different communication methods: visual, linguistic, spatial, aural, and gestural [4]. Multimodal music datasets within this context may incorporate audio, gestures, and written documents, such as lyrics [5].

  • Multimedia. In many cases, multimodal data is often used interchangeably with multimedia. For instance, Essid et al. [26] discuss “media modalities,” encompassing synchronized audio rigs, cameras, inertial measurement devices, and depth sensors. Similarly, Groux and Manzolli [40] use multimedia in the context of EEG signal processing, the SiMS music server, and real-time visualizers. In another case, a music dataset was labeled multimodal due to various machine sensors providing different inputs [14]. The term multimedia encompasses diverse technological components and recording techniques, extending beyond traditional communication modes to include the simultaneous use of different media forms in a single presentation, such as images, music, and captions.

  • No Definition. More often than not, multimodality is mentioned but not defined [73]. In these instances, researchers refer to modalities as multimodal information [27, 68, 72, 75] or simply “data types.” This covers audio, video, images, and lyrics [1, 3, 23, 69, 71], as well as extracted features, body motion, physiological measurements (eye gaze, EEG, EDR, ECG, EMG, respiration), content, context [23], symbolic scores [84], MIDI [90], and even depth, thermal, and IMU data [33, 41]. Sometimes, it features “additional multimodal information” [68], like album covers, video clip links, and expert notes. Some argue that multimodality involves blending diverse ways of representing information, such as visual elements, auditory components, and text [77]. Notably, there are instances where the music itself is referenced as a modality [31, 65, 95].

Despite its frequent and diverse usage, multimodality is often not explicitly defined, prompting reconsidering these definitions in light of evolving research and technological capabilities.

2.2 Reconsidering existing definitions

Various modalities enrich our understanding of music’s meaning and structure, underlining the importance of multimodal datasets for music analysis. While sensory modalities are often discussed, they do not fully capture the complexity of music processing and dataset descriptions. Recent research has challenged the traditional view that each sensory modality operates independently. Studies have shown significant interactions between different sensory modalities, which can influence how sensory information is processed in the brain [82]. For example, visual cues can affect how we perceive sounds and vice versa. This insight is particularly relevant for multimodal music analysis, as it underscores the importance of considering how different types of sensory information (such as captured by audio and video) can be integrated to enhance our understanding and analysis of music.

A comprehensive definition of multimodality must consider how modern machines can analyze music more accurately and precisely than humans. Machines can perform pitch tracking, timing analysis, and music transcription with details beyond human auditory perception. Machine systems can also use technical information from encoding and compression schemes to support their analysis, which may enable a deeper level of analysis than humans. However, a system performing good in one specific task, does not necessarily guarantee a good performance on another type of task. This fragmented expertise means that, in many areas, machines are still far from matching human performance, which seamlessly integrates multiple aspects of music understanding. Therefore, while machines offer precision and depth in certain tasks, integrating human insights remains crucial for a holistic understanding of music. Machines can improve their results by incorporating human information into their analysis chains, leveraging the complementary strengths of both human intuition and machine accuracy.

Similarly, communication modalities, such as speech, gesture, and facial expressions, are commonly used to study human interaction and expression. However, these modalities focus on relatively discrete and context-specific signals, while music involves a diverse array of elements, including melody, harmony, rhythm, and timbre, which often interact in intricate ways. This complexity necessitates specialized multimodal approaches for music processing, which go beyond the simpler structures typically addressed by communication modalities. Technological advancements allow music analysis beyond traditional modalities, with digital tools and algorithms broadening interpretation possibilities. While communication modalities offer insights into human interaction, understanding multimodal music data requires recognizing and integrating the complexities of music through careful consideration of its various dimensions and modalities.

Fig. 2
figure 2

Illustration of how data can be seen as the same or different modality, depending on the task. The left side shows two music score types (on paper and as digital images) and two audio representations (an audio file and a MIDI file). On the right, two tasks (composer style recognition and melody extraction) determine whether the information is considered one modality (equal sign) or not (not equal sign)

Simonetta et al. [83] define modality in MIR as the specific way of digitizing music information, citing examples such as audio, lyrics, symbolic scores, album covers, etc. This definition is particularly useful for applications within MIR, emphasizing the technical processing of multiple data modalities and the digitization process. However, focusing solely on these technical aspects within MIR overlooks the need to integrate these modalities into various musical tasks and maintain flexibility in multimodal analysis. This flexibility is crucial as it allows MIR methods to be tailored to different music research contexts, ensuring their effectiveness across domains.

In parts of the machine learning literature, multimodality refers to distinct information units in inputs and outputs of a model or system [73]. This involves different data types from various input channels, even when the communication method or pathway changes during data transfer. In music processing tasks, inputs include audio signals, text data like lyrics or metadata, and visual elements such as album covers or music videos. Outputs may include genre labels, sentiment scores, or musical notation generated from audio input. It is important to note that while datasets may be constructed from diverse sources, consideration is given to the purpose and use of the dataset, whether it is for only one processing task (e.g., genre classification) or multiple (e.g., genre classification and emotion recognition). Parcalabescu et al. [73] highlight the task-dependent nature of multimodality, which emerges when a task requires information from multiple sources for its solution.

Adopting definitions of multimodal music datasets, such as those centered around human sensory modalities and digitization-centric perspectives, proves inadequate for capturing the complexity of today’s music research scene. Limiting definitions to human sensory channels restricts the definitions’ scope, neglecting the multitude of data types and encoding mechanisms present. Similarly, the MIR emphasis on digitization provides a narrow framework that falls short of conceptual depth. Therefore, we need a definition that transcends human sensory modalities and embraces broader computational processes to facilitate a comprehensive understanding of multimodal music datasets.

2.3 Proposing a new definition

We propose to define multimodality as the deliberate integration of varied information sources tailored to specific tasks. Figure 2 illustrates how various data representations can belong to the same or different modalities, depending on whether they provide complementary information to the task. For instance, a hand-written score offers additional insights about a composer than its digital representation. This may be significant to identify the composer but is less important for automated melody transcription.

Fig. 3
figure 3

Dataset distribution across music processing tasks. This horizontal bar chart shows the number of datasets for the task. Bars represent specific tasks, sorted by dataset count, with task categories indicated by color patterns. Numerical values show the approximate availability of data samples for each dataset category

Fig. 4
figure 4

Bar plot illustrating the distribution of dataset sizes across various music genres. Each bar represents a dataset, sorted in descending order by the number of pieces. Bars are differentiated by genre: various, pop, non-western, classical, folk, jazz, and datasets with unavailable genre information. Bar heights are logarithmically scaled on the y-axis. The exact number of pieces is displayed on each bar. The plot highlights the diversity in dataset sizes and their genre categorizations in music research. Datasets that provide no size were excluded

In our definition, information units can be either physical signals or their digital representations, such as audio, lyrics, musical notation, and visual elements like album covers and music videos. Regardless of whether humans or machines process this information, our definition remains flexible for different types of analysis. This means that both human-centered and machine-centered approaches can be applied to these information units, making them versatile for various research and application purposes.

3 Existing multimodal music datasets

There is a growing, yet still limited, amount of multimodal music datasets available [83]. Figure 3 provides an overview of those discussed in this paper, illustrating their availability for different processing tasks. In addition, Fig. 4 offers an overview of their size distribution and genre diversity. These counts are intended to provide an overview of the datasets and their distribution across different music processing tasks, highlighting the relative scale and comprehensiveness of each dataset. This list is non-exhaustive; a more complete and updated repository is available online [17].

Our dataset categorization is based on high-level music processing tasks, aligning with the macro-tasks suggested by Schedl et al. [79, 83], with additional enrichments to suit our requirements. Categories include categorization-oriented, synchronization-oriented, similarity-oriented, time-dependent representation-oriented, and multi-task datasets, each harnessing multiple modalities. We highlight the multimodal nature of each dataset, emphasizing enhanced task performance through modality integration while acknowledging the significance each modality holds for interpretability [60]. We also address the benefits and drawbacks of modality combinations for each task.

3.1 Categorization-oriented

Datasets in this category are tailored to optimize the effectiveness of various music categorization (and regression) tasks.

3.1.1 Emotion/affect recognition

Music emotion recognition is one of the most studied multimodal music processing tasks [83] and has gained significant attention in the research community. It encompasses both classification tasks, where emotions are categorized into discrete labels, and regression tasks, where continuous emotional dimensions such as valence and arousal are predicted. This dual nature of music emotion recognition allows for a comprehensive analysis of emotional responses to music, leveraging various modalities. Multimodal datasets used in emotion-centric tasks, such as CAL500 [88], combine machine-extracted audio features with human emotion annotations. Additional datasets, including those from [35, 45, 51, 54, 87, 89], incorporate labels, lyrics, and participant information, such as demographic or emotional preference information. Integrating lyrics with audio data provides additional context, enhancing emotion recognition accuracy. However, combining audio and text for emotion recognition presents challenges due to data heterogeneity and a semantic gap between textual descriptions and acoustic features, requiring careful pre-processing and fusion techniques. This is relevant to both sentiment analysis and perceived emotion.

The NAVER Music Dataset [48] combines audio with spectrogram images to analyze music emotions. PMEmo [96] goes further, incorporating physiological data. However, this adds complexity to the task of integrating these different modalities effectively, as it requires sophisticated methods to synchronize and interpret them. The IMAC [91] and EmoMV [86] datasets utilize audio-visual integration, showing superior performance in binary classification and retrieval tasks, but such integration grapples with high-dimensionality and synchronization issues, impacting computational efficiency and accuracy of emotion recognition.

3.1.2 Genre classification and auto-tagging

Combining audio and lyrics has significantly improved genre classification, as shown in studies like [66, 67]. Orio et al. [72] merged audio features with genre tags and web-extracted data. Schindler and Rauber [80] enhanced genre classification by integrating audio and video. Large datasets like MSD-I and MuMu [71] mix audio, images, and multi-label genre annotations, boosting classification models. LMD-ALigned [90] uses six modalities: audio features, lyrics, symbolic data (e.g., instrument counts, tempo), model-based features (e.g., semantic descriptors), album cover images, and playlists. Each modality offers unique insights, but multimodal fusion also introduces challenges like complex fusion methods, synchronization issues, and higher computational demands. Auto-Tagging assigns descriptive labels to music content using datasets like MTG-Jamendo [12], which include audio recordings and category annotations (genres, instruments, moods/themes, and top-50 charts). While descriptive labels improve accuracy, they may miss subtleties and variations in the music. Genre recognition is a specific case of tag recognition, creating a hierarchical relationship. Genre recognition classifies music into pre-defined categories, while auto-tagging uses a broader range of descriptors for comprehensive analysis. This perspective underscores the interconnectedness of assigning genre labels and detailed tags, enhancing music classification and retrieval.

3.1.3 Musical gesture classification

Classification of music-related body motion, actions, and gestures has advanced significantly through multimodal data integration. Gan et al. [30] emphasize the effectiveness of multimodal fusion in this domain. Collections like those by Chang et al. [15] and Sarasua et al. [78] showcase successful modality integration, including audio, video, motion capture, physiological data (EMG, ECG, etc.), textual descriptors of the data, and MIDI. Each modality brings unique information; for instance, audio captures sound nuances, video offers visual cues for identification, EMG and IMU sensors provide insights into muscle activity and motion, images enhance visual context, text annotations add semantic information, and MIDI data includes musical notation tied to sound-producing actions. Such a multimodal approach allows for robust musical gesture classification models.

3.1.4 Singing voice analysis

This task involves analyzing vocal traits to improve singer categorization. The Vocal92 dataset [22] combines both singing and speech modalities by including a cappella solo singing and spoken recordings from 92 singers. This multimodal approach enhances accuracy by capturing nuances such as pitch and timbre in singing and contextual information in speech. While speech and singing share vocal traits, they also exhibit distinct differences like pitch and rhythm, providing complementary data for more robust singer recognition.

3.2 Time-dependent representation-oriented

Datasets here capture temporal evolution in music for tasks like source separation and transcription. Modalities such as audio, MIDI, and label descriptions of musical content form a comprehensive foundation for understanding musical dynamics.

3.2.1 Source separation

Source separation involves isolating individual sound sources from a mix, crucial for various applications such as music production, remixing, and audio analysis. TRIOS [28] comprises separated tracks from chamber music trios with time-aligned MIDI scores, facilitating score-informed audio source separation. Including MIDI data provides a synchronized reference for the musical notes, aiding in the precise separation of individual instruments. MUSIC-21 [97] emphasizes audiovisual performances, while CocosChorales [94] includes audio, MIDI, and note annotations, enhancing the dataset’s utility for tasks like source separation by offering multiple modalities and fine-grained musical information.

3.2.2 Piano tutoring

The dataset by Benetos et al. [9] combines audio recordings with manual and automated MIDI transcriptions, offering a resource for piano tutoring research. Audio captures expressive nuances, dynamics, and timbre, while MIDI transcriptions provide symbolic representations of musical notes and timing. This integration can enhance piano tutoring systems by delivering an audio experience alongside structured symbolic representations, facilitating learning. However, aligning audio signals with MIDI notes requires complex pre-processing and synchronization techniques.

3.2.3 Music segmentation and structure analysis

Music segmentation and structure analysis are complementary but distinct tasks. Music segmentation, a binary classification task, focuses on identifying structural boundaries. Datasets by Cheng et al. [16] and Hargreaves et al. [42] provide annotated audio recordings that capture expressive nuances and precise symbolic representations. Challenges include differing annotation interpretations and audio mismatches, requiring careful synchronization. In contrast, music structure analysis identifies relationships and hierarchies among musical parts. Gregorio et al. [37] offer audio and MIDI data for analyzing jazz improvisation, helping to understand how segments form larger structures. Despite the enriched data, challenges remain in establishing accurate hierarchies and aligning them with audio.

3.2.4 Music transcription

Music transcription involves converting audio recordings or other forms of musical representation, into musical notation. Camera-PrIMuS aids diverse transcription methods with real music staves featuring various distortions and formats [13]. The Dataset of Norwegian Hardanger Fiddle Recordings offers beat annotations alongside audio and note annotations, crucial for beat tracking and beat-aware transcription research [24, 52, 53]. MIREX and MIREX-multi-f0 datasets provide audio and beat annotations for extracting pitch information [7, 44]. MAPS offers audio recordings and MIDI files for piano transcription [25], while a dataset for polyphonic piano transcription includes audio and MIDI data from polyphonic recordings [76]. N20EM supports lyric transcription with audio, video, and IMU data [41], and CocoChorales combines audio, MIDI, and note annotation data for music transcription tasks [94]. One major challenge of combining MIDI, note annotations, and audio is the variability in MIDI note timing and pitch accuracy, nuanced annotations, and audio complexities, which can lead to transcription discrepancies. Robust synchronization and sophisticated algorithms are needed to leverage modalities effectively and minimize errors while preserving fidelity.

3.2.5 Instrument performance analysis

Perez-Carrillo et al. [75] introduced a dataset for evaluating algorithms related to guitar fingering and plucking controls. It includes audio recordings, 3D motion data, and information from the musical score (such as note onset, or ground truth for the plucking parameters). Audio captures sonic nuances, while other modalities provide details about physical aspects, instrument-specific actions, and the musical context. Such a comprehensive dataset enhances the accuracy of estimation algorithms and facilitates a deeper exploration of the interplay between musicians and their instruments in guitar performance analysis.

3.3 Music similarity-oriented

These datasets aim to study relationships and similarities between musical pieces.

3.3.1 Song retrieval

This task involves using similarity-matching strategies for song identification and retrieval. The Million Song Dataset (MSD) [10] has been important for various identification purposes. Correya et al. [19] used MSD for cover identification, demonstrating its effectiveness in distinguishing different versions of songs. Additionally, MSD is a valuable resource for audio and lyrics retrieval, enhancing retrieval system accuracy [95]. MSD500 [93] is designed for tag-based song retrieval, incorporating audio and descriptive tags for more nuanced searches. Integrating such multimodal information in MSD500 has shown superior performance and advanced music retrieval techniques. However, tag-based song retrieval that integrates metadata with audio content may face challenges because subjective tagging naturally varies and can be inconsistent in practice. This variability can lead to inconsistencies in how songs are categorized or retrieved within the system, potentially limiting its effectiveness in accurately capturing user preferences.

3.3.2 Music exploration and discovery

This task enhances music discovery by combining recommendation strategies with exploration opportunities, balancing familiarity and new content excitement. Music4All-Onion [69] stands out for integrating audio and text and refining content-based recommendations with nuanced and personalized suggestions. Watanabe et al. [92] introduce a dataset featuring lyrics, audio, and artist IDs, achieving good performance in the task of query-by-blending, encouraging user interaction. The dataset supports users in exploring and manipulating these integrated musical components to generate new queries. Additionally, Poltronieri et al. [77] propose a dataset of audio and melodic and harmonic annotations, enriching content-based music similarity exploration. The variability in how the users of the music exploration system interpret textual data of music content complicates the process of using this information for recommendations, since individual differences can lead to varied preferences and expectations.

3.4 Music generation

Music generation does not fit into the analysis-focused categories mentioned above. Music generation is a creative process focused on crafting new musical content, including melodies, harmonies, rhythms, or complete compositions. The dataset in [40] generates music based on brain activity. In [29], datasets such as URMP [58], AtinPiano, and MUSIC [97] are used for music generation from videos, showcasing the efficacy of combining audio and video. MusicCaps [2] takes a novel approach, generating music from textual (such as “an upbeat bass beat accompanying a reverberated guitar riff”) and expanding creative possibilities through multimodality.

3.5 Multi-task

Similar to foundational models for various music processing tasks [31], several multimodal music datasets are designed to address multiple objectives. These datasets cover more than one macro task and incorporate diverse modalities, such as audio, video, text, music scores, MIDI, and annotations, facilitating a broad range of MIR tasks. For instance, FMA [21], RWC [36], and DALI [68] are some examples, supporting tasks spanning audio analysis, music genre classification, and semantic music retrieval. ENST-Drums [32] enables event classification, drum track transcription, and polyphonic music extraction. Essid et al. [26] focus on synchronization, time-based movement representation, and dance movement recognition.

The URMP dataset [58] provides a comprehensive collection with audio, video, scores (MIDI and PDF), and annotations, suitable for diverse tasks including music transcription and performance analysis. MedleyDB [11] enhances melody transcription and genre classification capabilities with audio recordings, genre labels, and note annotations. Additionally, the CompMusic Art Indian music dataset and the Saraga music research corpus [74] contribute to the multi-task category with rich annotations for Indian raga classification, supporting melodic motif identification and time-dependent pattern detection. The Turkish makam dataset introduces innovative methods for tonic identification in audio recordings [81], enhancing analysis in expressive and intonation domains. Finally, the Song Describer dataset (SDD) [63] supports tasks like music captioning, text-to-music generation, and music-language retrieval, featuring paired audio and caption data for multimodal learning.

4 Multimodal music dataset pre-processing

Preparing multimodal music datasets involves handling various data types concurrently. Unlike unimodal datasets that focus on a single data type, multimodal datasets require integrated techniques for multimodal analysis. This section explores the primary strategies for pre-processing multimodal music datasets.

4.1 Crossmodal alignment

Establishing connections across modalities is a key challenge in multimodal dataset creation. Developing techniques for aligning audio, scores, text, and visual data based on shared features is essential for holistic analyses. This alignment is crucial for tasks like source separation and audio-to-score synchronization [83]. Cross-referencing metadata from multiple sources can help verify the accuracy of the information and identify discrepancies.

Clearly labeling and documenting different versions or performances of the same piece of music can maintain the integrity of the dataset [78]. Using advanced audio fingerprinting, which is now more robust to tempo and pitch variations, ensures audio files match metadata accurately. This technology, though challenging for transcription, offers efficient and precise indexing of music data, reducing errors in dataset management. By integrating these methods, consistency between audio and metadata can be maintained, supporting reliable music retrieval and analysis [70]. When legal constraints require pre-extracted features, providing detailed documentation of the feature extraction process can maintain transparency [18]. Encouraging community contributions through platforms can also facilitate collaborative annotation efforts, helping identify and rectify errors in the dataset.

4.2 Handling missing data across modalities

Multimodal datasets often exhibit variations in data availability across modalities due to restrictions on public access to specific types of data. Despite these challenges, multimodal music processing proves highly effective in analyzing music, particularly when certain data types are unavailable. Effective pre-processing involves addressing missing data to prevent biases and maintain dataset integrity [23]. Several strategies can be employed to handle missing data. Interpolation and imputation techniques, such as those described by Perez-Carrillo et al [75], estimate missing values based on patterns in the available data.

In cases where missing data cannot be effectively imputed or synthesized, excluding entries with incomplete data may be necessary. However, this approach should be used cautiously to avoid significantly reducing the dataset’s size [61]. Designing the dataset with redundancy across modalities can also ensure robustness, as other modalities can provide enough context to mitigate the impact of missing data [6].

4.3 Integrated feature extraction

Feature extraction is crucial for numerical representation across different modalities, supporting tasks like genre classification [72] and emotion recognition [89]. While aligning sampling rates is emphasized, strict uniformity across all modalities may vary by application. Flexible sampling rates can maintain temporal coherence. Techniques such as canonical correlation analysis (CCA) or deep learning methods are employed to project features from different modalities into unified representation spaces, capturing intermodal correlations effectively [49]. These integrated approaches facilitate a comprehensive analysis by harmonizing data from varied sources, ensuring that essential aspects across different modalities are captured.

4.4 Consistent data normalization

Normalizing data ensures that different modalities, which may have varying scales and magnitudes, contribute fairly to the analysis [83]. Consistent data normalization prevents any single modality from dominating the results due to differences in magnitude, fostering a balanced and comprehensive understanding of the multimodal dataset. Normalization ensures consistency and comparability across diverse datasets and modalities, enhancing the robustness of feature extraction and interpretation. While some analytical methods can manage normalization internally, applying normalization techniques ensures a standardized approach, which is critical for the accuracy and reliability of multimodal analysis.

4.5 Coordinated dataset splitting

For machine learning-oriented music processing tasks, dividing the dataset into training, validation, and test sets requires careful consideration to maintain label distribution and multimodal coherence [95]. Stratified splitting ensures that each set maintains the same proportion of labels as the original dataset, preserving the statistical properties across splits [55]. Maintaining temporal consistency is crucial for datasets with temporal data to avoid disrupting the sequence and is essential for tasks like audio-to-score synchronization and video analysis [8].

Ensuring multimodal integrity is vital, meaning that all modalities for a given piece of data must be included in the same split to avoid undermining multimodal analyses [33]. Additionally, the balanced distribution of various attributes (e.g., genres, emotions) across splits prevents any set from being biased towards specific attributes, which is important for training robust models [64].

5 Multimodal music dataset construction and evaluation

This section presents an integrated approach to constructing and evaluating multimodal music datasets, outlining essential considerations, best practices, and criteria to ensure the development of high-quality datasets suitable for various multimodal music processing tasks.

Creating a multimodal music dataset is a meticulous process for ensuring quality, relevance, and applicability to various research tasks. Here, we outline some criteria for developing a comprehensive multimodal music dataset, providing an overview of the current state-of-the-art. Furthermore, assessing a multimodal music dataset’s quality, usefulness, and applicability is crucial for ensuring its value to the research community and beyond. We also outline criteria for evaluating the efficacy of a constructed multimodal music dataset.

Despite proposing rigorous criteria for dataset construction and evaluation, it’s essential to remember that adopting good enough [46] practices acknowledges the practical constraints and real-world applications of multimodal music research. Balancing academic rigor with pragmatic considerations ensures that datasets are not only scientifically robust but also feasible for implementation in diverse real-life scenarios.

5.1 Diversity and representation

While datasets like MTG Jamendo [12] and DALI [68] encompass diverse music genres and tasks, achieving this breadth requires substantial computational and time resources. Such comprehensive datasets are exceptional rather than common in dataset creation.

Typically, datasets are designed to address specific research questions or applications, focusing on particular musical genres, historical periods, or cultural contexts. For instance, the Hardanger Fiddle Dataset [24] and genre-specific pattern recognition studies [85] illustrate this targeted approach. However, the creation of a multimodal music dataset should involve the deliberate consideration of its potential future applications. This includes ensuring clear metadata, documentation of data transformations, and adherence to data-sharing standards. Doing so enables future researchers to extract maximum value from the dataset for various unforeseen research questions and applications.

5.2 Data quality and consistency

High-quality data is essential for meaningful multimodal music analysis. This includes ensuring optimal and consistent data representations, formats, and sampling rates [70]. Recordings should have a high sampling rate, and minimal noise [25]. Evaluating multimodal music datasets involves both technical and perceptual assessments. Technical measures include bitrate and sampling rate [21]. Listener studies are equally important, providing subjective ratings from diverse groups to assess perceived quality. Additionally, high-quality datasets should enhance the listening experience and provide insights into musical structures, capturing elements such as harmony, rhythm, and timbre, along with detailed annotations [54].

5.3 Annotation and ground truth

Accurate annotation is crucial for supervised learning in multimodal music datasets [88]. Annotations, including genre labels, sentiment scores, and artist information, serve as “ground truth” for tasks like genre classification [36], emotion recognition [50], and audio signal separation [12]. However, “ground truth” is subjective and culturally specific, reflecting human interpretations influenced by cultural contexts, individual perceptions, and domain expertise. Hence, while annotations form the foundation for multimodal music processing tasks, their subjective nature must be acknowledged and interpreted accordingly.

To achieve high-quality annotations, combining crowdsourcing with expert validation is effective [59]. Crowdsourcing platforms can handle initial annotation tasks, with data curators validating a subset to ensure high standards. An iterative annotation and review process improves quality over time without initial high costs.

Open-source tools for data alignment reduce reliance on expensive proprietary software. Automated preprocessing detects and flags data inconsistencies [83]. Machine learning models can aid anomaly detection during manual reviews [58]. Collaborating with the research community promotes sharing best practices and resources to tackle dataset construction challenges effectively.

5.4 Modality interplay

An effective multimodal dataset should enable researchers to explore the relationships between different modalities, such as audio, text, and visual components. The dataset should provide sufficient data to study correlations between modalities. For example, the RWC Music Database [36] and the DALI dataset [68] provide linked audio, scores, and lyrics, facilitating multimodal analysis.

5.5 Generalization and robustness

A high-quality multimodal music dataset should facilitate model generalization to unseen data, encompassing variations within each modality, such as diverse timbres, lyrical themes, and visual styles. There are no established diversity metrics that could be used to quantify these aspects for each modality’s content, so these variations are typically evaluated qualitatively through the judgment of the dataset creation team members. This qualitative assessment considers the dataset’s breadth and richness of musical genres, lyrical topics, artistic styles, and other modality-specific attributes. Performance on established benchmark tasks should be tested and reported, with datasets like MedleyDB [11] often used for source separation tasks, providing benchmark results for comparison.

5.6 Usability and accessibility

It is crucial to design the dataset with scalability and ease of use in mind. The dataset should be easily accessible, with clear documentation and user-friendly interfaces. Accessibility can be quantified by the dataset’s availability (e.g., hosted on public repositories) and the comprehensiveness of its documentation [18]. The dataset should allow for future expansions and updates, with mechanisms for adding new data, such as standardized formats and modular data structures.

5.7 Real-world impact

Assessing a dataset’s impact goes beyond its influence within specific research fields and instead focuses on its broader implications and usability across various applications. The quality of a dataset can be evaluated by its ability to stimulate interdisciplinary research collaborations [12] and support a wide range of projects [19]. Moreover, evaluating a dataset’s practical applications in domains such as music recommendation systems, educational tools, and music analysis software provides insight into its real-world utility. This includes its effectiveness in facilitating new discoveries, enhancing educational resources, and empowering innovative applications across different user communities.

Understanding a dataset’s scalability and adaptability is crucial. While metrics like academic impact are relevant, the true measure of impact lies in its ability to transcend disciplinary boundaries and contribute to societal and cultural advancements through accessible and impactful applications.

5.8 Legal constraints and limitations

Understanding legal constraints and copyright is crucial in multimodal dataset construction. Comprehensive research into copyright laws across relevant jurisdictions is essential. Legal experts specializing in intellectual property can assist in navigating complex landscapes and ensuring compliance. Open licenses like Creative Commons (CC0, CC-BY) facilitate legal sharing and reuse [19]. For proprietary data, negotiating licenses for academic and research purposes with clearly defined terms is important to prevent legal disputes.

To comply with privacy laws (e.g., GDPR), removing personally identifiable information and applying fair use principles are recommended [18]. Rights management tools help manage permissions and limitations associated with datasets. Collaboration with institutional repositories and digital libraries supports handling copyrighted materials. Detailed dataset documentation, including legal status, licenses, transformations, and fair use analysis, enhances transparency [18]. Clear guidelines on data usage, sharing, and citation prevent unintentional legal violations.

6 Discussion

Navigating the construction and evaluation of multimodal music datasets reveals both promising avenues and persistent challenges. For example, there should be flexibility in dataset usage, allowing researchers to explore diverse applications and maximize the utility of available resources. While datasets may be categorized hierarchically, there are instances where a dataset can serve a completely different task than initially intended. For example, the Hardanger fiddle dataset [53], primarily used for music transcription, contains additional emotional information that has not been fully utilized in previous studies. CocosChorales [94] contains information useful for automated music composition. Similarly, Vocal92 could be used for speech-to-song generation [22], MTG-Jamendo for genre-based recommendation systems and semantic music search [12], and the dataset of Benetos et al. could be used for gesture recognition for expressive playing analysis [9].

One of the primary challenges is implementing multimodal fusion approaches that effectively address the specific requirements and nuances of music processing tasks. This includes managing feature contributions, selecting appropriate modalities, and navigating the complex dimensions of multimodal data.

Semantic alignment between different modalities remains a complex task, necessitating further research to refine techniques and understand contextual nuances. Collaboration across various domains is crucial in ensuring that multimodal datasets are comprehensive, accurate, and accessible. Professionals spanning musicology, psychology, and technology must work together to overcome these challenges.

7 Conclusion

In response to the communication gaps between music research disciplines regarding multimodality, our paper introduces a novel perspective on multimodal music datasets that bridges musicology, music psychology, and music technology. Multimodal music datasets are pivotal in advancing various music processing tasks and enriching our understanding of musical experiences. A streamlined framework may simplify the complexities associated with signal types, digitization processes, and data interpretation by humans and machines. This framework facilitates interdisciplinary collaboration and empowers researchers to explore diverse applications beyond traditional boundaries.

Our categorization scheme for multimodal datasets highlights strategic combinations of modalities and their implications for specific music-processing tasks. This enables researchers to make informed decisions when leveraging datasets across different domains, fostering innovation in music analysis. At the same time, we emphasize adopting “good enough practices” in dataset construction and evaluation, prioritizing pragmatic solutions that balance thoroughness with feasibility, and ensuring datasets are fit for purpose without excessive complexity.

Moving forward, prioritizing collaboration among musicologists, music psychologists, and music technologists is essential. Each group brings unique perspectives and expertise, which are essential for overcoming complex research challenges in multimodal music processing. Musicologists and psychologists, as primary data collectors and domain experts, play a pivotal role in shaping dataset construction and ensuring relevance to real-world music contexts. However, they often do not have the skills or interest to create large multimodal music datasets ready for machine learning. This is where fruitful collaborations with technologists can help move the field(s) forward. We advocate for adaptable and nuanced tools capable of accommodating the complexity and diversity inherent in musical expression rather than striving for universal solutions that may oversimplify or exclude critical aspects of music.