Multimodal music datasets? Challenges and future goals in music processing

Christodoulou, Anna-Maria; Lartillot, Olivier; Jensenius, Alexander Refsum

doi:10.1007/s13735-024-00344-6

Multimodal music datasets? Challenges and future goals in music processing

Regular Paper
Open access
Published: 28 August 2024

Volume 13, article number 37, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Multimodal music datasets? Challenges and future goals in music processing

Download PDF

Anna-Maria Christodoulou^1,2,
Olivier Lartillot^1,2 &
Alexander Refsum Jensenius^1,2

327 Accesses
Explore all metrics

Abstract

The term “multimodal music dataset” is often used to describe music-related datasets that represent music as a multimedia art form and multimodal experience. However, the term “multimodality” is often used differently in disciplines such as musicology, music psychology, and music technology. This paper proposes a definition of multimodality that works across different music disciplines. Many challenges are related to constructing, evaluating, and using multimodal music datasets. We provide a task-based categorization of multimodal datasets and suggest guidelines for their development. Diverse data pre-processing methods are illuminated, highlighting their contributions to transparent and reproducible music analysis. Additionally, evaluation metrics, methods, and benchmarks tailored for multimodal music processing tasks are scrutinized, empowering researchers to make informed decisions and facilitating cross-study comparisons.

ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists

Knowledge-Based Multimodal Music Similarity

OMR metrics and evaluation: a systematic review

Article 14 December 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

There is a growing interest in studying music as an embodied phenomenon in musicology, a multimodal phenomenon in music psychology, and as multimedia in music technology. This paper addresses challenges in collecting and evaluating multimodal music datasets that, in various ways, represent different data and related metadata about music performance and perception.

Music analysis has evolved from focusing on symbolic representations (using various score-based data types, such as MIDI and MusicXML) to incorporating subsymbolic representations (using data forms that capture complex information through continuous and distributed patterns in the time or frequency domain). While direct analysis from audio has been prominent in recent decades, it is crucial to consider the potential benefits of integrating symbolic representations as an intermediary step. This approach can enhance the depth and accuracy of music analysis, aligning with contemporary advocacy for a balanced integration of symbolic and subsymbolic methods (Lartillot, Submitted). Over the last decades, there has been a growing interest in incorporating embodied elements in music analysis [20, 34, 38, 39, 43, 56, 57]. This has led to datasets containing video, motion capture, physiological data, and various types of brain imagery. These data types are challenging to capture and analyze on their own. It is even harder to synchronize and analyze them together. Still, if we aim to uncover the “essence” of music, it is necessary to create datasets that fully encapsulate musical experiences. These data should, again, be connected to relevant contextual information about the creators, temporal and spatial contexts, events, and instruments involved.

We are currently exploring methods for collecting, pre-processing, storing, publishing, archiving, and evaluating multimodal music datasets. This paper contributes to the current landscape by defining the concept of a multimodal music dataset, categorizing existing datasets, and addressing the drawbacks and advantages of modalities for specific music-processing tasks. Additionally, we highlight the benefits of pre-processing techniques in analyzing and publishing datasets, addressing the challenges associated with different data types. By doing so, we aim to facilitate interdisciplinary collaboration and empower researchers to navigate the complexities of multimodal music analysis.

2 Multimodal music data definitions

Searching for “music & multimodality” in online databases reveals that the term is used differently between—and even within—disciplines. To further discuss specific definitions, we suggest a model with three levels: physical, digital, and cognitive (Fig. 1). The physical level refers to natural phenomena (such as sound and light waves) that can be captured and digitized using various sensors. The cognitive level covers human sensory modalities (audition, vision, etc.) and also includes data that provide context and interpretative meaning beyond direct sensory perception, such as personal memories, or emotional states. These can also be digitally represented in various ways based on capturing information from human senses (e.g., eye tracking) or cognitive processes (e.g., brain imagery techniques).

Various disciplines use the term multimodality differently. We work on the intersection between musicology, psychology, and technology and frequently experience gaps in terminology, theoretical foundations, and methodological approaches among researchers studying music. In this paper, we are particularly concerned about developing solutions that work for all three disciplines—and beyond—and can invite and inspire cross-disciplinary collaboration.

Parcalabescu et al. [73] suggest the following definition of multimodality in machine learning systems: “A machine learning task is multimodal when inputs or outputs are represented differently or are composed of distinct types of atomic units of information.” Inspired by this definition, we suggest that a multimodal music dataset can be defined as “diverse data types that offer complementary insights for a specific music processing task, regardless of source, format, or perceptual characteristics.” This definition is based on a review of existing ones, which we will discuss in the following sections.

2.1 Existing definitions

The term multimodal is commonly used in the literature to describe complex music datasets, considering both human-oriented and computational perspectives. Human-oriented modalities are commonly used in the music psychology literature, relating to sensory channels in perception and cognition [34, 56]. Audition and vision may be the most commonly studied modalities, but other modalities also shape musical experiences. Conversely, computer-oriented modality is more commonly used in the music technology literature, particularly within the sound and music computing (SMC) and music information retrieval (MIR) communities [1, 3, 23, 69, 71, 83]. Here, modality is often used to describe datasets with multiple data types like text, audio, and video. This perspective highlights the diversity of information sources and their transformation into digital forms for analysis and manipulation.

Multimedia is a related—yet different—term than multimodal. Multimedia denotes the fusion of various media or data types within a dataset. Many music datasets contain video files with embedded audio streams and can, therefore, also be classified as multimedia. Such a multimedia dataset would arguably also be multimodal since it incorporates both auditory and visual modalities. Other data types could also be included, such as images, textual information (e.g., lyrics), and musical notation.

Sometimes, relating a physical signal to a media type and a modality is easy, such as the relationship between sound waves in the air, audio files on a hard drive, and the human auditory system. Other times, it is more challenging. For example, musical notation can materialize as dots on a sheet of paper that can be digitized as a photo of the score or converted into symbolic form in a MIDI file or MusicXML. Scores are typically perceived through the visual system, but they could also be performed and, therefore, could arguably also relate to audition, motion, or the vestibular system. Musical scores can be seen as a representation of a performance, “impregnated” in musical knowledge [62], or as representations of sound actions [47]. In any case multimodal music datasets capture the intricate dimensions of musical experiences, facilitating a holistic exploration of music.

Reviewing relevant literature, we have found four types of definitions for multimodal music datasets:

Sensory Modalities. This group of publications describes a (music) dataset as multimodal when its information relates to any of the five sensory modalities. In music psychology and neuroscience, modality signifies the mode through which human senses perceive information, as seen in [6]. These modalities encompass auditory, visual, gustatory, olfactory, haptic, and even the perception of balance (vestibular).
Communication Modalities. Some publications use multimodality to describe different communication methods: visual, linguistic, spatial, aural, and gestural [4]. Multimodal music datasets within this context may incorporate audio, gestures, and written documents, such as lyrics [5].
Multimedia. In many cases, multimodal data is often used interchangeably with multimedia. For instance, Essid et al. [26] discuss “media modalities,” encompassing synchronized audio rigs, cameras, inertial measurement devices, and depth sensors. Similarly, Groux and Manzolli [40] use multimedia in the context of EEG signal processing, the SiMS music server, and real-time visualizers. In another case, a music dataset was labeled multimodal due to various machine sensors providing different inputs [14]. The term multimedia encompasses diverse technological components and recording techniques, extending beyond traditional communication modes to include the simultaneous use of different media forms in a single presentation, such as images, music, and captions.
No Definition. More often than not, multimodality is mentioned but not defined [73]. In these instances, researchers refer to modalities as multimodal information [27, 68, 72, 75] or simply “data types.” This covers audio, video, images, and lyrics [1, 3, 23, 69, 71], as well as extracted features, body motion, physiological measurements (eye gaze, EEG, EDR, ECG, EMG, respiration), content, context [23], symbolic scores [84], MIDI [90], and even depth, thermal, and IMU data [33, 41]. Sometimes, it features “additional multimodal information” [68], like album covers, video clip links, and expert notes. Some argue that multimodality involves blending diverse ways of representing information, such as visual elements, auditory components, and text [77]. Notably, there are instances where the music itself is referenced as a modality [31, 65, 95].

Despite its frequent and diverse usage, multimodality is often not explicitly defined, prompting reconsidering these definitions in light of evolving research and technological capabilities.

2.2 Reconsidering existing definitions

Various modalities enrich our understanding of music’s meaning and structure, underlining the importance of multimodal datasets for music analysis. While sensory modalities are often discussed, they do not fully capture the complexity of music processing and dataset descriptions. Recent research has challenged the traditional view that each sensory modality operates independently. Studies have shown significant interactions between different sensory modalities, which can influence how sensory information is processed in the brain [82]. For example, visual cues can affect how we perceive sounds and vice versa. This insight is particularly relevant for multimodal music analysis, as it underscores the importance of considering how different types of sensory information (such as captured by audio and video) can be integrated to enhance our understanding and analysis of music.

A comprehensive definition of multimodality must consider how modern machines can analyze music more accurately and precisely than humans. Machines can perform pitch tracking, timing analysis, and music transcription with details beyond human auditory perception. Machine systems can also use technical information from encoding and compression schemes to support their analysis, which may enable a deeper level of analysis than humans. However, a system performing good in one specific task, does not necessarily guarantee a good performance on another type of task. This fragmented expertise means that, in many areas, machines are still far from matching human performance, which seamlessly integrates multiple aspects of music understanding. Therefore, while machines offer precision and depth in certain tasks, integrating human insights remains crucial for a holistic understanding of music. Machines can improve their results by incorporating human information into their analysis chains, leveraging the complementary strengths of both human intuition and machine accuracy.

Similarly, communication modalities, such as speech, gesture, and facial expressions, are commonly used to study human interaction and expression. However, these modalities focus on relatively discrete and context-specific signals, while music involves a diverse array of elements, including melody, harmony, rhythm, and timbre, which often interact in intricate ways. This complexity necessitates specialized multimodal approaches for music processing, which go beyond the simpler structures typically addressed by communication modalities. Technological advancements allow music analysis beyond traditional modalities, with digital tools and algorithms broadening interpretation possibilities. While communication modalities offer insights into human interaction, understanding multimodal music data requires recognizing and integrating the complexities of music through careful consideration of its various dimensions and modalities.

Simonetta et al. [83] define modality in MIR as the specific way of digitizing music information, citing examples such as audio, lyrics, symbolic scores, album covers, etc. This definition is particularly useful for applications within MIR, emphasizing the technical processing of multiple data modalities and the digitization process. However, focusing solely on these technical aspects within MIR overlooks the need to integrate these modalities into various musical tasks and maintain flexibility in multimodal analysis. This flexibility is crucial as it allows MIR methods to be tailored to different music research contexts, ensuring their effectiveness across domains.

In parts of the machine learning literature, multimodality refers to distinct information units in inputs and outputs of a model or system [73]. This involves different data types from various input channels, even when the communication method or pathway changes during data transfer. In music processing tasks, inputs include audio signals, text data like lyrics or metadata, and visual elements such as album covers or music videos. Outputs may include genre labels, sentiment scores, or musical notation generated from audio input. It is important to note that while datasets may be constructed from diverse sources, consideration is given to the purpose and use of the dataset, whether it is for only one processing task (e.g., genre classification) or multiple (e.g., genre classification and emotion recognition). Parcalabescu et al. [73] highlight the task-dependent nature of multimodality, which emerges when a task requires information from multiple sources for its solution.

Adopting definitions of multimodal music datasets, such as those centered around human sensory modalities and digitization-centric perspectives, proves inadequate for capturing the complexity of today’s music research scene. Limiting definitions to human sensory channels restricts the definitions’ scope, neglecting the multitude of data types and encoding mechanisms present. Similarly, the MIR emphasis on digitization provides a narrow framework that falls short of conceptual depth. Therefore, we need a definition that transcends human sensory modalities and embraces broader computational processes to facilitate a comprehensive understanding of multimodal music datasets.

2.3 Proposing a new definition

We propose to define multimodality as the deliberate integration of varied information sources tailored to specific tasks. Figure 2 illustrates how various data representations can belong to the same or different modalities, depending on whether they provide complementary information to the task. For instance, a hand-written score offers additional insights about a composer than its digital representation. This may be significant to identify the composer but is less important for automated melody transcription.

In our definition, information units can be either physical signals or their digital representations, such as audio, lyrics, musical notation, and visual elements like album covers and music videos. Regardless of whether humans or machines process this information, our definition remains flexible for different types of analysis. This means that both human-centered and machine-centered approaches can be applied to these information units, making them versatile for various research and application purposes.

3 Existing multimodal music datasets

There is a growing, yet still limited, amount of multimodal music datasets available [83]. Figure 3 provides an overview of those discussed in this paper, illustrating their availability for different processing tasks. In addition, Fig. 4 offers an overview of their size distribution and genre diversity. These counts are intended to provide an overview of the datasets and their distribution across different music processing tasks, highlighting the relative scale and comprehensiveness of each dataset. This list is non-exhaustive; a more complete and updated repository is available online [17].

Our dataset categorization is based on high-level music processing tasks, aligning with the macro-tasks suggested by Schedl et al. [79, 83], with additional enrichments to suit our requirements. Categories include categorization-oriented, synchronization-oriented, similarity-oriented, time-dependent representation-oriented, and multi-task datasets, each harnessing multiple modalities. We highlight the multimodal nature of each dataset, emphasizing enhanced task performance through modality integration while acknowledging the significance each modality holds for interpretability [60]. We also address the benefits and drawbacks of modality combinations for each task.

3.1 Categorization-oriented

Datasets in this category are tailored to optimize the effectiveness of various music categorization (and regression) tasks.

3.1.1 Emotion/affect recognition

Music emotion recognition is one of the most studied multimodal music processing tasks [83] and has gained significant attention in the research community. It encompasses both classification tasks, where emotions are categorized into discrete labels, and regression tasks, where continuous emotional dimensions such as valence and arousal are predicted. This dual nature of music emotion recognition allows for a comprehensive analysis of emotional responses to music, leveraging various modalities. Multimodal datasets used in emotion-centric tasks, such as CAL500 [88], combine machine-extracted audio features with human emotion annotations. Additional datasets, including those from [35, 45, 51, 54, 87, 89], incorporate labels, lyrics, and participant information, such as demographic or emotional preference information. Integrating lyrics with audio data provides additional context, enhancing emotion recognition accuracy. However, combining audio and text for emotion recognition presents challenges due to data heterogeneity and a semantic gap between textual descriptions and acoustic features, requiring careful pre-processing and fusion techniques. This is relevant to both sentiment analysis and perceived emotion.

The NAVER Music Dataset [48] combines audio with spectrogram images to analyze music emotions. PMEmo [96] goes further, incorporating physiological data. However, this adds complexity to the task of integrating these different modalities effectively, as it requires sophisticated methods to synchronize and interpret them. The IMAC [91] and EmoMV [86] datasets utilize audio-visual integration, showing superior performance in binary classification and retrieval tasks, but such integration grapples with high-dimensionality and synchronization issues, impacting computational efficiency and accuracy of emotion recognition.

3.1.2 Genre classification and auto-tagging

Combining audio and lyrics has significantly improved genre classification, as shown in studies like [66, 67]. Orio et al. [72] merged audio features with genre tags and web-extracted data. Schindler and Rauber [80] enhanced genre classification by integrating audio and video. Large datasets like MSD-I and MuMu [71] mix audio, images, and multi-label genre annotations, boosting classification models. LMD-ALigned [90] uses six modalities: audio features, lyrics, symbolic data (e.g., instrument counts, tempo), model-based features (e.g., semantic descriptors), album cover images, and playlists. Each modality offers unique insights, but multimodal fusion also introduces challenges like complex fusion methods, synchronization issues, and higher computational demands. Auto-Tagging assigns descriptive labels to music content using datasets like MTG-Jamendo [12], which include audio recordings and category annotations (genres, instruments, moods/themes, and top-50 charts). While descriptive labels improve accuracy, they may miss subtleties and variations in the music. Genre recognition is a specific case of tag recognition, creating a hierarchical relationship. Genre recognition classifies music into pre-defined categories, while auto-tagging uses a broader range of descriptors for comprehensive analysis. This perspective underscores the interconnectedness of assigning genre labels and detailed tags, enhancing music classification and retrieval.

3.1.3 Musical gesture classification

Classification of music-related body motion, actions, and gestures has advanced significantly through multimodal data integration. Gan et al. [30] emphasize the effectiveness of multimodal fusion in this domain. Collections like those by Chang et al. [15] and Sarasua et al. [78] showcase successful modality integration, including audio, video, motion capture, physiological data (EMG, ECG, etc.), textual descriptors of the data, and MIDI. Each modality brings unique information; for instance, audio captures sound nuances, video offers visual cues for identification, EMG and IMU sensors provide insights into muscle activity and motion, images enhance visual context, text annotations add semantic information, and MIDI data includes musical notation tied to sound-producing actions. Such a multimodal approach allows for robust musical gesture classification models.

3.1.4 Singing voice analysis

This task involves analyzing vocal traits to improve singer categorization. The Vocal92 dataset [22] combines both singing and speech modalities by including a cappella solo singing and spoken recordings from 92 singers. This multimodal approach enhances accuracy by capturing nuances such as pitch and timbre in singing and contextual information in speech. While speech and singing share vocal traits, they also exhibit distinct differences like pitch and rhythm, providing complementary data for more robust singer recognition.

3.2 Time-dependent representation-oriented

Datasets here capture temporal evolution in music for tasks like source separation and transcription. Modalities such as audio, MIDI, and label descriptions of musical content form a comprehensive foundation for understanding musical dynamics.

3.2.1 Source separation

Source separation involves isolating individual sound sources from a mix, crucial for various applications such as music production, remixing, and audio analysis. TRIOS [28] comprises separated tracks from chamber music trios with time-aligned MIDI scores, facilitating score-informed audio source separation. Including MIDI data provides a synchronized reference for the musical notes, aiding in the precise separation of individual instruments. MUSIC-21 [97] emphasizes audiovisual performances, while CocosChorales [94] includes audio, MIDI, and note annotations, enhancing the dataset’s utility for tasks like source separation by offering multiple modalities and fine-grained musical information.

3.2.2 Piano tutoring

The dataset by Benetos et al. [9] combines audio recordings with manual and automated MIDI transcriptions, offering a resource for piano tutoring research. Audio captures expressive nuances, dynamics, and timbre, while MIDI transcriptions provide symbolic representations of musical notes and timing. This integration can enhance piano tutoring systems by delivering an audio experience alongside structured symbolic representations, facilitating learning. However, aligning audio signals with MIDI notes requires complex pre-processing and synchronization techniques.

3.2.3 Music segmentation and structure analysis

Music segmentation and structure analysis are complementary but distinct tasks. Music segmentation, a binary classification task, focuses on identifying structural boundaries. Datasets by Cheng et al. [16] and Hargreaves et al. [42] provide annotated audio recordings that capture expressive nuances and precise symbolic representations. Challenges include differing annotation interpretations and audio mismatches, requiring careful synchronization. In contrast, music structure analysis identifies relationships and hierarchies among musical parts. Gregorio et al. [37] offer audio and MIDI data for analyzing jazz improvisation, helping to understand how segments form larger structures. Despite the enriched data, challenges remain in establishing accurate hierarchies and aligning them with audio.

3.2.4 Music transcription

Music transcription involves converting audio recordings or other forms of musical representation, into musical notation. Camera-PrIMuS aids diverse transcription methods with real music staves featuring various distortions and formats [13]. The Dataset of Norwegian Hardanger Fiddle Recordings offers beat annotations alongside audio and note annotations, crucial for beat tracking and beat-aware transcription research [24, 52, 53]. MIREX and MIREX-multi-f0 datasets provide audio and beat annotations for extracting pitch information [7, 44]. MAPS offers audio recordings and MIDI files for piano transcription [25], while a dataset for polyphonic piano transcription includes audio and MIDI data from polyphonic recordings [76]. N20EM supports lyric transcription with audio, video, and IMU data [41], and CocoChorales combines audio, MIDI, and note annotation data for music transcription tasks [94]. One major challenge of combining MIDI, note annotations, and audio is the variability in MIDI note timing and pitch accuracy, nuanced annotations, and audio complexities, which can lead to transcription discrepancies. Robust synchronization and sophisticated algorithms are needed to leverage modalities effectively and minimize errors while preserving fidelity.

3.2.5 Instrument performance analysis

Perez-Carrillo et al. [75] introduced a dataset for evaluating algorithms related to guitar fingering and plucking controls. It includes audio recordings, 3D motion data, and information from the musical score (such as note onset, or ground truth for the plucking parameters). Audio captures sonic nuances, while other modalities provide details about physical aspects, instrument-specific actions, and the musical context. Such a comprehensive dataset enhances the accuracy of estimation algorithms and facilitates a deeper exploration of the interplay between musicians and their instruments in guitar performance analysis.

3.3 Music similarity-oriented

These datasets aim to study relationships and similarities between musical pieces.

3.3.1 Song retrieval

This task involves using similarity-matching strategies for song identification and retrieval. The Million Song Dataset (MSD) [10] has been important for various identification purposes. Correya et al. [19] used MSD for cover identification, demonstrating its effectiveness in distinguishing different versions of songs. Additionally, MSD is a valuable resource for audio and lyrics retrieval, enhancing retrieval system accuracy [95]. MSD500 [93] is designed for tag-based song retrieval, incorporating audio and descriptive tags for more nuanced searches. Integrating such multimodal information in MSD500 has shown superior performance and advanced music retrieval techniques. However, tag-based song retrieval that integrates metadata with audio content may face challenges because subjective tagging naturally varies and can be inconsistent in practice. This variability can lead to inconsistencies in how songs are categorized or retrieved within the system, potentially limiting its effectiveness in accurately capturing user preferences.

3.3.2 Music exploration and discovery

This task enhances music discovery by combining recommendation strategies with exploration opportunities, balancing familiarity and new content excitement. Music4All-Onion [69] stands out for integrating audio and text and refining content-based recommendations with nuanced and personalized suggestions. Watanabe et al. [92] introduce a dataset featuring lyrics, audio, and artist IDs, achieving good performance in the task of query-by-blending, encouraging user interaction. The dataset supports users in exploring and manipulating these integrated musical components to generate new queries. Additionally, Poltronieri et al. [77] propose a dataset of audio and melodic and harmonic annotations, enriching content-based music similarity exploration. The variability in how the users of the music exploration system interpret textual data of music content complicates the process of using this information for recommendations, since individual differences can lead to varied preferences and expectations.

3.4 Music generation

Music generation does not fit into the analysis-focused categories mentioned above. Music generation is a creative process focused on crafting new musical content, including melodies, harmonies, rhythms, or complete compositions. The dataset in [40] generates music based on brain activity. In [29], datasets such as URMP [58], AtinPiano, and MUSIC [97] are used for music generation from videos, showcasing the efficacy of combining audio and video. MusicCaps [2] takes a novel approach, generating music from textual (such as “an upbeat bass beat accompanying a reverberated guitar riff”) and expanding creative possibilities through multimodality.

3.5 Multi-task

Similar to foundational models for various music processing tasks [31], several multimodal music datasets are designed to address multiple objectives. These datasets cover more than one macro task and incorporate diverse modalities, such as audio, video, text, music scores, MIDI, and annotations, facilitating a broad range of MIR tasks. For instance, FMA [21], RWC [36], and DALI [68] are some examples, supporting tasks spanning audio analysis, music genre classification, and semantic music retrieval. ENST-Drums [32] enables event classification, drum track transcription, and polyphonic music extraction. Essid et al. [26] focus on synchronization, time-based movement representation, and dance movement recognition.

The URMP dataset [58] provides a comprehensive collection with audio, video, scores (MIDI and PDF), and annotations, suitable for diverse tasks including music transcription and performance analysis. MedleyDB [11] enhances melody transcription and genre classification capabilities with audio recordings, genre labels, and note annotations. Additionally, the CompMusic Art Indian music dataset and the Saraga music research corpus [74] contribute to the multi-task category with rich annotations for Indian raga classification, supporting melodic motif identification and time-dependent pattern detection. The Turkish makam dataset introduces innovative methods for tonic identification in audio recordings [81], enhancing analysis in expressive and intonation domains. Finally, the Song Describer dataset (SDD) [63] supports tasks like music captioning, text-to-music generation, and music-language retrieval, featuring paired audio and caption data for multimodal learning.

4 Multimodal music dataset pre-processing

Preparing multimodal music datasets involves handling various data types concurrently. Unlike unimodal datasets that focus on a single data type, multimodal datasets require integrated techniques for multimodal analysis. This section explores the primary strategies for pre-processing multimodal music datasets.

4.1 Crossmodal alignment

Establishing connections across modalities is a key challenge in multimodal dataset creation. Developing techniques for aligning audio, scores, text, and visual data based on shared features is essential for holistic analyses. This alignment is crucial for tasks like source separation and audio-to-score synchronization [83]. Cross-referencing metadata from multiple sources can help verify the accuracy of the information and identify discrepancies.

Clearly labeling and documenting different versions or performances of the same piece of music can maintain the integrity of the dataset [78]. Using advanced audio fingerprinting, which is now more robust to tempo and pitch variations, ensures audio files match metadata accurately. This technology, though challenging for transcription, offers efficient and precise indexing of music data, reducing errors in dataset management. By integrating these methods, consistency between audio and metadata can be maintained, supporting reliable music retrieval and analysis [70]. When legal constraints require pre-extracted features, providing detailed documentation of the feature extraction process can maintain transparency [18]. Encouraging community contributions through platforms can also facilitate collaborative annotation efforts, helping identify and rectify errors in the dataset.

4.2 Handling missing data across modalities

Multimodal datasets often exhibit variations in data availability across modalities due to restrictions on public access to specific types of data. Despite these challenges, multimodal music processing proves highly effective in analyzing music, particularly when certain data types are unavailable. Effective pre-processing involves addressing missing data to prevent biases and maintain dataset integrity [23]. Several strategies can be employed to handle missing data. Interpolation and imputation techniques, such as those described by Perez-Carrillo et al [75], estimate missing values based on patterns in the available data.

In cases where missing data cannot be effectively imputed or synthesized, excluding entries with incomplete data may be necessary. However, this approach should be used cautiously to avoid significantly reducing the dataset’s size [61]. Designing the dataset with redundancy across modalities can also ensure robustness, as other modalities can provide enough context to mitigate the impact of missing data [6].

4.3 Integrated feature extraction

Feature extraction is crucial for numerical representation across different modalities, supporting tasks like genre classification [72] and emotion recognition [89]. While aligning sampling rates is emphasized, strict uniformity across all modalities may vary by application. Flexible sampling rates can maintain temporal coherence. Techniques such as canonical correlation analysis (CCA) or deep learning methods are employed to project features from different modalities into unified representation spaces, capturing intermodal correlations effectively [49]. These integrated approaches facilitate a comprehensive analysis by harmonizing data from varied sources, ensuring that essential aspects across different modalities are captured.

4.4 Consistent data normalization

Normalizing data ensures that different modalities, which may have varying scales and magnitudes, contribute fairly to the analysis [83]. Consistent data normalization prevents any single modality from dominating the results due to differences in magnitude, fostering a balanced and comprehensive understanding of the multimodal dataset. Normalization ensures consistency and comparability across diverse datasets and modalities, enhancing the robustness of feature extraction and interpretation. While some analytical methods can manage normalization internally, applying normalization techniques ensures a standardized approach, which is critical for the accuracy and reliability of multimodal analysis.

4.5 Coordinated dataset splitting

For machine learning-oriented music processing tasks, dividing the dataset into training, validation, and test sets requires careful consideration to maintain label distribution and multimodal coherence [95]. Stratified splitting ensures that each set maintains the same proportion of labels as the original dataset, preserving the statistical properties across splits [55]. Maintaining temporal consistency is crucial for datasets with temporal data to avoid disrupting the sequence and is essential for tasks like audio-to-score synchronization and video analysis [8].

Ensuring multimodal integrity is vital, meaning that all modalities for a given piece of data must be included in the same split to avoid undermining multimodal analyses [33]. Additionally, the balanced distribution of various attributes (e.g., genres, emotions) across splits prevents any set from being biased towards specific attributes, which is important for training robust models [64].

5 Multimodal music dataset construction and evaluation

This section presents an integrated approach to constructing and evaluating multimodal music datasets, outlining essential considerations, best practices, and criteria to ensure the development of high-quality datasets suitable for various multimodal music processing tasks.

Creating a multimodal music dataset is a meticulous process for ensuring quality, relevance, and applicability to various research tasks. Here, we outline some criteria for developing a comprehensive multimodal music dataset, providing an overview of the current state-of-the-art. Furthermore, assessing a multimodal music dataset’s quality, usefulness, and applicability is crucial for ensuring its value to the research community and beyond. We also outline criteria for evaluating the efficacy of a constructed multimodal music dataset.

Despite proposing rigorous criteria for dataset construction and evaluation, it’s essential to remember that adopting good enough [46] practices acknowledges the practical constraints and real-world applications of multimodal music research. Balancing academic rigor with pragmatic considerations ensures that datasets are not only scientifically robust but also feasible for implementation in diverse real-life scenarios.

5.1 Diversity and representation

While datasets like MTG Jamendo [12] and DALI [68] encompass diverse music genres and tasks, achieving this breadth requires substantial computational and time resources. Such comprehensive datasets are exceptional rather than common in dataset creation.

Typically, datasets are designed to address specific research questions or applications, focusing on particular musical genres, historical periods, or cultural contexts. For instance, the Hardanger Fiddle Dataset [24] and genre-specific pattern recognition studies [85] illustrate this targeted approach. However, the creation of a multimodal music dataset should involve the deliberate consideration of its potential future applications. This includes ensuring clear metadata, documentation of data transformations, and adherence to data-sharing standards. Doing so enables future researchers to extract maximum value from the dataset for various unforeseen research questions and applications.

5.2 Data quality and consistency

High-quality data is essential for meaningful multimodal music analysis. This includes ensuring optimal and consistent data representations, formats, and sampling rates [70]. Recordings should have a high sampling rate, and minimal noise [25]. Evaluating multimodal music datasets involves both technical and perceptual assessments. Technical measures include bitrate and sampling rate [21]. Listener studies are equally important, providing subjective ratings from diverse groups to assess perceived quality. Additionally, high-quality datasets should enhance the listening experience and provide insights into musical structures, capturing elements such as harmony, rhythm, and timbre, along with detailed annotations [54].

5.3 Annotation and ground truth

Accurate annotation is crucial for supervised learning in multimodal music datasets [88]. Annotations, including genre labels, sentiment scores, and artist information, serve as “ground truth” for tasks like genre classification [36], emotion recognition [50], and audio signal separation [12]. However, “ground truth” is subjective and culturally specific, reflecting human interpretations influenced by cultural contexts, individual perceptions, and domain expertise. Hence, while annotations form the foundation for multimodal music processing tasks, their subjective nature must be acknowledged and interpreted accordingly.

To achieve high-quality annotations, combining crowdsourcing with expert validation is effective [59]. Crowdsourcing platforms can handle initial annotation tasks, with data curators validating a subset to ensure high standards. An iterative annotation and review process improves quality over time without initial high costs.

Open-source tools for data alignment reduce reliance on expensive proprietary software. Automated preprocessing detects and flags data inconsistencies [83]. Machine learning models can aid anomaly detection during manual reviews [58]. Collaborating with the research community promotes sharing best practices and resources to tackle dataset construction challenges effectively.

5.4 Modality interplay

An effective multimodal dataset should enable researchers to explore the relationships between different modalities, such as audio, text, and visual components. The dataset should provide sufficient data to study correlations between modalities. For example, the RWC Music Database [36] and the DALI dataset [68] provide linked audio, scores, and lyrics, facilitating multimodal analysis.

5.5 Generalization and robustness

A high-quality multimodal music dataset should facilitate model generalization to unseen data, encompassing variations within each modality, such as diverse timbres, lyrical themes, and visual styles. There are no established diversity metrics that could be used to quantify these aspects for each modality’s content, so these variations are typically evaluated qualitatively through the judgment of the dataset creation team members. This qualitative assessment considers the dataset’s breadth and richness of musical genres, lyrical topics, artistic styles, and other modality-specific attributes. Performance on established benchmark tasks should be tested and reported, with datasets like MedleyDB [11] often used for source separation tasks, providing benchmark results for comparison.

5.6 Usability and accessibility

It is crucial to design the dataset with scalability and ease of use in mind. The dataset should be easily accessible, with clear documentation and user-friendly interfaces. Accessibility can be quantified by the dataset’s availability (e.g., hosted on public repositories) and the comprehensiveness of its documentation [18]. The dataset should allow for future expansions and updates, with mechanisms for adding new data, such as standardized formats and modular data structures.

5.7 Real-world impact

Assessing a dataset’s impact goes beyond its influence within specific research fields and instead focuses on its broader implications and usability across various applications. The quality of a dataset can be evaluated by its ability to stimulate interdisciplinary research collaborations [12] and support a wide range of projects [19]. Moreover, evaluating a dataset’s practical applications in domains such as music recommendation systems, educational tools, and music analysis software provides insight into its real-world utility. This includes its effectiveness in facilitating new discoveries, enhancing educational resources, and empowering innovative applications across different user communities.

Understanding a dataset’s scalability and adaptability is crucial. While metrics like academic impact are relevant, the true measure of impact lies in its ability to transcend disciplinary boundaries and contribute to societal and cultural advancements through accessible and impactful applications.

5.8 Legal constraints and limitations

Understanding legal constraints and copyright is crucial in multimodal dataset construction. Comprehensive research into copyright laws across relevant jurisdictions is essential. Legal experts specializing in intellectual property can assist in navigating complex landscapes and ensuring compliance. Open licenses like Creative Commons (CC0, CC-BY) facilitate legal sharing and reuse [19]. For proprietary data, negotiating licenses for academic and research purposes with clearly defined terms is important to prevent legal disputes.

To comply with privacy laws (e.g., GDPR), removing personally identifiable information and applying fair use principles are recommended [18]. Rights management tools help manage permissions and limitations associated with datasets. Collaboration with institutional repositories and digital libraries supports handling copyrighted materials. Detailed dataset documentation, including legal status, licenses, transformations, and fair use analysis, enhances transparency [18]. Clear guidelines on data usage, sharing, and citation prevent unintentional legal violations.

6 Discussion

Navigating the construction and evaluation of multimodal music datasets reveals both promising avenues and persistent challenges. For example, there should be flexibility in dataset usage, allowing researchers to explore diverse applications and maximize the utility of available resources. While datasets may be categorized hierarchically, there are instances where a dataset can serve a completely different task than initially intended. For example, the Hardanger fiddle dataset [53], primarily used for music transcription, contains additional emotional information that has not been fully utilized in previous studies. CocosChorales [94] contains information useful for automated music composition. Similarly, Vocal92 could be used for speech-to-song generation [22], MTG-Jamendo for genre-based recommendation systems and semantic music search [12], and the dataset of Benetos et al. could be used for gesture recognition for expressive playing analysis [9].

One of the primary challenges is implementing multimodal fusion approaches that effectively address the specific requirements and nuances of music processing tasks. This includes managing feature contributions, selecting appropriate modalities, and navigating the complex dimensions of multimodal data.

Semantic alignment between different modalities remains a complex task, necessitating further research to refine techniques and understand contextual nuances. Collaboration across various domains is crucial in ensuring that multimodal datasets are comprehensive, accurate, and accessible. Professionals spanning musicology, psychology, and technology must work together to overcome these challenges.

7 Conclusion

In response to the communication gaps between music research disciplines regarding multimodality, our paper introduces a novel perspective on multimodal music datasets that bridges musicology, music psychology, and music technology. Multimodal music datasets are pivotal in advancing various music processing tasks and enriching our understanding of musical experiences. A streamlined framework may simplify the complexities associated with signal types, digitization processes, and data interpretation by humans and machines. This framework facilitates interdisciplinary collaboration and empowers researchers to explore diverse applications beyond traditional boundaries.

Our categorization scheme for multimodal datasets highlights strategic combinations of modalities and their implications for specific music-processing tasks. This enables researchers to make informed decisions when leveraging datasets across different domains, fostering innovation in music analysis. At the same time, we emphasize adopting “good enough practices” in dataset construction and evaluation, prioritizing pragmatic solutions that balance thoroughness with feasibility, and ensuring datasets are fit for purpose without excessive complexity.

Moving forward, prioritizing collaboration among musicologists, music psychologists, and music technologists is essential. Each group brings unique perspectives and expertise, which are essential for overcoming complex research challenges in multimodal music processing. Musicologists and psychologists, as primary data collectors and domain experts, play a pivotal role in shaping dataset construction and ensuring relevance to real-world music contexts. However, they often do not have the skills or interest to create large multimodal music datasets ready for machine learning. This is where fruitful collaborations with technologists can help move the field(s) forward. We advocate for adaptable and nuanced tools capable of accommodating the complexity and diversity inherent in musical expression rather than striving for universal solutions that may oversimplify or exclude critical aspects of music.

Data Availability

No datasets were generated or analysed during the current study.

References

Abhyankar SG, Bharadwaj SS, Rani GS, et al (2023) A survey on music genre classification using multimodal information processing and retrieval. In: Proceedings of the 2023 international conference on recent trends in electronics and communication (ICRTEC). IEEE, Mysore, India, pp 1–6. https://doi.org/10.1109/ICRTEC56977.2023.10111926
Agostinelli A, Denk TI, Borsos Z, et al (2023) MusicLM: generating music from text. https://doi.org/10.48550/arXiv.2301. 91111325
Alfaro-Contreras M, Valero-Mas JJ, Iñesta JM et al (2023) Late multimodal fusion for image and audio music transcription. Expert Syst Appl 216:119491. https://doi.org/10.1016/j.eswa.2022.119491
Article Google Scholar
Arola KL, Sheppard J, Ball CE (2014) Writer/designer: a guide to making multimodal projects. Bedford/St. Martins, Boston
Google Scholar
Aryafar K, Shokoufandeh A (2014) Multimodal music and lyrics fusion classifier for artist identification. In: 2014 13th international conference on machine learning and applications. IEEE, Detroit, MI, USA, pp 506–509, https://doi.org/10.1109/ICMLA.2014.88
Baltrušaitis T, Ahuja C, Morency LP (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Article Google Scholar
Bay M, Ehmann AF, Downie JS (2009) Evaluation of multiple-f0 estimation and tracking systems. In: ISMIR 2009
Bazzica A, Gemert JV, Liem CCS et al (2017) Vision-based detection of acoustic timed events: a case study on clarinet note onsets. ArXiv. https://doi.org/10.48550/arXiv.1706.09556
Benetos E, Klapuri A, Dixon S (2012) Score-informed transcription for automatic piano tutoring. In: Proceedings of the 20th European Signal Processing Conference, pp 2153–2157
Bertin-Mahieux T, Ellis DPW, Whitman B, et al (2011) The million song dataset. In: Proceedings of the 12th international society for music information retrieval conference, pp 591–596, https://doi.org/10.7916/D8NZ8J07
Bittner R, Salamon J, Tierney M, et al (2014) MedleyDB: A multitrack dataset for annotation-intensive MIR research. In: Proceedings - 15th international society for music information retrieval conference (ISMIR 2014)
Bogdanov D, Won M, Tovstogan P, et al (2019) The MTG-Jamendo dataset for automatic music tagging. In: Proceedings of the 36 Th international conference on machine learning, https://doi.org/10.5281/zenodo.3826812
Calvo-Zaragoza J, Rizo D (2018) Camera-PrIMuS: neural end-to-end optical music recognition on realistic monophonic scores. In: Proceedings of the 19th international society for music information retrieval conference, pp 248–255, https://doi.org/10.5281/zenodo.1492394
Caramiaux B, Tanaka A (2013) Machine learning of musical gestures. In: Proceedings of the international conference on new interfaces for musical expression (NIME), Korea advanced institute of science and technology, Daejeon, South Korea, pp 513–518, https://doi.org/10.48550/arXiv.2011.13487
Chang X, Peng L (2022) Intelligent analysis and classification of piano music gestures with multimodal recordings. Comput Intell Neurosci 2022:8232819. https://doi.org/10.1155/2022/8232819
Article Google Scholar
Cheng HT, Yang YH, Lin YC, et al (2009) Multimodal structure segmentation and analysis of music using audio and textual information. In: 2009 IEEE international symposium on circuits and systems. IEEE, Taipei, Taiwan, pp 1677–1680, https://doi.org/10.1109/ISCAS.2009.5118096
Christodoulou AM (2024) MuTecEn/multimodal-music-datasets. https://github.com/MuTecEn/Multimodal-Music-Datasets/tree/main
Christodoulou AM, Jensenius AR (2024). Navigating challenges in multimodal music data management for AI systems. https://doi.org/10.31219/osf.io/wvfa3
Correya AA, Hennequin R, Arcos M (2018) Large-scale cover song detection in digital music libraries using metadata, lyrics and audio features. arXiv:1808.10351
Cox A (2016) Music and embodied cognition: listening, moving, feeling, and thinking. Indiana University Press, Bloomington
Book Google Scholar
Defferrard M, Benzi K, Vandergheynst P, et al (2017) FMA: a dataset for music analysis. https://doi.org/10.48550/arXiv.1612.01840, arXiv:1612.01840
Deng Z, Zhou R (2023) Vocal92: multimodal audio dataset with a cappella solo singing and speech. IEEE Access pp 1–1. https://doi.org/10.1109/ACCESS.2023.3253207
D’mello SK, Kory J, (2015) A review and meta-analysis of multimodal affect detection systems. ACM Comput Surv 47(3):1–36. https://doi.org/10.1145/2682899
Elowsson A, Lartillot O (2021) A hardanger fiddle dataset with performances spanning emotional expressions and annotations aligned using image registration. In: Proceedings of the 22nd international conference on digital libraries for musicology, pp 174–181
Emiya V, Bertin N, David B et al (2008) MAPS - A piano database for multipitch estimation and automatic transcription of music. Tech. rep, Telecom Paris Tech
Essid S, Richard G (2012) Fusion of multimodal information in music content analysis. Multimodal Music Process 3:16. https://doi.org/10.4230/DFU.VOL3.11041.37
Article Google Scholar
Ferraro A, Favory X, Drossos K et al (2021) Enriched music representations with multiple cross-modal contrastive learning. IEEE Signal Process Lett 28:733–737. https://doi.org/10.1109/LSP.2021.3071082. arXiv:2104.00437 [cs, eess]
Article Google Scholar
Fritsch J, Plumbley MD (2013) Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 888–891, https://doi.org/10.1109/ICASSP.2013.6637776
Gan C, Huang D, Chen P et al (2020) Foley music: learning to generate music from videos. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer Vision - ECCV 2020. Lecture Notes in Computer Science. Springer International Publishing, Cham, pp 758–775. https://doi.org/10.1007/978-3-030-58621-8_44
Chapter Google Scholar
Gan C, Huang D, Zhao H, et al (2020b) Music gesture for visual sound separation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Seattle, WA, USA, pp 10475–10484, https://doi.org/10.1109/CVPR42600.2020.01049
Gardner J, Durand S, Stoller D, et al (2023) LLark: a multimodal foundation model for music. https://doi.org/10.48550/arXiv.2310.07160, arXiv:2310.07160
Gillet O, Richard G (2006) ENST-Drums: an extensive audio-visual database for drum signals processing. In: 7th international conference on music information retrieval. Victoria, Canada, pp 156–159
Girdhar R, El-Nouby A, Liu Z, et al (2023) ImageBind: one embedding space to bind them all. In: CVPR proceedings, Vancouver, BC, Canada, pp 15180–15190, https://doi.org/10.1109/CVPR52729.2023.01457, arXiv:2305.05665
Godøy RI, Leman M (eds) (2009) Musical Gestures: Sound, Movement, and Meaning. Routledge, New York. https://doi.org/10.4324/9780203863411
Gómez-Cañón JS, Gutiérrez-Páez N, Porcaro L et al (2023) TROMPA-MER: an open dataset for personalized music emotion recognition. J Intell Inf Syst 60(2):549–570. https://doi.org/10.1007/s10844-022-00746-0
Article Google Scholar
Goto M, Hashiguchi H, Nishimura T, et al (2002) RWC music database: music genre database and musical instrument sound database. In: Proceedings of the 3rd international society for music information retrieval conference, pp 287–288
Gregorio J, Kim YE (2016) Phrase-level audio segmentation of jazz improvizations informed by symbolic data. In: Proceedings of the 17th international society for music information retrieval conference, New York, NY, USA, pp 482–487, https://doi.org/10.5281/zenodo.1414790
Gritten A (2006) Music and gesture. Routledge, New York
Google Scholar
Gritten A, Gritten A, King E (2011) New perspectives on music and gesture. Ashgate Pub, Farnham
Google Scholar
Groux SL, Manzolli J (2010) Disembodied and collaborative musical interaction in the multimodal brain orchestra. In: Proceedings of the international conference on new interfaces for musical expression, Sydney, Australia, pp 309–314, https://doi.org/10.5281/zenodo.1177831
Gu X, Ou L, Ong D, et al (2022) MM-ALT: A multimodal automatic lyric transcription system. In: Proceedings of the 30th ACM international conference on multimedia. Association for computing machinery, New York, NY, USA, MM ’22, pp 3328–3337, https://doi.org/10.1145/3503161.3548411
Hargreaves S, Klapuri A, Sandler M (2012) Structural segmentation of multitrack audio. IEEE Trans Audio Speech Lang Process 20(10):2637–2647. https://doi.org/10.1109/TASL.2012.2209419
Article Google Scholar
Hatten RS (2004) Interpreting musical gestures, topics, and tropes: mozart, Beethoven, Schubert. Indiana University Press, Bloomington
Google Scholar
Holzapfel A, Davies M, Zapata J et al (2012) Selective sampling for beat tracking evaluation. IEEE Trans Audio Speech Lang Process 20:2539–2548. https://doi.org/10.1109/TASL.2012.2205244
Article Google Scholar
Hu X, Downie JS, Ehmann AF (2009) Lyric text mining in music mood classification. In: 10th international society for music information retrieval conference, pp 411–416
Jensenius AR (2021) Best versus good enough practices for open music research. Empirical Musicol Rev 16(1):5–15. https://doi.org/10.18061/emr.v16i1.7646
Article Google Scholar
Jensenius AR (2022) Sound actions: conceptualizing musical instruments. The MIT Press, Cambridge
Book Google Scholar
Jeon B, Kim C, Kim A, et al (2017) Music emotion recognition via end-to-end multimodal neural networks. In: 4th advanced information management, communicates, electronic and automation control conference (IMCEC), Chongqing, China, pp 1461–1465, https://doi.org/10.1109/IMCEC51613.2021.9482244
Kim Y, Schmidt E, Migneco R, et al (2010) Music emotion recognition: a state of the art review. In: Proceedings of the 11th international society for music information retrieval conference, pp 255–266
Koelstra S, Muhl C, Soleymani M et al (2012) DEAP: a database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3(1):18–31. https://doi.org/10.1109/T-AFFC.2011.15
Article Google Scholar
Koh EY, Cheuk KW, Heung KY et al (2022) MERP: a music dataset with emotion ratings and raters’ profile information. Sensors (Basel) 23(1):382. https://doi.org/10.3390/s23010382
Article Google Scholar
Lartillot O (2003) Discovering musical pattern through perceptive heuristics. In: Proceedings of the 4th international society for music information retrieval conference. Johns Hopkins University
Lartillot O, Johansson MS, Elowsson A et al (2023) A dataset of Norwegian Hardanger fiddle recordings with precise annotation of note and beat onsets. TISMIR 6(1):186–202. https://doi.org/10.5334/tismir.139
Article Google Scholar
Laurier C, Grivolla J, Herrera P (2008) Multimodal music mood classification using audio and lyrics. In: 2008 seventh international conference on machine learning and applications. IEEE, San Diego, CA, USA, pp 688–693, https://doi.org/10.1109/ICMLA.2008.96
Law E, West K, Mandel M (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th international society for music information retrieval conference, pp 387–392
Leman M (2007) Embodied music cognition and mediation technology. The MIT Press, Cambridge. https://doi.org/10.7551/mitpress/7476.001.0001
Book Google Scholar
Leman M, Lesaffre M, Maes PJ (2017) The Routledge companion to embodied music interaction. Routledge, New York
Google Scholar
Li B, Liu X, Dinesh K et al (2019) Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Trans Multimed 21(2):522–535. https://doi.org/10.1109/TMM.2018.2856090
Article Google Scholar
Li G, Wei Y, Tian Y, et al (2022) Learning to answer questions in dynamic audio-visual scenarios. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, New Orleans, LA, USA, pp 19086–19096, https://doi.org/10.1109/CVPR52688.2022.01852
Liang PP, Lyu Y, Chhablani G, et al (2023) MultiViz: towards visualizing and understanding multimodal models. arXiv:2207.00056
Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American english. PLoS ONE 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Magnusson T (2019) Sonic writing, 1st edn. Bloomsbury Publishing, London
Book Google Scholar
Manco I, Weck B, Doh S, et al (2023) The song describer dataset: a corpus of audio captions for music-and-language evaluation. In: Machine learning for audio workshop, arXiv:2311.10057
Marchini M, Ramirez R, Papiotis P et al (2014) The sense of ensemble: a machine learning approach to expressive performance modelling in string quartets. J New Music Res 43(3):303–317. https://doi.org/10.1080/09298215.2014.922999
Article Google Scholar
Mayer D (2020) Almat 2020 - symposium on algorithmic agency in artistic practice by Hanns Holger Rutz. https://www.researchcatalogue.net/view/921059/922503
Mayer R, Rauber A (2010) Multimodal aspects of music retrieval: audio, song lyrics - and beyond? In: Kacprzyk J, Raś ZW, Wieczorkowska AA (eds) Advances in music information retrieval, vol 274. Springer, Berlin, Heidelberg, pp 333–363. https://doi.org/10.1007/978-3-642-11674-2_15
Chapter Google Scholar
Mayer R, Neumayer R, Rauber A (2008) Combination of audio and lyrics features for genre classification in digital audio collections. In: MM’08 - Proceedings of the 2008 ACM international conference on multimedia, with co-located symposium and workshops, p 168, https://doi.org/10.1145/1459359.1459382
Meseguer-Brocal G, Cohen-Hadria A, Peeters G (2018) DALI: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm. In: Proceedings of the 19th international society for music information retrieval conference. Zenodo, https://doi.org/10.5281/ZENODO.1492443
Moscati M, Parada-Cabaleiro E, Deldjoo Y, et al (2022) Music4All-onion—a large-scale multi-faceted content-centric music recommendation dataset. In: Proceedings of the 31st ACM international conference on information and knowledge management. ACM, Atlanta GA USA, pp 4339–4343. https://doi.org/10.1145/3511808.3557656
Müller M, Arzt A, Balke S et al (2019) Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process Mag 36(1):52–62. https://doi.org/10.1109/MSP.2018.2868887. arXiv:1902.04397 [cs]
Article Google Scholar
Oramas S, Barbieri F, Nieto O et al (2018) Multimodal deep learning for music genre classification. Trans Int Soc Music Inf Retr 1(1):4–21. https://doi.org/10.5334/tismir.10
Article Google Scholar
Orio N, Rizo D, Miotto R, et al (2011) MusiClef: a benchmark activity in multimodal music information retrieval. In: Proceedings of the 12th international society for music information retrieval conference
Parcalabescu L, Trost N, Frank A (2021) What is multimodality? In: Donatelli L, Krishnaswamy N, Lai K, et al (eds) Proceedings of the 1st workshop on multimodal semantic representations (MMSR). Association for Computational Linguistics, Groningen, Netherlands (Online), pp 1–10
Paschalidou S, Miliaresi I (2023) Multimodal deep learning architecture for Hindustani raga classification. Sens Transducers 260(2):77–86
Google Scholar
Perez-Carrillo A, Arcos JL, Wanderley M (2016) Estimation of guitar fingering and plucking controls based on multimodal analysis of motion, audio and musical score. In: Kronland-Martinet R, Aramaki M, Ystad S (eds) Music, mind, and embodiment, vol 9617. Springer International Publishing, Cham, pp 71–87. https://doi.org/10.1007/978-3-319-46282-0_5
Chapter Google Scholar
Poliner GE, Ellis DPW (2006). A discriminative model for polyphonic piano transcription. https://doi.org/10.1155/2007/48317
Poltronieri A (2023) Knowledge-based multimodal music similarity. In: Lecture notes in computer science, pp 224–233, https://doi.org/10.1007/978-3-031-43458-7_41
Sarasúa Á, Caramiaux B, Tanaka A, et al (2017) Datasets for the analysis of expressive musical gestures. In: Proceedings of the 4th international conference on movement computing. ACM, London United Kingdom, pp 1–4, https://doi.org/10.1145/3077981.3078032
Schedl M, Gómez E, Fabra UP (2014) Music information retrieval: recent developments and applications. New Foundations and Trends
Schindler A, Rauber A (2017) Harnessing music-related visual stereotypes for music information retrieval. ACM Trans Intell Syst Technol 8(2):1–21. https://doi.org/10.1145/2926719
Article Google Scholar
Şentürk S, Gulati S, Serra X (2013) Score informed tonic identification for makam music of Turkey. In: Proceedings of the 14th international society for music information retrieval conference, pp 175–180
Shimojo S, Shams L (2001) Sensory modalities are not separate modalities: plasticity and interactions. Curr Opin Neurobiol 11(4):505–509. https://doi.org/10.1016/S0959-4388(00)00241-5
Article Google Scholar
Simonetta F, Ntalampiras S, Avanzini F (2019) Multimodal music information processing and retrieval: Survey and future challenges. In: 2019 international workshop on multilayer music representation and processing (MMRP), pp 10–18. https://doi.org/10.1109/MMRP.2019.00012, arXiv:1902.05347
Simonetta F, Ntalampiras S, Avanzini F (2021) Audio-to-score alignment using deep automatic music transcription. https://doi.org/10.1109/MMSP53017.2021.9733531, arXiv:2107.12854
Sun W, Sundarasekar R (2023) Research on pattern recognition of different music types in the context of AI with the help of multimedia information processing. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3523284
Article Google Scholar
Thao HTP, Roig G, Herremans D (2023) EmoMV: affective music-video correspondence learning datasets for classification and retrieval. Inf Fusion 91:64–79. https://doi.org/10.1016/j.inffus.2022.10.002
Article Google Scholar
Trochidis K, Tsoumakas G, Kalliris G, et al (2008) Multilabel classification of music into emotions. In: Proceedings 9th international conference on music information retrieval
Turnbull D, Barrington L, Torres D et al (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16(2):467–476. https://doi.org/10.1109/TASL.2007.913750
Article Google Scholar
Turnbull DR, Barrington L, Lanckriet G, et al (2009) Combining audio content and social context for semantic music discovery. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, Boston MA USA, pp 387–394, https://doi.org/10.1145/1571941.1572009
Vatolkin I, McKay C (2022) Multi-objective investigation of six feature source types for multi-modal music classification. TISMIR 5(1):1–19. https://doi.org/10.5334/tismir.67
Article Google Scholar
Verma G, Dhekane EG, Guha T (2019) learning affective correspondence between music and image. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Brighton, United Kingdom, pp 3975–3979, https://doi.org/10.1109/ICASSP.2019.8683133
Watanabe K, Goto M (2019) Query-by-blending: a music exploration system blending latent vector representations of lyric word, song audio, and artist. In: Proceedings of the 20th international society for music information retrieval conference, https://doi.org/10.5281/zenodo.3527761
Won M, Oramas S, Nieto O, et al (2020) Multimodal metric learning for tag-based music retrieval. In: ICASSP 2021 - 2021 IEEE international conference on acoustics, speech and signal processing, pp 591–595, arXiv:2010.16030
Wu Y, Gardner J, Manilow E, et al (2022) The chamber ensemble generator: limitless high-quality MIR data via generative modeling. https://doi.org/10.48550/arXiv.2209.14458, arXiv:2209.14458
Yu Y, Tang S, Raposo F et al (2019) Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans Multimedia Comput Commun Appl 15(1):20:1-20:16. https://doi.org/10.1145/3281746
Article Google Scholar
Zhang K, Zhang H, Li S, et al (2018) The PMEmo dataset for music emotion recognition. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp 135–142, https://doi.org/10.1145/3206025.3206037
Zhao H, Gan C, Ma WC, et al (2019) The sound of motions. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1735–1744, arXiv:1904.05979

Download references

Acknowledgements

This project is supported by the Research Council of Norway through project 262762 (RITMO).

Funding

Open access funding provided by University of Oslo (incl Oslo University Hospital)

Author information

Authors and Affiliations

RITMO Centre for Interdisciplinary Studies in Rhythm, Time and Motion, University of Oslo, Oslo, Norway
Anna-Maria Christodoulou, Olivier Lartillot & Alexander Refsum Jensenius
Department of Musicology, University of Oslo, Oslo, Norway
Anna-Maria Christodoulou, Olivier Lartillot & Alexander Refsum Jensenius

Authors

Anna-Maria Christodoulou
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Lartillot
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Refsum Jensenius
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.C. wrote the main manuscript text, and prepared figures and tables, with support from A.J. and O.L. A.J. and O.L. provided substantial input to the conception of the work and critically revised the manuscript for important intellectual content. All authors approved the final version of the manuscript for publication.

Corresponding author

Correspondence to Anna-Maria Christodoulou.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Christodoulou, AM., Lartillot, O. & Jensenius, A.R. Multimodal music datasets? Challenges and future goals in music processing. Int J Multimed Info Retr 13, 37 (2024). https://doi.org/10.1007/s13735-024-00344-6

Download citation

Received: 27 March 2024
Revised: 04 July 2024
Accepted: 31 July 2024
Published: 28 August 2024
DOI: https://doi.org/10.1007/s13735-024-00344-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multimodal music datasets? Challenges and future goals in music processing

Abstract

Similar content being viewed by others

ALF-200k: Towards Extensive Multimodal Analyses of Music Tracks and Playlists

Knowledge-Based Multimodal Music Similarity

OMR metrics and evaluation: a systematic review

Explore related subjects

1 Introduction

2 Multimodal music data definitions

2.1 Existing definitions

2.2 Reconsidering existing definitions

2.3 Proposing a new definition

3 Existing multimodal music datasets

3.1 Categorization-oriented

3.1.1 Emotion/affect recognition

3.1.2 Genre classification and auto-tagging

3.1.3 Musical gesture classification

3.1.4 Singing voice analysis

3.2 Time-dependent representation-oriented

3.2.1 Source separation

3.2.2 Piano tutoring

3.2.3 Music segmentation and structure analysis

3.2.4 Music transcription

3.2.5 Instrument performance analysis

3.3 Music similarity-oriented

3.3.1 Song retrieval

3.3.2 Music exploration and discovery

3.4 Music generation

3.5 Multi-task

4 Multimodal music dataset pre-processing

4.1 Crossmodal alignment

4.2 Handling missing data across modalities

4.3 Integrated feature extraction

4.4 Consistent data normalization

4.5 Coordinated dataset splitting

5 Multimodal music dataset construction and evaluation

5.1 Diversity and representation

5.2 Data quality and consistency

5.3 Annotation and ground truth

5.4 Modality interplay

5.5 Generalization and robustness

5.6 Usability and accessibility

5.7 Real-world impact

5.8 Legal constraints and limitations

6 Discussion

7 Conclusion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation