Leveraging compatibility and diversity in computer-aided music mashup creation

Bernardo, Gonçalo; Bernardes, Gilberto

doi:10.1007/s00779-022-01702-z

Leveraging compatibility and diversity in computer-aided music mashup creation

Original Paper
Open access
Published: 23 December 2022

Volume 27, pages 1793–1809, (2023)
Cite this article

Download PDF

You have full access to this open access article

Personal and Ubiquitous Computing Aims and scope Submit manuscript

Leveraging compatibility and diversity in computer-aided music mashup creation

Download PDF

1713 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

We advance Mixmash-AIS, a multimodal optimization music mashup creation model for loop recombination at scale. Our motivation is to (1) tackle current scalability limitations in state-of-the-art (brute force) computational mashup models while enforcing the (2) compatibility of audio loops and (3) a pool of diverse mashups that can accommodate user preferences. To this end, we adopt the artificial immune system (AIS) opt-aiNet algorithm to efficiently compute a population of compatible and diverse music mashups from loop recombinations. Optimal mashups result from local minima in a feature space representing harmonic, rhythmic, and spectral musical audio compatibility. We objectively assess the compatibility, diversity, and computational performance of Mixmash-AIS generated mashups compared to a standard genetic algorithm (GA) and a brute force (BF) approach. Furthermore, we conducted a perceptual test to validate the objective evaluation function within Mixmash-AIS in capturing user enjoyment of the computer-generated loop mashups. Our results show that while the GA stands as the most efficient algorithm, the AIS opt-aiNet outperforms both the GA and BF approaches in terms of compatibility and diversity. Our listening test has shown that Mixmash-AIS objective evaluation function significantly captures the perceptual compatibility of loop mashups (p < .001).

Monterey Mirror: an experiment in interactive music performance combining evolutionary computation and Zipf’s law

Article 21 November 2014

Using genetic algorithms for music composition: implications of early termination on aesthetic quality

Article 18 February 2022

Computer-Aided Musical Orchestration Using an Artificial Immune System

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Music mashup creation is a composition practice that leverages existing audio preservation mechanisms. It entails recombining two or more pre-recorded musical audio recordings as a means for creative endeavor [1]. The practice is strongly linked to the various sub-genres of electronic dance music (EDM) and the role of the DJ [1]. Girl Talk, The Kleptones, and Danger Mouse are popular mashup artists.

Mashup creation is typically confined to technology-fluent composers. It requires expertise from understanding musical structure to navigating and retrieving musical audio from large datasets. Both industry and academia have been designing tools that aid musicians, producers, and lay-users^{Footnote 1} in exploring the virtually infinite possibilities of digital music mashup creation. These tools streamline the time- consuming search for compatible musical audio and overcome the need for advanced music theory, practice, and digital signal processing knowledge. In this context, lay-users can engage in creative tasks, and professional musicians and producers can devote more time to creative experimentation.

Computational mashup creation primarily tackles two challenges: (1) retrieving compatible musical audio from a dataset and (2) transforming musical audio signals to “force” their compatibility (e.g., beat alignment or pitch shifting). Our article focuses primarily on the underlying methods of the former processing approach, which is commonly referred to as content-based retrieval within music information retrieval (MIR). Its application to music mashup creation has been identified as one of the grand challenges of the community [3] within the area of creative MIR [4].

Representative state-of-the-art applications for music mashup creation can be grouped into two main categories: (1) rule-based models with hand-crafted features [5,6,7,8] and (2) end-to-end machine learning models [9, 10]. Rule-based models typically adopt perceptual and formal mid-level descriptors (e.g., dissonance, key relatedness, and spectral flatness) to represent musical audio signals in a search space where compatible mashups are retrieved. The rule-based models’ transparency and “generality” (i.e., style agnostic) are great candidates to equip collaborative creative tools due to their considerable degree of user customization. While reducing the virtually infinite space of musical audio recombinations by selecting compatible mashups, they are not constrained by overly pronounced stylistic idiosyncrasies. Perceptual evaluations of existing rule-based models have robustly shown that proposed mashups feature harmonic, rhythmic, and spectral compatible musical audio. However, their computational performance has been addressed as a significant limitation at scale due to the required expensive brute-force search to create mashups from large musical audio datasets.

Existing end-to-end machine learning models for music mashup creation using deep neural networks rely on a large dataset of loops to train models that infer the characteristics of compatible loops without the need for hand-crafted features [9, 10]. Due to the lack of annotated multi-track mashup datasets, pipelines for extracting positive compatible loop examples from existing music have been proposed [9, 10]. While end-to-end machine learning models can account for yet unknown or non-systematized musical characteristics implicit in the signal and the mashup practice, they can potentially suffer from a lack of generalization, as they are intrinsically linked to the style of the training corpus. Furthermore, little adaptation to user preferences (besides the content curation of the training data) can be accommodated during generation.

Andersen and Knees [4] findings from in-depth interviews with expert users working creatively with electronic music show that a musical interface leveraging MIR technology should play the role of a collaborative machine. Fundamental to the human creator is the ability of the collaborative machine to assess, criticize, and occasionally oppose, thus promoting surprise and serendipity in recommendation and retrieval. Furthermore, it shall consider the creator’s individuality in allowing degrees of control over the recommendation process. The authors highlight that successful imitation is not enough for the musical interface to be recognized as “intelligent.” It should go beyond the traditional music and audio retrieval requirements for consumer or entertainment needs. Creative users of MIR technology in a music production environment are willing to evaluate a large part of the returned items to get inspired and find the personal best.

This article addresses the above limitations regarding scalability and user preference by proposing a computer-aided, diverse, user-customizable music mashup creation model from a large dataset of loops. In greater detail, our task can be defined as a search problem that must account for (1) the musical audio dataset scalability; (2) the musical audio sample compatibility and (3) diversity at the harmonic, rhythmic, and spectral levels; and (4) customizable user preferences. It should guarantee the adoption of large (or variable-sized loop datasets), and the search results should account for optimal recombination — according to formal and perceptual music principles (e.g., low degrees of dissonance, key affinity, and metrical alignment) — while equally enabling diverse results to accommodate user preferences.

In the context of our work, users account for professional (or musically trained) producers and composers, and lay-users. The proposed model is concerned with the musical structure’s fundamental attributes that can be leveraged to either promote interfaces for supporting the creative flow of producers and composers or allow lay-users to experience and actively engage in music creative practices. In the latter case, the model’s intelligence allows lay-users or non-musically trained people to surpass the steep learning curve of musical theory and ear training needed to create musical mashups. In the former case, while trained musicians can achieve these tasks, the model streamlines the search for optimal mashups, a time-consuming task when adopting mid to large loop datasets.

Evolutionary multimodal optimization is a class of algorithms that tackle the four above requirements. It typically embeds parallel and efficient search capabilities that optimize a given function to locate diverse solutions within a search space [11]. A prominent algorithm for multimodal optimization is the artificial immune system (AIS) opt-aiNet [12]. AIS opt-aiNet has been applied in creative MIR, namely in computer-aided musical orchestration [13, 14]. The latter works perform an efficient optimization search across a large musical instrumental note corpus to find diverse orchestrations matching a reference sound’s timbre.

Our work adopts AIS opt-aiNet for searching compatible and diverse loop mashups from large loop datasets. Three main objective criteria drive compatibility: harmonic compatibility [15], rhythmic compatibility [16], and spectral balance [17], broadly following the metrics proposed in Mixmash [6]. Diversity accounts for the thorough and concurrent exploration of different optimal matches across the search space. For example, resulting optimal loop recombinations can feature different key, tempo, timbre (e.g., driven from different instrumentation), and microtiming deviations (e.g., swing feeling). Furthermore, the user can customize the importance of the criteria in the objective compatibility function, biasing the search towards different musical structural elements. Conversely to existing models driven by computationally expensive brute-force (BF) search methods (e.g., [5] and [6]), we aim to provide a computer-aided tool that enables an efficient search on a large user-curated loop dataset while promoting a diverse set of optimal mashups. The model was implemented in Pure Data [18] and Python as prototype applications named Mixmash-AIS.

User preferences are intrinsic to the computational model proposed using the AIS opt-aiNet, seen as an intelligent collaborative machine or virtual agent that aids users in selecting from a reduced set of optimal mashups. Therefore, our proposal does not aim at fully automating the process of mashup creation but rather aiding the user in navigating the search space. In this context, the model promotes (1) a local search in multiple locations of the search space, which guarantee the retrieval of several optimal mashups, whose (2) diversity is enforced as distances in the search space indicate perceptual relatedness at the harmonic, rhythmic, and spectral dimensions. Furthermore, (3) user parameterization can bias the search to privilege preferential music structural elements (e.g., it can guarantee optimal rhythmic compatibility without accounting for harmonic and timbral qualities).

We evaluate our model using objective and subjective (i.e., perceptual) criteria. Three objective criteria are assessed: (1) the quality of the recombination (i.e., compatibility), (2) the diversity of the resulting mashups, and (3) the computational performance. The objective criteria are compared against a standard (unimodal) genetic algorithm (GA) and brute-force (BF) approach, following the objective evaluation procedures proposed in [13]. The main contributions of this article beyond those in our previous work [19] include (1) the inclusion of spectral balance criteria in the objective evaluation function, (2) the adoption of weights in the objective evaluation function defining the importance of each harmonic, rhythmic, and spectral compatibility criteria to account for customizable user preferences, (3) the possibility to expand the loop recombination to virtually any number of overlapping layers, and (4) a perceptual evaluation of our model by a listening test to validate the objective evaluation function.

The remainder of this article is structured as follows. Section 2 surveys related work on computational mashup creation and evolutionary optimization for musical audio recombination. Section 3 details the overview of the model, namely the audio features adopted, feature extraction, and the optimization search with AIS opt-aiNet. Sections 4 and 5 outline the evaluation procedure of the proposed model and the results, respectively. Finally, Section 6 presents the conclusions of our study and areas for future work.

2 Related work

This section surveys related work to the problem of computational mashup creation at scale along two fundamental lines of research: computational mashup creation and evolutionary multimodal optimization. We describe in Section 2.1 representative methods for computational mashup creation that adopt rule-based and end-to-end machine learning approaches. The rule-based approaches adopt hand-crafted descriptions extracted from musical audio to represent the signal’s rhythmic, harmonic, and timbre characteristics. Then, we address in Section 2.2 evolutionary multimodal optimization applied to musical manifestations, namely musical audio recombination within computer-aided orchestration using AIS opt-aiNet.

2.1 Computational music mashup creation

Early computational mashup creation focused on rhythmic-only features related to the temporal arrangement between two or more musical tracks [20, 21]. Lee et al. [7] concentrated on rhythmic matching, which adopts tempo as an input parameter for the system and employs beat matching by stretching the beats through a phase vocoder. Today, this strategy proliferates in commercial software such as Tracktor,^{Footnote 2} Mashup 2,^{Footnote 3} and Mixxx.^{Footnote 4} To perform a rhythmic alignment of musical audio tracks, Davies et al. [5] compute beat and downbeat tracking based on a combined kick and snare drum onset detection function while assuming a constant tempo and time signature across the entire duration of a musical track.

Advances in computational mashup creation pursued multi-attribute models, notably accounting for harmonic compatibility, commonly referred to as “harmonic mixing” [6, 20]. Multiple strategies have been adopted to measure harmonic compatibility: key affinity (or distances) in the circle of fifths;^{Footnote 5} cosine similarity between chroma vector representations^{Footnote 6} [5]; sensory dissonance [23]; and a combination of dissonance and perceptual relatedness indicators from Tonal Interval Vector (TIV) distances [6, 20]. The latter metric has been shown to outperform remaining harmonic compatibility metrics by perceptually aligning with human judgments and by its computational efficiency, promoting harmonic mixing at scale [24].

Timbre has been addressed in computational mashup creation to balance the spectrum across multiple regions [5] or as a strategy to find audio content which occupies the same spectral region [6]. Davies et al. [5] proposed a spectral balance metric that privileges (overlapping) mashups resulting in flat spectral representations. Maçãs et al. [6] compute the cosine distance between Mel-frequency cepstral coefficients (MFCC) to aid users in selecting musical loops with controlled degrees of timbral similarity via a graph-based visualization of a large loop dataset.

Conversely to the above models, which broadly pursued mashups resulting from the vertical recombination of musical tracks, Lee et al. [7] and Harrison et al. [25] extend the problem to the use of short fragments of musical audio in both vertical and horizontal dimensions. In other words, their model considers both overlapping musical audio recombination and their continuation in time.

To the best of our knowledge, Chen et al. [9] and Huang et al. [10] proposed the only end-to-end machine learning models for music mashup creation using deep neural networks. The former system, named Neural Loop Combiner, shows that a convolutional neural network trained on a dataset of hip-hop loops outperforms a Siamese network in creating musical mashups. A large dataset of loops with ground truth compatibility annotations is needed to train these models. Due to the nonexistence of such datasets, Chen et al. [9] proposed a pipeline to extract loops from existing music and obtain positive examples of compatible loops. Once the neural network is trained, the model is somehow limited to the style of the training dataset, and little adaptation to user preferences can be accommodated.

Huang et al. [10] have extended the Neural Loop Combiner model with novel techniques for generating training data without human labels through musical source separation [26, 27]. Isolated vocal, bass, drum, and other parts allow training the network with more controlled examples in terms of instrumentation pairings and directly estimate the compatibility of groups of stems instead of learning a representation space for embedding the stems. The source musical material from which the loops were extracted goes beyond the hip-hop genre in Neural Loop Combiner to allow for greater generalization as musical compatibility can be style-specific. Furthermore, they developed two novel deep neural network architectures (PreMixNet and PostMixNet) trained in a self- and semi-supervised way.

2.2 Evolutionary optimization for musical audio loops recombination

Evolutionary algorithms are a class of artificial intelligence methods greatly motivated by optimization processes inspired by natural phenomena, such as natural selection, species migration, bird swarms, human culture, and ant colonies [28]. Evolutionary optimization algorithms can be defined by two main criteria: modality (unimodality and multimodality) and the number of objective criteria to optimize (single objective and multiobjective).

The modality in evolutionary algorithms denotes optimization strategies seeking solutions for one global optimum (unimodal) or multiple local optima (multimodal) in a single algorithm run across several iterations. Multimodal evolutionary algorithms usually account for the population diversity, resulting from a comprehensive exploration of the search space [29].

Single and multiobjective optimization differ in terms of the objective search strategy applied. Single objective finds optimal solutions to a single objective function, whereas multiobjective accounts for problems with conflicting objectives with no single optimal solution [30].

Our work concentrates on multimodal and single-objective search optimization, notably the AIS opt-aiNet algorithm. De Castro and Timmis [12] presented the AIS opt-aiNet algorithm to solve multimodal and single-objective optimization problems. The algorithm can evolve a population of cells towards a set of optimal and diverse solutions to a problem. It employs immunological concepts of clonal proliferation, mutation, and repression to establish a network of inhibitors in the immune system network. In other words, AIS opt-aiNet integrates global and local search to find optimal solutions while ensuring their diversity. Furthermore, the algorithm presents two additional important features: the automatic determination of the population size and a defined convergence criterion.

The AIS has been adopted in music-related problems that typically require optimization strategies to find optimal and diverse solutions from a large-scale pool of candidates. Navarro et al. [31] adopt an AIS opt-aiNet to generate chord progressions in the symbolic music domain. A set of diverse yet optimal pool of chords is proposed to the user as good candidates for extending a chord progression for each new chord. Lampropoulos, Sotiropoulos, and Tsihrintzis [32] use an AIS negative selection algorithm for music recommendation. Their model has been shown to outperform state-of-the-art support vector machine models in providing high-quality music recommendations that efficiently desribe the sub-space of user preferences. The closest proposal to our model is by Caetano et al. [13]. They adopted the AIS opt-aiNet to the problem of computer-aided orchestration, i.e., the search for instrumental note sample combinations that match the timbre of a reference sound. They showed the importance of the method in promoting diverse solutions with optimal quality.

3 Mixmash-AIS

In Mixmash-AIS, we adopt AIS opt-aiNet to promote the efficient computational search for musical mashups resulting from the recombination of compatible musical audio loops,^{Footnote 7} referred to in AIS opt-aiNet as network cells. A mashup p results from the combination of loops l_i ∈ A, where i = {1,...,L} is the index of the loop in the dataset A, with a total of L musical loops. A mashup p results from overlapping its component loops; therefore, it can be understood as a combination of multiple loop layers o. While a virtual number of infinite loop layers can be added to a mashup p = {o₁,...o_u}, where u is the total number of loop layers, musical textures typically include three or four layers [34].

Each loop l_i is represented in a feature space, detailed in Section 3.1, where distances between loops capture their compatibility. The smaller the distance, the greater their compatibility. Therefore, optimal compatible mashups result from minimizing an evaluation function E_p in the feature space, which we define in Section 3.2. Diversity is guaranteed by pursuing a combination of global and local search in exploiting the feature space, resulting from iterative clonal mutation and selection of mashup candidates (see Fig. 1). The search for a diverse set of compatible loop mashups is detailed in Section 3.3.

Figure 2 shows the architecture of Mixmash-AIS. A user-curated dataset A of musical audio loops l_i is the content adopted to create musical mashups. The feature extraction algorithm defines harmonic T_i(k), rhythmic r_i(q), and spectral b_i(s) representations for each dataset loop l_i that are stored into a feature dataset. The AIS opt-aiNet is then adopted to search for multiple compatible and diverse mashups by evolving a random initial population. Finally, a set of optimal mashups result from overlaying the mashup component loops.

3.1 Feature extraction and dataset

The feature extraction module is responsible for creating three representations that capture the harmonic T_i(k), rhythmic r_i(q), and spectral b_i(s) content of each musical audio loop l_i. From the feature representations, we apply distance metrics to compute indicators of harmonic H, rhythmic R, and spectral S compatibility between musical audio loops l_i.

Equation (1) computes a TIV T_i(k) to represent the harmonic content of an audio loop l_i. T_i(k) as a 12-dimensional vector computed as the DFT of a chroma vector c_i(m). The use of the DFT of chroma vectors from musical audio has shown to provide indicators of dissonance and perceptual relatedness with greater accuracy than chroma vectors [20]. Furthermore, the representation is invariant to timbral differences of instrument sounds and has been shown to outperform existing representations in finding good harmonic mixes from musical audio [24].

$$ \begin{array}{@{}rcl@{}} T_{i}(k)&=& w_{a}(k) \sum\limits_{m=0}^{M-1} \bar{c}_{i}(m) e \frac{-j2 \pi km}{M} \quad , \\ k \in \mathbb{Z} \quad \text{with} \bar{c}_{i}(m) &=& \frac{c_{i}(m)}{\sum\limits_{m=0}^{M-1}c_{i}(m)}\quad , \end{array} $$

(1)

where M = 12 is the dimension of the input vector and w_a(k) = {3,8,11.5,15,14.5,7.5} are weights derived from empirical ratings of dyads consonance used to adjust the contribution of each dimension k of the DFT space [20]. We set k to 1 ≤ k ≤ 6 for T_i(k), since the remaining coefficients are symmetric. T_i(k) uses $\bar {c}_{i}(m)$ which is c_i(m) normalized by the DC component to allow the representation and comparison of different hierarchical levels of tonal pitch, such as chords and keys, which ultimately relate to different time scales or variable duration loops.

Following Bernardes et al. [20, 35], we adopt the T_i(k) space for computing the harmonic compatibility H between two given loops l₁ and l₂ using (2), which combines the dissonance D and perceptual distance P metrics shown in (3) and (4), respectively. The lower the values of H, the higher the degree of harmonic compatibility between two audio loops l_i. Fernández [24] has shown that the harmonic compatibility H indicator perceptually captures human preferences of a mashup to a higher degree than remaining harmonic compatibility metrics [17, 23].

$$ \begin{array}{@{}rcl@{}} H_{l_{1},l_{2}}&=&D_{l_{1},l_{2}} \cdot P_{l_{1},l_{2}} \end{array} $$

(2)

$$ \begin{array}{@{}rcl@{}} D_{l_{1},l_{2}}&=&1-\frac{a_{1} T_{1}(k) + a_{2} T_{2} (k)}{a_{1} + a_{2} w_{a}(k)} \end{array} $$

(3)

where a₁ and a₂ are the amplitudes of T₁(k) and T₂(k), respectively.

$$ P_{l_{1},l_{2}}=\sqrt{\sum\limits_{k=1}^{6} \lvert T_{1}(k) - T_{2}(k)\rvert^{2}} $$

(4)

A rhythmic histogram r_i(q) [16], where q = 60 bins, is adopted to represent the rhythmic content of a musical loop as amplitude modulations. The representation derives from rhythmic patterns [36], a time-invariant matrix representation of loudness fluctuations in the 24 critical Bark bands of the human’s listening range. Their fundamental difference is that rhythmic histogram r_i(q) accumulates all 24 critical bands onto a single bin, resulting in a vector of 60 frequency modulation bins in the [0,600] beats per minute (bpm) range [16]. The motivation to adopt rhythmic histograms r_i(q) in our work, instead of the more common rhythmic pattern representation, is to minimize pitch or spectral differences in the similarity computation, namely in light of the typical single-instrument nature of musical loops used in production settings [36, 37].

We adopt a two-stage extraction process to compute a rhythmic histogram r_i(q). First, we group the frequency bands by loudness sensation, using a short-time Fourier transform. The resulting spectral representation is then converted into a time-invariant 24 critical Bark bands modulation frequency spectrum by reapplying the Fourier transform. High amplitudes values in the rhythmic histogram r_i(q) denote a recurrent period in the musical audio. Figure 3 shows the rhythmic histogram with four predominant peaks at multiples of the tempo (126 bpm).

Rhythmic compatibility R is computed as the angular distance between rhythmic histograms r₁(q) and r₂(q) representing two musical loops l₁ and l₂, such that:

$$ R_{l_{1},l_{2}} = \text{arcos}\frac{r_{1}(q) \cdot r_{2}(q)}{\|r_{1}(q)\| \|r_{2}(q)\|} $$

(5)

Inspired by the key role of equalization in audio mixing [5], we propose a measure of spectral balance S to ensure a dispersed Bark spectral energy distribution in the resulting loop mashup. The use of Bark spectra b_i(s) aims to capture the subjective perception of loudness across the 24 critical bands of human hearing [38]. To ensure some separation between two given loops l₁ and l₂, spectral balance S computes the distance between their Bark spectrum centroid $\hat b_{i}$ given by (7), such that:

$$ S_{l_{1},l_{2}} = \begin{cases} 0 & if \quad \vert \hat{b}_{1} - \hat{b}_{2} \vert \geq 1 \\ 1 & \text{otherwise} \end{cases}. $$

(6)

$$ \hat b_{i} = \frac{{\sum}_{s=0}^{S-1} b_{i}(s) \cdot s}{{\sum}_{s=0}^{S-1} b_{i}(s)} $$

(7)

Promoting spectral separation fosters auditory segregation of the loops in a mashup [39], which is a known good practice for polyphonic textures within music composition [34]. However, larger spectral separation does not necessarily imply a “better” musical texture. Ultimately, while advisable to maintain a minimal degree of spectral separation, voice separability is a subjective and creative decision that should be considered by the user across the multiple optimal solutions provided by the model. In this context, we only penalize loop recombinations within the same critical bandwidth. The remaining cases are all considered valid candidates. Adopting a penalty of one for loops within the same critical band roughly aligns with the maximum distance values of remaining harmonic and rhythmic criteria in E_p. Furthermore, by privileging loops that occupy different critical bands, we promote a more uniform spectral energy distribution and foster a reduced sensory dissonance across loops as the auditory roughness between spectral peaks in different critical bands is residual [40].

3.2 Evaluation function

To ensure high compatibility values in a diverse space of solutions, AIS opt-aiNet assesses a population of musical loop mashups p at each iteration g. The population evolves across multiple iterations by minimizing a cost function E_p. A two-step process is adopted to compute an objective evaluation value per mashup p. First, we define H_p, R_p, and S_p as the sum of all pairwise distances across the component loops l_i in the mashup p. This step allows a virtually infinite number of overlapping layers to be considered by the model. A total of u(u − 1)/2 unique pairwise values (i.e., resulting from the combinations of two elements without repetition) per harmonic, rhythmic, and spectral representations are summed using (2), (5), and (6) respectively, where u is the total number of loops l_i in a mashup p per criteria.

We normalize the H_p, R_p, and S_p metric by the total number of u(u − 1)/2 pairwise elements per criteria. This normalization strategy allows musical audio loops recombination, where some of the criteria may not be relevant. For example, if we aim to create a mashup including a tonal and a percussive loop, the harmonic compatibility criteria H_p may not be relevant, and thus should be excluded. Figure 4 illustrates the criteria computation for three overlapping layers. Two layers are retrieved from a dataset A₁ featuring loops with tonal and a third layer is retrieved from a dataset A₂ with percussive content.

Second, we apply (8) to combine the resulting harmonic H_p, rhythmic R_p, and spectral balance S_p compatibility. Furthermore, a high penalty is applied to mashups p that include repeating loops l_i, such that:

$$ E_{p} = \alpha H_{p} + \beta R_{p} + \gamma S_{p} + F_{p} \quad , $$

(8)

where F_p = 0 if no duplicate loops indexes i are found in p and F_p = 50, otherwise. α, β, and γ are weights that aim to balance the relative importance of the metrics and promote customizable parameters that bias the search space towards user preferences across the search for harmonic, rhythmic, and spectral compatible loops.

3.3 Search algorithm

The immunological operations in AIS opt-aiNet — cloning, mutation, and affinity suppression — evolve an initial random population towards compatible and diverse loop mashups in the immune network. Maintenance of compatibility is assured by the evaluation function E_p and leveraged by the cloning and mutation operators, which optimize the population of mashup candidates across multiple regions of the multimodal search space at each algorithm iteration g. Valleys (or local minima) in the multimodal search space indicate optimal mashup candidates p. Figure 5 shows a flowchart diagram of the AIS opt-aiNet algorithm used in Mixmash-AIS.

The AIS opt-aiNet algorithm instantiates a random population of mashups p. In greater detail, to define an initial population, random index i numbers for the component loops l_i of each mashup p are generated. The initial number of mashups in the population (i.e., the population size or the number of network cells) is not relevant. The algorithm includes mechanics for automatically adjusting the population size via affinity suppression and population expansion. Cloning creates a number N of offspring clone cells per mashup in the population, which are identical copies of their parent cell. Each clone includes the parent and its N offsprings. The offspring clones undergo an operation of somatic mutation to become a variation of their parent. In other words, mutation asserts that a given loop l_i in a mashup p should be changed. The probability of a given loop l_i within a mashup p to be mutated is inversely proportional to the mashup p evaluation value. Following [13], we adopt (9) to define the mutation probability of a given loop l_i within a mashup p.

$$ \chi = \exp{(-\delta\hat{E})} , $$

(9)

where δ = 1.2 (as proposed in [13]) is a constant and $\hat {E}$ is the normalized evaluation value to the [0,1] range of a given mashup p. For each loop l_i in the mashup p, a random value in the [0,1] range determines whether an audio loop index i is mutated (Fig. 6). A loop index i is mutated (i.e., replaced), if the random value is ≤ χ. If the condition holds, a random loop index from the dataset A is fetched.

The clonal selection performs an elitist optimization of the population to retain the highest-compatible mashups per clone. To this end, all clone mashups are evaluated using (8) and the mashup with the smallest evaluation value E_p per clone is retained in the population. Then, the population’s average evaluation $\overline E_{g}$ at a given iteration g is computed to assess if the mashups optimization has stabilized. The population is said to have stabilized if the average error v in (10) is ≤ .001. The average error v is the modulo of the ratio between the average evaluation values of the previous iteration $\overline E_{g-1}$ and the average of the current iteration evaluation values $\overline E_{g}$ subtracted from unity. Once the population stabilizes, the algorithm continues to affinity suppression of candidate solutions followed by population expansion.

$$ v = \left\lvert 1 - \frac{\overline E_{g-1}}{\overline E_{g}} \right\rvert $$

(10)

AIS opt-aiNet adopts the suppression operator to exclude mashups p with high affinity or below a given distance threshold t in the feature space to maintain diversity. Pairwise mashup affinity in the feature space is computed as the angular distance between the concatenated vector 〈T_p(k), r_p(q),b_p(s)〉 expressing the harmonic, rhythmic, and spectral content of the (overlapping) component loops of a mashup p. To compute a unique vector representing the overlapping loops of a mashup p = {o₁,...,o_u} we apply (11), (12), and (13) to compute the amplitude-weighted combination of harmonic T_p(k) vectors, the linear combination of rhythmic histograms r_p(q), and the linear combination of Bark spectral representations b_p(s).

$$ \begin{array}{@{}rcl@{}} T_{p}(k) &=& \sum\limits_{o=1}^{u} a_{o} T_{o}(k) \end{array} $$

(11)

$$ \begin{array}{@{}rcl@{}} r_{p}(q) &=& \sum\limits_{o=1}^{u} r_{o}(q) \end{array} $$

(12)

$$ \begin{array}{@{}rcl@{}} b_{p}(s) &=& \sum\limits_{o=1}^{u} b_{o}(s) \end{array} $$

(13)

The suppression operator within AIS opt-aiNet retains a single local optima mashup that minimizes the value of E_p in (8) and excludes all population mashups at a smaller distance than the threshold t (i.e., high-affinity mashups). By excluding similar mashups p from the immune network, we ensure diversity in the population. After suppression, the remaining mashups in the immune network are memory cells.

The AIS opt-aiNet includes two stopping criteria conditions. Whenever one condition is met, the iterative method is stopped, and the population is returned. The first criterion is the user-defined maximum number of interactions $\hat g$. The second criterion is population stabilization, once the number of memory cells remains equal between two consecutive iterations. If the number of memory cells does not stabilize, a percentage d = 40% of random network cells is appended to the population to expand the immune network capacity to explore the space further.

Finally, a population of compatible and diverse mashups is output upon convergence. Mashups are presented to the user in the ranked ascending order of their evaluation E_p value, i.e., from the most to least compatible mashup. Each mashup p is synthesized by overlapping its component audio loops l_i, retrieved from the loop dataset A given their index i. This straightforward playback procedure neither accounts for the metrical alignment of musical loops, nor the rhythmic compatibility metric R captures the temporal loop alignment. Instead, it inspects the loop event’s periodicity. Therefore, we either assume the dataset loops start in the strongest metrical accent and have no residual silence — as in the case of the vast majority of professional loop collections — or a manual alignment of the overlapping loops must be performed.

4 Model evaluation

As shown in previous studies [5, 23], the compatibility of musical loops in a mashup is fundamental to user enjoyment. However, diversity in mashup creation is equally important in promoting multiple solutions from which users can select, considering their personal preferences. Therefore, an application to aid users in creating mashups should provide multiple and perceptually different solutions.

In this context, we use objective and perceptual measures to evaluate our model. We adopt objective measures to evaluate the (1) compatibility, (2) diversity, and (3) computational performance of the proposed model, Mixmash-AIS, compared to a standard genetic algorithm (GA), Mixmash-GA, and the original brute force (BF), Mixmash-BF. All models use the same feature space, which results from the combined harmonic H, rhythmic R, and spectral S compatibility criteria. More importantly, they adopt the same evaluation function E_p in (8) to assess the compatibility of a mashup. A perceptual test aims to validate the objective evaluation function E_p in Mixmash-AIS.

For the evaluation, we implemented the above three models as prototypes applications in Pure Data [18]. Their source code is available as Supplementary Material to this article. Furthermore, all musical mashups adopted in the evaluation are equally shared. Mixmash-GA adopts a standard GA with a uniform crossover with 70% probability, uniform mutation with 20% probability, roulette wheel selection, and elitism (top 5% individuals). Similar to Mixmash-AIS, Mixmash-GA outputs mashups ordered by compatibility E_p values.

Ideally, Mixmash-AIS should reach a similar level of compatibility in the highest-compatible mashups as the Mixmash-BF. Moreover, we expect Mixmash-AIS to outperform Mixmash-GA in compatibility and diversity. Standard GAs can prematurely converge to the same local optimum, which is not guaranteed to be the global (compatible) optimum, whereas AIS opt-aiNet returns multiple (diverse) local optima.

We adopt two hip-hop loop datasets for evaluating Mixmash-AIS and the remaining baseline GA and BF Mixmash-AIS models. Dataset A₁ includes 551 loops featuring tonal content (without percussive instruments) across a large array of electronic music instrumental sounds. Dataset A₂ includes 170 loops featuring percussive instrument content. Both datasets were collected from the extensive set of Apple Loops, which are commonly distributed with proprietary Apple Digital Audio Workstations (DAW), such as Garageband and Logic.^{Footnote 8} Due to the proprietary nature of the datasets, the audio loop content cannot be shared with the article but can be freely downloaded from the DAW mentioned above. The metadata identifying the list of audio loops adopted in both datasets and all descriptor analysis data per loop can be found in the Supplementary Material to this article.

The motivation to adopt Apple datasets featuring hip-hop loops only is twofold. First, it aims to challenge the assessment of diversity within a style. Including multiple styles in the dataset would increase by design the diversity of the dataset and, expectedly, the diversity in the mashups. Second, the Apple loops dataset is representative of the musical audio production content a producer or composer would adopt in a real-case creative context. Studies assessing the adoption of cross-style, personal, and multiple stylistic datasets should be conducted to validate further this claim (please refer to Section 6, where we detail some future research in this domain).

The lack of curated datasets with some balance across structural music dimensions (e.g., key, tempo, instrumentations) is a major drawback in this task. After analyzing several styles within the Apple loop collection, hip-hop was the one closest to a uniform distribution in harmonic, rhythmic, and spectral content (which we discuss in Fig. 7 and discuss in Section 4). In the context of this study, we adopted the Apple loops dataset as it is very close to what a producer or composer would adopt in a real case scenario.

The loops range between [5-24] seconds and feature diverse tempo (or bpm) and multiple instruments within a large set of spectral regions, roughly in the [40,10000] Hz range. Figure 7 shows the harmonic c_i(m), rhythmic r_i(q), and spectral s_i(b) representation histograms for the entire loop collection in datasets A₁ — images (a), (b), and (c) — and A₂ — images (c) and (d). For an unbiased evaluation, dataset histograms should result in uniform distributions, as they denote a comprehensive exploration of the search and feature spaces. While the harmonic histogram in Fig. 7(a) shows a quite uniform distribution, all remaining rhythmic — images (b) and (d) — and spectral — images (c) and (e) — histograms for both datasets A₁ and A₂ have a slight prevalence towards the lower bpm and Bark bands, respectively. This somewhat expected bias towards the more musical use of the bpm and Bark scale range is still quite balanced. Most consecutive bins denote minor differences and all bins have some magnitude, enabling a comprehensive exploration of the search and feature space.

The objective evaluation adopts the dataset A₁ and targets the generation of two-layered loop mashups with tonal content. The subjective evaluation included tonal and percussive loops from both A₁ and A₂ datasets. A threefold layered mashup is adopted, where two layers are retrieved from the tonal-content loops in dataset A₁ and a percussive loop from dataset A₂. To assess the recombination between the percussive loop and the remaining tonal-content loops, we do not inspect harmonic compatibility in E_p. All remaining criteria are considered. The illustration of the criteria computation in Fig. 4 demonstrates the procedure adopted in the perceptual threefold loop mashup evaluation.

We have defined the AIS opt-aiNet and GA parameters to withstand an initial population of 20 cells and a maximum number of 200 iterations. Furthermore, we set the number of clone generations for each network cell in AIS opt-aiNet to 10, and the affinity threshold t = .5. While the weights in (8) aim primarily to provide user control over the search criteria in the feature space, for the context of the evaluation, we set the weights to slightly privilege harmonic and rhythmic compatibility over spectral balance by setting α = .4 (harmonic compatibility), β = .4 (rhythmic compatibility), and γ = .2 (spectral balance). The weights definition resulted from experimentation with the model using the datasets A₁ and A₂ and follow some early evidence on the importance of each criteria [24].

4.1 Objective evaluation of compatibility

To objectively assess and compare the compatibility in all models under evaluation, we average the evaluation function values E_p of the 10 highest-compatible mashups, thus providing an average indicator of the model compatibility. The smaller the average compatibility value, the better it complies with the objective criteria in (8) that we aim to minimize. Furthermore, we compute 10 runs of the AIS opt-aiNet and GA models to capture the inter-run deviation. Deviations in the optimization convergence of the models are expected to be more noticeable in the GA. While not guaranteeing similar results at each run, the AIS opt-aiNet algorithm operators — namely affinity suppression and population expansion — provide mechanics that tend to evolve the population towards similar convergence results. The BF approach computes similar results at each run, as it exhaustively inspects all loop mashup combinations.

4.2 Objective evaluation of diversity

To measure diversity between two given mashups p, in the three models under evaluation, we adopt the cosine distance between the concatenated vector 〈T_p(k),r_p(q),b_p(s)〉 per mashup p. The objective diversity of a population is given by the median across all unique pairwise mashups distances from the 10 highest-compatible set in each model. A total of 45 pairwise distance values are considered per model.

The use of the cosine distance across the multiple harmonic H, rhythmic R, and spectral S dimensions is directly inspired by the affinity suppression operator adopted in our implementation of the AIS opt-aiNet (please refer to Section 3.3). The adopted distance metric results in distance values close to zero for perceptually similar mashups p and higher distance values for less perceptually similar mashups. Therefore, the larger the cosine distance between mashups p, the greater their perceptual difference, which we adopt as an indicator of diversity. In other words, the larger their distances in the feature spaces, the more perceptual diversity in the population.

In the harmonic feature space T_p(k), the cosine distance indicates the retaining of common tones in the mashups’ harmonic content [15, 41]. In the rhythmic feature space r_p(q), the cosine distance gradually increases from rhythmic periodicities at the same tempo and subdivision to less identical rhythmic structures within the same tempo (or metrical structure, such as double or half tempo rhythmic structures) and, finally, to rhythmic periodicities that are not multiples of the tempo [16]. In the spectral feature space b_p(s), the cosine distance inspects the alignment of the bark spectral representation of the mashups, a common metric to capture timbral similarity [6]. In the rhythmic and spectral features spaces, the adoption of cosine distance accounts for the perceptual similarity, while disregarding amplitude differences.

4.3 Evaluating computational performance

The computational performance of the models is instrumental to the task of computational mashup creation at scale due to the combinatorial explosion of loop recombinations. The complexity of the problem depends on the size L of the user-curated loop dataset A. For example, a BF approach in the dataset A₁, including 551 loops, results in 151525 unique combinations for mashups with two overlapping loops.

The associated computational cost of each iteration for the AIS opt-aiNet and GA under evaluation can be defined as $\mathcal {O}(LV)$, where L is the current population size, and V is the length of the combined rhythmic, harmonic, and spectral representation vectors. The affinity suppression in AIS opt-aiNet has an additional computational cost of $\mathcal {O}(L2V)$. The BF approach does not feature multiple iterations, and its computational cost can be defined as $\mathcal {O}(L^{2}V)$. The user defines the initial population in the AIS opt-aiNet and GA. In our evaluation, we adopt a population size of 20, which remains stable in size in the GA and is dynamically adapted at each iteration of the AIS opt-aiNet. In the BF approach, the population equals the total number of loops L in the dataset A.

These costs indicate significant computational gains when adopting the GA than the AIS opt-aiNet. In the latter model, affinity suppression adds complexity compared to the GA. However, both AIS opt-aiNet and GA suggest substantial gains compared to the BF approach but are dependent on their ability to converge. Therefore, to assess the performance of the models in the real-case scenario of the Apple Loops collection, we computed the average CPU usage over 10 runs. Furthermore, we equally report the number of iterations and the population size at convergence, which is particularly relevant for AIS opt-aiNet due to the dynamic behavior of its population size at each iteration.

4.4 Perceptual evaluation

We conducted a perceptual test to explore the relationship between user enjoyment of a mashup and the evaluation function E_p in (8), which objectively assesses the compatibility of a mashup p. We raised the hypothesis that user enjoyment would be correlated with the compatibility evaluation function E_p. We conducted an online listening test to perceptually assess the compatibility of the resulting mashups from Mixmash-AIS. Our test design is based on the procedure reported in [17] assessing the relationship between an objective mashability metric and user enjoyment. In our case, the mashability metric is replaced by the evaluation function E_p.

In total, 103 participants completed the listening test. Targeted participants included college students (undergraduate and graduate levels) and a balanced number of participants across musical training (51.2% versus 48.8% without training). As musical mashups enjoyment can be appreciated irrespective of musical training, we did not use this as a criterion for selecting participants. However, we ensured that the participants understood the term “music mashup” by explicitly informing them about the practice. No participant reported listening disabilities. To allow the participants to familiarize themselves with the procedure of the listening test, namely the interface, and to set the playback volume to a comfortable level, each participant undertook a short training phase before completing the test. The use of high-quality headphones was requested to the participants and informed consent was provided. Furthermore, participants were free to withdraw at any point and were not paid to conduct the listening test.

The listening test included 10 mashups randomly selected from 14 output mashups from Mixmash-AIS. Mashups overlap three audio loops and have an average duration of 10 s. To prevent order effects, the 10 musical mashups were presented in a different order per participant, using a uniform random distribution. Participants rated the mashup enjoyment using a 7-point Likert scale, where 1 corresponds to low enjoyment and 7 to high enjoyment.

5 Results

Figure 8 plots descriptive statistics for the three objective criteria under evaluation — (1) compatibility, (2) diversity, and (3) computational performance — for each AIS opt-aiNet, GA, and BF Mixmash-AIS models. We ran the two former models 10 times to account for the inter-run deviations computed as the standard deviation between all runs. The BF approach performs equally in every run, as it computes the compatibility of all possible loop dataset combinations. Therefore, we only ran the model once. Tables 1, 2, and 3 present the results for the objective criteria under evaluation per run. The results report indicators from the 10 highest-compatible mashups per model.

Table 1 Objective evaluation of compatibility and diversity across 10 runs of the Mixmash-AIS model using an artificial immune system (AIS) opt-aiNet algorithm

Full size table

Table 2 Objective evaluation of compatibility and diversity across 10 runs of the Mixmash-GA model using a genetic algorithm

Full size table

Table 3 Objective evaluation of compatibility and diversity in a brute force approach (exhaustive search of all pairwise loop mashup matches) in Mixmash

Full size table

By comparing the average compatibility values E_p between the AIS opt-aiNet (1.368 ± .114) and GA (2.694 ± .195) models, we can observe that the AIS opt-aiNet outperforms the standard GA in finding compatible mashups resulting from smaller (and optimal) values in the search space. These results reinforce the importance of the optimization multimodality in AIS opt-aiNet to guarantee a comprehensive search across the space, which typically secures both global and local optima. Conversely, the population of the GA tends to converge to the same region, which is not guaranteed to be the global optima. The average compatibility of the BF approach (.774) is lower than the AIS opt-aiNet and GA, thus presenting a set of more compatible mashups in its 10 highest-compatible mashups. However, AIS opt-aiNet excludes perceptually similar mashups with affinity, only retaining the highest compatible mashup. Conversely, BF outputs all mashups irrespective of their affinity. The latter assertion is verified by the lower median affinity value of the BF approach than the AIS opt-aiNet, as it indicates the found 10 highest-compatible mashups in the BF have smaller diversity.

The objective diversity for the three AIS opt-aiNet (1.624 ± .044), GA (.4379 ± .229), and BF (1.494) denotes a clear advantage of the AIS opt-aiNet in promoting perceptually different mashups in the 10 highest-compatible mashups, since their global distances in the feature space are more spread then the remaining models. The GA performs very poorly in diversity as it typically converges most mashups towards a single region in the search space. The smaller the affinity median value, the closer the candidate solutions are in the feature space. Consequently, the mashup outputs are not as perceptually different from each other. Reduced diversity in the output mashup collection restricts the creative MIR interface to act as a collaborative machine [4], which neither promotes degrees of surprise and serendipity nor accounts for user preferences.

By comparing the computational performance of the three models given by the CPU time to return a set of mashup solutions (i.e., the set of optima mashup solutions in AIS opt-aiNet or GA and the complete set of pairwise comparison in the BF approach), we denote some gains in the average CPU time of GA (1428 ± 275ms) and AIS opt-aiNet (6708 ± 3280ms) compared to the CPU time of the BF (151,523ms). The GA could even be further optimized as no stopping criteria have been defined. If no diversity is required, the GA is the best optimization strategy due to its performance efficiency while ensuring similar yet compatible mashups. The BF approach presents an obvious high computational cost in a combinatorial explosion problem. We could argue that once the evaluation of the complete set of loop combinations is pre-computed, we could store the results and retrieve them continuously at runtime. However, this strategy would fail when adopting three or more overlapping loops, as their features and distances metrics would have to be recomputed. By inspecting the diverse iteration count in the AIS opt-aiNet and the resulting average compatibility values in each run (13.2 ± 1.988), we can denote the capacity of stopping criteria in the algorithm to assess optimal convergence conditions.

To examine the listening test results, we first inspected the average perpetual ratings per mashup, defining the level of enjoyment. To assess the relationship between user enjoyment and the objective mashup compatibility evaluation function, we computed the Pearson correlation coefficient to infer the measure of linear correlation between two sets of data. The scatter plot in Fig. 9 shows the linear relationship between the two variables. A Pearson correlation coefficient of r = .99 (p < .001) indicates a statistically significant correlation between the objective cost function E_p and user enjoyment for our 10 mashups. These results endorse our hypothesis and validate the overall effectiveness of the objective evaluation function E_p in capturing the compatibility of recombined loops. A coefficient of determination R² = .98 exposes a high degree of strength between the linear regression model and the variables.

6 Conclusions

6.1 Summary and original contribution

We proposed Mixmash-AIS, a multimodal musical loop mashup optimization model for loop recombination at scale. It adopts the AIS opt-aiNet algorithm to leverage compatible and diverse mashups while addressing the scalability issues in state-of-the-art BF solutions for computational music mashup creation. To automatically assess the compatibility of mashups of musical audio loops, we proposed an objective function that accounts for three fundamental musical dimensions: harmony, rhythm, and timbre. In greater detail, the objective evaluation function inspects the harmonic compatibility in the Tonal Interval Vector space [6, 20], rhythmic compatibility using a time-invariant histogram proposed in [16], and spectral balance promotes loop recombination in different critical Bark bands.

An objective comparison of AIS opt-aiNet to a standard GA and BF approaches in proposing loop mashups denotes the primacy of the AIS opt-aiNet in finding local and global optimal mashups, closely matching the compatibility values of the BF approach. Furthermore, the AIS opt-aiNet promotes greater diversity than the GA and BF approaches. Finally, GA and AIS opt-aiNet have significant computational performance gains compared to the BF approach. A perceptually listening test was conducted and significantly validated the hypothesis that the proposed objective evaluation function in Mixmash-AIS captures user enjoyment of a mashup (r = .99,p < .001).

In promoting a diverse set of optimal mashups, the Mixmash-AIS can account for personal preferences and stylistic traits, which are fundamental to the production-based MIR interfaces [4]. Not only does the operational mechanics of the AIS opt-aiNet accommodate greater diversity, but also the weights regulating the importance of each harmonic, rhythmic, and spectral criteria within the objective evaluation function allow users to bias the search towards their preferences.

6.2 Towards Mixmash-AIS model applications

Existing implementations of the Mixmash-AIS model (found in the complementary materials to this article) lack a graphical user interface (GUI) promoting an intuitive strategy adapted to the two main identified target user groups: professional musicians, namely producers and composers, and lay-people. This section envisions user scenarios for the Mixmash-AIS model according to these two target user groups. To this end, we define how the model can contribute to or promote a creative flow within the target user groups, and what usability criteria are relevant in the design of the interaction and interface with the model.

The search for good matches for creating mashups or mixes between musical audio loops is a common task in the workflow of professional musicians. The ear training and music theory background of these users typical allow them to critically assess the compatibility between any two (or more) given musical audio loops. However, browsing and creatively experimenting with these large-scale datasets is very time-consuming due to the poorly formal annotations of the musical audio content (e.g., the formal musical structural descriptions such as key, chord, and meter annotations). Instead, high-level semantic labels (e.g., style and instrumentation) are typically adopted. The Mixmash-AIS model streamlines this process by proposing several good optimal mashups, thus excluding the need for a manual search for good matches at the formal levels of the harmonic, rhythmic, and spectral domains. While these elements are not fully formalized in the literature, our evaluation has validated our evaluation function E_p in capturing these traits when proposing different mashups.

Furthermore, the diversity of the proposed Mixmash-AIS model, stemming from the local searches in multiple perceptually different locations of the search space, allows accommodating user preferences, which are seminal to the workflow of musicians [42]. A significant degree of dimensions relevant to the decision of what musical audio to include in a particular musical context, or reflecting the stylistic idiosyncrasies of the user, is excluded from our objective evaluation function. In this context, the model’s capacity to propose diverse optimal mashups is important to promote a workflow where a human-in-the-loop strategy exists (REF). Mixmash-AIS is an intelligent collaborative assistant that filters or reduces the intractable set of combinations between all audio loops to a smaller set from which the user can choose.

A typical scenario where the Mixmash-AIS model can be applied is in the professional musicians’ workflow, namely assisting DJs in preparing their musical sets. The model can adopt both musical audio loops and full musical audio tracks. To a given extent, it could event propose which spectral components to be excluded to maximize compatibility (i.e., proposing changes to the signal that could enhance the evaluation function value E_p, which typically aligns with the strategies adopted by DJs when mixing two tracks (e.g., retaining the bass from one track and the lead parts from a different track). Another recurrent scenario where Mixmash-AIS can enhance the professional musicians’ creative workflow is finding a loop or musical audio sample in an existing project (e.g., a song). The model can easily be adapted to find a musical loop or track to be recombined with an existing musical context (e.g., finding a harmonization for a pre-recorded vocal track). Different graphic user interfaces for standalone applications or plugins for existing audio production software shall be studied for each scenario. Design strategies to promote serendipitous searches in large musical audio loops datasets have been discussed in [6] and can gain from the diverse set of proposed loops by the model.

The choice of the musical audio (loop) dataset is entire to the user and can be adjusted to his musical preferences or the musical context of a particular project. The agnostic search at the formal level of musical structure compatibility of the Mixmash-AIS model can accommodate any musical style and any cross-combination of styles. The curation of the dataset is fundamental to accommodate the user’s musical context and preferences. The model could also be easily adapted to work with large-scale datasets online, such as Freesound [43]. The size of the dataset is an important consideration in the model. Very small datasets may result in a small pool of optimal and not too diverse mashups. Furthermore, small datasets (e.g., about 10 to 20 musical audio loops or tracks) can potentially be manually browsed. The model shows greater potential in dealing with larger datasets (e.g., about 100 or more musical audio loops or tracks).

In the context of lay-users, the Mixmash-AIS model can be understood as a strategy to promote participation in music creation for those lacking musical training and thus adopt the model to select compatible musical audio loops or tracks. The model can fill a gap in knowledge or tools that could exclude non-musically trained users from participating in musical creation. These strategies can provide a new context to the favorite music of a particular user within entertainment music consumption scenarios. The use of small (personal) datasets or playlists can be envisioned here as a typical scenario. Therefore, diversity is less important than finding optimal compatibility to promote new listening experiences.

6.3 Future work

In the future, the most pressing research to be considered is in-depth user studies to assess and validate the Mixmash-AIS model in the musical practice of both professional musicians and lay-users. To conduct such studies, we shall consider user experiences to design, implement, and compare multiple interfaces adopting the Mixmash-AIS. Collected feedback can assess the usability of the interfaces and the degree to which they assist a user engaged in creative work. Furthermore, such evaluation can inform the model’s algorithmic design. An important aspect to be considered in the study is the possibility of users adopting their musical collection within their projects to enlighten the impact of dataset size and provide some insight into the ideal number of optimal diverse mashups to be presented to the user.

Notes

Lay-users are understood as non-musically trained individuals that while lacking formal knowledge on music theory and practice do engage and participate in creative musical processes, thus requiring interaction models adapted to their level of expertise [2].
https://www.native-instruments.com/en/catalog/traktor/, last access on 20 April 2021.
https://mashup.mixedinkey.com/, last access on 20 April, 2021.
https://mixxx.org/, last access on 20 April, 2021.
https://mashup.mixedinkey.com/ last access on 20 April
A chroma vector is one of the most prominent tonal audio descriptors. It represents the tonal content of a musical audio signal by its energy across the 12 chromatic tones of the (Western) equal-tempered scale. The resulting 12-element vector represents the energy of all 12 chromatic tones irrespective of octave differences — thus excluding from the representation the pitch height information [22]
An audio loop is a short segment of audio, e.g., a measure of a drum beat, which is created to be repeated over time [33].
https://support.apple.com/guide/logicpro/apple-loops-in-logic-pro-lgcp734a05f6/mac https://support.apple.com/guide/logicpro/apple-loops-in-logic-pro-lgcp734a05f6/mac, last accessed on 10 May 2021.

References

Shiga J (2007) Copy-and-persist: the logic of mash-up culture. Crit Stud Media Commun 24 (2):93–114. https://doi.org/10.1080/07393180701262685
Article Google Scholar
Keller D, Schiavoni F, Lazzarini V (2019) Ubiquitous music: perspectives and challenges. J New Music Res 48(4):309–315
Article Google Scholar
Goto M (2012) Grand challenges in music information research. In: Müller M, Goto M, Schedl M (eds) Multimodal Music Processing. Dagstuhl Follow-Ups, vol. 3, pp. 217–226. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany. https://doi.org/10.4230/DFU.Vol3.11041.217
Andersen K, Knees P (2016) Conversations with expert users in music retrieval and research challenges for creative MIR. In: Proceedings of the International Society for Music Information Retrieval Conference, pp. 122–128
Davies MEP, Hamel P, Yoshii K, Goto M (2014) Automashupper: automatic creation of multi-song music mashups. IEEE/ACM Transactions on Audio Speech and Language Processing 22(12): 1726–1737. https://doi.org/10.1109/TASLP.2014.2347135
Article Google Scholar
Maçãs C, Rodrigues A, Bernardes G, Machado P (2018) Mixmash: a visualisation system for musical mashup creation. In: 2018 22Nd international conference information visualisation (IV), pp. 471–477
Lee C, Lin Y, Yao Z, Lee F, Wu J (2015) Automatic mashup creation by considering both vertical and horizontal mashabilities. In: Proceedings of the International Society for Music Information Retrieval Conference, pp. 399–405
Goto M (2002) An audio-based real-time beat tracking system for music with or without drum-sounds. J New Music Res 30(2):159–171. https://doi.org/10.1076/jnmr.30.2.159.7114
Article Google Scholar
Chen B-Y, Smith JB, Yang Y-H (2020) Neural loop combiner: neural network models for assessing the compatibility of loops. In: Proceedings of the International Society for Music Information Retrieval Conference, pp. 424–431
Huang J, Wang J-C, Smith JBL, Song X, Wang Y (2021) Modeling the compatibility of stem tracks to generate music mashups. arXiv:2103.14208
Wong K-C (2015) Evolutionary multimodal optimization: a short survey. arXiv:1508.00457
De Castro LN, Timmis J (2002) An artificial immune network for multimodal function optimization. In: Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No.02TH8600), vol. 1, pp. 699–7041. https://doi.org/10.1109/CEC.2002.1007011
Caetano M, Zacharakis A, Barbancho I, Tardón LJ (2019) Leveraging diversity in computer-aided musical orchestration with an artificial immune system for multi-modal optimization. Swarm Evol Comput 50:100484. https://doi.org/10.1016/j.swevo.2018.12.010
Article Google Scholar
Abreu J, Caetano M, Penha R (2016) Computer-aided musical orchestration using an artificial immune system. In: Johnson C, Ciesielski V, Correia J, Machado P (eds) Evolutionary and biologically inspired music, Sound, Art and Design, pp 1-16
Ramires A, Bernardes G, Davies M, Serra X (2020) Tiv.lib: an open-source library for the tonal description of musical audio. In: Proceedings of the 23rd International Conference on Digital Audio Effects, pp 304–309
Lidy T, Rauber A (2005) Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proceedings of the International Society for Music Information Retrieval Conference, pp 34–41
Davies MEP, Hamel P, Yoshii K, Goto M (2014) Automashupper: automatic creation of multi-song music mashups, vol 22, pp 1726–1737, DOI https://doi.org/10.1109/TASLP.2014.2347135
Puckette M (1996) Pure data: another integrated computer music environment. In: Proceedings of the International Computer Music Conference, pp 37–41
Bernardo G, Bernardes G (2021) Leveraging compatibility and diversity in computational music mashup creation. In: Audio mostly 2021, pp 248–255
Bernardes G, Davies MEP, Guedes C Aramaki M, Davies MEP, Kronland-Martinet R, Ystad S (eds) (2018) Music technology with swing. Springer, Cham
Griffin G, Kim Y, Turnbull D (2010) Beat-sync-mash-coder: a web application for real-time creation of beat-synchronous music mashups. In: Proceedings of the International Conference on Acoustics, Speech, & Signal Processing, pp. 2–5
Fujishima T (1999) Real-time chord recognition of musical sound: a system using common lisp music. In: Proceedings of the International Computer Music Conference, pp 464– 467
Gebhardt RB, Davies MEP, Seeber BU (2016) Psychoacoustic approaches for harmonic music mixing. Appl Sci 6(5):136
Article Google Scholar
Pérez Fernández M (2020) Harmonic compatibility for loops in electronic music. Master’s thesis, Universitat Pompeu Fabra
Harrison P, Pearce M (2020) Simultaneous consonance in music perception and composition, vol 127, pp 216–244. https://doi.org/10.1037/rev0000169
Stöter F-R, Liutkus A, Ito N (2018) The 2018 signal separation evaluation campaign. In: International Conference on Latent Variable Analysis and Signal Separation, pp 293–305. Springer
Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T (2017) Singing voice separation with deep u-net convolutional networks. In: Proceedings of International Society for Music Information Retrieval Conference, pp 745–751
Simon D (2013) Evolutionary optimization algorithms. Wiley, Hoboken, New Jersey
Google Scholar
Wong K-C, Wu C-H, Mok RKP, Peng C, Zhang Z (2012) Evolutionary multimodal optimization using the principle of locality. Inf Sci 194:138–170. https://doi.org/10.1016/j.ins.2011.12.016
Article Google Scholar
Savic D (2002) Single-objective vs. multiobjective optimisation for integrated decision support. In: Proceedings of the First Biennial Meeting of the International Environmental Modelling and Software Society, vol. 1, pp. 7–12
Navarro-Cáceres M, Caetano M, Bernardes G, de Castro LN (2019) Chordais: an assistive system for the generation of chord progressions with an artificial immune system. Swarm Evol Comput 50:100543. https://doi.org/10.1016/j.swevo.2019.05.012
Article Google Scholar
Lampropoulos AS, Sotiropoulos DN, Tsihrintzis GA (2010) A music recommender based on artificial immune systems. In: Intelligent interactive multimedia systems and services. Springer, Berlin Heidelberg, pp 167–179
Gallagher M (2009) The music tech dictionary: a glossary of audio-related terms and technologies. Cengage Learning Ptr, Boston, MA USA
Huron D (2001) Tone and voice: a derivation of the rules of voice-leading from perceptual principles. Music Percept 19(1):1–64
Article MathSciNet Google Scholar
Bernardes G, Davies ME, Guedes C (2017) A perceptually-motivated harmonic compatibility method for music mixing. In: Procedings of the symposium on computer music multidisciplinary research, pp. 104–115
Pampalk E, Rauber A, Merkl D (2002) Content-based organization and visualization of music archives. In: Proceedings of the Tenth ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, pp 570–579
Rauber A, Pampalk E, Merkl D (2003) The som-enhanced jukebox: organization and visualization of music collections based on perceptual models. Journal of New Music Research 32(2):193–210. https://doi.org/10.1076/jnmr.32.2.193.16745
Article Google Scholar
Zwicker E (1961) Subdivision of the audible frequency range into critical bands (frequenzgruppen). The Journal of the Acoustical Society of America 33(2):248–248. https://doi.org/10.1121/1.1908630
Article Google Scholar
Yost W, Popper A, Fay R (1993) Human psychophysics, vol 3. Springer, New York
Book Google Scholar
Vassilakis P (2001) Auditory roughness estimation of complex spectra—roughness degrees and dissonance ratings of harmonic intervals revisited. J Acoust Soc Am 110(5):2755–2755
Article Google Scholar
Bernardes G, Cocharro D, Caetano M, Guedes C, Davies MEP (2016) A multi-level tonal interval space for modelling pitch relatedness and musical consonance. J New Music Res 45:243–260
Article Google Scholar
Holtz P (2009) What’s your music? Subjective theories of music-creating artists. Music Sci 13(2):207–230
Article Google Scholar
Fonseca E, Pons Puig J, Favory X, Font Corbera F, Bogdanov D, Ferraro A, Oramas S, Porter A, Serra X (2017) Freesound datasets: a platform for the creation of open audio datasets. In: Hu X, Cunningham SJ, Turnbull D, Duan Z (eds) Proceedings of the 18th ISMIR Conference; 2017 Oct 23-27; Suzhou, China.[Canada]: International Society for Music Information Retrieval; 2017. P. 486-93. International Society for Music Information Retrieval (ISMIR)

Download references

Funding

Open access funding provided by FCT—FCCN (b-on). This research is funded by the project “Experimentation in music in Portuguese culture: History, contexts and practices in the 20th and 21st centuries” (POCI-01-0145-FEDER-031380) co-funded by the European Union through the Operational Program Competitiveness and Internationalization, in its ERDF component, and by national funds, through the Portuguese Foundation for Science and Technology.

Author information

Authors and Affiliations

Department of Informatics Engineering, University of Porto–Faculty of Engineering, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal
Gonçalo Bernardo & Gilberto Bernardes
INESC TEC, Rua Dr. Roberto Frias, s/n, Porto, 4200-465, Portugal
Gilberto Bernardes

Authors

Gonçalo Bernardo
View author publications
You can also search for this author in PubMed Google Scholar
Gilberto Bernardes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gilberto Bernardes.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

In the supplementary materials to this article, we provide all the source code for three software prototypes, which implement the models Mixmash-AIS, Mixmash-GA, and Mixmash-BF in Pure Data. These prototypes were used to conduct the evaluation procedures. Furthermore, we include the Mixmash-AIS model implementation in Python, which shows computational performance gains compared to the Pure Data prototype and can be ultimately incorporated in industry applications and further advance academic findings on computational mashup creation. Included in the Python distribution of the Mixmash-AIS is the harmonic, rhythmic, and spectral feature analysis of the loops adopted in our evaluation datasets. Finally, all mashups adopted in the evaluation are included in the materials.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(TXT 316 bytes)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bernardo, G., Bernardes, G. Leveraging compatibility and diversity in computer-aided music mashup creation. Pers Ubiquit Comput 27, 1793–1809 (2023). https://doi.org/10.1007/s00779-022-01702-z

Download citation

Received: 03 January 2022
Accepted: 29 November 2022
Published: 23 December 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00779-022-01702-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Leveraging compatibility and diversity in computer-aided music mashup creation

Abstract

Similar content being viewed by others

Monterey Mirror: an experiment in interactive music performance combining evolutionary computation and Zipf’s law

Using genetic algorithms for music composition: implications of early termination on aesthetic quality

Computer-Aided Musical Orchestration Using an Artificial Immune System

1 Introduction

2 Related work

2.1 Computational music mashup creation

2.2 Evolutionary optimization for musical audio loops recombination

3 Mixmash-AIS

3.1 Feature extraction and dataset

3.2 Evaluation function

3.3 Search algorithm

4 Model evaluation

4.1 Objective evaluation of compatibility

4.2 Objective evaluation of diversity

4.3 Evaluating computational performance

4.4 Perceptual evaluation

5 Results

6 Conclusions

6.1 Summary and original contribution

6.2 Towards Mixmash-AIS model applications

6.3 Future work

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Supplementary Information

Electronic supplementary material

(TXT 316 bytes)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation