1 Introduction

Music mashup creation is a composition practice that leverages existing audio preservation mechanisms. It entails recombining two or more pre-recorded musical audio recordings as a means for creative endeavor [1]. The practice is strongly linked to the various sub-genres of electronic dance music (EDM) and the role of the DJ [1]. Girl Talk, The Kleptones, and Danger Mouse are popular mashup artists.

Mashup creation is typically confined to technology-fluent composers. It requires expertise from understanding musical structure to navigating and retrieving musical audio from large datasets. Both industry and academia have been designing tools that aid musicians, producers, and lay-usersFootnote 1 in exploring the virtually infinite possibilities of digital music mashup creation. These tools streamline the time- consuming search for compatible musical audio and overcome the need for advanced music theory, practice, and digital signal processing knowledge. In this context, lay-users can engage in creative tasks, and professional musicians and producers can devote more time to creative experimentation.

Computational mashup creation primarily tackles two challenges: (1) retrieving compatible musical audio from a dataset and (2) transforming musical audio signals to “force” their compatibility (e.g., beat alignment or pitch shifting). Our article focuses primarily on the underlying methods of the former processing approach, which is commonly referred to as content-based retrieval within music information retrieval (MIR). Its application to music mashup creation has been identified as one of the grand challenges of the community [3] within the area of creative MIR [4].

Representative state-of-the-art applications for music mashup creation can be grouped into two main categories: (1) rule-based models with hand-crafted features [5,6,7,8] and (2) end-to-end machine learning models [9, 10]. Rule-based models typically adopt perceptual and formal mid-level descriptors (e.g., dissonance, key relatedness, and spectral flatness) to represent musical audio signals in a search space where compatible mashups are retrieved. The rule-based models’ transparency and “generality” (i.e., style agnostic) are great candidates to equip collaborative creative tools due to their considerable degree of user customization. While reducing the virtually infinite space of musical audio recombinations by selecting compatible mashups, they are not constrained by overly pronounced stylistic idiosyncrasies. Perceptual evaluations of existing rule-based models have robustly shown that proposed mashups feature harmonic, rhythmic, and spectral compatible musical audio. However, their computational performance has been addressed as a significant limitation at scale due to the required expensive brute-force search to create mashups from large musical audio datasets.

Existing end-to-end machine learning models for music mashup creation using deep neural networks rely on a large dataset of loops to train models that infer the characteristics of compatible loops without the need for hand-crafted features [9, 10]. Due to the lack of annotated multi-track mashup datasets, pipelines for extracting positive compatible loop examples from existing music have been proposed [9, 10]. While end-to-end machine learning models can account for yet unknown or non-systematized musical characteristics implicit in the signal and the mashup practice, they can potentially suffer from a lack of generalization, as they are intrinsically linked to the style of the training corpus. Furthermore, little adaptation to user preferences (besides the content curation of the training data) can be accommodated during generation.

Andersen and Knees [4] findings from in-depth interviews with expert users working creatively with electronic music show that a musical interface leveraging MIR technology should play the role of a collaborative machine. Fundamental to the human creator is the ability of the collaborative machine to assess, criticize, and occasionally oppose, thus promoting surprise and serendipity in recommendation and retrieval. Furthermore, it shall consider the creator’s individuality in allowing degrees of control over the recommendation process. The authors highlight that successful imitation is not enough for the musical interface to be recognized as “intelligent.” It should go beyond the traditional music and audio retrieval requirements for consumer or entertainment needs. Creative users of MIR technology in a music production environment are willing to evaluate a large part of the returned items to get inspired and find the personal best.

This article addresses the above limitations regarding scalability and user preference by proposing a computer-aided, diverse, user-customizable music mashup creation model from a large dataset of loops. In greater detail, our task can be defined as a search problem that must account for (1) the musical audio dataset scalability; (2) the musical audio sample compatibility and (3) diversity at the harmonic, rhythmic, and spectral levels; and (4) customizable user preferences. It should guarantee the adoption of large (or variable-sized loop datasets), and the search results should account for optimal recombination — according to formal and perceptual music principles (e.g., low degrees of dissonance, key affinity, and metrical alignment) — while equally enabling diverse results to accommodate user preferences.

In the context of our work, users account for professional (or musically trained) producers and composers, and lay-users. The proposed model is concerned with the musical structure’s fundamental attributes that can be leveraged to either promote interfaces for supporting the creative flow of producers and composers or allow lay-users to experience and actively engage in music creative practices. In the latter case, the model’s intelligence allows lay-users or non-musically trained people to surpass the steep learning curve of musical theory and ear training needed to create musical mashups. In the former case, while trained musicians can achieve these tasks, the model streamlines the search for optimal mashups, a time-consuming task when adopting mid to large loop datasets.

Evolutionary multimodal optimization is a class of algorithms that tackle the four above requirements. It typically embeds parallel and efficient search capabilities that optimize a given function to locate diverse solutions within a search space [11]. A prominent algorithm for multimodal optimization is the artificial immune system (AIS) opt-aiNet [12]. AIS opt-aiNet has been applied in creative MIR, namely in computer-aided musical orchestration [13, 14]. The latter works perform an efficient optimization search across a large musical instrumental note corpus to find diverse orchestrations matching a reference sound’s timbre.

Our work adopts AIS opt-aiNet for searching compatible and diverse loop mashups from large loop datasets. Three main objective criteria drive compatibility: harmonic compatibility [15], rhythmic compatibility [16], and spectral balance [17], broadly following the metrics proposed in Mixmash [6]. Diversity accounts for the thorough and concurrent exploration of different optimal matches across the search space. For example, resulting optimal loop recombinations can feature different key, tempo, timbre (e.g., driven from different instrumentation), and microtiming deviations (e.g., swing feeling). Furthermore, the user can customize the importance of the criteria in the objective compatibility function, biasing the search towards different musical structural elements. Conversely to existing models driven by computationally expensive brute-force (BF) search methods (e.g., [5] and [6]), we aim to provide a computer-aided tool that enables an efficient search on a large user-curated loop dataset while promoting a diverse set of optimal mashups. The model was implemented in Pure Data [18] and Python as prototype applications named Mixmash-AIS.

User preferences are intrinsic to the computational model proposed using the AIS opt-aiNet, seen as an intelligent collaborative machine or virtual agent that aids users in selecting from a reduced set of optimal mashups. Therefore, our proposal does not aim at fully automating the process of mashup creation but rather aiding the user in navigating the search space. In this context, the model promotes (1) a local search in multiple locations of the search space, which guarantee the retrieval of several optimal mashups, whose (2) diversity is enforced as distances in the search space indicate perceptual relatedness at the harmonic, rhythmic, and spectral dimensions. Furthermore, (3) user parameterization can bias the search to privilege preferential music structural elements (e.g., it can guarantee optimal rhythmic compatibility without accounting for harmonic and timbral qualities).

We evaluate our model using objective and subjective (i.e., perceptual) criteria. Three objective criteria are assessed: (1) the quality of the recombination (i.e., compatibility), (2) the diversity of the resulting mashups, and (3) the computational performance. The objective criteria are compared against a standard (unimodal) genetic algorithm (GA) and brute-force (BF) approach, following the objective evaluation procedures proposed in [13]. The main contributions of this article beyond those in our previous work [19] include (1) the inclusion of spectral balance criteria in the objective evaluation function, (2) the adoption of weights in the objective evaluation function defining the importance of each harmonic, rhythmic, and spectral compatibility criteria to account for customizable user preferences, (3) the possibility to expand the loop recombination to virtually any number of overlapping layers, and (4) a perceptual evaluation of our model by a listening test to validate the objective evaluation function.

The remainder of this article is structured as follows. Section 2 surveys related work on computational mashup creation and evolutionary optimization for musical audio recombination. Section 3 details the overview of the model, namely the audio features adopted, feature extraction, and the optimization search with AIS opt-aiNet. Sections 4 and 5 outline the evaluation procedure of the proposed model and the results, respectively. Finally, Section 6 presents the conclusions of our study and areas for future work.

2 Related work

This section surveys related work to the problem of computational mashup creation at scale along two fundamental lines of research: computational mashup creation and evolutionary multimodal optimization. We describe in Section 2.1 representative methods for computational mashup creation that adopt rule-based and end-to-end machine learning approaches. The rule-based approaches adopt hand-crafted descriptions extracted from musical audio to represent the signal’s rhythmic, harmonic, and timbre characteristics. Then, we address in Section 2.2 evolutionary multimodal optimization applied to musical manifestations, namely musical audio recombination within computer-aided orchestration using AIS opt-aiNet.

2.1 Computational music mashup creation

Early computational mashup creation focused on rhythmic-only features related to the temporal arrangement between two or more musical tracks [20, 21]. Lee et al. [7] concentrated on rhythmic matching, which adopts tempo as an input parameter for the system and employs beat matching by stretching the beats through a phase vocoder. Today, this strategy proliferates in commercial software such as Tracktor,Footnote 2 Mashup 2,Footnote 3 and Mixxx.Footnote 4 To perform a rhythmic alignment of musical audio tracks, Davies et al. [5] compute beat and downbeat tracking based on a combined kick and snare drum onset detection function while assuming a constant tempo and time signature across the entire duration of a musical track.

Advances in computational mashup creation pursued multi-attribute models, notably accounting for harmonic compatibility, commonly referred to as “harmonic mixing” [6, 20]. Multiple strategies have been adopted to measure harmonic compatibility: key affinity (or distances) in the circle of fifths;Footnote 5 cosine similarity between chroma vector representationsFootnote 6 [5]; sensory dissonance [23]; and a combination of dissonance and perceptual relatedness indicators from Tonal Interval Vector (TIV) distances [6, 20]. The latter metric has been shown to outperform remaining harmonic compatibility metrics by perceptually aligning with human judgments and by its computational efficiency, promoting harmonic mixing at scale [24].

Timbre has been addressed in computational mashup creation to balance the spectrum across multiple regions [5] or as a strategy to find audio content which occupies the same spectral region [6]. Davies et al. [5] proposed a spectral balance metric that privileges (overlapping) mashups resulting in flat spectral representations. Maçãs et al. [6] compute the cosine distance between Mel-frequency cepstral coefficients (MFCC) to aid users in selecting musical loops with controlled degrees of timbral similarity via a graph-based visualization of a large loop dataset.

Conversely to the above models, which broadly pursued mashups resulting from the vertical recombination of musical tracks, Lee et al. [7] and Harrison et al. [25] extend the problem to the use of short fragments of musical audio in both vertical and horizontal dimensions. In other words, their model considers both overlapping musical audio recombination and their continuation in time.

To the best of our knowledge, Chen et al. [9] and Huang et al. [10] proposed the only end-to-end machine learning models for music mashup creation using deep neural networks. The former system, named Neural Loop Combiner, shows that a convolutional neural network trained on a dataset of hip-hop loops outperforms a Siamese network in creating musical mashups. A large dataset of loops with ground truth compatibility annotations is needed to train these models. Due to the nonexistence of such datasets, Chen et al. [9] proposed a pipeline to extract loops from existing music and obtain positive examples of compatible loops. Once the neural network is trained, the model is somehow limited to the style of the training dataset, and little adaptation to user preferences can be accommodated.

Huang et al. [10] have extended the Neural Loop Combiner model with novel techniques for generating training data without human labels through musical source separation [26, 27]. Isolated vocal, bass, drum, and other parts allow training the network with more controlled examples in terms of instrumentation pairings and directly estimate the compatibility of groups of stems instead of learning a representation space for embedding the stems. The source musical material from which the loops were extracted goes beyond the hip-hop genre in Neural Loop Combiner to allow for greater generalization as musical compatibility can be style-specific. Furthermore, they developed two novel deep neural network architectures (PreMixNet and PostMixNet) trained in a self- and semi-supervised way.

2.2 Evolutionary optimization for musical audio loops recombination

Evolutionary algorithms are a class of artificial intelligence methods greatly motivated by optimization processes inspired by natural phenomena, such as natural selection, species migration, bird swarms, human culture, and ant colonies [28]. Evolutionary optimization algorithms can be defined by two main criteria: modality (unimodality and multimodality) and the number of objective criteria to optimize (single objective and multiobjective).

The modality in evolutionary algorithms denotes optimization strategies seeking solutions for one global optimum (unimodal) or multiple local optima (multimodal) in a single algorithm run across several iterations. Multimodal evolutionary algorithms usually account for the population diversity, resulting from a comprehensive exploration of the search space [29].

Single and multiobjective optimization differ in terms of the objective search strategy applied. Single objective finds optimal solutions to a single objective function, whereas multiobjective accounts for problems with conflicting objectives with no single optimal solution [30].

Our work concentrates on multimodal and single-objective search optimization, notably the AIS opt-aiNet algorithm. De Castro and Timmis [12] presented the AIS opt-aiNet algorithm to solve multimodal and single-objective optimization problems. The algorithm can evolve a population of cells towards a set of optimal and diverse solutions to a problem. It employs immunological concepts of clonal proliferation, mutation, and repression to establish a network of inhibitors in the immune system network. In other words, AIS opt-aiNet integrates global and local search to find optimal solutions while ensuring their diversity. Furthermore, the algorithm presents two additional important features: the automatic determination of the population size and a defined convergence criterion.

The AIS has been adopted in music-related problems that typically require optimization strategies to find optimal and diverse solutions from a large-scale pool of candidates. Navarro et al. [31] adopt an AIS opt-aiNet to generate chord progressions in the symbolic music domain. A set of diverse yet optimal pool of chords is proposed to the user as good candidates for extending a chord progression for each new chord. Lampropoulos, Sotiropoulos, and Tsihrintzis [32] use an AIS negative selection algorithm for music recommendation. Their model has been shown to outperform state-of-the-art support vector machine models in providing high-quality music recommendations that efficiently desribe the sub-space of user preferences. The closest proposal to our model is by Caetano et al. [13]. They adopted the AIS opt-aiNet to the problem of computer-aided orchestration, i.e., the search for instrumental note sample combinations that match the timbre of a reference sound. They showed the importance of the method in promoting diverse solutions with optimal quality.

3 Mixmash-AIS

In Mixmash-AIS, we adopt AIS opt-aiNet to promote the efficient computational search for musical mashups resulting from the recombination of compatible musical audio loops,Footnote 7 referred to in AIS opt-aiNet as network cells. A mashup p results from the combination of loops liA, where i = {1,...,L} is the index of the loop in the dataset A, with a total of L musical loops. A mashup p results from overlapping its component loops; therefore, it can be understood as a combination of multiple loop layers o. While a virtual number of infinite loop layers can be added to a mashup p = {o1,...ou}, where u is the total number of loop layers, musical textures typically include three or four layers [34].

Each loop li is represented in a feature space, detailed in Section 3.1, where distances between loops capture their compatibility. The smaller the distance, the greater their compatibility. Therefore, optimal compatible mashups result from minimizing an evaluation function Ep in the feature space, which we define in Section 3.2. Diversity is guaranteed by pursuing a combination of global and local search in exploiting the feature space, resulting from iterative clonal mutation and selection of mashup candidates (see Fig. 1). The search for a diverse set of compatible loop mashups is detailed in Section 3.3.

Fig. 1
figure 1

Illustration of search (left) and feature (right) spaces in Mixmash-AIS. We reduced the search and feature spaces to two dimensions for enhanced visualization. The search space represents the recombination of two musical audio loops, and the feature space represents harmonic H and rhythmic R compatibility

Figure 2 shows the architecture of Mixmash-AIS. A user-curated dataset A of musical audio loops li is the content adopted to create musical mashups. The feature extraction algorithm defines harmonic Ti(k), rhythmic ri(q), and spectral bi(s) representations for each dataset loop li that are stored into a feature dataset. The AIS opt-aiNet is then adopted to search for multiple compatible and diverse mashups by evolving a random initial population. Finally, a set of optimal mashups result from overlaying the mashup component loops.

Fig. 2
figure 2

The architecture of Mixmash-AIS. Rectangular blocks denote processing functions. Solid and dashed arrows indicate audio and control flow of information between processing modules, respectively

3.1 Feature extraction and dataset

The feature extraction module is responsible for creating three representations that capture the harmonic Ti(k), rhythmic ri(q), and spectral bi(s) content of each musical audio loop li. From the feature representations, we apply distance metrics to compute indicators of harmonic H, rhythmic R, and spectral S compatibility between musical audio loops li.

Equation (1) computes a TIV Ti(k) to represent the harmonic content of an audio loop li. Ti(k) as a 12-dimensional vector computed as the DFT of a chroma vector ci(m). The use of the DFT of chroma vectors from musical audio has shown to provide indicators of dissonance and perceptual relatedness with greater accuracy than chroma vectors [20]. Furthermore, the representation is invariant to timbral differences of instrument sounds and has been shown to outperform existing representations in finding good harmonic mixes from musical audio [24].

$$ \begin{array}{@{}rcl@{}} T_{i}(k)&=& w_{a}(k) \sum\limits_{m=0}^{M-1} \bar{c}_{i}(m) e \frac{-j2 \pi km}{M} \quad , \\ k \in \mathbb{Z} \quad \text{with} \bar{c}_{i}(m) &=& \frac{c_{i}(m)}{\sum\limits_{m=0}^{M-1}c_{i}(m)}\quad , \end{array} $$

where M = 12 is the dimension of the input vector and wa(k) = {3,8,11.5,15,14.5,7.5} are weights derived from empirical ratings of dyads consonance used to adjust the contribution of each dimension k of the DFT space [20]. We set k to 1 ≤ k ≤ 6 for Ti(k), since the remaining coefficients are symmetric. Ti(k) uses \(\bar {c}_{i}(m)\) which is ci(m) normalized by the DC component to allow the representation and comparison of different hierarchical levels of tonal pitch, such as chords and keys, which ultimately relate to different time scales or variable duration loops.

Following Bernardes et al. [20, 35], we adopt the Ti(k) space for computing the harmonic compatibility H between two given loops l1 and l2 using (2), which combines the dissonance D and perceptual distance P metrics shown in (3) and (4), respectively. The lower the values of H, the higher the degree of harmonic compatibility between two audio loops li. Fernández [24] has shown that the harmonic compatibility H indicator perceptually captures human preferences of a mashup to a higher degree than remaining harmonic compatibility metrics [17, 23].

$$ \begin{array}{@{}rcl@{}} H_{l_{1},l_{2}}&=&D_{l_{1},l_{2}} \cdot P_{l_{1},l_{2}} \end{array} $$
$$ \begin{array}{@{}rcl@{}} D_{l_{1},l_{2}}&=&1-\frac{a_{1} T_{1}(k) + a_{2} T_{2} (k)}{a_{1} + a_{2} w_{a}(k)} \end{array} $$

where a1 and a2 are the amplitudes of T1(k) and T2(k), respectively.

$$ P_{l_{1},l_{2}}=\sqrt{\sum\limits_{k=1}^{6} \lvert T_{1}(k) - T_{2}(k)\rvert^{2}} $$

A rhythmic histogram ri(q) [16], where q = 60 bins, is adopted to represent the rhythmic content of a musical loop as amplitude modulations. The representation derives from rhythmic patterns [36], a time-invariant matrix representation of loudness fluctuations in the 24 critical Bark bands of the human’s listening range. Their fundamental difference is that rhythmic histogram ri(q) accumulates all 24 critical bands onto a single bin, resulting in a vector of 60 frequency modulation bins in the [0,600] beats per minute (bpm) range [16]. The motivation to adopt rhythmic histograms ri(q) in our work, instead of the more common rhythmic pattern representation, is to minimize pitch or spectral differences in the similarity computation, namely in light of the typical single-instrument nature of musical loops used in production settings [36, 37].

We adopt a two-stage extraction process to compute a rhythmic histogram ri(q). First, we group the frequency bands by loudness sensation, using a short-time Fourier transform. The resulting spectral representation is then converted into a time-invariant 24 critical Bark bands modulation frequency spectrum by reapplying the Fourier transform. High amplitudes values in the rhythmic histogram ri(q) denote a recurrent period in the musical audio. Figure 3 shows the rhythmic histogram with four predominant peaks at multiples of the tempo (126 bpm).

Fig. 3
figure 3

Rhythmic histogram of an audio loop with a stable tempo of 126 bpm. Peaks at integers multiples of 126 bpm indicate the presence of active pulses on binary subdivisions of the tempo and a straight rhythmic feeling.

Rhythmic compatibility R is computed as the angular distance between rhythmic histograms r1(q) and r2(q) representing two musical loops l1 and l2, such that:

$$ R_{l_{1},l_{2}} = \text{arcos}\frac{r_{1}(q) \cdot r_{2}(q)}{\|r_{1}(q)\| \|r_{2}(q)\|} $$

Inspired by the key role of equalization in audio mixing [5], we propose a measure of spectral balance S to ensure a dispersed Bark spectral energy distribution in the resulting loop mashup. The use of Bark spectra bi(s) aims to capture the subjective perception of loudness across the 24 critical bands of human hearing [38]. To ensure some separation between two given loops l1 and l2, spectral balance S computes the distance between their Bark spectrum centroid \(\hat b_{i}\) given by (7), such that:

$$ S_{l_{1},l_{2}} = \begin{cases} 0 & if \quad \vert \hat{b}_{1} - \hat{b}_{2} \vert \geq 1 \\ 1 & \text{otherwise} \end{cases}. $$
$$ \hat b_{i} = \frac{{\sum}_{s=0}^{S-1} b_{i}(s) \cdot s}{{\sum}_{s=0}^{S-1} b_{i}(s)} $$

Promoting spectral separation fosters auditory segregation of the loops in a mashup [39], which is a known good practice for polyphonic textures within music composition [34]. However, larger spectral separation does not necessarily imply a “better” musical texture. Ultimately, while advisable to maintain a minimal degree of spectral separation, voice separability is a subjective and creative decision that should be considered by the user across the multiple optimal solutions provided by the model. In this context, we only penalize loop recombinations within the same critical bandwidth. The remaining cases are all considered valid candidates. Adopting a penalty of one for loops within the same critical band roughly aligns with the maximum distance values of remaining harmonic and rhythmic criteria in Ep. Furthermore, by privileging loops that occupy different critical bands, we promote a more uniform spectral energy distribution and foster a reduced sensory dissonance across loops as the auditory roughness between spectral peaks in different critical bands is residual [40].

3.2 Evaluation function

To ensure high compatibility values in a diverse space of solutions, AIS opt-aiNet assesses a population of musical loop mashups p at each iteration g. The population evolves across multiple iterations by minimizing a cost function Ep. A two-step process is adopted to compute an objective evaluation value per mashup p. First, we define Hp, Rp, and Sp as the sum of all pairwise distances across the component loops li in the mashup p. This step allows a virtually infinite number of overlapping layers to be considered by the model. A total of u(u − 1)/2 unique pairwise values (i.e., resulting from the combinations of two elements without repetition) per harmonic, rhythmic, and spectral representations are summed using (2), (5), and (6) respectively, where u is the total number of loops li in a mashup p per criteria.

We normalize the Hp, Rp, and Sp metric by the total number of u(u − 1)/2 pairwise elements per criteria. This normalization strategy allows musical audio loops recombination, where some of the criteria may not be relevant. For example, if we aim to create a mashup including a tonal and a percussive loop, the harmonic compatibility criteria Hp may not be relevant, and thus should be excluded. Figure 4 illustrates the criteria computation for three overlapping layers. Two layers are retrieved from a dataset A1 featuring loops with tonal and a third layer is retrieved from a dataset A2 with percussive content.

Fig. 4
figure 4

Illustration of the threefold harmonic, rhythmic, and spectral compatibility criteria computation for a mashup p with three overlapping musical loops. Two tonal-content loops (l3 and l9) are retrieved from dataset A1. A percussive loop l20 is retrieved from dataset A2. Per criteria, all unique pairwise combinations (without repetition) are linearly combined and dived by the number of u(u − 1)/2. In the harmonic criteria, the l20A2 loop is excluded due to its percussive-only content

Second, we apply (8) to combine the resulting harmonic Hp, rhythmic Rp, and spectral balance Sp compatibility. Furthermore, a high penalty is applied to mashups p that include repeating loops li, such that:

$$ E_{p} = \alpha H_{p} + \beta R_{p} + \gamma S_{p} + F_{p} \quad , $$

where Fp = 0 if no duplicate loops indexes i are found in p and Fp = 50, otherwise. α, β, and γ are weights that aim to balance the relative importance of the metrics and promote customizable parameters that bias the search space towards user preferences across the search for harmonic, rhythmic, and spectral compatible loops.

3.3 Search algorithm

The immunological operations in AIS opt-aiNet — cloning, mutation, and affinity suppression — evolve an initial random population towards compatible and diverse loop mashups in the immune network. Maintenance of compatibility is assured by the evaluation function Ep and leveraged by the cloning and mutation operators, which optimize the population of mashup candidates across multiple regions of the multimodal search space at each algorithm iteration g. Valleys (or local minima) in the multimodal search space indicate optimal mashup candidates p. Figure 5 shows a flowchart diagram of the AIS opt-aiNet algorithm used in Mixmash-AIS.

Fig. 5
figure 5

Artificial immune system opt-aiNet flowchart diagram adopted in Mixmash-AIS

The AIS opt-aiNet algorithm instantiates a random population of mashups p. In greater detail, to define an initial population, random index i numbers for the component loops li of each mashup p are generated. The initial number of mashups in the population (i.e., the population size or the number of network cells) is not relevant. The algorithm includes mechanics for automatically adjusting the population size via affinity suppression and population expansion. Cloning creates a number N of offspring clone cells per mashup in the population, which are identical copies of their parent cell. Each clone includes the parent and its N offsprings. The offspring clones undergo an operation of somatic mutation to become a variation of their parent. In other words, mutation asserts that a given loop li in a mashup p should be changed. The probability of a given loop li within a mashup p to be mutated is inversely proportional to the mashup p evaluation value. Following [13], we adopt (9) to define the mutation probability of a given loop li within a mashup p.

$$ \chi = \exp{(-\delta\hat{E})} , $$

where δ = 1.2 (as proposed in [13]) is a constant and \(\hat {E}\) is the normalized evaluation value to the [0,1] range of a given mashup p. For each loop li in the mashup p, a random value in the [0,1] range determines whether an audio loop index i is mutated (Fig. 6). A loop index i is mutated (i.e., replaced), if the random value is ≤ χ. If the condition holds, a random loop index from the dataset A is fetched.

Fig. 6
figure 6

Mutation probability for a loop li within a mashup p as a function of the normalized mashup evaluation Ep value. Randomly generated values below the function enable the mutation of a given loop within a mashup

The clonal selection performs an elitist optimization of the population to retain the highest-compatible mashups per clone. To this end, all clone mashups are evaluated using (8) and the mashup with the smallest evaluation value Ep per clone is retained in the population. Then, the population’s average evaluation \(\overline E_{g}\) at a given iteration g is computed to assess if the mashups optimization has stabilized. The population is said to have stabilized if the average error v in (10) is ≤ .001. The average error v is the modulo of the ratio between the average evaluation values of the previous iteration \(\overline E_{g-1}\) and the average of the current iteration evaluation values \(\overline E_{g}\) subtracted from unity. Once the population stabilizes, the algorithm continues to affinity suppression of candidate solutions followed by population expansion.

$$ v = \left\lvert 1 - \frac{\overline E_{g-1}}{\overline E_{g}} \right\rvert $$

AIS opt-aiNet adopts the suppression operator to exclude mashups p with high affinity or below a given distance threshold t in the feature space to maintain diversity. Pairwise mashup affinity in the feature space is computed as the angular distance between the concatenated vector 〈Tp(k), rp(q),bp(s)〉 expressing the harmonic, rhythmic, and spectral content of the (overlapping) component loops of a mashup p. To compute a unique vector representing the overlapping loops of a mashup p = {o1,...,ou} we apply (11), (12), and (13) to compute the amplitude-weighted combination of harmonic Tp(k) vectors, the linear combination of rhythmic histograms rp(q), and the linear combination of Bark spectral representations bp(s).

$$ \begin{array}{@{}rcl@{}} T_{p}(k) &=& \sum\limits_{o=1}^{u} a_{o} T_{o}(k) \end{array} $$
$$ \begin{array}{@{}rcl@{}} r_{p}(q) &=& \sum\limits_{o=1}^{u} r_{o}(q) \end{array} $$
$$ \begin{array}{@{}rcl@{}} b_{p}(s) &=& \sum\limits_{o=1}^{u} b_{o}(s) \end{array} $$

The suppression operator within AIS opt-aiNet retains a single local optima mashup that minimizes the value of Ep in (8) and excludes all population mashups at a smaller distance than the threshold t (i.e., high-affinity mashups). By excluding similar mashups p from the immune network, we ensure diversity in the population. After suppression, the remaining mashups in the immune network are memory cells.

The AIS opt-aiNet includes two stopping criteria conditions. Whenever one condition is met, the iterative method is stopped, and the population is returned. The first criterion is the user-defined maximum number of interactions \(\hat g\). The second criterion is population stabilization, once the number of memory cells remains equal between two consecutive iterations. If the number of memory cells does not stabilize, a percentage d = 40% of random network cells is appended to the population to expand the immune network capacity to explore the space further.

Finally, a population of compatible and diverse mashups is output upon convergence. Mashups are presented to the user in the ranked ascending order of their evaluation Ep value, i.e., from the most to least compatible mashup. Each mashup p is synthesized by overlapping its component audio loops li, retrieved from the loop dataset A given their index i. This straightforward playback procedure neither accounts for the metrical alignment of musical loops, nor the rhythmic compatibility metric R captures the temporal loop alignment. Instead, it inspects the loop event’s periodicity. Therefore, we either assume the dataset loops start in the strongest metrical accent and have no residual silence — as in the case of the vast majority of professional loop collections — or a manual alignment of the overlapping loops must be performed.

4 Model evaluation

As shown in previous studies [5, 23], the compatibility of musical loops in a mashup is fundamental to user enjoyment. However, diversity in mashup creation is equally important in promoting multiple solutions from which users can select, considering their personal preferences. Therefore, an application to aid users in creating mashups should provide multiple and perceptually different solutions.

In this context, we use objective and perceptual measures to evaluate our model. We adopt objective measures to evaluate the (1) compatibility, (2) diversity, and (3) computational performance of the proposed model, Mixmash-AIS, compared to a standard genetic algorithm (GA), Mixmash-GA, and the original brute force (BF), Mixmash-BF. All models use the same feature space, which results from the combined harmonic H, rhythmic R, and spectral S compatibility criteria. More importantly, they adopt the same evaluation function Ep in (8) to assess the compatibility of a mashup. A perceptual test aims to validate the objective evaluation function Ep in Mixmash-AIS.

For the evaluation, we implemented the above three models as prototypes applications in Pure Data [18]. Their source code is available as Supplementary Material to this article. Furthermore, all musical mashups adopted in the evaluation are equally shared. Mixmash-GA adopts a standard GA with a uniform crossover with 70% probability, uniform mutation with 20% probability, roulette wheel selection, and elitism (top 5% individuals). Similar to Mixmash-AIS, Mixmash-GA outputs mashups ordered by compatibility Ep values.

Ideally, Mixmash-AIS should reach a similar level of compatibility in the highest-compatible mashups as the Mixmash-BF. Moreover, we expect Mixmash-AIS to outperform Mixmash-GA in compatibility and diversity. Standard GAs can prematurely converge to the same local optimum, which is not guaranteed to be the global (compatible) optimum, whereas AIS opt-aiNet returns multiple (diverse) local optima.

We adopt two hip-hop loop datasets for evaluating Mixmash-AIS and the remaining baseline GA and BF Mixmash-AIS models. Dataset A1 includes 551 loops featuring tonal content (without percussive instruments) across a large array of electronic music instrumental sounds. Dataset A2 includes 170 loops featuring percussive instrument content. Both datasets were collected from the extensive set of Apple Loops, which are commonly distributed with proprietary Apple Digital Audio Workstations (DAW), such as Garageband and Logic.Footnote 8 Due to the proprietary nature of the datasets, the audio loop content cannot be shared with the article but can be freely downloaded from the DAW mentioned above. The metadata identifying the list of audio loops adopted in both datasets and all descriptor analysis data per loop can be found in the Supplementary Material to this article.

The motivation to adopt Apple datasets featuring hip-hop loops only is twofold. First, it aims to challenge the assessment of diversity within a style. Including multiple styles in the dataset would increase by design the diversity of the dataset and, expectedly, the diversity in the mashups. Second, the Apple loops dataset is representative of the musical audio production content a producer or composer would adopt in a real-case creative context. Studies assessing the adoption of cross-style, personal, and multiple stylistic datasets should be conducted to validate further this claim (please refer to Section 6, where we detail some future research in this domain).

The lack of curated datasets with some balance across structural music dimensions (e.g., key, tempo, instrumentations) is a major drawback in this task. After analyzing several styles within the Apple loop collection, hip-hop was the one closest to a uniform distribution in harmonic, rhythmic, and spectral content (which we discuss in Fig. 7 and discuss in Section 4). In the context of this study, we adopted the Apple loops dataset as it is very close to what a producer or composer would adopt in a real case scenario.

Fig. 7
figure 7

Histogram distributions of the harmonic, rhythmic, and spectral criteria for two musical audio loop datasets A1 and A2 used in the study evaluation. Distributions (a), (b), and (c) result from the entire analysis of the harmonic ci(m), rhythmic ri(q), and spectral bi(s) content, respectively, of dataset A1, which features 551 loops with tonal content. Distributions (d) and (e) result from the complete analysis of the rhythmic ri(q) and spectral bi(s) content, respectively, of dataset A2, which features 170 loops with percussive content. Globally, the distributions denote some uniformity, ensuring an unbiased distribution across the search space

The loops range between [5-24] seconds and feature diverse tempo (or bpm) and multiple instruments within a large set of spectral regions, roughly in the [40,10000] Hz range. Figure 7 shows the harmonic ci(m), rhythmic ri(q), and spectral si(b) representation histograms for the entire loop collection in datasets A1 — images (a), (b), and (c) — and A2 — images (c) and (d). For an unbiased evaluation, dataset histograms should result in uniform distributions, as they denote a comprehensive exploration of the search and feature spaces. While the harmonic histogram in Fig. 7(a) shows a quite uniform distribution, all remaining rhythmic — images (b) and (d) — and spectral — images (c) and (e) — histograms for both datasets A1 and A2 have a slight prevalence towards the lower bpm and Bark bands, respectively. This somewhat expected bias towards the more musical use of the bpm and Bark scale range is still quite balanced. Most consecutive bins denote minor differences and all bins have some magnitude, enabling a comprehensive exploration of the search and feature space.

The objective evaluation adopts the dataset A1 and targets the generation of two-layered loop mashups with tonal content. The subjective evaluation included tonal and percussive loops from both A1 and A2 datasets. A threefold layered mashup is adopted, where two layers are retrieved from the tonal-content loops in dataset A1 and a percussive loop from dataset A2. To assess the recombination between the percussive loop and the remaining tonal-content loops, we do not inspect harmonic compatibility in Ep. All remaining criteria are considered. The illustration of the criteria computation in Fig. 4 demonstrates the procedure adopted in the perceptual threefold loop mashup evaluation.

We have defined the AIS opt-aiNet and GA parameters to withstand an initial population of 20 cells and a maximum number of 200 iterations. Furthermore, we set the number of clone generations for each network cell in AIS opt-aiNet to 10, and the affinity threshold t = .5. While the weights in (8) aim primarily to provide user control over the search criteria in the feature space, for the context of the evaluation, we set the weights to slightly privilege harmonic and rhythmic compatibility over spectral balance by setting α = .4 (harmonic compatibility), β = .4 (rhythmic compatibility), and γ = .2 (spectral balance). The weights definition resulted from experimentation with the model using the datasets A1 and A2 and follow some early evidence on the importance of each criteria [24].

4.1 Objective evaluation of compatibility

To objectively assess and compare the compatibility in all models under evaluation, we average the evaluation function values Ep of the 10 highest-compatible mashups, thus providing an average indicator of the model compatibility. The smaller the average compatibility value, the better it complies with the objective criteria in (8) that we aim to minimize. Furthermore, we compute 10 runs of the AIS opt-aiNet and GA models to capture the inter-run deviation. Deviations in the optimization convergence of the models are expected to be more noticeable in the GA. While not guaranteeing similar results at each run, the AIS opt-aiNet algorithm operators — namely affinity suppression and population expansion — provide mechanics that tend to evolve the population towards similar convergence results. The BF approach computes similar results at each run, as it exhaustively inspects all loop mashup combinations.

4.2 Objective evaluation of diversity

To measure diversity between two given mashups p, in the three models under evaluation, we adopt the cosine distance between the concatenated vector 〈Tp(k),rp(q),bp(s)〉 per mashup p. The objective diversity of a population is given by the median across all unique pairwise mashups distances from the 10 highest-compatible set in each model. A total of 45 pairwise distance values are considered per model.

The use of the cosine distance across the multiple harmonic H, rhythmic R, and spectral S dimensions is directly inspired by the affinity suppression operator adopted in our implementation of the AIS opt-aiNet (please refer to Section 3.3). The adopted distance metric results in distance values close to zero for perceptually similar mashups p and higher distance values for less perceptually similar mashups. Therefore, the larger the cosine distance between mashups p, the greater their perceptual difference, which we adopt as an indicator of diversity. In other words, the larger their distances in the feature spaces, the more perceptual diversity in the population.

In the harmonic feature space Tp(k), the cosine distance indicates the retaining of common tones in the mashups’ harmonic content [15, 41]. In the rhythmic feature space rp(q), the cosine distance gradually increases from rhythmic periodicities at the same tempo and subdivision to less identical rhythmic structures within the same tempo (or metrical structure, such as double or half tempo rhythmic structures) and, finally, to rhythmic periodicities that are not multiples of the tempo [16]. In the spectral feature space bp(s), the cosine distance inspects the alignment of the bark spectral representation of the mashups, a common metric to capture timbral similarity [6]. In the rhythmic and spectral features spaces, the adoption of cosine distance accounts for the perceptual similarity, while disregarding amplitude differences.

4.3 Evaluating computational performance

The computational performance of the models is instrumental to the task of computational mashup creation at scale due to the combinatorial explosion of loop recombinations. The complexity of the problem depends on the size L of the user-curated loop dataset A. For example, a BF approach in the dataset A1, including 551 loops, results in 151525 unique combinations for mashups with two overlapping loops.

The associated computational cost of each iteration for the AIS opt-aiNet and GA under evaluation can be defined as \(\mathcal {O}(LV)\), where L is the current population size, and V is the length of the combined rhythmic, harmonic, and spectral representation vectors. The affinity suppression in AIS opt-aiNet has an additional computational cost of \(\mathcal {O}(L2V)\). The BF approach does not feature multiple iterations, and its computational cost can be defined as \(\mathcal {O}(L^{2}V)\). The user defines the initial population in the AIS opt-aiNet and GA. In our evaluation, we adopt a population size of 20, which remains stable in size in the GA and is dynamically adapted at each iteration of the AIS opt-aiNet. In the BF approach, the population equals the total number of loops L in the dataset A.

These costs indicate significant computational gains when adopting the GA than the AIS opt-aiNet. In the latter model, affinity suppression adds complexity compared to the GA. However, both AIS opt-aiNet and GA suggest substantial gains compared to the BF approach but are dependent on their ability to converge. Therefore, to assess the performance of the models in the real-case scenario of the Apple Loops collection, we computed the average CPU usage over 10 runs. Furthermore, we equally report the number of iterations and the population size at convergence, which is particularly relevant for AIS opt-aiNet due to the dynamic behavior of its population size at each iteration.

4.4 Perceptual evaluation

We conducted a perceptual test to explore the relationship between user enjoyment of a mashup and the evaluation function Ep in (8), which objectively assesses the compatibility of a mashup p. We raised the hypothesis that user enjoyment would be correlated with the compatibility evaluation function Ep. We conducted an online listening test to perceptually assess the compatibility of the resulting mashups from Mixmash-AIS. Our test design is based on the procedure reported in [17] assessing the relationship between an objective mashability metric and user enjoyment. In our case, the mashability metric is replaced by the evaluation function Ep.

In total, 103 participants completed the listening test. Targeted participants included college students (undergraduate and graduate levels) and a balanced number of participants across musical training (51.2% versus 48.8% without training). As musical mashups enjoyment can be appreciated irrespective of musical training, we did not use this as a criterion for selecting participants. However, we ensured that the participants understood the term “music mashup” by explicitly informing them about the practice. No participant reported listening disabilities. To allow the participants to familiarize themselves with the procedure of the listening test, namely the interface, and to set the playback volume to a comfortable level, each participant undertook a short training phase before completing the test. The use of high-quality headphones was requested to the participants and informed consent was provided. Furthermore, participants were free to withdraw at any point and were not paid to conduct the listening test.

The listening test included 10 mashups randomly selected from 14 output mashups from Mixmash-AIS. Mashups overlap three audio loops and have an average duration of 10 s. To prevent order effects, the 10 musical mashups were presented in a different order per participant, using a uniform random distribution. Participants rated the mashup enjoyment using a 7-point Likert scale, where 1 corresponds to low enjoyment and 7 to high enjoyment.

5 Results

Figure 8 plots descriptive statistics for the three objective criteria under evaluation — (1) compatibility, (2) diversity, and (3) computational performance — for each AIS opt-aiNet, GA, and BF Mixmash-AIS models. We ran the two former models 10 times to account for the inter-run deviations computed as the standard deviation between all runs. The BF approach performs equally in every run, as it computes the compatibility of all possible loop dataset combinations. Therefore, we only ran the model once. Tables 12, and 3 present the results for the objective criteria under evaluation per run. The results report indicators from the 10 highest-compatible mashups per model.

Fig. 8
figure 8

Boxplots showing descriptive statistics of the objective evaluation criteria results for three computational loop mashups creation models. The comparison includes the 10 highest-compatible mashups proposed by the following models: artificial immune system (AIS) opt-aiNet, genetic algorithm (GA), and brute force (BF) approach. Three objective criteria are evaluated. The compatibility (left plot) between overlapping loops within a mashup that aims to be minimized. The diversity (middle plot) assesses the perceptual affinity between the resulting set of proposed mashups and aims to be maximized. Finally, computational performance inspects the runtime CPU cost of the models until terminated

Table 1 Objective evaluation of compatibility and diversity across 10 runs of the Mixmash-AIS model using an artificial immune system (AIS) opt-aiNet algorithm
Table 2 Objective evaluation of compatibility and diversity across 10 runs of the Mixmash-GA model using a genetic algorithm
Table 3 Objective evaluation of compatibility and diversity in a brute force approach (exhaustive search of all pairwise loop mashup matches) in Mixmash

By comparing the average compatibility values Ep between the AIS opt-aiNet (1.368 ± .114) and GA (2.694 ± .195) models, we can observe that the AIS opt-aiNet outperforms the standard GA in finding compatible mashups resulting from smaller (and optimal) values in the search space. These results reinforce the importance of the optimization multimodality in AIS opt-aiNet to guarantee a comprehensive search across the space, which typically secures both global and local optima. Conversely, the population of the GA tends to converge to the same region, which is not guaranteed to be the global optima. The average compatibility of the BF approach (.774) is lower than the AIS opt-aiNet and GA, thus presenting a set of more compatible mashups in its 10 highest-compatible mashups. However, AIS opt-aiNet excludes perceptually similar mashups with affinity, only retaining the highest compatible mashup. Conversely, BF outputs all mashups irrespective of their affinity. The latter assertion is verified by the lower median affinity value of the BF approach than the AIS opt-aiNet, as it indicates the found 10 highest-compatible mashups in the BF have smaller diversity.

The objective diversity for the three AIS opt-aiNet (1.624 ± .044), GA (.4379 ± .229), and BF (1.494) denotes a clear advantage of the AIS opt-aiNet in promoting perceptually different mashups in the 10 highest-compatible mashups, since their global distances in the feature space are more spread then the remaining models. The GA performs very poorly in diversity as it typically converges most mashups towards a single region in the search space. The smaller the affinity median value, the closer the candidate solutions are in the feature space. Consequently, the mashup outputs are not as perceptually different from each other. Reduced diversity in the output mashup collection restricts the creative MIR interface to act as a collaborative machine [4], which neither promotes degrees of surprise and serendipity nor accounts for user preferences.

By comparing the computational performance of the three models given by the CPU time to return a set of mashup solutions (i.e., the set of optima mashup solutions in AIS opt-aiNet or GA and the complete set of pairwise comparison in the BF approach), we denote some gains in the average CPU time of GA (1428 ± 275ms) and AIS opt-aiNet (6708 ± 3280ms) compared to the CPU time of the BF (151,523ms). The GA could even be further optimized as no stopping criteria have been defined. If no diversity is required, the GA is the best optimization strategy due to its performance efficiency while ensuring similar yet compatible mashups. The BF approach presents an obvious high computational cost in a combinatorial explosion problem. We could argue that once the evaluation of the complete set of loop combinations is pre-computed, we could store the results and retrieve them continuously at runtime. However, this strategy would fail when adopting three or more overlapping loops, as their features and distances metrics would have to be recomputed. By inspecting the diverse iteration count in the AIS opt-aiNet and the resulting average compatibility values in each run (13.2 ± 1.988), we can denote the capacity of stopping criteria in the algorithm to assess optimal convergence conditions.

To examine the listening test results, we first inspected the average perpetual ratings per mashup, defining the level of enjoyment. To assess the relationship between user enjoyment and the objective mashup compatibility evaluation function, we computed the Pearson correlation coefficient to infer the measure of linear correlation between two sets of data. The scatter plot in Fig. 9 shows the linear relationship between the two variables. A Pearson correlation coefficient of r = .99 (p < .001) indicates a statistically significant correlation between the objective cost function Ep and user enjoyment for our 10 mashups. These results endorse our hypothesis and validate the overall effectiveness of the objective evaluation function Ep in capturing the compatibility of recombined loops. A coefficient of determination R2 = .98 exposes a high degree of strength between the linear regression model and the variables.

Fig. 9
figure 9

Correlation and linear regression analysis between the objective evaluation function values Ep and the enjoyment ratings from the perceptual listening test for 10 Mixmash-AIS mashups. A statistically significant relationship (r = .99,p < .001) has been found between the variables under analysis

6 Conclusions

6.1 Summary and original contribution

We proposed Mixmash-AIS, a multimodal musical loop mashup optimization model for loop recombination at scale. It adopts the AIS opt-aiNet algorithm to leverage compatible and diverse mashups while addressing the scalability issues in state-of-the-art BF solutions for computational music mashup creation. To automatically assess the compatibility of mashups of musical audio loops, we proposed an objective function that accounts for three fundamental musical dimensions: harmony, rhythm, and timbre. In greater detail, the objective evaluation function inspects the harmonic compatibility in the Tonal Interval Vector space [6, 20], rhythmic compatibility using a time-invariant histogram proposed in [16], and spectral balance promotes loop recombination in different critical Bark bands.

An objective comparison of AIS opt-aiNet to a standard GA and BF approaches in proposing loop mashups denotes the primacy of the AIS opt-aiNet in finding local and global optimal mashups, closely matching the compatibility values of the BF approach. Furthermore, the AIS opt-aiNet promotes greater diversity than the GA and BF approaches. Finally, GA and AIS opt-aiNet have significant computational performance gains compared to the BF approach. A perceptually listening test was conducted and significantly validated the hypothesis that the proposed objective evaluation function in Mixmash-AIS captures user enjoyment of a mashup (r = .99,p < .001).

In promoting a diverse set of optimal mashups, the Mixmash-AIS can account for personal preferences and stylistic traits, which are fundamental to the production-based MIR interfaces [4]. Not only does the operational mechanics of the AIS opt-aiNet accommodate greater diversity, but also the weights regulating the importance of each harmonic, rhythmic, and spectral criteria within the objective evaluation function allow users to bias the search towards their preferences.

6.2 Towards Mixmash-AIS model applications

Existing implementations of the Mixmash-AIS model (found in the complementary materials to this article) lack a graphical user interface (GUI) promoting an intuitive strategy adapted to the two main identified target user groups: professional musicians, namely producers and composers, and lay-people. This section envisions user scenarios for the Mixmash-AIS model according to these two target user groups. To this end, we define how the model can contribute to or promote a creative flow within the target user groups, and what usability criteria are relevant in the design of the interaction and interface with the model.

The search for good matches for creating mashups or mixes between musical audio loops is a common task in the workflow of professional musicians. The ear training and music theory background of these users typical allow them to critically assess the compatibility between any two (or more) given musical audio loops. However, browsing and creatively experimenting with these large-scale datasets is very time-consuming due to the poorly formal annotations of the musical audio content (e.g., the formal musical structural descriptions such as key, chord, and meter annotations). Instead, high-level semantic labels (e.g., style and instrumentation) are typically adopted. The Mixmash-AIS model streamlines this process by proposing several good optimal mashups, thus excluding the need for a manual search for good matches at the formal levels of the harmonic, rhythmic, and spectral domains. While these elements are not fully formalized in the literature, our evaluation has validated our evaluation function Ep in capturing these traits when proposing different mashups.

Furthermore, the diversity of the proposed Mixmash-AIS model, stemming from the local searches in multiple perceptually different locations of the search space, allows accommodating user preferences, which are seminal to the workflow of musicians [42]. A significant degree of dimensions relevant to the decision of what musical audio to include in a particular musical context, or reflecting the stylistic idiosyncrasies of the user, is excluded from our objective evaluation function. In this context, the model’s capacity to propose diverse optimal mashups is important to promote a workflow where a human-in-the-loop strategy exists (REF). Mixmash-AIS is an intelligent collaborative assistant that filters or reduces the intractable set of combinations between all audio loops to a smaller set from which the user can choose.

A typical scenario where the Mixmash-AIS model can be applied is in the professional musicians’ workflow, namely assisting DJs in preparing their musical sets. The model can adopt both musical audio loops and full musical audio tracks. To a given extent, it could event propose which spectral components to be excluded to maximize compatibility (i.e., proposing changes to the signal that could enhance the evaluation function value Ep, which typically aligns with the strategies adopted by DJs when mixing two tracks (e.g., retaining the bass from one track and the lead parts from a different track). Another recurrent scenario where Mixmash-AIS can enhance the professional musicians’ creative workflow is finding a loop or musical audio sample in an existing project (e.g., a song). The model can easily be adapted to find a musical loop or track to be recombined with an existing musical context (e.g., finding a harmonization for a pre-recorded vocal track). Different graphic user interfaces for standalone applications or plugins for existing audio production software shall be studied for each scenario. Design strategies to promote serendipitous searches in large musical audio loops datasets have been discussed in [6] and can gain from the diverse set of proposed loops by the model.

The choice of the musical audio (loop) dataset is entire to the user and can be adjusted to his musical preferences or the musical context of a particular project. The agnostic search at the formal level of musical structure compatibility of the Mixmash-AIS model can accommodate any musical style and any cross-combination of styles. The curation of the dataset is fundamental to accommodate the user’s musical context and preferences. The model could also be easily adapted to work with large-scale datasets online, such as Freesound [43]. The size of the dataset is an important consideration in the model. Very small datasets may result in a small pool of optimal and not too diverse mashups. Furthermore, small datasets (e.g., about 10 to 20 musical audio loops or tracks) can potentially be manually browsed. The model shows greater potential in dealing with larger datasets (e.g., about 100 or more musical audio loops or tracks).

In the context of lay-users, the Mixmash-AIS model can be understood as a strategy to promote participation in music creation for those lacking musical training and thus adopt the model to select compatible musical audio loops or tracks. The model can fill a gap in knowledge or tools that could exclude non-musically trained users from participating in musical creation. These strategies can provide a new context to the favorite music of a particular user within entertainment music consumption scenarios. The use of small (personal) datasets or playlists can be envisioned here as a typical scenario. Therefore, diversity is less important than finding optimal compatibility to promote new listening experiences.

6.3 Future work

In the future, the most pressing research to be considered is in-depth user studies to assess and validate the Mixmash-AIS model in the musical practice of both professional musicians and lay-users. To conduct such studies, we shall consider user experiences to design, implement, and compare multiple interfaces adopting the Mixmash-AIS. Collected feedback can assess the usability of the interfaces and the degree to which they assist a user engaged in creative work. Furthermore, such evaluation can inform the model’s algorithmic design. An important aspect to be considered in the study is the possibility of users adopting their musical collection within their projects to enlighten the impact of dataset size and provide some insight into the ideal number of optimal diverse mashups to be presented to the user.