1 Introduction

Researchers in psychology, neuroscience, musicology, and engineering have long tried to find quantitative mathematical models of perceptual attributes of music. However, human perception can hardly be systematized into a fixed set of disjoint categories. In fact, music similarity, expressiveness, emotion, and genre, to name a few, are elusive terms that defy a shared and unambiguous definition [1,2,3]. Furthermore, seeking a general consensus might be considered an ill-defined problem, as these aspects of music fruition are distinctively subjective and are strongly dependent on one’s personal experience, music education, background, and culture [4,5,6]. Our perception of a musical performance depends on the complex interaction between multiple interrelated conceptual layers, and not a single aspect can be gauged in a vacuum: melody, harmony, rhythm, loudness, timbre, time signature, and tempo all play a joint role in the holistic perception of music and may affect how a musical piece is experienced [7]. Nevertheless, while an all-encompassing model for music perception may seem a long way off, researchers have compellingly resorted to a divide-and-conquer strategy to address such a multifaceted problem.

This work focuses on the long-studied aspect of rhythm complexity. Over the past few decades, numerous mathematical models have been proposed in the literature [8,9,10,11,12,13,14,15,16,17], all aimed at assessing the complexity of a pattern of rhythmic events. These algorithms ultimately provide an explicit mapping from a symbolic representation of the rhythm to a scalar value meant to quantify the degree of complexity that a human listener would perceive. These methods, however, are only able to partially model rhythm complexity. Indeed, while most listeners are able to assess the complexity of a given musical piece, an experienced musician writing or performing music can make use of a controlled degree of complexity to great artistic effect.

Data-driven methods have recently proven to be a powerful and expressive tool for multimedia generation. For example, deep learning techniques have been successfully applied to image [18,19,20], text [21,22,23], speech [24,25,26], and music generation [27,28,29,30,31,32]. However, the mappings provided by deep generative models for musical applications are typically implicit and lack interpetability of the underlying generative factors. In fact, they often fall short on two crucial aspects: controllability and interactivity [33]. Nonetheless, several attribute-controlled methods have been recently proposed in the literature. The hierarchical architecture of MusicVAE [29], other than achieving state-of-the-art performance in modeling long-term musical sequences, enables latent vector arithmetic manipulation to produce new samples with the desired characteristics. Following the work of [29], Roberts et al. [28] explored possible interactive applications of latent morphing via the interpolation of up to four melodies or drum excerpts. In [31], Gillick et al. introduced GrooVAE, a seq2seq recurrent variational information bottleneck model capable of generating expressive drums performances. The model, trained using Groove MIDI Dataset, was designed to tackle several drum-related tasks, including humanization, groove transfer, infilling, and translating tapping into to full drum patterns. Furthermore, Engel et al. [34] showed that it is possible to learn a-posteriori latent constraints that enable the use of unconditional models to generate outputs with the desired attributes. Hadjeres et al. [35] proposed a novel geodesic latent space regularization to control continuous or discrete attributes, such as the number of musical notes to be played, and applied it to the monophonic soprano parts of J.S. Bach chorales. In [36], Brunner et al. presented a recurrent variational autoencoder complemented with a softmax classifier that predicts the music style from the latent encoding of input symbolic representations extracted from MIDI; the authors thus performed style transfer between two music sequences by swapping the style codes. Tan and Herremans [37] took inspiration from Fader Networks [38] and proposed a model that allows to continuously manipulate music attributes (such as arousal) by independent “sliding faders.” This was achieved by learning separate latent spaces from which high-level attributes may be inferred from low-level representations via Gaussian Mixture VAEs [39]. More recently, Pati and Lerch [40] proposed a simple regularization method that monotonically embeds perceptual attributes of monophonic melodies, including rhythm complexity, in the latent space of a variational autoencoder. The authors have later investigated the impact of different latent space disentanglement methods on the music generation process of controllable models [41]. Finally, it is worth mentioning that recent commercial products, such as Apple’s Logic Pro Drummer, offer some degree of control over the rhythm complexity of an automated polyphonic drumming performance.

Against the backdrop of such a rich literature corpus, the contribution of this article is twofold. First, we propose a novel complexity measure that is specifically designed for drum patterns belonging to the Western musical tradition. To the best of our knowledge, this constitutes the first attempt at designing a proper polyphonic rhythm complexity measure. We validate the proposed algorithm via a perceptual experiment conducted with human listeners and show a high degree of agreement between measured complexity and subjective evaluations. Second, we present a latent vector model capable of learning a compact representation of drum patterns that enables fine-grained and explicit control over perceptual attributes of the generated rhythms. Specifically, we encode the newly proposed complexity measure in the latent space of a recurrent variational autoencoder inspired by [29, 31] and modified to enable single-knob manipulation of the target attribute. The resulting model can generate new and realistic drum patterns at the desired degree of complexity and provides an interpretable and fully-navigable latent representation that appears topologically structured according to the chosen rhythm complexity measure.

The remainder of this article is organized as follows. In Section 2, we provide an overview of the relevant literature and existing techniques for measuring the rhythm complexity of monophonic patterns. In Section 3, we propose a novel polyphonic complexity measure. In Section 4, we describe the dataset of drum patterns utilized in the present study. In Section 5, we provide the details of the listening test conducted to validate the proposed complexity measure and present the results. In Section 6, we outline the proposed latent rhythm complexity model. In Section 7, we evaluate the performance of the proposed model on the tasks of attribute-controlled drum pattern generation and output complexity manipulation. Finally, Section 8 concludes this work.

2 Background on rhythm complexity

Over the years, several rhythm complexity measures have been proposed in the literature [42,43,44]. Rather than considering the music signal as a raw waveform, most of the existing methods relies on a intermediate symbolic representation of rhythm as that produced by an ideal onset detector [45]. For a given tatumFootnote 1, it is customary to derive a discrete-time binary sequence of onsets distributed across a finite number of pulses. On this grid, ones correspond to onsets and zeros correspond to silence, as depicted in Fig. 1. A pulse refers to the smallest metrical unit meaningfully subdividing of the main beat and represent one of all the possible discrete-time locations within a binary pattern that can be assigned either one or zero. Therefore, a rhythm complexity measure can be thought of as a (nonlinear) function \(f_{\mathrm {p}}: \{0,1\}^{M\times N} \rightarrow \mathbb {R}\) that, given the matrix representation of a polyphonic rhythm with N pulses and M voices, yields a real-valued scalar.

Fig. 1
figure 1

Various representations of the same polyphonic drum pattern: raw audio waveform (top), drum sheet music (center), and symbolic binary representation (bottom)

Many rhythm complexity measures have been based on the concept of syncopation [8, 13,14,15, 47] i.e., the placement of accents and stresses meant to disrupt the regular flow of rhythm. Others, such as [10, 12, 48], are measures of irregularity with respect to a uniform meter, and several rely on the statistical properties of inter-onset intervals [9, 16, 42, 49]. Moreover, some authors have investigated entropy [50], subpattern dependencies [11], predictive coding [17], and the amount of data compression achievable [12, 51] in order to quantify the complexity of a rhythmic sequence. However, to the best of our knowledge, previous work almost entirely concerns monophonic patterns (\(M=1\)) and not polyphonic rhythms (\(M>1\)).

Notably, [52, 53] explore the adaptation of Toussaint’s metrical [13] and Longuet-Higgins and Lee [8] monophonic complexity measures to the polyphonic case, respectively. In [52], rhythm complexity estimates are used to rank MIDI file in a database. In [53], complexity measures are used to drive an interactive music system. Crucially, however, [52, 53] consider each drum-kit voice independently of the others before pooling the results, thus disregarding the interaction between voices. Furthermore, the authors provide no validation of the proposed methods against the results of a subjective evaluation campaign conducted with human listeners.

3 A novel rhythm complexity measure for polyphonic drum patterns

Drawing from the rich literature discussed in Section 2, the simplest design for a proper polyphonic complexity measure \(f_\mathrm {p}\) would first entail computing the complexity of each voice \(x_m\) in a M-voices pattern \(\mathbf {x} = [x_1[n],\ldots , x_M[n]]^\top\) separately from one another by using one of the many state-of-the-art monophonic rhythm complexity measures \(f(\cdot )\). Then, the overall complexity can be obtained as the linear combination

$$\begin{aligned} f_\mathrm {p}(\mathbf {x}) := \sum\limits_{m=1}^{M} w_m f\left( x_m[n]\right) . \end{aligned}$$
(1)

However, such a naive approach is bound to provide a poor complexity model as it does not take into consideration the interplay between voices that are instead meant to complement each other.

Instead, we propose to compute the linear combination of the monophonic complexity of groups of voices selected from those that are often found to create interlocked rhythmic phrases in the drumming style typical of contemporary Western music. Indeed, our assumption is that grouping multiple voices together allows to better capture the perceptual rhythm complexity of polyphonic patterns, as it is arguably determined by the joint interaction of multiple sources that play a certain role only in relation to others.

Given a subset of binary voices \(x_1[n],\ldots ,x_L[n]\) out of the M voices in \(\mathbf {x}\in \{0,1\}^{M\times N}\), the kth group can be defined as

$$\begin{aligned} g_k[n] := \bigvee\limits_{\ell =1}^L x_{\ell }[n] \end{aligned}$$
(2)

Namely, \(g_k[n]=1\) if and only if at least one of the L grouped voices had an onset at pulse n. Otherwise, \(g_k[n]=0\). Applying (2) to all K groups, the given pattern \(\mathbf {x}\) yields an augmented matrix representation \(\mathbf {g} = [g_1[n],\ldots , g_K[n]]^\top\) of size \({K\times N}\), where possibly \(K\gg M\). Hence, the proposed polyphonic complexity measure is given by

$$\begin{aligned} f_\mathrm {p}(\mathbf {g}) := \sum\limits_{k=1}^K w_k f(g_k[n]) \end{aligned}$$
(3)

where the weights \(w_k\) can be either, e.g, set to 1/K (yielding a simple average) or determined via (possibly non-negative) linear regression against the subjective results of a large-scale listening test.

We empirically determine the grouping reported in Table 1. Most notably, bass and snare drums are merged into a single group (\(k=1\)). Together, they constitute the backbone of contemporary Western drumming practices, especially in the rock and pop genre. Therefore, their relationship cannot be wholly conceptualized if they are considered disjointedly. Likewise, high and low toms are merged into group \(k=6\). For their part, closed and open hi-hat appear by themselves in respective groups (\(k=2\) and \(k=3\)), and we include an auxiliary group (\(k=4\)) to account for those rhythms in which the hi-hat follow a regular pattern regardless of the pedal action. For instance, let us consider a 1-bar pattern where open and closed hi-hat alternate as depicted in Fig. 2. The closed hi-hat (\(k=2\)) is always off-beat and thus is likely be assigned high complexity. However, the joint rhythm consists of a regular sequence of semiquavers thus making up for a rather easy-to-conceptualize rhythm. Similarly, an auxiliary group (\(k=9\)) is introduced for crash and ride cymbals, as the former is often used to accent patterns played mostly on the latter. Ultimately, by measuring the complexity of joint patterns and individual voices, we expect to regularize the overall complexity estimate accounting for both regularity and novelty.

Table 1 Proposed drum-kit voice groups \(g_k[n]\) used for the computation of rhythm complexity as given in (3)
Fig. 2
figure 2

Exemplary 1-bar hi-hat pattern. Taken by itself, the Closed Hi-Hat group (\(k=2\)) is characterized by high syncopation, all onsets being off-beat. Functionally, however, the combined pattern played on the hi-hat (\(k=4\)) is likely to be perceived as steady and regular

In this study, in order to quantify the complexity of each group, we adopt Toussaint’s metrical complexity measure [13]. For completeness, a detailed presentation of [13] is given in the Appendix. However, the proposed method does not intrinsically rely on any particular choice of \(f(\cdot )\), and a different monophonic measure may be used for each group of voices independently of the others.

4 Groove MIDI Dataset

Groove MIDI Dataset (GMD) was released by the authors of [31] and contains 13.6 h of drums recordings performed by professional and amateur drummers on an electronic drum set. The dataset contains audio files, MIDI transcriptions, and metadata, including time signature and tempo expressed in ticks per quarter. Whereas the original recordings are of variable lengths, we limit our study to 2-bar scores. GMD comprises a total of 22619 2-bar scores, 97% of them being of time signature 4/4. Filtering out other time signatures yields a total of 21940 samples. The General MIDI standard for drum-kits provides an integer number between 1 and 255 corresponding to each drum instrument. We apply the reduction strategy proposed in [29, 31] to map the 22 drum classes included in GMD onto nine canonical voices: Bass drum (0), Snare drum (1), Closed Hi-Hat (2), Open Hi-Hat (3), High Floor Tom (4), Low-Mid Tom (5), High Tom (6), Ride Cymbal (7), Crash Cymbal (8). We assume a tatum corresponding to a sixteenth note, regardless of tempo. This yields a total of 32 pulses every two bars. Thus, we quantize every MIDI event to the closest pulse in order to produce a symbolic representation in the form of a real-valued matrix with \(M=9\) rows and \(N=32\) columns. Finally, we obtain a binary matrix by discarding the information regarding the velocity of each onset and replacing with ones all rhythmic events with non-zero velocity.

5 Perceptual evaluation

5.1 Experimental setup

The polyphonic complexity measure proposed in Section 3 is validated via a listening test involving eight MIDI patterns sampled from GMD. The evaluation corpus was obtained by synthesizing audio clips from MIDI files in order to control the quantization and velocity of the test patterns. In fact, the audio recordings included in GMD contain agogic and dynamic accents which are traditionally excluded from the evaluation of rhythm complexity. First, we quantized every onset to the nearest semiquaver, and the corresponding velocity values were all set to 80. Then, the resulting patterns were repeated four times to create 8-bar sequences, synthesized to wav files using a library of realistic drum samples distributed with Ableton Live 9 Lite, and finally presented to human listeners via an online form. Akin to the five-point category-judgment scales of the Absolute Category Rating method included in the ITU-T Recommendation P.808 [54], testers were asked to provide a subjective assessment of the perceived rhythm complexity on a scale of 1 (lowest complexity) to 5 (highest complexity). The eight test samples were selected as follows. First, we filtered all three folds of GMD to gather a pool of candidate patterns. Specifically, we discarded all rhythms having either less than three voices (to make sure to evaluate proper polyphonic patterns) or less than eight pulses where at least one onset is present (to exclude overly sparse temporal sequences). Then, in order to ensure an even representation across the whole range of complexity values, we evaluated (3) for every candidate pattern using uniform weights \(w_k=1\), \(k=1,\ldots ,K\). We selected eight uniformly spaced target complexity values by sampling the range between the minimum and maximum complexity thus obtained. Hence, we extracted the eight drum patterns whose complexities were closest to the target ones. During the test, the order in which the patters were presented to the user was randomized and the name of each file replaced with a string of random characters. In order to minimize experimenter-expectancy effect, no audio examples were provided to the subjects before the test. Indeed, manually selecting a number of clips that aligned with the authors’ a priori notion of rhythm complexity could have possibly led to confirmation bias. Instead, we opted for an experimental setup in which all test clips were presented in the same web page, and users were allowed to listen to all patterns and possibly modify previous assessments before submitting the final evaluation results. A total of 24 people took part in the experiment, mainly from a pool of university students and researchers from the Music and Acoustic Engineering program at Politecnico di Milano, Italy. All test subjects are thus expected to have some degree of familiarity with basic music theory concepts.

5.2 Results

Figure 3 shows the correlation between the proposed rhythm complexity measure and the scores attributed to each pattern by the test subjects. Blue circles represent the average perceptual complexity assigned by the users to each of the eight drum patterns. Blue vertical lines, instead, represent the standard deviation for each given sample. The red dashed line represents the linear regression model with complexity measures as covariates and average subjective assessments as dependent variables. Using a uniform weighting policy for all voice groups, the data show a Pearson linear correlation coefficient of 0.9541 and, correspondingly, a Spearman rank correlation coefficient of 0.9762, indicating a strong monotonic relationship. Furthermore, the simple linear model \(y = 0.034\,f_\mathrm {p}(\mathbf {g}) + 1.35\) can fit the average user scores with a coefficient of determination of \(R^2\approx 0.91\).

Fig. 3
figure 3

Correlation between the proposed rhythm complexity measure with uniform weights \(w_k=1,\,\forall k\), and the average score assigned by human listeners to each of the eight drum patterns

As previously mentioned in Section 3, perceptually informed group weights \(w_1,\ldots ,w_K\) may be determined from the collected subjective assessments. Albeit the small sample size involved in the present experiment does not allow for a robust linear regression and it is likely to lead to overfitting, we empirically found that setting \(w_1=3\) for the bass and snare drum group (\(k=1\)) and \(w_4=w_9=1/3\) for the compound hi-hat (\(k=4\)) and cymbals (\(k=9\)) groups leads to a Pearson coefficient of 0.983 corresponding to a linear model \(y = 0.033\,f'_\mathrm {p}(\mathbf {g}) + 0.89\) with \(R^2\approx 0.97\).

The results presented in this section indicate a high degree of agreement between subjective assessments and the proposed complexity measure. Going forward, however, additional tests on curated datasets may be needed to confirm the applicability of the voice groups identified in Section 3 to genres such as jazz and heavy metal that are characterized by peculiar drumming techniques. These experiments are left for future work.

6 Proposed attribute-controlled drum patterns generation model

6.1 Deep generative architecture

In this section, we present a new attribute-controlled generative model that enables fine-grained modeling of musical sequences conditioned on high-level features such as rhythm complexity. The deep generative model is based on the hierarchical recurrent \(\beta\)-VAE architecture of MusicVAE [29], and it is augmented with two auxiliary loss terms meant to regularize and disentangle the latent space, respectively.

As in [31], the recurrent encoder \(q_{\phi }(\mathbf {z}|\mathbf {x})\) comprises a stack of two bidirectional layers, each with 512 LSTM cells. The forward and backward hidden states obtained by processing an input sequence \(\mathbf {x}\in \{0,1\}^{M\times N}\) are concatenated into a single 1024-dimensional vector, before being fed to two parallel fully-connected layers. The first layer outputs the locations of the latent distribution \(\varvec{\mu }\in \mathbb {R}^{H}\), where \(H=256\). The second layer, equipped with a softplus activation function, yields the scale parameters \(\varvec{\sigma }\in \mathbb {R}_{\ge 0}^{H}\).

We implement a hierarchical LSTM decoder \(p_{\theta }(\mathbf {x}|\mathbf {z})\) composed of a high-level conductor network and a bottom-layer RNN decoder. Namely, both the conductor and the decoder are two-layer unidirectional LSTM networks with 256 hidden cells and tanh activations. The output layer of the conductor has size 128 and that of the decoder has M sigmoid units, as many as the number of drum voices.

The input sequence \(\mathbf {x}\) is split into \(S=8\) non-overlapping sections of size \(M\times N/S\). The conductor network, whose goal is to model the long-term character of the entire sequence, outputs S embedding vectors which are in turn used to initialize the hidden states of the lower-level decoder.

The latent code \(\mathbf {z}\in \mathbb {R}^{H}\) is randomly sampled from a multivariate Gaussian distribution \(p(\mathbf {z})\) parameterized by \(\varvec{\mu }\) and \(\varvec{\sigma }\). Then, it is passed through a fully-connected layer followed by a tanh activation function to compute the initial states of the conductor network. For each of the S segments, the 128-dimensional embedding vector yielded by the conductor is in turn passed through a shared fully connected layer to initialize the hidden states of the lower-level decoder. The concatenation between the previous output and the current embedding vector serves as input for the decoder to produce the next section. The decoder autoregressively generates S sections that are thus concatenated into the complete output sequence.

As commonly done for \(\beta\)-VAEs, the base model is trained by minimizing the following objective [55]

$$\begin{aligned} \mathcal {L}_{\mathrm {VAE}} = \mathbb {E}\bigl [\log p_{\theta }(\mathbf {x}|\mathbf {z})\bigr ] + \beta D_\mathrm {KL}\bigl (q_{\phi }(\mathbf {z}|\mathbf {x})\,||\,p(\mathbf {z})\bigr ), \end{aligned}$$
(4)

where \(D_\mathrm {KL}(\,\cdot \,||\,\cdot \,)\) denotes the Kullback-Leibler divergence (KLD) and the real-valued parameter \(\beta <1\) favors reconstruction quality over enforcing a standard normal distribution in the latent space [21].

6.2 Latent space regularization

The deep generative model described in Section 6.1, despite having proven able to produce coherent long-term musical sequences, is unaware of the perceptual aspects of target rhythms. To incorporate this information into our latent vector model, similarly to [36, 40, 56], we propose a multi-objective learning approach. Namely, we force the base model to jointly learn the rhythm complexity of input patterns along with minimizing the classic \(\beta\)-VAE loss function given in (4). Our goal is to regularize the latent space in a way that would allow for continuous navigation and semantic exploration of the learned model. This is achieved by including the following auxiliary loss function

$$\begin{aligned} \mathcal {L}_\mathrm {reg} = \text {MSE}(f_\mathrm {p}(\mathbf {g}),\, z_i), \end{aligned}$$
(5)

where \(f_\mathrm {p}(\mathbf {g})\) is a polyphonic rhythm complexity measure such as the one described in Section 3, and \(z_i\in \mathbb {R}\) is ith element of the latent code \(\mathbf {z}\). This way, we are effectively constraining the ith latent space dimension to become topologically structured according to the behavior of the target perceptual measure. Hence, sampling latent codes along such dimension allows for the explicitly manipulation the complexity of output patterns in a way close to human understanding.

Since the latent vectors are encouraged to follow a multivariate standard normal distribution \(\mathcal {N}(\mathbf {0}, \mathbf {I})\) by the KLD term in (4), we standardize the complexity values of training data by subtracting the sample mean and dividing by the standard deviation. This yields the zero-mean and unit-variance complexity distribution shown in Fig. 4, which is compatible with being encoded into the ith univariate component of \(\mathbf {z}\). Finally, we use the training data statistics thus obtained to apply the same standardization at inference time.

Fig. 4
figure 4

Histogram of the proposed rhythm complexity measure evaluated on all 2-bar training patterns in GMD. Complexity values are standardized to obtain a distribution with zero mean and unit variance. For comparison, the probability density function of a standard normal distribution \(\mathcal {N}(0, 1)\) is overlaid in red

6.3 Latent space disentanglement

Having regularized \(z_i\) as described in the previous section, there are yet no guarantees that some information regarding rhythm complexity had been incorporated into other latent space dimensions. In fact, rhythm complexity is typically measured by gauging onset locations, which ultimately carry most of the same information the base model is trying to encode in the \(H\) latent dimensions. In particular, we would like the remaining \(H-1\) dimensions of the latent space to be relatively invariant with respect to changes in input complexity. Indeed, explicit and interpretable control over the desired output behavior becomes unfeasible when multiple latent variables are redundant and affect the same aspects of the overall rhythm complexity model. In the context of feature learning for generative applications, such a desirable property is often referred to as latent space disentanglement [57].

Notably, \(\beta\)-VAE was originally introduced to favor disentanglement [55]. However, this is mainly achieved for large values of \(\beta\), as later observed in [58]. Therefore, inspired by prior work on predictability minimization [59, 60] and attribute manipulation by means of sliding faders [37, 38, 40], we propose to augment the base model with an auxiliary adversarial loss term promoting latent space disentanglement. Let \(\mathbf {z}_\star \in \mathbb {R}^{H-1}\) be the vector of all remaining latent variables in \(\mathbf {z}\) except for \(z_i\). We define an adversarial regressor \(\hat{f}_\mathrm {p}(\mathbf {z}_\star )\) that is tasked to estimate the input rhythm complexity \(f_\mathrm {p}(\mathbf {g})\) by minimizing the following loss functions

$$\begin{aligned} \mathcal {L}_\mathrm {adv} = \text {MSE}( f_\mathrm {p}(\mathbf {g}),\, \hat{f}_{\mathrm {p}}(\mathbf {z}_\star )). \end{aligned}$$
(6)

In order to reduce the amount of information regarding \(f_\mathrm {p}(\mathbf {g})\) that is embedded into \(\mathbf {z}_\star\), we connect the encoder and the regressor via a gradient reversal layer (GRL) [60] that flips the sign of the gradients during backpropagation. Therefore, the encoder will learn a latent representation \(\mathbf {z}_\star\) that is minimally sensitive with respect to the input complexity as it is now trained adversarially with respect to the regressor.

In this study, we implement the adversarial regressor as a feed-forward neural network with two hidden layers with 128 units followed by ReLU activations and a linear output layer yielding a scalar value. The block diagram of the complete model is depicted in Fig. 5.

Fig. 5
figure 5

Proposed latent vector model for attribute-controlled drum pattern generation

6.4 Model training

The proposed latent vector model is optimized for 300 epochs using Adam [61], a batch size of 128, and the following compound objective function

$$\begin{aligned} \mathcal {L} := \mathcal {L}_{\mathrm {VAE}} + \alpha \,\mathcal {L}_{\mathrm {reg}} + \gamma \,\mathcal {L}_{\mathrm {adv}}, \end{aligned}$$
(7)

where \(\alpha\) and \(\gamma\) are scalar weights for the attribute-regularization and adversarial terms, respectively.

The learning rate is set to \(10^{-3}\) and exponentially decreased to \(10^{-5}\) with a decay rate of 0.99. We set the regularization weight \(\alpha =1\) for the entire training. Conversely, \(\beta\) and \(\gamma\) are annealed during early training to let the model focus more on pattern reconstruction then on structuring the latent representation. Namely, we set \(\beta =10^{-4}\) and \(\gamma =10^{-6}\) for the first 40 epochs. Throughout the following 250 epochs, \(\beta\) is linearly increased up to 0.25 with a step of \(10^{-3}\) per epoch, and \(\gamma\) is increased up to 0.05 with a step of \(2\cdot 10^{-4}\). Furthermore, we randomly apply teacher forcing on the recurrent decoder with a probability of 50%.

7 Performance evaluation

In this section, we evaluate the proposed latent vector model on several attribute-controlled generation tasks. In Section 7.1, we discuss the effects of latent space regularization. In Section 7.2, we show that the proposed adversarial component is effectively disentangling the latent representation. In Section 7.3, we investigate the capability of the proposed method to alter the rhythm complexity of input patterns in a controlled way. Finally, in Section 7.4, we test the model on the task of attribute-controlled generation from randomly sampled latent vectors.

7.1 Latent space regularization

In Fig. 6, we depict the latent vectors obtained by encoding GMD test data. For the sake of visualization, we only plot two latent dimensions, i.e., \(z_i\) and \(z_l\). In this example, we regularize the first dimension, i.e., \(i=0\), and choose \(l=127\). The color assigned to each point represents the rhythm complexity measured on the respective input patterns using (3): brighter colors correspond to higher complexity. The clearly noticeable color gradient indicates that the complexity values have been monotonically encoded along the regularized dimension. Furthermore, the latent complexity distribution appears to be continuously navigable from low to high values by traversing the latent space toward the positive direction of \(z_0\).

Fig. 6
figure 6

Latent rhythm complexity distribution of input drum patterns

In Fig. 7, instead, the coloring is determined according to the rhythm complexity measured on the output patterns generated by the decoder. Whereas the variational decoding process appears to affect output measures, we may notice that the overall complexity distribution retain a high degree of agreement with that of the input data.

Fig. 7
figure 7

Latent rhythm complexity distribution of generated drum patterns

7.2 Latent space disentanglement

The adversarial component introduced in Section 6.3 is meant to penalize any leakage of information regarding rhythm complexity into the non-regularized latent space dimensions. To assess the effectiveness of the proposed method, we conduct a simple ablation study. We train two generative models: the first is implemented as described in Section 6; the second follows the same specifications except for the exclusion of the adversarial term from the loss function in (7).

Quantifying latent space entanglement is a challenging task, as unwanted redundancy and the intertwining of latent variables might not follow a simple and easy-to-identify behavior. Therefore, similarly to what proposed in [62], our approach was to measure latent space entanglement via nonlinear regression. We define a nonlinear regressor \(r(\cdot )\) meant to estimate the measured complexity \(f_\mathrm {p}(\mathbf {g})\) from the non-regularized latent code portion \(\mathbf {z}_\star\). These codes are obtained by passing GMD data through the two pre-trained generative models under consideration. For the sake of simplicity, let us denote with \(\mathbf {z}_\star\) the partial codes from the proposed model and with \(\tilde{\mathbf {z}}_\star\) the ones from the baseline without adversarial term.

We implement each regressor as a two-layer feed-forward neural network with 128 units and ReLU activations. The two networks are optimized using \(\mathbf {z}_\star\) and \(\tilde{\mathbf {z}}_\star\) extracted from training data. The regressors are thus evaluated on the test fold of GMD. We argue that a lower regression performance corresponds to a more disentangled latent representation.

The regressor \(r(\tilde{\mathbf {z}}_\star )\) achieves a coefficient of determination \(R^2=0.5\), quantifying the percentage of the variation in test data complexity that is predictable from the independent variables \(\tilde{\mathbf {z}}_\star\). This clearly suggests that, without the proposed adversarial component, a non-negligible amount of information regarding rhythm complexity leaked into the non-regularized latent space and can be thus predicted. Conversely, the coefficient of determination of \(r(\mathbf {z}_\star )\) drops to \(R^2=0.1\) when including the adversarial loss term, revealing that the latent space has been effectively disentangled. Ultimately, this enables intuitive control over the output complexity that can be now altered in a fader-like fashion [38] simply by varying the scalar value of \(z_i\).

7.3 Rhythm complexity manipulation

In this section, we demonstrate how the proposed model could allow for a fine-grained manipulation of target attributes of the generated samples. In particular, we encode each rhythm \(\mathbf {x}\) in the test fold of GMD and extract the corresponding latent vectors \(\mathbf {z}\). Then, we fix \(\mathbf {z}_\star\) and let \(z_i\) vary according to \(z_i + j \Delta z\), where \(j\in \mathbb {Z}\) and \(\Delta z=0.5\). For each new latent code obtained this way, we task the decoder to generate the corresponding pattern. Hence, we compare the difference between the complexity of the unaltered output and that of the newly generated ones. Figure 8 shows the violin plot for \(j\in [-5,5]\), depicting for each shift \(j\Delta z\) the distribution of the resulting changes in the complexity of the decoder output for all samples in the test set. Remarkably, we obtain a Pearson correlation coefficient of 0.90 between the desired and resulting complexity increments.

Fig. 8
figure 8

Violin plot of the measured output complexity increments as a function of \(j\Delta z\)

By keeping \(\mathbf {z}_\star\) fixed throughout the experiment, we argue that the generated rhythms would be most similar to the original one. However, the more the target complexity is altered, the greater will be the deviation from the original pattern. To support these claims, we compute the average Hamming distance \(\mathcal {H}(s_0, s_j)\) between the unaltered output pattern (\(j=0\)) and the ones generated with the desired complexity increment \(j\Delta z\). Namely, we convert each drum voice into a string of ones and zeros and measure the number single-character edits needed to change one pattern into the other. Arguably, a higher Hamming distance indicates a more significant modification of the original output pattern. In Fig. 9, we show \(\mathcal {H}(s_0, s_j)\) as a function of \(j\Delta z\) with \(\Delta z = 0.1\). Notably, the average distance monotonically increases as the target complexity increment moves away from zero. In fact, the patterns with the least amount of complexity manipulation appear to be the most similar to the reference rhythm with an average of approximately 7.4 edits per sample. Conversely, the maximum distance is achieved for \(j\Delta z=2.5\), where we observe an average of 21.3 edits per sample.

Fig. 9
figure 9

Average Hamming distance as a function of \(j\Delta z\). As the target complexity varies, attribute-controlled rhythms show a monotonically increasing degree of dissimilarity with respect to the unaltered output pattern

7.4 Attribute-controlled generation

Finally, we evaluate the proposed latent vector model in a purely generative mode. We sample 1000 random latent codes from \(\mathcal {N}(\mathbf {0}, \mathbf {I})\) and task the decoder to autonomously produce new patterns. We let \(z_0\) vary from \(-2.576\) to \(2.576\), thus accounting for the complexity of 99% of the samples in the GMD test fold. Hence, we compute the correlation between \(z_0\) and the complexity of the newly generated patterns. Figure 10 shows a clear linear relationship between desired and output rhythm complexity. Despite a Pearson correlation coefficient of 0.9163, however, we notice the tendency of the system to reduce the output complexity with respect the target \(z_0\) value. This, in turn, is confirmed by the slope of the best-fit linear regression model \(y \approx 0.78\,z_0 - 0.18\) being less than one. Moreover, this trend is accompanied by an increment in the output complexity variance as \(z_0\) increases. These phenomena might be explained by considering that the training fold of GMD consists of data from spontaneous drumming performances and offers a limited representation of high-complexity patterns. As a result, the decoder may have been biased toward generating more conventional lower-complexity rhythms. Nevertheless, we argue that, in a practical application, this effect may be straightforwardly compensated by incorporating a suitable multiplicative factor into \(z_0\), thus reestablishing an identity-like mapping between desired and measured complexity.

Fig. 10
figure 10

Correlation between \(z_0\) and the rhythm complexity of drum patterns generated from randomly-sampled latent vectors

8 Conclusions and future work

In this article, we presented a novel latent rhythm complexity model designed for polyphonic drum patterns in the style of contemporary Western music. The proposed framework is based on a multi-objective learning paradigm in which variational autoencoding is supplemented by two additional loss terms, one for latent space regularization and the other targeting disentangled representations. In particular, the model is simultaneously tasked with predicting and embedding the value of a given musical attribute along one of its latent dimensions. This way, the ensuing latent space is encouraged to become semantically structured according to the target high-level feature, thus enabling straightforward interpretation and intuitive navigation. Moreover, we showed that decoding the latent representations thus obtained grants explicit control over the complexity of newly generated drum patterns. To achieve this, we introduced a new polyphonic rhythm complexity measure. To the best of our knowledge, the present work constitutes the first attempt at defining a proper complexity measure for polyphonic rhythmic patterns. The proposed measure was validated through a perceptual experiment which showed a high degree of correlation between measured complexity and that assessed by human listeners, as indicated by a Pearson coefficient above 0.95. Our method, being based on the linear combination of state-of-the-art monophonic measures applied to groups of functionally related drum voices, allows for great flexibility when it comes to measuring and weighting the contribution of individual (groups of) voices and may serve as a starting point for future research.

Endowing machines with an explicit understanding of perceptual features of music has the potential to enrich the capabilities of many AI-driven creative applications, including assisted music composition and attribute-controlled music generation. Besides, our work proves that regularizing a latent vector model according to target perceptual attributes may structure the resulting latent representations in a humanly interpretable way. Therefore, this approach might readily complement those applications involving the semantic exploration of musical content, such as music database navigation, recommender systems, and playlist generation.

Future work entails a large-scale survey to further validate the promising results presented in this article. This way, it would also be possible to determine the optimal complexity measure for each group of voices and derive a set of perceptually informed parameters for the proposed method. Moreover, examining the interplay between different yet related voices is not a concept solely pertaining to drums. In fact, we envision an adaptation of the proposed method to encompass, e.g., string quartets or four-part harmony chorales in which distinct voices are clearly identifiable and yet cannot be fully modeled independently of the others. Finally, building upon the existing work on the perception of monophonic rhythm complexity, the present study focuses on fixed-length quantized binary patterns. This means that only onset locations are considered, whereas dynamics, accents, time signature, and temporal deviations smaller than a tatum are disregarded. Although this choice is motivated by a divide-and-conquer modeling approach that regards these aspects of rhythm to be (at least partially) independent of each other, the validity of these assumptions is yet to be proved. In fact, one may argue that the temporal distribution of agogic and dynamic accents is likely to affect the complexity of a rhythmic pattern beyond the simple location of its onsets. Similarly, ditching quantization in favor of a continuous-time representation would fundamentally change the definition of syncopation, which could in turn entail a range of different psychoacoustic effects. Ultimately, these compelling questions remain open and must become the foundation of future research on the perception of rhythm.