1 Introduction

To better understand the manner in which artificial intelligence organizes information according to emergent concepts similar to those used by humans (as opposed to simply memorizing answers thanks to large processing power), we perform several tests on a neural network (NN) designed to reproduce (and interpolate between) music tracks.

More specifically, our study proceeds in two steps:

  1. 1.

    Consider a pre-trained NN as a black box, open it and see if we can extract patterns from the way it has encoded information, so that we can begin to make some basic sense of its inner workings.

  2. 2.

    If we find such patterns, ask whether this organization can be compared with pre-defined quantities used by humans to describe emergent concepts.

Other authors have shown how to take specific NNs and extract from them variables that are meaningful to humans for various types of data, showing for instance which neurons fire specifically when a CNN is shown images of ski resorts [1], asking whether some neuron’s output encapsulate concepts such as conserved quantities in mechanics problems [2], finding the number of independent variables in a physical system [3], or quantifying the type and degree of symmetry in paintings [4].

To focus on a different type of data, we choose the following NN as a testing ground: Google Magenta’s MusicVAE [5], a Variational Auto-Encoder (VAE) [6] which uses a 512-dimensional latent space to represent a few bars of music. We remind the reader that a VAE is a deep NN trained to produce outputs that closely match the input data. For this, it uses layers adapted to the data type at hand: dense layers for basic cases, convolution layers for images, recurrent layers for sequences such as music. Other examples of exploring the latent space of a VAE trained with music can be found in Refs. [7,8,9,10,11, 11,12,13].

If one were to simply use the reconstruction error as the loss function, the NN could simply learn to memorize each of the individual inputs. To avoid this, the VAE includes two related modifications: one of the intermediate layers (called the latent space) is taken to be stochastic, and the loss function is modified to balance the effects of that stochastic layer’s probability distribution against the reconstruction error.

At inference time, data are fed into the VAE from the “encoder” side (the part of the NN that is upstream of the latent space, i.e., to the left in Fig.  1). Each individual data input produces a probability distribution that is defined as a multivariate Gaussian, the means and variances of each of the (512) dimensions being given by (1024) neurons in this layer, see Fig.  1. The 512 dimensions are called latent dimensions.

Fig. 1
figure 1

A schematic depiction of the task performed by MusicVAE: a piece of music is encoded into a (multivariate Gaussian) distribution in a 512-dimensional latent space. That distribution can then be sampled from and decoded back to produce a similar “avatar” piece of music. In practice, the model we use focuses on monophonic melodies and uses recurrent networks in both its encoder and decoder

To “decode” this latent encoding back into the original data space (and hopefully produce a close match the original input), a sample is taken from the random distribution for this particular input, and inference is run through the second (decoder) part of the NN (the one downstream from the latent space, i.e., to the right in Fig.  1). For the stochastic property of the layer to play its role and spread each input’s encoding to ensure continuity and avoid memorization, the VAE is encouraged to keep each data point’s encoded probability distribution close to a common prior—the identity multivariate Gaussian in the present case. This is achieved by redefining the full loss function as the sum of the reconstruction error and of a measure of the distance between the latent distribution encoding a given input, and the target prior (namely the Kullback–Leibler divergence) Footnote 1.

The specific neurons we are interested in analyzing are the 512 that define the latent space distribution’s central value for each music input, and which we call latent variables: our aim is to study what musical information they represent. For most of this paper, we are not interested in the 512 neurons that produce the variances, except insofar as they help us learn about the neurons encoding the 512 central values.

In practice, we ask:

  1. 1.

    Whether the latent dimensions can be ordered by relevance/importance,

  2. 2.

    How many latent dimensions are really necessary for the VAE to perform well, i.e., can some latent dimensions be classified as irrelevant?

  3. 3.

    Whether some of the relevant latent dimensions can be singled out as particularly important?

  4. 4.

    If these most relevant latent dimensions correspond to some concepts of musicality as understood by humans?

Minimizing the size of latent spaces and the number of units in an NN in general may be useful to design nimbler models that are faster to train. Yet the implications of this work go beyond practical simplifications and beyond applications to music. They should be understood in the broader concept of other studies that have shown that some neurons in a NN encode information that is highly correlated with concepts developed by humans to describe the same data, be it music, painting, physical laws or images, as it helps us understand better how NNs (and humans) might learn and encode information. We believe it is important to be able to extract patterns of organization within NNs and to pinpoint neurons that encode specific information to provide another step toward Explainable AI. This could have practical implications in terms of ethics for instance, as pruning specific neurons could help in removing unwanted biases.

In Sect. 2, we illustrate how MusicVAE works on the first 2 bars of the melody “Twinkle, twinkle, little star”. This allows us to introduce the structure of the latent space, hinting at a division of latent dimensions into two sets: relevant and irrelevant.

In Sect.  3, we take a large sample of musical tunes to show that this ordering of latent dimensions according to relevance carries through for music tracks beyond “Twinkle, twinkle, little star”.

In Sect. 4, we show which latent dimensions encode the information of the human-defined quantities of rhythm and pitch.

In Sect. 5, we illustrate how sequences of random notes are encoded in the latent space and use this to test our assumptions about the difference between relevant dimensions and irrelevant dimensions.

In Sect.  6, we show how the analysis carries over to chunks of 16 bars of music and discuss the concept of melody.

2 “Twinkle, twinkle, little star” and MusicVAE.

To introduce some basic ideas, we focus as an example on the melody “Twinkle, twinkle, little star”, starting from its sheet music, or rather piano-roll representation as depicted at the top of Fig. 2, where the x-axis depicts time and the y-axis encodes frequency in the logarithmic scale of MIDI notes (equivalent to numbering the corresponding keys on a piano keyboard from left to right, including black keys).

As for any input, “Twinkle, twinkle” gets encoded into a vector \(\mu _{[1,\ldots ,512]}^{\text{twinkle}}\) of 512 central values, and a vector \(\sigma _{[1,\ldots ,512]}^{\text{twinkle}}\) of 512 standard deviations, defining a 512-dimensional Gaussian distribution in latent space.

Sampling from this 512-dimensional distribution and passing it through the decoder part of MusicVAE yields back another 2-bar note sequence that is similar but not necessarily identical to the original track, such as the one in the lower plot in Fig. 2.

Fig. 2
figure 2

Top: The first two bars of the melody for “Twinkle, twinkle, little star” with note frequency encoded as MIDI note, i.e., the number of the corresponding key on the piano, counted from left to right. Bottom: The result of decoding a random sample from the encoded distribution for “Twinkle, twinkle”

Given the stochastic nature of the VAE, the output may change between two inferences on the same input. Figure  3 illustrates this in the case of our “Twinkle, twinkle” example. The top plot depicts the standard deviations \(\sigma _{[0,\ldots ,511]}^{\text{twinkle}}\) of the multivariate Gaussian distribution for the first 100 of the 512 latent dimensions, sorted from smallest \(\sigma \) to largest. The middle plot shows the first 100 central values \(\mu _{[1,\ldots ,512]}^\text{twinkle}\) in the same order.

Fig. 3
figure 3

Latent encoding for the note sequence in Fig.  2. We only show the 100 dimensions with the smallest variance, in order. Standard deviations are at the top, means in the middle, and a random sample from the distribution at the bottom. The vertical red dashed line hints at a possible split between relevant latent dimensions (to the left), and irrelevant latent dimensions to the right

Together, these lists of 512 \(\mu \)’s and 512 \(\sigma \)’s entirely specify a multivariate Gaussian distribution from which we can sample and decode to obtain variations on the original input (“Twinkle, twinkle”): the bottom plot in Fig.  3 represents a sample drawn from this distribution. Once decoded, this sample yields the melody at the bottom of Fig.  2. On the other hand, decoding the central values themselves (i.e., the values in the middle plot of Fig. 3) yields back the original version of “Twinkle, twinkle, little star”, i.e., the piano-roll at the top of Fig.  2 Footnote 2.

From the top plot of Fig.  3, we can see that the encoding of “Twinkle, twinkle” is very precisely specified (small values of \(\sigma \)) along the first few dimensions in this ordering, indicating that MusicVAE uses these dimensions to encode a lot of important information about this particular piece of music, whereas the dimensions with large spreads could be approximated by standard Gaussians with \(\sigma \approx 1\) and \(\mu \approx 0\): in the top two plots of Fig.  3. What remains to be seen is whether this is also the case for other music tracks beyond “Twinkle, twinkle”, and whether the latent dimensions remain in the same order of relevance from song to song.

3 The structure of MusicVAE’s latent space

For the application described here, we used MusicVAE’s 2-bar model, which considers monophonic sequences of notes quantized down to 16th notes: this yields 32 discrete time stamps to choose from for a note to start. As for the pitches, MIDI files specify \(2^7 = 128\) discrete MIDI notes, i.e., slightly more than the 88 keys on a piano, to which we need to add the possibility of starting a silence, and the possibility of holding a frequency for the next interval, i.e., 130 discrete possibilities. With this quantization, there are \(4 \times 10^{67}\) possible 2-bar note sequences, whereas a latent space of 512 dimensions—even if these dimensions were perceptrons (i.e., binary neurons)—can describe over \(2^7 \approx 10^{154}\) possibilities and is thus over-dimensioned to describe all possible note sequences, let alone all the ones that can be considered music to a human ear. The subspace of note sequences that can be called “music” is even smaller.

The question is: how many latent dimensions do we really need in order to describe music?

To begin answering this question, we can ask whether the pattern observed in Fig.  3 is true for other pieces of music, i.e., if one latent dimension is specified with great precision (small spread) for one music piece, will that also be true for other tracks? The answer is a resounding “YES”.

To establish this, we pick at random from the 100,000 tracks in the Lakh MIDI Dataset [15]. We use about 10,000 tracks, from which the standard algorithm in MusicVAE extracts 5 melodies each, yielding 50,000 melodies. We then encode these melodies and study their encodings in latent space Footnote 3.

While for a given music track, we could order the latent dimensions as in Fig.  3, this may not be ideal, as each music track may produce a slightly different ordering of the \(\sigma \) values. We could order the latent dimensions according to the average over music tracks of the width of each latent dimension. Yet, it is logical to think that what matters is not simply that the width is large on average, but that it is larger than the typical central value for that same latent dimension.

For instance, latent dimensions which typically have \(\mu \approx 0\) and \(\sigma \approx 1\) are dimensions for which the training did not require a trade-off between the reconstruction loss and the KL divergence, and so the training produced a situation where this dimension essentially remains close to the default values of \(\mu = 0\) and \(\sigma = 1\) as enforced by the KL divergence with a normal prior: for these specific dimensions, the posterior is equal to the prior, i.e., it has collapsed [16]. Such latent dimensions play little role in the reconstruction or generation of data, and will therefore be called “irrelevant” and can be identified by \(|\mu |\ll \sigma \).

On the other hand, latent dimensions which encode relevant information will have required a trade-off between the reconstruction term and the KL term in the loss during training and, therefore, will typically produce \(\sigma < 1\) as well as nonzero values of \(\mu \ne 0\) for many music tracks (though not necessarily for all).

In summary, a latent dimension which encodes a song with \(|\mu |\gtrsim \sigma \) is one that needs to be very precisely specified in order to reproduce that music track, i.e., it has minimized the reconstruction error at the cost of the KL term. If this is true for most music tracks, we will label this latent dimension as “relevant”.

In fact, if we look across a large number of music tracks, we can compare the variability of the central value across music tracks, compared to the average of the widths, and thus order the latent dimensions k according to the value of:

$$\begin{aligned} \text {relevance}_k = \frac{\text {std}(\mu _{k})}{\text {mean}(\sigma _{k})} . \end{aligned}$$
(1)

Selecting latent dimensions with relevance above a given cut-off should allow us to separate between relevant and irrelevant latent dimensions, but this first requires setting the cut-off. From Fig.  4, we can see one latent dimension with particularly high relevance \(> 15\), then 36 more dimensions with relevance larger than 0.5, then one latent dimension with relevance around 0.5, and finally a group of hundreds of latent dimensions with relevance lower than 0.5. Taking a cut-off of about 0.5 in relevance will separate the two groups: latent dimensions with high relevance (i.e., relevance above 0.5) and latent dimensions with a lower relevance (below 0.5). In this Figure, we can also see that the separation between relevant and irrelevant is well-defined as soon as we have about 100 encodings to average over, while the exact ordering of the 37 relevant dimensions is quite robust once we have gathered over 1000 encodings.

Fig. 4
figure 4

Latent dimension relevance calculated by applying the statistical estimator defined in 1 on samples of 10 to 10000 melodies. The chosen cut-off of 0.5 is depicted as a horizontal dashed black line

As a consequence of our definition of relevance, one can expect that ablating an irrelevant dimensions (setting it to zero, which is nearly its central value) will have very little effect on the accuracy of the music reproduction, while setting a relevant dimension to zero will drastically impact accuracy.

We can see in Fig.  5 that ablating all 475 irrelevant latent dimensions has very little effect, whereas ablating one of the first two relevant latent dimensions noticeably decreases the accuracy. There we use a “realistic” metric of accuracy as an intersection over union of non-trivial tokens: we count how many non-trivial tokens are common between input and output, dividing by the number of instants for which either input or output included a non-trivial tokenFootnote 4.

Fig. 5
figure 5

Effect of ablating different latent dimensions on a realistic measure of accuracy (the intersection over union of non-trivial note onset tokens)

We can check this by looking at how much the central values of each melody vary along these same dimensions. In the lower plot of Fig.  6, we depict the central values for a random set of tracks in our database, using the same 100 latent coordinates as the top plot that depicts the variability from song to song for the width of each latent dimension. The first thing to notice is that, for all dimensions with a wide spread (the ones with \(\sigma \approx 1\) in the upper plot of the same Figure , i.e., to the right of the vertical dashed red line), the central value is always close to 0: all note sequences are encoded with the same unit Gaussian for these hundreds of dimensions, and hence, we have hundreds of irrelevant dimensions in latent space that carry very little musical information, more specifically 475 such dimensions (give or take 1 depending on the definition of the cut-off from Fig.  4).

Fig. 6
figure 6

Boxplot of standard deviations (top) and central values (bottom) for the 2-bar model run on 2000 random tracks from our dataset

On the other hand, for the relevant dimensions, i.e., the 37 latent dimensions to the left of the vertical dashed red line in Fig.  6, which have narrow distributions, we find that the central value fluctuates from track to track: these are the dimensions that contain the actual information about music. Not only are the central values of the relevant dimensions well-specified for each song, they vary a lot from song to song.

For our dataset, the correlation matrix in latent space is shown in Fig.  7: the 37 relevant latent dimensions we identified in Fig.  6 are quite uncorrelated. This is the subspace we will be most interested in the remainder of this paperFootnote 5 .

Fig. 7
figure 7

Pearson correlations of central values for the first 100 latent dimensions, between melodies extracted from a random sample of real music tracks

4 Latent variables for pitch and rhythm

In this Section, we attempt to disentangle how MusicVAE stores the information about music’s most fundamental qualities: rhythm and pitch.

For the present study, we do not need to delve into technical details of the definitions of “rhythm”, “pitch” or “melody”: suffice it to say that pitch is related to the notes’ basic frequencies (or alternatively, their location on the keyboard, with high-pitched notes to the right and lower frequencies to the left), independently of the moment/time they are played or of their relation to each other. On the other hand, rhythm is related to the temporal structure of music, i.e., the duration and arrangement of the succession of notes in time, independently of their pitch/frequency (i.e., independently of which key on the keyboard they correspond to). Finally, melody refers to relations or comparisons between the pitches of various notes played at different times in the sequence.

What the reader needs to know beyond this basic distinction is that musicologists have constructed dozens of variables to quantify various qualities of music, called “music features”Footnote 6: for monophonic music (no more than a single note played at any given time), these can be categorized under one of the three umbrellas of rhythm, pitch and melody. This implies that rhythm, pitch and melody are not usually thought of as uni-dimensional quantities by a human listenerFootnote 7.

We use the Python music21 libraryFootnote 8 to extract human-defined music features. To compute correlations that capture nonlinear dependencies, we use the correlation coefficient defined in the Python phik library [17].

The result is shown in Fig.  8. We can see that rhythm features (starting with the letter R at the bottom right in the Figure) are heavily correlated among themselves, as are some subsets of pitch features (from P11 to P15). Other pitch and melody features also form subsets of highly correlated features, but tend to correlate with other groups as well, even groups starting with a different letter (R or P).

Fig. 8
figure 8

Pearson correlations between human-defined music features in our dataset. The first letter of each feature indicates the type: R for rhythm, P for pitch and M for melody

Within our set of music tracks, we can compute correlations between latent dimension central values and human-defined music features, as displayed in Fig.  9. The salient points to notice are as follows:

  1. 1.

    The first relevant latent variable correlates more with several rhythm music feature than any other relevant variable does.

  2. 2.

    The second relevant latent variable correlates more with several pitch music feature than any other relevant variable does.

Fig. 9
figure 9

Nonlinear phik correlations between human-defined music features and latent variables, with latent variables sorted by relevance. We can easily pick out the first relevant dimension as being heavily correlated with many rhythm features and the second most relevant with many pitch features

Figure 10 explicitly displays the high correlations between the most relevant latent variable and rhythm features. In Fig.  11, we can check that the second latent variable is (non-monotonously) correlated with several features.

Fig. 10
figure 10

Most relevant latent variable, against the value of several rhythm features. The red curves show local regression fits of latent variables as a function of the human-defined music features

Fig. 11
figure 11

Second most relevant latent variable against the value of several pitch features. The red curves show local regression fits

Whereas humans require several music features to describe rhythm (respectively pitch), it seems as though MusicVAE condenses all of these into a single latent variable that encapsulates a lot of the information that we would intuitively qualify as rhythm (respectively pitch).

This is akin to what happens in other studies of emergent concepts in NNs, where single neurons quantify complicated concepts that may seem intuitive to humans but would be hard to describe explicitly [1,2,3]. Beyond the difference in analytic power due to its nonlinear abilities, the NN also seems to confirm that these concepts are very relevant (as indicated by the relevance of the corresponding latent dimensions) to correctly reproduce music, and it in fact suggests that rhythm is even more relevant than pitch in describing 2 bars of musicFootnote 9.

5 Random sequences of notes

MusicVAE has been trained on more than a million tracks of music of various genres, and it encodes these tracks using essentially the 37 relevant dimensions. We also saw that MusicVAE mostly uses 2 latent dimensions to represent rhythm and pitch. We can now ask what happens if we provide an input that is not real music: where does it get encoded in latent space? Are rhythm, pitch or the 37 relevant dimensions enough to distinguish music from noise?

5.1 Generating the data

We create 50,000 note sequences by switching on a random number of notes. Given that 32 sixteenth notes would fill exactly two bars at tempo 4/4, we draw the number of note switching-on events from a uniform distribution on integers from 2 to 32. Notes can only be switched on at discrete intervals (every sixteenth note), and we make sure that there is always exactly one note being played at any given timeFootnote 10.

The pitch for each note is selected from a uniform distribution over the integers from 30 to 100 in MIDI notation. This means that the pitches of the various notes in the sequence are uncorrelated and, as such, follow no melodic structure.

5.2 Distinguishing music and noise using individual latent dimensions

In Fig.  12, we show the histograms for the excitation of the four most relevant latent variables we identified above, in the case of the 50,000 melodies extracted from real music, and the 50,000 random note sequences we generated.

Fig. 12
figure 12

Values of the four most relevant latent variables in order of relevance for real music and random sequences

In the second plot from left in Fig.  12, we see that the pitches (second most relevant latent variable) of our random sequences differ from those of real music, as could be expected given the very broad distribution of pitches in our random sequences, whereas real melodies have narrower ranges of pitches.

As for rhythm on the other hand (leftmost plot in Fig.  12), the distribution of our random sequences of notes does not seem to differ that much from that of real music. This could be because we are only looking at 2 bars of music, admittedly a short sequence of notes most of the time, or because looking at one single latent dimension at a time is not enough to distinguish real music from random notes.

5.3 Counting excitations

Instead of looking at individual latent variables for each encoding, we first consider all latent variables at once, and ask whether it is possible to distinguish real music from random notes without even having identified relevant and non-relevant latent variables. We could do this at the level of samples for each encoded note sequence, but that would introduce more randomness, so we stick to using central values here. In practice, since we are asking whether the VAE can be used as a detector for music versus other sounds, we might as well use directly the information of the central values, which is provided by the VAE when performing a forward pass.

The result is shown in Fig.  13 where we see that, on average, random note sequences excite more latent dimensions than real music, as expected since the random notes are outliers that do not belong to the training distribution modeled by the VAE.

Fig. 13
figure 13

RMS of central values over all 512 latent dimensions for real music versus random notes

Since we know which latent dimensions actually encode the most relevant information about music, we can actually go one step further and refine this analysis by performing separate averages on the 37 most relevant latent dimensions and on the remaining 475, as shown in Fig.  14. There we see that the average excitation of relevant latent dimensions is larger than 1 for random notes, whereas it is slightly lower than 1 for real music. For the case of the irrelevant latent dimensions, the split is around 0.1, again with real music providing less excitation than random notesFootnote 11.

Fig. 14
figure 14

RMS of central values for real music versus random notes, split according to latent dimension relevance

6 Looking for melody (variables)

As we can see from Fig.  9, we have not found a variable that can be conclusively said to encapsulate the melody information in the 2-bar case, at least not independently from rhythm: the second relevant dimension was correlated with many melody (M) features, but even more so to rhythm (R) features. This could be due to the fact that 2 bars for music are not enough to give a strong melodic signal, and we therefore turn to the 16-bar case.

6.1 Specifics of the 16-bar case

For the 16 bar case, the analysis of relevance of latent dimensions in Fig.  15, looking for two well-separated sets of latent dimensions, leads us to choose a cut-off of about 2 in relevance, see Fig.  15. We see that the 16-bar case requires about double the number of relevant dimensions than the 2-bar case: precisely 77 relevant dimensions, and that there are four highly relevant dimensions (relevance > 50).

Fig. 15
figure 15

Latent dimension relevance estimated for a given number of encoded melodies for the 16-bar case, with chosen cut-off depicted as a horizontal dashed black line

Figure  16 collects the information that was presented in the previous sections for the case of 2 bars of music, but applied to sequences of 16 bars. We can pick out relevant dimensions 1 and 2, which are heavily correlated with pitch features, and relevant dimensions 3 and 4, which are heavily correlated with rhythm features. This exhausts the set of 4 highly relevant dimensions identified in Fig. 15.

Looking for latent variables that represent melody, we notice that the third and fourth most relevant latent dimensions are correlated with melody, but, as in the 2-bar case, they are mostly correlated with rhythm (see Fig.  9. Possible melody latent dimensions show up much later in Fig.  16, in relevance positions 27, 49, 63, 65 and 70. This might indicate that MusicVAE does not rely strongly on melody to organize its understanding of music, but that melody either appears as a consequence of the more basic concepts of pitch and rhythm, or only plays a secondary role.

Fig. 16
figure 16

Same plots as above, but for the 16-bar case

7 Conclusion

We have studied how a VAE trained on a million music tracks organizes its 512-dimensional latent space into hundreds of “irrelevant dimensions” that are barely used to encode music, and a few dozens of “relevant dimensions” that actually encode musical information. It does make sense that the VAE only uses a fraction of its 512 dimensions for music, as even the space of random note sequences would not require such a large latent space.

Returning to the case of real music, and in particular 2-bar melodies, we found that MusicVAE uses 37 relevant dimensions to actually encode musical information, and of these, the first two in order of importance can be clearly identified as corresponding to pitch and rhythm.

Indeed, we have shown that several quantities defined in the literature to describe rhythm are correlated almost exclusively with the most relevant dimension, which nonlinearly encodes this complex information into a single real variable. The same occurred for pitch with the second most relevant latent variable, but not for melody, which does not appear to be encoded independently from rhythm for such short music tracks.

Moving on to chunks of 16 bars of music, we saw that MusicVAE uses 77 relevant dimensions to encode music, but most pitch information is encoded nonlinearly into the first two most important relevant dimensions. The next two relevant dimensions by order of importance encode several rhythm features. Dedicated latent dimensions that only encode melody only show up much further in order of relevance.

Whereas previous approaches have focused on enforcing a linear mapping of the human-defined quantities onto the latent space, we suggest that the nonlinear change of representation is what allows the VAE to extract “principal coordinates” that diagonalize the problem, thereby simplifying and extending the human-defined variables.