Exploring Variational Auto-Encoder Architectures, Configurations, and Datasets for Generative Music Explainable AI

Generative AI models for music and the arts in general are increasingly complex and hard to understand. The field of eXplainable AI (XAI) seeks to make complex and opaque AI models such as neural networks more understandable to people. One approach to making generative AI models more understandable is to impose a small number of semantically meaningful attributes on generative AI models. This paper contributes a systematic examination of the impact that different combinations of Variational Auto-Encoder models (MeasureVAE and AdversarialVAE), configurations of latent space in the AI model (from 4 to 256 latent dimensions), and training datasets (Irish folk, Turkish folk, Classical, and pop) have on music generation performance when 2 or 4 meaningful musical attributes are imposed on the generative model. To date there have been no systematic comparisons of such models at this level of combinatorial detail. Our findings show that MeasureVAE has better reconstruction performance than AdversarialVAE which has better musical attribute independence. Results demonstrate that MeasureVAE was able to generate music across music genres with interpretable musical dimensions of control, and performs best with low complexity music such a pop and rock. We recommend that a 32 or 64 latent dimensional space is optimal for 4 regularised dimensions when using MeasureVAE to generate music across genres. Our results are the first detailed comparisons of configurations of state-of-the-art generative AI models for music and can be used to help select and configure AI models, musical features, and datasets for more understandable generation of music.


Introduction
Music generation is a key use of AI for arts, and is arguably one of the earliest forms of AI art.However, contemporary generative music models rely increasingly on complex Machine Learning models [1][2][3] such as neural networks [4,5] and deep learning techniques [6][7][8][9] which are difficult for people to understand and control.This makes it hard to use such models in real-world music making contexts as they are generally inaccessible to musicians or anyone besides their creator.
Making AI models more understandable to users is the focus the rapidly expanding research field of eXplainable AI (XAI) [10].One approach to making machine learning models more understandable is to expose elements of the models to people in semantically meaningful ways.For example, using latent space regularisation [11] to impose semantically meaningful dimensions in latent space.To date there has been very little research on the applicability and use of XAI for the arts.Indeed, there is a lack of research on what configurations of generative AI models and datasets are more, or less, amenable to explanation.
This paper takes a first step towards understanding the link between the explanation and performance of AI models for the arts by examining what effect different AI model architectures, configurations, and training datasets have on the performance of generative AI models that have some explainable features.

Related Work
The field of eXplainable AI (XAI) [10] explores how complex and difficult to understand AI models such as neural nets can be made more understandable to people.Approaches to increasing the explainability of AI models include generating understandable explanations of AI model behaviour, structuring and labelling complex AI models to make them more understandable, and approximating the behaviour of complex models with less complex and more understandable models (ibid.).An important element of XAI is the interpretability of an AI model which we take to mean the "ability to explain or to provide the meaning in understandable terms to a human" [12].Unfortunately the concept of explanation is ambiguous and variously defined [13].In Machine Learning (ML) literature XAI often refers to making the reasons behind ML decisions more comprehensible to humans.For example, the majority of XAI research has examined how to explain the decisions of ML classification and predictor models -see [12] for a thorough survey.There also exists a broader view of the concept of explainability in which "explainability encompasses everything that makes ML models transparent and understandable, also including information about the data, performance, etc." [14] which we follow.In this paper we are specifically concerned with how to make AI models more interpretable for people so that they can better control the generative aspects of the AI model.Approaches for explaining AI models are most often tied to specific AI models and data types.There are an emerging set of approaches which are not tied to specific AI models or data types, referred to as agnostic explanators [12].However, agnostic approaches, such as LIME [15] are concerned with building an explanation model to explain the classification/ prediction of an AI model whereas in this paper we focus on making the content of an AI model itself more interpretable so that the model can be better controlled for music generation.
To date most XAI research has been concerned with goal-directed domains where task efficiency and transparency are key factors.For example, generating explanations for why an AI model made a medical diagnosis [16] or how the AI models in self-driving cars work [17,18].However, there has been little research on how XAI could be used in more creative domains such as the Arts [19].This lack of explainability typically limits the use of AI models for the Arts to the creator of the AI model and severely limits their use by artists and practitioners.Of the limited research on XAI for the Arts, [20] explores the presentation of visual cues between mappings in the latent space of an AI model, and [21] researches the visualisation of levels of mutual trust between an AI system and musicians in music making.This leaves many open research questions on the use of XAI for the Arts ranging from questions about the explainability of different AI models and datasets to how to design user interfaces to navigate and manipulate explanations of generative AI models.
Taking music as a key form of artistic endeavour, this paper explores explainable AI for music.Musical problems addressed by AI models include composition, interpretation, improvisation, and accompaniment [22].In this paper we focus on a core use of AI for music -music composition or generation, otherwise know as generative music.Music itself has a multi-level structure that "ranges from timbre and sound through notes, chords, rhythmic patterns, harmonic patterns (e.g., cadences), melodic motifs, themes, sections, etc." [23].As such, generative AI models range in purpose from generating monophonic sequences of notes (referred to as a melody), to polyphonic melodies, multivoice polyphony, accompaniment to a melody (counterpoint or chord progression), and association of a melody with a chord progression [24].However, sequencing longer term structures such as themes and sections by integrating short-term and long-term machine learning for music generation remains an open research challenge [23] which is problematic given that the "long-term and/or hierarchical structure of the music plays an important role" [2].Applications of generative AI range from polyphonic classical music generation in the style of Bach [7] to monophonic Irish Folk music generation [25,26], and include composition applications such as musical inpainting to generate a melody to fill in the musical gap between two melodies [27] and musical interpolation to generate a set of melodies which incrementally move from one melody to another [28].However, the complex nature of these generative models means that people often need some technical expertise and knowledge of these algorithms in order to use and adapt them effectively.This makes such approaches difficult for people, especially non-experts, to understand and manipulate.
In this paper we focus very much on the explainability of the AI model itself and its output.In particular, we examine how semantically meaningful labels can be applied to properties of AI models to provide the opportunity for users to interpret and understand some aspects of the model and its generated output.To date there have been few comparisons of the performance of explainable generative models for music.For example, research has compared the performance of a novel Convolutional-Variational Neural Network for music generation to other Neural Networks [29] in terms of the Information Rate of generated music -a measure of musical structure.However, such comparisons compare across models, not comparing the configurations of models themselves, and do not examine a range of semantically meaningful features.We aim to compare the effect of meaningful labels on AI models in different configurations and with different datasets.
To reduce the complexity of these combinatorial analyses we select the core music generation task of generating monophonic melodies.In this way we contribute the first in-depth analysis of how different AI model architectures and datasets affect music generation when explainable attributes are used.Future work can build on our findings to compare the effects of explainable attributes on more complex polyphonic melody generation and later accompaniment and association.By taking this approach we improve the field's understanding of state-of-the-art deep learning generative models to help inform future generative model development and refinementunderstanding where we are today informs where we might go in the future.

Latent Spaces for Music Generation
AI models for music generation range from probability based models such as Markov Chains [30][31] through to deep learning techniques explored in this paper [27,32].Probabilistic approaches typically offer more controllable music generation with lower computational and dataset requirements, but their outputs are often less novel than those achieved by deep learning approaches.A wide variety of deep learning generative models of music have been developed in recent years [24] and have been demonstrated to generate convincing musical outputs [1][2][3].Briot et al. [24] provide a thorough survey of deep learning architectures and models used for music generation including Variational Auto Encoder (VAE), Restricted Boltzmann Machine (RBM), Recurrent Neural Network (RNN), Convolutional neural network (CNN), Generative Adversarial Networks (GAN), Reinforcement learning (RL), and Compound Architectures of these approaches.As noted by [32], two of the most popular deep learning approaches to generative AI are Generative Adversarial Networks (GANs) [33] and Variational Auto-Encoders (VAEs) [34].In this paper we examine VAEs as they have been demonstrated to be capable of creative tasks including music generation [35,36], music inpainting [27], and music interpolation [28].Moreover, whilst comparisons of VAE approaches have to date examined image generation in terms of computation time and re-generation accuracy [37], there has been no systematic comparison of VAEs for music generation, nor in terms of interpretable features.Some recent VAE systems have exposed the latent space of generative music models to users [20,27,[38][39][40] as a way for users to navigate the latent space to generate music.These approaches offer increased control of the AI models [40] and increased structure and labelling of the models [27], both of which contribute to increasing explainability.Given the research interest in making latent spaces more explainable we explore what effect different AI model configurations and training datasets have on one of these approaches when explainable attributes are applied.In this paper we explore these questions for the popular VAE architecture which shows promise as a deep learning approach to music generation [28].A VAE architecture consists of i) an encoder which encodes training data into ii) a multidimensional latent space which is used by iii) a decoder which decodes data from the latent space to generate data in the style of the training data as illustrated in Fig. 1.Modifying values of the latent space dimensions will have an effect on the generated data.The challenge for explainable VAE data generation is how to offer users meaningful control of the generated data given that the latent space is the result of unsupervised learning with no meaningful structure.There are two main approaches to attribute-based control of generative models: unsupervised disentanglement learning, and supervised regularisation methods [11].However, unsupervised disentanglement necessarily requires some post-training analysis to identify the possible meaning of the disentangled dimensions (ibid.).

Research Questions
As outlined in previous sections, there are many approaches to music generation using deep learning models, and each year new models are added to the repertoire of music generation systems.However, to date there has been no systematic analysis of how different training datasets and AI model architectures might impact the performance of XAI models for music.For example, to date the only experiments on the effect of latent space dimensionality on model performance have been conducted on images [41].Our core Research Question is: What effect do different AI model architectures, configurations, and training data have on the performance of generative AI models for music with explainable features.
No research has been undertaken to establish these effects to date.In answering this question we help researchers to understand the properties of state-of-the-art generative music architectures and so help to build a baseline from which to explore many more model features and generative architectures.
This paper begins to address the core Research Question by systematically asking the following questions about the performance of VAE generative models with explainable features: RQ1 What is the effect of VAE model architectures on performance?RQ2 What effect do the musical features imposed on the latent space have on performance?RQ3 What effect does the size of latent space have on performance?RQ4 What effect do training datasets have on performance?

Methods
Following [27] which demonstrates that VAEs are successful in generating short pieces of monophonic music we restrict our music generation to monophonic measures of music represented by 24 characters.Each character in can represent a musical note, a note continuation, or a rest.

Candidate AI Models
As a first step in understanding what effect explainable features have on the performance of generative AI model architectures we compare two representative example VAE generative music models -MeasureVAE [27] and AdversarialVAE [32].Both approaches build on a VAE architecture (Section 2.1) to generate music but differ in terms of how musically semantic information is applied to the music generation with MeasureVAE imposing regularised dimensions on the latent space and AdversarialVAE adding control attributes to the Decoder.

MeasureVAE
The popular MeasureVAE implementation 12 [27,42] has been demonstrated to be "successful in modeling individual measures of music" [27].Mea-sureVAE uses a bi-directional recurrent neural network (RNN) for the encoder, and a combination of two uni-directional RNNs and linear stacks for the decoder [27].The generated music can be varied by modifying the values of the dimensions in the latent space but the relationships between the dimensions and the music produced is not meaningful to people.To improve the explainability of the MeasureVAE we can apply latent space regularisation (LSR) [43] when training the VAE.LSR has been widely used to allow more user controlled generation of images [44] and music [42,45].Following (author?) [11,42] we use LSR to force a small number of dimensions of the latent space to represent specific musical attributes (see Section 4.3) -these regularised dimensions are the explainable features of the MeasureVAE model which increase the explainability of the generative model.Fig. 2 illustrates the VAE architecture with 4 regularised dimensions in the latent space.See [27] for details of the MeasureVAE Model Architecture.
In MeasureVAE, which is a typical VAE encoder-decoder architecture [11], data points x in a high-dimensional space X are mapped to a low-dimensional latent space Z using the encoder, where latent vectors are represented with z.The latent vectors, z are mapped back to the data space, X via the decoder.Latent vector, z, is considered as a random variable and generation process is defined with the sampling processes of z ∼ p(z) and x ∼ p θ (x | z).p θ (x | z) is the θ parameterised decoder architecture and p(z) is the prior distribution over the latent space, Z, as per the variational inference.The encoder is represented with q ϕ (z | x), which is the posterior parameterised by ϕ.In this context, the loss function is defined with the equation below as also defined in [11]: In 1, the first term, L R , represents the reconstruction loss, which is the L2 norm between x, original data vector, and x, its reconstruction version.The second, L KLD , represents the KL-Divergence regularisation, typical to VAEs.
To apply the latent space regularisation in the context of MeasureVAE, firstly an attribute distance matrix is defined, which is D a ∈ R m×m , where m is training examples in a mini-batch, as in [11]: (2) where x i and x j represent arbitrary data vectors and a(•) is the calculation of any attribute for the data vector, x.Then, another distance matrix, D r ∈ R m×m is calculated for the regularised dimension, r, of the latent vectors, z: where z r i and z r j are r-th dimension values of the arbitrary latent vectors z i and z j .Lastly, the additional loss term for the latent space regularisation is defined with the following equation, as in [11]: which is added to the VAE loss in 1.In 1, M AE is the mean absolute error, tanh is the hyperbolic tangent, sgn is the sign function and δ is a parameter that controls the spread of the posterior.Due to this additional term, increasing or decreasing relationships between the calculated attributes for x i and x j are similarly reflected to the relationship between z r i and z r j , respectively.The code for this MeasureVAE implementation based on [11] can be found here3 .

AdversarialVAE
The AdversarialVAE [32] uses a one-layer bidirectional Gated Recurrent Unit (GRU) for the encoder followed by linear layers (MLP) for the Encoder Decoder Latent Space

Regularised dimensions
Fig. 2 Variational Auto-Encoder with Latent Space Regularisation of 4 Dimensions mean and variance of the variational sampling at the latent space bottleneck, and a two-layer GRU for the decoder followed by a linear layer (MLP) which in contrast for MeasureVAE uses both the latent space and additional control attributes to generate the music.The latent space itself has an adversarial classifier-discriminator added which "induces the encoder to remove the attribute information from the latent vector" (ibid.)as illustrated in Fig. 3.In contrast to MeasureVAE where specific dimensions of the latent space are tied to semantic musical features, music generation with the AdversarialVAE is controlled by musical attributes fed to the decoder -these are the explainable features of the AdversarialVAE model.
The AdversarialVAE model as defined in [32], similar to MeasureVAE, has an additional loss term on top of the original VAE loss, which is defined in equation 1.The additional loss term here, denoted as L D , is adversarial and it belongs to a separate architecture that is a classifierdiscriminator which consists of linear layers with tanh activations, except for the last layer where sigmoid activation is utilised as per the classification task.The objective of this additional classifier-discriminator is to determine the value of a musical attribute using discretely defined levels given the latent vector, z, of a musical sequence by learning a probability distribution s ψ , where ψ is the parameterised classifier-discriminator.
To construct L D , firstly N many musical attributes are defined.Then, based on the training data, each attribute is quantised into K many bins (specifically, K = 8 in this study), where µlaw compression is used as in [32] to obtain equal number of samples in each bin given the characteristics of the training data.Labels of each sample in the training data are one-hot encoded according to the quantised bin that the sample belongs to.Considering N many musical metrics and K many discrete levels, each sample yields in a matrix B ∈ R N ×K for the target musical attributes, which is the output of the classifier-discriminator network.As per the adversarial objective defined in [32], the encoder tries to prevent the classifierdiscriminator to predict to correct targets for the musical attributes, therefore the targets from the perspective of the encoder are defined as 1 − B, where 1 is the matrix of ones as per the one-hot encoding.
After having the B and 1 − B matrices, the L D is defined as follows: where ϕ is the parameterised encoder architecture, ψ is the parameterised classifierdiscriminator, q ϕ (z | x) is the posterior distribution denoting the encoder in accordance with the notation in equation 1.
Then, the overall loss becomes: ) following a similar notation as above.See [32] for details of the AdversarialVAE Model Architecture and the repository here 4 for the implementation of AdversarialVAE based on [32].

Datasets
Generative AI music models are typically trained and evaluated on one musical dataset such as monophonic Irish folk melodies [26] which have been used to train and test MeasureVAE [27].However, as noted in [46], different musical genres have different musical features which may have an impact on the performance of a generative AI model and potentially its explainability.
In this paper we use the frequently used Irish Folk dataset [26] and compare and contrast this with datasets of Turkish folk music, pop music, Encoder Decoder Latent Space Adversarial Classifier-Discriminator

Control attributes
Fig. 3 Variational Auto-Encoder with Adversarial Classifier and Decoder Control Attributes and classical music as described in this section.Table 1 presents key features of the datasets used including their musical features from Section 4.3.

Irish Folk dataset
The Irish Folk dataset contains 20,000 monophonic Irish folk melodies [26] 5 from which 5.6m notes are extracted for these experiments.The dataset has the highest note range and density of the music used in these experiments meaning that it is the most complex musically.It is also by far the largest dataset used in this experiment and is commonly used in generative AI research.

Turkish Makam dataset
The Turkish Makam dataset [47] 6 as used in [48] which consists of approximately 2,200 musical scores related to Turkish makam music which is a form of Turkish folk music.This results in approximately 755k musical notes of monophonic folk songs.The Turkish Makam dataset has similarly high mean note density, note range, and rhythmic complexity to the Irish Folk music dataset suggesting similarly high musical complexity.The Turkish Makam dataset is the smallest dataset using in these experiments.

Muse Bach dataset
MuseData7 consists of Baroque to early classical music, including both monophonic and polyphonic instrumental pieces.Given the wide range of styles contained in MuseData we selected all pieces

Lakh Clean dataset
The Lakh dataset [49] 8 contains 176,581 unique MIDI files.For this experiment we use a subset of the Lakh dataset -the Clean MIDI (sub)dataset which contains pieces by 2,199 artists.The distribution of the genres in the Clean MIDI dataset is: 33% Pop, 32% Rock, 13% Jazz and Blues, 10% R&B, and 12% other, providing a dataset of contemporary popular music.Almost 7k monophonic clips were extracted from these pieces resulting in approximately 1.7m notes which we refer to as the Lakh Clean dataset.This dataset has the lowest mean note range and rhythmic complexity of the datasets using in these experiments, suggesting that it contains some of the least musically complex music.

Data Preparation
Each dataset was converted into a measure based ABC format using the midi2ABC functions in EasyABC 9 .Each measure is represented by 24 characters including notes names, and continuation and rest tokens.As the VAE models in this experiment work with monophonic melodies, single line melodies were extracted from the datasest using EasyABC.All musical instruments were then separated into separate files and any remaining chords were converted into single notes based on the chord's highest pitch.

Musical Features
There are many musical features that could be imposed on music generation.For example, the popular jSymbolic [50] offers analysis of 246 unique musical features.In this research we use a subset of the most frequently used features  [19] to select the following musical attributes: • Note Density (ND) -the number of notes in a measure; • Note Range (NR) -the highest minus lowest pitch in a measure; • Rhythmic Complexity (RC) -how syncopated a musical measure is [51]; • Average Interval Jump (AIJ) -the average of the absolute difference between adjacent notes in a measure.
These features cover both rhythmical properties (ND and RC) and melodic properties (NR and AIJ).They are used to i) characterise the musical properties of the datasets (Section 4.2) used to train the AI models; and ii) as attributes of control of music generation -because the features have some musical meaning they serve to increase the explainability of the AI models.

Experimental Setting: Comparing VAE Model Architectures
MeasureVAE and AdversarialVAE were compared using the Irish Folk dataset for training to explore RQ1.Of the 20,000 monophonic Irish folk melodies, 14,000 were used as the training set, 3,000 as test sets, and 3,000 as validation sets.Models were compared in terms of the evaluation metrics outlined in Section 5.1.The experiment learning rate was set to 1e-4 (optimized using Adam [52]).Both models were trained on a GPU for a total of 50 iterations with a batch size of 64 for all data.α = 0.1, β = 0.1, γ = 0.2 was used in the VAE loss function.As MeasureVAE uses 256 dimensions of latent space whereas AdversarialVAE uses 128 dimensions testing was undertaken with both 128 and 256 dimensions [41].Both models had 4 musical features imposed on them (from Section 4.3).
We use musical measures for generative output and training in keeping with state-of-the-art music generation research [11,27] and typical of the musical elements used in current generative AI tasks.Each measure is represented by 24 characters which include note names such as A3, G5, and so on, continuation tokens, and rest tokens.

Evaluation Metrics
We evaluate the AI models in terms of standard measures of: • Reconstruction Accuracy -how well the model can reconstruct any given input.We aim to maximise this.This is calculated by comparing the difference between the input melody and generated melody, and averaging this over the whole dataset.Reconstruction accuracy is defined as follows: ) where x i is the input sequence, xi is the reconstructed sequence, N is the number of samples in a dataset and M is the sequence length.The Check function compares two corresponding elements in the original and the reconstructed sequences as follows: • Reconstruction Efficiency -how well the model generates music with respect to the characteristics of its training dataset and also the provided input sequence when musical parameters are changed.We aim to maximise this.To calculate this measure, we split our data into two categories, where an attribute a r ≥ 0 and a r < 0.Then, we calculate the mean latent vectors z a and z a0 for each of these subsets, respectively.This procedures provides us with a general picture of latent vectors with respect to the presence of the musical attribute.Then, using these mean vectors, for each sample in our data, we get their latent vectors, z and apply the following interpolation: where the µ ∈ {−0.5, −0.4, . . ., 0.4, 0.5} with 11 possible values.Then, z resulting vectors are decoded into generated music sequences and for each generated music sequence, x, and input music sequence, x, we check the cosine similarity between these sequences using the following formula: Each musical attribute is interpolated separately, and then the average similarity of each interpolation is calculated.• Attribute Independence -how resilient an attribute is to change by other attributes.We aim to maximise this.This is calculated as getting the maximum value for Spearman's correlation coefficient between the attribute value, a(x), and each dimension of the latent space, z d , [53].Then, these correlation coefficients are averaged for all of the musical attributes.
Implementations of the Reconstruction Accuracy and Attribute Independence are included in these repositories 10 11 as in [11] and [32].

Reconstruction Accuracy
Table 2 shows the Reconstruction Accuracy scores for 128 and 256 latent space dimensions for MeasureVAE and AdversarialVAE.Results show that MeasureVAE reconstruction's accuracy outperformed the AdversarialVAE in both 128 and 256 dimension configurations, achieving a high of 99.6% for the validation set with 256 dimensions.Both models performed better in 256 dimensions than in 128 dimensions.This may be because the higher number of dimensions makes it easier to decompress the latent space.Inspecting Figs. 5 and 6 suggests that the two models produce different sounding music to each other for increased Note Range (b) with Measure-VAE producing a measure with larger changes between the notes.Furthermore, increasing the Note Range MeasureVAE also increased the Average Interval Jump in Fig. 5b, whereas Adversar-ialVAE produced a measure which shifted most of the original melody upwards in pitch expect for the final note which was shifted down to produce the required increase in Note Range in Fig. 6b.This difference is illustrated by calculating the Spearman's correlation r between the musical attribute values.In this case we see that the correlation between increase in Note Range and the generated measure's AIJ is r = 0.286 for Mea-sureVAE (weak correlation) and r = 0.154 for AdversarialVAE (no correlation) i.e.AIJ increases weakly with increases in NR for MeasureVAE but not for AdversarialVAE.
For Note Density (c), MeasureVAE produces music with a higher Note Density than Adversar-ialVAE when there is an increase in Note Density attribute applied to the generation.Interestingly MeasureVAE achieves the increased Note Density by adding an upward run of notes to the measure which also increases the Rhythmic Complexity whereas AdversarialVAE's increase in Note Density reduces Rhythmic Complexity compared to the original.Calculating Spearman's correlation again we see that in this case of increasing Note Density, the correlation of increased ND to output RC for MeasureVAE is r = 0.341 (weak correlation) and AdversarialVAE r = 0.178 (no correlation).This suggests that RC increases weakly when ND is increased with MeasureVAE, but not with AdversarialVAE.
In contrast to NR and ND, both AI models generate similar music to each other for increased Rhythmic Complexity (d) and also for increased Average Interval Jump (e).had the highest Attribute Independence for both models and for both 128 and 256 dimensions.This may be because ND is a measure of the number of notes in a measure meaning that it is easier to distinguish compared to other metrics such as Rhythmic Complexity which relies on the ability to differentiate between different musical beat types.Rhythmic Complexity showed the largest difference between MeasureVAE (0.878) and Adver-sarialVAE (0.943) for 128 dimensions, whereas Average Interval Jump shows the largest difference between MeasureVAE (0.765) and Adversar-ialVAE (0.875) for 256 dimensions.The higher independence of AdversarialVAE attributes may be due to the use of the adversarial classifierdiscriminator to impose musical attributes and the additional phase in the training process that optimises the decisions rather than trying to lower the loss function's value.In this work we are interested in contributing towards understanding how generative models which create music in given styles can be better interpreted and manipulated by users.To this end we now explore the performance of Measure-VAE in more detail as it has higher Reconstruction Accuracy and Reconstruction Efficiency than AdversarialVAE (Section 5).

Attribute Independence
In this section we examine the impact that different different configurations of musical (explainable) features (RQ2), sizes of latent spaces (RQ3), and different training datasets (RQ4) might have on the performance and explainability of Measure-VAE.To examine this systematically we undertook a combinatorial experiment examining the effect of musical dataset (n=4), number of latent dimensions (n=7), and number of regularized musical attributes (n=2) on evaluation metrics (Section 6.1): • Datasets -Muse Bach, Lakh Clean, Turkish Makam, Irish Folk datasets (Section 4.2) -to compare a range of musical genres; • Latent dimensions -4, 8, 16, 32, 64, 128, and 256 -to capture a typical range of latent space sizes; • Regularised dimensions -2 or 4 -using musical features (Section 4.3) in the latent space -ND&RC, NR&AIJ, or ND&NR&RC&AIJ.
For each combination of the above we trained a MeasureVAE model for 25 epochs.We use Adam [52] as the optimizer of the model with learning rate = 1e-5, β 1 = 0.9999 and ϵ = 1e-8.The model is trained on a single rtx6000 GPU following a similar setting of [42], taking on average of 2.5 hours per epoch.

Evaluation Metrics
We evaluate the combinations of datasets, latent space dimensions, and regularised dimensions in terms of standard measures of: • Reconstruction Accuracy -how well the model can reconstruct a given input -as in Section 5.1.
• Loss -loss scores are calculated by the sum of VAE loss (KL-divergence and reconstruction loss, typical to VAE architectures [54]) and the loss of the latent space regularization [11,19,42] -as in Section 4.1.We aim to minimize this.• Attribute Interpretability -how well a musical attribute can be predicted using only one LSR dimension in the latent space [11,42] -we aim to maximise this.We suggest that higher Interpretability scores contribute to better explainability it indicates less entangled semantic dimensions cf.[55].

Results
Tables 5 and 6 show the results for the combinatorial experiment including datasets (Muse Bach, Lakh Clean, Turkish Makam, Irish Folk), latent dimensions (4,8,16,32,64,128,256), regularised dimensions (ND, NR, RC, AIJ), Loss Scores and Reconstruction Accuracy Scores (Tables 5), and musical attribute Interpretability scores (Table 6).Fig. 7 Reconstruction Accuracy for MeasureVAE may be due to the higher complexity of the music in the dataset with high mean Note Density and Rhythmic Complexity, and lower Average Internal Jump than other datasets.Moreover, there are similar poor Reconstruction Accuracy and Loss scores found for the Irish Folk dataset which also has high mean ND, NR, and RC.The poor performance of the Turkish Makam dataset may also be may also be due to it being the smallest of the datasets used in this experiment, or the complex tonal features of Turkish Makam music [48] which may not be captured in the musical metrics used in this experiment.The results in Table 5 suggest that 2 regularised dimensions perform better than 4 for Loss and Reconstruction Accuracy scores.The pair of ND&RC regularised dimensions performing better than NR&AIJ for both Loss and Reconstruction Accuracy scores.The results also indicate that Reconstruction Accuracy and Loss scores improve up to 32 dimensional latent space and plateau for larger latent spaces.The results also indicate that different datasets have different Interpretability scores for different musical attributes, though it is not possible at this stage to say whether these Interpretability scores are good nor not cf.[55].Number of regularised dimensions.For ease of inspection, Fig. 9 illustrates the average Interpretability scores for all attributes for each dataset and 4 and 2 regularised dimensions (ND&RC).As suggested in Fig. 9 and detailed in Table 6, Interpretability scores for each regularised attribute were in general higher for 2 regularised dimensions than 4 which is to be expected as it is easier to achieve successful and linearly independent regularisation in fewer dimensions.The exception to this are the Rhythmic Complexity Interpretability scores for Muse Bach, Lakh Clean, and Irish Folk datasets.For Muse Back and Lakh Clean datasets the mean RC Interpretability scores were equal.The Irish Folk dataset's mean Rhythmic Complexity Interpretability scores are marginally higher for 4 regularised dimensions than 2 regularised due to the performance for latent spaces of 64, 128, and 256 dimensions.Inspecting the Interpretability scores for each dimension the data suggests that 2 regularised dimensions perform best compared to 4 regularised dimensions for Note Density and Average Interval Jump Interpretability scores.The higher mean Interpretability scores for 2 regularised dimensions versus 4 may be due to only regularising 2 dimensions rather than the nature of the regularisation itself.The poor performance of Rhythm Complexity for the Irish Folk dataset may be a reflection of higher Note Density, Note Range, and Rhythmic Complexity means for the Irish Folk dataset, or it may be a reflection of the larger dataset size.

Attribute Interpretability
Interpretability score performance.Rankings of Interpretability scores for datasets are not consistent across the number of dimensions in the latent space.For example, for 4 regularised dimensions the highest Rhythmic Complexity Interpretability score with 16 latent space dimensions is for the Lakh Clean dataset (0.974) whereas for 32 latent space dimensions the highest RC Interpretability score is for the Muse Bach dataset (0.985).For 4 regularised dimensions the highest scoring attribute Interpretability scores are consistent for 64, 128, and 256 dimensions e.g. the Muse Bach dataset has the highest Note Range Interpretability scores for latent spaces with 64, 128, and 256 dimensions.For 2 regularised dimensions there is no consistently highest ranked attribute for Interpretability across the range of latent space dimensions.
Optimal configurations.Given that high Reconstruction Accuracy scores and low Loss scores are reached at a 32 dimensional latent space for both 2 and 4 regularised dimensions, and that rankings of Interpretability scores stabilise at 64 dimensions and above, the results suggest that a 32 or 64 dimensional latent space would be when optimal applying MeasureVAE across genres as it minimises latent space size and Loss whilst maximising Reconstruction Accuracy and providing similar Interpretability scores to higher dimensional spaces.However careful selection of latent space size is recommended when MeasureVAE is to be used to generate specific genres of music.For example, 16 or even 8 latent dimensions are likely to be optimal for Irish Folk music generation with 2 regularised dimensions given that its best Interpretability performance is with an 8 dimensional latent space.
Dataset performance.Taking the average Interpretability score across all latent space sizes, results suggest that for 4 regularised dimensions, the Irish Folk dataset has the highest average Interpretability scores for ND and NR, whereas Muse Bach has the highest for RC and AIJ.It is worth noting that the Irish Folk dataset itself has the highest mean ND and NR and the second highest mean RC (Table 1), whereas the the Muse Bach dataset has the second lowest mean RC and lowest mean AIJ suggesting that there is not a correlation between the musical attributes of the datasets and the average Interpretability scores of the MeasureVAE models.In terms of lowest average Interpretability scores, the Turkish Makam dataset has the lowest for ND, Irish Folk dataset the lowest for RC, the Muse Bach and Turkish Makam datasets equally have the lowest for NR, and the Lakh Clean dataset has the lowest average AIJ Interpretability score.
For 2 regularised dimensions, the Turkish Makam dataset has the highest average ND and RC Interpretability scores whereas Lakh Clean has the lowest ND, and Irish Folk dataset has the lowest RC.Note that the Irish folk dataset has the lowest Rhythmic Complexity Interpretability scores for both 2 and 4 regularised dimensions and for all sizes of latent spaces.Whilst the Irish Folk dataset has a high mean Rhythmic Complexity this does not explain the poor RC Interpretability score for the Irish Folk dataset as the Turkish Makam dataset has the highest mean RC and also highest RC Interpretability score for 2 regularised dimensions, suggesting that there is not a correlation between the musical attributes of the datasets and the Interpretability scores of the MeasureVAE models.
Fig. 9 and Table 6 indicate some anomalies in the Interpretability scores.For 2 regularised dimensions, there is an outlying Rhythmic Complexity Interpretability score for 128 latent dimensions.For 4 regularised dimensions in a 16 dimensional latent space there are outlying Interpretability scores the Muse Bach dataset (ND and NR) and the Turkish Makam dataset (ND and RC).On investigation of the data no obvious reasons for these outlying results emerge.Instead, these anomalies suggest potential inconsistent performance of MeasureVAE for different datasets and latent dimension size, necessitating careful selection of latent space size for musical style.

Conclusions
This is the first time that two VAE models with semantic features for control of music generation have been systematically compared in terms of performance, latent space features, musical attributes, and training datasets.In doing this we help researchers to understand the properties of state-of-the-art generative models and so help to inform generative model research and design by providing a detailed analysis of current systems.We found that MeasureVAE has higher Reconstruction Accuracy and Reconstruction Efficiency than AdversarialVAE but lower musical Attribute Independence (Section 5).
The results also show that MeasureVAE is capable of generating music across folk, pop, rock, jazz and blues, R&B, and classical music styles, and performs best with lower complexity musical styles such as pop and rock.Furthermore, results show that MeasureVAE was able to generate music across these genres with interpretable musical dimensions of control.
The MeasureVAE generated output was found to have different musical Interpretability scores for different datasets, but there was not a correlation between the musical features of datasets and the related Interpretability scores of the generated music.For 4 regularised dimensions, the Irish Folk dataset has the highest average Interpretability scores for Note Density and Note Range, whereas Muse Bach has the highest for Rhythmic Complexity and Average Interval Jump Interpretability scores.Interpretability metrics were in general higher when only two dimensions of the latent space were regularised.Similarly, Loss and Reconstruction Accuracy scores were better fro two regularised dimensions than four.These findings are to be expected as it is easier to achieve successful and linearly independent regularisation in fewer dimensions.For Loss and Reconstruction Accuracy scores, MeasureVAE performed better with the pair of Note Density and Rhythmic Complexity regularised dimensions than when trained with Note Range and Average Interval Jump regularised dimensions.This may be because Mea-sureVAE is better at generating the tonal and rhythmic aspects of the music which are captured by ND and RC.
In terms of recommendations for use, results suggest that a 32 or 64 dimensional latent space would be optimal using MeasureVAE to generate music across a range of genres as this minimises latent space size whilst maximising reconstruction performance and providing similar Interpretability scores to those offered by higher dimensional spaces.However, careful selection of latent space size is required for generation of specific genres of music.For example, Irish Folk music may be optimally generated with 16 or even 8 latent dimensional space.
These results show that when explainable features are added to the MeasureVAE system it performs well across genres of musical generation.For XAI and the Arts more broadly our approach demonstrates how complex AI models can be compared and explored in order to identify optimal configurations for the required styles of music generation.The work also demonstrates the complex relationships between datasets, explainable attributes, and AI model music generation performance.This complex relationship has some wider implications for generative AI models.For example, it highlights the bias built in to models which makes them more amenable to certain datasets rather than others -a key concern of Human-Centred AI [56].In our case the structure of MeasureVAE biased it towards lower complexity musical styles such as pop and rock at the expense of more complex and forms of music such as Turkish Makram which it is worth noting are more marginalised forms of music.
The research presented here is a first step and is limited in scope.Future research needs to explore the effect that other genres and datasets, dataset sizes, musical attributes, and training regimes have on the performance of explainable AI models.This would provide a more in-depth analysis of the landscape of generative AI models from which to inform future AI model research and design.For example, we chose two sets of musical attributes to use in this experiment based on frequently used attributes in research papers but the utility of musical attributes to control musical generation very much depends on the context of use.We also need to compare a wider range of generative models and explainability techniques across datasets and musical attributes to identify which combinations of explainable AI model and dataset offer the best generative performance for the musical features of interest to musicians.For example, using information dynamic measures to compare generative models following [29].It would also be important to examine longer-term music generation such as song structure generation e.g.[57], and to use subjective listening tests to better understand the quality of the music generated (ibid.).Exploring how the robustness and interpretability of the models tested could be improved, for example [58], would be especially important for real-time music generation settings such as live performance.Moreover, it would be useful more broadly to explore how the evaluation approach deployed in this paper could be applied to other domains such as image generation.For example, how comparative evaluations of image generating VAEs [37] could be undertaken to compare interpretable features as we have done in this paper, or to apply our approach to comparing the effect of different interpretable features on the robustness of image generation [58] instead of music generation.
Finally, we need to start to explore how the explainable features of the models tested in this paper could be used to make more interactive generative systems that move beyond being a empirical tool for researching AI models to become more of a creative tool used in musical practice and performance.As a first step we will build the findings of this research into audio plugins which can be embedded into musician's musical tool chains as part of their artistic practice, starting with a MIDI music processor [59].

Figs. 5
Fig. 4 illustrates the comparative Reconstruction Efficiency for MeasureVAE and AdversarialVAE with 128 and 256 latent dimensions, as summarised in Table 3. Results show that the number of dimensions of latent space (128 or 256) did not have a noticeable effect on the Reconstruction Efficiency.Regardless of the number of dimensions, MeasureVAE had higher Reconstruction Efficiency than AdversarialVAE, with largest difference between Reconstruction Efficiency at µ = 0.0.Generated Outputs.Figs. 5 and 6 illustrate example outputs of the MeasureVAE and AdversarialVAE models respectively for 256 latent dimensions.For both Figs.(a) shows the input notes for the AI Model, and (b) to (e) show the the melody produced by the model after interpolating for one musical attribute at µ = +0.3.Each shows a the melody generated for a different musical feature: (b) Note Range increase; (c) Note Density increase; (d) Rhythmic Complexity increase; (e) Average Interval Jump increase.Inspecting Figs. 5 and 6 suggests that the two models produce different sounding music to each other for increased Note Range (b) with Measure-VAE producing a measure with larger changes between the notes.Furthermore, increasing the Note Range MeasureVAE also increased the Average Interval Jump in Fig.5b, whereas Adversar-ialVAE produced a measure which shifted most of the original melody upwards in pitch expect for the final note which was shifted down to produce the required increase in Note Range in Fig.6b.This difference is illustrated by calculating the Spearman's correlation r between the musical attribute values.In this case we see that the correlation between increase in Note Range and the generated measure's AIJ is r = 0.286 for Mea-sureVAE (weak correlation) and r = 0.154 for AdversarialVAE (no correlation) i.e.AIJ increases weakly with increases in NR for MeasureVAE but not for AdversarialVAE.For Note Density (c), MeasureVAE produces music with a higher Note Density than Adversar-ialVAE when there is an increase in Note Density attribute applied to the generation.Interestingly MeasureVAE achieves the increased Note Density by adding an upward run of notes to the measure which also increases the Rhythmic Complexity

Table 1
Summary statistics of the datasets.

Table 2
The Reconstruction Accuracy of MeasureVAE and AdversarialVAE models on training, test and validation data from Irish Folk dataset.
Fig. 4 illustrates the comparative Reconstruction Efficiency for MeasureVAE and AdversarialVAE with 128 and 256 latent dimensions, as summarised in Table3.Results show that the number of dimensions of latent space (128 or 256) did not have a noticeable effect on the Reconstruction Efficiency.Regardless of the number of dimensions, MeasureVAE had higher Reconstruction Efficiency than AdversarialVAE, with largest difference between Reconstruction Efficiency at µ = 0.0.Generated Outputs.

Table 3
The Mean and Standard Deviation of Reconstruction Efficiency for MeasureVAE and AdversarialVAE models with 128 and 256 latent dimensions.

Table 4
The Attribute Independence of MeasureVAE and AdversarialVAE models for Note Range, Note Density, Rhythmic Complexity and Average Interval Jump Attributes.

Table 5
MeasureVAE performed least well for the Turkish Makam dataset which had the highest average Loss scores and lowest Reconstruction Accuracy scores.MeasureVAE's poor performance with the Turkish Makam dataset

Table 5
Loss and Reconstruction Accuracy scores for MeasureVAE.

Table 6
Interpretability scores for MeasureVAE.Bold indicates the highest score of 2 and 4 dimensions.