Anticipation-RNN: enforcing unary constraints in sequence generation, with application to interactive music generation

Recurrent neural networks (RNNs) are now widely used on sequence generation tasks due to their ability to learn long-range dependencies and to generate sequences of arbitrary length. However, their left-to-right generation procedure only allows a limited control from a potential user which makes them unsuitable for interactive and creative usages such as interactive music generation. This article introduces a novel architecture called anticipation-RNN which possesses the assets of the RNN-based generative models while allowing to enforce user-defined unary constraints. We demonstrate its efficiency on the task of generating melodies satisfying unary constraints in the style of the soprano parts of the J.S. Bach chorale harmonizations. Sampling using the anticipation-RNN is of the same order of complexity than sampling from the traditional RNN model. This fast and interactive generation of musical sequences opens ways to devise real-time systems that could be used for creative purposes.


Introduction
Recently, a number of powerful generative models on symbolic music have been proposed [14].If they now perform well on a variety of different musical datasets, from monophonic folk music [29] to polyphonic Bach chorales [19], these models tend to face similar limitations: they do not provide musically interesting ways for a user to interact with them.Most of the time, only an input seed can be specified in order to condition the model upon: once the generation is finished, the user can only accept the result or regenerate another musical content.We believe that this restriction hinders creativity since the user does not play an active part in the music creation process.
Generation in these generative models is often performed from left to right; recurrent neural networks (RNNs) [10] are generally used to estimate the probability of generating the next musical event, and generation is done by iteratively sampling one musical event after another.This left-to-right modeling seems natural since music unfolds through time and this holds both for monophonic [7,29] and polyphonic [5,19] music generation tasks.However, this does not match real compositional principles since composition is mostly done in an iterative and non-sequential way [2].As a simple example, one may want to generate a melody that ends on a specific note, but generating such melodies while staying in the learned style (the melodies are sampled with the correct probabilities) is in general a non-trivial problem when generation is performed from left to right.This problem has been solved when the generative model is a Markov model [22] but remains hard when considering arbitrary RNNs.
In order to solve issues raised by the left-to-right sampling scheme, approaches based on MCMC methods have been proposed, in the context of monophonic sequences with shallow models [25] or on polyphonic musical pieces using deeper models [12,13].If these MCMC methods allow to generate musically convincing sequences while enforcing many user-defined constraints, the generation process is generally order of magnitudes longer than the simpler left-to-right generation scheme.This can prevent for instance using these models in real-time settings.
Another related approach is the one proposed in [18] where the authors address the problem of enforcing deterministic constraints on the output sequences.Their approach relies on performing a gradient descent on a regularized objective that takes into account the amount of constraints that are violated in the output sequence.They start from a pre-trained unconstrained model and then ''nudge'' its weights until it produces a valid sequence.If their model is able to handle a wide range of constraints (such as requiring the output sequence to belong to a context-free language), it enforces these constraints using a costly procedure, namely stochastic gradient descent (SGD).Sequences are generated using the deterministic argmax decoding procedure while our sampling scheme is non-deterministic, which we believe can be a desired feature in the context of interactive music generation.The approach of [17] is similar to the latter approach in the sense that the authors enforce constraints via gradient descent.However, since they rely on convolutional restricted Boltzmann machines, their sampling scheme is no longer deterministic.Their method is a way to sample polyphonic music having some imposed high-level structure (repetitions, patterns) which is imposed through the prescription of some predefined autocorrelation matrix.
The particularity of our approach is that it focuses on a smaller subset of constraints, namely unary constraints, which allows our sampling scheme to be faster since the proposed model takes into account the set of constraints during the training phase instead of the generation phase.
Except from the approaches cited above, the problem of generating sequences while enforcing user-defined constraints is rarely considered in the general machine learning literature but it is of crucial importance when devising interactive generative models.In this article, we propose a neural network architecture called anticipation-RNN which is capable of generating in the style learned from a database while enforcing user-defined unary constraints.Our model relies on two stacked RNNs, a constraint-RNN going from right to left whose aim is to take into account future constraints and a token-RNN going from left to right that generates the final output sequence.This architecture is very general and works with any RNN implementation.Furthermore, the generation process is fast as it only requires two neural network calls per musical event.
Even if the proposed architecture is composed of two RNNs going in opposite directions, it has not to be confused with the bidirectional-RNNs (BRNNs) architectures [26] which are commonly used to either summarize an entire sequence as in [24] or in the context of supervised learning [11].Even if there have been attempts to use BRNNs in an unsupervised setting [4], these methods are intrinsically based on a MCMC sampling procedure which makes them much slower than our proposed method.The idea of integrating future information to improve left-toright generation using RNNs has been considered in the Variational Bi-LSTM architecture [28] or in the Twin Networks architecture [27].The aim of these architectures is to regularize the hidden states of the RNNs so that they better model the data distribution.If ideas could appear to be similar to the ones developed in this paper, these two approaches do not consider the generation of sequences under constraints but are a method to improve the existing RNNs architectures.
The plan for this article is the following: in Sect.2, we precisely state the problem we consider and Sect. 3 describes the proposed architecture together with an adapted training procedure.Finally, we demonstrate experimentally the efficiency of our approach on the dataset of the chorale melodies by J.S. Bach in Sect. 4. In Sect.5, we discuss about the generality of our approach and about future developments.
Code is available at https://github.com/Ghadjeres/Anticipation-RNN, and the musical examples presented in this article can be listened to on the accompanying Web site: https://sites.google.com/view/anticipation-rnn-examples/accueil.

Statement of the problem
We consider an i.i.d.dataset D :¼ fs ¼ ðs 1 ; . ..; s N Þ 2 A N g of sequences of tokens s t 2 A of arbitrary length N over a vocabulary A. We are interested in probabilistic models over sequences p(s) such that where This means that the generative model p(s) over sequences is defined using the conditional probabilities pðs t js \t Þ only.
Generation with this generative model is performed iteratively by sampling s t from pðs t js \t Þ for t ¼ 1. ..N where N is arbitrary.Due to their simplicity and their efficiency, recurrent neural networks (RNNs) are used to model the conditional probability distributions pðs t js \t Þ: they allow to reuse the same neural network over the different time steps by introducing a hidden state vector in order to summarize the previous observations we condition on.More precisely, by writing f the RNN, in t its input, out tþ1 its output and h t its hidden state at time t, we have for all time indices t.When in t ¼ s t , the vector out tþ1 is used to define pðs tþ1 js \tþ1 Þ for all time indices t without the need to take as an input the entire sequence history s \tþ1 .If this approach is successful on many applications, such a model can only be conditioned on the past which prevents some possible creative use for these models: we can easily fix the beginning s \t of a sequence and generate a continuation s !t ¼ ðs t ; . ..; s N Þ but it becomes more intricate to fix the end s !t of a sequence and ask the model to generate a beginning sequence.
We now write pÀðsÞ the probability of a sequence s when no constraint is set.For simplicity of notation, we will suppose that we only generate sequences of fixed length N and denote by S :¼ A N the set of all sequences over A. The aim of this article is to be able to enforce any set C of unary constraints given by: where I is the set of constrained time indexes and c i 2 A the value of the constrained note at time index i.Ideally, we want to sample constrained sequences with the ''correct'' probabilities.If we denote by p þ ðsjCÞ the probability of a sequence s in the constrained model conditioned on a set of constraints C, this means that we want for all set of constraints C: where To put it in words, each set of constraints C defines a subset S þ ðCÞ of S from which we want to sample from using the probabilities (up to a normalization factor) given by p À .However, sampling from S þ ðCÞ using the acceptance-rejection sampling method is not efficient due to the arbitrary number of constraints.Exact sampling from S þ ðCÞ is possible when the conditional probability distributions are modeled using models such as Markov models but is intractable in general.This problem in the case of Markov models can in fact be exactly solved when considering more complex constraints on the space of sequences such as imposing the equality or the difference between two sequences symbols s i and s j .Generalizations of this problem to other types of constraints are discussed in Sect. 5.

The model
The problem when trying to enforce a constraint c :¼ ði; c i Þ is that imposing such a constraint on time index i ''twists'' the conditional probability distributions p À ðs t js \t Þ for t\i.However, the direct computation of p À ðs t js \t ; s i ¼ c i Þ (using Bayes rule when only p À ðs t js \t Þ is known) is computationally expensive.
The idea to overcome this issue is to introduce a neural network in order to summarize the set of constraints C. To this end, we introduce an additional token NC (no constraint) to A indicating that no unary constraint is set at a given position.By doing this, we can rewrite the set C as a sequence c ¼ ðc 1 ; . ..; c N Þ where c i 2 A [ fNCg.We then introduce a RNN called constraint-RNN in order to summarize the sequence of all constraints.This RNN goes backward (from c N to c 1 ), and all its outputs are used to condition a second RNN called token-RNN.
This architecture, called anticipation-RNN since the token-RNN is conditioned on what may come next, is depicted in Fig. 1.We notated by ðo 1 ; . ..; o N Þ the output sequence of the constraint-RNN (for notational simplicity, we reversed the sequence numbering: the first output of the constraint-RNN is o N in our notation).The aim of the output vector o t is to summarize all information about constraints from time t up to the end of the sequence.This vector is then concatenated to the input s tÀ1 of the token-RNN at time index t whose aim is to predict s t .
Basically, this amounts to modeling the conditional probability distribution p þ ðsjcÞ using the following factorization: where c ! t is defined similarly as in (2).Our approach differs from the approaches using Markov models in the sense that we directly train the conditional probability distribution (8) rather than trying to sample sequences in S þ ðCÞ using p À : we want our probabilistic model to be able to directly enforce hard constraints.
The anticipation-RNN thus takes as an input both a sequence of tokens ðs 0 ; . ..; s NÀ1 Þ and a sequence of constraints ðc 1 ; . ..; c N Þ and has to predict the shifted sequence ðs 1 ; . ..; s N Þ.The only requirement here is that the constraints have to be coherent with the sequence: Since we want our model to be able to deal with any unary constraints, we consider the dataset of couples of token sequences and constraint sequences D þ such that where f0; 1g N is the set of all binary masks: the sequence of constraints m(s) is then defined as the sequence It is important to note that this model not only is able to handle unary constraints, but can also include additional metadata information about the sequence of tokens whose changes we have to anticipate.Indeed, by including such temporal information in the c variables, this model can then learn to anticipate how to generate the tokens that will lead to a sequence complying with the provided metadata in a smooth way.These metadata can be musically relevant features such as the current key or mode, or the position of the cadences as it is done in [13].
This sampling procedure is fast since it only needs two RNN passes on the sequence.This modeling is thus particularly well suited for the real-time interactive generation of music.Furthermore, once the output of the constraint-RNN o is computed, sampling techniques usually applied in sequence generation tasks such as beam search [6,30] can be used without additional computing costs.

Dataset preprocessing
We evaluated our architecture on the dataset of the melodies from the four-part chorale harmonizations by J.S. Bach.This dataset is available in the music21 Python package [8], and we extracted the soprano parts from all 402 chorales that are in 4/4.In order to encode these monophonic sequences, we used the melodico-rhythmic encoding described in [13].In this encoding, time is quantized using a sixteenth note as the smallest subdivision (each beat is divided into four equal parts).On each of these subdivisions, the real name of the note is used as a token if it is the subdivision on which the note is played; otherwise, an additional token denoted as '' '' is used in order to indicate that the current note is held.A ''rest'' token is also used in order to handle rests.An example of an encoded melody using this encoding is displayed in Fig. 2.
The advantage of using such an encoding is that it allows to encode a monophonic musical sequence using only one sequence of tokens.Furthermore, it does not rely on the traditional MIDI pitch encoding but on the real note names: among other benefits, this allows to generate music sheets which are immediately readable and understandable by a musician and with no spelling mistakes.From a machine learning perspective, this has the effect of implicitly taking into account the current key and not throwing away this important piece of information.The model is thus more capable of generating coherent musical phrases.A simple example for this is that this encoding helps to distinguish between a E# and a F by considering them as two different notes.Indeed, these two notes would appear in contexts that are in different keys (in C# major or F# minor in the first case, in C major or F major in the second case for instance).
We also perform data augmentation by transposing all sequences in all possible keys as long as the transposed sequence lies within the original voice range.We end up with an alphabet of tokens A of size 125.

Implementation details
We used a two-layer stacked LSTM [15] for both the constraint-RNN and the token-RNN using the PyTorch [23] deep learning framework.Both LSTM networks have 256 units, and the constraints tokens c i and the input tokens s i are embedded using the same embedding of size 20.Sequences are padded with START and END symbols so that the model can learn when to start and when to finish.We Fig. 1 Anticipation-RNN architecture.The aim is to predict ðs 1 ; . ..; s N Þ given ðc 1 ; . ..; c N Þ and ðs 0 ; . ..; s NÀ1 Þ Fig. 2 Melodico-rhythmic encoding of the first bar of the melody of Fig. 8a.Each note name such as D4 or F#4 is considered as a single token add dropout on the input and between the LSTM layers and discuss the effect of the choice of these hyperparameters in Sect.4.3.We found that adding input on the input is crucial and set this value to 20%.
We fixed the length of the training subsequences to be 20-beat long which means that using the encoding described in Sect.4.1, we consider sequences of tokens of size 80.The network is trained to minimize the categorical cross-entropy between the true token at position 40 and its prediction.For each training sequence, we sample the binary masks m(s) of ( 9) by uniformly sampling a masking ratio p 2 ½0; 1 and then setting each unary constraint with probability p.
We perform stochastic gradient descent using the Adam algorithm [16] using the default settings provided by PyTorch for ten epochs with a batch size of 256.In this setting, our best model achieves a validation accuracy of 92.9% with a validation loss of 0.22.These figures are of course highly dependent on our modeling choices such as the number of subdivisions per beat, the preprocessing of our corpus as well as the way we sampled the binary masks.
The sampling procedure is then done iteratively from left to right by sampling the token at time t according to the probabilities given by p þ ðs t js \t ; o t Þ, where s \t is the sequence of previously generated tokens and o t the output of the constraint-RNN at position t.

Enforcing the constraints
We first check that the proposed architecture is able to enforce unary constraints; namely, that it fulfills the requirement (6).
In order to evaluate this property, we compute the amount of constraints that are enforced for various sets of constraints C. We chose for these sets of constraints different ''kinds'' of constraints, from constraints that are in the ''style of the corpus'' to constraints that are totally ''out-of-style.''More precisely, we considered: -C 1 : the beginning and the ending of an existing chorale melody (first five bars of the chorale melody ''Wer nur den lieben Gott la ¨ßt walten'' with two ablated bars), -C 2 : the beginning and the ending of the same chorale melody, but where the ending has been transposed to a distant key (from G minor to C# minor), -C 3 : constraints forcing the model to make ''big'' leaps (chorale melodies tend to be composed of conjunct melodic motions), -C 4 : a chromatic ascending scale, -C 5 : random notes every eighth note, -C 6 : the same random notes as above, but every quarter note.
These sets of unary constraints are displayed in Fig. 3.We measure the influence of the amount of the dropout that we use in our models (dropout between the LSTM layers) on the following task: for each set of constraints C i and for each model, we generate 1000 sequences using p þ ð:jC i Þ and compute the percentage of constrained notes that are sampled correctly.We report the results in Table 1.
These results show that for all sets of constraints that define a ''possibly-in-style'' musical constraint (constraint sets C 1 to C 4 ), the model manages to enforce the constraints efficiently: even if such constraints could not be encountered in the original dataset (constraint sets C 2 and C 4 ) .On the contrary, for truly out-of-style constraints (constraint sets C 5 and C 6 ), the model performs poorly on the task of enforcing these constraints.We do not think that it is a drawback of the model since its aim is to generate melodies in the style of the corpus which is made impossible when constrained with these incoherent constraints.
Table 1 also reveals the non-trivial effects of the choice of the amount of dropout of the models upon their performance on this task.

Anticipation capabilities
If the preceding section demonstrated that the anticipation-RNN architecture is able to enforce a wide variety of sets of unary constraints, we will explore in this section the role of the constraint-RNN and in particular how it is able to learn how to ''propagate'' the constraints backward, making the token-RNN able to anticipate what will come next.
For this, we will evaluate how the constrained model deviates from the unconstrained model.We compare the constrained model p þ on the same set of constraints C with its unconstrained counterpart p À .The latter is obtained by conditioning the model of Fig. 1 on a sequence of constraints in the special case where no constraint is set: the sequence of constraints is ðNC; . ..; NCÞ.
More precisely, for a set of constraints C, we quantify how the probability distributions p þ ð:js \t ; CÞ differ from the probability distributions p À ð:js \t Þ by computing how dissimilar they are.We chose as a measure of dissimilarity [3,9] the Jensen-Shannon divergence [1,20] which is defined by: where m ¼ pþq 2 , with KL denoting the Kullback-Leibler divergence The Jensen-Shannon divergence has the property of being symmetric, bounded (and thus always definite), and its square root satisfies the triangle inequality [21] which is an important feature compared to other divergences.In Fig. 4, we plot the evolution of the Jensen-Shannon divergence between the two distributions p þ ð:js \t ; CÞ and p À ð:js \t Þ during generation for different sets of constraints C. We generated 1000 sequences using the constrained model and computed the average Jensen-Shannon divergence between the two models for each time step.We then averaged the values over each beat in order not to take into account the intra-beat variations.Indeed, due to encoding we chose as well as to the singularity of the musical data we considered, patterns of oscillations appear.Indeed, both models agree in putting much of their probability mass on Fig. 3 Sets of constraints C i described in Sect.4.3, for i ranging from 1 to 5. In this particular figure, rests denote the absence of constraints Table 1 Percentage of correctly sampled constrained notes for different models p þ differing only by the amount of dropout they use and constrained on different sets of constraints  the hold symbol '' '' on the second sixteenth note of each beat since the soprano parts in Bach chorales are mostly composed of half notes, quarter notes and eighth notes.This is independent of the presence or absence of constraints so the constrained and unconstrained models make similar predictions on these time steps resulting in a low divergence.This plot confirms that the constraints are propagated backward in time and the token-RNN is not only able to enforce constraints but also able to anticipate how to do so.
We now illustrate this feature on a specific example.Figure 5 shows the evolution of p þ ðs t js \t ; C 3 Þ and p À ðs t js \t Þ during generation.It is interesting to note that the conditional probability distributions returned by p þ ðs t js \t ; C 3 Þ are more concentrated on specific values than the ones returned by p À ðs t js \t Þ.The concentration of the all probability mass of p þ ðs t js \t ; C 3 Þ on constrained notes confirms, on this specific example, that the proposed architecture has learned to enforce hard unary constraints.
In order to understand the effect of the constraints, we plot the difference between the two distributions of Fig. 5 for each time step in Fig. 6.This highlights the fact that the probability mass distribution of p þ is ''shifted upward'' few beats in advance when the next unary constraint is higher than the current note and ''downward'' in the opposite case.

Sampling with the correct probabilities
We now evaluate that the sampling using p þ fulfills the requirement (7).This means that for any set of constraints C, the ratio between the probabilities of two sequences in S þ ðCÞ is identical if probabilities are computed using the unconstrained model p À ð:Þ or if they are computed using the constrained model p þ ð:jCÞ.We introduce the set of constraints C 0 consisting of a single constrained note.
For a given set of constraints C, we generated 500 sequences and verified that the requirement ( 6) is fulfilled for all of these sequences (i.e.all constraints are enforced).In order to check the fulfillment of the requirement (7), we plot for each sequence s its probability in the constrained model p þ ðsÞ (defined as in Eq. 1) as a function of p À ðsÞ in logarithmic space.We compute these probabilities using ( 8), but only keep the time steps on which notes are not constrained.The resulting plots are shown in Fig. 7. Table 2 quantifies to which amount the two distributions are proportional on the subsets S þ ðC i Þ for different sets of constraints C i and for different models.
The translation in logarithmic space indicates the proportionality between the two distributions as desired.
The conclusion is that our model is able to correctly enforce all constraints for sets of constraints that are plausible with respect to the training dataset (Sect.4.3) and that on these specific sets of constraints, our sampling procedure respects the relative probabilities between the sequences.In other words, the anticipation-RNN is able to sample with the correct probabilities a subset of sequences defined by a set of unary constraints.

Musical examples
We end this section with the discussion over some generated constrained sequences.Figure 8 shows examples of the enforcement and the propagation of the constraints for the set of constraints C 3 : even if generation is done from left to right, the model is able to generate compelling musical phrases while enforcing the constraints.In particular, we see that the model is able to ''anticipate'' the moment when it has to ''go'' from a low-pitched note to a high-pitched one and vice versa.The use of the melodicorhythmic encoding allows to only impose that a note should be played at a given time, without specifying its rhythm.It is interesting to note that such a wide melodic contour (going from a D4 to a D5 and then going back to a D4 in only two bars) is unusual for a chorale melody.Nonetheless, the proposed model is able to generate a convincing Bach-like chorale melody.The three displayed examples show that there is a great variability in the generated solutions: even when constrained on the same set of constraints, the generated melodies have distinct characteristics such as, for example, the key they are in or where cadences could be.
Similarly to [13], we provide a plug-in for the MuseScore music score editor which allows to call the anticipation-RNN in an intuitive way.

Conclusion
We presented the anticipation-RNN, a simple but efficient way to generate sequences in a learned style while enforcing unary constraints.This method is general and can be used to improve many existing RNN-based generative models.Contrary to other approaches, we teach the model to learn to enforce hard constraints at training time.We believe that this approach is a first step toward the generation of musical sequences subjected to more complex constraints.
The constrained generation procedure is fast since it requires only 2N RNN calls, where N is the length of the generated sequence; as it does not require extensive computational resources and provides an interesting user-machine interaction, we think that this architecture paves the way to the development of creative real-time composition software.We also think that this fast sampling could be used jointly with MCMC methods in order to provide fast initializations.
Our approach can be seen as a general way to condition RNN models on time-dependent metadata.Indeed, the variable c in ( 8) is not only restricted to the value of the  6 Difference between p þ ðs t js \t Þ and p À ðs t js \t Þ as a function of t during the generation of the melody displayed in Fig. 8a.Beats on which a constraint is set are circled.We see that between beats 9-13, the probability mass of the constrained model p þ is shifted upward (compared to the probability distribution given by the unconstrained model p À ) in order to enforce the unary constraint D5 set at beat 13.From beats 13-17, the situation is reversed: the probability mass of the constrained model p þ is shifted downward in order to enforce the unary constraint D4 on beat 17  The closer the values are to one, the better the requirement ( 7) is achieved unary constraints, but can contain more information such as the location of the cadences or the current key.We successfully applied the anticipation-RNN in this setting and report that it manages to enforce these interesting and natural musical constraints in a smooth way while staying in the style of the training corpus.Future work will aim at handling other types of constraints (imposing the rhythm of the sequences, enforcing the equality between two notes or introducing soft constraints) and developing responsive user interfaces so that all the possibilities offered by this architecture can be used by a wide audience.

Fig. 4
Fig. 4 Plot of the evolution of the Jensen-Shannon divergence JS ðp þ ðs t js i\t ; C i Þjjp À ðs t js i\t ÞÞ between the constrained model p þ and the unconstrained model p À during generation for two sets of constraints C 1 and C 3 .Each point represents the average value of the

Fig. 5
Fig. 5 Plot of pðs t js \t Þ as a function of t during the generation of the melody displayed in Fig. 8a in the constrained and unconstrained cases.Beats on which a constraint is set are circled.a Constrained case: p ¼ p þ ð:jC 3 Þ.b Unconstrained case: p ¼ p À

Fig. 7
Fig.7Point plots in logarithmic scale of ln p þ ðsjC i Þ (y-axis) versus p À ðsÞ (x-axis) on a set of 500 sequences generated using p þ ðsjC i Þ, for C 0 and C 3 .The identity map is displayed in red and the linear

Fig. 8
Fig. 8 Examples of generated sequences in the style of the soprano parts of the J.S. Bach chorales.All examples are subject to the same set of unary constraints C 3 which is indicated using green notes (color figure online)

Table 2
Slopes of the linear interpolations displayed in Fig.7for different models and different sets of constraints C i