# Expressive Performance Rendering with Probabilistic Models

## Abstract

We present YQX, a probabilistic performance rendering system based on Bayesian network theory. It models dependencies between score and performance and predicts performance characteristics using information extracted from the score. We discuss the basic system that won the Rendering Contest RENCON 2008 and then present several extensions, two of which aim to incorporate the current performance context into the prediction, resulting in more stable and consistent predictions. Furthermore, we describe the first steps towards a multilevel prediction model: Segmentation of the work, decomposition of tempo trajectories, and combination of different prediction models form the basis for a hierarchical prediction system. The algorithms are evaluated and compared using two very large data sets of human piano performances: 13 complete Mozart sonatas and the complete works for solo piano by Chopin.

## 3.1 Introduction

Expressive performance modelling is the task of automatically generating an expressive rendering of a music score such that the performance produced sounds both musical and natural. This is done by first modelling the score or certain structural and musical characteristics of it. Then the score model is projected onto performance trajectories (for timing, dynamics, etc.) by a predictive model typically learned from a large set of example performances.

Unlike models in, for instance, rule-based or case-based approaches, the probabilistic performance model is regarded as a conditional multivariate probability distribution. The models differ in the way the mapping between score and performance model is achieved. Gaussian processes [31], hierarchical hidden Markov models [10], and Bayesian networks [38] are some of the techniques used.

Aside from the central problem of mapping the score to the performance, the main challenges in the process are acquisition and annotation of suitable example performances and the evaluation of the results. The data must encompass both precise performance data and score information and must be sufficiently large to be statistically representative. The level of precision required cannot yet be achieved through analysis of audio data, which leaves computer-controlled pianos, such as the Bösendorfer CEUS or the Yamaha Disklavier, as the main data source. For the training of our system, we have available two datasets recorded on such a device: 13 complete Mozart sonatas, performed by the Viennese pianist R. Batik in 1990, and the complete works for solo piano by Chopin, performed by the Russian pianist N. Magaloff in several live performances at the Vienna Konzerthaus in 1989.

Judging expressivity in terms of “humanness” and “naturalness” is a highly subjective task. The only scientific environment for comparing models according to such criteria is the annual Performance Rendering Contest RENCON [11], which offers a platform for presenting and evaluating, via listener ratings, state-of-the-art performance modelling systems. Alternatively, rendering systems can be evaluated automatically by measuring the similarity between rendered and real performances of a piece. This, however, is problematic: In some situations, small differences may make the result sound unintuitive and completely unmusical, whereas in other situations, a rendering may be reasonable despite huge differences.

In this chapter, we discuss a prototypical performance rendering system and its different stages: The basic system was entered successfully into the RENCON 2008 rendering contest. Several extensions have been developed that shed light on the problems and difficulties of probabilistic performance rendering.

## 3.2 Related Work

Systems can be compared in terms of two main components: the score representation and the learning and prediction model. The way expressive directives given in the score are rendered also makes a considerable difference in rendered performances, but this is beyond the scope of this chapter.

Score models – i.e., representations of the music and its structure – may be based either (1) on a sophisticated music theory such as Lerdahl and Jackendoff’s Generative Theory of Tonal Music (GTTM) [15] and Narmour’s Implication–Realization (IR) model [22] or (2) simply on basic features capturing some local score characteristics (see, e.g., [7, 9, 10, 31, 35]). Many current models work with a combination of computationally inexpensive descriptive score features and particular structural aspects – mainly phrasal information or simplifications thereof – that are calculated via musicological models. Examples are the models of Arcos and de Mántaras [1], who partially apply the GTTM, and our system [4, 38], which implements parts of the IR model to approximate the phrasal structure of the score.

Regarding the learning and prediction models used, three different categories can be distinguished [36]: case-based reasoning (CBR), rule extraction, and probabilistic approaches. Case-based approaches use a database of example performances of music segments. New segments are played imitating stored ones on the basis of a distance metric between score models. Prototypical case-based performance models are *SaxEx* [1] and *Kagurame Phase II* [29]. In [32, 37], a structurally similar system is described that is based on a hierarchical phrase segmentation of the music score. The results are exceedingly good, but the approach is limited to small-scale experiments, as the problem of algorithmic phrase detection is still not solved in a satisfactory way. Dorard et al. [2] used Kernel methods to connect their score model to a corpus of performance worms, aiming to reproduce the style of certain performers.

Rule-based systems use a matching process to map score features directly to performance modifications. Widmer [35] developed an inductive rule learning algorithm that automatically extracts performance rules from piano performances; it discovered a small set of rules that cover a surprisingly large amount of expressivity in the data. Our YQX system uses some of these rules in combination with a probabilistic approach. Ramirez et al. [25] followed a similar approach using inductive logic programming to learn performance rules for jazz saxophone from audio recordings. Perez et al. [24] used a similar technique on violin recordings. The well-known KTH rule system was first introduced in [28] and has been extended in more than 20 years of research. A comprehensive description is given in [7]. The *Director Musices* system is an implementation of the KTH system that allows for expressive performance rendering of musical scores. The rules in the system refer to low-level musical situations and theoretical concepts and relate them to predictions of timing, dynamics, and articulation.

The performance model in probabilistic approaches is regarded as a multivariate probability distribution onto which the score model is mapped. The approaches differ in how the mapping is achieved. The Naist model [31] applies Gaussian processes to fit a parametric output function to the training performances. YQX [4] uses Bayesian network theory to model the interaction between score and performance. In addition, a small set of note-level rules adapted from Widmer’s rule-based system are applied to further enhance musical quality. Grindlay and Helmbold first proposed a hidden Markov model (HMM) [9], which was later extended to form a hierarchical HMM [10], the advantage of which is that phrase information is coded into the structure of the model. All approaches mentioned above learn a monophonic performance model, predict the melody voice of the piece, and, in the rendering, synchronize the accompaniment according to the expressivity in the lead voice. Kim et al. [13] proposed a model of three sub-models: local expressivity models for the two outer voices (highest and lowest pitch of any given onset) and a harmony model for the inner voices.

Mazzola follows a different concept, building on a complex mathematical theory of musical structure [16]. The *Rubato* system [17, 19] is an implementation of this model.

## 3.3 The Data

Probabilistic models are usually learned from large sets of example data. For expressive performance modelling, the data must provide information on what was played (score information) and how it was played (performance information). The richness of available score information limits the level of sophistication of the score model: The more score information is provided, the more detailed a score model can be calculated. Piano performances can be described adequately by three dimensions: tempo, loudness, and articulation (our current model ignores pedalling). However, the information cannot be extracted from audio recordings at the necessary level of precision. This leaves a computer-controlled piano, such as the Bösendorfer CEUS (or the earlier version, the Bösendorfer SE) or a Yamaha Disklavier, as the only possible data source. This, of course, poses further problems. The number of recordings made on such devices is very small. Since such instruments are not normally used in recording studios or in public performances, the majority of available recordings stem from a scientific environment and do not feature world-class artists.

For our experiments, we use two unique data collections: the *Magaloff Corpus* and a collection of Mozart piano sonatas. In Spring 1989, Nikita Magaloff, a Russian–Georgian pianist famous for his Chopin interpretations, performed the entire work of Chopin for solo piano that was published during Chopin’s lifetime (op. 1–64) in six public appearances at the Vienna Konzerthaus. Although the technology was fairly new at the time (first prototype in 1983, official release 1985 [20]), all six concerts were played and recorded on a Bösendorfer SE, precisely capturing every single keystroke and pedal movement. This was probably the first time the Bösendorfer SE was used to such an extent. The collected data is presumably the most comprehensive single-performer corpus ever recorded. The data set comprises more than 150 pieces with over 320,000 performed notes. We scanned the sheet music of all pieces and transformed it into machine-readable, symbolic scores in musicXML [27] format using the optical music recognition software SharpEye.^{1} The MIDI data from the recordings were then matched semi-automatically to the symbolic scores. The result is a completely annotated corpus containing precisely measured performance data for almost all notes Chopin has ever written for solo piano.^{2} Flossmann et al. [5] provided a comprehensive overview of the corpus, its construction, and results of initial exploratory studies of aspects of Magaloff’s performance style.

Overview of the data corpora

Magaloff corpus | Mozart corpus | ||
---|---|---|---|

Pieces/movements | 155 | 39 | |

Score notes | 328,800 | 100,689 | |

Performed notes | 335,542 | 104,497 | |

Playing time | 10 h 7 m 52 s | 3 h 57 m 40 s |

## 3.4 Score and Performance Model

As indicated above, our rendering system is based on a score model comprising simple score descriptors (the *features*) and a musicological model – the Implication – Realization model by Narmour. Performances are characterized in three dimensions: tempo, loudness, and articulation. The way the performance characteristics (the *targets*) are defined has a large impact on the quality of the rendered pieces.

The prediction is done note by note for the melody voice of the piece only. In the Mozart sonatas, we manually annotated the melody voice in all pieces. In the case of the Chopin data, we assume that the highest pitch at any given time is the melody voice of the piece. Clearly, this very simple heuristic does not always hold true, but in the case of Chopin, it is correct often enough to be justifiable.

### 3.4.1 Performance Targets

Tempo in musical performances usually refers to a combination of two aspects: (1) the current tempo of a performance that evolves slowly and changes according to ritardandi or accelerandi; (2) the tempo of individual notes, often referred to as *local timing*, i.e., local deviations from the current tempo, used to emphasize single notes through delay or anticipation. Tempo is often measured in absolute beats per minute. We define the tempo relative to interonset intervals (IOI), i.e., the time between two successive notes. A performed IOI that is longer than prescribed by the score and the current tempo implies a slowing down, while a shorter IOI implies a speeding up. Thus, the description is independent of the absolute tempo and focuses on changes.

Loudness is not measured in absolute terms but relative to the overall loudness of the performance. Articulation describes the amount of legato between two successive notes: The smaller the audible gap between two successive notes, the more legato the first one becomes; the larger the gap, the more staccato.

**Timing:**The timing of a performance is measured in*interonset intervals*(IOIs), i.e., the time between two successive notes. The*IOI ratio*of a note relates the nominal score IOI and the performance IOI to the subsequent note. This indicates whether the next onset occurred earlier or later than prescribed by the score and thus also whether the previous note was shortened or lengthened. Let*s*_{i}and*s*_{i + 1}be the two successive score notes,*p*_{i}and*p*_{i + 1}the corresponding notes in the performance,*ioi*_{i, i + 1}^{s}the score IOI,*ioi*_{i, i + 1}^{p}the performance IOI of the two notes,^{3}*l*_{s}the duration of the complete piece in beats, and*l*_{p}the length of the performance. The IOI ratio*ioiR*_{i}of*s*_{i}is then defined as$$ioi{R}_{i} =\log \frac{io{i}_{i,i+1}^{p} {_\ast} {l}_{s}} {io{i}_{i,i+1}^{s} {_\ast} {l}_{p}}.$$Normalizing both score and performance IOIs to fractions of the complete score and performance makes this measure independent of the actual tempo. The logarithm is used to scale the values to a range symmetrical around zero, where

*ioiR*_{i}> 0 indicates a prolonged IOI, i.e., a tempo slower than notated, and*ioiR*_{i}< 0 indicates a shortened IOI, i.e., a tempo faster than notated.**Split tempo and timing:**It can be beneficial to divide the combined tempo into current tempo and local timing. The current tempo is defined as the lower-frequency components of the IOI ratio time series. A simple way of calculating the low-frequency component is to apply a windowed moving average function to the curve. The residual is considered the local timing. Let*ioiR*_{i}be the IOI ratio of note*s*_{i}and*n*∈*ℕ*(usually 5 ≤*n*≤ 10), then the current tempo*ct*_{i}of the note*s*_{i}is calculated byThe residual high-frequency content can be considered as the local timing$$c{t}_{i} = \frac{1} {n}\sum\limits_{j=i-\frac{(n-1)} {2} }^{ i+ \frac{(n-1)} {2} }ioi{R}_{j}.$$*lt*_{i}and, in relation to the current tempo, indicates that a note is either played faster or slower with respect to the current tempo:$$l{t}_{i} = \frac{ioi{R}_{j} - c{t}_{i}} {c{t}_{i}} .$$**Loudness:**The loudness, also referred to as velocity,^{4}of a performance is described as the ratio between the loudness of a note and the mean loudness of the performance. Again, the logarithm is used to scale the values to a range symmetrical around zero, with values above 0 being louder than average and those below 0 softer than average. Let*mvel*_{i}be the midi velocity of note*s*_{i}. The loudness*vel*_{i}is then calculated by$$ve{l}_{i} =\log \frac{mve{l}_{i}} {\sum_{j}mve{l}_{j}}.$$**Articulation:**Articulation measures the amount of legato between two notes, i.e., the ratio of the gap between them in a performance and the gap between them in the score. Let*ioi*^{s}_{i, i + 1}and*ioi*^{p}_{i, i + 1}be the score and performance IOIs between the successive notes*s*_{i}and*s*_{i + 1}, and*dur*_{i}^{s}and*dur*_{i}^{p}the nominal score duration and the played duration of*s*_{i}, respectively. The articulation*art*_{i}of a note*s*_{i}is defined as$$ar{t}_{i} = \frac{io{i}_{i,i+1}^{s} \,{_\ast}\, du{r}_{i}^{p}} {du{r}_{i}^{s}\, {_\ast}\, io{i}_{i,i+1}^{p}}.$$

### 3.4.2 Score Features

As briefly mentioned above, there are basically two ways of modelling a musical score: using (1) sophisticated musicological models, such as implementations of the GTTM [15] or Narmour’s Implication–Realization model [22], and (2) feature-based descriptors of the musical content. We use a combination of both approaches.

**Rhythmic features**describe the relations of score durations of successive notes and their rhythmic context. In our system, we use:**Duration ratio:**Let*dur*_{i}be the score duration of note*s*_{i}measured in beats; the duration ratio*durR*_{i}is then defined by$$dur{R}_{i} = \frac{du{r}_{i}} {du{r}_{i+1}}.$$**Rhythmic context:**The score durations of notes*s*_{i − 1},*s*_{i}, and*s*_{i + 1}are sorted and assigned three different labels: short (s), neutral (n), and long (l). When a rest immediately before (and/or after)*s*_{i}is longer than half the duration of*s*_{i − 1}(and/or*s*_{i + 1}), the respective labels are replaced with ( − ). The rhythmic context*rhyC*_{i}of*s*_{i}is then one of the 20 possible label triplets.^{5}

**Melodic features:**Melodic features describe the melodic content of the score, mainly pitch intervals and contours.**Pitch interval:**The interval to the next score note, measured in semitones. The values are cut off at − 13 and + 13 so that all intervals greater than one octave are treated identically.**Pitch contour:**The series of pitch intervals is smoothed to determine the distance of a score note to the next maximum or minimum pitch in the melody. The smoothing is needed to avoid choosing a local minimum/maximum.**IR features:**One category of features is based on Narmour’s Implication–Realization model of melodic expectation [22]. The theory constitutes an alternative to Schenkerian analysis and is focused more on cognitive aspects than on musical analysis. A short overview is given in Sect. 3.4.3. We use the labels assigned to each melody note and the distance of a melody note to the nearest point of closure as score features.**Harmonic consonance:**Harmonic features describe perceptual aspects related to melodic consonance. Using Temperley’s key profiles [30], we automatically determine the most likely local harmony given the pitches at a particular onset. The consonance of a note within an estimated harmony is judged using the key profiles proposed by Krumhansl and Kessler [14].

### 3.4.3 Narmour’s Implication–Realization (IR) Model

The Implication-Realization (IR) model proposed by Narmour [22, 23] is a cognitively motivated model of musical structure. It tries to describe explicitly the patterns of listener expectation with respect to the continuation of the melody. It applies the principles of Gestalt theory to melody perception, an approach introduced by Meyer [18]. The model describes both the continuation implied by particular melodic intervals and the extent to which this (expected) continuation is actually realized by the following interval. Grachten [8] provides a short introduction to the processes involved.

Two main principles of the theory concern the direction and size of melodic intervals. (1) Small intervals imply an interval in the same registral direction, and large intervals imply a change in registral direction. (2) A small interval implies a following similarly sized interval, and a large interval implies a smaller interval. Based on these two principles, melodic patterns, or *structures*, can be identified that either satisfy or violate the implications predicted by the principles. Figure 3.1 shows eight such structures: process (P), duplication(D), intervallic duplication (ID), intervallic process (IP), registral process (VP), reversal (R), intervallic reversal (VR), and registral reversal (VR). The Process structure, for instance, sastisfies both registral and intervallic implications. Intervallic Process satisfies the intervallic difference principle but violates the registral implication.

*closure*, which refers to situations in which listeners might expect a caesura. In the IR model, closure can be evoked in several dimensions of the music: intervallic progression, metrical position, rhythm, and harmony. The accumulated degrees of closure in each dimension constitute the perceived overall closure at any point in the score. Occurrences of strong closure may coincide with a more commonly used concept of closure in music theory that refers to the completion of a musical entity, for example, a phrase. Hence, calculating the distance of each note to the nearest point of closure can provide a segmentation of a piece similar to phrasal analysis.

## 3.5 YQX: The Simple Model

**Q**) are associated with discrete probability tables, while continuous score features (

**X**) are modelled by Gaussian distributions. The predicted performance characteristics, the targets (

**Y**), are continuously valued and conditioned on the set of discrete and continuous features. Figure 3.2 shows the general layout. The semantics is that of a linear Gaussian model [21]. This implies that the case of a continuous distribution parenting a continuous distribution is implemented by making the mean of the child distribution linearly dependant on the value of the condition. Sets are hereafter denoted by bold letters, and vectors are indicated by variables with superscribed arrows.

*y*is modelled as a conditional distribution

*P*(

*y*|

**Q,X**). Following the linear Gaussian model, this is a Gaussian distribution \(\mathcal{N}(y;\mu ,{\sigma }^{2})\) with the mean μ varying linearly with

**X**. Given specific values,

**Q**=

**q**and \(\mathbf{X} =\overrightarrow{ x}\) (treating the real-valued set of continuous score features as a vector):

*d*

_{q}and \(\overrightarrow{{k}}_{\mathbf{q}}\) are estimated from the data by least-squares linear regression. The average residual error of the regression is the variance σ

^{2}of the distribution. Thus, we collect all instances in the data that share the same combination of discrete feature values and build a joint probability distribution of the continuous features and targets of these instances. This implements the conditioning on the discrete features

**Q**. The linear dependency of the mean of the target distribution on the values of the continuous features introduces the conditioning on

**X**. This constitutes the training phase of the model.

Performance prediction is done note by note. The score features of a note are entered into the network as evidence \(\overrightarrow{x}\) and **q**. The instantiation of the discrete features determines the appropriate probability table and the parameterization *d*_{q} and \(\overrightarrow{{k}}_{\mathbf{q}}\), and the continuous features are used to calculate the mean of the target distribution μ. This value is used as the prediction for the specific note. As the targets are independent, we create models and predictions for each target separately.

### 3.5.1 Quantitative Evaluation of YQX

We evaluated the model using the datasets described in Sect. 3.3: the complete Chopin piano works played by N. Magaloff and the 13 complete Mozart piano sonatas played by R. Batik. The Mozart data were split into two different datasets – fast movements and slow movements – as they might reflect different interpretational concepts that would also be reproduced in the predictions. Thus, we also show the results for the Chopin data for different categories (ballades, nocturnes, etc.^{6}) The quality of a predicted performance is measured by Pearson’s correlation coefficient between the predicted curve and the curve calculated from the training data.

*ioi (r)*shows the best combined predictions for the each dataset.

Correlations between predicted and real performance for YQX. The targets shown are IOI ratio (*IOI*), loudness (*vel*), articulation (*art*), local timing (*timing*), current tempo (*tempo*), and reassembled IOI ratio (*IOI (r)*)

IOI | Vel | Art | Timing | Tempo | IOI (r) | ||
---|---|---|---|---|---|---|---|

Mozart fast | 0.46 | 0.42 | 0.49 | 0.43 | 0.39 | 0.46 | |

Mozart slow | 0.48 | 0.41 | 0.39 | 0.48 | 0.35 | 0.48 | |

Chopin | 0.22 | 0.16 | 0.33 | 0.15 | 0.18 | 0.22 | |

Ballades | 0.33 | 0.17 | 0.40 | 0.12 | 0.37 | 0.33 | |

Etudes | 0.17 | 0.15 | 0.17 | 0.09 | 0.20 | 0.16 | |

Mazurkas | 0.23 | 0.14 | 0.29 | 0.20 | 0.13 | 0.23 | |

Nocturnes | 0.17 | 0.17 | 0.33 | 0.14 | 0.11 | 0.17 | |

Pieces | 0.20 | 0.15 | 0.35 | 0.17 | 0.14 | 0.19 | |

Polonaises | 0.20 | 0.16 | 0.32 | 0.13 | 0.14 | 0.20 | |

Preludes | 0.20 | 0.15 | 0.33 | 0.15 | 0.16 | 0.21 | |

Scherzi | 0.33 | 0.23 | 0.26 | 0.16 | 0.30 | 0.33 | |

Sonatas | 0.16 | 0.14 | 0.32 | 0.12 | 0.20 | 0.16 | |

Waltzes | 0.35 | 0.16 | 0.29 | 0.22 | 0.35 | 0.35 |

The first observation is that the Chopin data generally show lower prediction quality, which implies that these data are much harder to predict than the Mozart pieces. This is probably due to the much higher variation in the performance characteristics for which the score features must account. Second, the loudness curves seem harder to predict than the tempo curves, a problem also observed in previous experiments with the model (see [4] and [37]). Third, articulation seems to be easier to predict than tempo (with the exception of the slow Mozart movements and the Chopin scherzi, mazurkas, and waltzes, for which articulation was harder to predict than tempo). The Chopin categories show huge differences in the prediction quality for tempo (the scherzi being the hardest to predict and the waltzes the easiest), suggesting that there are indeed common interpretational characteristics within each category.

Predicting the IOI ratio by combining the predictions for local timing and current tempo seems moderately successful. Only in some cases is the best combined prediction better than the best prediction for the separate components. It must be noted though that the combined predictions used the same set of features for both local timing and current tempo. Due to the extremely high number of possible combinations involved, experiments to find the two feature sets that lead to the best combined prediction have not yet been conducted.

### 3.5.2 Qualitative Evaluation of YQX

All quantitative evaluations of performances face the same problem: Although similarities between the predicted and the original curves can be measured to a certain degree, there is no computational way of judging the aesthetic qualities, or the degree of naturalness of expression, of a performance. The only adequate measure of quality is human judgement. The annual rendering contest RENCON [11] offers a scientific platform on which performance rendering systems can be compared and rated by the audience.

The system YQX participated in RENCON08, which was hosted alongside the ICMPC10 in Sapporo. Entrants to the “autonomous section” were required to render two previously unknown pieces (composed specifically for the competition) without any audio feedback from the system and within the time frame of 1 h. Four contestants entered the autonomous section and competed for three awards: The Rencon award was to be given to a winner selected by audience vote (both through web and on-site voting), the Rencon technical award was to be given to the entrant judged most interesting from a technical point of view, and finally the Rencon Murao Award was to be given to the entrant that most impressed the composer Prof. T. Murao. YQX won all three prizes. While this is no proof of the absolute quality of the model, it does give some evidence that the model is able to capture and reproduce certain aesthetic qualities of music performance. A video of YQX performing at RENCON08 can be seen at http://www.cp.jku.at/projects/yqx/yqx_cvideo2.flv.^{7}

## 3.6 YQX: The Enhanced Dynamic Model

The predictions of the basic YQX system are note-wise; each prediction depends only on the score features at that particular score onset. In a real performance, this is of course not the case: Typically, changes in dynamics or tempo evolve gradually. Clearly, this necessitates awareness of the surrounding expressive context.

*t*− 1 to the target in time-step

*t*. Figure 3.3 shows the unfolded network. This should lead to smoother and more consistent performances with less abrupt changes and, ideally, to an increase in the overall prediction quality.

The context-aware prediction can be done in two different ways: (1) Using the previous target simply as an additional parent probability distribution to the current target allows optimization with respect to one preceding prediction. Minimal adaptation has to be made to the algorithm (see Sect. 3.6.1). (2) Using an adaptation of the Viterbi decoding in hidden Markov models results in a predicted series that is optimal with respect to the complete piece (see Sect. 3.6.2).

### 3.6.1 YQX with Local Maximisation

*y*

_{t − 1}) to the target

*y*

_{t}as an additional feature that we calculate from the performance data. In the training process, the joint distribution of the continuous features, the target

*y*

_{t}, and the target in the previous time-step

*y*

_{t − 1}given the discrete score features – in mathematical terms \(P({y}_{t-1},{y}_{t},\overrightarrow{{x}}_{t}\vert {\mathbf{q}}_{t})\) – is estimated. This alters the conditional distribution of the target

*y*

_{t}to \(P({y}_{t}\vert \mathbf{Q},\mathbf{X},{y}_{t-1}) = \mathcal{N}(y;\mu ,{\sigma }^{2})\) with

^{8}

The prediction phase is equally straightforward. As in the simple model, the mean of \(P({y}_{t}\vert {\mathbf{q}}_{t},\overrightarrow{{x}}_{t},{y}_{t-1})\) is used as the prediction for the score note in time-step *t*. This is the value with the highest local probability.

### 3.6.2 YQX with Global Maximisation

The second approach drops the concept of a linear Gaussian model completely. In the training phase, the joint conditional distributions \(P({y}_{t-1},{y}_{t},\overrightarrow{{x}}_{t}\vert {\mathbf{q}}_{t})\) are estimated as before, but no linear regression parameters need to be calculated. The aim is to construct a sequence of predictions that maximizes the conditional probability of the performance given the score features with respect to the complete history of predictions made up to that point.

This is calculated in similarly to the Viterbi decoding in hidden Markov models, which tries to find the best explanation for the observed data [12]. Aside from the fact that the roles of evidence nodes and query nodes are switched, the main conceptual difference is that – unlike the HMM setup, which uses tabular distributions – our approach must deal with continuous distributions. This rules out the dynamic programming algorithm usually applied and calls for an analytical solution, which we present below. As in the Viterbi algorithm, the calculation is done in two steps: a forward and a backward sweep. In the forward movement, the most probable target is calculated relative to the previous time-step. In the backward movement, knowing the final point of the optimal path, the sequence of predictions is found by backtracking through all time-steps.

### The Forward Calculation

*t*, and

*N*be the number of data points in a piece. Further, let α

_{t}be the probability distribution over the target values

*y*

_{t}to conclude the optimal path from time-steps 1 to

*t*. By means of a recursive formula, α(

*y*

_{t}) can be calculated for all time-steps of the unfolded network

^{9}:

*y*

_{t − 1}in time-step

*t*− 1 the probability of being part of the optimal path. Then we can calculate for each target value

*y*

_{t}in time-step

*t*the predecessor that yields the highest probability for each specific

*y*

_{t}of being on the optimal path. In the backward movement, we start with the most probable final point of the path (the mean of the last α) and then backtrack to the beginning by choosing the best predecessors. As we cannot calculate the maximum over all

*y*

_{t − 1}∈

*ℝ*directly, we need an analytical way of calculating α(

*y*

_{t}) from α(

*y*

_{t − 1}), which we derive below. We will also show that α(

*y*

_{t}) remains Gaussian through all time-steps. This is particularly important because we rely on the parametric representation using mean and variance.

We hereafter use the distribution \(p({y}_{t-1}\vert {y}_{t},\overrightarrow{{x}}_{t},{\mathbf{q}}_{t})\) ∝ \(\mathcal{N}({y}_{t-1};{\mu }_{t-1},{\sigma }_{t-1}^{2})\) that can be calculated via conditioning from the joint conditional distribution \(p({y}_{t-1},{y}_{t},\overrightarrow{{x}}_{t}\vert {\mathbf{q}}_{t})\) that is estimated in the training of the model. For details as to how this is done see, for instance, [26]. Anticipating our proof that the α(*y*_{t}) are Gaussian, we refer to the mean and variance as μ_{α, t} and σ^{2}_{α, t}.

*y*

_{t − 1}) is Gaussian, the result of the product in brackets is Gaussian \(\mathcal{N}({y}_{t-1};{\mu }_{{_\ast}},{\sigma }_{{_\ast}}^{2})\) with a normalizing constant

*z*, that is Gaussian in both means of the factors:

*z*is to be multiplied with a Gaussian distribution over

*y*

_{t}. Hence,

*z*must be transformed into a distribution over the same variable. By finding a

*y*

_{t}such that the exponent in Eq. 3.6 equals 0, we can construct the mean μ

_{z}and variance σ

_{z}

^{2}of

*z*. Note that the variable μ

_{t − 1}is dependent on

*y*

_{t}due to the conditioning of

*p*(

*y*

_{t − 1}|

*y*

_{t}) on

*y*

_{t}.

*z*is independent of

*y*

_{t − 1}, it is not affected by the calculation of the maximum:

*y*

_{t}). The distribution

*P*(

*y*

_{t}) is Gaussian by design, and hence, the remaining product again results in a Gaussian and a normalizing constant. As the means of both factors are fixed, the normalizing constant in this case is a single factor. The mean μ

_{α, t}and variance σ

^{2}

_{α, t}of α(

*y*

_{t}) follow:

Thus, α(*y*_{t}) is Gaussian in *y*_{t}, assuming that α(*y*_{t − 1}) is Gaussian. Since α(*y*_{1}) is Gaussian, it follows that α(*y*_{t}) is Gaussian for 1 ≤ *t* ≤ *N*. This equation shows that the mean and variance of α(*y*_{t}) can be computed recursively using the mean μ_{α, t − 1} and variance σ^{2}_{α, t − 1} of α(*y*_{t − 1}). The parameters of \({\alpha }_{{y}_{1}}\) are equal to \({\mu }_{{y}_{1}}\) and \({\sigma }_{{y}_{1}}^{2}\), which are the mean and the variance of the distribution \(p({y}_{1}\vert \overrightarrow{{x}}_{1},{\mathbf{q}}_{1})\), and are estimated from the data.

### The Backward Calculation

_{t}, σ

_{t}

^{2}of α(

*y*

_{t}) are known for 1 ≤

*t*≤

*N*, the optimal sequence

*y*

_{1},

*…*,

*y*

_{N}can be calculated:

### 3.6.3 Quantitative Evaluation

Correlations between predicted and real performance for the basic YQX and the locally and globally optimized models. The targets shown are IOI ratio (*IOI*), loudness (*vel*), articulation (*art*), local timing (*timing*), current tempo (*tempo*), and reassembled IOI ratio (*IOI (r)*)

IOI | Vel | Art | Timing | Tempo | IOI (r) | |||
---|---|---|---|---|---|---|---|---|

Mozart fast | YQX | 0.46 | 0.42 | 0.49 | 0.43 | 0.39 | 0.46 | |

Local | 0.44 | 0.41 | 0.48 | 0.42 | 0.43 | 0.44 | ||

Global | 0.39 | 0.37 | 0.37 | 0.32 | 0.43 | 0.39 | ||

Mozart slow | YQX | 0.48 | 0.41 | 0.39 | 0.48 | 0.35 | 0.48 | |

Local | 0.46 | 0.39 | 0.38 | 0.48 | 0.42 | 0.47 | ||

Global | 0.46 | 0.35 | 0.23 | 0.44 | 0.34 | 0.46 | ||

Chopin | YQX | 0.22 | 0.16 | 0.33 | 0.15 | 0.18 | 0.22 | |

Local | 0.21 | 0.14 | 0.14 | 0.15 | 0.16 | 0.20 | ||

Global | 0.23 | 0.15 | 0.14 | 0.16 | 0.22 | 0.23 | ||

Ballades | YQX | 0.33 | 0.17 | 0.40 | 0.12 | 0.37 | 0.33 | |

Local | 0.36 | 0.17 | 0.39 | 0.12 | 0.30 | 0.25 | ||

Global | 0.38 | 0.19 | 0.36 | 0.12 | 0.46 | 0.38 | ||

Etudes | YQX | 0.17 | 0.15 | 0.17 | 0.09 | 0.20 | 0.16 | |

Local | 0.14 | 0.14 | 0.16 | 0.09 | 0.17 | 0.14 | ||

Global | 0.22 | 0.15 | 0.15 | 0.13 | 0.26 | 0.23 | ||

Mazurkas | YQX | 0.23 | 0.14 | 0.29 | 0.20 | 0.13 | 0.23 | |

Local | 0.22 | 0.14 | 0.28 | 0.22 | 0.13 | 0.21 | ||

Global | 0.23 | 0.13 | 0.27 | 0.20 | 0.19 | 0.24 | ||

Nocturnes | YQX | 0.17 | 0.17 | 0.33 | 0.14 | 0.11 | 0.17 | |

Local | 0.17 | 0.11 | 0.32 | 0.14 | 0.17 | 0.16 | ||

Global | 0.20 | 0.18 | 0.31 | 0.15 | 0.14 | 0.18 | ||

Pieces | YQX | 0.20 | 0.15 | 0.35 | 0.17 | 0.14 | 0.19 | |

Local | 0.22 | 0.12 | 0.33 | 0.12 | 0.16 | 0.18 | ||

Global | 0.23 | 0.14 | 0.33 | 0.17 | 0.25 | 0.26 | ||

Polonaises | YQX | 0.20 | 0.16 | 0.32 | 0.13 | 0.14 | 0.20 | |

Local | 0.18 | 0.19 | 0.32 | 0.13 | 0.15 | 0.16 | ||

Global | 0.22 | 0.19 | 0.31 | 0.14 | 0.20 | 0.23 | ||

Preludes | YQX | 0.20 | 0.15 | 0.33 | 0.15 | 0.16 | 0.21 | |

Local | 0.19 | 0.11 | 0.31 | 0.15 | 0.22 | 0.18 | ||

Global | 0.22 | 0.14 | 0.28 | 0.14 | 0.23 | 0.22 | ||

Scherzi | YQX | 0.33 | 0.23 | 0.26 | 0.16 | 0.30 | 0.33 | |

Local | 0.34 | 0.18 | 0.26 | 0.15 | 0.32 | 0.31 | ||

Global | 0.34 | 0.18 | 0.25 | 0.13 | 0.36 | 0.34 | ||

Sonatas | YQX | 0.16 | 0.14 | 0.32 | 0.12 | 0.20 | 0.16 | |

Local | 0.17 | 0.12 | 0.32 | 0.12 | 0.18 | 0.15 | ||

Global | 0.21 | 0.15 | 0.32 | 0.09 | 0.28 | 0.22 | ||

Waltzes | YQX | 0.35 | 0.16 | 0.29 | 0.22 | 0.35 | 0.35 | |

Local | 0.37 | 0.18 | 0.28 | 0.23 | 0.31 | 0.14 | ||

Global | 0.38 | 0.24 | 0.29 | 0.22 | 0.44 | 0.38 |

For the Chopin data (both complete set and individual categories), the prediction quality for tempo increases in all cases and for loudness in some cases. Prediction quality for articulation decreases compared to the original model for both local and global optimization. This is not surprising because articulation is a local phenomenon that does not benefit from long-term modelling. This also holds for the timing, i.e., the local tempo component: In most cases, local or global optimization does not improve the prediction quality. However, the current tempo – the low-frequency component of the IOI ratio – on the other hand, does benefit from optimizing the prediction globally with respect to the performance context: The prediction quality is increased in all cases (the biggest gain, almost 80*%*, is registered in the mazurkas).

Surprisingly, the Mozart data paint a different picture: None of the performance targets (with the exception of the current tempo prediction for the fast movements) benefits from including the performance context into the predictions. Previous experiments [4] showed that, given a specific, fixed set of features, local or global optimization improves the prediction quality. However, given the freedom of choosing the best set of features for each particular target (which is the evaluation setup we chose here), feature sets exist with which the original, simple model outperforms the enhanced versions in terms of average correlation.

### 3.6.4 Qualitative Evaluation

*K*. 280. The original YQX algorithm exhibits small fluctuations that are largely uncorrelated with the human performance. This results in small but noticeable irregularities in the rendered performance. In contrast to the human performance, which is far from yielding a flat curve, these make the result sound inconsistent instead of lively and natural. The globally optimized YQX eliminates them at the expense of flattening out some of the (musically meaningful) spikes. The correlation for the movement was improved by 57. 2

*%*from 0. 29 (YQX) to 0. 456 (YQX global).

## 3.7 Further Extensions

### 3.7.1 Note-Level Rules

**Staccato rule:**If two successive notes (not exceeding a certain duration) have the same pitch, and the second of the two is longer, then the first note is played staccato. In our implementation, the predicted articulation is substituted with a fixed small value, usually around 0. 15, which amounts to 15*%*of the duration in the score in terms of the current performance tempo.**Delay next rule:**If two notes of the same length are followed by a longer note, the last note is played with a slight delay. The IOI ratio of the middle note of a triplet satisfying the condition is calculated by taking the average of the two preceding notes and adding a fixed amount.

### 3.7.2 Combined Tempo and Timing Model

As discussed briefly in Sect. 3.5.1, it seems reasonable to split the tempo curve into a high- and a low-frequency component (the local and global tempo), predict the two separately, and reassemble a tempo prediction from the two curves. Considering the effect of global optimization, as discussed in Sect. 3.6.3, it also seems appropriate to use the basic model for the local timing predictions and the global optimization algorithm for the current tempo predictions.

An obvious extension to the experiments already presented would be to use different feature sets for the two components. In previous studies [3], we have discovered a relation between the size of the context a feature describes and its prediction quality for global and local tempo changes. The low-frequency components of certain features that are calculated, for instance, via a windowed moving average, are more suitable for global tempo prediction than are the high-frequency components, and vice versa for local tempo changes. Preliminary experiments that integrate this concept in the YQX algorithms show a slight quality increase (around 5*%*) for the current tempo and, consequently, for the combined IOI ratio target.

Also, global tempo trends in classical music are highly correlated with the phrase structure of a piece. This fact is often discussed in research on models of expressivity, such as the kinematic models introduced by Friberg and Sundberg [6] and by Todd [33]. Instead of training a model on the tempo curve of a complete piece, a promising approach would thus be to train and predict phrases or phrase-like segments of the score. A possible, albeit simplistic, implementation would assume that tempo and loudness follow a n approximately parabolic trend – soft and slow at the beginning and end of a phrase, faster and louder in the middle. A performance would then be created by combining the local tempo predictions made by a probabilistic model with a segment-wise parabolic global tempo. To refine the segment-wise predictions of global tempo, any kind of more sophisticated model could be used – a probabilistic system, a parametric model, or a case-based one (as in [37]).

### 3.7.3 Dynamic Bayesian Networks

Both models presented above, the Bayesian reasoning of YQX and the context-aware dynamic YQX, are subclasses of the general complex of Bayesian networks. The obvious generalization of the models is towards a dynamic Bayesian network (DBN). The main differences lie in (1) the network layout and (2) the way the model is trained.

*Q*

_{1},

*…*,

*Q*

_{n}. This mitigates the sparsity problem caused by the huge number of possible combinations of values for the discrete features. The values of M are not known in advance, only the number of discrete states that the node can be in is fixed. The conditional probability distribution of M given the parenting nodes

*Q*

_{1},

*…*,

*Q*

_{n}is estimated in the training process of the model. The training itself is done by maximizing the log likelihood of the predicted values with an expectation-maximization algorithm [21].

However, the most significant difference is that instead of feeding the complete piece into the model at once, DBNs work on short segments. In theory, any trend common to all or most of the segments should also be recognizable in the predicted curves. Given a segmentation of musical works into musically meaningful fragments – ideally phrases – the network should be able to reproduce patterns of tempo or loudness that are common across phrases.

## 3.8 Conclusion

Automatic synthesis of expressive music is a very challenging task. Of particular difficulty is the evaluation of a system, as one cannot judge the aesthetic quality of a performance by numbers. The only adequate measure of quality is human judgement. The rendering system presented passed this test in the RENCON08 and therefore constitutes a baseline for our current research. The two extensions we devised incorporate the current performance context into predictions. This proved useful for reproducing longer-term trends in the data at the expense of local expressivity.

We consider this a work in progress. There is still a long way to go to a machine-generated performance that sounds profoundly musical. The main goal in the near future will be to further develop the idea of a multilevel system comprising several sub-models, each specialized on a different aspect of performance – global trends and local events. Segmentation of the input pieces will also play a significant role, as this reflects the inherently hierarchical structure of music performance.

## Footnotes

- 1.
- 2.
Some of the posthumously published works were played as encores but have not yet been included in the dataset.

- 3.
The unit of the duration does not matter in this case, as it cancels out with the unit of the complete duration of the performance.

- 4.
Computer-controlled pianos measure loudness by measuring the velocity at which a hammer strikes a string.

- 5.
In the case of two equally long durations, we only discriminate between long and neutral. Hence, there are no situations labelled

*lsl*,*sls*,*ssl*, etc., only*lnl*,*nln*,*nnl*, etc., which reduces the number of combinations used. - 6.
The category

*Pieces*comprises Rondos (op. 1, op. 5, op. 16), Variations op. 12, Bolero op. 19, Impromptus (op. 36, op. 51), Tarantelle op. 43, Allegro de Concert op. 46, Fantaisie op. 49, Berceuse op. 57, and Barcarolle op. 61. - 7.
The performed piece “My Nocturne,” a piano piece in a Chopin-like style, was composed by Prof. Tadahiro Murao specifically for the competition.

- 8.
The construct \((\overrightarrow{x},{y}_{t-1})\) is a concatenation of the vector \(\overrightarrow{x}\) and the value

*y*_{t − 1}leading to a new vector of dimension \(dim(\overrightarrow{x}) + 1\). - 9.
We use α(

*y*_{t}) and*p*(*y*_{t}) as abbreviations of α(*Y*_{t}=*y*_{t}) and*p*(*Y*_{t}=*y*_{t}), respectively.

## Notes

### Acknowledgements

We express our gratitude to Mme Irène Magaloff for her generous permission to use the unique resource that is the Magaloff Corpus for our research. This work is funded by the Austrian National Research Fund FWF via grants TRP 109-N23 and Z159 (“Wittgenstein Award”). The Austrian Research Institute for Artificial Intelligence acknowledges financial support from the Austrian Federal Ministries BMWF and BMVIT.

### References

- 1.Arcos J, de Mántaras R (2001) An interactive CBR approach for generating expressive music. J Appl Intell 27(1):115–129CrossRefGoogle Scholar
- 2.Dorard L, Hardoon D, Shawe-Taylor J (2007) Can style be learned? A machine learning approach towards “performing” as famous pianists. In: Proceedings of music, brain & cognition workshop – the neural information processing systems 2007 (NIPS 2007), WhistlerGoogle Scholar
- 3.Flossmann S, Grachten M, Widmer G (2008) Experimentally investigating the use of score features for computational models of expressive timing. In: Proceedings of the 10th international conference on music perception and cognition 2008 (ICMPC ’08), SapporoGoogle Scholar
- 4.Flossmann S, Grachten M, Widmer G (2009) Expressive performance rendering: introducing performance context. In: Proceedings of the 6th sound and music computing conference 2009 (SMC ’09), Porto, pp 155–160Google Scholar
- 5.Flossmann S, Goebl W, Grachten M, Niedermayer B, Widmer G (2010) The magaloff project: an interim report. J New Music Res 39(4):363–377CrossRefGoogle Scholar
- 6.Friberg A, Sundberg J (1999) Does music performance allude to locomotion? A model of final ritardandi derived from measurements of stopping runners. J Acoust Soc Am 105(3):1469–1484CrossRefGoogle Scholar
- 7.Friberg A, Bresin R, Sundberg J (2006) Overview of the KTH rule system for musical performance. Adv Cognit Psychol 2(2–3):145–161CrossRefGoogle Scholar
- 8.Grachten M (2006) Expressivity-aware tempo transformations of music performances using case based reasoning. Ph.D. thesis, Pompeu Fabra University, BarcelonaGoogle Scholar
- 9.Grindlay GC (2005) Modeling expressive musical performance with Hidden Markov models. Master’s thesis, University of California, Santa CruzGoogle Scholar
- 10.Grindlay G, Helmbold D (2006) Modeling, analyzing, and synthesizing expressive piano performance with graphical models. Mach Learn 65(2–3):361–387CrossRefGoogle Scholar
- 11.Hashida M (2008) RENCON – Performance Rendering Contest for computer systems. http://www.renconmusic.org/. Accessed Sep 2008
- 12.Juang BH, Rabiner LR (1991) Hidden Markov Models for speech recognition. Technometrics 33(3):251–272MathSciNetMATHCrossRefGoogle Scholar
- 13.Kim TH, Fukayama S, Nishimoto T, Sagayama S (2010) Performance rendering for polyphonic piano music with a combination of probabilistic models for melody and harmony. In: Proceedings of the 7th sound and music computing conference 2010 (SMC ’10), BarcelonaGoogle Scholar
- 14.Krumhansl CL, Kessler EJ (1982) Tracing the dynamic changes in perceived tonal organization in a spatioal representation of musical keys. Psychol Rev 89:334–368CrossRefGoogle Scholar
- 15.Lerdahl F, Jackendoff R (1983) A generative theory of tonal music. The MIT Press, CambridgeGoogle Scholar
- 16.Mazzola G (2002) The topos of music – geometric logic of concepts, theory, and performance. Birkhäuser Verlag, BaselMATHGoogle Scholar
- 17.Mazzola G (2006) Rubato software. http://www.rubato.org
- 18.Meyer L (1956) Emotion and meaning in music. University of Chicago Press, ChicagoGoogle Scholar
- 19.Milmeister G (2006) The Rubato composer music software: component-based implementation of a functorial concept architecture. Ph.D. thesis, Universität Zürich, ZürichGoogle Scholar
- 20.Moog RA, Rhea TL (1990) Evolution of the keyboard interface: the Boesendorfer 290 SE recording piano and the moog multiply-touch-sensitive keyboards. Comput Music J 14(2):52–60CrossRefGoogle Scholar
- 21.Murphy K (2002) Dynamic Bayesian networks: presentation, inference and learning. Ph.D. thesis, University of California, BerkeleyGoogle Scholar
- 22.Narmour E (1990) The analysis and cognition of basic melodic structures: the implication–realization model. University of Chicago Press, ChicagoGoogle Scholar
- 23.Narmour E (1992) The analysis and cognition of melodic complexity: the implication–realization model. University of Chicago Press, ChicagoGoogle Scholar
- 24.Perez A, Maestre E, Ramirez R, Kersten S (2008) Expressive irish fiddle performance model informed with bowing. In: Proceedings of the international computer music conference 2008 (ICMC ’08), BelfastGoogle Scholar
- 25.Ramirez R, Hazan A, Gòmez E, Maestre E (2004) Understanding expressive transformations in saxophone Jazz performances using inductive machine learning in saxophone jazz performances using inductive machine learning. In: Proceedings of the sound and music computing international conference 2004 (SMC ’04), ParisGoogle Scholar
- 26.Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge. www.GaussianProcess.org/gpml
- 27.Recordare (2003) MusicXML definition. http://www.recordare.com/xml.html
- 28.Sundberg J, Askenfelt A, Frydén L (1983) Musical performance: a synthesis-by-rule approach. Comput Music J 7:37–43CrossRefGoogle Scholar
- 29.Suzuki T (2003) The second phase development of case based performance rendering system “Kagurame”. In: Working notes of the IJCAI-03 rencon workshop, Acapulco, pp 23–31Google Scholar
- 30.Temperley D (2007) Music and probability. MIT Press, CambridgeMATHGoogle Scholar
- 31.Teramura K, Okuma H, et al (2008) Gaussian process regression for rendering music performance. In: Proceedings of the 10th international conference on music perception and cognition 2008 (ICMPC ’08), SapporoGoogle Scholar
- 32.Tobudic A, Widmer G (2006) Relational IBL in classical music. Mach Learn 64(1–3):5–24CrossRefGoogle Scholar
- 33.Todd NPM (1992) The dynamics of dynamics: a model of musical expression. J Acoust Soc Am 91:3450–3550Google Scholar
- 34.Widmer G (2002) Machine discoveries: a few simple, robust local expression principles. J New Music Res 31(1):37–50MathSciNetCrossRefGoogle Scholar
- 35.Widmer G (2003) Discovering simple rules in complex data: a meta-learning algorithm and some surprising musical discoveries. Artif Intell 146(2):129–148MathSciNetMATHCrossRefGoogle Scholar
- 36.Widmer G, Goebl W (2004) Computational models of expressive music performance: the state of the art. J New Music Res 33(3):203–216CrossRefGoogle Scholar
- 37.Widmer G, Tobudic A (2003) Playing Mozart by analogy: learning multi-level timing and dynamics strategies. J New Music Res 32(3):259–268CrossRefGoogle Scholar
- 38.Widmer G, Flossmann S, Grachten M (2009) YQX plays Chopin. AI Mag 30(3):35–48Google Scholar