Introduction

In music, combinations of tones form chords, and they progress in a somewhat regular way. These regularities have been studied as harmony theory [1,2,3,4], which is employed in artificial intelligence for music such as structural analysis [5, 6], recommendation [7], and music generation [8,9,10].

However, such textbook theories are too restrictive to represent diverse musical styles. Considering that these theories are derived from the experience of existing music, we are interested in whether a machine can induce inherent regularities in music by statistical learning. Unlike direct applications as music generation systems [11, 12], data-oriented, statistical acquisition of chord functions has been rarely investigated [13,14,15,16,17]. It has been reported that the statistically induced regularities considerably agree with known textbook theories, and it is expected to find fine-grained patterns that are characteristic of particular genres or composers [13,14,15,16,17].

These previous works, however, required the chord labeling or grouping as a pre-process, which includes chord segmentation [13, 16], modulation segmentation [14, 16], and chord annotation with scale degrees or Berklee chords [14, 15, 17]. Although harmony analyses in textbooks are based on given chord names, the correspondence between chord names and surface pitch events in scores is not trivial and diverse depending on musical styles. Thus, obtaining consistent Berklee annotations [18] or key assignments [19] were found to be more difficult than generally expected. In addition, pre-processed modulation detection or chord degree assignment ignores the mutual dependency of chord functions and keys. For example, G major chord is dominant in the C major key, while it becomes tonic in the G major key. Thus, the same chord may possess different regularities of progressions according to tonality [2]. Based on this knowledge, previous works for chord function identification have simplified the problem by removing the dependency for keys in advance. However, we regard that these functional changes of codes are also important features of music that should be captured in data-oriented statistical models.

We present a model that automatically learns chord categories and segments surface notes to be classified into the categories without relying on pre-defined chord symbols and key-assignments by utilizing an extension of hidden semi-Markov models (HSMM), where the chord categories are learned as hidden states that are used to predict the next coming chord in a chord sequence. As textbook theory required the concept of chord functions to explain chord progressions, we consider the characteristics of chords and the progressions between chords to be important musical features worth learning from the data.

Although our future goal is to apply the data-driven model to a variety of compositions to reveal the characteristics of individual composers, we start with J. S. Bach’s four-part chorales in this study. In this research, we avoid any heuristic pre-processing and only apply ones that are simply based on musical notation; in particular, we transpose pieces so as not to possess a key signatureFootnote 1 and ignore octave positions.

Technically, the task is a kind of sequential tagging, and thus it is close to the unsupervised part-of-speech (POS) tag induction in the natural language processing (NLP). We employ the unsupervised neural HMM, which is known to be one of the prominent models proposed for the POS tag induction [20]. The most advantageous feature of the neural HMM is that we can easily embed additional contexts. We customize the model and its additional contexts to be suitable for the music. Furthermore, to make the model be in accordance with metrical structures, we extend it to the neural hidden semi-Markov model (HSMM)Footnote 2.

We use perplexity as the evaluation metric based on previous works [15, 17]. Experiments show that our model appropriately segments and classifies surface pitch-classes, especially with the smallest perplexity. Additional contexts with neural network modeling are shown to be effective to improve the perplexity. In addition, we show the transitions between categories reflect the difference of tonalities when we count them by separating pieces into groups of majors, minors, and dorian scales.

This paper is organized as follows. In Sect. “Related Work”, we review related studies. We introduce the proposed model in Sect. “Unsupervised Neural Hidden Semi-Markov Model for Chord Classification”. Then, we show the experimental results in Sect. “Experiments”. Finally, we summarize our contributions in Sect. “Conclusion”.

Related Work

Unsupervised learning of harmony or chord progressions has been studied through clustering methods [13, 14], HMMs [15,16,17], and Probabilistic Context-free Grammars (PCFG) [15]. These studies aimed at acquiring the progression rules, not based on possibly subjective human annotation, but relying on data-driven analysis.

They have found similarities between statistically induced clusters and chord functions in textbook theories, when compared models with the same number of clusters as the textbook [14,15,16]. Especially in HMM based models, not only obtained clusters but also their state transition property was found to have similarity to the known chord functions [16, 17]. However, selecting an optimal number of clusters would not be trivial. White and Quinn [16] proposed a methodology to find the best number of states for HMMs. They adopted the k-medoids clustering over hundreds of HMMs with different initial parameterizations. And then, they assumed that the number of hidden state N that showed higher silhouette width (i.e., clearer cluster boundary) than \(N-1\) and \(N + 1\) was appropriate. Uehara et al. [17] proposed another tandem approach to find the number of hidden states and detect modulation segments. On the other hand, Jacoby et al. [14] investigated the balance of the model by introducing the optimal complexity-accuracy curve, rather than fixing a particular number of hidden states. Similarly, Tsushima et al. [15] evaluated the generative models by the perplexity, which is a common metric for estimating generalization performance of a statistical model, and found that the larger number of hidden states leads a better score. Based on the latter approaches, we train our model with multiple numbers of hidden states and evaluate them by perplexity.

Preceding studies mentioned focused on harmonic structure solely. In other words, the duration of chords was ignored. However, a recent work proposed a supervised learning of the combined model of harmony and rhythm, and reported the efficacy of rhythmical information [21]. We also incorporate metrical information into our unsupervised learning by a semi-Markov model. Not only employing a semi-Markov model, we adopt an extension with neural networks based on the neural HMM. An important strength of the neural HMM is the seamless integration with additional contexts. Tran et al. [20] introduced the two additional contexts: embedded feature of preceding observations by the Long-Short Term Memory (LSTM) [27] and morphological information via character Convolutional Neural Networks. Despite its simple framework, the neural HMM outperformed even the highly polished Bayesian mixture model [22], or the hierarchical Pitman-Yor Process HMM [23]. Additional contexts in the neural HMM are also can be used for the unsupervised neural HSMM in the same manner.

Unlike the Hybrid DNN(Deep Neural Network)-HMM [24] that converts a pre-trained GMM(Gaussian Mixture Model)-HMM to a DNN-HMM or the Tandem model that combines features by a supervised DNN [25] to an HMM, the neural HMM is seamless and fully unsupervised. Our extension, i.e., the unsupervised neural HSMM differs from the Recurrent HSMM that makes use of the bi-LSTM for the purpose of reducing the error in the variational approximation [26].

Unsupervised Neural Hidden Semi-Markov Model for Chord Classification

Hidden Semi-Markov Model

Fig. 1
figure 1

Hidden Semi-Markov model

We aim at presenting a model that predicts chord segments, chord categories and chord progressions simultaneously. To achieve this, we employ hidden semi-Markov model (HSMM), which is an extension of hidden Markov model (HMM), where lies a Markov chain of hidden states behind an observable sequence. The notion of the duration of a hidden state is introduced by the “semi-Markov” extension [28, 31,32,33]. While there can be various ways to model the duration of a state, usually one of the following three models is used for computational efficiency [34]: Explicit duration HMM [28, 31], Variable transition HMM [32], and Residential-time HMM [33]. We select the Residential-time HMM [33, 34]Footnote 3 since the computational complexity of which is the smallest of the threeFootnote 4.

Residential-time HMM assumes that a hidden state transition is independent of the duration of the previous hidden state and the duration of a hidden state is a discrete random variable. With this assumption, hidden state transition is described as followsFootnote 5

$$P({Q_t} = (j,\tau ')|{Q_{t - 1}} = (i,\tau )) = \left\{ {\begin{array}{*{20}{l}} {{a_{i,(j,\tau ')}}}&{{\text{if }}\tau = 1{\text{(transition)}}} \\ {1(\tau ' = \tau - 1)}&{{\text{if}}\tau > 1{\text{(decrement)}}} \end{array}} \right.$$

where \(i, j \in \{0, 1, ..., S\}\) are state indices, S is the number of hidden states, \(\tau \in \{1, 2, \dots , D\}\) is the discrete duration time, and D is the maximum duration of a hidden state. We call a particular value of i or j as a hidden state index. While a hidden state index itself is just a serial number in hidden states of H(S)MM, a particular hidden state is expected to be a chord category after the training. Furthermore, \(a_{i, (j, \tau ')}\) can be decomposed into transition probability and duration probability as follows.

$$\begin{aligned} a_{i, (j, \tau ')} = a_{ij} p_{j\tau '} \end{aligned}$$

Note that a hidden state changes to another only when the remaining duration \(\tau = 1\). Therefore, the duration probability determines a hidden state duration, and the self-transition probability \(a_{ii}\) is always zero.

As is shown in the graphical representation of HSMM (Fig. 1a), each hidden state (\(z_{t]} = i\)) can produce multiple observations. Then, when a hidden state changes to another one (j), the model calculates the duration probability of the next state \(p_{j\tau }\) as well as the transition probability \(a_{ij}\). However, there could be multiple possible combinations of the hidden state and the duration. In Fig. 1b, the number of hidden states is 3 and the maximum duration of a hidden state is 2; when the hidden state index is \(i = 0\) and (remaining) duration is \(\tau = 1\) at the time step t, possible conditions for the previous time step \(t - 1\) are: \((i=0, \tau =2)\), \((i=1, \tau =1)\), and \((i=2, \tau =1)\). Note that if the current remaining duration is \(\tau = 1\), the hidden state must change to another one (dashed lines) at the next time step. On the other hand, if \(\tau > 1\), the hidden state continues into the next time step (solid lines).

Framework

Fig. 2
figure 2

BWV294 (in evaluation set) of analysis by the proposed model (8-state HSMM). The key is transposed to have no key signature. ij are hidden states, which are uniquely determined by selecting the sequence with the maximum likelihood by the Viterbi algorithm after training. A token index k corresponds to a particular pitch observation. We show the lookup table of token indices in Table 4 in Appendix. Residential time \(\tau\) represents a remaining duration of a hidden state i, which decreases by 1 in the same hidden state. Time steps increment by the sixteenth-note width segment. The notation \(z_{[13:16]} = 2\) represents that the hidden state index is \(i = 2\) on time steps 13–16. \(a_{ij}\) is a transition probability and \(a_{2,3}\) represent that of hidden state index 2 to 3. \(b_{ik}\) is an emission probability and \(b_{2,12}\) is that of token index 12 emitted from hidden state index 2. \(p_{j\tau }\) is a duration probability and \(p_{3,2}\) represents that the next hidden state index is 3 and the duration of which is 2

We expect that hidden states represent chord categories, which are not given a priori and thus learned in an unsupervised manner. With the Markov property assumption, hidden states (chord categories) are learned to govern the progression of the forthcoming chord category. We argue that an appropriate set of chord symbols would be diverse along with targeted pieces and thus adopt unsupervised learning.

We give a more detailed description of the proposed framework by showing an example in Fig. 2. Since HMM (HSMM) takes discrete time series, we set a time step by every sixteenth note, which is the minimum duration of note in the corpus, except for quite few exceptions of 32nd notes. We obtain a pitch-class vector for each segmentFootnote 6. For example, when the time step is at 25, the pitch names of notes contained in the segment are (C, D, G), and thus corresponding pitch class vector is (0, 2, 7).

We build a table of vocabulary of observations, and call them i.e., tokens, from every combination of the four-part pitch-classes. For example, the token index for (0, 2, 7) is set to 16. The whole vocabulary is shown in Appendix (Table 4). As are shown in the example, observations may include passing tones. We expect that a hidden state works as a chord category behind these raw pitch-class vectors. Looking at time step 25 again, we can see that the index of the hidden state is 7. Here, the hidden state is lasting from time step 25 to 28, which includes three types of pitch class vectors, (0, 2, 7), (2, 7, 11) and (2, 5, 7, 11), but all the three can be interpreted as a part of G major chord.

A hidden state emits a token with probability \(b_{ik}\), where i is a hidden state index and k is a token index. In the case of time step 25, pitch-class vector (0, 2, 7) (the token index of which is 16) is emitted with the probability of \(b_{7,16}\), from the hidden state indexed 7. A hidden state is changed to another one with the probability of transition probability \(a_{ij}\) multiplied by duration probability \(p_{j\tau }\), where i is the index of preceding hidden state, j and \(\tau\) are the index and duration of the forthcoming hidden state respectively. The probability of the transition from hidden state 7 to hidden state 3 at time step 28 to 29 is then \(a_{7,3}p_{3,8}\).

Note again that hidden states and their duration are not given a priori, but obtained by unsupervised learning. More precisely, we first optimize parameters that determine transition, duration, and emission categorical distributions so that a tuned model gives higher likelihoods to targeted pieces, and then obtain the best hidden state sequence based on the model. Next, we give a detailed description of the architecture of neural networks used in the neural HSMM in the following section.

Architecture of Neural Hidden Semi-Markov Model

Different from the conventional HSMM that has categorical parameters just as matrices of parameters, neural HSMM equips neural network components as functions for calculating transition, duration, and emission categorical distributions. As mentioned in the previous section, the role of each distribution is as follows.

  • Transition distribution: the probability of transition from a hidden state to another one, which corresponds to transition of chord categories.

  • Duration distribution: the probability of duration of a hidden state, which corresponds to duration of a chord category.

  • Emission distribution: the probability for a hidden state to emit a particular token. In our case, this is the probability for a chord category to emit a particular pitch-class vector.

Besides these three distributions, there is a special case of transition distribution i.e., initial state distribution, described in Sect. “Initial Hidden State Probability”.

The same graphical representation (Fig. 1a) applies to neural HSMMs, but the categorical distributions are obtained as outputs of neural networks that can employ additional musical contexts for the calculation. We show these networks and additional contexts in the following paragraphs: hidden state transition probability (“Hidden State Transition Probability), initial state probability (“Initial Hidden State Probability”), duration probability (“Duration Probability”), and emission probability (“Emission Probability”)Footnote 7.

Hidden State Transition Probability

Fig. 3
figure 3

The network architecture for calculating transition probabilities \(a_{ij}\). \({\varvec{s}}_i\)(s) are the set of hidden state embeddings. \({\varvec{r}}^{histo}\) is an additional context of pitch-class histogram. \({\varvec{h}}_t\) is another additional context of embedded feature of preceding observations by the LSTM. \({\varvec{o}}_{16}\), \({\varvec{o}}_1\), and \({\varvec{o}}_7\), are observation embeddings associated with observed tokens: \(x_{t-1} = 16\), \(x_t = 1\), and \(x_{t+1} = 7\)

We denote a hidden state at time step t as \(z_t\) and state index of which as i or j. Given the number of hidden state S, each hidden state \(i \in S\) has a transition distribution to the next hidden state \(j \in S\), which is denoted as \(a_{ij}\), and thus \(\sum _{j} a_{ij} = 1\). Note that, in HSMM (Residential-time HMM), the self-transition \(a_{ii}\) is set to zero, since not transition distribution but duration distribution manages the duration of a hidden state. Instead of having \(a_{ij}\) just as a single learnable value, we prepare a function that calculates it by employing neural network components as follows.

$$\begin{aligned} a_{ij} = P(z_{t+1} = j|z_t = i) = \mathrm{softmax}_j(\mathrm{MLP_{3}}([{{\varvec{s}}}_i; {{\varvec{c}}_t}])) \end{aligned}$$
(1)

where \(\mathrm{MLP_3}\) is a 3-layer Multi Layer Perceptron that has one input layer, two hidden layers and one output layer with a hyperbolic tangent (\({\tanh }\)) activation function after each hidden layerFootnote 8. The output layer of this function is softmaxFootnote 9, to satisfy the condition \(\sum _{j} a_{ij} = 1\). The output layer size of the transition \(\mathrm{MLP_2}\) in (1) should be one smaller than the number of states (\(S - 1\)), since it does not allow self-transitions. Since \(a_{ij}\) corresponds to the transition probability of hidden state i to j, the input for the network includes a feature for hidden state i. We provide a hidden state embedding \({{\varvec{s}}}_i\), which is a learnable vector associated to hidden state index i. This hidden state embedding is also jointly used in the network for duration and emission probability (described in Sects. “Duration Probability” and “Emission Probability”), and would help to capture relationships between hidden states (chord categories) and emissions (observed pitch-class vectors).

Taking further advantage of neural HMM, we introduce two types of additional contexts for the calculation of hidden state transition probability, i.e., a pitch-class histogram and an embedded feature of preceding pitch-class vectors, and feed a concatenation of the two contexts \({{\varvec{c}}}_t\) to (1).

  • Additional context 1: Pitch-class histogram (\(\mathbf{HISTO}\)) The first additional context is a pitch-class histogram that is the frequency of occurrence of each pitch-classes calculated from an entire phrase, which represents tonality. We split a piece into chord sequences at each fermata (point d’orgue) that works as a full-stop marker of a lyric, and then sum up the duration of each pitch-class to obtain a pitch-class histogram of a sequence. Even though the main tonality of a piece would be easily distinguished by the key signatureFootnote 10, there would be local modulations and thus we employ such a histogram as an indirect information of local tonality. We obtain the feature of a pitch-class histogram of a chord sequence \({{\varvec{r}}}^{\textit{histo}}\) as follows.

    $$\begin{aligned} {{\varvec{r}}}^{\textit{histo}} = \mathrm{MLP_{2}}({{\varvec{v}}}^{\textit{histo}}) \end{aligned}$$
    (2)

    where \({{\varvec{v}}}^{\textit{histo}} \in {{\mathbb {R}}}^{12}\) is the raw pitch-class histogram of a sequence.

  • Additional context 2: Embedded feature of preceding observations by the LSTM (\(\mathbf{LSTM}\)) The second additional context is an embedded feature of preceding observations by the Long-Short Term Memory (LSTM) [27], based on the idea from [20]. LSTM is a variant of Recurrent Neural Network (RNN) that recursively feeds observations, and thus it is suitable for a feature function for sequential data. In our case, it corresponds to a feature of a temporal sequence (from \(t=1\) to current time step t) of pitch-class vectors. At each time step, we input an observation embedding \({{\varvec{o}}}_k\) that associated with the observation \({{\varvec{v}}}^{\textit{pitch}}_k\) at tFootnote 11 to LSTM as follows (3)(4).

    $$\begin{aligned} {{\varvec{o}}}_k&= \mathrm{tanh}(\mathrm{MLP_2} ({{\varvec{v}}}^{\textit{pitch}}_k)) \end{aligned}$$
    (3)
    $$\begin{aligned} {{\varvec{h}}}_t&= \mathrm{LSTM}({{\varvec{o}}}_k, {{\varvec{h}}}_{t-1}) \end{aligned}$$
    (4)

    In calculation of an observation embedding (3), \({{\varvec{v}}}^{\textit{pitch}}_k\) is a binary pitch-class vector, that is a 12-dimensional vector consisting of 1/0. For example, if the pitch-class vector is (2, 5, 7, 11), the corresponding binary pitch-class vector is (0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1). We expect that binary pitch-class vectors give raw information of observations, instead of symbolized token indices, and it is also used in emission distribution network described in Sect. “Emission Probability”.

Finally, we concatenate the two contexts and obtain the context vector as (5), where [; ] denotes vector concatenation. This context vector is applied to the transition probability network (1).

$$\begin{aligned} {{\varvec{c}}}_t = [{{\varvec{r}}}^{\textit{histo}};\,{{\varvec{h}}}_t] \end{aligned}$$
(5)

We illustrate the network for calculating transition probability \(a_{ij}\) in Fig. 3, which we have described in this section. Note that transition probability thus dynamically varies by contexts in neural HSMM, though it is static in conventional models. Therefore, while we train a model without distinguishing major/minor pieces or local modulations, we expect that the additional contexts help to automatically adjust transition probabilities.

Initial Hidden State Probability

Since the first pitch-class vector at time step 1 does not have a preceding observation, this special distribution for the initial hidden state is provided. We set an external initial hidden state \(z_0\) at time step 0 that has no emission and duration 1, to meet the initial boundary condition of HSMM. The initial hidden state probability \(\rho _i\) is calculated as follows. This is the only component remained almost the same as the conventional model except that it is normalized by the \(\mathrm{softmax}\) function.

$$\begin{aligned} \rho _i = P(z_0 = i) = \mathrm{softmax}_i({{\varvec{\pi }}}) \end{aligned}$$

where i is a hidden state index, \({{\varvec{\pi }}} \in {{\mathbb {R}}}^{S}\) is the learnable weight vector, and S is the number of hidden states.

Duration Probability

The duration probability determines how long the hidden state i (i.e., a chord category) will reside in the same hidden state. The network for calculating it is as follows.

$$\begin{aligned} p_{i\tau }&= \mathrm{softmax}_{\tau }(\mathrm{MLP_3}([{{\varvec{s}}}_i;\,{{\varvec{r}}}^{\textit{beat}}_t])) \end{aligned}$$
(6)
$$\begin{aligned} {{\varvec{r}}}^{\textit{beat}}_t&= \mathrm{MLP_2}([v^{\textit{timesig}};\, v^{\textit{beat}}_t]) \end{aligned}$$
(7)
  • Additional context 3: Beat position (\(\mathbf{BEAT}\)) In (6), \({{\varvec{r}}}^{\textit{beat}}_t\) is an additional context for the calculation of duration probability, which is expected to give a metrical information. \([v^{\textit{timesig}}; v^{\textit{beat}}_t]\) is a 2-dimensional vector that consists of a beat position and a numerator of the time signature. In this study, \(v^{\textit{timesig}}\) is constant \((=4)\) since we only treat four-four time (4/4) pieces, while \(v^{\textit{beat}}_t\) could vary among \(\{1.0, 1.25, \cdots , 4.5, 4.75\}\).

The output layer size of the duration \(\mathrm{MLP_2}\) in (6) is 16 that is a maximum duration length that corresponds to a whole note. The hidden state embedding \({\varvec{s}}_i\) that is used in the network for transition probability is used here again, and is expected to help to jointly learn the transition and duration distribution of hidden state i.

Emission Probability

We employ a discrete HSMM in this study, therefore, each observation is associated with token index \(k \in V\) in the vocabulary as mentioned in the example in Sect. “Framework”. The probability that hidden state i yields an token index k (\(b_{ik}\)) is calculated as follows, based on [20].

$$\begin{aligned}&b_{ik} = P(x_t = k|z_t = i) = \mathrm{softmax}_k({{\varvec{s}}}_{i}^{\textsf {T}}\,{{\varvec{o}}}_k + l_k) \nonumber \\&\quad = \frac{\exp {({{\varvec{s}}}_{i}^{\textsf {T}}\,{{\varvec{o}}}_k + l_k)}}{\sum _{k'}{\exp {({{\varvec{s}}}_{i}^{\textsf {T}}\,{{\varvec{o}}}_{k'} + l_{k'}})}} \end{aligned}$$
(8)
  • Additional context 4: Observation embedding from binary pitch-class vector (\(\mathbf{PITCH}\)) We use the same observation embedding \({{\varvec{o}}}_k\) as one used in the context for transition probability (3) for the calculation of emission probability (8). Instead of employing weight matrices or \(\mathrm{MLP}\)s, the emission probability is learned as the dot product of a hidden state embedding and an observation embedding. The bias value \(l_k \in {{\mathbb {R}}}\) would help to consider the frequency of each chord. In contrast to the conventional discrete HMM that cannot use the raw information about the observation once converted into a token index, our model does not lose it by the observation embedding.

Training

Optimization

It is known that there is no analytical solution to obtain an optimal parameterization of an HMM that maximizes the likelihood of the observed sequence [28]. Although the widely used Baum-Welch re-estimation algorithm (that is a kind of the expectation-maximization algorithm) is able to obtain a locally optimal parameterization, it would be stuck in bad local optima when the likelihood surface is complex [28]. On the other hand, we can utilize gradient based methods to maximize the likelihood or the marginal probability \(\sum ^N_{n=1} \ln P(\mathbf{x}^{(n)}_{1:T})\) where \(\mathbf{x}^{(n)}_{1:T}\) is an observed sequence since it is an optimization problem [28,29,30]. We employ the known dynamic programming named the forward algorithm for H(S)MMs to calculate the marginal probability [28, 34].

Neural networks models are generally trained by gradient based optimizers as well as the backpropagation for the gradient computing. Recently, along with the success of neural networks, efficient optimizers also have been proposed. By implementing the HSMM as a neural network, we can naturally utilize the gradient based optimization [29]. In this study, we use one of the latest optimizer RAdam [35], which adjusts the learning rate automatically.

Forward Algorithm for Hidden Semi-Markov Models

To conduct the gradient based optimization, we calculate the marginal probability of an observed sequence \(\ln P(\mathbf{x}_{1:T})\) by the forward algorithm, which marginalizes the possible paths (as previously illustrated in Fig. 1b) of states and durations. We give the forward algorithm for the Residential-time HMM in the following.

To explain the duration of hidden states, we borrow the notation used by Yu et al. [34].

  • \(t_1:t_2]\) :a hidden state lasts at latest from \(t_1\) and ends at time \(t_2\).

Among variants of HSMMs, the Residential-Time HMM does not allow the self-transition, and a transition to another hidden state is required when the “remaining” duration is \(\tau = 1\). On the other hand, when \(\tau > 1\), the duration just decrements at each time step.

The \(\alpha _{t}(j, \tau )\) represents the forward probability that the hidden state at t is j and the remaining duration of which is \(\tau\), and output tokens from \(t = 1\) to t, i.e., \(\mathbf{x}_{1:t}\) are observed. It is decomposed as follows.

$$\begin{aligned} \alpha _{t}(j, \tau )&= P(z_{t:t + \tau -1]} = j, \mathbf{x}_{1:t}) \\&= \alpha _{t-1}(j, \tau + 1) P(x_t=k|z_t=j)\\&\quad~+ P(\tau |z_t=j) P(x_t=k|z_t=j) \\&\qquad \sum _{i\backslash {}j} \alpha _{t-1}(i, 1)P(z_t=j|z_{t-1}=i) \\&= \alpha _{t-1}(j, \tau + 1) b_{jk} + p_{j\tau } b_{jk} \sum _{i\backslash {}j} \alpha _{t-1}(i, 1) a_{ij} \end{aligned}$$

where \(P(\tau |z_t = j) = p_{j\tau }\) is duration probabilityFootnote 12, \(P(x_t = k|z_t = j) = b_{jk}\) is emission probability, and \(P(z_t = j|z_{t-1} = i) = a_{ij}\) is transition probability. Note that when the state at t is j, there are two possibilities: (i) the hidden state at time \(t - 1\) is also j with the remaining duration at there is \(\tau + 1\), (ii) the hidden state at time \(t - 1\) is \(i~(i \ne j)\) and the remaining duration is 1 then is transferred another hidden state j at the time t.

We apply scaling by replacing \(\alpha _{t}(i, \tau )\) to the conditional probability, similar to HMM [34], and obtain a modified forward algorithm.

$$\begin{aligned} {\hat{\alpha }}_{t}(j, \tau )&= P(z_{t:t + \tau -1]} = j | \mathbf{x}_{1:t}) = \frac{\alpha _{t}(j, \tau )}{P(x_1, \dots , x_t)} \nonumber \\ C_t&= P(x_t|x_1, \dots , x_{t-1}) \nonumber \\ P(x_1, \dots , x_t)&= \prod _{t'}^{t} C_{t'} \end{aligned}$$
(9)
$$\begin{aligned} C_t {{\hat{\alpha }}_{t}(j, \tau )}&= \frac{\alpha _{t}(j, \tau )}{P(x_1, \dots x_{t-1})} \nonumber \\&= {{\hat{\alpha }}_{t-1}(j, \tau + 1)} b_{jk} + p_{i\tau } b_{jk} \sum _{i\backslash {}j} {{\hat{\alpha }}_{t-1}(i, 1)} a_{ij} \end{aligned}$$
(10)

As can be seen from (9), the marginal probability is obtained by \(P(\mathbf{x}_{1:T}) = P(x_1, \dots , x_T) = \prod _{t'}^{T} C_{t'}\), where T is the length of an entire observation sequence, which means we can train the model only by executing the forward algorithm, without the backward algorithm usually used in EM algorithm. Note that \(C_t\) is obtained by summing up (10) about all j and \(\tau\), since \(\sum _j \sum _{\tau } {\hat{\alpha }}_t(j, \tau ) = 1\). Unlike HMMs, we set the initial probability to yield the initial hidden state \({z_0}\) that does not yield observation by considering the initial boundary condition: \({\alpha }_0 (i, 1) = P(z_0=i)\) and \({\alpha }_0 (i, \tau ) = 0\) for \(\tau > 1\) [34]. Therefore, \({\alpha }_0 = {\hat{\alpha }}_0\) and \(C_0 = 1.0\) at \(t = 0\).

Experiments

In this section, we show the experimental results. We first describe the dataset (Sect. “Dataset”) and experimental setups (Sect. “Experimental Setups”). Thereafter, we show the evaluation result by perplexity in Sect. “Evaluation by Perplexity”. We give qualitative analysis for obtained chord categories and their progressions in Sect. “Qualitative Analysis for Induced Clusters”. Finally, we show a detailed exemplification and discussion of BWV267 with reference of a human annotation in Sect. “Discussion on an Analysis by the Model”.

Dataset

We use J.S.Bach’s four-part chorales by the Riemenschneider numbering system (1-371) from the Music21 Corpus [37] as our dataset resource. At first, we have removed duplicated pieces according to the analysis by Dahn [38]. We only contain four-four time (4/4) piecesFootnote 13, thus the number of pieces is 290 (Table 1). We regard each phrase as an independent sequence, however, when we provide these phrases to training, evaluation, and testing set, we give randomness over pieces (not over phrases), since one piece may contain similar phrases.

Table 1 The statistics of the dataset

Experimental Setups

In this research, we disregard the difference of absolute pitch so that we transpose all the major pieces to C major and all the minor ones a minor. Some pieces of the corpus are written in dorian scale without key signature. From the viewpoint of modern tonality, we regard them in d minor; however, we do not shift them to a minor. Also, local modulations are shifted in a similar way, so as to preserve the relative positions of constituent notes. Finally, we ignore the difference by octave in pitch events and inversion in chords.

We assign token indices (k) for pitch-classes the accumulated pitch-duration of them in the dataset over 95%, and then, remained chords are merged to “\(\textsf {Others}\)”. The obtained vocabulary size was 80, including “\(\textsf {Rest}\)”.

In training, we set the mini-batch 8. We train models up to 500 epochs, however, the process is stopped when the lowest loss on the evaluation set is not updated over 20 epochs. We apply the dropout [36] to each hidden layer of MLPs, which randomly ignores some ratio of neurons (12.5% in our case) during training to avoid over-fitting.

Evaluation by Perplexity

Evaluation Metric

Following previous works [15, 17], we use perplexity as an evaluation metric, defined by the following equation.

$$\begin{aligned} {\mathcal {P}} = \exp {\left( -\frac{1}{T}\ln P(x_1, \dots x_T;\,{{\varvec{\theta }}})\right) } \end{aligned}$$
(11)

where \((x_1, \dots x_T)\) is a sequence of output tokens and \({{\varvec{\theta }}}\) is a set of model parameters. The smaller perplexity for test data means the better generalization performance since perplexity corresponds to log-average of inverse probability, and thus it is commonly used for evaluating probabilistic models. We examine the performance on multiple numbers of hidden states: 3 to 16. We conduct the experiments with three different random seeds of \(\{0, 1, 2\}\) for each number of hidden states, and report the averaged scores in Sect. “Evaluation by Perplexity”.

Perplexity Scores

Fig. 4
figure 4

Averaged perplexities by three trials with random seeds of \(\{0, 1, 2\}\) on the testing set

In addition to the neural HSMM, we implement a baseline model that represents probabilities as simply learnable weight vectors or matrices with softmax output layers. The baseline model is almost “Non-Neural” but an HSMM tuned by the same gradient-based optimizer. We compare the proposed model with this baseline to see how efficiently the elaborated neural network components work. In Fig. 4, we can see that the elaborated neural models considerably outperformed the baselines.

The ablation study also shows the efficacy of additional contexts as shown in Table 2. Removing any additional contexts, i.e., the pitch histograms (\(\mathbf{-HISTO}\)), embedded feature of preceding observations by the LSTM (\(\mathbf{-LSTM}\)), the beat positions (\(\mathbf{-BEAT}\)), and the observation embeddings from binary pitch-class vectors (\(\mathbf{-PITCH}\)) degraded the perplexity. Among them, removingFootnote 14 the observation representation by pitch-classes (\(\mathbf{-PITCH}\)) led a significant drop. Since we do not employ MLPs for the architecture of calculation of emission probability but force a model to learn relationships between hidden states and vocabularies directly, the information of raw pitch-classes would help the learning.

Table 2 Ablation studies. Averaged perplexities by three trials with random seeds of \(\{0, 1, 2\}\) on the testing set. The bold numbers are best score in the same number of hidden state

Qualitative Analysis for Induced Clusters

Fig. 5
figure 5

Comparison between (Top) neural HSMM of the best evaluation perplexity among the three random seeds, (Middle) neural HSMM of the worst evaluation perplexity, and (Bottom) base HSMM of the best evaluation perplexity. The bar charts shows the top three emissions per each hidden state

In this section, we give qualitative analysis and discussion about the induced chord clusters. We focus on the eight-state models since we have transposed pieces not to have key signatures and have assumed that each cluster would roughly correspond to a triad on the diatonic scale, or otherwise to the rests.

Induced Clusters and Model’s Perplexities

We investigate if we can find clearer clusters in the higher scored model in obtained examples. We show the emission probabilities of the best neural HSMM, the worst neural HSMM, and the best baseline model in Fig. 5. Note that we have executed three trials of training for each model with a different random seed among \(\{0, 1, 2\}.\) Here the “best” model means that the evaluation perplexity of which is the smallest among the three trials.

We observe that the clusters obtained by the best scored neural HMM mainly consist of the chords on diatonic scales (C major and a minor) and its commonly-used borrowed chords such as D major chord. We show the emission probabilities of the top three emissions for each hidden state index in Fig. 5 and summarize them in Table 3. For ease of readability, we name chord categories by the name of the most frequent output chord for each hidden state index. Although the most frequent chord is representative of the category, it does not correspond to a unique chord but a set of emission probabilities; therefore, we add “hat” to it. The cluster  may emerge from two reasons: the common use of the Picardy third, or the dominant of dorian mode pieces. It is worth noting that seventh chords and passing chords are merged into appropriate clusters, e.g., C.D.E.G in the same hidden state for C.E.G (state3), and D.F.G.B for D.G.B (state7).

Table 3 Chord categories obtained by the best scored 8-states neural HSMM. The chord category is named after the chord name of the top emission for each hidden state

Even though employing the same neural HSMM, the worst perplexity model (in the middle of Fig. 5) seems to be less appropriate than the best model. For example, C.E.A is mixed up with C.F.A and E.G\(\sharp\)(A\(\flat\)).B in state7, and C.E.G (state4) and passing chords around it (state0) are separated. Although we would be able to choose a model by perplexity, we admit that difficulty for obtaining the global optimum still exists even though we employ the efficient gradient based optimizer.

The best baseline model (the evaluation perplexity of which is 9.92) is still worse than the worst neural HSMM (9.18). Not only the best baseline model scored a worse perplexity, but it possessed more miscellaneous clusters than neural HSMMs. However, we can observe that C.E.G and D.F.G.B of state3, or D.G.B and D.F\(\sharp\)(G\(\flat\)).A of state5, in the same tonality, would tend to be merged into the same category in the baseline HSMM.

Hidden State Transitions

Fig. 6
figure 6

Counts of hidden state transitions on: (Top) major pieces, (Middle) minor pieces, and (Bottom) dorian pieces. Sequence of hidden states are calculated by the Viterbi algorithm. Chord category labels mentioned after hidden state indices are defined in Table 3

Since chord functions lie in the regularity of chord progressions, we try to find them by investigating the transitions between obtained chord categories. We count the hidden state transitions on major pieces, minor pieces and dorian pieces separately to see whether the model appropriately reflects the difference of tonalities in its state transitions. Note that, instead of examining the hidden state transition probability directly, we count the number of transitions after decoding the sequence of hidden states by the Viterbi algorithm since the hidden state transition probability changes by contexts.

We show the result of hidden state transition properties by the best scored neural HSMM in Fig. 6 (the emission probability of which is shown at the top of Fig. 5). In major pieces, the tendency that {state4:F̂, state2:d̂} (subdominant) \(\rightarrow\) state7:Ĝ (dominant) \(\rightarrow\) state3:Ĉ (tonic) is noticeable. While in minor pieces, the tendency described above suggests existence of the relative major keys, which is consistent with previous works [13, 17]. In addition, a strong transition from state5:Ê (dominant) \(\rightarrow\) state6:â (tonic) are observed. Unlike in major pieces, the transition from state2:d̂ (subdominant) \(\rightarrow\) state5:Ê (dominant) is increased in minor pieces, since it corresponds to the transition of subdominant \(\rightarrow\) dominant in a minor. Finally, in dorian pieces, transitions from state1:Â (dominant) \(\rightarrow\) state2:d̂ (tonic) are observed. Unlike C major and a minor pieces, the transition from state7:Ĝ no longer has the strong tendency of proceeding to state3:Ĉ but tends to proceed state1:Â instead, corresponding to the progression of subdominant \(\rightarrow\) dominant.

Discussion on an Analysis by the Model

In this section, we discuss the adequacy of our model from an obtained result. We select BWV267 from our testing set, since a human analysis on which is publicly available [37]Footnote 15. Although our aim is not to reproduce the human analysis, we consult it as a possible interpretation for local modulations and chord annotations. Note that we normalized the score to have no key signature, and thus the main key became C major, while the original key was G major. We show extractions of the result in Fig. 7Footnote 16. According to the human analysis [37], there are local modulations to {F major, G major, d minor, g minor} keys in this piece.

Fig. 7
figure 7

Chord classification by the neural HSMM on BWV267 (Excerpt)

In this example, we could see that the model found an effective set of clusters to cover chords appeared in a piece including local modulations. For example, in the section of G major key (phrase No.4, time step 21–32), we observed the cluster of D̂ (state0), which was a borrowed chord seen from C major key. This cluster D̂ was used again in the section of g minor key (phrase No.7, time step 7–32). Interestingly, g minor chords, i.e., D.G.A\(\sharp\)(B\(\flat\)) (\(k=21\)) and D.F.G.A\(\sharp\)(B\(\flat\)) (\(k=60\)), were classified into the same cluster as Ĝ (state7), since g minor chord and G major one shared the dominant (state0: D̂). Similarly, in the section of F major (phrase No.4, time step 1–20), C major dominant seventh chords, i.e., C.E.G.A\(\sharp\)(B\(\flat\)) (\(k=24\)), were classified into the same cluster as C major chords (state3: Ĉ). According to Schoenberg, local modulations basically remain in closely related keys and thus can be analyzed by altered chords [3]. In the case of g/G chords and C/C7 chords described above, these alternations (g or C7) do not change the probable progression of chords, and thus would be classified into the same function as the one on the diatonic scale (G or C).

In the longest modulation section of d minor (phrase No.6), we could see that the cluster of tonic chord (state3: Ĉ) has not appeared and that the cluster of d minor chord (state2:d̂) tended to proceed to A major chords (state1:Â).

Since we ignored inversions of chords in this study, we basically could not capture the difference between them. We admit that it needs further consideration, and left it as a future work. In addition, we observed more “\(\textsf {Others}\)” tokens (\(k = 79\)) in modulation sections that hindered the analysis. “\(\textsf {Others}\)” or “\(\textsf {Unknown}\)” token has been introduced in discrete HMMs to avoid useless increase of dimension of the emission distribution. However, once an observation is converted into “\(\textsf {Others}\)”, the raw information about it (which is pitch-class vector in our case) will be lost. More effective treatment for less frequent observations or more direct representation for emission distribution is another important future work for us.

Conclusion

In this paper, we aimed to obtain the regularity of chord progressions by the data-oriented unsupervised learning. Chord functions were introduced to explain typical progressions a posteriori, and thus music is not constrained by such functions. Thus we have been interested in finding statistically plausible chord functions by machines.

Harmony analysis consists of the following four processes; (i) defining an appropriate set of chord labels (e.g., chord degrees or Berklee chords in traditional approaches), (ii) determining the scopes of chords and their labels from surface structures of a piece, (iii) identifying the sequence of chords based on the determined labels, (iv) and finally, analyzing the labeled chord sequence by examining transitions between chords, i.e., chord functions. Our objective is to let computers execute these processes, without relying on human annotations or pre-defined labels. In our experiments, we have only pre-processed the source scores by shifting keys either to C major or to a minor, and ignored the difference by octave in pitch events. The pre-process could be executed automatically without relying on external music feature extraction tools.

We regarded that the states in hidden Markov model (HMM) represented chord categories, and we employed an extended HMM with the notion of duration of a hidden state, called hidden semi-Markov model (HSMM). Furthermore, we utilized neural networks for calculating categorical distributions to embed the pitch-class distribution, preceding chord sequences, beat information, and so on, as additional contexts.

The experimental results satisfactorily showed the efficacy of such contexts in terms of perplexity in comparison with the baseline and other ablation studies (Sect. “Evaluation by Perplexity”). We successfully obtained clear chord clusters with the best scored proposed model where seventh chords and borrowed chords were classified into appropriate clusters (Sect. “Induced Clusters and Model’s Perplexities”). Also, we have shown a full exemplification of BWV267 (Sect. “Discussion on an Analysis by the Model”).

As was mentioned by Hugo Riemann [2], “In the change of these function (tonic, dominant, and sub-dominant) lies the essence of modulation”. According to this, the distinction of keys should not be given prior to the chord analysis, but the modulations should be externalized as a result of the analysis. We followed this aphorism and have avoided the independent detection of modulation. Instead, we have characterized each hidden state, counting the number of their transitions, and dividing the experimental results into majors, minors, and dorian scale. In this regard, we have not automatically detected the keys, however, we have found that the distribution of state transitions were consistent with the known classical theory, in an unsupervised manner (Sect. “Hidden State Transitions”).

We plan to improve the proposed methodology to incorporate modulation detection, to integrate with the chord analysis, as our future work. Also, we intend to apply our model to pieces that may not be faithful to classical musicology. We need further extensions to overcome the out-of-vocabulary issue described in Sect. “Discussion on an Analysis by the Model” for pieces that accompany more elaborations than four-part chorales. In addition, there are usually few pieces in a particular genre by a specific composer. In such a case, learned parameters of transition probabilities and emission probabilities are expected to be good initial values (pre-trained information) for fine-tuning, which is known to be effective when a sufficient amount of training data is not available. For this reason, we used Bach’s four-part chorales that contain a substantial amount of coherently structured pieces composed at the beginning of the tonal music. We hope the future work of extensive music analysis by machines would contribute to expanding the scope of discovering cultural evolution in music.