1 Introduction

Dynamic time warping (DTW) is a method of optimally aligning two distinct time series of generally different length. In addition to the alignment, DTW computes a score indicating the similarity of the two sequences. This ability to quantify the similarity between time series has led to the application of DTW in automatic speech recognition (ASR) systems several decades ago [1, 2]. It has remained popular in this field, with more recent developments reported in [3] and [4].

DTW has also found application in fields related to ASR. For example, it has been used successfully in keyword spotting and information retrieval (IR) systems [57]. To accomplish IR, sub-sequences in a speech signal that match a template with certain degree of time warping are detected. The direct approach to keyword spotting has recently been extended by training a convolutional neural network (CNN) to emulate the template matching performed by DTW, thereby providing a substantial computational advantage [8, 9].

In the related task of acoustic pattern discovery, DTW can be allowed to consider multiple local alignments between speech signals during the overall search [10]. In this way, DTW can find similar segment pairs in speech audio, followed by a clustering step [11]. The resulting cluster labels are used to train hidden Markov models (HMMs).

In an effort to improve performance, several variations of DTW have been proposed since its inception. For example, a one-against-all index (OAI) for each time series under consideration is proposed in [4]. The OAI is subsequently used to weight the corresponding DTW alignment score in a speech recognition system.

Another modification of DTW which was reported to improve performance is the parametric derivative dynamic time warping (DDTW) that was applied to hierarchical clustering of UCR Time Series Classification Archive data [12]. Parametric DDTW combines the scores produced by DTW and by DDTW to provide a final similarity measure. A similar weighted modification of DTW has been proposed in [13].

Finally, DTW has also been applied to the direct matching of points along the best alignment for use in a signature verification system [14]. A stability function is subsequently applied, and the resulting score is used as a similarity measure.

We describe a modification of DTW and demonstrate its improved performance when used as a similarity measure to cluster speech segments. Our DTW modification exploits the asynchronous temporal structure of features extracted from speech. Related work has considered such feature trajectories by training separate hidden Markov models (HMMs) for each mel frequency cepstral coefficients (MFCC) feature dimension [15]. This work reports improvements in both phoneme and word recognition. The clustering of speech segments also has several useful applications in ASR [1618]. Recently, it has been particularly useful in the automatic discovery of sub-word units [19, 20].

Section 2 reviews the standard formulation of DTW and Section 3 describes our proposed modification. Section 4 presents the evaluation tools we employ and Section 5 describes the data we use for experimentation. Section 6 presents an experimental evaluation of the proposed method. Section 7 discusses the results and concludes the paper.

2 Classical dynamic time warping

We consider speech segments as temporal sequences of multidimensional feature vectors in the Euclidean space. Sequences are of arbitrary and generally different length, but all vectors are of equal dimension. The DTW algorithm recursively determines the best alignment between two such vector time series by minimizing a cumulative path cost that is commonly based on Euclidean distances between time-aligned vectors [2, 21].

Consider N such sequences Xi, i=1,2,…,N, each composed of Ti feature vectors, as defined in Eq. 1.

$$ \mathbf{X}_{i} =\lbrace \mathbf{x}_{i1},\mathbf{x}_{i2},\ldots,\mathbf{x}_{iT_{i}} \rbrace, \ \ \ i=1,2,\ldots,N $$
(1)

Each feature vector xit has m dimensions, as indicated in Eq. 2.

$$ \mathbf{x}_{it} = \left\langle x_{it}^{(1)},x_{it}^{(2)},\ldots,x_{it}^{(m)} \right\rangle,\ \ \ t=1,2,..,T $$
(2)

Two sequences Xi and Xj are aligned by constructing a Ti-by- Tj distance matrix Dij(p,q) whose entries contain the distances d(xip,xjq). Typical choices for d are the Euclidean distance and the Manhattan distance. A matrix of minimum accumulated distances γij(p,q) is then constructed by considering all paths from Dij(1,1) to Dij(p,q). Using the local and global path constraints, γij(p,q) is computed recursively according to the principle of dynamic programming, as shown in Eq. 3 [2].

$$ \begin{aligned} \gamma_{ij}(p,q) = D_{ij}(\mathbf{x}_{ip}, \mathbf{x}_{iq}) \,+ \min \left\lbrace \gamma_{ij}(p-1,q-1),\right. \\\left. \gamma_{ij}(p-1,q), \gamma_{ij}(p,q-1)\right\rbrace \end{aligned} $$
(3)

The similarity DTW(Xi,Xj) between vector sequences Xi and Xj is then given by Eq. 4. Here, K is the length of the optimal path from Dij(1,1) to Dij(Ti,Tj) and is used to normalise the similarity value.

$$ {\begin{aligned} \text{DTW}(\mathbf{X}_{i},\mathbf{X}_{j})=\frac{1}{K}\gamma_{ij} \left(T_{i},T_{j} \right) \end{aligned}} $$
(4)

This standard formulation of dynamic time warping will in the remainder of the paper be referred to as classical DTW. Figure 1 shows the classical DTW alignment between two different sequences of 21-dimensional spectral feature vectors representing the same sound uttered by different speakers. These spectral features are obtained by straightforward binning of the short-time power spectra. To avoid clutter, the alignment of just four of the feature vectors is shown.

Fig. 1
figure 1

Alignment by classical DTW of spectral features extracted from the triphone b-aa+dx as uttered by a male speaker mrfk0 and b by female speaker fdml0 in the TIMIT corpus

3 Feature trajectory DTW (FTDTW)

We define a feature trajectory \(X_{i}^{(l)}\) as the time series obtained when considering the l-th element of each feature vector in a sequence Xi, as shown in Eq. 5.

$$ X_{i}^{(l)} =\left\lbrace x_{i1}^{(l)}, x_{i2}^{(l)},\ldots,x_{iT_{i}}^{(l)} \right\rbrace, \ \ \ l = 1,2,\ldots,m $$
(5)

Hence, \(X_{i}^{(l)}\) is a one-dimensional time series for feature l. We now calculate the similarity of two feature vector sequences by applying classical DTW to each corresponding pair of feature trajectories, and subsequently normalise the sum, as shown in Eq. 6.

$$ \text{FTDTW}(\mathbf{X}_{i}, \mathbf{X}_{j})=\frac{1}{\beta} \sum\limits_{l=1}^{m} DTW \left\lbrace X_{i}^{(l)},X_{j}^{(l)} \right\rbrace $$
(6)

where \(\beta =\sqrt []{{\sum \nolimits }_{l=1}^{m} K_{l}^{2}}\), K is the path length and DTW(.) is non-normalised classical DTW.

As illustrated, we repeat the alignment of the two speech segments shown in Fig. 1 with FTDTW. Figure 2a identifies seven features from each of the four feature vectors shown in Fig. 1a. Figure 2b demonstrates how each of these seven features align with the second speech segment. The features themselves are the same as those illustrated in Fig. 1. For the illustrated example, application of Eq. 6 involves 21 separate alignments, each between corresponding feature trajectories as also indicated in Fig. 2. The resulting 21 scores are summed and normalised by β. Figure 2 illustrates how, in contrast to the classical DTW, FTDTW does not require features coincident in time in one segment to align with features in the other segment also coincident in time. Finally, we note that, because each of the m DTW alignments in the summation of the right-hand side of Eq. 6 is computed independently, the FTDTW computation can be easily parallelised over m processors or cores. This provides a computational advantage over DTW, which involves the alignment of vector sequences and is not so easily parallelised.

Fig. 2
figure 2

Alignment by FTDTW of spectral features extracted from the triphone b-aa+dx as uttered by a male speaker mrfk0 and b by female speaker fdml0 in the TIMIT corpus

4 Evaluation

We evaluate the effectiveness of our proposed modification to DTW by using it to compute similarities between speech segments, and then using these similarities to perform agglomerative hierarchical clustering [22, 23]. We will cluster speech segments corresponding to triphones extracted from the TIMIT corpus as well as isolated digits extracted from the Spoken Arabic Digit Dataset (SADD). Since the phonetic alignment is provided in the former and the word alignments in the latter, the ground truth is available. Hence, we can use the external metrics to quantify the quality of the resulting clusters [24]. We chose F-measure and normalised mutual information (NMI) as metrics for cluster evaluation in our experiments [25, 26]. These two metrics represent two commonly used categories of external evaluation measures called set-matching-based measures and information theoretic-based measures. The F-measure was chosen because it is a widely used set matching-based measure for the evaluation of clustering and classification systems [27]. The NMI is a popular choice among the information theoretic-based clustering evaluation measures [28].

4.1 Agglomerative hierarchical clustering

In agglomerative hierarchical clustering (AHC), the agglomeration of data objects (speech segments in the case of our experimental evaluation) is initialised by the assumption that each object is the sole occupant of its own cluster. A binary tree referred to as a dendrogram is created by successively merging the closest cluster pairs until a single cluster remains [29]. We use the popular Ward method to quantify inter-cluster similarity [30]. The input to the AHC algorithm is a symmetric N×N proximity matrix populated by the values of DTW(·,·) or FTDTW(·,·) and the output consists of the R clusters.

4.2 F-measure

The F-measure is based on the quantity precision (PR) and recall (RE). Precision indicates the degree to which a cluster is dominated by a particular class, while recall indicates the degree to which a particular class is concentrated in a specific cluster. Precision and recall are defined in Eqs. 7 and 8 respectively.

$$ \text{PR}(r,v)=\frac{n_{rv}}{n_{r}} $$
(7)
$$ \text{RE}(r,v)=\frac{n_{rv}}{n_{v}} $$
(8)

Here, nrv indicates the number of objects of class v in cluster r; nr and nv indicate the number of objects in cluster r and class v respectively. The F-measure (F) is given in Eq. 9.

$$ F(r,v)=\frac{2 \times \text{RE}(r,v) \times \text{PR}(r,v)}{\text{RE}(r,v)+\text{PR}(r,v)} $$
(9)

When the clusters are perfect, nrv=nr=nv, and hence, F(r,v)=1.

4.3 Normalised mutual information

Normalised mutual information (NMI) employs the following formulations:

  • The set of R clusters G={G1,G2,…,GR}, and

  • The set of V classes C={C1,C2,…,CV} representing ground truth.

NMI is based on the mutual information I(G,C) between classes and clusters [26, 31]. The mutual information is not sensitive to varying number of clusters, and therefore, it is normalised by a factor based on the cluster entropy H(G) and class entropy H(C). These entropies measure cluster and class cohesiveness respectively. The NMI criterion is given in Eq. 10.

$$ \text{NMI}(\mathbf{G},\mathbf{C})=\frac{2I(\mathbf{G},\mathbf{C})}{\left[ H(\mathbf{G}) + H(\mathbf{C})\right]} $$
(10)

The mutual information I(G,C) and the entropies H(G) and H(C) are given in Eqs. 11, 12 and 13 respectively.

$$ I(\mathbf{G},\mathbf{C})=\sum\limits_{r \in \mathbf{G}} \sum\limits_{v \in \mathbf{C}} P(G_{r})P(C_{v}) \log \frac{P(G_{r} \cap C_{v})}{P(G_{r})P(C_{v})} $$
(11)

In Eq. 11, P(Gr), P(Cv), and P(GrCv) are the probabilities of a segment belonging to cluster Gr, class Cv and the intersection of Gr and Cv respectively.

$$ H(\mathbf{G})= -\sum\limits_{r \in \mathbf{G}}P(G_{r}) \log P(G_{r}) $$
(12)
$$ H(\mathbf{C})= -\sum\limits_{v \in \mathbf{C}}P(C_{v}) \log P(C_{v}) $$
(13)

It can be shown that I(G,C) is zero when the clustering is random with respect to class membership and that it achieves a maximum of 1.0 for perfect clustering [31].

5 Data

Our first set of experiments uses speech segments taken from the TIMIT speech corpus [32]. TIMIT has been chosen because it includes accurate time-aligned phonetic transcriptions, meaning that both phonetic labels and their start/end times are known. As our desired clusters, we use triphones, which are phones in specific left and right contexts [33]. We consider triphones that occur at least 20 times and at most 25 times in the corpus. This leads to an evenly balanced set of 8772 speech segments, which also corresponds approximately to the number of segments in our second set of experiments.

For comparison and confirmation purposes, we performed a second set of experiments using the Spoken Arabic Digit Dataset (SADD) [34]. SADD consists of 8800 utterances already parametrised as 13-dimensional MFCCs. The utterances were spoken by 44 male and 44 female Arabic speakers. Each utterance in the SADD corresponds to a single Arabic digit and will therefore be considered to be a single segment in our experiments. Each digit (0 to 9) was uttered ten times by each speaker.

A third set of experiments is based on 10 independent subsets of speech segments drawn from the TIMIT SI and SX utterances, irrespective of occurrence frequency. This better represents the unbalanced distribution of triphones that may be expected in unconstrained speech. Table 1 summarises the datasets used in each of the three sets of experiments.

Table 1 Datasets used for experimental evaluation

We considered two feature vector parametrisations popular in the field of speech processing, namely mel frequency cepstral coefficients (MFCCs) and perceptual linear prediction (PLP) coefficients [35, 36]. For the former, log frame energy was appended to the first 12 MFCCs to produce a 13-dimensional feature vector. The first and second differentials (velocity and acceleration) were subsequently added to produce the final 39-dimensional MFCC feature vector. For the latter, 13 PLP coefficients were considered, to which velocity and acceleration were added, again resulting in a 39-dimensional feature vector. One such feature vector was extracted for each 10 ms frame of speech, where consecutive frames overlapped by 5 ms. All TIMIT feature vectors were computed using HTK [37]. SADD provides pre-computed MFCC features, and hence, PLP features were not used in the associated experiments.

6 Experiments

To evaluate the performance of feature trajectory DTW (FTDTW) as an alternative to classical DTW as a similarity measure, we will employ it to perform AHC of the speech segments described in Section 5. The quality of the automatically determined clusters will be determined using the F-measure and in several cases also NMI.

In a first set of experiments, we cluster dataset 1 (Table 1).

Figure 3 reflects the clustering performance in terms of (a) the F-measure and (b) NMI, when using MFCCs as features. Both the F-measure and NMI are plotted as a function of the number of clusters. Note that the F-measure continues to decline as the number of clusters exceeds 1200.

Fig. 3
figure 3

Clustering performance for dataset 1 when using MFCC features in terms of a F-measure and b NMI

Figure 3a and b show that FTDTW improves on the performance of the classical DTW in this clustering task in terms of both F-measure and NMI. Especially in terms of F-measure, this improvement is substantial.

A corresponding set of experiments using PLP features was carried out for dataset 1, and the results are shown in Fig. 4. The same trends seen for MFCCs in Fig. 4 are observed, with substantial improvements particularly in terms of F-measure.

Fig. 4
figure 4

Clustering performance for dataset 1 when using PLP features in terms of a F-measure and b NMI

In a second set of experiments, we clustered dataset 2 (Table 1) which consists of isolated Arabic digits. Figure 5 indicates the clustering performance, both in terms of F-measure and NMI for this dataset. Again, we observe that FTDTW outperforms the classical DTW in terms of both F-measure and NMI in practically all cases.

Fig. 5
figure 5

Clustering performance for dataset 2 in terms of a F-measure and b NMI

In a third and final set of experiments, we considered dataset 3 (Table 1). The 10 independent subsets of the TIMIT training set each contained between 12034 and 12495 triphone segments. In contrast to the experiments for dataset 1, all triphone tokens were considered irrespective of occurrence frequency. Furthermore, the number of clusters was chosen to be 2394, a figure which corresponds to the number of triphone types with more than 10 occurrences in the data. A single number of clusters, rather than a range as presented in Figs. 3, 4 and 5, has been used here in order to make the required computations practical. Figure 6 presents the clustering performance for each of the 10 subsets in terms of F-measure. We observe that FTDTW achieves an improvement over classical DTW in all cases. A paired t test indicated p<0.0001, and hence, the improvements are statistically highly significant. Similar improvements were observed in terms of NMI.

Fig. 6
figure 6

Clustering performance for the 10 independent subsets of dataset 3 in terms of F-measure

7 Discussion and conclusions

The experiments in Section 6 have applied our modified DTW algorithm (FTDTW) to the clustering of speech segments. Our experiments show consistent and statistically significant improvement over the classical DTW baseline for both MFCC and PLP parametrisations and across three datasets. We conclude that FTDTW is more effective as a similarity measure for speech signals than the classical DTW.

Because the classical DTW operates on a feature-vector by feature-vector basis, it enforces absolute temporal synchrony between the feature trajectories. In contrast, FTDTW does not impose this synchrony constraint, but aligns feature trajectories independently on a pair-by-pair basis. Since FTDTW is observed to lead to better clusters in our experiments, we conclude that the strict temporal synchrony imposed by the classical DTW is counter-productive in the case of speech signals. We further speculate that segments of speech that human listeners would regard as similar also exhibit such differing time-scale warping among the feature trajectories. It remains to be seen whether this decoupling of the feature trajectories is advantageous for signals other than speech.

Finally, and noting that it is not a focus of this paper, we may consider the maxima observed in the F-measure in Figs. 3 and 4, and in both the F-measure and NMI in Fig. 5. A peak in the quality of the clusters as a function of the number of clusters may be taken to indicate the best estimate of the ‘true’ number of clusters in the data. For the experiments using the MFCC parametrisation of dataset 1 (Fig. 4), we see that an optimum in the F-measure is reached at 501 and 421 clusters for FTDTW and classical DTW respectively. The ‘true’ number of clusters corresponds to the number of triphone types in dataset 1, which is 404. Hence, both DTW formulations over-estimate the number of clusters. A similar tendency is seen for the PLP parametrisations of the same dataset, where the F-measure peaks at 439 and 559 clusters for the classical DTW and the proposed DTW respectively, and also for dataset 2 in Fig. 5.

Although the ground truth is known, the class definitions (triphones for datasets 1 and 3 and isolated digits for dataset 2) may be called into question. In particular, although all triphones correspond to acoustic segments from the same phone within the same left and right contexts, there are many other possible sources of systematic variability, such as the accent of the speaker. Hence, it may be reasonable to expect that a larger number of clusters are needed to optimally model the data. To determine whether this is the case, the clusters should be used to determine acoustic models for an ASR system. Then, the performance of varying clusterings of the data can be compared by comparing the performance of the resulting ASR systems. We intend to address this question in the ongoing work.