Keywords

1 Introduction

Text-to-speech synthesis (TTS) is the technique of generating intelligible speech from a given text. Applications of TTS have grown from early systems which aid the visually impaired, to in-car navigation systems, e-book readers, spoken dialog systems, communicative robots, singing speech synthesizers, and speech-to-speech translation systems [1].

More recently, TTS systems have moved from the task of producing intelligible voices, to the more difficult challenge of generating voices in multiple languages, with different styles and emotions [2]. Despite these trends, there are unresolved obstacles, such as improving the overall quality of the voices. Some researchers are striving to create TTS systems which try to mimic natural human voices more closely.

The statistical methods for TTS, which arose in the late 1990s, have grown in popularity [3], particularly those based on Hidden Markov Models (HMMs). HMMs are known for their flexibility in changing speaker characteristics, having a low footprint, and their capacity to produce average voices. Previously, HMMs were utilized extensively in the inverse task to TTS of speech recognition. Here they have proved to be successful at providing a robust representation of the main events into which speech can be segmented [4], using efficient parameter estimation algorithms.

More than twenty statistical speech synthesis implementations have been developed for several different languages from around the world. For example  [516], are a few of the recent publications. Every implementation of a new language, or one of it’s dialects, requires the adaptation of HMM-related algorithms by incorporating their own linguistic specifications, and making a series of decisions regarding the type of HMM, decision trees, and training conditions.

In this paper, we present our implementation of a statistical parametric speech synthesis system based on HMM, together with the use of long short-term memory postfilter neural networks for improving its spectral quality.

The rest of this paper is organized as follows: Sect. 2 provides some details of an HMM-based speech synthesis system and in Sect. 3, long short-term memory neural networks are briefly described. Section 4 gives the proposed system and the experiments carried out in order to test the postfilter. Section 5 presents and discusses the results and objective evaluations conducted, and finally, some conclusions are given in Sect. 6.

2 Speech Synthesis Based on HMM

An HMM is a Markov process with unobserved or hidden states. The states themselves emit observations according to certain probability distributions.

In Fig. 1, a representation of a left-to-right HMM is shown, where there is a first state to the left from which transitions can occur to the same state or to the next one on the right, but not in the reverse direction. In this \(p_{ij}\) represents the probability of transition from state i to state j, and \(O_{k}\) represents the observation emitted in state k.

Fig. 1.
figure 1

Left to right example of an HMM with three states

In HMM-based speech synthesis, the speech waveforms can be reasonably reconstructed from a sequence of acoustic parameters learnt and emitted as vectors from the HMM states [1]. Typical implementation of this model includes vectors of observations comprising of the pitch, f0, the mel frequency cepstral coefficients, MFCC and their delta and delta features, for an adequate modeling of the dynamic features of speech. A common tool used to build these HMM-based speech systems is known as HTS [17], which we also use in this paper.

In order to improve the quality of the results, some researchers have recently experimented with postfiltering stages, in which the parameters obtained from HTS voices have been enhanced using deep generative architectures [1821], for example restricted boltzmann machines, deep belief networks, bidirectional associative memories, and recurrent neural networks (RNN).

In the next section, we present our proposal to incorporate long short-term memory recurrent neural networks in order to improve the quality of HMM-based speech synthesis.

3 Long Short-Term Memory Recurrent Neural Networks

Among the many new algorithms developed to improve some tasks related to speech, such as speech recognition, several groups of researchers have experimented with the use of Deep Neural Networks (DNN), giving encouraging results. Deep learning, based on several kinds of neural networks with many hidden layers, have achieved interesting results in many machine learning and pattern recognition problems. The disadvantage of using such networks is they cannot directly model the dependent nature of each sequence of parameters with the former, something which is desirable in order to imitate human speech production. It has been suggested that one way to solve this problem is to include RNN [22, 23] in which there is feedback from some of the neurons in the network, backwards or to themselves, forming a kind of memory that retains information about previous states.

An extended kind of RNN, which can store information over long or short time intervals, has been presented in [24], and is called long short-term memory (LSTM). LSTM was recently successfully used in speech recognition, giving the lowest recorded error rates on the TIMIT database [25], as well as in other applications to speech recognition [26]. The storage and use of long-term and short-term information is potentially significant for many applications, including speech processing, non-Markovian control, and music composition [24].

In a RNN, output vector sequences \(\mathbf {y}=\left( y_{1},y_{2},\dots ,y_{T} \right) \) are computed from input vector sequences \(\mathbf {x}=\left( x_{1},x_{2},\dots ,x_{T}\right) \) and hidden vector sequences \(\mathbf {h}=\left( h_{1},h_{2},\dots ,h_{T} \right) \) iterating Eqs. 1 and 2 from 1 to T [22]:

$$\begin{aligned} h_{t}=\mathcal {H}\left( \mathbf {W}_{xh}x_{t}+\mathbf {W}_{hh}h_{t-1}+b_{h} \right) \end{aligned}$$
(1)
$$\begin{aligned} y_{t}=\mathbf {W}_{hy}h_{t}+b_{y} \end{aligned}$$
(2)

where \(\mathbf {W}_{ij}\) is the weight matrix between layer i and j, \(b_{k}\) is the bias vector for layer k and \(\mathcal {H}\) is the activation function for hidden nodes, usually a sigmoid function \(f:\mathbb {R}\rightarrow \mathbb {R},f(t)=\frac{1}{1+e^{-t}}\).

Each cell in the hidden layers of a LSTM, has some extra gates to store values: an input gate, forget gate, output gate and cell activation, so values can be stored in the long or short term. These gates are implemented following the equations:

$$\begin{aligned} i_{t}=\sigma \left( \mathbf {W}_{xi}x_{t}+\mathbf {W}_{hi}h_{t-1}+\mathbf {W}_{ci}c_{t-1}+b_{i} \right) \end{aligned}$$
(3)
$$\begin{aligned} f_{t}=\sigma \left( \mathbf {W}_{xf}x_{t}+\mathbf {W}_{hf}h_{t-1}+\mathbf {W}_{cf}c_{t-1}+b_{f}\right) \end{aligned}$$
(4)
$$\begin{aligned} c_{t}=f_{t}c_{t-1}+i_{t}\tanh \left( \mathbf {W}_{xc}x_{t}+\mathbf {W}_{hc}h_{t-1}+b_{c} \right) \end{aligned}$$
(5)
$$\begin{aligned} o_{t}=\sigma \left( \mathbf {W}_{xo}x_{t}+\mathbf {W}_{ho}h_{t-1}+\mathbf {W}_{co}c_{t}+b_{o} \right) \end{aligned}$$
(6)
$$\begin{aligned} h_{t}=i_{t}\tanh \left( c_{t} \right) \end{aligned}$$
(7)

where \(\sigma \) is the sigmoid function, i is the input gate activation vector, f the forget gate activation function, o is the output gate activation function, and c the cell activation function. \(\mathbf {W}_{mn}\) are the weight matrices from each cell to gate vector.

4 Description of the System

Often, the resulting voices from the HTS system have notable differences with the original voices used in their creation. It is possible to reduce the gap between natural and artificial voices by additional learning directly applied to the data [18]. In our proposal, we use aligned utterances from natural and synthetic voices produced by the HTS system to establish a correspondence between each frame.

Given a sentence spoken using natural speech and also with the voice produced by the HTS, we extract a representation consisting of one coefficient for f0, one coefficient for energy, and 39 MFCC coefficients, using the system Ahocoder [27]. The inputs to the LSTM network correspond to the MFCC parameters of each frame for the sentences spoken using the HTS voice, while the output corresponds to the MFCC parameters given by the natural voice for the same sentence. In this way, we have an exact correspondence given by the alignment between the vectors from each utterance using the HTS voice and the natural voice.

Hence, each LSTM network attempts to solve the regression problem of transforming the values of the speech produced by the artificial and natural voices. This allows a further improvement to the quality of newly synthesized utterances with HTS, and uses the network as a way of refining these synthetic parameters to more closely resemble those of a natural voice. Figure 2 outlines the proposed system.

Fig. 2.
figure 2

Proposed system. HTS and Natural utterances are aligned frame by frame

4.1 Corpus Description

The CMU_Arctic databases were constructed at the Language Technologies Institute at Carnegie Mellon University. They are phonetically balanced, with several US English speakers. It was designed for unit selection speech synthesis research.

The databases consist of around 1150 utterances selected from out-of-copyright texts from Project Gutenberg. The databases include US English male and female speakers. A detailed report on the structure and content of the database and the recording conditions is available in the Language Technologies Institute Tech Report CMU-LTI-03-177 [28]. Four of the available voices were selected: BDL (male), CLB (female), RMS (male) and SLT (female).

4.2 Experiments

Each voice was parameterized, and the resulting set of vectors was divided into training, validation, and testing sets. The amount of data available for each voice is shown in Table 1. Despite all voices uttering the same phrases, the length differences are due to variations in the speech rate of each speaker.

Table 1. Amount of data (vectors) available for each voice in the databases

The LSTM networks for each voice had three hidden layers, with 200, 160 and 200 units in each one respectively.

To determine the improvement in the quality of the synthetic voices, several objective measures were used. These measures have been applied in recent speech synthesis experiments and were found to be reliable in measuring the quality of synthesized voices [29, 30]:

  • Mel Cepstral Distortion (MCD): Excluding silent phonemes, between two waveforms \(v^{\text{ targ }}\) and \(v^{\text{ ref }}\) it can be measured following Eq. 8 [31]

    $$\begin{aligned} \text{ MCD }\left( v^{\text{ targ }},v^{\text{ ref }}\right) =\frac{\alpha }{T}\sum _{t=0}^{T-1}\sqrt{\sum _{d=s}^{D}\left( v_{d}^{\text{ targ }}(t)-v_{d}^{\text{ ref }}(t) \right) ^{2}} \end{aligned}$$
    (8)

    where \(\alpha =\frac{10\sqrt{2}}{\ln 10}\), T is the number of frames of each utterance, and D the total number of parameters of each vector.

  • mfcc trajectory and spectrogram visualization: Observation of these figures allow a simple visual comparison between the similitude of the synthesized and natural voices.

These measures were applied to the test set after being processed with the LSTM networks, and the results were compared with those of the HTS voices. The results and analysis are shown in the following section.

5 Results and Analysis

For each synthesized voice produced with HTS and processed with LSTM networks, MCD results are shown in Table 2. It can be seen how this parameter improved when all voices were processed with LSTM networks.

This shows the ability of these networks to learn the particular regression problem of each voice.

Table 2. MCD between HTS and natural voices, and between LSTM postfiltering and natural voices
Fig. 3.
figure 3

Evolution of MCD improvement in LSTM postfiltering during training epochs

Fig. 4.
figure 4

Illustration of enhancing the 5th mel-cepstral coefficient trajectory by LSTM postfiltering

Fig. 5.
figure 5

Comparison of spectrograms

The best result of MCD improvement with the LSTM postfiltering is CLB (11.2 %) and the least best was RMS (1 %). Figure 3 shows how MCD evolves with the training epochs for each voice. All HTS voices, except one, were improved by the LSTM neural network postfilter for MCD after the first 50 epochs of training.

The differences in the number of epochs required to reach convergence in each case are notable. This can be explained by the difference in MCD between HTS and natural voices. The gap between them is variable and the LSTM network requires more epochs to model the regression function between them.

An example of the parameters generated by the HTS and the enhancement pursuit by the LSTM postfilter is shown in Fig. 4. It can be seen how the LSTM postfilter fits the trajectory of the mfcc better than the HTS base system.

In Fig. 5 a comparison of three spectrograms of the utterance “Will we ever forget it?” for the voices of: (a) Original (b) HTS and (c) LSTM postfilter enhanced, is shown. The HTS spectrogram usually shows bands in higher frequencies not present in the natural voice, and the LSTM postfilter helps to smooth it, making it closer to the spectrogram of the original voice.

6 Conclusions

We have presented a new proposal to improve the quality of synthetic voices based on HMM with LSTM networks. The method shows how to improve an artificial voice and make it mimic more closely the original natural voice in terms of its spectral characteristics.

We evaluated the proposed LSTM postfilter using four voices, two masculine and two feminine, and the results show that all of them were improved for spectral features, such as MCD measurement, spectrograms, and mfcc trajectory generation.

The improvement of the HTS voices in MCD to the original voices were observed from the first training epochs of the LSTM neural network, but the convergence to a minimum distance took many more epochs. Due to the extensive amount of time required to train each epoch, further exploration should determine new network configurations or training conditions to reduce training time.

Future work will include the exploration of new representations of speech signals, hybrid neural networks, and fundamental frequency enhancement with LSTM postfilters.