1 Introduction

In recent years, voice conversion (VC), a technique used to change specific information in the speech of a source speaker to that of a target speaker while retaining linguistic information, has been garnering much attention in speech signal processing. VC techniques have been applied to various tasks, such as speech enhancement [1], emotion conversion [2], speaking assistance [3], and other applications [4,5]. Most of the related work in VC focuses not on f0 conversion but on the conversion of spectrum features, and we conform to that in this report as well.

Various statistical approaches to VC have been studied so far, including those discussed in [6,7]. Among these approaches, the Gaussian mixture model (GMM)-based mapping method [8] is widely used, and a number of improvements have been proposed. Toda et al. [9] introduced dynamic features and the global variance (GV) of the converted spectra over a time sequence. Helander et al. [10] proposed transforms based on partial least squares (PLS) to prevent the over-fitting problem encountered in standard multivariate regression. There have also been approaches that do not require parallel data since they use a GMM adaptation technique [11], eigenvoice GMM [12,13] or probabilistic integration model [14]. Other approaches based on statistical approaches have been proposed; Jian et al. [15] used canonical correlation analysis for the VC, and Takashima et al. [16] proposed a VC technique using exemplar-based non-negative matrix factorization (NMF).

However, most of the conventional VC methods, including the GMM-based approaches, rely on ‘shallow’ voice conversion based on linear (or piecewise linear) transformation. That means a source speech was converted in the original feature space directly or in the shallow architecture with a few hidden layers. To capture the characteristics of speech more precisely, it is necessary to have a deeper non-linear architecture with more hidden layers. The shape of the vocal tract is generally non-linear, so non-linear voice conversion is more compatible with human speech. One example of a non-linear VC method is proposed by Narendranath et al. [17] and Desai et al. [18] based on neural networks (NN). In the GMM-based approaches, the conversion is achieved so as to maximize the conditional probability calculated from a joint probability of source speech and target speech, which is trained beforehand. On the other hand, NN-based approaches directly train the conditional probability, which converts the feature vector of a source speaker to that of a target speaker. It is often reported that such a discriminative approach performs better than a generative approach, such as GMM, in speech recognition and synthesis as well as in VC [19,20]. For these reasons, NN-based approaches achieve relatively high performance if the training samples are carefully prepared [18].

These approaches often suffer from over-smoothing or over-fitting problems. GMM-based approaches represent acoustic features using multiple Gaussian distributions, which are estimated by averaging observations with similar context descriptions in the training. Therefore, the outputs of the GMM distribute near the modes (means) of the Gaussians, which causes problems with over-smoothing. Furthermore, over-fitting problems arise when we give more Gaussian mixtures due to precise estimation of the observed distribution. In the NN-based approaches, the model is often over-fitted due to its complexity because it exaggerates small fluctuations in the unknown data if the number of training data is not enough relative to the number of parameters.

In order to alleviate the over-smoothing effect in a GMM-based method, some methods have been proposed so far, such as the global variance model [21], a minimizing-divergence model [22], and post-filtering [23]. An exemplar-based VC system using non-negative matrix factorization (NMF) has also been proposed to tackle the over-smoothing problems [16,24]. In our earlier work [25], we proposed a new VC technique that copes with the over-fitting problems in NN-based approaches, using a combination of speaker-dependent restricted Boltzmann machines (RBMs) [26] (or deep belief nets (DBN) [27]) that captures high-order features in an unsupervised manner and a concatenating NN. It is reported that these graphical models are better at representing the distribution of high-dimensional observations with cross-dimension correlations than GMM in speech synthesis [28] and in speech recognition [29]. Since Hinton et al. introduced an effective training algorithm for the DBN in 2006 [27], the use of deep learning rapidly spread in the field of signal processing, as well as in speech signal processing. An RBM (or DBN) has been used, for example, for hand-written character recognition [27], 3-D object recognition [30], machine transliteration [31], and so on.

In this paper, we extend our earlier work in [25] to systematically capture time information as well as latent (deep) relationships between a source speaker’s and a target speaker’s features in a single network. This is accomplished by combining speaker-dependent conditional restricted Boltzmann machines (CRBMs) and a concatenating NN.

A CRBM is a non-linear probabilistic model used to represent time series data consisting of three factors: (i) an undirected model between binary latent variables and the current visible variables, (ii) a directed model from the previous visible variables to the current visible variables, and (iii) a directed model from the previous visible variables to the latent variables. In our approach, we first train two exclusive CRBMs for the source and the target speakers independently using segmented training data prepared for each speaker, then train a NN using the projected features, and finally, fine-tune the networks as a single network for VC. Because the training data for the source speaker CRBM include various phonemes particular to the speaker, the speaker-dependent network tries to capture the abstractions to maximally express the training data that have abundant speaker individuality information and less phoneme-related information. Furthermore, the network captures time-series features with the directed models (ii) and (iii), enabling it to discover temporal correlations at the same time. Therefore, we expect that if feature conversion is conducted in such time-related individuality-emphasized high-order spaces, it is much easier to convert voice features than if the original spectrum-based space is used.

Similar research can be found in [32] and [33]. Wu et al. employed a CRBM to capture the linear and non-linear relationship between the source and the target features [32]. Chen et al. also used a RBM to model the joint spectral distribution instead of using the conventional joint density GMM [33]. Unlike these approaches, which is based on a joint model, our method trains two exclusive RBMs for each speaker, aiming to capture speaker-specific conversion-friendly features. We will discuss the differences between these approaches and the proposed method in Section 2.

The rest of the article is organized as follows. In Section 2, we briefly review the fundamental techniques, (RBMs and CRBMs) before explaining our method. The proposed VC system is presented in Section 2, and we compare the proposed method with existing related work in Section 2. We describe the various experiments and VC results in Section 2, and we conclude the article in Section 2.

2 Preliminaries

Our voice conversion system uses CRBMs to capture high-order conversion-friendly features. A RBM, the fundamental model of the CRBM, was first introduced as a method of representing binary-valued data [34,35], and it later came to be used to deal with real-valued data (such as acoustic features) known as a Gaussian-Bernoulli RBM (GBRBM) [36]. However, it has been reported that the original GBRBM had some difficulties because the training of the parameters was unstable [27,37,38]. Later, an improved learning method for GBRBM was proposed by Cho et al. [39] to overcome the difficulties. We briefly review RBMs and CRBMs in this section, introducing their improved versions.

2.1 RBM

A RBM is an undirected graphical model that defines the distribution of visible variables with binary hidden (latent) variables. In literature dealing with the improved GBRBM [39], the joint probability p(v,h) of real-valued visible units \(\boldsymbol {v} = [v_{1}, \cdots, v_{I}]^{T}, v_{i} \in \mathbb {R}\) and binary-valued hidden units h=[h 1,⋯,h J ]T,h j ∈{0,1} is defined as follows:

$$\begin{array}{@{}rcl@{}} p(\boldsymbol {v},\boldsymbol {h}) &=& \frac{1}{Z} e^{-E(\boldsymbol {v},\boldsymbol {h})} \end{array} $$
((1))
$$\begin{array}{@{}rcl@{}} E(\boldsymbol {v},\boldsymbol {h}) &=& \left\| \frac{\boldsymbol {v} - \boldsymbol {b} }{2 \boldsymbol {\sigma}} \right\|^{2} - {\boldsymbol {c}}^{T} \boldsymbol {h} - \left(\frac{\boldsymbol {v}}{\boldsymbol {\sigma}^{2}} \right)^{T} \boldsymbol{W} \boldsymbol {h} \end{array} $$
((2))
$$\begin{array}{@{}rcl@{}} Z &=& \sum_{\boldsymbol {v},\boldsymbol {h}} e^{-E(\boldsymbol {v},\boldsymbol {h})}, \end{array} $$
((3))

where ∥·∥2 denotes L2 norm. \(\boldsymbol {W} \in \mathbb {R}^{I \times J}\), \(\boldsymbol {\sigma } \in \mathbb {R}^{I \times 1}\), \(\boldsymbol {b} \in \mathbb {R}^{I \times 1}\), and \(\boldsymbol {c} \in \mathbb {R}^{J \times 1}\) are model parameters of the GBRBM, indicating the weight matrix between visible units and hidden units, the standard deviations associated with Gaussian visible units, a bias vector of the visible units, and a bias vector of hidden units, respectively. The fraction bar in Equation 2 denotes the element-wise division.

Because there are no connections between visible units or between hidden units, the conditional probabilities p(h|v) and p(v|h) form simple equations as follows:

$$\begin{array}{@{}rcl@{}} p(h_{j}=1|\boldsymbol {v}) &=& \mathcal{S} \left(c_{j} + \left(\frac{\boldsymbol {v}}{\boldsymbol {\sigma}^{2}}\right)^{T} \boldsymbol{W}_{:j} \right) \end{array} $$
((4))
$$\begin{array}{@{}rcl@{}} p(v_{i}=v|\boldsymbol {h}) &=& \mathcal{N} \left(v \ | \ b_{i} + \boldsymbol{W}_{i:} \boldsymbol {h}, {\sigma_{i}^{2}} \right), \end{array} $$
((5))

where W :j and W i: denote the jth column vector and the ith row vector, respectively. \(\mathcal {S}(\cdot)\) and \(\mathcal {N}(\cdot | \mu, \sigma ^{2})\) indicate an element-wise sigmoid function and Gaussian probability density function with the mean μ and variance σ 2, respectively.

For parameter estimation, the log-likelihood of a collection of visible units \(\mathcal {L} = \log \prod _{n} p(\boldsymbol {v}_{n})\) is used as an evaluation function. Differentiating partially with respect to each parameter, we obtain:

$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial \boldsymbol{W}_{ij}} &=& \left\langle \frac{v_{i} h_{j}}{{\sigma_{i}^{2}}} \right\rangle_{\text{data}} - \left\langle \frac{v_{i} h_{j}}{{\sigma_{i}^{2}}} \right\rangle_{\text{model}} \end{array} $$
((6))
$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial b_{i}} &=& \left\langle \frac{v_{i}}{{\sigma_{i}^{2}}} \right\rangle_{\text{data}} - \left\langle \frac{v_{i}}{{\sigma_{i}^{2}}} \right\rangle_{\text{model}} \end{array} $$
((7))
$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial c_{j}} &=& \left\langle h_{j} \right\rangle_{\text{data}} - \left\langle h_{j} \right\rangle_{\text{model}}, \end{array} $$
((8))

where 〈·〉data and 〈·〉model indicate expectations of input data and the inner model, respectively. However, it is generally difficult to compute the second term, so, typically, the expected reconstructed data 〈·〉recon computed by Equations 4 and 5 is used instead [27].

In the improved GBRBM, the variance parameter \({\sigma _{i}^{2}}\) is replaced as \({\sigma _{i}^{2}} = e^{z_{i}}\) so as to constrain the variance to a non-zero value and provide stability in training the parameters. Under this modification, the gradient with respect to z i becomes:

$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial z_{i}} = \ & e^{-z_{i}} \left\langle \frac{(v_{i} - b_{i})^{2}}{2} - v_{i} \boldsymbol{W}_{i:} \boldsymbol {h} \right\rangle_{\text{data}} \\ & - e^{-z_{i}} \left\langle \frac{(v_{i} - b_{i})^{2}}{2} - v_{i} \boldsymbol{W}_{i:} \boldsymbol {h} \right\rangle_{\text{model}}. \end{array} $$
((9))

Using Equations 6, 7, 8, and 9, each parameter can be updated by stochastic gradient descent with a fixed learning rate and a momentum term.

2.2 CRBM

A CRBM is an extended version of RBM proposed by Taylor et al. [40] and is suitable for the representation of time series data. In addition to the use of an undirected model as in RBM, CRBM also employs directed models from a collection of previous visible units \(\mathcal {V}'^{(t)} = \left \{\boldsymbol {v}^{(p)}\right \}_{p=t-P}^{t-1}, \ \boldsymbol {v}^{(p)}=\left [v_{1}^{(p)},\cdots,v_{I}^{(p)}\right ]^{T}, v_{i}^{(p)} \in \mathbb {R}\) to binary hidden units \(\boldsymbol {h}^{(t)}=\left [h_{1}^{(t)},\cdots,h_{J}^{(t)}\right ]^{T}, h_{j}^{(t)} \in \{0,1\}\) and to the current visible units \(\boldsymbol {v}^{(t)}=\left [v_{1}^{(t)},\cdots,v_{I}^{(t)}\right ]^{T}, v_{i}^{(t)} \in \mathbb {R}\) at the current frame t, where P is the number of previous frames from the current frame taken into account. In this model, there are three types of parameters to be estimated: \(\boldsymbol {W}_{v_{p}'v} \in \mathbb {R}^{I \times I}\) (a directed weight matrix from v (tp) to v (t)), \(\boldsymbol {W}_{v'_{p}h} \in \mathbb {R}^{I \times J}\) (a directed weight matrix from v (tp) to h (t)), and \(\boldsymbol {W}_{\textit {vh}} \in \mathbb {R}^{I \times J}\) (an undirected weight matrix between v (t) and h (t)). These weights are estimated using contrastive divergence in a similar manner to an RBM by maximizing the likelihood \(\mathcal {L} = \log \prod _{t} p\left (\boldsymbol {v}^{(t)} | \mathcal {V}'^{(t)}\right)\), where:

$$\begin{array}{@{}rcl@{}} p\left(\boldsymbol {v}^{(t)}|\mathcal{V}'^{(t)}\right) &=& \frac{1}{Z} \sum_{\boldsymbol {h}^{(t)}} e^{-E\left(\boldsymbol {v}^{(t)},\boldsymbol {h}^{(t)}|\mathcal{V}'^{(t)}\right)} \end{array} $$
((10))
$$\begin{array}{@{}rcl@{}} Z &=& \sum_{\boldsymbol {v}^{(t)},\boldsymbol {h}^{(t)}} e^{-E\left(\boldsymbol {v}^{(t)},\boldsymbol {h}^{(t)}|\mathcal{V}'^{(t)}\right)}. \end{array} $$
((11))

Inspired by the improvement learning method of a GBRBM, we define the energy function E in this paper as follows:

$$\begin{array}{@{}rcl@{}} &&E\left(\boldsymbol {v}^{(t)},\boldsymbol {h}^{(t)}|\mathcal{V}'^{(t)}\right) \end{array} $$
((12))
$$\begin{array}{@{}rcl@{}} && = \left\| \frac{\boldsymbol {v}^{(t)} - \boldsymbol {b}'^{(t)}}{2 \boldsymbol {\sigma}} \right\|^{2} - {\boldsymbol {c}'^{(t)}}^{T} \boldsymbol {h}^{(t)} - \left(\frac{\boldsymbol {v}^{(t)}}{\boldsymbol {\sigma}^{2}} \right)^{T} \boldsymbol{W}_{vh} \boldsymbol {h}^{(t)} \\ && \boldsymbol {b}'^{(t)} = \boldsymbol {b} + \sum_{p} \boldsymbol{W}_{v'_{p}v}^{T} \boldsymbol {v}^{(t-p)} \end{array} $$
((13))
$$\begin{array}{@{}rcl@{}} && \boldsymbol {c}'^{(t)} = \boldsymbol {c} + \sum_{p} \boldsymbol{W}_{v'_{p}h}^{T} \boldsymbol {v}^{(t-p)}. \end{array} $$
((14))

We obtain the following partial differential equations to the log-likelihood :

$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial \left(\boldsymbol{W}_{v_{p}'v}\right)_{i'i}} &=& \left\langle \frac{v_{i}^{(t)} v^{(t-p)}_{i'}}{{\sigma_{i}^{2}}} \right\rangle_{\text{data}} - \left\langle \frac{v^{(t)}_{i} v^{(t-p)}_{i'}}{{\sigma_{i}^{2}}} \right\rangle_{\text{model}} \end{array} $$
((15))
$$\begin{array}{@{}rcl@{}} \frac{\partial \mathcal{L}}{\partial \left(\boldsymbol{W}_{v_{p}'h}\right)_{i'j}} &=& \left\langle {v^{(t-p)}_{i'}} h^{(t)}_{j} \right\rangle_{\text{data}} - \left\langle v^{(t-p)}_{i'} h^{(t)}_{j} \right\rangle_{\text{model}} \end{array} $$
((16))

The other parameters related to the undirected model (W vh , b, c, and σ (or z)) are also calculated from Equations 6, 7, 8, and 9 by proper substitution of variables. Once the parameters are estimated, forward inference (the conditional probability of h (t) given v (t) and \(\mathcal {V}'^{(t)}\)) and backward inference (the conditional probability of v (t) given h (t) and \(\mathcal {V}'^{(t)}\)) are respectively written as:

$$ {\fontsize{9}{6}\begin{aligned} &p\left(h^{(t)}_{j}\right. = \left.1|\boldsymbol{v}^{(t)},\mathcal{V}'^{(t)}\right) \\ &= \mathcal{S} \left(c_{j} + {\boldsymbol {v}^{(t)}}^{T} \boldsymbol{W}_{{vh}_{:j}} + \sum_{p} {\boldsymbol {v}^{(t-p)}}^{T} \boldsymbol{W}_{{v_{p}'h}_{:j}} \right) \\ \end{aligned}} $$
((17))
$$ {\fontsize{9}{6}\begin{aligned} &p\left(v^{(t)}_{i}=v|\boldsymbol {h}^{(t)},\mathcal{V}'^{(t)}\right)\\ &= \mathcal{N} \left(v | b_{i} + {\boldsymbol {h}^{(t)}}^{T} \boldsymbol{W}^{T}_{{vh}_{i:}} + \sum_{p} {\boldsymbol {v}^{(t-p)}}^{T} \boldsymbol{W}_{{v_{p}'v}_{:j}}, {\sigma_{i}^{2}} \right) \end{aligned}} $$
((18))

3 Proposed voice conversion

In general, the fewer phonological and the more individuality-emphasized features a source input includes for a speaker, the easier it is to convert the source features to target features. This paper proposes voice conversion using such features.

Figure 1 shows an overview of our proposed voice conversion system where we set P=1. In our approach, we independently train CRBMs for each speaker beforehand as shown in Figure 1a. Variables x (t) and y (t) (x (t−1) and y (t−1)) represent acoustic feature vectors (e.g., visible units in a CRBM), such as Mel-frequency cepstral coefficient (MFCC), at frame t (at frame t−1) for a source speaker and a target speaker, respectively.

Figure 1
figure 1

A flow chart of the proposed voice conversion system. (a) CRBMs for a source speaker (below) and a target speaker. (b) Our proposed voice conversion architecture combining the two pre-trained speaker-dependent CRBMs with a concatenating NN.

For the source speaker, for instance, the parameter matrix W xh is, along with \(\boldsymbol {W}_{x^{\prime } h}\) and \(\boldsymbol {W}_{x{\prime }x}\phantom {\dot {i}\!}\), estimated so as to maximize the probability of T chained training samples \(p(\boldsymbol {x}) = \prod _{t=1}^{T} p\left (\boldsymbol {x}^{(t)} | \boldsymbol {x}^{(t-1)}\right)\). Using these matrices, an input vector x (t) at frame t given the previous vector x (t−1) is projected into the speaker-dependent latent space that captures speaker-individualities. The latent features \(\boldsymbol {h}_{x}^{(t)}\) can be calculated using mean-field approximation as follows:

$$\begin{array}{@{}rcl@{}} \boldsymbol {h}_{x}^{(t)} = \mathcal{S} \left(\boldsymbol{W}_{{xh}} {\boldsymbol {x}^{(t)}} + \boldsymbol{W}_{{x'h}} {\boldsymbol {x}^{(t-1)}} + \boldsymbol {c}_{x} \right) \end{array} $$
((19))

from Equation 17, where c x is a bias vector of forward inference for the source speaker. Because each unit in the hidden vector \(\boldsymbol {h}_{x}^{(t)}\) is independent from the others (due to the nature of RBM), it captures the common characteristics in the visible units. The training data usually include various phonemes and unvarying speaker-specific features; thus, we expect that the extracted features in \(\boldsymbol {h}_{x}^{(t)}\) represent speaker-individual information. Since we estimate the time-related matrices \(\boldsymbol {W}_{x^{\prime }h}\phantom {\dot {i}\!}\) and \(\boldsymbol {W}_{x^{\prime }x}\) jointly with the static term W xh as shown in Equation 12 using the training data, they capture time-related information and W xh can focus on capturing other static information. This means that the obtained features in the hidden units \(\boldsymbol {h}_{x}^{(t)}\) also help to capture time-related speaker-individualities. The above discussion applies to the target speaker, and the hidden vector for the target y (t) is obtained in the same manner as in Equation 19:

$$\begin{array}{@{}rcl@{}} \boldsymbol {h}_{y}^{(t)} = \mathcal{S} \left(\boldsymbol{W}_{{yh}} {\boldsymbol {y}^{(t)}} + \boldsymbol{W}_{{y'h}} {\boldsymbol {y}^{(t-1)}} + \boldsymbol {c}_{y} \right) \end{array} $$
((20))

where c y is a bias vector for the target speaker.

In our approach, we convert such individuality-emphasized features (from \(\boldsymbol {h}_{x}^{(t)}\) to \(\boldsymbol {h}_{y}^{(t)}\)) using a NN that has L+2 layers (L is the number of hidden layers; typically, L is 0 or 1) as shown in Figure 1b. To train the NN, we use the parallel training set \(\left \{\boldsymbol {x}_{t},\boldsymbol {y}_{t}\right \}_{t=0}^{T'}\) where T is the number of frames of the parallel dataa. During the training stage of the NN, the projected vectors of the source speaker’s acoustic features \(\boldsymbol {h}_{x}^{(t)}\) are used as inputs, and the projected vectors of the corresponding target speaker’s features \(\boldsymbol {h}_{y}^{(t)}\) are used as outputs. Weight parameters of the NN \(\{\boldsymbol {W}_{l}, \boldsymbol {d}_{l} \}_{l=0}^{L}\) are estimated to minimize the error between the output \(\eta \left (\boldsymbol {h}_{x}^{(t)}\right)\) and the target vector \(\boldsymbol {h}_{y}^{(t)}\) as is typical for a NN. Once the weight parameters are estimated, an input vector \(\boldsymbol {h}_{x}^{(t)}\) is converted to:

$$\begin{array}{@{}rcl@{}} \eta\left(\boldsymbol {h}_{x}^{(t)}\right) &=& \bigodot_{l=0}^{L} \eta_{l}\left(\boldsymbol {h}_{x}^{(t)}\right) \end{array} $$
((21))
$$\begin{array}{@{}rcl@{}} \eta_{l}\left(\boldsymbol {h}_{x}^{(t)}\right) &=& \mathcal{S}\left(\boldsymbol{W}_{l} \boldsymbol {h}_{x}^{(t)} + \boldsymbol {d}_{l}\right) \end{array} $$
((22))

where \(\bigodot _{l=0}^{L}\) denotes the composition of L+1 functions. For instance, \(\bigodot _{l=0}^{1} \eta _{l}(\boldsymbol {z}) = \mathcal {S}(\boldsymbol {W}_{1} \mathcal {S}(\boldsymbol {W}_{0} \boldsymbol {z} + \boldsymbol {d}_{0}) + \boldsymbol {d}_{1})\) for a NN with one hidden layer.

To convert the output of the NN to the acoustic features of the target speaker, we simply use backward inference of a CRBM using Equation 18, resulting in:

$$\begin{array}{@{}rcl@{}} &&p\left(\boldsymbol {y}^{(t)}|\boldsymbol {h}_{y}^{(t)},\boldsymbol {y}^{(t-1)}\right) \\ &&\;= \mathcal{N}\left(\boldsymbol {y}|\boldsymbol{W}^{T}_{yh} \boldsymbol {h}_{y}^{(t)} + \boldsymbol{W}_{y'y} \boldsymbol {y}^{(t-1)} + \boldsymbol {b}_{y}, \boldsymbol {\sigma}^{2}_{y}\right) \end{array} $$
((23))

where b y and σ y are a bias vector of backward inference for the target speaker, respectively. Generalizing and summarizing the above discussion, a voice conversion function of our method from a source acoustic vector x (t) to a target vector y (t) at frame t, given the previous vectors \(\mathcal {X}'^{(t)} = \left \{ \boldsymbol {x}^{(t-p)}\right \})_{p=1}^{P}\) and \(\mathcal {Y}'^{(t)} = \left \{ \boldsymbol {y}^{(t-p)}\right \}_{p=1}^{P}\), is written as:

$$\begin{array}{@{}rcl@{}} \boldsymbol {y}^{(t)} = \bigodot_{k=0}^{L+2} f_{(k)} \big(\boldsymbol{W}_{(k)} \boldsymbol {x}^{(t)} + \boldsymbol {a}_{(k)} (\mathcal{X}'^{(t)}, \mathcal{Y}'^{(t)}) \big) \end{array} $$
((24))

where W (k) and \(\boldsymbol {a}_{(k)} \left (\mathcal {X}'^{(t)}, \mathcal {Y}'^{(t)}\right)\) denote elements of a set of our model parameters \(\Theta = \{ \mathcal {W} \cup \mathcal {A} \}\):

$$\begin{array}{@{}rcl@{}} \mathcal{W} &=& \left\{\boldsymbol{W}_{(k)}\right\}_{k=0}^{L+2} \end{array} $$
((25))
$$\begin{array}{@{}rcl@{}} &=& \left\{\boldsymbol{W}_{xh}, \boldsymbol{W}_{0}, \cdots, \boldsymbol{W}_{L}, {\boldsymbol{W}_{yh}}^{T}\right\} \end{array} $$
((26))
$$\begin{array}{@{}rcl@{}} \mathcal{A} &=& \left\{ \boldsymbol {a}_{(k)} (\mathcal{X}'^{(t)}, \mathcal{Y}'^{(t)}) \right\}_{k=0}^{L+2} \end{array} $$
((27))
$$\begin{array}{@{}rcl@{}} &=& \left\{ \boldsymbol {c}_{x} + \sum_{p} \boldsymbol{W}_{{x'h}} {\boldsymbol {x}^{(t-p)}}, \boldsymbol {d}_{0},\right. \end{array} $$
((28))
$$\begin{array}{@{}rcl@{}} &&\left.\cdots, \boldsymbol {d}_{L}, \boldsymbol {b}_{y} + \sum_{p} \boldsymbol{W}_{y'y} \boldsymbol {y}^{(t-p)} \right\}, \end{array} $$
((29))

and \(\left \{f_{(k)}\right \}_{k=0}^{L+2} = \{ \mathcal {S}, \mathcal {S}, \cdots, \mathcal {S}, \mathcal {I} \}\), where indicates an identity function.

The conversion function shown in Equation 24 implies a dynamic model of a (L+4)-layer network with sigmoid activated functions. Therefore, regarding it as a recurrent neural network (RNN),b we can fine-tunec each parameter of the entire network by back-propagation through time (BPTT) [41] using the acoustic parallel data. Specifically, each parameter is re-updated so as to minimize the total error ε in a gradient-descent-based approach, which is defined as:

$$\begin{array}{@{}rcl@{}} \epsilon = \sum_{1 \ge t \ge T} \epsilon^{(t)} = \frac{1}{2} \sum_{1 \ge t \ge T} \left(\boldsymbol {y}^{(t)} - \boldsymbol {\nu}^{(t)}\right)^{2}, \end{array} $$
((30))

where ν (t) denotes the output of RNN at frame t. The gradient with respect to θ, which is a parameter in the highest recursive hidden layer, for instance, can be written as follows:

$$\begin{array}{@{}rcl@{}} \frac{\partial \epsilon}{\partial \theta} &=& \sum_{1 \le t \le T} \frac{\partial \epsilon^{(t)}}{\partial \theta} \end{array} $$
((31))
$$\begin{array}{@{}rcl@{}} \frac{\partial \epsilon^{(t)}}{\partial \theta} &=& \sum_{1 \le k \le t} \left(\frac{\partial \epsilon^{(t)}}{\partial \boldsymbol {h}_{y}^{(t)}} \frac{\partial \boldsymbol {h}_{y}^{(t)}}{\partial \boldsymbol {h}_{y}^{(k)}} \frac{\partial^{+} \boldsymbol {h}_{y}^{(k)}}{\partial \theta} \right) \end{array} $$
((32))
$$\begin{array}{@{}rcl@{}} \frac{\partial \boldsymbol {h}_{y}^{(t)}}{\partial \boldsymbol {h}_{y}^{(k)}} &=& \prod_{t \ge i > k} \frac{\partial \boldsymbol {h}_{y}^{(i)}}{\partial \boldsymbol {h}_{y}^{(i-1)}} \end{array} $$
((33))
$$\begin{array}{@{}rcl@{}} &=& \prod_{t \ge i > k} \boldsymbol{W}_{y'y} \left(1- \mathcal{S}\left(\boldsymbol {h}_{y}^{(i-1)}\right)\right), \end{array} $$
((34))

where \(\frac {\partial ^{+} \boldsymbol {h}^{(k)}}{\partial \theta }\) refers to the immediate partial derivative of the hidden units h (k) with respect to θ (i.e., h (k−1) is regarded as a constant with respect to θ).

As Equation 24 indicates, we need a current acoustic vector from a source speaker and previous vectors from both a source speaker and a target speaker to estimate the target speaker’s current acoustic vector. However, we never know the correct previous vector of the target speaker, so in practice, we use the last converted (estimated) vectors as the previous target vector iteratively, starting from a zero vector. We confirmed that this approach worked well through our preliminary experiments.

Meanwhile, a conventional GMM-based approach [9] with M Gaussian mixtures converts the source features x as:

$$\begin{array}{@{}rcl@{}} {}&& \boldsymbol {y} = \sum_{m=1}^{M} P\left(m|\boldsymbol {x}\right) \left(\boldsymbol{\Sigma}_{yx}^{(m)} \boldsymbol{\Sigma}_{xx}^{(m)-1} \left(\boldsymbol {x} - \boldsymbol {\mu}_{x}^{(m)}\right) + \boldsymbol {\mu}_{y}^{(m)} \right) \end{array} $$
((35))
$$\begin{array}{@{}rcl@{}} {}&& P(m|\boldsymbol {x}) = \frac{w^{(m)} \mathcal{N}\left(\boldsymbol {x};\boldsymbol {\mu}_{x}^{(m)},\boldsymbol{\Sigma}_{xx}^{(m)}\right)}{\sum_{m=1}^{M} w^{(m)} \mathcal{N}\left(\boldsymbol {x};\boldsymbol {\mu}_{x}^{(m)},\boldsymbol{\Sigma}_{xx}^{(m)}\right)} \end{array} $$
((36))

where w (m), \(\boldsymbol {\mu }_{\cdot }^{(m)}\) and \(\boldsymbol {\Sigma }_{\cdot }^{(m)}\) are the weight, the corresponding mean vectors, and the corresponding covariance matrices to the speaker of the mth mixture, respectively, showing it to be an additive model of piecewise linear functions. Our approach using Equation 24 is based on the composite function of multiple different non-linear functions feeding time-series data. Therefore, it is expected that our composite model can represent more complex relationships than the conventional GMM-based method and other static network approaches [18,25].

4 Related work

It is worth noting that we compare our method with the conventional method proposed by Wu et al. in [32], that also employed a CRBM for VC. Figure 2 shows the comparison of graphical models among three methods. Wu’s method directly uses CRBM to estimate the target features y (t) from the input x (t) along with the latent features h (t) to capture the linear and non-linear relationship between the source and the target features (Figure 2b). On the other hand, our method (Figure 2c) uses two CRBMs for each of the source and the target speakers to obtain their latent features \(\boldsymbol {h}_{x}^{(t)}\) and \(\boldsymbol {h}_{y}^{(t)}\), capturing time-related information (from t−1 to t frames). Connecting the latent features using a NN, the entire conversion network of our method consequently forms a deep architecture. Our previous approach [25] has a deep network similar to that of the proposed method (Figure 2a); however, the difference is that it involves time-related relationships in the network.

Figure 2
figure 2

Model structures of the related systems. (a) Our earlier work, speaker-dependent RBM, (b) CRBM proposed in [32], and (c) our proposed method, speaker-dependent CRBM.

Since the acoustic signals we are targeting are time-series data, the model that captures time-related information will provide us with the better performance.

5 Experiments

5.1 Conditions

In our experiments, we conducted voice conversion using the ATR Japanese speech database [42], comparing our method (speaker-dependent restricted Boltzmann machines; say ‘SD-CRBM’) with four methods: the well-known GMM-based approach (‘GMM’), conventional NN-based voice conversion [18] (‘NN’), our previous work [25] (‘SD-RBM’) and, for a reference, a recurrent neural network with randomly-initialized weights (‘RNN’). In order to evaluate our method under various circumstances, we tested male-to-female (the source and the target speakers are identified with MMY and FTK in the database, respectively), female-to-female (FKN and FTK), and male-to-male (MMY and MHT) patterns.

For an input vector, we calculated 24-dimensional MFCC features from 513-dimensional STRAIGHT spectra [43] using the filter-theory [44] to decode the MFCC back to STRAIGHT spectra in the synthesis stage. Each speech signal was sampled at 12 kHz and windowed with a 25-ms Hamming window every 10 ms. Unlike our previous work [25], we processed the obtained MFCC with zero component analysis (ZCA) whitening [38], where we confirmed it worked better than without whitening, especially for ‘NN.’ The parallel data of the source/target speakers processed by dynamic programming were created from 216 word utterances in the dataset and were used for the training of each method (note that two CRBMs for ‘SD-CRBM’ and two RBMs for ‘SD-RBM’ can be trained without the necessity of using parallel data, although we used the same parallel training data for the CRBMs and the RBMs in this research.)

The network-based approaches (‘SD-CRBM’, ‘SD-RBM’, ‘NN’, and ‘RNN’) were trained using gradient descent with a learning rate of 0.01 and momentum of 0.9, with the number of epochs being 400. The parameters of ‘NN’ and ‘RNN’ were initialized randomly. All the network-based methods had four layers including an input layer, two hidden layers, and an output layer. Other configurations, such as the number of hidden units, will be discussed in the following section.

For the GMM-based approach, we used diagonal covariance matrices without global variance and dynamic features.

For the objective test, 15 sentences (about 60 s long) that were not included in the training data were arbitrarily selected from the database (identified with SDA01 ∼SDA15). For the objective evaluation, we used Mel-cepstral distortion (MCD) to measure how close the converted vector is to the target vector in Mel-cepstral space. The MCD is defined as below:

$$ \text{MCD}\left[dB\right] = \frac{10}{\ln 10} \sqrt{2 \sum_{d=1}^{24} \left(\boldsymbol {c}_{d} - \boldsymbol {c}_{d}'\right)^{2}} $$
((37))

where c d and c d′ denote the dth original target MFCC and the converted MFCC, respectively. The smaller the value of MCD is, the closer the converted spectra are to the target spectra. We calculated the MCD for each frame in the training data and averaged the MCD values for the final evaluation.

For the subjective evaluation, ABX listening tests were conducted, where nine participants listened to five pairs of converted speech signals (from a development set, which was used for the determination of model parameters) produced using our approach (‘SD-CRBM’) and the converted speech signals produced by the other methods (‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’) along with an original target speech signal (generated from analysis-by-synthesis). We evaluated the models, which were trained using N=5,000 or N=20,000 training frames. They then selected the better one in terms of speaker identity (how well they can recognize the speaker from the converted speech) and speech quality (how clear and natural the converted speech is).

5.2 Determining appropriate parameters

In this section, we report preliminary experiments in which we tested various models with different hyper parameters to determine the appropriate ones. All models were trained using N=20,000 frames from the male-to-female training data and evaluated using a development set of five sentences (identified with SDA16 ∼SDA20 in the database) that were not included in either the training set or the test set.

5.2.1 Network-based methods

Here, we will see how our approach works as the number of hidden units J in each hidden layer changes, comparing it to four network-based methods (‘SD-CRBM’, ‘SD-RBM’, ‘NN’, and ‘RNN’). In this preliminary experiment, three architectural patterns were tested, where J=24, 48, and 72. We used L=0, which forms a four-layer network for all methods (for example, when J=48 is used, the numbers of units in ‘NN’ from the input/source layer to the output/target layer become 24,48,48, and 24 in order). For ‘SD-CRBM’, we set P=1 (1 delay for ‘RNN’ as well), which means we take into account only one previous frame.

Figure 3 compares the averaged MCD obtained for each architecture. As shown in Figure 3, our method ‘SD-CRBM’ performed the best of all the methods for each case. The interesting point is that the more hidden units the network has, the better performance it provides for ‘SD-CRBM’ and ‘RNN’, while it is the other way around for ‘SD-RBM’ and ‘NN’. This is considered to be due to over-fitting to the training data for ‘SD-RBM’ and ‘NN’ when the number of parameters is large (e.g., J=72), while ‘SD-CRBM’ and ‘RNN’ still required parameters to fit the models that capture time-series data.

Figure 3
figure 3

Changing hidden units. The values show averaged Mel-cepstral distortion with varying numbers of hidden units J for all network-based methods (N=20,000).

For the remaining experiments in this paper, the best architectures for each method were used, i.e., J=24 for ‘SD-RBM’ and ‘NN’, and J = 72 for ‘SD-CRBM’ and ‘RNN’.

5.2.2 The number of previous frames

We further investigated the performance of our method ‘SD-CRBM’ with the hidden units of J=72, changing the number of previous frames in the CRBM as P=1,2,3,4,5. The evaluation results are described in Figure 4, showing the averaged MCDs obtained from each case. As shown in Figure 4, we could not necessarily obtained a better performance as the number of previous frames increased. One reason is that the neighbor source vectors previous to the current one contained similar information, and only a few source vectors were required to estimate the current target vector. Therefore, the poor performance with the larger number of previous frames (e.g., P=4) was caused because the parameter estimation became more difficult as the redundant parameters increased.

Figure 4
figure 4

Changing previous frames. The values show averaged Mel-cepstral distortion with varying numbers of previous frames P to be taken into account for our method (N=20,000).

In the remaining experiments, we used P=1, which provided the best performance in the preliminary experiment.

5.2.3 GMM-based method

For the GMM-based voice conversion (‘GMM’), we tried and evaluated five mixtures (8,16,32,64,and 128 mixtures) to determine an appropriate number of mixtures. Figure 5 shows the averaged MCDs over the development set when using the GMM with various mixtures. As shown in the figure, the GMM with 64 mixtures performed the best of all. Therefore, we used mixtures of 64 for ‘GMM’ in the evaluation experiments described in Section 2.

Figure 5
figure 5

Changing mixtures. The values show averaged Mel-cepstral distortion with varying numbers of mixtures M for GMM method (N=20,000).

5.3 Evaluation

In this section, we evaluate our method (‘SD-CRBM’) comparing it with four methods (‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’) using objective and subjective criteria for each pair of speakers, by changing the number of training frames as N=5,000, 10,000, 20,000, and 40,000.

5.3.1 Results

Figures 6, 7, and 8 summarize the experimental results for the test data, comparing each method with respect to objective criteria for male-to-female, female-to-female, and male-to-male voice conversion, respectively. As shown in these Figures, the MCDs decreased as the number of training data increased in most cases (regardless of the gender or the method). Furthermore, our approach outperformed the other methods in every case, except for the case where N=20,000 in the male-to-male experiment.

Figure 6
figure 6

Male-to-female voice conversion results. The values show averaged Mel-cepstral distortion for each method with varying amounts of training data.

Figure 7
figure 7

Female-to-female voice conversion results. The values show averaged Mel-cepstral distortion for each method with varying amounts of training data.

Figure 8
figure 8

Male-to-male voice conversion results. The values show averaged Mel-cepstral distortion for each method with varying amounts of training data.

Figures 9 and 10 show the results of subjective evaluation comparing each method in terms of speaker identity and speaker quality, respectively, when we use training samples of N=5,000. Figures 11 and 12 also show the subjective evaluation results when we use training samples of N=20,000. We also list the p values produced by pairwise t-testing for each experiment in Tables 1 and 2, and in Tables 3 and 4, in terms of speaker identity and speech quality, respectively. As shown in Figures 11 and 12, our method performed better than each opponent method in regard to mean preference score in terms of both speaker identity and speech quality. However, as shown in Tables 1, 2, 3, and 4, we could not, unfortunately, obtain a significant difference between our method and the other methods in some cases (e.g., ‘NN’ with respect to (w.r.t.) speaker identity, and ‘SD-RBM’ and ‘RNN’ w.r.t. speech quality when N=20,000 training frames were used). We obtained significant differences at a significance level of 0.1 in the other cases.

Figure 9
figure 9

Subjective preference scores w.r.t. speaker identity (in case N=5,000 ). Our method ‘SD-CRBM’ was compared to four other methods: ‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’.

Figure 10
figure 10

Subjective preference scores w.r.t. speech quality (in case N=5,000 ). Our method ‘SD-CRBM’ was compared to four other methods: ‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’.

Figure 11
figure 11

Subjective preference scores w.r.t. speaker identity (in case N=20,000 ). Our method ‘SD-CRBM’ was compared to four other methods: ‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’.

Figure 12
figure 12

Subjective preference scores w.r.t. speech quality (in case N=20,000 ). Our method ‘SD-CRBM’ was compared to four other methods: ‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’.

Table 1 p values between our method and each method w.r.t. speaker identity in case N=5,000
Table 2 p values between our method and each method w.r.t. speaker identity in case N=20,000
Table 3 p values between our method and each method w.r.t. speaker quality in case N=5,000
Table 4 p values between our method and each method w.r.t. speech quality in case N=20,000

5.3.2 Discussion

In objective criteria, our approach (‘SD-CRBM’) outperformed the other methods, including the popular GMM-based voice conversion method, in most cases. In subjective criteria as well, we obtained significantly better performance compared with each opponent, in terms of speaker identity and/or speech quality (to be specific, in terms of both speaker identity and speech quality for ‘GMM’, in terms of only speech quality for ‘NN’, in terms of only speaker identity for ‘SD-RBM’ and ‘RNN’). The reason for the improvement is attributed to the fact that our time-involving, high-order conversion system using CRBMs is able to capture and convert the abstractions of speaker individualities better than the other methods. In particular, as shown in Figures 6, 7, and 8, our approach achieved high performance in MCD criteria. This is because the CRBMs captured time-series data more appropriately and alleviated estimation errors.

One interesting point is that ‘NN’ and ‘RNN’, which were based on random initialization in weight parameters, produced unstable performance (e.g., the MCD by ‘NN’ increased even as the number of training frames increased from 10,000 to 20,000 in male-to-female conversion, and the MCD by ‘RNN’ also increased as the number of training data changed from 20,000 to 40,000 in male-to-male conversion). This is caused by a fall into local minima starting from the randomly-initialized weights. Figure 13 shows some of the converged weights in the network, comparing ‘RNN’ and ‘SD-CRBM’, where the weights were pre-trained using speaker-dependent CRBMs and a concatenating NN followed by fine-tuning using RNN. As shown in Figure 13, the weights in ‘RNN’ were almost meaningless and messy; meanwhile, we see that the weights in ‘SD-CRBM’ had a sparse structure and operative bases. In general, an acoustic feature vector at the last previous frame (v (t−1)) is very similar to the feature vector at the current frame (v (t)), and, therefore, we expect that the conversion matrix from v (t−1) to v (t) may be close to an identity matrix. The recurrent weight obtained by our approach shown in Figure 13a-1 indicates this fact.

Figure 13
figure 13

Estimated weights of the pre-trained RNN ( · -1) and the randomly-initialized RNN ( · -2). After 400 epochs (N=40,000,J=72, male-to-female). (a) The weights from the previous target vector to the current target vector \(\boldsymbol {W}_{y^{\prime }y}\), (b) the weights from the second hidden layer to the current target vector W yh , (c) the weights from the current source vector to the first hidden layer W xh , and (d) the weights from the first hidden layer to the second hidden layer W 0.

6 Conclusion

We presented a voice conversion method that combines speaker-dependent CRBMs and a NN to extract speaker-individual information for speech conversion. Through experiments, we confirmed that our approach is effective, especially in terms of MCD, compared with the well-known conventional GMM-based approach, a NN-based approach, and our own previous work, SD-RBM, (and recurrent neural network for a reference), regardless of the gender in conversion.

We also conducted ABX experiments for subjective evaluation. The results showed that the performance of our method was not always significantly different in comparison to NN, RNN, and SD-RBM; however, it did perform significantly better than these methods in terms of either speaker identity or speech quality. In the future, we will work to improve our method so that it obtains better results in regard to the sense of hearing.