1 Introduction

Statistical Voice Conversion (SVC) is a potential technique to enable a user for flexibly synthesizing several kinds of speech. While keeping the linguistic content and environmental conditions unchanged, the goal of SVC is to change and modify speaker individuality; i.e., the source speaker’s voice is transformed to sound like that of the target speaker [7]. There are several applications within the concept of voice conversion, such as converting speech from impaired to normal voice [11], from normal to singing sound [47], electro-laryngeal to normal speech [44], etc.

Over the years, voice conversion frameworks have mostly focused on spectral conversion between source and target speakers [8, 21]. In the sense of the statistical parametric approaches, such as Gaussian mixture model (GMM) [59] and exemplar based on non-negative matrix factorization [1, 63], SVC showed a success in the linear transformation of the spectral information. Nonlinear transformation approaches, such as hidden Markov models (HMMs) [49], deep belief networks (DBNs) [45] and restricted Boltzmann machines (RBMs) [46], have been also shown to be effective in modeling the relationship between source-target features more accurately. The DBN and RBM were used to replace GMM to model the distribution of spectral envelopes [36]. However, the resulting speech parameters from these models tend to be over-smoothed and affect the similarity and quality of generated speech. To cope with these problems, some approaches attempt to reduce the difference between natural and the converted speech parameters by using Global variance [59], modulation spectrum [57], dynamic kernel partial least squares regression [20], or generative adversarial networks [28]. Even though these techniques achieve some improvements, the naturalness of the converted voice still deteriorates compared to the source speaker. Therefore, improving the performance of converted voice is still a challenging research question.

There seem to be four factors that degrade the quality of SVC: 1) speech parameters (i.e. vocoder features), 2) mapping function between the source and target speakers, 3) learning model, and 4) vocoder synthesis quality. To capture the quality of these factors, feed-forward deep neural networks (FF-DNNs) was proposed as an acoustic modeling solution of different research areas [19, 50, 66]. FF-DNNs have shown their ability to extract high-level, complex abstractions and data representations from large volumes of supervised and unsupervised data [43], and achieve significant improvements in various machine learning areas including the ability to model high-dimensional acoustic parameters [61], and the availability of multi-task learning [64]. In this article, we predict acoustic features using a FF-DNN, which are then passed to a vocoder to generate the converted speech waveform. Thus, both vocoder and FF-DNN models can be used to improve the converted acoustic parameters.

A vocoder (which is also called speech analysis/synthesis system) is another important component of various speech synthesis applications such as Text-To-Speech (TTS) synthesis [16], voice conversion [31], or singing synthesizers [29]. Although there are several different types of vocoders, they follow the same main strategy. The analysis stage is used to convert speech waveform into a set of parameters which represent separately the vocal-folds excitation signal and the vocal-tract filter transfer function to filter the excitation signal, whereas in the synthesis stage, the entire parameter set is used to reconstruct the original speech signal. Hu et al. [23] present an experimental comparison of a wide range of important vocoder types which have been previously invented. Despite the fact that most of these vocoders have been successful in synthesizing speech, the sound quality of the synthesized voices is still perceptibly degraded compared to that of the natural sound. The reason for this is either the inaccurate estimation of the vocoder parameters that lead to losing some important excitation / spectral details, or typically those vocoders are computationally intensive. However, various studies in voice conversion are still considering some of these vocoders [23], such as STRAIGHT [5, 53, 59], mixed excitation [34], Harmonic plus Noise Model [35], glottal source modeling [6], or even with more complex end-to-end acoustic models like adaptive WAVENET [54], or Tacotron [62]. Consequently, simple and uniform vocoders, which would handle all speech sounds and voice qualities (e.g. creaky voice) in a unified way, are still missing in SVC. Therefore, it is still worth to develop advanced vocoders for achieving high-quality converted speech.

In our recent work in statistical parametric speech synthesis, we have proposed a novel continuous vocoder using continuous fundamental frequency (contF0) in combination with Maximum Voiced Frequency (MVF), which was shown to improve the performance under a FF-DNN compared to the hidden Markov model based TTS [2]. The advantage of a continuous vocoder in this scenario is that vocoder parameters are simpler to model than in traditional vocoders with discontinuous F0. However, in SVC, the effectiveness of the continuous vocoder has not been confirmed yet. Thus, we are developing a solution in this article to achieve higher sound quality and conversion accuracy, while the SVC remains computationally efficient.

Unlike the methods referenced above, the proposed structure implicates two major technical developments. First, we build a voice conversion framework that consists of a FF-DNN and a continuous vocoder to automatically estimate the mapping relationship between the parameters of the source and target speakers. Second, we apply a geometric approach to spectral subtraction (GA-SS) to improve the signal-to-noise ratio of the converted speech and enhance anti-noise property of our vocoder. For the first time, we study the interaction between continuous parameters and FF-DNN based voice conversion. We expect that the new voice conversion model gives high-quality synthesized speech compared to the source voice.

This paper is organized as follows: In Section 2, we propose the novel idea of continuous vocoder based voice conversion. In Section 3, experimental conditions and error metrics are addressed. We report the objective and subjective evaluation results in Section 4. Section 5 gives the conclusion and discussion.

2 Proposed conversion methodology

2.1 Speaker-adaptive continuous vocoder

To construct our proposed method based SVC, we adopt a continuous F0 (contF0) estimator with maximum voiced frequency (MVF) as they are the base features for our continuous vocoder, so that they can appropriately synthesize high quality speech. The continuous vocoder was designed to overcome shortcomings of discontinuity in the speech parameters and the computational complexity of modern vocoders. Our proposed vocoder is presented in Fig. 1, and its algorithms are briefly explained below.

Fig. 1
figure 1

Workflow of the continuous vocoder

2.1.1 contF0: F0 estimation algorithm

In recent years, there has been a rising trend of assuming that continuous F0 observations are present similarly in unvoiced regions and there have been various modelling schemes along these lines. It was found in [27] that a continuous F0 creates more expressive F0 contours with HMM. Zhang et al. [67] introduce a new approach to improve modeling piece-wise continuous F0 trajectory with voicing strength and voiced/unvoiced decision for HMM-based TTS.

The contF0 estimator used in this vocoder is an approach proposed by Garner et al. [18] that is able to track fast changes. According to Fig. 2, the algorithm starts simply with splitting the speech signal into overlapping frames. The result of windowing each frame is then used to calculate the autocorrelation function. Identifying a peak between two frequencies and calculating the variance are the essential steps of the Kalman smoother to give a final sequence of continuous pitch estimates with no voiced/unvoiced decision.

Fig. 2
figure 2

Flowchart of the continuous pitch estimation algorithm

Besides, during the analysis phase, the Glottal Closure Instant (GCI) algorithm [15] is used to find the glottal period boundaries of individual cycles in the voiced parts of the inverse filtered residual signal. From these pitch cycles, a Principal Component Analysis (PCA) residual is built which will be used in the synthesis phase as shown in Fig. 1 to yield better speech quality than those of the excitation pulses. In our previous study, Tóth and Csapó [60] have shown that contF0 contour can be approximated better with HMM and deep neural network (DNN) than traditional discontinuous F0. An example of contF0 estimation on a female speech sample is shown in Fig. 3 compared with the DIO algorithm [41] as one of the most successful discontinuous F0 s.

Fig. 3
figure 3

Example of F0 estimated by contF0 algorithm (blue) and DIO algorithm (red). Sentence: “Author of the danger trail, Philip Steels, etc.”, from a female speaker

2.1.2 MVF: Maximum voiced frequency algorithm

During the production of voiced sounds, MVF is used as the spectral boundary separating low-frequency periodic and high-frequency aperiodic components. MVF has been used in various speech models [12, 17, 56], that yield sufficiently better quality in synthesized speech. Our vocoder follows the algorithm proposed by [13] which has the potential to discriminate harmonicity, exploits both amplitude and phase spectra, and use the maximum likelihood criterion as a strategy to derive the MVF estimate. The performance of this algorithm has been previously assessed by comparing it with two state-of-the-art methods, namely the Peak-to-Valley (P2V) used in [56] and the Sinusoidal Likeness Measure (SLM) [17]. Based on Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC), the algorithm proposed by [13] objectively outperforms both P2V and SLM methods. Moreover, a substantial improvement was also observed over the state-of-the-art techniques in a subjective listening test using male, female, and child speech.

The method consists of the following steps. First, 4 period-long Hanning window is applied to exhibit a good peak structure. Then, the frequencies of the spectral peaks are detected using a standard peak picking function. Amplitude spectrum, phase coherence, and harmonic-to-noise ratio are extracted in the third step for each harmonic candidate which convey some relevant statistics to predict the strategy decision by using the maximum likelihood criterion. Time smoothing step is finally applied to the obtained MVF trajectory in order to remove unwanted spurious values. An example of spectrogram of the natural waveform with the MVF contour is shown in Fig. 4. Here, the duration of this sentence is about 3 s, and was sampled at 16 kHz with a 16-bit quantization level. It is windowed by Hanning window function in duration of 25 ms, mutually shifted by 5 ms. The thresholds for the pitch tracking are set from 80 to 300 Hz. Thus, the MVF parameter models the voicing information: for unvoiced sounds, the MVF is low (around 1 kHz), for voiced sounds, the MVF is high (above 4 kHz).

Fig. 4
figure 4

Example of spectrogram of the natural waveform and MVF contour (blue). Sentence: “Author of the danger trail, Philip Steels, etc.”, from a female speaker

2.1.3 MGC: Mel-generalized cepstral algorithm

In our recent studies [3, 9], a simple spectral model represented by 24–order MGC was used [58]. Although several vocoders based on this simple algorithm have been developed, they are not able to synthesis natural sound. The main problem is that it is affected by time-varying components and it is difficult to remove them. Therefore, more advanced spectral estimation methods might increase the quality of synthesized speech.

In [39], an accurate and temporally stable spectral envelope estimation called CheapTrick was proposed. CheapTrick consists of three steps: F0-adaptive Hanning window, smoothing of the power spectrum, and spectral recovery in the quefrency domain. In a modified version of the continuous vocoder, Cheaptrick algorithm using the 60-order MGC representation with α = 0.58 (Fs = 16 kHz) will be used to achieve high-quality speech spectral estimation. A comparison of spectral envelope between standard MGC and the CheapTrick is shown in Fig. 5. Accordingly, it is clear now to see how a continuous vocoder will behave after adaptation to a more accurate spectral envelope technique than the MGC previous system.

Fig. 5
figure 5

Example of the signal spectrum of a voiced segment (green) with the spectral shape (spectral envelope) estimates obtained with standard MGC (red) and CheapTrick (blue)

2.1.4 Synthesis algorithm

It was shown in [14] that the PCA based residual yields better speech quality than pulse-noise excitation. Therefore, voiced excitation in the continuous vocoder is composed of PCA residuals overlap-added pitch synchronously based on the contF0. Then, the voiced excitation is lowpass filtered frame by frame at the frequency given by the MVF parameter. In the frequencies higher than the actual value of MVF, white noise is used. The voiced and the unvoiced excitation are added together. Finally, a Mel generalized-log spectrum approximation (MGLSA) filter [24] is used to synthesize speech from the excitation and the MGC parameter stream.

In a recent study, we applied various time envelopes to shape the high-frequency component (above MVF) of the excitation by estimating the envelope of the PCA residual; that is helpful in achieving accurate approximations compared to natural speech. In this work, we also added a time domain envelope to the voiced and unvoiced excitation to make it more similar to the residual of natural speech. This technique, using a Hilbert envelope, brings out some hidden information more efficiently and fits a curve that approximately matches the peaks of the residual frame as shown in Fig. 6. As a consequence, the analysis and synthesis steps for the latest version of the continuous vocoder are shown in Fig. 7.

Fig. 6
figure 6

Illustration of the performance of the time envelope. “unvoiced_frame” is the excitation signal consisting of white noise, whereas “resid_pca” is the result of applying PCA on the voiced excitation frames

Fig. 7
figure 7

Steps of the continuous vocoder. X represents the input waveform, Fs represents the sampling frequency, and Y represents the synthesized speech

2.2 Training a model based on FF-DNN

In [10, 38], the neural network based SVC reaches higher performance on the conversion than the GMM alternative. In this work, a FF-DNN is used to model the transformation between source and target speech features as shown in Fig. 8. It consists of 6 feed-forward hidden layers, each consisting of 1024 units and performs a non-linear function of the previous layer’s representation, and a linear activation function at the output layer. These layers perform the following transformation

$$ {y}_i(x)=f\left(\sum \limits_{j=0}^{M_i}{W}_{ij}{x}_j+{b}_i\right) $$
(1)
Fig. 8
figure 8

A general schematic diagram of the proposed method based on FF-DNN

where Mi is the number of units in layer i, x = (x1, …, xn) is the input feature vector, y = (y1, …, yk) is the output vector, W is the connection weight matrix between two layers, b is the bias vector, and f(∙) denotes an activation function which is defined as:

$$ f(x)=\left\{\begin{array}{rr}\frac{e^{2x}-1}{e^{2x}+1},& in\ the\ hidden\ layer\\ {}x,& in\ the\ output\ layer\end{array}\right. $$
(2)

FF-DNN aims to minimize the mean squared error function between the target output y and the prediction output \( \hat{y} \)

$$ E=\frac{1}{n}\sum \limits_{i=1}^n{\left({y}_i-{\hat{y}}_i\right)}^2 $$
(3)

Hence, input features are propagated forward through the FF-DNN with these estimated parameters to produce the corresponding output parameters.

2.3 Voice conversion model

The framework of the proposed SVC system is shown in Fig. 9. It consists of feature processing, training and conversion-synthesis steps. MVF, contF0, and MGC parameters are extracted from the source and target voices using the analysis function of the continuous vocoder. A training process based on a FF-DNN is applied to construct the conversion phase.

Fig. 9
figure 9

Flowchart of the proposed SVC algorithm

The purpose of the conversion function is to map the training features of the source speaker \( X={\left\{{x}_i\right\}}_{i=1}^I \) to the corresponding training features of the target speaker \( Y={\left\{{y}_j\right\}}_{j=1}^J \). Here, X ad Y vector sequences are time-aligned frame by frame by the Dynamic Time Warping (DTW) algorithm [48, 52] since both vectors differ in the durations and have different-length recordings. DTW is a technique for deriving a nonlinear mapping between two vectors to minimize the overall distance D(X, Y) between the source and target speakers.

Then, the time-aligned acoustic feature sequences of both speakers are trained and used for the conversion function in order to predict the target features from the features of the source speaker. Finally, the converted \( \overset{`}{contF0} \), \( \overset{`}{MVF} \), and \( \overset{`}{MGC} \) are synthesized to get the converted speech waveform by the synthesis function of the continuous vocoder.

2.4 Reducing unwanted frequencies

The goal of this section is to remove or reduce the level of unwanted high-frequency components from the converted features, that may be generated during training or conversion phase. Therefore, we apply the GA-SS approach proposed by [65] in order to improve the performance of the converted speech signal. This approach consistently outperforms other conventional spectral subtractions particularly at low SNRs. Besides, GA-SS more suitable for our work because of its simplicity and low computational cost. Here, GA-SS can be applied in each frame signal f(n) by letting y(n) = f(n) + e(n) be the sampled speech signal with the estimation error e(n), assuming that the first 3 frames are noise/silence. Taking the short-time Fourier transform of y(n)

$$ Y\left({w}_k\right)=F\left({w}_k\right)+E\left({w}_k\right) $$
(4)

where wk = 2πk/N, k = 0, 1, 2, …, N − 1, and N is the frame length in samples. Then, we can rewrite Eq. (4) in polar form as

$$ {A}_Y{e}^{j{\theta}_Y}={A}_F{e}^{j{\theta}_F}+{A}_E{e}^{j{\theta}_E} $$
(5)

where A and θ are the magnitude and phase of the frame spectra respectively. Taking into account the trigonometric principles in Equation 5, the gain function HG can be derived as always real and positive [65].

$$ {H}_G=\frac{A_F}{A_Y}=\sqrt{\frac{1-{\mathit{\cos}}^2\left({\theta}_Y-{\theta}_E\right)}{1-{\mathit{\cos}}^2\left({\theta}_F-{\theta}_E\right)}}\kern1em ,\left({H}_G\ge 0\right) $$
(6)

Obtain the enhanced magnitude spectrum of the signal by

$$ {\hat{A}}_F={H}_G\ast {A}_Y $$
(7)

Using the inverse discrete Fourier transform of \( {\hat{A}}_F.{e}^{j{\theta}_Y} \), the enhanced frame signal \( \hat{f}(n) \) can be obtained.

To clarify the effects of this approach, white Gaussian noise is added to the natural and synthetic speech waveforms. The amount of noise is specified by signal-to-noise ratio (SNR) in the range of −20 to 10 dB. The root mean square (RMS) error was calculated over 20 sentences selected randomly from each speaker. The smaller the value of RMS, the better performance. The overall RMS error values obtained as a function of the SNR between clean speech (natural or synthesized) sample and the noisy one (the same speech sample, with noise added) is shown in Fig. 10. The results suggest that the RMS for the synthesized signal with GA-SS approach is smallest and close to the natural signal than without GA-SS. Nevertheless, the differences were very small. But adding this approach as an extra step to our proposed model does help to some extent in improving the overall sound quality, especially in noisy conditions.

Fig. 10
figure 10

Influence of the GA-SS approach on the average RMS error. We present the average RMS error over 20 synthesized sentences per each speaker. “SLT” is an American English female speaker, whereas “BDL” and “JMK” are American and Canadian English male speakers, respectively

3 Experimental conditions

In order to evaluate the performance of the suggested voice conversion framework, a database containing a few hours of speech from several speakers was required for giving indicative results. Datasets are described in more detail in the first part of this section, while training settings and error metrics are defined afterword.

3.1 Datasets

Three English speakers were chosen from the CMU-ARCTICFootnote 1 database [32], denoted BDL (American English, male), JMK (Canadian English, male), and SLT (American English, female), each one consisting of 1132 sentences. The speech waveform of this database was recorded at a 16 kHz sampling rate with 16-bit linear quantization. 90% of these sentences were used in the training experiment, while the rest were used for testing and evaluating the SVC. We used SLT, BDL, and JMK for source and target speakers as well. With the aim of seeing the statistical behavior of our proposed model, we considered only cross-gender (“male-to-female” and “female-to-male”) conversions in this experiment as we do not see much of a difference between the converted and target speech in the intra-gender (“male-to-male” and “female-to-female”) conversions. Hence, four SVC experiments are carried out for evaluation:

  • SLT to BDL

  • BDL to SLT

  • SLT to JMK

  • JMK to SLT

3.2 FF-DNN settings

A hyperbolic tangent activation function was applied. The outputs lie in the range (−1 to 1) and this function can yield lower error rates and faster convergence than a logistic sigmoid function. For the first 15 epochs, a fixed learning rate of 0.002 was chosen with a momentum of 0.3. More specifically, after 10 epochs, the momentum was increased to 0.9 and then the learning rate was halved regularly. The FF-DNN used in this work was implemented in the open source Merlin toolkit for speech synthesis [68] with some modifications. Besides, the training procedures were conducted on a high performance NVidia Titan X GPU. Weights and biases were prepared with small nonzero values, and optimized with stochastic gradient descent to minimize the mean squared error between its predictions and acoustic features of the training set.

3.3 Error measurement metrics

It is well-known that the efficient method for evaluating speech quality is typically done through subjective listening tests. However, there are various issues related with the use of subjective testing. It can be sometimes very expensive, time consuming, and hard to find a sufficient number of suitable volunteers [22, 51]. For that reason, it can often be useful in this work to run objective tests in addition to listening tests. Similarly, finding a meaningful objective metric is always a challenge in evaluating the performance of speech quality, similarity, and intelligibility. In fact, one metric possibly suitable for a few systems but not convenient for all. The reason for that may be returned to some factors which are influenced by the speed, complexity, or accuracy of the speech models. Speaker types and environmental conditions should also be taken into account when choosing these metrics. Therefore, a range of objective speech quality and intelligibility measures are considered to evaluate the quality of the proposed model. The results were averaged over the test utterances for each speaker. The following seven evaluation metrics were used:

  1. a)

    Weighted Spectral Slope (WSS) [30]: The algorithm first decomposes the frame signal into a set of frequency bands. The intensities within each critical band are measured. Then, a weighted distance between the measured slopes of the log-critical band spectra are computed

$$ WSS=\frac{1}{N}\sum \limits_{j=1}^N\left(\frac{\sum_{i=1}^K{W}_{i,j}{\left({Y}_{i,j}-{X}_{i,j}\right)}^2}{\sum_{i=1}^K{W}_{i,j}}\right) $$
(8)

where N is the number of frames in the utterance, and K is the number of sub-bands. Wi, j, Xi, j, and Yi, j denote the weight, the spectral slope of target speech signal, and the spectral slope of converted speech signals; respectively, at the ith frequency band and jth frame.

  1. b)

    Log-Likelihood Ratio (LLR) [51]: It is a distance measure that can be calculated from the linear prediction coefficients (LPC) vector of the target and converted speech. The segmental LLR is

$$ LLR=\frac{1}{N}\sum \limits_{i=1}^N\log \left(\frac{a_{y,i}^T{R}_{x,i}{a}_{y,i}}{a_{x,i}^T{R}_{x,i}{a}_{x,i}}\right) $$
(9)

where ax, ay, and Rx are the LPC vector of the target signal frame, converted signal frame, and the autocorrelation matrix of the target speech signal, respectively.

  1. c)

    Itakura-Saito (IS) [25]: It is also a distance measure computed from the LPC vector

$$ IS=\frac{1}{N}\sum \limits_{i=1}^N\frac{\sigma_{x,i}^2}{\sigma_{y,i}^2}\left(\frac{a_{y,i}^T{R}_{x,i}{a}_{y,i}}{a_{x,i}^T{R}_{x,i}{a}_{x,i}}\right)+\log \left(\frac{\sigma_{y,i}^2}{\sigma_{x,i}^2}\right)-1 $$
(10)

where \( {\sigma}_x^2 \) and \( {\sigma}_y^2 \) are the LPC all-pole gains of the target and converted signal frames, respectively.

  1. d)

    Log Spectral Distortion (LSD): It can be defined as the square difference carried over the logarithm of the spectral envelopes of target X(f) and converted Y(f) speech signals at N frequency points

$$ LSD=\sqrt{\frac{1}{N}\sum \limits_{i=1}^N mean\ {\left(\log \mathrm{X}\left({f}_i\right)- logY\left({f}_i\right)\right)}^2} $$
(11)
  1. e)

    Normalized Covariance Metric (NCM) [37]: It is based on a Speech Transmission Index (STI) [55], which uses covariance coefficient r of the Hilbert envelope between the target and converted frame signal

$$ NCM=\frac{1}{N}\sum \limits_{j=1}^N\left(\frac{\sum_{i=1}^K{W}_{i,j}\cdotp \mathit{\log}\frac{r_{i,j}^2}{1-{r}_{i.j}^2}}{\sum_{i=1}^K{W}_{i,j}}\right) $$
(12)

where W is the weight vector applied to the STI of K bands and can be found by the articulation index [4].

  1. f)

    frequency-weighted segmental SNR (fwSNRseg) [37]: Similarly to Equation (12), fwSNRseg can be estimated by

$$ {fwSNR}_{seg}=\frac{1}{N}\sum \limits_{j=1}^N\left(\frac{\sum_{i=1}^K{W}_{i,j}\cdotp \mathit{\log}\frac{X_{i,j}^2}{X_{i,j}^2-{Y}_{i.j}^2}}{\sum_{i=1}^K{W}_{i,j}}\right) $$
(13)

where \( {X}_{i,j}^2 \), \( {Y}_{i.j}^2 \) are critical-band magnitude spectra in the jth frequency band of the target and converted frame signals respectively, K is the number of bands, W is the weight vector defined in [4].

  1. g)

    Mel-Cepstral Distortion (MCD) [33]: It is based on the Euclidean distance between the target and converted frame vectors that describe the global spectral characteristics.

$$ MCD=\frac{1}{N}\sum \limits_{j=1}^N\sqrt{\sum \limits_{i=1}^K{\left({x}_{i,j}-{y}_{i,j}\right)}^2} $$
(14)

where x and y are the ith cepstral coefficients of the target and converted speech signals, respectively.

4 Evaluation results and discussion

The experimental evaluation has two main goals. First, it aims to evaluate the quality of the generated speech with respect to naturalness. The second goal is to evaluate how similar is the converted speech to the target speaker. Therefore, a reference (baseline) system with high quality performance is required to demonstrate the effectiveness and performance of the proposed methodology. Since the WORLDFootnote 2 vocoder [42] has a high-quality speech synthesis system for real-time applications and better than several high-quality vocoders (such as STRAIGHT) [40], we use it as our state-of-the-art baseline within SVC. We did not use WaveNet or Tacotron based neural vocoders as a baseline in the experiment because our proposed vocoder is a source-filter based system, not an end-to-end acoustic model. For the VC experiments using the WORLD vocoder, we used the same FF-DNN architecture as for the proposed vocoder (see Sec. 2.2 and 3.2). We have synthesized 20 utterances for each of the speaker pair conversions, which means that 80 sentences are available for evaluation.

4.1 Objective evaluation

Here, we show the results for the error metrics presented in subsection 3.3. For all empirical measures, a calculation is done frame-by-frame, and a lower value indicates better performance except for the fwSNRseg measure (higher value is better). The results were averaged, and the best value in each column of Table 1 is bold faced.

Table 1 Average scores on converted speech signal per each of the speaker pairs conversion

It is interesting to emphasize that the findings in Table 1 showed that the baseline does not meet the performance of our proposed model. That, in other words, the results reported in Table 1 strongly support the use of the proposed vocoder for SVC. In particular, the fwSNRseg between converted and target speech frames using the proposed method with continuous vocoder are higher than those using the baseline method. Nevertheless, the WORLD vocoder is shown to be better only for the SLT-to-JMK speaker conversion.

The comparison of the spectral envelope of one speech frame converted by the proposed method is given in Fig. 11. The converted spectral envelope is plotted along with the source and the preferred target. It may be observed that the converted spectral envelope is more similar in general to the target one than the source one. Even though, these two trajectories seem similar, they are moderately smoothed compared with the target one; that can affect the quality of the converted speech. It can also be seen in Fig. 12 that the converted contF0 trajectories generated from proposed method follow the same shape of the target confirming the similarity between them and can provide better F0 predictions. Similarly, when looking at Fig. 13, it makes apparent that the proposed framework produces converted speech with MVF more similar to the target trajectories rather than to the source ones.

Fig. 11
figure 11

Example of one shorter segment /e/ from the natural source, target, and converted spectral envelopes using proposed method. Sentence: “Gad, your letter came just in time”

Fig. 12
figure 12

Example of the natural source, target, and converted contF0 trajectories using proposed method. Sentence: “From that moment his friendship for Belize turns to hatred and jealousy”

Fig. 13
figure 13

Example of the natural source, target, and converted MVF contours using the proposed method. Sentence: “Gregson shoved back his chair and rose to his feet”

As a result, these experiments show that the proposed model with continuous vocoder is competitive for the SVC task, and superior to the reference WORLD model.

4.2 Subjective evaluation

To demonstrate the efficiency of our proposed model, we conducted two different perceptual listening tests. First, in order to evaluate the similarity of the converted speech to a reference target voice (which was the natural voice), we performed a web-based MUSHRA-like (MUlti-Stimulus test with Hidden Reference and Anchor) listening test [26]. The advantage of MUSHRA is that it enables evaluation of multiple samples in a single trial without breaking the task into many pairwise comparisons, and it is a standard method for speech synthesis evaluations. Within the MUSHRA test we compared four variants of the sentences: 1) Source, 2) Target, 3) Converted speech using the high-quality baseline (WORLD) vocoder, 4) Converted speech using the proposed (Continuous) vocoder. Listeners were asked to assess which variant was more similar to the natural reference (i.e., the target speaker), without considering the speech quality, from 0 (highly not similar) to 100 (highly similar). From the testing set of Sec. 4.1, 12 utterances were randomly chosen and presented in a randomized order (different for each participant). Altogether, 48 utterances were included in the MUSHRA test (4 types × 12 sentences).

Second, in order to evaluate the overall quality and identity of the synthesized speech from both proposed and baseline systems, a Mean Opinion Score (MOS) test was carried out. In the MOS test we compared three variants of the sentences: 1) Target, 2) Converted speech using the baseline (WORLD) vocoder, and 3) Converted speech using the proposed (Continuous) vocoder. The listeners had to rate the naturalness of each stimulus, from 0 (highly unnatural) to 100 (highly natural). Similarly, the same 12 sentences were used as in the MUSHRA test. Altogether, 36 utterances were included in the MOS test (3 types × 12 sentences).

Before the test, listeners were asked to listen to an example from the male speaker to adjust the volume. Nineteen participants between the age of 23–40 (mean age: 30 years) were asked to conduct the online listening test. 12 of them were males and 7 were females. On average, the MUSHRA test took 13 min, while the MOS test was 12 min long. The listening tests samples can be found online.Footnote 3

The MUSHRA similarity scores of the listening test are presented in Fig. 14. It can be seen that both systems achieve almost similar performance to the target voice across all gender combinations. This means that our proposed model has successfully converted the source voice to the target voice on cross-gender cases. In case of SLT-to-BDL conversion, the difference between the baseline and the proposed systems is statistically significant (Mann-Whitney-Wilcoxon ranksum test, with a 95% confidence level), while the other differences between the baseline and proposed are not significant.

Fig. 14
figure 14

MUSHRA scores for the similarity question. Higher value means larger similarity to the target speaker. Errorbars show the bootstrapped 95% confidence intervals

Additionally, Fig. 15 shows the results of the MOS test. We can see that both the baseline and proposed systems achieved low naturalness scores compared to the target speaker, showing that the listeners clearly differentiated the utterances resulting from the voice conversion. It can be also found that the listeners preferred the baseline system compared to the proposed one. However, none of these differences are statistically significant (Mann-Whitney-Wilcoxon ranksum test, with a 95% confidence level).

Fig. 15
figure 15

MOS scores for the naturalness question. Higher value means better overall quality. Errorbars show the bootstrapped 95% confidence intervals

As the final result of the listening tests investigating similarity to the target speaker and overall quality, we can conclude that the proposed continuous vocoder within the SVC framework performed well, because it is as good as the voice conversion using the WORLD vocoder.

5 Conclusions

In this paper, we proposed a new approach to statistical voice conversion using a feed-forward deep neural network. The main idea was to integrate the continuous vocoder into the SVC framework, which provides an advanced model of the excitation signal, by converting its contF0, MVF, and spectral features within a statistical conversion function. The advantage of this vocoder is that it does not require to have a voiced/unvoiced decision, which means that the alignment error will be avoided in SVC between voiced and unvoiced segments. Therefore, its simplicity and flexibility allows us to easily construct a voice conversion framework using a FF-DNN.

Using a variety of measurements, the performance strengths and weaknesses of the proposed method for different speakers were highlighted. From the objective experiments, the performance of the proposed system (using the continuous vocoder) was superior in most cases to that of the reference system (using the WORLD vocoder). Moreover, two listening tests have been performed to evaluate the effectiveness of the proposed method. The similarity test showed that the reference and proposed systems are both similar to the target speaker. This also confirms our findings, that are reported in the objective evaluations. Significant differences were not found compared to the reference system during the quality (MOS) test. This means that the proposed approach is capable of converting speech with higher naturalness and perceptual speech intelligibility.

Plans of future research involve first to add a Harmonics-to-Noise Ratio as a new parameter to the analysis, statistical learning and synthesis steps in order to further reduce the buzziness caused by vocoding. Secondly, it would be interesting to investigate the effectiveness of applying a mixture density recurrent network by using a bi-directional long-short memory (Bi-LSTM) based SVC to further improve the perceptual quality of the converted speech.