Utterance copy (or copy synthesis) corresponds to imitating the human voice via a synthesizer [1] and has important clinical applications [2]. For example, specialists use utterance copy to artificially produce the voices of patients who cannot normally speak due to trauma, disease, or surgery, and from the estimated set of parameters, they can better understand the related problems [24]. Thus, the task is that, given a target utterance (a sentence, word or phoneme spoken by the person of interest), one has to find the set of parameters that, when used as the input of a synthesizer, generates an artificial voice that resembles the target one. This task can be done manually, by trial-and-error, or automatically.

This paper presents a genetic algorithm called newGASpeech that performs the utterance copy task through a process of analysis-by-synthesis [2]. The obtained results are compared with the ones produced by the baseline WinSnoori [1], which to the best of the author’s knowledge is the only freely available software that automatically estimates Klatt input parameters for utterance copy. Another related work is Procsy [5], but its current version requires additional input files, such as phonetic transcriptions that are not readily available.

The results presented in this work are for both synthetic speech, which is obtained with a text-to-speech (TTS) system, and natural speech, for male and female speakers. Due to the difficulty of an objective evaluation of the synthetic voices, several complementary figures of merit were adopted, namely: the log-spectral distance (D LE ), signal-to-noise ratio (SNR) [6], root-mean-square error (RMSE), Perceptual Evaluation of Speech Quality (PESQ) [7], and P.563, a single-ended method for objective speech quality [8].

This work is organized as follows: background and a literature review about Klatt are described in the “Background” section. The methodology and customized genetic algorithm to perform utterance copy are explained in “Methods” and “Genetic algorithm for utterance copy” sections, respectively. The results and conclusions are presented in “Results and discussion” and “Conclusions” sections.

Klatt synthesizer

Klatt [9, 10] is a formant-based synthesizer adopted in many speech studies (e. g., [2, 11]) because most its inputs are closely related to physical parameters. This leads to a high degree of interpretability, which is essential in some studies of the acoustic correlates of voice quality, such as male/female speech conversion and simulation of breathiness, roughness, and vocal fry. The Klatt synthesizer has been used to mimic natural speech [12, 13] as well as pathological voices [2]. There are of course other synthesis techniques, together with methods for obtaining the associated input parameters, such as [14]. However, sometimes the interpretation of the role of each input parameter is not as easy as for Klatt and alternative synthesizers seem less popular.

The Klatt synthesizer uses a production mechanism based on a source-filter model [9, 15] that allows modeling the vocal tract through a linear filter, with a set of resonators in parallel or/and cascade that vary in time. There are several versions of Klatt synthesizers but the most popular ones are Klatt80 [9], KLSYN88 [10] and a modified version of Klatt80 by Jon Iles [16]. Our system uses KLSYN88 (Fig. 1), which has 48 parameters. All these parameters are detailed in [10] and, here, just a brief description is provided.

Fig. 1
figure 1

The KLSYN88 formant synthesizer adapted from [10]

Only 41 KLSYN88 parameters are effectively used here, and from the remaining seven, six of them are assumed to be zero (not shown in Fig. 1) and SQ is part of a voicing source that is not adopted in this work (Fig. 1, orange rectangle). The KLGLOTT88 voicing source was used and comprises F0 (fundamental frequency), AV (amplitude of voicing), OQ (open quotient), FL (flutter), and DI (diplophonia). The TL and AH parameters are parts of the voicing source and are responsible for an extra tilt of voicing spectrum and the amplitude of aspiration, respectively [10]. Five resonators in cascade are needed to simulate laryngeal sounds by the formants frequency F 1 to F 5 and their bandwidths B 1 to B 5, respectively. The resonators in parallel are responsible for modeling fricative sounds through parameters A2F to A6F, and for controlling the amplitude of fricative source AF and their bandwidths B2F to B6F.

The baseline WinSnoori is a free software for speech analysis that generates the Klatt parameters from the waveform input speech file. It makes utterance copy through a speech analysis algorithm. Its Klatt synthesizer is the modified version of Klatt80 by Jon Iles and having the configuration of 41 parameters to produce a speech frame.

The synthetic target speech files used in this study were generated by DECtalk [17], which is a text-to-speech (TTS) system produced by Fonix Corporation that internally uses KLSYN88 as the backend synthesizer. The generated voices by DECtalk possess high intelligibility and its demo version was provided to LaPS (Signal Processing Laboratory of the Federal University of Pará) for research purposes. Because DECtalk generates speech using KLSYN88, it is very useful for our research given that, in this case, the “correct” Klatt input parameters are known, which does not occur when using natural speech.

More specifically, for each utterance specified by an input text, DECtalk generates an output file having 18 parameters that can be mapped to the 13 parameters of the HLSyn synthesizer [18] input file through a script developed in Java by our group and called DEC2HLSyn. The HLSyn can then be used to generate the input file of the KLSYN88, having the 48 necessary parameters to perform speech synthesis with Klatt. Generally speaking, HLSyn is a “high-level” synthesizer that expands its 13 input parameters into the 48 parameters of the “low-level” KLSYN88 synthesizer [18].


In this study, 240 sentences [19] were submitted to the DECtalk TTS to produce synthetic voices for 6 different speakers (3 males and 3 females). Grouping the sentences into categories of male and female speakers, histograms of all parameters were generated and, from the histograms, it was possible to identify that DECtalk imposes variation of only 25 parameters (shown in black in Fig. 1), out of the 41 aforementioned, regardless of the speaker, while the others are kept at constant values. Parameters FL, D F1, D B1, and A6F have zero as constant value. The other constant parameters have their values listed in Table 1 and the parameters that vary over time are listed in Table 2. As in [10], only five resonators are adopted in this work, such that F6 and B6 are not used. The suggested range of parameter values defined in [10] is expanded by DECtalk for parameters AF, B1, FNP, and FNZ [20].

Table 1 Klatt parameters with constant values different from zero
Table 2 25 KLSYN88’s changing parameters

The KLSYN88 input file is formed by several rows, each one representing a voice segment and having the combination of 25 different parameter values, which are used to synthesize the synthetic speech for that frame in this work (Fig. 2).

Fig. 2
figure 2

KLSYN88 input file

The production of a KLSYN88 input file needs considerable time if performed manually. As discussed, the goal of newGASpeech is, given a waveform file with the target spoken utterance, to automatically estimate the input parameters of Klatt synthesizer. It uses a GA based on the Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [21] with the Root Mean Squared Error (RMSE) as fitness function to evaluate the solution. The time to execute newGASpeech on a long utterance is significant, but the process is feasible because it can be done offline and does not demand time from a specialist to perform a manual utterance copy.

Genetic algorithm for utterance copy

The analysis-by-synthesis process begins by segmenting the input target voice file into frames of approximately 5 ms and obtaining information through a configuration file as shown in Fig. 3 (step 1). For each frame, a full run of GA is executed (several iterations). The parameters to synthesize the frame compose a chromosome that has its fitness calculated after using its corresponding parameters to feed Klatt. The fitness is the RMSE calculated between the target and synthesized signal frames. Care has to be exercised to properly re-initialize the synthesizer’s state (memory of its digital resonators, etc.) along the whole process that executes several GAs. After evaluating the population, a rank is assigned to each individual and those with better ranks are selected to undergo crossover and mutation. As a result, a new population is generated and, in turn, undergoes evaluation, selection, crossover, and mutation steps again. The whole process is repeated until the algorithm reaches one of the stopping criteria [22] (step 2). The best individuals for each frame compose a Klatt input file which is synthesized and outputs a synthetic voice file that aims at mimicking the target voice (steps 3 to 6).

Fig. 3
figure 3

Description of the newGASpeech framework

In most cases, the speech signal does not change abruptly from one frame to another. In order to improve convergence and suggest continuity with respect to the previous population, the framework has the option of running Interframe [22]. In this case, the best individuals of the previous frame are copied to initialize the population of the next frame (recall that a new GA-population is initialized for each frame).

Coding and decoding of chromosome

The population is comprised of individuals with chromosomes that have genes which represent the Klatt synthesizer input parameters of a frame. Genes are grouped in two sections: F v and T v that store the parameters of voicing source and vocal tract, respectively. Figures 4 and 5 illustrate all genes that may comprise F v and T v sessions, respectively.

Fig. 4
figure 4

Structure of the voicing source section

Fig. 5
figure 5

Structure of the vocal tract section of a chromosome

In addition to the sections mentioned above, the chromosome also has a voicing gene. This gene performs the necessary function of indicating if the chromosome represents a voiced or unvoiced segment. If the segment is unvoiced, F0 and AV parameters are zero, otherwise the sound is identified as voiced and both have nonzero values.

Look-ahead mechanism

Usually, the speech signal synthesized by Klatt is composed of several frames. A frame should not be treated independently because it might potentially influence the next frames. Therefore, a frame configuration may be good for the current frame; although, it may impact negatively on the signal of the following frames. The Look-Ahead mechanism [20] allows the evaluation of the synthesis of the current frame configuration together with n f following frames, greatly increasing the problem search space. Figure 6 illustrates the chromosome structure.

Fig. 6
figure 6

Division of chromosome into sections

B v is the voicing source gene, F v , T v , and L a are the sections: voicing source, the vocal tract, and Look-Ahead mechanism, respectively. B v , F v , and T v were previously addressed. The L a section stores Look-Ahead frames and its size depends on the amount of k frames needed ahead to solve the problem. Experiments demonstrated in this study that to estimate the 25 parameters mentioned in the “Methods” section at least n f =2 Look-Ahead frames are needed with all its frames having the same weight equal to one. Thus, L a section contains at least B v , F v , and T v for the next 2 frames.

This minimum amount of Look-ahead frames is based on the time interval t 0 between impulses to generate a voiced excitation. Hence, the number of samples T 0 corresponding to t 0 seconds is T 0=t 0 f s , where f s is the sample rate in Hz. For any periodic signal, the fundamental frequency f 0=1/t 0 (in Hz) is one over the fundamental period t 0. Hence, to obtain T 0 as a function of the parameter F0, it remains to note that Klatt uses an integer parameter F0=10f 0 to represent f 0 [10]. Therefore, T 0=f s /(F0/10) is the approximate number of samples separating voicing impulses. In this research, f s =11025 and each frame is represented by 71 samples. Figure 7 shows the signal of the voicing source to 4 frames where N is the current. In this case, the frame N has F0 value equal to 943 and, according to the T 0 equation, the next impulse will occur 116.9 samples ahead from the beginning of the period in this frame, in other words in frame N+2.

Fig. 7
figure 7

Impulses of a voiced excitation

The F0 average value in the experiments was 975.5 for male voiced frames, therefore, the F0 value set for the current frame only impacts in the second follows and the frame N is as important as other. For female speakers, the F0 average value was greater (≈1595.5) than male. Applying the T 0 equation to calculate the number of samples for the next impulse were required approximately 69.1 samples. This indicates that for female speakers only N+1 Look-ahead frame is required because the F0 value impacts directly in the next frame.

Dimensionality of the search space

Each Klatt parameter has its own possible range of values [10], with these values restricted to be integer numbers. For a single frame, the size of the search space S is given by:

$$ S = \prod_{n=1}^{N_{p}}\left(U_{n}-L_{n} + 1\right) $$

where N p is the number of parameters to be estimated and U n and L n , respectively, represent the upper and lower bounds of integer and consecutive values of parameter n. For example, if the framework is estimating N p =25 parameters for one frame, and each can assume U n L n +1=50 distinct values, ∀n, the search space is given by S=(51)25≈5×1042. This dimension gets even larger when the search space includes the Look-Ahead frames and is then given by \({S}^{({{n}_{f}}+1)}\). Hence, procedures such as GA are an important tool to address this problem.

Results and discussion

Utterance copy recent experiments were performed in two ways: first to assess the GA convergence regarding the dimensionality of the search space and second to compare the synthesized male and female voices obtained by the newGASpeech and the WinSnoori using as target natural and synthetic voices. The main motivation to use synthetic and natural voices was to have finer control of the experiments in the former, given that the correct values of the input parameters are known, and test our system with natural voices of unknown speakers in the latter. The DECtalk voices were from six American male and female speakers and the words were listed in the Table 3.

Table 3 Male and female words list—synthetic voices

The used natural voices were from the TIDIGITS corpus [23]. These voices are pronunciations by six American male and female speakers of the digits: two, four, six, eight, and nine totaling 30 speech signals.

The newGASpeech was configured as shown in Table 4. The crossover and mutation rates are adaptive and may be decremented in 0.01 by each generation. This occurs until the minimum rate equal to 0.01 is reached. However, the rates are only decreased if the population presents diversity. The option of running Interframe was used and 10 % of the best chromosomes from the previous frame were copied to initialize part of the next frame population. For the following simulations, GA was configured to always estimate the 25 varying parameters (“Methods” section).

Table 4 newGASpeech configuration

Files generated by DECtalk and from TIDIGITS corpus were used as input to both newGASpeech and WinSnoori. The target and synthetic signals were then aligned according to their cross-correlation and the results evaluated using the metrics: SNR, RMSE, D LE , PESQ, and P.563. It should be noted that none of these metrics perfectly correlate with subjective evaluations. However, informal listening tests indicated a very good match between the overall result of the four metrics and a MOS-like subjective evaluation.

The SNR value should be as large as possible indicating that the signal power is much higher than the “noise” power. While for RMSE and D LE , the smaller the better. PESQ and P.563 compares two speech signals and assigns scores ranging from 1 to 5, with 1 being “bad quality” and 5 “excellent quality”. An important difference between these last two is that P.563 is single-ended, i.e., it does not use the target file when assigning a score. This is interesting in utterance copy because the other four metrics require a reasonable time-alignment between the target and synthetic signals while P.563 does not. However, the P.563 is not appropriate for speech signals with duration shorter than 3 s [8] and, because of that, it was not used to evaluate the signals from TIDIGITS.

Dimensionality of the search space

To evaluate how dimensionality influences the fitness function (RMSE), the option of informing was developed frame by frame and, for each parameter, a restricted range of values around the correct one. Experiments were performed using five words uttered by one of the male speakers shown in Table 3, and the newGASpeech’s configuration was as follows: population of 1000 individuals, 300 generations, and the same initial crossover and mutation rates specified in the Table 4. From the correct parameters values for a given frame, they were varied in ±2, 4, 8, 16, and 32 %.

In this experiment, the largest search space dimension occurs when 25 parameters are estimated and the correct parameter values are restricted to the variation of ±32 %. The variable Z is this dimension and was calculated by Eq. 1. Its value is approximately 6.86E+102, including the dimension of the 2 Look-ahead frames required. Other dimensions of the search space, for ±2, 4, 8, and 16 % variations of the parameter values, were normalized by Z. Figure 8 illustrates the average RMSE calculated for all words. It can be seen that the search space is huge and the RMSE only decreases significantly when it is reduced by several orders of magnitude.

Fig. 8
figure 8

RMSE average for five words of Paul speaker

Experiments with synthetic target voices

Figures 9 and 10 show the results for synthetic speech using “boxplot” graphs, for male and female speakers. In all performed tests, the result obtained by newGASpeech was better for all speakers than the one by WinSnoori according to the chosen figures of merit, with those of Wendy with larger difference in favor of newGASpeech with lower RMSE (91.6 %), greater PESQ (28.4 %), SNR (99.6 %), and P.563 (78.3 %) medians although the D LE has not been the lowest (91.2 %). These percentages reflect the variation with respect to the baseline. All female and male speakers obtained similar results when using GA (Table 5), except for the SNR average which was 96.8 % higher for female than for male.

Fig. 9
figure 9

a PESQ, b SNR, c P.563, d RMSE, and e D LE values for male speakers

Fig. 10
figure 10

a PESQ, b SNR, c P.563, d RMSE, and e D LE values for female speakers

Table 5 Medians for newGASpeech (GA) and WinSnoori (Wsno) by voice type for male speakers

Experiments with natural target voices

In experiments using natural target voices, as seen in Figs. 11 and 12, the GA obtained voices with higher PESQ and SNR, and lower RMSE and D LE than the baseline. Results of female speakers performed better than male. The female Speaker 2 presents the best PESQ median value (3.19) although the RMSE (40.21) and D LE (0.028) of the newGASpeech was not the lowest value and the SNR (21.45) not be the highest if compared to other speakers.

Fig. 11
figure 11

a PESQ, b SNR, c RMSE, and d D LE values for male speakers

Fig. 12
figure 12

a PESQ, b SNR, c RMSE, and d D LE values for female speakers

Comparing the results obtained with synthetic and natural target voices (Table 5), the newGASpeech had a greater RMSE reduction of 43.5 % and 35.7 % for female and male speakers although the PESQ reduced too. The SNR increased for male and decreased for female voices, and the D LE increased significantly for both (≈88 % in average). The P.563 value is considered high for the results with synthetic target voices with an average value of 4.4 for male and female, but of course the results with natural speech are more important.

For all experiments, WinSnoori was outperformed. Comparing the results of the baseline for synthetic versus natural target voices, the PESQ median reduced ≈14 % for female and increased the same percentage for male speakers. The SNR increased a little, and as in the GA, the major difference was the reduction of the RMSE, ≈75 % average, for male and female, and increase of the D LE in ≈86 % for both.

Percentage error of the newGASpeech estimated parameters for the synthetic voices

Given that the correct values were known, for all experiments with male and female synthetic voices (Table 3) using GA, the percentage error (PE) was calculated as

$$ PE = 100 \times {\left|\frac{\hat{p_{k}}-p_{k}}{p_{k}}\right|} \% $$

for all 25 estimated parameters, where \(\hat {p_{k}}\) is the estimated Klatt parameter value by GA and p k its target value. The goal is to observe, in the input parameters space, what are the parameters with the best and worst estimations.

For each word, the PE average was calculated for voiced frames to avoid silence and low-energy unvoiced frames. Female and male speakers had seven parameters with an average PE less than 10 % and this same number of parameters with average error in the range between 10 % and 50 %. These parameters belong to the voicing source and cascade branch and are listed in Table 6.

Table 6 Parameters with percentage error below 50 %

The parameters with the smallest PE in Table 6 are those that impact the synthesized speech the most. For example, the parameters belonging to the parallel branch, responsible for the production of fricative sounds, had P E>100 % indicating that these parameters did not strongly influence in the distortion of the generated signal. In contrast, F1 was the parameter with highest accuracy for male speakers (98 %) given its strong impact on the generated speech. Curiously, for all speakers, AH and TL were in the same PE range.


This work presented the current version of the newGASpeech framework, which is based on GA and performs utterance copy through a process of analysis-by-synthesis. The obtained results were compared with the ones produced by the baseline WinSnoori.

The proposed software significantly outperformed WinSnoori with respect to five objective figures of merit: RMSE, SNR, spectral distortion, PESQ, and P.563 scores. The results were obtained for synthetic and natural speech files covering both male and female speakers. Compared to previous results, the RMSE decreased by 72.8 % and 64.8 % for male and female speakers, respectively.

This is on-going work with improvements to be made. For example, after the systems are properly tuned, a systematic subjective evaluation will be conducted. Another aspect that was taken in account in the design of the current experiments is that breadth, instead of depth, was prioritized. Future experiments will adopt larger amount of speech data, especially with relatively long sentences.

Another important future work is to evaluate the results according to the input parameters themselves (their time evolution, dynamic range, etc.). Because the problem has a many-to-one mapping, it happens that a solution has a good match with the target speech when only the synthetic speech is taken into account, but a relatively poor set of input parameters.

The newGASpeech will be made freely available for users of Klatt-based utterance copy applications. The final goal is to provide an easy-to-use solution that focuses on utterance copy for health applications. In spite of having more than three decades, the Klatt synthesizer is still a popular formant synthesizer and this motivates the development of associated tools.