1 Introduction

In 1970, Fant [1] set up the Source-Filter model, a classical acoustic modeling method, which provided a promising approach to conduct speech synthesis researches. The Source-Filter model represented speeches as the combination of a source and a linear acoustic filter, corresponding to the vocal cords and the vocal tract (soft palate, tongue, nasal cavity, oral cavity, etc.), respectively.

Electroglottograph (EGG) records electrical impedance in the glottis collected by electrodes situated on the throat and can reflect the vocal cord movement. When the vocal cord closes, the contact area between the two cords reaches its maximum, which leads to the lowest resistance and highest collected voltage in EGG. Conversely, when the vocal cord opens, the lowest collected voltage will be collected [2]. Figure 1 illustrates the waveform of a EGG signal sample. Phase 1, 2, 3, 4 represent the closing phase, maximum contact, opening phase and open but no contact phase, respectively. Depending on the periodical change of the amplitude of EGG signals during speaking, we can mark pitching and obtain the source information, which has been researched by Hussein [3].

In respect that EGG is directly collected from throat, there are two definite superiority of the speech analysis based on the EGG: 1. It is not affected by mechanical vibration and noise so it is suitable to apply in the ultra-high noise environment; 2. It can accurately reflect the vocal cord vibration state information.

Fig. 1
figure 1

the waveform of EGG. Phase 1: the closing phase; Phase 2: maximum contact; Phase 3: opening phase; Phase 4: open but no contact phase

As the EGG signal highly corresponds to speaking, lots of researches have been carried out about EGG. Aiming at exploring the characteristics of EGG signals, Paul figured out that some features, such as gender, vowel, and phonatory registers can be extracted from the EGG signal [4]. Lu discussed the relationship between EGG and emotions [5]. Alberto evaluated EGG signal variability by amplitude-speed combined analysis [6]. As for utilizing EGG signals, Chen proposed a speech emotional features extraction method based on EGG [7]. Michal Borsky utilized EGG signal as a feature type to investigate its performance for voice quality classification task [8]. Sunil Kumar put forward a robust method to detect glottal activity using the phase of the EGG signal [9]. Liu compared the parametrization methods of EGG signals in distinguishing between phonation types [10]. Lebacq analyzed the dynamics of vocal onset through the shape of EGG signals [11]. Filipa discovered the immediate effects of using a flow ball device for voice exercises, which can assist voice training [12].

Focusing on speech synthesis researches, traditional speech synthesis aims at utilizing the information of the raw speech to recover it, which consists of waveform synthesis method, rule-based synthesis method, and synthesis method based on parameters. Waveform synthesis mainly refers to edit and joint waveform, which has limited performance. The rule-based synthesis method produces speeches through phonetic rules. Belonging to it, PSOLA is a representative algorithm for waveform splicing and prosody control [13]. However, this method requires a large volume of sound libraries, making it difficult to be applied to portable devices. The parameter-based synthesis method mainly refers to speech synthesis with acoustic features. Lots of well-known synthesis systems implemented this method such as Klatt series-parallel formant synthesizer [14], LPC [15], LSP [16], and LMA [17]. Additionally, char2wav [18], straight [19], WORLD [20], vocaine [21], Mel spectrum [22] and other models have also achieved good results.

Speech synthesis with the deep learning method seeks to convert text to speech, which is mainly achieved by extracting deep features. Oord [23] proposed a deep neural network model named WaveNet for generating original audio waveform signals. In 2017, Baidu put forward deep voice [24], which replaced the traditional method with the neural network at different levels, and applied the WaveNet model to the final speech synthesis module. Another widely-used model named Tacotron [25] was proposed by Google, which is an end-to-end generative text to speech model and achieves to directly learn the mapping between the text and speech pair. Later, Baidu launched deep voice2 [26] and deep voice3 [27] to improve and modify the previous generation model. Then, based on Tacotron, several improvements have been put forward to tackle different problems, in particular focus on specific characteristics when dealing with other languages. In Japanese text-to-speech (TTS) tasks, Yasuda [28] included self-attention to Tacotron to capture long-term dependencies of pitch accents. Liu [29] designed a distillation loss function to modify the feature loss function and proposed a teacher-student training scheme based on Tacotron to solve the exposure bias problem. To improve the naturalness and tackle the prosodic problems in Mandarin TTS, several solutions have also been figured out. Yang [30] proposed SAG-Tacotron which replaced the CBHG encoder of Tacotron with the self-attention-based one and utilized learnable Gaussian bias to enhance localness modeling and overcome the problem of self-attention’s dispersing the distributions of attention. Another popular direction is to design an extra front-end, realized by Lu [31] who proposed a text enhancing method and tried to leverage previous phrasing models and larger text database at the same time, and Pan [32] who set up a unified front-end to solve polyphone disambiguation and prosody word prediction. However, as all the speech synthesis methods mentioned above are text-to-speech, it is not suitable to generate personalized speech in our application scenario.

Attending to the inherent superiority of EGG, we have made efforts to collect EGG and speech signals simultaneously ant built up a database named Chinese Dual-mode Emotional Speech Database (CDESD [33]), which provides a basis for the research of EGG, especially in the Mandarin speech research. In our previous study, we have proved that EGG signals can be used in text content recognition [34]. Two long Chinese sentences with different contents often vary in the vocal cord movement, which can be reflected by the EGG signal. Thus, it is reasonable to convert the EGG signal into one sentence under the condition of limited class of contents. We extracted the fundamental frequency (\(F_0\)), the relative difference of \(F_0\) (\(diffF_0\)) and log short-term energy (logE) from the EGG signal in every frame to combine into a feature vector sequence, and fed in a 3-layer bidirectional LSTM (Long Short-Term Memory [35]) to convert the sequence into one of 20-class of sentences with different contents. This research archived to recognize the text from some specific classes of sentences through the EGG signal and provided support on our task of speech synthesis with EGG signals.

Based on our previous study, which has proved that EGG signal can be classified into the text in the scenario of limited contents, we now propose a framework to synthesize personalized speeches utilizing EGG signals. Compared to speech signals, EGG signals have the following two superiority under our application scenarios. (1) clean EGG signals can be collected in high noise circumstances, where clean speech signals can not. (2) To enable patients who retain vocal cord vibration but lose the ability to produce voice to speak again. It must be highlighted that to our knowledge, this is the first attempt to utilize EGG signals into Tacotron-2.

This paper is organized as follows. In Section 2, our methods and materials are introduced. In Section 3, we discuss the results of our models and the comparison experiments we have conducted. Section 4 makes conclusions and points out the expected future works.

2 Methods and materials

2.1 Methods

This paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. Our framework consists of a text content recognition model with the EGG signal and a speech synthesis model with the text and EGG signal. Figure 2 is the overall structure of our entire framework. The function of the text content recognition model is to get text recognition results of the EGG signal input, which is essential for speech synthesis. Then, the speech synthesis model produces speech with the EGG signal and its text recognition result. Based on Modified Tacotron-2, three features (fundamental frequency, spectrum envelope, and aperiodic parameter) are extracted from the text recognition result. To utilize the information contained in EGG signals and synthesize personalized speeches, we choose WORLD as the vocoder and put forward a fine-grained fundamental frequency modification method to obtain the modified fundamental frequency. Finally, speech is synthesized by the WORLD vocoder according to three features.

Fig. 2
figure 2

The flow chart of our framework

2.1.1 Text content recognition model with the EGG signal

In our previous study, we have set up a text content recognition model for getting the text recognition result from the EGG signal, proposed in the article [34]. So we just briefly introduce our method here.

Figure 3 is the structure of our text content recognition model, which consists of an EGG feature extraction module to obtain the feature vector sequence from the whole EGG signal and a recognition network to get the recognition result of the feature vector sequence.

Fig. 3
figure 3

The structure of our text content recognition model

As Fig. 3 depicts, the EGG feature extraction module consists of three parts, voiced segments extraction part, feature extraction part, and smoothing part. When setting about EGG signals, we firstly extract voiced frames from the EGG signal to avoid the unvoiced segments’ influence. Considering that the change of the pitch varies greatly from each other for two long sentences, we choose three parameters as features: the fundamental frequency (\(F_0\)), the relative first-order difference of \(F_0\) (\(diff F_0\)) and the log short-term energy (logE). Between them, \(F_0\) is a commonly-used parameter to characterize vocal cord vibration. The extraction of \(F_0\) is based on the periodical change of the amplitude of EGG signals, which is estimated by the auto-correlation method as follows:

$$\begin{array}{*{20}c} F_{0}=\frac{f_{s}}{\underset{\frac{f_{s}}{f_{max}}\le k \le \frac{f_{s}}{f_{min}}}{\arg \max }\sum _{m=0}^{N-1-k}x_{EGG}(m)x_{EGG}(m+k)}\end{array}$$
(1)

where \(f_s\) is the sampling rate. \(f_{max}\) and \(f_{min}\) are the maximum and minimum of the \(F_0\), respectively.

\(diffF_0\) indicates the change of \(F_0\) over time, which is naturally calculated as equation (2). The log short-term energy (logE) is included to characterize the stress distribute of the EGG signal.

$$\begin{array}{*{20}c} diffF_0(i)=\frac{F_{0}(i+1)-F_{0}(i)}{F_{0}(i)}\end{array}$$
(2)

where \(F_{0}(i)\) and \(diffF_0(i)\) are respectively \(F_0\) and the relative first-order difference of \(F_0\) at the frame i.

As the method of \(F_0\) extraction will cause erroneous values [36], \(F_0\) smoothing is required. We adopt the smoothing method with bidirectional searching proposed by Jun et al [37], which combined interpolation and mean filtering according to a normalized \(F_0\) in every segment. Compared with the traditional median filter, bidirectional searching achieves better performance on \(F_0\) smoothing according to our experiment.

Through feature extraction module, a feature vector sequence is prepared, which contains 504 frames and 3 features at every time step. Next, we feed the feature vector sequence in our recognition network. Figure 4 illustrates the structure of our recognition network.

Fig. 4
figure 4

The structure of the recognition network

As Fig. 4 shows, our recognition network consists of an encoder and a classifier. The encoder extracts contextual information from the feature vector sequence and generates a contextual vector. Then the classifier converts the contextual vector into an index that can search the sentence from 20-class of contents dictionary. For reasons of the superior performance of LSTM in sequence processing tasks, we select Three-layer Bi-LSTM [38] as our encoder, which has proved effective according to our comparative experiment. Through the encoder, the forward and backward output are concatenated to obtain the last encoded vector. Then the classifier generates a probability vector by feeding the contextual vector into the fully connected layer and the softmax operator.

2.1.2 Speech synthesis model with the EGG signal and the text

After the text content recognition model, we have obtained corresponding texts from EGG signals. The next step is to synthesize speeches utilizing the texts and EGG signals. Our speech synthesis model consists of the same EGG feature extraction module to extract \(F_0\) and a Chinese speech synthesis model to synthesize speeches. Besides, to contain the personal characteristic into synthesized speech utilizing EGG signals, we propose a fine-grained fundamental frequency modification method. Figure 5 is the overall flow chart of our speech synthesis model.

Fig. 5
figure 5

The structure of the speech synthesis model

The principle and details of the EGG feature extraction module have been discussed in 2.1.1, we reuse it here to obtain \(F_0\) to work in the fine-grained fundamental frequency modification method. As for the Chinese speech synthesis module, it consists of three parts, text frontend, acoustic model and vocoder. Next we introduce these parts as well as our proposed \(F_0\) modification method as follows.

Text Frontend

The function of the text frontend is to classify the input characters and encoder into a limited number of classes to reduce the difficulty in training the acoustic model. When focusing on the Mandarin TTS task, it means to convert a sequence of Chinese characters into smaller text modeling units, like a sequence of pinyin or phones.

To explore which text modeling unit is optimal between pinyin and phone, we conduct comparative experiments based on pinyin and phone, respectively. Specifically, pypinyin library is used to convert Chinese characters into pinyin and a pinyin-to-phones dictionary is set up through the mapping between pinyin and phone sequence in the Chinese speech synthesis dataset. Compared with 4103 classes of Chinese characters, we squeeze the number into only 1609 of pinyin and 263 of phones after the text frontend.

Acoustic Model

The function of the acoustic model is to establish the mapping between text modeling units. Our acoustic model is based on Tacotron-2 [39] and modified with depthwise separable convolution to decrease the model size. Our model replaces all the convolutions in the Tacotron-2 model with depthwise separable convolutions.

Fig. 6
figure 6

The structure of the acoustic model

Tacotron-2 is an end-to-end TTS model. The core of Tacotron-2 is the acoustic model, shown in Fig. 6, which comprises an encoder, a decoder and a postprocessing module (PostNet). In the encoder, the text modeling unit sequence is converted into a dense vector through character embedding. This dense vector is input into a 3-layer 1-dimensional convolutional layer to simulate the language model and then feed in a 2-layer Bi-LSTM to obtain an encoded vector.

In terms of the decoder, an attention weight vector is calculated based on the encoder output. Two types of attention, content attention and position attention, are applied here. Content attention focuses on the correlation between the decoder’s hidden vector at a certain time step and the encoder’s at each time step, while position attention figures out the correlation between the decoder’s attention weight vector at a certain time step and the encoder’s hidden vector at each time step. Attention scores are calculated by fully connected layers, defined in the following equation:

$$\begin{array}{*{20}c} e_{ij} = score(s_{i},ca_{i-1},h_{j})=v_{a}^{T}tanh(Ws_{i}+Vh_{j}+Uf_{i}+b)\end{array}$$
(3)

where W, V, U and b are the parameters that fully connected layers learn. \(s_i\) is the hidden state of the decoder at the current time step i. \(h_j\) is the hidden state of the encoder at time step j. \(ca_i\) indicates the accumulation of attention weight vector \(a_j\) calculated by equation (4) and \(f_i\) is convoluted by \(ca_i\), shown in equation (5). The attention weight vector \(a_j\) in equation 4 is a composition of the attention weight coefficient, \(a_{j}=[a_{j1},a_{j2},\cdots ,a_{jS}]\). By calculating \(ca_i\), the attention weight network can acquire the attention information which has been learned, so that the model could avoid repeating the unexpected speech.

$$\begin{array}{*{20}c} ca_{i}= &{} \sum _{j=1}^{i-1}a_{j}\end{array}$$
(4)
$$\begin{array}{*{20}c} f_{i}= &{} F*ca_{i}\end{array}$$
(5)

After calculating the attention score (\(e_{ij}\)), the attention weight coefficient \(a_{ij}\) can be figured out by softmax, as equation (6). Finally, the output of attention module, the context vector \(c_i\) can be generated by accumulating the product of \(a_{ij}\) and the hidden state of the encoder \(h_j\), shown in equation (7).

$$\begin{array}{*{20}c} a_{ij}=\frac{exp(e_{ij})}{\sum _{k=1}^{s}exp(e_{ik})}\end{array}$$
(6)
$$\begin{array}{*{20}c} c_{i}=\sum _{j=1}^{S}a_{ij}h_{j}\end{array}$$
(7)

The input of preprocessing network (PreNet) are acoustic features, and the teacher forcing criterion works in the training phase. The output of PreNet and the context vector calculated from the last decoding time step are input into the 2-layer LSTM decoder. Meanwhile, the context vector is generated by the decoder’s output combined with the attention weight of the last decoding time step. This process forms a cycle. The final output is predicted by the linear projection of the concatenated vector of decoder output and the context vector. There are two forms of output, one is the acoustic feature, the other is the stop token probability, of which the latter is a binary recognition task, determining whether the decoding process ends. Besides, the acoustic features of p frame (\(p>1\)) are predicted at each time step to speed up calculation and reduce memory consumption.

Considering PostNet, 5-layer convolutional layers and residual connections are combined to refine the predicted acoustic features.

We design the loss of our acoustic model to include the following four parts: (a) The mean square error between target acoustic features \(y_{target,i}\) and predicted ones without post-processing \(y_{prev,i}\). (b) The mean square error between target acoustic features and predicted features with post-processing \(y_{post,i}\). (c) The cross-entropy loss between the one-hot vector of the target stop token \(S_{target}\) and the probability vector of the predicted stop token \(S_{prediction}\). (d) The \(L_2\) regularization loss (\(\lambda =10^{-6}\)). The loss function is defined as equation (8):

$$\begin{array}{*{20}c} loss= &{} MSE(y_{target,i},y_{prev,i}) + MSE(y_{target,i},y_{post,i})\nonumber \\ & +CE(S_{target},S_{prediction})+ \lambda \sum _{j=1}^{p}w_{j}^{2}\end{array}$$
(8)

Additionally, to reduce the size of our model, we modify the baseline Tacotron-2 introduced above and replace the regular convolution with the depthwise separable convolution in our model. The depthwise separable convolution originated from Xception [40] and MobileNet [41]. This method is widely used to substitute the regular convolution to reduce the number of parameters as well as the model size, sometimes enables the model to converge faster [42].

Figure 7 illustrates the principle of the regular convolution. Concretely, for the 2-dimensional regular convolution, each convolution kernel squeezes all channels of the input feature map into a one-channel output feature map. Then all one-channel output feature maps are concatenated into an output with a given number of channels.

Fig. 7
figure 7

The principle of the regular convolution

The 2-dimensional depthwise separable convolution comprises the depthwise convolution and pointwise convolution. On the depthwise convolution phase, each convolution kernel goes through a channel of the input feature map to generate \(C_M\) (channel multiplier, \(C_M\ge 1\)) channels of output. On the pointwise convolution phase, all the channels of the depthwise convolution output are feed in a 1 \(\times\) 1 pointwise convolution to generate one-channel output. Finally, all one-channel output feature maps are combined to produce an output with a given number of channels. Figure 8 depicts the principle of the depthwise separable convolution.

Fig. 8
figure 8

The principle of the depthwise separable convolution

According to Fig. 8, the number of parameters of the depthwise separable convolution is calculated as the following equation.

$$\begin{array}{*{20}c} N_{sc}=k_{h} \times k_{w} \times C_{in} \times C_{M}+C_{in} \times C_{out} \times C_{M}\end{array}$$
(9)

The ratio of the number of parameters between these two methods is shown in equation (10), which proves that the depthwise separable convolution has much fewer parameters than the regular.

$$\begin{array}{*{20}c} \frac{N_{sc}}{N_{rc}}=\frac{C_{M}}{C_{out}}+\frac{C_{M}}{k_{h} \times k_{w}}\end{array}$$
(10)

Considering that all the convolutions in Tacotron-2 are 1-dimensional. To utilize depthwise separable convolution, we expand all the feature maps with a height dimension. Concretely, we change the shape from 2-dimensional normal structure \(T \times W:T \times d\) to 3-dimensional normal structure \(H\times W\times C:1 \times T \times d\). Afterward, we set \(k_h\) to 1, \(k_w\) to the number of time steps, and \(C_{out}\) to the dimension of features at each time step. Finally, we remove the height dimension to obtain the 2-dimensional normal structure (\(H \times W \times C:1 \times T\times d\rightarrow T \times W:T \times d\)).

Vocoder

The function of the vocoder is to generate the speech according to acoustic features. Wavenet, the default vocoder in Tacotron-2, is utilized in the process of synthesizing the original speech. But to synthesize the personalized speech with the aid of the EGG signal, we require a vocoder that utilizes the \(F_0\), so we choose WORLD in our framework. WORLD vocoder generates the speech according to the \(F_0\), spectrum envelope, and aperiodic parameter. Among them, \(F_0\) is used as the periodic excitation and the aperiodic parameter as the aperiodic excitation to constitute the mixed excitation signal e(n). The spectrum envelope simulates the resonance part of the vocal tract through the minimum phase response h(n). The synthesized speech signal is figured out by the convolution of these two signals. Figure 9 is the principle of the WORLD vocoder.

Fig. 9
figure 9

The principle of WORLD vocoder

Fine-grained fundamental frequency modification method

To utilize the speaker’s characteristics contained in the EGG signal and synthesize personalized speech, a fine-grained fundamental frequency modification method is proposed. We design both the paralleled path, which means the EGG signal corresponds to the text, and the unparalleled path, which means the EGG signal indicates a different context from the text. Figure 10 depicts the principle of our fine-grained fundamental frequency modification method.

Fig. 10
figure 10

The principle of fine-grained fundamental frequency modification method

For the unparalleled path, as the waveform between \({F_0}_{EGG}\) and \({F_0}_{Speech,old}\) is different, we apply the coarse-grained fundamental frequency modification method to synthesize personalized speech. As the average value of the \(F_0\) describes the pitch of the speaker, we calculate the ratio of the average value of the \(F_0\) of the EGG signal and the original \(F_0\) feature as equation (11) shows and set this ratio as the modified scale to adjust the \(F_0\) and spectrum envelope of the acoustic features predicted by Tacotron-2 point by point. The adjustment equations are defined as follows.

$$\begin{array}{*{20}c} & R=\frac{\bar{F_0}_{EGG}}{\bar{F_0}_{feature}}\end{array}$$
(11)
$$\begin{array}{*{20}c} &{F_0}_{personalized}(i)=R\times {F_0}_{feature}(i)\end{array}$$
(12)
$$\begin{array}{*{20}c} &{Spec}_{personalized}(i,k)={Spec}_{feature}(i,[\frac{k}{R}])\end{array}$$
(13)

where R is the coarse-grained adjustment ratio. \(\bar{F_0}_{EGG}\), \(\bar{F_0}_{feature}\) is the average value of the \(F_0\) of the EGG signal and the original synthesized speech, respectively. \({F_0}_{feature}(i)\), \({F_0}_{personalized}(i)\) is the \(i-th\) frame of the \(F_0\) of the original synthesized speech and the newly synthesized speech with adjustments, respectively. \({Spec}_{feature}(i,k)\) , \({Spec}_{personalized}(i,k)\) is the \(i-th\) frame and the \(k-th\) frequency sampling point of the spectrum envelope of the original synthesized speech and the newly synthesized speech with adjustments, respectively.

By adjusting the overall range of \(F_0\) and re-sampling the frequency axis to adjust the spectrum envelope, we can synthesize personalized speech contained speaker’s characteristics by the WORLD vocoder.

For the paralleled path, as \({F_0}_{EGG}\) and \({F_0}_{feature}\) correspond to the same context, their waveforms are similar to each other and can be aligned. So we put forward the fine-grained fundamental frequency modification method to more detailed adjust \({F_0}_{feature}\) and synthesize personalized speech. Due to the obvious difference in sampling rate and duration between the EGG signal and the original synthesized speech. \({F_0}_{EGG}\) often mismatches \({F_0}_{feature}\). So firstly, we apply dynamic time wrapping to \({F_0}_{EGG}\) to obtain \({F_0}_{EGG, aligned}\) which shares the same length and zero segments as \({F_0}_{feature}\). Then, we apply the coarse-grained fundamental frequency modification method to gain \({F_0}_{coarse-grained}\). For further adjusting \({F_0}_{coarse-grained}\) to imitate the changes of \({F_0}_{EGG}\) over time, we conduct \({F_0}\) fine-grained modification, which generates a specific ratio r(i) to indicate the relationship between this time step and the overall range, and modified \({F_0}_{feature}\) at every time step, defined as follows.

$$\begin{array}{*{20}c} r(i)= &{} \frac{{F_0}_{EGG}(i)}{\bar{F_0}_{EGG}}\end{array}$$
(14)
$$\begin{array}{*{20}c} {F_0}_{personalized}(i)= &{} r(i)\times {F_0}_{coarse-grained}(i)\end{array}$$
(15)

where r(i) is the fine-grained adjustment ratio. \(\bar{F_0}_{EGG}\) is the average value of the \(F_0\) of the EGG signal. \({F_0}_{coarse-grained}(i)\), \({F_0}_{personalized}(i)\) is the \(i-th\) frame of \({F_0}_{coarse-grained}(i)\) and the \(F_0\) of the newly synthesized speech with fine-grained adjustments, respectively.

By applying fine-grained modification, we rely much deeply on \({F_0}_{EGG}\). We obtain personalized \(F_0\) which not only shares the same overall range as \({F_0}_{EGG}\) but also imitates the range changes over time, which promises to contain more personalized characteristics such as tone and stress.

2.2 Materials

The dataset for our text content recognition model is the Chinese Dual-mode Emotional Speech Database (CDESD [33]). This dataset is built by the pattern recognition and human intelligence laboratory affiliated with the Department of Electronics and information engineering at Beihang University and collected from 20 speakers aged 21 to 23 (13 men, 7 women). The dataset contains 11366 speeches and corresponding EGG samples, and there are 20 classes of sentences with different contents in this dataset, which is the output of the classifier. In the experiment, 0.8 of the total dataset are chosen as the training set and the others as the validation test.

The dataset for the Chinese speech synthesis is Biaobei Chinese female voice datasetFootnote 1, which is widely used for Mandarin TTS task. The dataset is recorded by a 20-year-old woman, whose voice is active and intelligent. The total duration is about 12 hours and the sampling rate is 48 kHz.

3 Experiments, results and discussions

3.1 Text content classification model

Figure 11(a) and (b) show the loss and accuracy of every epoch fo our text content classification model. The best result of the validation set occurs at epoch 52, whose accuracy reaches 91.12 %. This result is based on the following conditions: (a) choosing the 3-layer Bi-LSTM as the encoder, (b) including all of the three features, the \(F_0\), the relative first order difference of \(F_0\) and the log short-term energy logE, (c) choosing bidirectional smoothing as the smoothing method. The promising recognition accuracy provides strong support for speech synthesis based on the classified text.

Fig. 11
figure 11

Loss and accuracy of recognition model

To figure out the best conditions of our text content classification model, we design three series of comparative experiments. The first comparative experiment explores which encoder is the most effective in extracting contextual information. We choose commonly-used encoders as baseline, including CNN, Bi-GRU and LSTM, to highlight the superior of Bi-LSTM. Additionally, to figure out how many layers of Bi-LSTM perform best, we also conduct experiments under different numbers of layers. The result is listed in Table 1 and suggests that the 3-layer Bi-LSTM is the best encoder compared to the others. This result proves the effectiveness of our encoder and provides the guidance to choose an encoder with appropriate numbers of parameters.

Table 1 The comparative experiments among different encoders

The second comparative experiment explores whether every feature we select contributes to improve the accuracy of the recognition network and which combination is the best. We try out different units of features with the orginal EGG signal as baseline. In Table 2, the result indicates that using all the three features is more effective than other combinations. That is, all of these three features work for improving recognition. Among these features, \(F_0\) proves to have an essential influence on the result, which accords with our expectation that \(F_0\) directly reflects the characteristic of the vocal cord vibration of speakers.

Table 2 The Acc results among different feature selection strategies

The third comparative experiment explores which smoothing method is optimal. We set \(F_0\) without any smoothing as baseline and compare bidirectional smoothing method with traditional median filter. The result shown in Table 3 suggests that bidirectional smoothing achieves a better result than other methods.

Table 3 The Acc results among different \(F_0\) smoothing methods

The experiment with the best result in this section shows the satisfying performance of our text content recognition model with the EGG signal, which lands a strong basis for the research of speech synthesis with the EGG signal and the text. Besides, concluded from the comparative experiments, the combination of three conditions, encoder, feature and smoothing method selection, contributes to the best results of our model.

3.2 Speech synthesis model

Figure 12(a), (b), (c) and (d) shows the total loss and the former three types of losses defined in equation (8) of every iteration of the acoustic model.

Fig. 12
figure 12

Losses of the acoustic models

In our experiment, we set the batch size as 32, the total training iterations as 2M, the initial learning rate as 1e-3 and the final as 1e-5 and exponentially decay every 4000 iterations. The optimizer is Adam [45]. After about 200k iteration, the loss curve converges at a very low value, which proves a satisfying performance of the model. Compared with Fig. 11 (b) and (c), the loss of the acoustic features with postprocessing is much less than that without postprocessing, which proves the effectiveness of the postprocessing module [46]. Figure 11 (d) proves that the model has learned how many time steps should stop generating the predicted acoustic features.

To evaluate the sound quality of synthesized speeches, objective and subject test are conducted. We choose Mel cepstral distortion (MCD) [47] in objective test, for MCD is regarded as a target to indicate spectral performance. When dealing with subject test, the mean opinion score (MOS) is figured out. We choose SAG-Tacotron [30] to a representative state-of-the-art performance, for it improves the naturalness of speeches without complex front-end, corresponding to our target.

3.2.1 objective test

In the objective test, we set the original Tacotron-2 as the baseline, and compare our model with SAG-Tacotron. We test our model under the condition with and without our fine-grained fundamental frequency modification method to explore the influence of EGG in speech synthesis. Dynamic time warping (DTW) is applied to align the frames of the predicted Mel spectrums with the ground truth.

Table 4 shows the results of different methods judged by MCD. As lower MCD indicates better spectral performance, it can be figured out that our model outperforms the Tacotron-2 with a decrease of 0.14 and our fine-grained fundamental frequency modification method works in improving the quality of the speech. The performance of our model, Tacotron-2+DSC with modification, is comparable with the state-of-the-art performance.

Table 4 The MCD evaluation of different acoustic models

3.2.2 Subjective test

For the mean opinion score (MOS) measurement, we set up 5 series of evaluation sets in which 5 different sentences are included and invite 20 listeners, 10 men and 10 women, aged 18 to 40, to randomly choose and rate the quality on a 5-point scale: “5” for excellent, “4” for good, “3” for fair, “2” for poor, and “1” for bad. Table 5 shows the performance of different methods. As participants’ feedback, our model improves the performance of the original Tacotron-2 with a gain of 0.42 and achieves a comparable score with SAG-Tacotron. When associating with the fine-grained fundamental frequency modification method, we can get a higher score of 3.94, which proves our modification is effective. Besides, the low variance indicates that the robustness of our model.

Table 5 Mean option scores(MOS) with 95% confidence intervals

3.2.3 Comparative experiments

We conduct two series of comparative experiments to figure out the best acoustic model, focusing on text modeling unit and the selection of depthwise separable convolution parameters. For Mandarin TTS task, the choice of text modeling unit is essential. So we explores both pinyin and phone to figure out which text modeling unit is optimal. Figure 13 shows the alignment between the encoder and decoder. Comparing Fig. 13 (a) with (b), the alignment curve of phone modeling is nearly a straight line, while that of pinyin modeling is messy. This phenomenon proves that for this dataset, choosing phone as the text modeling unit is much better than pinyin because the number of the classes of phone is much smaller than pinyin, let alone Chinese characters.

Fig. 13
figure 13

The alignment between the encoder and decoder with different text modeling units

Table 6 shows the comparative results of MOS under two text modeling units. The result proves the better performance of the phone modeling once again. However, the quality of synthesized speech is worse than the ground truth. It may be because that phone modeling regards the transition of two consecutive phones from two different Chinese characters as the same as that in a Chinese character, which will cause the synthesized speech to be not fluent and natural enough.

Table 6 The MOS under different text modeling units

The other comparative experiment explores whether the modification on the acoustic model works. Selecting the original Tacotron and Tacotron-2 as baselines, we seek the best \(C_M\) on our Tacotron-2 revised by depthwise separable convolution. Table 7 shows the comparative results under two aspects: (a)The MOS of the synthesized speech. (b)The model size of the acoustic model. The conclusion is that Tacotron-2 is much better than Tacotron. Besides, the quality of the speech synthesized by the modified Tacotron, which is revised by depthwise separable convolution structure achieves better-synthesized speech than that by the original Tacotron. Meanwhile, the model size of the revised model is much smaller than the original one.

Table 7 The MOS and model size of different acoustic models

Both aiming to achieve natural prosody on an end-to-end speech synthesis system for Mandarin, the state-of-the-art performance is realized by SAG-Tacotron [30]. Shown in Tables 4 and 5, our model achieves a comparable performance with SAG-Tacotron in the objective and subjective test. And as Table 7 shows, the trade-off between the quality of the synthesized speech and the model size gets the balance.

3.3 Fine-grained fundamental frequency modified method

Fig. 14
figure 14

The \(F_0\) extracted from EGG

Figure 14 shows the \(F_0\) extracted from the EGG signal. For unparalleled path, Fig. 15(a) shows the \(F_0\) extracted from the features of Tacotron-2. As the \({F_0}_{EGG}\) signal contains personalized features of the speech, it can be utilized to adjust the original \(F_0\) and synthesize personalized speech. By equation (11), the adjustment ratio R is figured out. Figure 15(c) is the spectrum of the original synthesized speech. Figure 15(b) and (d) is the \(F_0\) and the spectrum of the adjusted synthesized speech, respectively. Figure 15(b) shows the average \(F_0\) is more similar to that of the EGG signal, which means the pitch of newly synthesized speech is similar to the speaker. Figure 15(d) shows the frequency axis resampling, which means the reasonable change of the fundamental frequency and the harmonic frequency.

Fig. 15
figure 15

The \(F_0\) and spectrum of the original and personalized speech

For paralleled path, the \({F_0}_{EGG}\) is first aligned to the \(F_0\) from features by conducting the dynamic time wrapping method. Then fine-grained fundamental frequency modification method is conducted to adjust \(F_0\) in more detailed levels. The comparison of \(F_0\) between the original and personalized speech is shown as Fig. 16. It indicates that not only the average \(F_0\) has been modified to fit the speaker’s tone, but also the trend of \(F_0\) has been adjusted according to the \({F_0}_{EGG}\), which includes the stress information into the final personalized speech.

Fig. 16
figure 16

The comparison of \(F_0\) between the original and personalized speech

To figure out the voice quality of personalized speeches, a series of subjective evaluation is conducted. Table 8 shows that for unparalleled path, coarse-grained modified speech gains the mean opinion score (MOS) of 3.94, which proves that EGG signal contributes to improving the naturalness of \(F_0\) and synthesizing personalized speech. Compared with the state-of-the-art performance of Mandarin TTS named SAG-Tacotron [30], our method achieves better result of MOS. For paralleled path, the MOS of fine-grained modified speech is slightly lower than the original speech. It may because there still be something mess when conducting the alignment. However, it must be pointed out that the fine-grained modified speech includes the stress of the speaker, as listeners feed back. So fine-grained fundamental frequency modification method still proves to add more detailed information into the final speech and its good performance is promising when solving the alignment problem.

Table 8 The MOS of personalized speeches

The results of our experiment indicate that utilizing EGG signals enables the personalized synthesized speech to be more consistent with the speaker’s characteristics. For unparalleled path, it has been proved that coarse-grained fundamental frequency modification method can get a higher MOS of 3.94 than the original speech. For paralleled path, it is also proved that fine-grained fundamental frequency modification method makes the final speech include stress information of the speaker.

4 Conclusions

In this paper, a speech synthesis framework with EGG signals based on the modified Tacotron-2 is proposed to utilize in some extreme environments where speech signals can hardly be collected. This framework consists of a text content recognition model and a speech synthesis model. To synthesize personalized speech, we propose a fine-grained fundamental frequency modification method.

The text content recognition model is to convert each EGG signal sample into the corresponding text with a category of content. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. The comparative experiments show the following results: (1) The 3-layer Bi-LSTM gains higher accuracy than other recognition models we choose. (2) All of the three features contributes to the result and the combination of three features is more effective. (3) The smoothing method with bidirectional searching achieves better results than traditional methods.

The speech synthesis model is to synthesize the personalized speech with the corresponding text and EGG signals. Our model achieves a comparable result to the state-of-the-art performance according to both MCD and MOS. This model gains the mean opinion score (MOS) of 3.87 with relatively small model size and synthesizes the personalized speech with the MOS of 3.94, which is more consistent with the speaker’s characteristics, with the aid of EGG signals. From the comparative experiments, it can be proved that: (1) In terms of the text modeling units, phone is much better than pinyin. (2) Tacotron-2+depthwise separable convolution (channel multiplier=2) is better than other acoustic models considering the quality of synthesized speech and model size.

The expected future works are listed as follows. For the text content recognition model, the dataset will be expanded for more classes of contents to obtain a more general result. Considering the speech synthesis procedure, a better acoustic model will be explored to increase the speech quality and for other applications. For instance, to utilize it in portable devices, other modifications can be explored. Spiking neural networks (SNNs) [48, 49], as the third generation of neural network, comprise of spiking neurons. Addition of the temporal dimension for information encoding in SNNs yields new insight into the dynamics of the human brain and makes it potential to result in compact representations of large neural networks [50]. As such, SNNs have great potential for solving complicated time-dependent pattern recognition problems defined by time series. So it is a fascinating direction to apply SNN in speech synthesis in the future. As the development of spiking neural networks (SNNs) controlling mobile robots is one of the modern challenges in computational neuroscience and artificial intelligence [51], more motivations may arouse when associating the TTS task with neuromorphic computing [52,53,54]. For example, when dealing with TTS tasks based on large-scale datasets, to enhance the biological realism of neuromorphic systems and further understand the computational power of neurons, multicompartment emulation is an essential step to discuss [55]. Finally, for the fine-grained fundamental frequency modification method, a more proper alignment method will be explored.