Introduction

Voice conversion (VC) is a digital signal processing technology that can analyze and reconstruct acoustic features [1]-[4]. It makes the converted speech signals sound more similar to the characteristics of a target speaker, while ensuring that the speech content is consistent before and after the conversion. Voice conversion technology has a wide range of applications in personalized speech synthesis [5], film and television dubbing, speaker identity anonymization [6] and data augmentation. With the continuous development of deep learning, voice conversion has made significant improvements in both speech naturalness and similarity to the target speaker’s voice. For example, voice conversion methods based on generative adversarial networks [7, 8] have extended the good performance of voice conversion to non-parallel datasets [9, 10]. Meanwhile, voice conversion frameworks based on encoder–decoders [11, 12], which combine vector quantization [13, 14] and instance normalization [15], are investigated extensively to decouple speech contents from speaker identity and have improved the generalization ability of voice conversion models. These methods above perform well only when the voices of both source and target speakers are clean. However, ubiquitous ambient noises in real-world scenarios would severely affect the performance of voice conversion by leading to decreases in the quality of converted speech. Therefore, people have tried a lot of ways to remove background sounds of the input speech when conducting voice conversion.

Some studies have shown that pre-trained noise-robust automatic speech recognition could extract speech contents from noisy inputs to alleviate the negative impacts of background sounds [16, 17]. Another choice is to use a denoising module to preprocess noisy speech, which is typically conducted by denoising using advanced speech enhancement methods [18, 19], and then feeding the processed denoised speech into downstream tasks of voice conversion [20]. However, the separated processing procedures inevitably lead to speech distortions and thus reduce the quality of the converted speech. As an alternative to remove background sounds, Xie et al. proposed to use multitask learning to preserve background sounds [21]. Specifically, this method involves two tasks: one is to convert the input speech into the target speech, and the other is to reconstruct background sounds. By simultaneously optimizing the training objective of these two tasks, the model can convert speech while preserving the information of the background sounds. The work in [21] improves the speech separation (SS) task by considering phase information, but the model does not consider the relationships between background sounds and enhanced speech during training. Furthermore, in the aforementioned research, background sounds are often discarded as noise, thereby overlooking the fact that in certain application scenarios (such as audiobooks, dubbed movies, etc.), background sounds also contain valuable information.

In order to address the issues of significant degradation in speech quality when using noisy speech for conversion, as well as the possibility of flexibly preserving background sounds. A noise-robust voice conversion model is proposed by jointly training speech separation and voice conversion, where a user can choose to retain or to remove the background sounds freely. The model consists of an optimized speech separation module and a voice conversion module. Specifically, in the speech separation module, a dual-decoder speech separation method based on the Deep Complex Convolution Recurrent Network (DCCRN) [22] model uses two decoders to separate speech and background sounds, respectively. A bridge module [23] is introduced here to capture hidden information in denoised speech and noise through information exchange. The voice conversion module uses VQMIVC [24] as the backbone network and combines cycle loss with the original mutual information loss to enhance the adequacy of decoupling of features. A unified loss function is used to train the overall model to alleviate the problem of decreased model performance due to speech distortions caused by separated training. Experimental results show that the proposed method can significantly improve the quality of voice conversion and can effectively preserve or remove background sounds in noisy environments.

The contributions of this paper are summarized as follows:

  1. (1)

    A background sound controllable noise robust voice conversion method is proposed, which realizes flexible control of background sounds in noisy environments by jointly training a speech separation module (The speech separation module here aims to separate background noise from clean speech) and a voice conversion module.

  2. (2)

    The bridge module is introduced to capture the hidden information in the denoised speech and the background sounds, which reduces the coupling between the background sounds and the denoised speech, and provides high-quality speech for the subsequent voice conversion task.

  3. (3)

    The cycle loss and mutual information loss are combined to optimize the voice conversion module, which further improves the efficacy of decoupling between speech contents, speaker identity and pitch, and improves the quality of converted speech.

The organization of the remaining sections of this article is as follows: Sect. “Background-controllable voice conversion model” introduces the details of the proposed model in this article, as well as the configuration of the loss function. In Sect. “Experimental settings”, the experimental setup is described, including the dataset, the evaluation metrics, and the baseline methods. Section “Experimental results and analysis” analyses and summarizes the results of the comparative experiments and the ablation studies, with an additional visual analysis of audio examples. Finally, Sect. “Conclusion” presents the conclusion.

Background-controllable voice conversion model

The model consists of a speech separation module and a voice conversion module, as shown in Fig. 1. The noisy speech is first put into the speech separation module to obtain denoised speech and background sounds. The denoised speech is then used as the input to the conversion module for subsequent voice conversion. Finally, depending on the application scenarios, clean converted speech or converted speech with preserved background sounds can be obtained. The remaining part of this sections will introduce the details of each module.

Fig. 1
figure 1

The framework of the proposed model

Speech separation module

In Deep Noise Suppression Challenge 2020 [25], DCCRN has demonstrated the state-of-the-art performance. The model has a simple structure and low computing complexity, while the complex convolution used in the model can capture the phase information in the speech signals very well. Therefore, in this paper, DCCRN is used as the backbone network of the speech separation module, based on which a dual-decoder speech separation model is proposed to promote the separation of speech and background sounds. As shown in Fig. 2, the speech separation model consists of a branch of speech and a branch of background sounds. The noisy speech waveform is preprocessed to obtain a Mel spectrogram as the input of the network model. The two sets of branches share a complex encoder and two complex long short-term memory (LSTM) layers. Two independent complex decoders encode speech and background sounds, respectively, and a bridge module is used for modeling implicit feature fusions between the encoders. Finally, the estimated denoised speech and the background sound Mel spectrograms are obtained and perform as inputs to the downstream voice conversion task. The details of the complex encoder, complex LSTM, and bridge module will be introduced as below.

Fig. 2
figure 2

The diagram of the network structure for speech separation

Complex encoder

It includes complex 2D convolutional blocks, complex batch normalization layers, and complex parametric rectified linear unit (PReLU) [26] activation functions. Complex convolution can be seen as convolution operations in Fourier domain. In Fourier domain, complex signals can be represented as complex numbers with real and imaginary parts. By converting the input signal and convolution kernel into complex representations, convolution operations can be transformed into dot-product operations, thereby improving computational efficiency. The schematic diagram of the calculation process is shown in Fig. 3.

Fig. 3
figure 3

The schematic of the complex domain performing 2D real convolutions

Specifically, suppose that the complex convolution filter is \(Y = {Y_r} + j{Y_i}\), the real matrix \({Y_r}\) represents the real part of the complex convolution kernel, and \({Y_i}\) represents the imaginary part of the complex convolution kernel. At the same time, the complex matrix is defined as \(X = {X_r} + j{X_i}\), and the complex convolution operation formula is obtained as follows:

$$ X \otimes Y = \left( {{X_r} * {Y_r} - {X_i} * {Y_i}} \right) + j\left( {{X_r} * {Y_i} + {X_i} * {Y_r}} \right)$$
(1)

where \( \otimes\) denotes complex convolution and \( *\) denotes real convolution.

The complex batch normalization layer and the complex PReLU activation function follow the implementation in [27]. The complex PReLU activation function was originally designed to satisfy the Cauchy-Riemann Equation, so the activation function performs activation operations on the real and imaginary parts, respectively, which are calculated as shown in Eq. (2). The structure of the complex decoder module is basically the same as that of the complex encoder, so it will not be explained here.

$$ CPReLU(X) = PReLU({X_r}) + j(PReLU({X_i}))$$
(2)

Complex LSTM

Similar to the operation of complex convolution, complex LSTM replaces real convolution operation with complex convolution operation. Given the complex input X, the implementation of complex LSTM operation is given as follows:

$$\begin{gathered} {F_{rr}} = LST{M_r}\left( {X_r} \right);\begin{array}{*{20}{c}} {}&{} \end{array}{F_{ir}} = LST{M_r}\left( {X_i} \right) \hfill \\ {F_{ri}} = LST{M_i}\left( {X_r} \right);\begin{array}{*{20}{c}} {}&{} \end{array}{F_{ii}} = LST{M_i}\left( {X_i} \right) \hfill \\ \end{gathered}$$
(3)
$${F_{out}} = \left( {{F_{rr}} - {F_{ii}}} \right) + j\left( {{F_{ri}} - {F_{ir}}} \right)$$
(4)

In Eq. (3), \(LST{M_r}\) and \(LST{M_i}\) represent the real and imaginary parts of the traditional LSTM, \({F_{rr}}\) is the convolution calculation of \({X_r}\) and \(LST{M_r}\), and \({F_{out}}\) represents the feature output of a complex layer.

Bridge module

Bridge module is a module that integrates different types of features. Specifically, the bridge module consists of two parts: the information exchange part and the information integration part. The task of information exchange is to exchange the feature representation in each neural network with that in the other neural networks, so that the features in different neural networks can be transmitted and shared with each other. The task of information integration is to integrate the obtained feature representations after the exchange part and to generate the final prediction results. The bridge module structure used in this paper is similar to the structure of complex encoder, except that the convolution kernel of the complex 2D convolutional layer is 1*1, as shown in Fig. 4. By adding a bridge module between the two branches, the denoised speech and background sound are able to share the intermediate information of each layer, and the multi-layer interactive signals are able to improve the accuracy of predicting speech and background sounds.

Fig. 4
figure 4

The diagram of the bridge module

Voice conversion module

The voice conversion module uses VQMIVC as the backbone network, which includes three encoders and one decoder, as shown in Fig. 5. The original speech is preprocessed to obtain Mel spectrogram and log F0 (denoted by lf0 throughout the paper), which are inputs to the content encoder, pitch encoder, and speaker encoder to obtain content representation, pitch representation, and speaker embedding representation, respectively. Mutual information is used to measure the correlation between encoder outputs, and decoupling between content, pitch, and speaker identity is achieved by minimizing their mutual information. The decoder takes the output of the encoder as the input and generates the reconstructed Mel spectrogram. The reconstructed Mel spectrogram is used for the second round of encoding, as shown by the dashed line in Fig. 5.

Fig. 5
figure 5

The structure of the module for voice conversion

In this paper, the cycle loss is introduced, which is combined with the mutual information loss to further improve the efficacy of feature decoupling. The idea of cycle loss was proposed by CycleGAN-VC in [28]. Specifically, the model converts the source speech into the target speech, then reverses the converted target speech, and finally converts back to the style of the source speech. The cycle consistent loss is determined by calculating the difference between the two transitions, so as to adjust the model parameters and improve the quality of voice conversion. In this paper, the reconstructed Mel spectrum obtained from the first encoder–decoder round is fed into the encoder again. By calculating the cycle consistency loss between the reconstructed Mel spectrogram obtained twice, information disentanglement is promoted.

Loss function

The loss function used to train the model consists of two parts: the loss for the speech separation module and the speech loss for the conversion module, as shown in Eq. (5).

$$ L = {L_{SS}} + {L_{VC}} $$
(5)

The training loss for the speech separation module

The training loss of the speech separation module includes two parts: the loss of the speech branch and the loss of the noise branch. The speech branch calculates the L1 loss between the Mel spectrograms of the estimated speech and the clean speech, while the noise branch calculates the L1 loss between the Mel spectrograms of the estimated noise and the actual noise. The results of the two branches are then added together. The specific calculation process is shown in Eq. (6).

$$\begin{aligned} {L_{ss}} &= \frac{1}{T}\sum\limits_{t = 1}^T \left(\left\| {C{D_E}\left( {{M_{t\_n}}} \right) - {M_{t\_c}}} \right\|_1^1\right. \\ &\quad +\left. \left| {C{D_N}\left( {{M_{t\_n}}} \right) - \left({{M_{t\_n}} - {M_{t\_c}}} \right)} \right\|_1^1 \right) \end{aligned}$$
(6)

where \(C{D_E}\) and \(C{D_N}\) denote the speech branch decoder and the noise branch decoder, respectively, \({M_{t\_n}}\) denotes the noisy speech Mel spectrum, \({M_{t\_c}}\) denotes the clean speech Mel spectrum, and \(T\) denotes the number of utterances. The final value of \({L_{ss}}\) is the average of the losses on the \(T\) utterances.

The training loss for the voice conversion module

The training loss of the voice conversion module consists of the loss of vector quantization \({L_{VQ}}\), the loss of contrast prediction coding \({L_{CPC}}\), the loss of mutual information \({L_{MI}}\), the reconstruction loss \({L_{REC}}\) and the cycle loss \({L_{cyc}}\), where \({L_{VQ}}\), \({L_{CPC}}\), \({L_{MI}}\) and \({L_{REC}}\) follow the definitions in the VQMIVC, please refer to [24] for details. The cycle loss is shown in Eq. (7).

$$ {L_{cyc}} = \frac{1}{T}\sum\limits_{t = 1}^T {\left( {\left\| {{M_t}^{\prime} - {{\hat M}_t}} \right\|_1^1} \right)}$$
(7)

where \({M_t}^{\prime}\) denotes the predicted Mel spectrum obtained after the first round of coding and decoding operations, and \({\hat M_t}\) denotes the predicted Mel spectrum obtained finally by using the speech estimated in the first round as the input to the second round of coding and decoding operations.

The training loss of the voice conversion module is shown in Eq. (8).

$${L_{VC}} = {L_{VQ}} + {L_{CPC}} + \alpha {L_{MI}} + \beta {L_{cyc}} + \gamma {L_{REC}} $$
(8)

where \(\alpha \geqslant 0,\beta \geqslant ,\gamma \geqslant 0\), are used as the weights of the objective function. In the experiments of this paper, these hyperparameters are set as follows: \(\alpha = 1e - 2,\)\(\beta = 5,\gamma = 10\). The value of \(\alpha \) is consistent with the optimal value obtained by the experiment in [24].

Experimental settings

Datasets

The proposed model was trained and tested on a mixture of MUSDB18-train dataset [29] and CSTR-VCTK dataset [30]. The MUSDB18-train dataset contains 100 complete music tracks of different genres, each of which is with independent drum, bass, vocal and other tracks. The CSTR-VCTK dataset contains nearly 44 h of clean speech clips from 109 English speakers with different accents, each of which lasts 4–9 s. Audio data are randomly extracted from the MUSDB18-train dataset as background sounds, cropped to the same length as the clean speech, and added to the CSTR-VCTK dataset with signal-to-noise ratios of -5 dB, 0 dB, 5 dB, 10 dB, and 20 dB. The mixed speech clips are divided into training and testing sets with a ratio of 9:1, and 10% of the speech of each speaker in the training set is extracted as the validation set for cross-validation during training. Since there is no intersection between the speakers in the training set and the testing set, the speech in the testing set can be used to verify the performance of zero-shot conversion of the proposed model. In the acoustic feature extraction stage, all audio files are down-sampled into 16 kHz, and 80-dim Mel spectrum and LogF0 are extracted from the clean and noisy speech using Librosa, with a frame length of 25 ms and a frame shift of 10 ms. During the testing stage, two male and two female speakers are randomly selected, and each pair is converted backward and forward. Forty utterances are selected for subjective and objective evaluation for each pair of speakers.

Training settings

The network of the speech separation module includes one encoder and two decoders, where the channel numbers of the encoder are set to: {16, 32, 64, 128, 256}. Due to the use of skip connections, the channel numbers of the decoders are set to be double of those in the corresponding encoder layers. The encoder and decoders each consist of six complex convolutional modules, with convolutional layer’s kernel size of [3, 3], stride of [2, 1], and padding of [1, 2]. The real and imaginary parts of the complex LSTM each have 1024 hidden units. The hyperparameters of the network for the voice conversion module are based on the VQMIVC model. The network is trained using the Adam optimizer, with the learning rate increasing from 1e-6 to 1e-3 in the first 15 epochs, halving every 100 epochs after 200 epochs, for a total of 500 epochs, with a batch size set to 8.

Baselines

Six advanced models are selected as comparison baselines in the experiments. The first four models, AutoVC, VQMIVC, VAE-CN-C, and SEGAN-VC, are trained on clean/noisy speech datasets and generate converted clean speech during the conversion phase, and their results are compared with the experimental results of our model without adding background sounds. The latter two models, MULTI-TASK-VC and Upper bound model, are trained only on noisy speech datasets and generated converted speech with background sounds during the conversion phase, and their results are compared with the experimental results of our model with retained background sounds. The introduction and specific parameter configurations of each model are listed as follows.

AutoVC [31]: This model is the first to implement zero-shot voice conversion by decoupling speech content and speaker identity through carefully designed information bottlenecks. In the experiments, we set the bottleneck dimension to 32, and the settings of the remaining parameters are the same as those in the original paper.

VQMIVC [24]: Mutual information is used as a correlation measure to improve the efficacy of decoupling speech contents, speaker identity and pitch. 256-dim Mel-spectrum and 1-dim LogF0 are extracted from clean speech as the inputs of the model.

VAE-CN-C [17]: The model has selected the best one from the previous models [12]. Based on the AdaINVC model [32], the content encoder and the speaker encoder are modified, and the domain adversarial training module is introduced, so that the model could perform noise robust voice conversion. The input of the model is clean speech and its noisy counterpart, and the output is clean speech with the target speaker’s characteristics.

SEGAN-VC [33]: The SEGAN method processes on speech in the waveform domain, and proposes an end-to-end speech enhancement framework based on the generative adversarial model. The method trains the SEGAN model in a cascade way together with the VC module proposed in this paper.

MULTI-TASK-VC [21]: Yao et al. propose an end-to-end framework via multi-task learning, which sequentially stacks a source separation module, a bottleneck feature extraction module, and a VC module. Using the deep complex convolution recurrent network optimized by the power-law compressed phase-aware (PLCPA) loss and the asymmetric loss to complete the source separation task. A unified reconstruction loss is used to train the model to improve the quality of voice conversion.

Upper bound model: In this model, the superior performance VQMIVC model is selected, and the clean source speaker speech is used for conversion, and the conversion result is superimposed with the original background sounds. The original background sounds is randomly chosen from the MUSDB18-train dataset and matched with the background noise added to the source speaker's speech.

Evaluation metrics

In this paper, the following metrics are used to evaluate the voice conversion methods qualitatively.

Subjective evaluation metric

The Mean Opinion Score (MOS) [34] is used as the subjective evaluation metric to evaluate the naturalness and similarity of the converted speech. Speech naturalness mainly evaluates the fluency and completeness of the content of the converted speech, while similarity mainly compares the characteristics of the converted speech with the target speaker’s voice. The MOS score is rated on a five-point scale, with higher scores indicating higher speech quality and clearer speech content. In these experiments, ten test subjects with a certain background in speech signal processing and sensitivity to sound are selected to rate the experimental results using MOS. The resulting MOS score is a numerical value between 1 and 5, which is the average subjective evaluation of all test subjects for each group of experimental results.

Objective evaluation metric

Four objective evaluation indicators are selected: Mel Cepstral Distortion (MCD), Pearson Correlation Coefficient (PCC) [35], Perceptual Evaluation of Speech Quality (PESQ) [36], and Short-Time Objective Intelligibility (STOI) [37]. MCD is used to measure the difference between two acoustic feature sequences and calculate their similarity by comparing the distance between the two speech signals on Mel Frequency Cepstral Coefficients (MFCCs), with Euclidean distance being a common calculation method. Generally, the smaller the value of MCD, the higher the similarity between the two speech signals. The calculation formula is shown in Eq. (9).

$${\text{MCD = }}\frac{10}{{\ln 10}}\sqrt {2\sum\limits_{n = 1}^N {{{\left( {{m_c} - {m_t}} \right)}^2}} } $$
(9)

where N denotes the dimension of the Mel spectrum, \({m_c}\) and \({m_t}\) denote the Mel coefficients of the converted speech and those of the target speech, respectively.

The Pearson correlation coefficient between the F0 of the source speech and the converted speech can effectively evaluate the changes in speech content and intonation. A higher F0-PCC indicates a higher consistency of F0 variation between the converted speech and the source speech. The PESQ algorithm calculates the speech quality score between the original speech sample and the sample processed in a certain way based on their differences. The PESQ score ranges from -0.5 to 4.5, and a higher PESQ value indicates better auditory quality of the tested speech.

STOI is used to measure the degree to which speech signals remain clear under interferences. Compared with other evaluation indicators, STOI pays more attention to the characteristics of human hearing, and can better simulate the perception process of human auditory system to speech. The STOI score ranges from 0 to 1, and a higher score indicates higher clarity of speech.

The experimental results were subjectively evaluated by 30 test subjects, including 20 professionals who have been engaged in speech-related research and 10 ordinary testers. Statistical analysis was conducted on thousands of generated speech results for each experiment. All subjective evaluations provided a 95% confidence interval. The objective evaluation indicators such as MCD, F0-PCC, PESQ, and STOI were all based on the average values of the generated speech mentioned above.

Experimental results and analysis

Analysis of subjective evaluation results

In this group of experiments, except for the first two models trained on the clean speech dataset, other models are trained using noisy speech dataset with a signal-to-noise ratio of 5 dB. During conversion, the source speaker's speech is noisy with a signal-to-noise ratio of 5 dB, while the target speaker's speech is clean. Two male and two female speakers are selected in the experiments, and male–male, male–female, female–female, and female–male conversions are performed. 20 pairs of utterances are randomly selected from each group for subjective evaluation. The experimental result is the average value of the 20 pairs of speech results, as shown in Table 1.

Table 1 Subjective evaluation results of the methods

“Ours_c” in Table 1 represents the model that uses the proposed method and the conversion result is clean speech, and the comparison objects are the first four groups of models in Table 1. “Ours_n” represents the model in which the converted results preserve the background sounds, and is compared with the last two models in the table. As shown in Table 1, the performance of the voice conversion model trained on a clean speech dataset (the first two models in the table) is significantly reduced due to the influence of noisy source speech. Since the SEGAN-VC model uses the same VC module as the one proposed in this paper, the experimental results indirectly reflect the superiority of the double-branch speech separation module proposed in this paper, which can effectively separate noise and speech. Compared with the advanced noise-robust voice conversion model, MULTI-TASK-VC, the proposed method in this paper improves the naturalness and similarity of the converted speech by 0.08 and 0.14, respectively, indicating that the cycle loss used in the conversion stage enhances the feature decoupling sufficiency. At the same time, the experimental results show that the quality of voice conversion between the same gender is better than that of cross-gender conversion.

Analysis of objective evaluation results

The relative performance of the proposed model and six baseline models are evaluated using objective evaluation metrics such as MCD, F0-PCC, PESQ, and STOI. In this group of objective evaluations, the MCD and F0-PCC metrics mainly evaluate the difference between the converted speech and the Ground Truth. The Ground Truth here means that the same background sound as the noisy source speaker's speech is added to the clean target speaker's speech, while the content of the noisy source speaker's speech is identical to that of the clean target speaker's speech. PESQ and STOI are used to evaluate the speech separation stage and estimate the difference between the estimated speech and the clean speech. Since “Ours_c” and “Ours_n” use the same speech separation module, and results of the two models are the same in terms of PESQ and STOI. The experimental parameter settings and training data selection are the same as in Sect. “Analysis of subjective evaluation results”, and the objective evaluation results are shown in Table 2.

Table 2 Objective evaluation results of the methods

As shown in Table 2, in terms of speech evaluation, the two models of the proposed framework are superior to the baseline models in both metrics of MCD and F0-PCC, objectively demonstrating the good performance of this paper’s method on voice conversion. The objective evaluation metrics of “Ours_n” model are slightly lower than those of “Ours_c”, which is understandable, because the speech with added background sounds will contaminate some details in clean speech, thus deteriorates the evaluation metrics. In addition, compared with the MULTI-TASK-VC method that also uses the DCCRN framework as the backbone network for speech separation module, the proposed model in this paper has improved that in both PESQ and STOI metrics. The experimental results demonstrate that the dual-decoder speech separation module proposed in this paper can effectively improve the separation accuracy between clean speech and background sounds by introducing a “bridge module” and provide high-quality speech input for downstream voice conversion tasks, thereby improving the overall performance of the model.

Ablation studies

To further verify the effectiveness of each component in the proposed framework, we have also conducted ablation experiments. Three sets of ablation experiments are set up to verify the performance of the model without using cycle consistency loss function (w/o \({L_{cyc}}\)), without using mutual information loss function (w/o \({L_{MI}}\)), and without using the noise branch of the speech separation module to separate noise (w/o \(C{D_N}\)), respectively. The remaining parts of experimental parameter settings are the same as those in Sect.  “Analysis of subjective evaluation results”. The specific experimental results are shown in Table 3.

Table 3 Results of the subjective and objective evaluation in the ablation studies

Experimental results show that compared with the original model, the three models have improved the subjective and objective evaluation metrics. Among them, the w/o \({L_{MI}}\) model achieves the lowest naturalness and similarity in the three groups of experiments, which indicates that the addition of mutual information loss can effectively realize the decoupling between speech contents, pitch and speaker identity features. The use of the \({L_{cyc}}\) loss further encourages information decoupling, avoids speech distortion caused by information leakage, and improves the accuracy of reconstructed speech. When the model removes the noise branch of the speech separation module, PESQ and STOI and other metrics will deteriorate seriously, but the naturalness and similarity indicators are slightly higher than those of w/o \({L_{MI}}\), indicating that the model pays more attention to the training of the conversion module after removing the noise branch. Ablation experiments confirm the effectiveness of the proposed model in the field of background sound controllable noise robust voice conversion.

Results with unseen noises

This section experimentally verifies the conversion performance of the proposed model under different signal-to-noise ratios as well as unseen noise environments. Among them, the unseen noise is from the NOISEX-92 noise corpus [38], factory1 noise, hfchannel noise and pink noise are selected and randomly added to the clean speech, and the SNR is set in the same setting as the visible noise. The experimental results are shown in Table 4.

Table 4 Results of the subjective and objective evaluation of the converted speech with different SNRs and unseen noises

This experiment is conducted in a scenario where the source speaker's speech has noises and the target speaker's speech is clean. At the same time, to facilitate comparison of the experimental results, the converted speech is cleaned by removing the background noises. Firstly, the model is evaluated for the scenarios where noises are seen in the training data. As the signal-to-noise ratio level increases, both the naturalness and MCD evaluation metrics of the speech improve. It is obvious that extracting clean speech from high signal-to-noise ratios signals is easier than that from low signal-to-noise ratios signals. Secondly, the performance of the model is evaluated in invisible noise scenarios. The results show that under the same signal-to-noise ratios conditions, the subjective and objective evaluation metrics of the converted speech in invisible noise scenarios are slightly lower than those in visible noise scenarios. However, compared with the baseline models in Sect. “Analysis of subjective evaluation results” and “Analysis of objective evaluation results”, the proposed model in this paper demonstrates better performance.

Visualization

In order to intuitively demonstrating the performance of the speech separation module and the voice conversion module in the proposed model, this section uses speech waveform and spectrogram to visually analyze some speech examples of the two modules. The visualization results of speech examples for the speech separation module are shown in Fig. 6.

Fig. 6
figure 6

Visualization of the results in the speech separation module

Figure 6A–E shows the waveforms and their corresponding spectrograms from male speakers, and Fig. 6F–J shows the waveforms and their corresponding spectrogram from female speakers. From top to bottom, they show the original noisy speech, the estimated clean speech using the dual-branch speech separation model, the ground truth clean speech, the estimated background noises, and the ground truth background noises, respectively. From the visualization, it can be observed that both the estimated clean speech and background noises using the speech separation module are extremely similar to the ground truth ones in both waveform and spectrogram. The experimental results prove that the proposed dual-branch speech separation module in this paper is effective in separating clean speech and background noises.

Figure 7 shows the spectrogram of some speech examples for the voice conversion module. In the experiments, noises are added to the source speaker’s speech uniformly, and the SNR of the noisy speech is 5 dB. The target speaker’s speech is selected as the clean goal. The figure shows two scenarios of voice conversion, which are male-to-female (the first row) and female-to-male (the second row) conversions. From left to right, they show the source speaker’s speech, the target speaker’s speech, the converted speech with removed background noises, and the converted speech with retained background noises. From the part circled in black boxes in the figures, it can be seen that the proposed model has good performance in the details of pitch conversion. At the same time, comparing Fig. 7a,c, it is found that while performing speaker identity conversion, the proposed model ensures the consistency of the speech contents before and after the conversion. These results demonstrate that the combination of cycle consistency loss and mutual information loss used by the model has effectively improved the efficacy of the conversion, and has significantly improved the quality of the converted speech.

Fig. 7
figure 7

Visualization of speech examples in the voice conversion module

Conclusion

This paper proposes a noise-robust voice conversion model with controllable background sound, which can retain or remove background sound flexibly and obtain high quality converted speech. The model includes an optimized speech separation module and a voice conversion module. The speech separation module is composed of a dual-decoder model, which encodes the denoised speech and noise, respectively. The bridge module is used to promote the information interaction between the denoised speech and the background sound, alleviate the gradient descent problem, and improve the accuracy of speech and background sound prediction. The voice conversion module adopts the multi-encoder structure, introduces the cycle loss, and combines it with the mutual information loss to train the multi-encoder to improve the model decoupling adequacy. In addition, by jointly training the speech separation module and the voice conversion module, the speech distortion problem caused by the training mismatch between modules is alleviated. The subjective and objective evaluation results show that the converted speech obtained by the proposed method has higher naturalness and speaker similarity than the baseline method, and achieves comparable experimental results with the voice conversion model trained with a clean dataset. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

However, there are still limitations in our work. Currently, we are unable to achieve cross-linguistic speech conversion due to the insufficient expressive power of the feature representations extracted during model training. Therefore, in the future research, we will focus on exploring more expressive feature representations to achieve cross-language voice conversion, which will further improve the practicality of the model.