WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

Voice conversion (VC) is a task for changing the speech of a source speaker to the target voice style while preserving linguistic information of the source speech. Existing VC methods require a separate vocoder because they output mel-spectrogram. Therefore, the VC performance varies depending on the vocoder performance, and noisy speech can be generated due to problems such as train-test mismatch. In this paper, we propose a speech and fundamental frequency consistent raw audio voice conversion method called WaveVC. WaveVC does not require a separate vocoder because it performs VC directly on raw audio and is unaﬀected by vocoder performance. In addition, WaveVC uses speech loss and F0 loss to preserve content information and generate F0 consistent results. WaveVC shows high performance in both many-to-many VC and any-to-any VC, and the converted samples are available online.


Introduction
Style transfer is applied in various fields, including vision tasks [1].Especially in the field of speech signal processing, voice conversion (VC) is a task for changing the speech of a source speaker to the target voice style while preserving linguistic information of the source speech The application of VC has the potential for utilization in various fields such as movie dubbing, singing conversion [2], and speaking aids [3].Generally, conventional VC methods require parallel data, which are different speakers' utterances of the same sentence.However, obtaining such parallel data is an obvious limitation as it is very difficult in practice.For this reason, various methods using non-parallel data for VC are recently being explored.
The VC methods using non-parallel data can be roughly divided into the following methods: autoencoder-based and GAN [4,5]-based.Most recent autoencoder-based methods are composed of a content encoder, a speaker encoder, and a decoder.These methods are trained using zero-shot learning to reconstruct the input data.The content encoder erases the style of the source speaker while keeping linguistic information in the utterance.In contrast, the speaker encoder extracts only the style of the target speaker regardless of the utterance.AutoVC [6] applied zero-shot learning to VC for the first time and can respond to unseen speakers not used for training.AdaIN-VC [7] does not simply concatenate the style of the target speaker extracted from the speaker encoder but reflects the style through adaptive instance normalization [8,9].AutoVC-F0 [10] uses the F0 information of the source speaker to generate a natural-sounding F0.Again-VC [11] uses only one encoder without the separate speaker encoder, unlike other autoencoder-based methods.Meanwhile, these zero-shot learning-based methods have limitations in that they must be carefully designed to disentangle the content and the style well with a bottleneck structure.
Recently, GAN-based methods such as StarGANv2-VC [12] show high-quality VC performance using adversarial training and perceptual loss.However, StarGANv2-VC has a crucial limitation: it cannot respond to unseen target speakers.Therefore, many efforts are being made on GAN-based any-to-any VC [13][14][15] to respond to unseen source speakers and unseen target speakers.Since the aforementioned autoencoderbased methods and GAN-based methods both output the acoustic feature like melspectrogram, vocoders such as MelGAN [16], Parallel WaveGAN [17], and HiFi-GAN [18] are needed to convert the mel-spectrogram into the raw wave.Using vocoders can cause problems such as noisy speech generation for reasons such as train-test mismatch [19].As a result, the quality of the generated waveform depends on the vocoder.Therefore, models that synthesize speech without the vocoder, such as WaveNet [20] and Parallel WaveNet [21], have been studied.In the VC task, NVC-Net [22] solves the problem of using the vocoder by directly generating raw audio, but this does not guarantee high VC performance.
In this paper, we propose a speech and fundamental frequency consistent raw audio VC method called WaveVC.Because WaveVC performs VC directly on raw audio, it is not affected by vocoder performance.WaveVC employs speech loss and F0 loss to preserve the content of the source speech and generate F0 consistent speech.In addition, the F0 feature is concatenated with a content embedding vector to generate natural sound.Our main contributions are summarized as follows: (1) Because WaveVC performs VC directly from raw audio, an additional vocoder is not required.(2) WaveVC uses two additional losses, speech consistency loss, and F0 consistency loss, to preserve content information and generate fundamental frequency consistent results, and F0 information is used to create natural voices.(3) WaveVC shows high VC performance in both many-to-many and any-to-any VC, and the converted samples are available on the web demo page1 .

WaveVC
WaveVC mainly consists of a content encoder E c , a decoder G, a speaker encoder E s , three discriminators D i for i = 1, 2, 3 that are used for different temporal resolutions, and an F0 extraction network F .The overall architecture of WaveVC is shown in Figure 1.
Content encoder.Since the input of the content encoder E c is a raw audio waveform, the content encoder E c consists of one input 1D convolutional layer, four downsampling blocks, and two following 1D convolutional layer with GELU activation [23].Each downsampling block consists of four residual blocks and a 1D convolutional layer.Each has a 1D dilated convolutional layer with a gated-tanh nonlinear function and residual skip connection.Each downsampling block makes the input of the block four times the lower temporal resolution.Finally, the source waveform x has a temporal resolution that is 256 times lower by the content encoder E c , and L2 norm is applied to the content embedding vector.
Speaker encoder.Unlike the content encoder E c , the speaker encoder E s uses mel-spectrogram as the input.The speaker encoder E s consists of five residual blocks and a global average pooling layer, and a 512-dimensional vector is generated regardless of the input length by removing temporal dimensions.Mean vector µ and covariance vector σ are generated by each fully-connected layer.Finally, the speaker embedding vector z is produced by reparameterization trick [24] such as z ∼ µ + σ ⊙ ϵ, where ϵ ∼ (0, I).F0 extraction network.A pre-trained JDC network [25] composed of the convolutional layers, and bidirectional LSTM is used as the F0 extraction network F to extract the fundamental frequency information.The JDC network uses the melspectrogram as the input and outputs the fundamental frequency.Then, the only convolutional layer F conv of the JDC network is used for the F0 information feature extraction.Finally, the F0 information feature f is defined as F conv (x).
Decoder.The decoder G is constructed in the form of an inversion of the content encoder E c .The decoder G uses the concatenated feature of the content embedding vector (c s = E c (x s )) and the F0 information feature (f s = F conv (x s )) of the source speech x s as the input.The decoder G consists of four upsampling blocks instead of the downsampling blocks.Each upsampling block contains a 1D transposed convolutional layer and four residual blocks.The 1D transposed convolutional layer makes the input of the upsampling block four times higher in temporal resolution.Also, unlike the residual block of the content encoder E c , the residual block of the decoder G uses the speaker embedding vector as the conditional input.Then, z s and z t are used as the speaker embedding vector for the source speech and the target speech.
Discriminator.As with MelGAN [16], three discriminators D i for i = 1, 2, 3 uses the mel-spectrograms with three different window sizes as the input.Each window size is set to 1024, 512, and 256.The number of speakers in the training data defines the output vector size of the discriminators.Finally, discriminators distinguish whether the input is the corresponding speaker by binary classification.

Training objectives
WaveVC aims to accomplish speech and F0 consistent raw audio VC.The losses to achieving this goal are explained.In the following brief description, the reconstructed speech (x = G(c s , z s , f s )) and the converted speech (x = G(c s , z t , f s )) are defined, respectively.
Adversarial loss.When the speaker id labels of the source speech x s and the target speech x t are y s and y t , respectively, the adversarial loss is defined as Each discriminator D i is trained through binary classification to distinguish whether it is the corresponding label's speech.Conversely, the content encoder E c , the speaker encoder E s , and the decoder G are trained to be indistinguishable from the discriminators D i .Speech loss.Speech loss is used to ensure that the content of the converted speech is maintained.The speech loss is composed of the differences between the source speech and the converted speech and between the source speech and the reconstructed source speech and is defined as where ∥•∥ 1 denotes l1 norm.In addition, A is a pre-trained automatic speech recognition (ASR) network for extracting the convolutional speech features from the source speech and the converted speech.In this case, a joint CTC-attention VGG-BLSTM network [26] is employed as the pre-trained ASR network for the speech convolutional feature extraction.F0 loss.F0 loss is used to generate fundamental frequency consistent results.The final output of the F0 extraction network is used as the predicted fundamental frequency.F0 loss is calculated by the differences in the normalized fundamental frequency between the source speech and the converted speech and between the source speech and the reconstructed source speech as follows where F (•) means the normalized output of the F0 extraction network.Reconstruction loss.Reconstruction loss is composed of two parts to improve perceptual quality.One is feature matching loss [16], and the other is spectral loss [22].The feature matching loss is calculated by feature maps of the discriminators D i as follows where D i j (•) denotes the jth feature map of the ith discriminator, and N D indicates the number of discriminators.On the other hand, the spectral loss is calculated from mel-spectrograms with different FFT sizes and is defined as where T (•, w) denotes transformation to log mel-spectrogram with a FFT size of w, and ∥•∥ 2 indicates l2 norm.In this case, w is set to 2048, 1024, and 512.Finally, the reconstruction loss is defined as Content loss.Content loss induces that the content embedding vector of the converted speech is equal to the content embedding vector of the source speech and is defined as KL loss.KL loss [27] is a constraint that makes the distribution of the speaker embedding vector close to a normal distribution.The KL loss is defined as where D KL (•||•) denotes KL divergence, and p(z s |x s ) indicates the output distribution of E s (x s ).
Full objective.The full generator loss function can be summarized as follows where λ adv , λ asr , λ f 0 , λ rec , λ con , and λ kl are hyperparmeters for each loss.In addition, the discriminators are trained via only adversarial loss L adv .

Datasets
For a fair performance comparison, the baseline and our proposed methods are trained with the VCTK dataset [28], with 44 hours of utterances of 109 speakers.As in NVC-Net [22], six speakers are separated into unseen speakers.90% and 10% of utterances of the remaining 103 speakers are randomly partitioned into a training set and a test set.

Implementation details
For training, all datasets are downsampled to 24 kHz and randomly clipped to 38,540 samples (approximately 1.5 seconds) every epoch, and random clipping and random scaling are employed as the data augmentation.We train for a total of 500 epochs using the Adam optimizer with β 1 = 0.5, β 2 = 0.9, and a learning rate of 0.0001.The hyperparameters of the full loss are set to λ adv = 1, λ asr = 5, λ f 0 = 2.5, λ rec = 10, λ con = 10, and λ kl = 0.02.The pre-trained networks mentioned in StarGANv2-VC2 [12] are employed as the F0 extraction network and the ASR network.AdaIN-VC [7], Again-VC [11], and NVC-Net [22] are employed as the baseline methods to compare the performance of WaveVC.AdaIN-VC3 and Again-VC 4 are trained with the same dataset mentioned in section 3.1 by using the official code on the website.Unlike WaveVC and other baseline methods, Again-VC is experimented with by downsampling the dataset to 22050Hz, as mentioned in the reference.NVC-Net5 is reconfigured and trained with PyTorch [29].

Subjective quality assessment
The mean opinion score (MOS) is conducted on naturalness and similarity metrics to evaluate VC performance.The naturalness metric is scored from 1 to 5 by evaluating whether the converted speech has noise and distortion.The similarity metric is scored from 1 to 5 on how similar the converted voice is to the target speaker.The MOS evaluation is performed on 20 samples, each in seen-to-seen and unseen-to-unseen cases, by a total of 20 participants.The MOS results are summarized in Table 1.
AdaIN-VC and Again-VC do not convert not only well but also sometimes fail to generate data.A large number of speakers during training seems impossible to cover WaveVC shows from 0.72 to 0.8 higher naturalness score and from 0.46 to 0.68 higher similarity score than NVC-Net in the seen-to-seen case.In particular, in the case of WaveVC's F2M and F2F, naturalness scores similar to the ground truth are shown.
In the unseen-to-unseen case, WaveVC gets from 0.44 to 0.80 higher naturalness score and from 0.34 to 0.56 higher similarity score than NVC-Net.These scores indicate that WaveVC performs speech and fundamental frequency consistent VC.Consequently, WaveVC not only performs adversarial raw audio VC but also improves performance by concatenating the fundamental frequency feature into the content embedding vector and applying the speech loss and the F0 loss.

Objective quality assessment
MBNet [30]-based predicted mean opinion score (pMOS) evaluation and Wav2Vec2 [31]-based character error rate (CER) are also performed on the sampled data for the objective quality assessments.The objective quality assessments are summarized in Table 2. AdaIN-VC and Again-VC show lower pMOS and higher CER than NVC-Net and WaveVC.This means that the quality of the speech converted through AdaIN-VC and Again-VC is degraded, and inaccurate speech is generated because the utterance information of the source speech is not properly preserved.NVC-Net shows relatively better values than AdaIN-VC and Again-VC, but shows larger differences than the proposed WaveVC.WaveVC shows pMOS of 3.32 and CER of 0.037 in the seen-to-seen case, and pMOS of 2.99 and CER of 0.051 in the unseen-to-unseen case.The objective quality assessment results of WaveVC are the highest pMOS and the lowest CER in both seen-to-seen and unseen-to-unseen cases.These results mean that WaveVC not only generates high-quality speech, but also preserves the utterance information of the source speech well.

Ablation study
The ablation study is performed to compare how the ASR loss and the F0 loss affect the converted speech.When applying the ASR loss, it can be seen that the CER values decrease.These results indicate that the ASR loss helps to preserve the utterance information of the source speech well.On the other hand, it can be seen that pMOS increases when F0 loss is applied.It converts into high-quality speech while preserving the information about the fundamental frequency and injecting the information at the time of conversion.

Conclusions
In this paper, we proposed the adversarial raw audio VC method called WaveVC, which does not require a separate vocoder because it performs VC directly on raw audio.In addition, the proposed WaveVC performed speech and fundamental frequency consistent VC by reflecting the fundamental frequency information to the content embedding vector and adding two losses: speech loss and F0 loss.To compare the performance of WavcVC with other VC methods, we conducted a MOS evaluation for the naturalness and similarity of the VC results.As a result, WaveVC not only produced better performance than other competing VC methods but also showed a level of naturalness similar to the ground truth.In addition, in objective quality assessments such as pMOS and CER, WaveVC showed significantly better performance than other VC methods.

Fig. 1
Fig. 1 The overall architecture of WaveVC.The solid lines are the paths used in training and inference, and the dotted lines are used only in training.

Table 1
The MOS results on VC methods

Table 3
The experimental results on ablation study