1 Introduction

Music is a blend of vocal and instrumental sounds to express and evoke emotion through a combination of melody, rhythm, and harmony. The ultimate goal of singing voice separation systems is to separate previously mixed vocal and instrumental components and achieve a deep understanding of each component. These systems, once developed adequately, will have applications in bilateral cochlear implants [15], the ability to calculate fundamental frequency [7], beat monitoring (despite dominant voices) [49], and karaoke music production, as well as any other system that relies on lyric, instrument and chord recognition. Other potential applications include melody extraction/annotation [5, 34], assessment of singing ability [18], automatic lyrics recognition/matching [23, 44], singing visualization [19], and singer identification [20].

The separation of singing voices has long been acknowledged as a difficult task. In recent years, researchers have focused on data-driven machine learning approaches to separate voices from polyphonic music. These systems have generally relied on the two-dimensional time–frequency magnitude spectrogram of the audio signal, which help convolutional neural networks to implement audio-related tasks.

A high-level outline of our proposed singing voice source separation process is presented in Fig. 1. We proposed a long short-term memory (LSTM)-based high-resolution representation network to extract a final feature map from the input spectrogram. This feature map can be used to generate a time–frequency soft mask, which is multiplied with input spectrograms to generate predicted spectrograms. The predicted spectrograms can be transformed back into signals that correspond to vocal and accompaniment tracks using the inverse of the short-time Fourier transform.

Fig. 1
figure 1

A basic overview of our singing voice source separation method

The proposed HR-LSTM network can be employed not only for singing voice separation but also for many different fields of sequence-to-sequence learning problems. The sequence-to-sequence learning includes many different fields of computer vision such as speech recognition, time series prediction, machine translation, and question-answering. The deep neural networks can only be applied to problems whose inputs and outputs are encoded with vectors of fixed dimensionality. However, this LSTM-based HRNet can process not only single data points such as images but also entire sequences of data such as speech and video.

Resolution is important for audio analysis using time–frequency representation because the pixel correlation and harmonic representation in spectrogram are responsible for the unique characteristics of an acoustic signal. We used HRnet; a proven method; to keep the spatial resolution of various feature maps of the network. Temporal information is captured using the LSTM network because the music information is globally correlated along the temporal axis. The consideration of spatiotemporal information without losing the resolution is the main characteristic of our proposed method that is responsible for a better result for music source separation.

The proposed network was tested using three publicly available datasets (DSD100, MIK-1K and Pansori). The DSD100 and MIR-1K datasets were used to test how the system separated two music sources, singing voices and accompaniments, while the Korean traditional music Pansori dataset was used to test how the system separated two different singing voices, as well as a drum sound. To confirm the utility of the HR-LSTM, we also proposed a new singing voice separation dataset (referred to as NISVS). We mixed the DSD100 and NISVS datasets and reported our results accordingly.

The proposed network’s performance was evaluated using the median value of the signal-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR), each measured in decibels (dB). The proposed HR-LSTM outperformed the current state-of-the-art result, tested against the combined DSD100 and NISVS dataset. The combination of these two datasets in the training phase can be able to predict spectrograms closer to ground truth while testing. The addition of the Pansori dataset, along with the incorporation of the NISVS dataset in the DSD100, allowed our model to achieve better separation results than when it was tested only against the DSD100 and MIR-1K datasets.

The major contributions of our study can be summarized as follows:

  1. 1.

    The new model HR-LSTM was proposed and was consisting of a combination of the relatively new "HRnet" for high-resolution representation learning (devised for application to image processing problems) with a well-known approach that uses a long short-term memory (LSTM) module to capture the temporal features of the acoustic signal.

  2. 2.

    We tested our proposed model against various state-of-the-art methods and achieved improvement over the state-of-the-art in certain cases.

  3. 3.

    The new synthetic NISVS dataset was proposed and mixed with the real DSD100 dataset for training. This mixing of two datasets during the training phase improves the result while testing in comparison with without mixing them. This type of experiment concludes that mixing synthetic and real data while training can improve the accuracy in the test phase for real data in the source separation domain. This can be shown in Tables 3 and 5.

  4. 4.

    HRNet has not been studied yet in the source separation community. So, one of the primary reasons behind the performance of the network is the HRNet itself, which maintains the high-resolution feature map of the spectrogram instead of recovering it from low-resolution, which we explain in the third last paragraph of the “related work” section and the last sentence of “High-resolution representation learning section. This is the main contribution of our paper.

2 Related Work

The study of singing voice separation has a long history. The oldest algorithms written to achieve this goal emphasized the pitch and frequency of the audio signal using statistical methods to separate mixed sources. Independent component analysis (ICA) [13], nonnegative matrix factorization (NMF) [16], and sparse component analysis (SCA) [6] were developed for use with blind source separation [3, 21]. Each of these methods was predicated on the idea that data can be projected from a time series onto a new set of axes based on a statistical technique. The authors of [32] first separated the singing voice from music accompaniment using nonnegative matrix partial co-factorization (NMPCF). The separated singing voice was subsequently used to estimate the pitches and reconstruct the singing voice’s spectrum. The authors of [31] attempted to separate music/voice by first identifying the periodically repeating segment from a mixture, and then separating the repeated signal from the mix.

Today, these methods have been supplanted by deep neural networks capable of outperforming previous approaches to source separation. To the latent information from an audio input, deep neural networks often use a hierarchical architecture and a nonlinear approximation function to estimate the independent music source from the combined signals. The authors of [42, 46, 50] applied this deep learning-based separation method. For single-channel source separation, a fully convolutional denoising auto-encoder (CDAE) was presented by [8]. In that case, the researchers explored whether a CDAE could derive the spectral-temporal filters and properties associated with a source. The authors of [24] and [25] similarly used multichannel audio input to train a DNN by focusing primarily on the spectral properties of a single frame. To separate the source spectra, these authors employed a fully linked network and a 2D Mel-Spectrogram. The authors of [35] suggested a modified group delay (MOD-GD) function intended to improve the performance of the available algorithms by incorporating previously neglected phase spectrogram information. DNN was used by the authors of [2] for supervised speech signal training that enhanced speech intelligibility in noisy environments.

Some researchers have used waveform music representation for source separation and to maintain the audio signal’s phase information. Preserving the sinusoid audio information [4] and managing the memory requires a data-driven strategy when applied to waveform audio representation. The U-Net-based architecture employed in [38] resampled the characteristics at various time scales. Through the incorporation of source additivity into the output layer, upsampling, and context-aware prediction, a modified U-Net design outperformed its peers. Similarly, the authors of [9, 22] used an encoder-decoder approach to solve the problem of many speakers requiring multiple audio channel source separation. The authors of [27, 30, 33] also used waveform representation to isolate speech from noisy signals, while in the authors of [1, 11, 39, 40] instead of using waveform audio representation, divided the spectrogram into multiple sub-bands based on frequency ranges to generate the time–frequency masks. As the patterns of the spectrogram were different along with the frequency band, applying a different convolutional filter in each band proved to be critical to boosting the performance of the source separation systems.

None of the aforementioned studies on music information retrieval preserves the high-resolution representation of the spectrogram. HRNet has recently proven successful at human pose estimation, object detection, and semantic segmentation [45]. Accordingly, HRNet has supplanted the encoder-decoder-based networks designed to recover high-resolution from low-resolution. HRNet, which maintains high-resolution instead of recovering them from low to high-resolution results in highly precise and semantically strong features. This preserves the correlation between different Mel-bins of the spectrogram in succeeding layers of HRNet. In this paper, we sought to determine whether maintaining spectrogram features, rather than recovering them in subsequent layers (as is done by other deep neural networks) results in enhanced performance in separating singing voices. To further improve performance at this task, we blended the HRNet with an LSTM block. While the blending of architectures often increases the complexity of the model in an undesirable manner, we combined the two in a unified architecture similar to the authors of [39].

Existing singing voice separation datasets are limited in terms of both size and musical variety. To advance our study and the field more broadly, we created a labeled dataset for the singing voice separation problem based on data from a Nepali reality television singing competition. Despite the fact that the proposed dataset is synthetic, it effectively preserved the real-world music samples, ensuring that HR-LSTM performance was unaffected in real-world test samples.

To further improve the system’s performance, the proposed samples of the training dataset were mixed with training samples of DSD100, and our test results are reported independently. As deep learning architectures are only effective when provided with adequate and appropriate data, we compared our methods against prior state-of-the-art alternatives using various publicly available datasets.

2.1 The New Nepal Idol Singing Voice Separation Dataset

Idols is a franchise reality television singing competition created by British television producer Simon Fuller and developed by Fremantle [47]. Nepal Idol is a Nepali reality television singing competition that is part of the Idols franchise. In the Nepal Idol competition, each contestant must pass four different selection rounds (audition, theatre, piano, and gala) prior to proceeding to the final round. Our Nepal Idol singing voice separation dataset (NISVS) was generated using recordings from the audition round. The contestants in this round must perform without any instrumental accompaniment. This allow us to establish ground truth sources for each singer’s voice. Likewise, to obtain the ground truth sources for the instruments, we downloaded Nepali instrumental sounds similar to those used in later rounds from YouTube. The downloaded instrumental sounds and the contestant voices without instruments were mixed together in equal length to obtain mixed signals. This process of constructing the synthetic mixtures does not have proper alignment between instruments and singing voices and is uncorrelated with each other. Although the synthetic data is uncorrelated, it can be used to add real data to increase the number of samples just during the training phase. The synthetic data have already been used and perform well in deep learning communities by performing various augmentation technique in image and audio domains. Likewise, the construction of our synthetic NISVS dataset and added it with real training dataset support to increase the accuracy of the real dataset during the testing phase. More specifically, we merge well-aligned real training data of DSD100 with uncorrelated training data of the NISVS dataset. The result of the experiment during the testing phase for both real (DSD100) and synthetic (NISVS) samples are reported in the experimental section of Tables 3 and 5. This result states that the mixing of real and synthetic data during the training phase can improve the accuracy in the test phase in comparison with without mixing it.

In total, we ended up identifying 95 sources, including singing voice, accompaniment and mixture. Seventy of these were placed in the training set while the remaining 25 were placed in a test set. The dataset was preprocessed to ensure that the singing voice and accompaniment recordings were precisely the same length. The audio samples were of varying length, ranging from 10 to 78 s, with an average duration of 32.79 s in the training set. In the test set, samples were 13–75 s with an average duration of 34.15 s. The training data included 2296 s of recordings total, while the test data included 854 s. Consistent with the format of the DSD100 dataset, the two sources and mixture recordings that comprised our NISVS dataset were kept in different folders, [26]. The details of the NISVS dataset are presented in Table 1. The log spectrogram visualization of the two ground truth sources along with their mixture is shown in Fig. 2.

Table 1 The NISVS dataset in detail
Fig. 2
figure 2

Log spectrogram visualization of the NISVS a accompaniments, b vocal, and c mixture

2.2 High-Resolution Representation Learning

The most well-known encoder-decoder based networks (e.g., hourglass, U-Net, DeconvNet, SegNet) were designed to recover high-resolution from low-resolution representation by upsampling the feature map. The distinguishing feature of these networks is that they connect multi-resolution convolutions in series, with the result that the representations are weak due to location-sensitivity loss. HRNet solved this problem by connecting multi-resolution convolutions in parallel. HRNet was originally developed as a backbone network and is currently among the best performing networks at human pose recognition, object detection, and semantic segmentation [45]. The features obtained from HRNet are highly precise and semantically strong as the network maintains the high-resolution instead of recovering it from low to high-resolution. HRNet architecture connects high-to-low-resolution convolution in parallel with repeated fusion instead of series.

The HRNet architecture is made up of four blocks, each of which symbolizes a multi-resolution that connects high-to-low and low-to-high in parallel. Starting with high-resolution in the first block, network processing gradually adds high-to-low resolution one by one. High-to-low resolution is gradually added to create new stages and link the parallel multi-resolution streams. This consists of resolutions from the previous stages and one extra low-resolutions from the current stage. Multi-resolution features are created by fusing these resolutions together. Figure 44 depicts the multi-resolution fusion layer during low-to-high and high-to-low processes. To fully follow its processing, readers are encouraged to view the original paper of HRNet [45].

2.3 LSTM Blocks

Recurrent neural networks (RNNs) are powerful tools used for sequence learning in a diverse array of fields, from speech recognition to image captioning. It can also be used in control theory for non-fragile \({\mathcal{H}}_{\infty }\) synchronization. Bidirectional associative memory inertia neural networks has recently been proposed for the synchronization of discrete-time by combining continuous-time inertial neural networks and conventional first-order bidirectional associative memory neural networks [36]. Similarly, the method for synchronization controller has been proposed to handle the controller gain fluctuations in [37]. So, in order to use the power of RNNs, we propose a LSTM block that receives an \(N\) size feature map as input and provides \(N+1\) size feature maps as output. The LSTM block used was made up of a \(1\times 1\) convolution that reduced the number of feature maps to one. The two-dimensional feature vectors from the single feature map are converted into a one-dimensional feature vector by using the LSTM layer. There are two LSTM layers each containing 128 dimension memory units. These memory units in the LSTM block represent the one-dimensional feature vector of the spectrogram. The final layer of the LSTM block is a feedforward linear layer that converted the number of LSTM units back into the input frequency.

This type of recurrent structure has the advantage of capturing the context information from nearby frames of the mixed spectrogram. The context information is important to capture the temporal structure of the mixed-signal from which the network can memorize the longer dependencies and thus helps to improve the singing voice separation results. The LSTM blocks has been adopted in every stage of HRNet (as depicted in Fig. 55) to obtain LSTM-based multi-scale feature map. The features obtained from the HRNet are effective for modeling the local structure of the spectrogram as it follows the CNN structure. Whereas, the features from the LSTM block captures the global information by covering the entire frequency at once. The concatenation of these two multi-scale local and global features is well suited to separate the singing voices. The architecture of the LSTM block is described in Fig. 3.

Fig. 3
figure 3

The LSTM block architecture

2.4 Combining LSTM with HRNet

Prior studies on source separation [39, 41] have suggested that the blending of two networks, particularly convolutional neural networks and LSTMs, can increase audio source separation accuracy. These blended networks achieve state-of-the-art results when tested against various publicly available datasets. Inspired by this technique, in this paper we aimed to adapt an HRNet for use with an LSTM to perform singing voice separation.

The input to our HR-LSTM was the mixed magnitude spectrogram of size \(F\times T\times 1\), in which \(F=512\) denoted the frequency axis, \(T=64\) denoted the time axis, and 1 was the spectral channel. The HR-LSTM consisted of four branches (branch 1–4) that calculated a high-resolution spectrogram from a sub-network of branch1 in parallel with a lower-resolution spectrogram from sub-networks of branch2, branch3 and branch4. Similar to the ResNet-50, each branch consisted of four residual units with skip connection [10]. The output feature map of the last residual block in all branches was passed into the LSTM block of 128-dimension memory units. The spectrogram feature map obtained from the LSTM block and the residual block were concatenated, resulting in the output as an LSTM-based multi-scale feature map in all four branches as illustrated in Fig. 55. These concatenated LSTM-based multi-scale feature maps capture the local and global features which are useful to effectively separate the singing voices. After obtaining the feature map, there is a fusion layer, whose objective is to fused downsampled and upsampled features by aggregating the information obtained from high, medium, and low-resolution feature maps.

Figure 4 demonstrates the multi-resolution fusion layer that has been fused to share the information across the different resolution. The feature maps from low resolution to high resolution simply increase the resolution by using bilinear upsampling, whereas the feature maps from high resolution to low-resolution decrease the resolution by using convolutional with a stride of 2. The final feature maps of each branch are obtained by summing all the downsampled and upsampled features, resulting in the high-resolution representation of the mixed spectrogram. This final high-resolution representation of the spectrogram feature map provides a better trade-off between time and frequency resolutions.

Fig. 4
figure 4

Multi-resolution fusion layer during low-to-high and high-to-low process

We added the LSTM block to the HRNet at a point just prior to the downsampling and upsampling being performed, allowing the block to capture the global structure and the HRNet to model the fine local structure of the input mixed spectrogram. The architecture of HR-LSTM is shown in Fig. 5.

Fig. 5
figure 5

The architecture of HR-LSTM

HR-LSTM’s first branch, given input mixed spectrograms of size \(F\times T\times 1\), produced outputs with a resolution of \(F\times T\times (C+1)\). In this case, \(C=32\), i.e., the number of channels obtained from HRNet, while 1 represented the feature channel obtained from the LSTM. Similarly, the HR-LSTM’s second, third and fourth branches gave the output size spectrogram of resolution \(\frac{F\times T}{2}\times \left(2C+1\right)\), \(\frac{F\times T}{4}\times \left(4C+1\right)\), and \(\frac{F\times T}{8}\times \left(8C+1\right)\) , respectively. All HR-LSTM branches maintained these resolutions throughout the process. The multi-resolution feature map obtained from the low-to-high and high-to-low processes are fused via the fusion layer to obtain the number of feature maps (\(C\), \(2C\), \(4C\), and \(8C\), for each layer, respectively). The feature of channel \(2C\), \(4C\), and \(8C\) was bilinearly upsampled by corresponding factors of \(2\), \(4\), and \(8\) to obtain feature maps of the same size. Finally, as shown in Fig. 6, all features were concatenated to obtain the final feature map of size \(F\times T\times (C+2C+4C+8C)\).

Fig. 6
figure 6

The multi-resolution outputs were upsampled by a factor of 2, 4, and 8 to obtain the final feature map

2.5 Loss Function

The error rate for \(L2\) loss will be higher because of the squared differences between the predicted and ground truth spectrogram. So, in this work, \(L_{1,1}\) norm was used to minimize the absolute difference between the target and predicted spectrograms as it is resistant to outliers in the data which is helpful to effectively ignore the outlier of the spectrogram and has been already studied in various types of source separation problem [1, 29, 40]. The HR-LSTM’s time–frequency output mask for the \(i^{tn}\) source spectrogram is represented by \(M_{i}\). The input mixed spectrogram \(X\) of size \(F \times T \times 1\) was multiplied with \(M_{i}\) to obtain the predicted spectrogram. The loss function for the \(i^{tn}\) source spectrogram was used to minimize the absolute difference between the ith ground truth spectrogram of the music source and was given by

$$ {\text{Loss}}_{{i{\text{th}}}} = \parallel Y_{i} - X \odot M_{i} \parallel_{1,1} $$
(1)

where \(\odot\) is defined as the element multiplication, \(\parallel \cdot \parallel_{1,1}\) is the 1-norm, and \({\text{M}}_{{\text{i}}}\) represents the time–frequency mask for the \(i^{tn}\) music sources. The prediction of the HR-LSTM prior to the application of the time–frequency mask to the singing voice is represented by \(P_{{1{\text{st}}}}\) and prior to the application of the time–frequency mask to the accompaniment is represented by \(P_{{2{\text{nd}}}}\). The time–frequency mask \({\text{M}}\) is defined as

$$ {\text{Mask}}_{M} = \frac{{\left| {P_{{1{\text{st}}}} } \right|}}{{\left| {P_{{1{\text{st}}}} } \right| + \left| {P_{{2{\text{nd}}}} } \right|}}, $$
(2)

The output of the network for singing voice and accompaniment is given by

$$ {\text{singing}}\,{\text{voice}}\left( {P_{{1{\text{st}}}} } \right) = {\text{Mask}}_{M} \odot X, $$
(3)
$$ {\text{Accompaniment}}\left( {P_{{2{\text{nd}}}} } \right) = (1 - {\text{Mask}}_{M} ) \odot X , $$
(4)

Equation (1) represents loss for a single source spectrogram. Accordingly, the total loss of the network for \(N\) music sources is defined as

$$ {\text{Loss}}_{N} = \mathop \sum \limits_{1 = 1}^{N} {\text{Loss}}_{{i{\text{th}}}} $$
(5)

2.6 Evaluation Measures

The HR-LSTM network was evaluated using the popular metrics of the median of signal-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR), each measured in decibels (dB) consistent with BSS-Eval metrics [43]. Initially, according to BSS-Eval, it was assumed that the estimation of the predicted sources \(\hat{Z}_{predicted}\) was composed of four independent components, as given in Eq. (6).

$$ \hat{Z}_{{{\text{predicted}}}} = Z_{{{\text{target}}}} + e_{{{\text{interf}}}} + e_{{{\text{noise}}}} + e_{{{\text{artif}}}} , $$
(6)

where \(\hat{Z}_{{{\text{predicted}}}}\) is any source predicted by the network, \(Z_{{{\text{target}}}}\) is the ground truth source, and \(e_{{{\text{interf}}}}\), \(e_{{{\text{noise}}}}\), and \(e_{{{\text{artif}}}}\) are error terms for interference, noise, and artifacts [43]. The calculation of all evaluation measures requires knowledge of the ground truth signals divided into short window segment of few seconds long. SIR reflects the number of additional sources that can be heard in the estimated source, whereas SAR describes the number of unwanted artifacts between the true source and predicted source. SDR is used as a general indicator to measure the effectiveness of the source separation system. Equations (7), (8) and (9) measure the SDR, SIR and SAR ratio between the predicted and ground truth signal.

$$ {\text{SDR}} = 10\log_{10} \frac{{||Z_{{{\text{target}}}} ||^{2} }}{{||e_{{{\text{artif}}}} + e_{{{\text{interf}}}} + e_{{{\text{noise}}}} ||^{2} }}, $$
(7)
$$ {\text{SIR}} = 10\log_{10} \frac{{||Z_{{{\text{target}}}} ||^{2} }}{{||e_{{{\text{interf}}}} ||^{2} }} , $$
(8)
$$ {\text{SAR}} = 10\log_{10} \frac{{||Z_{{{\text{target}}}} + e_{{{\text{interf}}}} ||^{2} }}{{||e_{{{\text{artif}}}} ||^{2} }} , $$
(9)

3 Experiments

The HR-LSTM was tested against four different types of source separation datasets. The three experiments on the DSD100, MIR-1K and NISVS datasets were to test singing voice separation, while the Korean traditional Pansori dataset was used to test the system’s ability to separate the three sources of drum, drummer voice, and singer voice [1]. DSD100 and MIR-1K are publicly available datasets, while NISVS was our created dataset. Our experimental configurations were the same across all four dataset tests as those of [1], with the exception that we increased the number of iterations to 400,000. We also perform experiment by mixing the training data of DSD100 and MIR1K. The mixing of two different datasets achieves slightly better results in comparison with without mixing them.

4 Datasets

As our proposed NISVS dataset was described in Sect. 3, we will briefly review the DSD100, MIR-1K, and Pansori dataset. Pansori music, which emerged in South Korea, has been registered by UNESCO as an intangible heritage. In this type of music, the singer explains the actions of characters and expresses their feelings during a stage performance. The Pansori dataset used in [1] consisted of three different sources to separate; drum, drummer voice, and singer’s voice. Drum and drummer voice are repeated throughout the entire song, repeating every 0.5–3 s. The Pansori dataset mixed samples were synthetically created by mixing drum, drummer voice and singer’s voice with white noise. The sources for drum and drummer voice were initially physically removed from the original Pansori song and saved in a different folder, establishing the ground truth source for the singer’s voice. The drum and drummer’s voice in Pansori music only contains the percussive elements, meaning that there are no harmonic and rhythmic elements in pansori. So, the synthetic Pansori samples used just during the training phase can successfully separate the sources for real pansori music in the test phase [1].

The DSD100 dataset, consisting of 100 full-track songs, was originally designed by SiSEC [26]. The dataset consists of an evenly distributed variety of musical genres and styles. Although there are four different sources in the original DSD100 dataset—bass, drum, other, and vocal—the DSD100 dataset used in our experiment was adapted for the singing voice separation task by mixing the bass, drum, and other sources into ‘accompaniment.’

The MIR-1K dataset similarly contains 1000 song clips with voice and accompaniment captured in the left and right channels at a sampling rate of 16 kHz. The annotation file of MIR-1K dataset contains additional information, including pitch contours in the semitone, indices and types of unvoiced frames, lyrics, and vocal/non-vocal segment. Each track ranges from 4 to 13 s, and the total length of the dataset is 133 min. The songs were performed by eight women and eleven men, most of whom had no formal music training. To ensure a fair comparison, we selected 175 clips sung by one male (abjones) and one female (amy) as a training set. The remaining 825 tracks were used for testing our source separation system. Additional details concerning these datasets are shown in Table 2.

Table 2 Details of the DSD100, MIR-1K, and Pansori datasets

4.1 Results of Testing Using the DSD100 Dataset

We merged our proposed NISVS dataset with the training data of DSD100. Then, the median SDR value for HRNet with only DSD100, HR-LSTM with only DSD100 and HR-LSTM with DSD100 plus NISVS has been reported as a test result of DSD100. The DSD100 dataset played two roles: (1) samples with four independent sources had these sources separated, and (2) samples with two independent sources of vocals and accompaniments had singing voices separated. As we designed our experiment specifically for singing voice separation, the three sources of bass, drum, and other were blended. As deep neural networks require large datasets for training, mixing DSD100 and our NISVS dataset slightly improved median SDR values in the HR-LSTM trained on the modified dataset over that trained on the original dataset. Our HR-LSTM with mixing data can even perform better for separating the vocals which is 0.04 dB more in other state-of-the-art networks, including MMDenseLSTM [39]. However, for separating accompaniments, this is reduced by 0.16 dB. Moreover, our HRNet and HR-LSTM achieved comparable results with other current algorithms when only the DSD100 dataset was used (Table 3).

Table 3 Median SDR values in decibel (dB) for singing voice separation on DSD100 dataset

Other state-of-the-art algorithms, like MMDenseNet [40], MMDenseLSTM [39], and PSHN (4-Stack) [1] use multi-band spectrogram input to predict respective sources of music. These algorithms use parallel network architectures to extract features from each band and concatenate the output feature map from each parallel network to estimate the final sources of music. The MMDenseLSTM, which is an improved version of MMDenseNet, is still among the best at separating the accompaniment which is 0.16 dB more as compared to our HR-LSTM trained on DSD plus NISVS. With respect to vocal separation, however our HR-LSTM outperformed all existing methods. Similarly, the authors of [29] created a fully convolutional hourglass network that used a single-band spectrogram to extract the features using a top-down and bottom-up approach. While the SH-4stack [29] and HR-LSTM both use single-band spectrogram, the prior method was 0.90 dB and 0.43 dB less accurate for vocals and accompaniments. BLEND [41] is similar to our method, as it merged two neural network architectures (feed-forward and recurrent) by combining the output using Wiener filtering, though it performed worse than the proposed method. NUG [24] estimated the source spectra by combining the covariance matrix with a deep neural network, though it achieved an accuracy of only 4.55 dB for vocals and 8.90 dB for accompaniments. DeepNMF [17] is a conventional method for audio source separation that utilizes a nonnegative deep neural architecture and achieved only 2.75 dB accuracy for vocals and 8.90 dB for accompaniments.

4.2 Results of Testing Using the MIR-1K Dataset

HRNet and HR-LSTM were tested without mixing our NISVS dataset with the MIR-1K [12] dataset, as the MIR-1K dataset contains enough audio samples for training (825). Moreover, the musical genres in our proposed NISVS dataset are too different from those contained in the MIR-1K dataset.

The performance of MIR-1K dataset has been reported to compare with other state-of-the-art algorithms using GNSDR, GSIR, and GSAR for both singing voice and accompaniment. Global normalized SDR (GNSDR), global SIR (GSIR), and global SAR (GSAR) are calculated as weighted means of NSDR, SIR, and SAR, respectively, which is based on BSS-EVAL metrics [28, 43]. Table 4 presents the experimental results achieved by our combined HR-LSTM and HRNet. The HR-LSTM and HRNet as assessed by GSAR and GNSDR at separating singing voice and accompaniments exceeded all other baseline architectures. Specifically, HR-LSTM performed the best as assessed by GNSDR (for accompaniment separation) and GSAR (for singing voice separation). The HRNet performed best as assessed by GSAR at accompaniment separation, though PSHN (4-Stack) [1] was still best at separating singing voices, as it had a GNSDR value of 10.83 dB and GSIR value of 16.54 dB. PSHN (4-Stack) also separated accompaniment well, as assessed by GSIR, outperforming our HR-LSTM by 0.09 dB.

Table 4 GNSDR, GSIR, and GSAR values in decibel (dB) for singing voice separation using the MIR-1K dataset

4.3 Results of Testing Using the NISVS Dataset

Table 5 shows the results of three experiments conducted with the NISVS dataset. First, the mixed audio of the DSD100 and NISVS datasets was used for training. The HR-LSTM network trained on this mixed dataset performed well during testing. It helps, of course, that the music in these two datasets is similar in terms of genre. The connections of the multi-resolution convolutions in parallel in the HRNet has the accuracy of 17.59 dB for vocals and 8.04 dB for accompaniments. This experiment on HRNet only uses our NISVS dataset for training.

Table 5 Median SDR values in decibel (dB) for singing voice separation using our developed NISVS dataset

A second experiment that used the blended LSTM block and HRNet (HR-LSTM) increased performance by 1.23 dB for vocals and 0.78 dB for accompaniments over an HRNet trained only on our NISVS dataset. Incorporation of our NISVS dataset into the system improved the HR-LSTM’s accuracy for two reasons; first, the incorporation of the LSTM block before every downsampling and upsampling operation captured the global structure for modeling the fine local features of the input mixed spectrogram, and second, the LSTM treated the feature map of the spectrogram as sequential data along the time axis which captured the long range dependencies present in the mixed spectrogram.

The third experiment tested the HR-LSTM on the blended DSD100 and NISVS dataset. This mixed dataset resulted in the most accurate model performance, reaching 19.46 dB for vocals and 8.85 dB for accompaniments.

We also visualize the predicted and ground truth spectrogram for one of the test audio samples of singing voice and accompaniment of the NISVS dataset. In Table 5, the mixing of the real DSD100 dataset and synthetic NISVS dataset gives better accuracy in comparison with without mixing it. So, the visualization of the test result in Fig. 7 has been carried out using the HR-LSTM model trained with a mixing DSD100 and NISVS datasets. The visualization results prove that the predicted spectrogram for singing voice and accompaniment are close to ground truth and hence separate the music sources successfully.

Fig. 7
figure 7

The results of comparison between ground truth and predicted spectrograms in one of our test set. a Ground truth for accompaniment, b ground truth for singing voice, c predicted accompaniment, d predicted singing voice

4.4 Results of Testing Using the Pansori Dataset

Table 6 shows the outcomes of our study, as well as a comparison to the baseline [1] using Pansori dataset. The Pansori dataset was originally published in [1], in which a parallel stack hourglass network (PSHN) with different architectural variations—PSHN (1-Stack), PSHN (2-Stack), PSHN (3-Stack), and PSHN (4-Stack) – was constructed. The PSHN (4-Stack) performed the best, at 15.97 dB for drum, 12.86 dB for drummer voice, and 16.12 dB for singer voice. The masks that had been estimated in the intermediate stage of the parallel hourglass module were passed into the next module, which resulted in the excellent performance of the PSHN (4-Stack). The HRNet and HR-LSTM architectures designed in this paper surpassed even the PSHN (4-Stack) in accuracy. The HR-LSTM exceeded the HRNet along with all variants of the baseline at drum, drummer voice and singer voice separation. The outperformance is attributed to the fact that our HRNet connected multi-resolution convolutions in parallel, and the fusion of the HRNet with a LSTM block. Even the HRNet without the LSTM block performed well compared to the baseline, achieving 15.83 dB for drum, 12.91 dB for drummer voice, and 16.25 dB for singer voice (0.14 dB less for drum, 0.05 dB more for drummer voice, and 0.03 dB more for singer voice than the PSHN (4-Stack)).

Table 6 Median SDR values in decibel (dB) for the Pansori source separation dataset

5 Discussion

The research in [17] makes use of nonnegative deep networks, from which the nonnegative parameters were produced via source separation based on nonnegative factorization. This results from unfolding NMF iterations by untying its parameters. Similarly, the work in [24] combine deep neural networks with spatial covariance matrices to do source separation. In addition to this, [48] propose an algorithm called MLRR to learn the subspaces using online dictionary learning. All these methods [17, 24, 48] uses common approach called matrix factorization which is the conventional and old approach for source separation and is far beyond in compare with our deep learning-based method. Moreover, we compare our proposed HR-LSTM network with other deep learning-based methods [1, 29, 39,40,41]. The work in [1, 39, 40] uses multi-band spectrogram as input and feed each band input into the separate network. Whereas, in our case, we use single-band spectrogram which makes our architecture less complex in compare with them. In addition to this, the key advantage of our method in compare with all other methods in Tables 3, 4, and 6 is that our method can preserves the high-resolution representation of the spectrogram rather than recovering it from low-resolution. This process of maintaining the spectrogram brings highly precise and semantically strong features. The mixing of our proposed NISVS dataset with publicly available DSD100 dataset while training in our method is another advantage for improving the accuracy in test data in compare with other methods in the literature.

6 Conclusion

We developed an LSTM and HRNet-based high-resolution representation learning method to perform singing voice separation task. HR-LSTM connects multi-resolution convolution in parallel instead of series which helps to maintain the resolution of the spectrogram throughout the whole process. In our unified design the blending of HRNet and LSTM blocks received mixed spectrogram representations as input and predicted the masks for each source. The predicted mask was then multiplied with the input spectrogram to obtain an estimated spectrogram, which was then transformed back into the signal using the inverse of the short-time Fourier transform. The signal-to-distortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifacts ratio (SAR) values were used to determine the system's accuracy in decibels (dB). We validated the HR-LSTM architecture using four datasets: NISVS, DSD100, MIR-1K, and the Korean traditional music (Pansori). Our experiments confirmed that the developed HR-LSTM outperforms state-of-the-art networks at singing voice separation when the DSD100 dataset is used, and performs comparably well when the MIR-1K dataset is used. To further boost performance, we combined the DSD100 and NISVS training datasets, and the test results were presented separately. This newly developed NISVS dataset will assist future researchers working on the problem of voice separation, just as our HRNet will, we anticipate, prove useful in applications that require voice separation.