1 Introduction

Pulmonary auscultation is one of the oldest techniques used in the diagnosis of the respiratory system. It is considered as a safe, non-invasive, and cost-effective clinical method to monitor the overall condition of the lungs and surrounding respiratory organs (Bardou et al. 2018; Andrès et al. 2018). Through a stethoscope, the sound of air moving inside and outside the lungs during breathing can be auscultated through chest walls allowing a physiotherapist to identify any pulmonary diseases such as asthma, pneumonia, or bronchiectasis (BRON) (Andrès et al. 2018; Pramono et al. 2019). According to the world health organization (WHO) report in 2017 (World Health Organization 2017a), more than 235 million people are suffering from asthma worldwide. In addition, chronic obstructive pulmonary disease (COPD) is expected to be the third leading cause of death by 2030 (World Health Organization 2017b).

Lung sounds are either normal or abnormal. An abnormality in the auscultated sound usually indicates an inflammation, infection, obstruction, or fluid in the lungs. There are several types of abnormal (adventitious) lung sounds that superimpose normal sounds including wheezes, stridor, rhonchi, and crackles (Sarkar et al. 2015; Andrès et al. 2018). Wheezes are considered as high-pitch continuous waves of more than 400 Hz lasting for more than 80 ms and sounding like a breathing whistle. These sounds are due to an inflammation/narrowing of the bronchial tubes (Pramono et al. 2019). Similarly, stridor sounds are high-pitched waves of more than 500 Hz lasting for over 250 ms. They usually originate due to laryngeal or tracheal stenosis (Pasterkamp et al. 1997). Rhonchi are low-pitch continuous waves of sounds similar to snoring with frequencies less than 200 Hz. They usually arise from fluid or mucus filling up the bronchial tubes (Sovijarvi et al. 2000). Crackles are discontinuous clicking or rattling sounds of either fine (short duration) or coarse (long duration). These sounds are an indication of pneumonia or heart failure (Reichert et al. 2008). Other respiratory sounds include coughing, snoring, and squawking.

In general, lung sounds are acoustic signals with frequencies ranging between 100 Hz and 2 kHz (Gross et al. 2000). However, the human ear is sensitive to waves of 20 Hz to 20 kHz (Rosen and Howell 2011). Using the traditional manual stethoscope, many diseases could be misdiagnosed or go undetected due to inability of hearing its corresponding respiratory sounds. Thus, the auscultation process loses important information, carried by lower frequency waves, about the condition of respiratory organs. In addition, the diagnosis of pulmonary diseases is usually affected by the quality of the tool, physician experience, and surrounding environment (Shi et al. 2019). Therefore, electronic stethoscope has been gradually arising as a replacement to traditional diagnosis tools. It has the ability to store lung sounds as signals within a computer; allowing medical doctors to investigate these signals in time-frequency analysis with a better interpretation (Shi et al. 2019; Gurung et al. 2011). Furthermore, recent advances in signal processing and artificial intelligence assist clinicians in decision making when diagnosing respiratory diseases through lung sounds.

Numerous studies have covered the use of machine/deep learning algorithms in automatic respiratory diseases identification and lung sounds classification. In machine learning, many models have been utilized including support vector machines (SVMs) (Jin et al. 2014), k-nearest neighbors (KNNs) (Serbes et al. 2013), naive Bayes classifier (Naves et al. 2016), and artificial neural networks (ANNs) (Orjuela-Cañón et al. 2014). However, despite achieving high levels of performance, these methods require additional feature extraction step for features such as time domain, time-frequency domain (Chen et al. 2019), hilbert-huang transform (HHT) (Serbes et al. 2013), melFrequency cepstral coefficients (MFCCs) (Bahoura 2009), wavelet transform coefficients (Kahya et al. 2006; Orjuela-Cañón et al. 2014), and higher order statistics (HOS) (Naves et al. 2016). Recently, deep learning algorithms have arisen without the need of any prior feature extraction procedures. These methods reduced the human error caused by conventional algorithms, which may be due to patient-specific details and other data variations from patient to patient. In addition, deep learning outperformed these methods in disease identification and lung sounds classification (Jayalakshmy and Sudha 2020; Demir et al. 2020). In Aykanat et al. (2017) and Bardou et al. (2018), researchers utilized convolutional neural networks (CNNs) to classify respiratory sounds, where it has been shown that the highest accuracy was achieved using the CNN model versus typical machine learning models (i.e., SVM and KNN). Furthermore, in Messner et al. (2020), authors designed a convolutional recurrent neural network (RNN) utilizing long short-term memory (LSTM) cells for multi-channel lung sounds classification achieving high levels of accuracy. Additionally, the combination of CNN and RNN into one model have been investigated in several studies and for various applications (Xuan et al. 2019; Passricha and Aggarwal 2019; Dubey et al. 2019).

Table 1 Demographic information of the subjects

1.1 Our contribution

In this paper, a study is conducted to investigate the ability of deep learning, illustrated by deep convolutional neural networks and long short-term memory units, in recognizing multiple pulmonary diseases from lung sounds signals (Fig. 1). The signals were obtained from recordings of electronic stethoscopes at a local hospital in Irbid, Jordan, in combination with a publicly available data-set. The recordings represent signals from patients suffering from asthma, pneumonia, BRON, COPD, and heart failure (HF) along with control (normal) patients. Each signal goes initially into a preprocessing procedure to ensure the best possible input to the deep learning network. The preprocessing steps include wavelet smoothing, displacement removal, and normalization. A CNN and bidirectional LSTM network (CNN + BDLSTM) was designed for the training and classification processes to extract information from both spatial and temporal domains of the signals. The training followed a tenfold cross-validation scheme to allow the maximum possible amount of data within the training model and to cover the whole data-set in the prediction process. Several evaluation metrics were used to evaluate the recognition of diseases using CNN and LSTM networks individually as well as a combination of both networks.

The main contribution of this work is the implementation of a BDLSTM network in addition to the normal CNN feature extraction approach. This adds to the learning efficiency of the network by extracting time-domain features from signals. To the best of our knowledge, deep learning models for the purpose of lung sounds classification have been designed usually using a single neural network approach. Therefore, along with the proposed combined model (CNN + BDLSTM), the disease recognition ability was tested for the network when operating individually as either CNN or BDLSTM. Furthermore, the majority of studies in the literature have implemented the use of signals spectrograms as 2-dimensional (2D) images, which increases the load on the system during the training and classification processes of the model. In contrast, the proposed study utilizes 1-dimensional (1D) signals with only a small portion (5 s) of the lung sounds recordings. Furthermore, a novel stethoscope-based lung sound data-set was collected locally. This allowed for the inclusion of more types of pulmonary diseases including asthma and HF and provided a better analysis of the performance of the deep learning models over a wider range of lung sound characteristics. It is worth noting that this work does not implement any data augmentation techniques that are considered less preferable in clinical studies. In contrast, the locally recorded data-set was used to balance the classes along with a weight-modified classification layer at the end of the trained model. Unlike the majority of previous studies, where features are manually extracted or the model are built with large neural networks, the proposed study was developed to ensure high levels of performance while at the same time be as simple as possible for the use in clinical settings. The developed model accepts data as small lung sounds signals (5 s), which does not require considerable memory or computational overhead.

Fig. 1
figure 1

The complete procedure followed in the proposed study

Fig. 2
figure 2

Examples of lung sound signals coming from normal and five types of respiratory diseases patients: a normal, b asthma, c pneumonia, d BRON, e COPD, f HF

2 Material and methods

2.1 Subjects

The selected signals were acquired from locally recorded lung sounds in addition to a publicly available data-set. The decision to combine two data-sets was to incorporate more patients with lung sounds corresponding to a wider range of respiratory diseases. The detailed demographic information of patients included in both data-sets is provided in Table 1.

The first group of signals were acquired locally at King Abdullah University Hospital, Jordan University of Science and Technology, Irbid, Jordan. The study was approved by the institutional review board (IRB: 35/117/218) at King Abdullah University Hospital. In addition, all participants reviewed the procedure of the study and provided a written consent prior to any clinical examinations. The data-set included 103 participants (62 M, 41 F) of all age groups, out of which 35 participants had no respiratory abnormalities (normal), while 68 had pulmonary diseases including asthma, pneumonia, BRON, COPD, and HF. The acquisition protocol of lung sounds was performed by two professional thoracic clinicians. Each participant was asked to maintain a relax in a supine position prior to the recording of their breathing cycle sounds. The sounds were recorded using a single-channel electronic stethoscope (3M Littmann\(^{\circledR }\) model 3200) placed on either upper, middle, or lower left/right chest wall locations. The electronic device provides a built-in ambient and fractional reduction technology, however, few recordings had slight movement artifacts. All signals were sampled with a sampling frequency of 4 kHz and band-limited to a frequency range of 20 Hz to 2 kHz.

The second group of signals were obtained from the publicly available 2017 Int. Conf. on Biomedical Health Informatics (ICBHI’17) challenge online database Rocha et al. (2017). The data-set includes lung sounds recorded by the School of Health Sciences, University of Aveiro (ESSUA) research team at the Respiratory Research and Rehabilitation Laboratory (Lab3R), ESSUA and at Hospital Infante D. Pedro, Aveiro, Portugal. In addition, another team from the Aristotle University of Thessaloniki (AUTH) and the University of Coimbra (UC) acquired respiratory sounds at the Papanikolaou General Hospital, Thessaloniki, the General Hospital of Imathia (Health Unit of Naousa), Greece, and the General Hospital of Imathia (Health Unit of Naousa), Greece. The data-set covered 126 subject with various types of respiratory diseases including pneumonia, BRON, and COPD. Each patient recorded lung sounds for duration varying between 10 and 90 s, and all signals were re-sampled at a sampling frequency of 4 kHz.

In this study, only 110 patients were selected from the ICBHI’17 data-set. The selection of these patients was to compliment the locally recorded data-set described earlier. In addition, it is worth noting that each recording from both data-sets included only 5 s to maintain a complete respiratory cycle. No further data segmentation was applied on the recordings of both data-sets. The normal breathing range in adults is 12–18 breaths per minute (Barrett et al. 2016), therefore, a single respiratory cycle (inspiration and expiration) for the slower breathing scenario takes 5 s, while the faster breathing scenario lasts up to 2 s (Lapi et al. 2014; Nuckowska et al. 2019). Thus, a segment of 5 s ensures coverage of both breathing rates without adding additional signal data per patient that may cause extra complexity for the model. This approach has been widely applied in literature (Zhang et al. 2016; Chen et al. 2016). Figure 2 shows examples of the 5 s segmented recordings corresponding to normal subjects and patients with respiratory diseases.

2.2 Preprocessing

As in any other electronically recorded biological signal, lung sound recordings are disturbed by acoustic noise caused by ambient noise, background talking, electronic interference, or any displacement of the stethoscope (Emmanouilidou and Elhilal 2013). Therefore, it is of a high importance to ensure that the signals are smoothed and pre-processed prior to any feature extraction procedure. Several preprocessing steps were followed in the current study including:

  • 1D wavelet smoothing

  • Displacement artifact removal

  • z-score normalization

The following subsections provide a brief description of each preprocessing step.

2.2.1 1D wavelet smoothing

Wavelet transform (WT) has been widely used for the analysis of non-stationary signals. In comparison to Fourier transform (FT), the signal is decomposed into a group of wavelets instead of complex sinusoids. The basic idea behind WT is the use of a mother wavelet (\(\varphi (t)\)) to translate and dilate a signal into different functions (Martínez et al. 2004). Mathematically, the mother wavelet is described as:

$$\begin{aligned} \varphi _{ab}(t) = |a|^{-1/2}\varphi \left( \frac{t-b}{a}\right) \end{aligned}$$

where t corresponds to the time instance, a is the dilation function, and b is the translation function.

WT is either continuous or discrete. In continuous wavelet transform (CWT), the dilation and translation functions operate on the signal continuously, which increases the computational complexity. On the other hand, the discrete wavelet transform (DWT) operates with wavelets discretely sampled preserving both the frequency and the location information in time (Saxena et al. 2002). Therefore, it is more efficient to analyze signals using DWT as opposed to the CWT.

In DWT, the most commonly used orthonormal wavelets are wavelets from the family of Daubechies (db). Beside the selection of a mother wavelet to decompose a signal, the thresholding function as well as the level of decomposition have to be known. A famous 1-dimensional (1D) DWT function is the maximal overlap discrete wavelet transform (MODWT) (Chandra et al. 2018; Cornish et al. 2006). This function is superior to basic DWT in implementing a highly redundant DWTs that keeps down-sampling values at each decomposition level. The basic definition of MODWT is the use of a pair of high-pass and low-pass filters to decompose an infinite sequence as:

$$\begin{aligned} \bar{W}_{k,t}= & {} \sum _{l=0}^{L_{k}-1} \bar{p}_{k,l}X_{t-l} \end{aligned}$$
$$\begin{aligned} \bar{V}_{k,t}= & {} \sum _{l=0}^{L_{k}-1} \bar{g}_{k,l}X_{t-l} \end{aligned}$$

where \(\bar{W}_{k,t}\) is the wavelet coefficient, \(\bar{V}_{k,t}\) is the scaling coefficient, \(\bar{p}_{k,l}\) and \(\bar{g}_{k,l}\) are the high-pass and low-pass filters, respectively, \(X_{t}\) is the infinite sequence, and k is the level of decomposition. A detailed explanation of the MODWT implementation is given in Cornish et al. (2006).

In this work, the DWT was selected to follow a soft MODWT of level 4 with a db5 mother wavelet. The smoothing was followed using MATLAB R2020a signal processing toolbox and function (wden()).

2.2.2 Displacement artifact removal

Any displacement of the electronic stethoscope causes a wave-shaped low-frequency signal on top of the useful signal. Therefore, it is essential to ensure that the signals are not contaminating any of these shapes within its structure (Zheng et al. 2020a). To achieve this, a local polynomial regression smoother (LOESS) function was utilized for its fast and high performance in removing such effects. In LOESS, the signal is fitted with a weighted least-squares function, where the closer the points to the fitted line the higher the weights and visa-versa. The function is given as:

$$\begin{aligned} w(x) = (1-|d|^{3})^{3} \end{aligned}$$

where d is the distance of each point to the fitted curve scaled to be between 0 and 1. Furthermore, a robust version of the algorithm (rLOESS) allows to set a zero weight to the points outside the sixth mean of absolute deviation. A complete mathematical explanation of these two methods can be found in Cleveland (1979).

2.2.3 Z-score normalization

After the aforementioned signal preprocessing steps, it is essential to ensure that each signal is z-score normalized. In z-score normalization, the signal in time domain will no longer have a wide dynamic ranges between its corresponding values. In other word, no larger trends in the signal that dominate the smaller ones (Zhang et al. 2017; Yannick et al. 2019). Therefore, the signal exhibit a mean value (\(\mu\)) of 0 and a standard deviation (\(\sigma\)) of 1 as follows:

$$\begin{aligned} x = \frac{x - \mu }{\sigma } \end{aligned}$$

Having a clear signal with no trend variations across time maintains a better performance in deep learning algorithms (Yannick et al. 2019).

2.3 Training and classification

The selected model for training and classification is a combination of 1D convolutional neural network and bidirectional long short-term memory (CNN + BDLSTM). CNN allows to extract features regarding the overall spatial dimensionality of the signal. On the other hand, LSTM captures the features according to the variations in time-domain. By combining both networks, a better performance is usually achieved in training a model to predict signals based on their spatial and time-domain characteristics (Zheng et al. 2020b). Furthermore, the performance of the model was evaluated for the combined network as well as when operating individually as CNN or BDLSTM. The following subsections describe each network layers prior to the training process (Fig. 3).

2.3.1 1D CNN architecture

CNN is one of the most commonly used artificial neural networks for the process of feature extraction and classification. It is considered as a feed-forward network with transnational and rotational invariance to analyze visual imagery (Radzi and Khalil-Hani 2011; Schmidhuber 2015). In a CNN, multiple number of dot products (convolutions) is applied to a signal \(x_{n}^{0} = [x_{1}, x_{2}, \ldots , x_{N}]\), where N is the total number of points, as per the following equation:

$$\begin{aligned} c_{n}^{uj} = h_a\left( b_{j}+\sum _{m=1}^{M}w_{m}^{j}x_{n+m-1}^{j}\right) \end{aligned}$$

where u is the layer index, \(h_a\) is the activation function, \(b_j\) is the bias of the jth feature map, M is the kernel size, \(w_{m}^{j}\) is the weight of the feature map and filter index mth.

The architecture of the CNN developed herein (Fig. 4) included a set of 1D convolutional, batch normalization (BN), rectified linear unit (ReLU), and max pooling layers. Initially, a 1D input layer was used to accept input data of dimensionality of [20,000,1] to the network. Then, a set of three 1D convolutional layers (Conv1D) were used each with a kernel size of [32,1], total filters of 16, and stride of [1,1]. Each Conv1D layer was followed by a BN layer and a ReLU except for the last Conv1D layer where it was followed first by a max-pooling layer. The BN layer guarantees normalized inputs across the filters on each mini-batch during training, whereas the ReLU set a threshold to replace negative values with zero. Furthermore, 30\(\%\) dropout was added after the first two ReLU layers to prevent over-fitting of the model. The max-pooling layer was designed with a kernel size [2,1] and stride of [2,1] to reduce the dimensionality of the extract features and as well as the computational complexity. The network was implemented using MATLAB R2020a deep learning toolbox.

2.3.2 LSTM units

LSTM is a kind of recurrent neural networks (RNNs) that is structured around a main functioning cell. The cell is connected by three units, namely input (i), output o, and forget f gates. The cell is responsible of managing temporal information flow within the network, while the gates control the flow of information inside and outside the whole unit (Hochreiter and Schmidhuber 1997; Fernandez et al. 2014). An LSTM network could process information in the forward direction (unidirectional) or in both the forward and backward directions (bidirectional) as shown in Fig. 5. The latter is most commonly used for signal analysis where time is considered a critical factor in the learning process.

Mathematically, the output of the main functioning cell C at any point of time t is given as:

$$\begin{aligned} C_t = f_t C_{t-1} + i_t c_t \end{aligned}$$

where \(f_t\) is the forget gate activation, \(i_t\) is the input gate activation, and \(c_t\) is the input to the main cell. Usually, the hidden-units activations are selected to be performed across the network based on a sigmoid function \(\sigma ()\) given by:

$$\begin{aligned} h_t = o_t \tanh (c_t) \end{aligned}$$

where \(o_t\) is the output gate activation. Each gate is defined based on the following equations:

$$\begin{aligned} i_t= & {} \sigma (W_{xi}x_t + W_{hi}h_{t-1} + W_{ci}c_{t-1} + b_i) \end{aligned}$$
$$\begin{aligned} f_t= & {} \sigma (W_{xf}x_t + W_{hf}h_{t-1} + W_{cf}c_{t-1} + b_f) \end{aligned}$$
$$\begin{aligned} o_t= & {} \sigma (W_{xo}x_t + W_{ho}h_{t-1}+ W_{co}c_{t} + b_o) \end{aligned}$$
$$\begin{aligned} c_t= & {} \tanh (W_{xc}x_t + W_{hc}h_{t-1} + b_c) \end{aligned}$$

where \(W_{x*}\) are the input-to-gate weights, \(W_{h*}\) are the hidden-to-hidden weights, and \(W_{c*}\) are the peephole weights.

To process the information in the forward and backward directions, the bidirectional LSTM output \(y_t\) is defined as:

$$\begin{aligned} y_t = W_{\overrightarrow{h}y} \overrightarrow{h^N} + W_{\overleftarrow{h}y} \overleftarrow{h^N} + b_y \end{aligned}$$

where \(\overrightarrow{h^N}\) and \(\overleftarrow{h^N}\) are the hidden layers output in the forward and backward directions, respectively, for all N levels of stack.

In this work, a total of 100 hidden-units were utilized within a bidirectional LSTM model. Therefore, a total of 200 units in both directions were utilized during training. At the end of the LSTM network, an additional dropout layer of 20\(\%\) was added to reduce over-fitting of the model. The network was designed using MATLAB R2020a deep learning toolbox.

2.3.3 Training parameters

The training followed a tenfold cross-validation scheme to ensure coverage of all possible combinations within the data-set. A mini-batch size of 64 was selected with a total number of epochs of 5. The solver was chosen to be based on stochastic gradient descent with momentum (SGDM) optimization (Qian 1999). The initial learning rate was set by default to 0.01 with an L2-regularization of 0.0001.

To handle data imbalance, a weight-modified classification layer was added to the end of the CNN-BDLSTM network. This layer is able of handling the sum of squares error (SSE) loss (Xu et al. 2014; Ali et al. 2015) when data labels are not uniformly distributed or equally split. The weight of each class \(w_{d}\) was determined by the following:

$$\begin{aligned} w_{d} = 1 - \frac{N_{d}}{N_{t}} \end{aligned}$$

where \(N_{d}\) and \(N_{t}\) are the number of samples per class and in total, respectively.

Fig. 3
figure 3

Convolutional neural network and bidirectional long short-term memory (CNN + BDLSTM) model architecture

Fig. 4
figure 4

The structure of the convolutional neural network (CNN) designed in the proposed study

Fig. 5
figure 5

The structure of the bidirectional long short-term memory (BDLSTM) designed in the proposed study

2.4 Performance evaluation

To evaluate the performance of the developed model, several metrics were included in this study to analyze the classification confusion matrix. The confusion matrix was generated sequentially after every fold, and all evaluation metrics were calculated from the overall confusion matrix after the tenfold cross-validation of the training/classification scheme. The first parameter is Cohen’s kappa \(\kappa\) (Cohen 1960), which is a strong indicator of the degree of agreement between the original labels and predicted ones given as:

$$\begin{aligned} \kappa = \frac{P_0 - P_c}{1 - P_c} \end{aligned}$$

where \(P_0\) is the observed agreements and \(P_c\) is the agreements expected by chance.

Furthermore, the standard evaluation metrics were obtained from the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) including the accuracy, sensitivity, specificity, precision, and F1-score. These metrics are given by:

$$\begin{aligned} Accuracy= & {} \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
$$\begin{aligned} Sensitivity= & {} \frac{TP}{TP + FN} \end{aligned}$$
$$\begin{aligned} Specificity= & {} \frac{TN}{TN + FP} \end{aligned}$$
$$\begin{aligned} Precision= & {} \frac{TP}{TP + FP} \end{aligned}$$
$$\begin{aligned} F1= & {} \frac{2TP}{2TP + FP + FN} \end{aligned}$$

3 Results

The preprocessing results are shown in Fig. 6 for only 2.5 s segment for visual purposes. The developed algorithm successfully removes any disturbances attached to the signal due to surrounding noise sources. In addition, the very low frequency, preserving any displacement or movement artifacts, are extracted in part (b). The final signal prior to the training and classification process is shown in part (d) after the z-score normalization of the signal.

Fig. 6
figure 6

The preprocessing of a selected lung sound signal (2.5 s segment) showing: a original signal, b MODWT wavelet smoothing, c rLOESS displacement removal, d z-score normalization

Fig. 7
figure 7

The confusion matrix and per-class precision percentage in respiratory diseases recognition using: a BDLSTM, b CNN, c CNN + BDLSTM

Table 2 Performance comparison, expressed in \(\%\), of different neural networks (in the order BDLSTM/CNN/CNN + BDLSTM) based on tenfold cross validation

3.1 Performance of the proposed model

The proposed method was entirely implemented using MATLAB software R2020a (Mathworks, Inc.). The experiments were conducted on an Intel processor (i7-9700) with 32 GBs of RAM. The training process was performed on NVIDIA GeForce GTX 1070 graphics processing unit (GPU) of 8 GBs display memory (VRAM). Each fold was trained for 2 minutes, yielding a total of around 20 minutes to complete the whole training/classification scheme. The prediction of per-patient class took less than a second under the aforementioned machine specifications.

To evaluate the performance of the algorithm, tenfold confusion matrix of the original versus predicted diseases is shown in Fig. 7 for the CNN, BDLSTM, and CNN + BDLSTM neural networks. The highest average precision of the classification process was for the CNN + BDLSTM model with 98.85\(\%\) split among the classes as 98.80\(\%\), 95.60\(\%\), 98.80\(\%\), 100\(\%\), 99.00\(\%\), and 100\(\%\) for normal, asthma, pneumonia, BRON, COPD, and HF, respectively. The performance of the BDLSTM and CNN when operating separately had an average precision values of 92.15\(\%\) and 96.88\(\%\), respectively, which is much less than the performance the combined network.

Furthermore, using these confusion matrices, the performance metrics described in Sect. 2.4 were extracted and evaluated. The complete evaluation metrics per class are shown in Table 2. In addition, the table shows the average value of each metric for the three deep learning models. The agreement between the predicted diseases and the original ones reached an average value of 98.26\(\%\) with an accuracy of 99.62\(\%\) using the CNN + BDLSTM model. The highest sensitivity values were obtained for the recognition of pneumonia with 93.98\(\%\) (BDLSTM), 100\(\%\), and 100\(\%\) for the three models. In addition, this disease prediction process had specificity values of 99.16\(\%\), 98.63\(\%\), and 99.85\(\%\). It is worth noting that HF had the highest classification performance due to predicting all signals correctly. The classification performance had sensitivity values of 90.06\(\%\), 93.02\(\%\), and 98.43\(\%\) for the BDLSTM, CNN, and CNN + BDLSTM networks, respectively.

The performance of the proposed CNN + BDLSTM model is reported relative to other state-of-art studies in the literature (Table 3). The summary table covers five most recent studies that implemented deep learning for lung sounds classifications between 2017 and 2020. These studies have used stethoscope recordings from the ICBHI’17 database. Each study has utilized different approaches for processing the recordings, such as signal segmentation and data augmentation. Most of these studies required a preprocessing step of converting the signals into their corresponding spectrogram images as an input to the deep learning model. Models such as SVM, CNN, and VGG and bidirectional gated recurrent unit (VGG-BDGRU) were used in these studies and their performance metrics including the accuracy, sensitivity, and specificity are reported accordingly. The proposed approach had the highest levels of accuracy relative to other models. However, it is worth noting that each study implemented different number of recordings for the classification of different number of classes (diseases or lung sounds).

Table 3 Summary table of recent studies found in literature for the use of machine/deep learning approaches in lung sounds classification

4 Discussion

In this study, an investigation was carried out on the use of deep learning models, as illustrated by the combination of CNN and BDLSTM neural networks, in identifying pulmonary diseases. The developed model achieved high levels of performance (sensitivity/specificity of 98.43\(\%\)/99.69\(\%\)), which paves the way towards implementing deep learning in clinical settings.

4.1 Preparation of lung sounds

As shown in Fig. 6, preprocessing steps ensured the use of an improved version of the lung sound signals within the layers of the training network. In practice, the recorded raw acoustic signal includes several type of unwanted acoustic components such as acoustic noise, displacement noise, cardiac sounds, and background sounds. Thus, the training of deep learning models may be negatively affected by such disturbances of the useful signal. Furthermore, most of studies found in literature implement data augmentation techniques within their proposed approaches as an important preprocessing step. However, this may lead to an unstable model due to the creation of unrealistic signal recordings. In this study, no data augmentation techniques were followed, however, the data-set was balanced by the use of the locally recorded data-set. In addition to balancing the classes within the model, it allowed for the inclusion of a larger set of subjects and disease types.

4.2 Analysis of the CNN + BDLSTM network

Instead of only using a single CNN or RNN model as commonly used in the literature, the proposed network architecture guarantees the inclusion of both models into a single structure. Therefore, the spatial representation of the signals, as well as the temporal dynamics, were extracted as feature vectors by the network itself. The CNN filters provided spatial features using convolutional filters, whereas the BDLSTM further extracted temporal features by its memory cells. This has a huge impact when compared to conventional machine learning approaches that require external feature extraction from signals for both spatial and temporal dynamics. The CNN model included three convolutional layers followed by a max pooling layer. The features embedded within each filter were used as inputs to the bidirectional LSTM to learn temporal features in both the forward and backward directions. It has been shown previously that a bidirectional LSTM performs better for time-domain signal classification problems (Graves et al. 2005; Fraiwan and Alkhodari 2020).

The high performance of diseases recognition using the developed deep learning model suggest it as an important tool in clinical settings. As shown in Fig. 7, the detection of BRON as well as HF lung sounds signals was achieved perfectly. This suggest both diseases to have their own signal characteristics in both spatial and temporal representations. Furthermore, it is worth noting that several signals from the COPD patients were misclassified as either normal or asthma. However, the detection error for every disease is low (i.e., ±0.36) as calculated from the standard deviation of the averaged accuracy in Table 2.

To elucidate more on the performance of best performing network (i.e., CNN + BDLSTM), Fig. 8 shows examples from three correctly classified signals and three misclassified signals from the normal, asthma, and pneumonia patients. It can be seen that for the correctly identified signals (a, c, and e), the model was confident in making the decision with probabilities of more than 89\(\%\). However, for the wrongly classified signals (b, d, and f), the model had probabilities of less than 60\(\%\) when making the decisions.

Fig. 8
figure 8

Examples of three correctly classified signals and three miss-classified signals along with the prediction probabilities using the best performing deep learning model (CNN + BDLSTM)

Furthermore, the type-1/type-2 errors show that COPD was the most misclassified class during the classification process under the best performing network; the CNN + BDLSTM (Fig. 7c). For type-1 errors, it had misclassifications with the normal, asthma, and pneumonia diseases. On the other hand, type-2 error shows that it had misclassifications with the normal, asthma, and BRON diseases. It is worth noting that HF lung sounds had no errors for both types in the classification process.

In addition, previous research works in the literature implemented several data augmentation techniques to ensure a balance in the training process for the model. In contrast, the work presented herein does not follow the same approach, as it is generally not recommended for clinical research to augment signals . Therefore, the addition of a locally recorded data-set along with the weight-modified classification layers ensured a more balanced training process. This led to high levels of accuracy relative to recent state-of-art studies (Table 3), with training data much more reliable for clinical investigation.

4.3 Clinical relevance

Clinically, the proposed study ensures accurate recognition of respiratory diseases from lung sounds. Unlike the traditional stethoscope where diseases are diagnosed manually, electronic lung sounds combined with a deep learning predictive model reduce the errors in diseases detection. Therefore, many clinical decisions can be positively affected to prevent any further development of the diseases. Furthermore, although manual diagnosis may lead to correct diagnosis in some circumstances, it is highly recommended clinically to build a model that is able of detecting small variations in signals across patients. Many basic approaches, including threshold levels or feature extraction, have been implemented. However, they may be highly affected by the patient-specific information within the same disease. Thus, a deep learning model that is able of learning from huge number of features automatically could enrich the diagnosis process and act like a supportive decision maker in clinical settings.

5 Conclusion and future work

In this paper, a deep learning model based on convolutional neural networks (CNNs) and bidirectional long short-term memory (LSTM) was utilized for the purpose of lung sounds classification. The classification of lung sounds into multiple respiratory diseases using this model had an overall average accuracy of 99.62\(\%\) with a Cohen’s kappa value of 98.26\(\%\). This study paves the way towards implementing deep learning trained models in clinical settings to assist clinician in decision making.

Future works will focus on increasing the size of the data-set to include more subjects and a wider range of diseases such as COVID-19. This will improve the credibility of the proposed model. Although, the current proposed classification model achieves high performance metrics, it may be further improved by adjusting both the preprocessing techniques and the training structure.