1 Introduction

In recent years, lung diseases became the third largest cause of death globally (Lehrer 2018). Based on the World Health Organization (WHO) statistics, the main five lung diseases (Moussavi 2006) are: tuberculosis, lung cancer, chronic obstructive pulmonary disease (COPD), asthma, and acute lower respiratory tract infection (LRTI). These diseases are responsible for the death of more than 3 million people each year worldwide (Chang and Cheng 2008; Chang and Lai 2010). These lung diseases are affecting the overall healthcare system severely while on the other hand, they accordingly affect the general population's lives. Like other disease especially the serious one, prevention is the key for decreasing it affect, diagnosis and treatment in early stages are considered key factors and methods for limiting the negative impact of these deadly diseases. Auscultation of the lung using a stethoscope is the traditional method and one of the widely used diagnostic method which is regularly used by specialists and general practitioners for performing the initial realization of the respiratory system condition.

The sound of the lungs can be normal or abnormal. An irregularity in the auscultated sound typically denotes lung fluid, infection, inflammation, or blockage (Chang and Lai 2010; Sengupta et al. 2016). Many different types of anomalous (adventitious) lung sounds, such as wheezes, stridor, rhonchi, and crackles, superimpose regular sounds. Breathing whistle-like high-pitched continuous waves lasting longer than 80 ms are referred to as wheezes (Naves et al. 2016). These noises are caused by bronchial tube irritation or constriction. Similar to this, stridor noises are high-pitched waves lasting longer than 250 ms and exceeding 500 Hz. They typically develop as a result of tracheal or laryngeal stenosis. Rhonchi are low-pitched, continuous waves of noises with frequencies under 200 Hz that resemble snoring (Naves et al. 2016; Bardou et al. 2018; Palaniappan et al. 2014). They typically start when the bronchial passages get overfilled with liquid or mucus. Crackles are abrupt clicking or rattling noises that can be either fine (short-lived) or coarse (long-lasting) (long duration) (Palaniappan et al. 2014). These noises are a sign of heart failure or pneumonia. Coughing, snoring, and squawking are other respiratory noises. Lung sounds are typically acoustic waves that have frequencies between 100 Hz and 2 kHz. The human ear is only susceptible to waves between 20 Hz and 20 kHz, though (Rocha et al. 2017; Aykanat et al. 2017). Due to the classic manual stethoscope's inability to pick up on the matching respiratory sounds of numerous ailments, these conditions may be misdiagnosed or go unnoticed (Serbes et al. 2013). As a result, crucial information regarding the health of the respiratory organs that were delivered by lower frequency waves is lost throughout the auscultation procedure (Bahoura 2009). Additionally, the quality of the tool, the expertise of the doctor, and the setting can all have an impact on the diagnosis of lung disorders. As a result, electronic stethoscopes have been emerging progressively to take the role of conventional diagnostic equipment (Bahoura 2009; Icer and Gengec 2014). It has the capacity to record lung sounds as signals within a computer, enabling medical professionals to analyze these signals using time–frequency analysis more accurately. Additionally, current developments in artificial intelligence and signal processing help clinicians make decisions when identifying respiratory disorders through lung sounds (Icer and Gengec 2014; Jin et al. 2014).

Moreover, physicians tend to use different strategies like oxygen saturation (SPO2) using plethysmography, spirometry, and arterial blood gas analysis, but lung sound auscultation still vital for physicians due to its simplicity and low cost (Sengupta et al. 2016). The traditional auscultation technique is usually used to collect lung sounds by using a stethoscope (Lehrer 2018; Moussavi 2006; Chang and Cheng 2008). This method is noninvasive, does not need long time for diagnosis, and has no harm effect on patient, but it might lead to wrong diagnosis if the physician is not well trained to use it (Lehrer 2018; Moussavi 2006). Lung sounds are non-stationary which leads to complex analysis and recognition of sounds (Moussavi 2006). Therefore, it is necessary to develop an automatic recognition system to solve the limitations of using the traditional techniques to ensure more efficient clinical diagnosis (Chang and Lai 2010; Sengupta et al. 2016; Naves et al. 2016). In general, there are two types of lung sounds: normal if the lung has no respiratory disorder, or adventitious breathing sound when the lung has a respiratory disorder (Lehrer 2018; Moussavi 2006; Chang and Cheng 2008; Chang and Lai 2010; Sengupta et al. 2016; Naves et al. 2016). Respiratory disorder became a common problem in all sides of the world. Smoking is the most common cause of this disorder (Lehrer 2018; Moussavi 2006), but it can be also caused by genetics and environmental exposure (Chang and Cheng 2008). There are many categories of lung sounds with respiratory disorder, including fine crackle, coarse crackle, polyphonic wheeze, monophonic wheeze, squawk, and stridor, pleural rub, stridor, and squawks (Lehrer 2018; Moussavi 2006; Chang and Cheng 2008; Chang and Lai 2010; Sengupta et al. 2016; Naves et al. 2016; Bardou et al. 2018).

In the last years, large number of different research approaches have been developed and evaluated for automatic detection and classification of lung abnormalities using lung auscultation sounds. Also, the researchers provided many types of feature extraction techniques that have been used with different types of machine learning (ML) algorithms. ML techniques have been used for lung sounds classifications (Bardou et al. 2018; Palaniappan et al. 2014; Rocha et al. 2017; Aykanat et al. 2017). These techniques mainly applied to create models to find better representations for large-scale unlabeled data (Palaniappan et al. 2014; Rocha et al. 2017; Aykanat et al. 2017; Serbes et al. 2013). Feature-based techniques are commonly used to create automatic systems for classifying lung sounds (Rocha et al. 2017).

The advancement of deep learning (DL) techniques and new developments demonstrating very promising results in different medical applications like diseases detection and classification (Bahoura 2009). DL has many advantages over the ML such as it has an automatic feature extraction, and DL methods are more generic and mitigate the limitations of traditional ML-based methods. Moreover, DL-based methods that are used in recent years for the classification of respiratory abnormalities and pathologies from lung auscultation records have produced very promising results (Bahoura 2009; Icer and Gengec 2014). However, to get the right functionality of DL, DL networks should undergo an extensive training process using a huge training dataset which requires a considerable amount of time and powerful computational resources. As a result, it becomes quite challenging to use DL frameworks in available wearable devices and mobile platforms since it has low computational resources. Many approaches for lung sound classification using CNNs and compared it with features-based approaches. Usually authors used Mel-frequency cepstral coefficient MFCC’s statistics extracted from the signals in the first handcrafted features-based approach, and local binary patterns extracted from spectrograms were used in the second approach, while the third approach is based on the design of convolutional neural networks (CNN) (Bardou et al. 2018).

Also, the researchers employed different types of machine learning algorithms with handcrafted features like Mel-frequency cepstral coefficient (MFCC) features in a support-vector machine (SVM) or fed the spectrogram images to the convolutional neural network (CNN). The most common type of classifier with handcrafted features is the SVM algorithm as a classification method for audio and utilized its results to benchmark the CNN algorithm. Moreover, different classification scenarios can be involved for comparison like healthy versus pathological classification; rale, rhonchus, and normal sound classification; singular respiratory sound type classification; and audio type classification with all sound types (Aykanat et al. 2017). Recently, a new method for automatic detection of pulmonary diseases (PDs) from lung sound (LS) signals, where the LS signal modes were evaluated using empirical wavelet transform with fixed boundary points. The time-domain (Shannon entropy) and frequency-domain (peak amplitude and peak frequency) handcrafted features have been extracted from each mode. Then, machine learning classifiers, such as support-vector machine, random forest, extreme gradient boosting, and light gradient boosting machine (LGBM), have been chosen to detect PDs using the features of LS signals automatically. The performance of these features shows a promising result and can be enhanced further for multi-class scenarios like normal versus asthma, normal versus pneumonia, normal versus chronic obstructive pulmonary disease (COPD), and normal versus pneumonia versus asthma versus COPD classification schemes (Tripathy et al. 2022).

In addition, applying different homogeneous ensemble learning methods to perform multi-class classification of respiratory diseases has been raised recently to enhance the performance of the systems. These systems can be applied to a wider range of conditions involved including healthy, asthma, pneumonia, heart failure, bronchiectasis or bronchitis, and chronic obstructive pulmonary disease (Fraiwan et al. 2021b). The other types of ensembles of hybrid methods are combining two different types of deep learning algorithms in one architecture, this can robustly enhance the deep learning performance to recognize pulmonary diseases from electronically recorded lung sounds. But all researchers employ several preprocessing steps that were undertaken to ensure smoother and less noisy signals like wavelet smoothing, displacement artifact removal, and z-score normalization. Usually, the deep learning architectures that consisted of two stages are mainly based on combining convolutional neural networks and bidirectional long short-term memory units (Fraiwan et al. 2021c).

1.1 Our contribution

In this work, we are presenting a study to investigate the ability of three different deep learning models, illustrated by convolutional neural networks, long short-term memory, and a hybrid model between them, in recognizing multiple pulmonary diseases from recorded lung sound signals. The used signals were obtained by merging two publicly available datasets International Conference on Biomedical Health Informatics (ICBHI) 2017 Challenge dataset and King Abdullah University Hospital (KAUH) dataset. The recordings represent signals from patients suffering from normal, asthma, lung fibrosis, BRON, COPD, heart failure, heart failure + COPD, heart failure + lung fibrosis, lung fibrosis, pleural effusion, and pneumonia. A CNN and LSTM network, and hybrid (CNN + LSTM) were designed for the training and testing classification processes to extract information from the temporal domain of the signals in the raw formats without any preprocessing techniques. The three models have been evaluated on non-augmented and augmented datasets where the two datasets have been utilized to generate four different sub-datasets. Several evaluation metrics were used to evaluate the recognition of diseases using CNN and LSTM networks individually as well as a combination of both networks. To the best of our knowledge, single neural network approaches have often been used in the building of deep learning models for lung sound classification. Therefore, in addition to the suggested hybrid model (CNN + LSTM), the network's capacity to recognize diseases when it was functioning independently as either CNN or LSTM was examined. In addition to feature memorization, the key contribution of this study is the implementation of the standard CNN feature extraction approach and the LSTM network, which increases learning efficiency. Our system presents fundamental differences when compared to the mentioned systems in the literature. Furthermore, the contributions of this work can be summarized as follows:

  • The first paper that compares three different types of deep learning models, CNN model, LSTM model, and a hybrid model that combined both CNN and LSTM models.

  • The first system is used to classify raw lung auscultation sound without any preprocessing techniques.

  • The first system used sound augmentation techniques to enhance the accuracy of the models.

  • Using a huge dataset of lung auscultation sounds collected from two different datasets available online.

  • A system to detect and classify 11 different classes including ten diseases and healthy.

The remainder of the paper is organized as follows: Sect. 2 provides details about recent related works. Section 3 presents with a detailed explanation of the used dataset and the proposed architectures. Section 4 is the results, including the performance of the proposed different architectures. Section 5 is the discussion of the method results. And finally, Sect. 6 represents the conclusion of the proposed work.

2 Literature review

In recent years, a substantial amount of previous related research that employing both machine learning and deep learning have been proposed. The methods focused on automated raw and processed respiratory sound classification. In this section, the most recent and related works to the paper’s topic are discussed. Serbes et al. (2013) have used time–frequency (TF) and timescale (TS) analysis to detect pulmonary crackles. The crackles frequency characteristics were extracted from the non-preprocessed and pre-processed signals using TF and TS analysis. Dual-tree complex wavelet transform (DTCWT) is applied in the pre-processing step to filter out the frequency bands that have no crackle information. They have used K-nearest neighbors (KNN), SVM (support-vector machine), and multilayer perceptron to classify crackling and non-crackling sounds and achieved an accuracy of 97.5% from the SVM classifier. Bahoura (2009) have proposed an approach for two classes of lung sounds including normal and wheeze signals, they have used the Mel-frequency cepstral coefficients (MFCCs) for feature extraction and the Gaussian mixture model (GMM) to classify the signals, and they achieved an accuracy of 94.2%.

Icer and Gengec (2014) have used support-vector machine (SVM) for automatic classification. They created features using the frequency ratio of power spectral density (PSD) values and the Hilbert–Huang transform (HHT) to distinguish between three classes of lung sounds including: normal lung sounds, crackles, and rhonchus with an accuracy above 90%. Moreover, Jin et al. (2014) have used SVM to distinguish between different classes of respiratory sounds including normal, wheezing, stridor, and rhonchi and achieved an accuracy between 97.7 and 98.8%. Reyes et al. (2014) have used a technique to obtain the time–frequency (TF) representation of thoracic sounds. The performance of the TF representations for different classes including the heart, adventitious, and normal lung sounds was assessed using TF patterns and they stated that the best performance was achieved from the Hilbert–Huang spectrum (HHS). Higher order statistics (HOS) were used by Naves et al. (2016) for classifying different classes of lung sounds including normal, coarse crackle, fine crackle, monophonic and polyphonic wheezes. They have used genetic algorithms and Fisher's discriminant ratio to reduce dimensionality, and k-NN, and naive Bayes classifiers were used for classification with an accuracy of 98.1% on training data, and 94.6% on validation data.

Orjuela-Canon et al. (2014) used MFCC features along with artificial neural network (ANN) to distinguish between normal sounds, wheezes, and crackles. The achieved performance of classification was 75% for crackles, 100% for wheezes and 80% for normal. Maruf et al. (2015) have used GMM to classify crackles from normal respiratory sounds. They have used a band-pass filter for background noises reduction, and then they extracted three spatial–temporal features, namely pitch, energy, and spectrogram and an accuracy of 97.56% was achieved. A novel attractor recurrent neural networks (ARNN) technique based on the fuzzy functions (FFs-ARNN) for the classification of lung abnormalities was proposed by Bagher et al. and achieved an accuracy of 91% (Khodabakhshi and Moradi 2017). Pinho et al. (2015) used signal processing methodologies for the detection of crackles in audio files. Their method is based on using fractal dimension and box filtering to extract the window of interest to verify and validate the potential crackle, then extracting the crackle parameters for characterizations.

Islam et al. (2018) used ANN (artificial neural networks) and SVM for classifying lung sounds from 60 subjects 50% of them have asthma. They obtained the best accuracy of 93.3% from SVM scenario. Accuracies of up to 93% and 91.7% were achieved using other configurations of neural network for detecting crackles and wheezes, respectively (Guler et al. 2005). A dataset of seven classes including normal, coarse crackle, fine crackle, monophonic wheeze, polyphonic wheeze, squawk, and stridor were used in another work where different approaches of ANN were used and the results from convolutional neural network (CNN) were the best (Guler et al. 2005; Shuvo et al. 2020; Garcia-Ordas et al. 2020; Tsai et al. 2020; Demir et al. 2020; Kevat et al. 2020; Andrade et al. 2021; Wani et al. 2021).

CNN was also used by Jacome et al. (2019) to deal with respiratory sounds with accuracies of 97% and 87% in detecting inspiration and expiration, respectively. Two types of machine learning algorithms were proposed by Aykanat et al. (2017), Mel frequency cepstral coefficient (MFCC) features in a support-vector machine (SVM) and spectrogram images in the convolutional neural network (CNN). Four data sets were prepared for each CNN and SVM algorithm to classify different respiratory conditions including healthy versus pathological classification; rale, rhonchus, and normal sound classification; singular respiratory sound type classification; and audio type classification with all sound types. They achieved accuracies of CNN 86%, SVM 86%, CNN 76%, SVM 75%, CNN 80%, SVM 80%, and CNN 62%, SVM 62%, respectively. Their results have shown that pre-diagnosis and classifying of respiratory audio can be done accurately using CNN and SVM machine learning algorithms. Bardou et al. (2018) used three approaches, two of them are based on the extraction of a set of handcrafted features trained by three different classifiers including SVM, KNN, and Gaussian mixture models, while they applied CNN for their third approach. The dataset they used consist of seven classes (normal, coarse crackle, fine crackle, monophonic wheeze, polyphonic wheeze, squawk and stridor, the results they achieved show that the CNN outperform the handcrafted feature-based classifiers.

Garcia-Ordas et al. (2020) used a convolutional neural network (CNN) to classify the respiratory sounds into healthy, chronic, and non-chronic disease and achieved 0.993 F-score in the three-label classification. They also have done more challenging classification for different types of pathologies or healthy conditions including URTI, COPD, Bronchiectasis, Pneumonia, and Bronchiolitis and achieved F-score of 0.990 in all classes. Moreover, Fraiwan et al. (2021a) have used an electronic stethoscope to record lung sounds from 112 subjects of healthy and unhealthy conditions (35 healthy and 77 unhealthy) to create new dataset. The dataset contains seven ailments including normal breathing sounds, lung fibrosis, heart failure, asthma, pneumonia, bronchitis, pleural effusion, as well as COPD. This dataset was created for use in machine learning models to distinguish the correct type of lung sounds or detect pulmonary diseases. Fraiwan et al. (2021b) used different ensemble classifiers to perform multi-class classification of respiratory diseases. The dataset they used included a total of 215 subjects with 308 clinically acquired lung sound recordings and 1176 recordings obtained from the ICBHI Challenge database. The recorded data has different conditions including asthma, pneumonia, heart failure, bronchitis, chronic obstructive pulmonary disease, as well as healthy condition. Shannon entropy, logarithmic energy entropy, and spectrogram-based spectral entropy were used for feature representation of the lung sound signals. Bootstrap aggregation and adaptive boosting ensembles were built using decision trees and discriminant classifier. Boosted decision trees achieved the best overall accuracy, sensitivity, specificity, F1-score, and Cohen's kappa coefficient of 98.27%, 95.28%, 98.9%, 93.61%, and 92.28%, respectively. Among the baseline methods, SVM provided an average accuracy of (98.20%), sensitivity (91.5%), and specificity (98.55%) Furthermore, Fraiwan et al. (2021c) have used a deep learning model-based CNNs and bidirectional LSTM (BD-LSTM) for recognizing pulmonary diseases. A dataset of 103 patients recorded at King Abdullah University hospital (KAUH) in Jordan and data from 110 patients were added from the Int. Conf. on Biomedical Health Informatics publicly available challenge database were used. The highest average accuracy achieved in classifying patients based on the pulmonary disease types using CNN + BD-LSTM is 99.62% with a precision of 98.85% and a total agreement of 98.26% between the predictions and original classes within the training scheme.

Nguyen and Pernkopf (2022) proposed a methodology for lung sound classification by employing co-tuning and stochastic normalization to enhance the classification results. They split sound record into 8 s segments then calculating the corrected and normalized spectrogram. After that they co-tuned a Resnet50 based model and applied on three different scenarios 2 classes, 3 three classes, and 4 classes. The results show that their highest performance was obtained using 60–40 training and testing sets for two classes problem using Logmel Spectrogram, ResNet101 and scored specificity of 91.77% and sensitivity of 95.76%. Moreover, Tripathy et al. (2022) proposed a methodology using empirical wavelet transform with fixed boundary points. Where, the time-domain (Shannon entropy) and frequency-domain (peak amplitude and peak frequency) features have been extracted. Then, they employ different classifiers like support-vector machine, random forest, extreme gradient boosting, and light gradient boosting machine have been chosen to detect pulmonary diseases using the extracted features. The best accuracy values were 80.35, 83.27, 99.34, and 77.13% have been obtained using the light gradient boosting machine classifier with fivefolds cross validation for classification systems comparing normal to asthma, normal to pneumonia, normal to COPD, and normal to pneumonia, asthma, and COPD.

Bhatta et al. (2022) proposed a respiratory audio collection to forecast a variety of illnesses, including bronchiectasis, pneumonia, and asthma. To put this study into practice, we used respiratory and disease diagnosis audio datasets. We then extracted characteristics from each audio dataset and trained a convolution neural network (CNN) method, model. We can add any new test data to predict an illness after the training model. The authors report an accuracy and sensitivity values of 86% using their own dataset. Finally, Soni et al. (2022) proposed s method for gapping the problem of heart and lung sounds labeling (diagnosing). They use the ResNet-18 model as a base to generate encodings in the latent space of length N for training. They used clinical data, such as age, sex, weight, and sound location, together with the audio file to make use of the common context of the recordings at the patient level. When using patient-specific representations to choose positive and negative pairs, they demonstrate better in downstream tasks for diagnosing heart and lung sounds. The highest performance achieved using linear evaluation and the AUC was 0.752 with 95% confidence interval of 0.715, 0.791.

As we can notice from the scanned literature, most of them are focused on using pre-trained model(s) not their models, or small datasets for evaluation of their methods. Moreover, they only focused on just employing the comparison between the pre-trained model(s) to select the best of them. Also, the literature does not provide a generalized capacity model either because of datasets or not using the augmentation process. Finally, all literature spends most of the research time on the development of signal enhancement or features extraction methods instead of developing new model or systems. Based on that this paper will enrich the literature by combining the two publicly available lung sound datasets to significantly enhance the generalization capacity and model performance for systems that classify adventitious lung sounds and respiratory diseases by employing the augmentation process. Moreover, it will provide a comprehensive comparison between three different types of deep learning models (CNN, LSTM, and CNN–LSTM) based on various datasets by applying them to raw sounds without any enhancement methods or feature extractions except resizing and show the best model for each scenario which open the way for further investigations of designing new models.

3 Materials and methods

As shown in Fig. 1, the adopted methodology consists of the following main phases: data acquisition and preparation, feature extraction, construction and training of the ensemble and baseline classifiers, and finally performance evaluation. These steps are detailed below.

Fig. 1
figure 1

The complete procedure followed in the proposed study

3.1 Lung sounds datasets

In this paper, the data used incorporated two different datasets, both datasets are consisting of stethoscope lung sounds classified with different respiratory diseases. The first dataset was the publicly available International Conference on Biomedical Health Informatics (ICBHI) 2017 Challenge dataset. The second dataset was the new public available King Abdullah University Hospital (KAUH) dataset. Furthermore, both datasets have been merged together to make four different datasets. The four datasets vary in number and type of classes included in each one. Table 1 provides a detailed overview of the used dataset and the four datasets from their merging. Each dataset will be discussed in detail in the next two sections.

Table 1 The details of the used datasets

3.1.1 KAUH dataset

The KAUH dataset which is a new public available dataset with a total of 70 individuals with various respiratory diseases such as asthma, pneumonia, heart failure, bronchiectasis or bronchitis (BRON diseases), and chronic obstructive pulmonary disease (COPD) (Fraiwan et al. 2021a). A total of 35 healthy controls were also surveyed. The age of the participants in this study was not a variable of interest in order to ensure fair research. The participants ranged in age from youngsters through adults to the elderly. After thoroughly comprehending the parameters of the study and the technique involved, all participants signed a written consent form. The complete dataset comprised of 308 lung sound recordings, each lasting 5 s. Based on the average resting respiration rates for humans (12–20 breaths per minute), this length is sufficient to encompass at least one respiratory cycle and has been used in previous investigations (Fraiwan et al. 2021a). In general, adopting small length data windows alleviates the difficulty of medical data availability while also increasing the model's computing efficiency. Furthermore, training the model on lung sound signals rather than respiratory cycles makes data curation and labeling much easier. In clinical situations and real-time applications, these distinguished features are usually advantageous (Fraiwan et al. 2021a, b, c).

3.1.2 ICBHI 2017 dataset

The International Conference on Biomedical Health Informatics (ICBHI) 2017 dataset, which is a publicly available benchmark dataset of lung auscultation sounds (Rocha et al. 2017), was used in this research. Two independent research teams from Portugal and Greece have gathered the data (Nuckowska et al. 2019). The dataset contains 5:5 h of audio recordings sampled at multiple frequencies (4 kHz, 10 kHz, and 44 kHz), spanning from 10 to 90 s, in 920 audio samples of 126 participants in various anatomical positions (Chen et al. 2019). The samples are professionally annotated according to two schemes: I according to the corresponding patient's pathological condition, i.e., healthy and seven distinct disease classes, namely Pneumonia, Bronchiectasis, COPD, URTI, LRTI, Bronchiolitis, and Asthma; and (ii) according to the presence of respiratory anomalies, i.e., crackles and wheezes in each respiratory cycle. Zhang et al. (2015) contains more information about the dataset and data collecting technique. Table 1 shows the details of the used datasets in our research.

3.2 Augmentation

Data augmentation is a very popular technique that can be used to artificially expand the size of a training dataset in general, this is done by creating modified versions of audios in the dataset (Guler et al. 2005; Shuvo et al. 2020). Training deep learning models on huge datasets can result in more skillful models, and the augmentation techniques can create variations of the audio les that can improve the ability of the fit models to generalize what they have learned to new audios. For instance, with images, we might do things like rotating the image slightly, cropping or scaling it, modifying colors or lighting, or adding some noise to the image (Garcia-Ordas et al. 2020; Tsai et al. 2020; Demir et al. 2020). Since the semantics of the image have not changed materially, so the same target label from the original sample will still apply to the augmented sample. Just like with images, there are several techniques to augment audio data as well. This augmentation can be done both on the raw audio (Kevat et al. 2020; Andrade et al. 2021; Wani et al. 2021). In this research the following audio augmentation techniques have been applied:

  • Time Stretch: randomly slow down or speedup the sound.

  • Time Shift: shift audio to the left or the right by a random amount.

  • Add Noise: add some random values to the sound.

  • Control Volume: randomly increasing or decreasing the volume of the audio.

3.3 Deep learning models

Deep learning is one of the newest sorts and state-of-the-art artificial intelligence technologies that has emerged in response to the growing quantity of massive datasets (Alqudah 2020; Esteva et al. 2021; Alqudah et al. 2021a, b; Kanavati et al. 2020). Deep learning is primarily defined and distinguished by the development of a unique architecture made up of many and sequential layers in which successive stages of input processing are carried out (LeCun et al. 1995, 2015). Deep learning is based on and inspired by the human brain's deep structure (LeCun et al. 1995; Alqudah et al. 2021c). The human brain's deep structures have a large number of hidden layers, allowing us to extract and abstract deep features at various levels and from various perspectives. In recent years, a slew of deep learning algorithms has been presented (Alqudah 2020; Esteva et al. 2021; Alqudah et al. 2021a, b, c; Kanavati et al. 2020; LeCun et al. 2015). The most frequently used, powerful, and efficient deep learning algorithms are the CNN (Alqudah 2020; Alqudah et al. 2021a, c; Alqudah and Alqudah 2022a) and Long Short-Term Memory (LSTM) (Ozturk and Ozkaya 2020; Petmezas et al. 2021; Cinar and Tuncer 2021; Jelodar et al. 2020). In the following subsections we will discuss in detail the developed separated and hybrid CNN and LSTM models. Moreover, Fig. 2 shows the developed models and Table 2 shows the layer details.

Fig. 2
figure 2

The used deep learning models

Table 2 The layer details of proposed deep learning models

3.3.1 CNN model

The CNNs have a large number of hidden layers that use convolution and subsampling techniques to extract deep features from the input data (Alqudah et al. 2021a; LeCun et al. 2015). Input, convolution, RELU (rectified linear unit), fully connected, classification, and output layers are the different types of layers in a CNN. These layers are combined to create a CNN model that can complete the assignment. CNN has excelled in a variety of scientific fields, particularly in the medical field (Chen et al. 2019; Esteva et al. 2021). Deep, representative, and discriminative characteristics are extracted primarily using CNN layers. When utilizing CNN layers, the preceding layers will do downsampling and feature selection, as well as generate categorization of the data. Figure 2A shows the used CNN Model.

3.3.2 LSTM model

Hochreiter and Schmidhuber first proposed the LSTM in 1997 (Shadmand and Mashoufi 2016), and a group led by Felix Gers improved it in 2000 (Gers et al. 2000). Researchers are now introducing many varieties of LSTM, with Zaremba et al. (2014) providing details on LSTM. A memory cell, input gate, output gate, and forget gate make up the most typical LSTM design. Assume that the input, cell, and hidden states at iteration t are \({x}_{t}\), \({c}_{t}\), and \({h}_{t}\), respectively. The cell state c t and hidden state \({h}_{t-1}\) are produced for the current input \({x}_{t}\), the previous cell state \({c}_{t-1}\), and its corresponding previous hidden state \({h}_{t-1}\) (Zaremba et al. 2014). Figure 2B shows the used LSTM model.

3.3.3 CNN–LSTM model

Deep feature extraction and selection from the ECG beat is handled by the CNN blocks, which are the 1D convolutional layer and the max pooling layer in this hybrid model. While the LSTM layer, which is fed these characteristics as time-dependent features, will learn to extract contextual time data (Shahzadi et al. 2018). Deep feature extraction and classification utilizing a hybrid 1D CNN–LSTM outperforms CNN- or LSTM-based methods, according to our research (She and Zhang 2018). Furthermore, using the LSTM layer allows for considerably shallower models to be built than pure CNN models. Our research shows that adopting a hybrid 1D CNN–LSTM for deep feature extraction and classification outperforms CNN- or LSTM-based methods. Furthermore, employing the LSTM layer allows for a significantly shallower model to be built than pure CNN models, resulting in better performance with fewer parameters. Figure 2C shows the used CNN–LSTM model.

3.4 Performance evaluation

Any artificial intelligence (AI)-based system must have a performance evaluation that corresponds to any new data. The original annotations of the raw lung auscultation sounds were compared to the identical lung sounds annotations predicted by the models to assess the performance of the developed models. The accuracy, sensitivity, precision, and specificity of the data were then determined using these annotations. These indicators show how exactly lung sounds are diagnosed (Alqudah and Alqudah 2022a). True positive (TP), false positive (FP), false negative (FN), and true negative (TN) are four different sorts of statistical values used to calculate these measures (Kanavati et al. 2020; Alqudah et al. 2021c). All these parameters are extracted from the confusion matrix, the confusion matrix shows four main statistical indices which are used later to calculate performance metric (Obeidat and Alqudah 2021; Alqudah and Alqudah 2022b; Alqudah et al. 2021d), these indices are true positive (TP), false positive (FP), false negative (FN), and true negative (TN) (Alqudah et al. 2020; Al-Issa and Alqudah 2022). Figure 3 shows a simple confusion matrix. Then, the following performance evaluation parameters (accuracy, sensitivity, specificity, precision, F1 score, and MCC have been calculated using these values:

Fig. 3
figure 3

A confusion matrix example

$$\mathrm{Accuracy}= \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}$$
(1)
$$\mathrm{Sensitivity}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(2)
$$\mathrm{Specificity}= \frac{\mathrm{TN}}{\mathrm{FP}+\mathrm{TN}}$$
(3)
$$\mathrm{Precision}= \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(4)
$$\mathrm{F}1 \,\mathrm{Score}= 2\frac{\mathrm{Precision}*\mathrm{Specificity}}{\mathrm{Precision}+\mathrm{Specificity}}$$
(5)
$$\mathrm{MCC}= \frac{\mathrm{TP}*\mathrm{TN}-\mathrm{FP}*\mathrm{FN}}{\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})(\mathrm{TN}+\mathrm{FP})(\mathrm{TN}+\mathrm{FN})}}$$
(6)

4 The experimental results

The investigated models were trained tested using a computer with an Intel(R) Core TM i7-6700, a 3.40 GHz CPU, and 16 GB RAM. The training process for each model is taking around 45 min. All models are trained using Adam optimizer, initial learning rate of 0.001, max epochs of 100, mini batch size of 128, and validation frequency of 100. The performances of all deep learning models are described using the confusion matrix and statistical parameters extracted from it. The confusion matrix represents the results of classification using a certain deep learning model. After calculating the statistical parameters, namely false positive (FP), false negative (FN), true positive (TP), and true negative (TN), the raw lung sound classification effectiveness is compared using the four statistical indices: sensitivity, specificity, precision, and accuracy. The performance evaluation results of all models for different types of raw lung sounds classification using different datasets are shown in the following subsections.

4.1 Non augmented dataset results

The following sections will show the performance of different deep learning models using non-augmented dataset.

4.1.1 CNN model results

In this section we will display the results of CNN model on all of the non-augmented datasets, Fig. 4 shows the training results of all datasets using CNN model. While Fig. 5 shows the testing dataset results of all datasets using CNN model.

Fig. 4
figure 4

CNN model training results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 5
figure 5

CNN model testing results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

4.1.2 LSTM model results

In this section we will display the results of LSTM model on all of the non-augmented datasets, Fig. 6 shows the training results of all datasets using LSTM model. While Fig. 7 shows the testing dataset results of all datasets using LSTM model.

Fig. 6
figure 6

LSTM model training results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 7
figure 7

LSTM model testing results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

4.1.3 CNN–LSTM model results

In this section we will display the results of CNN–LSTM model on all of the non-augmented datasets, Fig. 8 shows the training results of all datasets using CNN–LSTM model. While Fig. 9 shows the testing dataset results of all datasets using CNN–LSTM model.

Fig. 8
figure 8

CNN–LSTM model training results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 9
figure 9

CNN–LSTM model testing results for non-augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

4.2 Augmented dataset results

The following sections will show the performance of different deep learning models using augmented dataset.

4.2.1 CNN model results

In this section we will display the results of CNN model on all of the augmented datasets, Fig. 10 shows the training results of all datasets using CNN model. While Fig. 11 shows the testing dataset results of all datasets using CNN model.

Fig. 10
figure 10

CNN model training results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 11
figure 11

CNN model testing results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset

4.2.2 LSTM model results

In this section we will display the results of LSTM model on all of the augmented datasets, Fig. 12 shows the training results of all datasets using LSTM model. While Fig. 13 shows the testing dataset results of all datasets using LSTM model.

Fig. 12
figure 12

LSTM model training results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 13
figure 13

LSTM model testing results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

4.2.3 CNN–LSTM model results

In this section we will display the results of CNN–LSTM model on all of the augmented datasets, Fig. 14 shows the training results of all datasets using CNN–LSTM model. While Fig. 15 shows the testing dataset results of all datasets using CNN–LSTM model.

Fig. 14
figure 14

CNN–LSTM model training results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

Fig. 15
figure 15

CNN–LSTM model testing results for augmented datasets, A Dataset 1, B Dataset 2, C Dataset 3, D Dataset 4

5 Discussion

In general, computer-aided system for detection of respiratory diseases can expedite diagnostic and treatment decisions and support the study of physiological patterns associated with various respiratory pathologies. In this work, we propose to apply different deep learning models to perform multi-class classification of different types of respiratory diseases. As imperative to all deep learning frameworks, the design stage of the deep learning models targeted toward providing a better classification of the raw lung sounds pattern. Thus, optimized models are key to building effective classification models and enhance the model's predictive accuracy. In general, respiratory sounds are characterized as random and nonlinear signals that are highly complex in nature; Because of the fluctuating lung volume, this is especially true. These characteristics can be seen in both healthy and pathological individuals, but they are more noticeable in pathological lung sounds.

In this study, an investigation was carried out on the use of different deep learning models, as illustrated by the combination of CNN and LSTM neural networks, in identifying pulmonary diseases. The developed models achieved high levels of performance. The highest achieved accuracy/sensitivity/specificity were 100%, 100%, and 100% using the hybrid CNN–LSTM model, which paves the way toward implementing deep learning in clinical settings. Thus, a comparison of the results of the proposed deep learning models using either non-augmented or augmented datasets are shown in Table 3. Using Table 3 we can notice that the hybrid CNNLSTM model outperforms all other models (CNN and LSTM) either augmented or non-augmented datasets.

Table 3 Comparison between different deep learning models among different datasets using testing sub-datasets

Clinically, the proposed research ensures the accurate detection and classification of different respiratory diseases from lung sounds. Unlike the traditional stethoscope where diseases are diagnosed manually and based on the practitioner experience, electronic lung sounds combined with a deep learning predictive models reduce the errors in diseases classification. Therefore, many clinical decisions can be positively affected to prevent any further development of the diseases and any treatment. Furthermore, although manual diagnosis may lead to correct diagnosis in some circumstances, it is highly recommended clinically to build a model(s) that is able of detecting small variations in signals across patients. However, they may be highly affected by the patient-specific information within the same disease. Thus, a deep learning model that can learn from huge number of features which are which could automatically enrich the diagnosis process and act like a supportive decision maker in clinical settings. A comparison between the proposed models and models in the literature is shown in Table 4. Using Table 4 we can notice that our proposed models achieved a better performance than any model in the literature.

Table 4 Comparison between our proposed methods and literature

6 Conclusion

To sum up, this research paper, proposed a deep learning model based on convolutional neural networks (CNNs), long short-term memory (LSTM), and hybrid of them. These models were utilized for the purpose of raw lung sounds classification using combination of two datasets. The first dataset proposed dataset of lung sounds recorded at King Abdullah University Hospital (KAUH), while the other dataset is the dataset used in the international conference on Biomedical Health Informatics (ICBHI) 2017 Challenge. Different deep learning models like CNN, LSTM, and CNN–LSTM were employed as classification methods and were compared with using the achieved results. The experimental results showed that the hybrid CNN–LSTM classification model generally outperformed the CNN and LSTM methods which are commonly employed in the literature. This research paves the way toward designing and implementing deep learning models in clinical settings to assist clinicians in decision making with high accuracy. Future works will focus on increasing the number of classes of the used dataset to include more records from different subjects and a wider range of diseases such as COVID-19. Such future works will enhance the credibility of the proposed model. Although the current proposed classification deep learning models achieves high performance metrics, it would be improved by making more tuning on the hyperparameters.

For future works, we intend to expand the application of the proposed automatic lung auscultation sounds classification in real-time examination process. Initially, we intend to test the proposed method on more datasets. Then, we plan to develop an embedded system that integrate the developed CNN–LSTM model into the system with a digital stethoscope. Furthermore, we also aim to create an Internet of Things (IoT) system for this approach to be accessible in developing countries, where we find the highest mortality rates. Finally, we intend to analyze the feasibility of the system with the telemedicine system.

7 Limitations

We must acknowledge the limitations of current research. As this research collects data from two datasets there are a few records for some classes, while some classes have a large number of samples. So, the main limitation is the number of samples per class and the required future effort to make records for all diseases. Furthermore, other factors may affect the results like medications used by patients may contribute to the development or deterioration of symptoms; however, the current research did not investigate the effect of these factors.