1 Introduction

The heart is the primary organ of the human circulatory system. The blood circulation within the heart makes the sound. The sounds come from the closing and opening of the atrioventricular valves. As these valves open and close, allowing blood flow to and from the Heart, it produces the heartbeat sound in this course. Analysis of heart sounds is fundamental to detecting any heart-related disorders. Under the investigation of heart sounds, the classification of heart-related disease is also crucial for quickly taking the correct preventive action. In practice, the background noise signals must be removed in the valvular heart disease [1, 2] analysis after receiving auscultation through an electronic acoustic stethoscope. Then the noise-free heart sound was digitized for computational needs. Convenient feature-extracting algorithms [1, 3] are applied in the computational intelligence to extract the essential features to classify the heart sound [1, 4] for diseases. As PCG signals are produced from the opening and closing of valves, they are repetitive and mechanical vibrations that occur at certain fixed time intervals and are analogous to an electrical signal. The heart sound can be analyzed in the conventional time and frequency domain using different developed algorithms and tools applied in various computational intelligence techniques. Artificial intelligence with computational intelligence plays a significant role in cardiac monitoring, early screening, and identification of valvular heart diseases. Inception networks and residual networks found suitable applications in heart sound analysis in terms of their accuracy and low screening time. Integration of squeeze and excitation blocks with these networks improved their performance depending on their respective connection positions. Detailed studies based on multiple hyperparameters are required to enhance the performance of the suitable deep-learning classifier on normal and abnormal heart sounds. Tuning hyperparameters in deep neural network models is necessary to avoid overfitting and underfitting issues. The study aims to search for the best cost-effective, simplified, and improved classifier tool for the early screening of heart diseases.

2 Paper organization

Section 3 provides a literature study of different deep learning methods used in heart sound analysis. Section 4 provides the objective of the research work. Section 5 details about techniques and materials used in this research paper. Section 6 explains the hardware development of the proposed system. Section 7 explains software based developed deep learning models for the proposed method and their improvements using hyperparameter tuning. Section 8 highlights the result analysis of the research work. Eventually, Section 9 summarizes the conclusions and future scope.

3 Related work

In this research, the signal classification of valvular heart diseases using a deep learning model is essential. According to a literature survey, many researchers have worked in this area. The authors [1] 2005 did work on different stages in heart sound signal for PCG signal analysis.

The authors [3] 2009 did work on heart sound analysis using an adaptive fuzzy inference system based on a Mamdani-type fuzzy inference classifier. The experiment was carried out on a standard heart sound repository. It was an offline method and not tested with human subjects. The authors [5] 2013 did work on heart sound analysis using different feature extraction techniques and methods. However, no artificial intelligence was involved in this. The authors [6] (2014) reviewed the papers on the classification of PCG signals. The authors [4] (2015) researched the analysis of second heart sounds, which involves computing the length and the energy of normalized heart sounds. However, they did not classify heart sounds. The authors [7] (2018) researched PCG signal sensing using several machine-learning methods for abnormal heart sound detection. However, it was used to discriminate between different types of heart sounds.

The authors [8] in 2018 and [9] in 2013 researched PCG signal analysis using the wavelet transform method. However, it had restrictions in real-time signal analysis. The authors [10] 2015 did work on the classification of heart sounds using multimodal features and obtained an accuracy of 85%. The authors [11] (2020) researched the classification of heart sound using CNN and achieved an accuracy of 88%. The authors [12] (2017) paper investigated PCG signal analysis for murmur diagnosing using Shannon energy and obtained an accuracy of 83%. The authors [13, 14] in 2017 and 2018 did work on a simple technique for heart sound detection and identification using the Kalman filter in real-time analysis, where various feature extraction techniques were discussed. Thus, improving data science models and artificial intelligence is necessary for PCG signal analysis. Table 1 shows some of the recent works that have been done on PCG signal analysis and classification methods.

Table 1 Recent work done on PCG signal analysis and deep learning-based methods

4 Objective

The main objective of this paper is to design and develop an artificial intelligence-based ear-contactless electronic stethoscope that is low-cost, portable, and accurate. The developed stethoscope can also effectively diagnose various heart sound types where auscultation occurs through a bluetooth-enabled micro speaker.

5 Methods and materials

5.1 Heart sound dataset description

The research was based on four heart sound data repositories. A massive number of heart sound data (70,000 heart sound samples) are taken from these four heart sound data repositories for training and testing purposes of the modeling. The training and testing ratio is 80:20. Similar classes’ heart sounds are considered during training. The heart sound banks are the only source available on the internet used by many researchers, as the literature says. These sounds are highly authentic and reliable, as described in the following: https://github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features [15].

Five categories of heart sound samples have been considered, as shown in Fig. 1: normal sound, mitral regurgitation, mitral stenosis, mitral valve prolapse, and atrial stenosis. Each heart sound lasts for a time duration of 5 s to 10 s and has a bandwidth of 65 Hz to 500 Hz.

Fig. 1
figure 1

Cardiac sound dataset 1

Heart sound dataset 2, as shown in Fig. 2, is obtained from the classification of heart sound recordings-pascal challenge dataset B [15, 16]. Three categories of heart sounds have been considered: normal, murmur, and extra–systole.

Fig. 2
figure 2

Cardiac sound dataset 2

Figure 3 details the physio net challenge training set [17, 18], comprising five training databases (A through E) containing 3126 heart sound samples.

Fig. 3
figure 3

Cardiac sound dataset 3

Kaggle heartbeat sounds [19, 20] dataset consisting of normal, abnormal, noisy normal, and noisy abnormal is also used for our heart sound analysis.

Mainly, two categories of heart sounds are used for analysis: normal and abnormal.

Features [21] of the heart sound considered for the whole study have been limited to:

  1. 1.

    Acoustic features: MFCCs, Mel, Chroma, Contrast, and Tonnetz.

  2. 2.

    Time domain features: RMS, Signal Energy, Signal Power, ZCR, THD, DWT, Skewness, and Kurtosis.

  3. 3.

    Time and Frequency domain features: Hilbert Huang transform (HHT)

Figure 4 depicts a block diagram of PCG signal classification [21, 22]. Heart samples are given to a pre-processing module, which applies a bandpass filter with a bandwidth of 35 to 480 Hz to suppress the ambient error. A sample length of 3 s is considered and made constant for each heart sample recording undergoing pre-processing. Various features are collected from the pre-processed heart sound, and eventually, the heart sound sample is categorized for validation of the proposed developed system.

$$\mathrm{x}\left(\mathrm{t}\right)=\mathrm{f}\left(\mathrm{x}\left(\mathrm{t}\right)\right)$$
(1)

where x' (t) is the processed cardiac sound signal.

Fig. 4
figure 4

Block diagram of PCG signal classification

Each dataset is split into training data (85%) and test data (15%). Training data is further divided into validation data (15%) and the rest for training the model.

HHT is used as one of the feature extraction methods where the heart signal is decomposed into various intrinsic mode functions (IMFs) using empirical mode decomposition (EMD). The Hilbert transform is applied to the IMFs to obtain the time, and frequency distribution for Hilbert spectral analysis (HSA) to detect extra Heart sounds like S3 and S4.

The following algorithms are applied to classify heart sounds:

  1. 1.

    Proposed SE-based Inception Network

  2. 2.

    Proposed CNN-RNN architecture

  3. 3.

    Proposed Recurrent Neural Network (RNN)

  4. 4.

    Proposed CNN-based Inception Network

All deep learning-based algorithms are written in Python ver. 3.9.2 using Thorny python editor (Linux). The proposed algorithms mentioned above are briefly described under the software development of the proposed deep learning models.

The valvular heart sounds considered for the entire PCG signal analysis [23, 24] are divided into the following categories:

  1. 1.

    Normal Heart Sounds

  2. 2.

    Mitral Stenosis

  3. 3.

    Aortic Stenosis

  4. 4.

    Mitral Regurgitation

  5. 5.

    Aortic Regurgitation

  6. 6.

    Mitral Valve Prolapse

  7. 7.

    Extra Systole

6 The hardware development of the proposed system

The schematic diagram of the data acquisition system and other interfaces is described in Figs. 5.

Fig. 5
figure 5

Schematic diagram of the proposed heart sound acquisition system

The hardware circuit for heart sound data acquisition and interface is described in Fig. 5, where it uses a chest piece for sensing real-time heart sound samples, an electret microphone for conversion of the real-time heart sounds into an electronic signal, a microphone pre-amplifier for amplification of the heart sound, a notch filter for removing the electrical interferences, an analog tunable band pass filter, a buffer amplifier for impedance matching, a Raspberry Pi 4B model acting as a CPU, a Bluetooth enabled micro speaker for listening to the captured heart sound, a power supply for the Raspberry Pi model, 7-inch touch screen for displaying the heart sound.

Heart sound in the chest generates pressure waves which the stethoscope diaphragm has picked up. The electret microphone fitted inside the chest piece and the stethoscope tube convert the heart sound signal to an electrical signal. This electrical signal is amplified by an audio amplifier based on a MAX 9812 IC chip of gain 20 and fed to the notch filter to remove electrical interferences. The processed signal is fed to an analog tunable band pass filter with a 35–470 Hz spectrum. The heart sound signal typically [25] belongs to 35–470 Hz for normal and abnormal sounds. Finally, the processed signal is converted to digital form in a USB sound card that contains a 16-bit ADC converter with a sampling frequency of 44.1 KHz. The output of the USB sound card is connected to the Raspberry Pi 4B computer. The heart sound samples are picked up in real-time by a stethoscope sensor and saved in WAV format for further processing through the Raspberry Pi 4B with developed deep learning models. A 7-inch LCD touch screen is used to display the processed data and AI-based prediction of the disease of heart sound sample picked from the chest through the developed system in real time.

Figure 6 highlights the prototype of the proposed system incorporating all the required elements. The Bluetooth-enabled micro speaker attached to the raspberry pi makes the stethoscope ear-contactless. 

Fig. 6
figure 6

The prototype of the proposed system and its components. A The uninterruptible power supply. B The Raspberry Pi 4B (left) and uninterruptible power supply (right). C Combination of the Raspberry Pi and the power supply. D A 7-inch touch screen, Raspberry Pi with power supply, and micro speaker. E A fully assembled device containing the components in D

The stethoscopes are designed to meet specific parameters. The proposed stethoscope satisfies all these parameters better than others and thus can be considered an essential tool in cardiac monitoring of heart disease. Figure 7 provides the heart sounds captured in real-time through the proposed stethoscope and 3 M Littman digital stethoscope. Figure 8 provides how digital sound wav files are stored using the raspberry pi 4B model. Figure 9 shows a comparative study of the proposed stethoscope with the conventional stethoscope and 3 M Littman stethoscope based on the computation of these parameters. The third (S3) and fourth heart sounds (S4) are very low frequency and have low intensity. Thus, it can sometimes be heard using the bell of the stethoscope chest piece.

Fig. 7
figure 7

PCG recordings obtained through 3 M Littman stethoscope and proposed stethoscope

Fig. 8
figure 8

Digital heart sound wav files stored in Raspberry Pi using Thorny IDE environment and Python 3.9 version

Fig. 9
figure 9

Comparison of the proposed stethoscope based on the evaluation of specific parameters with a conventional stethoscope and 3 M Littman stethoscope

The developed system is ear contactless because heart sound plays through an external bluetooth speaker. There is no contact between the stethoscope chest piece and the ear. Therefore, it is hygienically safe during the COVID-19 chest examination. Auscultation of the heart is essential in patients with COVID-19. However, proper auscultation of these patients is difficult when medical workers wear personal protective suits.

Figure 9 highlights the parameters considered in comparing the three stethoscopes:

  1. 1.

    Disinfection in use.

  2. 2.

    Ease of handling the stethoscope.

  3. 3.

    Safety for patients and health professionals in use.

  4. 4.

    Ease of auscultation.

  5. 5.

    Easy to afford.

  6. 6.

    Digital storage of wav files.

  7. 7.

    Compatibility with the wearing of personal protective clothing.

7 The software development of the proposed model

7.1 Squeeze and excitation networks applied to existing state of art architectures (SOTA)

The residual network and inception network work best for valvular heart sound analysis [26, 27]. However, their performances can be further improved by integrating squeeze and excitation (SE) blocks with the existing state of art models, as given in Figs. 10 and 11.

Fig. 10
figure 10

SE—ResNet Block

Fig. 11
figure 11

SE—Inception Net Block

Figure 12 provides various positions of attached SE blocks. Table 2 highlights and compares results based on the different positions of SE blocks connected with the existing CNN models. Table 2 also provides a standard SE block associated with the existing state of the model [28, 29], like inception net offers the best result among all others.

Fig. 12
figure 12

Various positions of squeeze and excitation blocks

Table 2 SE-block connection position-based results in CNN-based models applied to pascals dataset

In SE modules, there exist mainly three parts:

  1. 1.

    Squeeze block

  2. 2.

    Excitation block

  3. 3.

    Scale block

In the squeeze block, a global average pooling operation is performed to reduce the C × H × W shape to C × 1 × 1 shape to get a global picture for each channel.

The excitation network contains two fully connected layers, first to reduce the dimensions and second to increase the dimensions back to the original. The dimensions are reduced by a reduction ratio of r = 16. Initially, the vector of length C is obtained in the squeeze operation, and the next stage is to generate a set of weights for each channel which is done with the help of the excitation operation.

Finally, the scaling operation is done with the help of a sigmoid layer where outputs of the excitation block are multiplied element-wise with the input feature to get the final output of the SE Module.

Various learning curves are obtained, as shown in Figs. 13 and 14.

Fig. 13
figure 13

Plot of accuracy vs. epoch during training and validation in SE-Inception Net

Fig. 14
figure 14

Plot of cossentropy loss vs. epoch during training and validation in SE-Inception Net

7.2 Proposed CNN-RNN architecture

A combination of CNN-RNN architecture has been proposed, where features are extracted with a CNN-based inception network, and classification is done by the RNN model, as given in Fig. 15. The hybrid CNN-based inception network model and LSTM-based RNN model use some of the acoustic features of the heart sound signal like Mel-frequency cepstrum coefficients (MFCCs), Mel, Chroma, Contrast, and Tonnetz. The proposed CNN-RNN model [28, 30], after hyper-parameter settings, achieved an accuracy of 91.17%. It produced better performance in terms of accuracy compared to the RNN model after optimizing its hyperparameters.

Fig. 15
figure 15

Functional block diagram of the proposed CNN-RNN model

Figure 16 depicts a plot of the accuracy of the proposed CNN-RNN model after hyper-parameter tuning with the number of epochs during the training and validation stage. It is found from the above plot that accuracy rises as the number of epochs grow.

Fig. 16
figure 16

Plot of accuracy vs. epoch during training and validation

Table 3 produces the hyper-parameter settings of the proposed model.

Table 3 Parameter configuration in CNN-RNN model

7.3 Proposed RNN Architecture

It is a special kind of deep neural network [13, 15] where the obtained result from the previous stage is used as input to the present stage, as explained in Fig. 17. In PCG signal analysis, it plays an important role.

$${\mathrm{h}}_{\mathrm{t}}=\left({\mathrm{h}}_{\mathrm{t}}-1,{\mathrm{x}}_{\mathrm{t}}\right)$$
(2)

where ht = present state in RNN network.

Fig. 17
figure 17

Steps involved in RNN

ht-1:

= Previous state in RNN network

xt:

= Input to RNN network

$${\mathrm{h}}_{\mathrm{t}}=\mathrm{ReLU}\;\left({\mathrm{W}}_{\mathrm{hh}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{W}}_{\mathrm{xh}}{\mathrm{x}}_{\mathrm{t}}\right)$$
(3)
Whh:

= Weight at recurrent neuron in RNN network

Wxh:

= Weight at input neuron in RNN network

ReLU:

= Activation function used in the hidden layer

$${\mathrm{y}}_{\mathrm{t}}=\cdots\upsigma \left(\mathrm{Why }\;{\mathrm{h}}_{\mathrm{t}}\right)$$
(4)
σ:

= Softmax Activation function used in the output layer

yt:

= output of RNN network

Why:

= Weight at output layer in RNN network.

Each dataset is grouped into training data (85%) and test data (15%), as shown in Fig. 18. Furthermore, training data is decomposed into validation data (15%) and the rest for training the model. Once the proposed RNN model [31, 32] gets trained using training data, it is then validated using the validation data to adjust the hyper-parameters to get better results and choose the best-selected model. Test data is commonly used to validate the performance of the proposed model. Eventually, statistical parameters are computed for the proposed model.

Fig. 18
figure 18

Functional block diagram of RNN

Figure 19 shows the architecture of the proposed RNN model applied to dataset 2. The input layer uses six neurons followed by two long short-term memory (LSTM) layers. The first LSTM layer [28, 33] comprises 50 neurons followed by a 35% dropout rate, whereas the second LSTM layer contains 20 neurons followed by a 35% dropout rate. In every LSTM layer, the ReLU activation function has been used. The softmax activation function has been used in the dense output layer, having several neurons depending on the dataset to which the proposed model is applied, as shown in Table 3.

Fig. 19
figure 19

Architecture of the proposed RNN using LSTM

Table 4 shows the proposed RNN model [34] summary using two LSTM layers with the ReLU activation function and a dense output layer with the Softmax function being used.

Table 4 Proposed RNN model summary

Figure 20 shows the cross-entropy loss [35] of the proposed RNN model under the training and validation phase. The plot of cross-entropy loss with the number of epochs indicates that loss reduces as the number of epochs rises in the training and validation stage for the proposed RNN model.

Fig. 20
figure 20

Plot of cossentropy loss vs. epoch during training and validation in RNN

Figure 21 depicts the accuracy of the proposed RNN model under the training and validation phase. The plot of accuracy with the number of epochs indicates that accuracy rises as the number of the epochs grows in the training and validation stage for the proposed RNN model.

Fig. 21
figure 21

Plot of accuracy vs. epoch during training and validation in RNN

The effect of various optimizers on the proposed RNN model applied on dataset 2 is shown in Fig. 22. The figure below shows that the Adam optimizer gives the best result among the others.

Fig. 22
figure 22

Comparison of different optimizers in RNN

Table 5 shows the effect of different dropout rates on the proposed RNN model. It is observed that on increasing the dropout rates, the model’s accuracy increases, and cross-entropy loss decreases. The tuning of hyperparameters thus helps to choose the best proposed RNN model at a dropout rate of 0.35. The performance evaluation of the proposed RNN model is carried out on the test data by considering a dropout rate of 35%, as shown in Table 6. After hyperparameter settings, the proposed RNN-based long short-term memory (LSTM) model achieved an accuracy of 82.32%.

Table 5 Hyper-parameter tuning of samples by applying different dropout rates in RNN
Table 6 Summary of test-data results in RNN

Table 7 provides the block diagram of proposed Inception Net architecture. Table 8 shows the effect of adding inception blocks on the proposed inception net model. It is experimentally studied and found that increasing the number of blocks increases the model’s accuracy and cross-entropy loss decreases.

Figure 22 provides the comparative study of different optimizers used in deep learning methods like RNN. As it is clear from the above figure, the accuracy increases very steeply on implementing the said optimizer. The learning rate is a configurable and most important hyperparameter in the deep learning model. Figure 23 depicts the effect of the learning rate on the cross-entropy loss of the proposed RNN model under the training and validation phase. It shows that the optimal condition is obtained at a learning rate of 0.001. As the combined architecture of the CNN and RNN model [36, 37] has low accuracy and more training time, this hyperparameter optimization method can easily keep track of that.

Fig. 23
figure 23

Effect of learning rate during training and validation using ADAM optimizer

Various parameters used to judge any developed system’s performance are called accuracy, precision, recall, and F-measure. For the computation of these metrics, the count of true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN) data are needed, as provided:

$$\mathrm{Accuracy}=\mathrm{TP}+\mathrm{TN}/\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}$$
(5)
$$\mathrm{Precision}=\mathrm{TP}/\mathrm{TP}+\mathrm{FP}$$
(6)
$$\mathrm{Recall}=\mathrm{TP}/\mathrm{TP}+\mathrm{FN}$$
(7)
$$\mathrm{F}1\;\mathrm{ Score}={2}^{*}\left(\mathrm{Recall }*\mathrm{Precision}\right)/\left(\mathrm{Recall}+\mathrm{Precision}\right)$$
(8)

7.4 Proposed CNN-based inception network

Inception networks [9, 19, 38] are often used in classification problems related to medical imaging. An inception network responds to a set of Heart sounds, performs the required operations and analysis on them, and eventually predicts the type of heart sound for classification [39]. A CNN-based inception network [40, 41] has multiple convolutional layers and inception modules to learn the various features in a heart sound and predict the class labels accordingly. All inception network parameters and hyperparameters are adjusted during the training phase of the deep learning model. A Python-based Keras sequential model [42, 43] has been taken for implementation. The entire design of the deep learning model is shown in Fig. 24. This model summary was obtained after training and validation of the dataset.

Fig. 24
figure 24

Inception net block

Figure 25 provides the skeleton of an inception module used. It is a sparsely connected network with a max pool and multiple convolutions of kernel sizes 1, 3, and 5 at the same layer, followed by an application of concatenation operation from all filter outputs.

Fig. 25
figure 25

Proposed CNN-based inception net model summary

Table 6 shows the proposed inception net architecture using three inception blocks with a ReLU activation function, an input layer, and an output layer with a softmax activation function.

Table 7 reflects the architecture of the proposed inception net model applied to dataset1. In this model, the first hidden convolution layer contains 256 filters, each kernel size equal to 3 using the ReLU activation function, followed by the second hidden convolution layer containing the same number of filters and kernel size. The third hidden layer has three inception modules, followed by the fourth, fifth, and sixth hidden layers. They are the fully connected layer containing 1200, 600, and 150 nodes, respectively. The output layer contains five nodes using the Softmax activation function to classify five different types of heart sound [2, 44]. Learning curves are obtained for the proposed model developed with normal and abnormal heart sounds and plotted accordingly, as shown in Figs. 26 and 27.

Table 7 Proposed Inception Net architecture
Table 8 Effect of adding inception blocks in CNN-based Inception Net model
Fig. 26
figure 26

Plot of cross entropy loss vs. epoch during training and validation in CNN-based Inception Net

Fig. 27
figure 27

Plot of accuracy vs. epoch during training and validation in CNN-based Inception Net

Figure 26 shows the cross-entropy loss of the proposed inception net model [45] under the training and validation phase. The plot of cross-entropy loss with the number of epochs indicates that loss reduces as the number of epochs rises in the training and validation phase for the proposed inception net model. Learning rate affects the cross-entropy loss and accuracy as a lower learning rate needs a large number of iterations, and a larger one needs a small number of iterations. Thus, choosing a proper learning rate value is a challenging and significant task, as shown in Fig. 28.

Fig. 28
figure 28

Effect of learning rate during training and validation using ADAM optimizer

Figure 27 highlights the accuracy of the proposed inception net model under the training and validation phase. The plot of accuracy with the number of epochs indicates that accuracy rises as the number of epochs grows in the training and validation stage for the proposed CNN model.

The proposed inception net model’s performance evaluation [46, 47] is carried out on the test data by selecting the number of inception blocks as six, as shown in Table 9. The best-proposed model has been chosen when the number of added blocks equals six.

Table 9 Summary of test-data results in the Inception Network

Figure 28 shows the effect of the learning rate on the cross-entropy loss of the proposed inception net model under the training and validation phase. It shows that the optimal condition is obtained at a learning rate of 0.001.

After hyperparameter settings, the proposed CNN-based inception net model achieved an accuracy of 99.65%.

7.5 Comparison to machine learning methods

The proposed model gets compared with the available machine-learning methods. A similar set of feature learning methods is considered and then a perception is made about the proposed software base deep learning model.

Dataset 1, dataset 2, and dataset 3 have been used to compare all the machine learning algorithms, as shown in Fig. 29. All machine learning algorithms [14], like Support Vector Machine [43], K-Nearest Neighborhood, Naïve Bayes, and Random Forest, are written in Python ver. 3.8 using Keras and tensor flow. Datasets have been fed to these various machine learning models using fivefold cross-validation to evaluate their performance and statistical analysis [48, 49]. Different models have been assessed against three datasets and compared, as given in the figure below. It is found that the inception network works best among all other machine learning algorithms, as shown in Fig. 30.

Fig. 29
figure 29

The comparison of the machine learning performance between the proposed method (RNN and CNN) and other machine learning

Fig. 30
figure 30

Comparison of different machine learning methods with deep learning on various datasets

8 Result analysis

Table 10 shows the accuracy of the four proposed models. The proposed hybrid CNN-RNN model attained better accuracy after hyperparameter settings than the RNN model. It has been found that the SE-based inception network works the best of all.

Table 10 Accuracy of the proposed models

Tables 11 and 12 depicts the comparison of performance metrics in terms of training time and testing time. After optimizing hyperparameters, the training time and testing time of the proposed hybrid CNN-RNN model are lower than the RNN model, and the SE-inception network achieved the most down screening time among them. Table 12 provides a detailed comparative study of the performance of different stethoscopes.

Table 11 Comparison of screening time of different deep learning models
Table 12 Comparison of performance metrics of different deep learning models

Table 13 describes the detail of the 30 volunteers of different age groups and genders considered for the entire experimental study. Table 14 analyzes PCG recordings obtained using the proposed contactless stethoscope by adapting SE-inception network. Different positions like upper left sternal border (ULSB), upper right sternal border (URSB), and lower left sternal borders (LLSB) are considered for this analysis under postures like sitting, standing, and supine. The recordings are finally compared with their past medical history.

Table 13 Comparison of performance metrics of different stethoscopes
Table 14 Description of the volunteers considered

For heart sound detection of valvular disease, the stethoscope diaphragm should be placed with good contact on the body over the heart at areas defined in Fig. 31. In the case of pulse rate measurement, the stethoscope diaphragm is placed over bronchial arteries. The almost 30 mm Hg pressure is higher than the systolic pressure during palpation. For every volunteer, nine PCG readings (3 readings from each posture) are obtained using the developed system. Table 15 shows the scores on a 1 to 5 scale by comparing the developed stethoscope readings with the volunteer’s medical history. The developed system works quite decently for most volunteers in terms of accuracy (Table 16).

Fig. 31
figure 31

Locations of heart for the acquisition of heart sound

Table 15 Analysis of pcg recordings obtained in various positions with different postures applied on different subjects using the proposed contactless stethoscope
Table 16 Analysis OF PCG recordings obtained using the proposed stethoscope with past medical history

9 Conclusion

Machine learning algorithms have certain restrictions in real-time valvular heart sound analysis applications. This research aimed to use a CNN-based deep learning classifier to develop a low-cost portable contactless stethoscope for valvular heart disease prediction in rural places. The hardware development provides an electronic stethoscope with ear contactless that has a bluetooth-enabled speaker. The cost of an echo ultrasound 2D/3D imaging machine for cardiac imaging is around 12,000 to 15,000 USD. Our total development cost of the AI-based prototype system for detecting and predicting heart diseases is about 2500 USD. The available echo machines for cardiac imaging are portable and easy to use in rural villages. However, the developed prototype cannot image the cardiac condition. However, it can predict valvular diseases cheaply and can be incorporated into rural village applications using artificial intelligence. The designed system was experimentally evaluated for its performance with 30 human volunteers having a medical history of 27 volunteers with normal heart, two with mitral stenosis, and one with mitral regurgitation, which the physician clinically assessed. On experimentation with those human volunteers, the developed system predicts the same results. The experimental studies aimed to find a suitable and improved deep learning classifier using Python-based convolutional neural network (CNN) and recurrent neural network (RNN) on verified normal and abnormal heart sounds of three datasets, as described in the text. Integration of SE blocks with the existing state of art architecture prevailed in better performance of the inception and residual networks where standard SE blocks produced the best results, among others. The training dataset trains the deep learning model, the validation dataset is used to validate the performance of hyperparameters of the model, and the test dataset computes the model’s overall performance. It has been found that the best results can be obtained from any deep learning model by tuning the hyperparameters. In this entire experiment, classifier accuracy, precision, recall, and F1-scores are evaluated with the different heart sound databases. It has also been experimentally studied and found that the CNN-based inception network model classifier can be chosen as the most effective deep learning classifier. However, it has some complexity in the computational part. A recurrent neural network (RNN) is also suitable as an AI classifier of heart sound though it takes much time in the learning phase. However, once learned, it gives a speedy result.