1 Introduction

Otitis media (OM), a global health problem, is mostly seen in children under the age of 7 and can cause severe hearing loss and speech disorders [1]. OM, which is a disease that affects approximately 1.2 billion people worldwide [2], ranks second in hearing loss and fifth in the global disease burden [2, 3]. WHO estimates that OM complications may cause 28 thousand deaths each year [4]. While OM is diagnosed by general practitioners often using a small hand-held medical device called an otoscope, otolaryngologists usually use more advanced and specialized tools such as an endoscope or microscope [5]. However, the high cost of these devices prevents the wide use of them by otolaryngologists, and this makes it difficult to diagnose the disease in hospitals that lack these devices. In addition, different interpretations that are not objective may occur in clinical examinations performed by specialists [6].

Although there are machine learning and statistical-based studies in the literature regarding hearing loss and inner ear disorder [7,8,9,10,11,12,13,14,15,16], there are still gaps in recognizing middle ear diseases causing the limited use of automatic diagnosing systems.

The need for extracting features known as hand-crafted classification tasks that are performed by traditional machine learning methods involves complex processes and requires attention. Moreover, feature extractions directly affect classification performance. Improvements in the capacities of central processing units (CPUs) and graphical processing units (GPUs) over the last decade have enabled the development of high-performance deep learning (DL) models that eliminate the difficulty of extracting hand-crafted features [17]. Complex cognitive tasks such as image and voice recognition can be performed more efficiently thanks to DL models that contain a large number of processing units and layers.

In recent years, the ability to solve complex tasks with high success by processing large data without the need for feature extraction using DL models has made the use of these models quite common in disease classification and recognition of medical images [18,19,20,21,22,23]. However, there is a limited number of studies in the literature that diagnose and classify middle ear disease using DL methods. This study uses a current public dataset prepared by Viscaino et al. [16]. The fact that only a few DL-based studies have been carried out on this dataset has been a source of motivation for us to examine the performance of different DL approaches in classifying the disease using this dataset. To fill the current gap, this study examines the effect of the ensemble voting approach on multi-class image classification performances of DL methods on the otoscopy dataset including images of TM conditions such as myringosclerosis, earwax plug, COM, and normal. The reason for adopting the ensemble approach in this study is that it includes many models, thus increasing the ability to generalize the weak and strong sides of each independently trained model to different parts of the input space and ensuring more robust models are obtained. The second reason is that ensemble approaches are less sensitive to noise. Another important point in choosing this approach is that they can achieve better results compared to single models by ensuring bias-variance balance. In this context, the following expressions summarize the significant contributions of this study.

  • This study is the first that addresses the performance of a voting ensemble framework that uses Inception V3, DenseNet 121, VGG16, MobileNet, and EfficientNet B0 pre-trained DL models on the Ear Imagery dataset.

  • The accuracy of TM classification was improved by utilizing hard and soft voting approaches.

  • The proposed approach provides a significant enhancement in solving slow diagnosis and high-cost problems of computer-aided decision support systems in this field.

The following sections of this study are organized as follows. Section 2 presents the papers related to the classification of tympanic membrane conditions. Section 3 introduces the dataset and explains the proposed voting ensemble framework and the pre-trained CNN models included in this framework. Section 4 presents the obtained results in detail. The results are discussed in Section 5. Finally, the study is concluded in Section 6.

2 Related work

In hospitals, otolaryngologists usually examine otoscopy images to gain a more comprehensive evaluation of the disease and implement a treatment plan based on this. In recent years, the increase in the use of DL-based approaches for the classification of middle ear diseases has been noticeable due to the contribution they provide to field experts in decision-making. For detecting normal and OM cases, Lee et al. proposed a convolutional neural network (CNN) model [24]. In the feature extraction process, they analyzed the tympanic membrane (TM) perforation using a class activation map and obtained 91% accuracy. Cha et al. [25] used an auto-endoscopic image dataset including 10,544 samples in their study, where they examined the success of 9 pre-trained DL models. In their study, a multi-class image classification including attic retraction, tympanic perforation, myringitis, otitis externa, and tumor and normal conditions was performed. They composed an ensemble classifier of two pre-trained models (Inception-V3 and ResNet101) which achieved an average accuracy of 93.67% for fivefold cross-validation. Another ensemble-based DL model was proposed by Zeng et al. They trained 9 CNN models using a total of 20,542 endoscopic images to classify 8 TM conditions. Finally, they selected the two best models according to accuracy and training time and combined them in an ensemble classifier and achieved an average accuracy of 95.59% [26]. In another study, Khan et al. [27] utilized DL models such as DenseNet161, ResNet50, VGGNet16, SE_ResNet152, and InceptionResNet_v2 on 2484 auto-endoscopic images and reported that the DenseNet161 model obtained 95% accuracy on the dataset including COM with TM perforations, OME, and normal classes. Başaran et al. [28] performed TM detection and TM classification in their study based on a faster regional convolutional neural network (Faster R-CNN) and pre-trained CNN on augmented images, respectively. In their study, they achieved the best results using Faster R-CNN and VGG16 with an average classification accuracy of 90.8% using tenfold cross-validation. In a study by Wang et al. [29] where they used computed tomography images of the temporal bone to diagnose COM, considerations of six clinicians were compared with the results of pre-trained CNN methods. They observed that the model they proposed performed superior in some cases than clinical experts. In a study that used a transfer learning approach, Zafer [30] performed TM classification using otoscopy images of normal, CSOM, AOM, and earwax diseases obtained from the same hospital that Başaran et al. [28] stated in their study. The features provided by 7 different pre-trained CNNs were fused and then fed to traditional machine learning models and it reported that the SVM model achieved the best accuracy with 99.47%. In the same study, when using tenfold cross-validation, the accuracy was reported to be 98.74%. Singh and Dutta proposed a deep learning-based approach for the automatic detection of ear disease. They achieved the highest accuracy 96% by applying data augmentation [31]. In addition to these studies in the literature, we have recently performed a Tympanic Membrane (TM) classification in which keypoint-based deep hypercolumn features extracted from 5 different layers of the VGG16 model are classified with 99.06% accuracy using the Bi-LSTM deep learning model [32]. However, in our study, we encountered a high-level consumption of system resources due to using hypercolumn features.

In addition to the studies conducted with OM cases discussed above, other studies deal with ear problems due to other causes. For example, Zeng et al. designed a Siamese network for the classification of conductive hearing loss, in which two ResNet-101 networks share weights on raw otoscopic images and segmented images with U-Net architecture [33]. Choi et al. addressed the problem that most TM lesions have more than one diagnostic name. The authors investigated the effect of concurrent diseases on the classification performance of deep learning networks in a dataset of auto-endoscopic images with multiple diseases [34]. Habib et al. used CNN-based ResNet-50, DenseNet-161, VGG16, and Vision Transformer models on 1842 images. They used these models, which trained with the initial weights on the ImageNet dataset, to detect normal and abnormal classes. Although the authors worked on three different multiclass datasets in their study, the fact that their proposed model performs binary classification can be considered a drawback [35]. Nam et al. used the regions of interest detected from 4,808 autoscopic images with Mask R-CNN to extract the distinctive features necessary for classification. Then, by adopting an ensemble approach including EfficientNetB0 and Inception-V3 models, they achieved an average classification accuracy of 97.29% with the fivefold cross-validation technique [36]. Afify et al. created a CNN model from scratch on 880 otoscopy images and determined the model parameters using the Bayesian optimization technique. The authors achieved 98.10% classification accuracy with the model they proposed [37]. Wang et al. proposed a VGG16 model to detect cholesteatoma and chronic suppurative otitis media diseases, which was trained on 973 computed tomography images obtained from 499 patients. However, the authors reported that the model working with CT images had difficulties detecting the early stages of these two diseases [38].

As can be seen from the literature given in this section, it has been observed that the success of the ensemble approach of DL methods on TM classification has not been examined in detail. The fact that deep learning-based studies for middle ear disease recognition and classification are still limited and the need for DL-based approaches that will efficiently solve this problem motivated us to develop a DL approach for TM classification that uses system resources more efficiently.

3 Methodology

This study presents a framework that performs the classification of TM conditions including normal, earwax plug, myringosclerosis, and chronic otitis media, stably and efficiently. In this context, the predictions made by fine-tuned pre-trained CNN models on the Ear Imagery dataset were handled by a voting ensemble approach to obtain a more accurate and stable classification. A voting ensemble framework was built using both soft and hard voting approaches and the results of both approaches were examined. Figure 1 shows the overview of the proposed approach.

Fig. 1
figure 1

Overview of the proposed voting ensemble framework

3.1 Dataset

In this study, the Ear Imagery dataset is used, which is publicly available and prepared by Viscaino et al. [16]. The images in the dataset used in this study are collected and labeled by a research group whose one of the members is an otolaryngologist in collaboration with the Department of Otolaryngology of the Clinical Hospital from Universidad de Chile. The ground truths of OM cases were determined by the otolaryngologist and approved by the scientific committee of the relevant university. In addition, this dataset was divided by its owners into train, validation, and test sets. The dataset has 880 otoscopy images with a resolution of 420 × 380 which belongs to 180 patients aged 7 to 65 years who applied to the otolaryngology outpatient clinic of the Universidad de Chile Clinical Hospital. The images in the dataset are divided into 4 categories chronic otitis media, myringosclerosis, earwax plug, and normal otoscopy and each category has 220 samples that provide a balanced dataset. Figure 2 demonstrates sample images of each TM condition.

Fig. 2
figure 2

Sample images belonging to different TM conditions. a Normal (b) Chronic Otitis Media (c) Earwax plug (d) Myringoesclerosis

3.2 CNN

Convolutional neural networks introduced by Yann Lecun [39] are neural networks primarily specialized for image classification, object recognition, and detection. Unlike traditional image processing methods, CNNs can learn the most appropriate features by learning on their own. Traditional CNN structure consists of a convolution layer, pooling layer, and fully connected layer (Fig. 3). The convolution layer and the pooling layer are used for feature extraction. Classification of an input image is performed in the fully connected layer. All features from previous layers are flattened into a one-dimensional feature vector to make the neural network suitable for classification and prediction and finally, output classification probabilities are obtained in the softmax layer.

Fig. 3
figure 3

Traditional CNN structure

3.3 Pre-trained CNN models

In this study, CNN-based DL models such as Inception-V3, DenseNet 121, VGG16, MobileNet, and EfficientNet B0 were used. These models were summarized in the following subsections.

3.3.1 Inception-V3

Inception V3 developed by Google is the extended network of the GoogLeNet and the third release in the Deep Learning Evolutionary Architectures series [40]. Inception-V3 is one of the most advanced architectures used in image classification. The Inception-v3 architecture proposes an initial model that combines multiple different sizes of convolutional filters in a new filter. This type of design reduces the number of parameters to be trained, thus reducing the computational complexity. Convolutions, average pooling, max pooling, concats, dropouts, and fully connected layers are parts of the symmetrical and asymmetrical building blocks that make up the model. The Softmax activation function is used in the last layer of the Inception-V3 architecture, which has 42 layers in total and an input layer that takes images of 299 × 299 pixels.

3.3.2 DenseNet

Convolution and sub-sampling result in a drop in feature maps during the training of neural networks, and losses in image features are also experienced during the transition between layers. Huang [41] created the DenseNet system to make better use of image information. Each layer in the system is fed forward to the other layers. A layer can thus access the property information of any layer preceding it. In addition, DenseNet has several focal points: alleviates the disappearing angle problem, includes strengthening, stimulates highlight reutilizing and generously reduces the number of parameters.

3.3.3 VGG16

The VGG16 model, which has 16 layers, was developed by Simonyan and Zisserman and is based on the CNN model [42]. It received a 92.7% accuracy score in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). First, 13 of these layers are convolutional layers with 3 × 3 filters and 2 × 2 max pooling layers. VGG16 shrinks input images across maximum pooling layers. The ReLu activation function is applied in these layers. Then, three fully connected layers contain most of the parameters of the network. In addition, the model consists of 138 million parameters. Considering the depth of the model is deeper compared to previous CNN models, it takes a long time to train such a large model.

3.3.4 MobileNet

MobileNet [43] is a lightweight deep neural network (DNN) for mobile-embedded terminals that was suggested by Google in 2017. MobileNet, which is built on modern architecture, employs a deeply separable convolution to construct a lightweight DNN. The research is aimed at model compression, and the basic notion is the skillful decomposition of the convolution kernel. It can efficiently reduce network parameters while accounting for the optimization delay.

3.3.5 EfficientNet

The EfficientNet model, which achieved 84.4% accuracy in the ImageNet classification challenge with 66 M parameters, can be thought of as a set of CNN models. The EfficientNet has 8 models ranging from B0 to B7, and as the model number increases, the estimated number of parameters does not increase significantly, but the accuracy rate does. DL architectures are designed to bring more effective ways with smaller models. Unlike other state-of-the-art models, the EfficientNet model scales in depth, width, and resolution while attempting to scale down, yielding more efficient results. The grid search is the first stage in the compound scaling approach to determine the relationship between the multiple scaling dimensions of the baseline network under a fixed resource restriction. A reasonable scaling factor for depth, width, and resolution dimensions is established in this manner. Following that, the coefficients are used to scale the baseline network [44].

3.4 Transfer learning

Transfer learning is the task of transferring the knowledge gained from a trained network to a different network created to solve a similar problem [45]. The main idea in this approach is that DL networks can learn characteristic features if they are properly trained. For general features, the last few layers of the pre-trained network are replaced with layers suitable for the new task. Using a network pre-trained with a large dataset both reduces the training time required for the new task and enables to achievement higher accuracy.

3.5 Voting ensemble approach

Rather than trying to find a single best classifier on a classification problem, stronger generalization capacity can be achieved by combining more than one classifier with ensemble methods [46]. Ensemble methods often produce more accurate results than a single classifier. The most preferred ensemble approach adopting the combination method is the voting ensemble approach [47,48,49,50], which provides models with high generalization capacity. There are two commonly used schemes among voting approaches. These are hard voting and soft voting.

Hard voting is the basic example of majority voting. This method, it is aimed to estimate the final class label by calculating the majority of labels predicted by all classifiers. The function is represented by Eq. 1.

$${\widehat{y}}_{i}=mode\left\{{C}_{1}\left({x}_{i}\right), {C}_{2}\left({x}_{i}\right), \dots ,{C}_{j}\left({x}_{i}\right),\dots ,{C}_{m}\left({x}_{i}\right)\right\}$$
(1)

where, m is the number of classifiers, \({x}_{i}\) denotes the ith sample, \({C}_{j}({x}_{i})\) denotes the predictions of the jth classifier and the mode function is used to calculate the majority vote of all predictions [51].

Soft voting calculates the weighted sum of the prediction probabilities of all classifiers for each class for estimating the final class label as shown in Eq. 2. The label belonging to the class with the highest probability in total is selected.

$${\widehat y}_i={\arg\;\max\nolimits_k}\sum\nolimits_{j=1}^mw_jp_{i,k}^j$$
(2)

where, m is the number of classifiers, \({w}_{j}\) denotes the weight of jth classifier, \({p}_{i, k}^{j}\) denotes the prediction probability of jth classifier for assigning the ith sample to kth class [51].

3.6 Performance metrics

Several common performance metrics such as Accuracy (Acc), Sensitivity (Se), Specificity (Sp), and Precision (Pre) are used to measure the performance of the state-of-the-art CNN models and the voting ensemble approach including these models. These metrics are based on values of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) which are calculated using the values in the confusion matrix. Since multi-class classification is carried out in this study, TP, TN, FP, and FN values ​​are calculated separately for each class. Thus, TP gives the correct number of classified images for the relevant class, while FP is the number of misclassified images in all other classes except the relevant class. On the other hand, while TN is the total number of correctly classified images for all other classes except the relevant class, FN is the number of incorrectly classified images for the relevant class. The performance metrics used in the study and their extended calculations for multi-class classification using macro-averaging [52, 53] are given in Eqs. 312.

For a class \(k\),

$$Sen(k)= \frac{TP(k)}{TP(k)+FN(k)}$$
(3)
$$Spe(k)= \frac{TN(k)}{TN(k)+FP(k)}$$
(4)
$$Acc(k)= \frac{TP\left(k\right)+TN(k)}{TP(k)+FN\left(k\right)+TN\left(k\right)+FP(k)}$$
(5)
$$Pre(k)= \frac{TP(k)}{TP(k)+FP(k)}$$
(6)
$$F1\; score(k)= \frac{2\times Pre(k)\times Sen(k)}{Pre\left(k\right)+Sen(k)}$$
(7)
$$Average \;Sen= \frac{1}{\#classes} \sum\nolimits_{k=1}^{\#classes}Sen\left(k\right)$$
(8)
$$Average\; Spe= \frac{1}{\#classes} \sum\nolimits_{k=1}^{\#classes}Spe\left(k\right)$$
(9)
$$Average\; Acc= \frac{1}{\#classes} \sum\nolimits_{k=1}^{\#classes}Acc\left(k\right)$$
(10)
$$Average\; Pre= \frac{1}{\#classes} \sum\nolimits_{k=1}^{\#classes}Pre\left(k\right)$$
(11)
$$Average\; F1 \;score= \frac{1}{\#classes} \sum\nolimits_{k=1}^{\#classes}F1 \;score\left(k\right)$$
(12)

Apart from these performance metrics, another metric used for measuring the classification performance is the Receiver Operating Characteristic (ROC) curve which demonstrates the variation of the True Positive Rate (TPR) concerning False Positive Rate (FPR). The Area Under the Curve (AUC) is the area under the ROC curve. When this area is close to 1, it indicates high success for the model, while a value close to 0 indicates low success on discrimination of classes and 0.5 indicates that the model selects a random class each time. The ratio of true positives to all positives gives TPR which is also called sensitivity. The ratio of incorrectly predicted negatives to all negatives gives the FPR, which is also defined as 1-Specificity. The calculations of these metrics are as given in Eqs. 1314.

For a class \(k\),

$$TPR\left(k\right)=\mathrm{Sen }\;(k)$$
(13)
$$FPR\left(k\right)=1- \mathrm{Spe }\;(k)$$
(14)

4 Results

4.1 Training

In this study, the Ear Imagery dataset was used in its original form determined by the publishers, without changing the number of instances for the training, testing, and validation datasets. The publishers divided the dataset containing 880 images into training, validation, and test datasets containing 576, 144, and 160 images, respectively. This number of samples corresponds to approximately 80% and 20% for training and testing datasets respectively with reserving 20% of the training dataset for validation. The dataset includes 3 classes of OM disease (chronic otitis media, earwax plug, myringosclerosis), as well as a control class (normal), yielding in total of 4 classes. The train, validation, and test sets were balanced such that they contained an equal number of sample images for each class. For each model used in this study, the zero-center normalization method was applied to the pixels of otoscopy images. We first split the dataset into train and test sets and then applied Z-score normalization using only the training dataset. Afterward, Z-score normalization was applied to the test dataset by using the mean and variance obtained from the training dataset. In this study, pre-trained CNN models on the ImageNet dataset were used for classification by applying required modifications in the last layers of the models. The structure of the CNN layers that form the feature maps in the models was preserved and a global average pooling layer was added afterwards. The number of output neurons in the last fully connected layers of the pre-trained models was set to 4 for adapting them to the tympanic membrane classification problem. In the last layer, Softmax was used as the activation function, and categorical cross entropy was used as the loss function. The trainable parameters are set to True for all layers in the models. Hyperparameters of the CNN models such as learning rates, batch size, number of epochs, and activation function were set by trial and error and the determined parameter values were given in Table 1. The training of all CNN models used in this study was carried out in 40 epochs using Adam optimizer with batch size set to 8. To prevent overfitting, weights obtained in the epoch that achieved the best validation accuracy during training were chosen as the final model.

Table 1 Parameter values used for CNN models

In this study, all codes were developed using Keras 2.3.1 in Python programming language, and experiments were carried out on a computer running a 64-bit Ubuntu 18.04.3 LTS operating system with Intel (R) Xeon (R) 2.00 GHz CPU, 12 GB RAM and NVIDIA T80 GPU with 12 GB memory.

4.2 Experimental results

In this study, TM conditions were classified using the Voting ensemble framework, and the results of the proposed framework were compared with current studies in the literature. First, as mentioned in Section 3.4, we applied the transfer learning approach and fine-tuned the pre-trained CNN models to obtain the final models. The accuracy and loss curves obtained during the training of the models are given in Fig. 4. As can be observed from the figure, the accuracies obtained with pre-trained models were quite high.

Fig. 4
figure 4

Training and validation accuracy/loss curves

While classifying in the Voting ensemble framework (soft voting and hard voting) approach, the weights used for voting are the values obtained from the Softmax activation function in the output layers of the DL models used in the study. Since the dataset used in this study includes 4 classes, the average values were calculated by taking the mean of the values presented by the model concerning relevant metrics for each class. In this context, the average Acc, Sen, Spe, and Pre values obtained on the test dataset for all models are given in Table 2. The values marked in bold in this table represent the best values obtained for the relevant performance criterion.

Table 2 Average results of pre-trained DL models and voting ensembles on the test dataset

As seen in Table 3, while the soft and hard voting approaches showed close performances, these two approaches offered higher accuracy compared to other models. In addition, these two approaches offered the best performance in the context of Sen and Spe metrics. In particular, the soft voting approach slightly outperformed the hard voting approach. Despite close accuracy values between 96.6% and 98.8% offered by the models, when it was evaluated in the context of the precision metric, it was seen that there were more significant differences between the models. Precision indicates how many of the samples a model predicts positively are true positive, and precision increases as the number of FPs decreases. Accordingly, although the correct classification rate in the samples classified as positive by soft voting and hard voting approaches is higher than other models, soft voting provided the best performance in terms of this metric.

Table 3 Experimental results and confidence intervals of used models

The challenging aspect of classification problems is the model's ability to make predictions for samples not considered in the training process and the accuracy of these predictions to be evaluated in the case of model uncertainty. The confidence interval is an appropriate way to measure the uncertainty of the estimate. To measure the confidence of the model, the model is trained multiple times and then the confidence interval is calculated by obtaining the distribution of the different predictions made by the model on the test data. Accordingly, to measure the confidence of the models used in this study, the models were trained 5 times, and mean, accuracy values, standard deviations, and the lower and upper accuracies for 95% confidence interval were calculated for each model. As seen in Table 3, the differences between the lower and upper accuracy values calculated for the models vary within a small range such as 0.12 and 0.3, which indicates that the uncertainties of the models are quite low.

In addition, the accuracies of the pre-trained models and the voting ensemble approach were given in Fig. 5 in ascending order. As can be seen from Fig. 5, performances of voting ensemble approaches were superior to pre-trained models alone for TM conditions classification. The highest accuracy was achieved with the soft voting approach with 98.8%, while the model with the lowest accuracy was Inception V3 with 96.6%.

Fig. 5
figure 5

Performance comparisons of the models

The confusion matrices of the ensemble approaches are given in Fig. 6. The Acc, Sen, Spe, and Pre values obtained by the hard and soft voting ensemble approaches, which provide the best performances in the context of all the metrics used in this study, were detailed in Tables 4 and 5 for each class in the Ear Imagery dataset. Considering the classification performances of hard and soft voting approaches used in the final estimation stage, while both approaches offered the same classification performance for COM and earwax plug classes in the abnormal category, the soft voting approach was slightly better than the hard voting approach in the classification of the samples in normal class and the myringosclerosis class in the abnormal category.

Fig. 6
figure 6

Confusion Matrices (a) Hard voting model (b) Soft voting model

Table 4 Detailed results of the hard voting model
Table 5 Detailed results of the soft voting model

Considering the precision values obtained by the soft voting approach for each class, this approach achieved values ranging from 95.1% to 100%. When the accuracy values obtained for each class were examined, it was seen that this approach made predictions with an accuracy between 98.1% and 100%. Finally, when the Sen and Spe values obtained for each class were analyzed, it was seen that the soft voting approach achieved the Sen metric between 92.5% and 100% and the Spe metric between 98.3% and 100%.

With the aim of more concretely showing the performance of DL architectures considered in this study, the number of misclassified samples, in other words, the number of FNs for each class was also indicated in Table 6. As can be observed in this table, the soft voting approach had some incorrect classifications of the myringosclerosis class. In the confusion matrix given in Fig. 6b obtained from the soft voting approach, it can be seen that there were 3 FNs including 2 normal, and 1 COM case. The soft voting approach presented the lowest number of misclassifications with 4 out of all samples. Although no model classified all samples in the normal class correctly, soft and hard voting ensemble approaches were the ones that presented the lowest number of misclassifications with 1. Among the CNN-based pre-trained models, DenseNet 121 and VGG16 were the ones that achieved this success. Hard and soft voting approaches correctly classified all the samples in the COM class at the final prediction stage. The pre-trained models that classified all the samples in this class correctly were DenseNet 121 and EfficientNet B0. Samples in the earwax plug class were correctly classified by both voting ensemble approaches and pre-trained CNN models. On the other hand, the most difficult TM condition to be distinguished among was myringosclerosis. The FP values presented by the pre-trained CNN models and the voting ensemble approach for the myringosclerosis class were higher than the other classes. In addition, the soft voting approach had the lowest number of misclassifications compared to single models and the hard voting approach. Other methods performed misclassifications with a number ranging from 4 to 6.

Table 6 False prediction numbers of DL models

The ROC curves and AUC values of both voting frameworks obtained for each TM condition are shown in Fig. 7. As seen in Fig. 7, the soft voting approach was superior to the hard voting approach with AUC values of 0.995, 0.991, 0.999, and 1.000 for normal, myringosclerosis, COM, and earwax plug classes, respectively.

Fig. 7
figure 7

ROC Curves of Voting frameworks for each class (a) COM (b) earwax plug (c) myringosclerosis (d) normal

In addition, the class-based macro average ROC curves and AUC values of the hard and soft voting approaches used in the final prediction stage for the classification of TM conditions are given in Fig. 8. As seen from the ROC curves, the soft voting approach for all classes was slightly better than a hard voting approach.

Fig. 8
figure 8

ROC curves of soft and hard voting approaches

5 Discussion

Studies on TM conditions classification can be divided into two groups such as traditional machine learning methods and DL methods. It has been observed that model accuracies ranging from 73 to 93% were obtained in studies where traditional machine learning methods were used on different datasets. In the last two years, DL models specialized in images, which do not require a troublesome handcrafted feature extraction process, have been commonly used for increasing the success of TM condition classification.

The dataset used in this study is a current dataset made public by Viscaino et al. [16]. They achieved 93.90% accuracy in their study which used classical machine learning methods to classify extracted features. In this study, a voting ensemble framework containing pre-trained CNN models was proposed for the classification of TM conditions. Our proposed soft voting and hard voting ensemble approaches demonstrated their superiority over classical machine learning by achieving 98.8% and 98.4% accuracy on the dataset. The accuracy obtained by Viscaino et al. [16] fell behind the accuracy of our proposed model. In addition, the performances of single pre-trained CNN models were also examined. The performances of the pre-trained CNN models lagged behind both of the proposed voting ensemble approaches. Accordingly, among the single pre-trained CNN models, InceptionV3 has the lowest performance compared to the others in terms of all metrics. With 98.1% accuracy, DenseNet-121 both showed its superiority over other single models and offered the closest performance to hard and soft voting ensemble approaches.

Furthermore, Table 7 summarizes the TM classification results of our proposed DL-based ensemble model and other CNN-based DL studies. Except for studies of Singh and Dutta[31] and Afify et al. [37], the studies in this table used different datasets. Therefore, a direct comparison of the classification performances presented in these studies with the performance of our proposed model is inappropriate. Singh and Dutta [31] carried out a four-class classification on augmented images with their own proposed CNN architecture consisting of 6 convolutions and 2 dense layers without using a transfer learning approach. They trained the CNN model 400 epochs and achieved 96% accuracy on the test dataset. Their study is valuable in terms of showing that 96% accuracy can be achieved in TM classification with only data augmentation and without using a transfer learning approach. Since they applied data augmentation to both train and test datasets, a direct comparison of their results with our study was not suitable. On the other hand, in terms of looking at the time spent training the models, Singh and Dutta [28] did not report the training times in their study. They trained the CNN model at 400 epochs and also validated the model containing the best weights at the 367th epoch by applying the checkpoint technique. In addition, the authors needed a process that would require extra effort to apply the data augmentation technique on both train and test datasets before the training phase of their CNN model. Even though five different pre-trained models were trained in our study, using the transfer learning approach in this training enabled the trained models to reach the best generalization performance in a very short training period of 40 epochs. The obtained accuracy-loss plots in experiments support this hypothesis. On the other hand, Afify et al. proposed a CNN architecture with hyperparameters adjusted by the Bayesian optimization method. Considering the results reported by the authors, it was seen that although the model they proposed was superior to other studies using the same dataset, their proposed model fell behind our model in terms of accuracy score.

Table 7 The results of the proposed model and other DL-based approaches

From the perspective of ensemble learning approaches, there are a limited number of studies in the literature on the classification of TM conditions. In two prominent studies [25, 26], researchers used two different trained CNN models by combining them in a soft voting ensemble and they showed that their proposed ensemble approaches increased the classification success compared to single CNN models. Our study differs from [25, 26] in terms of both the number of CNN models used and voting approaches. The model we propose in this study combines four different CNN models via both hard voting and soft voting rules. This model gave pay to an increase in the classification success compared to single CNN models in both cases.

6 Conclusion

In this study, the Voting ensemble framework including 5 different pre-trained models was proposed to classify different TM conditions such as normal, chronic otitis media, earwax plug, and myringosclerosis. To make the best use of the Voting ensemble approach, the performances of the soft and hard voting methods were examined. Experiments showed that the soft voting method offered the best performance with 98.8% Acc, 97.5% Sen, and 99.1% Spe compared to the state-of-the-art pre-trained models. Moreover, the classification performance of the proposed method was superior to other studies performed on the same dataset in the literature. Considering the high classification accuracy achieved, the proposed soft voting ensemble framework can play an important role in the development of expert systems to be used in real clinical environments. A low-cost system based on DL will help to diagnose TM conditions more accurately in a clinical environment where there is no field specialist, and on the other hand, such a system will decrease the variability among the visual examination results of observers which is error-prone.

On the other hand, the fact that this study has been performed on a small dataset can be considered a limitation. However, it should be noted that insufficiency in the number of publicly available datasets is a matter to be considered. Finally, in our future studies, we intend to use domain adaptation by conducting experimental studies on different datasets with more cases of middle ear disease, and it is planned to obtain a model that can generalize to more than one dataset.