1 Introduction

As stated by the World Health Organization, thyroid illness is the most widespread endocrine disorder in the world after diabetes (https://www.who.int/) [1]. Hyperthyroidism and hypothyroidism are the most frequent thyroid gland illnesses, which have been recorded in more than 110 countries throughout the world, putting 1.6 billion people in danger, and a majority of these are found in Asia, Africa, and Latin America [2]. Currently, over 25,000 emergency clinics around the globe collect information on patients in various configurations. However, studies are conducted by traditional examination and measurable tests using the traditional method [3], which is time-consuming and costly. Doctors believe that, early disease detection, diagnosis, and treatment are critical in inhibiting disease development or even passing away. Despite numerous trials, clinical diagnosis is frequently regarded as a difficult task [4]. The thyroid is a tiny, butterfly-shaped gland that sits right below Adam’s apple at the base of the neck [5].The endocrine system is a complicated network of glands that controls the organization of many of the actions of the human body. The thyroid gland yields hormones that govern the human body’s metabolism. The most common cause is a lack of iodine; however it can also be caused by other circumstances [6]. T3, T4, and Calcitonin are the three hormones produced by the thyroid gland where T3 and T4 are just in the strictest sense [7]. Iodine is required for the creation of both hormones. We must receive this trace element through our diet because our systems cannot produce it. Iodine is absorbed into our bloodstream by food in our intestines and finally produces thyroid hormones. Hypothyroidism (underactive thyroid) is a malfunction in which the thyroid gland does not produce enough specific hormones [8]. A few symptoms had been seen early on in the course of hypothyroidism. Without giving much con- centration to hypothyroidism, this could be led to obesity. Moreover, several other problems like joint pain, heart disease, and sometimes infertility might be seen among patients [9]. Hyperthyroidism is a malfunction in which the thyroid gland produces many thyroid hormones that circulate in the blood- stream. Some symptoms of hyperthyroidism are nervousness, impatience, and increased hunger [10]. For thyroid prediction at early stage, we could use ma- chine learning, an area of computer science that has exploded in popularity in recent years and is likely to continue to do so in the future. A machine learning algorithm has several advantages, including high parallelism, speed, self-learning and noise error tolerance [11]. Machine learning allows humans to get insight from vast amounts of data that would otherwise be too difficult or impossible to process. By building a machine learning model, we can predict hypothyroidism and hyperthyroidism with the help of symptoms of the patient, which is a cost-effective and time-saving approach. The machine learning model is trained using data from various databases as input. It can be used to produce predictions for other input data once it has been trained. Several supervised machine learning algorithms are available in the literature [12,13,14,15,16]. We employed a Decision tree classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, K-Nearest Neighbor, Logistic Regression, and Support Machine Vector to predict thyroid disease in our study, and we could relate the performance of the algorithms to discover the finest method for more correctly predicting thyroid disease. In the end, a post hoc technique known as explainable artificial intelligence is used to understand and believe the output and outcomes produced by the black box machine learning algorithms.

The primary goal of this research is

  • To find a reliable machine learning classification method for predicting thyroid disease using the fewest possible features.

  • Identifying the most important feature to detect the thyroid disease

  • Validating the experimental results using explainable AI.

We also determined the most important aspects of our datasets for pre- dicting thyroid illness. Finally, we believe this research might significantly impact the scientific community for better understanding and applying ma- chine learning in the medical field especially thyroid disease prediction. The overall workflow of this study is depicted in Fig. 1. Firstly, we collect preprocessed, cleaned and resampled the data. Following that, two sub dataset was generated using the original dataset, which was used to predict thyroid Over- all, we followed seven steps for building the thyroid prediction model. Finally, validated our results using explainable AI.

Fig. 1
figure 1

The complete workflow paradigm in this study

2 Literature Review

Several works have been done so far in this relevant work. In [17], authors employed Decision Tree, Support Vector Machine, Artificial Neural Network, and the K-Nearest-Neighbor algorithm, among other classification algorithms, based on the dataset of thyroid gained from the UCI Repository (https://archive.ics.uci.edu/ml/ datasets/thyroid + disease), classification and prediction were performed, and accuracy was measured based on the output provided. Logistics Regression and SVM machine learning techniques to evaluate the Thyroid Dataset and RMS error, Precision, Recall, F1 measure, and ROC were used to compare these two methods in [18]. According to them, successful classifier was found to be logistic regression. Awasthi and Anil Antony [19] discussed using KNN, support vector machine (SVM), and machine learning algorithms to categorize and detect thyroid illness. They employed the K-nearest neighbor technique to approximate missing values in user input for thyroid diagnosis. A classification system for two categories of thyroid disease: hyperthyroidism and hyperthyroidism was proposed in [20]. During the preprocessing stage, missing values that are not a numerical constraint are identified, and the mean value of the matching column is used to fill in the gaps. The differential evolution technique is used to create child subsets from parent records. In [21], the authors used SVM as a classifier to distinguish thyroid disease. This investigation is based on two datasets. For classification, the authors employed Naive Bayes and support vector machines in [22]. Several grouping algorithms, like the K-nearest neighbor, support this idea. The Rapid miner device was used to conduct the research, and the findings reveal that K-nearest neighbor is more optimum than Naive Bayes in diagnosing thyroid issues. The K-nearest neighbor classifier was the most reliable, with accuracy of 93.44%, while the Naive Bayes classifier had only 22.56%. In [23], SVM surpassed K-Nearest Neighbor and Bayesian with an accuracy of 84.62 percent. KNN independently discovered the closest neighborhood. In [24], the authors have proposed several Thyroid prediction strategies based on data mining techniques. They investigated the link between T3, T4, and TSH, as well as hyperthyroidism and hypothyroidism. In addition to this, recently some other authors also use different machine learning modification technique to predict thyroid [25, 26]. Moreover, in [27] authors used different feature engineering method like forward backward and bidirectional feature elimination method for thyroid classification. In the area of healthcare, Pawar et al. [28] employed the XAI technique to model integrity, openness in feature selection, result monitoring, and model refinement. The same author also employed explainable AI in 2021 to offer a tool for comprehending machine learning models in the healthcare industry. To the best of our knowledge, Arjaria et al. in 2022 [29] used XAI to forecast the accuracy of decision tree algorithms with an explanation of key features, improving the models’ accuracy, and making the models more accountable by requiring them to explain the reasoning behind each decision. Despite the widespread application of AI and machine learning in the fields of medicine and diagnostics, we observed that there is still a gap in the ability of explainable machine learning to predict thyroid disease. In this study, we first predict the best ML model and then utilize XAI to analyze the best ML model’s “black box” for classifying thyroid diseases.

3 Material and Methods

3.1 Dataset Description

The first natural step towards the development of a machine learning model is the collection of data. The data was taken from the UCI (https://archive.ics.uci.edu/ml/datasets/Thyroid+Disease) machine learning repository [30]. We took three datasets (hypothyroid, hyperthyroid, and sick) from the UCI machine learning repository and combine them to create our final dataset, which has 3221 entries.

There is a total of 30 features; six of the features are real number properties, while the remainder is category traits, as shown in Table 1. Pre-processing is done to improve the quality of the dataset obtained for further analysis. The histogram of all the attributes is visualized in Fig. 2 after dropping the two attributes (TBG measured and TBG) because of the sizeable missing value.

Table 1 Description of thyroid dataset attributes
Fig. 2
figure 2

Visual representation of the dataset for all patients with all 28 attributes. The frequency distribution of the dataset can now be easily seen from the histogram. Overall the attributes age, TSH, T3, T4, T4U, and FTI appeared to be normally distributed. Whereas, other qualities are categorical in nature

3.2 Data Pre-processing

Raw data from the real world is frequently incomplete, unreliable, and devoid of specific behaviors or trends. They are also likely to have many mistakes in them [31]. As a result, they are pre-processed into a format that the machine learning algorithm can use for the model once they have been collected. The data pre-processing phase should be given much attention in order to get the best model quality. It includes several tasks employed in the process to make the data more relevant. In this study, we followed the following steps in order to preprocess the data.

At first, many unclear values did not have any significant meaning. So, we removed those unclear values to get better results from this process by reducing the attributes of the dataset. Following that, we replaced the missing value because our dataset had so many missing values. We took different steps to handle missing values, for example, filling missing values with median and mode. In addition, the categorical data was encoded into integer format so that data with transformed category values may be fed into models to improve prediction accuracy. Furthermore, we handled the imbalanced data from our datasets in which the target class had an unequal distribution of observations. For balancing our dataset, we employed the resampling technique. Finally, we spilled the datasets into training and test sets. The training dataset was utilized to fit the model, and test sets were used to make predictions and compare them to the predicted values. In this study, 70 out of 100 data was used for training, and 30 out of 100 was used for testing.

3.3 Feature Selection Methods

Feature selection is a strategy for limiting the input variable to the model by removing insignificant data and only using valuable data [32]. The purpose of feature selection in machine learning is to determine the best set of characteristics for building effective models of the phenomena being studied. In this study, for selecting the most important feature, we used the univariate feature selection approach and the feature importance method [33].

3.4 Selection of the Classification Algorithms

Before selecting an algorithm, there are a few things to remember, the size of the training data, the output’s accuracy and/or interpretability, time spent on training or speed, linearity, and the number of features [34]. In this investigation, we took seven popular machine learning classification algorithms for solving this dataset because we are trying to figure out which algorithm performs better on our dataset. In order to predict the thyroid, we use Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression, K-Nearest Neighbor, and Support Vector Machine (SVM) algorithms.

Supervised Machine Learning algorithms like Decision Trees are typically used to tackle classification and regression issues by splitting the data based on specific criteria. While the data is divided among the nodes, the final decision is provided by the leaves. The problems with this method are over-fitting, although Random Forest offers a solution that is based on an ensemble modeling strategy. The naive Bayes algorithm is based on conditional probability and use a probability table as the model. Throughout the training session, the table is updated. Due to its effectiveness, performance, adaptability for modest amounts of training data, dealing of both discrete and continuous data, and capability to address binary and multi- class classification challenges, the technique has benefits over competing methods. However, because NB models are too simplistic, properly trained and optimized models frequently outperform them.

Computational efficiency, ease of regularization, and simplicity in implementation are some benefits of Logistic Regression. However, its inability to tackle non-linear problems, susceptibility to overfitting, and poor performance until all independent variables are recognized may sometimes causes problem of using this algorithm. Unsupervised K Means Clustering is a widespread choice for clustering problems. When variables are large, it is computationally more effective than hierarchical clustering. The algorithm’s order of complexity, making it computationally efficient. However, K value prediction is challenging and the performance of globular clusters is compromised.

We can use SVM for both classification and regression problem. The decision boundary, a hyperplane is required to divide a collection of objects into their many classes. It can manage structured and semi-structured data. Moreover, it can manage complex functions if the right kernel function can be determined. SVM has a lower likelihood of over fitting. With a huge data collection, though, its performance suffers because of the longer training times.

3.5 Evaluation of the Model

In machine learning, performance metrics refer to how well an algorithm per- forms depending on various criteria such as precision, accuracy, recall, and F1 score [35,36,37]. The following sections go through several performance metrics.

3.5.1 Accuracy

The percentage of correct test data predictions referred to as accuracy. It is easy to calculate by dividing the number of forecasts by the number of correct guesses. The formula for calculating the accuracy is given below:

$$Accuracy = \frac{TP + TN}{{TP + FP + TN + FN}}$$
(1)

3.5.2 Precision

The precision score is used to assess the model’s correctly counting genuine positives among all positive predictions. The following is the formula for calculating precision:

$$Precision= \frac{TP}{TP+FP}$$
(2)

3.5.3 Recall (Sensitivity)

The recall score used to assess the model’s performance in terms of accurately counting true positives among all actual positive values. Below is the formula for determining the recall.

$$Recall= \frac{TP}{TP+FN}$$
(3)

3.5.4 F1 Score

The F1-score is the harmonic mean of precision and recall score, and utilized as a metric in situations when choosing either precision or recall score can result in a model with excessive false positives or false negatives. The F1 score measured as follows.

$$F1 \,score = \frac{{2\,\left( {Precision \times Recall} \right)}}{Precision + Recall}$$
(4)

After combining three datasets, our final thyroid dataset had 3221 number of instances of 3221 patients. Along with the target value, we had 30 attributes. There were no missing values in our data. When we looked back at the original dataset, there were missing values in several columns.’nan’ is used to replace these values. Then we convert it into the numerical format. Because the missing values, except sex, are from numeric attributes, they are replaced with the median value of the respective columns. However, sex is a categorical attribute, and the missing value of it is replaced with a mode value of the respective attribute. Initially, we dropped two attributes, TBG and TBG measured, as the majority of values of these attributes are missing. Because the majority of data of these columns are missing. Our categorical attribute is mapped to numeric values, done manually with programming. For converting those values into numeric values, we use a label encoder. Our other attributes are in the form of objects. As a result, we convert them to integer format to fit them into our model. Our dataset is imbalanced because the target class has an uneven distribution of observations. There are 2753 observations under the negative class label, 220 observations under the hypothyroid class label, 171 observations under the sick class label, and 77 observations under the hyper- thyroid class label. When dealing with unbalanced datasets, typical machine learning methods may create biased, erroneous, and unsatisfactory classifiers.

Standard classifier methods favor classes with many instances, such as Decision Tree and Logistic Regression. Typically, they can only anticipate data from the vast majority of classes. The minority class’s traits are frequently dismissed as noise and ignored. As a result, the minority class has a higher chance of being misclassified than the majority class. Because Machine Learning Algorithms are typically design to improve accuracy by reducing error, this occurs. Therefore, we convert the dataset into a balanced one to obtain the desired result. We use a resampling technique to ensure that the minority and majority classes are equal. Finally, the distribution of observations in our dataset is even across our entire class.

3.6 Explainable Artificial Intelligence

Explainable artificial intelligence (XAI) is a collection of techniques and strategies that eventually enable human users to grasp and trust the output and results generated by the black box machine learning methods. In this study, a post hoc XAI approach has been considered to explain the model. The post hoc approaches analyze the model after training, but they do so without restricting the model's complexity. Therefore, the explainability does not affect the performance of the model. The complexity of the machine learning model is, however, restricted through the use of intrinsic approaches. Again, based on the scope, there can be two types of explainability: global and local. A global explanation of a machine learning (ML) model specifies which features are vital to the overall model's outcome. In contrast, a local approach only explains single data points. In this study, we used Shapley additive explanations (SHAP) [38]and local interpretable model-agnostic explanations (LIME) [39] for the global and local explanations, respectively.

For N number of explanatory variables, in terms of local accuracy each prediction made by the SHAP method is approximated by f(x) with g(x′), and a quantity ϕj ∈ R. Which can be defined as follows[40]:

$$f(x)\, = \,g(x^{\prime } ) = \,\varphi_{0} \, + \,\sum\limits_{j = 1}^{N} {\varphi_{j} } x_{j}^{\prime }$$
(5)

Three major properties of SHAP Local accuracy, missingness, and consistency can only be satisfied by one explanatory model defined by as follows:

$$\varphi_{j} \left( {f,x} \right)\, = \,\sum\limits_{{Z^{\prime } \subseteq x^{\prime } }} {\frac{{|Z^{\prime } |!\,\left( {N - \,|Z^{\prime } | - 1} \right)!}}{N!}} \,\left[ {f_{x} \left( {Z^{\prime } } \right) - f_{x} \left( {Z^{\prime } \backslash j} \right)} \right]$$
(6)

where z′ ∈ {0,1} N: binary variable’s linear function, z′\j representing setting at zi = 0, and non-zero entities is denoted by |z′|.

Whereas LIME attempt to fit a local model with sample data points that resemble the observation being addressed. Thus, each observation x of LIME can be obtained by as follows [40]:

$$\lambda \left( x \right){\mkern 1mu} = {\mkern 1mu} argmin_{q \in Q} {\mkern 1mu} \,L\left( {f,{\mkern 1mu} q,{\mkern 1mu} \pi_{x} } \right){\mkern 1mu} \, + \,{\mkern 1mu} \psi \left( q \right)$$
(7)

where locality aware loss L, potentially interpretable models is denoted by Q, πx(z): distance between an instance z and x, and ψ(q): A metric for the explanation's complexity q ∈ Q.

4 Result and Discussion

A comparison of seven different machine-learning algorithms was conducted in this study. Decision Tree Classifier, Random Forest Classifier, Naive Bayes Classifier, Gradient Boosting Classifier, Logistic Regression Classifier, K- Nearest Neighbor, and Support Vector Machine were utilized for thyroid disease prediction. Firstly, we collect and preprocessed the data and then fed the data to train the model. By comparing the scores, various performance criteria, including accuracy, precision, recall, and F1-score, are utilized to establish whether an algorithm is superior to others. We divide our dataset into three formats: the first set considering all attributes, the second set with 14 feature selection process attributes, and the third with 14 univariant feature selection process attributes. We narrowed down attributes based on their correlation with the target, which we calculated with the feature selection process and univariant feature selection methods. Overall, the results of various algorithms are explained in the next part of this result analysis.

4.1 Descriptive Statistics of the Dataset

Exploratory data analysis (EDA) is a sort of data analysis that employs data visualization to evaluate and investigate data sets and describe their key properties [41, 42]. EDA is mainly used to examine what data might reveal outside formal modeling or hypothesis testing tasks and to understand variables and their interactions better. It can also help us to figure out if the statistical methods we are contemplating for data analysis are appropriate. Our dataset has 28 attributes, with only six of them being numeric. Therefore, we give a short descriptive statistic of our dataset in Table 2. We can see that all of the attributes have 3221 values in this table. So, before we train the model, we use various techniques to fill in the missing values. We can also see that the average age of the patients is 52.4, implying that the most patients were elderly.

Table 2 Descriptive statistics of numeric value of our dataset

The youngest person was one year old, and the oldest person was 94 years old. The age distribution of the data is skewed, indicating that the population with a low age is absent. The standard deviation is 19.1, indicating the sparse- ness of the age group, which ranges from 57 to 73 years old. TSH mean was 6.322 mIU/L, indicating that most patients’ TSH levels were not expected. TSH levels should be between 0.5 and 5.0 mIU/L to be considered normal. TSH had a minimum value of 0.005 mIU/L and a maximum value of 478.0 mIU/L. The mean T3 value was 1.95 nmol/L, with a minimum of 0.05 nmol/L and a maximum of 10.6 nmol/L. The mean value of TT4 is 107.55. The maximum value of TT4 is 430 and the minimum of TT4 is 2.

In the case of T4U, the mean value is 0.988 mIU/mL. The maximum value of T4U is 2.12 mIU/mL and the minimum value of T4U is 0.31 mIU/mL. More- over, the mean value of FTI is 110.26. The correlation between all the numeric data is depicted in Fig. 3. Figure 3 shows that TT4 and FTI have a strong relationship. We can get a better understanding of this correlation table if we look at the heat map. Fig. 4 depicts a heatmap of all attribute correlations. Form the heatmap and numerical correlation of the above figure, we can draw interpretation about the correlation among the variable. It is clear from the heatmap that T4U measured and FTI measured has very strong correlation. Moreover some other parameter’s also visualized very strong relationship like TT4 with T3, T4U with FTI, TT4 with FTI and TTI with T4U.

Fig. 3
figure 3

Correlations among the numeric value attribute’s of the dataset

Fig. 4
figure 4

Correlation Visualization using a heat map. The figure shows the correlation among 28 attributes of our thyroid dataset. From the figure, we can say that some of the attribute pairs are highly correlated and some of the pairs are negatively correlated

4.2 Category Class Blanching

The target class has an uneven distribution of observations, which makes our dataset unbalanced. There are 2753 observations under the negative class label, 220 observations under the hypothyroid class label, 171 observations under the sick class label, and 77 observations under the hyperthyroid class label. So, our dataset is highly unbalanced.

As a result, machine learning classifiers faced some difficulties in making accurate predictions on our dataset. Because classic classifiers such as Decision Tree and Logistic Regression favor classes with many occurrences, they typically only forecast data from the vast majority of classes. The features of the minority class are frequently rejected and treated as noise. The graphical representation of our classes is shown in Fig. 5.

Fig. 5
figure 5

Imbalanced classes of original datasets

We can see that our dataset is entirely skewed. We focus on balancing the classes in the training data before delivering the data as input to the classification model. The primary purpose of class balancing is to either increase the frequency of the minority class or lower the frequency of the majority class. This is done to ensure that the number of instances in both classes is about equal. We employed the resampling technique to balance our dataset. Resampling is a common strategy for dealing with very imbalanced datasets. Under-sampling involves deleting samples from the majority class and/or introducing additional examples from the minority class. All our classes have an equal number of 2753 observations. The balanced plot is shown in Fig. 6.

Fig. 6
figure 6

Balanced classes after resampling of the original dataset

After resampling, we found our final balanced dataset. Now we can build our model using this dataset which will give us a more accurate result. After resampling, we have a total of 11012 instances

4.3 Performance Analysis of Different Algorithm

Our original dataset, which included all features, was first utilized to evaluate several machine learning measures. After that, we used our balanced dataset to test multiple machine-learning models. This study selected the dataset’s important features using feature importance methods and univariate feature selection techniques. In our experiments, those vital features are then used to identify the model’s precision, accuracy, recall, and F1 score.

The data we use is typically divided into two categories: training data and test data. In this study, 70% of the data was utilized for training and 30% for testing. So, out of our 11,012 dataset instances, 7708 were used for the training set. 3304 of the 11,012 dataset instances were used in the testing set. Using the testing, we can determine the accuracy of our model and how well it can predict thyroid disease. We used the Sklearn library to split our data set as a train and test set. Sklearn, model selection train, and test split library component, split the dataset randomly with specified portion, and we get the random train and test part from the entire dataset. After training the model with all algorithms, the testing dataset was used to test the methods. The F1-score, recall, precision, and accuracy were used to evaluate the model’s performance. The entire study aimed to see which algorithm could best classify diseases.

This section highlights the study’s outcomes and introduces the top performer based on several performance criteria. At first, performance was measured using our raw dataset. Secondly, performance was measured using a dataset containing 14 attributes derived from the feature importance method. Third, performance was determined by considering 14 attributes from the univariate feature selection. Finally, we compare various performance metrics of various algorithms and feature categories.

4.3.1 Results Using All Features

We apply the selected algorithms to our dataset. Our dataset has 28 attributes; among them, the category is the target. The algorithms are then compared using various performance metrics. We can see from Table 3 that the Logistic Regression algorithm has the highest accuracy of any algorithm. After Logistic Regression, Support Vector Machine, Gradient Boosting Classifier, and Decision Tree Classifier have higher accuracy. Predictor accuracy refers to how well a predictor can forecast the value of a predicted characteristic for fresh data. In contrast, classifier accuracy refers to a classifier’s ability to predict the class label correctly. However, accuracy does not always provide good performance metrics to compare algorithms, so consider other metrics, for instance, recall, precision, and F1 score. We now assess our model’s performance using various metrics such as recall, precision, and F1 score.

Table 3 Evaluation of algorithms with all features

Logistic Regression, as shown in Table 3, outperforms in terms of accuracy. However, this algorithm’s precision, recall, and F1 score are all low. We got an accuracy of 84.48%, precision of 25%, recall of 24%, and F1 score of 25 from Logistic Regression, which outperforms the other six classification algorithms for this dataset. The Support Vector Machine, Gradient Boosting Classifier, and Decision Tree Classifier perform as well. However, precision, recall, and F1-score are all extremely low in each case. As a result, we can only measure them using accuracy. However, accuracy cannot always provide us with an accurate measure of performance. Random Forest has a 74.4% accuracy, but precision, recall, and F1 score are all low. The accuracy of the K-Nearest Neighbor is 72.18 percent. On the other hand, Naive Bayes gives us a low score for this experiment. This algorithm only has a 16.44 percent accuracy, which is highly unsatisfactory. From the result, we can also say that Logistic Regression gives us the best prediction for our dataset. Naive Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our dataset, Logistic Regression is the best classification algorithm, while Naive Bayes is the worst.

4.3.2 Results for Our Dataset Using Feature Importance Method

We determine our 14 best-correlated features from our dataset using the feature importance technique. We apply the seven algorithms to the 14 features chosen using the method. The algorithms are then compared using various performance metrics. All the selected features are presented in Fig. 7 with their importance value.

Fig. 7
figure 7

Important features according to feature importance

We apply Random Forest Classifier, Decision Tree Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector. We can see from the above bar chart that the Random Forest algorithm outperforms all others in terms of accuracy. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. As previously stated, accuracy is not always an appropriate metric when comparing algorithms, so consider alternative metrics like precision, recall, and F1-score. The performance metrics of all seven algorithms are listed in Table 4.

Table 4 Evaluation of algorithms with the features of feature importance

Random Forest beats all other performance criteria, such as accuracy, pre- cision, recall, and F1 score, as seen in the table above. We have the highest accuracy of 91.92 percent, the highest precision of 92 percent, the highest recall of 92 percent, and the highest F1 score of 92 percent. So, for our dataset with 14 feature importance attributes, Random Forest outperforms the other six classification algorithms. Following that, the Gradient Boosting and Decision Tree Classifier perform admirably. However, both the Decision Tree Classifier and the Gradient Boosting Classifier have the same precision, recall, and F1 score. Moreover, in the case of Gradient Boosting, accuracy is improved. So, in terms of accuracy, we can say that Gradient Boosting outperforms Decision Tree Classifier. K-Nearest Neighbor has an accuracy of 86.22 percent and an F1 score of 86 percent. With a 73.7 percent F1 Score, SVM provides 73.7 per- cent accuracy. With an F1 score of 86 percent, Logistic Regression has a 73.15 percent accuracy. Finally, Naive Bayes gives a 64 percent F1 score and 67.86 percent accuracy, respectively. The confusion matrix tells us how accurate the classifier is at making predictions. The confusion matrix of all seven classification algorithms is shown in Fig. 8.

Fig. 8
figure 8

Confusion matrix of different algorithms a SVM, b DT, c RF, d GB, e Naive Bayes, f KNN and g LR using the features of feature importance method

From the confusion matrix, as shown in Fig. 8, we can also say that Random Forest gives us the best prediction, and Naive Bayes gives us the poorest prediction in this case. As a result, we can conclude that for our chosen dataset, Random Forest is the best classification algorithm.

4.3.3 Results for Our Dataset Using Univariate Feature Selection Method

In this case, we use the univariate feature selection method to select our important features. The top 14 features with their correlated score with our target are given in Fig. 9. We apply the Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, Naive Bayes Classifier, Logistic Regression Classifier, K-Nearest Neighbor, and Support Vector to the selected features. We can observe from Table 5 that our results are slightly different from previous results. Random Forest provides the best accuracy of 90.4 percent this time as well. After Random Forest, Decision Tree Classifier and Gradient Boosting Classifier have higher accuracy. Decision Tree Classifier and Gradient Boosting Classifier both have an accuracy of 89.55 percent and 89.35 percent, respectively. K Neighbors has an accuracy rate of 86.07 percent. The accuracy of SVM in- creased to 74.5 percent, whereas that of Logistic Regression decreased to 71.82 percent. Besides, the accuracy of Naive Bayes fluctuates a lot for this dataset. As a result, we conclude that this method is ineffective compared to the feature importance technique. Other performance metrics of all seven algorithms on this dataset are also presented in Table 5.

Fig. 9
figure 9

Top 14 features selected using univariate feature selection procedure based on their score

Table 5 Evaluation of algorithms with the features of univariate feature selection

Table 5 shows that the performance metrics differ significantly from the previous test result. Logistic Regression, K Neighbors, and Support Vector Machine all have the same precision. The precision of the Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier, on the other hand, decreases. K-Neighbors, SVM, and Logistic Regression all have the same recall. On the other hand, the recall of the Decision Tree Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Naive Bayes Classifier falls. F1-Score provides a comprehensive view of precision and recall simultaneously, as shown by the fact that the F1-Score is the same for Logistic Regression, K Neighbors, and SVM. The F1 Score of Naive Bayes decreases. So, based on the table above, we can conclude that Random Forest is the best performer. After that, the Decision Tree Classifier performs admirably. Gradient Boosting Classifier and Decision Tree Classifier are nearly equal in this race, but Decision Tree Classifier outperforms Gradient Boosting Classifier by a small margin. However, Naive Bayes reduces performance across the board. The confusion matrix of all the seven classification algorithms is shown in Fig. 10.

Fig. 10
figure 10

Confusion matrix of different algorithms a SVM, b DT, c RF, d GB, e Naive Bayes, f KNN and g LR using the univariate feature selection method

We can also conclude from the confusion matrix that Random Forest pro- vides the best prognosis. In this case, Nave Bayes gives us the worst prediction. Overall results with all classifiers and features in this investigation are depicted in Fig. 11.

Fig. 11
figure 11

Comparative analysis of performance measures of seven algorithms with three feature sets, where FS represented the data with all features, FS1 represented the dataset generated using feature selection method and finally FS2 represented the dataset generated using univariate feature selection method

4.4 Explainability Analysis

We have achieved the best results for the Random Forest model, so we only explore the explanation for that model. Shapley additive explanations (SHAP) is a game theory-based method for interpreting the findings of machine learning models. It provides a procedure for determining and displaying the comparative importance of individual features of the model. This method approximates the individual contribution of each feature for each instance in the dataset. The importance of the feature is then assessed by analyzing the model’s results with and without the feature. To explain the model, we have generated four different visualizations of the feature effectiveness for the four classes. From Fig. 12, the contribution of the feature for the hypothyroid class can be interpreted. The majority of the instances have been classified as hypothyroid, where the TSH test value and T3 test value have a comparatively higher value. On the other hand, the lower values of the TT4 test and referral source significantly impact the predicted hypothyroid class for some instances. Figure 13 provides the important features for the explanation of the sick class. The figure also shows that the T3 feature has the highest feature importance for the Sick class. However, if the value of the T3, THS, and TT4 tests are low, then the model predicts a patient as sick. For a small number of instances, the higher values of the FTI feature is crucial for being classified as sick. Figure 14 illustrates that the FTI and T4U measured are the most important features for a patient to be classified as negative. The instances that have the normal range of the FTI test value are classified as negative. Figure 15 shows that the FTI and T3 test values are essential for the hyperthyroid class. It is interesting to observe that pregnancy can be a reason for hyperthyroid.

Fig. 12
figure 12

Violin summary plot using SHAP for the hypothyroid class

Fig. 13
figure 13

Violin summary plot using SHAP for the sick class

Fig. 14
figure 14

Violin summary plot using SHAP for the negative class

Fig. 15
figure 15

Violin summary plot using SHAP for the hyperthyroid Class

Local interpretable model-agnostic explanations (LIME) is an explainable AI technique that helps show how a machine learning model works and makes each forecast of the model easy to understand by itself. Since the method describes the classifier for a single instance, it works well for local explanations. LIME’s workings are based on the idea that any complicated model is simple on a regional scale. So, LIME tries to make a simplified model based on a single instance in the hopes that this model will mimic the behavior of the global model at that particular instance. The simplified model can then be used to figure out how that complex model is working.

We have used lime for the local explanation of our model. Using LIME, we can interpret a patient’s classification result, and Fig. 16 is one of them. The actual class of that patient was sick. The predicted class of the patient is also sick by our model. From the figure it can be explained why the instance is classified as sick. Figure 16a represents our model’s predicted probability for each class. Figure 16b represents the constraints fulfilled by that instance to be classified as sick. The model is interpreted as sick as the instance has test value T3 ≤ 1.20, FTI ≤ 103.00, TT4 ≤ 85, etc. Other important features, such as T4U, age, TSH, etc., pushed the instance to be classified as sick. Figure 16c shows the actual feature values of that instance.predicted class: hypothyroid

Fig. 16
figure 16

LIME explanation for a patient a category, b explanation, c history, where actual class: sick, predicted class: sick

Figure 17 is another local explanation of an instance correctly classified as hypothyroid. From Fig. 17a, we can see the prediction probabilities. Figure 17b represents the conditions for which the instance has been classified as hypothyroid.

Fig. 17
figure 17

LIME explanation for a patient a category, b explanation, c history, where actual class: hypothyroid

Our model classified both instances depicted in Figs. 18 and 19 as negative. Figure 19 illustrates the LIME explanation for an incorrectly categorized instance. The actual class of the instance is hyperthyroid, although the model predicts it as negative. Figure 19a represents the likelihood of prediction for each class. With a class probability value of 0.29, the actual class can be shown to be in second place. Figure 19b shows that several factors, like FTI, TT4, T3, etc., attempted to push the model’s output to the correct class. Nevertheless, Fig. 19c illustrates that the other factors such as lithium, TSH, T3, TT4, and so on pushed the model to the negative class. Notable is the fact that several criteria, such as TT4 > 122.00 and T3 > 2.20, are the same in the negative and hyperthyroid classes. However, for this instance, the other factors play an important role in predicting.

Fig. 18
figure 18

LIME explanation for a patient a category, b explanation, c history, where actual class: negative, predicted class: negative

Fig. 19
figure 19

LIME explanation for a patient a category, b explanation, c history, where actual class: hyperthyroid, predicted class: negative

4.5 Discussion

As machine learning is using almost all aspect of data analysis. So practical implementation of machine learning model for medical data analysis especially for thyroid disease detection may save huge amount of expert physician requirements in this field. However medical data is very sensitive therefore perfect model is the basic requirements for medical data analysis. Selecting some appropriate feature data points as well as effective machine learning algorithm may pave the way for healthcare to automatic thyroid detection. In this study, we did feature engineering method to identify the best machine learning classification algorithm depending on the data feature used for thyroid detection. Furthermore, we validated our identified best performing model as well as the features which influenced most for the classification by XAI. It is clear from the performance of different algorithms that each algorithm performed better depending on whether a subset of features or full features were used. Depending on the situation, each algorithm has the inherent ability to outperform others. RF, for example, outperforms all other algorithms in terms of accuracy of 91.42% and 90.4% respectively in our dataset for the case of FS1 and FS2. Furthermore, we know that SVM performs better for small data sets, and ensemble-type classifiers like Random Forest perform better for large data sets. Missing values play a significant role in decision trees. Even after imputing, it cannot produce the same results as a perfect dataset. However, for our case, DT also performed very well of accuracy 90.43% in FS1. Another good classifiers was Gaussian Naive Bayes. However, it did not perform well with our dataset. The presumption that all attributes were independent was the reason for this. Results and Analysis would have been less accurate if there was a dependency between the features in the dataset. The accuracy of the K-Nearest Neighbor increases as the number of K we choose increases. It ensures that the given point and the dataset are similar. The performance of algorithms that use all the dataset’s features FS was poor relative to FS1 and FS2 for most of the algorithms. After reducing the attribute in the dataset, most of the cases machine learning algorithm performance improved. When there are many attributes, classifier algorithms become complicated, and prediction results vary. Because this is the standard process of evaluating algorithms, performance metrics after converting categorical values, balancing our dataset, and feature selection are used for dataset comparison. Therefore, by considering all of the situations and the performance of metrics used in this experiments we suggests that the Random Forest algorithm and FS1 features should be used to train the model to predict hypothyroidism and Hyperthyroidism more correctly. Furthermore, from the XAI analysis we could also observed that the feature attribute TSH, T3, TT4, FTI and T3 contributed the most to classify hypothyroid and hyper- thyroid can be obtained by using the feature importance method FS1. As we use the limited data points in our study there is still a chance for the biasness of the result specially for the case of using all feature to classify the thyroid. Moreover, we balance the dataset before final classification which might not always produced perfect data points. Exact clinical data points might further validate our approach which we are looking for.

With this UCI adjusted dataset, several studies [3, 28] have been conducted to determine the best suited machine learning model for thyroid categorization. Some other works [18, 21, 24] also noted for the prediction of thyroid disease using different dataset. As per our concern, in this paper, we apply XAI to verify the critical features from this dataset that led the best fitted model to predict specific classes using explainable artificial intelligence and our result is relatively comparable to the existing work.

5 Conclusion

After reducing the features using the feature importance technique and univariate feature selection technique, we tested our collected dataset on various machine learning classifiers to see which classifier gave us the best accuracy. After analyzing the data, we discovered that Logistic Regression outperforms all other classification algorithms for our dataset. When all features are considered, Logistic Regression yields an accuracy score of 84.48 percent. When we use the feature importance method to narrow down the feature set, the Random Forest Classifier gives an accuracy score of 91.92 percent. The accuracy of the Decision Tree Classifier and the Gradient Boosting Classifier is 90.5 percent and 90.43 percent, respectively. When we use the univariate feature selection technique to narrow down the feature set, Random Forest also gives the highest accuracy score of 90.4 percent. The second-best algorithm is the Decision Tree Classifier, which has an accuracy score of 89.55 percent; the third-best algorithm is the Gradient Boosting Classifier, which has an accuracy score of 89.35 percent. From explainability analysis, we can conclude that most instances have been classified as hypothyroid on the basis of the features TSH, T3 TT4. We can also identified that the FTI and T3 test values are important for the hyperthyroid class. So, the feature importance technique is more accurate than the univariate feature selection technique in determining correlated features. Thus, after looking at all of the performance metrics, we decided that the Random Forest Classifier, Decision Tree Classifier, and Gradient Boosting Classifier and feature importance technique might be a potential choice for predicting hypothyroidism and hyperthyroidism. Though we got relatively better result by feature engineering however there is still room for the search for more perfect model as well as dataset feature selection scheme to further improvement of the result. Moreover, we try our best to clear about biological term in this study however for any kind of our representational limitations we will improve in future.