1 Introduction

Machine learning has become an important part of human lives that provides smart and affordable solutions to various problems. As such, healthcare is catching the attention of many researchers, as society relies upon healthy and performing individuals for its balanced functioning. It is obvious that a diseased person would spend much of his time in fretting about his health, thus leaving very little productive time left to complete the assigned duties, let alone perform well. This is an uncalled for situation. For instance, a lady sitting down on her desk, trying to sort out a coding problem, feels agitated with throbbing pulse which wants to beat out of her heart. Or, another person, say an accountant, trying to complete the balance sheet for a client, feels feverish, and delirious. Clearly, people involved in both of these examples are not in a position to complete their tasks to the best of their capabilities. The reason being—they might be suffering from a thyroid disorder, called hyperthyroidism. Some may feel drowsy and lethargic, which is a case of hypothyroidism. The thyroid malfunction is one of the common diseases affecting people from all age groups. The disease is not dangerous as other diseases like heart disease and cancer, but it may be the cause of other diseases with severe complications.

To our rescue, some very dedicated researchers have been putting in the very best of their efforts in modelling these disease prediction problems using statistical techniques, and now machine learning and deep learning techniques. Data mining and machine learning techniques can be used to identify thyroid disease. This method both reduces misdiagnoses due to human mistakes and allows for efficient use of time. However, most data mining and machine learning approaches require marked training data. For greater accuracy, the volume of data is critical. There are other issues which researchers have to face while performing research related to health care data, such as authorization for collecting data, privacy and secrecy concerns etc. In spite of all this, researchers are motivated to explore in terms of analyzing the data, its exploratory analysis, preprocessing, dimension reduction, data augmentation, and so on. The aim of this research is to provide a modelling solution to the prediction of thyroid disease so that society can benefit from the research advancements of computational techniques. We have applied dimension reduction techniques, and the output of these techniques is input into two classifiers. The comparative analysis shows the efficacy of our approach. To make the size of data large enough for buiding a dep neural network model, we have applied data augmentation.

The main objectives of the present research work are as follows:

  • Preprocessing of data

  • Apply dimension reduction and data augmentation techniques

  • Build classifiers in a distributed environment

  • Perform comparative analysis

The rest of the paper is organized as follows: Sect. 2 presents the literature review. Section 3 presents the methodology. Section 4 describes the experiments, and Sect. 5 represents the results and exploratory analysis. Finally, Sect. 6 concludes the paper.

2 Literature Review

Researchers have been putting in efforts to integrate Information technology in healthcare not only in terms of applying machine learning techniques to healthcare data but also to devise techniques such as telemedicine, smart care taking platforms etc. In [1] Anwar et al. propose a model where telemedicine technology could be helpful where there is a shortage of medical specialists or doctors. In their paper titled “Child Temperature Monitoring System” [2], the authors provide a smart way to protect the infant from sudden infant death syndrome. This technique is a novel concept that will help parents and care taker to know their newborn better, especially because the infant is helpless in sharing. Anwar and Prasad in [3] claim that following a critical care plan is vital for chronic diseases. More so, if the patient has some disability. This can be achieved by integration of Information and Communication Technologies (ICT) and focal health care business models. This definitely is a step towards a better workforce. In [4] Koren et al. claim that sensor data is subject to several sources of faults and errors, which may further lead to imprecise or even incorrect and misleading answers. So data collected from wearable sensors need to be analyzed to confirm that they are correct and relevant. Only then this data can be included in a formal Electronic Health Record. The collection of health care data using sensors in large amounts has further motivated researchers to carry on these studies effectively using big data platforms [4,5,6,7,8,9]. MapReduce [10, 11], a distributable and scalable parallel processing framework, is used for data processing in healthcare. Deep learning approaches have been applied for the prediction of violent incidents by patients [12].

Now we discuss some research work specific to thyroid disease prediction. The thyroid gland produces hormones for the regulation of metabolism, and they are of three types: triiodothyronine (T3), thyroxin (T4), and thyroid-stimulating hormone (TSH). If these hormones are produced in excess, it is hyperthyroidism, and if in less, it is hypothyroidism. Some symptoms, in addition to those cited in our example earlier, are intolerance to cold, muscle ache, cramps, constipation, weight gain, or loss. Researchers in [13] have applied neural network models to diagnose thyroid disease. In [14] Alqurashi and Wang worked upon a thyroid dataset with five features using various ensemble clustering methods. Akbas et al. in [15] studied the detection of thyroid cancer using multiple approaches. Other researchers have applied K-Nearest neighbour, Support vector machine, Neural fuzzy methods, random forest tree, extra tree for studying this disease data [16, 17]. Dhyan Chandra Yadav in [18] proposed the prediction of thyroid diseases using a decision tree ensemble approach. In [19], the researchers developed a Computer-aided Diagnosis system using PCA and extreme learning techniques to predict thyroid diseases. The experiments show that a maximum accuracy of 98.1% was obtained. Prasan Kumar Sahu in [20] proposed a cloud-enabled big data framework to provide a healthcare solution. The proposed technique deals with structured and unpackaged data generated by healthcare systems and by the use of wearable body sensors; and results show 98% accuracy in predicting disease using correlation analysis.

In [21], the research was conducted on different patients.TSH has been shown to be related to the value of lipid levels or cholesterol levels. Lipid values increased in patients after the level of TSH decreased. In [22], the researchers have developed the hybrid architecture system using rough data sets theory and machine learning algorithms to predict thyroid diseases. In [23], Zhiwen Yu applied a semi-supervised classifier ensemble approach and inspected the trouble of managing high-dimensional datasets with constrained categorized samples. Nyirenda [24] used a statistical approach to find the relationship between thyroid and vascular disease. Research has found that a patient suffering from thyroid disease is more prone to vascular disease. Significant mortality in patients with thyroid disease due to vascular disease is observed at a later stage. Raghuraman et al. in [25] performed comparative thyroid disease diagnosis using Machine learning techniques—Support Vector Machine (SVM), Multiple Linear Regression and Decision Trees, and the highest accuracy of 97.97% was obtained by the decision tree model. Dharamrajan et al. in [26] applied Support Vector machine (SVM) and Decition tree classifier for thyroid prediction, and obtained an accuracy of 97.35 using decision trees.

3 Methodology

Data set and methods are discussed in this section.The thyroid disease dataset consists of 3152 cases, 23 characteristics and finally a class to predict whether the individual is ill or not We present techniques and experimental set-up used for this task. The work flow of our work begins with preprocessing of data, then applying dimension reduction and data augmentation techniques. After this, classifiers are implemented in a distributed environment, and finally comparative analysis is performed. We begin by describing the three dimension reduction techniques. Dimensionality reduction is a method for obtaining the information with lesser number of dimensions from a high dimensional feature space. In machine learning it is very important for the better classification, regression, presentation and visualization of data to reduce the high-dimensional data collection. It is also helpful to better understand the associations between the data. This allows us to identify the intrinsic dimensionality and generalization of the dataset. Since volume of data is a critical issue in healthcare, data augmentation is applied to synthetically generate data so as to develop deep learning models which are said to be data hungry.

3.1 Principal Component Analysis

Principal component analysis (PCA) is an uncontrolled linear transformation technology commonly used in many fields, mainly for extracting functions and reducing dimensionality. Other common PCA applications include data processing, bonded signals de-noising, genome data analysis, and bioinformatics gene expression levels.PCA allows us to classify data trends based on feature-to-feature correlations. In short, PCA seeks to find the highest-dimensional data range directions and projects them into an equivalent or lesser new subspace than the first.

3.2 Singular Value Decomposition

The Singular Value Decomposition (SVD) of matrices provides us with singular vectors which are of reduced dimension, and may be used for classification very effectively. This is specially so for data matrices which are usually rectangular in nature, and eigenvalue decomposition is not possible. For symmetric matrices, the Spectral Theorem holds, which says that there is a basis of eigenvectors and every eigenvalue is real. The spectral theorem also provides a canonical decomposition, called the spectral decomposition, eigenvalue decomposition, or eigendecomposition, of the underlying vector space on which the operator acts. We now briefly explain the correlation between the spectral decomposition and the SVD. The matrix \(A A^{T}\) is of dimension \(m x m\), a symmetric and positive definite matrix. Thus, \(A^{T} A = VE_{1} V^{T} \) and the \(V\) matrix comprises of the eigen vectors of \(A^{T} A\). These vectors are orthogonal and in \(n\) dimensions. \(E_{1}\) is a diagonal matrix comprising of eigen values of \(A^{T} A\). Similar logic holds true for \(A A^{T} = U E_{2} U^{T}\), and the \(U\) matrix comprises of the eigen vectors of \(A A^{T}\). These vectors are orthogonal and in \(m\) dimensions. \(E_{2}\) is a diagonal matrix comprising of eigen values of \(A A^{T}\). The Singular Value Decomposition of \(A\) uses the \(U\) and the \(V\) which have been introduced earlier to be eigen vectors of \(A A^{T}\) and \(A^{T} A\). The factorization of a rectangular matrix \(A\) (of \(m\) rows and \(n\) columns) into its Singular Value Decomposition is \(A = U \Sigma V^{T}\), such that the columns of \(U\) are the left Singular vectors in \(m\) dimensions and columns of \(V\) are the right Singular vectors in \(n\) dimensions, the matrix ∑ is a diagonal matrix where the numbers on the diagonal are non-negative and are called Singular values. It is interesting how these singular values play an important role in reducing the number of effective dimensions. In our research work, for each class, we applied the Singular Value Decomposition and found the \(U\) singular vectors for non-zero singular values. These \(U_{i}\) were used for classification purpose. Further, we iterated through the number of singular values which were optimally required to perform the classification operation.

3.3 Decision Tree

Decision tree methods build a choice model based on real data attribute values. Decisions are taken for a particular record in tree structures before a prediction is selected. Data for category and regression problems are trained on decisions. Decision trees are always quick and right and offer explainable solutions. A decision tree is a tree design, where each inner node (non-leaf node) is a test attribute and each branch is a test result. The leaf nodes are the class nodes The objective is a model based on the input variables, which will estimate the value of the destination variable. In our work, decision trees have been used to identify the features in the order of decreasing importance.

3.4 Building Classifiers

After feature reduction, the K-Nearest Neighbour (KNN) and Neural Network (NN) classifiers are built. We present the outline of algorithms for implementing feature reduction and classification. Algorithm 1 depicts the pseudocode for SVD with KNN classifier. Step 1 loads the dataset in Resilient Distributed Datasets (RDD). Step 2 does the preprocessing and normalization of the dataset. Step 3 deals with splitting the dataset into training (80%) and testing data(20%). Testing data is broadcasted in each slave to receive only one copy of testing data (Step 4). SVD is applied to the training data and U left singular vectors are obtained, representing the training data (Step 7–8). Further, Euclidean distance between U and test data is calculated, and distances are collected at master. Then we apply the KNN classifier. (Step 9–12).

figure d
figure e

Algorithm 2 represents the steps for the feature reduction technique with a Neural network classifier. Steps 1–2 are the same as in Algorithm 1. Feature reduction technique DT or PCA is applied on RDD Dataset and data is split in the same manner as in Step-3 in Algorithm 1. The model is prepared by applying the neural network classifier on training data. Further, the model is tested on the testing data to predict the accuracy score. (Step 5–7).

3.5 Data Augmentation and Deep Learning

For applying the data augmentation, we created the 10,000 samples using Gaussian distribution. The ratio of class 0(non-thyroid) and class 1(thyroid) is 91:9 in the original dataset. The mean and standard deviation of the features have been calculated for each class label. So, we created 900 samples of class 1 and 9100 samples of class 0 using \(Gaussian\left( {mu,sigma} \right) + random\left( \lambda \right)\), where \(mu\) represents the mean of each feature and \(sigma\) denotes the standard deviation of each sample and noise term is added with a random number \({ }\lambda \in \left( { - 0.1,0.1} \right).\) We created 20% samples of 10,000 samples for validation purposes.

3.6 Data Pre-processing and Normalization

Information pre-handling addresses the primary assignment in data mining procedures. It includes cleaning, extraction, and change of information into a reasonable arrangement for machine execution. Crude information contains missing data and invalid data. It prompts a debacle in the forecast with machine learning. Categorical variables, consisting of categorical values are replaced by 0 and 1. For example, Male and females are replaced by 1 and 0. Normalization is a very important task in the deep learning task. It involves the standardization of the data.

4 Experiments

The experimental setup used for this research work had five Personal Computers: a single Master Node and four Worker Nodes. Every computer was identical and had this specification: 8 GB RAM(DDR3), Intel Core i7 Processor (5th Gen), and a 1 TB Hard disk. The operating system that has been used is Linux Ubuntu-18.04 with Apache Spark-2.4.3. Python Language is used in the Spark platform.

All experiments were conducted in a distributed environment, on the Spark platform. The data was loaded using \({\text{Data}} = {\text{sc}}.{\text{textfile}}\left( {{\text{file}}} \right)\). Then, we performed preprocessing and removed missing values. The null values are replaced by 0. Then the data was normalized. The data set was split into 80% and 20% training and testing ratios. The test data was broadcast to all worker nodes using, \({\text{testdata}} = {\text{ sc}}.{\text{broadcast}}\left( {{\text{testdata}}} \right)\). The training data was split into the worker nodes using \({\text{rdd}} = {\text{sc}}.{\text{parallelize}}\left( {{\text{train}}} \right)\). The row matrix from the \({\text{rdd}}\) was created using \({\text{mat }} = {\text{RowMatrix}}\left( {{\text{rdd}}} \right).\) After this, the dimension reduction techniques were applied and reduced dimensions fed into the classifiers. All this is executed on worker nodes, and then the distance computation for the test data is done for classification purposes. The master node collects all the distance values and predicts the class label corresponding to the minimum distance. Finally, the accuracy score is calculated.

For the K Nearest neighbour classifier, the values of K were taken to be as 3, 5, 7, 9, and the best results have been reported. The number of features for the input layer is 22 features for PCA-NN, 12 features for SVD—NN and 5 features for DT-NN classifier. In the neural network model, 10 neurons were present in two hidden layers, and sigmoid activation function was used. This is implemented on the Spark platform with \({\text{block Size}} = 128\), \({\text{seed value}} = 1234,\) and activation function is sigmoid.

For Prediction with augmented data and deep neural network, our experiment used the two hidden layers with 16 neurons and one input layer with 23 inputs with activation function Rectified Linear Unit. The output layer has one neuron with an activation function sigmoid. In this experiment, we set the batch size = 64, a number of epochs = 100, and an experiment was conducted to validate the 20% data of the entire dataset. Figure 1 shows the architecture of deep learning neural network.

Fig. 1
figure 1

Structure of deep learning neural network

5 Results

All experiments were conducted in a distributed environment on the Spark platform. The dimension reduction techniques were applied, and then the features identified by these techniques were input into the classifiers. For the K Nearest neighbor classifier, the values of K are taken to be as 3, 5, 7, 9, and the best results have been reported. In the neural network, we set the parameters as, Maximum iteration = 100 and number of layers = [no of features, 10, 10, 2]. After 100 iterations, the error did not converge. The odd values of k in KNN had been taken into consideration because of the majority of voting classifiers take these values, and is also available as the options to find the best value of k in the python libraries.

5.1 Dataset Description and Exploratory Analysis

Table 1 shows the dataset for thyroid disease, composed of 3152 instances, 23 features, and class [27]. The thyroid dataset aims to predict whether the person is suffering from sickness-euthyroid disease or not.

Table 1 Dataset description

The names and description of various Features is given in Table 2.

Table 2 Description of attributes

The distribution of the classes of thyroid dataset is shown as (Table 3):

Table 3 Distribution of classes

Next, we find the importance of each feature using Gini index, as given in Fig. 2 and further, the correlation between different features is presented in Fig. 3.

Fig. 2
figure 2

Importance of each feature

Fig. 3
figure 3

Correlation graph of features

5.2 Comparative Analysis of Classifier Performance

From the values in Table 4, it can be seen that as a dimension reduction technique, the singular value decomposition performs better than principal components analysis, while the decision tree is better than singular value decomposition. The best accuracy of 98.70% is obtained by the decision tree dimension reduction technique, which selects five features and the neural network classifier. Note that the values of F1-score, precision, and recall are also the best. The same is displayed in Fig. 4 plot. Table 5 shows the total run time of different classifiers. It shows that the Neural network classifier takes a little higher time than the K-NN classifier.

Table 4 Comparative accuracy for varying number of features in thyroid dataset
Fig. 4
figure 4

Plot of various parameters of thyroid dataset

Table 5 Total run times of different classifiers

In Table 6 we present the results of the deep neural network model built with augmented data. From the values in Table 6, it can be seen that we got the highest parameters score than the earlier results in Table 5. Note that the values of F1-score, precision, and recall are also the best.

Table 6 Parameters score by DNN

Figure 5 shows that accuracy varies almost as much as training and testing. It reached a maximum of 99.95% at its peak of testing data.

Fig. 5
figure 5

Plot of accuracy between training data and testing data

Figure 6 shows the loss between training and validation data. Initially, the loss of training data is high and then gets reduced to a loss of validation data in 100 epochs.

Fig. 6
figure 6

Plot of training and validation loss across epochs

Finally, in Table 7 and Fig. 7 we give a comparison of our model performance with other researchers. Ioniţă and Ioniţă, in their work in [28] apply Naive bayes, Decision tree, Multilayer perceptron, and Radial basis function network. Tyagi et al. [29] also use a decision trees along with artificial neural networks for the classification of the thyroid datasets.

Table 7 Comparison of our proposed model with other techniques
Fig. 7
figure 7

Comparative study of our proposed work with other techniques

Sivasakthivel et al. [30] apply different kinds of decision tree classifiers for the same purpose. Li-Na Li in [19] developed a Computer-aided Diagnosis system using PCA and extreme learning techniques to predict thyroid diseases, and a maximum accuracy of 98.1% was obtained. Prasan Kumar Sahu in [20] proposed a cloud-enabled big data framework to provide a healthcare solution and results show 98% accuracy in predicting disease using correlation analysis. Raghuraman et al. in [25] performed comparative thyroid disease diagnosis using Machine learning techniques—Support Vector Machine (SVM), Multiple Linear Regression and Decision Trees, and the highest accuracy of 97.97% was obtained by the decision tree model. Dharamrajan et al. in [26] applied Support Vector machine (SVM) and Decition tree classifier for thyroid prediction, and obtained an accuracy of 97.35 using decision trees. Finally our two proposed techniques, the first with feature reduction shows an accuracy of 98.7% while the second with data augmentation technique gives an accuracy of 99.95%, and outperform all the others.

6 Conclusion

An enormous growth has been observed in medical expert systems in recent years, and the systems available are now sufficiently developed to be targeted in practice. In order to provide patient care more efficiently, however, expert systems will gradually be incorporated into hospital information systems. For treatments like the production and design of vaccinations, medical data are essential. The dataset is collected in the medical application through the testing of the patient's response to a particular medicine or the collection of medical tests to diagnose a certain medical condition. Thyroidism is specially hard to determine because symptoms can easily be confused with other symptoms. Therapy can regulate dysfunction by early diagnosis of thyroid disease. A modeling solution for the prediction of the thyroid disease is suggested in this study to allow society to enjoy the research progress of computer techniques. The thyroid disease dataset consists of 3152 cases, 23 characteristics and finally a class to predict whether the individual is ill or not. The techniques for dimension reduction and data augmentation are used and used as input to two classifiers. The detailed results of experiments are presented in Tables 4, 5 and 6. A comparative analysis with the study of other researchers in Table 7 shows that our techniques of feature reduction and data augmentation pareform really well with accuracy of 98.7% and 99.95%. As part of our ongoing work, we aim to apply deep learning models for prediction of complex life threatening diseases.