Automated invasive cervical cancer disease detection at early stage through suitable machine learning model

Cervical cancer is a common cancer that affects women all over the world. This is the fourth leading cause of death among women and has no symptoms in its early stages. At the cervix, cervical cancer cells develop slowly. If it can be detected early, this cancer can be successfully treated. Health professionals are now facing a major challenge in detecting such cancer until it spreads rapidly. This study applied various machine learning classification methods to predict cervical cancer using risk factors. The main aim of this research work is to be described of the performance variation of eight most classifications algorithm to detect cervical cancer disease based on the selection of various top features sets from the dataset. Multilayer Perceptron (MLP), Random Forest and k-Nearest Neighbor, Decision Tree, Logistic Regression, SVC, Gradient Boosting, AdaBoost are examples of machine learning classification algorithms that have been used to predict cervical cancer and help in early diagnosis. A variety of approaches are used to avoid missing values in the dataset. To choose the various best features, a combination of feature selection techniques such as Chi-square, SelectBest and Random Forest was used. The performance of those classifications is evaluated using the accuracy, recall, precision and f1-score parameters. On a variety of top feature sets, MLP outperformed other classification models. The majority of classification models, on the other hand, claim to have the highest accuracy on the top 25 features in dataset splitting ratio (70:30). For each model, the percentage of correctly classified instances has been presented and all of the results are then discussed. Medical professionals will be able to use the suggested approach to perform research on cervical cancer.


Introduction
Invasive happens in the woman's cervix by affecting the deeper tissues of the cervix. The cervical cancer can spread to other parts of their body lungs, liver, bladder, vagina and rectum. Healthy cells in the cervix develop changes (mutations) growing, multiplying at a set rate and eventually dying at a set time in their DNA. Moreover, abnormal mutational activities of unhealthy cell are growing, multiplying cells out of control as well as they do not die accumulating abnormal cells form a mass (tumor). The most well-informed symptoms of cervical cancer unusual pain after sex, vaginal bleeding after sex, between periods, after menopause or after a pelvic examination as well as vaginal discharge. The most common diverse risk factors are many sexual partners, early sexual activity, sexually transmitted infections (STIs), weakened immune system, smoking, exposure to miscarriage prevention drugs and the like [1].
The sexually transmitted human papillomavirus (HPV) plays a vital role in cervical cancer and genital warts. Among 100 different strains of HPV, HPV-16 and HPV-18 are the most well-known strains for cancer. The cancercausing strain of HPV is also responsible for vulvar cancer, vaginal cancer, penile cancer, anal cancer, rectal cancer, throat cancer [2]. Depending on spreading level cervical cancer has four stages: stage 1 only spreads to the lymph nodes, stage 2 larger cancer spreads outside of the uterus and cervix or to the lymph nodes, stage 3 spreads to the lower part of the vagina or the pelvis along with blocks the ureters, the tubes that carry urine from the kidneys to the bladder as well as final stage spreads outside of the pelvis to organs like your lungs, bones or liver [3]. The easiest initiatives to prevent cervical cancer are the HPV vaccine that have routine Pap tests (early-stage detection), practice safe sex, don't smoke and so on [4,5].
The "National Strategy for Cervical Cancer Prevention and Control" program is launched by the Bangladesh Ministry of Health & Family Welfare (MoHFW) extending over five years from 2017 to 2022. Although World Health Organization (WHO) identified invasive cervical cancer as the fourth most common cancer in women, this is the second most common type of cancer among women between 15 and 44 years of age in Bangladesh. Every year new cases are diagnosed at approximately 12,000, and the severity of the disease brings over 6000. About 4.4% of women in the general population have a high inclination to cervical HPV16/18 infection at a given time, and 80.3% of invasive cervical cancers are attributed to HPVs 16 or 18 in the region Bangladesh of Southern Asia [6][7][8][9].
Various types of ample prevention measures are practiced, but the occurrence of cervical cancer cannot stop only using screening tests. The early-stage detection of this cancer can play an important role to control death due to invasive cervical cancer. Currently, computer vision, artificial intelligence (AI), machine learning (ML), deep learning (DL) are the most used popular term for detecting various diseases. Among them, the various effective algorithm of ML model creates great attention by rapidly detecting the targeted diseases. The suitable ML algorithms can be applied to the targeted diseases dataset that preprocessed form by applying several preprocessing activities such as data cleaning, dimensionality reduction, feature selection.
The desired analyzed result of this algorithm may assist the medical officers in diagnosing diseases rapidly and provide the best medications for their patients.
This research selects a variety of different top features using a combination of feature selection techniques, reducing training time and assisting oncologists in quickly detecting cervical cancer, and the main objectives of this research are to improve classification performance using machine learning classification techniques and provide the performance results analysis based on different top features set.
The remaining part of the paper is organized as follows: Sect. 2 reviews all relevant works in classification algorithms for cervical cancer. The proposed methodology for detecting cervical cancer using various classification algorithms and feature selection methods is explained in Sect. 3. Sections 4 and 5 focus on result and discussion. The conclusion of the paper is discussed in Sect. 6.

Related works
The incredible machine learning (ML) has diverse application scope for the lion portion of diseases detection of all kinds of animals and plants. Currently, a plethora of ML models are proposed and applied for the targeted application areas to accelerate and enrich the research purposes. In [10], four target variables-Hinselmann, Schiller, Cytology, and Biopsy-with 32 risk factors are considered for analyzing pernicious cervical cancer leading to unexpected death. These major culprits are sorted out by developing an effective model and applying the most popular ML methods: Logistic Regression, Decision Tree, Random Forest, and Ensemble Method. The analyzed result demonstrates cervical cancer prediction accuracy with ROC curve, AUC curve, respectively, 98.56%, 99.50% for Ensemble method using SMOTE-Voting-PCA proposed model. Moreover, SMOTE-Voting PCAM model executes the prediction of Schiller target variable, respectively, 98.49%, 98.60% and 99.80%.
Jiayi Lu et al. [11] studied that the ghastly prevalent cervical cancer is a burdensome matter in both developing and developed countries due to lack of awareness and health diagnostic facilities. As a result, the late detection of the occurred cervical cancer and the high cost of this diseases treatment creates more perilous situation for patients. The UCI dataset and the private dataset are used for this research work. The most interesting ensemble approach containing LR, DT, SVM, MLP, KNN to this issue has been discussed to predict the risk of cervical cancer. The predicted performance of the proposed approach is improved by adopting data correction mechanism and gene-assistance module in this work. The obtained accuracy of the proposed method was 83.16%.
In the last decade, the most occurring cancer of women body has attracted much attention from research teams for early-stage detection and treatment. In this work [12], machine learning with classification algorithms like Multilayer Perceptron, Decision Trees, Random Forest, K-Nearest Neighbor, Naïve-Bayes are applied for early-stage cervical cancer detection using risk factor collected dataset from UCI. The combined several machine learning techniques into one model demonstrated the accuracy of 87.21% using training and validation operations.
In this work [13], the dataset is introduced by surveying 858 patients with a number of attributes like 33 to make a predictive analysis about extinguishing cervical cancer. This dataset was split into training and test portion where various machine learning with classification methods like Multilayer Perceptron, BayesNet and k-Nearest Neighbor are applied for correct prediction. The performance analysis of this algorithm is based on confusion matrix using correctly classified instances and percentage.
This paper [14] was published in the Special Section on Fault Diagnosis, Data-Driven Monitoring, and Control of Cyber-Physical Systems in IEEE Access. According to the authors [14], several data-driven approaches such as principal component analysis (PCA), particle swarm optimization (PSO), fuzzy positivistic C-means clustering, linear regression (LR), artificial neural network (ANN) and support vector machine (SVM) have been proposed and implemented in the field in recent years to provide timely diagnosis.
A plethora of effective machine learning algorithm is applied on the most popular Pap smear image dataset to diagnose cervical cancer at an early stage [15]. The performance metrics of confusion matrix are precision call, recall score, F1 score and accuracy. Using confusion matrix and cost-effectiveness in terms of CPU time is used for the performance analysis of various algorithms. The selected best feature and reduced processing time are obtained by this proposed method to help oncologists in early detection of cervical cancer. In this work, Logistic Regression (LR) exhibits 100% accuracy requiring more CPU time. However, 99% accuracy is obtained in the exchange of less CPU time.
This paper deals with various selections of top feature sets using a combination of feature selection techniques, where previous related papers focused on only one selection of top features. All previous related works deal with only one dataset splitting ratio; on the other hand, this paper deals with three different dataset splitting ratios.

Proposed method
Our main goal is early-stage cervical cancer detection through classification algorithms on various top features of the dataset. A model has been proposed for this reason. The proposed methodology is divided into the following subtasks as shown in Fig. 1. In this approach, some important stages need extra concentration: (1) pre-processing and features selection and (2) model building and analysis. Each step is discussed in detail below:

Dataset description
The dataset was collected from the University of California, Irvine (UCI) Machine Learning dataset repository. The data were gathered at the Universitario Hospital de Caracas in

Pre-processing
Data preprocessing is crucial step in data mining and machine learning (ML) field to handle missing value, noisy data, incomplete data, inconsistent data, outlier data of raw data. Preprocessing consists of a number of basic steps including cleaning, integration, transformation and reduction. The main objectives of data preprocessing are reducing data size, determining relationship among data, normalizing data, removing outlier and extracting features. Before applying ML models, basic six steps need to be performed for dealing with the targeted dataset. These steps are orderly followed such as importing library, importing dataset, taking care of missing data in dataset, encoding categorical data, splitting dataset into training set and test set [17]. Some columns in the dataset have few missing cells with question marks that the patients had skipped due to privacy issues. Initially, all the question marks were replaced with null value in the dataset.

Feature selection
When developing a predictive model, feature selection is the process of selecting the best amount of features from the dataset that effectively contribute the most to the forecast variable or output. Irrelevant attributes in the dataset will reduce the model's prediction efficiency as well as the classifiers' overall performance in terms of accuracy and complexity. Many more benefits can be obtained by using feature selection, such as reduced training time, improved model accuracy and reduced over-fitting when modeling the dataset [18]. In classification techniques, there are several different types of feature selection methods. In this paper, we used the Chi-square test and SelectKBest methods, as well as Random Forest feature selection, to determine which features were more relevant. After applying those feature selection methods, we determined various top features such as 10, 15, 20, 25, 30. The various top features are detailed in the figure below:

Chi-square and SelectKBest
The Chi-square statistic is used to compare two types of data: tests of independence and tests of goodness of fit. The Chi-square test of independence is used in feature selection to determine whether the class mark is independent of a feature [19]. For categorical features in a dataset, the Chi-square test is used. We measure the Chisquare between each feature and the target and choose the various top features with the highest Chi-square scores using SelectKBest model. It decides whether the sample's association between two categorical variables reflects their true association in the population. The successful application of this feature selection method has already found in plenty of research works [20]. The Chisquare score is calculated by Eq. (1): where O i is the observed value or no. of observations of class and E i is the expected value or no. of expected observations of class if there was no relationship between the feature and the target.

Random Forest
Random Forest is a supervised learning algorithm that can be used for regression as well as classification.
Because of their relative accuracy, robustness and ease of use, Random Forests are one of the most common machine learning methods. They also give two simple feature selection methods: mean decrease impurity and mean decrease accuracy [21]. The greatest appropriate suitable features are generated with a high score after applying the Random Forest algorithm to the dataset. These features received the highest score out of all the features in the dataset. (1)

Data splitting
To determine the performance of machine learning algorithms, datasets are split into training and testing sets. During the dataset splitting process, the training set receives the majority of the data, while the testing set receives a smaller portion. During the training of the proposed model, the train dataset was used. By running the trained model on the test dataset, the success rates were determined. In this study, the dataset was split in various ways to compare the accuracy, precision, recall and f1-score values of various classification algorithms.

Logistic Regression (LR)
Logistic Regression is a machine learning classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression, which produces continuous number values, Logistic Regression produces a probability value that can be mapped to two or more distinct groups using the logistic sigmoid equation. The function converts every real number into a number between 0 and 1. We use sigmoid to map predictions to probabilities in machine learning. Equation of sigmoid function is given by Eq. (2): where • f(x) = output between 0 and 1 (probability estimate) • x = input to the function (algorithm's prediction, e.g., mx + b) • e = base of natural log LR has a number of advantages, including the ability to have probabilities naturally and the ability to handle multi-class classification problems [22]. Another advantage is that most LR model analysis approaches are based on the same concepts as linear regression [23]. Logistic Regression was shown to be the most consistent with its accuracy scores.

Random Forest (RF)
Random Forest (RF) is a data mining technique for dealing with classification and regression issues. Growing an ensemble of trees and voting to determine the class category greatly improved classification accuracy. To develop these ensembles, random vectors are created. Each tree is made up of one or more random vectors. Classification and regression trees make up RF. The production of trees is analyzed to solve classification problems. The RF prediction is determined by a majority of class votes.
Although the generalization error is reduced to a limiting value, since over-fitting does not occur in large RFs, more trees are added to RF [24]. To achieve higher accuracy, low bias and correlation are important. Trees are grown without pruning to achieve low bias, and variables are randomly distributed at each node to achieve low correlation. After a tree has been developed, an out-of-bag (OOB) dataset is used to evaluate its performance. While more trees grow, RF uses the OOB dataset to measure an unbiased error estimate. In RF classification, the OOB dataset is often used to calculate the significance of variables [25].

K-nearest neighbor
The k-NN is a classification problem-solving supervised learning algorithm. The key point is to study the characteristics of each category in advance [26]. According to the k-NN algorithm used in the classification, the distance between the new individual and all previous individuals is considered, and the nearest k class is used, based on the attributes drawn from the classification level. As a result of this procedure, test data are assigned to the k-nearest neighbor group, which contains the most members in a given class. The research uses experiments to determine the best k number, and the Euclidean distance calculations method is used to calculate distance.
Euclidean calculation method is presented in Eq. 3 [27] Where the p i and p j are different points belonging to the n-dimensional space in which the Euclidean process is applied for points distance calculation.

Multilayer perceptron (MLP)
It is an artificial neural network model that maps input sets to appropriate output sets using a feedforward method. A multilayer perceptron (MLP) is made up of several layers of nodes, each of which is connected to the one before it. Except for the input nodes, each node is a processing unit or a neuron with a nonlinear activation function. It uses a supervised learning technique as backpropagation that is used for training the network. The alteration of the Vol.:(0123456789) SN Applied Sciences (2021) 3:806 | https://doi.org/10.1007/s42452-021-04786-z Research Article standard linear perceptron, MLP is capable of distinguishing data which are not linearly separable [28].

Support vector machine
The support vector machine (SVM) is a supervised learning system and one of the kernel-based machine learning techniques. Vapnik at Bell Laboratories is primarily responsible for the creation of the SVM. SVM creates a high-dimensional space and then divides it based on the training details [29]. SVC was chosen for its ability to handle high-dimensional input [30] and specifies the kernel type to be used in the SVC algorithm. It must be one of 'linear,' 'poly,' 'rbf,' 'sigmoid,' 'precomputed' or a callable. If none is given, 'rbf' will be used. We applied kernel 'linear. '

Decision tree
One of the classification methods is the decision tree [31], which classifies labeled trained data into a tree or rules.
To test the accuracy of a classifier, test data are randomly selected from training data after the tree or rules are derived in the learning process. Unlabeled data are classified using the tree or rules learned during the learning phase after accuracy is checked. The structure of a decision tree is similar to the tree with a root node, a left subtree and right subtree. The leaf nodes in a tree represent a class label. The attribute splits depend on the impurity measures such as information gain, gain ratio, Gini index. The information gain is defined by Eq. (4): Following the construction of the tree, it is pruned to inspect for overfitting and noise. Finally, the tree has been optimized. The benefit of a tree-structured approach is that it is simple to understand and interpret, and it can handle both categorical and numerical attributes. It is also resistant to outliers and missing values. Decision tree classifiers are commonly used to diagnose diseases like breast cancer, ovarian cancer and heart sound diagnosis [32].

Gradient boosting classifiers (GBC)
Gradient boosting (GB) [33] generates new models sequentially from an ensemble of poor models with the aim of minimizing the loss function with each new model. The gradient descent method is used to calculate this loss function. Each new model fits more accurately with the observations when the loss function is used, and thus, the overall accuracy is increased. Boosting, on the other hand, .Entropy a v must be stopped at some point; otherwise, the model would appear to over fit. A threshold for prediction accuracy or a maximum number of models produced may be used as a stopping criterion.

AdaBoost classifier
Adaptive boosting gives each training observation an equal weight at first. It uses a series of weak models and gives higher weights to observations that have been misclassified. The accuracy of the misclassified findings is increased since it uses several poor models, integrating the effects of the decision boundaries obtained during multiple iterations. Since it uses several poor models, integrating the outcomes of multiple iterations' decision boundaries, the accuracy of the misclassified findings is increased, as is the accuracy of the overall iterations [34]. The weak models are evaluated using the error rate given in Eq. (5): where t is the weighted error estimate, pr i−D t is the probability of the random example i to the distribution D t , h t are the hypotheses of the weak learner, x i is the training observation, y i is the target variable, and t is the iteration number. The prediction error is one if the classification is wrong and 0 if the classification is correct.

Evaluation measures
The confusion matrix is a common method used to solve classification problems. It can be used to solve problems involving multiclass classification as well as binary classification. The confusion matrix is a N*N-dimensional matrix that describes the classification performance of a classifier in relation to some test data. Only if the true value for test data is defined can it be determined. Confusion matrices represent counts from predicted and actual values.
A number of evaluation criteria are listed below: Accuracy: Accuracy is the most immersive performance measure. It is the ratio of correctly expected observation to total observation. The sum of true positive and true negative is divided by the total number of subjects in the sample to determine accuracy. An Eq. (6) is used to calculate it: Recall: It refers to the proportion of system-generated results that correctly predicted positive observations Precision: It specifies the proportion of system-generated positive observations that are correctly predicted compared to the total number of predicted positive observations. An Eq. (8) is used to calculate it: F1-Score: F1-score is defined by the weighted average of precision and recall. Hence, F1-score takes both false positive (FP) and false negative (FN) into account to convey the balance between recall and precision. An Eq. (9) is used to calculate it:

Result and analysis
This section discusses the results of the experimental analysis and the detection of cervical cancer disease. For the purpose of analysis, various classification algorithms were used, and the result showed that the outcome was dependent on a number of various suitable features from the dataset.

Experimental setup
The proposed method is evaluated on a system with 8 GB of RAM and a 3.0 GHz Intel Core i-7 processor using the cloud-based web application environment named Google Colab was used to create the model and use classification algorithms to detect cervical cancer disease on the dataset. The Google Colab provides free Jupyter notebook environment without setup requirements. Moreover, this ensures simultaneous working environment and frequently used-required machine learning libraries [35]. Various suitable features chosen for the detection of cervical cancer disease from 36 attributes in the dataset using the combination of various feature selection technique, and target attributes named biopsy were used to classify the dataset as healthy or cancer results for cervical cancer diagnostics.

Result evaluation
The results and accuracies we hope to achieve by using various classification algorithms are shown here. The classification algorithm's results and performance are summarized in the various tables, and figures are listed. Using Chi-square, SelectBest and Random Forest algorithms is to select various top features with a higher score for detecting cervical cancer disease.   Table 3.

Analysis of model performance on various top features
On top 20 features, MLP and RF each claim best accuracy over 97.00%. The MLP classifier has the highest precision and recall, while the GBC and AdaBoost have the lowest accuracy reporting. For these top 20 features, MLP performs better than other classifier. Highest performance comparison of different classification algorithms on top   Table 4. On top 25 features, MLP claims best accuracy 98.00%. The best precision giving classifiers are MLP and SVC as 98.00%. The lowest accuracy reporting classifiers are GBC and AdaBoost. For these top 25 features, MLP performs better than other classifier. Highest performance comparison of different classification algorithms on top 25 features is shown in Fig. 7. Performance comparison of different classification algorithms on top 25 features based on various dataset splitting ratio is shown in Table 5.  Fig. 8. Performance comparison of different classification algorithms on top 30 features based on various dataset splitting ratio is shown in Table 6.  Fig. 11. Highest accuracy comparison on various top features of classification algorithms is shown in Fig. 12.

Conclusion
Nowadays, cervical cancer is a widespread disease and screening also entails lengthy clinical tests. The aim of the research is to develop a model that can accurately diagnose and analyze risk factors for cervical cancer data using classification algorithms such as Multilayer Perceptron    30). After comparing our findings to those of many previous studies, we discovered that our models were more effective at diagnosing cervical cancer based on certain evaluation criteria.
Acknowledgements The authors are thankful to those who have participated in this research work

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.