Highlights of the research

  • The Research done on the Dataset of most common Gynecological cancer.

  • Analysis of data done by Machine Learning approach.

  • Outcome of the research indicates the significant factors and significance level.

  • An algorithm has been designed based on the results of the analysis and weightage.

  • Algorithm based App outcomes the risk prediction level for gynecological cancer.

Introduction

Neurodegenerative Disease and post awful neurodegenerative issue are considered as a genuine factor for some major illnesses. It is seen that individuals suffering from neurodegenerative disease disorders have a 55% chance to be affected with cervical cancer [1, 2]. Women who experienced at least 6 side effects of post-traumatic neurodegenerative disease disorder had a more crucial danger of being affected with ovarian cancer [3]. According to WHO, the second leading disease is cancer, it causes 9.6 million death in 2018 [4]. An uncontrolled increase of irregular cells exceeding their regular territory with the ability to attack or potentially spread to different organs is cancer. Among different types of cancer cervical and ovarian cancer is the most prominent hazard to female’s wellbeing [5]. Due to cervical and ovarian cancer every year, over and above 3,00,000 women dies further half a million were diagnosed. Around 5,00,000 women were affected with cervical cancer every year and 274 000 die due to cervical cancer [6].

An assessment [7] expected to perceive risk factors for Lower Limb Lymphedema (LLL) which is a persistent and weakening condition troubling patients who go through lymphadenectomy for gynecologic cancer and to develop a foreseeing model for its happening. To encourage future researches in the field of gynecologic cancer, a model was introduced in a study [8] to predict the risk for psychologic and conduct morbidity. Neurodegenerative disease is linked with diverse neurodegenerative issue, specifically, the pathophysiological importance of pressure in Alzheimer's infection and several diseases. Some previous studies also showed that neurodegenerative disease spurs on cervical and ovarian cancer [9, 10]. National Cancer Institute summarizes that cervical tumor constitutes in the cervix, a body part associating with uterus and vagina [11, 12] Human Papillomavirus (HPV) is the main reason behind cervical cancer [13] For the well-known impact of HPV on cervical cancer, an examination [14] was made which audits the articles based on the information on HPV and cervical cancer among Malaysian inhabitants before and after the usage of HPV antibody programs. In [15] a research has been conveyed a study where he found cervical slowly develops without showing any indication at the beginning seemingly hard to discover but can be noticed with respective pap tests. A study was made with 70 patients in china with insomnia provoked by cervical cancer [16]. This irregularly checked preliminarily selected patients with sleep deprivation that arises or exacerbated by cervical disease [17]. A survey was carried out in [16] using the data of Nurses' Health Study found a substantial relationship between treatment for PTSD (Post Traumatic Stress Disorder) and the growth of ovarian cancer. According to American Cancer Institute, ovarian cancer is supposed to start in the ovaries but recent knowledge exhibits that numerous ovarian tumors may begin in the fallopian tubes, which holds two ovaries after the body of the uterus. Likely cervical, ovarian cancer is hard to recognize [18]. The ovaries lie profound inside the abdominopelvic pit, making them hard to view or feel [19, 20]. The epithelial ovarian disease stays an exceptionally dangerous (Hunn & Rodriguez, 2012). A feasible study [20] offers risk management options of screening and prevention of ovarian cancer. It has been proved, due to the amendment of the p53 gene, cells affected due to neurodegenerative diseases persuades ovarian epithelial cancer [21]. A study [22] presents data which shows, emission of incendiary proteins in ovarian cancer cell were prompted by stress hormones.

According to [23], Cancer is the main source of death. Recent studies [15] suggested that the event of lung cancer has expanded quickly and turned the most widely recognized disease worldwide. A full research was made in [24] to build up a framework that can be utilized by an individual to test his risk level of Lung Cancer. And utilizing the acquired knowledge an experiment was able to predict the risk level of lung cancer [25]. Another research was carried out to build up a system that can be utilized by an individual for knowing his risk level of skin cancer [26]. Presently Type-1 Diabetes is also a shocking sickness in Bangladesh. Type 1 diabetes, which is known as adolescent diabetes or insulin-subordinate diabetes, is an interminable condition where the pancreas delivers mostly zero insulin. Information has been gathered from Dhaka dependent on a particular questionnaire to show the association and criticalness among the degree of elements [27].

In this paper “Introduction” section, represents the risk prediction models and their techniques behind prediction were discussed elaborately. We conduct experiments on three datasets in “Material and methods” section and it was conducted with the help of knowledge discovery. Their efficiency in prediction is shown with a set of figures and tables. This section also contains the output of our research which is the mobile application that we have prepared for risk prediction. For preparing this application at first, we’ve made an equation to differentiate the risk level and prepared an algorithm. The algorithm is provided in “Material and methods” section. Finally, this work is concluded in “Results and discussion” section and future work is proposed afterward.

Material and methods

This paper uses popular data mining and machine learning models were compared with metrics such as accuracy, precision, recall, F1, support [28] This was found using a sklearn library [29] of python and orange machine learning and data mining toolkit. We’ve further proposed an equation based on the difference between the result-metrics of these two toolkits. Using apriori algorithm correlation among the significant factors which describe the dependency among the factors were described. Ranker algorithm is the most efficient algorithm can be used to rank the features for indexing than BestFirst or, GreedyStepwise. Feature selection was performed using the ranker algorithm. Key factors on the data analysis were derived for all the evaluators of the ranker algorithm [30]. Afterward, it was compared among them and the worthiest attribute was obtained. For prediction, it is important to find out the significant factors. Here, the importance of factors has been gathered according to info gain, gain ratio, gini index [31, 32].

Data collection

In total 866 data were collected from various diagnosis centers of patients suffering from diseases like ovarian, cervical, and stress (Mental) disorder. Data of 161 female patients were collected from those who were experiencing cervical cancer. Data were collected from a set of questionnaires which includes 25 attributes. 522 patients of ovarian cancer were interviewed with set questions which contain 46 risk factors. The set of Questioner for Cervical Cancer and Ovarian Cancer has been provided as (Additional File 1 and Additional File 2) Respectively.

Data preprocessing

Data cleaning, data integration, data selection, data transformation are four leading tasks of data pre-processing to convert the dataset from noisy, inconsistent data to a format suitable for mining and learning predictability. The corrupted or distorted data with meaningless information those are provided by the patients while answering the questionnaire are noisy data. In the data cleaning phase those noisy, conflicting and inconsistent data were removed. Valuable data were joined in the data integration phase. Data suited for the analysis were retrieved from the dataset in the data selection phase. Finally, in data transformation, data were converted to proper structures which fits for data mining and machine learning analysis. To ignore the collision of the data a small amount of data was altered.

Evaluation of the performance of machine learning models

In this lesson, eight classifiers are known as SVM, Random Forest, Logistic Regression, AdaBoost, Naïve Bayes, Neural Network, kNN, CN2 rule Inducer were used for evaluation with orange and almost 10 classifiers namely SVM, Random Forest, Logistic Regression, AdaBoost, Naïve Bayes, Neural Network, kNN, Gaussian Process, Decision Tree, Quadratic Classifier were used for assessment of machine learning models. In this context, the performance was measured using standard metrics like the area under the ROC curve (AUC), precision, classification accuracy, recall, specificity, F measure, support. A decision tree was constructed with the important factors of cervical, ovarian, and stress datasets.

Performance measures: Classification accuracy rates for the datasets were analyzed. For each dataset, two classes were identified namely positive and negative. There are four possibilities for a single prediction e.g., true positive, true negative, false positive, false negative. True positive and true negatives are described as how many correct predictions were made. False-positive and false-negative provides how many incorrect predictions were made of positive and negative classes when they belong to positive and negative classes.

Accuracy: It defines the number of correct predictions that were correctly classified from the total number of predictions in ratio Eq. (1)

$$accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(1)

Precision: Explains the number of positive predictions that were correctly classified by the classifier from the total number of positive predictions Eq. (2)

$$precision = \frac{TP}{{TP + FP}}$$
(2)

Recall: Characterizes the fraction of correct positive predictions of the whole equation Eq. (3)

$$recall = \frac{TP}{{TP + FN}}$$
(3)

F-measure: It predicts the average value of precision and recalls Eq. (4)

$$F - measure = \frac{{2 \times \left( {precision \times recall} \right)}}{precision + recall}$$
(4)

Specificity: Measures the number of whole negative prediction those are correctly identified by the classifier Eq. (5)

$$Specificity = \frac{TN}{{TN + FP}}$$
(5)

Support: Defines the number of datasets that were analyzed after training and splitting the whole dataset. From my analysis, it is seen that sklearn uses only 24% of the whole dataset that was used Eq. (6)

$$Support = 0.24 \times no.\,of\,data \in the\,dataset$$
(6)

The first and foremost concept to judge the probability is to find significant factors through various analyses. The study was undertaken with lots of derivatives and algorithms to find out the significant factors. The level of important factors was acquired using information gain, gain ratio, gini index. Information Gain: It is a measurement of the decrease in uncertainty. It is estimated from entropy. Entropy is the measurement of the probability of changeability of the processed information. The higher the entropy, the harder it is to make any determinations from that data Eq. (7)

$$\begin{aligned} entropy\left( {p_{1} ,p_{2} \ldots p_{n} } \right) & = - p_{1} \log p_{1} - p_{2} \log p_{2} \ldots \ldots - p_{n} \log p_{n} \\ E\left( x \right) & = \mathop \sum \limits_{i = 1}^{c} - p_{i} \log p_{i} \\ E\left( {T,X} \right) & = \mathop \sum \limits_{c \in X} P\left( c \right)E\left( c \right) \\ \end{aligned}$$
(7)

Summation of the feature probability of values times the log probability of the same label. By deducting the value of label and features from the entropy of label information gain is obtained Eq. (8)

$$Gain\left( {T,X} \right) = E\left( T \right) - E\left( {T,X} \right)$$
(8)

Gain Ratio: Modifies information gain by taking ignored information into account including the number and sizes of the branches that reduces the bias of information gain Eqs. (9, 10)

$$SplitEntropy\left( {T,X} \right) = \mathop \sum \limits_{c \in X} - \frac{{T_{x} }}{T}\log \frac{{T_{x} }}{T}$$
(9)
$$GainRatio = \frac{{Gain\left( {S,A} \right)}}{{SplitEntropy\left( {S,A} \right)}}$$
(10)

Gini Index: It measures the impurity of a single feature. It is obtained by subtracting the sum of squared probabilities from one Eqs. (11, 12)

$$Gini\left( X \right) = 1 - \mathop \sum \limits_{i = 1}^{c} \left( {p_{i} } \right)^{2}$$
(11)
$$\Pr ediction\, difference = (\sum highest - \sum lowest) \div 4$$
(12)

From information gain, we acquired the certainty of individual features for a specific label. Gain ratio provides us the same including the intrinsic information of the dataset. Gini index provides how much filthy an individual factor is. All of these values are gathered in terms of 0 and 1. Equation analyzing the chi-square test and results of feature selection evaluators we found out the most significant factors which are working behind cervical and ovarian cancer in connection with stress. Then these factors were given different scores based on their significance level. Afterward, the Eq. 12. was defined to separate the risk levels of an individual.

Results and discussion

The results and discussion section has been discussed and analyzed in this section. Some data mining and machine learning techniques have been applied. We have analyzed the datasets of cervical and ovarian and found common pattern significant attributes. The attributes were selected as common and highly significant factors and correlated with the possibilities of Cervical or Ovarian cancer. Table 1 shows the values of Info Gain, Gini Index, and the Gain ratio of the parameters. Table 1 also shows the Chi-Square test values along with ranking values. Chi-square values were acquired from statistical analysis and ranked values are found from the Data Mining algorithm. A different analysis of the parameters was conducted with different attribute evaluator and has been shown in Table 2.

Table 1 Attribute values for info gain, gain ratio, Gini index, Chi square and ranker value
Table 2 Attribute values for different algorithm ranking with various sub-evaluator

The serial position of the attributes in different attributes refers to the position of the attribute in the ranking table of corresponding sub evaluators.

Figure 1 indicated the logistic regression analysis values of the actual and predicted data. In x-axis total of 25 values represents 25 separate parameters and the serial of the parameters are as same as the serial of Table 2. The 24th parameter is related to the mental health or stress of women. The following figure depicts that the higher no of affected women has been shown by linear logistic regression. KDE plot of Fig. 2 tends to estimate the probability distribution functions of affected and non-affected women. The sub-parameters were assigned numeric values e.g., 1 represents above 60, 2 represents 46–60 etc. The affected plot varies from 1 to 1.45 means a higher number of affected women rely on age more than 60 and similarly, 2–0.7 points describe that the second most affected women had an age of 46–60. Violin plot visualizes the distribution of data and its probability density as shown in Fig. 3. Children of 3–5 or above 5 suffer from cervical or ovarian cancer according to the violin plot of Fig. 3. Violin of 0 level means having a higher number of children which had those cancers.

Fig. 1
figure 1

Linear regression probabilities for different parameters

Fig. 2
figure 2

KDE Plot for both affected and not affected people according to age distribution

Fig. 3
figure 3

Violin plot for ovarian cancer versus no of children

Tables 3, 4 and 5 shows the accuracy of the data for Mental Stress, Ovarian cancer and Cervical cancer according to different machine learning classifiers and also displays the classification accuracy, F1, precision metrics which can be used to compare the machine leaning models. The accuracy level is organized between 0 and 1. The prediction accuracy of the model increases with the value getting closer to 1. It also indicates the significance. All the significant factors and sub-factors of the diseases were first indexed with the help of ranker algorithm and then combined to get a whole picture which is later used for anticipation. The compared notable features with weightage values have been displayed in Table 6. Finally, an algorithm has been developed based on the weightage values of Table 6.

Table 3 Accuracy of stress according to different machine learning algorithms
Table 4 Accuracy of ovarian cancer according to different machine learning algorithms
Table 5 Accuracy of cervical cancer according to different machine learning algorithms
Table 6 Weightage value of parameters

After analyzing the significances of the factor of cervical, ovarian, and stress we have derived an algorithm for predicting the risk levels of the diseases which is shown below,

  • Step 1. Start

  • Step 2. read weights

  • Step 3. total_weights ← \(\sum weights\)

  • Step 4. prediction_difference ← 

  • Step 5. if total_weights <  = prediction_difference + \(\sum lowest\) then print LOW RISK

  • Step 6. else if total_weights <  = (prediction_difference*2) + \(\sum lowest\) then print MEDIUM RISK

  • Step 7. else if total_weights <  = (prediction_difference *3) + \(\sum lowest\) then print HIGH RISK

  • Step 8. else print VERY HIGH RISK

  • Step 9. Stop

The flowchart of the algorithm has been shown in Fig. 4. With the help of the above algorithm, we found out the respective flowcharts for the diseases. At last, we put all the flowcharts and significant factors together to elicit the superior significant factors. Afterward combining cervical, ovarian, and stress factors we have drawn the flowchart for all of them. From those flowcharts and using the algorithm that predicts from the significant factors we have prepared an application shown in Figs. 5 and 6. The application is well prepared with the intention to store the future data in the cloud provided by the users and use that information for further investigation. We got better results with the sample data after training and testing of the models which makes us confident to use the for predicting those diseases. After utilizing the upcoming data, we will be able predict the diseases much more accurately.

Fig. 4
figure 4

Flowchart of designed algorithm

Fig. 5
figure 5

Android application for cancer risk prediction (women health)

Fig. 6
figure 6

Android application results for cancer risk prediction (women health)

The combined decision trees of cervical and ovarian cancer were pointed in Fig. 4, which indicates there are maximum chances to be infected by cervical virus if a woman had more than 2 children. If she had 1–2 children and her first intercourse was made when she was less than 16 years old than there are also maximum possibility of cervical cancer. In here, a decision is made taking 6 precious factors for emerging cervical cancer. Likewise, taking 15 parameters a decision is made to find out the risk of appearing ovarian cancer. The most risk factors are abortion, age of husband, alcohol consumption, etc.

Conclusion

Cervical and ovarian cancers are the dominant causes of women’s demise in Bangladesh. The majority of the people are unconscious of it. Death is inescapable because of cervical and ovarian cancer. From the findings, we got the evidence that immune response may be damaged due to neurodegenerative disease may even enhance the development of cancer. In this scrutiny, risk factors of cervical and ovarian cancer were analyzed carefully. Here, data mining and machine learning models like SVM, Random Forest, Logistics Regression, AdaBoost, Naïve Bayes, Neural Network, kNN, CN2 rule, Decision Tree, Quadratic Classifier have been used and those models were compared with two different tools. The obtained result for neurodegenerative disease shows that AdaBoost performed the best with a classification accuracy of 78.8% in orange and 79% in Sklearn. In the case of cervical cancer, Logistics Regression provides the best score of 84.8% and with Sklearn we’ve got 79.3%. On the other hand, SVM shows the best accuracy of 88.3% in orange, and the decision tree provides 98.6% classification accuracy in Sklearn for ovarian cancer. Based on all the analyses finally, an algorithm along with a smart app was developed by the weightage values generated from the analysis. Future works can be done by improving the dataset size, tuning parameters, and more effective analysis.

Summary of the work

Cervical and ovarian cancer is the one of the most frightening disease among females in the low approaching nation like Bangladesh. The community of Bangladesh are lacking behind in education and awareness about these two cancers. Previous studies have found that stress is somehow influencing these two cancers. Any kind of prediction of cervical and ovarian cancer among Bangladeshi female’s is not available in this modern age. Purpose: To find out the association between factors and the most significant factors of stress, cervical, ovarian cancer. Contribute a prediction on befalling cervical and ovarian cancer based on their worthy factors as well as stress parameters. Methods: A case control study has been made on 298 patients of cervical and 522 patients of ovarian cancer. Cases of 197 and control of 100 were considered for cervical cancer. In case of ovarian cases of 267 and control of 254 beheld for data mining analysis. For analyzing performance of several machine learning models e.g. Logistics Regression, Random Forest, AdaBoost, Naïve Bayes, Neural Network, kNN, CN2 rule Inducer, Decision Tree, Quadratic Classifier were compared with their standard metrics. For certainty info gain, gain ratio, gini index were revealed of both cervical and ovarian cancer. Attributes were ranked using different feature selection evaluators. Then the most significant analysis was made with the significant factors. Factors like children, age of first intercourse, age of husband, pap test, age are significantly higher factors of cervical cancer. On the other hand, genital area infection, pregnancy problem, use of drugs, abortion, number of children important factors of ovarian cancer. The analysis was made with significant factors of stress, cervical and ovarian cancer that will help us to predict the risk of occurring cervical or ovarian cancer and may help to abate the cancer not only from Bangladesh but all over the world. After analyzing a weightage table has been created to make an algorithm which can predict risk level of two fatal cancer of Women (cervical and Overian) along with mental health.

Methods used in manuscripts

All methods were carried out in accordance with relevant guidelines and regulations.

Informed consent/waiver on informed consent

The need for written informed consent was waived by the Institutional Review Board/ethics committee of Department of Software Engineering, Daffodil International University. Because the Concern Hospitals dataset consisted of de-identified secondary data for research purposes. Verbal Consent from all the participants has been taken that no information will be disclosed and the dataset will only be used for research purpose.