1 Introduction

Lung cancer considers as the deadlier disease and a primary concern of high mortality in present world. Lung cancer affects human being at a greater extent and as per prediction it now takes 7th position in mortality rate index causing 1.5% of total mortality rate of the world [2]. Lung cancer originates from lung and spreads up to brain and spreads Lung cancer is categorized in to two major group. One is non-small cell lung cancer and another is small cell lung cancer. Some of the symptoms which are associated with the patients like severe chest pain, dry cough, breathlessness, weight loss etc. Looking in to the cultivation of cancer and its causes doctors give stress more on smoking and second-hand smoking as if the primary causes of lung cancer. Treatment of lung cancer involves surgery, chemotherapy, radiation therapy, Immune therapy etc. In-spite of this lung cancer diagnosis process is very weak because doctor will able to know the disease only at the advanced stage [18]. Therefore early prediction before final stage is highly important so that the mortality rate can be easily prevented with effective control. Even after the proper medication and diagnosis survival rate of lung cancer is very promising. Survival rate of lung cancer differs from person to person. It depends on age, sex and race as well as health condition. Machine learning now days plays a crucial role for detection and prediction of medical diseases at early stages of safe human life. Machine Learning makes diagnosis process easier and deterministic. Machine learning now a day’s have already dominated medical field. Every county is now adopting machine learning techniques in their health care sector. With the application of machine learning the actual detection of diseases can be explored. Some of the crucial application of machine learning is described as Feature Extraction: In any disease attributes are the real information container of the diseases. Machine learning (ML) helps for easy of data analysis and process the real attributes or information and finds the actual problem creator of diseases. It helps medical expert to find the root cause of diseases. Image processing: Using various process of machine learning the image analysis has been found accurate and valuable. That helps the concerned doctors to have a better diagnosis of the diseases such that money and time can be saved and value proportion can also be increased. Drug manufacturing: Depending up on the increase of various diseases, drug should be multi functional and quantity should be known. So ML has solved the problem and helps the drug industry to use of ML application for manufacturing. Better Prediction of diseases: ML helps to predict the severity of diseases and its outcome. ML controls disease outbreak through early prediction such that appropriate measures can be taken. Still machine learning application needs to be refined such that it can be more standardized and more reliable. Thus the need of more improvement in machine learning algorithm would help the physicians, health catalyst for accurate clinical decision making with high efficiency as well as good accuracy.

Machine learning makes the system to find the solution of problem with own learning strategies. ML classifies in to three categories such as unsupervised learning, supervised learning, Reinforcement learning. Supervised learning identifies two processes under its umbrella, one is classification and another is regression. Classification is process in which input data is processed and categorized in to certain group. The proposed work was carried out in Weka tool. Algorithm like j48, KNN, Naive Bias and RBF are used in Weka tool and a comparative analysis was derived finally.

2 Related Work

Z. Zubi et al. (2014) extracted features from chest x ray images and used concept of back propagation neural network method to improve the accuracy [31]. Rashmee Kohad et al. (2015) used ant colony optimization with ANN and SVM to predict the accuracy of 98% and 93.2% respectively on 250 lung cancer CT images [16]. Kourou et al. (2015) outlined a review of various machine learning approach on several cancer data and concluded that application of integration of feature selection and classifier will provide a promising result in analysis of cancer data [17]. Hosseinzadeh et al. (2013). Proposed SVM model on selection of protein attributes and concluded that the result is having 88% accuracy in compared to other classifier technique for prediction of lung cancer tumors [11]. Naveen and Pradeep (2018) proposed that among SVM, Naive Bayes and C4.5 classifier, C4.5 performs better on North central cancer treatment group (NCCTG) lung cancer data with better accuracy and also predicted that C4.5 is better classifier with the increase of lung cancer training data [25]. Gur Amrit Pal singh and P.K Gupta (2018) proposed new algorithm for feature extraction on image data and applied machine learning classifier to improve the accuracy [29]. Hussein et al. (2019) proposed supervised learning using 3D Convolutional neural network(3D CNN)on lung nodules data set as well as unsupervised learning SVM approach to classify benign and malignant data with a accuracy of 91% [12]. Monkam et al. (2019) provided survey on importance of Convolutional neural network for predicting lung module with almost greater than 90% accuracy [21]. Asuntha and Andy Srinivasan (2019) proposed fuzzy particle swarm optimization with deep neural network on lung cancer images to achieve an accuracy of 99.2% [5]. Ganggayah et al. (2019) used various classifiers on breast cancer data having 8066 record with 23 predictor and concluded that random forest classifier gives 82% better accuracy [9]. Gibbons et al. (2019) used supervised learning such as linear regression model, support vector machine, ANN etc. and predicted that SVM results an better accuracy of 96% as compared to other methods [28]. Shakeel et al. (2019) used feature selection process and a novel hybrid approach of ANN on lung cancer data available from ELVIRA biomedical data to predict an accuracy of 99.6% [26]. Bhuvaneswari et al. (2015) used gabor filter for feature extraction and G-knn approach to classify lung cancer images with an accuracy of 90% [7]. Xin Li, Bin Hu, Hui Li Bin (2019) used 3D dense sharp network and IBM SPSS25.0 statistical analysis software on 53 patients to obtain an accuracy of 88% in finding malignant and benign [19]. Shanti and raj Kumar (2020) used wrapper feature selection method as well as stochastic diffusion research algorithm on lung cancer image and concluded that this is one of the best performing algorithm for classification [27]. Rezaei Hachesu P et al. (2017) proposed a different approach for analysis of survival rate and the method find a correlation between various attributes and their survival rate and this process is carried out with 470 records having 17 features [10]. Kadir et al. (2018) provided an overview approach of various deep learning strategies used for accuracy prediction of lung cancer CT images [15]. Paing et al. (2019) used computer aided diagnosis process in which in three phases segmentation, detecting and staging process are followed for classification of CT lung cancer images with a greater accuracy [23].

3 Dataset Description

Dataset was available in UCI machine learning repository. Data consists of 32 instances and it has 57 features (1 class attribute and 56 input data), all predictive attributes are nominal range between 0–3 while class attribute level of 3 types [1]. Nominal attribute and class label data are converted in to binary form such that data analysis process becomes easier. Nominal to binary form is the most standardization process for data analysis. Data set comprises of some missing values which degrades the algorithm performances so care full execution before analysis on data is required. Label is described as high, low, medium. In the paper we categorized high to 2, medium to 1 and low to 0.

4 Classification Techniques

Classification comes under supervised learning process in order to predict given input data to a certain class label. The novelty in classification relies on mapping input function to a certain output level. Various learning classifiers are described as Perceptron, Naïve Bayes, Decision Tree, Logistic Regression, K nearest neighbour, Artificial Network, Support Vector Machine. Classification in machine learning is one of prior decision making techniques used for data analysis. Various classifier techniques are too used to classify data samples [20, 22]. The concept of our paper focuses on novel approach of Machine Learning for analysis of lung cancer data set to achieve a good accuracy. Some of the mostly used classifier techniques are described as.

4.1 Neural Network

Neural network are the basic block of machine learning approach in which the learning process is carried in between neuron. Artificial neural network (ANN) comprises of input layer, intermediate layer having hidden neurons and output layer. Every input neuron is connected to hidden neuron through appropriate weight and similarly weight is connected between hidden unit to output unit. Neuron presented in hidden neuron and output neuron are processed with some known threshold functional value. Depending on the requirement the activation will be used to process the neuron. The synaptic weight gets multiplied with the corresponding neuron presented in hidden layer and output layer for classification process. The desired target is adjusted through the weight adjustment technique either in feed forward approach or feed back approach to get the required target. Feed forward network approaches are simpler process for classification approaches.

4.2 Radial Basis Function Network

Radial basis function network comes under neural network that uses radial basis function as its threshold function.RBF network has advantage of easy of design and strong tolerance to input noises. Radial basis Function is characterized by feed forward architecture which comprises of an one middle layer between input and output layer. It uses a series of basis function that are centered at each sampling point. Formally for a given input x the network output can be written as (Fig. 1).

Fig. 1.
figure 1

RBF.

Where

$$ y = \prod\nolimits_{i = 1}^{N} {w_{i} R_{i} (x) + w_{0} } $$
(1)

wi: weight, w0: biasterm, R: Activation function

$$ R_{i} \left( x \right) = {\varphi }\left[ {\left\| {x_{i} - c_{i} } \right\|} \right] $$
(2)

φ: radial function, ci: RBF centre

In RBF architecture the weight that connects to input unit and middle layer represents the centre of the corresponding neuron where as weights connecting to middle layer and output layer are used to train the network.

4.3 Support Vector Classifter

One of the simple and useful approaches in supervised learning is support vector classification. Support vector classifier (SVC) is usually preferred for data analysis because of its computational capability with in very less time frame. This classifier works on the decision boundary concept Recognized as hyper plane. The hyper plane is used to classify the input data in to required target group. But in order to fit the decision boundary in a plane maximize distance margin is chosen from data points for classification. User defined support vector classifier can be framed using various kernel function to improve the accuracy. Support vector classifier is well suited for both structured and unstructured data. Support vector classifier is not affected with over fitting problem and makes it more reliable.

4.4 Logistic Regression Classifter

Logistic Regression classifier is brought from statistics. These classifiers is based on the probability of outcome from the input process data. Binary logistic regression is generally preferred in machine learning technique for dealing with binary input variables. To categorize the class in to specific category sigmoid function is utilized. Advantages of Logistic Regression classifier.

  • Logistic regression classifier is very flexible to implement

  • Suitable for binary classification

  • Depend on probabilistic model

4.5 Random Forest Classifter

Combination of classifier trees represents random forest classifier. One of the finest approaches to represent input variables in form of trees that makes a forest like structure. Input Data are represented in trees and each tree specifies a class label. Random forest depends on its error rate. Error rate signifies in to two directions. First one is the correlation between trees and second one is the strength of the tree. Advantages of random forest.

  • Proper method for noisy and Imbalanced data representation.

  • Data can be represented without any data reduction.

  • Best approach for analysis of large data set.

  • Finest approach in machine learning platform for improvement of accuracy.

  • It handles the over fitting problem which mostly occurs in different Machine learning algorithm.

  • one of the best reliable algorithm

4.6 J48 Classifter

J48 is representation of c4.5 in weka tool developed from java. Decision tree implements tree concepts to find the solution of the problem. class label is represented by leaf node where as attributed are defined with internal node of tree. In decision tree attribute selection process is done by Information gain and gain index. Depending on the concept of information gain and depending on the importance of information gain Decision tree classifier performs the classification. The information gain for a particular attribute X at a node is calculated as

$$ {\text{Information Gain}}\,({\text{N}},{\text{X}}) = {\text{Entropy}}\,({\text{N}}) - \mathop \sum \limits_{value\, at\, x} \frac{\left| N \right|}{{\left| {N_{i} } \right|}}\,Entropy(N) $$
(3)

Where N is the set of instance at that particular node and

$$ \left| N \right|:\,cardinality $$

Entropy of N is found as:

$$ Entropy \,(N) = \mathop \sum \limits_{i = 1}^{N} - p_{i} \,log_{2} \,p_{i} $$
(4)

4.7 Naïve Bayes Classifter

Naive Bayes classifier is one of the probabilistic classifier with strong indepenent assumption between features. Naive Bayes is based on bayes Theorem where Naive Bayes classifier uses bayesian network model p using the maximum a posteriori decision rule in Bayesian Setting. The feature which are classified in naive Bayes are always independent to each other. If y is class variable and x is dependent feature vector then.

$$ y\, = \,argmax_{y} \,p(y)\,\mathop \prod \limits_{i = 1}^{n} \,p\left( {\frac{{x_{i} }}{y}} \right) $$
(5)

P (y) is called class probability and

$$ p\left( {\frac{{x_{i} }}{y}} \right) $$
(6)

is conditional probability. Bayesian probability says

$$ Posterior\, = \,\frac{Prior\,*\,Likelihood}{Evidence} $$

4.8 Knn Classifter

Knn classifier comes under lazy learning process in which training and testing can be realized on same data or as per the programmer’s choice. In the process, the data of interest is retrieved and analyzed depending upon the majority value of class label assigned as per k, where k is an integer. The value of k is based on distance calculation process. The choice of k depends on data. Larger value of k minimizes the noise on classification. Similarly Parameter selection is also a prominent technique to improve the accuracy in classification. Weighted Knn classifier: A mechanism in which a suitable weight can be assigned to the neighbor’s value so that its contribution has great impact to neighbors than distant ones. In the weighted knn approach the weight has a significant value in evaluating the nearest optimistic value. Generally the weight is based on reciprocal of distance approach. The weight value of attribute is multiplied with distance to obtain the required value.

Pseudo code for Knn

  • Take the input data

  • Consider initial value of k

  • Divide the train and test data

  • For achieving required target iteration for all training data points

  • Find the distance between test data and each row of training data. (Euclidean Distance is the best Choice approach)

  • Arrange the calculated distance in ascending order based on distance values.

  • Consider the Top k value from sorted value.

  • Find the Majority class label

  • Obtain the target class.

5 Proposed Model

Data analysis process was carried using both weka tool of version 3.6 and Jupiter platform in python tool [13, 24]. Weka is an open source tool used for classification, clustering, regression and data visualization. Weka generally supports input file either in .csv or .arff extension format. Weka explore has various tabs for data analysis such as prepossess, classify, cluster, association, select attribute and visualize. When data prepossessing is selected it enables to upload the input data in weka tool [3, 30]. Weka tool clearly understands and represents the data for easy of data analysis. Before running any classification algorithm, Weka tool asks various option like splitting percentage, used training set, supplied test set, Cross validation option etc. Classificationn mostly occurs with splitting of 80% trainning and 20% testing [6]. But In Weka tool our analysis process was carried out with 10 fold cross validation with selected classifier technique to obtain an output of interest [8]. Weka is a user friendly visualization tool with we have tested various classifier technique and its output performance.

6 Result Analysis

The Input data consists of missing values. So it is required to prepossess the data such that the missing values have been replaced with the most occurrence value of the corresponding column. Then the processed data is applied in Weka data mining tool for analysis. The prepossessed data is converted in to suitable form for classification using different classifier approach. The classifier approach is executed with 10 cross validation method. The cross validation is a powerful data analysis process where 10 folds can be done with the available data and an accurate decision can be made on the provided data with good prediction. With the classify tab of Weka tool different classifier approaches are verified. After careful analysis results of proposed classifiers are compared. J48 and Naive Bayes algorithm classifies 32 instances in to 25 correctly classified instances and 7 incorrectly classified instances. Like wise 24 correctly classified instances and 8 incorrectly classified instances are obtained from 32 instances using knn with 5 nearest neighbour. As per our analysis the RBF classifier is mostly preferred among various classifiers. This is due to its highest classification accuracy which is obtained from its 26 correctly classified instances and 6 incorrectly classified instances from 32 instances. Similarly False Positive and False Negative both have a value of 3 each. The output result of various classifiers used in Weka tool on lung cancer data is represented in below table. Generally in confusion matrix Accuracy, Recall, Precision and F-Measure are the key process parameter for classification [4, 14]. Classification accuracy is the measure of number of correct prediction made out from total number of prediction. These parameters depend on some specific outcome. Those are ’TP (True Positive) which is the correctly predicted event values and ’TN (True Negative) is correctly predicted no event values. Similarly ’FP (False Positive) is incorrectly predicted event values and ’FN (False Negative) for incorrectly predicted no event values. Relationship are derived as below (Figs. 2, 3) and (Table 1).

Fig. 2.
figure 2

Process flow of various classifiers in Weka tool.

Fig. 3.
figure 3

Accuracy graph.

Table 1. Classifiers output in Weka tool.
$$ Accuracy = \frac{{^{J} \,TP\, + \,^{J} \,TN}}{{^{J} \,TP\, + \,^{J} \,TN\, + \,^{J} \,FP\, + \,^{J} \,FN}} $$
(7)
$$ Recall = \frac{{^{J} \,TP}}{{^{J} \,TP\, + \,^{J} \,FN}} $$
(8)
$$ Precision = \frac{{^{J} \,TP}}{{^{J} \,TP\, + \,^{J} \,FP}} $$
(9)
$$ F\_Measure = \frac{{\text{2}*\,Recall*Precision}}{Recall + Precision} $$
(10)

7 Conclusion

In this paper we have shown that with RBF classifier the accuracy is found to be 81.25% on lung cancer data. So In the analysis it can be predicted that with suitable feature selection method and integrated approach with other supervised learning process and modified functional approach in RBF, accuracy will be further improved.