1 Introduction

The liver organ plays a critical role in many functions in the human body, including red blood cell decomposition. The liver is the largest organ in our body, and it is found in the upper right corner of our abdomen. Actually, any abnormality of the liver is referred to as liver disease. Some of the things that can happen with this disease are inflammation (hepatitis B and C) from infectious or non-infectious causes (chemical or autoimmune hepatitis), tumors, malignant scarring of the liver (cirrhosis), and metabolic disorders. Liver diseases can be caused by a variety of factors. High cholesterol, autoimmune disorders, and long-term use of medications are among them [1].

In the last few decades, liver diseases have rapidly increased in prevalence and severity, making them one of the leading global killers. According to the World Health Organization (WHO), chronic diseases are responsible for about 59% of global mortality and 46% of global diseases, and they kill about 35 million people every year [2]. Ultrasound (US), Computed Tomography (CT), and Magnetic Resonance Imaging (MRI) are just some of the imaging modalities used to diagnose liver disease [3, 4]. Nonetheless, it's possible that routine blood tests are essential in preventing liver disease [4]. The information age we are currently experiencing generates millions of data points daily from a wide variety of sources. Using machine learning techniques, these data can be used to enhance healthcare services and accurately diagnose diseases.

Researchers have paid a lot of focus to Machine Learning (ML), and it has been widely adopted and used in a wide variety of contexts around the world. In medicine, ML has demonstrated its effectiveness by being used to address a variety of urgent issues, including cancer treatment, heart disease diagnosis, dengue fever treatment, and other issues [5]. High data dimensionality is a problem that frequently arises in ML. Consequently, a large amount of memory is required, and sometimes these data may be irrelevant or redundant, resulting in an overfitting problem. Therefore, feature selection is carried out as a means of dealing with this issue. Feature selection is the process of deleting the less informative features and selecting the most informative ones [6].

Actually, it is too challenging to learn effective classifiers for many classification problems without first removing redundant features. It is possible to reduce the complexity of learning algorithms by removing superfluous features. There are numerous feature selection strategies available for determining which ones are the most informative. There are two types of these procedures: filter procedures and wrapper procedures [7,8,9]. Without a learning algorithm and instead using broad characteristics of the data, filter methods evaluate and select feature subsets. In contrast, wrapper approaches use a classification algorithm to assess a feature subset after an optimizing algorithm has been applied to either add or remove features [7,8,9]. Figures 1 and 2 depict respective examples of filter and wrapper techniques.

Fig. 1
figure 1

Filter based feature selection methods

Fig. 2
figure 2

Wrapper based feature selection methods

The primary contribution of this paper is the presentation of a new Liver Patients Detection Strategy (LPDS) that utilizes common blood tests for both Liver and non-Liver patients. In fact, the proposed LPDS is divided into three stages: (i) data preprocessing; (ii) feature selection; and (iii) detection. The primary goal of the data preprocessing phase is to eliminate any outliers from the input data using Isolation Forest (IF). Then, during feature selection phase, the most efficacious features are selected using Improved Binary Butterfly Optimization Algorithm (IB2OA) from the patient’s routine blood test to enable the detection phase to work well. In reality, IB2OA is a hybrid technique that incorporates both filter and wrapper strategies. There are two steps to IB2OA; the Primary Selection (PS) and the Final Selection (FS). In fact, PS is used to quickly identify the most important features. While FS is used to accurately select the most informative features. During detection phase, liver disease patients are detected based on the most efficacious features. Experimental results demonstrated that IB2OA outperforms other competitors as it introduces maximum accuracy.

Following this outline, the rest of the paper is presented as follows: Section 2 details the guiding principles of this piece. In Section 3, we will discuss the efforts made in the past to categorize patients with liver disease. The Liver Patients Detection Strategy (LPDS) is the topic of Section 4's extensive analysis. In Section 5, we discuss our findings from the experiments. In Section 6, conclusions are summarized.

2 Preliminary concepts

The principles used in this article include; swarm intelligence concept, butterfly optimization algorithm, and opposite based learning which are covered in detail in the following subsections.

2.1 Swarm intelligence

Swarm Intelligence (SI) is a method for modelling cooperative intelligence in biological systems. Natural swarm behavior is the basis for this well-liked multi-agent framework [10]. It acts similarly to how a pack of animals would in order to stay alive. Swarm behavior is advantageous for many kinds of animals in many different environments. According to models of basic group behavior, even relatively simple interactions between individuals can be enough to shape and display a variety of group morphologies. People form groups in order to share the burden of processing information and making business-related decisions. The group's ability to make better decisions than an individual does is known as "collective intelligence" [11, 12]. SI refers to the mechanism by which individuals interact with one another and the groups to which they belong [11, 12].

SI has been applied to the management of robots and unmanned vehicles, the forecasting of social behaviors, and the enhancement of computer and communication networks [10, 13, 14]. The field of SI algorithms is often thought of as a subset of AI. SI has recently attracted attention from the feature selection community [10, 13, 14] due to its ease of use and global search capabilities. Numerous algorithms that use swarm intelligence have been developed, such as Particle Swarm Optimization (PSO), Grey Wolf Optimizer (GWO), Genetic Algorithm (GA), Whale Optimizer Algorithm (WOA), Salp Swarm Algorithm (SSA), the Sine Cosine Algorithm (SCA), Bat Algorithm (BA), Ant Colony Optimization (ACO), and Butterfly Optimization Algorithm (BOA) [14, 15]. The search for the next iteration of any SI algorithm typically relies on a stochastic search algorithm, in which heuristic information is shared. The overarching structure of SI algorithms is depicted in Fig. 3.

Fig. 3
figure 3

General fram work of SI algorithms

It's essential to set the algorithm's parameters up front. Next, initialization and the accompanying strategies kick off the evolutionary process. In the next iteration of the SI framework, the fitness function will be used to rank the search agents. The fitness function can be either a simple metric, such as classification accuracy, or a more complex function. In a SI algorithm, the search agents are periodically updated and relocated in accordance with the algorithm's underlying mathematics. This procedure is carried out over and over again until the end condition is met. In the end, the optimal search result is found [14, 15].

2.2 Butterfly optimization algorithm

Butterfly Optimization Algorithm (BOA) is an example of a bio-inspired algorithm, a class of metaheuristic algorithms that takes inspiration from the natural world. Butterflies are used as search agents in the optimization process of BOA, which is based on their behavior when foraging for food [16, 17]. Butterflies are equipped with olfactory receptors that allow them to detect the aroma of food and flowers. Chemoreceptors are the sensory receptors that are found all throughout the butterfly's body. A butterfly is thought to be able to create scent or fragrance with some power or intensity in BOA [16, 17]. This scent has something to do with the butterfly's fitness, which is determined by the problem's objective function. This means that a butterfly's fitness will change as it moves from one position in the search space to another. The scent of a butterfly can be detected by other butterflies in the neighborhood, resulting in the formation of a collective social learning system. When a butterfly detects the scent of the best butterfly in the search space, it makes a beeline for it, and this stage is known as the BOA global search phase. While a butterfly in the search space is unable to detect the scent of another butterfly, this step is known as the local search phase since it will make random strides [16,17,18,19].

In BOA as in any SI algorithm, there are three phases which are; initialization phase, iteration phase, and final phase [20]. In the first phase (i.e., initialization phase), the parameters of the algorithm are defined after which, the algorithm generates its initial population randomly. Then, the algorithm enters the stage of iteration phase where the search agents use two phases which are; global search phase and local search phase according to a (p). switch probability Actually, p controls the algorithm strategy for both global and local searches [20]. The global and local searches of BOA, are determined using the following equations:

$${Y}_i\left(t+1\right)={Y}_i(t)+\left({r}^2\ast {Y}_{best}(t)-{Y}_i(t)\right)\ast {f}_i$$
(1)
$${Y}_i\left(t+1\right)={Y}_i(t)+\left({r}^2\ast {Y}_j(t)-{Y}_k(t)\right)\ast {f}_i\kern0.5em$$
(2)

Where Yi(t + 1) represents the location of ith butterfly in the next repetition (t+1). Yi(t) donates the position of ith butterfly in the current repetition t, r is a random number between zero and one. In addition, Ybest(t) the location of the best butterfly in repetition t and Yj(t) and Yk(t) stand for jth and kth butterflies from the population in the current repetition t. f represents the perceived magnitude of the fragrance which is formulated by using the following equation:

$$f=c{I}^a\kern0.5em$$
(3)

Where c is the sensory modality, I is the stimulus intensity and a is the power exponent. In BOA equations (1) and (2) are used according to following strategy:

$${Y}_i\left(t+1\right)=\left\{\begin{array}{c}{Y}_i(t)+\left({r}^2\ast {Y}_{best}(t)-{Y}_i(t)\right)\ast {f}_i\kern6.25em for\ \mathit{\operatorname{rand}}\left(0,1\right)<p\\ {}{Y}_i(t)+\left({r}^2\ast {Y}_j(t)-{Y}_k(t)\right)\ast {f}_i\kern6.5em otherwise\kern5.75em \end{array}\right.$$
(4)

Where p is a constant number; p ∈ [0, 1]. Figure 4 shows the flow chart of convolutional BOA and conventional BOA algorithm is represented in algorithm 1.

Fig. 4
figure 4

Flow chart of conventional BOA

2.3 Opposition-Based Learning

Opposition-Based Learning (OBL) is a new intelligent computing technology that has been successfully applied in many intelligent algorithm optimizations. OBL was proposed by Tizhoosh [21]. It has been theorized that OBL has a higher chance of finding a solution very close to the global optimal solution [22].

Definition 1: Opposite number

Let x a real number and [c, d], then the opposite number \(\overline{y}\) is given by the following equation:

$$\overline{y}=c+d-y\kern0.75em$$
(5)

Where \(\overline{y}\) is the opposite value y, of c and d refers to lower bound and upper bound respectively.

Definition 2: Opposite vector

If yi = {y1, y2, …. yn}; where y1, y2, …. yn are real numbers, then \({\overline{y}}_i\) is calculated using the following equation:

$${\overline{y}}_i={c}_i+{d}_i-{\overline{y}}_i\kern1em$$
(6)

Let f(y) is the evaluation function, then in every repetition we calculate f(y) and \(f\left(\overline{y}\right)\) and the current solution will then be implemented by \(\overline{y}\) if \(f\left(\overline{y}\right)>f(y)\), otherwise y.

Algorithm 1
figure a

conventional BOA

3 Related work

Previous efforts to categorize liver diseases were discussed in this section. An Image-based Classification Model (ICM) for liver disease classification has been proposed in [23] with novel methodology. The proposed method did go through several iterations. The used CT images were first preprocessed by cutting out the background and focusing in on the liver. Intensity and higher-order features were used to derive the 3D texture features. Then, we used a hybrid of the Whale Optimization Approach and Simulated Annealing (WOA-SA) to select the most effective features for further analysis. The proposed classification model is then fed with these informative features. The experimental results show that the proposed method outperforms the alternatives.

A machine learning-based Fatty Liver Disease Classification Model (FLDCM) is presented in [24]. Using four different classification techniques, the developed model successfully predicted fatty liver disease. Random Forest, Naive Bayes, Artificial Neural Networks, and Logistic Regression are the models in question. In the first stage of data preparation, all gaps in the data set were closed. The relative importance of each variable was then calculated using Information Gain (IG). Then, accurate FLD patient identification and predictive classification models were created. Compared to other classification models, the RF model performed better, as shown by the results in [24].

According to [25], a novel approach has been proposed to classify patients with liver cancer by using Serum Raman Spectroscopy in conjunction with Deep Learning Algorithms (SRS-DLA). Gaussian white noise was added to the data at levels of 5, 10, 15, 20, and 25 dB to increase the robustness of the proposed models. Convolutional neural networks that were fed data that had been enhanced by a factor of ten reportedly performed well, as shown in [25]. Using gadoxetic acid enhanced Hepatobiliary Phase (HBP) MRI, a fully automated Deep Learning (DL) algorithm has been introduced [26]. In the first step, convolutional neural network (CNN) input was generated from HBP images by creating representative liver patch images for each patient. Later on, the DL model was fed segmented liver images that had been patched together. As shown in [26], noninvasive liver fibrosis staging has good-to-excellent diagnostic performance in experimental settings.

Computer-assisted diagnosis (CAD) using ultrasound images of the liver has been developed, as detailed in [27]. The proposed CAD method used a voting-based classifier and machine learning algorithms to determine whether liver tissues were fatty or normal. First, a genetic algorithm was used to select multiple regions of interest (ROIs; a total of nine ROIs) within the liver tissue. Then, using Gray-Level Co-Occurrence Matrix (GLCM) and First-Order Statistics (FOS), 26 features of each ROI were extracted. Finally, fatty liver tissue was classified using a voting-based classifier. The results obtained in [27] show that the proposed CAD method outperformed the existing literature.

In [28], a novel framework for early detection of chronic liver disease is proposed. A model known as a Hybridized Rough Set and Bat-inspired Algorithm (HRS-BA) is being proposed. The primary goal of the proposed model is to give the doctor a new perspective. Decision-making factors were initially prioritized using BA. The decision rules were then created based on these characteristics. The results were also compared to those obtained by using hybridized decision tree algorithms, and they were found to be vastly superior. Current classification methods are briefly compared in Table 1.

Table 1 A brief comparison about the current classification techniques

4 The proposed Liver Patients Detection Strategy (LPDS)

This section will go over the proposed Liver Patients Detection Strategy (LPDS). LPDS's primary goal is to detect patients who are infected with liver disease quickly and accurately. The early detection of liver disease patients allows for faster treatment and, consequently, slows the disease's spread. LPDS receives input in the form of a training set consisting of both normal and patient routine blood tests. After the model has been trained, it will be able to classify new cases. LPDS determines whether or not the input case is infected. As shown in Fig. 5, LPDS is divided into three phases: (i) data preprocessing, (ii) feature selection, and (iii) detection. The details of each phase will be discussed in the next subsection.

Fig. 5
figure 5

The proposed Liver Patients Detection Strategy (LPDS)

4.1 Preprocessing phase

The primary purpose of the preprocessing stage is to ready the data for the subsequent processing stage. Firstly, patient attributes are first extracted from the input training set. Table 2 lists several characteristics to consider when detecting liver disease patients. The dataset contains four missing values. However, special care must be taken with outliers. Two methods are employed. The results were unfavourable when machine learning algorithms were applied directly to the data without first removing outliers or selecting features. Results using the dataset's normal distribution to combat overfitting and then applying Isolation Forest (IS) for outlier detection are, however, quite encouraging [29, 30]. Several methods of plotting were used to examine the data for skewness, detect outliers, and verify the data's distribution. To succeed, each of these preprocessing methods is essential.

Table 2 Feature categorization of chronic liver diseases

4.2 Feature selection phase

The most instructive features are chosen during the feature selection phase. The term "feature selection" describes this procedure. Feature selection is a method for increasing the classifier's accuracy by eliminating extraneous data points. Therefore, the feature selection process is crucial to enhancing the efficacy of learning algorithms [25, 31, 32]. Filter and wrapper methods are the two most common ways to categorize feature selection techniques [31, 32]. It has been demonstrated that filter methods are fast and scalable but cannot provide better performance than wrapper methods. However, wrapper methods are more expensive to compute [31,32,33], despite providing better performance. The Improved Binary Butterfly Optimization Algorithm (IB2OA) is proposed in this work as a new hybrid filter-wrapper approach to feature selection.

By combining the speed of the filter technique with the strength of the wrapper method, IB2OA is able to select features from the dataset with greater efficiency. For this reason, the proposed method seeks to simplify the calculations required to locate the optimal solution to high-dimensional datasets and cut down on the time spent doing so. Figure 6 shows that the core of the Improved Binary Butterfly Optimization Algorithm (IB2OA) is made up of two parts: (i) the Primary Selection (PS) using Information Gain (IG), and (ii) the Final Selection (FS) using IB2OA. The large search space slows down IB2OA's computation time, despite its ability to accurately identify the effective features. Therefore, the primary goal of PS is to apply IG to select the most effective features by narrowing the search space of B2OA, thereby reducing the time complexity. When all is said and done, the optimal subset of features helps enhance the reliability and performance of the employed classification model. Figure 6 shows how IG is used in PS to select the best possible set of useful and informative features to use when analyzing data from a dataset of liver patients.

Fig. 6
figure 6

Shows the required sequential steps for implementing IB2OA

Figure 6 depicts the extraction of features from a dataset of liver patients, followed by their transfer to the first step (e.g., PS) for preselection of the most effective and dynamic features. Therefore, the PS output will be fed into the second step (e.g., FS) to produce the seed "q" butterfly population. Then, IB2OA will be performed until the end point is reached. In the end, the most effective set of features is provided by the best solution for the population, IB2OA. Base classifiers, such as Naive Bayes (NB), should be used to assess this subset [34,35,36].

In general, IB2OA is relied on a meta-heuristic optimization algorithm called Butterfly Optimization Algorithm (BOA). BOA simulates the guidance and hunting behavior of the butterflies in search of food in natural environment. It was used to solve continuous optimization problem. Hence, to deal with feature selection problem which is considered discrete optimization problem, BOA is converted into B2OA. Hence, B2OA starts with a group of butterflies as solutions which are called Population (P). Each butterfly represents a candidate solution (e.g., the most efficacious subset of features). Each candidate solution represented as a binary vector of value ‘zero’ or ‘one’ in which its length is equal to the number of features. For features, a value of 'one' indicates selection, while a value of 'zero' indicates deselection or removal. Consequently, the size of the best subset is determined by a value of one’s.

Figure 6 shows the required sequential steps for implementing IB2AO as a feature selection. First, in PS, IG is used to preselect a set of optimum features from the extracted features from liver patients’ datasets. In other word, IG is used to rank the features based on its entropy using the following equation [37]:

$$IG(f)=-\sum_{n=1}^mP\left({C}_n\right) logP\left({C}_n\right)+P(f)\sum_{n=1}^mP\left({C}_n|f\right) logP\left({C}_n|f\right)+P\left(\overline{f}\right)\sum_{n=1}^m\left({C}_n|\overline{f}\right) logP\left({C}_n|\overline{f}\right)$$
(7)

Where Cn is the nth class category, f is stand for the feature. P(Cn) denotes the percentage of reviews in the Cn class category, and P(f) is the percentage of reviews in the f class category. \(P\left(\overline{f}\right)\) is the percentage of reviews that do not have the feature f. P(Cn| f) is the proportion of reviews in class category Cn that contain the feature f and \(P\left({C}_n|\overline{f}\right)\) is the proportion of reviews in class category Cnthat doesn’t contain the feature f.

Then, this subset of optimal features will be passed to the second step (e.g., FS) to be searched by using B2OA to determine the final best subset. As shown in Fig. 6, in FS, after generating the initial population of ‘q’ butterflies, EOBL is used to find the opposite solution for each initial solution. For example, if the initial solution y = {0, 1, 0, 1, 1, 0, 0, 1, 0, 1} thus, the opposite solution is \(\overline{y}=\left\{1,0,1,0,0,1,1,0,1,0\right\}\). Then, the whole solutions will be evaluated using the accuracy index of a standard classifier such as NB to find the best subset of features. for example, if the number of initial solutions is 20 solutions, then, the number of opposite solutions is also 20 solutions. Consequently, the final population contains 40 solutions and these solutions will be evaluated using the following equation:

$$Fit\left({Y}_i\right)=0.99\ast \left(1- Accuracy\ \left({Y}_i\right)\right)+0.01\ast \left(\frac{number\ of\ selected\ features}{total\ number\ of\ features}\right)$$
(8)

Where Accuracy (Yi) is the success rate of classifying data using the ith set of features. The algorithm searches for the best butterfly with the highest Fit (Yi). After evaluating all candidate solution, the optimization procedure can be used to change the position of a butterfly that has been placed artificially using equation (4). After updating the new positions of the butterflies, these positions are position is adjusted using the sigmoid function, which is used to find new butterfly position relied on binary values by using (9):

$${Y}_i^j\left(t+1\right)=\left\{\begin{array}{c}0\kern3.75em if\kern0.75em \mathit{\operatorname{rand}}\left(0,1\right)\ge \sigma \left({Y}_i^j\right)\\ {}\ \\ {}\ 1\kern5.25em otherwise\kern6em \end{array}\right.$$
(9)

Where \({Y}_i^j\left(t+1\right)\) represents the value of the ith butterfly at the jth position in the next repetition (t+1), where j=1,2, 3,...,m, and rand(0,1) represents a random number between [0,1]. Furthermore, the sigmoid transfer function \(\sigma \left({Y}_i^j\right)\) represents the likelihood that the jth bit is either 0 or 1. The formula for computing \(\sigma \left({Y}_i^j\right)\)

$$\sigma \left({Y}_i^j\right)=\frac{1}{1+{\mathit{\exp}}^{-{Y}_i^j}}\kern1.25em$$
(10)

Each butterfly in P is scored using the fitness function in (8), which takes into account the butterfly's current position and its new position \({Y}_i^j\left(t+1\right)\). Then, the process is repeated up to the maximum allowed generations. Once the best butterfly has been selected from the population, the algorithm stops. All features contributed by 1 in this butterfly are the most reliable indicators of liver disease. Different features will be chosen as the best subset of features after the IB2OA algorithm is applied to the dataset of liver patients. Total Bilirubin (TB), Alkaline Phosphotase (Alkphos), Aspartate aminotransferase (Sgot), and the Albumin to Globulin Ratio (A/G Ratio) are the features chosen. Algorithm 2 depicts the proposed IB2OA's algorithm.

Algorithm 2
figure b

The algorithm of the proposed IB2OA

4.3 Detection phase

Finally, in the detection phase, different ML classifiers are used. Actually, the development in computer vision and ML technologies can be used for the accurate, quick, and earlier detection of liver disease patients [38]. Utilizing these technologies has the advantage of producing quick and precise results from computerized arrests. Time wastage can be decreased by utilizing improvements in computer vision and precision. Patients with liver disease benefit from being diagnosed early so that they can begin treatment as soon as possible. In this paper, the selected features are used to fed five different classifiers which are; Support Vector Machine (SVM), K- Nearest Neighbor (KNN), Naïve Bayes (NB), Decision Tree (DT), and Random Forest (RF) as shown in Fig. 7. Based on classification accuracy, the effectiveness of various classifiers was evaluated.

Fig. 7
figure 7

The detection phase using different classifiers

5 Experimental results

The effectiveness of the Liver Patients Detection Strategy (LPDS) that was just proposed will be discussed here. The proposed LPDS was used to identify infected patients with liver disease from laboratory results. In reality, there are three stages to LPDS: (i) data preprocessing, (ii) feature selection, and (iii) detection. During data preprocessing, the patient's information is managed, and anomalous data is eliminated. Following this, the most useful features are selected utilizing Improved Binary Butterfly Optimization Algorithm (IB2OA) during the feature selection phase. At last, these useful features are fed into five distinct classifiers: SVM, KNN, NB, DT, and RF. The most efficient of these classifiers will be selected on the basis of their ability to correctly categorize data. Our implementation is founded on a database of laboratory results for people with and without liver disease (https://www.kaggle.com/uciml/indian-liver-patient-records/home) [39]. With the help of the collected data (patients' dataset), the results presented in this paper were generated. Given the scarcity of publicly available datasets, the classification model is verified via cross-validation. Using 10-fold cross-validation, the dataset is split into 10 equal parts, with one part serving as the testing set and the other 9 as the training sets. As a result, there will be a total of 251 (90%) patients in the training phase and 28 (10%) patients in the testing phase. Our tests were run on an Intel Core i5-6200U processor at 2.30 GHz with 8 GB of RAM. Parameters and their corresponding values are shown in Table 3.

Table 3 The parameters applied with the corresponding used values

5.1 Data description

Information collected from patients' medical records is included in this dataset. Including the results of a battery of routine blood tests on people of varying ages, sexes, and health conditions (https://www.kaggle.com/uciml/indian-liver-patient-records/home). There are a total of 583 cases in the dataset. In fact, as shown in Table 4, the cases in the collected dataset are split into two groups: those with liver disease and those without. Hepatic patients are commonly referred to by that term. Patients who do not have liver disease are referred to as "non-liver patients." Ten features are used in both the training and testing datasets, all of which are derived from standard blood tests. Age, sex, Total Bilirubin (TB), Direct Bilirubin (DB), Alkaline phosphatase (Alkphos), alamine aminotransferase (Sgpt), aspartate aminotransferase (Sgot), Total Proteins (TP), albumin (ALB), and Albumin to Globulin ratio (A/G Ratio) (https://www.kaggle.com/uciml/indian-liver-patient-records/home) [39]. A breakdown of the collected data set by "Age," "Gender," and "Type of Disease" is shown in Figs. 8, 9 and 10.

Table 4 Dataset description
Fig. 8
figure 8

The Total Number of Cases according to age

Fig. 9
figure 9

The total number of cases according to age and sex

Fig. 10
figure 10

The presentation of Liver patient and non-Liver patient distribution

5.2 Evaluation metrices

Measures of performance such as accuracy, error, precision, recall/sensitivity, and F-score are applied throughout this section to assess the quality of the proposed method [34,35,36]. The primary findings from the system used to assess the effectiveness of each algorithm are detailed in Tables 5 and 6. The confusion matrix is summarized using several different formulas in this paper, which are shown in Table 6 [34,35,36].

Table 5 The classification outcomes of the system
Table 6 Confusion matrix formulas

5.3 Comparison of IB2OA with standard Meta-heuristics

In this paper, a new feature selection methodology called Improved Binary Butterfly Optimization Algorithm (IB2OA) was introduced. Numerous feature selection techniques are compared to the proposed IB2OA based on the NB classifier as a base classifier in order to demonstrate the effectiveness of the proposed method. They include; Particle Swarm Optimization (PSO), Grey Wolf Optimizer (GWO), Genetic Algorithm (GA), Whale Optimizer Algorithm (WOA), Salp Swarm Algorithm (SSA), Sine Cosine Algorithm (SCA), Bat Algorithm (BA), Ant Colony Optimization (ACO), and Butterfly Optimization Algorithm (BOA). Table 7 displays the results.

Table 7 Comparison between IB2OA and other optimizers in terms of accuracy, precision, recall, and F-score

According to Table 7, the proposed IB2OA performs better than other optimizers in terms of accuracy, precision, recall/sensitivity, and F-score. The proposed IB2OA achieved the highest accuracy of 0.988, as shown in Table 7, while the lowest value was 0.851, introduced by GWO. Furthermore, IB2OA introduces the highest precision value of 0.9009, while SCA introduces the lowest value of 0.699. IB2OA's average recall and F-score are, respectively, 0.8018 and 0.8902. Table 7 reveals that IB2OA performs significantly better than PSO, GWO, GA, WOA, SSA, SCA, BA, ACO, and BOA. The proposed strategy also showed steady behavior due to the STD values, as it obtained the lowest STD average across all methods.

5.4 Comparison with the state-of-the-arts FS methods

Through this subsection, the performance of the proposed method is compared to the state-of-the-art feature selection methods, including; Improved Binary Global Harmony Search (IBGSH) [40], Dynamic Salp Swarm Algorithm (DSSA) [41], Chimp Optimization Algorithm (ChOA) [42], Improved Binary Grey Wolf Optimizer (IBGWO) [43], Binary Monarch Butterfly Optimization (BMBO) [44], Information Gain binary Butterfly Optimization Algorithm (IG-bBOA) [17], and Enhanced Chaotic Crow Search and Particle Swarm Optimization Algorithm (ECCSPSOA) [45]. Table 8 displays the results.

Table 8 Comparison between IB2OA and state-of-the-arts in terms of accuracy, precision, recall, F-measure, and minimum number of selected features

Table 8 shows that the accuracy introduced by the proposed method is approximately 0.988, compared to for IBGSH, DSSA, ChOA, IBGWO, BMBO, IG-bBOA, and ECCSPOA, respectively. IB2OA achieved a precision of 0.8709 while IBGSH, DSSA, ChOA, IBGWO, BMBO, IG-bBOA, and ECCSPOA 's precisions are; 0.7821, 0.6802, 0.6920, 0.7397, 0.7063, 0.7318, and 0.8102 respectively. IB2OA 's recall value is 0.801 while DSSA's is 0.6755. Additionally, IB2OA provided the highest F-score value of 0.8302, while ChOA introduced the lowest value. Due to the STD values, which had the lowest average STD value when compared to the others, the proposed method also demonstrated stable behavior.

5.5 Testing the whole strategy

In this subsection, the whole strategy will be evaluated. In fact, five different classifiers which are; SVM, KNN, NB, DT, and RF will be tested to select the best one. Initially, the selected features are used to fed these classifiers. Based on the classification accuracy, the best one will be selected to complete the proposed strategy. The higher accuracy means the best classifier. Result is shown in Fig. 11.

Fig. 11
figure 11

Comparison of performance of the used classifiers

Figure 11 introduces comparative study based on performance of five different classifiers. As shown in Fig. 11, the accuracy of the used of SVM, KNN, NB, DT, and RF are 96.2%, 99.1%, 98.8%, 93%, and 94.1 % respectively. From the results obtained, it is concluded that KNN give the best result with “K=5” as KNN is one of the most efficient ana simple algorithm used for disease classification. The encouraging results illustrate that the proposed IB2OA with KNN can accurately detect liver disease patients.

6 Conclusions

Chronic liver disease, the primary reason for death worldwide, affects a vast number of people. A variety of factors that harm the liver contribute to the development of this disease. For instance, alcohol abuse, undiagnosed hepatitis, and obesity which causes liver encephalopathy, jaundice, abnormal nerve function, bloody coughing or vomiting, kidney failure, liver failure, and many other symptoms. The diagnosis of this illness is very expensive and difficult. Therefore, it is essential to identify liver disease early in order to save lives and take the necessary steps to control the disease. This paper describes a new way to find people with liver disease. It is called Liver Patients Detection Strategy (LPDS). The proposed LPDS was depended on a new feature selection method called Improved Binary Butterfly Optimization Algorithm (IB2OA). IB2OA consists of two steps, which are Primary Selection Step (PS) and Final Selection Step (FS). Then, these features were used to feed five different classifiers, namely: Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Naïve Bayes (NB), Decision Tree (DT), and Random Forest (RF) to detect liver patients. Results from experiments show that the proposed IB2OA has better average accuracy, average precision, average sensitivity/recall, and F-score than competing methods. Additionally, the proposed model's average selected features are 4.425 compared to the other state of the art. Additionally, KNN classifier has obtained the highest classification accuracy of 99.1% on test dataset among all classifiers taken into consideration.