1 Introduction

Breast cancer is the worst form of cancer among women. According to the American Cancer Society (ACS), there will be 287,850 new cases and 43,250 fatalities in the USA in 2022; see American Cancer Society [1]. Breast cancer is a major health concern because it is diagnosed in at least 1.67 million people annually, leading to an estimated 522,000 deaths [2]. Early detection and classification of breast cancer patients is therefore an essential duty. In reality, there are a number of imaging modalities that aid in the detection of breast cancer [3]. Imaging modalities such as magnetic resonance images (MRI), positron emission tomography (PET), thermography (thermal imaging), ultrasound imaging (sonography), and mammography are shown in Fig. 1.

Fig. 1
figure 1

Various breast cancer imaging techniques

Mammography is an X-ray examination that is used to screen the breast. It is the most common method for screening breast cancer. Mammography examination also known as mammogram helps to detect and diagnose breast cancer early [4]. Mammography has several advantages such as: (1) it decreases the chances of dying from breast cancer as it can be used to detect all types of breast cancer, including invasive ductal and lobular cancer, (2) after an X-ray examination, the patient’s body does not contain any radiation, (3) it improves the physician’s ability to identify small cancerous cell, and (4) in the typical diagnostic range of this examination, X-rays usually have no side effects [5]. However, it has drawbacks such as: (1) a normal-appearing false-negative mammography despite the existence of breast cancer, (2) an abnormal-looking false-positive mammogram despite the absence of breast cancer, and (3) periods of waiting when follow-up tests are required. Therefore, we need to resort to alternative methods of support, such as machine learning.

Machine learning (ML) is an area of computer science and artificial intelligence (AI) that studies how machines can learn like humans using data and algorithms with the ability to learn and improve their performance over time [6]. By following this principle, it will eventually become more effective and precise. In reality, there is no difference between the various ML methods. First, they begin by feeding training data into the chosen algorithm. The final algorithm is developed using training data, which can be either known or unknown data. After that, the algorithm is then fed new input data to see if it is functioning properly. Then, the prediction and outcomes are cross-checked [7]. Various ML techniques have been used for breast cancer classification [8,9,10,11,12]. However, they suffer from several limitations such as (1) low diagnostic accuracy, (2) time consuming, and (3) high degree of complexity.

Actually, ML algorithms have gained fame in the healthcare field for being able to detect and classify different diseases especially in the case of breast cancer. But the existence of irrelevant features causes that many ML algorithms do not work well. Irrelevant features cause many issues such as (1) degradation of the classifier’s performance, (2) increases the training time, and (3) leads to the over fitting problem [13]. Therefore, it is critical to have good data preparation and dimensionality reduction before applying ML algorithms because it can affect the results. When data are precise, consistent, and free of noise, disease detection and classification become faster and easier [14].

Many ML algorithms are unable to perform well due to the presence of unrelated or irrelevant features. These algorithms are K-nearest neighbors (KNN), support vector machines (SVM), etc. [15]. Removing those features before using the machine learning algorithms is a critical step in improving the classification techniques’ performance [15]. Feature selection is the process of identifying the most significant features from the original features. The main aims of feature selection process are to avoid over fitting problem, obtain the higher accuracy, get better learning performance, and decrease the computational cost [16, 17]. Mainly, feature selection approaches can be divided into two main categories: filter, and wrapper methods [18,19,20]. Filter methods find the significant features without requiring the data to be classified first. Wrapper approaches, on the other hand, use classification algorithms to focus in on the most relevant features [18,19,20]. With regards to classification accuracy, wrapper approaches are preferred over filter methods [18,19,20]. Swarm intelligence algorithms have been used increasingly for feature selection.

Swarm intelligence (SI) is a method of simulating the intelligence of biological groups. It is a popular multi-agent framework that was inspired by natural swarm behaviors [21]. It mimics the actions of a herd of animals fighting for its own survival. Controlling robots and unmanned vehicles, predicting social behaviors, improving telecommunication and computer networks, etc., are just a few of the many complex problems that SI has been applied to solve in real-world applications [21,22,23]. Recently, SI has attracted a lot of interest from feature selection community due to its simplicity and global search ability. There are a variety of swarm intelligence algorithms available today, including the ant colony optimization (ACO), grey wolf optimization (GWO), particle swarm optimization (PSO), bat algorithm (BA), and many others [23].

However, relying solely on the feature selection process will not provide the best classification accuracy because there are a small number of instances in the datasets that represent bad or noisy data [24]. Therefore, to boost the efficiency of the classification techniques, hasty data should be disregarded [25]. Outlier rejection is the process of eliminating or removing noisy data that has a very unusual behavior compared to the others. Consequently, before applying the classification technique, the outliers must be rejected or eliminated from the datasets. Several ML algorithms treat outliers as noise. Since its presence reduces the system’s ability to predict future events, it must be eliminated. In general, there are two types of outlier methods which are classic outlier approach and spatial outlier approach [24, 25].

This paper introduces a new strategy for detecting breast cancer patients called patients detection strategy (PDS). PDS consists of two phases, which are data preprocessing phase (DP2) and patient detection phase (PDP). The main objective of DP2 is to transform the mammogram images into a set of features and eliminate the noisy data. Consequently, it consists of two processes, which are outlier rejection and feature selection. Firstly, gray level co-occurrence matrix (GLCM) is used to extract features from mammogram images. Then, hasty data that exhibit extreme behavior in comparison with others are excluded from these extracted features. On the other hand, in feature selection process, the most efficacious and useful features are selected to proceed the next phase (i.e., PDP). Undoubtedly, a good feature selection method speeds up classification by allowing for the consideration of fewer features, which not only increases model efficiency but also improves model performance. In this paper, the main contribution is focused on introducing new hybrid feature selection method (NHFSM). NHFSM is a hybrid method that combines filter and wrapper feature selection methods. It includes two modules which are quick selection module (QSM) and feature selection module (FSM). In the first module (i.e., QSM), information gain (IG) is used to select the active features from those extracted features in another meaning, and IG is used as the initial selection stage. In the second module (i.e., FSM), these selected features are used to initialize the initial population of bat algorithm (BA). FSM uses a hybrid approach to finally select the most informative features. Contrarily, PDP makes use of the filtered data to perform precise classification quickly. The experimental results show that NHFSM is superior to its rivals in terms of accuracy, precision, recall, and error rate.

1.1 Research questions

As such, this work attempts to answer the following research questions (Q):

Q1: which machine-learning models based on feature selection methods exist to detect breast cancer patients?

Q2: to what extend good feature selection method is important for breast cancer detection?

1.2 Research organization

The rest of paper is organized as follows; Sect. 2 briefly discussed the standard PSO and standard BA. Section 3 presents the previous efforts about feature selection methods. Section 4 introduces the proposed patients detection strategy. Section 5 introduces the proposed feature selection method. Section 6 illustrates the experimental results. Conclusions and future work are discussed in Sect. 7.

2 Background and basic concepts

In this section, an overview about particle swarm optimization (PSO) and bat algorithm (BA) will be presented.

2.1 Particle swarm optimization (PSO)

Particle swarm optimization (PSO) is a stochastic, population-based optimization technique. It was influenced by the social behavior of flocks of birds or schools of fish. It has been introduced by Kennedy and Eberhart [26]. PSO uses a group of individuals to explore promising areas in the search space. In order to optimize a problem, this algorithm iteratively enhances the candidate solutions. Firstly, PSO starts randomly with a group of particles or birds as a candidate solution called swarm S. Each particle in the swarm is arranged in relation to the other particles based on all of its experiences [27]. Hence, each particle in the solution space can be a potential solution to an optimization problem and flies at a certain velocity in the multi-dimensional search space. As a result, particles must determine both their personal best position (Pbest) and their global best position (XG), which refers to their best position among all other particles. Thus, the velocity vector and the position vector can be used to describe the flight state. PSO has several advantages compared to other optimization algorithms, as; (1) it is less sensitive to the nature of the objective function than other heuristic methods, (2) less dependence on the set of initial points than other evolutionary methods, resulting in a more robust convergence algorithm, and (3) it is simple to implement with few parameters to adjust [26, 27]. Figure 2 shows the conventional PSO flowchart.

Fig. 2
figure 2

Conventional PSO flowchart

Let us assume that there are N particles in the population of PSO, and that their coordinates each signify a possible solution associated with two vectors, the position vector Xi and velocity vector Vi. The search space has M dimensions, so the position of the ith particle is given by Xi = (Xi1, Xi2,…, XiM), and ith particle velocity VXi = (VXi1, VXi2, …, VXiM). The previous best position of ith particle that has the best fitness value is called Pbesti = Ppi = (Ppi1, Ppi2,…., PpiM). Additionally, the global best position among all particles in the swarm is XG also known as XGlobal; XG = (XG1, XG2,…., XGM). In fact, the particle’s velocity and position is dynamically updated based on its momentum and both Pbest and XGlobal to fly near the optimal solution. The velocity and position of each particle are updated using the following equations [26, 27]:

$$VX_{i} \left( {t + 1} \right) = w*VX_{i} \left( t \right) + \left( {c_{1} r_{1} \left( {P_{{{\text{pi}}}} \left( t \right) - X_{i} \left( t \right)} \right)} \right) + \left( {c_{2} r_{2} \left( {X_{{\text{G}}} \left( t \right) - X_{i} \left( t \right)} \right)} \right)$$
(1)
$$X_{i} \left( {t + 1} \right) = X_{i} \left( t \right) + VX_{i} \left( {t + 1} \right)$$
(2)

where VXi(t + 1) denotes the velocity of ith particle at iteration (t + 1). VXi(t) represents the velocity of ith particle at iteration t and \(P_{{{\text{pi}}}} \left( t \right)\) is the current personal best position of ith particle; Pbesti. Additionally, XG(t) denotes the global best position in the swarm S at the current iteration; XGlobal. Xi(t) is the current position of ith particle at the current iteration. w is the inertia weight which is used to control the impact of the previous history of velocities on the current velocity; w ∈ [0.9–1.2]. c1 and c2 are the cognitive and social acceleration constants; c1, c2 ∈ [2–4]. Additionally, r1 and r2 are two uniformly distributed random numbers in the interval [0,1]; r1, r2 ∈ [0–1]. \(X\left( {t + 1} \right)\) is the position of ith particle at the next iteration. Conventional PSO algorithm is illustrated in algorithm 1.

figure a

2.2 Bat algorithm (BA)

Bat algorithm (BA) is a bio-inspired algorithm created by Xin-She Yang in 2010, and it has since been shown to be quite effective. It mimics parts of the micro-bat’s echolocation properties in a simple way [28, 29]. The basic structure of BA is built using three major properties of the micro-bat which are (1) the echolocation behavior, (2) the signal frequency as the micro-bat transmits a signal with a frequency of f and a variable wavelength λ, and (3) the emitted sound’s loudness A0, which is used to locate prey [28,29,30]. The following are Xin-She Yang’s approximate rules for this method:

  1. (1)

    Each bat uses echolocation to detect distance, and, in some mysterious way, it can determine its distance from the prey.

  2. (2)

    To find prey, bats fly at a random velocity vi toward a position xi, emitting sounds with a fixed frequency fmin, varying wavelength, and loudness A0.

  3. (3)

    Bats can automatically adjust both the wavelength (or frequency) and rate of pulse emission in response to the proximity of their target.

  4. (4)

    Although the loudness might fluctuate in a variety of ways, it is acceptable that it reduces from a maximum A0 to a constant low value Amin.

For m-dimensional search space, bats are defined by their position, velocity, and frequency. Hence, for ith bat the updated velocity, position, and frequency for the next iteration are determined by using the following equations [28]:

$$VX_{i} \left( {t + 1} \right) = VX_{i} \left( t \right) + \left( {X_{i} \left( t \right) - X_{{\text{G}}} \left( t \right)} \right)f_{i}$$
(3)
$$X_{i} \left( {t + 1} \right) = X_{i} \left( t \right) + VX_{i} \left( {t + 1} \right)$$
(4)

where t represents the current iteration and \(VX_{i} \left( {t + 1} \right)\) represents the velocity of ith bat at the next iteration. \(VX_{i} \left( t \right)\) is the velocity of ith bat at the current iteration and \(X_{i} \left( t \right)\) is the position of ith bat at the current iteration. Additionally, \(X_{{\text{G}}} \left( t \right)\) represents the current global best solution in the bat’s population and \(X_{i} \left( {t + 1} \right)\) indicates the position of ith bat at the next iteration (t + 1). \(f_{i}\) represents the frequency of ith bat which is updated at each iteration as represented in (5) [28]:

$$f_{i} = f_{\min } + \left( {f_{\max } - f_{\min } } \right)\beta$$
(5)

where β is a uniformly distributed random number; β ∈ [0,1]. For the algorithm’s local search part, once a solution is chosen from among the current best solutions, a new solution is generated locally for each bat using random walk. This is done by using (6) [30]:

$$X_{i} \left( {t + 1} \right) = X_{i} \left( t \right) + \varepsilon \overline{A}_{t}$$
(6)

where ε is random number; ε ∈ [− 1,1].and \(\overline{A}_{t}\) is the average value of A for all bats at the current iteration that bats use to do exploration rather than exploitation as it increases. The frequency value in the BA determines the space and the range of bats movement. When bats get close to their prey, the loudness A decreases and the pulse emission rate r increases. The updated values of loudness A and emission pulse rate r are calculated by using (7) and (8) [28]:

$$A_{i} \left( {t + 1} \right) = \alpha A_{i} \left( t \right)$$
(7)
$$r_{i} \left( {t + 1} \right) = r_{i} \left( 0 \right)\left( {1 - e^{ - \gamma t} } \right)$$
(8)

where \(A_{i} \left( {t + 1} \right)\) is the loudness of ith bat at the next iteration and \(A_{i} \left( t \right)\) is the loudness of ith bat at the current iteration. Additionally, \(r_{i} \left( {t + 1} \right)\) is the emission pulse rate at the next iteration and \(r_{i} \left( 0 \right)\) is the initial emission pulse rate. α and γ are constant. Figure 3 shows the flowchart of conventional BA; also, the sequential steps are illustrated in algorithm 2.

figure b
Fig. 3
figure 3

Conventional BA flowchart

3 Literature review

During this subsection, previous efforts related to feature selection in breast cancer classification will be reviewed. Actually, several researchers gave their interest focusing on finding the best feature selection methods that can select the most informative features from among these extracted features in order to achieve the most accurate breast cancer classification possible. In [31], the authors focused on investigating the impact of feature selection techniques when integrated with classification algorithm in diagnosing breast cancer. They used particle swarm optimization (PSO) to select the most effective and informative features. They concluded that when selecting the most importance features, the classifier’s performance was improved. In [32], Raman spectral feature selection using ant colony optimization (ACO) for breast cancer diagnosis was introduced. ACO was provided to find the best features related to cancerous changes to promote the classification accuracy. The experimental results showed that by using ACO to choose the best features, the classification accuracy of normal, benign, and cancerous groups improved by 14%, reaching 87.7%.

As shown in [33], a new method for feature selection was introduced to improve the diagnostic precision of computer-aided diagnosis systems. The proposed method called opposition-based enhanced version of grey wolf optimization (OEGWO). OEGWO was utilized to solve the feature selection problem in breast density classification. Actually, OEGWO passed through three steps which are opposition-based population initialization, modification of exploration and exploitation controlling parameter, and modification of position updating step. Firstly, from mammogram images, forty-five texture features were extracted. After features were extracted, OEGWO was used to select the most useful ones. Based on experimental results, the proposed OEGWO is superior to other feature selection algorithms for identifying breast density.

In [34], a new method for feature selection in breast mass classification was proposed. The proposed method called opposition-based Harris hawks optimization (OHHO) algorithm. Firstly, forty-five texture features and nine shape features were extracted from mammogram images. Then, to deal with the problem of feature selection, the proposed OHHO was converted into binary using sigmoid transfer function. Finally, the proposed OHHO was applied to select the most important feature. Results in [34] showed that OHHO outperforms the other competitors.

It has become clear that selecting the most effective and informative features is an important process to enhance the classification performance as illustrated in [35], the authors came to the conclusion that the system performance is excellent due to the selection of more appropriate features. Also, in [36], a feature selection method called minimal redundancy maximal relevance feature selection (MRMRFS) algorithms was provided to increase the classification accuracy. MRMRFS was used to select the most appropriate features from the breast cancer datasets. In [36], experimental results show that by selecting better features with MRMRFS, the SVM’s performance was enhanced, resulting in an accuracy of 99.71%.

In fact, a lot of research has focused on applying a combination of feature selection techniques to increase performance improvement. In [37], a new hybridization method using grey wolf optimizer and rough set (GWORS) was proposed for feature selection. GWORS was based on two main process which are feature extraction, and feature selection. Firstly, texture, intensity, and shape-based features were extracted from mass segmented mammogram images. Then, GWORS was applied to select the most effective and informative features from these extracted features. Results in [37] indicated that GWORS is superior to other methods in terms of accuracy, F-measure, and receiver operating characteristic curve.

As presented in [38], a novel breast cancer intelligent diagnosis approach was proposed. The proposed approach introduces a new feature selection method called information gain with simulated annealing genetic algorithm wrapper (IGSAGAW). Firstly, the extracted features were ranked using information gain (IG) algorithm. Then, according to the importance of features, SAGAW was used for selecting top m-optimal features to feed the used classifier. IGSAGAW not only reduces the complexity of the classification process, but also maximizes the classification accuracy and minimizes misclassification cost.

In [39], a new feature selection method (FSM) to select the most relevant features for classifying patients into malignant or benign cases was presented. Firstly, the extracted features were ranked using a new feature importance index based on principle component analysis (PCA) and Bhattacharyya distance (BD). The combination between two parameters of PCA and BD was used to highlight features that have higher variance and discriminatory power. Then, the proposed method classified patient records into appropriate classes iteratively using three classification techniques, which are K-nearest neighbor (KNN), linear discriminant analysis (LDA), and probabilistic neural network (PNN). The classification process was carried out on the remaining features until only one feature was left after the index eliminated the least significant feature. Results in [39] demonstrated the effectiveness of the proposed method.

As illustrated in [40], a novel hybrid feature selection algorithm (HFSA) based on Relieff and entropy-based genetic algorithm (EGA) was proposed for breast cancer diagnosis. HFSA was used to combine the advantages of both filter and wrapper methods to deal with datasets with high dimensions and uncertainties. Firstly, the weights of each feature were calculated and evaluated using Relieff ranking method as a filter method. Then, to more effectively remove irrelevant features, EGA was used as a wrapper method. Experimental results in [40] have proven that the proposed method not only generates small subset of informative and significant features, but also provides significant classification accuracy for large datasets. Table 1 shows a brief comparison about the current feature selection methods.

Table 1 Shows a brief comparison about the current feature selection methods

4 The proposed patients detection strategy (PDS)

Over the years, breast cancer has risen to prominence as one of the most prevalent diseases affecting females worldwide. Therefore, automatic breast cancer patient detection and classification is essential, particularly in situations where a quick and accurate decision is required [41]. Automatic diagnosis of breast cancer patients has the ability to reduce molarity rate by decreasing the time wasted by radiologists during examination [4, 42, 43]. In this paper, a new strategy for detecting breast cancer patients called patients detection strategy (PDS) has been introduced. As shown in Fig. 4, the proposed PDS is split into two phases which are data preprocessing phase (DP2) and patient detection phase (PDP).

Fig. 4
figure 4

Patients detection strategy (PDS)

4.1 Data preprocessing phase (DP2)

The main aim of DP2 is to clean the patient’s historical data that have been collected. To achieve this goal, data mining techniques are used to perform two main processes; outlier rejection and feature selection, which result in a meaningful pattern of data. Firstly, the patient’s mammogram images are converted into a set of features using GLCM and then, eliminate or reject the outlier items during outlier rejection stage. Outliers (e.g., bad data) are filtered out, and only the most useful features are taken forward for the next phase. In the outlier rejection stage, hasty data with unusual behaviors should be rejected or removed. Generally, there are two main types of outlier rejection methods which are classic outlier approach and spatial outlier approach [24, 25]. Classic outlier approach examines outliers depended on transaction dataset, which is made up of a collection of items. The classic outlier approaches can be divided into five classes. These classes are statistical-based approach, distance-based approach, deviation-based approach, density-based approach, and depth-based approach [44]. On the other hand, spatial outlier approach analyzes outliers depended on spatial dataset. Spatial dataset refers to a collection of objects that are spatially referenced. Spatial outlier approaches are divided into two classes which are space-based approach and graph-based approach.

On the other hand, removing irrelevant or insignificant features during the feature selection process is crucial to enhance the classifier’s efficiency. By allowing fewer features to be considered, a good feature selection methodology undoubtedly increases the model’s effectiveness while also accelerating the classification process. Generally, feature selection approaches can be classified into two classes which are filter and wrapper approaches. In conclusion, performing outlier rejection and feature selection is a critical process to provide efficient and relevant features for the next phase.

4.2 Patients detection phase (PDP)

For women, breast cancer is a leading cause of death, and one of the most important first steps in treating breast cancer is a precise diagnosis. During PDP, the cleaned data are used to provide rapid and accurate classification result. Actually, to perfectly perform the classification process, selecting suitable classifier is required. Classification is a data mining (ML) technique used to predict group membership for data instances [15]. In fact, ML is a useful tool for predicting medical conditions and assisting radiologists in making informed medical decisions [15]. Wrong diagnosis of breast cancer cases will lead to the spread of the disease day by day. On the other hand, using only the most important and effective features will help the used classifier to make accurate and fast decisions thus, certainly enhance the accuracy of diagnosis. In the next section, we will introduce a proposed feature selection method to select the most effective and informative features.

5 The proposed feature selection method

One of the main reasons of overfitting is the presence of irrelevant features in the input dataset, particularly in the area of medical diagnosis of breast cancer patients [15, 45]. The main target of the proposed method is to identify the most informative features for classifying breast cancer patients. In fact, the diagnostic model’s accuracy can be improved by excluding the least affected features from the output. As a result, feature selection should be done prior to learning the diagnostic model in order to enhance its performance and make it faster and more cost-effective model [46,47,48]. In this section, a new hybrid feature selection method (NHFSM) is introduced. NHFSM is a hybrid feature selection method that depends on wrapper method to accurately select the most informative features for classifying breast cancer patients. Mainly, it is made up of two modules, which are (1) quick selection module (QSM) that uses information gain (IG) as a filter method to preselect the most informative features quickly, and (2) feature selection module (FSM) that uses hybrid bat algorithm and particle swarm optimization (HBAPSO) as a wrapper method.

The main aim of FSM is to pick the most informative features out of the original features to promote the classification accuracy. To this end, NHFSM has been used to pick the most efficacious features. This method is a combination of bat algorithm (BA) and particle swarm optimization (PSO) called hybrid bat algorithm and particle swarm optimization (HBAPSO). In fact, PSO is used in many applications for solving optimization problems, especially in the case of feature selection. However, it suffers from premature convergence and requires computational time in most cases [49, 50]. Unlike PSO, BA is good at controlling the search space’s exploration and exploitation, and it takes less time to compute [51, 52]. In this paper, an attempt is made to improve PSO by using BA to select the most relevant subset of features for improving the performance of breast cancer’s classification model.

Generally, the use of hybrid methods to solve optimization problems is a new and successful trend. This paper’s primary objective is to develop a method that combines the advantages of multiple algorithms to achieve superior performance. HBAPSO integrates the advantage of PSO of fast convergence and the advantage of BA of less computational time to find best solution. Hence, PSO’s updating velocity that presented in Eq. (1) was improved by using bat’s frequency, which can be calculated by using Eq. (5). Since BA was designed to address issues in the continuous search space, it will be converted into Binary BA (BBA) in order to address discrete optimization issues like feature selection. Actually, BA was transformed into BBA by using sigmoid transfer function which converts the velocity’s value from continuous space into discrete search space [52]. Consequently, the bat’s velocity is calculated in the same manner in conventional BA, but, bats current position, bats best position, and global best position in the population have only binary values (0 or 1). Figure 5 depicts the proposed method’s sequential steps.

Fig. 5
figure 5

The sequential steps of NHFSM

At the beginning, after extracting the features from mammogram images, these features will be passed to QSM to quickly eliminate the non-informative features using IG as a preselection stage. Then, generate the initial population of BA randomly. After generating the initial population of BA algorithm, the same mechanism that is used in PSO will be applied to BA. Hence, firstly, each bat will be evaluated using Naïve Bayes (NB) classifier. In fact, the reason for choosing NB as the standard classifier to evaluate each bat is that it has no parameters to adjust. Moreover, the main aim of evaluating each bat is to elect the most significant features that characterized breast cancer based on its accuracy which is called fitness degree. Actually, fitness degree is used to rank the bats based on the classification accuracy to find the best solution. Fitness degree for each bat \({\text{Fit}}\left( {X_{i} } \right)\) can be calculated by using (9) [15]:

$${\text{Fit}}\left( {X_{i} } \right) = {\text{accuracy}}\left( {X_{i} } \right)$$
(9)

where \({\text{accuracy}}\left( {X_{i} } \right)\) denotes the classification accuracy based on a subset of features in ith bat. Based on the fitness value for every bat in P, each individual fitness degree will be sorted and the best position will be saved and the own solution for each bat Xpersonal and the best solution among all the bats Xglobal memory for each bat will be updated by using (10) & (11):

$$X_{{{\text{Personal}}}} \left( {X_{i} } \right) = X_{{{\text{pi}}}} = \left\{ {\begin{array}{*{20}c} {X_{i } } & {{\text{if\,}}( {\text{Fit}}\left( {X_{i} } \right) > {\text{Fit}}(X_{{{\text{pi}}}} ))} \\ {X_{{{\text{pi}}}} } & {{\text{otherwise}}} \\ \end{array} } \right.$$
(10)
$$X_{{{\text{Global}}}} = X_{{\text{G}}} = \left\{ {\begin{array}{*{20}c} {X_{{{\text{pi }} }} } & {{\text{if\,}}( {\text{Fit}}\left( {X_{{{\text{pi}}}} } \right) > {\text{Fit}}(X_{{\text{pi + 1}}} ))} \\ {X_{{\text{pi + 1}}} } & {{\text{otherwise}}} \\ \end{array} } \right.$$
(11)

where XPersonal(Xi) denotes the optimal solution of each ith bat and Xi denotes the current position of ith bat. Furthermore, Xpi represents the best position of ith bat. Fit(Xi) represents the fitness value of the ith bat based on its current position. Fit(Xpi) stands for the fitness value of the ith bat based on its best position. XGlobal is the best particle in whole population P, and Fit(Xpi+1) stands for the fitness value of the (i + 1)th bat based on its best position. Xpi+1 indicates the personal best position of (i + 1)th bat. In NHFSM, the best personal position at the current iteration \(X_{{{\text{pi}}}}\) is assigned as \(P_{{{\text{pi}}}} \left( t \right)\) at the current iteration and the best position among all solution is assigned as \(X_{{\text{G}}} \left( t \right)\). Furthermore, the PSO’s velocity is modified by using bat frequency \(f_{i}\) as presented in (12):

$$VX_{i} \left( {t + 1} \right) = w*VX_{i} \left( t \right) + \left( {c_{1} r_{1} \left( {P_{{{\text{pi}}}} \left( t \right) - X_{i} \left( t \right)} \right)} \right) + \left( {c_{2} r_{2} \left( {X_{{\text{G}}} \left( t \right) - X_{i} \left( t \right)} \right)} \right)f_{i}$$
(12)

Using hybrid between BA and PSO will perform well as; PSO has the ability to identify promising areas in the search space quickly, while BA can quickly obtain the best solution within those areas. Then, after calculating the velocity for each bat in the population using enhanced PSO’s velocity. The bat velocity can reveal the probability distribution that plays a major role in producing the bat position at random. As a result, in order to find a new bat position based on binary values, the sigmoid function is applied to the position of the search agent using (13):

$$X_{i}^{j} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}c} 0 & {{\text{if}}\;{\text{rand}}\left( {0,1} \right) \ge {\text{sig}}\left( {VX_{i}^{j} } \right)} \\ 1 & { {\text{otherwise}}} \\ \end{array} } \right.$$
(13)

where Xij(t + 1) indicates ith bat value at jth position in the next iteration t + 1; j = 1, 2, 3,…,m and rand(0,1) is a random value between [0,1]. Moreover, sig(VXij) is the sigmoid transfer function that indicates the probability that the jth bit will have a 0 or 1 value. sig(VXij) is determined by (14):

$${\text{sig}}\left( {VX_{i}^{j} } \right) = \frac{1}{{1 + e^{{ - VX_{i}^{j} }} }}$$
(14)

Based on the new position \(X_{i}^{j} \left( {t + 1} \right)\) of each bat in P, each bat is evaluated using the fitness function defined by Eq. (10). After all the bats’ velocities and positions have been updated, a new solution is generated around the optimal one, if the generated number value is greater than pulse emission rate that is can be determined by using Eq. (6). Moreover, if (rand < Ai & f (Xi) > f (XG)), then loudness and emission rate are updated using Eq. (7) and Eq. (8) and the new solution is accepted. All of these computations are done over and over again until the maximum number of generations has been reached. In the end, the algorithm produces the population-wide best bat solution XGlobal as its output and stops. All of the features denoted by 1 are the most significant features for breast cancer classification. Seven different features will be chosen after applying NHFSM algorithm to the breast’s dataset that contains the features. These elected features are homogeneity (HOM), energy (ENG), entropy (ENT), contrast (CON), correlation (COR), cluster shade (CS), and cluster prominence (CP).

For the implementation of NHFSM, suppose that there are ‘m’ dimensional feature space; FS = {f1, f2,….,fm}. Additionally, the input training data of ‘g’ objects (patients) can be expressed by W = {Y1, Y2,…., Yg} and the testing data of ‘h’ objects can be expressed by K = {R1, R2,….,Rh}. Each object of Yi ∈ W and Rj ∈ K is expressed as an ordered set of ‘m’ features; Yi(f1, f2, f3, …., fm) = [f1i, f2i, f3i, …, fmi] and Rj(f1, f2, f3, …., fm) = [f1j, f2j, f3j, …, fmj]. Hence, each object Yi and Rj can thus be represented in an ‘m’ dimensional space of features. For the breast cancer classification problem, it is a vital process to decrease m-dimensions or remove the unrelated features in breast’s dataset in order to prevent overfitting and promote the performance of the classification model. Algorithm 3 illustrates the sequential steps of NHFSM.

figure c

6 Experimental results

Via this section, the main contribution that was presented in the proposed patients detection strategy (PDS), the new hybrid feature selection method (NHFSM) will be evaluated. NHFSM is composed of two modules which are quick selection module (QSM), and feature selection module (FSM). In QSM, information gain (IG) is used to quickly eliminate the least significant features. During FSM, a new hybrid feature selection method called hybrid bat algorithm and particle swarm optimization (HBAPSO) was proposed to select the most informative features extracted from mammogram images. Our implementation is based on MIAS datasets [53, 54]. MIAS is an internet dataset that was used to replicate the results presented in this paper. It is a publicly available dataset created on Kaggle. It has 322 cases, of which 209 are normal, 61 of which have benign breast cancer, and 52 of which have malignant breast cancer. There are two parts to it: training and testing. The training set is used to teach the model what to do. The testing set is used to measure how well the proposed model works. Therefore, 226 (70%) patients will be used for training, and 96 (30%) will be used for testing, respectively, as shown in Table 2. Table 3 lists the applied parameters as well as the values that were used.

Table 2 MIAS database image selection for training and testing
Table 3 The parameters applied with the corresponding used values

6.1 Evaluation metrics

To ensure the reliability of the proposed method accuracy, error, precision, sensitivity, F-measure, micro-average, and macro-average have been taken as evaluation matrices. Table 4 shows the classification outcomes of the system [45, 55]. Moreover, various evaluation metrics are displayed in Table 5 [45, 55].

Table 4 The classification outcomes of the system
Table 5 Performance evaluation metrics

6.2 Testing the proposed feature selection method

Through this section, the proposed feature selection called new hybrid feature selection method (NHFSM) will be evaluated. In order to demonstrate the effectiveness of the proposed NHFSM, many feature selection methods are compared to the proposed NHFSM based on NB as a standard classifier. The most recently used feature methods that are used for evaluation are illustrated in Table 1. These methods are ACO [32], OEGWO [33], OHHO [34], MRMRFS [36], GWORS [37], and IGSAGAW [38]. Results are shown in Figs. 6, 7, 8, 9, 10, 11, 12, 13 and 14. As shown in Figs. 6, 7, 8, 9, 10, 11, 12, 13 and 14, the proposed NHFSM has the best performance. The proposed NHFSM improves accuracy, precision, recall, macro-average precision, macro-average recall, micro-average precision, micro-average recall, and F-measure, while reducing error. This shows how well the proposed NHFSM will work.

Fig. 6
figure 6

Accuracy comparison: proposed method verses various feature selection methods

Fig. 7
figure 7

Error comparison: proposed method verses various feature selection methods

Fig. 8
figure 8

Precision comparison: proposed method verses various feature selection methods

Fig. 9
figure 9

Sensitivity comparison: proposed method verses various feature selection methods

Fig. 10
figure 10

Macro-average precision comparison: proposed method verses various feature selection methods

Fig. 11
figure 11

Macro-average recall comparison: proposed method verses various feature selection methods

Fig. 12
figure 12

Micro-average precision comparison: proposed method verses various feature selection methods

Fig. 13
figure 13

Micro-average recall comparison: proposed method verses various feature selection methods

Fig. 14
figure 14

F-measure comparison: proposed method verses various feature selection methods

A comparison of all models across all evaluation metrics is shown in Figs. 6, 7, 8 and 9. It is concluded that adding more patients to the training dataset enhances the effectiveness of all methods. The maximum “Precision,” “Recall,” and “Accuracy” are obtained, while the minimum “Error” is obtained, at the highest number of training patients. It is easy to understand why as training patients are added, more data are collected and classification rules are developed. As a result, in light of the classifiers’ better training, classification accuracy has increased. When there are 226 training patients, the accuracy values for ACO, OEGWO, OHHO, MRMRFS, GWORS, IGWSAGAW, and NHFSM are 0.81, 0.82, 0.88, 0.86, 0.91, 0.93, and 0.97, respectively. While the error values are 0.19, 0.18, 0.12, 0.14, 0.9, 0.7, and 0.03 in that order. NHFSM introduces the highest precision value when the number of training patients is 226, while ACO introduces the lowest value. Additionally, when the number of training patients is 226, the highest recall value is 0.75 at the proposed NHFSM, while the lowest value is 0.61 at ACO.

The final results shown in Figs. 10, 11, 12 and 13 show that the proposed NHFSM introduces the highest macro-average precision value of 0.756, but the lowest value is 0.634, which is given by ACO when the number of training patients = 226. Moreover, when the number of training patients is equal to 226, the highest macro-average recall value is 0.7243 which is introduced by NHFSM, but the lowest value provided by ACO with value equals 0.6. When the number of training patients is equal to 226, the best value of micro-average precision is 0.728 of NHFSM, but the worst value is 0.632 of ACO. Additionally, micro-average recall value of NHFSM is 0.7355, but the worst value is 0.61 of ACO when the number of training patients is equal to 226. According to Fig. 14, the worst F-measure value is 0.59 for ACO and the best value is 0.716 for NHFSM when the number of training patients is equal to 226. Finally, according to Fig. 15, NHFSM is much faster than the other existing methods.

Fig. 15
figure 15

Run time comparison: proposed method verses various feature selection methods

The slope linear regression through the data points mentioned in [56] used by [57], explored by [58], and explained by [59] was used to estimate the rate of change in the number of testing patients with run time of feature selection method for ACO, OEGWO, OHHO, MRMRFS, GWORS, IGWSAGAW, and NHFSM. The analysis shows that the number of testing patients increases with the run time of feature selection method for ACO at the rate of 0.094642857. Worth noticing that the number of testing patients increases with the run time of feature selection method for OEGWO at the higher rate of 0.167857143. Additionally, when the number of testing patients increases, the run time rate of OHHO, MRMRFS, and GWORS is 0.116071429, 0.133928571, and 0.169642857 at the same order. While the number of testing patients increases with the run time of feature selection method for IGWSAGAW and NHFSM at the rate of 0.055357143.

Actually, the development of a computer-aided framework that can automatically and accurately diagnose breast lesions is one of the significant medical difficulties faced by AI applications. There have been numerous studies and attempts to build a precise automatic system for these applications in the body of published work. Early research from 1972 to 1978 concentrated mostly on determining the textural characteristics of breast tissues and using conventional machine learning techniques for classification without using good feature selection method. Due to their low accuracy and sensitivity, the results of those approaches were not suitable for the precise classification of breast lesions. The work done utilizing the traditional ML approaches [60, 61] and the DL-based methods [62,63,64] has provided results that are comparable to the created automatic system based on the classification of region of interest (ROI). Our proposed PDS that based on NHFSM performs better on the MIAS dataset than the deep learning systems created in [65,66,67]. The classification performance of the proposed strategy is assessed in comparison with a number of state-of-the-art breast cancer detection systems as presented in Table 6.

Table 6 Evaluation comparison results of the proposed PDS against the state-of the-art breast cancer classification methods

7 Conclusions and future work

In order to solve the first study question, a literature analysis was conducted to investigate the available ML models for detecting breast cancer patients. As seen in previous sections, there is a research need on this topic. The performance of previous ML models was subpar. This research developed the patients detection technique, a novel strategy for finding breast cancer patients (PDS). Initially, the features were extracted from mammogram images using GLCM and the outlier items were eliminated. Then, in an attempt to answer the second research question of whether good feature selection process is important for breast cancer detection, the most efficacious and important features were selected using the proposed feature selection methodology which is called new hybrid feature selection method (NHFSM). NHFSM combines between the advantages of both filter and wrapper selection methods. Firstly, the redundancy of the features was eliminated and the relevant features were selected using IG as a preselection stage. Then, these features were processed using hybrid bat algorithm and particle swarm optimization (HBAPSO). Based on the maximum fitness value related features, the optimized features subset was selected. Finally, a MATLAB tool was used to evaluate the strategy’s effectiveness, which introduced accurate results compared to other competitors in terms of accuracy, precision, sensitivity/recall, F-measure, and error rate in which it provides about of 0.97, 0.76, 0.75, 0.716, and 0.03, respectively. The reason is that the proposed NHFSM is a hybrid method that combines between two efficient and very accurate wrapper methods which are PSO and BA.

By locating an excellent classifier to complete the detection technique, breast cancer prediction may be further enhanced in the future. Additionally, work on metaheuristics can be expanded to improve the suggested feature selection approach’s accuracy as feasible (Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16).

Table 7 Accuracy comparison: proposed method verses various feature selection methods
Table 8 Error comparison: proposed method verses various feature selection methods
Table 9 Precision comparison: proposed method verses various feature selection methods
Table 10 Sensitivity/recall comparison: proposed method verses various feature selection methods
Table 11 Macro-average precision comparison: proposed method verses various feature selection methods
Table 12 Macro-average recall comparison: proposed method verses various feature selection methods
Table 13 Micro-average precision comparison: proposed method verses various feature selection methods
Table 14 Micro-average recall comparison: proposed method verses various feature selection methods
Table 15 F-measure comparison: proposed method verses various feature selection methods
Table 16 Run time comparison: proposed method verses various feature selection methods