1 Introduction

Anemia is a global health problem affecting human health [1]. It particularly affects young children and pregnant women. The World Health Organization estimates that 42% of children under 5 years of age and 40% of pregnant women worldwide are anemic [2].

Anemia, which is expressed by a decrease in the number of red blood cells in the blood, occurs with a decrease in the level of hemoglobin in the blood with parameters such as sex, age, pregnancy, and nutrition. So anemia is defined as the hemoglobin value below the appropriate reference range. The measurement of the values related to the cells in the blood circulation is called the complete blood count (hemogram). The complete blood counting marks the blood values low or high according to the reference range.

Both diagnosis and treatment of the anemia is decided by doctors. In order to diagnose anemia more accurately, blood tests, radiological images, etc., must be observed by the doctor. The diseases produce a lot of medical data from which alternative solutions are produced such as to detect diseases at an early stage, to prescribe appropriate drugs to the patient, and not to extend the initial phase before reaching the critical phase. Consequently, disease determinations can be made for new patients according to the medical data obtained from the patients. This is very important for doctors to minimize the margin of error in the diagnosis they will make for the patient, and it is important for helping doctors to diagnose. Therefore, the evaluation of data records in health institutions is of great importance for patients and hospitals. However, these processes can be difficult and costly, especially in underdeveloped countries.

There are many studies on designing decision support systems for doctors for new patients by evaluating data records in hospitals with biomedical image processing, biomedical signal processing, biomedical digital data processing, etc. [3,4,5,6]. In image processing studies, medical images (magnetic resonance imaging (MRI), computerized tomography (CT) scans, etc.) have been analyzed and systems have been developed to help doctors make better treatment decisions [7,8,9]. Signal processing studies aim to develop systems that help doctors by analyzing and interpreting medical signals (electrocardiography (ECG), electroencephalography (EEG), etc.) [10, 11]. In studies on digital data processing, digital data (blood count, C-reactive protein (CRP) level, etc.) from patients are usually processed and systems have been developed to help doctors respond faster and more accurately for new patients. In addition to classical methods such as support vector machines, Naïve Bayes, regression, and k-nearest neighborhood, artificial intelligence-based methods such as artificial neural networks, deep learning, and random forest trees have started to be used in studies [12, 13].

Optimization methods have an important place in the solution of engineering problems. Modeling a problem has become an area where optimization methods are frequently used. Finding the model parameters that best represent the problem is a very important step for modeling the problem. For this reason, mathematical modeling is needed in areas such as data analysis, control system design, machine learning, etc. [14,15,16].

Engineering problems are faced with increasing levels of complexity day by day. Classical methods cannot be successful in the optimization of complex systems due to problems such as difficulty in solving high dimensional problems, local minima problems, the fact that many classical methods are designed for differentiable problems, etc. Therefore, the need for new optimization methods inspired by nature is increasing. These methods, which tend to perform better on complex problems, can deal with non-continuous problems and are less sensitive to local minima [17, 18]. Examples of nature-inspired algorithms frequently used in optimization are crow search optimization (CSO), chicken swarm algorithm (CSA), JAYA, ant colony, HHA, artificial bee colony (ABC), etc. [19, 20].

MARS method, which is another mathematical modeling method preferred for analyzing complex datasets, has been frequently used in prediction, analysis, classification, etc., studies [21]. When the studies in which the Mars method is applied are examined, it shows that this machine learning approach can help to create good prediction models for engineering datasets [22,23,24].

These methods may not perform well for every dataset due to different features in different datasets used in studies. Limits such as different parameter properties, number of parameters, and changes in the number of patient records in the dataset significantly affect the success of anemia disease classification methods. So, it is essential to develop new techniques because the properties of the studied datasets and the number of parameters or sizes may differ [13]. In machine learning, image, biomedical, robotics, natural language processing, and other fields, both classical and metaheuristic methods have been successfully used to classify data with different parameters and feature structures [25,26,27]. The methods have been developed from different perspectives such as linear, quadratic, and exponential in the literature, and the methods have been analyzed under various scenarios such as linear, quadratic, and exponential in order to model the relationships between the parameters in the datasets [28, 29]. The model weight values in the constructed scenarios were calculated by optimizing the proposed methods according to the objective function. Disease classification was made by generating the weight values with the lowest error.

For these reasons, new approaches and algorithms need to be developed to predict anemia. There are many data mining methods used in anemia diagnosis in the literature. These are: learning vector quantization neural network (LVQ), k-nearest neighbors (k-NN), multiple linear regression (MLR), logistic regression (LR), fuzzy logic (FL), artificial neural networks (ANN), etc. [30, 31].

In this study, 1732 blood data from the Kaggle database were analyzed using the Harris hawks algorithm, a nature-inspired evolutionary algorithm, and the MARS algorithm, a classical mathematical modelling method. The proposed methods are analyzed under 6 different scenarios: multilinear form HHA, multilinear quadratic form HHA, multi-exponential form HHA, first-order MARS model, second-order MARS model, and MARS model to obtain the best degree and pruning coefficient. Thus, the pruning parameter and degree values, which have a significant effect on the performance of the MARS method in 3 models, enable the model to learn different relationships and reveal complex models, while in the other 3 models, the effects of the parameters on the classification success of the problem modelled in linear, quadratic, and exponential form were optimized by HHA method and the most appropriate weight values were obtained. To the best of our knowledge, no anemia classification study has been performed using the MARS method and parameter estimation method based on mathematical modelling with HHA.

2 Literature review

With the help of artificial neural networks and decision trees developed by genetic programming, an average of 90% performance was obtained as a result of the tests performed for the classification problem of thalassemia (Mediterranean anemia) disease [32]. In a 2008 study, a decision support system was designed to help physicians in iron deficiency anemia [33]. Finally, Anemia (+) and Anemia (−) results were evaluated at the end of the procedure. The results of the decision support system completely coincided with the decisions of the doctors. Serum iron, serum iron-binding capacity, and ferritin were used as parameters in the study, and six different blood parameters, namely HGB, RBC, MCH, MCHC, WBC, and HCT, were used in our study. In a study conducted in 2011, anemia prediction and classification were analyzed using data mining techniques, J48 and sequential minimum optimization (SMO) classification methods were applied in Weka, and the C4.5 decision tree algorithm (CDTA) and support vector machine (SVM) were studied [34]. Another study designed a neuro-fuzzy network to determine the level of anemia in a child [35]. With this system, which was developed after statistical measurements, the root mean square of the errors was found to be 0.2743. In 2012, artificial neural networks and an adaptive neuro-fuzzy inference system (ANFIS) were developed to predict iron deficiency anemia based on four laboratory data of mean erythrocyte volume (MCV), mean cellular hemoglobin (MCH), mean cellular hemoglobin concentration (MCHC), and red blood cell count (RBC) [36]. In a study on iron deficiency anemia in women, feedforward networks (FFN), cascade forward networks (CFN), distributed delay networks (DDN), probabilistic neural network (PNN), and LVQ were used [37]. Another article presents the classification of blood characteristics with a CDTA, Bayesian classifier, and a multilayer perceptron (MP) [38]. With the study classified eighteen thalassemia anemia with high prevalence in Thailand, the best classification performance is obtained with the Naïve Bayes (NB) classifier and then with the multilayer perceptron. A study was carried out using machine learning algorithms in the detection of anemia [39]. In this study, ANN, SVM, and statistical model methods were applied in the diagnosis of iron deficiency. Some classification algorithms such as NB, MP, J48, and SMO were used by using WEKA data mining tool [40]. As a result, it was observed that the J48 decision tree algorithm (JDTA) had the best performance. The deep learning methodologies were used to increase the performance of white blood cell (WBC) identification systems. A new WBC recognition system has been proposed based on deep learning theory [41]. In a 2018 study, an easy-to-use and inexpensive device was developed to determine the anemia status in patients, preventing the patient from going to the laboratory frequently, allowing a large number of people to be screened for anemia [42]. It has been observed that there is a strong correlation between the information estimated by the device and the actual Hb values obtained by taking blood samples. The k-NN classification algorithm was used to assess the anemia status and gave good results. Thus, doctors avoid a significant number of blood tests [42]. In the study conducted in 2019, the effect of biochemistry values on iron deficiency anemia was investigated by k-NN, CDTA, and ANN methods, based on the blood values stated in the literature to be effective for iron deficiency anemia [43]. As a result, it has been seen that the highest-performance artificial neural network method is. In another study, machine learning algorithms, linear discriminant analysis (LDA), classification and regression trees (CART), SVM, randomized forest (RF), k-NN, and LR were used [44]. They found that the RF algorithm achieved the best classification accuracy. In another study conducted in 2019, a new machine learning method (HEAC—Hemoglobin Estimation and Anemia Classification) was proposed for anemia classification based on blood parameters and compared with other machine learning methods in the literature [45]. Since the symptoms of iron deficiency anemia and β thalassemia are similar in the study conducted in 2020, a decision support system was developed to ensure discrimination [31]. In the proposed system, LR, k-NN, SVM, extreme learning machine, and regularized extreme learning machine classification algorithms are used. As a result, in the study in which male and female patients were evaluated together, an accuracy of 96.30% for women and 94.37% for men was obtained. In a study conducted in 2021, a structure was proposed that will enable the recognition of anemia in clinical practice conditions [46]. ANN, SVM, NB, and ensemble decision tree methods were used as classification algorithms. In another study, two hybrid models using genetic algorithm (GA) and deep learning algorithms (DLA) of stacked autoencoder (SAE) and convolutional neural network (CNN) were proposed for the prediction of some types of anemia [13]. When the performances of the proposed algorithms were evaluated, the performance of the GA-CNN algorithm was found to be better. One study in 2022 used synthetic minority over-sampling technique SMOTE to improve the imbalance of the anemia dataset from India [47]. Then, with the help of the decision tree rule-based learning method, the rules for the detection of anemia were derived using the original and SMOTE dataset.

When the studies on anemia are examined, it is seen that the use of swarm-based optimization methods is quite low. This study aims to see the success of the HHA algorithm, one of these algorithms, also called metaheuristics, which uses the advantages of swarm intelligence to solve complex optimization problems that cannot be solved by analytical methods, to obtain weight coefficients that will emphasize the importance of the parameters in the dataset, and the success of the MARS method, a classical mathematical modeling method preferred for the analysis of complex datasets, in the classification problem with different degree and pruning parameters. Both methods are tested under three different scenarios to highlight their success by modeling the relationships between dataset parameters. Both methods are tested under three different scenarios, and their success is highlighted by modelling the relationships between dataset parameters.

3 Material and method

3.1 Dataset and preprocessing

Blood data of 1732 patients from the Kaggle database were used in the study. The dataset consists of 351 patients with anemia and 1381 patients without anemia. As shown in Table 1, the study used 6 attributes and 2 classes, anemia (1)/healthy (0). The RBC value indicates the amount of red blood cells in the blood of each patient data, hemoglobin, HGB value indicates the amount of iron-rich protein stored in red blood cells, HCT value indicates the volumetric amount of blood in red blood cells, MCV value indicates the average cell volume, MCH value indicates the ratio of hemoglobin to red blood cells in a given volume, MCHC value indicates the average amount of hemoglobin in a single red blood cell, and 6 different blood components and their corresponding anemia outcome information.

Table 1 List of attributes in the dataset

3.2 Harris hawks optimization method

The Harris algorithm is an algorithm that works by imitating the hunting strategy of hawks in 2019 and is presented by Heidari as mathematically modeled [48]. When the literature is searched, it is seen that the Harris hawks method is used in many different areas [49,50,51,52]. However, the use of HHA in studies on disease prediction in the health sector has been limited [53,54,55]. Harris hawks move in packs with a leader at their head when hunting rabbits. First of all, they determine the location of the prey by making reconnaissance flights. They then move on to the hunting stage. This algorithm is population-based and consists of many stages.

3.2.1 Exploration phase

During the exploration phase, the Harris hawks wait and observe. This event continues in a loop. In each cycle, the hawk in the best position gives the best solution according to the position of the prey. While hawks wander, they make 2 different discoveries. These discoveries are given in Eq. 1. The value of \(q\) in the equation is a probabilistic value and indicates which discovery will be applied [48].

$$x\left(t+1\right)=\left\{\begin{array}{ll}{x}_{\rm{rand}}(t)-{r}_{1}|{x}_{\rm{rand}}(t)=2{r}_{2}x(t)|,q\ge .5\\ {(x}_{\rm{rabbit}}(t)-{x}_{m}(t))-{r}_{3}(\text{LB}+{r}_{4}\text{(UB}-\text{LB)}),q<.5\end{array}\right.$$
(1)

In the equation, \(x(t+1)\) is the vector indicating the position of the Harris hawk in each iteration, \({x}_{\rm{rabbit}}\) is the vector indicating the position of the prey, \({r}_{1},{r}_{2},{r}_{3},{r}_{4,}\) and \(q\) are the random numbers, and \(x(t)\) is the vector giving the current position of the hawk. \(\text{LB}\) and \(\text{UB}\) are the lower value and upper value of the positions. A hawk chosen randomly from the population is \({x}_{\rm{rand}}(t)\), and the average position of the current hawk population is \({x}_{m}(t)\). Using Eq. 2, the average position is found [3].

$${{{x}}}_{{{m}}}\left({{t}}\right)=\frac{1}{{{N}}}{\sum }_{{{i}}=1}^{{{N}}}{{{x}}}_{{{i}}}({{t}})$$
(2)

The \(N\) value given in Eq. 2 indicates the number of hawks, while \(t\) is the number of iterations.

3.2.2 Transition from exploration to exploitation

After the hawks complete the exploration phase, they perform different attacks according to the energies of their prey. The decrease in the energy of the prey during escape is stated in Eq. 3 [48].

$$E=2{E}_{0}\left(1-\frac{t}{T}\right)$$
(3)

The \(E\) value in Eq. 3 indicates the energy of the escaped prey, the \({E}_{0}\) value the initial energy of the prey, and the \(T\) value the maximum number of repetitions.

3.2.3 Exploitation phase

This is the stage where the hawks make the surprise leap to their prey. Reacting to the surprise jump by attacking the hawk’s prey, the rabbits try to escape. In response to these escapes of the rabbits, the Harris hawks employ different strategies. In the algorithm, these strategies are designed in 4 different ways.

The first strategy is called the “Soft besiege,” where the Harris hawk tries to de-energize its prey by making deceptive leaps. (\(r\ge 0.5,E\ge 0.5\)). The soft besiege strategy is as in Eqs. 4 and 5 [48].

$$x\left(t+1\right)=\Delta x\left(t\right)-E\left|J{X}_{\rm{rabbit}}(t)-x(t)\right|$$
(4)
$${{\Delta x}\left({{t}}\right)={{x}}}_{{\text{rabbit}}}({{t}})-{{x}}({{t}})$$
(5)

Considering Eqs. 4 and 5, the escaped prey’s chance of being caught is \(r\), and the rabbit’s energy \(E\). The difference between the position in t iteration and the position of the rabbit is \(\Delta x(t)\). The value indicated by \(J\) is a value that changes in each iteration to simulate the natural rabbit movement [48].

The second strategy has been called “Hard Besiege,” where the hawk hardly flanks to throw its surprise claw against the prey’s energy (\(r\ge 0.5\), |\(E\) |\(\le\) 0.5) that is greatly diminished. This situation is shown in Eq. 6 [48].

$$x\left(t+1\right)={x}_{\rm{rabbit}}\left(t\right)-E|\Delta x(t)|$$
(6)

The third strategy is “Soft besiege with progressive rapid dives,” where the Prey has enough energy to escape. In other words, it makes a soft siege before the surprise jump compared to the previous siege. It is smarter than the hard siege step. In other words, it is shown in Eq. 7 that the hawks decide on their next step before making the soft siege [48].

$$Y={x}_{\rm{rabbit}}\left(t\right)-E|J{x}_{\rm{rabbit}}(t)-x(t)|$$
(7)

In order to decide whether this move made in the next step will be a good dive move, a comparison with the previous dive is made. If it is not suitable, a sudden dive is made. During this decision, a Levy Flight-based movement structure is used. It is given in Eq. 8.

$$Z=Y+S\times LF(D)$$
(8)

According to what is given in Eq. 8, the problem dimension is \(D\). \(S\) is a random vector of size \(1\times D\). \(Y\) indicates the position of the prey relative to its decreasing energy. \(Z\) is the variable that decides whether the hawk will move to its prey. \(\text{LF}\) is the levy function and is found using Eq. 9 [48].

$$\text{LF}(x)=0.01x\left(\frac{\mu x\sigma }{|\mu {|}^{\frac{1}{\beta }}}\right),\sigma =\left[\frac{\Gamma (1+\beta )x\text{sin}(\frac{\pi \beta }{2})}{\Gamma \left(\frac{1+\beta }{2}\right)x\beta x{2}^{\left(\frac{\beta -1}{2}\right)}}\right]$$
(9)

According to what is given in Eq. 9, \(u\), \(v\) is a random number between (0.1), and \(\beta\) is 1,5. Equation 10 is used to update the positions of the hawks [48].

$$x(t+1)=\left\{Y\right.\,\, \text{if}\,\,F(Y)<f(x(t))$$
(10)
$$x(t+1)=\left\{Z\right.\,\, \text{if}\,\,F(Z)<F(x(t))$$
(11)

The \(Y\) and \(Z\) equality given in Eqs. 10 and 11 is found using Eqs. 7 and 8 [48].

The last strategy has been called “Hard besiege with progressive rapid dives,” at which stage the prey does not have enough energy to escape. The falcon makes a fierce siege before surprise leaps to capture prey. In Eqs. 12 and 13, the case of hard siege is given [48].

$$x^{\prime}(t+1)=\left\{Y^{\prime}\right.\,\, \text{if}\,\,F(Y^{\prime})<f(x(t))$$
(12)
$$x{^{\prime}}(t+1)=\left\{Z{^{\prime}}\right.\,\, \text{if}\,\,F(Z{^{\prime}})<F(x(t))$$
(13)

Values \(Y\) and \(Z\) can be determined from equations

$$Y^{\prime}={x}_{\rm{rabbit}}(t)-E|J{x}_{\rm{rabbit}}(t)-{x}_{m}(t)|$$
(14)
$$Z^{\prime}=Y^{\prime}+S\times LF(D)$$
(15)

3.3 Multivariate adaptive regression spline (MARS)

The linear regression model is very important in solving many problems. However, real-life problems often show a nonlinear structure. Linear models cannot represent this structure well. Nonparametric regression is used to characterize these structures [56, 57]. If the number of independent variables to be used in the model to be created is large, nonparametric regression forms are not both useful and cannot be easily interpreted. However, MARS, developed by Friedman in 1991, is a form of nonparametric regression. While this method is useful for fitting nonlinear multivariate functions, it does not have the disadvantages of problems with a large number of independent variables.

MARS, which is nonparametric and does not assume a functional relationship between dependent and independent variables, has been used in many engineering problems [58, 59]. Instead of a mathematical relationship, it creates a dynamic relationship between cause and effect variables [60]. It builds a flexible regression model using basis functions corresponding to different ranges of independent variable [57, 61]. The model consists of data-driven basis functions and the coefficients associated with these bases. It divides the independent variable values into regions and associates each region with a regression equation.

General MARS model is defined as [61]

$$Y={\beta }_{0}+\sum_{k=1}^{K}{a}_{k}{\beta }_{k}({X}_{t})+{\varepsilon }_{i}$$
(16)

where \(k\) is the node number, \(K\) is number of the basic function, \(X\) is independent variable, \({a}_{k}\) is the coefficients of the \(K\)th basic function, \({\beta }_{0}\) is an constant term and \({\beta }_{k}({X}_{t})\) denotes \(k\)th independent variable for the \(k\)th basic function.

As \(k=1,2,..K\) defined, the basic function has the following form [61]

$${B}_{m}(x)=\prod_{t=1}^{{L}_{m}}\left[{S}_{l,m}\right.\left.({x}_{v(1,m)}-{k}_{l,m})\right]$$
(17)

According to Eq. 17, \({L}_{m}\) is the degree of interaction, \({S}_{l,m}\) \(\in \left[\pm 1\right]\), \({k}_{l,m}\) is the node value, and \({x}_{v(1,m)}\) is the argument value.

In the first stage of the MARS method, which consists of two steps, forward and backward, since a more complex model than desired is obtained with the forward step algorithm, the basic functions in the model are added sequentially with the backward step algorithm, which is the second stage, to reach the optimum model. The process ends when the number of \(\text{BF}\) (basic functions) is maximized. However, the model produced at this stage includes the BFs that contribute most or least to the overall performance and is therefore more complex and contains inaccurate terms. The backward step is applied to avoid overfitting by reducing the complexity of the model without disturbing the fit of the model obtained in the forward step to the data. At each step, it removes the \(\text{BF}\) s that lead to the smallest increase in the residual sum of squares, and finally, a best-predicted model is created [62]. This process is called pruning and is determined by the hyperparameter “nprune.” By allowing different shapes of \(\text{BF}\) s and their interactions, MARS has the capacity to reliably track very complex data structures that are often hidden in high dimensions [63, 64]. Pruning is most commonly done with the generalized cross-validation technique. These operations protect against overfitting by reducing the complexity of the model. The degree of pruning and which functions or interactions to remove affect the performance of the model.

Another hyperparameter that controls the degree of the polynomial basis functions in the method using the expansion of piecewise linear basis functions is the “degree” parameter. The degree parameter allows the model to learn different relationships. Increasing the degree allows the model to learn more complex models, and decreasing the degree allows it to learn simpler relations. For these reasons, the pruning parameter and the degree have a significant impact on the performance of the MARS model.

4 Application of HHA and MARS methods to the anemia prediction problem

In this section, it is stated how the HHA and MARS methods are adapted to the anemia disease prediction problem.

4.1 Adaptation of HHA algorithm to anemia disease problem

Since 6 different blood parameters are used with the Harris hawks method, the feature vector is given as follows.

$$B=\left[{B}_{1},{B}_{2},{B}_{3},{B}_{4},{B}_{5},{B}_{6}\right]$$
(18)

In order to detect anemia with HHA, the results were tested by modeling in 3 different forms. As shown in Table 2, Model-1HHA refers to linear form, Model-2HHA refers to quadratic form and Model-3HHA refers to exponential form.

Table 2 Types of models to be applied HHA

According to Eq. 18, column 1 of the feature vector \(\left({B}_{1}\right)\) is the coefficient of RBC in blood, column 2 \(({B}_{2})\) is the coefficient of HGB in blood, column 3 \(\left({B}_{3}\right)\) is HCT, column 4 \(({B}_{4})\) is MCV, column 5 \(({B}_{5})\) is MCH, and column 6 \(({B}_{6})\) is MCHC. These coefficients represent the effect sizes of the parameters (RBC, HGB, HCT, MCV, MCH, and MCHC) on classification. The effects of the parameters on the classification success are optimized with HHA to obtain the most appropriate weight values.

Multiple linear regression model adapted to blood data is expressed in terms of combination of the anemia variables;

$$y={B}_{0}+{B}_{1}HBG+{B}_{2}\text{RBC}+{B}_{3}\text{MCH}+{B}_{4}\text{WBC}+{B}_{5}\text{MCV}+{B}_{6}HCT={B}_{0}+\sum_{i=1}^{k}{B}_{i}{x}_{i}$$
(19)

Multiple quadratic regression model can be established as the following:

$$y={B}_{0}+{B}_{1}\text{HBG}+{B}_{2}\text{RBC}+...+{B}_{6}\text{HCT}$$
$$+{B}_{7}\text{HB}{G}^{2}+{B}_{8}.\text{HBG}.RBC+{B}_{9}.\text{HBG}.\text{MCH}+...$$
$$+{B}_{12}.\text{HBG}.\text{HCT}+{B}_{13}\text{RBC}^{2}+{B}_{14}\text{RBC}.\text{MCH}+...$$
(20)
$$+{B}_{17}\text{RBC}.\text{HCT}+{B}_{18}\text{MC{H}}^{2}+...$$
$$+...+{B}_{27}\text{HC{T}}^{2}$$

Multiple exponential regression model can be established as the following:

$$y={B}_{0}+{B}_{1}{e}^{{B}_{7}\text{HBG}}+{B}_{2}{e}^{{B}_{8}\text{RBC}}$$
$$+{B}_{3}{e}^{{B}_{9}\text{MCH}}{+B}_{4}{e}^{{B}_{10}\text{WBC}}{+B}_{5}{e}^{{B}_{11}\text{MCHC}}+{B}_{6}{e}^{{B}_{12}HCT}$$
(21)

The \(y\) values in Eqs. 19, 20, and 21 are the anemia value. \({B}_{i}\), \(0\le i\le 6\), 0 < i < 27, and \(0\le i\le 12\) are the parameters to be determined for Eqs. 1921, respectively, using Harris hawks algorithm.

The cost function in multiple linear form is expressed as

$$J(Q)=\frac{1}{N}\sum_{i=1}^{N}(Y-{X}_{i}{B}^{T}{)}^{2}$$
(22)

where \((X)\) is the input set, \({X}_{i}\) patient records in the \(i.\) patient registration \(Y\) are the label values, and the number of patients in the dataset is \(N\).

By considering the problem as data mining parameter optimization, the model parameters (\(B\)) are tried to be estimated with the help of the HHA using the dataset. The classification of the data is managed whether the anemia exists or not within the scope of the study.

Application stages of the HHA algorithm for Model-1HHA:

Step 1:

The population size \((N)\) and the maximum number of iterations \(( T)\) are defined. (In the study, \(N=50\) and \(T=250\) were taken.)

Step 2: A matrix \(B\) is produced as much as the population size.

$$( - 5 < B_{0} < 5 \;\;{\text{and}}\;\; - 3 < B_{1} ,B_{2} ...B_{N} < 3)$$

Step 3 The objective function is calculated by considering Eq. 23.

$${\text{J}}=\frac{1}{2}\left({{\text{B}}}_{0}+\sum_{{\text{i}}=1}^{{\text{k}}}{{\text{y}}}_{{\text{i}}}-{{\text{B}}}_{{\text{i}}}{{\text{x}}}_{{\text{i}}}\right)^{2}$$
(23)

Step 4 Each row of matrix \(B\) is set as the \(x\text{rabbit}\) rabbit position. \(x\text{rabbit}\) is calculated according to the objective function. According to the number of iterations, the \(x\text{rabbit}\) position that minimizes the objective function is calculated.

Step 5 Equation 24 formulates the energy of the hunt.

$${\text{E}}=2{{\text{E}}}_{0}\left(1-\frac{{\text{t}}}{{\text{T}}}\right)$$
(24)

The first energy of the hunt is compared with the objective function value, and then, this value is reduced according to the number of iterations, thus minimizing the objective function. Then, in the following steps, the parameter values (prey position) change according to the energy of the prey.

Step 6: The position vector is updated when the energy of the prey is greater than or equal to 1.

Step 7: If the energy of the prey is less than 1, there are four different strategy chances (exploitation phase).

Step 7.1: if (|\(E\)|≥ 0.5 and \(r\)\(0.5\)), \(xrabbit\) is updated using the soft besiege.

Step 7.2 if (|\(E\)|≥ 0.5 and \(r\) < \(0.5\)), \(xrabbit\) is updated using step 6 (hard siege).

Step 7.3 if (|\(E\)|< 0.5 and \(r\)\(0.5\)), \(xrabbit\) is updated using the Soft besiege with progressive rapid dives.

Step 7.4 if (|\(E\)|< 0.5 and \(r\)< \(0.5\)), using the Hard besiege with progressive rapid dives, \(xrabbit\) is updated.

Step 8 Repeat step 5 until the number of iterations.

Step 9 The hunt is located.

The accuracy and average accuracy value obtained at each floor by applying tenfold cross-validation for Model-1HHA are given in Table 3. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 3 Accuracy values at each fold by Model-1HHA

The accuracy and average accuracy value obtained at each floor by applying tenfold cross-validation for Model-2HHA are given in Table 4. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 4 Accuracy values at each fold by Model-2HHA

The accuracy and average accuracy value obtained at each floor by applying tenfold cross-validation for Model-3HHA are given in Table 5. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 5 Accuracy values at each fold by Model-3HHA

The coefficients produced at each layer by the HHA method applied to linear, quadratic, and exponential form models are shown in Table 6, Tables 7, and 8.

Table 6 Attribute weights at each fold by Model-1HHA
Table 7 Attribute weights at each fold by Model-2HHA
Table 8 Attribute weights at each fold by model-3HHA

As shown in Tables 6, 7, and 8, the overall model coefficients were created by averaging the weight values produced by Model-1HHA, Model-2HHA, and Model-3HHA at each fold and the average coefficients are presented in Tables 9, 10, and 11, respectively.

Table 9 Average attribute weights by Model-1HHA
Table 10 Average attribute weights by Model-2HHA
Table 11 Average attribute weights by Model-2HHA

In Tables 9, 10 and 11, the HHA algorithm is run for the mathematical model specified in Eqs. 19, 20 and 21 and the weight values that minimize the fitness function given in Eq. 22 are calculated.

4.2 Adaptation of the MARS method to the problem of anemia

The 16–17 parameters of the anemia models were determined as defined in Table 12, with Model-1MARS being the first-order model, Model-2MARS second-order model, and Model-3MARS being the model with the best pruning and degree values, respectively.

Table 12 Model types to which the MARS method is applied

To examine the performance of the MARS model with different combinations of hyperparameters, tests were performed on three different models. For the cross-validation process, tenfold cross-validation was tried. During the experiment, the hyperparameters degree and nprune were tested with different values. In Model-1MARS, \(\text{degree}=1\) and \(\text{nprune}=25\), in Model-2MARS, \(\text{degree}=2\) and \(\text{nprune}=25\), and finally in Model-3MARS, nprune was set from 5 to 50 (\(5:5:50\)) and degree was set between 1 and 4 \((1:4)\) in order to find the most successful parameter combinations. With these hyperparameter combinations, it is aimed to capture the behavior of the model in a wide range.

The accuracy and average accuracy value obtained at each step fold by applying tenfold cross-validation for Model-1MARS are given in Table 13. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 13 Accuracy values at each fold by Model-1MARS

Before the pruning process, 10 basis functions were generated and the functions with low success rate were removed from the model. Finally, in order to test the effect of RBC, HGB, HCT, MCV, MCH, and MCHC on anemia, 9 basis functions generated by the MARS method for Model-1MARS and their weights are presented in Table 14.

Table 14 Degree = 1 Estimated results of the MARS model

The basic functions defined for the model with \(\text{BF}j\), where \(j\) is the number of basic functions, are presented in Table 14. In line with this information, the Model-1MARS model is as follows:

$$\text{Model-1MARS}=23.60-225.31x\text{BF1}+237.19x\text{BF2}-46.91x\text{BF3}-27.17x\text{BF4}-8.86x\text{BF5}+12.05x\text{BF6}+46.61x\text{BF7}+6.99x\text{BF8}$$
(25)

When Eq. 25 is analyzed, it is observed that there is no interaction of the basis functions by taking degree = 1, and simple linear functions are formed. According to Table 14 and Eq. 25, the RBC and WBC parameters were pruned and removed from the model by the MARS method because of their low contribution to the model performance.

The accuracy and average accuracy value obtained at each fold by applying tenfold cross-validation for Model-2 MARS are given in Table 15. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 15 Accuracy values at each fold by Model-2MARS

With Model-2MARS, 17 basis functions were generated before the pruning process and the functions with low success rates were removed from the model. Finally, in order to test the effect of RBC, HGB, HCT, MCV, MCH, and MCHC on anemia, 8 basis functions generated by the MARS method for Model-2MARS and their weights are presented in Table 16.

Table 16 Degree = 2 Estimated results of the MARS model

In line with the information in Table 16, the Model-2MARS model is as follows:

$$\text{Model-2MARS}=11.7-4.5x\text{BF1}-585.3x\text{BF2}+626.3x\text{BF3}-45.5x\text{BF4}+7.2x\text{BF5}+294.1x\text{BF6}-268.2x\text{BF7}$$
(26)

When Eq. 26 is analyzed, the maximum interaction level of the basis functions is set to 2 by taking degree = 1. This shows that the model’s basis functions are linear and quadratic functions as shown in Table 16.

\(\text{BF1}\), \(\text{BF2}\), \(\text{BF3},\) and \(\text{BF4}\) form linear functions; \(\text{BF5}\), \(\text{BF6},\) and \(\text{BF7}\) are quadratic functions formed as the product of two basis functions. According to Table 16 and Eq. 26, MCH, WBC, and MCHC parameters were removed from the model by the MARS method because of their low contribution to the model’s performance success.

The accuracy and average accuracy value obtained at each fold by applying tenfold cross-validation for Model-3MARS are given in Table 17. The high accuracy achieved on each fold shows that the model is not affected by the unbalanced dataset.

Table 17 Accuracy values at each fold by Model-3MARS

For Model-3MARS, degree = 1:4 and nprune = seq(5:5:50). As shown in Table 18, a grid containing the performances of the hyperparameters according to the values in these ranges was created.

Table 18 Accuracy values at each fold by the grid containing hyperparameters Degree = 1:4 and nprune = seq(5:5:50)

As a result of the tests, 17 basis functions were generated by Model-3MARS before the pruning process and the functions with low success rate with pruning process were removed from the model. Finally, in order to test the effect of RBC, HGB, HCT, MCV, MCH, and MCHC on anemia, 5 basis functions generated by the MARS method for Model-3MARS and their weights are presented in Table 19.

Table 19 Degree = 1:4 and nprune = seq(5:5:50) estimated results of the MARS model

In line with the information in Table 19, the Model-3MARS model is as follows:

$${\text{Model}}-3{\text{MARS}}=43.5-1107.3x\text{BF1}+671.4x\text{BF2}+419.9x\text{BF3}-123.8x\text{BF4}$$
(27)

When Eq. 27 is analyzed, the maximum interaction level of the basis functions is set to 2 by taking degree = 2 and nprune = 5. This shows that the model’s basis functions are linear and quadratic functions as shown in Table 19. BF1, BF2, and BF3 are linear functions, while BF4 is a quadratic function which is the product of two basis functions. According to Table 19 and Eq. 27, MCH, WBC, HCT, and MCHC parameters were pruned and removed from the model by the MARS method in the pruning phase since their contribution to the model performance was low.

When a new patient data is to be analyzed using Eqs. 2527, classification can be made by entering the necessary information in the \(\text{BF}j\) basic functions.

5 Evaluation

In the study, two classes were expressed as anemia (1) and non-anemia/healthy (0) individuals. A tenfold cross-validation method was used for all proposed models. With this method, the dataset is divided into 10 different subsets, and each time one group is used as a test set, while the other group is used for training. This approach enables an evaluation process where each subset serves as a test set once and all combinations are tested and the results averaged. This way, the class imbalance in the dataset does not cause any problems during model learning and performance evaluation. In other words, since there are more non-anemia records in the dataset, the model may learn this class better, while it may tend to misclassify the anemia class with fewer records in the anemia class. However, with this method, where the representation of each class in the training and test sets is compatible with the proportion in the overall dataset, the effect of such situations is minimized.

The performances of the classification methods were calculated according to ROC analysis metrics. These metrics are methods that reveal how well the model performs in predictions in order to learn the performance of the results obtained by various methods. It is frequently used in data mining applications [65].

Confusion matrix for ROC analysis is shown in Table 20.

Table 20 Confusion matrix for ROC

The basic equations for ROC analysis are shown in Eqs. 2833.

$${\text{Accuracy}}=\frac{{\text{TP}}+{\text{TN}}}{{\text{TP}}+{\text{TN}}+{\text{FP}}+{\text{FN}}}$$
(28)
$${\text{Recall}}-{\text{Sensitivity}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}$$
(29)
$$\mathrm{Specificity }= \frac{{\text{TN}}}{{\text{TN}}+{\text{FP}}}$$
(30)
$$\mathrm{Precision }= \frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}$$
(31)
$${\text{F}}1-{\text{Score}}=\frac{2{\text{Precision}}.{\text{Recall}}}{{\text{Precision}}+{\text{Recall}}}$$
(32)
$${\text{AUC}}=\frac{{\text{TPR}}-{\text{TNR}}}{2}$$
(33)

The purpose of ROC analysis is to compare the performance of the results obtained by various methods and to evaluate the results in terms of sensitivity, specificity, precision, F1-score, AUC, and accuracy [66]. The ROC parameters used in the analysis are TP, TN, FP, and FN. TP (true positive) and TN (true negative) indicate the correct prediction of anemia as a result of the classification used. FP (false positive) and FN (false negative) represent the number of false predictions.

Selecting the appropriate metric for the model is very important for obtaining the desired results. Accuracy, AUC, and F1-score are parameters often used to evaluate model performance. All three are used to evaluate how well a model performs. Accuracy is the most popular metric that determines the percentage of correct predictions. AUC compares the relationship between the true-positive rate (TPR) and false-positive rate (FPR) at different thresholds. Data scientists try to achieve the highest TPR while maintaining the lowest FPR, indicating their success in making correct predictions. In unevenly distributed datasets, it is not enough to measure model success with the accuracy metric alone. F1-score is one of the most widely used metrics for unbalanced datasets [67, 68]. F-score is a measure based on precision and recall values. That is, if recall is high, precision is usually low, and vice versa. At the same time, a high recall value means that a large proportion of instances in the minority class are correctly predicted, while a high precision value indicates that there is a high probability that the predicted minority instances actually belong to the minority class. Higher values for both precision and recall lead to a higher F1-score, indicating that there is not a large difference between the precision and recall values [68, 69].

Although AUC is a very common measure for imbalanced data problems, the F-measure is more suitable for such cases. This is because the minority class is more critical than the majority class, and the F-measure is an indication that the desired method classifies samples in the minority class with a higher accuracy and a lower misclassification rate. In other words, the AUC measure evaluates the overall accuracy of the classifier in both the majority and minority classes, while the F-measure focuses only on the accuracy of the classifier in the minority class [68,69,70].

Since the dataset used in the study is an imbalanced dataset, AUC and F1-score performance comparison is made in the next section, but since other studies in the literature produce results based on the accuracy metric, other ROC metrics are also presented.

6 Experimental results of HHA and MARS

In this study, anemia disease classification was performed with 2 different optimization algorithms for patient data with blood values and results. The data used in the study were classified using the two methods given in Tables 2 and 12, and the performance of each method was tested with three different models. Classification was performed on linear, quadratic, and exponential models for the HHA method, and on models with different pruning coefficients such as first-order pruning coefficient 25, second-order pruning coefficient 25, and degree values between 1 and 4 for the MARS method.

The high accuracy values in Tables 3, 4 and 5 for the HHA method and Tables 13, 15 and 17 for the MARS method show that the accuracy of the proposed methods is high in each cluster in the dataset divided into 10 different subsets. This shows that the proposed methods are not affected by dataset imbalance.

This section presents the confusion matrices and ROC performance analysis of the methods obtained for the six models. The confusion matrices of both Model-1 and Model-3HHA algorithms and Model-1 and Model-3MARS methods are documented in Figs. 1, 2, 3, 4, 5 and 6, respectively, and ROC performance values are calculated from them.

Fig. 1
figure 1

Confusion matrix of the Model-1HHA

Fig. 2
figure 2

Confusion matrix of the Model-2HHA

Fig. 3
figure 3

Confusion matrix of the Model-3HHA

Fig. 4
figure 4

Confusion matrix of the Model-1MARS

Fig. 5
figure 5

Confusion matrix of the Model-2MARS

Fig. 6
figure 6

Confusion matrix of the Model-3MARS

When the confusion matrices in Figs. 1, 2, 3, 4, 5 and 6 are evaluated, it is seen that a total of 13 patients with Model-1HHA, 12 patients with Model-2HHA, 316 patients with Model-3HHA, 14 patients with Model-1MARS, 12 patients with Model-2MARS and 11 patients with Model-3MARS could not be classified correctly.

Table 21 shows the ROC performance analyses made as a result of testing six different models.

Table 21 ROC performance analyses

In the study, in order to better model the relationship between anemia disease parameters, 6 different tests were performed at different degrees and emphasizing the interaction of parameters with each other. In the research conducted on different models, it was analyzed that the classical method MARS and the metaheuristic method HHA showed extremely significant success in anemia disease classification when the accuracy metric was considered. However, since it is necessary to evaluate model performance metrics such as precision, recall, F1-score, and AUC in addition to the accuracy metric to determine model success in non-uniformly distributed datasets, these metrics were also analyzed.

For Model-1 and Model-3 HHA, the precision metric, which is the ability of the classifier not to label a healthy record as anemia, is high for all 3 models. Recall, F1-score, and AUC metrics are high in Model-1 and Model-2, but low in Model-3. In the dataset where the number of anemia patient records is low, class-based achievements are important instead of overall accuracy achievement. In other words, while the accuracy value is 81.76%, the recall metric, which is the ratio of all correctly predicted anemia patient records to anemia patient records, is low in Model-3 because of the low number of anemia patient records. The F1-score, which is also the weighted average of precision and recall, takes into account the misclassified anemia records from the precision score and the misclassified healthy data records from the recall score. The AUC metric, which evaluates the accuracy of the classifier in both the minority and majority classes, and the F1-score metric, which evaluates the accuracy of the classifier only in the minority class, also produced high results for Model-1 and Model-2, but poor results for Model-3, which shows exponential behavior.

When the results of Model-1 and Model-3MARS where the MARS method is applied are evaluated, it is seen that first-order Model-1 and second-order Model-2 show successful results when nprune = 25 is selected. In the tests performed for Model-3, which finds the best degree and prune hyperparameters to classify the anemia dataset with the MARS method, degree = 2 nprune = 5 was found. Again, it is seen that the success of Model-3 is high.

The cost function changes of linear, exponential, and quadratic HHA models using the mean squared error (MSE) function are shown in Figs. 7, 8, and 9. The reason why there is no continuous decrease in the graph drawn for Model-1 in Fig. 7 is that the cost curves are calculated by averaging the coefficients. Figure 7 shows that the linear HHA model converges to the optimum result at the end of the 250th iteration.

Fig. 7
figure 7

The cost function of Model-1HHA

Fig. 8
figure 8

The cost function of Model-2HHA

Fig. 9
figure 9

The cost function of Model-3HHA

The reason why there is no continuous decrease in Fig. 8 according to the cost values plotted by averaging the coefficients of the parameters at each iteration is that the cost values are calculated again by averaging the coefficients as in Model-1HHA. Looking at Fig. 8, the HHA model, which shows linear behavior according to the coefficients to be optimized, converges to the optimum result.

In Fig. 9, for the exponential form, the cost values are plotted by averaging the coefficients of the parameters at each iteration, resulting in a fluctuating graph. The reason for the lack of a continuous decrease on the graph is that the coefficients are obtained by averaging in each iteration as in Model-1 and Model-2. It is concluded that the reduction from 130 to 13 parameters for the exponential form is not suitable for a model with exponential behavior. In other words, averaging the parameters and obtaining a general parameter do not give healthy results in some cases.

These results show that the relationship between the parameters and the contribution of the model to the anemia disease classification problem is significant.

7 Conclusion

Harris hawks algorithm and MARS algorithm were used to classify the data in the database where each patient data contain 6 different blood components including RBC, HGB, HCT, MCV, MCH and MCHC blood values and the corresponding anemia outcome information. During classification, the dataset was divided into 10 different subsets using the tenfold cross-validation method. In this process, each of the 10 subsets was tested once as a test set. The algorithms were analyzed on 6 different models: multilinear form HHA, multilinear quadratic form HHA, multi-exponential form HHA, MARS model with first-order pruning coefficient 25, MARS model with second-order pruning coefficient 25, and MARS model using the best degree and pruning coefficient.

As a result of the tests performed on the models applying the MARS method based on classical mathematical modeling, it was concluded that they performed well in anemia classification with an accuracy of 99.19%, 99.31%, and 99.36%, respectively, as shown in Table 22. In addition to the accuracy metric, it was also observed to perform well against the F1-score and AUC metrics, which are commonly used to analyze imbalanced datasets (Table 21).

Table 22 Related studies using different datasets similar to our study

The results of the success of the parameter estimation method based on mathematical modeling using the HHA method on the anemia classification problem were 99.25%, 99.31%, and 81.76%, respectively. Thus, according to the coefficients to be optimized according to the average parameter approach, which obtains a general parameter that best summarizes the problem by averaging the parameter values produced at each floor in the cross-validation results, it is seen that the linear and quadratic model with linear behavior shows successful results, while the model in exponential form, which does not show linear behavior, shows lower success compared to other models.

When the results are evaluated, in the prediction process using both HHA and MARS, the targeted outputs were successfully achieved using 6 different blood components in a total of 1732 cases, 351 with anemia, and 1381 without anemia.

According to the performance results, the proposed algorithms classify anemia better than previous methods (Table 22). This shows that our proposed methods for the anemia classification problem have high performance in terms of accuracy, F1-score, and AUC performance. The high performance obtained and the high accuracy at each level of the dataset analyzed by tenfold cross-validation show that the proposed methods are not affected by the imbalance of the dataset.

The results of the study are expected to help medical students and doctors in the anemia classification problem. We believe that the classifier performances of our proposed models will contribute positively to the literature.