1 Introduction

There are several types of email messages that computer users do not opt to receive in their email inboxes, such as spam, bulk email, junk email, promotion and commercial emails, and so on. These messages have some differences; however, in this study, they are all considered spam emails. Inappropriate messages on a large scale on the Internet that do not have useful content for the user would be classified as spam. Spam can be distributed in different formats and on various platforms. Social media spam, web spam, forum spam, spam instant messaging, email spam, and so on are examples of various types of spam. Although the majority of internet-based platforms can be successfully utilized to transmit spam, email spamming has grown in popularity due to its widespread use for a variety of purposes [34]. Text REtrieval Conference (TREC) has a definition for spam: "Spam is unsolicited mail that is sent vaguely, directly or indirectly by someone who has no relationship with the recipient of the letter" [11].

Although emails are effective and easily accessible means of communication, they can become a disaster due to the exploitation of marketers to advertise their products and scammers to deceive people into abusing their designs. The significant negative effect of spam emails is not limited to the severe waste of resources, time, and effort but also increases the burden of communication and cybercrime, affecting even the global economy and costing millions of dollars annually for businesses and individuals. Unwanted emails, in addition to consuming resources such as bandwidth, removal time, and storage space, also pose a security threat [5]. Attackers use a variety of methods to gain access to the victim's information. Email systems are one of the platforms used by attackers to spread malware. A recent McAfee report states that more than 97% of spam emails in the last four months of 2017 were sent via Necurs and Gamut botnets [33].

Detecting suspicious emails manually by users prevents attackers from reaching their goals in this way. To facilitate identifying suspicious emails, users, after observing their characteristics, should immediately take the necessary actions to prevent spam distribution and must inform the relevant institutions [16]. However, developing efficient mechanisms to automatically identify unsolicited emails is very important. Some of the characteristics of emails that are believed to be malicious are listed in “Appendix”.

Spam detection is a challenging problem, and several techniques have been developed and introduced to automatically detect spam emails; however, not all of them show an accuracy of 100%. Machine learning and deep learning techniques have proven to be the most successful of the methods introduced. In recent years, one of the common applications of machine learning has been spam detection [53]. Natural Language Processing (NLP) helps these methods to increase their accuracy. These spam detection methods consist of two stages: feature selection and classification [15].

Optimization algorithms are the other methods that can help developing spam detection systems. The Horse herd Optimisation Algorithm (HOA) [35] is a novel meta-heuristic algorithm and has a high exploration and exploitation performance. It excels at finding the best and optimal solutions to high-dimensional problems. In this article, our objective is to present a new method for detecting spam emails using HOA. To do this, we first convert the basic HOA, which is a continuous algorithm, to a discrete algorithm and then modify it into a multiobjective algorithm to solve multiobjective problems. Finally, the new multiobjective binary HOA is used in selecting the important features of spam emails to recognize them so that the received emails are classified correctly into spam or genuine emails. These two categories are then evaluated.

This study's main motivation for using HOA in solving spam detection problems was its outstanding performance in addressing complex high-dimensional problems. It is exceptionally efficient in exploration and exploitation. It can find the optimal solution very fast, with a low cost and complexity. With regards to accuracy and efficiency, it outperforms many well-known optimization algorithms such as the grasshopper optimization algorithm [48], the sine cosine algorithm [38], the multi-verse optimizer [39], the moth-flame optimization algorithm [36], the dragonfly algorithm [37], and the grey wolf optimizer [40].

Overall, the current study has the following main contributions:

  • HOA, a novel metaheuristic algorithm for high exploration and faster convergence, has been used in the study. To the best of the authors’ knowledge, this algorithm has not yet been used for spam detection.

  • The original HOA was a single objective algorithm developed to solve continuous problems. In this study, HOA was discretized and converted to a multiobjective algorithm.

  • The original HOA was transformed into a binary opposition-based algorithm.

  • Using HOA for feature selection, a novel spam detection method is proposed.

  • After selecting the optimal features, the K-Nearest Neighbours (KNN) classification method was used to classify the collection of spam emails.

  • According to the evaluation results, the proposed method outperforms well-known algorithms in terms of accuracy, precision, and sensitivity.

The remainder of this article is organized as follows: Sect. 2 introduces the related works. In Sect. 3, the original horse herd optimization algorithm is presented. Section 4 introduces the new proposed approach, and finally, in Sect. 5, the evaluation results and conclusion are discussed.

2 Related works

Unsolicited spam emails sent by marketers for promoting their products are regarded as annoying since they take up a lot of space in servers [45]. Some innocent users may also fall prey to fake emails [21]. Scammers try to get users' bank account details by sending these emails to steal money. Spam emails by attackers and hackers to distribute viruses and other malicious software are also hidden behind attractive and exciting offer links [23]. Therefore, the problem of spam emails should be addressed immediately, and effective measures should be taken to control this problem. Efforts have been made to reduce spam emails, including the development of advanced filtering tools and anti-spam laws in the United States [5].

Many researchers have focused their attention on the email spam detection problem, and in the literature, several notable approaches have been proposed. This section discusses some of the previous studies focusing on detecting and classifying spam through machine learning techniques and deep learning algorithms. One of the widely used algorithms for this problem is Naive Bayes [4, 47, 50]. There are various techniques introduced for detecting spam; however, our main focus would be on metaheuristic optimization algorithms in the present study.

A decision tree was applied in the study by Carreras and Marquez [8] to filter unwanted emails. Because the features of spam emails are difficult to define, this method is not extensively employed in spam filtering. K-nearest neighbours (KNN), Naïve Bayes and Reverse DBSCAN algorithms were used by Harisinghaney et al. [18] to classify image-based and text-based spam, and performance comparison of the mentioned algorithms were provided based on four measuring factors.

Egozi and Verma [13] used natural language processing techniques to detect phishing emails. Their model applies a feature selection method to select 26 features in order to determine if an email is a genuine email or spam. With only 26 features, their approach correctly identified more than 95% of ham emails as well as 80% of phishing emails.

Sharma and Bhardwaj [51] introduced a spam mail detection (SMD) system based on hybrid machine learning applying Naive Bayes and the J8 decision tree. This system consists of four models: data set preparation, data preprocessing, feature selection, and hybrid bagged approach. A total of three experiments were performed, of which the first two were conducted based on Naive Bayes and J8, and the other experiment was the proposed SMD which achieved an accuracy of 87.5%.

A new model for spam detection (THEMIS) was introduced by Soni [52] that is used to show emails at the header, body, character, and word level all at the same time. This approach uses deep convolutional neural network algorithms for recognizing spam emails. The evaluation results show that THEMIS's accuracy of 99.84% is higher than LSTM and CNN's accuracy.

In the study by GuangJun et al. [17], a method is proposed for spam classification in mobile communication using predictive machine learning models (e.g., logistic regression, K-Nearest Neighbor, and decision tree). Experiment results suggest that this method is accurate and timely in detecting spam and can protect email communication in mobile systems.

The study by Bibi et al. [6] provides a comparison of past spam filtering algorithms discussing their accuracy and the employed data sets. The study presents in-depth knowledge of the simple Naive Bayes algorithm, which is one of the best algorithms for text classification. This study evaluated classifier machine learning algorithms in spam detection and found that using WEKA, the Naïve Bayes algorithm provides effective accuracy and precision.

Mohmmadzadeh [42] developed a new hybrid model by combining the whale optimization algorithms and the flower pollination algorithms to solve the feature selection problem on the basis of opposition-based learning for detecting spams. The new model has higher accuracy in spam detection compared to previous approaches.

A spam detection approach using word embedding based on deep learning architecture in the NLP context was introduced by Srinivasan et al. [53]. The study reveals that deep learning outperforms standard machine learning classifiers when it comes to spam detection.

Apart from the sample methods described earlier, other methods are also available that only used metaheuristic algorithms, but none of the proposed methods are entirely accurate, and they are all erroneous to some extent. Moreover, only the classification phase was carried out in many previous methods, and the feature selection phase was not implemented. Feature selection reduces the dimensions of computation and increases classification accuracy by removing unnecessary features. Due to the lack of the feature selection process, the majority of the previous solutions spend a tremendous amount of time running the algorithm and do not have a high accuracy percentage. Table 1 demonstrates some examples of optimization methods used in spam detection that have been published recently, with some drawbacks that the proposed method in this study attempts to rectify.

Table 1 Examples of the recent spam detection methods

As can be seen in Table 1, even the most recent methods are not 100% accurate and need a lot of time to execute the algorithm, and some of them have high computational complexity and high error rate. Thus, the objective of the current study was to employ a robust metaheuristic optimization algorithm, which is highly efficient in exploration and exploitation, to enhance the computation speed and accuracy of spam detection as well as reduce the error rate. After a comprehensive search in the literature and examination of several optimization algorithms, the authors decided to use the novel metaheuristic optimization algorithm, HOA, for the feature selection phase of the proposed approach, and as a result, the spam detection method suggested by the current study is on the basis of HOA. This optimization algorithm has been tested by multiple well-known test functions in high dimensions and has proven that it is able to solve challenging and high-dimensional problems.

In order to carefully assess and evaluate the performance and efficiency of the proposed method, some of the most popular and highly efficient optimization and classification algorithms in the literature were selected for the simulation, and their performance was compared to the proposed method’s performance. The simulation results indicate that the proposed method outperforms the previous methods, and demonstrates a high level of accuracy and precision, spends less execution time, and has lower error rate. Thus, the new method’s superiority is its higher accuracy and speed, and lower error rate and complexity.

As stated earlier, to be able to use HOA in selecting features, we converted that into a discrete algorithm since it was originally a continuous algorithm. Then, because feature selection is also a multiobjective problem, we transformed HOA into a multiobjective HOA and used  it to select spam features. To the best of our knowledge, this is the first research in the field that presents a binary and multiobjective version of HOA. The following section introduces the horse herd optimization algorithm.

3 Horse herd optimization algorithm

In recent years, various metaheuristic algorithms have been employed to solve a wide range of optimization problems [10, 29, 56]. A reason for this is the ability of metaheuristic algorithms to mathematically model and solve a variety of real-world problems [49]. This study aimed to employ a novel metaheuristic algorithm for solving the feature selection problem for detecting spam emails. Therefore, the Horse herd Optimisation Algorithm (HOA) was used as the primary method for this purpose. HOA, proposed in the study by MiarNaeimi et al. [35], is a robust metaheuristic algorithm inspired by the horses’ herding behaviors at various ages. Because of the vast number of control factors based on the behavior of horses of various ages, HOA shows an outstanding performance at addressing complex high-dimensional problems. Its performance at high dimensions (up to 10,000) has been evaluated using popular test functions, and it was discovered to be extremely efficient in exploration and exploitation. It has the ability to find the best solution in the shortest time, at the lowest cost, and with the least amount of complexity, and in terms of accuracy and efficiency, it outperforms many well-known metaheuristic optimization algorithms. This algorithm is discussed in greater detail in the following section.

At different ages, horses show various behaviors [35]. A horse's maximum lifespan is around 25–30 years [25]. In HOA, horses are divided into four categories according to their age: horses in ages 0–5, 5–10, 10–15 and older than 15, which are represented by δ, γ, β, and α respectively. HOA uses six general horse behaviors at the mentioned ages to simulate their social life. Those behaviours are: "grazing, hierarchy, sociability, imitation, defence mechanism and roaming".

Equation (1) describes the horse movement at each iteration:

$$X_{m}^{{\text{Iter,AGE}}} = \vec{V}_{m}^{{\text{Iter,AGE}}} + X_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} , \quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$
(1)

where \(X_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the position of the mth horse, \(\vec{V}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the velocity vector of the mth horse, AGE is the horse age range, and Iter is the current iteration.

To determine the age of horses, each iteration should have a thorough matrix of responses. The matrix is sorted according to the best responses, with the first 10% of the horses from the matrix’s top chosen as α. The β, δ, and γ horses comprised the next 20%, 30% and 40% of the remainder of the horses, respectively. In order to detect the velocity vector, the steps of simulating the mentioned six behaviors are mathematically implemented. During each cycle of the algorithm, the motion vector of horses of various ages can be expressed by Eq. (2) [35]:

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter,}}\alpha }} & = \vec{G}_{m}^{{{\text{Iter,}}\alpha }} + \vec{D}_{m}^{{{\text{Iter,}}\alpha }} \\ \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \vec{G}_{m}^{{{\text{Iter}},\beta }} + \vec{H}_{m}^{{{\text{Iter}},\beta }} + \vec{S}_{m}^{{{\text{Iter}},\beta }} + \vec{D}_{m}^{{{\text{Iter}},\beta }} \\ \vec{V}_{m}^{{{\text{Iter,}}\gamma }} & = \vec{G}_{m}^{{{\text{Iter,}}\gamma }} + \vec{H}_{m}^{{{\text{Iter,}}\gamma }} + \vec{S}_{m}^{{{\text{Iter,}}\gamma }} + \vec{I}_{m}^{{{\text{Iter,}}\gamma }} + \vec{D}_{m}^{{{\text{Iter,}}\gamma }} + \vec{R}_{m}^{{{\text{Iter,}}\gamma }} \\ \vec{V}_{m}^{{{\text{Iter}},\delta }} & = \vec{G}_{m}^{{{\text{Iter}},\delta }} + \vec{I}_{m}^{{{\text{Iter}},\delta }} + \vec{R}_{m}^{{{\text{Iter}},\delta }} \\ \end{aligned}$$
(2)

As stated earlier, HOA is inspired by horses and their six general and social behaviors in various ages. The six behaviors and their mathematical implementation are discussed as follows.

Grazing: Horses are grazing animals that graze at all stages of their lives for about 16–20 h per day [25]. Equations (3) and (4) mathematically implement this behavior in HOA [35].

$$\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{{{\text{Iter}}}} \left( {\check{u}} + {\check{\rho}} \right) + [X_{m}^{{({\text{Iter}} - 1)}} ],\quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$
(3)
$$g_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{g}$$
(4)

In the above equations, \(\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the ith horse's motion parameter indicating its tendency to graze. With \({\omega }_{g}\) in each iteration, this factor reduces linearity. \({\check{u}}\) is the upper bound of the grazing space, and its recommended value is 1.05. \({\check{l}}\) is the lower bound of the grazing space, and its recommended value is 0.95. \(\rho\) is a random number in between 0 and 1. The coefficient \(g\) for all age ranges is recommended to be set to 1.5.

Hierarchy: Horses are not self-sufficient, and they usually follow a leader, which could be a human, an adult stallion, or a mare. This occurs in the hierarchy law [7]. The most experienced and strongest horse tends to lead in a herd of horses, and others follow it. Horses between the ages of 5 and 15 (β and γ) were shown to follow the hierarchy law. The hierarchy is implemented according to Eqs. (5) and (6) below [35]:

$$\vec{H}_{m}^{{\text{Iter,AGE}}} = h_{m}^{{\text{Iter,AGE}}} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$
(5)
$$h_{m}^{{\text{Iter,AGE}}} = h_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{h}$$
(6)

where \(\vec{H}_{m}^{{{\text{Iter}},{\text{AGE}}}}\) is the impact of the location of the leader horse on the velocity, and \(X_{*}^{{({\text{Iter}} - 1)}}\) indicates the location of that horse.

Sociability: Sociability is another behavior of horses that HOA inspired. Horses require social interaction and may coexist with other animals. This also increases their chances of survival. Some horses appear to enjoy being with even other animals such as cattle and sheep [25]. Horses between the ages of 5 and 15 years old show this behavior. Socialization in HOA was considered the movement towards the position of other horses in the herd, and it is implemented using the Eqs. (7) and (8) [35]:

$$\vec{S}_{m}^{{\text{Iter,AGE}}} = s_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \beta ,\gamma$$
(7)
$$S_{m}^{{\text{Iter,AGE}}} = s_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{s}$$
(8)

where \(\vec{S}_{m}^{{\text{Iter,AGE}}}\) is the ith horses social vector motion, and \(s_{m}^{{\text{Iter,AGE}}}\) is the same horse's orientation towards the herd in the Iterth iteration. With a \({ }\omega_{s}\) factor, \(s_{m}^{{\text{Iter,AGE}}}\) decrements in each cycle. The total number of horses is indicated by N, and AGE is each horse’s age range in the herd. The s coefficient of β and γ horses is calculated in the parameters' sensitivity analysis.

Imitation: Horses learn each other's excellent and undesirable habits and behaviors by imitating one another [7]. This imitation is the other horse behavior that is inspired by HOA. Young horses attempt to imitate others, and this behavior persists throughout their lives. The imitation is described by Eqs. (9) and (10) [35]:

$$\vec{I}_{m}^{{\text{Iter,AGE}}} = i_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right],\quad {\text{AGE}} = \gamma$$
(9)
$$i_{m}^{{\text{Iter,AGE}}} = i_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{i}$$
(10)

In the above equations, \(\vec{I}_{m}^{{\text{Iter,AGE}}}\) shows the ith horse's motion vector towards the best horses’ average with locations of \(\widehat{X}\). pN presents the total number of horses that have the best locations, and p is recommended to be set to 10% of total horses in the herd. \({\omega }_{i}\) is a reduction factor in each cycle for iiter.

Defense: Horses use fight-or-flight behavior to defend themselves. Their initial impulse is to flee. In addition, when trapped, they usually buck. To keep rivals, they fight for food and water. They also fight to avoid dangerous situations with enemies such as wolves [25, 55]. The horses’ defense mechanism is the other behavior used in HOA and defined by running away from horses that exhibit non-optimal responses. Equations (11) and (12) describe the defense mechanism [35]:

$$\vec{D}_{m}^{{\text{Iter,AGE}}} = - d_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} {\check{X}}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$
(11)
$$d_{m}^{{\text{Iter,AGE}}} = d_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{d}$$
(12)

In the above equations, \(\vec{D}_{m}^{{\text{Iter,AGE}}}\) indicates the “the escape vector of ith horse from the average of some horses with worst locations, which are shown by the \({\check{X}}\) vector”. The quantity of horses that have the worst locations is qN. The value of q is recommended to be set to 20% of the total number of horses. \(\omega_{d} { }\) is the reduction factor per cycle for diter.

Roaming: The last behavior of horses that HOA simulates is their roaming habit. In pursuit of food, horses in nature roam and graze from one pasture to another if they are not kept in stables. A horse may abruptly change its grazing site. Horses are incredibly curious, as they frequently visit different pastures and get to know their surroundings [55]. The Roaming behavior is considered as a random movement of a horse in the herd and can be described by Eqs. (13) and (14) [35]:

$$\vec{R}_{m}^{{\text{Iter,AGE}}} = r_{m}^{{\text{Iter,AGE}}} pX^{{({\text{Iter}} - 1)}} ,\quad {\text{AGE}} = \gamma \;{\text{and}}\;\delta$$
(13)
$$r_{m}^{{\text{Iter,AGE}}} = r_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{r}$$
(14)

\(\vec{R}_{m}^{{\text{Iter,AGE}}}\) is “the random velocity vector of ith horse for a local search and an escape from local minima”. The reduction factor of \(r_{ m}^{{\text{Iter,AGE}}}\) per cycle is represented by \(\omega_{r}\).

The horses’ general velocity can be calculated by substituting Eqs. (3)–(14) in Eq. (2). The velocity of horses at different ages (δ, γ, β, and α, respectively) are obtained according to Eqs. (15)–(18).

$$\vec{V}_{m}^{{{\text{Iter}},\delta }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right]$$
(15)

where \(\vec{V}_{m}^{{{\text{Iter}},\delta }}\) is the δ horses’ velocity (horses at the age of 0–5).

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\gamma }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right] \\ \end{aligned}$$
(16)

where \(\vec{V}_{m}^{{{\text{Iter}},\gamma }}\) is the γ horses’s velocity (horses at the age of 5–10).

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ \end{aligned}$$
(17)

where \(\vec{V}_{m}^{{{\text{Iter}},\beta }}\) is the β horses’ velocity (horses at the age between 10 and 15 years).

$$\vec{V}_{m}^{{{\text{Iter}},\alpha }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] - \left[ {d_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right]$$
(18)

where \(\vec{V}_{m}^{{{\text{Iter}},\alpha }}\) is the α horses’ velocity (horses older than 15).

The findings validated HOA's capacity to cope with difficult situations involving a large number of unknown variables in high-dimensional domains. Adult α horses start a local search around the global optimum with extremely high precision. The β horses look for other near situations around the adult α horses, intending to approach them; nevertheless, the γ horses have less interest in approaching the α horses. They show a strong drive to explore new regions and discover new global optimum spots. Because of their specific behavioral features, young δ horses are excellent candidates for the random search phase.

4 Proposed approach

In this study, the metaheuristic HOA is modified first, then the modified version of HOA is used in feature selection for detecting spam emails. First, the continuous HOA is changed to a binary algorithm to be used for feature selection since it is a discrete problem. The inputs of the resulting algorithm are then become opposition-based. Next, the binary opposition-based HOA is upgraded to multiobjective in order to solve multiobjective problems. Finally, the multiobjective opposition-based binary HOA (MOBHOA) is applied in spam detection.

Users usually receive spam from anonymous senders with strange email addresses. This certainly does not mean that every email sent by an anonymous sender is considered spam. Therefore, it is necessary to use appropriate methods to detect and separate spam emails from legitimate emails that contain important information. In the proposed method, every email that is entered from the server needs to be followed by a series of steps to be classified as spam email or genuine email. The first step after receiving an email from the server is the feature extraction step. A series of general or specific features are extracted from the email body in the feature extraction step. The next phase is feature selection, which identifies related features and removes irrelevant and duplicate features. The final step is the classification step which is used to classify emails as spam or genuine emails. The overall structure of this method is depicted in Fig. 1 which shows the flowchart of the new approach and how it operates for detecting spam emails. The next sections provide further details of each step in modifying the HOA.

Fig. 1
figure 1

Overall structure and steps of the proposed approach

4.1 Binary HOA

The optimization process in binary search space differs significantly from continuous search space. Horse search agents can update their positions in the continuous search space by adding a step length to their position vector. But in a binary search space, the search agents’ position can not be updated by adding a step length because the search agent position vector can only have a value of 0 or 1. Therefore, we needed to develop a binary version of the HOA for feature selection, which is a discrete problem.

Developing the binary version of the HOA algorithm is simple. We only need to set the variables’ minimum and maximum values between zero and one, then run the algorithm. Just before sending the values to the cost function, we process the values with the greatest integer function to round them to zero and one vector. The nature of the variables is and will be continuous, but they will become binary with the greatest integer function only before entering the cost function. In other words, the algorithm considers the problem to be continuous, and the cost function considers it to be discrete. In the meantime, a function establishes the communication language of the discrete cost function (binary) and the continuous algorithm. This is performed by applying the greatest integer function in Eq. (19). In Eq. (19), x represents a real value between m and n, which are two consecutive integers, and k is an integer resulting from the application of the greatest integer function on x. This strategy can solve the problem of continuity of a continuous algorithm to be used for discrete problems.

$$k = \left| \!{\underline {\,\,}} \right. x\left. {\underline {\, \,}}\! \right|$$
(19)

4.2 Opposition-based binary HOA

By exploring conflicting solutions, opposition-based learning increases the chances of the start with a better initial population [48]. Not only could this approach be used in the initial solutions, but could be applied continuously to any solution in the current population. Generally, the opposition-based learning method is employed in metaheuristic approaches to improve convergence. Because the temporal complexity of metaheuristic algorithms increases, the opposition-based learning method is used to avoid these limitations. This strategy causes the metaheuristic method to seek the optimal solutions in the current solution’s opposite direction. Then, it determines which one is the best solution to choose, the current or the opposite. This method converges the solution rapidly and brings it closer to the optimal solution [48]. A sample opposition-based learning application was discussed in the study by Ibrahim et al. [22].

Starting from a suitable initial population in evolutionary algorithms is an essential and challenging task as the starting point would be effective in the algorithm's convergence speed and the final solution’s quality [48]. In an opposition-based algorithm, to determine the members of the original population, first, a high and a low limit is defined for each of the genes that make up the population members. The genes are then randomly defined between the upper limit and the lower limit. To use opposite numbers during the starting of the population, we consider the value of each member, which is defined according to Eq. (20). Assuming that X is the position of the horse between a and b, the opposition-based \(\overline{X}\) is defined according to Eq. (20). If the opposition-based cost function becomes less than the initial cost function, then the point can be substituted; otherwise, it will continue. Therefore, the gene and the opposite gene are evaluated simultaneously to proceed with more appropriate ones.

$$\overline{X} = a + b {-}X$$
(20)

4.3 Multiobjective opposition-based binary HOA

Models used to optimize problems that only have one objective function are known as single-objective models. In a single-objective problem, we attempt to find the best solution among available solutions. In practice, there is more than one objective function in many designing and engineering problems. These problems are known as multiobjective optimization problems. In many cases, the objective functions defined in multiobjective optimization problems are in conflict with each other [9]. That means the objectives are not compatible [37].

Spam detection is a multiobjective problem. The objectives pursued in this problem are the number of features and the classification accuracy, in which the quantity of features should be minimum, whereas the classification accuracy should be maximum. Higher classification accuracy means that most emails are categorized into the correct category after the classification is completed, and the error rate of the classification is minimal. Furthermore, because the classification is reliant on the selected features by the modified HOA metaheuristic algorithm, the number of features should be kept as minimal as feasible to prevent complexity. Since more than one objective function must be investigated, it is necessary to use a multiobjective optimization method. The essential aspect of such approaches is that they provide engineers and system designers with more than one solution. These solutions demonstrate the balance between the various objective functions [24]. A multiobjective optimization problem can be expressed mathematically as a minimization problem using Eq. (21) [60]:

$$\begin{aligned} & {\text{Minimize:}}\;f_{m} (x), \quad m = 1,2, \ldots ,M \\ & {\text{Subject }}\;{\text{to:}}\; g_{i} (x) \ge 0, \quad j = 1,2, \ldots ,J \\ & h_{k} (x) = 0,\quad k = 1,2, \ldots ,K \\ & L_{i} \le x_{i} \le U_{i} , n\quad i = 1,2, \ldots ,n \\ \end{aligned}$$
(21)

In Eq. (21), M represents the number of objectives, J represents the number of inequality constraints, K represents the number of equality constraints, and [Li, Ui] are the ith variable's boundaries. The solutions of a multiobjective problem would not be compared by arithmetic relational operators. Rather, the Pareto optimal dominance concept compares two solutions in a multiobjective search space [60].

To date, several single-objective metaheuristic methods have been converted to multiobjective [58]. This section explains how we have converted the single-objective HOA to multiobjective HOA. The multiobjective HOA algorithm employs a general objective function with a weight vector based on Eq. (22) to find the relationship between horses in a multiobjective search space. In this Equation, M combines each horse's objectives into a single objective.

$$F(x_{i} ) = \frac{1}{M}\mathop \sum \limits_{j = 1}^{M} f_{j} (x_{i} )$$
(22)

The main difference between single-objective and multiobjective HOA is in their process of updating the objectives. By selecting the best solution obtained, the objective could be easily selected in a single-objective search space. However, the objective must be selected from a set of optimal solutions in multiobjective HOA. Optimal solutions are stored, and the ultimate objective would be one of them. The challenge here is to find an objective for improving the distribution of the stored solutions. To attain this goal, first, the number of neighboring solutions in the existing solution’s neighborhood is calculated [41]. This method is similar to MOPSO in the study by Zouache et al. [60]. Then, the number of neighboring solutions is considered a quantitative criterion for measuring the areas' congestion. Equation (23) determines the probability of choosing an objective from among the objectives.

$$p_{i} = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {N_{i} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${N_{i} }$}}$$
(23)

In Eq. (23), Ni indicates the total number of the neighborhood of the ith solution. With this probability, a roulette method is used to choose the objective. This improves the distribution of the search space's less distributed areas. The other benefit is that in the event of premature convergence, solutions with a crowded neighborhood may be chosen as the objective to solve the problem [59]. The used storage space is limited. To lower the computational cost of the multiobjective HOA, only a small number of solutions should be in the archive, and the archive must be updated frequently. But, when comparing out-of-archive and in-archive solutions, there are several cases. The multiobjective HOA must be able to manage these cases in order to enhance the archive. The simplest case is when at least one archive member dominates the external solution. In this case, it must be discarded immediately. The other case is when all of the solutions in the archive are dominated by new solutions. Since the archive stores the dominant solutions that have been achieved so far, a non-dominant solution must be added to the archive. On the other hand, if the solution dominates the archive, it must be replaced.

In spam detection, feature selection is considered a multiobjective optimization problem. Two opposite objectives are met in multiobjective problems: (1) a minimum selected features and (2) a higher classification accuracy. Therefore, to be able to define the feature selection's objective function, a classification algorithm is required [19, 20]. Because most studies in the literature have employed the KNN classification algorithm, this classification method is employed to define the feature selection problem’s objective function in the current study as well, and the opposition-based binary HOA was converted to multiobjective, then it is used for spam detection problem.

Equation (24) is applied as a multiobjective function for selecting features. This equation balances between two opposing objectives so that a near-optimal solution is chosen.

The smaller the number of features contributes to a more optimal solution, yet, a lower number of features might sometimes raise the classification error rate. Also, the smaller the classification error, the more optimal the solution, but the number of features may have to be increased to reduce the error rate. In other words, a fewer number of features does not always optimize the solution, and a lower number of features from a certain limit may reduce the accuracy of the classification. It might also happen the other way around; a lower classification error rate does not always optimize the solution and may cause more features to be selected. There is a threshold for each of these, and this threshold is different in different problems. Therefore, a balance must be achieved between these, and Eq. (24) establishes this balance.

$${\text{Fitness}} = \alpha \gamma_{R} (D) + \beta \frac{\left| R \right|}{{\left| N \right|}}$$
(24)

In Eq. (24), \(\alpha \gamma_{R} (D)\) indicates the classifier's error rate, \(\left| R \right|\) indicates the selected subset's multi-linearity, and the overall number of features within the data set is denoted by \(\left| N \right|.\) α and β are the significance of the classification's quality and the subset's length, respectively. The α and β values have been adapted from Emary, Zawbaa and Hassanien [14] where α ∈ [0, 1] and β = (1 − α). The initial value of α in this study is set to 0.99; thus, β will be calculated as 0.01. KNN helps to evaluate the selected feature by the suggested method and other similar methods accurately, and it serves as a benchmark for all algorithms [2, 3, 27, 30,31,32, 46].

4.4 Spam detection using multiobjective opposition-based binary HOA

The current study employed a data set on which preliminary processing was performed, and a set of features was extracted. MOBHOA selects several extracted features that distinguish spam emails from genuine emails. This is accomplished through the use of HOA's natural processes, which are discussed as follows.

Feature selection is a four-step process that includes the generation of feature subsets, evaluation of subsets, termination of criteria checking, and the validation of the results [26]. Firstly, the feature subset is generated in the data set. In this subset, candidate features are searched based on the search strategy of MOBOHA. Then, candidate subsets are evaluated and compared with the best previous value of the evaluation feature used. If a better subset is produced, it is replaced with the previous best. This generation and evaluation of the subsets is iterated until the termination criterion of the MOBHOA is reached. MOBHOA is repeated several times before achieving the best global solution. After each cycle, the fitness function calculates the accuracy of the classifier for the candidate subset. The candidate generation, fitness calculation, and evaluation function continues until the final criteria are met. In general, termination criteria are defined on the basis of two factors: the rate of error and the total number of iterations. If the error rate is lower than a certain threshold, or if the algorithm exceeds the specified number of iterations, the algorithm stops [26].

As stated earlier, this study attempted to propose a new optimization approach for feature selection in detecting spam emails using MOBHOA. Figure 2 illustrates the flowchart related to the proposed approach.

Fig. 2
figure 2

The proposed algorithm's flowchart

5 Simulation and evaluation

The new algorithm was implemented and simulated in MATLAB R2014a environment installed on a PC with a 64-bit i5 CPU and 4GB memory. For simulation, a data set called 'Spam Base' was used. 20% of the data was allocated for training and 80% for testing. Experiments were conducted on the Spam Base data set from the UCI data repository for evaluating the algorithm's performance in detecting spam. The used data set includes 4601 emails, of which 1813 (39.4%) are spam emails and 2788 (60.6%) are non-spam emails. Every record in this data set contains fifty-eight features in which the latest feature shows whether the email is spam (1) or genuine (0). The first forty-eight features indicate the frequency of specific keywords. That is the percentage of words or phrases in the email that match a specific word or phrase. The next six features indicate the characters’ frequency, and the next three features contain information about the data set. In Liu et al. [28], this data set was recommended as one of the most valid and suitable data sets for spam.

The remainder of this section discusses the classification accuracy of the proposed method in detecting spam compared to GWO and KNN. The simulation results of GWO, KNN, and MOBHOA in terms of classification accuracy with a different number of iterations are represented in Table 2 and Fig. 3. In this simulation, the iteration number was set to 1–100, and the size of the population was considered 20.

Table 2 Comparison of GWO, KNN, and MOBHOA in terms of the accuracy of the classification
Fig. 3
figure 3

Performance comparison of GWO, KNN, and MOBHOA in terms of the spam detection's accuracy

According to Table 2 and Fig. 3, MOBHOA has obtained much better results than GWO and KNN algorithms in detecting spam by increasing the iteration number. The performance of MOBHOA was similar to the other two algorithms in the first iterations; but, by the increase in the number of iterations, it has shown its better performance over the GWO and KNN algorithms. This was due to the application of the opposition-based approach to develop solutions in the opposite search space.

In the next evaluation, MOBHOA was compared with K-Nearest Neighbours-Grey Wolf Optimisation (KNN-GWO), KNN, Multi Layer Perceptron (MLP), Naive Bayesian (NB), and Support Vector Machine (SVM) classifiers with regards to accuracy, sensitivity and precision in detecting spam emails. Table 3 and Fig. 4 show the evaluation results.

Table 3 Comparison of MOBHOA with other classifiers in terms of accuracy, precision, and sensitivity
Fig. 4
figure 4

Comparison of MOBHOA with KNN and KNN-GWO classifiers in accuracy, sensitivity, and precision

As mentioned earlier, detecting spam emails is carried out in two steps, the first step is selecting features, and the second step is classification. Table 3 shows the results of feature selection.

The results in Table 3 are presented according to the feature selection step of the MOBHOA-KNN method, which was carried out with MOBHOA, and the classification step which is done with the KNN method. Also, in the KNN-GWO method, the feature selection step is carried out with GWO, and the classification step is done with the KNN method. However, KNN, MLP, SVM, and NB methods classify data without performing the feature selection step.

The proposed approach in feature selection improves accuracy, precision, and sensitivity and also reduces runtime. Because in optimal feature selection, redundant or insignificant features are eliminated, and operations are performed on only significant features. In this case, the algorithm's execution time will be reduced, and the accuracy, precision, and sensitivity will be increased. In this experiment, when running the execution, in all feature selections, KNN was constant. In the second run, considering the optimal feature selection, KNN was combined with the proposed method, and the results are demonstrated in Table 3.

As shown in Fig. 4, the evaluation results indicate that MOBHOA has improved compared to KNN and KNN-GWO with respect to accuracy, sensitivity, and precision. Specifically, it has increased the algorithm's accuracy by around 50%.

The results show that the multiobjective opposition-based binary horse herd optimizer running on the UCI data set has been more successful in the average size of selection and the accuracy of classification compared with some other standard metaheuristic methods. According to the results, the proposed algorithm is substantially more accurate in detecting spam emails in the data set than other similar algorithms. This is due to the application of HOA, which is a highly efficient optimization algorithm and has an outstanding performance in solving high-dimensional problems. The other reason is implementing the feature selection phase besides the classification phase. Feature selection decreases the computational complexity and increases classification accuracy by removing unnecessary features.

Machine learning-based techniques are one of the most efficient ways to solve a variety of problems. However, most machine learning algorithms have the problem of computational complexity. There is a need to employ more advanced techniques and algorithms that can improve the accuracy and decrease the complexity and error rate of the spam detection problem; therefore, we used the horse herd optimization algorithm to further improve the computation speed and accuracy. New advances in deep learning demonstrate that they can still be utilized for solving spam detection problems. A limited number of studies in the literature have examined the performance of deep learning algorithms for spam detection. In addition, the majority of the used datasets are either small in size or artificially developed. Thus, future studies are expected to consider big data solutions, large datasets, and deep learning algorithms to develop more efficient techniques for detecting spam. Furthermore, the focus of this study was specifically on email spam detection, and spam detection in other platforms, such as social networking spam and so on was not examined in the current study. Future studies may focus on using this approach for spam detection on other platforms.

6 Conclusion

Unwanted emails or spam have become a problem for Internet users and data centers, as these types of emails waste a large amount of storage and other resources. Moreover, they provide a basis for intrusion, cyber-attacks as well as access to user information. There are several techniques and methods for detecting, filtering, classifying spam, and facilitating their removal. In the majority of the proposed approaches, there is a rate of error, and none of the spam detection techniques, despite the optimizations performed, have been effective on their own. The objective of this paper was to use a robust metaheuristic optimization algorithm to detect spam emails to be used in email services. For this purpose, the horse herd optimization algorithm was employed, which is a novel nature-inspired metaheuristic optimization algorithm developed for solving highly complex optimization problems. The problem of detecting spam is discrete and has multiple objectives. To be able to use HOA for this problem, first, the original HOA, which is a continuous algorithm, was binarised and then transformed into a multiobjective opposition-based algorithm to solve the feature selection problem in spam detection. The new algorithm, multiobjective opposition-based binary horse herd optimization algorithm (MOBHOA), was implemented and simulated in MATLAB, and in order to evaluate the performance of the proposed approach in detecting spams, experiments were conducted on the Spam Base data set from the UCI data repository. According to the simulation results, in comparison with other similar approaches such as KNN, GWO, MLP, SVM, and NB, the new approach performs better in classification, as well as accuracy, precision, and sensitivity. The findings demonstrate that the new approach outperforms similar metaheuristic solutions introduced in the literature; therefore, it could be used for feature selection in spam detection systems.