Memetic algorithms for training feedforward neural networks: an approach based on gravitational search algorithm

The backpropagation (BP) algorithm is a gradient-based algorithm used for training a feedforward neural network (FNN). Despite the fact that BP is still used today when FNNs are trained, it has some disadvantages, including the following: (i) it fails when non-differentiable functions are addressed, (ii) it can become trapped in local minima, and (iii) it has slow convergence. In order to solve some of these problems, metaheuristic algorithms have been used to train FNN. Although they have good exploration skills, they are not as good as gradient-based algorithms at exploitation tasks. The main contribution of this article lies in its application of novel memetic approaches based on the Gravitational Search Algorithm (GSA) and Chaotic Gravitational Search Algorithm (CGSA) algorithms, called respectively Memetic Gravitational Search Algorithm (MGSA) and Memetic Chaotic Gravitational Search Algorithm (MCGSA), to train FNNs in three classical benchmark problems: the XOR problem, the approximation of a continuous function, and classification tasks. The results show that both approaches constitute suitable alternatives for training FNNs, even improving on the performance of other state-of-the-art metaheuristic algorithms such as ParticleSwarm Optimization (PSO), the Genetic Algorithm (GA), the Adaptive Differential Evolution algorithm with Repaired crossover rate (Rcr-JADE), and the Covariance matrix learning and Bimodal distribution parameter setting Differential Evolution (COBIDE) algorithm. Swarm optimization, the genetic algorithm, the adaptive differential evolution algorithm with repaired crossover rate, and the covariance matrix learning and bimodal distribution parameter setting differential evolution algorithm.


Multilayer perceptron IoT
Internet of things Neural network training problem n j Number of neurons of the layer j c Total number of layers in a FNN W j Weight matrix associated with the connections from layer j À 1 to layer j b j Threshold vector of the neurons for layer j x i Input vector for the FNN Neural networks are one of the most popular and successful machine learning techniques. The aim of machine learning techniques is to provide algorithms with the goal of allowing the computer to learn automatically without human intervention, and to adjust actions accordingly. Specifically, FNNs are the best known neural network architecture (see [33]), with the layers interconnected in such a way that the output of ith layer is used as the input of the i þ 1th layer. Most commercial applications which use neural networks implement this kind of architecture, since it has been shown to be suitable for addressing a large class of problems pertaining to pattern recognition or prediction (see [8,41,42,67]). Moreover, an FNN with only one hidden layer is considered a universal approximator (see [31]) as it can approximate any continuous function.
In neural networks, the learning process takes place through the adjustment of the weights and biases which measure the strength of the connection between neurons from different layers. These weights are usually updated in order to minimize the error made by the network, which is formulated using a loss function, like mean square error (MSE), sum of square errors (SSE) or root mean square error (RMSE). The process of adjusting the weights and biases for a given network is known as training. Therefore, from the mathematical point of view, behind a neural network lies the solution of an unconstrained optimization problem.
In the past, gradient-based techniques were the most widely used for FNN optimization (see [32]). These algorithms use a specialized method for computing the gradient of the loss function with respect to the parameters of the network, known as the backpropagation technique. The most representative algorithm of this class is the steepest descent approach known as the BP algorithm (see [21]). This algorithm is based on the idea of propagating the errors made by the network to the back layers. This identifies how the error varies with respect to the network parameters and adjusts the weights and biases in order to minimize the loss function.
Training an FNN using a BP algorithm is difficult when the problems are non-differentiable or multimodal. This is because gradient-based algorithms are liable to becoming trapped local minima (see [26]), making their performance highly dependent on the initial values of weights, biases and the chosen hyperparameters of the optimization algorithm. Furthermore, the vanishing gradient problem becomes a major concern when many hidden layers are added to the network. These facts motivate the challenge of finding new approaches and algorithms which outperform the BP in these scenarios. In order to address these problems, metaheuristic algorithms have been successfully applied to training FNNs (see [12,52,53]). Metaheuristics improve the exploration capacity of gradient-based algorithms, but they exhibit slow convergence in comparison with gradient-based approaches, which achieve convergence at a (super-)linear rate (see [4]).
In the last few years, MAs have appeared as a new computation paradigm which tries to combine metaheuristic algorithms with one or more local search procedures in order to combine the advantages of both approaches (see [47,50]). MAs have been successfully applied in different domains such as the train timetabling problems (see [14]) and segmentation of temporal series (see [40]).
Recently, [20] have proposed two memetic algorithms for unconstrained optimization based on the hybridization of the GSA (see [60]) or CGSA (see [44]) with quasi-Newton (qN) search directions. The resulting algorithms are named MGSA and MCGSA. In this previous study, the authors proved these memetic algorithms can be considered as part of the state of the art, since they outperformed metaheuristic algorithms from the state of the art in a wide set of synthetic and real-world global optimization problems. This paper studies the application of both memetic approaches to a challenging machine learning problem, specifically the training of an FNN. Both approaches are particularly suited to training a FNN as they consider a qN algorithm with a superlinear convergence rate and they include a promising evolutionary algorithm, so as not to become trapped in local minima. The main advantage of using a metaheuristic algorithm instead of BP is that metaheuristic algorithms are able to escape from local optima, while BP may be trapped in local minima. However, the exploitation capabilities of BP improve upon those of metaheuristic algorithms significantly. This situation motivates the use of the memetic approaches based on gradient information, which combine the advantages of BP and metaheuristic algorithms. Thus, the main advantage of using a memetic approach is a greater rate of convergence than with metaheuristic algorithms and the ability to find a global optimum of the problem, unlike BP. In order to do this, the implementation of these algorithms has been adapted to this problem and the results obtained are analyzed in order to compare the performance of memetic approaches to that of other metaheuristic algorithms from the state of the art.
Furthermore, [45] addressed the training problem of FNN using GSA, PSO and a hybrid of GSA with PSO. This study replicates their computational experiment in order to assess the performance of MGSA and MCGSA in training FNNs. Furthermore, a statistical comparison of both approaches with metaheuristic algorithms from the state of the art was carried out. To do this, PSO and GA were chosen as the traditional algorithms, while Rcr-JADE (see [24]) and COBIDE (see [69]) were selected as recent proposed algorithms, since these variants of the differential evolution (DE) algorithm have been widely used in recent years and give the best results in rankings (see [57]). The results obtained show MCGSA improves statistically, in terms of convergence and speed, on the results obtained by the metaheuristics from the state of the art when training a FNN.
The rest of the article is structured as follows: Sect. 2 gives a brief review of GSA applications in neural networks; Sect. 3 defines and formalizes the neural network training problem. Then, Sect. 4 describes GSA, CGSA and the memetic algorithms used in this paper. Section 5 shows the results of experiments carried out in this study, and finally, Sect. 6 summarizes the conclusions and further work derived from this article.

Related work
The success of metaheuristics and evolutionary algorithms in the field of optimization is well known. They are approximate and non-deterministic optimization methods which incorporate mechanisms to escape from local optima in order to reach the global optimum. The popularity of these methods is due, among other things, to their versatility, simple coding, the few hypotheses to be fulfilled and the fact they do not use derivatives. The applications of these algorithms cover many, if not all, fields of engineering.
Some recent applications are related to hydrology, such as [9] where an adaptive-network-basedfuzzy inference system (ANFIS) model is designed for long-term prediction of discharges by the Manwan Hydropower Plant into the Lancangjiang River, in order to manage and schedule the hydroelectric reservoir. Subsequently, [68] proposed a neural network model coupled with the ensemble empirical mode decomposition (EEMD) designed for medium and long-term forecasting in hydrological time series. Hybrid algorithms have also been applied in this field, for example in [46], where a hybrid firefly algorithm with support vector regression is applied to predict evaporation with the purpose of managing water resources and [72], where an enhance extreme learning machine (EELM) is used to forecast river flow in the Kelantan River (Malaysia).
Regarding other fields of science and engineering, recent relevant applications can also be found. For example, in chemistry, [48] have recently proposed approaches based on neural networks, ANFIS and response surface methods to estimate and optimize the main parameters which affect the yield and cost of biodiesel production, and found that neural networks with radial basis functions give the maximum biodiesel yield production with the minimum production cost. In another field, [17] presented a survey of computational intelligence techniques applied to address severe flood management, identifying neural networks, soft computing techniques, decision trees, fuzzy logic and hybrid approaches. Finally, other recent applications in the field of power systems are [1] where a hybrid algorithm based on PSO and the bacterial foraging optimization algorithm (BFOA) is applied to power system stabilizer design, and [2] where the same authors used GSA to design a static synchronous series compensator for single and multi-machine power systems.
Metaheuristics and evolutionary algorithms in the area of neural networks have been applied to optimize the FNN components of: (i) connection weights, i.e., to find the optimal combination of weights and biases which provides the minimum error while keeping the other components fixed at their initial setting; (ii) architecture, i.e., the metaheuristic is used to search for optimum architecture from a compact space of FNN topology; and (iii) hyperparameters such as the learning rate in the BP (remember that the efficiency of these algorithms is highly conditioned by these values). The first two areas have attracted the attention of the research community, with many studies appearing related to these fields. The aim of this section is to review the studies and applications of GSA in the field of neural networks, paying special attention to groups (i) and (ii).

Using GSA for training neural networks
The complexity of the training problem derives from the topology of the network, since the size of the network, in terms of hidden neurons and hidden layers, leads to a large number of parameters to be optimized. Thus, the number of hidden neurons is assumed to be fixed in this kind of problem, or different sizes are proposed in order to study which network topology provides the best performance. The sizes of the hidden layers are usually set based on the previous experience of the practitioners.
The metaheuristics and their hybridizations have been applied to training problem of FNN to avoid the drawbacks of gradient-based algorithms. Hybrid algorithms can be classified into two main categories: (i) combinations of two algorithms to take advantage of the local and global search capacity of both algorithms. [70,73] thus proposed a memetic algorithm based on GA and BP, while [75] proposed a memetic PSO with BP algorithm for training FNN, and (ii) combinations of two metaheuristics, with the purpose of obtaining a more robust global search algorithm. A representative example of this trend is [34], where PSO and GA were used to design a hybrid algorithm for recurrent neural network design.
Although there are many studies that discuss this kind of algorithm (see [29,71]), this section will focus on studies which use the GSA for training neural networks. In 2012, [45] proposed a hybrid PSO with GSA (PSOGSA) for training FNN with one hidden layer. They used this algorithm to test the performance of the network over three classical benchmark problems, showing that PSOGSA outperforms PSO and GSA in terms of convergence speed and avoiding local minima.
Later, [62] proposed a hybrid GSA with GA showing the performance of this algorithm outperforms BP algorithm in approximation function problems and the XOR problem. The idea of hybridizing GSA with GA was subsequently used in [36] for tuning damping controller parameters in a unified power flow controller, giving good results. Moreover, [49] proposed a neuro-fuzzy system, developed using PSO hybridized with GSA, in order to predict the scouring process at pile groups due to waves. Recently, [55] have proposed a two-layer FNN trained with sensitivity-based linear learning method (SBLLM) hybridized with GSA for estimating the band gap of a doped titanium dioxide semiconductor using crystal lattice distortion.

Using GSA for optimizing network topology
Choosing the optimal number of hidden neurons and of hidden layers is a key point when considering FNNs. It is a complex problem, since it has to be tuned for the problem at hand. That is the reason why many studies propose different sizes for the hidden layer and, later, the network which exhibits the best performance is chosen (see [19]).
Thus, [5] uses an Elman-type recurrent neural network to estimate solar radiation intensity in order to determine the maximum power point tracking in photovoltaic systems. The number of hidden neurons of the proposed neural model is optimized using a hybrid binary PSO algorithm and GSA (BPSO-GSA); the best network topology had six hidden neurons. Later, [3] proposed a two-layer FNN optimized with an SBLLM method for estimating relative cooling power of manganite-based materials for magnetic refrigeration enhancement. The number of epochs and hidden neurons of the network was optimized using GSA, giving an optimized network with 67 hidden layers trained over 4468 epochs. Finally, another recent application of this approach is [56], where the authors propose an extreme machine learning (EML) approach for precise quantitative analysis of laser-induced breakdown spectroscopy (LIBS) spectra. The number of hidden neurons is optimized using a hybrid of a GSA with EML.
The analysis of these recent studies reveals a growing interest in hybrid algorithms based on GSA to optimize FNN. In this paper, we address the problem of choosing an optimal network topology based on memetic algorithms which is not popular and it constitutes a promising direction of research.

The neural network training problem
This section describes and formalizes the optimization problem which arises from the training of a FNN. In a FNN, also known as multilayer perceptron (MLP), the network topology is composed by the input layer, a set of hidden layers and, finally, the output layer. The connections between the neurons from different layers are always forward, and usually, all the neurons of one layer are connected to all neurons from the next layer. A general scheme of a FNN or MLP is shown in Fig. 1. In the figure, n j is the number of neurons in the jth layer and the parameter c denotes the total number of layers in the architecture.
In every FNN, the connections between the neurons from different layers are associated with a real number, the so-called weight of the connection. The weight of a connection expresses the synaptic power of a given connection, which can be excitable (positive sign) or inhibitory (negative sign). Furthermore, each neuron in the topology has an internal threshold (b). It is used as a comparison factor in order to produce the output of the network and activate or not a neuron in the topology.
There are two main steps in the training process of a FNN. First, the inputs of the network are propagated forward in order to obtain an output. Next, the error between the output produced by the network and the desirable value is computed. Finally, in a second step, the errors are propagated backward, adjusting the weights and thresholds for each neuron in the topology in order to minimize an error or loss function (defined by the designer). This last step will be carried out by using an optimization method.
We consider a canonical FNN with fully connected layers. The first is the input layer, the following layers are hidden layers and the last one is the output layer. The sizes of the layers are n j for j ¼ 1; . . .; c. Let W j 2 R n j Ân jÀ1 be the weight matrix associated with the connections from layer j À 1 to layer j, and let b j 2 R n j be the threshold vector of the neurons from layer j for layers j ¼ 1; . . .; c. The process of propagating the inputs of the network forward is the following: is computed using Eq. (1).
2. Computing the activation for the neurons from the hidden layer. The neurons from the hidden layer process the information received from the neurons of the input layer, applying the activation function to the weighted sum of the activations received. FNN applies successive transformations to the given input x i by means of Eq. (2).
where the vector b j 2 R n j contains the jth layer parameters (biases) and s is a component-wise nonlinear activation function. The activation function of a neural network includes many alternatives such as sigmoid, hyperbolic tangent, ReLU (rectified linear unit), Leaky ReLU and Softmax. The activation function is necessary to avoid the linearity of the computation since, if a set of neurons are each linked by a process, but without applying a nonlinear activation function; then, the computation of those neurons could have been performed by only one of them. Thus, to provide suitable computing capacity to a neural network it is necessary to establish a nonlinear relationship between the input of each neuron and its output. The most widely used activation functions s of FNNs include the sigmoid function sðxÞ ¼ 1=ð1 þ expðÀxÞÞ and the ReLU function sðxÞ ¼ maxf0; xg or the hyperbolic tangent sðxÞ ¼ e x Àe Àx e x þe Àx . In this paper, the sigmoid function has been chosen. Hence, Eq. (2) can be rewritten as: 3. Computing the activation for the neurons from the output layer. The last step is to compute the activation of the neurons from the output layer in the same way as it was made in the hidden layer. However, in this case, the activation of the neurons will be the output of the network o i , as it is shown in Eq. (4).
where o i is the output vector provided by the network for the input vector x i .
This process in which the neural network obtains the output vector o i for an input vector x i , where the parametrization of the network x ¼ fW j ; b j g c j¼1 is known, can be schematically represented by o i ðx i ; xÞ. Without loss of generality, it can be considered that x is a parameter vector The training problem consists of determining the parameter vector x and involves a training dataset fðx 1 ; y 1 Þ; . . .; ðx N ; y N Þg and the choice of a loss function ', obtaining the resulting optimization problem: A popular choice for the loss function ' is MSE where k Á k 2 is the Euclidean norm. Hence, Eq. (6) is the objective function of the problem, which can be formalized through Eq. (7).
Therefore, and in accordance with the prior formalization, the training problem of a FNN (without specifying a loss function) can be stated as Minimize x2R n f ðxÞ: ð8Þ A small example is shown to illustrate the encoding strategy for the optimization algorithms. Given the fully connected FNN shown in Fig. 2 with three layers where n 0 ¼ 3; n 1 ¼ 4; n 2 ¼ 2, a parametrization would be encoded using Eq. (9).
We use the vector x to represent the parametrization of the network In population-based metaheuristics, each individual of the population is represented by a vector x which includes the weights and biases of the different layers of the network. At the end of the optimization process, the best individual in the population will be used to set the parameters of the neural network.

Memetic gravitational search algorithms
This section describes the MGSA and MCGSA training algorithms FNNs. Firstly, the GSA, the CGSA and the qN algorithms are reviewed. Then, the two memetic proposals MGSA and MCGSA are presented.

Gravitational search algorithm (GSA)
GSA is a bio-inspired algorithm proposed by Rashedi in [60]. It is based on the theory of Newtonian gravity in physics, in such a way that each solution in the search space corresponds to a mass in the space on a universe metaphor. Then, according to the universal law of gravity, heavier masses (solutions with better fitness values) attract the smaller ones (solutions with worst fitness values), ensuring convergence. The notation that we have used to describe GSA is slightly different from the original formulation, since the notation has been adapted to the neural network training problem. In order not to differ excessively, the notation will be reused, and the indices i and j will be used to refer to population individuals since, in this context, there is no risk of confusion with the hidden layer j or the training data i.
At the start, a population of N pob solutions is randomly initialized, where position and speed values are initialized for each solution. The objective function is then evaluated over all the solutions. The mass of each solution is related to the fitness value in the GSA metaphor. The mass of solution i in iteration t is computed as shown in Eq. (11): where m t i is computed through Eq. (12).
where worst t is the solution with the worst fitness value in iteration t and best t is the solution which has the best fitness value in iteration t.
Next, the gravitational force acting from each individual to the rest is computed through Newton's law of gravitation, shown in Eq. (13).
where F ij notes the gravitational force which acts on mass i from mass j. M pi corresponds to the passive gravitational mass of i mass, while M aj is the active gravitational mass of j. Note that in GSA, M ai ¼ M pi ¼ M i . R ij is the Euclidean distance that separates the masses i and j. R is used in GSA instead of R 2 because it provides better results (see [60]). parameter is a small positive constant used to avoid division by zero. Moreover, G t is the gravitational constant to compute the force. It is initialized to a higher value at the beginning of the algorithm and will be reduced over iterations, according to the following equation where G 0 is the initial value of G, a is the so-called learning rate, t is the current iteration and T is the maximum number of iterations allowed. G t is a exponential function which manages the balance between exploration and exploitation during the execution of the algorithm. At first, high G values give the algorithm greater exploration capability. Subsequently, this value decreases according to the expression shown above, giving the algorithm a greater exploitation capacity. Then, to calculate the total force exerted by one mass of space on another, the random weighted sum of the forces exerted on a concrete mass by the rest of space objects is made. The introduction of the random component in this calculation provides the algorithm with a stochastic behavior. The calculus of total force is shown in Eq. (14).
where rand j is a single uniformly distributed random number in the interval (0, 1). In addition to the gravitational constant, the GSA has an additional mechanism to control the balance between exploration and exploitation. It is the Kbest function in which only the k best objects apply their force to the rest at iteration t. This is a linear function which depends on the iteration t and its value is decreased linearly according to the number of iterations t. It means at first of the algorithm, all the objects exert their force to rest, enhancing exploration, and later, only the Kbest objects will apply their force, enhancing exploitation. Thus, the summation in (14) is taken as j 2 Kbest t À fig.
Finally, Newton's second law is applied to compute the existing acceleration of each mass, which is computed by at the end of each iteration. Then, their position and speed will be updated based on the total force. These calculations are expressed as v tþ1 i . According to these formulae, GSA ensures the movement of the particles toward the particles with greater mass, performing the exploitation when the heavier particles move slowly. Briefly, a flowchart of the GSA is shown in Fig. 3.

Chaotic gravitational search algorithm (CGSA)
CGSA is a variant of the GSA which appeared in [44] aimed at dealing with the problem of slow convergence inherent to GSA and improving the exploration/exploitation trade-off.
In the GSA, the exploration-exploitation trade-off is controlled by the two functions G t and Kbest t . On the one hand, Kbest t encourages the exploration step of the algorithm, allowing all the masses to exert their strength on the rest. This function is decreased using a linear function, and therefore, as the number of iterations increases, the number of masses which exert their force on the rest decreases, focusing the search on the most promising environments found so far. Kbest t thus encourages the exploitation stage toward the end.
On the other hand, the gravitational constant G t also controls the trade-off between exploration and exploitation. Initially, G t values encourage exploration, increasing the force exerted by a mass i on the rest. This value is later decreased using a nonlinear function, in such a way that at the end of the algorithm, the slow movement of heavier agents (thanks to the low values of G t ) encourages the exploitation skills of the GSA, that is, the ability to approximate a local minimum when a promising search environment has been found.
In [44], different chaotic maps are added to G t in order to perturb the behavior of G t . In the original GSA, G t is constantly decreased over the iterations, so the algorithm either explores or exploits. Adding a chaotic map to the gravitational constant will chaotically change the value of G t while decreasing it during iterations, giving exploration and exploitation at the beginning and at the end of the algorithm. This approach improves significantly the performance of the original GSA, showing that the sinusoidal chaotic map is the most suitable for GSA. A graphical example of introducing a chaotic sinusoidal map into G t is shown in Fig. 4. The flowchart of CGSA is identical to GSA flowchart, the only difference being if the gravitational constant is replaced by G t chaotic , and the chaotic GSA appears. CGSA with the sinusoidal map is the CGSA used in the experimental section.

Gradient-based approaches
In unconstrained optimization tasks, such as the training problem of FNN, descent direction methods are widely used. These methods consider the search direction d k at the kth iteration, making a movement in the direction d k , since these methods guarantee a decrease in the loss function through small movements.
In order to analyze the descent property of these methods, the following function of one variable is defined The function / evaluates the progress of the loss function in the direction d k . If we require the sign of the derivative of the function / in the point a ¼ 0 to be negative, to ensure descent for sufficiently small values of a, the search direction must satisfy: where rf ðx k Þ is the vector of first-order partial derivatives of the loss function with respect to the vector x, i.e., the gradient of the loss function. Condition (16) guarantees that the method is of descent and it allows us to consider weight changes as where a k is small enough to give a decrease in the loss function.
The canonical example is the steepest descent method, which is defined by the choice d k ¼ Àrf ðx k Þ and it satisfies the condition (16) The steepest descent method is a first-order gradient descent algorithm and it is known as BP for training FNN. The numerical scheme leads to: where a k is the learning rate. The main feature of BP is the gradient calculation specialization for the structure of the problem at hand, a method known as the backpropagation algorithm, where the error obtained at the output layer is propagated backward to the hidden layers. Several numerical optimization methods appear when certain gradient transformations are applied. If d k is a descent direction, the transformation where H k is a positive definite matrix means the direction b d k is still a descent direction. The most notable example is where to determine a suitable step length. When the approximate solution a k of the one-dimensional problem (21) One of the most widely used Hessian updating methods is the Broyden-Fletcher-Goldfarb-Shanno (BFGS) formula. There is growing evidence that BFGS is the best current update formula for use in unconstrained optimization (see [10]). In problems where the number of parameters is large, such as deep neural networks, a number of strategies have been proposed to address the situation, including limited memory BFGS methods (L-BFGS) (see [39,51]). Thus, BFGS is considered to be the most effective of all quasi-Newton updating formulae. It approximates the inverse Hessian matrix ½r 2 f ðx kþ1 Þ À1 by the following expression: where > indicates the transpose and The procedure is then repeated from the point x kþ1 . The initial array H 0 can be adjusted to any symmetric positive definite array, e.g., the identity array. Each iteration can be accomplished at a cost of arithmetic operations O(n 2 ) (plus the cost of function and gradient evaluations) where n is the number of decision variables. Storage and computing requirements increase in proportion to n 2 and become impractical for larger n. Many approaches have been presented to deal with this drawback, including quasi-Newton limited memory methods.

Memetic algorithms based on GSA and CGSA
The proposed memetic algorithms have two main components: (i) an evolutionary framework or a metaheuristic algorithm and (ii) one or more local search methods which will be introduced into approach (i). The two memetic algorithms are proposed according to the following choices: (a) MGSA ¼ GSA for ðiÞ þ qN for (ii) and (b) MCGSA ¼ CGSA for (i) ? qN for (ii). Both approaches share the local search and differ at the metaheuristic stage, as they are, respectively, GSA and CGSA. The design of both memetic algorithms is based on the characterization of local minima. The basis of these algorithms is that they are able to detect that a promising neighborhood has been found to trigger the qN method. In this way, adding mechanisms to characterize local minima is a crucial stage in the design of these memetic algorithms. This paper characterizes the concept of a promising neighborhood of an unexplored local minimum using two rules: (i) it contains a solution which has a better fitness value than the best found so far and (ii) it is located far from the current best solution. These rules are formally expressed below: In Rule 1, f tÀ1 Ã is the best function objective value obtained thus far, best t is the fitness value of the best solution in iteration t and e is a parameter which measures whether the new best point is better than the best found so far.
In Rule 2, x t i Ã is the best point in the population at iteration t, x tÀ1 Ã is the best point obtained so far and c is a parameter which measures whether the new best point is far enough from a neighborhood of x tÀ1 Ã . Rule 1 states that the point x t i Ã reached in the current iteration t is better than the previous points analyzed, and Rule 2 constrains point x t i Ã not to belong to a neighborhood of x tÀ1 Ã , and so, it is suitable to perform an exploitation stage to this new search environment. Parameters e ! 0 and c [ 0 are required to be set in order to detect a promising neighborhood.
However, the detection of a promising region is not the only consideration taken into account by the algorithm, which also looks at where and when the local search method is triggered and the strength with which the local search method will be applied. Fulfillment of both rules (i) and (ii) will trigger the qN method at the end of each GSA or CGSA iteration starting from the best current solution x t i Ã . Finally, the intensity of the local search is based on a tolerance parameter. The tolerance parameter tol represents a lower bound on the size of a step or a decreasing magnitude on the loss function. If the algorithm attempts to take a step that is smaller than the tolerance value, or the improvement is not significant, the iterations of qN end. Thus, the stopping criterion used in the quasi-Newton method is The tol parameter manages the intensity of the local search. At first, it is convenient to explore the neighborhood sparsely, increasing its intensity as the algorithm progresses. For this reason, the parameter tol is initialized to tol ¼ 0:01. Then it decreases to tol ¼ tol=10 each time the quasi-Newton method is performed.
It should be noted that, although the qN search method is used in this paper, any other conventional algorithm can be chosen as a local search component for the memetic algorithm. Specifically, specialized algorithms for solving the neural network training problem, such as Rprop (see [61]), Quickpro (see [15]) or Levenberg-Marquadt (see [66]) algorithms, may be chosen. These algorithms allow specialized gradient computation and have a linear convergence rate. This paper uses the qN method due to its (super-)linear convergence rate.
Finally, in order to summarize the design of the memetic algorithms, Fig. 5 shows a flowchart of both MGSA and MCGSA. In the flowchart, it is possible to see how the metaheuristic algorithm is run, and when a promising area is found (satisfying rules 1 and 2), the qN method is triggered, starting from the best current point. The point obtained by the qN method will replace the best population point, updating not only its position but also its speed. The structure of this memetic algorithm also allows the use of qN method to be extended to non-differentiable problems. If the memetic algorithm achieves a non-differentiable point, then the local search step carried out by qN will be a null stage. However, the memetic algorithm escapes from this problematic point thanks to its evolutionary framework.

Experimental results
This section will describe the experiments carried out in this paper in order to study the performance of memetic algorithms based on GSA for training FNN. This was done by replicating the experiments conducted in [45] under the same conditions. These experiments are intended to study if the use of MAs for training FNNs can be a suitable choice in comparison with the current state-of-the-art metaheuristic algorithms. In [45], a hybrid of PSO and GSA was used to train a FNN for three different benchmark problems, showing that the proposed hybrid algorithm outperforms individual GSA and individual PSO.
These authors chose a sample of three classical, wellknown benchmark problems where neural networks are used: The first is a three-bit parity (XOR) problem, the second is a function approximation problem and the last one is a classification problem. These problems have been chosen because they are classic benchmarks widely used by the scientific community when a new method for training FNN is designed. Furthermore, this set of problems covers the spectrum of problems that FNN solves: function approximation, binary classification and multi-classification. Another advantage of choosing these problems is the possibility of replicating the experiments of [45] under the same conditions, in order to compare the results obtained with those obtained by the hybrid GSA of these authors. Currently, an alternative to classic benchmarks problems is training deep neural networks as test problems. In these problems, backpropagation has been proved to be inefficient for networks with two or more hidden layers, due to the vanishing gradient problem, poor generalization and the possibility of becoming trapped in a local optimum (see [23,37,38]). In this scenario, the challenge is to find alternative methods to deal with these problems. However, the main focus of this paper is on testing hybrid methods to train FNN.
To solve these problems, different algorithms, both classical and from the state of the art of metaheuristics, were chosen to train the FNNs. These are the following: the GSA proposed in [60], the CGSA with a sinusoidal map proposed in [44], and the MGSA and MCGSA described in Sect. 4. These memetic algorithms have been run using, in all trials, a basic configuration (e ¼ 0; c ¼ 1; tol ¼ 0:01) in order to ensure that the performance of these proposed algorithms is robust with respect to their parametrization, and the parameter tuning is not an inconvenience in their application. The choice of e and c means that the qN method is applied each time a mass of the population finds a better solution, especially when it manages to escape from a local minimum. Of course, an optimal tuning will guarantee better results. Also, the PSO algorithm implemented by the particleswarm MATLAB function was used  [34,45,75]. The genetic algorithm implemented in MATLAB by the ga function has also been considered in this study in the same way as [34,62,65,73]. Both PSO and GA have been chosen since they are classical metaheuristic algorithms used to train FNNs. Furthermore, in order to compare the memetic algorithms with the state of the art, two algorithms based on the differential evolution (DE) algorithm have been considered in this experiment (see [57]): The first is Rcr-JADE (see [24]). This algorithm modifies the adaptation rule used in the crossover. Also, the adaptation methods used in the control parameters of Rcr-JADE have also been applied in other versions of DE (see [25,27,64]). The second algorithm is COBIDE (see [69]). Their authors proposed a new DE variant based on covariance matrix learning and bimodal distribution parameter setting. Both variants of the DE algorithm were chosen since their codes could be publicly obtained and they were in the first positions of the ranking made in [57], where Rcr-JADE was the best algorithm over a set of thirty algorithms in twenty two real-world problems when 50,000 and 100,000 function evaluations were allowed, while the COBIDE algorithm was the best when 150,000 function evaluations were allowed. These experiments have been implemented using the MATLAB programming language, specifically MATLAB 2017b version, and run on a server equipped with a 4.00 gigahertz AMD FX-8370 Eight-Core-Processor.
In order to assess the performance of the algorithms involved in the computational experiments, the goodness of fit reached in the training process through MSE and the computational cost, given as the number of function evaluations, have been used (see [35]). Also, in [54], RMSE is used to assess the performance of the model in many classifiers applied to predict the travel modes of passengers. However, in the case of classification problems, alternative validation measures, such as accuracy, ROC curve or F1 score, are also widely used on training and test problems (see [6,16,28]). This paper focuses on the performance of the optimization algorithms employed, and this is the main reason why metrics based on the error measured for by the algorithm in the objective function and the number of function evaluations have been chosen as performance metrics (see [57,58]).
These three problems are then discussed, with an emphasis on the performance of the memetic algorithms used in this paper, and a statistical comparison is made with the selected algorithms from the state of the art.

The m-bit parity problem (XOR problem)
This is a very famous nonlinear benchmark problem widely used in the optimization of neural networks (see [7,22,65]). The definition of this problem is the following: Given a bit string or input vector of ''0s'' and ''1s'' with length m, the desirable output will be 1 if the input vector contains an odd number of ''1s'' and the output will be 0 if the input vector includes an even number of ''1s.'' Table 1 shows the input vectors and the corresponding outputs for the case m ¼ 3 bits.
The main feature of the XOR problem, which makes it a widely used benchmark problem, is that it is not linearly separable. Hence, this problem cannot be solved using a perceptron (a neural network without hidden layers). This feature adds complexity to the problem. Figure 6 shows an illustration to prove this in the case of m ¼ 3 bits.
To solve the XOR problem, a feedforward neural network with topology 3-S-1 has been chosen (where S is the number of neurons in the hidden layer). Thus, the input layer has a total of three neurons, according to the length of the input vector in the case of m ¼ 3 bits, the hidden layer will have a variable number of neurons which will generate different topologies to be tested, and finally, the output layer has only one neuron which will be 1 or 0, the possible outputs of the XOR problem. The following topologies have also been trained in order to test the performance of the algorithms when the number of parameters increases: S ¼ f5; 6; 7; 8; 9; 10; 11; 13; 15; 20; 30g. These   Figure 7 shows an illustration of the network topology used to solve the XOR problem. The setup of the experiment is as follows: each algorithm was run 30 times for each network topology, using a population of 30 individuals (N Pob ¼ 30) where 500 training iterations are allowed. This is the only difference with respect to [45] since these authors carried out a more intensive computational experiment, using a population of 50 individuals over 500 iterations. This paper uses 15,000 evaluations of the function, as solutions are required in a brief time period when addressing real-world problems. In the computational experiments presented in this paper, only 60% of function evaluations are performed in comparison with the experiments carried out in [45]. Tables 2 and 3 display the results of this experiment. Average, median, standard deviation and best values of mean square error (MSE) averaged over the thirty runs have been reported. In addition, the nonparametric statistical test called the Wilcoxon rank-sum test (see [11]) was performed to determine whether the mean performance between algorithms is statistically significant. This test is very suitable when the sample size is small and furthermore does not require that the distribution of the data fulfill any hypothesis, in contrast to parametric tests (see [18]). The statistical significance is calculated by using the pÀvalues that are also shown in Tables 2 and 3. The MCGSA has been used as baseline. It should be noted that a pÀvalue less than 0.05 means that there are significant differences between the algorithm and the baseline algorithm. Thus, h ¼ 1 indicates there is a significant difference in the performance of two algorithms in favor of MCGSA, h ¼ À1 implies that the performance of the other algorithm is statistically better, h ¼ 0 means that there are no significant differences in the performance of both algorithms, and finally, NA means the test is not applicable, since an algorithm cannot be compared to oneself.
The numerical results show that the use of the proposed memetic algorithms to train a feedforward neural network for the XOR problem improves significantly the results of the GSA family of algorithms and the selected algorithms   from the state of the art. The performance of the MCGSA is only outperformed by the GA when S ¼ 5 and by the CGSA when S ¼ f8; 9; 11; 15; 30g. It should also be remarked that the MCGSA maintains its performance despite the increased complexity of the network, which shows the robustness of the algorithm. This does not occur, for example, with Rcr-JADE, which has good performance in the smallest networks, but works worse when bigger networks are considered. Moreover, in order to study not only the final results of the algorithms, but also their convergence to global optima, Figs. 8 and 9 show the average convergence in the 30 runs of the four best algorithms from the eight tested. It is possible to see in these figures that the GSA can achieve good results, but the introduction of qN search in the MGSA speeds up convergence. Moreover, the chaotic component of the MCGSA improves the results obtained by the MGSA, and it is possible to confirm that although the MGSA converges more quickly, it gives a worse solution than the MCGSA, which is capable, thanks to the chaotic component, of avoiding local minima. Furthermore, comparing the MCGSA with the CGSA, the convergence of the MCGSA is quicker than in the case of the CGSA. This is due to the introduction of qN search directions, which improves the exploitation stage of the original CGSA, guaranteeing better convergence. Finally, comparing the results obtained in this study with the results obtained by [45], there is a meaningful improvement, especially in the most complex

Function approximation problem
One of the most common problems to be addressed when using neural networks is the function approximation problem (see [13,63,74]). This is why a problem of this kind has been included in this study. The definition of a function approximation problem is as follows: Given a function g(x) and a set of points from the domain of g(x), called X, the objective is to obtain the image of each point of X which guarantees the most accurate approximation of the function g(x).
In this experiment, the objective function from [45] has been used. Thus, the objective of this experiment is to approximate the function gðxÞ ¼ sinð2xÞe Àx with one dimension. This is done by using a sample of 105 points in the interval ½0; p, with an increment of 0.03, as a training set. Figure 10 shows the graph of this function with one dimension.
A FNN with the structure 1-S-1 has been used to solve the problem. The input layer will have only one neuron, since the input vector will have only one component (the variable x), and the output layer will have only one neuron, since g(x) is a real function. The hidden layer will have several sizes which will generate different network typologies to be trained. The values of S are: S ¼ f3; 4; 5; 6; 7g. The size of the problems is therefore n ¼ f10; 13; 16; 19; 22g. The proposed FNN topology for this problem is represented in Fig. 11. The setup of the experiment is identical to that carried out for the XOR problem. The statistics reported in the results are also the same as for the previous experiments. Again, the Wilcoxon rank-sum test has been used in order to check whether the difference in the performance of two algorithms is statistically significant. The results obtained are shown in Table 4. The MCGSA significantly outperforms the rest of the algorithms considered in the experiment, returning the best results in all problems except the first, where the GA exhibits the best performance. This problem has a total of ten parameters to be adjusted and is the problem of least dimensionality, which could explain why the GA works better in this particular case where the number of parameters is small. However, it is possible to see how, as the number of parameters increases, our proposal continues to perform well in contrast to the GA, whose performance is worsened by increasing the dimensionality of the problem. Another interesting feature of the performance of the MCGSA is the increase in the number of hidden neurons, allowing it to approximate better, giving better values of average MSE error when the number of hidden neurons is larger.
In order to evaluate the behavior of the algorithms over the whole optimization process, Fig. 12 shows the convergence curves of the four best algorithms for each problem. The curves shown represent the mean performance of the four best algorithms in each problem over the 30 runs. As with the XOR problem, the memetic algorithms proposed show the fastest convergence. Although the MGSA converges more quickly than the MCGSA at

Input Layer
Output Layer   [45], it should be remarked that the MCGSA guarantees results of the same order of magnitude, and even improves the results in some cases with only 15,000 function evaluations, in comparison with [45], who obtained their results over 100,000 function evaluations.

Classification problem
Classification problems are one of the most representative supervised learning problems. This kind of problem can be defined as follows: Given a dataset where each instance is labeled with a class, the aim is to find a model which classifies accurately the instances which belong to each class. The iris or Fisher's iris dataset is one of the most famous such used in classification problems. It has 150 samples of iris flowers. For each one, the following In order to address this problem, a FNN with topology 4-S-3 was designed. Four input neurons are used according to the four features which characterize an instance. The parameter S will again be a variable which will generate the different network topologies to be trained. The values of S are the following: S ¼ f4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15g. These S values provide the following set of problem dimensions n ¼ f27; 35;43;51;59;67;75;83;91;99;107;115;123g. Finally, the output layer will have three neurons, since there are three possible classes of iris flowers. A graphical representation of the FNN used to solve this problem is shown in Fig. 13.
The configuration of the experiment is the same as for the two previous experiments, returning the same statistics and carrying out the Wilcoxon rank-sum test in order to check whether there are significant differences in the performance of two different algorithms. Left-one cross-validation was also used for training the FNNs, as in [45]. Thus, in each generation, 149 samples of the dataset are chosen for training the FNN, while the unused sample is employed to test the FNN. The statistical results obtained for the training test are shown in Tables 5 and 6.
The results clearly show the superiority of the MCGSA over the rest of the algorithms tested. MCGSA is the algorithm which returns the best mean value in all problems, with the exception of the one with five neurons in the hidden layer, where the MGSA gives the best value. In this problem, where the number of parameters is small, MGSA could speed up the convergence of the algorithm, since it does not introduce chaotic maps to improve the balance between exploration and exploitation, encouraging exploitation through qN search directions.
Another interesting aspect derived from the results is the behavior of the Rcr-JADE and GA. In the XOR problem and function approximation problem, the best algorithms are usually MCGSA, MGSA, CGSA and GA. However, in this problem, Rcr-JADE appears as one of the four best algorithms in the small-size problems (i.e., S 12), and   then, GA emerges as one of the best algorithm, replacing the other. This implies that Rcr-JADE is a good choice in small instances, while GA is more appropriate when considering larger sizes. As in the previous experiments, the convergence curves are shown in Figs. 14 and 15, with the purpose of analyzing the performance of the algorithms over the whole optimization process. It is possible to check that the memetic proposals show the best convergences curves. Again, and as in the previous experiments, although MGSA converges more quickly than MCGSA, the latter is capable of avoiding local minima due to the chaotic maps introduced by the G constant, showing the best performance of all the algorithms tested.

Conclusions and further work
This paper applies two recently proposed MAs based on qN search directions to the FNN neural network training problem in order to study whether the memetic approach can improve the performance of metaheuristic algorithms when training such networks.
To do this, MGSA and MCGSA were applied to three classical benchmark problems, along with classical GSA and CGSA and other metaheuristics from the state of the art, such as PSO, GA, Rcr-JADE and COBIDE. The main conclusions arising from the computational experiments undertaken in this study are: algorithm with the best performance among all the tested algorithms in the XOR problem, the function adjustment problem and the classification problem. The improvement achieved by MCGSA can be seen both in the speed of convergence and in the quality of the solution obtained. -The CGSA is capable of avoiding local optima, due to the sinusoidal chaotic map added to it. This property is inherited by MCGSA, while the introduction of the qN algorithm in CGSA, which has a superlinear convergence rate, improves the convergence speed of the baseline algorithm. The improvement is especially evident in the function approximation problem and in the classification problem. It is also possible to see that the curse of dimensionality affects the MCGSA less than the CGSA, obtaining better solutions by an order of magnitude in the MSE, for the classification problems of greater dimensionality. -The problem of metaheuristic algorithms in addressing the curse of dimensionality problem has been noted. MCGSA is the algorithm that best addresses this problem.
This work has focused on the use of qN as a local search method. However, the framework of MA allows the inclusion of specialized algorithm for the FNN neural network training problem, such as BP. These alternatives should be explored. Moreover, more effort must be made with regard to data quality, since the effectiveness of a FNN depends to a great extent on the data used to train it. Thus, problems like imbalanced data, insufficient or overabundant data and high-dimensional data must be addressed in order to improve the performance of FNNs (see [43]). Currently, in the era of big data, stream data (see [76]) are a kind of high-dimensional data that are very common in many areas, such as natural language processing, speech processing and social network data. Although the processing of these data is analyzed through deep learning architectures, FNNs trained with MAs can be tested in order to study performance. Also, memetic FNNs could be useful in managing and reducing the data stream (see [30]).
Furthermore, the use of evolutionary algorithms for optimizing the topology of deep neural networks is currently a hot topic in the literature. Future work should look at specializing memetic algorithms for optimizing the topology of deep and convolutional neural networks.
Finally, regarding the data stream provided by internet of things (IoT) devices, such as human activity data, social activity data or smartphone data, these need to be processed using inexpensive complex models (see [59]). In this context, tools such as memetic FNNs can be a suitable choice for analyzing these data.