An efficient optimization approach for designing machine learning models based on genetic algorithm

Machine learning (ML) methods have shown powerful performance in different application. Nonetheless, designing ML models remains a challenge and requires further research as most procedures adopt a trial and error strategy. In this study, we present a methodology to optimize the architecture and the feature configurations of ML models considering a supervised learning process. The proposed approach employs genetic algorithm (GA)-based integer-valued optimization for two ML models, namely deep neural networks (DNN) and adaptive neuro-fuzzy inference system (ANFIS). The selected variables in the DNN optimization problems are the number of hidden layers, their number of neurons and their activation function, while the type and the number of membership functions are the design variables in the ANFIS optimization problem. The mean squared error (MSE) between the predictions and the target outputs is minimized as the optimization fitness function. The proposed scheme is validated through a case study of computational material design. We apply the method to predict the fracture energy of polymer/nanoparticles composites (PNCs) with a database gathered from the literature. The optimized DNN model shows superior prediction accuracy compared to the classical one-hidden layer network. Also, it outperforms ANFIS with significantly lower number of generations in GA. The proposed method can be easily extended to optimize similar architecture properties of ML models in various complex systems.


GðÁÞ
The activation function f l The output of a hidden layer w The connecting weights k Learning iteration e The error vector J The Jacobian matrix for the first derivatives of the network errors with respect to the weights l The learning rate I The identity matrix f i Crisp function of ANFIS l i ðx i Þ Membership grade of the inputs, x i y The set of n integer state variables pðyÞ The fitness function V f Volume fraction of nanofiller d n The diameter of the nanoparticles G Im The fracture energy E m Young's modulus r ym The yield strength N The number of datasets t i The target output O i The predicted output by ML model The total number of the model evaluations N w The number of connecting weights between layers

Introduction
Machine learning (ML) methods have been extensively used for simulating material design in various applications recently thanks to the considerable advancements in computing power. The high advantage of ML is to represent the actual behavior with much less cost and time. These tools are based on computational intelligence through correlating the input parameters to the output/s of interest by means of mathematics and statistical methods. ML modeling is reasonably accurate and able to capture and identify the nonlinearity in the very complex physical systems by developing a black box model without the need to mathematical models. Thus, it has become a viable complement and even an alternative to the physically based model [1][2][3]. Artificial neural network (ANN) is a widely common ML method that has the ability to learn the pattern rapidly. Disregarding the nature of the problem understudy, all the influencing factors can be taken into account considering their complicated joint effect. The general structure of ANN is composed of parallel layers connected by weights and biases to form the network. Feed-forward neural networks are used to construct an approximate function for the relationship starting from the input layer toward the output layer and passing through a hidden layer. The weights and the biases can be learned making use of a predefined training process [4,5]. Different from the classical shallow ANN, the deep neural network (DNN) is formed by multiple processing hidden layers (more than two) providing a higher learning representation [6]. Moreover, the adaptive neuro-fuzzy inference system (ANFIS) presents a combination of neural network and a fuzzy system that deals with reasoning. Using these artificial intelligence approaches, the behavior of the given problems can be captured effectively and, consequently, the future response can be predicted implicitly with much less effort. The structure and the related configurations are essential modeling factors in building the ML model. Different results are obtained when the architecture and the feature configurations are changed. The numbers of input and output variables define the number of neurons in the input and output layers. In the meantime, there is still no general rule for setting the dimension of the hidden layers and the number of neurons in each layer. It is difficult to find the optimal set of the possible structures and parameters. A trial and error is a very common adopted procedure in which the tedious iterative process is inevitable. Though, reliable and powerful optimization methods have been effectively used in identifying the optimal model from several trained models. Much of the focus has been given on mainly optimizing the parameters that can be derived through the training process. The optimal model was selected by finding the optimal connecting weights and was approximated by reducing the training error between the predictions and corresponding targets [7][8][9]. In ANFIS models, most of the optimization techniques were utilized for defining the membership functions and the corresponding fuzzy rules to increase the accuracy of standalone ANFIS [10,11]. When it comes to optimal architecture, it is seen that limited analyses were investigated. An efficient configuration of ML models can be obtained by optimizing hyperparameters whose magnitude is to be set before the learning process begins. This is a complicated optimization problem as it contains a large number of correlated design variables and nonlinear objective function. Therefore, a method for developing and optimizing ML models to obtain the best model configuration is needed.
This paper presents a robust methodology for finding the optimal architecture and features of the DNN and ANFIS models. Supervised learning is considered where the data points include a target output to be predicted from a given set of input variables. Genetic algorithm (GA)based integer-valued optimization is employed to find the best ML model configuration through minimizing a fitness function of the mean squared error (MSE) between the predicted and the target values. The hyperparameters are restricted to be integers. For DNN, the optimization variables include: the number of hidden layers, the number of neurons in each one, and the type of the activation function, whereas for ANFIS, they are the type and the number of membership functions In addition, the performance of the addressed ML models is also evaluated and compared by calculating the corresponding coefficient of determination (R 2 ) and the probability distribution of the relative error.
We apply the proposed methodology to the computational design of materials in order to validate the method and compare it with the classical technique. The focus of interest in this paper is the prediction of the fracture energy of polymer/nanoparticles composites (PNCs) based on a set of experimental measurements gathered from the literature. This is a challenging task considering the complex and nonlinear nature of toughening mechanism of PNCs which depends on diverse uncertain factors. Up to date, there are only a few contributions on ML to investigate the behavior of PNCs [12][13][14]. In a previous study, we presented unoptimized ML models for PNCs that were constructed using the concept of trial and error [15]. The developed ANN and ANFIS showed a considerable superior performance compared to the results obtained by the analytical and linear regression models. Yet, the search for a better model never stops. To the best knowledge of the authors, the optimal design of the architecture and the hyperparameters of the ML model in the area of PNCs material design remains almost unexplored.
The rest of the paper is organized as follows: firstly, we describe the machine learning and the optimization methods with their essential mathematical background in Sect. 2. Then, Sect. 3 addresses the model problem including details of the dataset. Subsequently, we briefly describe the application and performance's analysis of the proposed method in Sect. 4. Finally, Sect. 5 summarizes the key results and provides direction for future works.

Material and methods
There exist several ML approaches documented in the literature such as decision trees, random forest, support vector, Bayesian networks, extreme learning machine, evolutionary computing, ANN, and ANFIS, among others. An extensive review on the methods can be found in [16][17][18]. Nevertheless, for the purpose of finding the optimal configurations of ML model, ANN and ANFIS are investigated in this study due to their flexible and adjustable architecture. We have not performed a conclusive or exhaustive analysis to determine whether and how they are better, as our major concern to have relatively simple and optimized model that could be easily applied. Hereafter, a short description and the mathematical formulations involved in the development of the proposed method are elucidated.

Artificial neural networks
Artificial neural network (ANN) is a highly parallel system that mimics the function of the biological brain. It is designed to model the relation among the input and output parameters through a training process. The typical ANN model contains several inter-connected processing units called neurons or nodes grouped together into layers. The neighboring layers are connected by weights forming a large network. The network learns by analyzing multiple datasets and adjusting the connection weights [19,20]. In the course of this study, we apply the multilayer feedforward networks to predict the fracture energy of the PNCs. The optimal architecture of the deep neural network (DNN), a network composed of two or more hidden layers, is examined. The architecture of the network and the proposed DNN model for predicting the fracture energy of PNCs is shown in Fig. 1.
Inputs from previous layers are linked to each neuron by the corresponding weights and bias by which the neuron receives data and consequently proceeds it to the next layer. The weighted sums are passed through an activation function, GðÁÞ, to determine the neuron output. It takes inputs from previous layer to produce a scalar output. The output is computed layer by layer as in Eq. (1).
where f lÀ1 is the output from the preceding layer (l À 1), and w and b are the connecting weights and bias, respectively. Finally, the signal from the neurons of the last hidden layer is passed to the output layer with linear activation [21,22]. The network learns iteratively from several datasets in supervised learning process. The predicted outputs are compared with the target output, and accordingly, new iteration is proceeded to minimize the mean squared error (MSE). Levenberg-Marquardt algorithm is used for training where the weights and bias are updated via the error back-propagation (BP). It is a combination of gradient descent forms and Newton method. After each learning iteration, k, the error vector (e) is computed and the weights are updated. The Jacobian matrix (J) includes the first derivatives of the network errors with respect to the weights. During the training in the standard gradient descent, as the error converges to a minimum value, the gradient will become very small and the weights are updated very slightly. Contrary, the training by Levenberg-Marquardt method can be much faster [23,24]. The modification applied to update the weights is given by Eq. (2).
where l is the learning rate that governs the step size and I is the identity matrix.
In constructing the DNN models, we divide the data into two groups: training and testing datasets. The training dataset is used to build the network and approximate the  Fig. 1 The general architecture of multilayer feedforward networks for the model problem Neural Computing and Applications connecting weights, while testing dataset is used to validate the models against unseen data (cross-validation) and to prevent possible overfitting. Doing so, the learning iterations continues until the stopping criterion is met; (1) the MSE in the testing set increases to a maximum number of successive iterations of 20, and (2) the gradient of the error has a minimum value of 10 À7 .

Adaptive neuro-fuzzy inference system
The concept of fuzzy logic can be introduced through a fuzzy inference system (FIS) to map the input/output relationship. The process starts with defining the membership functions of fuzzy sets (fuzzification) statistically making use of the available dataset, comes through creating the rules and merging all the fuzzy rules by a proper fuzzy inference and ends finally by defuzzification into crisp output values [25]. The adaptive neuro-fuzzy inference system (ANFIS) benefits from the merits of fuzzy logic and the neural network [26]. It is based on using fuzzy rules for adaptation of a set of model parameters and the neural network for training and updating these parameters. Takagi-Sugeno fuzzy model is one of the ANFIS approaches characterized by linear or constant terms in the consequent part of the if-then rules [27]. The learning method of the ANFIS is similar to the common feedforward neural networks with defined network representation. It is comprised of nodes with specific functions collected in main five layers. For each layer, the output signals are processed by the node functions as displayed in Fig. 2. In the first layer, the nodes evaluate the fuzzy membership grade of the inputs, l i ðx i Þ, by dividing the domain of each one into a number of fuzzy subsets. The membership function can be of increasing, decreasing or approximation type [25]. The nodes in the second layer multiply the incoming signals from Layer 1 to calculate the weights of the rules firing strength, w i , and send the product out to the next layer. Then, the normalized firing strength of the fuzzy rules is approximated at the nodes of Layer 3, where the output in each node is calculated as the ratio of the firing weight to the sum of all firing weights. In the fourth layer, the outcomes from the preceding layer are multiplied by a crisp function (f i ) that specifies the membership function of the output. In this way, the defuzzification of the fuzzy rules is achieved for the overall weighted output. Finally, the overall output is computed as the summation of all incoming signals in Layer 5.

Genetic algorithm
For the purpose of constructing the ML models of the highest performance, the architecture and features are to be optimized in this work. The state variables represent the number of hidden layers, the number neurons in each, and the type of the activation function for the DNN, while they are the type and the number of membership functions in ANFIS. Such discrete nature of the variables makes the optimization a non-convex problem. The classical method for solving these problems is based on the branch and bound algorithm. It starts by finding the optimal solution of the variables without the integer constraints. Then, the branches of this solution are explored creating two new subproblems. The branch for each variable is checked against upper and lower estimated bounds on the optimal solution. The subproblems are solved for the new constrained and the process of branching is repeated until obtaining a solution that satisfies all the integer constraints [28]. Alternatively, heuristics methods such as genetic and evolutionary algorithms are faster and more efficient to approximate the solution of computationally expensive problems. It has been applied in solving numerical problems and prediction [29,30]. The heuristics methods search within the domain for integer feasible solutions. Starting from randomly generated candidate, a new generation with modified objective values is extracted. The procedure is continued to derive sufficiently good solution. Like the neural networks, a genetic algorithm (GA) is biologically inspired heuristic approach based on the evolutionary process representing an optimization procedure in a binary search space. It seeks to find the values of the decision state variables that optimize an objective function. The concept of GA was presented by Holland and his collaborators [31]. In this study, GA-based integer-valued optimization is employed [32]. The mathematical formulation of the problem is defined in Eq. (3).
In the above expression, y is the set of n integer variables with lower and upper limits of y l and y u ; pðyÞ is the fitness function; and gðyÞ and hðyÞ are the equality and inequality constraints. The objective (fitness) function is a nonlinear function representing the MSE for the predictions of the ML models with respect to the experimental data. The flowchart of the optimization steps is shown in Fig. 3. The genetic learning starts with creation of an initial population consisting of randomly generated rules. Each rule can be represented by a string of bits. The random population is the state variables that define the architecture of the ML. Then, a new population is formed consisting the fittest rules [33]. At each step, GA uses three main types of rules to create the next generation from the current population: crossover, mutation, and selection. In the crossover, the genetic information of two parents is combined from the selection operator to explore the design space, while the genetic information is changed with mutation probability to introduce diversity and to prevent local optimal solution. Meanwhile, a selection operator is maintained to choose fit individuals for the reproduction operators. In this way, the derivatives of the objective function are not required making GA favorable choice for the nonlinear and discontinuous optimization problems. The obtained solution is evaluated by the objective optimization function of the MSE. The corresponding value of the objective (known as the fitness) measures the performance of the chosen individuals compared to the other whole population. The process of generating new populations continues as long as the termination check has not been met with the condition that each rule in the population satisfies a prespecified fitness threshold.

Model problem
Epoxy polymer is well known to be a brittle material. It has a poor fracture toughness and a poor resistance to crack initiation and propagation. Several diverse second-phase materials have been added to the polymer matrix in order to improve the fracture properties without sacrificing other important thermo-physical properties. Structural characterization can be enhanced for the purpose of environmental applications [34,35]. Inorganic additives of fillers with the size of nanometer are effectively applied because of their high surface to volume ratio. In this regard, polymer nanocomposites have offered exceptional improvements even at lower filler contents. The shape of the nanofillers can be of spherical particles with a radius of 10-80 nm [36][37][38].
Studying the fracture energy enhancement due to the addition of rigid nanofillers is a highly challenging task. It depends on different factors which highly affect the toughening mechanism such as the volume fraction, the curing conditions, the mechanical properties of the two phases, the agglomeration, and distribution of the fillers. Several numerical and analytical models have been proposed to model the fracture and crack propagation of PNCs, see, for example, the contributions in [39][40][41][42][43][44][45] and the references therein. Besides these approximation models, the fracture energy of PNCs has been directly extracted from experiments [46][47][48][49].
In this study, we introduce data-driven models as a promising alternative to the 'classical' computational approaches. To establish the database for the purpose of developing the ML models, five parameters are defined. Two of them represent the geometrical properties of the rigid nanofillers, i.e., the volume fraction ðV f Þ, and the diameter of the particle ðd n Þ. The remaining three define the material properties of the epoxy polymers: the fracture energy ðG Im Þ, the Young's modulus ðE m Þ, and the yield  Fig. 3 Flowchart of the methodology for ML optimization using integer based GA Neural Computing and Applications strength ðr ym Þ. These parameters are well known to be the governing factors in the toughening mechanism of PNCs according to Huang and Kinloch [39], Williams [40], and Quaresimin et al. [41]. In a previous study, we have employed ANN and ANFIS to predict the fracture energy by establishing a database collected from the literature [15]. The dataset consists of 115 experimental measurements. Various ranges are collected to ensure the development of a robust model that can be applied to a wide range of the PNCs. Table 1 lists the range of the variation of the input-output parameters. Of the total datasets, 85 samples are assigned for training the ML models. The remaining is set as testing data to compute the validation error. It is also utilized to prevent overfitting that results in good performance in the training set and poor predictions in the overall data. Instead of random division, the data are divided based on the condition that the training and the testing datasets have identical statistical distributions.
An optimal machine learning model will be constructed to predict the fracture energy of PNCs as a function of the selected five inputs. The modeling relation is defined as where G Ic is the fracture energy of the PNCS and f() is the developed mapping function. An integer-valued-based GA optimization is applied to find the optimal architecture in DNN and ANFIS. In optimization scheme, various arrangements of neurons in the hidden layers using predefined multilayer neural networks of DNN are sought. In each layer, the number of neurons and the activation function are optimized in order to evaluate the fitness function. Likewise, the optimal type and number of membership functions in ANFIS are obtained. Measuring the performance of the models generally depends on the residual of the differences between the predictions and the target values. Ordinary, absolute, or relative absolute sum of residuals is affected by the direction of over-or under-prediction. Contrary, the indicators which consider the variance are used to eliminate this effect. In our analysis, the predictions of the developed ML models are evaluated by two performance evaluation indices: the mean squared error (MSE) and the coefficient of determination ðR 2 Þ in which the variance of the estimator and its squared bias are taken into account. MSE and R 2 can be calculated, respectively, using Eqs. (5) and (6).
where N is the number of datasets, t i refers to the actual observation from the experiments, and O i is the predicted fracture energy by the addressed ML model. From Eq. (5), MSE is the variance of the residuals that corresponds to the error's discrepancy between t i and O i . It measures the absolute fit of the model to identify the undesirable large differences. Differently, R 2 evaluates the relative measure of the fit which quantifies the variance of the residuals divided by the total sum of squares of error with respect to the mean. Besides, the probability distribution of the relative error is employed to measure the performance of the optimal models.

Results and discussion
The hyperparameters defining the optimal architecture of ML models are sought for the model problem of predicting the fracture energy of the PNCs ðG Ic Þ. Several network architectures are examined including the shallow ANN of one hidden layer and DNN of more than two hidden layers (from 2 to 6) in order to evaluate and select the most appropriate structure. We examine also the optimal type of the activation function out of four functions: tan-sigmoid, log-sigmoid, Elliot-sigmoid, and radial basis. Apparently, these are nonlinear functions. The derivative of the linear base (e.g., triangular or pure linear) is a constant and cannot go back and modify the weights to provide a better prediction. Hence, the linear functions are not possible to use back-propagation for training the DNN models. An integer-valued-based GA is applied for this purpose with MSE minimized as the objective function. Therefore, a population size of 100 is set for the first generation and 50 for the followings. Figure 4 shows the convergence rate to obtain the minimum MSE against the number of generations in the different DNN architectures. It can be noted that Str:-1 has stationary trend and cannot be improved with further generations; Str:-6 requires 75 (= 3800 function evaluations) generations to get the lowest fitness of MSE ¼ 1503:3 among the different DNNs. In the second order comes Str:-3 surpassing Str:-4 and Str:-5.

Neural Computing and Applications
The optimal architectures of the created models are summarized in Table 2. Each row contains the optimal number of neurons in each hidden layer of the k-layered networks and their activation functions, GðÁÞ, accompanied by the corresponding prediction accuracy measures (MSE and R 2 ). The last two columns include the total number of the model evaluations (N f ) used to get the optimal architecture, and the number of connecting weights (N w ) between layers. Except of the shallow network, the logsigmoid is found to be the optimal activation function. This may be explained by its smooth gradient that prevent the jumps in the output values. Moreover, the results show that the MSE decreases steadily with the increase of hidden layers from Str:-1 until Str:-3. Afterward, it becomes slightly stabilized before it reaches the minimum at Str:-6. With the increase of the hidden layers, the number of connecting weights grows significantly (excluding Str:-4). In the comparison between Str:-3 and Str:-6, the latter, on the one hand, provides higher accuracy. On the other hand, however, it has around fourfolded number of connecting weights. Interestingly, the more the deeper of the hidden layers is, the higher the required computation burden for each function evaluation.
The network performance for the training and testing datasets in the optimal structures is shown in Fig. 5. The training error is decreasing sharply in the first ten iterations before it reaches a roughly stable convergence. The training could be continued with negligible improvement, but the error in the testing set diverges after specific number of learning iterations. The training is allowed 20 more iterations since the last time it decreased in the testing set. Hence, in order to prevent overfitting, the connecting weights are selected at the lowest MSE in the testing set. Notably, Str:-3 is the fastest to reach the best results.
Similarly, integer-valued-based GA optimization is applied to find the optimal hyperparameters in ANFIS model. In particular, we optimize the number and the type of the membership functions required for the fuzzification of the full space in the input parameters. The number of the rules equals the product of a sequence of the numbers of membership functions associated with the each of the five inputs. As the numbers of membership functions increase, the number of the generated rules considerably increases, and the computation time becomes unaffordable. Seven different forms of approximating membership functions are examined: Bell-shaped, Gaussian, Gaussian combination, triangular, trapezoidal, P-shaped, and difference between two sigmoidal functions. An integer values from 1 to 7 are assigned to represent each type. Considering the number of membership functions with respect to the input parameters, the rules are generated using grid partitioning to generate a full range of the rules. The convergence of the MSE is depicted in Fig. 6. After 222 generations, the MSE converges to its minimum value of 2008.8. This convergence required 11,150 function evaluations. The computation  N l the number of hidden layers, N f the total number of the objective function evaluations used to get the optimal architecture by using GA, and N w the number of connecting weights between layers Neural Computing and Applications cost of GA in optimizing ANFIS is significantly higher with lower predictive accuracy compared to DNN. Details on the developed optimal ANFIS are listed in Table 3. The best performance is obtained also after ten training iterations. Additional training iterations lead to an overfitting error in the testing dataset, see Fig. 7. In [15], ANFIS was found to outperform a trial selected classical one-layer ANN model. However, an optimized multiple hidden layers configuration of DNN shows better prediction accuracy. We also employ a graphical analysis of the relative error to evaluate the performance of the optimized DNN and ANFIS models. The percentage relative error is calculated as the relative variation of the predicted value from the corresponding experimental data. The prediction inaccuracies of the optimal DNN (Str:-3 and Str:-6) and ANFIS are represented by the histograms in Fig. 8. Obviously, most of the data points have a relative error value close to zero indicating a robust predictive capability of both methods. The scatter of the distribution on either side of the equality is very similar. The probability of full matching the experimental values in the predictions by DNN is higher. The histograms show that Str:-6, Str:-3 and ANFIS can predict nearly 86%, 87%, and 76%, respectively, of the data with an absolute relative error of 10% or less.

Conclusions
A robust optimization approach for machine learning modeling was presented in this paper based on genetic algorithm. An integer-valued optimization was adopted. The approach investigated the architecture and the feature configurations of two machine learning models: deep neural networks (DNN) and adaptive neuro-fuzzy inference system (ANFIS). The set of optimized variables were the number of neurons in the hidden layers using predefined multilayer neural networks and the type of the activation function in the former, but the type and the number of membership functions in the latter. In solving the optimization problem, the mean squared error (MSE) was assigned to be the fitness function. We elucidated the proposed method by using a case study in the computational material design application. The aim was to find an efficient structure of DNN and ANFIS models to predict the fracture energy of polymer/nanoparticles composites (PNCs). The conventional analytical predictor models of PNCs show a complex and nonlinear toughening mechanism dominated by different parameters. Thus, the addressed ML models were trained to establish the inherent relation between five inputs and the fracture energy based on dataset gathered from the literature. The parameters include the percent of the volume fraction of the nanoparticles, their size, the fracture energy of the bulk epoxy polymer, its Young's modulus, and its yield strength. The dataset was divided into two groups: the training dataset used to build the model and the testing one used to validate the performance and to prevent overfitting during learning iterations. The mean squared error (MSE), the coefficient of determination (R 2 ), and the probability distribution of the relative error were employed to compare and to measure the performance of the optimized DNN and ANFIS models. The results indicated that the optimized DNN not only surpassed the classical one layer ANN model as expected but also yielded better prediction accuracy compared to ANFIS. The DNN with three and six hidden layers has shown the highest performance. However, it was found that as the number of the layers increases, the network structure becomes larger and the complexity of the algorithm increases. Although the results are limited to the application case study, the method can be applied to variant applications. In the future, we will extend the presented work to optimize non-supervised ML models in finding the solution of the partial differential equations.
Acknowledgements Open Access funding provided by Projekt DEAL. This study was funded by Alexander von Humboldt-Stiftung (Grant number Sofja Kovalevskaja 2015).

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.