Approximation of two-variable functions using high-order Takagi–Sugeno fuzzy systems, sparse regressions, and metaheuristic optimization

This paper proposes a new hybrid method for training high-order Takagi–Sugeno fuzzy systems using sparse regressions and metaheuristic optimization. The fuzzy system is considered with Gaussian fuzzy sets in the antecedents and high-order polynomials in the consequents of fuzzy rules. The fuzzy sets can be chosen manually or determined by a metaheuristic optimization method (particle swarm optimization, genetic algorithm or simulated annealing), while the polynomials are obtained using ordinary least squares, ridge regression or sparse regressions (forward selection, least angle regression, least absolute shrinkage and selection operator, and elastic net regression). A quality criterion is proposed that expresses a compromise between the prediction ability of the fuzzy model and its sparsity. The conducted experiments showed that: (a) the use of sparse regressions and/or metaheuristic optimization can reduce the validation error compared with the reference method, and (b) the use of sparse regressions may simplify the fuzzy model by zeroing some of the coefficients.


Introduction
Many different methods have been developed for automatic training fuzzy systems from observed data. In this paper, we propose a novel approach to train Takagi-Sugeno fuzzy systems for function approximation. This approach is based on sparse regressions and metaheuristic optimization. Sparse regressions give sparse solutions, which means that some of the model coefficients are exactly zero. Such models are easier to interpret (Sjöstrand et al. 2018), more compact and therefore easier to implement. In addition, sparse regressions provide regularization, and therefore, they can be used when the problem is ill-conditioned (e.g., when the number of variables exceeds the number of observations are modern nature-inspired algorithms widely used in global optimization problems (Glover and Kochenberger 2003). Metaheuristic means that in an algorithm, there is a "master strategy" at a higher level that guides heuristics applied in local search.

Related works
The literature on using metaheuristic optimization methods to train fuzzy systems is extensive. Further are discussed the papers in which hybrid methods using metaheuristics and regressions were applied. In such methods, the antecedents are trained by metaheuristic methods, while the consequents are trained by regressions.
One of the most commonly used algorithms to train fuzzy systems is particle swarm optimization (PSO). An approach presented in Li and Wu (2011) combines particle swarm optimization and recursive least squares estimator to obtain a fuzzy approximation. The PSO is used to train the antecedent part of the first-order T-S system, whereas the consequent part is trained by the RLSE method. Building a type-2 neuralfuzzy system was discussed in Yeh et al. (2011). In the first step, a fuzzy clustering method is used to partition the dataset into clusters. Then, a type-2 fuzzy Takagi-Sugeno-Kang (TSK) rule is derived from each cluster. The parameters are refined using PSO and a divide-and-merge-based least squares. In Ying et al. (2011), an approach to function approximation using robust fuzzy regression and particle swarm optimization is proposed. A fuzzy regression is used to construct the first-order TSK fuzzy model, whereas particle swarm optimization is used to tune its parameters. A self-learning complex neuro-fuzzy system that uses Gaussian complex fuzzy sets was proposed in Li et al. (2012). The knowledge base consists of the T-S fuzzy rules with complex fuzzy sets in the antecedent part and linear models in the consequent part. The antecedent parameters and the consequent parameters are trained by the particle swarm optimization algorithm and recursive least squares, respectively. In the paper (Soltani et al. 2012), a method for fuzzy c-regression model clustering was proposed. The method combines the advantages of two algorithms: clustering and particle swarm optimization. The consequent parameters of the first-order T-S fuzzy rules are estimated by the orthogonal least squares method. A self-constructing radial basis function neural-fuzzy system was proposed in Yang et al. (2013). The proposed method uses particle swarm optimization for generating the antecedent parameters and the least-Wilcoxon norm for the consequent parameters, instead of the traditional least squares estimation. A two-step fuzzy model building algorithm based on particle swarm optimization and kernel ridge regression was presented in Boulkaibet et al. (2017). In the first step, the clustering based on particle swarm optimization separates the input data into clusters and obtains the antecedent parameters. In the second step, the consequent parameters are calculated using a kernel ridge regression. In Taieb et al. (2018), the adaptive chaos particle swarm optimization algorithm (ACPSO) using weighted recursive least squares was proposed. The ACPSO is used to optimize the parameters of the model, and then, the obtained parameters are used to initialize the fuzzy c-regression model. A fuzzy model identification method was proposed in Tsai and Chen (2018). Firstly, the fuzzy c-means algorithm is used to determine the rule number. Next, the initial fuzzy sets and the consequent parameters are obtained by particle swarm optimization. The final parameters are obtained using fuzzy c-regression and orthogonal least squares methods. In the paper (Tu and Li 2018), there is proposed a complexfuzzy machine learning approach to function approximation. Particle swarm optimization is used to select the premise parameters of the first-order fuzzy model, while the recursive least squares estimator is used to find the consequent parameters.
Other commonly used methods are genetic algorithms (GA). A two-step approach to the construction of first-order fuzzy rules from data was proposed in Setnes and Roubos (2000). In the first step, fuzzy clustering and the weighted least squares are used to obtain an initial fuzzy model. In the second step, this model is optimized by a real-coded genetic algorithm that allows simultaneous tuning of the rule antecedents and consequents. In Wang et al. (2005), a scheme based on multi-objective hierarchical GA (MOHGA) is proposed. This scheme is used to extract interpretable rule-based knowledge from data. First, fuzzy clustering is applied to generate an initial rule-based model. Then, the MOHGA and the recursive least squares estimator (RLSE) are used to obtain the optimized fuzzy models. A fuzzy modeling approach for the identification of nonlinear control processes was discussed in Yusof et al. (2011). This approach is based on a combination of genetic algorithm and recursive least squares. The antecedent parameters of the first-order T-S model are tuned by a genetic algorithm, whereas the consequent part is identified by recursive least squares estimator.

Contributions
From the literature review, it is seen that at most first-order polynomials are used in the consequent part to train fuzzy systems. In this paper, we propose to use high-order fuzzy systems for two-variable function approximation. In such systems, higher-order polynomials in the consequent part of rules are used, which can give greater flexibility in the selection of system parameters. Moreover, there is no use of sparse regressions for two-variable function approximation. Sparse regressions can generate sparse models (Sjöstrand et al. 2018), which are more compact, easier to interpret and implement. Summarizing, the main contributions of this paper can be stated as: -The definition of high-order Takagi-Sugeno fuzzy systems with two input variables, -The use of sparse regressions and metaheuristic optimization to train these systems.
In the proposed method, the premise parameters are determined manually or by metaheuristic optimization methods such as particle swarm optimization (PSO), genetic algorithm (GA), and simulated annealing (SA). The consequent parameters are calculated by ordinary least squares (OLS), ridge regression (RIDGE), and sparse regressions. The following sparse regressions have been used: forward selection Generates less fitted models Regression may be ill-conditioned The number of fuzzy sets must be defined Metaheuristics (Almaraashi et al. 2016;Cheung et al. 2014;Lin et al. 2016;Martino et al. 2014;Zhao et al. 2010) Simple to use and implement Large search space Difficulty in finding optimal parameters The number of fuzzy sets must be defined High computational cost Stochastic characteristic of the results Metaheuristics with regression (Li and Wu 2011;Li et al. 2012;Taieb et al. 2018;Tu and Li 2018;Ying et al. 2011;Yusof et al. 2011;Yang et al. 2013) The use of regression reduces the search space The use of clustering gives the initial structure of a system The use of regression reduces the search space Clustering complicates the algorithm Regression may be ill-conditioned Stochastic characteristic of the results

Our approach
The use of sparse regression simplifies the model Using the ridge and sparse regressions prevents occurrence of ill-conditioned problem The use of regression reduces the search space The use of high-order system provides greater flexibility in system design The number of fuzzy sets must be defined Stochastic characteristic of the results High computational cost Applying the high-order fuzzy system increases the number of parameters in the consequent part (FS), least angle regression (LAR), least absolute shrinkage and selection operator (LASSO), and elastic net regression (ENET). The OLS regression was used as a reference model. This paper is a continuation of the work (Wiktorowicz and Krzeszowski 2020), where the approximation of one-variable functions was considered.

Paper structure
The structure of this paper is as follows. Section 2 describes the Takagi-Sugeno fuzzy system with two inputs and with high-order polynomials in the consequent parts of the fuzzy rules. Section 3 presents the training methods of the consequent parameters when using the OLS, RIDGE, and sparse regressions. Section 4 presents the training methods of the antecedent parameters when using the PSO, GA, and SA methods. The performance criterion is described in Sect. 5. Section 6 contains the design procedure for training fuzzy models. In Sect. 7, the experimental results are presented. Finally, the conclusions are given in Sect. 8.

High-order Takagi-Sugeno fuzzy system
We consider a Takagi-Sugeno (T-S) fuzzy system (Takagi and Sugeno 1985) with two inputs x 1 , x 2 and one output y described by r fuzzy inference rules where j = 1, 2, . . . , r , F j (x 1 ), G j (x 2 ) are fuzzy sets, and P j (x 1 , x 2 ) is the polynomial of degree d.

Definition 1
The T-S system with the rules (1) is called: which means that the consequent functions are constants (polynomial degree d is equal to zero) (Takagi and Sugeno 1985), which means that the consequent functions are linear (polynomial degree d is equal to one) (Takagi and Sugeno 1985), and k = 2, 3, . . . , m, which means that the consequent functions are nonlinear (polynomial degree d is greater than one).
In this paper, we use Gaussian membership functions that can be unevenly spaced in the universe of discourse (see Fig. 1). These functions are defined by where . . , ρ, ρ is the number of fuzzy sets for the inputs, p k , q k are the peaks, and σ k , δ k > 0 are the widths. Using the definitions of fuzzy sets A k (x 1 ) and B k (x 2 ), the fuzzy rules (1) are written as table presented in Table 2, where r = ρ 2 . The output of the T-S system is computed by . (4) Definition 2 Wang and Mendel (1992) The fuzzy basis function (FBF) for the jth rule is the function ξ j (x 1 , x 2 ) given by are the fuzzy sets for the inputs x 1 and x 2 , P j (x 1 , x 2 ) is the high-order polynomial of x 1 and x 2 , r is the number of rules Applying (5), the output of the T-S system can be written: -For the zero-order system as -For the first-order and high-order systems as Because in (7) the FBFs are multiplied by x l 1 and x l 2 where l = 1, 2, . . . , m, we define a modified fuzzy basis function.

Definition 3 The modified FBF
Applying (8) and (9) we obtain We introduce the following vectors: -For the zero-order system as -For the first-order and high-order systems as The output of the T-S system can now be written as where The vector w contains p = r (2d + 1) parameters of the T-S fuzzy model to be determined.

Training the consequent parameters
We assume as known the observations . . , n and n is the number of observations. We introduce the regression matrix where h j ((x 1 ) i , (x 2 ) i ) is given by (11) or (13).

Ordinary least squares
The cost function to be minimized in the OLS is the sum of squared errors is the estimated output of the system (see Eq. 15) for the ith observation. The optimal solution is given by Bishop (2006) where y = [y 1 , . . . , y n ] T . Because the model parameters are computed directly from all the data contained in X and y, this method is a batch least squares.

Ridge regression
The cost function in the ridge regression (Hoerl and Kennard 1970) is the penalized sum of squared errors where λ ≥ 0 is a regularization parameter. The fuzzy model weights are given by where I is the identity matrix. The ridge regression is applied in this paper because it can be used for ill-conditioned problems, that is when the matrix X T X is close to singular. The ridge regression, similarly as the OLS, is a one-pass method, and therefore it is very fast.

Sparse regressions
The sparse regressions briefly described in this section allow the coefficients of a model to be exactly zero (Sjöstrand et al. 2018). These regressions lead to simplified models that are easier to interpret.
In the forward selection, that is an example of stepwise regression, the variables are added one by one to the model. In the beginning, all coefficients are equal to zero, and then a particular variable is chosen. The next variable to include can be chosen based on a number of criteria. For example, it can be the one that has the highest correlation with the current residual vector (Sjöstrand et al. 2018).
The least angle regression (Efron et al. 2004;Sjöstrand et al. 2018) works similarly to the FS procedure, but the algorithm does not move in the direction of one variable. In the LAR, the estimated parameters are calculated in a direction in which the angles with each of the variables currently in the model are equal. This algorithm is the basis for other sparse methods, such as the LASSO and elastic net regression.
The least absolute shrinkage and selection operator regression (Sjöstrand et al. 2018;Tibshirani 1996) has a mechanism that implements a coefficient shrinkage and variable selection. The cost function combines the sum of the squared errors and the penalty function based on the L 1 norm: where λ is a nonnegative regularization parameter. The elastic net regression (Sjöstrand et al. 2018;Zou and Hastie 2005) combines the features of the ridge regression and the LASSO. The cost function includes a penalty term related to both the L 1 and the L 2 norms: where λ and δ are nonnegative regularization parameters. The solution is found by the LARS-EN algorithm, which is based on the LARS algorithm (Efron et al. 2004).
Example 1 Consider a simple regression problem for a small amount of data. We have four observations (n = 4) in the form of vectors x = [1, 2, 3, 4] T and y = [6, 5, 7, 10] T . The goal is to build a regression model b] is the vector of the model coefficients. To obtain a model with the intercept term (the constant b different from zero), we add the column of ones to the regression matrix, which has the form It is easy to check that the OLS method gives the solution y = 1.4x + 3.5, where a = 1.4 and b = 3.5. Applying the FS, we obtain three solutions in the coefficient path The LAR and the LASSO methods generate and using the ENET with δ = 0. We can see that in the solution β 2 , the coefficient b is exactly zero, that results from using the sparse regressions. The selection of one of the solutions is based on a specific criterion, e.g., cross-validation, Akaike's information criterion, or Bayesian information criterion (Sjöstrand et al. 2018).

Particle swarm optimization
Particle swarm optimization is a population-based algorithm developed by Kennedy and Eberhart Eberhart and Shi (2000), Kennedy and Eberhart (1995). It is based on the social behavior of living organisms that live in large groups like birds flock or fish school. In PSO, a group of particles (a population) forms a swarm, in which each particle represents a hypothetical solution. The particle remembers its best position pbest and has access to the best position gbest in the swarm. The best local and global positions are selected using an objective function (Sect. 5). The learning scheme is based on two components: -Cognition component-attracts particles toward the local best position, -Social component-attracts particles toward the best position in the swarm.
The velocity v k and the position x k of the kth particle are calculated based on the following equations (Eberhart and Shi 2000;MathWorks 2019a): where ω is the inertia weight, r 1 , r 2 are vectors of random numbers uniformly distributed within [0,1], l is the current iteration number, and c 1 , c 2 are the cognitive and social coefficients, respectively.

Genetic algorithm
Genetic algorithm (Holland 1992; MathWorks 2019a; Whitley 1994) is a method for solving optimization problems inspired by the biological process of Darwinian evolution, where selection, crossover, and mutation play a major role. The GA repeatedly modifies a population to achieve new and possibly better solutions. In each generation of the GA, the individuals are randomly selected from the current population to be "parents" and used to obtain "children" for the next generation. In subsequent generations, the population "evolves" toward the optimal solution. The GA uses three main types of rules to create the next generation from the current population: -Selection-during this process, individuals called "parents" are selected through a fitness-based process. Individuals with a good value of the objective function (Sect. 5) are more often chosen for the next generation, -Crossover (recombination)-combines two "parents" to form "children" for the next generation; it is analogous to the crossover that takes place during sexual reproduction in biology. The new individuals have the characteristics of both parents, -Mutation-during the mutation process, an individual mutates that is random changes are introduced into the genotype. The purpose of this rule is to introduce diversity in the population that prevents the premature convergence of the algorithm.
Crossover and mutation characterize the explorative and exploitative features of GA. Maintaining a balance between these two features is crucial to speed up the search process and to achieve high-quality solutions.

Simulated annealing
Simulated annealing (Kirkpatrick et al. 1983; MathWorks 2019a) is a method for solving unconstrained and boundconstrained optimization problems. This method was originally inspired by the process of annealing in metallurgy. The SA models the process of heating a material and then gradually lowering the temperature in order to reduce defects. The goal is to move the system from the initial state to the state with minimum energy. As the algorithm runs, a new state is randomly generated and accepted with a certain probability. The acceptance probability is a function that depends on the energies of the two states and the temperature where ΔE is the difference of energies of the present and previous solution (ΔE = E k+1 − E k ) and T is the current temperature. The algorithm systematically decreases the temperature and stores the best state found so far. The energy determines how good the solution is, and it corresponds to the value of the objective function (Sect. 5).

Performance criterion
The objective function for all methods is the square root of the mean square error where V denotes the number of observations in the validation set, y k denotes the kth output data in the validation set, and y k denotes the output of the fuzzy model obtained for the kth input data in the validation set. The fuzzy model used to calculate the estimateŷ k is obtained based on the observations in the training set. Fuzzy models used in this paper may be sparse, which means they may have some coefficients equal to zero. To describe the sparsity of a fuzzy model, we propose the following definition.

Definition 4 The sparsity of a T-S fuzzy model is defined as
where S ∈ [0, 1], z is the number of zero-valued coefficients in the polynomials, r is the number of rules, and d is the polynomial degree.

Definition 5
The density of a T-S fuzzy model is defined as one minus the sparsity: In this paper, the best T-S model is chosen by minimizing a quality criterion in which the goal is to make the objective function (33) and the density as small as possible: where α ∈ [0, 1]. The RMSE OLS is the mean value of RMSE for the OLS regression that is treated as the reference method. The quality index (36) expresses a compromise between the prediction ability of the model and its sparsity.

Design procedure for training fuzzy models
The following methods for building fuzzy models are applied in this paper: -Non-sparse methods: - The design procedure for training fuzzy models is presented in Fig. 2. In Block 1, the Gaussian fuzzy sets are proposed. In the OLS, RIDGE, and SR methods, one proposition is generated in such a way that these sets are distributed evenly in the spaces X 1 , X 2 , and the cross-point of two adja-  Fig. 2 The design procedure for training fuzzy models. In the first step, the fuzzy sets for the system inputs are selected, followed by the decision to use sparse regressions, and finally the models are validated. The results are the polynomial coefficients in the consequents of fuzzy rules cent sets is equal to 0.5. In the PSO-OLS, PSO-RIDGE, GA-OLS, GA-RIDGE, SA-OLS, SA-RIDGE and PSO-SR, GA-SR, SA-SR methods, 10 propositions are generated by the PSO, GA or SA algorithms. The outputs of Block 1 are the vectors p, σ , and q, δ. In Block 2, the regression matrix X (18) is determined. In Block 3, the coefficient path for one of the SR methods is generated. In Block 4, no-sparse methods are validated. As a result of validating the OLS method, the value of RMSE OLS in the quality criterion (36) is obtained. In Block 5, sparse methods are validated. The validation is done along the coefficient path. For all propositions, the RMSE, the sparsity S, and the quality index Q are calculated. Then, the smallest value of Q is chosen with the constraint that the RMSE is not greater than RMSE OLS .

Experimental setup
This section gives examples of two-variable nonlinear function approximation. The following parameters were used in all experiments. The number of observations ([(x 1 ) i , (x 2 ) i ] T , y i ) was n = 81, and they were evenly distributed in the space X 1 × X 2 . The best method was selected using the Monte-Carlo cross-validation (MCCV) (Picard and Cook 1984), in which the data set was divided randomly into some fraction of data to form the training set and to assign the rest of the points to the validation set. This process was repeated 10 times, generating new training and validation partitions in the proportion of 70% test data and 30% validation data. Statistical analysis was carried out using Wilcoxon signed rank test for differences in RMSE results between all methods and the reference method (OLS). For the inputs of the fuzzy system, three fuzzy sets were defined, which gave nine fuzzy inference rules. The widths of fuzzy sets were bounded in the intervals [σ min , σ max ] = [δ min , δ max ] = [0.0849, 2.123]. The degree of the polynomials in the consequent part was set to two. For the ridge regression (23), λ = 1e−08 and the ENET regression (25), δ = 1e−8. The number of objective function evaluations was 6000. The parameter in the quality criterion (36) was α = 0.5. For metaheuristic algorithms, default parameter values were adopted in accordance with the implementation contained in the Matlab toolbox. The experiments were carried out on a mobile computer equipped with Intel(R) Core(TM) i5-7200U and 8GB RAM.

Implementation
The function regress from the Matlab Statistics and Machine Learning Toolbox (MathWorks 2019b) has been used to apply the OLS regression. The ridge regression has been implemented in Matlab using a custom function.
The sparse regressions have been implemented in Matlab using the toolbox SpaSM (Sjöstrand et al. 2018). From this toolbox, the following functions have been used: forwardselection, lar, lasso, and elasticnet. These functions take the regression matrix X and the vector y as arguments. Moreover, the function elasticnet has the regularization parameter δ. As the output, the described functions return the solution path in the form of the coefficients w, from which the best solution can be selected.
The metaheuristic methods have been implemented using the Global Optimization Toolbox in Matlab (MathWorks 2019a). From this toolbox, the following functions have been used: particleswarm, ga, and simulannealbnd. These functions allow the solution to be obtained subject to the bounds defined by the user. They operate on the vec- Table 3 Performance comparison for Experiment 1; RMSE is the mean of the validation error, std is the standard deviation, min is the minimum value, max is the maximum value, p is the p-value of Wilcoxon test, S is the mean of the model sparsity, Q is the mean of the quality index where p 1 , . . . , p ρ , q 1 , . . . , q ρ are the peaks of membership functions, σ 1 , . . . , σ ρ , δ 1 , . . . , δ ρ are the widths of membership functions.

Results of experiment 1
We consider the nonlinear function (Yeh et al. 2011) where x 1 ∈ [−1, 1] and x 2 ∈ [0, 1]. The results of Experiment 1 are presented in Table 3. The statistical analysis of the RMSE showed that most of the calculated models generate significantly different results ( p < 0.05) compared to the OLS model. The exception is the LAR, LASSO and ENET models. The smallest value of the quality index Q is equal to 0.1450 and was obtained for the PSO-ENET method. For this method, the validation error RMSE is 1.864e−03, and it is smaller than for the reference model for which this error is equal to 4.800e−02. The sparsity S is 0.7489 that means that the PSO-ENET method zeroed out 75% of 45 coefficients. Thanks to this, the model is easier to interpret and implement.  It is worth noting that in the PSO-ENET model, three rules (R 4 , R 5 , R 6 ) have the zero polynomial in the consequent part. Figures 3, 4, 5 show the goal function y, the estimator y, and the approximation error y −ŷ for the best model. The average time of training high-order T-S fuzzy systems using PSO-ENET method for one MCCV data subset was about 40.28 s ( Table 5). The methods with manually chosen fuzzy sets (OLS, RIDGE, FS, LAR, LASSO, ENET) have the shortest calculation times, while the longest times have been obtained by algorithms using SA.

Results of experiment 2
This experiment applies the nonlinear function (Yeh et al. 2011) It is seen that two rules (R 1 and R 7 ) have the zero polynomial in the consequent part. Figures 6, 7 and 8 show the goal function y, the estimatorŷ, and the approximation error y −ŷ for the best model. The average time of training highorder T-S fuzzy systems using the PSO-FS method for one MCCV data subset was about 48.48 s ( Table 5). As in Experiment 1, the methods with manually chosen fuzzy sets (OLS, RIDGE, FS, LAR, LASSO, ENET) have the shortest calcu-

Conclusions
A method of training high-order Takagi-Sugeno systems for two-variable function approximation has been proposed. The method is based on sparse regressions and metaheuristic optimization. The antecedent parameters of the fuzzy rules are set manually or by metaheuristic optimization methods such as particle swarm optimization, genetic algorithm, or simulated annealing. The consequent parameters are determined by ordinary least squares, ridge regression or sparse regressions such as forward selection, least angle regression, least absolute shrinkage and selection operator or elastic net. Ordinary least squares regression is used as a reference method. A quality criterion based on sparsity measure has been proposed to assess the quality of the fuzzy models. Compared with the

Compliance with ethical standards
Conflict of interest The authors declare that there is no conflict of interests regarding the publication of this paper.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.