A data-driven TSK-style rule base generation approach for the proposed TSK+ inference engine is presented in this section, which is outlined in Fig. 1. Given a data set \(\mathbb {T}\) which might be sparse, unevenly distributed, or dense, the system firstly groups the data instances into clusters using certain clustering algorithms. Then, each cluster is expressed as a TSK rule by employing linear regression. From this, an initial rule base is generated by combining all the extracted rules. Finally, the initialised rule base is optimised by applying the genetic algorithm (GA), which fine-tunes the membership functions of fuzzy sets in the rule base.
Rule base initialisation
Centroid-based clustering algorithms are traditionally employed in TSK fuzzy modelling to group similar objects together for rule extraction (Chen and Linkens (2004)), which is also the case in this work. However, differing from existing TSK-style rule base generation approaches, the proposed system is workable with dense data sets, sparse data sets and unevenly distributed data sets. Therefore, a two-level clustering scheme is applied in this work. The first level of clustering divides the given (dense/sparse) data set into multiple sub-data sets using sparse K-Means clustering algorithm (Witten and Tibshirani (2010)). Based on the feature of the sparse K-Means clustering, those divided sub-data sets are generally considered being dense. The second level of clustering is applied on each obtained dense sub-data set to generate rule clusters for TSK fuzzy rule extraction by employing the standard K-Means clustering algorithm (MacQueen (1967)). Note that the number of clusters has to be pre-defined for both sparse K-Means and the standard K-Means, which is discussed first below.
Number of clusters determination
A number of approaches have been proposed in the literature to determine the value of k, such as the Elbow method, Cross-validation, Bayesian Information Criterion (BIC)-based approach, and Silhouette-based approach (Kodinariya and Makwana (2013)). In particular, the Elbow method is faster and effective, and this approach is therefore employed in this work. This approach determines the number of clusters based on the criteria that adding another cluster does not lead to much better modelling result based on a given objective function. For instance, for a given problem, the relationship between performance improvement and the value of k is shown in Fig. 2. The value of k in this case can be determined as 4 which is the obvious turning point (or the Elbow point).
Dense sub-data set generation
Sparse K-Means is an extension of the standard K-Means for handling sparse data sets (Witten and Tibshirani (2010)). Assuming k clusters are required, sparse K-Means also starts with the initialisation of k centroids (usually randomly). This is followed by the assignment of data instances to centroids and the updating of centroids based on the assignments, which are iterated until there is no change on the assignments. Different from the standard K-Means which assigns objects with the goal of minimising the within-cluster sum of squares error (SSE), the sparse K-Means assigns objects by maximising the between-cluster sum of squares error (BCSS), which is defined as:
$$\begin{aligned} BCSS = \sum _{q=1}^{m+1} \left( \sum _{t=1}^{p} (x_{tq} - \mu _q)^2 - SSE\right) , \end{aligned}$$
(9)
where p is the total numbers of data instances in the given data set, m is the number of input features in the given data set, \(\mu _q\) is the mean of all the elements on the qth feature, and \(x_{tq}\) is the qth feature of the tth data point in the given data set.
The within-cluster sum of squares error SSE is defined as:
$$\begin{aligned} SSE=\sum _{j=1}^{k}\sum _{t=1}^{p_{j}}\left( \parallel x_{jt} - v_j \parallel \right) ^{2}, \end{aligned}$$
(10)
where k is the number of clusters determined by the Elbow approach, \(p_j\) is the number of data instances in the jth cluster, \(x_{jt}\) is the tth data point in the jth cluster, \(v_j\) is the jth cluster centre, and \(\parallel x_{jt} - v_j \parallel \) is the Euclidean distance between \(x_{jt}\) and \(v_j\). Note that if the labels in a given data set are symbolic values, only 0-order TSK rules are required and thus Eq. 9 becomes:
$$\begin{aligned} BCSS = \sum _{q=1}^{m} \left( \sum _{t=1}^{p} (x_{tq} - \mu _q)^2 - SSE\right) . \end{aligned}$$
(11)
Rule cluster generation
Once the given training data set \(\mathbb {T}\) has been divided into k dense sub-data sets, K-Means is employed to each determined sub-data set \(T_i\) (\(1\le i \le k\)) to generate rule clusters, each representing a rule. Assume that \(k_i\) clusters are required for a sub-data set \(T_i\). K-Means is initialised by \(k_i\) random cluster centroids. It then assigns every data instance to one cluster by minimising the SSE:
$$\begin{aligned} SSE=\sum _{j=1}^{k_i}\sum _{t=1}^{p_{t}^i}\left( \parallel x_{jt}^i - v_j^i \parallel \right) ^{2}, \end{aligned}$$
(12)
where \(p_{t}^i\) is the number of data points in the jth cluster of the sub-data set \(T_i\), \(x_{jt}^i\) is the tth data point in the jth cluster in the sub-data set \(T_i\), \(v_j^i\) is the centre of the jth cluster in the sub-data set \(T_i\), and \(\parallel x_{jt}^i - v_j^i \parallel \) is the Euclidean distance between \(x_{jt}^i\) and \(v_j^i\). Once all the data instances are assigned, the algorithm updates the cluster centroids accordingly to the newly assigned members. These two steps are iterated until there is no change in object assignments. After the K-Means is applied, the given training data set \(\mathbb {T}\) is divided into \(n = \sum _{j=1}^{k}k_i\) clusters. For simplicity, the generated rule clusters are jointly represented as \(\{RC_1, RC_2, \ldots , RC_n\}\).
Fuzzy rule extraction
Each determined cluster from the above steps is utilised to form one TSK fuzzy rule. A number of approaches have been proposed to use a Gaussian membership function to represent a cluster, such as Rezaee and Zarandi (2010). However, given the fact that most of the real-world data are not normally distributed, the cluster centroid usually is not identical with the centre of the Gaussian membership function, and thus, Gaussian membership functions may not be able to accurately represent the distribution of the calculated clusters. In order to prevent this and also keep computational efficiency as stated in Sect. 3, triangular membership functions are utilised in this work.
Suppose that a data set has m input features and a single-output feature. Given a rule cluster \(RC_r\) \((1\le r \le n)\), a TSK fuzzy rule \(R_r\) can be extracted from the cluster as follows:
$$\begin{aligned}&R_r:\ \mathbf{IF}\ x_1\ \text {is}\ A_{1r}\text { and } \ldots \text { and } x_m\ is\ A_{mr}\nonumber \\&\qquad \mathbf{THEN}\ y=f_r(x_1, \ldots ,x_m) . \end{aligned}$$
(13)
Without loss generality, take the sth dimension (\(1 \le s\le m\)) of rule cluster \(RC_r\) as an example, denoted as \(RC_r^s\). Suppose that \(RC_r^s\) has \(p_r\) elements, i.e. \(RC_r^s = \{x_s^1,\) \( x_s^2, \ldots ,x_s^{p_r}\}\). As only triangular fuzzy sets are used in this work, fuzzy set \(A_{sr}\) can be precisely represented as \((a^1_{sr}, a^2_{sr}, a^3_{sr})\). The core of the triangular fuzzy set is set as the cluster centroid, that is \(a^2_{sr} = \sum _{q=1}^{p_r} x_s^q/p_r\); and the support of the fuzzy set is set as the span of the cluster, i.e. \((a^1_{sr}, a^3_{sr}) = (min\{x^1_s, x^2_s, \ldots , x_s^{p_r}\},\) \(max\{x^1_s, x^2_s, \ldots , x^{p_r}_s\})\).
First-order polynomials are typically used as the consequences of TSK fuzzy rules. That is, \(y =\beta _{0r}+\beta _{1r}x_1 + \cdots +\beta _{mr}x_m\), where the parameters \(\beta _{0r}\) and \(\beta _{sr},\ s\in \{1,2,\ldots ,m\}\) are estimated using a linear regression approach. The locally weighted regression (LWR) is particularly adopted in this work, due to its ability to generate an independent model that is only related to the given cluster of data in the training data set (Nandi and Klawonn (2007); Rezaee and Zarandi (2010)). The rule consequence will deteriorate to 0-order, if the values in the output dimension are discrete integer numbers. From this, the raw base is initialised by combining all the extracted rules, which is of the form of Eq. 1.
Rule base optimisation
The generated raw rule base is optimised in this section by fine-tuning the membership functions using the general optimisation searching algorithm, genetic algorithm (GA). GA has been successfully utilised in rule base optimisation, such as Mucientes et al. (2009) and Tan et al. (2016). Briefly, GA is an adaptive heuristic search algorithm for solving both constrained and unconstrained optimisation problems based on evolutionary ideas of natural selection process that mimics biological evolution. The algorithm firstly initialises the population with random individuals. It then selects a number of individuals for reproduction by applying the genetic operators. The offspring and some of the selected existing individuals jointly form the next generation. The algorithm repeats this process until a satisfactory solution is generated or a maximum number of generations has been reached.
Problem representation
Assume that an initialised TSK rule base is comprised of n rules as expressed in Eq. 1. A chromosome or individual, denoted as \(\textit{I}\), in the GA is used to represent a potential solution, which is designed to represent the rule base in this proposed system, as illustrated in Fig. 3.
Population initialisation
The initial population \(\mathbb {P}= \{\textit{I}_1, \textit{I}_2,\ldots ,\textit{I}_{|\mathbb {P}|} \}\) is formed by taking the initialised rule base and its random variations. In order to guarantee all the variated fuzzy sets are valid and convex, constraint \(a_{sr}^{1}< a_{sr}^{2} < a_{sr}^{3}\) is applied to the genes representing each fuzzy set. The size of the population \(|\mathbb {P}|\) is a problem-specific adjustable parameter, typically ranging from tens to thousands, with 20–30 being used most commonly (Naik et al. (2014)).
Objective function
An objective function is used in the GA to determine the quality of individuals. The objective function in this work is defined as the root mean square error (RMSE). Given a training data set \(\mathbb {T} \) and an individual \(I_i\), \(1\le i \le | \mathbb {P} |\), the RMSE value can be calculated as:
$$\begin{aligned} RMSE_i = \sqrt{\frac{\displaystyle \sum \nolimits _{j=1}^{|\mathbb {T}|}\left( z_j - \hat{z}_j\right) ^2}{|\mathbb {T}|}}, \end{aligned}$$
(14)
where \(|\mathbb {T}|\) is the size of the given training data set, \(z_j\) is the label of the jth training data instance, and \(\hat{z}_j\) represents the output value led by the proposed TSK+ inference approach. The individual with the smallest value of RMSE represents the fittest solution in the population.
Selection
A number of individuals need to be selected for reproduction, which is implemented in this work by the fitness proportionate selection method, also known as the roulette wheel selection. Assuming that \(f_i\) is the fitness of individual \(\textit{I}_i\) in the current population \(\mathbb {P}\), its probability of being selected to generate the next generation is:
$$\begin{aligned} p(I_i) = \frac{f_i}{\displaystyle \sum \nolimits _{j=1}^{|\mathbb {P}|}f_j}, \end{aligned}$$
(15)
where \(|\mathbb {P}|\) is the size of the population. The fitness value \(f_i\) of an individual \(\textit{I}_i\) in the proposed system was determined by adopting the linear-ranking algorithm (Baker (1985)) given as:
$$\begin{aligned} f_i = 2-max+\frac{2(max-1)(r_i-1)}{|\mathbb {P}|}, \end{aligned}$$
(16)
where \(r_i\) is the ranking position of individual \(I_i\) in the ordered population \(\mathbb {P}\), and max is the bias or selective pressure towards the fittest individuals in the population.
Reproduction
Once a number of parents are selected, they then breed some individuals for the next generation using the genetic operators crossover and mutation, as shown in Fig. 4. In particular, crossover swaps contiguous parts of the genes of two individuals. In this work, the single-point crossover approach is adopted, which swaps all data beyond this index point between the two parent chromosomes to generate two children. Note that the crossover point can only be between those genes which employed to represent two different fuzzy sets, such that all the fuzzy sets are valid and convex all the time during the reproduction process.
The second genetic operator mutation is used to maintain genetic diversity from one generation of an individual to the next, which is analogous to a biological mutation. Mutation alters one gene values in a chromosome from its initial state. A pre-defined mutation rate is used to control the percentage of occurrence of mutations. In this work, in order to make sure the resulted fuzzy sets are valid and convex, the constraint \(a_{sr}^1 \ge a_{sr}^2 \ge a_{sr}^3\) is applied to the genes representing each fuzzy set during the mutation operation. The newly bred individuals and some of the best individual in the current generation \(\mathbb {P}\) jointly form the next generation of the population (\(\mathbb {P}^{'}\)).
Iteration and termination
The selection and reproduction processes are iterated until the pre-defined maximum number of iterations is reached or the value of the objective function regarding an individual is less than a pre-specified threshold. When the termination condition is satisfied, the fittest individual in the current population is the optimal solution.