1 Introduction

Fuzzy inference mechanisms are built upon fuzzy logic to map system inputs and outputs. A typical fuzzy inference system consists of a rule base and an inference engine. A number of inference engines have been developed, with the Mamdani inference (Mamdani (1977)) and the TSK inference (Takagi and Sugeno (1985)) being the most widely used. The Mamdani fuzzy model is more intuitive and suitable for handling linguistic inputs; its outputs are usually fuzzy sets, and thus, a defuzzification process is often required. In contrast, the TSK inference approach produces crisp outputs directly, as TSK fuzzy rules use polynomials as rule consequences. There are generally two types of rule bases used to support the two fuzzy inference engines, which are Mamdani-style rule bases and TSK-style rule bases accordingly.

A rule base, Mamdani-style or TSK-style, can either be translated from expert knowledge or extracted from data. The rule base led by the knowledge-driven approaches therefore essentially is a representation of the human experts’ knowledge in the format of fuzzy rules (Negnevitsky (2005)). In order to enable this approach, a problem has to be human comprehensible and fully understood by human experts as linguistic rules, which are then interpreted as fuzzy rules by specifying the membership functions of linguistic words. Recognising that the expert knowledge may not always be available, data-driven approaches were proposed, which extract fuzzy rules from a set of training data using machine learning approaches (Rezaee and Zarandi (2010)). Both Mamdani and TSK inference approaches are only workable with dense rule bases which each covers the entire input domain.

Fuzzy interpolation relaxes the requirement of dense rule bases from conventional fuzzy inference systems (Kóczy and Hirota (1993); Yang et al. (2017c)). When a given observation does not overlap with any rule antecedent, certain conclusions can still be generated by means of interpolation. In addition, fuzzy interpolation helps in system complexity reduction by removing the rules that can be approximated by their neighbours. A number of fuzzy interpolation approaches have been developed, such as Chen and Hsin (2015), Huang and Shen (2006, 2008), Kóczy and Hirota (1997), Naik et al. (2014), Yang and Shen (2011), Yang et al. (2017a), Shen and Yang (2011) and Yang and Shen (2013). However, all these existing fuzzy interpolation approaches were extensions of the Mamdani inference.

A novel fuzzy interpolation approach, which extends the TSK inference, is presented in this paper. The proposed fuzzy inference engine is workable with sparse, dense or imbalanced TSK-style rule bases, which is a further development on the seminal work of Li et al. (2017). In addition, a data-driven TSK-style rule base generation approach is also proposed to extract compact and concise rule bases from incomplete, imbalanced, normal or over-dense data sets. Note that sparse and imbalanced data sets are still commonly seen, regardless of the magnitude of the data sets in the era of big data. The proposed approach has been applied to two benchmark problems and a real-world problem in the field of cyber security. The experimentation demonstrated the power of the proposed approach in enhancing the conventional TSK inference method by means of broader applicability and better system efficiency, and competitive performance in reference to other machine learning approaches.

The structure of rest of the paper is organised as follows. Section 2 introduces the theoretical underpinnings of TSK fuzzy inference model and the TSK rule base generation approaches. Section 3 presents the extended TSK system, and Sect. 4 discusses the proposed rule base generation approach. Section 5 details the experimentation for demonstration and validation. Section 6 concludes the paper and suggests probable future developments.

2 Background

The conventional TSK inference system and rule base generation approaches are briefly reviewed in this section.

2.1 TSK inference

Suppose a TSK-style fuzzy rule base comprises of n rules each with m antecedents:

$$\begin{aligned}&R_1:\ \mathbf{IF}\ x_1\text { is } A_{11} \text { and }\ldots \text {and } x_m\text { is } A_{m1} \nonumber \\&\quad \mathbf{THEN}\ y= f_1(x_1,\ldots ,x_m) \nonumber \\&\qquad \qquad =\beta _{01}+\beta _{11}x_1+\ldots + \beta _{m1}x_m,\nonumber \\&\ldots \nonumber \\&R_n:\ \mathbf{IF}\ x_1\text { is } A_{1n} \text { and }\ldots \text {and } x_m\text { is } A_{mn} \nonumber \\&\quad \mathbf{THEN}\ y= f_n(x_1,\ldots ,x_m) \nonumber \\&\qquad \qquad =\beta _{0n}+\beta _{1n}x_1+\ldots + \beta _{mn}x_m, \end{aligned}$$

where \(\beta _{0r}\) and \(\beta _{sr}\), (\(r\in \{1,2,\ldots ,n\}\) and \(s\in \{1,2,\ldots \), \(m\}\)) are constant parameters of the linear functions of rule consequences. The consequence polynomials deteriorate to constant numbers \(\beta _{0r}\) when the outputs are discrete crisp numbers (to represent symbolic values). Given an input vector \((A_1^*,\ldots ,A_m^*)\), the TSK engine performs inference in the following steps:

  • 1 Determine the firing strength of each rule \(R_r\) (\(r\in \{1,2,\ldots ,n\}\)) by integrating the similarity degrees between its antecedents and the given inputs:

    $$\begin{aligned} \alpha _r = S(A_{1}^*,A_{1r}) \wedge \ldots \wedge S(A_m^*,A_{mr}), \end{aligned}$$

    where \(\wedge \) is a t-norm usually implemented as a minimum operator, and \(S(A_s^*,A_{sr})\) (\(s\in \{1,2,\ldots , m\}\)) is the similarity degree between fuzzy sets \(A_s^*\) and \(A_{sr}\):

    $$\begin{aligned} S(A_s^*,A_{sr}) = max\{min\{\mu _{A_s^*}(x),\mu _{A_{sr}}(x)\}\}, \end{aligned}$$

    where \(\mu _{A_{s}^*}(x) \text { and } \mu _{A_{sr}}(x)\) are the degrees of membership for a given value x within the domain.

  • 2 Calculate the sub-output led by each rule \(R_r\) based on the given observation (\(A_1^*,\ldots ,A_m^*\)):

    $$\begin{aligned} \begin{aligned}&f_r(x_1^*,\ldots ,x_m^*) \\&\quad = \beta _{0r}+\beta _{1r}Rep(A_1^*)+ \cdots + \beta _{mr}Rep(A_m^*), \end{aligned} \end{aligned}$$

    where \(Rep(A^*_s)\) is the representative value or defuzzified value of fuzzy set fuzzy set \(A^*_s\), which is often calculated as the centre of area of the membership function.

  • 3 Generate the final output by integrating all the sub-outputs from all the rules:

    $$\begin{aligned} y=\frac{\displaystyle \sum \nolimits _{r=1}^{n}\alpha _r f_r(x_1^*,\ldots ,x_m^*)}{\displaystyle \sum \nolimits _{r=1}^{n}\alpha _r}. \end{aligned}$$

It is clear from Eq. 3 that the firing strength will be 0 if a given input vector does not overlap with any rule antecedent. In this case, no rule will be fired and the conventional TSK approach will fail.

2.2 TSK rule base generation

The antecedent variables of a TSK rule are represented as fuzzy sets and the consequence is represented by a linear polynomial function, as shown in Eq. 1. Data-driven fuzzy rule extraction approaches first partition the problem domain into multiple regions or rule clusters. Then, each region is represented by a TSK rule. Various clustering algorithms, such as K-Means (MacQueen (1967)) and its variations, can be used to divide the problem domain into sub-regions or rule clusters (Chen and Linkens (2004); Rezaee and Zarandi (2010)).

Given a rule cluster that contains a set of multi-dimensional data instances, a typical TSK fuzzy rule extraction process is performed in two steps: (1) rule antecedents determination and (2) consequent polynomial determination (Nandi and Klawonn (2007)). The rule antecedent determination process prescribes a fuzzy set to represent the information of the cluster on each dimension, such as Gaussian membership functions Chen and Linkens (2004). The consequent polynomial can usually be determined by employing the linear regression (Nandi and Klawonn (2007); Rezaee and Zarandi (2010)). Once each rule cluster has been expressed by a TSK rule, the TSK rule base can be assembled by combining all the extracted rules.

Note that a 0-order TSK fuzzy model is required if only symbolic labels are included in the output of data set (Kerk et al. (2016)). In this case, the step of consequent polynomial determination is usually omitted. Instead, discrete crisp numbers are typically used in representing the symbolic output values. Accordingly, the rule base generation process is different by firstly dividing the labelled data set into multiple sub-data sets each sharing the same label. Then, a clustering algorithm is applied to each sub-data set to generate rule clusters. Finally, each rule cluster is represented as the antecedents of a rule with an integer number used as the 0-order consequence representing the label.

3 TSK inference with fuzzy interpolation (TSK+)

The conventional TSK fuzzy inference system is extended in this section by allowing the interpolation and extrapolation of inference results. The extended system is thus workable with sparse rule bases, dense rule bases and imbalanced rule bases, which is termed as TSK+ inference system.

3.1 Modified similarity measure

Conventional TSK will fail if a given input does not overlap any rule antecedent in the rule base. This can be addressed using fuzzy interpolation such that the inference consequence can be approximated from the neighbouring rules of the given input. In order to enable this, the measure of firing strength used in the conventional TSK inference is modified based on a revised similarity measure proposed in Chen and Chen (2003). In particular, the similarity measure proposed in Chen and Chen (2003) is not sensitive to distance in addition to membership functions. This similarity measure is further extended in this subsection such that its sensitivity to distance is flexible and configurable to support the development of TSK+ inference engine.

It has been proven in the literature that different types of membership functions do not pose a significant difference in inference results if the membership functions are properly fine-tuned (Chang and Fan (2008)). Based on this, only triangular membership functions are used in this work for computational efficiency. Given two triangular fuzzy sets \(A=(a_1,a_2,a_3)\) and \(A'=(a_1',a_2',a_3')\) in a normalised variable domain, their similarity degree \(S(A,A^{'})\) can be calculated as (Chen and Chen (2003)):

$$\begin{aligned} \begin{aligned} S(A,A') = \left( 1- \frac{\displaystyle \sum \nolimits _{i=1}^3{|a_i-a_i'|}}{3} \right) . \\ \end{aligned} \end{aligned}$$

Equation 6 is extended in this work by introducing a configurable parameter as:

$$\begin{aligned} \begin{aligned} S(A,A') = \left( 1- \frac{\displaystyle \sum \nolimits _{i=1}^3{|a_i-a_i'|}}{3} \right) \cdot d, \\ \end{aligned} \end{aligned}$$

where d, termed as \(distance\ factor\), is a function of the distance between the two concerned fuzzy sets:

$$ \begin{aligned} d=\left\{ \begin{matrix} 1&{}; &{} \begin{matrix} a_1=a_2=a_3 \\ \& \ a_1^{'}=a_2^{'}=a_3^{'} \\ \end{matrix}\\ \\ 1 - \frac{1}{1+e^{(-s \cdot \Vert A,A'\Vert +5)}}&{}; &{} \text {otherwise}, \end{matrix}\right. \end{aligned}$$

where \(\Vert A,A'\Vert \) represents the distance between the two fuzzy sets usually defined as the Euclidean distance of their representative values, and s (\(s > 0\)) is an adjustable sensitivity factor. Smaller value of s leads to a similarity degree which is more sensitive to the distance of the two fuzzy sets. The constant 5 in the equation ensures that the distance factor is normalised as 1 when the distance between two given fuzzy sets is 0 (i.e. the two fuzzy sets have the same representative values). According to Eq. 8, the distance factor is not considered when fuzzy sets \(A \text { and } A'\) are both crisp. This is because the shapes of the fuzzy set need to be considered by the representative values as contributing elements of the distance factor when the objects are fuzzy sets, but there is no point to consider this element if the objects are crisp numbers (Johanyák and Kovács (2005)).

The modified similarity measure \(S(A,A')\) between fuzzy sets A and \(A'\) has the following properties:

  1. 1.

    lager value of \(S(A,A')\) represents higher similarity degree between fuzzy sets A and \(A'\);

  2. 2.

    \(S(A,A') = 1\) if and only if fuzzy sets A and \(A'\) are identical;

  3. 3.

    \(S(A,A') > 0\) unless (\(a_1=a_2=a_3=0\) and \(a_1'=a_2' =a_3' =1\)) or (\(a_1=a_2=a_3=1\) and \(a_1'=a_2'=a_3' =0\)).

3.2 Extended TSK inference

Given a rule base as specified in Eq. 1 and an input vector \((A_1^*,\ldots ,A_m^*)\), the TSK+ performs inferences using the same steps as those detailed in Sect. 2.1 except that Eq. 3 is replaced by Eq. 7. According to the third property of the modified similarity measure discussed above, \(S(A_s^{*},A_{sr}) >0\) unless \(A_s^*\) and \(A_{sr}\) take boundary crisp values 0 and 1. This means the firing strength of any rule \(R_r\) is always greater than 0, i.e. \(\alpha _r >0\), except for the special case when only boundary crisp values are involved. As a result, every rule in the rule base contributes to the final inference result to a certain degree. Therefore, even if the given observation does not overlap with any rule antecedent in the rule base, certain inference result can still be generated, which significantly improves the applicability of the conventional TSK inference system.

Fig. 1
figure 1

TSK+ rule base generation

4 Sparse TSK rule base generation

A data-driven TSK-style rule base generation approach for the proposed TSK+ inference engine is presented in this section, which is outlined in Fig. 1. Given a data set \(\mathbb {T}\) which might be sparse, unevenly distributed, or dense, the system firstly groups the data instances into clusters using certain clustering algorithms. Then, each cluster is expressed as a TSK rule by employing linear regression. From this, an initial rule base is generated by combining all the extracted rules. Finally, the initialised rule base is optimised by applying the genetic algorithm (GA), which fine-tunes the membership functions of fuzzy sets in the rule base.

4.1 Rule base initialisation

Centroid-based clustering algorithms are traditionally employed in TSK fuzzy modelling to group similar objects together for rule extraction (Chen and Linkens (2004)), which is also the case in this work. However, differing from existing TSK-style rule base generation approaches, the proposed system is workable with dense data sets, sparse data sets and unevenly distributed data sets. Therefore, a two-level clustering scheme is applied in this work. The first level of clustering divides the given (dense/sparse) data set into multiple sub-data sets using sparse K-Means clustering algorithm (Witten and Tibshirani (2010)). Based on the feature of the sparse K-Means clustering, those divided sub-data sets are generally considered being dense. The second level of clustering is applied on each obtained dense sub-data set to generate rule clusters for TSK fuzzy rule extraction by employing the standard K-Means clustering algorithm (MacQueen (1967)). Note that the number of clusters has to be pre-defined for both sparse K-Means and the standard K-Means, which is discussed first below.

4.1.1 Number of clusters determination

A number of approaches have been proposed in the literature to determine the value of k, such as the Elbow method, Cross-validation, Bayesian Information Criterion (BIC)-based approach, and Silhouette-based approach (Kodinariya and Makwana (2013)). In particular, the Elbow method is faster and effective, and this approach is therefore employed in this work. This approach determines the number of clusters based on the criteria that adding another cluster does not lead to much better modelling result based on a given objective function. For instance, for a given problem, the relationship between performance improvement and the value of k is shown in Fig. 2. The value of k in this case can be determined as 4 which is the obvious turning point (or the Elbow point).

Fig. 2
figure 2

Determination of k using the Elbow method

4.1.2 Dense sub-data set generation

Sparse K-Means is an extension of the standard K-Means for handling sparse data sets (Witten and Tibshirani (2010)). Assuming k clusters are required, sparse K-Means also starts with the initialisation of k centroids (usually randomly). This is followed by the assignment of data instances to centroids and the updating of centroids based on the assignments, which are iterated until there is no change on the assignments. Different from the standard K-Means which assigns objects with the goal of minimising the within-cluster sum of squares error (SSE), the sparse K-Means assigns objects by maximising the between-cluster sum of squares error (BCSS), which is defined as:

$$\begin{aligned} BCSS = \sum _{q=1}^{m+1} \left( \sum _{t=1}^{p} (x_{tq} - \mu _q)^2 - SSE\right) , \end{aligned}$$

where p is the total numbers of data instances in the given data set, m is the number of input features in the given data set, \(\mu _q\) is the mean of all the elements on the qth feature, and \(x_{tq}\) is the qth feature of the tth data point in the given data set.

The within-cluster sum of squares error SSE is defined as:

$$\begin{aligned} SSE=\sum _{j=1}^{k}\sum _{t=1}^{p_{j}}\left( \parallel x_{jt} - v_j \parallel \right) ^{2}, \end{aligned}$$

where k is the number of clusters determined by the Elbow approach, \(p_j\) is the number of data instances in the jth cluster, \(x_{jt}\) is the tth data point in the jth cluster, \(v_j\) is the jth cluster centre, and \(\parallel x_{jt} - v_j \parallel \) is the Euclidean distance between \(x_{jt}\) and \(v_j\). Note that if the labels in a given data set are symbolic values, only 0-order TSK rules are required and thus Eq. 9 becomes:

$$\begin{aligned} BCSS = \sum _{q=1}^{m} \left( \sum _{t=1}^{p} (x_{tq} - \mu _q)^2 - SSE\right) . \end{aligned}$$

4.1.3 Rule cluster generation

Once the given training data set \(\mathbb {T}\) has been divided into k dense sub-data sets, K-Means is employed to each determined sub-data set \(T_i\) (\(1\le i \le k\)) to generate rule clusters, each representing a rule. Assume that \(k_i\) clusters are required for a sub-data set \(T_i\). K-Means is initialised by \(k_i\) random cluster centroids. It then assigns every data instance to one cluster by minimising the SSE:

$$\begin{aligned} SSE=\sum _{j=1}^{k_i}\sum _{t=1}^{p_{t}^i}\left( \parallel x_{jt}^i - v_j^i \parallel \right) ^{2}, \end{aligned}$$

where \(p_{t}^i\) is the number of data points in the jth cluster of the sub-data set \(T_i\), \(x_{jt}^i\) is the tth data point in the jth cluster in the sub-data set \(T_i\), \(v_j^i\) is the centre of the jth cluster in the sub-data set \(T_i\), and \(\parallel x_{jt}^i - v_j^i \parallel \) is the Euclidean distance between \(x_{jt}^i\) and \(v_j^i\). Once all the data instances are assigned, the algorithm updates the cluster centroids accordingly to the newly assigned members. These two steps are iterated until there is no change in object assignments. After the K-Means is applied, the given training data set \(\mathbb {T}\) is divided into \(n = \sum _{j=1}^{k}k_i\) clusters. For simplicity, the generated rule clusters are jointly represented as \(\{RC_1, RC_2, \ldots , RC_n\}\).

4.1.4 Fuzzy rule extraction

Each determined cluster from the above steps is utilised to form one TSK fuzzy rule. A number of approaches have been proposed to use a Gaussian membership function to represent a cluster, such as Rezaee and Zarandi (2010). However, given the fact that most of the real-world data are not normally distributed, the cluster centroid usually is not identical with the centre of the Gaussian membership function, and thus, Gaussian membership functions may not be able to accurately represent the distribution of the calculated clusters. In order to prevent this and also keep computational efficiency as stated in Sect. 3, triangular membership functions are utilised in this work.

Suppose that a data set has m input features and a single-output feature. Given a rule cluster \(RC_r\) \((1\le r \le n)\), a TSK fuzzy rule \(R_r\) can be extracted from the cluster as follows:

$$\begin{aligned}&R_r:\ \mathbf{IF}\ x_1\ \text {is}\ A_{1r}\text { and } \ldots \text { and } x_m\ is\ A_{mr}\nonumber \\&\qquad \mathbf{THEN}\ y=f_r(x_1, \ldots ,x_m) . \end{aligned}$$

Without loss generality, take the sth dimension (\(1 \le s\le m\)) of rule cluster \(RC_r\) as an example, denoted as \(RC_r^s\). Suppose that \(RC_r^s\) has \(p_r\) elements, i.e. \(RC_r^s = \{x_s^1,\) \( x_s^2, \ldots ,x_s^{p_r}\}\). As only triangular fuzzy sets are used in this work, fuzzy set \(A_{sr}\) can be precisely represented as \((a^1_{sr}, a^2_{sr}, a^3_{sr})\). The core of the triangular fuzzy set is set as the cluster centroid, that is \(a^2_{sr} = \sum _{q=1}^{p_r} x_s^q/p_r\); and the support of the fuzzy set is set as the span of the cluster, i.e. \((a^1_{sr}, a^3_{sr}) = (min\{x^1_s, x^2_s, \ldots , x_s^{p_r}\},\) \(max\{x^1_s, x^2_s, \ldots , x^{p_r}_s\})\).

First-order polynomials are typically used as the consequences of TSK fuzzy rules. That is, \(y =\beta _{0r}+\beta _{1r}x_1 + \cdots +\beta _{mr}x_m\), where the parameters \(\beta _{0r}\) and \(\beta _{sr},\ s\in \{1,2,\ldots ,m\}\) are estimated using a linear regression approach. The locally weighted regression (LWR) is particularly adopted in this work, due to its ability to generate an independent model that is only related to the given cluster of data in the training data set (Nandi and Klawonn (2007); Rezaee and Zarandi (2010)). The rule consequence will deteriorate to 0-order, if the values in the output dimension are discrete integer numbers. From this, the raw base is initialised by combining all the extracted rules, which is of the form of Eq. 1.

4.2 Rule base optimisation

The generated raw rule base is optimised in this section by fine-tuning the membership functions using the general optimisation searching algorithm, genetic algorithm (GA). GA has been successfully utilised in rule base optimisation, such as Mucientes et al. (2009) and Tan et al. (2016). Briefly, GA is an adaptive heuristic search algorithm for solving both constrained and unconstrained optimisation problems based on evolutionary ideas of natural selection process that mimics biological evolution. The algorithm firstly initialises the population with random individuals. It then selects a number of individuals for reproduction by applying the genetic operators. The offspring and some of the selected existing individuals jointly form the next generation. The algorithm repeats this process until a satisfactory solution is generated or a maximum number of generations has been reached.

4.2.1 Problem representation

Assume that an initialised TSK rule base is comprised of n rules as expressed in Eq. 1. A chromosome or individual, denoted as \(\textit{I}\), in the GA is used to represent a potential solution, which is designed to represent the rule base in this proposed system, as illustrated in Fig. 3.

Fig. 3
figure 3

Chromosome encoding

4.2.2 Population initialisation

The initial population \(\mathbb {P}= \{\textit{I}_1, \textit{I}_2,\ldots ,\textit{I}_{|\mathbb {P}|} \}\) is formed by taking the initialised rule base and its random variations. In order to guarantee all the variated fuzzy sets are valid and convex, constraint \(a_{sr}^{1}< a_{sr}^{2} < a_{sr}^{3}\) is applied to the genes representing each fuzzy set. The size of the population \(|\mathbb {P}|\) is a problem-specific adjustable parameter, typically ranging from tens to thousands, with 20–30 being used most commonly (Naik et al. (2014)).

4.2.3 Objective function

An objective function is used in the GA to determine the quality of individuals. The objective function in this work is defined as the root mean square error (RMSE). Given a training data set \(\mathbb {T} \) and an individual \(I_i\), \(1\le i \le | \mathbb {P} |\), the RMSE value can be calculated as:

$$\begin{aligned} RMSE_i = \sqrt{\frac{\displaystyle \sum \nolimits _{j=1}^{|\mathbb {T}|}\left( z_j - \hat{z}_j\right) ^2}{|\mathbb {T}|}}, \end{aligned}$$

where \(|\mathbb {T}|\) is the size of the given training data set, \(z_j\) is the label of the jth training data instance, and \(\hat{z}_j\) represents the output value led by the proposed TSK+ inference approach. The individual with the smallest value of RMSE represents the fittest solution in the population.

4.2.4 Selection

A number of individuals need to be selected for reproduction, which is implemented in this work by the fitness proportionate selection method, also known as the roulette wheel selection. Assuming that \(f_i\) is the fitness of individual \(\textit{I}_i\) in the current population \(\mathbb {P}\), its probability of being selected to generate the next generation is:

$$\begin{aligned} p(I_i) = \frac{f_i}{\displaystyle \sum \nolimits _{j=1}^{|\mathbb {P}|}f_j}, \end{aligned}$$

where \(|\mathbb {P}|\) is the size of the population. The fitness value \(f_i\) of an individual \(\textit{I}_i\) in the proposed system was determined by adopting the linear-ranking algorithm (Baker (1985)) given as:

$$\begin{aligned} f_i = 2-max+\frac{2(max-1)(r_i-1)}{|\mathbb {P}|}, \end{aligned}$$

where \(r_i\) is the ranking position of individual \(I_i\) in the ordered population \(\mathbb {P}\), and max is the bias or selective pressure towards the fittest individuals in the population.

4.2.5 Reproduction

Once a number of parents are selected, they then breed some individuals for the next generation using the genetic operators crossover and mutation, as shown in Fig. 4. In particular, crossover swaps contiguous parts of the genes of two individuals. In this work, the single-point crossover approach is adopted, which swaps all data beyond this index point between the two parent chromosomes to generate two children. Note that the crossover point can only be between those genes which employed to represent two different fuzzy sets, such that all the fuzzy sets are valid and convex all the time during the reproduction process.

The second genetic operator mutation is used to maintain genetic diversity from one generation of an individual to the next, which is analogous to a biological mutation. Mutation alters one gene values in a chromosome from its initial state. A pre-defined mutation rate is used to control the percentage of occurrence of mutations. In this work, in order to make sure the resulted fuzzy sets are valid and convex, the constraint \(a_{sr}^1 \ge a_{sr}^2 \ge a_{sr}^3\) is applied to the genes representing each fuzzy set during the mutation operation. The newly bred individuals and some of the best individual in the current generation \(\mathbb {P}\) jointly form the next generation of the population (\(\mathbb {P}^{'}\)).

Fig. 4
figure 4

Procedure of reproduction (with only one crossover or mutation operation per generation for illustration)

4.2.6 Iteration and termination

The selection and reproduction processes are iterated until the pre-defined maximum number of iterations is reached or the value of the objective function regarding an individual is less than a pre-specified threshold. When the termination condition is satisfied, the fittest individual in the current population is the optimal solution.

5 Experimentation

Two nonlinear mathematical models and a well-known real-world data set (KDD Cup 99 data set) are employed in this section to validate and evaluate the proposed system.

5.1 Experiment 1

A 3-dimensional nonlinear function has been used as a benchmark by a number of projects, including the recent ones such as Bellaaj et al. (2013); Tan et al. (2016) and Li et al. (2017), which is re-considered in this section for a comparative study. The problem is given below:

$$\begin{aligned} f(x_1,x_2) = sin \left( \frac{x_1}{\pi } \right) \cdot sin \left( \frac{x_2}{\pi } \right) , \end{aligned}$$

which takes two inputs, \(x_1\) (\(x_1 \in [10,30]\)) and \(x_2\) (\(x_2 \in [10,30]\)), and produces a single output \(y=f(x_1,x_2)\) (\(y \in [-1,1]\)), as illustrated in Fig. 5a.

Fig. 5
figure 5

Mathematical model in Experiment 1. a Surface view of Eq. 17. b Training data set distribution

5.1.1 Rule base initialisation

In order to demonstrate the proposed TSK+ rule base generation approach, a sparse training data set \(\mathbb {T}\) was manually generated from Eq. 17. The data set is composed of 300 data points sparsely distributed within the \([10,30] \times [10,30]\) domain covering 57% of the input domain. The distribution of this training data set is illustrated in a 2-dimensional plane in Fig. 5b. The key steps of TSK rule base initialisation using the training data set are summarised below.

Step 1 Dense sub-data sets generation The sparse training data set \(\mathbb {T}\) was firstly divided into a number a dense sub-data sets by applying the sparse K-Means clustering algorithm. In particular, the number of sub-data sets was determined by the Elbow method as discussed in the Sect. 4.1.2. The performance improvements (PIs) against the incremented number clusters are listed in Table 1 and shown in Fig. 6. It is clear from the figure that the performance improvement decreases rapidly when k increased from 2 to 6, before it flattened out after 6. Therefore, 6 was taken as the number of clusters. The application of sparse K-Means led to 6 dense sub-data sets as demonstrated in Fig. 7.

Table 1 SSE and performance improvement
Fig. 6
figure 6

Performance improvement against incremented k

Fig. 7
figure 7

Result of sparse K-Means where k = 6

Step 2 Rule cluster generation Once the sparse training data set \(\mathbb {T}\) was divided into 6 dense sub-data sets (\(T_1,T_2,\ldots \), \(T_6\)), the standard K-Means clustering algorithm was employed on each determined sub-data set to group similar data points into rule clusters. The application of the Elbow approach led to a set of SSEs and PIs as listed in Table 2, which in turn determined the value of k for each sub-data set as listed in Table 3.

Table 2 SSE and PI for each sub-data set (\(k=\{1,2,\ldots ,10\}\))

Step 3 Rule base initialisation A rule was extracted from each cluster and all the extracted rules jointly initialised the rule base. The generated 28 TSK fuzzy rules are detailed in Table 4.

5.1.2 TSK fuzzy interpolation

Given any input, overlapped with any rule antecedent or not, the proposed TSK+ inference engine is able to generate an output. For instance, a randomly generated testing data point was \((A_1^*=(27.37, 27.37, 27.37)\), \(A_2^*=(13.56, 13.56, 13.56))\). The proposed TSK+ inference engine firstly calculated the similarity degrees between the given input and the antecedents of every rule (\(S(A_1^*,A_{r1}), S(A_2^*,A_{r2}), r=\{1,2,\ldots ,28\}\)) using Eq. 7, with the results listed in Table 5.

From the calculated similarity degrees, the firing strength of each rule \(FS_r\) was computed using Eq. 2, and the sub-consequence from each rule was calculated using Eq. 4, which are also shown in Table 5. From this, the final output of the given input was calculated using Eq. 5 as \(y=-\,0.902\).

5.1.3 Rule base optimisation

In order to achieve the optimal performance, the generated raw rule base was fine-tuned using the GA algorithm as detailed in Sect. 4.2. The GA parameters used in this experiment are listed in Table 6. The population was initialised as the individuals representing the raw rule base and its random variations. The performance against the number of iterations is shown in Fig. 8, which clearly demonstrates the performance improvements led by the GA.

Table 3 The value of k for each sub-data set
Table 4 Generated raw TSK rule base

5.1.4 Results comparison

In order to enable a direct comparative study with support of the approaches proposed in Bellaaj et al. (2013) and Tan et al. (2016), the proposed approach was also applied to 36 randomly generated testing data points. The sum of errors for the 36 testing data led by the proposed approach is shown in Table 7, in addition to those led by the compared approaches. The proposed TSK+ outperformed the approaches proposed in Bellaaj et al. (2013) and Tan et al. (2016), although less rules have been used. Another advantage of the proposed approach is that the number of rules led by the proposed system was determined automatically by the system without the requirement of any human inputs. In addition, noticeably, the optimisation process significantly improved the system performance by reducing the sum of error from 3.38 to 1.78.

5.2 Experiment 2

The proposed approach is not only able to deal with sparse or unevenly distributed data sets, but also able to work with dense data sets. In order to evaluate its ability in handling dense data sets, a two-input and single-output nonlinear mathematical model expressed in Eq. 18 was employed as a test bed, given that it has been used by Evsukoff et al. (2002) and Rezaee and Zarandi (2010).

$$\begin{aligned} f(x,y)=\frac{sin(x)}{x} \cdot \frac{sin(y)}{y}. \end{aligned}$$
Table 5 Sub-result from each rule and its calculation details
Table 6 GA parameters

In this case, a randomly generated dense training data set was used, including 1681 data points distributed within the range of \([-10,10] \times [-10,10]\). By employing the proposed approach, 14 TSK fuzzy rules in total were generated. To allow a direct comparison, the mean-squares error (MSE) was used as the measurement of models, by following the work presented in Evsukoff et al. (2002) and zRezaee and Zarandi (2010). The detailed calculations of this experiment are omitted here. The MSE values led by different approaches with the specified number of rules are listed in Table 8. The results demonstrated that the proposed system TSK+ performed competitively with only 14 TSK fuzzy rules.

5.3 Experiment 3

This section considers a well-known real-world data set NSL-KDD-99 (Tavallaee et al. (2009)), which has been widely used as a benchmark (Bostani and Sheikhan (2017); Wang et al. (2010); Yang et al. (2017b)). This data set is a modified version of KDD Cup 99 data set generated in a military network environment. It contains 125,973 data instances with 41 attributes and 1 label which indicates the type of connection. In particular, normal connections and four types of attacks are labelled in the data set, including Denial of Service Attacks (DoS), User to Root Attacks(U2R), Remote to User Attacks (R2U) and Probes.

5.3.1 Data set pre-processing

The four most important attributes were selected from the original 41 using expertise knowledge in order to remove noises and redundancies in the work of Yang et al. (2017b). This experiment also took the four attributes as system input, and the application of automatic feature selection and reduction approaches, such as Zheng et al. (2015), remains as future work. The four selected input features are listed in Table 9. The data set in this experiment has also been normalised in an effort to reduce the potential noises in this real-world data set.

5.3.2 TSK+ model construction

As the labels are symbolic values, 0-order TSK-style fuzzy rules were used. In order to construct a 0-order TSK rule base, the training data set was divided into 5 sub-data sets based on the five symbolic labels, which are represented using five integer numbers. The sizes of the sub-data sets and their corresponding integer labels are listed in Table 10. The rule base generation process is summarised in four steps below.

Step 1 Dense sub-data set generation The sparse K-Means was applied on each sub-data set \(\mathbb {T}_j, 1\le j\le 5\) to generate dense sub-data sets. Taking the second sub-data set \(\mathbb {T}_2\) as an example, the performance improvement led by the increment of \(k_2,\ (k_2\in \{1,2,\ldots ,10\})\) is shown in Fig. 9. Following the Elbow approach, 3 was selected as the number of clusters, i.e. \(k_2 = 3\). Denote the 3 generated dense sub-sets as (\(T_{21},T_{22},T_{23}\)).

Step 2 Rule cluster generation The standard K-Means clustering algorithm was applied on \(T_{ji}\) (\(j \in \{1,2, \ldots , 5\}\) and i ranging from 1 to the determined cluster numbers using the Elbow approach), to generate rule clusters each representing a rule. Again, take \(\mathbb {T}_2\) as an example for illustration. The determined cluster numbers \(k_{2i}\) for each dense sub-data set \(T_{2i}\) are shown in the third column of Table 11. Then, 13 rule clusters were generated from the sub-data set \(\mathbb {T}_{2}\). The rule cluster generation process for other dense sub-data sets is not detailed here, but the generated rule cluster or the given training data set is summarised in Table 11.

Fig. 8
figure 8

RMSE values decrease over time during optimisation

Step 3 Raw rule base generation A rule is extracted from each generated rule cluster. Taking rule cluster \(RC_{12}\) as an example, which is the first determined rule cluster in \(T_{21}\), the corresponding rule \(R_{12}\) can be extracted. In particular, the consequence of rule \(R_{12}\) is the integer number representing the class of connections; and the rule antecedents are four triangle fuzzy sets (\(A_{(12)1},\ A_{(12)2},\ A_{(12)3},\ A_{(12)4}\)) led by the approach discussed in Sect. 4.1.4. The generated rule \(R_{12}\) is:

$$\begin{aligned}&R_{12}:\ \mathbf{IF}\ x_1 \text { is } (0.588,0.599,0.601) \nonumber \\&\quad \qquad \text {and } x_2 \text { is } (0,0.387,0.414) \nonumber \\&\quad \qquad \text {and } x_3 \text { is } (0.05,0.075,0.1) \text { and } x_4 \text { is } (0,0,0) \nonumber \\&\qquad \mathbf{THEN}\ y=2. \end{aligned}$$

According to Table 11, 46 rules in total were generated to initialise the rule base. The detailed initialised rule base can be found in Appendix I.

Table 7 Results led by approaches in experiment 1
Table 8 Results led by fuzzy models in experiment 2
Table 9 Selected input features
Table 10 Data details regarding the types of connections
Fig. 9
figure 9

Performance improvement regarding incremented k

Table 11 Rules and clusters for each set of data instances with the same type of connection

Step 4 Rule base optimisation The GA was applied to optimise the membership functions of the fuzzy sets involved in the extracted fuzzy rules. The same GA parameters used in Experiment 1 as listed in Table 6 were also used in this example, but the number of iterations was increased to 20,000. The system performance against the number of iterations used in GA is shown in Fig. 10. The optimised rule base is attached in Appendix II.

Fig. 10
figure 10

RMSE decreasing over time

Table 12 Performance comparison

5.3.3 TSK+ model evaluation

Once the TSK fuzzy rule base was generated, it can then be used for connection classification by the proposed TSK+ inference engine. The rule base and the inference engine TSK+ jointly formed the fuzzy model, which was validated and evaluated using a testing dataset. The testing data set contains 22,544 data samples provided by Tavallaee et al. (2009). The testing data set was also extracted from original KDD Cup 99 data set, but it does not share any data instance with the training data set NSL-KDD-99.

Note that the testing data set has been used in a number of projects with different classification approaches. In particular, decision tree, naive Bayes, back-propagation neural network (BPNN), fuzzy clustering-artificial neural network (FC-ANN) have been employed in Wang et al. (2010), and modified optimum-path forest (MOPF) was applied in Bostani and Sheikhan (2017). The accuracy of the classification results for each class of network traffic generated by different approaches including the proposed one with the initialised rule base and the optimised rule base is listed in Table 12.

The results show that the proposed TSK+ overall outperformed all other approaches; and the proposed rule base optimisation approach significantly improved the system performance by over 10% on average. In particular, the proposed system achieved better accuracies on the prediction of normal connections, DoS and R2U than those of all other approaches; worse performance led by the proposed approach in the class of U2R compared to FC-ANN and MOPF; and similar result was generated for class Probes with the existing best performance resulted from other approaches.

6 Conclusion

This paper proposed a fuzzy inference system TSK+, which extended the conventional TSK inference system such that it is also applicable to sparse rule bases and unevenly distributed rule bases. This is achieved by allowing the interpolation and extrapolation of the output from all rules even if the given input does not overlap with any rule antecedents. This paper also proposed a novel data-driven rule base generation approach, which is workable with spare data sets, dense data set, and unevenly distributed data sets. The system was validated and evaluated by applying two benchmark problems and one real-world data set. The experimental results demonstrated the wide applicability of the proposed system with compact rule bases and competitive performances.

The proposed system can be further enhanced in the following areas. Firstly, the sparsity-aware possibilistic clustering algorithm (Xenaki et al. (2016)) was designed to work with sparse data sets, and thus, it is desired to investigate how this clustering algorithm can support the proposed TSK+ inference system. Secondly, it is worthwhile to study how the proposed sparse rule base generation approach can help in generating Mamdani-style fuzzy rule bases. Finally, it would be interesting to define the sparsity or density of a data set such that more accurate clustering results can be generated during rule cluster generation.