TSK-Streams: learning TSK fuzzy systems for regression on data streams

The problem of adaptive learning from evolving and possibly non-stationary data streams has attracted a lot of interest in machine learning in the recent past, and also stimulated research in related fields, such as computational intelligence and fuzzy systems. In particular, several rule-based methods for the incremental induction of regression models have been proposed. In this paper, we develop a method that combines the strengths of two existing approaches rooted in different learning paradigms. More concretely, our method adopts basic principles of the state-of-the-art learning algorithm AMRules and enriches them by the representational advantages of fuzzy rules. In a comprehensive experimental study, TSK-Streams is shown to be highly competitive in terms of performance.


Introduction
In many practical applications of machine learning and predictive modeling, data is produced incrementally in the course of time and observed in the form of a continuous, potentially unbounded stream of observations. Correspondingly, the problem of learning from data streams (Gama 2012) has received increasing attention in recent Responsible editor: Toon Calders. B Eyke Hüllermeier eyke@upb.de Ammar Shaker ammar.shaker@neclab.eu years. Algorithms for learning on streams must be able to process the data in a single pass, which implies an incremental mode of learning, to detect and handle different types of drift in the data distribution, and, correspondingly, adapt to changes of the underlying data-generating process (Gama et al. 2014;Lu et al. 2019).
A popular approach for learning on data streams, both for classification and regression, is rule induction, in the fuzzy logic and computational intelligence community also known as "evolving fuzzy systems" (Lughofer 2011). Shaker et al. (2017) proposed a method for regression that builds on a very efficient and effective technique for rule induction, which is inspired by the state-of-the-art machine learning algorithm AMRules (Almeida et al. 2013), and combines it with the strengths of fuzzy modeling. Thus, the method induces a set of fuzzy rules, which, compared to conventional rules with Boolean antecedents, has the advantage of producing smooth regression functions.
The method presented in this paper, called TSK-Streams, is a substantially revised and improved variant of (Shaker et al. 2017). The main modifications and novel contributions are as follows: -We give a concise overview of regression learning on data streams as well as a systematic comparison of existing methods with regard to properties such as discretization of features, splitting criteria for rules, etc. This overview helps to better understand the specificities and characteristics of approaches originating from different research fields, as well as to position our own approach. -We introduce a new strategy for the induction of TSK fuzzy rules and realize it in the form of two concrete variants: variance reduction and error reduction. While the former is still close to Shaker et al. (2017), the variance reduction approach has not been considered for online learning of fuzzy systems so far. Compared with error reduction and other state-of-the-art methods, it leads to models with superior predictive performance. -In Shaker et al. (2017), rule antecedents may contain disjunctions and negations, which makes them difficult to understand and interpret. The representation of TSK rules used in this paper is simpler and more concise. This is achieved by means of an improved technique for splitting fuzzy sets (and extending corresponding rules). -We propose the induction of candidate fuzzy rules using a discretization technique that is based on an extended Binary Search Tree (E-BST) structure. Compared to the three-layered discretization architecture used by Shaker et al. (2017), the use of E-BST for constructing candidate fuzzy sets has a number of advantages in the context of online learning. Most notably, it comes with a reduction of complexity from linear to logarithmic (in the number of candidate extensions). -Our empirical evaluation is more extensive and comprises a couple of additional large-scale data sets with up to 100k instances. The evaluation is also extended by including an additional method that has been introduced recently (Gomes et al. 2018).
The rest of the paper is organized as follows. Following a brief discussion of related work and systematic exposition of methods for regression on data streams in Sect. 2, we introduce our method TSK-Streams in Sect. 3. Section 4 presents a comprehensive experimental study, in which TSK-Streams is compared to several competitors, prior to concluding the paper in Sect. 5.

Learning regression models on data streams
Recall that, in the standard setting of supervised learning, a learner is given access to a set of training data D = (x 1 , y 1 ), . . . , (x N , y N ) ⊂ X × Y, where X is an instance space and Y the set of outcomes that can be associated with an instance. In the case of regression, Y = R, i.e., outcomes are real numbers. Most commonly, instances are represented in terms of feature vectors x i = (x i,1 , . . . , x i,d ) ∈ R d . Given a hypothesis space H (consisting of hypotheses h : X −→ Y mapping instances x to outcomes y) and a loss function : Y × Y −→ R, the goal of the learner is to induce a hypothesis h * ∈ H with low risk (expected loss) where P is a joint probability measure on X × Y characterizing the data-generating process. Thus, given the training data D, the learner needs to "guess" a good hypothesis h. Assuming the training examples (x i , y i ) to be independent and identically distributed (according to P), this choice is commonly guided by the empirical risk i.e., the performance of a hypothesis on the training data. However, since R emp (h) is only an estimation of the true risk R(h), the hypothesis (empirical risk minimizer)ĥ . .= argmin h∈H R emp (h) favored by the learner will normally not coincide with the true risk minimizer h * . .= argmin h∈H R(h). Moreover, since empirical risk minimization is prone to overfitting the training data, which in turn compromises generalization performance, the learning process is typically regularized.
When learning from data streams, as opposed to learning in "batch mode", the training data is not given in the form of a static data set D. Instead, the data is produced in an online manner, and training examples are provided one by one. In other words, the data forms a potentially unbounded, continuously evolving sequence (x 1 , y 1 ), . . . , (x i , y i ), . . . of data points. Also, the data generating process is not necessarily stationary, i.e., the (x i , y i ) are not necessarily generated from the same distribution P. Instead, this distribution may change in the course of time, giving rise to what is called concept drift or concept shift in the literature (Gama et al. 2014;Lu et al. 2019).

An overview of existing methods
In the machine learning community, research on supervised learning from data streams has mainly focused on classification problems so far. As one of the first methods, Hoeffding trees (Domingos and Hulten 2000) have been proposed for learning classifiers on high-speed data streams. Since then, the tree-based approach has been developed further, and various modifications and variants can be found in the current literature (Bifet and Gavaldà 2009). Closely related to tree-based approaches is the induction of decision rules. For example, the Adaptive Very Fast Decision Rules (AVFDR) method (Kosina and Gama 2012) is an extension of the Very Fast Decision Rules (VFDR) classifier (Gama and Kosina 2011), which learns a compact set of rules in an incrementally manner. Most recently, Bifet et al. (2017) developed an extremely fast version of Hoeffding trees with an implementation that is ready to be used in industrial environments.
Less research has been done on regression for data streams. Notable exceptions include AMRules (Almeida et al. 2013), which is an extension of AVFDR for handling numeric target values, and FIMTDD (Ikonomovska et al. 2011), which induces model trees. In contrast to the machine learning community, the fuzzy systems community has put more emphasis on regression than on classification (Angelov 2002;Angelov et al. 2010;Lughofer 2011). In particular, FLEXFIS (Lughofer 2008) is a method for inducing Takagi-Sugeno-Kang (TSK) rules (Takagi and Sugeno 1985) from data streams.
In the following, we elaborate a bit more on those approaches that are especially relevant for our own method and the experimental study presented later on namely.
In the Adaptive Model Rules (AMRules) approach, the rule premises are represented in the form of conjunctive combinations of literals on the input variables. Moreover, the rule consequents are specified as linear functions of the variables, which are fitted to the data using least squares regression. Each rule maintains various statistics characterising the part of the instance space covered by that rule. Starting with a single literal, each rule is expanded by new literals step by step, using the Hoeffding bound as a selection criterion. A distinction between unordered rule sets and decision lists is made by Almeida et al. (2013). In this paper, the authors propose two prediction and update schemes. In the first approach, the rules are sorted in the order in which they have been learned. For prediction, only the first rule that is activated by an example is used. In the second approach, the rules are treated as a set, and their predictions are aggregated after an inversely proportional weighting to the loss of their corresponding rules. Moreover, all rules activated by an example are updated. Since a better performance was achieved for the second approach, the authors used that one in their study.
Fast Incremental Model Trees with Drift Detection (FIMTDD) is a method for learning model trees for regression. To determine splits of the model tree, candidate attributes are assessed according to how much they they help to reduce the variance of the target variable. Moreover, a linear function on a corresponding subspace is specified for each leaf of the induced tree, and learning these functions is accomplished using stochastic gradient descent. An ensemble of FIMTDD trees was introduced by Ikonomovska et al. (2015) after equipping each tree with the option mechanism, i.e., the standard single split approach is replaced by a multi-split ability to avoid long waiting times before the tie-breaking takes place. Another ensemble version of FIMTDD (adaptive random forest, ARF-Reg) was proposed by Gomes et al. (2018), using an online version of bagging for creating the ensemble members (Oza and Russell 2001).
The Flexible Fuzzy Inference Systems (FLEXFIS) approach by Lughofer (2008) is a method for learning fuzzy rules, or, more specifically, Takagi-Sugeno-Kang (TSK) rules, on data streams. This type of rule will be formally introduced in Sect. 4.1. In contrast to Boolean rules, fuzzy rules are of a gradual nature and can cover an instance to a certain degree, which in turn allows for modulating the influence of a rule on a prediction in a more fine-granular way. In FLEXFIS, the fuzzy support of a rule, i.e., the region it covers in the input space, is determined by (incrementally) clustering the training data and associating each rule with a cluster. Rule consequents are specified in terms of linear functions of the input variables, and the estimation of these functions is successively adapted through recursive weighted least squares (RWLS) (Ljung 1999). While both fuzzy (FLEXFIS) and no-fuzzy (AMRules) methods are capable of performing a weighted aggregation, in the latter case weighting is oblivious to which degree a rule covers a data sample.
The main motivation of our approach is to take advantage of the effectivity and efficiency and algorithmic techniques for rule learning as implemented by methods such as AMRules, and to combine them with the expressiveness of fuzzy rules as used in approaches like FLEXFIS and eTS+ (Angelov 2010) as well as related formalisms such as fuzzy pattern trees (Shaker et al. 2013).
In the following, we provide a more systematic exposition and categorize the learning algorithms discussed above according to several properties. Along the way, we highlight potential advantages of combining different algorithms and their features.

Trees versus rules
Most tree and rule induction methods are based on refining rules in a general-tospecific manner, i.e., they share the property of moving from general to more specific hypotheses. In FIMTDD, for example, leaf nodes are split into more specific leaf nodes. Likewise, in AMRules and TSK-Streams, rules are specialized by adding terms to the premise part.
Trees can be seen as rule sets with a specific structure. Thus, while a direct transformation from a tree to a set of rules can usually be done in a straight-forward manner, the other direction is not always possible. In AMRules, for example, some of the rules are removed upon detection of a concept change, which makes it impossible to map the current rules to an equivalent tree-model. FLEXFIS and eTS+ do not follow the aforementioned general-to-specific induction scheme. Instead, they learn and maintain rules in the form of clusters directly in the instance space. In general, these rules cannot be represented in terms of an equivalent tree structure.

Binary versus gradual membership
The application of fuzzy logic in decision tree and rule learning leads to two important distinctions from conventional learning. First, hard conditions (in rule antecedents) are replaced by soft conditions, so that an example can satisfy a condition to a certain degree. Therefore, in a tree structure, an instance can be propagated to different sibling nodes/leaves simultaneously, perhaps with different weights. Likewise, in a system of rules, it can be covered by multiple rules with different membership degrees.
The second difference is the ability to aggregate the decisions made by different rules in a weighted manner, as done by TSK-Streams, FLEXFIS, and eTS+, instead of merely computing an unweighted average of the outputs of all rules covering an instance. Thus, more weight can be given to the more relevant and less to the less relevant rules.
Likewise, gradual membership allows for more general inference in the case of treestructures. While decision and model trees restrict tree traversal to a single branch from the root to a leave node, an equivalent fuzzy model tree 1 would follow several such paths simultaneously, branching an instance at an inner node in a weighted manner depending on how much it agrees with the conditions associated with each branch.

Discretization
Discretization is usually needed to create a finite number of candidate values for splitting points (thresholds) in the case of continuous features; these splitting points are then validated using a splitting criterion to decide how a tree/rule should be extended.
Both AMRules and FIMTDD apply a supervised discretization technique that is tailored to each rule and leaf node; this is achieved by considering the target values of all instances that reached a given leaf node or are covered by a rule.
TSK-Streams, as we will explain later, applies a supervised discretization technique for the creation of fuzzy sets that are evaluated for future extensions.

Splitting criteria
As already said, refining a model normally means extending a rule with additional conditions, thereby splitting it into two more specific rules or shrinking the region covered by that rule. A splitting criterion is used to find the presumably best among the (typically large) set of candidate splits. To quantify the usefulness of a split, different measures are conceivable.
A splitting criterion employed by many method, including AMRules, is variance reduction: For the rule R, the instances N covered by that rule are split into two groups N 1 and N 2 based on an attribute x j and a threshold v, i.e., The sets N 1 and N 2 then specify new rules R 1 and R 2 , respectively. Both x j and v are chosen so as to achieve a maximal reduction of variance where V ar(N ) is the variance of the target attribute (the y-values) of the instances in N .
Variance reduction has its roots in the earliest decision tree induction methods, in which splits are chosen that decrease the impurity of leaf nodes. For categorical target attributes, this is usually put in practice by reducing the information entropy. In the case of classification, the majority class is then used for prediction at a leaf node. In regression, where the target attribute is numerical, averaging is a more reasonable aggregation strategy; it was already adopted by the first regression tree learner CART (Breiman et al. 1984). With the aim of minimizing the sum of squared errors, variance reduction becomes the right splitting criterion, since the sum of weighted variances [the second part of (2)] can be written as the sum of squared errors: M5 (Quinlan 1992), one of the most popular regression approaches, is a tree that is similar to regression trees with the exception of learning a linear function in the leaf nodes, instead of predicting a constant (the average in CART), while employing variance reduction as a splitting criterion. FIMTDD extends M5 for learning model trees from data streams; it also applies variance reduction as a splitting criterion.
Despite the popularity of variance reduction, it has been criticized by Karalič (1992) as "not an appropriate measure for impurity of an example set since example sets with large variance and very low impurity can arise". Similarly, a set of data points might be perfectly located on a hyperplane, non-orthogonal to the target axis, and still have a high variance.
FLEXFIS and eTS+ do not apply a splitting criterion directly, but utilize an extension mechanism that decides when to add rules to the current rule set. More specifically, FLEXFIS applies an incremental clustering method, namely an incremental version of vector quantization (Gray 1984), such that a new example forms a new cluster if its distance to the nearest cluster is larger than the "vigilance" parameter. This parameter controls the tradeoff between major structural changes (creating a new cluster) and minor adaptations of the current structure. Likewise, eTS+ utilizes a density-based incremental clustering, eClusteting+ (Angelov 2004). In both approaches, the clusters found are eventually transformed into rules.
Finally, we mention that most of the presented approaches consider only a single attribute for splitting, which leads to axis-parallel splits, not only in the standard case (FIMTDD and AMRules) but also in the case of fuzzy methods. FLEXFIS and eTS+ constitute an exception, since they find multivariate Gaussian clusters with non-diagonal covariance matrices.

Statistical tests versus engineered parameters
Learning on data streams, including the choice of the next split, must be done in an online manner. To answer the question whether or not an additional split is required, i.e., whether or not a significant improvement can be achieved through a split, statistical tests can be applied. A statistical test based on the Hoeffding bound has been extensively used by recent machine learning approaches for classification and regression, including Hoeffding trees, FIMTDD, AMRules, and TSK-Streams.
Instead of applying statistical tests, FLEXFIS and eTS+ make use of more engineered solutions, such as creating a new rule whenever an example is distant from all existing rules, as in FLEXFIS, or when adding an example reduces the density of existing ones, as in eTS+.

The learning algorithm TSK-Streams
TSK-Streams is an incremental, adaptive algorithm for learning rule-based regression models in a streaming mode. More specifically, TSK-Streams produces a widely used type of fuzzy rule system called Takagi-Sugeno-Kang (TSK) (Takagi and Sugeno 1985).

Basic concepts from fuzzy logic
Let us recall that the notion of a fuzzy set generalizes the conventional concept of a set in the sense of allowing for partial membership, which means that an element can belong to a set to a certain degree (Zadeh 1965). More formally, a fuzzy subset A of a reference set X is characterized in terms of a so-called membership function μ A : X −→ M, where M is a totally or partially ordered set of membership degrees, typically the unit interval [0, 1]. This function can be considered as a generalization of the characteristic function of a set, which is restricted to the membership degrees 0 (no membership) and 1 (full membership). In general, the membership degree μ A (x) can be interpreted as the truth degree of the proposition that x is an element of A.
Indeed, like in the classical case, there is a close relationship between set theory and logic. For example, generalized logical operators are used to define generalizations of set-theoretical operations such as intersection and union. A triangular norm (t-norm) is a binary operator : [0, 1] 2 −→ [0, 1], which is commutative, associative, nondecreasing in both arguments, and with neutral element 1 and absorbing element 0 (Klement et al. 2000). Commonly used examples include the minimum, the product, and the Lukasiewicz t-norm (u, v) = max{u + v − 1, 0}. A t-norm serves as a generalized logical conjunction, and as such, is also used to define the intersection of fuzzy sets: If A and B are fuzzy subsets of X with membership functions μ A and μ B , respectively, then the intersection C = A ∩ B is characterized by the membership function μ C (x) = (μ A (x), μ B (x)) for all x ∈ X . Note that, due to its associativity and commutativity, a t-norm can be generalized from a binary operation to a conjunc-tion of any number of elements in a canonical way; we shall write (u 1 , . . . , u n ) for the combination of membership degrees u 1 , . . . , u n .
With every t-norm , one can associate a t-conorm ⊥ given by . The latter plays the role of a generalized disjunction and can be used, for example, to define the union of fuzzy sets. The t-conorms obtained for the minimum, the product, and the Lukasiewicz t-norm are given by the maximum, the algebraic sum ⊥(u, v) = u + v − uv and the t-conorm ⊥(u, v) = min{u + v, 1}, respectively.

TSK fuzzy systems
A TSK rule R i has the following structure: where (x 1 , . . . , x d ) is the feature representation of an instance x ∈ R d and A i, j defines the jth antecedent of R i in terms of a soft constraint (modeled by a fuzzy set). The consequent part of the rule is specified by the vector , which defines an affine function of the input features. In what follows, we denote a rule by R i = (M i , ω i ), with M i the fuzzy sets defining the rule antecedents, and ω i the coefficients specifying the linear function. The soft constraint A i, j is modeled in terms of a fuzzy set with membership function μ j . The overall degree to which an instance x satisfies the rule premise R i is where the triangular norm models the logical conjunction. We will adopt the Gödel t-norm, which is given by (u, v) = min (u, v). Notice that A i, j might be a void constraint, which corresponds to setting μ (i) j = μ void ≡ 1; in that case, the feature x j is effectively removed from the premise of the rule (4). Now, given an instance x as an input to a TSK system with C rules RS = {R 1 , . . . , R C }, each of these rules will be "activated" with the degree (5). Therefore, the system's output is specified by the weighted average of the outputs suggested by the individual rules (see Fig. 1 for an illustration): with Fuzzy sets can have membership functions with different shapes and properties (Pedrycz and Gomide 1998). In our approach, we employ the family of the "S-shaped" parametrized functions: a fuzzy set μ has a support and core [a, d] and The left boundary of the fuzzy set μ is modeled in terms of an "Sshaped" transition between zero and full membership: An S-shaped membership function can also be left-or right-unbounded:

Online rule induction
The TSK-Streams algorithm (cf. Algorithm 1 for the basic structure) begins with a single default rule and then learns rules in an incremental manner. The default rule has an empty premise for each feature (that is, the membership function μ void ) and covers the complete input space. Then, the algorithm continuously checks whether, for any of the rules R i , one of its extensions could possibly improve the performance of the current fuzzy system. Fig. 1 Illustration of a TSK fuzzy system with three rules (in blue, green, and black, respectively) for the case of a one-dimensional instance space: Each rule is specified in terms of an S-shaped fuzzy set (rule antecedent) and an affine function (rule consequent, dashed line). The function (6) modeled by the TSK system is plotted in red color (Color figure online)

Fig. 2
Illustration of rule expansion: On the left side, the domain is covered by a single fuzzy set (μ void , blue flat line), so that the TSK system produces an affine function. On the right side, the fuzzy set is split into two fuzzy sets (in blue and green, respectively), each of them giving rise to an individual rule with different consequent. The TSK system is now able to fit the data (black points) more accurately (Color figure online) An expansion of a rule R i with a predicate (x j IS A i, j ) on the jth attribute means that the rule is split into two new rules R i and R i with predicates (x j IS A i, j ) and These membership functions are chosen after a fuzzy partitioning of the domain of feature x j . To this end, we apply a supervised discretization technique that divides a fuzzy set into two new fuzzy sets so as to improve the overall performance (see Fig.  2 for an illustration). Here, we focus on two criteria (cf. the discussion in Sect. 2), to be detailed in the following.
These membership functions are chosen after a fuzzy partitioning of the domain of feature x j . To this end, we apply a supervised discretization technique that divides a fuzzy set into two new fuzzy sets so as to improve the overall performance. Here, we focus on two criteria (cf. the discussion in Sect. 2), to be detailed in the following.

Variance reduction
Similar to the AMRule principle of reducing the variance, based on the fuzzy set A i, j , two fuzzy sets A i, j and A i, j are created such that a maximum reduction in the target attribute's variance is achieved. For example, let A i, j be a fuzzy set (for the jth attribute in the rule R i ) characterized by the S-shaped membership function μ (i) j , which is parametrized by the quadruple (a, b, c, d). Let N R i be the set of examples (x, y) covered by the rule R i , i.e., the examples for which μ i (x) > 0. We then seek to find the value q ∈ [a, d] such that the reduction in variance with and V ar(S) the variance of the set S. Similar to AMRules and FIMTDD, we achieve the variance reduction by storing candidate values in an extended binary search tree (E-BST). This data structure allows for computing the variance reduction for each candidate value in time that is linear in the size of the tree; moreover, it can be updated in logarithmic time (Ikonomovska 2012). E-BST is an extended binary search tree that stores sufficient statistics to evaluate the variance reduction resulting from the split at that node. Each node in E-BST contains (i) the test value at that node, (ii) the number of samples |{(x i , y i )}| reaching that node, the sum of their target attributes ( y i ), and the sum of the squared target attributes ( y 2 i ), and (iii) the same statistics for the samples reaching the right child node of the current node.

Error reduction
Extending the current model with new rules so as to improve the system's overall performance requires, for each existing rule, the creation and evaluation of all possible extensions-evaluating an extension here means determining the empirical performance of the (modified) system as a whole. As before, by a possible rule extension we mean replacing a fuzzy set A i, j in a rule antecedent by new fuzzy sets A i, j and A i, j , which are produced by bisecting the support of A i, j at some suitable splitting point. Even if these splitting points were organized in a binary search tree structure, the number of updates required after observing a new example would no longer be logarithmic but linear. Indeed, every possible extension means fitting a step-wise linear function, at each splitting value, on the entire training data (or updating the linear function on the new data instance).
To counter the aforementioned problem, we suggest a heuristic that simultaneously chooses a promising splitting value and fits a stepwise linear function for each candidate extension rule. The splitting value is chosen by adaptively shifting (increasing or decreasing) it based on the performance of new candidate rules. More formally, let A i, j be a fuzzy set characterized by the S-shaped membership function μ by (a, b, c, d), and let N R i be the set of instances (x, y) covered by the rule R i . Let q ∈ [a, d] be the initial splitting value from which A i, j and A i, j are constructed via suitable parametrizations (a, b, q + ρ 1 , q + ρ 2 ) and (q − ρ 2 , q − ρ 1 , c, d) of their membership functions μ (i) j and μ (i) j , respectively. We initialize q by the current mean of the observed values x j . The values ρ 1 and ρ 2 control the steepness of the S-shaped function and are chosen in proportion to the observed variance. From the membership functions μ (i) j and μ (i) j and the parent rule R i , the new candidate rules R i and R i are created (see lines 1-9 of Algorithm 2). Upon observing a new example (x, y), both the membership degrees μ (i) j (x), μ (i) j (x) and the errors committed by each candidate rule, err 1 = (ω (i) · x − y) 2 , err 2 = (ω (i) · x − y) 2 , are computed. If the "winner rule", i.e., the candidate rule by which the example is covered the most, commits an error that is larger than the error committed by the other candidate rule (covering the example to a lesser degree), we consider this as an inconsistency. The latter can be mitigated by shifting the splitting value q right or left, in proportion to the error committed by each candidate extension (see lines 11-21 of Algorithm 2).

Algorithm 2: GenUpdateERCandidates -ErrorReduction
: the rule whose extensions should be created/updated M i : the set of fuzzy sets conjugated in the premise. ω i : the vector of coefficients of the linear function.
if m 1 > m 2 ∧ err or 1 > err or 2 then /* shift q to the left */ 15 q = q − ηΨ i (x t )(err or 1 − err or 2 ) 16 else if m 1 < m 2 ∧ err or 1 < err or 2 then /* shift q to the right */ In the explanations above, we outlined two ways of splitting an S-shaped function into two such functions of similar shape. In the beginning, however, the default rule contains only unbounded fuzzy sets characterized by μ void . A split of an unbounded fuzzy set produces two sets with membership functions μ left-ub (x) and μ right-ub (x), respectively, which cover the resulting half spaces (with some degree of overlap). Similarly, a split of a right-or left-unbounded membership function leads to a rightor left-unbounded and an S-shaped function.
Recall that AMRules adopts only a single rule from the two candidates emerging from a rule expansion (cf. Sect. 2.2). More specifically, AMRules keeps the rule with minimum weighted variance and discards the other candidate as well as the parent rule from the original rule set. Since the resulting rule set does not form a partition of the instance space, this strategy requires a default rule covering the space that is not covered by any other rule. Motivated by this strategy, we also study the effect of adopting only a single instead of both rule extensions. Thus, we distinguish the following two strategies.
1. Single Extension: Only the best extension is added to the rule set, while the other one is discarded. The parent rule is also discarded unless it is the default rule. The choice of the best rule depends on the criterion used for splitting: either the weighted variance reduction or the weighted SSE. 2. All Extensions: Both extensions are added to the rule set, and the parent rule is removed. This approach makes the whole system of rules equivalent to a tree structure.
The two adaptation strategies will be revisited in the context of change detection in Sect. 3.6. A more detailed exposition of the adaptation strategies is given in Algorithms 5 and 4.

Rule consequents
FLEXFIS makes use of recursive weighted least squares estimation (RWLS) (Ljung 1999) to fit linear functions as rule consequents. This approach is computationally expensive, as it requires multiple matrix inversions. In our approach, and similar to AMRules, we learn consequents more efficiently using gradient methods. When a new training instance (x t , y t ) arrives, TSK-Streams produces a prediction y t , the squared error of which can be obtained as follows: where RS is the current set of rules. According to the technique of stochastic gradient descent, the coefficients ω i, j are then moved into the negative direction of the gradient, with the length of the shift being controlled by the learning rate η: Thus, the following (component-wise) update rule is obtained: The process of updating the rule consequents is summarized in Algorithm 3, which also updates the consequents of the rule's extension (when the error reduction strategy is used).

Model structure
TSK-Streams adapts the TSK rule system (that is, the fuzzy sets in the rule antecedents and the linear function in the consequents) in a continuous manner. While the adaptations discussed so far essentially concern the parameters of the system, the replacement of a rule by one of its expansions corresponds to a (more substantial) structural change.
For obvious reasons, such changes should be handled with caution, especially when they lead to an increased complexity of the model. Learning methods therefore tend to maintain the current model unless being sufficiently convinced that an expansion will yield an improvement. To decide whether or not a possible expansion should be adopted, the estimated performance difference is typically taken as a criterion: this difference should be significant in a statistical sense.
In our algorithm, we make use of Hoeffding's inequality to support these decisions. The latter bounds the difference between the empirical meanX of the n i.i.d. random variables X 1 , . . . , X n (having support [a, b] ⊂ R) and the expectation E(X ) in terms of More specifically, when using the error reduction criterion, we replace a rule R i by two rules R i and R i , considering the reduction in the sum of squared errors (SSE). That is, the SSE of the current rule set RS is compared with the SSE of all alternative With SS E best and SS E 2ndbest denoting the expansion with the lowest and the second lowest error, respectively, the best expansion is adopted if or when falls below a tie-breaking constant τ . The constant is obtained from (21) by setting the probability to a desired degree of confidence 1 − δ, i.e., setting the righthand side to 1 − δ and solving for ; noting that the ratio (22) is bounded in ]0, 1], b − a is set to 1. Algorithm 4 depicts the system expansion procedure when the error reduction criterion is applied. The same technique can be used for the single extension variant, except that the rule R i is replaced with the extension that achieves the lowest weighted SSE (provided R i is not the default rule, otherwise R i is also kept).
As an alternative to the global error reduction criterion, the variance reduction approach checks for the decrease in variance for each rule locally. The Hoeffding inequality is then applied to the ratio of the variance reductions of the best two candidate extensions of the same rule R i . The procedure that performs the expansion is depicted in Algorithm 5. This strategy can be seen as a model adaptation through local improvements.
Overfitting is a potential problem when a tree-or rule-based system is extended merely based on the measured improvement in the training loss. To circumvent overfitting, pruning is often applied. In our approach, we propose a penalization mechanism to avoid the danger of overfitting due to an excessive increase in the number of rules. This mechanism consists of adding a complexity term C to . For both extensions (variance reduction and error reduction), C is set to d −2 √ |RS|, with d the number of features and RS the current rule set. Again, we refer to Algorithm 1 for an overview of the TSK-Streams algorithm (for both alternatives, variance and error reduction).

Change detection
A concept drift may cause a drop in the performance of a rule. To detect such cases, we make use of the adaptive windowing (ADWIN) (Bifet and Gavaldà 2007) drift

Algorithm 4: ExpandSystemER -ErrorReduction
SS E i j : the sum of squared errors committed by the extension j of rule R i δ: confidence level τ : tie-breaking constant n: number of examples seen by the current system (x t , y t ): a new training example. 1 let SS E current be the SSE of the current system 2 let s pq be the extension with smallest SSE 3 let s pq be the extension with second smallest SSE 4 Update SS E current , s pq and s pq on (x t , y t ) if Single Extension then 10 let R best ∈ {R p , R p } has the smallest weighted SSE detector. Compared to the Page-Hinkely test (PH) (Page 1954), which is used by AMRules, ADWIN has the advantage of being non-parametric, which means that it makes no assumptions about the observed random variable. Besides, only a single parameter needs to be chosen, namely the tolerance towards false alarms (δ adwin ). In our approach, ADWIN is locally applied in each rule. More specifically, given that an example is covered by a rule, it is applied on the absolute error committed by that rule on this example.
For the single extension strategy, the rule that suffers from a drop of performance can be simply discarded. But in the all extensions strategy and upon detecting a drift in the rule R p = (M p , ω p ), we find its sibling rule R q = (M q , ω q ), from which it differs by only one single literal (i.e., there is a fuzzy set μ ( p) j ∈ M p on the jth attribute that satisfies the following criterion: for all i ∈ {1, . . . , d}\{ j} : μ To remove the rule R p , it is retracted from the rule set, and its sibling rule R q is updated by replacing μ j . In case the sibling rule R q has already been extended before the drift is detected, the same procedure is applied recursively to the children of this rule.

Algorithm 5: ExpandSystemVR -VarianceReduction
V ar Red i j : variance reduction caused by the extension j of rule R i δ: confidence level τ : tie-breaking constant n: number of examples seen by the current system (x t , y t ): a new training example.

Empirical evaluation
To compare our method TSK-streams with existing algorithms, we conducted a series of experiments, in which we investigated the algorithms' predictive accuracy, their runtime, and the size of the models they produce.

Methods, data, and experimental setup
TSK-Streams is implemented in MOA 2 (Massive Online Analysis), which is an open source software framework for mining and analyzing large data sets in a streaming mode (Bifet et al. 2010). In our experiments, TSK-Streams is compared with AMRules, FIMTDD, ARF-Reg, and FLEXFIS. Both AMRules and FIMTDD are implemented in MOA's distribution. We implement ARF-Reg as described in the original paper (Gomes et al. 2018). FLEXFIS is implemented in Matlab. For the (hyper-)parametrization of all methods, we perform grid search as described in Sect. 1. The test-then-train protocol was used for all experiments. According to this protocol, each instance is used for both testing and training: The model is evaluated on the instance first, and a learning step is carried out afterward. Experiments are performed on benchmark data sets 3 collected from the UCI repository 4 (Dua and Graff 2019) and other repositories 5 ; a summary of the type, the number of attributes, and the number of instances of each data set is given in Table 1.
The data sets starting with prefix BNG-are obtained from the online machine learning platform OpenML (Bischl et al. 2017); these large data streams are drawn from Bayesian networks as generative models, after constructing each network from a relatively small data set (we refer to van Rijn et al. (2014) for more details).

Results
In the first part of the evaluation, we compare the four variants of our own method: variance reduction versus error reduction, and the extension using a single candidate versus the extension for both candidates. Let us note that the combination of error reduction with the extension for both candidates essentially corresponds to the previous version by Shaker et al. (2017).   Table 2 shows the average RMSE (root mean squared error) and the corresponding standard error on ten rounds for each data set. In this table, the last row shows the number of statistically significant wins/losses of the first three against the fourth variant (with variance reduction and consideration of both candidates); these tests apply the Wilcoxon signed-rank test over the paired performances of the 10 iterations with confidence level α = 0.05. From the results, the fourth variant appears to be superior to the other variants. Therefore, we adopt this variant (simply referred to as TSK-Streams in the following) and consider it for further comparison with state-of-the-art methods. One way to understand this result is to realize that keeping both candidate extensions for the variance reduction case is better than a single candidate, since the variance reduction does not only reduce the variance in comparison to the original rule, but also leads to decreasing the variance in each of the two candidate rules. The same cannot be said about the error reduction, as it is a measure of the weighted errors in each candidate. Hence, the weighted error of well and poorly performing candidate rules might still lead to reducing the error. In such a case, it would be more reasonable to consider only the best performing candidate, which explains why the single candidate is better when considering the error reduction as criteria. One evidence supporting this claim is that on average, the two candidates strategy produces a smaller number of rules (and shorter runtime) than the single candidate strategy, which means that the latter becomes sometimes slowed down (or gets stuck) with rules that are performing poorly. Tables 3 and 4 show the comparison between the different TSK-Streams variants in terms of model size and runtime, respectively. Table 5 presents the performance comparison between TSK-Streams and the other approaches, AMRules, FIMTDD, ARF-Reg, and FLEXFIS. Overall, TSK-Streams compares quite favourably and performs best in terms of the average rank statistic. Moreover, at least on 9 of the 18 data sets, its performance is statistically better (also according to the Wilcoxon signed-rank test at significance level α = 0.05) than that of any other approach. In spite of the limited number of data sets, the advantage over AMRules and ARF-Reg is even statistically significant; this is not the case, however, for FIMTDD and FLEXFIS (cf. Fig. 3).
Other criteria important for the applicability of an approach in the setting of data streams include model complexity and efficiency. Obviously, these properties are not independent of each other, because more time is needed to maintain and adapt larger models. We measure the two criteria, respectively, in terms of the number of rules/leaf nodes in the model eventually produced by a learning algorithm and the average time (in milliseconds) the algorithms need to process a single instance. We consider the latter more informative than the total runtime on an entire data set (stream), because the processing time per instance is more relevant for the possible application of an algorithm under real-time conditions. Table 6 shows that TSK-Streams tends to produces smaller models than FIMTDD and ARF-Reg, which are still slightly larger than those of FLEXFIS and AMRules. Table 7 shows that TSK-Streams is also slower on average. We would argue, however, that this is not important, as it is still extremely fast in terms of absolute runtime: Being able to predict and learn from each new instance in just a few milliseconds, it certainly meets the requirements for learning on data streams in common practical applications.      The superscripts and ⊕ indicate statistical significance for the comparison with TSK- Streams  Fig. 3 Comparison of all learners using the Nemenyi test (Demsar 2006). For a confidence level of 0.05, the critical difference (CD) for average ranks is 1.43, which indicates that the advantage of TSK-Rules is statistically significant in the cases of FIMTDD and ARF-Reg but neither for AMRules nor for FLEXFIS

Conclusion
In this paper, we introduced a new fuzzy rule learner for adaptive regression on data streams, called TSK-Streams. This method combines the effectivity of concepts for rule induction as implemented in AMRules with the expressivity of TSK fuzzy rules. TSK-Streams as presented in this paper is an improved variant of an earlier version (Shaker et al. 2017); modifications essentially concern all parts of the learning algorithm, including the discretization, the rule extension, and the drift detection.
In an experimental study with real and synthetic data, we compared TSK-Streams with state-of-the-art regression algorithms for learning from data streams: AMRules, FIMTDD, ARF-Reg and FLEXFIS. The results are very promising, especially because our learner achieves the best performance in terms of predictive accuracy. This is remarkable, given that AMRules and FLEXFIS are truly strong (and indeed still competitive) learners-these methods have been developed over many years, and are therefore difficult to beat.
Our current implementation of TSK-Streams can be obtained from our Github repository. 6 Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.