Decision rules extraction from data stream in the presence of changing context for diabetes treatment

The knowledge extraction is an important element of the e-Health system. In this paper, we introduce a new method for decision rules extraction called Graph-based Rules Inducer to support the medical interview in the diabetes treatment. The emphasis is put on the capability of hidden context change tracking. The context is understood as a set of all factors affecting patient condition. In order to follow context changes, a forgetting mechanism with a forgetting factor is implemented in the proposed algorithm. Moreover, to aggregate data, a graph representation is used and a limitation of the search space is proposed to protect from overfitting. We demonstrate the advantages of our approach in comparison with other methods through an empirical study on the Electricity benchmark data set in the classification task. Subsequently, our method is applied in the diabetes treatment as a tool supporting medical interviews.

to lower the costs and make the disease bearable for a patient so-called e-Health systems are proposed [1,35,50,51].
Such systems help the patient conduct measurements (e.g., the level of glucose in the blood) or remind about taking medicines [13] but also to consult via mobile/computer network [65], make an appointment with a physician [41], support image analysis [44], or support decision making [7]. Sometimes they play a crucial role in the educational aspect of the disease and help to learn about patient's condition [38].
From a medical perspective, it is obvious that a medical interview, called anamnesis, is a crucial element in diagnosing and providing a medical care to the patient. Therefore, in most of the e-Health systems, a module for supporting medical interview is implemented. However, in the systems for the diabetes treatment, such module is rather designed to make simple aggregations (e.g., statistics or trends [49], rarely predictions [26]). Applications with knowledge extraction in a form easily understandable for a human being, for example, a decision tree [48] or decision rules [60], are rather uncommon. In either way, the medical society strongly emphasize an important role of anamnesis in the diabetes treatment [27]. The knowledge about the context of the patient determines the interpretation of measurements and further treatment schedule. The context is understood as a set of factors affecting patient's condition, such as feeding habits, sport activities, mood (e.g., stress, tension, relationships), and general health condition. However, measuring the context directly is difficult, if not impossible, hence it should be partially or fully obtained during anamnesis.
Therefore, in this work, a method for knowledge extraction in a form of decision rules to support anamnesis is presented. This method is implemented in a system called eDiab [60].
The eDiab system is in the preliminary phase of development (more technical details can be found in [60]), and this work contributes to proposing an algorithm for rules induction that enables a physician to conduct personalized and detailed medical interview. However, to solve the problem of the medical interview, two issues should be considered. First, knowledge representation should be chosen. In this work, decision rules are applied because they are easily understandable by a human being and could be seamlessly translated to a natural language. Second, the context is unknown and non-stationary (evolving in time). Thus, it is impossible to apply models that assume stationarity, e.g., Hidden Markov Models [6,36]. A reasonable solution is to apply a knowledge extraction method with incremental learning and a forgetting mechanism.
Incremental learning is primarily focused on processing data in a sequential way. In other words, knowledge is updated incrementally by processing examples one-by-one. However, to validate knowledge in the presence of changing hidden context, a forgetting mechanism should be proposed. The forgetting mechanism enables the knowledge to follow changes of the observed phenomenon. It means that parts of the knowledge which do not reflect current observations are forgotten, i.e., removed. In the literature, there are two ways of forgetting [45,59]: (i) explicit forgetting, (ii) implicit forgetting. The explicit forgetting assumes that knowledge is updated and validated based on the fixed number of recent examples with constant weighting, so-called shifting window, or with exponential weighting, so-called forgetting factor. Weighting means that each observation contributes to learning with a weight. In the forgetting with shifting window, all observations maintained in the window have weight equal to 1, and all other examples-0. In the exponential forgetting, a forgetting factor weighs observation in such a way that the older observation is, the less it influences a model. On the other hand, implicit forgetting does not apply weighting according to the time criterion but some other mechanism is used, for example, examples are weighted spatially [59]. In this work, a method with incremental learning and forgetting with forgetting factor is proposed.
The work is organized as follows. In Sect. 2, the review of existing solutions is presented. In Sect. 3, the problem is formally stated. In Sect. 4, the proposed algorithm is outlined. In Sect. 5, the forgetting mechanism for proposed algorithm is described. In Sect. 6, two empirical studies are presented: (i) based on Electricity benchmark data set [28], (ii) based on real-life measurements of diabetics [63]. Section 7 concludes the research paper.

Related work
There are many knowledge extraction methods that represent knowledge in a way easily understandable for a human being such as: rules inducers, for example, AQ [46], CEA [47], CN2 [15], RIPPER or its modifications [55], and tree inducers, for example, CART [9], ID3 [56], C4.5 [57]. All of them apply batch learning paradigm which means that the knowledge is extracted based on the entire set of training examples.
However, in the presence of data stream and an unknown, changing context, so-called hidden context [29,67], the batch learning fails. In the batch learning, a growing amount of training data increases the processing time and decreases the available memory space. Furthermore, even if all examples could be handled by the system, the knowledge extracted from past data is hardly valid and useful at the moment of decision-making. For this reason, an incremental learning paradigm should be applied. Additionally, in the presence of the changing context, a forgetting mechanism has to be used.
One of the first algorithms for rules induction with incremental learning and forgetting was introduced by Maloof and Michalski called AQ with Partial Memory (AQ-PM) [45]. Their approach uses the AQ algorithm [46] as a rules inducer but examples in the learning sequence are selected. The selection is made in two ways. First, only recent examples are maintained (explicit forgetting). Second, from the maintained examples only those are kept that were incorrectly classified by the knowledge (implicit forgetting). Then, a new knowledge is extracted from examples in the partial memory, that is, recent selected examples. The idea of AQ-PM is very interesting even though the performance of the method is unsatisfactory.
Recently, some other modifications of AQ algorithm were proposed [61]. The first modification uses the explicit forgetting only if the classification accuracy drops significantly. The second modification induces rules based on the shifting window at each learning step and a validation step is added: rules are removed if their evaluation value is below given threshold. This is another way of applying the implicit forgetting but on the model itself (not only on examples as in AQ-PM). In comparison with AQ-PM, both mentioned AQ modifications performed slightly better but still rather mediocre.
Another approach was presented by Widmer and Kubat [67] called FLORA (acronym standing for FLOating Rough Approximation). The basics of the method were originally introduced by Kubat [40] with the idea to use the rough set theory for inducing rules. The model consists of descriptions for positive, negative and noisy examples. Each description is parameterized by a coverage measure [30]. To keep up-to-date knowledge, the explicit forgetting is used to remove older examples and the implicit forgetting to change the coverage measure values. Additionally, to change the size of the shifting window, heuristics is applied. The problem of the FLORA methods is high computational complexity, which precludes its further practical usage.
For building decision trees, there are two approaches particularly noteworthy. The first approach, called SPLICE [29], uses temporal batch learning. The data stream is divided into clusters and then the C4.5 algorithm is applied to extract knowledge on each cluster. Similar idea was proposed in [25] where training examples are divided into clusters and rules are induced from each cluster. However, SPLICE works off-line and can be used only to analyze historic data. The second approach, called CVFDT [32], applies the explicit learning for changing the tree structure by repeatedly applying the VFDT algorithm [19]. The algorithm formulates an alternative subtree for each attribute having a relatively high information gain and replaces the old subtree when a new one becomes more accurate.
An original approach is to use a graph structure as in OLIN algorithm [42]. The idea is to build information-fuzzy network (IFN) to learn and track the changing context. Next, to obtain rules, an additional method for rules induction from IFN has to be applied [43]. Each layer in the IFN corresponds to an attribute, and layers consist of nodes that contain values of previous attributes plus values of attribute in a given layer. Thus, the last layer represents nodes with values of different attributes, which are connected with decisions. An edge in the network represents a connection between two attributes. The decision about merging two attributes is based on the information-theoretic measure. To follow context changes, a shifting window is used. The OLIN is a novel approach from the rules induction perspective, since it uses graph-based representation to maintain the knowledge. However, knowledge updating is accomplished by a graph structure modification that entails possible knowledge loss (examples are not aggregated anywhere).
Another idea connected with graphical representation of rules is to use Pawlak's flow graphs [52]. Rules are represented as a flow network and flows in the network correspond to strength of rules. Moreover, Pawlak has shown several expressions that bound the accuracy and coverage with the form of Bayes'-like theorem but with a truth level interpretation instead of probabilities. Pawlak's approach is very interesting and gives another perspective at rules in terms of network flows. Nevertheless, flow graphs are used rather to represent rules than to aggregate data. Therefore, in the form they are given, they cannot be applied to rules induction successfully.
However, the OLIN algorithm and the flow graphs can be regarded as a part of a fast developing subfield of graph-based machine learning and knowledge extraction methods [16,31,66]. Graph-based methods were used for rules induction [54,58], clustering [23], substructure detection [16,34]. However, for such methods, there is still a need for developing proper forgetting mechanisms and a method for applying incremental learning.
A lot of effort has been done to handle data streams and the hidden context. However, all approaches mentioned above have one substantial disadvantage-the performance is still unsatisfactory. Moreover, due to the non-parametric character of the rule-based model, the forgetting is performed on the model structure, not on parameters. This is a huge obstacle because not only knowledge has to be kept in memory but examples as well. If any changes are made in the model, for example, some rules are removed, lost knowledge cannot be restored. Hence, it is a great challenge to propose a method of aggregating data, so that the learning phase is conducted in polynomial time and the forgetting is performed on parameters, not on the model itself. Such forgetting would give a fast and easy method of adapting to context changes and does not force any additional procedures for examples handling and maintaining. Furthermore, if examples are aggregated somehow, the knowledge can be even completely removed and later restored from aggregated data at any time.

Knowledge extraction problem statement
In general, the problem of knowledge extraction (or learning) is considered a problem of finding an unknown dependency of an analyzed system using a limited number of observations [12,14,17,64].
The knowledge extraction from examples can be described using the following components [14,64]: (i) A generator of random input vectors u = [u 1 u 2 . . . u D ] ∈ U, drawn independently from a fixed probability distribution function P(u|c n ), which is unknown. (ii) An object that returns an output value y ∈ Y to a given u, according to the fixed conditional probability density function P(y|u, c n ), which is also unknown. (iii) A learning algorithm that is capable of implementing a model,ȳ = (u). (iv) An environment (or context) that influences the input generator and the object and is assumed to be unknown and unobservable, c n ∈ C.
Remark The term model and knowledge are very often used interchangeably. However, knowledge refers to the model with determined structure or/and parameters.
In the considered domain of diabetes, the input represents quantities that describe a patient and a measurement, for example, weight, pressure, part of a day, day of a week, information about a meal. The output can represent the level of glucose in blood or whether it is in norm or not. In further considerations, it is assumed that Y = {−1, 1} where y = 1 and y = −1 denote normal and abnormal levels of glucose in blood, respectively. Moreover, it is assumed that all input values are nominal (discrete), that is, for all d = 1, 2, . . . , D, u d ∈ U d , card{U d } = K d < ∞. For further clarity, let K denote a sum of all K d .
Morover, as previously mentioned, probabilities for the inputs and the output are unknown. There is only given a training sequence of N examples, which are drawn from a distribution P(u, y|c n ) = P(u|c n ) · P(y|u, c n ): ( The examples in the class y n = 1 will be called positive examples and in the class y n = −1-negative. Moreover, it is worth stressing that examples are drawn from the distribution dependent on the context. The learning algorithm is supposed to choose the best model for the given training sequence and it has to follow the changing context. Hence, it should choose a sequence of models (knowledge) over a period when M values of context appear, = ( 1 , 2 , . . . , M ). The sequence of models has to minimize following functional (quality criterion): where E[·] is an expected value, L(·, ·)-loss function, for example, It is easy to notice that to minimize the (2) it is enough to minimize for each m = 1, 2, . . . , M. Then, the Empirical Risk Minimization (ERM) principle can be applied [12,14,17,64], and the model is chosen due to the following empirical criterion for each m = 1, 2, . . . , M, M m=1 N m = N . However, it is assumed that the context is unknown and unobservable. Therefore, values of the context are unknown. Additionally, moments of the context change are unknown as well. That is why, in order to use the ERM principle, an additional mechanism for context change detection has to be applied. Otherwise, it is impossible to minimize (2) and then a substitute criterion, a so-called prequential error [22], has to be considered. Then, the problem is to find a sequence of models = ( 1 , 2 , . . . , N ) that minimizes the prequential error, that is,Q To solve the problem, an algorithm with incremental learning can be proposed. There are the following propositions: 1. Algorithms with temporal batch learning for the mth context: where G 2 is a learning algorithm, n denotes the time step, U n−L:n and y n−L:n are L recent observations. 3. Algorithms with incremental learning and forgetting factor: where G 3 is a learning algorithm, n denotes the time step, u n and y n are recent observations.
The algorithms of the first type need an additional method for context change detection, but they solve the problem in the form (2) using the ERM principle. The algorithms of the second type update the knowledge based on the shifting window, that is, L recent examples. The algorithms of the third type use only recent example to update the knowledge with weight 1 and rest with exponential weight. Both the algorithms of second and third types solve the problem using the prequential error (6).

Rule-based representation
As previously mentioned, to support the anamnesis, the knowledge should be expressed in a way understandable easily. Therefore, in further considerations, the set of models is represented by the attribute-value logic [11,46]. It means that there are input and output atomic formulae that correspond to input variables and output variable, respectively. An input atomic formula is denoted by α d k ="u d = k" and means: "the dth input equals k", k ∈ U d . An output atomic formula is denoted by α out l ="y = l" and means: "the output equals l", l ∈ Y. Further, the atomic formulae are connected by logical operators such as ∧, ∨, and ⇒. The input formulae are in 1-CNF (Conjunctive Normal Form) [37], which is a conjunction (logical operators and) of no more than one input formula concerning each input. Then, a decision rule is an implication: where φ in is the condition, which is the 1-CNF expression of input formulae, and φ out is the decision which is a single output formula. A decision rule is denoted by φ.
The rule-based knowledge is a disjunction of rules: where J is the number of rules. Such models are called D-DNF (Disjunctive Normal Form for D inputs) [37] in which expressions in 1-CNF with the decision are connected by disjunctions (logical operator or). Hence, the set of models is in D-DNF form.
It is also worth noting that the rule-based knowledge representation is a discriminative type of models. In contradiction to generative models, discriminative models are unable to generate both output and input values, and for a given input, they return an appropriate output [6].
Moreover, for the rule-based knowledge representation treated as a classifier, a Vapnik-Chervonenkis dimension (VC-dim) can be given [64]. For the rules expressed in the attribute-value logic the VC-dim equals [2] This result entails a big capacity of a rule-based classifier. In other words, there is a threat of an excessive adjustment to data, so called overfitting. Therefore, in formulating an algorithm for rules induction, a method of regularization should be proposed.

Problem of forgetting in rule-based models
In the presence of changing hidden context, the rule-based model has to be updated and validated continuously. It is accomplished by applying forgetting mechanism and incremental learning paradigm. However, there are two issues that need to be considered. First, the rulebased models are non-parametric. In such case, the forgetting has to be conducted directly on the model structure. Removing parts of the knowledge leads to irreversible loss of information. Second, the problem of optimal set of rules induction is proved to be NP-Complete [2]. Therefore, many algorithms are insufficient from a computational perspective, especially when data streams must be processed.
A solution to both problems can be to propose a method of parameterization. However, such parameterization needs to fulfill two restrictions: (i) parameters should reflect data (data aggregation), (ii) rule-based model should be easily induced based on parameters. Hitherto, according to the literature, there are some propositions of parameterization, as in FLORA. Nevertheless, they do not meet mentioned restrictions.
Hence, in this paper, a new way of parameterization is proposed. The idea is to use graph structures that are parameterized. Nodes of graphs are associated with input and output formulae, and arcs denote logical relations. Weights on graph's arcs reflect occurrences of examples. Consequently, the graphs are used for rules induction.
The idea of the approach is derived mainly from three concepts. The first approach is the OLIN algorithm, which applies graphs to represent rules [42]. The second one is the work of Georgii et al. who use a graph as a space search and an inverse search method for finding clusters [23]. Last but not least, the presented approach is strongly influenced by Pawlak's flow graphs [52].

Preliminaries
Prior to discussing formal details, a single rule should be considered. The decision rule consists of a condition, which is in 1-CNF and a decision. Moreover, the condition is a conjunction of input formulae and the decision-an output formula. Thus, the rule can be represented as a simple graph in which edges connecting input formulae are identified with the operator ∧, and an edge between any input formula and output formula means ⇒. Since the and operator is commutative, the inputs can be ordered in a chosen manner.
Example 4.1 Consider an object with two inputs, u 1 ∈ {a, b}, u 2 = {1, 2}, and one output, Assume that the object can be described by following rules: Then, all possible rules are given in Fig. 1c, rules for the class 1-in Fig. 1a, and for the class 1-in Fig. 1b. The arcs corresponding to the and logical operator in a rule are dotted. The arcs between input formulae and the output formula are denoted by a double line. Then, the rule φ 1 is represented in Fig. 1a. The conjunction of α 1 a and α 2 1 is the dotted arc, and the implication is the double line. However, for the rule φ 2 , there is a direct connection between α 1 b and an output (similarly, for the rule φ 3 , between α 2 2 and an output), therefore the arcs denotes implications.
Hence, any rule-based model can be represented by a graph [18] where V denotes a set of vertices, that is, input and output formulae, A denotes a set of arcs. Furthermore, each example can be seen as a most specific rule that have D conditions and a decision. Therefore, examples can be represented by graphs. Let us introduce graphs for positive examples, denoted by G + , and negative ones, denoted by G − . Both graphs have the same set of vertices but they can have different sets of arcs. Moreover, weights are associated with arcs in the positive graph, w + , and similarly in the negative graph, w − . 2  and nor or operators. Hence, only the connections between two different inputs (layers) are possible. Therefore, the considered graph representing rules is layered, directed, and acyclic. Each layer consists of input formulae for one input and following order is chosen: the first layer corresponds to an input with the smallest number of values, the second-to an input in which the number of values is higher than for the first one but is smaller than the next one and so on to Dth layer. In other words inputs are ordered so that where the lower index in parentheses denotes the ordered input number. The last layer, that is, (D + 1)th layer, contains only an output formula. For further simplicity, let us assume that a vertex associated with the input formula α d k is denoted by v d k . Similarly, for the output formula-v out l . The arc connecting two vertices: i in the sth layer and j in the tth layer is denoted by a s→t i, j , for example, the arc between v 1 b and v out −1 is a 1→out b,−1 . For each arc, a weight is associated, for positive graph w s→out +,i, j and for negative graph w s→out −,i, j . Moreover, normally the graph is represented by an adjacency matrix [18]. However, because the considered graph is layered, directed and acyclic, it is enough to keep less than K 2 weights of edges. The following lemma can be given for the number of arcs in the graph representing the rule-based model. Hence, there are This expression can be simplified into the following form The Lemma 4.1 shows that instead of K 2 entities in the adjacency matrix there are only arcs needed to be kept. Therefore, each graph representing a rule-based model is coded using κ parameters as follows: where code(layer d ) is a code (0-1 sequence) for single layer, for d = 1, 2, . . . , D, For example, the graphs from the Fig. 1a, b are the following: For instance, the first code states that the atomic formula α 1 a is connected with α 2 1 and not with α 2 2 and α out 1 ; the formula α 1 b is connected with no formulae, and from the second layer, there exists only the connection between α 2 1 and α out 1 .

Learning and rules induction
As it was mentioned before, each example can be seen as the most specific rule. Thus, an example can be coded using (11). For instance, as in If the nth example is (u n , y n ), then the weights are updated as follows: w y n := w y n + code(u n ) (12) w out y n := w out y n + 1.
It can be done is such a way as the codes for weights and the example are of the same length and reflect the same inputs. Besides, it is worth noting that the weights aggregate data in the incremental manner. Moreover, having weights values, it is possible to restore the whole training set. 3 First, we propose a criterion for evaluating pairs of atomic formulae, and then-paths with the end in the final vertex, that is, rules. In the machine learning literature for the rules induction coverage and accuracy 4 measures are used [55]. The coverage measure says about the generality of the rule while the accuracy measure expresses the specialization of the rule. For example, the rule α 1 a ⇒ α out 1 is more general than α 1 a ∧ α 2 1 ⇒ α out 1 . Let E φ in denote a set of all examples covered by the condition of a rule φ in , E l -a set of all examples in the class y = l. Then the coverage measure is defined as follows The accuracy measure is expressed in the following way When above expressions are undetermined, then they are set to zero. It is worth realizing that accuracy can be determined by coverage [52] Moreover, it is worth noting that for any pair of input formulae A and B, μ c is anti-monotonic, that is, and μ a is monotonic, that is, (A ∧ B, y).
Both properties follow from the definitions. The set of covered examples by single formula A is the same or larger than for A ∧ B, that is, card{E A } ≥ card{E A∧B }. Hence, by adding a formula in coverage, the nominator can decrease while the denominator is constant, so the value of coverage can decrease or remain the same. In accuracy, however, both the nominator and the denominator can decrease. But the denominator decreases at a faster pace than the nominator because card{E A∧B ∩ E y } ≤ card{E A∧B }, and thus the value of accuracy can increase or remain unchanged. However, coverage and accuracy are only suitable to measure the generalization and specialization of a rule separately. But the goal in the rules induction is to obtain the knowledge that is the most accurate generalization of examples [47]. Hence, to reach a balance between generalization and specialization in a synthetic criterion, following convex combination of (14) and (15) can be proposed where β ∈ [0, 1], which determines the weight of balance between the generality of rules, expressed by μ c , and their specificity, expressed by μ a . Consequently, having weights for both positive and negative graphs, the coverage and accuracy can be calculated for a single arc a s→t i, j in the class y = l. Denoting by w s→t l,i, j a weight for the arc a s→t i, j in the class y = l, and w out l -the number of occurrences of the class y = l, the coverage of an arc is defined as follows Then, applying (16), we get accuracy of an arc in the form Next, it is important to evaluate the quality of a path which is an equivalence of evaluating a rule. The path π of a length no greater than D is a sequence of distinct vertices with the beginning in v s k and the end in v out l such that from each of its vertices there is an arc to the next vertex in the sequence. Obviously, each vertex in the path belongs to distinct layers. To calculate the quality criterion of the path, the anti-monotonicity of the μ c is used, that is, Then, having the coverage of the path, the accuracy can be calculated using (16), that is, Thus, having negative and positive weights and given β, it is possible to calculate criterion (19) for any path in the graph. Now, all rules with the value of the criterion (19) larger than 0, that is, q(φ, l) > 0 can be generated. However, in this approach, the total number 2 K of rules has to be considered and the knowledge, even if the rules are ordered according to the quality criterion, is rather useless. Therefore, the search space should be limited.
The idea of limiting the search space is to create a new graph in which an arc is either negative or positive. The rule is formulated only if all the arcs in the path are either positive or negative. First, the negative and positive graphs are calculated, that is, weights w − , w + , w out − , w out + . Next, for each arc, the quality criterion (19) is calculated in positive and negative graphs. It results in obtaining different weights for negative and positive graphs, denoted by q − , q + , respectively. Finally, a graph that determines a search space is computed as a difference between q + and q − . The difference is denoted by q. Then, all arcs in graph coded by q are either positive or negative. Each path that has only positive or negative weights is regarded as an admissible rule.
The procedure for determining the search space is described in Algorithm 4.1. (20), (v) accuracy (21). Output: The graph that defines the search space.
Step 2: For given β, for each arc in graphs G + and G − calculate the quality criterion (19) using (20), and (21). Denote those weights by q + and q − for G + and G − , respectively. (Both vectors are of size κ).
Step 3: Calculate the difference between q + and q − , that is: The rationale behind (24) is to classify an arc to one and only one output value. The graph represented by the code q contains the final output vertex that is neither positive nor negative.
Hence, having the search space defined by q, it is clear that not all paths are allowed. Some paths can contain arcs that are positive and negative as well. Additionally, we are interested only in rules for which quality criterion is greater than a given value θ , that is, q(φ, l) ≥ θ , where θ ∈ [0, 1) is a threshold. Hence, the final algorithm for rules induction, called Graphbased Rules Inducer (GRI), can be proposed. It is worth noting that the procedure starts from the final layer, that is, (D + 1)th layer, and then proceeds to the first one. P + and P − denote the sets of all possible paths built only from positive and negative arcs, respectively.  Step 3: For each π ∈ P + such that π = (a b→· j,· . . . ) and for all i = 1, . . . , K d consider the path π = (a d→b i, j a b→· j,· . . . ). If q d→b i, j > 0, then P + := P + ∪ {π}. For each π ∈ P − such that π = (a b→· j,· . . . ) and for all i = 1, . . . , K d consider the path π = (a d→b i, j a b→· j,· . . . ). If q d→b i, j < 0, then P − := P − ∪ {π }.
In Algorithm 4.2, in the first step, the graph determining the search space is obtained. In order to create only the paths that are coupled with a proper rule, that is, the sign of the path, the algorithm runs backward, that is, from the final vertex to the first layer. Despite the fact that the number of paths in both sets might grow exponentially with respect to the number of features, in many practical cases the number of paths seems to be reasonable. Nevertheless, formulating a rule-based model in problems with many inputs might be intractable, therefore some heuristics should be proposed. On the other hand, in many cases, for example, in diabetics treatment where human health and life is crucial, the accuracy of the rules should be as high as possible. Then, all possible paths ought to be checked and evaluated.

Remarks
1. Graphs G − and G + can be regarded as a special kind of Pawlak's flow graphs. However, in presented approach, each vertex is connected with a single atomic formula, not with a conjunction of atomic formulae like in the original flow graphs. Nevertheless, similarly to Pawlak's flow graphs, the weights associated with arcs of positive and negative graphs can be deemed as information flow between layers. The amount of information inflow of the dth layer in the class y = l, f in (d, l), equals the amount of information outflow from the layer, f out (d, l), that is, Because of the way of the weights updating (12), it can be checked that f in (d, l) = f out (d, l). Equations (25) and (26) can be considered as flow conservation equations [53]. 2. The limitation of the search space applied in the GRI algorithm matters in two ways.
First, only either positive or negative paths are allowed. Second, not all models can be obtained from such a structure. To provide evidence, let us refer to the example in Fig. 2. There are two inputs with three values each, that is, α 1 a , α 1 b , α 1 c , and α 2 1 , α 2 2 , α 2 3 . The class 1 is depicted by the gray color, and the class −1 by the white color (gray and white rectangles in Fig. 2a). However, Algorithm 4.1 does not allow to obtain both rules α 1 b ∧ α 2 2 ⇒ α out 1 and α 1 a ∧ α 2 2 ⇒ α out −1 because it would mean that the arc in the graph q between v 2 2 and the final node v out y is positive and negative at the same time. Thus, depending on data, knowledge can be in a similar form to Fig. 2b. Therefore, the limitation of the search space results in the model robustness to overfitting. However, if the true object description is as in Fig. 2a, then the obtained knowledge is less accurate. Nevertheless, in the case of changing context, the robustness is usually at the expense of accuracy. 3. The graph coded by q can be used as a classifier itself. The procedure for classification of a new example is as follows. First, all possible subpaths of the example (note: example is treated as a path) should be found. 5 However, only subpaths that are either positive or negative are considered. Second, the subpath with the highest quality value is chosen. The sign of that subpath is returned as the class. In case, the GRI algorithm is used as a classifier it will be referred as the GRI classifier.
In the case when only the class is needed, this simple procedure enables classification of an up-coming observation usually very fast.

Graph-based rules inducer with forgetting
The presented form of the GRI algorithm is a knowledge extraction method with the incremental learning as the weights are updated step-by-step. Moreover, the weights w − and w + reflect data aggregation and thus examples can be restored at any time. However, the main goal of this paper is to update and validate knowledge according to the changing context. To follow context changes, a forgetting mechanism has to be proposed. The first idea is to apply shifting window, but it would require the calculation of weights values to be done constantly. The second proposition is the forgetting factor (exponential forgetting). Applying the forgetting factor, γ ∈ [0, 1], in negative and positive graphs would result in re-calculating weights values once only, before the updating phase. Then, if a new example emerges, (u n+1 , y n+1 ), the updating formulas (12) and (13) can be re-written in the following way. If y n+1 = 1, then or if y n+1 = −1, then 5 For an observation of the length D there are 2 D − 1 of all subpaths.
The forgetting factor γ is usually very close to 1 meaning that such an approach can be seen as a weighted shifting window of the size approximately equal 3/(1 − γ ) [8]. All examples that are prior to the shifting window, that is, n < 3/ (1 − γ ), influence the model with the weight smaller than 0.05.
Obviously, applying the forgetting factor precludes restoration of the original data from the graphs G + and G − . However, it is a price that has to be paid for the context tracking.
Hence, to make Algorithms 4.1 and 4.2 capable of the context tracking, the exponential forgetting should be applied to determination of the search space. Step 0: Set n := 0.
Step 1: Set n := n + 1, take an example from the data stream, u n , y n . If the example is misclassified and y n = 1, then update weights using (27), and (28). If the example is misclassified and y n = −1, then update weights using (31), and (32). Otherwise update weights using (12) and (13).
Step 2: For a given value of β, for each arc in graphs G + and G − calculate the quality criterion (19) using (20), and (21). Denote those weights by q + and q − for G + and G − , respectively. (Both vectors are of size κ).
Step 3: Calculate the difference between q + and q − , that is (24).
The forgetting in the Step 1 can be applied after each observation, not only if the misclassification occurs.
In Algorithm 4.2, the rules induction is performed based on the calculated q. It does not matter if in the first step of Algorithm 4.2 is Algorithm 4.1 or Algorithm 5.1. The only difference is that the first approach is without forgetting (see Fig. 3a and the second one-with a forgetting mechanism and rules are induced only if the procedure is set to be run (see Fig. 3b, e.g., after a fixed number of iterations. Both versions of the GRI algorithm are schematically presented in the Fig. 3. It is worth remembering that in Algorithms 4.1 and 5.1, the weights updating is performed in an incremental manner.
Moreover, the rules induction does not necessarily have to be conducted at each time step. For instance, in the diabetes treatment, the rules induction is applied only if the physician or patients wants it, and not after each measurement.
The remarks given for the GRI algorithm (see Sect. 4.3) holds true for the GRI algorithm with forgetting.

Experimental results
To evaluate the GRI algorithm with forgetting for data stream in the presence of changing context, two experiments were conducted. The first one involves thirteen other methods and Fig. 3 Schemes for the GRI algorithm with limiting search space a without forgetting, b with forgetting a benchmark data set Electricity [28]. The second one is the application in the diabetes treatment and involves five other methods and a data set collected by Michael Kahn, M.D., Ph.D., and published in the UCI Machine Learning Repository [63].

Electricity
The data used in the experiment were first described by Harries [28]. The data set was prepared basing on the observations collected from the Australian New South Wales (NSW) Electricity Market. In this market, the prices are not fixed and they are affected by the demand and the supply of the market. There are several factor influencing the market demand such as weather, time of day, district population density, market expansion. In other words, this domain is known to exhibit seasonality and sensitivity to short-term events, for example, weather changes. Therefore, all influences can be treated as a hidden context that affects the market.
The Electricity data set (referred to E L EC2 in the literature [28]) contains 45312 examples dated from May 1996 to December 1998. An example consists of following inputs: u 1 -day of a week, u 2 -period of a day, u 3 -NSW price, u 4 -NSW demand, u 5 -Victoria region price, u 6 -Victoria region demand, and u 7 -the scheduled electricity transfer between states. Because inputs 3-7 were numeric, the discretization was applied. Finally, there were following number of input values: K 1 = 7, K 2 = 48, K 3 = 10, K 4 = 6, K 5 = 10, K 6 = 7, and K 7 = 7 (all values K = 95).
The output value identifies changes of the price related to a moving average of the last 24 hours. The class reflects deviations of the price on a one day average and removes the impact of longer term price trends. There are two possible output values: UP (y = 1), and DOWN (y = −1).
In the experiment, the following methods were compared with the GRI algorithm: -IB1 (also known as 1-NN) with DDM (Drift Detection Method) [21] or EDDM (Early Drift Detection Method) [5]-is a lazy classifier (the nearest neighbor) with a method for context change detection; -J48 with DDM or EDDM-it is an implementation of tree-based classifier called C4.5 [57] with a method for context change detection; -DWM-NB (Dynamic Weighted Majority of Naïve Bayes) [39]-is an ensemble method of Naïve Bayes classifiers that uses the idea of weighted majority; -PL-NB (Paired Learner of Naïve Bayes) [4]-is a special kind of an ensemble method of Naïve Bayes classifiers that uses two classifiers: one learns from all data, and second learns from data maintained in a shifting window; -SWIM (Shifting WINNOW) [3]-is a classifier that implements a Boolean threshold function (based on the WINNOW algorithm) but with shifting window; -NB (Naïve Bayes) classifier with exponential forgetting [62]-an estimation of probabilities is made based on the frequency matrix that aggregates data and forgetting is conducted on that matrix; -AQ-P1 [61]-is the AQ algorithm with shifting window; -AQ-P2 [61]-is the AQ algorithm with shifting window and implicit forgetting on a model; -CART [9] with shifting window-is a classification tree that is induced from examples maintained in a shifting window; -Random Forest [10] with shifting window-it is an ensemble classifier of classification trees that is learned from examples maintained in a shifting window; -SVM [64] with shifting window-is a Support Vector Machine classifier that is obtained from examples maintained in a shifting window; the following kernel function was applied [6]: where card{u ∩ v}-number of conditions shared by rules u and v.
It is worth noting that only J48, CART, AQ-P1 and AQ-P2 are rule-based representations. All methods are evaluated using the (6) criterion. Furthermore, in all cases, the learning is conducted on the stream of data (an example arrives according to the original time stamp). A new arriving example is first classified and then used for learning (after classification true output is given).
Due to authors knowledge, the methods SWIM, NB with forgetting, AQ-P1, AQ-P2, CART, Random Forest, and SVM have not been yet evaluated on the Electricity data set. All of mentioned algorithms, as well as the GRI, were implemented in Matlab ® environment. Results for the others methods are taken from the literature specified in Table 1.
Only the output value is important in this task therefore the GRI was used as a classifier (the procedure described in the Sect. 4.3).

Results and discussion
The Electricity data set contains real-life data and the number of data is enough, from statistical perspective, to compare different methods. The results are presented in Table 1. The best performance is achieved by SWIM, which is better than the GRI by approximately 0.01. The second is GRI with criterion value of 0.12 and is comparable with the Random Forest. The rule-based algorithms are 0.13 for CART, 0.16 for J48 with EDDM, and around 0.19 for AQ-P1 and AQ-P2. The performance values for all methods vary from 0.11 to 0.23.
However, the performance of the GRI algorithm is very promising. The results indicate that the GRI algorithm can extract very accurate knowledge. Especially that it outperforms the AQs algorithms by around 0.06, the J48 with EDDM by approximately 0.03, and CART by about 0.01.
The best performance of the GRI classifier is achieved for γ = 0.88 and β = 0.1, which is quite an interesting result. The γ parameter arbitrates weights of the examples and quantifies examples for further consideration in knowledge extraction process. It is important to notice that the result of the GRI without forgetting, that is, γ = 1, is much poorer than for γ = 0.88 (the difference is around 0.13 in favor of γ = 0.88, see Table 2). The β parameter determines the balance between choosing general or specialized rules. In other words, the more rules are Table 1 Results for methods compared in the experiment for the criterion (6) in the electricity price market

Method
Rule-based Criterion (6) Source IB1 with DDM N 0.23 [21] IB1 with EDDM N 0.14 [5] J48 with DDM I 0.21 [21] J48 with EDDM I 0.16 [5] DWM-NB N 0.17 [39] PL-NB N 0.19 [4] SWIM N 0.11 [3] NB with forgetting N 0.17 [62] AQ-P1 Y 0.19 [61] AQ-P2 Y 0.19 [61] CART with shifting window I 0.13 -Random Forest with shifting window N 0.12 -SVM with shifting window N 0.14 -GRI Y 0.12 -Each method contains additional information whether it has rule-based representation (Y-yes, N-no, I-can be interpreted as rules). Best three results in bold Table 2 Results for the GRI algorithm with different values of parameters, and for the criterion (6) in the electricity price market Best result in bold specialized, the broader knowledge describes single situations. In this case, a small value of β can be explained by the domain specificity. The electricity price market is dependent on short-term events and seasonality. For instance, in the winter days are shorter and electricity demand is higher so these conditions determine specific situations that influence prices fluctuations.

Supporting anamnesis in diabetes treatment
To evaluate the GRI algorithm for the anamnesis module in the e-Health system for the diabetes treatment, a real-life data set is used [63]. The data was collected by Michael Kahn, M.D., Ph.D. and covers 70 patients. Diabetes patient records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an inter-nal clock to time stamp events, whereas the paper records only provided "logical time" slots (breakfast, lunch, dinner, bedtime) provided by a patient (e.g., breakfast is 6:30, or 7:35). Each patient's medical history corresponds to a period from 20 to 149 days of measurements, depending on a patient. Original diabetes files consist of four information per record: (i) date, (ii) time, (iii) code (nominal), (iv) glucose level in blood (numeric). The code describes the measurement, for example, regular insulin dose, pre-lunch glucose measurement, typical meal ingestion, typical exercise activity, and others (details can be found in [63]).
However, the aim of the anamnesis is to find the hidden context. In other words, to provide such knowledge that could help a physician in formulating adequate questions. For instance, if the patient glucose level is always bad on a Saturday morning, this might do an indication that his/her habits may influence the glucose level in blood.
Therefore, the original inputs were transformed into the following attributes: u 1 -day of week (Monday, Tuesday, and so on), u 2 -part of a day (from 4:00 until 10:00, from 10:00 until 16:00, from 16:00 until 22:00, and from 22:00 until 4:00), u 3 -measurement code (before a meal, after a meal, after an insulin dose, other). The number of values for inputs are following: K 1 = 7, K 2 = 4, K 3 = 5. Besides, the glucose level in blood was transformed into the output that describes whether the glucose level is in norm (y = 1), or not (y = −1). The output is determined basing on the original code, for example, pre-lunch glucose measurement, and the glucose level in blood. For instance, the allowed glucose level in blood is different before a meal (80-120 mg/dl) and after a meal (80-140 mg/dl). Details can be found in [63]. Having these inputs, a physician is able to identify, for example, the periods during which the glucose is not in norm or what the glucose level is before and after meals, and then inquire into the reasons of such state.
Diabetes is an illness, which is caused not only by the insulin production problems but also depends on other factors like psychological tension, feeding and drinking habits, sport activities, health condition. All of these factors can be treated as the hidden context. Moreover, the diabetic condition evolves in time and the results of treatment are rather noticeable in a long term. Therefore, in the experiment, only 10 out of 70 patient records were used from which the smallest number of examples was 926 (116 days), and the biggest number was 1,327 (149 days).
To evaluate the GRI algorithm in comparison with other rule-based inducers (AQ-P1, AQ-P2, and CART with shifting window), and well-known classifiers like Random Forest and SVM with kernel function (35) and shifting window, the criterion (6) was used. In a real-life application, a physician uses the extracted knowledge. However, in the experiment, it would be methodologically incorrect to evaluate the results based on the physician's opinions. Hence, it is assumed that if the algorithm is able to track the context changes, it could be argued that the usefulness of the knowledge is appropriate for medical interview. Furthermore, similarly to the Electricity data set, in all cases, the learning is conducted on the stream of data (an example arrives according to the original time stamp). A new example is first classified and then used for learning (after classification the true output is given).
All algorithms were checked with different parameters values, and results for the highest performing algorithms are presented in Table 3. The results of mean prequential error of 10 patients for the GRI with different γ and β values are provided in Table 4. Table 3 The results for the GRI, AQ-P1, AQ-P2, CART, random forest, and SVM, for the criterion (6)

Results and discussion
First of all, the obtained results indicate that applying knowledge in the form of rules, that is, GRI, AQs, and CART, for diabetes is satisfactory (see Table 3). The mean value of criterion (6) is at the level of 0.15 and 0.16 for AQ-P1 and AQ-P2, respectively, while CART even 0.14. The GRI classifier performed better by about 0.05, meaning the criterion is at the level of 0.10. Moreover, the GRI classifier outperformed Random Forest and SVM by about 0.03. Nevertheless, because only 10 patients were considered, it is worth checking whether the differences between the GRI and other algorithms are significant. Therefore, a statistical hypothesis testing is applied. For that purpose, the following methodology was considered: 1. Verify if the obtained results for all patients are drawn from a normal distribution. 2. Verify if the standard deviations (or variances) of all algorithms are the same. 3. Verify if the mean value of the GRI is worse than for others methods.
To verify if the methods are statistically comparable, the one-tailed Student's t-test is used. The null hypothesis is as follows: H 0 : μ < μ GRI , where μ is a mean value of an algorithm other than the GRI, and μ G RI -a mean value of the GRI. However, the Student's t-test can be used only if observations are drawn from the normal distribution, and variances are equal. Therefore, to solve the first problem, the Kolmogorov-Smirnov test is applied. For the second problem the F-statistics is used. All test were conducted in the Matlab ® environment. Hence: Hence, it can be assumed that variances are equal.  In all cases, the T was greater than t and thus the null hypotheses can be rejected (in all cases with p-value ≈ 1).
Conclusions of the statistical hypothesis testing are the following. Comparing GRI with other algorithms, that is, AQ-P1, AQ-P2, CART, Random Forest, and SVM, it can be stated that the difference between GRI and other algorithms is significant, and in the application of diabetes treatment, the GRI performs better than the other algorithms.
The GRI classifier performs best for γ = 0.95 and β = 0.4 (see Table 4). This result indicates two issues. First, the hidden context and treatment have influenced patient's conditions. Therefore, the application of the forgetting mechanism had slight but positive effect on the knowledge quality (see the difference between γ = 1 and γ = 0.95, Table 4). However, in the biomedical application, even a small difference is priceless and can save human lives.
Second, in the considered application, the value of β close to 0.5 indicates that both generalization and specialization are important. According to the domain, the diabetics condition is rather slowly evolving and thus there are few repeating routines. 6 Therefore, in this case, it is more important to generalize data rather than to focus on single situations as it is in the domain of the electricity price market.
Finally, it is worth realizing how rules can be presented to a physician. It is assumed that only rules connected with an abnormal glucose level in blood are considered. In Table 5 all rules only for the glucose level not in norm for the patient No. 67 are presented. Hence, the rules can be reported to the physician in a raw form (Table 5) or in a translated form (Table 6).    This translation is relatively easy to perform as the rule-based knowledge is represented by logical formulae.

Conclusions
Application of the e-Health systems to support diabetes treatment is a challenging task both from the technical and the medical perspective. However, a lot of effort should be put to make patients' lives more bearable and decrease the costs of treatment. In this paper, a new method of knowledge extraction for supporting anamnesis was described. To allow the context tracking, a novel method for data aggregation was proposed. In order to avoid overfitting, a method of limiting search space was presented. The limitation of the search space can be regarded as a kind of regularization. It is especially important in the rules induction because rule-based models have a high value of Vapnik-Chervonenkis dimension, which entails high susceptibility to overfitting. Experimental validation shows that the GRI algorithm is not only more effective in comparison with other methods but also that it can be successfully applied to support the anamnesis in the diabetes treatment.
Future developments of this work will address: (i) development of the eDiab system as a service oriented system [24], (ii) conducting experiments not only with glucometer but also with weight and pressure gauge, (iii) deeper insight into the theoretical aspects of the limitation of the search space, (iv) other measures than coverage and accuracy, (v) heuristics for path selection.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.