In this section we present our algorithm for the \(\tilde{\varDelta }\)-Closed Set Listing problem defined in Sect. 3. To tackle massive data streams in feasible time, we approximate the \(\tilde{\varDelta }\)-closed sets for a data stream \(\mathcal {S}_t = \langle T_1,\ldots ,T_t\rangle \) at time t from a random sample \(\mathcal {D}_{t}\) generated from \(\mathcal {S}_t\) without replacement. Since the order of the elements in the sample does not matter, \(\mathcal {D}_{t}\) is regarded as a transaction database. The size s of \(\mathcal {D}_t\) is chosen in a way that for all \(X \subseteq E\), the discrepancy between the relative frequency of X in \(\mathcal {S}_t\) and that in \(\mathcal {D}_{t}\) is at most \(\epsilon \) with probability at least \(1-\delta \), i.e., s satisfies
$$\begin{aligned} \mathrm{Pr}\left( \left| \frac{|\mathcal {S}_t[X]|}{t} - \frac{|\mathcal {D}_t[X]|}{s} \right| \le \epsilon \right) \ge 1 - \delta \end{aligned}$$
(2)
for any \(X \subseteq E\). The parameters \(\epsilon \) (error) and \(\delta \) (confidence) are specified by the user. Our extensive experiments in Sect. 5.2 show that a very close approximation of the true family of \(\tilde{\varDelta }\)-closed itemsets can be obtained in this way.
Our algorithm recalculates the family of \(\tilde{\varDelta }\)-closed itemsets not after each new transaction, but either upon request or after b new transactions have been received since the last update, where b, the buffer size, is specified by the user. Given \(\mathcal {S}_t = \langle T_1,\ldots ,T_t\rangle \) and \(\mathcal {S}_{t'} = \langle T_1,\ldots ,T_t, T_{t+1},\ldots ,T_{t'} \rangle \) with \(t'-t \le b\), the new sample \(\mathcal {D}_{t'}\) of \(\mathcal {S}_{t'}\) is computed from the old sample \(\mathcal {D}_{t}\) by \( \mathcal {D}_{t'}=\mathcal {D}_{t}\ominus \mathcal {D}_{\mathrm{del}}\oplus \mathcal {D}_{\mathrm{ins}}\), where \(\mathcal {D}_{\mathrm{del}}\) (resp. \(\mathcal {D}_{\mathrm{ins}}\)) is the multiset of transactions to be removed from (resp. added to) \(\mathcal {D}_{t}\), and \(\ominus \) and \(\oplus \) denote the set difference and the union operations on multisets.
The algorithm will be illustrated on the example transactional data stream given in Fig. 1 with \(b=8\) and \(\varDelta = 2\). For the sake of simplicity, we assume that transaction 1 will be replaced with transaction 9 by the sampling algorithm. The strongly closed itemsets for the first respectively last eight transactions are shown in Figures 2 and 3.
The rest of this section is organized as follows. We sketch the sampling algorithm in Sect. 4.1 and describe the algorithm updating the family of \(\tilde{\varDelta }\)-closed itemsets from \(\mathcal {D}_{t}\) to \(\mathcal {D}_{t'}\) in Sect. 4.2.
Sampling
We use reservoir sampling (Knuth 1997; Vitter 1985) for generating a random sample \(\mathcal {D}_{t}\) of size s for a data stream \(\mathcal {S}_t = \langle T_1,\ldots ,T_t\rangle \), as this method does not require the stream length to be known in advance. The general scheme of reservoir algorithms is that they first add \(T_1,\ldots ,T_s\) to a “reservoir” and then throw a biased coin with probability s / k of head for all \(k=s+1,\ldots ,t\). If the outcome is head they replace one of the elements selected from the reservoir uniformly at random with \(T_k\). This naive version of reservoir sampling, attributed to A.G. Waterman by D. Knuth in Knuth (1997), generates a random sample \(\mathcal {D}_{t}\) of \(\mathcal {S}_t\) without replacement uniformly at random. That is, all elements of \(\mathcal {S}_t\) have probability s / t of being part of the sample after \(\mathcal {S}_t\) has been processed. We have implemented Vitter’s more sophisticated version, called Algorithm Z in Vitter (1985).
Given a sample \(\mathcal {D}_{t}\) of a data stream \(\mathcal {S}_t = \langle T_1,\ldots ,T_t\rangle \), the sample \(\mathcal {D}_{t'}\) for \(\mathcal {S}_{t'} = \langle T_1,\ldots ,T_t,T_{t+1},\ldots ,T_{t'}\rangle \) is computed from \(\mathcal {D}_{t}\) by repeatedly applying Algorithm Z to \(\mathcal {D}_{t}\) and the elements in \(\langle T_{t+1},\ldots ,T_{t'}\rangle \). Recall that \(t'-t \le b\), where b is the buffer size. If a transaction in the sample is replaced by a new transaction \(T \in \{T_{t+1},\ldots ,T_{t'}\}\), we appropriately update a database \(\mathcal {D}_{\mathrm{del}}\) containing the transactions to be removed from \(\mathcal {D}_{t}\) and a database \(\mathcal {D}_{\mathrm{ins}}\) containing the transactions to be added to \(\mathcal {D}_{t}\). Clearly, \(|\mathcal {D}_{\mathrm{del}}| =|\mathcal {D}_{\mathrm{ins}}|\). Furthermore,
$$\begin{aligned} \mathbb {E}[|\mathcal {D}_{\mathrm{del}}|] = \mathbb {E}[|\mathcal {D}_{\mathrm{ins}}|] \le \frac{bs}{t'} . \end{aligned}$$
(3)
This follows directly from the linearity of the expectation and from \(\mathbb {E}[X_k] = s/t'\), where \(X_k\) is the indicator random variable for the event that \(T_k\) is selected for \(\mathcal {S}_{t'}\). Note that in (3) we have inequality only in the case that \(t'-t < b\), i.e., when an update is calculated upon request for an incomplete buffer; o/w we always have equality. The RHS of (3) approaches 0 as \(t'\) approaches infinity. For example, it is only 15 for \(b = 10\)k, \(t' = 100\)M, \(\epsilon = 0.005\), \(\delta = 0.001\), and \(s = 150\)k, where the sample size \(s=s(\epsilon ,\delta )\) satisfying (2) is calculated by Hoeffding’s inequality,Footnote 4 i.e.,
$$\begin{aligned} s= \left\lceil \frac{1}{2\epsilon ^2} \ln \frac{2}{\delta }\right\rceil . \end{aligned}$$
(4)
Incremental update
Note that the sample size s in (4) depends on the error and confidence parameters \(\epsilon \) and \(\delta \) only. That is, s does not change with increasing data stream length. Hence, both denominators in the LHS of (1) will be fixed (i.e., s) for the entire mining process from the s-th transaction onward. More precisely, for any transaction database \(\mathcal {D}\) of size s and \(\tilde{\varDelta } \in [0,1]\), the family of relatively\(\tilde{\varDelta }\)-closed itemsets of \(\mathcal {D}\) is equal to the family \(\mathcal {C}_{\varDelta ,\mathcal {D}}\) of absolutely\(\varDelta \)-closed itemsets for \(\varDelta = \lceil s\tilde{\varDelta }\rceil \). This allows us to consider the following problem equivalent to the \(\tilde{\varDelta }\)-Closed Set Listing problem:
-
\(\varDelta \)-Closed Set Listing ProblemGiven\(\mathcal {D}_{t}\), \(\mathcal {D}_{\mathrm{del}}\), \(\mathcal {D}_{\mathrm{ins}}\) for \(\mathcal {S}_t\) and \(\mathcal {S}_{t'}\) as defined in Sect. 4.1, an integer \(\varDelta > 0\), and the family \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\) of \(\varDelta \)-closed itemsets of \(\mathcal {D}_{t}\), generate all elements of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) for \(\mathcal {D}_{t'}= \mathcal {D}_{t}\ominus \mathcal {D}_{\mathrm{del}}\oplus \mathcal {D}_{\mathrm{ins}}\).
Instead of generating \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) from scratch, our goal is to design a much faster practical algorithm by reducing the number of evaluations of the closure operator for \(\mathcal {D}_{t'}\). This is motivated by the fact that the execution of the closure operator is the most expensive part of the algorithm. We make use of the fact that the expected number of changes in \(\mathcal {D}_{t'}\) w.r.t. \(\mathcal {D}_{t}\) becomes smaller and smaller as \(t'\) increases (cf. (3)). Accordingly, our focus in the design of the updating algorithm is on quickly deciding whether an element \(C' \in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\) remains \(\varDelta \)-closed in \(\mathcal {D}_{t'}\), where \(C'\) is obtained by \(C' = \sigma _{\varDelta ,\mathcal {D}_t}(C\cup \{e\})\) for some \(C\in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\) and \(e \in E\). Below we show that in all of the cases when at least one of the support sets \(\mathcal {D}_{\mathrm{del}}[C\cup \{e\}]\) or \(\mathcal {D}_{\mathrm{ins}}[C\cup \{e\}]\) is empty, the problem above can be decided much faster than with the naive way of using Algorithm 1. As we empirically demonstrate in Sect. 5, a considerable speed-up over the naive algorithm can be achieved in this way.
We first briefly sketch the algorithm computing \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) from \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\) (see Algorithm 2). It requires four auxiliary pieces of information for all strongly closed itemsets in \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\), except for the empty set (cf. Line 1 of Main). Hence, to simplify the notation, the set variables \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\) and \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) in all algorithms of this section store quintuples, where the first component is the strongly closed itemset itself; the other four components are specified below.
Algorithm 2 is a divide and conquer algorithm that recursively calls ListClosed with some \(\varDelta \)-closed set \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\), forbidden set \(N \subseteq E\), and minimum candidate generator element i. It first determines the next smallest generator element e (Line 3) and calculates the closure \(C'=\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\) in Lines 4–10; these steps are discussed in detail below. We store \(C'\), together with some auxiliary information (Lines 12 and 15). The algorithm then calls ListClosed recursively for generating further \(\varDelta \)-closed supersets of \(C'\). In particular, if \(C'\) does not contain any forbidden item from N then the last element of the quintuple stored for \(C'\) is \(\uparrow \) (Line 12); o/w it is \(\downarrow \) (Line 15). After all elements of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) have been generated that are supersets of C, contain e, but do not contain any element in N, the algorithm generates all closed sets in \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) that are supersets of C and do not contain any element from \(N \cup \{e\}\) (Lines 16–19).
Example 1
Using the transactions in Fig. 1, we show how Algorithm 2 updates the family of 2-closed itemsets for the first eight transactions (cf. Fig. 2) to that for the last eight (cf. Fig. 3). For \(E = \{a,b,c,d\}\) with \(a< b< c < d\) and \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}= \{d, ad, bd, cd\}\), the input to the algorithm for this update consists of \(\mathcal {D}_{\mathrm{del}}= \{t_1\}\), \(\mathcal {D}_{\mathrm{ins}}= \{t_9\}\), and \(\varDelta = 2\). The algorithm first initializes \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\leftarrow \{\emptyset \}\) (line 2) and then calls ListClosed(\(\emptyset \), \(\emptyset \), a) (line 3). The recursive calls of list closed are visualized in Fig. 4. The edges corresponding to lines 1–15 are labeled with the value of the variable \(C_e\) (cf. line 3) and the case used for update in lines 4–10; unlabeled edges correspond to lines 17–19.
Theorem 2
Algorithm 2 generates all elements of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) correctly, irredundantly, in total time \(O\left( |E| \cdot |\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}| \cdot \Vert \mathcal {D}_{t'}\Vert _0\right) \), with delay \(O\left( |E|^2 \Vert \mathcal {D}_{t'}\Vert _0\right) \), and in space \(O\left( |E|+\Vert \mathcal {D}_{t'}\Vert _0\right) \).
Proof
Regarding the correctness, we only need to show that \(C'\) computed in Lines 4–10 satisfies \(C' = \sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\). The correctness of Closure_\(\alpha \) (Algorithm 3), Closure_\(\beta \) (Algorithm 4), and Closure_\(\gamma \) (Algorithm 5) is shown below in Lemmas 1, 2, and 3, respectively. The proofs of the irredundancy and the time and space complexity are immediate from Boley et al. (2010) and Gély (2005) by noting that Algorithm 2 must call the closure operator for all elements in \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) in the worst case. \(\square \)
In the rest of this section we give the algorithms for the cases distinguished in Lines 4–10 (case (\(\delta \)) is trivial) and prove their correctness.
Case\((\alpha )\) We first consider the case that the set \(C\cup \{e\}\) with \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and \(e \in E\) to be extended for further \(\varDelta \)-closed sets satisfies
$$\begin{aligned} \mathcal {D}_{\mathrm{del}}[C\cup \{e\}] = \emptyset \text { and } \mathcal {D}_{\mathrm{ins}}[C\cup \{e\}] = \emptyset \end{aligned}$$
(5)
(Lines 4–5 of Algorithm 2). The closure \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C\cup \{e\})\) for this case can be computed by Algorithm 3; the correctness of Algorithm 3 is stated in Lemma 1 below.
Lemma 1
Algorithm 3 is correct, i.e., for all \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and for all \(e \in E\), the output of the algorithm is \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\).
Proof
Condition (5) implies that \(\mathcal {D}_{t}[C\cup \{e\}] = \mathcal {D}_{t'}[C\cup \{e\}]\), where \(\mathcal {D}_{t'}= \mathcal {D}_{t}\ominus \mathcal {D}_{\mathrm{del}}\oplus \mathcal {D}_{\mathrm{ins}}\). Hence, \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\}) = \sigma _{\varDelta ,\mathcal {D}_t}(C \cup \{e\})\) and \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\}) \in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\), from which the proof is immediate for both cases considered in Lines 1–2. \(\square \)
Example 2
In our running Example 1, the first call LC(\(\emptyset \),\(\emptyset \),a) in Fig. 4 corresponds to case (\(\alpha \)), as \(\mathcal {D}_{\mathrm{ins}}[a] = \mathcal {D}_{\mathrm{del}}[a] = \emptyset \). Algorithm 3 returns ad as the closure of a in line 1, i.e., we do not need to (re)evaluate the closure operator on a.
Case\((\beta )\) We now turn to the case that \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and \(e \in E\) fulfill
$$\begin{aligned} \mathcal {D}_{\mathrm{del}}[C\cup \{e\}] \ne \emptyset \text { and } \mathcal {D}_{\mathrm{ins}}[C\cup \{e\}] = \emptyset \end{aligned}$$
(6)
(Lines 6–7 of Algorithm 2). In Proposition 1 below we first prove some monotonicity results that will be used also for case (\(\gamma \)).
Proposition 1
Let \(\mathcal {D}_1\) and \(\mathcal {D}_2\) be transaction databases over E. If \(\mathcal {D}_1 \subseteq \mathcal {D}_2\) then for all \(\varDelta \in \mathbb {N}\),
$$\begin{aligned} \mathcal {C}_{\varDelta ,\mathcal {D}_1} \subseteq \mathcal {C}_{\varDelta ,\mathcal {D}_2} . \end{aligned}$$
(7)
Furthermore, for all \(\varDelta \in \mathbb {N}\) and for all \(X \subseteq E\),
$$\begin{aligned} \sigma _{\varDelta ,\mathcal {D}_1}(X) \supseteq \sigma _{\varDelta ,\mathcal {D}_2}(X) . \end{aligned}$$
(8)
Proof
Let \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_1}\) for some \(\varDelta \in \mathbb {N}\) and let \(\mathcal {D}'= \mathcal {D}_2 \ominus \mathcal {D}_1\). Then, for any \(e \in E \setminus C\), we have
$$\begin{aligned} |\mathcal {D}_2[C\cup \{e\}]|= & {} |\mathcal {D}_1[C\cup \{e\}]|+ |D'[C\cup \{e\}]| \\\le & {} |\mathcal {D}_1[C]| - \varDelta +|D'[C]|\\= & {} |\mathcal {D}_2[C]|-\varDelta , \end{aligned}$$
where the inequality follows from \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_1}\) and from the anti-monotonicity of support sets. Hence \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_2}\) completing the proof of (7).
To show (8), suppose that during the calculation of \(\sigma _{\varDelta ,\mathcal {D}_2}(X)\), the items in \(\sigma _{\varDelta ,\mathcal {D}_2}(X) \setminus X\) have been added to X in the order \(e_1,\ldots ,e_k\). Let \(X_0 =X\) and \(X_i = X \cup \{e_1,\ldots ,e_{i-1},e_{i}\}\) for all \(i\in [k]\). Then \( |\mathcal {D}_2[X_{i-1}]| - |\mathcal {D}_2[X_i]| < \varDelta \) for all \(i\in [k]\) (see Algorithm 1). Since \(\mathcal {D}_2[X_{i-1}] \supseteq \mathcal {D}_2[X_{i}]\) and \(\mathcal {D}_1 \subseteq \mathcal {D}_2\), we have \( |\mathcal {D}_1[X_{i-1}]| - |\mathcal {D}_1[X_{i}]| < \varDelta \) for all i. Thus, as Algorithm 1 is Church-Rosser, all \(e_i\) will be added to \(\sigma _{\varDelta ,\mathcal {D}_1}(X)\) as well, implying (8). \(\square \)
Using Proposition 1, we have the following result for Algorithm 4 concerning case (\(\beta \)):
Lemma 2
Algorithm 4 is correct, i.e., for all \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and for all \(e \in E\), the output of the algorithm is \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\).
Proof
By Condition (6), \(\mathcal {D}_{t'}[C\cup \{e\}] \subseteq \mathcal {D}_{t}[C\cup \{e\}]\) and hence Proposition 1 implies that there is no \(Y \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) with \(C\cup \{e\} \subsetneq Y \subsetneq \sigma _{\varDelta ,\mathcal {D}_t}(C\cup \{e\})\). Furthermore, if \(\sigma _{\varDelta ,\mathcal {D}_t}(C\cup \{e\}) \not \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) then \(\sigma _{\varDelta ,\mathcal {D}_t}(C\cup \{e\}) \subsetneq \sigma _{\varDelta ,\mathcal {D}_{t'}}(C\cup \{e\})\). Thus, to check whether \(C' = \sigma _{\varDelta ,\mathcal {D}_t}(C\cup \{e\})\) remains closed in \(\mathcal {D}_{t'}\), it suffices to test whether
$$\begin{aligned} |\mathcal {D}_{t'}[C']|- |\mathcal {D}_{t'}[C' \cup \{i\}]| \ge \varDelta \end{aligned}$$
(9)
further holds for all items \(i \in E \setminus C'\) (Lines 2–6 of Algorithm 4). If so, the algorithm returns \(C'\) in Line 7, implying the correctness of Algorithm 4 for the case that \(C' \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\); the claim is trivial for the other two cases (Lines 6 and 9). \(\square \)
We note that in our implementation of Algorithm 4 we do not calculate \(C'.\mathrm{count}\) and \(C'.\varDelta _i\) in Lines 2 and 4, but store and maintain them consistently. In this way, the condition in Line 5 can be decided from \(\mathcal {D}_{\mathrm{del}}\), without any access to \(\mathcal {D}_{t}\). It is important to mention that with increasing stream length, the number of elements to be deleted from \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\) becomes smaller (cf. (3)) and typically, most of the elements of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) are calculated by terminating in Line 7.
Example 3
In our running Example 1, the call of LC(\(\emptyset \),a,b) in Fig. 4 corresponds to case (\(\beta \)) because \(\mathcal {D}_{\mathrm{ins}}[b] = \emptyset \) and \(\mathcal {D}_{\mathrm{del}}[b] \ne \emptyset \). Since \((\emptyset ,a,b,bd,\uparrow ) \in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\), Algorithm 4 only needs to compute support queries on \(\mathcal {D}_{\mathrm{del}}\) in lines 2, 4 and 5. For all i considered in line 3, the condition in line 5 is not fulfilled. Hence, the algorithm returns db in line 7, without calling the closure operator.
Case\((\gamma )\) Finally we discuss the case that \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and \(e \in E\) satisfy the condition
$$\begin{aligned} \mathcal {D}_{\mathrm{del}}[C\cup \{e\}] = \emptyset \text { and } \mathcal {D}_{\mathrm{ins}}[C\cup \{e\}] \ne \emptyset \end{aligned}$$
(10)
(see Lines 8–9 of Algorithm 2). The proof for this case is shown also by using Proposition 1.
Lemma 3
Algorithm 5 is correct, i.e., for all \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and for all \(e \in E\), the output of the algorithm is \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\).
Proof
The proof is automatic for the case that the condition in Line 1 of Algorithm 5 is false. Consider the case that it is true. Proposition 1 with Condition (10) implies that \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\subseteq \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) (i.e., all \(\varDelta \)-closed itemsets in \(\mathcal {C}_{\varDelta ,\mathcal {D}_t}\) are preserved) and that \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\}) \subseteq \sigma _{\varDelta ,\mathcal {D}_t}(C \cup \{e\})\). Thus, when calculating \(\sigma _{\varDelta ,\mathcal {D}_{t'}}(C \cup \{e\})\) in Loop 3–7, it suffices to consider only the elements in \(\sigma _{\varDelta ,\mathcal {D}_t}(C \cup \{e\}) \setminus (C \cup \{e\})\), from which the claim is immediate for this case. \(\square \)
Compared to case (\(\beta \)), we need to calculate support counts in the entire sample \(\mathcal {D}_{t'}\) for this case. However, the inner loop (Lines 4–6) iterates over a typically much smaller set than the general closure algorithm (cf. Lines 2–5 of Algorithm 1). Analogously to case (\(\beta \)), the number of new \(\varDelta \)-closed itemsets to be added to \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) becomes smaller with increasing stream length, and hence, most of the elements of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) are calculated in the “then” part (Line 2–8) of the “if” statement.
Example 4
In our running Example 1, the call of LC(\(\emptyset \),ab,c) in Fig. 4 corresponds to case (\(\gamma \)) since item c occurs only in \(\mathcal {D}_{\mathrm{ins}}\) (i.e., \(\mathcal {D}_{\mathrm{ins}}[c] \ne \emptyset \) and \(\mathcal {D}_{\mathrm{del}}[c] = \emptyset \)). Since \((\emptyset ,ab,c,cd,\uparrow ) \in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\), the algorithm goes into the loop 4–6, iterating over all elements of \(cd\setminus c\). The condition in line 5 is not satisfied for d and thus c is returned as a new closed itemset in line 8, without calling the closure operator.
Controlling the time and space complexity of the update
Although by Theorem 2 Algorithm 2 does not improve the worst-case time and space complexity of the batch algorithm (Boley et al. 2009) calculating the family of strongly closed sets from scratch, our experimental results presented in Sect. 5 clearly demonstrate a considerable speed-up on artificial and real-word datasets. The total time depends on the cardinality of \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\), which can be exponential in |E|. The time and space of the update can be controlled by selecting the parameter \(\varDelta \) in a way that \(|\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}| < K\) for some reasonable small K. Once K has been fixed, the value of \(\varDelta \) can automatically be adjusted when the number of elements in \(\mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) that have already been enumerated exceeds K. More precisely, suppose Algorithm 2 has generated a subset \(\mathcal {C}'\subseteq \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) with \(|\mathcal {C}'| = K+1\). For all \(C \in \mathcal {C}'\), let \(\varDelta _C\) be the strength of C and denote \(\varDelta ' = \min _{C \in \mathcal {C}'} \varDelta _C\). Clearly, \(\varDelta ' \ge \varDelta \). Let \(\mathcal {C}'' = \{ C \in \mathcal {C}' : \varDelta _C > \varDelta '\}\). For the set obtained we have \(\mathcal {C}'' \subseteq \mathcal {C}_{\varDelta '+1,\mathcal {D}_{t'}}\) and \(|\mathcal {C}''| \le K\).
This change of \(\varDelta \) to \(\varDelta '+1\) requires, however, the maintenance of auxiliary pieces of information for all already generated strongly closed sets, as well as the reconstruction of the five tuples for the closed sets remaining. More precisely, suppose \(\varDelta _{C,e} = |\mathcal {D}_{t}[C]| -|\mathcal {D}_{t}[C\cup \{e\}]|\) has been calculated correctly for all \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_t}\) and for all \(e \in E\setminus C\). Notice that the strength of C in \(\mathcal {D}_{t}\) is given by \(\min _{e \in E \setminus C} \varDelta _{C,e}\), where the \(\varDelta _{C,e}\)s are obtained as a byproduct of the algorithm computing the closure operator (cf. Algorithm 1). One can see that if \(C \in \mathcal {C}_{\varDelta ,\mathcal {D}_{t'}}\) and C has not been recalculated by calling the closure operator, then \(\varDelta _{C,e}\) can be updated by
$$\begin{aligned} \varDelta _{C,e} = \varDelta _{C,e} +|\mathcal {D}_{\mathrm{ins}}[C]|-|\mathcal {D}_{\mathrm{ins}}[C\cup \{e\}]|-|\mathcal {D}_{\mathrm{del}}[C]|+|\mathcal {D}_{\mathrm{del}}[C\cup \{e\}]| \end{aligned}$$
for all \(e \in E \setminus C\). Thus, the complexity of the update for this case depends on the cardinality of \(\mathcal {D}_{\mathrm{ins}}\) and \(\mathcal {D}_{\mathrm{del}}\) only, which become smaller and smaller with increasing \(t'\) by (3). Finally, utilizing the algebraic properties of closure systems, the five tuples can be reconstructed by a top-down traversal of the enumeration tree corresponding to Algorithm 2.