Here, we address the issue of the data structure that is needed to efficiently realize the three key operations of the proposed algorithm: incremental intersection, deletion, and Δ-compression. Of these, incremental intersection incurs the majority of the computational cost. This operation imposes the traverse of every entry, e, to compute its intersection with a transaction, ti, for each timestamp i. This computation is often redundant: for example, if αe has no common items with ti, it is redundant to compute the intersection of αe with ti.
Based on this observation, Yen et al. (2011) proposed an indexing data structure, called cid_list, corresponding to the vertical format of the stored itemsets: for each item, x, cid_list(x) maintains the indexes of the entries corresponding to the itemsets that contain x. Using cid_list, we can focus only on the entries whose indexes are contained in \(\bigcup _{x\in t_{i}} cid\_list(x)\) and compute their intersections with ti. However, the computational cost of updating cid_list is relatively high: the entire cid_list is dynamically changed by addition and deletion operations. This overhead becomes especially high for dense datasets since most itemsets stored in Ti have some of the same items as ti.
Borgelt et al. (2011) proposed a fast two-pass FIM method, called ISTA, based on incremental intersection. In this implementation, the prefix tree (as well as patricia) was introduced to efficiently maintain Ti and perform the incremental intersection. Although it is reasonable to represent Ti with such a concise data structure, it is not directly applicable in the one-pass approximation setting that is used here. For example, we cannot use the item support as static information, while it is available in the transactional database that allows multi-pass scanning. This information is crucial to constructing a compact trie by sorting items in a pre-processing step. Note that the trie size can be directly affected by the order of (sorted) transactions. Indeed, ISTA constructs the trie with 306 nodes by treating it as a retail problem (No. 3 in Table 1), compared to 244,938 nodes when the pre-processing technique is not applied. Besides, in the context of the SD that emerges concept drift, it is not appropriate to assume a static distribution.
Table 1 Characteristics of the used datasets Thus, it is necessary to design a more suitable data structure for our proposed algorithm, which enables to prune redundant computations in incremental intersections, as well as quickly access both the minimum entries and Δ-covered entries required for PARASOL deletion and Δ-compression, respectively.
Weeping tree
Now, we propose a variation of the binomial spanning tree (Johnsson and Ho 1989; Chang 2005), called weeping tree, in which a collection of entries, Tn, can be represented in a binary n-cube as follows: Let e be an entry in Tn. By Theorem 1, αe corresponds to the intersection of a certain set S of transactions. This set can be represented as a binary address (x1, x2,…, xn) where each xj is one if S contains tj and zero otherwise. Every αe has its own binary address. Thus, Tn can be represented as a set of binary addresses, denoted by V (n), each of which identifies αe for an entry e ∈ Tn. Each binary address can be described by an integer, \(x = {\Sigma }_{j=1}^{n} (x_{j}\times 2^{n-j})\). Then, V (n) corresponds to a subset of {1,…,2n − 1}.
Let x and y be two integers (1 ≤ x, y ≤ 2n − 1) with the addresses \((x_{1},x_{2},\ldots , x_{n})\) and \((y_{1},y_{2},\ldots , y_{n})\), respectively. p(x) is the position in x that satisfies two conditions: xp(x) = 1 and xj = 0 for each j(p(x) + 1 ≤ j ≤ n). In other words, p(x) is the least significant set bit of x. We say that x covers y if yj = xj for each j where 1 ≤ j ≤ p(x). For example, if x = 12 and y = 15 in 4-cube with addresses (1100) and (1111) then, p(x) = 2, y1 = x1, and y2 = x2 so x covers y. We assume that zero covers every integer.
Now, we define the binomial spanning tree of V (n), following the notion in the literature (Chang 2005).
Definition 4 (Binomial spanning tree)
Let V (n) be a subset of {1,…,2n − 1}, x be an integer such that 0 ≤ x ≤ 2n − 1, and C(x) be the set of integers in V (n) each of which is covered by x. The binomial spanning tree of V (n) is the tree in which the root node, r, is zero and the other nodes are V (n). The children of each node, x, correspond to the following set:
$$\{ y~|~y\in C(x)\ and \not\exists y^{\prime}\in C(x)\ s.t.\ y \neq y^{\prime}\ and\ y \in C(y^{\prime})\}.$$
The siblings y(1), y(2),…, y(m) are sorted in the descending order (i.e., y(j) is the left sibling of y(j+ 1)). We say that a non-root node, y, is a descendant of node x if y ∈ C(x), a precursor of x if y∉C(x) and y > x, and a successor of x if y < x. The precursors and successors of x are denoted at P(x) and S(x), respectively.
The weeping tree at time n, denoted by W(n), is the binomial spanning tree of V (n) obtained by associating each node, x, with its corresponding entry e (i.e., αe is the intersection of the transactions indicated by the address x).
Example 5
Consider again the stream \(\mathcal {S}_{4}^{1}\) in Example 2. Let 𝜖 and k be 0.2 and 15, respectively. PARASOL uses the 15 entries in T(4) for all closed itemsets (i.e., V (4) = {1,2,…,15}). The corresponding weeping tree, W(4), is described in Fig. 7. Each node, x, is associated with its own entry, e. For example, the node x = 6 (0110) corresponds to the entry for α = t2 ∩ t3, i.e., α = {1,4,5}. Note also that C(6) = {7}, P(6) = {8,9,…,15} and S(6) = {1,2,…,5}.
One crucial feature of the weeping tree is that it captures inclusion relationships among the stored itemsets in Tn.
Proposition 2
Let two nodes xand ybe associated with two entries exand ey. If xcovers y, then\(\alpha _{e_{y}}\subseteq \alpha _{e_{x}}\)
Proof
The address of y can be written as (x1,…, xp(x), yp(x)+ 1,…, yn), since x covers y. Accordingly, \(\alpha _{e_{y}}\) is written as \((\bigcap _{x_{j} = 1, 1\leq j \leq p(x)} t_{j})\ \cap \ I\) where \(I = \bigcap _{y_{j} = 1, p(x)+1\leq j \leq n} t_{j}\). Since p(x) is the position of the least significant set bit, \(\alpha _{e_{x}}=\bigcap _{x_{j} = 1, 1\leq j \leq p(x)} t_{j}\) holds, followed by \(\alpha _{e_{y}} = \alpha _{e_{x}}\cap I\). □
Proposition 2 has three useful implications. First, it is applicable for pruning the intersection computations. Suppose that during the updating process at time i, an entry, e, is found such that \(\alpha _{e} \subseteq t_{i}\). Since every descendant of e must be included in ti, it is not necessary to compute the intersections for these descendants. Proposition 2 also implies that every minimum entry must be located in the shallowest layer in the tree due to the anti-monotonicity of Ti. This feature is useful for the minimum entry deletion (in practice, it is reasonable to use min-heap for the shallowest layer).
Finally, Proposition 2 is applicable to the pairwise checking involved in Δ-compression. Suppose that we found a parent entry, ep, and its child entry, ec such that \(C_{e_{c}} \leq C_{e_{p}}-{\Delta }_{e_{p}} + {\Delta }(n)\). Then, \(\alpha _{e_{c}}\) must be Δ(n)-covered by \(\alpha _{e_{p}}\) according to Corollary 1. Hence, a quick check can be done to determine if each child is Δ(n)-covered by its parent. Note that a brute-force approach requires O(k2) time for Δ-compression while the quick pairwise checking can be completed in O(k) time. Thus, it is useful as a pre-processing step preceding Δ-compression.
Weeping tree updating
Here, we explain how to incrementally update W(i) at each time i. Suppose that an itemset, α, is newly stored in W(i + 1). An address (x1,…, xi,1) is then assigned to α, where (x1,…, xi) is the address of the node, r, in W(i) that corresponds to a representative entry of α if such an r exists; if there is no such node, r is the root node. Thus, the entry for α is newly located as a child of node r.
Example 6
Consider Example 5 again. Let α be the itemset {2,5} that is newly added in W(4). There exists a representative entry r = 〈{2,4,5},2,0〉, for α (see Fig. 4) in T(3). Since r is given the address (101) at time i = 3, the address of α becomes (1011), and the entry for α is located as a child of r.
We can interpret the meaning of the address assigned to each entry. Let e be an entry of node x with the address (x1,…, xn). We denote the least and greatest significant bit sets of x as p(x) and q(x), respectively. Then, e is written as 〈α,Δ(q(x) − 1) + B,Δ(q(x) − 1)〉, where \(\alpha = \bigcap _{x_{j}=1, q(x)\leq j\leq p(x)} t_{j}\), Δ(q(x) − 1) is the maximum error at time q(x) − 1 and B is the bit count of x (See Fig. 8). This observation leads to the following proposition:
Proposition 3
Consider two nodes, xand y, of entries ex and ey, respectively, in a weeping tree. If\(\alpha _{e_{y}}\subseteq \alpha _{e_{x}}\), then yis either a descendant or a precursor of x.
Proof
We derive a contradiction in the case that \(\alpha _{e_{y}}\subseteq \alpha _{e_{x}}\) and y < x. Let v be the address obtained by the bitwise OR operation between x and y. Since y < x and x ≤ v, y < v. We write v as (v1,…, vn) and denote by αv the intersection \(\bigcap _{v_{j} = 1, 1\leq j\leq n} t_{j}\). Since \(\alpha _{e_{y}}\subseteq \alpha _{e_{x}}\), we have \(\alpha _{v}=\alpha _{e_{y}}\). Thus, node v should not appear in the tree, since its duplicate never occurs in Tn. Without losing generality, this implies that v has been deleted at some time, which is referred to as time m. Accordingly, the tree never contains such a node, u, with an address (u1,…, un) such that q(u) = q(v); moreover, the bit count of the m-prefix (u1,…, um) is lower than the bit count of the m-prefix (v1,…, vm). This is because every node with such an address has been deleted at time m, along with v (i.e., u has a lower frequency count than v). Hence, no node can have an address in which the m-prefix matches (u1,…, um) or (v1,…, vm). Next, we consider the address of node x. Since y < x and v is obtained by a bitwise OR operation between x and y, we have q(x) = q(v). In addition, (x1,…, xm) is either equal to (v1,…, vm) or has a lower bit count than (v1,…, vm). Hence, x should not appear in the tree. This is a contradiction. □
For example, consider node 6 for the itemset {1,4,5} as shown in Fig. 7. There are three nodes 7,14,15 that have subsets of this itemset and each of these nodes is either a descendant or a precursor of the node 6.
Proposition 3 is useful for pruning the computation for incremental intersection. Suppose that for some entry C, ti ⊂ αC holds. Thus, the intersection computations for every successor of C with ti can be skipped as they do not store any subset of ti (See Fig. 9).
The weeping tree can be used to perform the incremental intersection by traversing the weeping tree in a depth-first, left-to-right manner. Algorithm 3 sketches the process for updating a node x in W(i) with an itemset E. Note that E is initially a transaction.
In the algorithm, a node, x, is identified with its associated entry, ex. Here, αx, cx and Δx are the itemset, frequency, and error count of ex, respectively. Given the transaction ti+ 1 and W(i), the next tree, W(i + 1), is obtained by calling the function update(root, ti+ 1, W(i)). Note that W(0) is defined as the initial tree consisting of the root node.
A few characteristics of the update(x, E, W(i)) algorithm should be noted:
-
Line 4 means that \(\alpha _{y}\subseteq E\). By Proposition 2, the descendants of y are included by E. Thus, the frequency count of each node, z ∈ C(y), can be simply incremented without computing the intersection except for αy itself. This is called descendant-intersect-skipping (DIS).
-
In Line 9, we continue the updating process. In the recursive call, the intersection, I, is used instead of the original itemset, E. It follows that \(\alpha _{y^{\prime }}\cap E = \alpha _{y^{\prime }}\cap I\) for each child \(y^{\prime }\) of y since I = αy ∩ E. By reducing E to I (i.e., I is a subset of E), the computational cost of the recursive call after Line 9 is reduced. This pruning technique is called masking.
-
If I = ∅, every descendant of y has no items in common E so the descendants need not be updated. This is called descendant-update-skipping (DUS).
-
Line 11 checks if \(E \subseteq \alpha _{y}\) or not. If so, every right sibling of y need not be updated. This follows from the observation in Proposition 3 that the entry for any subset of E to be updated never appears in the successors of y. This is called successor-update-skipping (SUS).
-
Finally, if there is no entry for E in W(i), the new entry for E is added as the right-most child of x. Note that if x is the root node, we set cx = 0 and Δx = Δ(i).
Example 7
We explain how Algorithm 3 works using
$${\mathcal S}^{2}_{4} = \langle \{1,2,3,5\},\{1,2,4\},\{2,3,4\},\{1,2,5\}\rangle.$$
W(3) corresponds to the left tree in Fig. 10. The function update(root, t4, W(3)) is called to derive W(4) from W(3). For the left-most child e1, the intersection, I1 = {1,2,5} of e1 with t4 is computed. Since I1≠∅, we call update(e1, I1, W(3)) as shown in Line 9. For the left-most child, e2, of e1, the intersection I2 = {1,2} of e2 with I1 is computed using I1 by masking. Since \(I_{2} = \alpha _{e_{2}}\), DIS is applied in Lines 5-7. Then, \(c_{e_{5}}\) is simply incremented by one. Moving to the right sibling, e4, the intersection I4 = {2} of e4 with I2 is computed. Since I4≠∅, update(e4, I4, W(3)) is called. Since e4 does not have any children, the algorithm checks if there exists the entry for I4 then backtrack to the second call (i.e., update(e1, I1, W(3))). Since there is no sibling of e4, the algorithm returns to Line 15 and a new entry, e7, for I1 is added as the right most child of e1. After backtracking to the first call, SUS is applied in Line 11 since I1 = t4. Thus, the updating of the two right-most sibling nodes, e3 and e6, is skipped and the algorithm proceeds to Line 15. Since an entry for t4 exists, the updated tree, W(4), is returned as the output.
In this way, the update function realizes the incremental intersection. Note that every node in W(i) is visited at most once, which implies that Algorithm 3 efficiently runs update(root, ti+ 1, W(i)) to return W(i + 1) in O(kL) time.
Next, we show how PARASOL realizes the minimum entry deletion in the weeping tree. As explained before, the minimum entries to be deleted are allocated in the shallowest layer relative to the root. Recall Example 5, in which 𝜖 was set to 0.25 and PARASOL was used to delete the entries with frequency counts of one at time i = 4. These minimum entries can be quickly accessed by applying min-heap to the shallowest layer. The reduced weeping tree is obtained by reconnecting the children of the deleted nodes with the root as shown in Fig. 11.
The weeping tree can be used as a pre-processing step for the Δ-compression in such a way that we find such a child node e that ce ≤ cr −Δr + Δ(n) where r is the parent node of e. By Corollary 1, e can be deleted from the tree.
Figure 11 shows the reduced tree obtained by searching the weeping tree W(4) in a bottom up manner from the left-most leaf to the root; this process results in the removal of four nodes (7, 11, 13, and 15). Note that Δ-compression requires O(k2) time to completely check every pair of nodes. Thus, it is reasonable to carry out the Δ-compression in a two-step procedure; first checking each pair of parent-child nodes in a one-time scan and subsequently performing the brute-force search of the remaining nodes.