CC-IFIM: an efficient approach for incremental frequent itemset mining based on closed candidates

Frequent itemset mining (FIM) is the crucial task in mining association rules that finds all frequent k-itemsets in the transaction dataset from which all association rules are extracted. In the big-data era, the datasets are huge and rapidly expanding, so adding new transactions as time advances results in periodic changes in correlations and frequent itemsets present in the dataset. Re-mining the updated dataset is impractical and costly. This problem is solved via incremental frequent itemset mining. Numerous researchers view the new transactions as a distinct dataset (partition) that may be mined to obtain all of its frequent item sets. The extracted local frequent itemsets are then combined to create a collection of global candidates, where it is possible to estimate the support count of the combined candidates to avoid re-scanning the dataset. However, these works are hampered by the growth of a huge number of candidates, and the support count estimation is still imprecise. In this paper, the Closed Candidates-based Incremental Frequent Itemset Mining approach, or CC-IFIM, has been proposed to decrease candidate generation and improve the accuracy of the global frequent itemsets that are retrieved. The proposed approach is able to prune several produced candidates in earlier steps before performing any further computations. To improve the accuracy of the computation of the support count of the produced candidates, the similarity between partitions has been evaluated using just the local closed candidates rather than all candidates. The experimental findings demonstrated that the CC-IFIM approach is superior to its competitors in terms of efficiency and accuracy.


Introduction
Association rule mining (ARM) is one of the key tasks in data mining. It discovers interesting correlations among set of items in large transaction datasets. ARM has developed into a powerful tool with immense potential and wide applications such as market basket analysis, medical diagnosis, bioinformatics, fraud detection and the internet of things.
Formally, consider I = {I 1 , I 2 , … , I m } is a set of m items and a dataset D = {T 1 , T 2 , … , T n } is a set of n transactions, where T i ⊆ I , ARM is a process of mining all strong association rules (AR) by performing the following two tasks [1]. Firstly, generate all frequent k-itemsets ( FI ); k = 1, 2, … , m . The k-itemset X (a set of k items) is called frequent if sup(X) ≥ minsup × n ; sup(X) = |T i ;X ⊆ T i | and minsup is give threshold (say 80% ). Secondly, from the extracted FI , the strong association rules (AR) have been mined; an association R ∈ AR on the form sup(X) , where minsup and minconf are given thresholds.
Frequent itemsets mining ( FIM ) is the crucial task in AR that finds all FI among a set of items in D . The exponential complexity of FIM attracts researchers to develop new algorithms to improve it. In huge datasets, the traditional algorithms, such as Apriori [1], FP-growth [9], Eclat [19] and Quick-Apriori [17] require exponential time and memory for mining all FI . In several applications, transaction datasets are growing continuously, so adding incremental transactions to the original dataset requires repeating the mining processes. These kinds of algorithms are considered static because they have no way to merge the new transactions without remining the updated datasets. Due to their exponential complexity, traditional mining approaches fail to address the continuously increasing dataset.
To overcome these obstacles, incremental frequent itemset mining ( IFIM ) is utilized. There are two categories for IFIM approaches: the Apriori-based [4,5,20] and the tree-based approaches [9,10,13,15]. Apriori-based approaches suffer from the I/O overhead of scanning the dataset many times and the cost of computation and memory of generating a huge number of candidates. As well as FP tree-based approaches suffer from complex operations for adjusting trees [18].
Nowadays where the number of transactions is very huge and continuously changing, all the above approaches are not applicable because they cannot get the results in an efficient way. Therefore, some approaches [3,18] introduced approximated solutions for IFIM . These approaches can be utilized in mining FI as well. For IFIM , the FPMSIM approximated approach [18] considers the new transactions as a separate dataset to be mined, from which all local FI are mined. Subsequently, the extracted local FI s of both the original and the incremental datasets are combined together to produce global candidates. The support count of each candidate is estimated using a kind of statistical model such as Jaccard similarity between the produced candidates of the original and incremental datasets. Although the approach [18] has good results, it produces many candidate itemsets that can be pruned before any further computation. As well as, the estimating of the support 1 3 CC-IFIM: an efficient approach for incremental frequent itemset… count of all candidates is imprecise, which causes the loss of global FI . The purpose of this paper is to address the issues raised in the FPMSIM [18]. CC-IFIM, or closed candidates-based incremental frequent itemset mining, is the name of the proposed approach. The suggested approach makes use of a pruning mechanism to eliminate certain early, pointless candidates. The CC-IFIM approach measures the similarity between just the closed candidates of the original and incremental datasets rather than all produced candidates in order to increase the estimated accuracy of the support count of all candidates. The experimental results on five benchmark datasets showed that the CC-IFIM approach is more efficient and accurate than the FPMSIM approach.
The remainder of this paper is organized as follows: The related works are discussed in Sect. 2, where FPMSIM , which is the source of CC-IFIM, is given specific attention. Section 3 introduces the proposed approach CC-IFIM. Section 4 discusses the experimental results. Finally, Sect. 5 concludes the paper.

Related work
In this section, the incremental frequent itemsets mining is classified into two categories: exact approaches that find all FI (Sect. 2.1), and approximated approach that tries to find approximated FI denoted by FI approx ; FI approx ⊆ FI (Sect. 2.2).

Exact approaches
The exact approach can be classified into two major categories: Apriori-based and FP-tree-based as shown in the following two subsections:

Apriori-based
Apriori-based approach firstly scans the incremental dataset to extract its FI s, then re-scanning the original dataset to get the exact support of each candidate, as in FUP [4], FUP2 [5]. However, re-scanning the original dataset causes I/O cost overload.
In 2001, Zhou et al. [20] developed the MAAP algorithm, which utilizes the characteristics of Apriori to improve the overall performance. In the Apriori algorithm, high-level itemsets are joined by low-level itemsets, where in the Apriori algorithm uses high-level itemsets to induct low-level itemsets. MAAP compares frequent patterns with new transactions to generate new association rules. In addition, MAAP can improve the performance of FUP [4]; however, the problem of re-scanning the dataset still exists.

FP tree-based
Rather than employing the generate-and-test strategy of Apriori-based algorithms, the tree-based framework constructs an extended prefix-tree structure, called Frequent Pattern tree (FPtree), to capture the content of the transaction database [9].
In 2008, Hong et al. [10] proposed a fast and effective method for updating the structure of the FP-tree, called the FUFP (Fast Updated Frequent Pattern) algorithm, in which only frequent items are saved in the FUFP-tree. The FUFP algorithm can quickly update and modify the tree by dividing items. When the original large item becomes smaller, it will be directly deleted from the FUFP-tree. Instead, as the original item gets larger, it is added to the end of the head table in descending order. But it needs to re-scan the original dataset to find the transactions of the newly enlarged items and insert them into the FUFP tree.
In 2014, two remarkably efficient algorithms are introduced: FIN [7] and Pre-Post+ [8] with POC tree and PPC tree, respectively. These two structures are prefix trees and similar to FP-tree, Moreover, both algorithms employ two additional data structures called Nodeset and N-list, respectively, to significantly improve mining speed. However, N-list consumes a lot of memory, and for some datasets, Nodeset's cardinality grows significantly [6].
The preceding issue was resolved in 2016 by Deng, by proposing an algorithm called dFIN [6] that is based on a new data structure called DiffNodeset instead of Nodeset. In contrast to Nodeset, the DiffNodeset of each k-itemset (k ≥ 3) is extracted by the difference between the DiffNodesets of two (k-1)-itemsets. Numerous investigations demonstrate that DiffNodeset's cardinality is lower than Nodeset's. As a result, the dFIN algorithm is quicker than Nodeset-based algorithms. However, the calculation of the difference between two DiffNodesets can be timeconsuming for some datasets.
In 2017, Huynh et al. [11] proposed a tree structure IPPC (Incremental Pre-Post-Order Coding) algorithm which supports incremental tree construction, and an algorithm for incrementally mining frequent itemsets, IFIN (Incremental Frequent Itemsets Nodesets). Through experiments, algorithm IFIN has demonstrated its superior performance compared to FIN [7] and PrePost+ [8]. However, in the case of datasets comprising a large number of distinct items but just a small percentage of frequent items for a certain support threshold, we investigate that IPPC tree becomes to lose its advantage in running time and memory for its construction compared to other trees such as POC and PPC of algorithms FIN and PrePost+.
In 2021, Satyavathi et al. [14] proposed the FIN_INCRE algorithm by enhancing FIN algorithm [7] for efficient mining of incremental association rules which require scanning of the original dataset only once. After scanning the dataset, it generates POC-Tree and from which it produces item sets that occur frequently. Then, they are used to generate association rules. When some new instances are inserted into the original dataset, the algorithm scans only the newly added instances. Then, it updates POC-Tree and frequent itemsets before actually updating the mined association rules.
However, FP tree-based approaches suffer from complex operations for adjusting FP-tree.

Approximated approaches
Regardless of using Apriori [2] or FP-Tree [9], the key of these approaches are reducing both I/O cost and a complex generating of FP-trees.
In 2017, Li et al. [12] proposed a three-way decision update pattern (TDUP) approach along with a synchronization mechanism for this issue. With two supportbased measures, all possible itemsets are divided into positive, boundary, and negative regions. TDUP efficiently updates frequent itemsets and reduces the cost of re-scanning. However, TDUP may miss some potential frequent itemsets with incremental data updates.
In 2021, Xun et al. [18], developed an approach for incremental frequent itemsets mining based on frequent pattern tree and multi-scale called FPMSIM . A partitioning-based FPMSIM is valid for not only parallel programming of mining all FI's, but also for addressing the problem of incremental frequent itemsets mining. In general, As shown in Algorithm 1, FPMSIM divides the dataset into scales or partitions using a multi-scale concept in which each partition may have the same essence and characteristics. Using FP-growth [2] and for each partition, the local frequent itemsets FI have been extracted. The union of all local frequent itemsets is considered as global candidates C. To estimate the support of global candidates, Jaccard method [16] is used for measuring the similarities between scales or partitions. Finally, the global frequent itemsets are candidates whose candidates with support is greater than or equal to global minsup. In big data, this approach is efficient and scalable which produced an acceptable FI approx compared with the exact FI , however, it produces many candidate itemsets that can be pruned before any further computation. Additionally, the estimating of the support count of all candidates is imprecise, which causes the loss of global FI . Consequently, miss some potential frequent itemsets.

The proposed approach CC-IFIM
In this section, an improved approach of FPMSIM approach [18] has been introduced. This approach has been called Closed Candidates-based Frequent Itemset Mining and denoted(CC-IFIM). Although FPMSIM and CC-IFIM have some common phases there are a lot of differences between them, The core difference is ignoring the division of the dataset into scales, which requires an additional cost for getting further information as in FPMSIM . This kind of division is not applicable in most cases of real applications.

3
CC-IFIM: an efficient approach for incremental frequent itemset… Consider a transaction dataset D and minimum support count minsup D . Algorithm 2 shows the steps of extracting FI , where CC-IFIM works as follows: • Each P j has its own minsup called minsup P j .
2. Using Apriori [2] or FP-tree [9] algorithm, ∀ P j ∈ P mine its frequent itemsets Let q is the number of generated candidates (i.e., q =| C D | ). Each itemset c ∈ C D is called a candidate itemset of D. 4. Build a matrix S ∈ ℝ q×k ; in which the rows represent the candidates c i ; i = 1, 2, … , q and columns represent the partitions 5. An extra column m ∈ ℝ q×1 with dimension q is associated to the matrix S as follows: where p + (i) = {j = S(i, j) ≠ 0} gives the partition j in which c i is frequent (i.e., sup j (c i ) ≥ 0). 6. Extract closed candidates Using the candidates' C D and the matrix S, which contains local frequent itemsets, extract only the closed candidates' CC D which contains all local closed frequent itemset (Definition 1).
7. Using CC D , build its corresponding new matrix called S closed ∈ ℝ q c ×k ;q c =| CC D | , q c ≤ q , and S closed (i, j) = S(i, j); cc i ∈ CC D and j = 1, 2, … , k . In CC-IFIM, the matrix S closed is used for creating the similarity matrix M CC as shown in the next step instead of the matrix S as in FPMSIM. 8. Using CC D , build a symmetric similarity matrix is the similarity between the partitions P j and P j ′ for all j, j � = 1, 2, … , k and j ≠ j ′ . The three versions of similarity such as Jaccard, Dice, or Cosine [16] has been used from which the more accurate method will be chosen as a suitable similarity measurement. The similarity Jaccard matrix is calculated as follows: Simply, you can read S closed (i, j) ≠ 0 by the candidate c i is frequent at partition j. Therefore, M CC (j, j � ) is the number of closed candidates that are frequent in both partitions j and j ′ divided by the number of candidates that are frequent in at least one partition. In CC-IFIM, we discovered that the Jaccard similarity matrix is not the suitable method for measuring the similarity between partitions as shown in the experimental result section. The similarity Dice matrix is calculated as follows: The Cosine similarity method is calculated as follows: Based on any version of similarity method, obviously M C C(j, j) = 1 , j) . The value or coefficient M C C(j, j � ) ranges between 0 and 1. M C C(j, j � ) = 1 means the partitions j and j ′ are similar, while M C C(j, j � ) = 0 means the partitions j and j ′ are dissimilar and there is no intersection between them. 9. Pruning step After creating matrix M C C , we back again to C D and its corresponding matrix S. Before estimating the global support, a new hypothesis was used in earlier step to predict whether the candidate c i is globally frequent or not. Therefore, ∀c i ∈ C D , Critical_sup can be calculated as follows: where Since, at S(i, j) = 0 , c i is infrequent at partition j. Therefore, the most expected support value should be minsup j − 1 that is the largest number less than minsup j . Remove row i from table S, consequently remove candidate c i from C D if its Critical_sup(c i ) < minsup . This step reduces the computation and memory cost for building the matrix S.
To estimate the support of each candidate c i ∈ C D at any partition j with S(i, j) = 0 . The estimated support S(i, j) is computed according to the following equation: 11. For each candidate c i ∈ C D , estimate a global support using the following equation: 12. Add c i to the global frequent itemsets FI , if c i satisfies the following condition: The following example is used for discussing each step of CC-IFIM for extracting FI and the differences between FPMSIM and CC-IFIM. Consider the transaction dataset D (Fig. 1) that contains sixteen transactions ( n = 16 ) and five items I = {a, b, c, d, e} ( |I| = 5 ). And minsup = 50% of D = 0.5 * 16 = 8 . CC-IFIM works as follows: 1. Firstly, D is divided into four partitions ( k = 4 ): P 1 , P 2 , P 3 , and P 4 (Fig. 1).
• Incremental dataset Finally, this approach will also deal with any incremental datasets as a new partition P 2 , where P 1 is the original dataset. Similarly as shown in Algorithm 2, FI 1 and FI 2 are extracted for the partitions P 1 and P 2 , respectively. All the steps as shown in Algorithm 2 have been applied to P 1 and P 2 . Where, form the candidates C = FI 1 ⋃ FI 2 . Then form a matrix S for the two partitions P 1 and P 2 as mentioned earlier. Find the closed candidates CC , then the similarity matrix M CC that is corresponding to the closed candidates from CC (Definition 1). Apply the pruning step, then Using M CC estimate the support of zero entries in matrix S C to find the global support of all candidates. From FI that satisfies miusup threshold. • The Differences between traditional, FPMSIM,and CC-IFIM approaches The traditional approach is applied to the transaction dataset D (Fig. 1), where the extracted FI Original = {a, b, c, d, e, ab, ac, bc, ce, abc} . Using FPMSIM , the similarity matrix M C (shown at FPMSIM partition at the right of Fig. 3) is measured based on the original matrix S that corresponds C D instead of M CC as in CC-IFIM.
In FPMSIM that is based on Jaccard similarity, the matrix M C is calculated according to the following equation: The reason of calculating similarity matrix based on CC D instead of C D returns to the following property: if there are two frequent itemsets X and Y have the same support at all the corresponding partitions, and X ⊂ Y , There are the same and share the same information. Therefore, using X and Y for computing the similarity between two partitions j and j ′ decrease the coefficient M(j, j � ) because the denominator is increased twice, one for X and one for Y, consequently, the calculated similarity coefficient is decreased. Instead, we have to ignore X and keep only Y, consequently, the denominator is increased once which leads to increase the similarity coefficient compared to the previous. This means that the computed similarity based only on local closed candidates is more realistic and better. For example, Using CC D , M CC (1, 2) = 7∕9 = 0.7778 . Where, the number of itemsets which are frequent in P 1 and P 2 (intersection) is 7. Where, the intersected itemsets in CC D are {a, b, c, d, ab, ac, abc} . while the union is 9 which is {a, b, c, d, e, ab, ac, ce, abc} . Similarly, but using C D , M C (1, 2) = 9∕17 = 0.52941 . As shown in (Fig. 3) the coefficients of matrix M C C are greater than the coefficients of matrix M C . Therefore, the estimated support of CC-IFIM is greater than FPMSIM. Back to the example, it is worth noting that, for calculating the similarity matrix, we consider the whole candidates in C D before applying the pruning step as in our

Experimental results
In this section, we compare CC-IFIM with FPMSIM [18]. We have implemented both two approaches using Python. The implemented approaches have been applied to various real datasets and synthetic datasets (as shown in Table 1) (URL:http:// fimi. ua. ac. be.) which are widely used for performance evaluation in the pattern mining area. Accidents and Mushroom are dense datasets, while Retail is a sparse dataset. Generally, a dense dataset is composed of relatively long transactions and a small number of items, and a sparse dataset is characterized by relatively short transactions and a large number of items.
we have used a synthetic sparse dataset T10I4D100K and a real dataset, online news portal click-stream data (Kosarak). The efficiency (running time) of the CC-IFIM compared to FPMSIM is shown in Sect. 4.1. Section 4.2 introduces the rich comparative study that shows the accuracy of CC-IFIM compared to FPMSIM using three measurements; coverage, precision, and average support error.

Efficiency of CC-IFIM
Using several minsup thresholds on the mentioned datasets: • Running time Since the generation of FI of each part is similar in both FPMSIM and CC-IFIM, the running time of FI generation is not included in the comparisons. Figure 4 shows the running time in seconds of CC-IFIM compared to FPMSIM . In CC-IFIM, the running time is calculated starting from pruning C D , getting CC D , calculating M CC , then estimating the zeros entries in the matrix S. While in FPMSIM , the running time is calculated starting from calculating M C , then estimating the zeros entries in the matrix S. All results show CC-IFIM is better than FPMSIM . Where in Mushroom, Accidents, Retail, T10I4D100K, and T10I4D100K datasets, CC-IFIM reduced the execution time of FPMSIM around 10-12%, 10-42%, 16-34%, 10-18%, and 10-18%, respectively. In Kosarak dataset, CC-IFIM reduced the execution time of FPMSIM around 10-13%. Although our approach consumes time for extracting the closed candidates, our approach with the pruning step is more efficient than FPMSIM. • Memory consumption Fig. 5 shows the memory consumption of CC-IFIM compared to FPMSIM . All results show CC-IFIM uses memory more efficiently than FPMSIM . As a result, our approach consumed less memory than FPMSIM.

Accuracy of CC-IFIM Using M CC
The coverage, precision, and average support error of CC-IFIM are evaluated against FPMSIM [18]. The coverage is calculated using the following equation: where FI app is the approximated FI using any approach. fPFI is the false positive FI . The precision reflects the effect of false positive and false negative itemsets on the accuracy of experimental [18], which is defined as: The support estimation error is the average of the difference between the estimated support sup * (f ) of the frequent itemsets f in CC-IFIM and the real support sup(f ) of these itemsets as follows: Figures 6 and 7 shows coverage, precision, and avg_sup_error among FPMSIM and three versions of CC-IFIM based on Jaccard, DICE and Cosine, respectively. These results have been tested using several minsupp . On Mushroom, Retail, T10I4D100K and T40I10D100K datasets, the overall coverage and precision of CC-IFIM are better than FPMSIM . This means the similarity based on CC D is more accurate of estimating the global itemsets support than based on C D as in FPMSIM . As shown in the last column, that shows the average calculations, CC-IFIM based on Dice or Cosine similarity is better than using Jaccard similarity. It is worth noting that, although the avg_supp_error is increasing on CC-IFIM, it is very normal due to the increasing of the number of FI . In Mushroom dataset, avg_supp_error recorded 0% at some minsup thresholds, this does not indicate that the estimation support values are For big data, Kosarak dataset, the traditional algorithm fails to mine all FIs , according to heap memory exception. Therefore, we can not able to calculate coverage, precision and avg_supp_error, while the approximated approaches are the suitable solution to mine FIs . The comparative study has been made between CC-IFIM and FPMISM only. Figure 8 shows that the number of extracted FI s by CC-IFIM is more than FI s by FPMSIM . As shown, at minsup = 0.6% CC-IFIM approach can mine 1332 frequent itemset from 1465 candidates, while FPMISM mines only 1309. Moreover, CC − IFIM approach pruned 131 candidates.

Conclusion
The incremental frequent itemset mining approach is an estimated good choice for the huge datasets that are rapidly expanding by periodically adding new transactions. In which, the approach frequent pattern tree and multi-scale incremental frequent itemset, FPMSIM , is used. It splits the dataset into several components, applies frequent itemset mining separately, then uses Jaccard similarity based on all local frequent itemsets the global frequent itemsets have been mined. However, it faces two problems: performance and efficiency. In this paper, we developed Fig. 15 Coverage, precision and avg_sup_error among FPMSIM and three versions of CC-IFIM based on Jaccard, DICE and Cosine for Mushroom and retail datasets after dividing them into 80% original dataset and 20% incremental dataset this approach to be more efficient and accurate. In CC-IFIM, the performance has been increased by applying some prior candidate pruning. While the accuracy has been increased by applying similarity measurements on local closed frequent itemsets instead of local frequent itemsets. As well as using another similarity methods such as Dice or Cosine which is remarkably increasing the accuracy of CC-IFIM. The key contributions of this study are the introduction of a pruning approach for reducing the number of candidates, as well as a novel definition of closed itemsets termed local closed frequent itemsets, on which similarity methods rely. Future research will focus on establishing the scientific rationale for the ideal number of partitions in division processes. Also closed and maximal incremental frequent itemsets will be covered. Additionally, investigate several different similarity models in order to propose a new one.