Mining high utility itemsets by dynamically pruning the tree structure
Abstract
Mining high utility itemsets is one of the most important research issues in data mining owing to its ability to consider nonbinary frequency values of items in transactions and different profit values for each item. Mining such itemsets from a transaction database involves finding those itemsets with utility above a user-specified threshold. In this paper, we propose an efficient concurrent algorithm, called CHUI-Mine (Concurrent High Utility Itemsets Mine), for mining high utility itemsets by dynamically pruning the tree structure. A tree structure, called the CHUI-Tree, is introduced to capture the important utility information of the candidate itemsets. By recording changes in support counts of candidate high utility items during the tree construction process, we implement dynamic CHUI-Tree pruning, and discuss the rationality thereof. The CHUI-Mine algorithm makes use of a concurrent strategy, enabling the simultaneous construction of a CHUI-Tree and the discovery of high utility itemsets. Our algorithm reduces the problem of huge memory usage for tree construction and traversal in tree-based algorithms for mining high utility itemsets. Extensive experimental results show that the CHUI-Mine algorithm is both efficient and scalable.
Keywords
Data mining High utility itemset CHUI-Tree Dynamically pruning Concurrency1 Introduction
Data mining, which is the extraction of hidden information from large databases, has become more and more important in many domains, including business, scientific research, government, and so on [12, 24].
Frequent Itemset Mining (FIM) is a major problem in many data mining applications [2, 11]. It started as a phase in the discovery of association rules [15, 25, 30, 34], but has since been generalized, independent of these, to many other patterns, for example, frequent sequences [16], episodes [29], periodic patterns [35], frequent subgraphs [23], and bridging rules [40].
Apriori [2] is the first FIM algorithm based on the anti-monotone property, i.e., if an itemset is found to be frequent then all its non-empty subsets are frequent. The Apriori technique finds frequent itemsets of length k from a set of previously generated itemsets of length k−1 and therefore, requires multiple database scans, proportional to the longest frequent itemset. Moreover, a large amount of memory is needed to handle the candidate itemsets when the number of potential frequent itemsets is reasonably large. Han et al. [13] proposed the FP-growth method to avoid generating candidate itemsets by building an FP-tree while only scanning over the database twice. This method constructs a conditional database for each frequent itemset X. All itemsets with X as a prefix can be mined from the respective conditional database without accessing other information. FP-growth avoids the costly candidate itemsets generation phase, which overcomes the main bottlenecks of the Apriori-like algorithms. Various other studies [1, 10, 28, 39] has been carried out on frequent itemset mining.
Although standard algorithms are capable of identifying itemsets that produce distinct patterns, they fail to consider the quantity or weight, such as profit of the items. For example, in retail applications, frequent itemsets identified by the traditional FIM algorithms may contribute only a small portion of the overall revenue or profit, as high margin or luxury goods typically do not appear in a large number of transactions. A similar problem occurs when data mining is applied within an enterprise to identify the most valuable client segments, or product combinations that contribute most to the company’s bottom line. To address the limitation, utility mining [6, 27, 38] has emerged as an important topic in data mining.
The basic meaning of utility is the importance or profitability of items to the users. The utility values of items in a transaction database consist of two parts: one is the profit of distinct items, which is called external utility, while the other is the quantity of items in one transaction, which is called internal utility. The utility of an itemset is defined as the external utility multiplied by the internal utility. High utility itemset (HUI) mining aims to find all itemsets with utilities no smaller than a user specified value of minimum utility. Mining HUIs is a very important method for retrieving more valuable information from a database by measuring how useful the item is. The information can help businesses make a variety of decisions such as revising revenue, adjusting inventory, or determining purchase orders.
Nevertheless, HUI mining is not an easy task. The difficulty is that it does not follow the “downward closure property” [2], that is, a high utility itemset may consist of some low utility subsets. Earlier studies [22, 36] suffered from the level-wise candidate generation-and-test problem, involving several database scans depending on the length of candidate itemsets. In view of this, some novel tree structures have been used for HUI mining [4, 9]. As these algorithms are based on the pattern growth approach [13], they also generate a vast number of conditional trees with a corresponding high cost in terms of time and space.
To reduce the cost of storage and traversal of the vast number of tree structures, we propose a concurrent algorithm for mining HUIs based on dynamic tree structure pruning. The major contributions of this work are summarized below.
On the one hand, a new tree structure, called the CHUI-Tree, is proposed. It exploits a pattern growth approach to avoid the problem of the level-wise candidate generation-and-test strategy. By monitoring changes in item counts during the tree construction process, we introduce a dynamic tree pruning strategy, the rationality of which is discussed in this paper.
On the other hand, two concurrent processes [26] for discovering HUIs during the construction of trees are introduced. Compared with other tree-based algorithms, our algorithm does not need to wait for the whole tree structure to be created, before starting the mining process.
Extensive experimental results on both synthetic and real datasets show that our concurrent algorithm is efficient and scalable for mining high utility itemsets.
The remainder of this paper is organized as follows. In Sect. 2, we discuss the HUI mining problem and related work. In Sect. 3, the proposed data structure and the concurrent algorithm are described in details. In Sect. 4, we present experimental results on both synthetic and real datasets. Finally, conclusions are drawn in Sect. 5.
2 Background and related work
2.1 Problem definition
We adopted similar definitions to those presented in previous works [4, 22]. Let I={i_{1},i_{2},…,i_{m}} be a finite set of items. Then, set X⊆I is called an itemset, or a k-itemset if it contains k items. Let D={T_{1},T_{2},…,T_{n}} be a transaction database. Each transaction T_{i}∈D, with unique identifier tid, is a subset of I.
The internal utilityq(i_{p},T_{d}) represents the quantity of item i_{p} in transaction T_{d}. The external utilityp(i_{p}) is the unit profit value of item i_{p}. The utility of item i_{p} in transaction T_{d} is defined as u(i_{p},T_{d})=p(i_{p})×q(i_{p},T_{d}).
The transaction utility of transaction T_{d} is defined as TU(T_{d})=u(T_{d},T_{d}).
In the HUI mining problem, we need to find those itemsets that make a significant contribution to the total profit. Therefore, we have to quantify the significant amount by using a metric, called the minimum utility threshold δ, which needs to be defined for this purpose. By using this measure, businessmen or other users can express this significant contribution to the total profit as a percentage according to their requirements.
An itemset X is called a high utility itemset if u(X)≥min_util. Otherwise, it is called a low utility itemset. Given a transaction database D, the task of HUI mining is to find all itemsets that have utilities no less than min_util.
The main challenge of HUI mining is that the itemset utility does not have the downward closure property. Liu et al. [22] proposed transaction-weighted downward closure to prune the search space of HUIs.
X is a high transaction-weighted utilization itemset (HTWUI) if TWU(X)≥min_util. Otherwise, it is called a low transaction-weighted utilization itemset (LTWUI).
As shown in [22], any superset of an LTWUI is also an LTWUI. Thus, we can prune the supersets of LTWUIs. However, since transaction-weighted utilization is an over-estimation of the real utility itemset value, further pruning of HTWUIs is required.
The support count of itemset X, denoted by σ(X), is defined as the number of transactions in which X occurs as a subset [2]. The support count is used to prune the tree structure dynamically in the proposed CHUI-Mine, as discussed in Sect. 3.2.
Example 1
Example database
TID | Transactions | TU |
---|---|---|
T_{1} | (A,2)(E,4)(F,1) | 19 |
T_{2} | (A,1)(B,1)(D,1)(E,1) | 25 |
T_{3} | (A,1)(C,1)(F,1) | 13 |
T_{4} | (B,1)(D,1)(E,2) | 22 |
T_{5} | (B,1)(C,1)(E,5) | 27 |
Profit table
Item | A | B | C | D | E | F |
Profit | 5 | 10 | 7 | 8 | 2 | 1 |
2.2 Existing algorithms
Recently, mining HUIs from a large transaction database has become an active research problem in data mining [14, 18].
The basic concepts of HUI mining were given in [36]. Since this approach, called Mining with Expected Utility (MEU), cannot use the downward closure property to reduce the number of candidate itemsets, a heuristic approach was proposed to predict whether an itemset should be added to the candidate set. However, the prediction usually overestimates, especially in the initial stages. Moreover, the examination of candidates is impractical, in terms of both processing cost and memory requirements, whenever the number of items is large or the utility threshold is low. Later, the same authors proposed two algorithms, UMining and UMining_H [37], to discover HUIs. In UMining, a pruning strategy based on the utility upper bound property is used. UMining_H has been designed with another pruning strategy based on a heuristic method. However, these methods still do not satisfy the downward closure property, and therefore, overestimate the itemsets. Thus, they also suffer from excessive candidate generations and poor test methodology.
The Two-Phase algorithm [22] was developed to find HUIs using the downward closure property. In phase I, the useful property, i.e., the transaction-weighted downward closure property is used. The size of candidates is reduced by considering the supersets of HTWUIs. In phase II, only one extra database scan is needed to filter out the HTWUIs that are indeed low utility itemsets. Although the Two-Phase algorithm effectively reduces the search space and captures a complete set of HUIs, it still generates too many candidates for HTWUIs and requires multiple database scans, especially when mining dense datasets and long patterns, much like the Apriori algorithm for frequent pattern mining.
To reduce the number of candidates in the Two-Phase algorithm, Li et al. [20] proposed an isolated items discarding strategy, abbreviated as IIDS. The IIDS shows that an itemset share mining [5] problem can be converted to a utility mining problem by replacing the frequency value of each item in a transaction by its total profit. By pruning isolated items during the level-wise search, the number of HTWUIs can be effectively reduced. The authors developed efficient algorithms called FUM and DCG+ for HUI mining. However, these approaches still need to scan the database multiple times and suffer from the problem of the candidate generation-and-test scheme to find HUIs.
To generate HTWUIs efficiently in phase I and avoid scanning the database multiple times, several methods have been proposed, including the projection-based approach [17] and the approach based on vertical data layout [19]. Of these new approaches, the tree-structure-based algorithms have been shown to be very efficient for mining HUIs. Ahmed et al. [3] proposed a tree-based algorithm, called IHUP, for mining HUIs. The authors used an IHUP-Tree to maintain the information of HUIs and transactions. First, items in the transaction are rearranged in a fixed order such as lexicographic order. Then, the rearranged transactions are inserted into the IHUP-Tree. Next, HTWUIs are generated from the IHUP-Tree by applying the FP-Growth algorithm [13]. Finally, HUIs and their utilities are identified from the set of HTWUIs by scanning the original database once only.
HUC-Prune is another novel HUI mining algorithm [4]. HUC-Prune uses the HUC-tree structure, which is a prefix tree storing the candidate items in descending order of TWU value, and every node in the HUC-Tree consists of an item name and a TWU value. Similar to IHUP [3], this algorithm also replaces the level-wise candidate generation process by a pattern growth approach.
Besides the above two algorithms, there are various other HUI mining methods based on tree structures, such as CTU-Tree [7], CUP-Tree [8], UP-Tree [31], HUP-Tree [21], and so on. These tree-based algorithms are comprised of three steps: (1) construction of trees, (2) generation of candidate HUIs from the trees using the pattern growth approach, and (3) identification of HUIs from the set of candidates. Although these tree structures are often compact, they may not be minimal and still occupy a large memory space. The mining performance of these methods is closely related to the number of conditional trees constructed during the whole mining process and the construction/traversal cost of each conditional tree. Thus, one of the performance bottlenecks of these algorithms is the generation of a huge number of conditional trees, which has high time and space costs.
3 Proposed method
In this section, we first introduce the proposed data structure, called CHUI-Tree. Then, we discuss dynamic pruning of the CHUI-Tree, and describe the proposed concurrent algorithm, called CHUI-Mine, in detail.
3.1 The data structure: CHUI-Tree
3.1.1 Elements in the CHUI-Tree
Specifically, CHUI-Tree T is a tree structure composed as follows:
(1) It consists of one root labeled as “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and an HTWUI-header table, denoted by T.header.
(2) Each node N in the item-prefix subtree consists of six fields: N.item, N.count, N.util, N.nodelink, N.children and N.parent, where N.item registers which item N represents; N.count is the support count of N.item; N.util is TWU(N.item); N.nodelink links to the next node in the CHUI-Tree carrying the same N.item, or “null” if there is none; N.children registers the children nodes of N, or “null” if there is none; and N.parent registers the parent node of N.
(3) Each entry in T.header consists of four fields: (1) item, denoting the item this entry represents; (2) No, giving the number of transactions containing item to be inserted into the tree during the subsequent construction process; (3) TWU, representing TWU(item); (4) node-link, a pointer pointing to the first node in the CHUI-Tree carrying item.
Note that we use No in the HTWUI-header table to realize dynamic pruning of the CHUI-Tree. The initial value of No is the support count of item in T, but the value is reduced by 1 if a transaction containing item is inserted into T. The subtraction is repeated until No is reduced to 0, which means that all nodes containing item have been inserted into the tree.
Definition 1
Let T be a CHUI-Tree, a conditional pattern of itemset X is defined as CP(X)=(Y:count,util), where YX is the set of items on the path from T.root to X, count is the support count of X on this path, and util is TWU(X) on this path. The set of all X’s conditional pattern is called the conditional pattern base of X, denoted by CPB(X). The CHUI-Tree constructed from CPB(X) is called X’s conditional CHUI-Tree, denoted as CT(X).
The CHUI-Tree constructed from the initial database can be viewed as CT(∅), the conditional CHUI-Tree for the empty itemset.
3.1.2 Construction of CHUI-Tree
The construction of a CHUI-Tree is realized by the pattern growth approach, and can be completed with two scans of the database. In the first pass over the database, the algorithm scans each transaction T_{i} and calculates its transaction utility value. Thereafter, it adds this value to the TWU value of each item present in T_{i}.
After the first scan of the database, high transaction-weighted utilization items are organized in the HTWUI-header table in descending order of TWU values. During the second scan of the database, transactions are inserted into the CHUI-Tree. Initially, the tree is created with a root. When a transaction is retrieved, low transaction-weighted utilization items are removed from the transaction and their utilities are eliminated from the TU of the transaction since only supersets of high transaction-weighted utilization items can potentially become the high utility itemsets. The remaining items in the transaction are sorted in descending order of TWU. Then, an update node operation is performed if the current root node contains a child with the item to be inserted; otherwise an insert node operation is performed, until every high transaction-weighted utilization item in the current transaction has been processed.
3.2 Dynamic pruning of CHUI-Tree
Most tree-based algorithms for HUI mining consist of three steps: tree construction, HTWUIs identification, and HUIs discovery. The main work done in these methods is traversing trees and constructing new conditional trees after the whole tree has been constructed from the original database. When using these algorithms on a large database with a low utility threshold, the storage and traversal costs of numerous conditional trees are high. Thus, the questions that arise are: Can we reduce the storage space and traversal time so that the method has lower runtimes? And, can we discover HTWUIs during the process of tree construction, instead of after the construction of the whole tree? The answer to both is “yes”, by using the No field in the HTWUI-header table.
As described in Sect. 3.1.2, after the first scan of the database, high transaction-weighted utilization items are organized in the HTWUI-header table, and values of the field of No are initialized by the support counts of the corresponding items. When transaction T_{d} is inserted into the CHUI-Tree, the No fields of items contained in T_{d} are decremented by 1. If the No value of a certain item is reduced to 0, the nodes containing this item and their offspring nodes can be pruned. The rationality of this pruning strategy is based on the following theorem.
Theorem 1
Letibe an item of CHUI-TreeTandN={n_{1},n_{2},…,n_{k}} be the set of nodes containingi. During the construction ofT, if theNofield of itemiin the HTWUI-header table is reduced to 0, nodes inNdo not change during the subsequent construction process of T, and the conditional patterns of items contained in the offspring nodes of nodes inNdo not change either.
Proof
During the construction of CHUI-Tree T, if the No field of item i is reduced to 0, we can see that no nodes containing i will be inserted into T, and nodes in N do not change.
For ∀n∈N, we have the following two cases:
(1) If n is a leaf node, there are no offspring nodes of n.
(2) If n is not a leaf node, based on the construction of the CHUI-Tree, no items will be added into T as the offspring nodes of n. Thus, the values of both support counts and the TWU of items contained in the offspring nodes of n will not change. So the conditional patterns of these nodes will not change either. □
To discover HUIs from the pruned branches, we provide the following definition.
Definition 2
Let T be a CHUI-Tree, I_{T}={i_{1},i_{2},…,i_{n}} be items in T.header, and the conditional pattern list of T be an array with size n, denoted as CPL(T). Each element of the array corresponds to a triple (item,flag,CPB(item)), where item∈I_{T}, flag is a Boolean variable with values in {TRUE, FALSE}, and CPB(item) is the conditional pattern base of item.
According to Theorem 1, during the construction of CHUI-Tree T, once the No field of item i has been reduced to 0, we can prune the nodes containing i and their offspring nodes, and store the pruned branches in CPL(T). The flag of i is set to TRUE, which means we can start the mining process of HTWUIs containing i. For other items stored in CPL(T) together with i, we set their flags to FALSE.
Example 2
TWU of each item
Item | A | B | C | D | E | F |
TWU | 57 | 74 | 40 | 47 | 93 | 32 |
The initial HTWUI-header table
Item | No. | TWU | Node-link |
---|---|---|---|
E | 4 | 93 | Null |
B | 3 | 74 | Null |
A | 3 | 57 | Null |
D | 2 | 47 | Null |
C | 2 | 40 | Null |
The re-organized database
TID | Transactions | TU |
---|---|---|
\(T_{1}'\) | (E,4)(A,2) | 18 |
\(T_{2}'\) | (E,1)(B,1)(A,1)(D,1) | 25 |
\(T_{3}'\) | (A,1)(C,1) | 12 |
\(T_{4}'\) | (E,2)(B,1)(D,1) | 22 |
\(T_{5}'\) | (E,5)(B,1)(C,1) | 27 |
The CPL of CHUI-Tree in Fig. 2
Item | Flag | CPB |
---|---|---|
A | TRUE | E:1,18 EB:1,25 |
D | FALSE | EBA:1,25 |
C | FALSE | A:1,12 |
3.3 Proposed concurrent algorithm
To prune and discover HUIs simultaneously during the construction of the CHUI-Tree, we propose the CHUI-Mine algorithm based on two concurrent processes. Concurrent processes can function completely independently of one another [26]. Two processes are concurrent if their execution can overlap in time; that is, the execution of the second process starts before the first process completes. Concurrent processes generally interact through the following two mechanisms: shared variables and message passing.
The CHUI-Mine algorithm can be divided into two sub-tasks: The first is to prune CHUI-Tree dynamically and store the results in the CPL in the shared buffer, while the other is to read conditional pattern base from the CPL and mine HUIs using a pattern growth approach. These two tasks are implemented by concurrent Processes 1 and 2, described below.
Process 1
In Process 1, the conditional pattern base is first scanned once to determine the high transaction-weighted utilization single items (step 1). In step 2, the remaining items are ordered in descending order of TWU in the CHUI-Tree. The pruned transactions are also listed in the same order. The CHUI-Tree is pruned and the CPL is modified in the main loop transaction by transaction (steps 3–25). Step 4 accesses the CPL structure in the buffer. In step 5, one transaction is added into the CHUI-Tree. The items in the transaction are examined one by one (steps 6–22). In step 7, the No field of the item being processed is reduced by 1. If the item’s count is reduced to 0, the CHUI-Tree is pruned and the CPL in the buffer is modified (steps 8–21). Then, the nodes containing the current item are pruned one by one (steps 9–20). The conditional pattern of the item is computed in step 11. In steps 12–16, the item and its conditional patterns are inserted into the CPL, and its flag is set to TRUE. In step 17, the items contained in the offspring nodes of the current nodes are added to the CPL. Step 18 prunes the CHUI-Tree, while step 23 releases the CPL in the buffer. Process 2 is woken up in step 24. In Procedure Insert, the items contained in the offspring nodes of the current node and their conditional patterns are added to the CPL recursively. Their flags are set to FALSE. In Procedure Delete, the current node and its offspring nodes are deleted from the CHUI-Tree.
Process 2
Initially Process 2 waits for Process 1 (step 1). Step 2 accesses the CPL structure in the buffer. The main loop modifies the CPL in the buffer and discovers HTWUIs (steps 3–7). In step 4, the conditional CHUI-Tree of elements with TRUE values of the flag, is constructed. Then, Procedure HTW-growth is called to discover HTWUIs in step 5. Step 6 deletes this element in the CPL. Step 8 releases the CPL in the buffer. Procedure HTW-growth is similar to the methods used in other tree-based algorithms.
It should be noted that the results discovered by the two concurrent processes described above are HTWUIs. Finally, HUIs are identified by scanning the reorganized transactions. Since there is no low transaction-weighted utilization item in the reorganized transactions, the I/O cost and execution time can be reduced.
3.4 Algorithm correctness and discussion
The rationality of concurrent HUI mining can be proved by the following theorem.
Theorem 2
Given a transaction databaseD, T_{i}(1≤i≤n) are conditional trees for discovering HTWUIs produced by dynamic pruning, T is the whole tree constructed using the pattern-growth approach without pruning, and nodes inT_{i}(1≤i≤n) andThave the same structure. Then, there is a 1-1 mapping betweenT_{1}∪T_{2}∪⋯∪T_{n}andT.
Proof
We prove the theorem from the perspective of vertices and edges.
(1) For an arbitrary vertex v in T, as can be seen from the CHUI-Mine algorithm, its support count will be reduced to 0. That is, vertex v will finally be pruned. Therefore, there is a node v′∈T_{1}∪T_{2}∪⋯∪T_{n} that represents the same item as v. In the same way, for an arbitrary vertex v′ in T_{i} (1≤i≤n), there exists a node v in T that represents the same item as v′. So there is a 1-1 mapping of vertices between T_{1}∪T_{2}∪⋯∪T_{n} and T.
(2) For an arbitrary edge e=(v_{i},v_{j}) in T, according to the above discussion on vertices, there are \(v_{i}'\in T_{i}\) (1≤i≤n) and \(v_{j}'\in T_{j}\) (1≤j≤n) corresponding to v_{i} and v_{j}, respectively. As the order of items in both the pruned trees and the whole tree is determined by the original database, the connection relations in T_{1}∪T_{2}∪⋯∪T_{n} and T are the same. So, there exists \(e' = (v_{i}', v_{j}')\) in T_{i}∪T_{j} corresponding to e=(v_{i},v_{j}). Similarly, for arbitrary \(e'= (v_{i}', v_{j}')\in T_{i}\) (1≤i≤n), there exists e=(v_{i},v_{j})∈T corresponding to e′. So there is a 1-1 mapping of edges between T_{1}∪T_{2}∪⋯∪T_{n} and T.
According to the above, there is a 1-1 mapping between T_{1}∪T_{2}∪⋯∪T_{n} and T. □
Theorem 2 ensures that the result of concurrent HUI mining is the same as mining the whole tree structure.
Compared with related HUI mining algorithms based on tree structures [4, 9], the proposed CHUI-Mine can currently discover HUIs. The major contributions of CHUI-Mine are summarized as follows. On the one hand, HTWUIs can be discovered during the process of tree construction, instead of after the construction of the whole tree, which improves mining efficiency. On the other hand, it avoids generating the whole tree structure. As the memory consumption of pruned trees is usually small, CHUI-Mine can reduce peak memory usage.
The main reason for these advantages is that the support count, together with the TWU, is also recorded in the CHUI-Tree. As proved in Theorem 1, by monitoring the changes in support counts, sub-trees can be pruned dynamically. Thus, tree construction and HTWUI mining can be realized concurrently.
4 Experimental results
In this section, we evaluate the performance of our algorithm and compare it with the Two-Phase [22], FUM [20], and HUC-Prune [4] algorithms.
4.1 Experimental environment and datasets
Characteristics of datasets
Datasets | Avg. trans. length | No. of trans | No. of items |
---|---|---|---|
T10I4D100k | 10 | 100,000 | 100 |
T20I4D100k | 20 | 100,000 | 100 |
T5N5D1M | 5 | 1,000,000 | 100 |
Chess | 37 | 3,196 | 75 |
Mushroom | 23 | 8,124 | 119 |
BMS-POS | 7 | 515,597 | 1,657 |
4.2 Performance analysis on synthetic datasets
We first show the performance of these algorithms on the synthetic datasets, T10I4D100k, T20I4D100k, and T5N5D1M. Since the first two synthetic datasets have fewer transactions than the third one, we only changed the average size of each transaction for T10I4D100k and T20I4D100k. Then a large dataset with short average transaction size, T5N5D1M, was used to analyze the performance. These synthetic datasets normally have too many distinct items. Therefore, although their transaction length is small on average, they typically have many transactions.
From the above discussion, we can see that CHUI-Mine outperforms Two-Phase, FUM, and HUC-Prune with regard to efficiency, number of candidates, and memory usage on synthetic datasets.
4.3 Performance analysis on real datasets
In this section, we compare the performance of these algorithms on real dense datasets, Chess, Mushroom, and BMS-POS, in which each transaction contains many items. As the probability of an item’s occurrence in each transaction is very high, the runtime tends to be very long especially when the minimum utility threshold is low. Furthermore, these real datasets also generate a large number of candidates.
From the above discussion, we can see that CHUI-Mine outperforms Two-Phase, FUM, and HUC-Prune with regard to efficiency, number of candidates, and memory usage on real datasets.
4.4 Scalability
In the following experiments, we varied the dataset size and number of items to evaluate scalability of the four algorithms. The datasets were all generated by the IBM data generator [32].
From the above discussion, by varying either the dataset size or the number of items, CHUI-Mine reduces the running time for HUI mining while offering linear scalability.
5 Conclusions
In this paper, we proposed a concurrent algorithm called CHUI-Mine for mining HUIs from transaction databases. A novel data structure called the CHUI-Tree was created for maintaining the information of HUIs. Using this structure, potential HUIs can be generated efficiently using two concurrent processes: one for constructing and dynamically pruning the tree, and then placing the conditional trees into a buffer, and the other for reading the conditional pattern list from the buffer and mining HUIs. The rationality of dynamic pruning of the tree structure was also proved. In the experiments, both synthetic and real datasets were used to evaluate the performance of CHUI-Mine. The mining performance is enhanced since both the search space and the number of candidates are effectively reduced. In addition, the experimental results show that CHUI-Mine is both efficient and scalable.
Notes
Acknowledgements
We would like to express our deep gratitude to the anonymous reviewers of this paper. The work is partly supported by the National Natural Science Foundation of China (61105045), Funding Project for Academic Human Resources Development in Institutions of Higher Learning Under the Jurisdiction of Beijing Municipality (PHR201108057), and North China University of Technology (CCXZ201303).
References
- 1.Adnan M, Alhajj R (2009) DRFP-tree: disk-resident frequent pattern tree. Appl Intell 30(2):84–97 CrossRefGoogle Scholar
- 2.Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings 20th international conference very large data bases (VLDB’94), pp 487–499 Google Scholar
- 3.Ahmed CF, Tanbeer SK, Jeong B-S, Lee Y-K (2009) Efficient tree structures for high utility pattern. IEEE Trans Knowl Data Eng 21(12):1708–1721 CrossRefGoogle Scholar
- 4.Ahmed CF, Tanbeer SK, Jeong B-S, Lee Y-K (2011) HUC-Prune: an efficient candidate pruning technique to mine high utility patterns. Appl Intell 34(2):181–198 CrossRefGoogle Scholar
- 5.Barber B, Hamilton HJ (2003) Extracting share frequent itemsets with infrequent subsets. Data Min Knowl Discov 7(2):153–185 CrossRefMathSciNetGoogle Scholar
- 6.Chan R, Yang Q, Shen Y-D (2003) Mining high utility itemsets. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM’03), pp 19–26 CrossRefGoogle Scholar
- 7.Erwin A, Gopalan RP, Achuthan NR (2007) CTU-Mine: an efficient high utility itemset mining algorithm using the pattern growth approach. In: Proceedings of the 7th IEEE international conference on computer and information technology (CIT’07), pp 71–76 CrossRefGoogle Scholar
- 8.Erwin A, Gopalan RP, Achuthan NR (2007) A bottom-up projection based algorithm for mining high utility itemsets. In: Proceedings of the 2nd international workshop on integrating artificial intelligence and data mining (AIDM’07), pp 3–10 Google Scholar
- 9.Erwin A, Gopalan RP, Achuthan NR (2008) Efficient mining of high utility itemsets from large datasets. In: Proceedings of the 12th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD’08), pp 554–561 CrossRefGoogle Scholar
- 10.Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using FP-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362 CrossRefGoogle Scholar
- 11.Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86 CrossRefMathSciNetGoogle Scholar
- 12.Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, San Francisco Google Scholar
- 13.Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87 CrossRefMathSciNetGoogle Scholar
- 14.Hu J, Mojsilovic A (2007) High-utility pattern mining: a method for discovery of high-utility item sets. Pattern Recognit 40(11):3317–3324 CrossRefMATHGoogle Scholar
- 15.Kaya M, Alhajj R (2008) Online mining of fuzzy multidimensional weighted association rules. Appl Intell 29(1):13–34 CrossRefGoogle Scholar
- 16.Lee C-H (2007) IMSP: an information theoretic approach for multi-dimensional sequential pattern mining. Appl Intell 26(3):231–242 CrossRefMATHGoogle Scholar
- 17.Lan G-C, Hong T-P, Tseng VS (2012) A projection-based approach for discovering high average-utility itemsets. J Inf Sci Eng 28(1):193–209 Google Scholar
- 18.Li H-F, Huang H-Y, Lee S-Y (2011) Fast and memory efficient mining of high-utility itemsets from data streams: with and without negative item profits. Knowl Inf Syst 28(3):495–522 CrossRefGoogle Scholar
- 19.Li H-F (2011) MHUI-max: an efficient algorithm for discovering high-utility itemsets from data streams. J Inf Sci 37(5):532–545 CrossRefGoogle Scholar
- 20.Li Y-C, Yeh J-S, Chang C-C (2008) Isolated items discarding strategy for discovering high utility itemsets. Data Knowl Eng 64(1):198–217 CrossRefGoogle Scholar
- 21.Lin C-W, Hong T-P, Lu W-H (2011) An effective tree structure for mining high utility itemsets. Expert Syst Appl 38(6):7419–7424 CrossRefGoogle Scholar
- 22.Liu Y, Liao W-K, Choudhary AN (2005) A two phase algorithm for fast discovery of high utility of itemsets. In: Proceedings of the 9th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’05), pp 689–695 Google Scholar
- 23.Maunz A, Helma C, Kramer S (2011) Efficient mining for structurally diverse subgraph patterns in large molecular databases. Mach Learn 83(2):193–218 CrossRefMATHMathSciNetGoogle Scholar
- 24.Piatetsky-Shapiro G (2007) Data mining and knowledge discovery 1996 to 2005: overcoming the hype and moving from “university” to “business” and “analytics”. Data Min Knowl Discov 15(1):99–105 CrossRefMathSciNetGoogle Scholar
- 25.Rauch J (2005) Logic of association rules. Appl Intell 22(1):9–28 CrossRefMATHGoogle Scholar
- 26.Roscoe AW (2010) Understanding concurrent systems. Springer, London CrossRefMATHGoogle Scholar
- 27.Shie B-E, Yu PS, Tseng VS (2013) Mining interesting user behavior patterns in mobile commerce environments. Appl Intell 38(3):418–435 CrossRefGoogle Scholar
- 28.Song W, Yang BR, Xu ZY (2008) Index-BitTableFI: an improved algorithm for mining frequent itemsets. Knowl-Based Syst 21(6):507–513 CrossRefGoogle Scholar
- 29.Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66 CrossRefMATHMathSciNetGoogle Scholar
- 30.Tseng M-C, Lin Y-Y, Jeng R (2008) Updating generalized association rules with evolving taxonomies. Appl Intell 29(3):306–320 CrossRefGoogle Scholar
- 31.Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) UP-Growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10), pp 253–262 CrossRefGoogle Scholar
- 32.IBM data generator. http://www.cs.loyola.edu/~cgiannel/assoc_gen.html
- 33.Frequent Itemset mining implementations repository. http://fimi.ua.ac.be/
- 34.Wang S-L, Patel D, Jafari A, Hong T-P (2007) Hiding collaborative recommendation association rules. Appl Intell 27(1):67–77 CrossRefMATHGoogle Scholar
- 35.Wang Y-T, Cheng J-T (2011) Mining periodic movement patterns of mobile phone users based on an efficient sampling approach. Appl Intell 35(1):32–40 CrossRefGoogle Scholar
- 36.Yao H, Hamilton HJ, Butz CJ (2004) A foundational approach to mining itemset utilities from databases. In: Proceedings of the 4th SIAM international conference on data mining (SDM’04), pp 482–486 CrossRefGoogle Scholar
- 37.Yao H, Hamilton HJ (2006) Mining itemset utilities from transaction databases. Data Knowl Eng 59(3):603–626 CrossRefGoogle Scholar
- 38.Yu G, Shao S, Luo B, Zeng X (2009) A hybrid method for high-utility itemsets mining in large high-dimensional data. Int J Data Warehous Min 5(1):57–73 CrossRefGoogle Scholar
- 39.Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03), pp 326–335 Google Scholar
- 40.Zhang S, Chen F, Wu X, Zhang C, Wang R (2012) Mining bridging rules between conceptual clusters. Appl Intell 36(1):108–118 CrossRefGoogle Scholar