Applied Intelligence

, Volume 40, Issue 1, pp 29–43 | Cite as

Mining high utility itemsets by dynamically pruning the tree structure

Article

Abstract

Mining high utility itemsets is one of the most important research issues in data mining owing to its ability to consider nonbinary frequency values of items in transactions and different profit values for each item. Mining such itemsets from a transaction database involves finding those itemsets with utility above a user-specified threshold. In this paper, we propose an efficient concurrent algorithm, called CHUI-Mine (Concurrent High Utility Itemsets Mine), for mining high utility itemsets by dynamically pruning the tree structure. A tree structure, called the CHUI-Tree, is introduced to capture the important utility information of the candidate itemsets. By recording changes in support counts of candidate high utility items during the tree construction process, we implement dynamic CHUI-Tree pruning, and discuss the rationality thereof. The CHUI-Mine algorithm makes use of a concurrent strategy, enabling the simultaneous construction of a CHUI-Tree and the discovery of high utility itemsets. Our algorithm reduces the problem of huge memory usage for tree construction and traversal in tree-based algorithms for mining high utility itemsets. Extensive experimental results show that the CHUI-Mine algorithm is both efficient and scalable.

Keywords

Data mining High utility itemset CHUI-Tree Dynamically pruning Concurrency 

1 Introduction

Data mining, which is the extraction of hidden information from large databases, has become more and more important in many domains, including business, scientific research, government, and so on [12, 24].

Frequent Itemset Mining (FIM) is a major problem in many data mining applications [2, 11]. It started as a phase in the discovery of association rules [15, 25, 30, 34], but has since been generalized, independent of these, to many other patterns, for example, frequent sequences [16], episodes [29], periodic patterns [35], frequent subgraphs [23], and bridging rules [40].

Apriori [2] is the first FIM algorithm based on the anti-monotone property, i.e., if an itemset is found to be frequent then all its non-empty subsets are frequent. The Apriori technique finds frequent itemsets of length k from a set of previously generated itemsets of length k−1 and therefore, requires multiple database scans, proportional to the longest frequent itemset. Moreover, a large amount of memory is needed to handle the candidate itemsets when the number of potential frequent itemsets is reasonably large. Han et al. [13] proposed the FP-growth method to avoid generating candidate itemsets by building an FP-tree while only scanning over the database twice. This method constructs a conditional database for each frequent itemset X. All itemsets with X as a prefix can be mined from the respective conditional database without accessing other information. FP-growth avoids the costly candidate itemsets generation phase, which overcomes the main bottlenecks of the Apriori-like algorithms. Various other studies [1, 10, 28, 39] has been carried out on frequent itemset mining.

Although standard algorithms are capable of identifying itemsets that produce distinct patterns, they fail to consider the quantity or weight, such as profit of the items. For example, in retail applications, frequent itemsets identified by the traditional FIM algorithms may contribute only a small portion of the overall revenue or profit, as high margin or luxury goods typically do not appear in a large number of transactions. A similar problem occurs when data mining is applied within an enterprise to identify the most valuable client segments, or product combinations that contribute most to the company’s bottom line. To address the limitation, utility mining [6, 27, 38] has emerged as an important topic in data mining.

The basic meaning of utility is the importance or profitability of items to the users. The utility values of items in a transaction database consist of two parts: one is the profit of distinct items, which is called external utility, while the other is the quantity of items in one transaction, which is called internal utility. The utility of an itemset is defined as the external utility multiplied by the internal utility. High utility itemset (HUI) mining aims to find all itemsets with utilities no smaller than a user specified value of minimum utility. Mining HUIs is a very important method for retrieving more valuable information from a database by measuring how useful the item is. The information can help businesses make a variety of decisions such as revising revenue, adjusting inventory, or determining purchase orders.

Nevertheless, HUI mining is not an easy task. The difficulty is that it does not follow the “downward closure property” [2], that is, a high utility itemset may consist of some low utility subsets. Earlier studies [22, 36] suffered from the level-wise candidate generation-and-test problem, involving several database scans depending on the length of candidate itemsets. In view of this, some novel tree structures have been used for HUI mining [4, 9]. As these algorithms are based on the pattern growth approach [13], they also generate a vast number of conditional trees with a corresponding high cost in terms of time and space.

To reduce the cost of storage and traversal of the vast number of tree structures, we propose a concurrent algorithm for mining HUIs based on dynamic tree structure pruning. The major contributions of this work are summarized below.

On the one hand, a new tree structure, called the CHUI-Tree, is proposed. It exploits a pattern growth approach to avoid the problem of the level-wise candidate generation-and-test strategy. By monitoring changes in item counts during the tree construction process, we introduce a dynamic tree pruning strategy, the rationality of which is discussed in this paper.

On the other hand, two concurrent processes [26] for discovering HUIs during the construction of trees are introduced. Compared with other tree-based algorithms, our algorithm does not need to wait for the whole tree structure to be created, before starting the mining process.

Extensive experimental results on both synthetic and real datasets show that our concurrent algorithm is efficient and scalable for mining high utility itemsets.

The remainder of this paper is organized as follows. In Sect. 2, we discuss the HUI mining problem and related work. In Sect. 3, the proposed data structure and the concurrent algorithm are described in details. In Sect. 4, we present experimental results on both synthetic and real datasets. Finally, conclusions are drawn in Sect. 5.

2 Background and related work

2.1 Problem definition

We adopted similar definitions to those presented in previous works [4, 22]. Let I={i1,i2,…,im} be a finite set of items. Then, set XI is called an itemset, or a k-itemset if it contains k items. Let D={T1,T2,…,Tn} be a transaction database. Each transaction TiD, with unique identifier tid, is a subset of I.

The internal utilityq(ip,Td) represents the quantity of item ip in transaction Td. The external utilityp(ip) is the unit profit value of item ip. The utility of item ip in transaction Td is defined as u(ip,Td)=p(ipq(ip,Td).

The utility of itemset X in transaction Td is defined as:
$${u(X, T_d)=\sum_{i_p\in X \wedge X\subseteq T_d} u(i_p,T_d)} $$
The utility of itemset X in D is defined as:
$${u(X)=\sum_{X \subseteq T_d \wedge T_d\in \boldsymbol{D}} u(X,T_d)} $$

The transaction utility of transaction Td is defined as TU(Td)=u(Td,Td).

In the HUI mining problem, we need to find those itemsets that make a significant contribution to the total profit. Therefore, we have to quantify the significant amount by using a metric, called the minimum utility threshold δ, which needs to be defined for this purpose. By using this measure, businessmen or other users can express this significant contribution to the total profit as a percentage according to their requirements.

The minimum utility thresholdδ is given as a percentage of the total transaction utility values of the database, while the minimum utility value is defined as:
$${\mathit{min}\_\mathit{util}=\delta\times\sum _{ T_d \in {\boldsymbol{D}}} \mathit{TU}(T_d)} $$

An itemset X is called a high utility itemset if u(X)≥min_util. Otherwise, it is called a low utility itemset. Given a transaction database D, the task of HUI mining is to find all itemsets that have utilities no less than min_util.

The main challenge of HUI mining is that the itemset utility does not have the downward closure property. Liu et al. [22] proposed transaction-weighted downward closure to prune the search space of HUIs.

The transaction-weighted utilization (TWU) of itemset X is the sum of the transaction utilities of all the transactions containing X, which is defined as:
$${\mathit{TWU}(X)=\sum_{X \subseteq T_d \wedge T_d \in \boldsymbol{D}} \mathit{TU}(T_d)} $$

X is a high transaction-weighted utilization itemset (HTWUI) if TWU(X)≥min_util. Otherwise, it is called a low transaction-weighted utilization itemset (LTWUI).

As shown in [22], any superset of an LTWUI is also an LTWUI. Thus, we can prune the supersets of LTWUIs. However, since transaction-weighted utilization is an over-estimation of the real utility itemset value, further pruning of HTWUIs is required.

The support count of itemset X, denoted by σ(X), is defined as the number of transactions in which X occurs as a subset [2]. The support count is used to prune the tree structure dynamically in the proposed CHUI-Mine, as discussed in Sect. 3.2.

Example 1

Consider the transaction database in Table 1 and the profit table in Table 2. For convenience, we write itemset {B,E} as BE. In the example database, the utility of item E in transaction T2 is u(E,T2)=2×1=2, the utility of itemset BE in transaction T2 is u(BE,T2)=u(B,T2)+u(E,T2)=10+2=12, and the utility of itemset BE in the transaction database is u(BE)=u(BE,T2)+u(BE,T4)+u(BE,T5)=12+14+20=46. Given min_util=35, since u(BE)>min_util, BE is a HUI. The transaction utility of T2 is TU(T2)=u(ABDE,T2)=25, and the transaction-weighted utilization of itemset BE is TWU(BE)=TU(T2)+TU(T4)+TU(T5)=74, and thus, BE is a HTWWUI. Since BE is contained in 3 transactions, T2,T4 and T5, the support count of BE is: σ(BE)=3.
Table 1

Example database

TID

Transactions

TU

T1

(A,2)(E,4)(F,1)

19

T2

(A,1)(B,1)(D,1)(E,1)

25

T3

(A,1)(C,1)(F,1)

13

T4

(B,1)(D,1)(E,2)

22

T5

(B,1)(C,1)(E,5)

27

Table 2

Profit table

Item

A

B

C

D

E

F

Profit

5

10

7

8

2

1

2.2 Existing algorithms

Recently, mining HUIs from a large transaction database has become an active research problem in data mining [14, 18].

The basic concepts of HUI mining were given in [36]. Since this approach, called Mining with Expected Utility (MEU), cannot use the downward closure property to reduce the number of candidate itemsets, a heuristic approach was proposed to predict whether an itemset should be added to the candidate set. However, the prediction usually overestimates, especially in the initial stages. Moreover, the examination of candidates is impractical, in terms of both processing cost and memory requirements, whenever the number of items is large or the utility threshold is low. Later, the same authors proposed two algorithms, UMining and UMining_H [37], to discover HUIs. In UMining, a pruning strategy based on the utility upper bound property is used. UMining_H has been designed with another pruning strategy based on a heuristic method. However, these methods still do not satisfy the downward closure property, and therefore, overestimate the itemsets. Thus, they also suffer from excessive candidate generations and poor test methodology.

The Two-Phase algorithm [22] was developed to find HUIs using the downward closure property. In phase I, the useful property, i.e., the transaction-weighted downward closure property is used. The size of candidates is reduced by considering the supersets of HTWUIs. In phase II, only one extra database scan is needed to filter out the HTWUIs that are indeed low utility itemsets. Although the Two-Phase algorithm effectively reduces the search space and captures a complete set of HUIs, it still generates too many candidates for HTWUIs and requires multiple database scans, especially when mining dense datasets and long patterns, much like the Apriori algorithm for frequent pattern mining.

To reduce the number of candidates in the Two-Phase algorithm, Li et al. [20] proposed an isolated items discarding strategy, abbreviated as IIDS. The IIDS shows that an itemset share mining [5] problem can be converted to a utility mining problem by replacing the frequency value of each item in a transaction by its total profit. By pruning isolated items during the level-wise search, the number of HTWUIs can be effectively reduced. The authors developed efficient algorithms called FUM and DCG+ for HUI mining. However, these approaches still need to scan the database multiple times and suffer from the problem of the candidate generation-and-test scheme to find HUIs.

To generate HTWUIs efficiently in phase I and avoid scanning the database multiple times, several methods have been proposed, including the projection-based approach [17] and the approach based on vertical data layout [19]. Of these new approaches, the tree-structure-based algorithms have been shown to be very efficient for mining HUIs. Ahmed et al. [3] proposed a tree-based algorithm, called IHUP, for mining HUIs. The authors used an IHUP-Tree to maintain the information of HUIs and transactions. First, items in the transaction are rearranged in a fixed order such as lexicographic order. Then, the rearranged transactions are inserted into the IHUP-Tree. Next, HTWUIs are generated from the IHUP-Tree by applying the FP-Growth algorithm [13]. Finally, HUIs and their utilities are identified from the set of HTWUIs by scanning the original database once only.

HUC-Prune is another novel HUI mining algorithm [4]. HUC-Prune uses the HUC-tree structure, which is a prefix tree storing the candidate items in descending order of TWU value, and every node in the HUC-Tree consists of an item name and a TWU value. Similar to IHUP [3], this algorithm also replaces the level-wise candidate generation process by a pattern growth approach.

Besides the above two algorithms, there are various other HUI mining methods based on tree structures, such as CTU-Tree [7], CUP-Tree [8], UP-Tree [31], HUP-Tree [21], and so on. These tree-based algorithms are comprised of three steps: (1) construction of trees, (2) generation of candidate HUIs from the trees using the pattern growth approach, and (3) identification of HUIs from the set of candidates. Although these tree structures are often compact, they may not be minimal and still occupy a large memory space. The mining performance of these methods is closely related to the number of conditional trees constructed during the whole mining process and the construction/traversal cost of each conditional tree. Thus, one of the performance bottlenecks of these algorithms is the generation of a huge number of conditional trees, which has high time and space costs.

3 Proposed method

In this section, we first introduce the proposed data structure, called CHUI-Tree. Then, we discuss dynamic pruning of the CHUI-Tree, and describe the proposed concurrent algorithm, called CHUI-Mine, in detail.

3.1 The data structure: CHUI-Tree

3.1.1 Elements in the CHUI-Tree

Specifically, CHUI-Tree T is a tree structure composed as follows:

(1) It consists of one root labeled as “null”, denoted by T.root; a set of item-prefix subtrees as the children of the root, denoted by T.tree; and an HTWUI-header table, denoted by T.header.

(2) Each node N in the item-prefix subtree consists of six fields: N.item, N.count, N.util, N.nodelink, N.children and N.parent, where N.item registers which item N represents; N.count is the support count of N.item; N.util is TWU(N.item); N.nodelink links to the next node in the CHUI-Tree carrying the same N.item, or “null” if there is none; N.children registers the children nodes of N, or “null” if there is none; and N.parent registers the parent node of N.

(3) Each entry in T.header consists of four fields: (1) item, denoting the item this entry represents; (2) No, giving the number of transactions containing item to be inserted into the tree during the subsequent construction process; (3) TWU, representing TWU(item); (4) node-link, a pointer pointing to the first node in the CHUI-Tree carrying item.

Note that we use No in the HTWUI-header table to realize dynamic pruning of the CHUI-Tree. The initial value of No is the support count of item in T, but the value is reduced by 1 if a transaction containing item is inserted into T. The subtraction is repeated until No is reduced to 0, which means that all nodes containing item have been inserted into the tree.

Definition 1

Let T be a CHUI-Tree, a conditional pattern of itemset X is defined as CP(X)=(Y:count,util), where YX is the set of items on the path from T.root to X, count is the support count of X on this path, and util is TWU(X) on this path. The set of all X’s conditional pattern is called the conditional pattern base of X, denoted by CPB(X). The CHUI-Tree constructed from CPB(X) is called X’s conditional CHUI-Tree, denoted as CT(X).

The CHUI-Tree constructed from the initial database can be viewed as CT(∅), the conditional CHUI-Tree for the empty itemset.

3.1.2 Construction of CHUI-Tree

The construction of a CHUI-Tree is realized by the pattern growth approach, and can be completed with two scans of the database. In the first pass over the database, the algorithm scans each transaction Ti and calculates its transaction utility value. Thereafter, it adds this value to the TWU value of each item present in Ti.

After the first scan of the database, high transaction-weighted utilization items are organized in the HTWUI-header table in descending order of TWU values. During the second scan of the database, transactions are inserted into the CHUI-Tree. Initially, the tree is created with a root. When a transaction is retrieved, low transaction-weighted utilization items are removed from the transaction and their utilities are eliminated from the TU of the transaction since only supersets of high transaction-weighted utilization items can potentially become the high utility itemsets. The remaining items in the transaction are sorted in descending order of TWU. Then, an update node operation is performed if the current root node contains a child with the item to be inserted; otherwise an insert node operation is performed, until every high transaction-weighted utilization item in the current transaction has been processed.

3.2 Dynamic pruning of CHUI-Tree

Most tree-based algorithms for HUI mining consist of three steps: tree construction, HTWUIs identification, and HUIs discovery. The main work done in these methods is traversing trees and constructing new conditional trees after the whole tree has been constructed from the original database. When using these algorithms on a large database with a low utility threshold, the storage and traversal costs of numerous conditional trees are high. Thus, the questions that arise are: Can we reduce the storage space and traversal time so that the method has lower runtimes? And, can we discover HTWUIs during the process of tree construction, instead of after the construction of the whole tree? The answer to both is “yes”, by using the No field in the HTWUI-header table.

As described in Sect. 3.1.2, after the first scan of the database, high transaction-weighted utilization items are organized in the HTWUI-header table, and values of the field of No are initialized by the support counts of the corresponding items. When transaction Td is inserted into the CHUI-Tree, the No fields of items contained in Td are decremented by 1. If the No value of a certain item is reduced to 0, the nodes containing this item and their offspring nodes can be pruned. The rationality of this pruning strategy is based on the following theorem.

Theorem 1

Letibe an item of CHUI-TreeTandN={n1,n2,…,nk} be the set of nodes containingi. During the construction ofT, if theNofield of itemiin the HTWUI-header table is reduced to 0, nodes inNdo not change during the subsequent construction process of T, and the conditional patterns of items contained in the offspring nodes of nodes inNdo not change either.

Proof

During the construction of CHUI-Tree T, if the No field of item i is reduced to 0, we can see that no nodes containing i will be inserted into T, and nodes in N do not change.

For ∀nN, we have the following two cases:

(1) If n is a leaf node, there are no offspring nodes of n.

(2) If n is not a leaf node, based on the construction of the CHUI-Tree, no items will be added into T as the offspring nodes of n. Thus, the values of both support counts and the TWU of items contained in the offspring nodes of n will not change. So the conditional patterns of these nodes will not change either. □

To discover HUIs from the pruned branches, we provide the following definition.

Definition 2

Let T be a CHUI-Tree, IT={i1,i2,…,in} be items in T.header, and the conditional pattern list of T be an array with size n, denoted as CPL(T). Each element of the array corresponds to a triple (item,flag,CPB(item)), where itemIT, flag is a Boolean variable with values in {TRUE, FALSE}, and CPB(item) is the conditional pattern base of item.

According to Theorem 1, during the construction of CHUI-Tree T, once the No field of item i has been reduced to 0, we can prune the nodes containing i and their offspring nodes, and store the pruned branches in CPL(T). The flag of i is set to TRUE, which means we can start the mining process of HTWUIs containing i. For other items stored in CPL(T) together with i, we set their flags to FALSE.

Example 2

Consider the transaction database in Table 1 and the profit table in Table 2. Suppose min_util is 35. After the first database scan, we obtain the TU of each transaction in Table 1 and the TWU of each item in Table 3. As TWU(F)<min_util, item F is deleted. The high transaction-weighted utilization items are organized in the HTWUI-header table in descending order of TWU. Note that in HTWUI-header table, the initial value of field No corresponding to item i is i’s support count σ(i). The initial HTWUI-header table is shown in Table 4, while Table 5 shows the reorganized transactions and their TUs for the database in Table 1. As shown in the latter table, F is removed from the transactions T1 and T3. Moreover, the utilities of F are eliminated from the TUs of T1 and T3. Then, we insert reorganized transactions into the CHUI-Tree one by one.
Table 3

TWU of each item

Item

A

B

C

D

E

F

TWU

57

74

40

47

93

32

Table 4

The initial HTWUI-header table

Item

No.

TWU

Node-link

E

4

93

Null

B

3

74

Null

A

3

57

Null

D

2

47

Null

C

2

40

Null

Table 5

The re-organized database

TID

Transactions

TU

\(T_{1}'\)

(E,4)(A,2)

18

\(T_{2}'\)

(E,1)(B,1)(A,1)(D,1)

25

\(T_{3}'\)

(A,1)(C,1)

12

\(T_{4}'\)

(E,2)(B,1)(D,1)

22

\(T_{5}'\)

(E,5)(B,1)(C,1)

27

Consider the reorganized transactions in Table 5. The first reorganized transaction \(T_{1}'=\{E, A\}\) leads to the creation of a branch in the CHUI-Tree. The first node nE is created under the root with nE.item=E, nE.count=1, nE.util=18. The second node nA is created under node nE with nA.item=A, nA.count=1 and nA.util=18. In addition, the values of the No fields corresponding to E and A are decremented by 1. The reorganized transaction \(T_{2}'\) is inserted in the same way. The CHUI-Tree up to this point is shown in Fig. 1. When the reorganized transaction \(T_{3}'=\{A, C\}\) is inserted, the value of No corresponding to A is reduced to zero. According to Theorem 1, we can prune nodes containing A and their offspring nodes, and then we insert A, D and C into the CPL. The pruned CHUI-tree at this stage is shown in Fig. 2, where nodes under the dotted line are pruned. The CPL is shown in Table 6.
Fig. 1

CHUI-tree after inserting \(T_{2}'\)

Fig. 2

The pruned CHUI-tree after inserting \(T_{3}'\)

Table 6

The CPL of CHUI-Tree in Fig. 2

Item

Flag

CPB

A

TRUE

E:1,18

EB:1,25

D

FALSE

EBA:1,25

C

FALSE

A:1,12

3.3 Proposed concurrent algorithm

To prune and discover HUIs simultaneously during the construction of the CHUI-Tree, we propose the CHUI-Mine algorithm based on two concurrent processes. Concurrent processes can function completely independently of one another [26]. Two processes are concurrent if their execution can overlap in time; that is, the execution of the second process starts before the first process completes. Concurrent processes generally interact through the following two mechanisms: shared variables and message passing.

The CHUI-Mine algorithm can be divided into two sub-tasks: The first is to prune CHUI-Tree dynamically and store the results in the CPL in the shared buffer, while the other is to read conditional pattern base from the CPL and mine HUIs using a pattern growth approach. These two tasks are implemented by concurrent Processes 1 and 2, described below.

Process 1

Prune CHUI-Tree dynamically

In Process 1, the conditional pattern base is first scanned once to determine the high transaction-weighted utilization single items (step 1). In step 2, the remaining items are ordered in descending order of TWU in the CHUI-Tree. The pruned transactions are also listed in the same order. The CHUI-Tree is pruned and the CPL is modified in the main loop transaction by transaction (steps 3–25). Step 4 accesses the CPL structure in the buffer. In step 5, one transaction is added into the CHUI-Tree. The items in the transaction are examined one by one (steps 6–22). In step 7, the No field of the item being processed is reduced by 1. If the item’s count is reduced to 0, the CHUI-Tree is pruned and the CPL in the buffer is modified (steps 8–21). Then, the nodes containing the current item are pruned one by one (steps 9–20). The conditional pattern of the item is computed in step 11. In steps 12–16, the item and its conditional patterns are inserted into the CPL, and its flag is set to TRUE. In step 17, the items contained in the offspring nodes of the current nodes are added to the CPL. Step 18 prunes the CHUI-Tree, while step 23 releases the CPL in the buffer. Process 2 is woken up in step 24. In Procedure Insert, the items contained in the offspring nodes of the current node and their conditional patterns are added to the CPL recursively. Their flags are set to FALSE. In Procedure Delete, the current node and its offspring nodes are deleted from the CHUI-Tree.

Process 2

Initially Process 2 waits for Process 1 (step 1). Step 2 accesses the CPL structure in the buffer. The main loop modifies the CPL in the buffer and discovers HTWUIs (steps 3–7). In step 4, the conditional CHUI-Tree of elements with TRUE values of the flag, is constructed. Then, Procedure HTW-growth is called to discover HTWUIs in step 5. Step 6 deletes this element in the CPL. Step 8 releases the CPL in the buffer. Procedure HTW-growth is similar to the methods used in other tree-based algorithms.

It should be noted that the results discovered by the two concurrent processes described above are HTWUIs. Finally, HUIs are identified by scanning the reorganized transactions. Since there is no low transaction-weighted utilization item in the reorganized transactions, the I/O cost and execution time can be reduced.

3.4 Algorithm correctness and discussion

The rationality of concurrent HUI mining can be proved by the following theorem.

Theorem 2

Given a transaction databaseD, Ti(1≤in) are conditional trees for discovering HTWUIs produced by dynamic pruning, T is the whole tree constructed using the pattern-growth approach without pruning, and nodes inTi(1≤in) andThave the same structure. Then, there is a 1-1 mapping betweenT1T2∪⋯∪TnandT.

Proof

We prove the theorem from the perspective of vertices and edges.

(1) For an arbitrary vertex v in T, as can be seen from the CHUI-Mine algorithm, its support count will be reduced to 0. That is, vertex v will finally be pruned. Therefore, there is a node v′∈T1T2∪⋯∪Tn that represents the same item as v. In the same way, for an arbitrary vertex v′ in Ti (1≤in), there exists a node v in T that represents the same item as v′. So there is a 1-1 mapping of vertices between T1T2∪⋯∪Tn and T.

(2) For an arbitrary edge e=(vi,vj) in T, according to the above discussion on vertices, there are \(v_{i}'\in T_{i}\) (1≤in) and \(v_{j}'\in T_{j}\) (1≤jn) corresponding to vi and vj, respectively. As the order of items in both the pruned trees and the whole tree is determined by the original database, the connection relations in T1T2∪⋯∪Tn and T are the same. So, there exists \(e' = (v_{i}', v_{j}')\) in TiTj corresponding to e=(vi,vj). Similarly, for arbitrary \(e'= (v_{i}', v_{j}')\in T_{i}\) (1≤in), there exists e=(vi,vj)∈T corresponding to e′. So there is a 1-1 mapping of edges between T1T2∪⋯∪Tn and T.

According to the above, there is a 1-1 mapping between T1T2∪⋯∪Tn and T. □

Theorem 2 ensures that the result of concurrent HUI mining is the same as mining the whole tree structure.

Compared with related HUI mining algorithms based on tree structures [4, 9], the proposed CHUI-Mine can currently discover HUIs. The major contributions of CHUI-Mine are summarized as follows. On the one hand, HTWUIs can be discovered during the process of tree construction, instead of after the construction of the whole tree, which improves mining efficiency. On the other hand, it avoids generating the whole tree structure. As the memory consumption of pruned trees is usually small, CHUI-Mine can reduce peak memory usage.

The main reason for these advantages is that the support count, together with the TWU, is also recorded in the CHUI-Tree. As proved in Theorem 1, by monitoring the changes in support counts, sub-trees can be pruned dynamically. Thus, tree construction and HTWUI mining can be realized concurrently.

4 Experimental results

In this section, we evaluate the performance of our algorithm and compare it with the Two-Phase [22], FUM [20], and HUC-Prune [4] algorithms.

4.1 Experimental environment and datasets

The experiments were performed on a 2.40 GHz CPU with 2 GB memory, and running on Windows XP. Our programs were written in C++. Both synthetic and real datasets were used to evaluate the performance of the algorithms. We generated three synthetic datasets from the IBM data generator [32]: T10I4D100k, T20I4D100k and T5N5D1M. Real datasets were downloaded from the frequent itemset mining implementations repository [33]. The Chess dataset was derived from gaming steps. The Mushroom dataset contains characteristics of various species of mushrooms. The BMS-POS dataset contains several years worth of point-of-sale data from a large electronics retailer. The purpose of this dataset was to find associations between product categories purchased by customers in a single visit to the retailer. This dataset was used in the KDD-Cup 2000. Table 7 gives the characteristics of the datasets used in the experiments.
Table 7

Characteristics of datasets

Datasets

Avg. trans. length

No. of trans

No. of items

T10I4D100k

10

100,000

100

T20I4D100k

20

100,000

100

T5N5D1M

5

1,000,000

100

Chess

37

3,196

75

Mushroom

23

8,124

119

BMS-POS

7

515,597

1,657

None of these datasets provide the utility value or quantity of each item in each transaction. Thus, to fit them into the scenario of HUI mining, we randomly generated a quantity for each item in every transaction, ranging from 1 to 5, and a number for the utility value ranging from 1.0 to 10.0. Having observed from real databases that most items carry low profit, we generated the utility values using a log normal distribution. Figure 3 shows the external utility distribution of 1000 distinct items.
Fig. 3

External utility distribution for 1000 distinct items

4.2 Performance analysis on synthetic datasets

We first show the performance of these algorithms on the synthetic datasets, T10I4D100k, T20I4D100k, and T5N5D1M. Since the first two synthetic datasets have fewer transactions than the third one, we only changed the average size of each transaction for T10I4D100k and T20I4D100k. Then a large dataset with short average transaction size, T5N5D1M, was used to analyze the performance. These synthetic datasets normally have too many distinct items. Therefore, although their transaction length is small on average, they typically have many transactions.

Figures 45, and 6 show the execution time comparisons on the synthetic datasets. For T10I4D100k, CHUI-Mine is twice as fast as Two-Phase and faster than FUM and HUC-Prune when the minimum utility threshold is between 5 % and 8 %. For T20I4D100k, the result is almost the same as that for T10I4D100k. With the minimum utility threshold increased from 23 % to 30 % Two-Phase is always the slowest. Although FUM and HUC-Prune are faster than Two-Phase, they still have a longer runtime than CHUI-Mine for the whole process. For T5N5D1M, as the dataset is very large, CHUI-Mine is only slightly faster than HUC-Prune when the minimum utility threshold is between 1 % and 5 %. On average, it is 3 and 4 times faster than FUM and Two-Phase, respectively.
Fig. 4

Execution times on T10I4D100k

Fig. 5

Execution times on T20I4D100k

Fig. 6

Execution times on T5N5D1M

As another important criterion, we also compared the number of candidates for the three synthetic datasets, as shown in Figs. 78, and 9, respectively. On T10I4D100k and T20I4D100k, we notice that the comparison results are clearly divided into two groups. The first gives the results for Two-Phase and FUM, where the numbers of candidates are much greater than those for the other two algorithms. The other group contains the results for HUC-Prune and CHUI-Mine, which have considerably fewer candidates. The reason is that HUC-Prune and CHUI-Mine use tree structures to store the data and tree-based techniques to prune candidates. Thus, these methods can significantly reduce the number of candidates. On the large synthetic dataset T5N5D1M, the candidates generated by CHUI-Mine are still slightly fewer than those generated by HUC-Prune. Both Two-Phase and FUM generate almost the same number of candidates. On average, they generate twice as many candidates as CHUI-Mine.
Fig. 7

Number of candidates on T10I4D100k

Fig. 8

Number of candidates on T20I4D100k

Fig. 9

Number of candidates on T5N5D1M

Next we compared the memory usage of these algorithms. The tree structure can represent useful information in a very compressed form because transactions have many items in common. By utilizing path overlapping (prefix sharing), tree structures can save a great deal of space. This is verified by the memory comparison results shown in Figs. 1011, and 12. We find that the memory usage for Two-Phase and FUM is almost constant, with both of these methods using more than 200 MB memory on T10I4D100k and T20I4D100k, and more than 330 MB memory on T5N5D1M. CHUI-Mine requires the least storage space; although its memory usage increases with a decrease in the minimum utility threshold, the rate of increase is always less than that of HUC-Prune.
Fig. 10

Memory usage on T10I4D100k

Fig. 11

Memory usage on T20I4D100k

Fig. 12

Memory usage on T5N5D1M

From the above discussion, we can see that CHUI-Mine outperforms Two-Phase, FUM, and HUC-Prune with regard to efficiency, number of candidates, and memory usage on synthetic datasets.

4.3 Performance analysis on real datasets

In this section, we compare the performance of these algorithms on real dense datasets, Chess, Mushroom, and BMS-POS, in which each transaction contains many items. As the probability of an item’s occurrence in each transaction is very high, the runtime tends to be very long especially when the minimum utility threshold is low. Furthermore, these real datasets also generate a large number of candidates.

Figures 1314, and 15 show the execution time comparisons on Chess, Mushroom and BMS-POS, respectively, with CHUI-Mine achieving the best performance. On Chess, the differences are not obvious when the minimum utility threshold is above 70 %. However, with the threshold below 65 %, we can see the difference clearly. Two-Phase requires more than 8000 seconds when the minimum utility threshold is 60 %. Under the same condition, CHUI-Mine runs about twice as fast as HUC-Prune and one order of magnitude faster than Two-Phase. On Mushroom, CHUI-Mine runs four to nine times faster than FUM, and one order of magnitude faster than Two-Phase when the minimum utility threshold is 10 %. On BMS-POS, CHUI-Mine runs almost three times faster than Two-Phase and twice as fast as FUM and HUC-Prune over the whole process.
Fig. 13

Execution times on Chess

Fig. 14

Execution times on Mushroom

Fig. 15

Execution times on BMS-POS

Figures 1617, and 18 show the number of candidates generated on Chess, Mushroom and BMS-POS, respectively. On Chess, the number of candidates generated by CHUI-Mine is two to nine times smaller than that generated by HUC-Prune, and one to two orders of magnitude smaller than that generated by FUM and Two-Phase. On Mushroom, although CHUI-Mine still generates the least candidates, the differences between the four algorithms are not as obvious as on Chess. On BMS-POS, the difference in the number of candidates for the four algorithms is clear. CHUI-Mine once again generates fewer candidates than all the other algorithms; in fact, four times less than Two-Phase. These comparison results show that the proposed pruning method is effective.
Fig. 16

Number of candidates on Chess

Fig. 17

Number of candidates on Mushroom

Fig. 18

Number of candidates on BMS-POS

We also compared the memory usage for the three datasets, as shown in Figs. 1920, and 21, respectively. On Chess, Two-Phase and FUM have constant memory usage around 200 MB with a threshold between 60 % and 75 %. FUM has slightly greater space-saving than Two-Phase. HUC-Prune uses less memory than the above two algorithms when the utility threshold is high. However, once the utility threshold is decreased to 63 %, there is an explosive increase in memory usage for HUC-Prune. Although its memory requirement increases with a decreased utility threshold, CHUI-Mine still requires the least memory. On Mushroom, the memory usage of Two-Phase and FUM remains constant at a level above 200 MB. CHUI-Mine uses slightly less space than HUC-Prune, with the gap between them more obvious when the utility threshold is lower than 15 %. On BMS-POS, the memory usage of Two-Phase increases to about 270 MB, while FUM consumes about 210 MB. The memory usage of HUC-Prune is similar to that on Chess. That is, HUC-Prune requires only a little more space than CHUI-Mine when the threshold is high. But its usage increases rapidly as the threshold decreases. HUC-Prune uses more memory than FUM when the utility threshold is lower than 1.4 % and more than Two-Phase when the utility threshold is lower than 1.1 %. The memory usage of CHUI-Mine increases slowly when the threshold is between 1 % and 2 % and is never greater than 200 MB.
Fig. 19

Memory usage on Chess

Fig. 20

Memory usage on Mushroom

Fig. 21

Memory usage on BMS-POS

From the above discussion, we can see that CHUI-Mine outperforms Two-Phase, FUM, and HUC-Prune with regard to efficiency, number of candidates, and memory usage on real datasets.

4.4 Scalability

In the following experiments, we varied the dataset size and number of items to evaluate scalability of the four algorithms. The datasets were all generated by the IBM data generator [32].

Figure 22 shows the scalability of the algorithms when increasing the number of transactions on T10I4 with a minimum utility threshold of 3 %. The number of transactions varied from 10k to 100k. The running time of CHUI-Mine increases approximately linearly with an increase in the size of the database. On average, the running time of CHUI-Mine is less than that of HUC-Prune, FUM, and Two-Phase, by 28 %, 60 %, and 94 %, respectively.
Fig. 22

Scalability on T10I4D10-100k

Figure 23 shows the scalability comparisons by varying the database size of the large dataset T5N5 with a minimum utility threshold of 5 %. The number of transactions varied from 1 million to 3 million. On this large dataset, FUM and Two-Phase required almost the same time for most cases. The running time of CHUI-Mine is, on average, less than that of HUC-Prune, FUM, and Two-Phase, by 17 %, 31 %, and 33 %, respectively. This experiment shows that CHUI-Mine is scalable on a dataset with a large number of transactions.
Fig. 23

Scalability on T5N5D1M-3M

We also generated another dataset series T10I4D100k, with minimum utility threshold of 3 %, to evaluate scalability. The number of items varied from 100 to 500. As shown in Fig. 24, the running time of CHUI-Mine is 15 % less than that of HUC-Prune. Moreover, CHUI-Mine is 2 and 3 times faster than FUM and Two-Phase, respectively.
Fig. 24

Scalability on T10I4D100kN100-500

From the above discussion, by varying either the dataset size or the number of items, CHUI-Mine reduces the running time for HUI mining while offering linear scalability.

5 Conclusions

In this paper, we proposed a concurrent algorithm called CHUI-Mine for mining HUIs from transaction databases. A novel data structure called the CHUI-Tree was created for maintaining the information of HUIs. Using this structure, potential HUIs can be generated efficiently using two concurrent processes: one for constructing and dynamically pruning the tree, and then placing the conditional trees into a buffer, and the other for reading the conditional pattern list from the buffer and mining HUIs. The rationality of dynamic pruning of the tree structure was also proved. In the experiments, both synthetic and real datasets were used to evaluate the performance of CHUI-Mine. The mining performance is enhanced since both the search space and the number of candidates are effectively reduced. In addition, the experimental results show that CHUI-Mine is both efficient and scalable.

Notes

Acknowledgements

We would like to express our deep gratitude to the anonymous reviewers of this paper. The work is partly supported by the National Natural Science Foundation of China (61105045), Funding Project for Academic Human Resources Development in Institutions of Higher Learning Under the Jurisdiction of Beijing Municipality (PHR201108057), and North China University of Technology (CCXZ201303).

References

  1. 1.
    Adnan M, Alhajj R (2009) DRFP-tree: disk-resident frequent pattern tree. Appl Intell 30(2):84–97 CrossRefGoogle Scholar
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings 20th international conference very large data bases (VLDB’94), pp 487–499 Google Scholar
  3. 3.
    Ahmed CF, Tanbeer SK, Jeong B-S, Lee Y-K (2009) Efficient tree structures for high utility pattern. IEEE Trans Knowl Data Eng 21(12):1708–1721 CrossRefGoogle Scholar
  4. 4.
    Ahmed CF, Tanbeer SK, Jeong B-S, Lee Y-K (2011) HUC-Prune: an efficient candidate pruning technique to mine high utility patterns. Appl Intell 34(2):181–198 CrossRefGoogle Scholar
  5. 5.
    Barber B, Hamilton HJ (2003) Extracting share frequent itemsets with infrequent subsets. Data Min Knowl Discov 7(2):153–185 CrossRefMathSciNetGoogle Scholar
  6. 6.
    Chan R, Yang Q, Shen Y-D (2003) Mining high utility itemsets. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM’03), pp 19–26 CrossRefGoogle Scholar
  7. 7.
    Erwin A, Gopalan RP, Achuthan NR (2007) CTU-Mine: an efficient high utility itemset mining algorithm using the pattern growth approach. In: Proceedings of the 7th IEEE international conference on computer and information technology (CIT’07), pp 71–76 CrossRefGoogle Scholar
  8. 8.
    Erwin A, Gopalan RP, Achuthan NR (2007) A bottom-up projection based algorithm for mining high utility itemsets. In: Proceedings of the 2nd international workshop on integrating artificial intelligence and data mining (AIDM’07), pp 3–10 Google Scholar
  9. 9.
    Erwin A, Gopalan RP, Achuthan NR (2008) Efficient mining of high utility itemsets from large datasets. In: Proceedings of the 12th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD’08), pp 554–561 CrossRefGoogle Scholar
  10. 10.
    Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using FP-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362 CrossRefGoogle Scholar
  11. 11.
    Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86 CrossRefMathSciNetGoogle Scholar
  12. 12.
    Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, San Francisco Google Scholar
  13. 13.
    Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87 CrossRefMathSciNetGoogle Scholar
  14. 14.
    Hu J, Mojsilovic A (2007) High-utility pattern mining: a method for discovery of high-utility item sets. Pattern Recognit 40(11):3317–3324 CrossRefMATHGoogle Scholar
  15. 15.
    Kaya M, Alhajj R (2008) Online mining of fuzzy multidimensional weighted association rules. Appl Intell 29(1):13–34 CrossRefGoogle Scholar
  16. 16.
    Lee C-H (2007) IMSP: an information theoretic approach for multi-dimensional sequential pattern mining. Appl Intell 26(3):231–242 CrossRefMATHGoogle Scholar
  17. 17.
    Lan G-C, Hong T-P, Tseng VS (2012) A projection-based approach for discovering high average-utility itemsets. J Inf Sci Eng 28(1):193–209 Google Scholar
  18. 18.
    Li H-F, Huang H-Y, Lee S-Y (2011) Fast and memory efficient mining of high-utility itemsets from data streams: with and without negative item profits. Knowl Inf Syst 28(3):495–522 CrossRefGoogle Scholar
  19. 19.
    Li H-F (2011) MHUI-max: an efficient algorithm for discovering high-utility itemsets from data streams. J Inf Sci 37(5):532–545 CrossRefGoogle Scholar
  20. 20.
    Li Y-C, Yeh J-S, Chang C-C (2008) Isolated items discarding strategy for discovering high utility itemsets. Data Knowl Eng 64(1):198–217 CrossRefGoogle Scholar
  21. 21.
    Lin C-W, Hong T-P, Lu W-H (2011) An effective tree structure for mining high utility itemsets. Expert Syst Appl 38(6):7419–7424 CrossRefGoogle Scholar
  22. 22.
    Liu Y, Liao W-K, Choudhary AN (2005) A two phase algorithm for fast discovery of high utility of itemsets. In: Proceedings of the 9th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’05), pp 689–695 Google Scholar
  23. 23.
    Maunz A, Helma C, Kramer S (2011) Efficient mining for structurally diverse subgraph patterns in large molecular databases. Mach Learn 83(2):193–218 CrossRefMATHMathSciNetGoogle Scholar
  24. 24.
    Piatetsky-Shapiro G (2007) Data mining and knowledge discovery 1996 to 2005: overcoming the hype and moving from “university” to “business” and “analytics”. Data Min Knowl Discov 15(1):99–105 CrossRefMathSciNetGoogle Scholar
  25. 25.
    Rauch J (2005) Logic of association rules. Appl Intell 22(1):9–28 CrossRefMATHGoogle Scholar
  26. 26.
    Roscoe AW (2010) Understanding concurrent systems. Springer, London CrossRefMATHGoogle Scholar
  27. 27.
    Shie B-E, Yu PS, Tseng VS (2013) Mining interesting user behavior patterns in mobile commerce environments. Appl Intell 38(3):418–435 CrossRefGoogle Scholar
  28. 28.
    Song W, Yang BR, Xu ZY (2008) Index-BitTableFI: an improved algorithm for mining frequent itemsets. Knowl-Based Syst 21(6):507–513 CrossRefGoogle Scholar
  29. 29.
    Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66 CrossRefMATHMathSciNetGoogle Scholar
  30. 30.
    Tseng M-C, Lin Y-Y, Jeng R (2008) Updating generalized association rules with evolving taxonomies. Appl Intell 29(3):306–320 CrossRefGoogle Scholar
  31. 31.
    Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) UP-Growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10), pp 253–262 CrossRefGoogle Scholar
  32. 32.
  33. 33.
    Frequent Itemset mining implementations repository. http://fimi.ua.ac.be/
  34. 34.
    Wang S-L, Patel D, Jafari A, Hong T-P (2007) Hiding collaborative recommendation association rules. Appl Intell 27(1):67–77 CrossRefMATHGoogle Scholar
  35. 35.
    Wang Y-T, Cheng J-T (2011) Mining periodic movement patterns of mobile phone users based on an efficient sampling approach. Appl Intell 35(1):32–40 CrossRefGoogle Scholar
  36. 36.
    Yao H, Hamilton HJ, Butz CJ (2004) A foundational approach to mining itemset utilities from databases. In: Proceedings of the 4th SIAM international conference on data mining (SDM’04), pp 482–486 CrossRefGoogle Scholar
  37. 37.
    Yao H, Hamilton HJ (2006) Mining itemset utilities from transaction databases. Data Knowl Eng 59(3):603–626 CrossRefGoogle Scholar
  38. 38.
    Yu G, Shao S, Luo B, Zeng X (2009) A hybrid method for high-utility itemsets mining in large high-dimensional data. Int J Data Warehous Min 5(1):57–73 CrossRefGoogle Scholar
  39. 39.
    Zaki MJ, Gouda K (2003) Fast vertical mining using diffsets. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’03), pp 326–335 Google Scholar
  40. 40.
    Zhang S, Chen F, Wu X, Zhang C, Wang R (2012) Mining bridging rules between conceptual clusters. Appl Intell 36(1):108–118 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.College of Information EngineeringNorth China University of TechnologyBeijingChina

Personalised recommendations