Abstract
High utility itemset mining is a crucial research area that focuses on identifying combinations of itemsets from databases that possess a utility value higher than a user-specified threshold. However, most existing algorithms assume that the databases are static, which is not realistic for real-life datasets that are continuously growing with new data. Furthermore, existing algorithms only rely on the utility value to identify relevant itemsets, leading to even the earliest occurring combinations being produced as output. Although some mining algorithms adopt a support-based approach to account for itemset frequency, they do not consider the temporal nature of itemsets. To address these challenges, this paper proposes the Scented Utility Miner (SUM) algorithm that uses a reinduction strategy to track the recency of itemset occurrence and mine itemsets from incremental databases. The paper provides a novel approach for mining high utility itemsets from dynamic databases and presents several experiments that demonstrate the effectiveness of the proposed approach.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Advances in knowledge discovery and data mining have led to the development of efficient techniques for extracting useful information from databases in the form of patterns. The big data revolution has enhanced the process of data collection and retrieval from databases, leading to the application of data mining in various fields, such as web data mining, health record analysis, and smart device failure analysis. Mining transaction databases for useful patterns is one of the most widely explored fields in data mining [21]. Initially, frequent itemsets were extracted from transaction databases, but this approach does not consider more relevant parameters like the cost and quantity of items. High Utility Itemset Mining (HUIM) overcomes this limitation by mining patterns with utility values higher than a user-defined minimum threshold. However, mining high utility itemsets is more complex than frequent itemsets because the subsets of high utility itemsets may or may not be high utility itemsets. Efficient pruning mechanisms are required to tackle this problem, and the transaction weighted downward closure property and other storage and retrieval techniques have been proposed in literature. However, most of these algorithms are meant for static datasets and generate a large search space, requiring multiple database scans. Since real-life datasets are ever-growing, high utility itemset mining algorithms must take into account the incremental nature of the databases.
Mining itemsets from an expanding database presents new challenges, including the need for mined patterns to capture the latest trends in the data due to the addition of new transactions. Outdated patterns can be misleading and should be eliminated from the results. Additionally, as the input database evolves, the minimum utility threshold required for mining may vary and therefore should be dynamically computed based on the current set of underlying records. To address these challenges, this paper proposes a novel algorithm called the Scented Utility Miner (SUM). SUM is based on a data structure called the residue map, proposed in [18], and uses a master map to hold the necessary information for extracting high utility itemsets (HUIs). The algorithm uses a reinduction strategy to track the recency of occurrence of itemsets and dynamically compute the minimum utility threshold value. The main contributions of this paper are listed as follows:
-
1.
The proposed scented utility miner (SUM) algorithm tackles the challenges of mining itemsets from evolving databases. This is achieved by utilizing a data structure called a residue map and a master map, which retains useful information from previous scans and stores the latest information from incremental updates.
-
2.
To capture the relevance of itemsets in the database, the algorithm uses a reinduction-based strategy. The reinduction counter of an itemset ensures that obsolete patterns are removed from the set of high utility itemsets and are not used to produce new combinations. If an itemset becomes relevant again with the addition of new transactions, it is reinducted into the probable set of high utility itemsets and used to compute combinations with other relevant itemsets.
-
3.
A dynamic minimum threshold setting strategy is introduced to ensure that high utility itemsets are mined against a suitable threshold value based on the latest trends in the database.
-
4.
The efficacy of the proposed algorithm is validated through experiments conducted on both real and synthetic datasets.
The paper is organized as follows: Section 2 presents the problem of high utility itemset mining, followed by a discussion of related work in Sect. 3. The proposed work is introduced in Sect. 4, and its functioning is illustrated in Sect. 5. Section 6 presents the experimental results, and Sect. 7 concludes the paper with a discussion on future directions.
2 Problem Statement
The problem of HUIM can be formally stated as follows: The terminologies used here are adopted from the work discussed in [15].
-
I = \(\{i_{1}, i_{2},\dots , i_{m}\}\) represents a set of items.
-
D = \(\{T_{1}, T_{2},\dots , T_{n}\}\) is a database of transactions such that every transaction \(T_i \in D\), consists of pairs of the form a:b, where a is the item-id and b is the internal utility value of a.
-
\(q(i_{p}, T_{q})\) represents the internal or local transaction utility value. It is the quantity of item \(i_{p}\) in transaction \(T_{q}\).
-
pr(i\(_{p}\)) is the external utility, which is the profit associated with item \(i_{p}\).
-
\(u(i_{p}, T_{q})\) is the measure of utility for item \(i_{p}\) in transaction \(T_{q}\). It is defined as \(q(i_{p}, T_{q}) \times pr(i_{p}\)).
-
u(X, \(T_{q}\)) is the utility of itemset X in transaction \(T_{q}\). It is defined as \(\sum _{i_{p} \in X} u(i_{p}, T_{q})\), where X = \(\{i_{1}, i_{2},\dots , i_{r}\}\) is a r-itemset, X \(\subseteq T_{q}\) and \(1 \le r \le m\), and \(\{i_{1}, i_{2},\dots , i_{m}\}\) represents the set of items in D.
-
u(X) represents the utility of itemset X in the database D, which is \(\sum _{T_{q}\in D \wedge X \subseteq T_{q}} u(X,T_{q})\).
The problem of utility mining is to mine all the high utility itemsets in a given database. An itemset X is a high utility itemset if u(X) \(\ge \epsilon\), where X \(\subseteq\) I and \(\epsilon\) is the minimum utility threshold.
In the context of incremental databases, new transactions are added to the database over time, and the utility values of itemsets may change due to these updates. Let \(\mathcal {D}_k\) be the database consisting of the first k transactions, where \(1\le k\le N\). The problem of mining high utility itemsets from incremental databases is to efficiently mine the high utility itemsets from \(\mathcal {D}_k\) for all \(k=1,2,\dots ,N\), given the minimum utility threshold.
For example, in the database shown in Fig. 1, each record is uniquely identified by an identifier and it consists of a set of items. Each item has a corresponding internal and external utility values as shown in Fig. 2. If a set of new transactions are added to the original database, as shown in Fig. 3, then the problem of high utility itemset mining from incremental datasets aims to extract all those items from the total set of transactions, such that, the utility value of the mined items is higher than or equal to the user-specified minimum utility threshold.
3 Related Work
There are numerous algorithms available in the literature that focus on extracting high utility items from large databases. The initial solutions to the utility mining problem are classified based on the techniques employed to extract high utility patterns and the data structures used to store the resulting set of items. These approaches can be categorized into three main types: generate and test based techniques, tree-based pattern growth techniques, and list-based techniques. This section provides a detailed explanation of these techniques and reviews the existing literature on high utility itemset mining.
3.1 Generate-and-Test Based Techniques
These techniques involve generating itemsets or patterns and then evaluating their utility against a predefined threshold. One of the earliest techniques for HUIM is the two-phase algorithm [16], which defines a transaction-weighted downward closure property. It uses the sum of transaction utilities to prune the search space by eliminating candidates that have transaction-weighted utility values lower than the user-defined threshold. The transaction weighted downward closure property plays a crucial role in designing efficient algorithms for mining high utility items from large databases by allowing the elimination of unpromising itemsets during the mining process.
3.2 Tree-Based Pattern Growth Techniques
Pattern growth approaches are used to produce HUIs by utilizing tree-like data structures to store base patterns and incrementally updating the utility-related information [1, 9, 19]. These pattern growth techniques utilize a compact and efficient data structure, typically referred to as a tree or a prefix tree, to represent frequent patterns or itemsets. The initial step involves constructing this tree from the transactional database. The tree structure is then traversed and expanded to generate meaningful itemsets based on a given minimum utility threshold [12]. The utility-related information, such as the total utility of each itemset, is maintained and updated during the tree traversal process.
By utilizing these tree-like data structures, pattern growth approaches efficiently mine HUIs [17]. They avoid generating unnecessary intermediate itemsets and focus only on relevant high utility patterns. The incremental updates to utility-related information allow for efficient pruning and filtering of unpromising itemsets.
3.3 List-Based Techniques
Inverted list-based approaches are also used to solve the HUIM problem, where inverted lists are formed from database scans that are recursively traversed to produce meaningful combinations of potential HUIs. Then, the minimum utility threshold is imposed on the potential items to extract the final set of HUIs. The HUI-miner algorithm [14] provides an important design called utility lists to store utility-relevant information for each item in the form of transaction-ID, utility value, and remaining utility value. However, it suffers from the drawback of performing a highly time-consuming join operation between utility lists to generate potential itemsets. Moreover, it requires different definitions of the join operation for single items and itemsets with size greater than one and also suffers from a large number of redundant join operations.
The FHM algorithm [7] uses a special data structure to store utility values of correlated itemsets to improve the performance of the join operation. Paper [10] introduces several pruning strategies to enhance the mining process of HUI-miner in terms of execution time and memory. A list-based mining procedure is discussed in [23] with several techniques such as fast utility counting, high utility database projection, and high utility transaction merging.
Recently, the residue map data structure is introduced in [17], which is defined for every distinct item in the dataset and is used to store utility-related information in the fields TID-list, total utility, and residue, where the residue is a user-defined minimum threshold-dependent parameter. A set of residue maps sorted in the ascending order of the total utility value is utilized to recursively mine the high utility items. Additionally, several evolutionary algorithms have also been utilized for HUIM as in [2, 5].
One of the key challenges of the existing high utility mining algorithms, is the ability to effectively mine dynamic databases, which are subject to constant changes over time. While a number of algorithms have been developed for static databases, few have been designed to handle the unique requirements of dynamic databases.
One approach to handling dynamic databases is discussed in [13], which categorizes transactions into cases and processes each case separately to maintain discovered high utility itemsets. Another algorithm, PRE-HUI-INS [11], extends the pre-large concept [8] by defining upper and lower utility thresholds for deriving high utility and pre-large utility itemsets. HUI-list-INS [11] maintains and updates utility list structures for mining HUIs with transaction insertion based on FHM [7], while LIHUP [22] constructs a global data structure through a single scan and restructures it according to an optimal sorting order for mining HUIs.
However, most of these algorithms mine items using a single value of a minimum utility threshold, which may not accurately capture the changing trends in dynamic databases over time. Even in static databases of huge size, the combination of itemsets that may be relevant at the beginning of the dataset may not be relevant towards the end. To address this challenge, sliding window protocols have been proposed as a solution to mining dynamic databases [3].
Some studies have discussed support-based mining procedures [20] to address this gap, but this approach has a major drawback: itemsets are produced as output only after they reach a certain support threshold. This means that some recent items may not be produced as output during regimes in the database where they are highly useful. Thus, there is a need to develop more efficient algorithms that take into account the incremental nature of databases, while producing highly relevant itemsets.
This study proposes an efficient algorithm that addresses the research gaps discussed above. The algorithm assigns a reinduction-based counter and dynamically sets the value of the minimum utility threshold, thereby producing highly relevant itemsets while accounting for the incremental nature of databases. This approach can help improve the effectiveness of mining dynamic databases and has potential applications in various domains, such as healthcare, finance, and e-commerce. The proposed algorithm can also be extended and customized for specific applications and datasets.
4 Proposed Work
The proposed work introduces a novel algorithm, the Scented Utility Miner (SUM), for high utility itemset mining that uses residue maps to store the utility-related information for each itemset. The residue map consists of four fields, namely, itemset, TID-list, total utility, and residue, as shown in Fig. 4.
A sorted set of residue maps is used to form the master map, which is traversed recursively to mine the set of high utility items. The main advantage of using the residue maps for mining the utility itemsets is the ease of adaptability since the residue maps mine the high utility itemsets by directly storing the minimum threshold information for the computation of the residue value. As new transactions are inserted, and the minimum utility threshold varies, only those residue maps are generated that are relevant as per the current set of records.
The SUM algorithm is designed to cater to the requirements of the incremental updates to the initial dataset. For each increment to the dataset, the master map is updated, and the concept of reinduction of items and the dynamic setting of the threshold used for mining the process is used. The formal algorithm of the proposed scheme is presented in Algorithm 1.
Algorithm 1 takes as input, a database D with k transactions and n distinct items, where each item holds a utility value. It also takes the minimum utility threshold, and the maximum value of the reinduction counter, as inputs. The output of the algorithm is the set of high utility itemsets.
The algorithm works by first updating the reinduction counter of all the items in the master map. Initially, the reinduction counter of every item in the master map is decremented by one (line:2). Once a given transaction is scanned, the reinduction counter of the items present in that transaction is set to the maximum value of reinduction count (line:13, line:21). Then, for each transaction in the database, the algorithm runs for each item in that transaction. If an item is not in the master map (line:6), the algorithm creates a new residue map for that item and adds it to the master map (line:7-12). If the item is already in the master map (line:15), the algorithm retrieves its residue map and updates its TID-list, total utility, and residue fields (line:15-20).
The algorithm then checks if the TWU (Total Weighted Utility) of the item is greater than the minimum utility threshold (line:23). If it is the case, the algorithm generates a sorted set of residue maps that are greater than or equal to its utility (line:24) and traverses the master map recursively to generate combinations of itemsets and produce high utility itemsets (line:29).
The proposed algorithm is efficient and scalable since it only generates relevant residue maps as new transactions are inserted and the minimum utility threshold varies. The use of residue maps enables ease of adaptability, and the concept of reinduction of items improves the efficiency of the mining process.
4.1 Incremental Updates to Transaction Database
The SUM algorithm begins by constructing an initial master map for a given database. This map is sorted based on the total utility value from the corresponding set of residue maps, as demonstrated in algorithm 1. Next, the algorithm sequentially scans every transaction in the input database to identify a set of items. For each newly discovered item, a new residue map is constructed. If an item has already been scanned in the database, then its residue map is updated.
For a database D, if a set of new transactions N are inserted such that the updated database is D’, then the following properties hold.
- Property-1:
-
If an itemset X is a high utility itemset in the original database D, and X does not appear in the negation set N, then X will also be a high utility itemset in the transformed database D’ for a given minimum utility threshold.
- Property-2:
-
If an itemset X is not a high utility itemset in the original database D, and X does not appear in the negation set N, then X will also not be a high utility itemset in the transformed database D’ for a given minimum utility threshold.
- Property-3:
-
If an itemset X is not a high utility itemset in the original database D, and X appears in the negation set N with a total utility of \({U_X}\), then X will be a high utility itemset in the transformed database D’ if and only if \({U_X}\) is greater than or equal to the specified minimum utility threshold.
To account for incremental updates to the database, the SUM algorithm employs the existing master map to incorporate new information from transactions. When a new set of transactions is added to the database, the algorithm checks if any new items are present. If so, a residue map is created and a pointer to it is added to the master map. If an itemset already exists in the database, the corresponding residue map is retrieved and updated with the new information. The master map is then rearranged to sort the itemsets based on total utility in ascending order. Finally, the updated master map is recursively traversed to mine the exact set of HUIs for the updated database.
4.2 Maintaining Relevance of Itemsets Across Multiple Database Updates
To prevent outdated patterns from impacting the mining process, it is crucial to assign a relevance factor based on the recency of itemset occurrences in the database. Therefore, to evaluate the relevance of itemsets, the concept of reinduction is introduced as explained as follows:
-
1.
A reinduction counter, \(\text{RC}_i\), is defined for every item i in the database. A window size, \(W_n\), is defined as the maximum number of transactions in which an item should occur at least once to be considered relevant.
-
2.
The reinduction counter for every item is initialized with a value of zero.
$$\begin{aligned} \forall i \in D, ; \text{RC}_i \leftarrow 0 \end{aligned}$$ -
3.
As a sequence of transactions in the database is scanned, for every new item x that is discovered, its reinduction counter is set to the value corresponding to the maximum window size, \(W_n\).
$$\begin{aligned} \text{RC}_x \leftarrow W_n \end{aligned}$$ -
4.
For every subsequent transaction, the reinduction counter is decremented by one if the item does not occur in the transaction. That is,
$$\begin{aligned} \text {if } x \notin T_i, \quad \text{RC}_x \leftarrow \text{RC}_x - 1 \end{aligned}$$ -
5.
For every subsequent transaction, the reinduction counter is set to the maximum window size if the item occurs in the transaction. That is,
$$\begin{aligned} \text {if } x \in T_i, \quad \text{RC}_x \leftarrow W_n \end{aligned}$$
To ensure that only relevant items are included in the master map, the reinduction counters of the single items are computed. Only those items with a nonzero reinduction counter are considered for insertion in the master map. The computation of reinduction counters for every item in the database ensures that the relevance of items is maintained even when the database size is increasing due to the addition of new transactions. The window size is used to define how recent an item needs to be to be considered relevant. The window size can be set to suit the requirements of an application. For applications that require high sensitivity to changes in incoming data, such as stock market analysis, a small window size can be used to ensure that the variation is reflected in the master map. However, for applications with slowly changing patterns, such as sales of products in a retail store, a moderate to high window size can be used.
4.3 Dynamic Selection of Minimum Utility Threshold Based on Incremental Updates in Database
To mine high utility itemsets (HUIs) from a growing database of transactions, the utility threshold value needs to be updated to reflect changing trends. Here is a proposed strategy to cater to the dynamic threshold value requirements:
-
1.
For the first scan of the database, a user-specified minimum utility threshold (\(\text{MU}_{0}\)) value is set. After mining the HUIs, the least total utility value (\(\text{TU}_{\text{Min}(0)}\)) for the current set of residue maps is identified to raise the threshold in the next scan.
-
2.
When the database is updated, a new minimum utility value (\(\text{MU}_{1}\)) is computed by adding the least total utility from the initial scan (\(\text{TU}_{\text{Min}(0)}\)) to the user-specified minimum utility threshold value (\(\text{MU}_{0}\)), as follows:
$$\begin{aligned} (\text{MU}_{1}) \leftarrow \text{MU}_{0} + \text{TU}_{\text{Min}(0)} \end{aligned}$$ -
3.
The least total utility value (\(\text{TU}_{\text{Min}(1)}\)) is also updated based on the new set of records for subsequent scans.
-
4.
For every subsequent database update, the minimum utility threshold value is systematically raised based on the previous utility value and the least total utility from the previous scan, as follows:
$$\begin{aligned} (\text{MU}_{i+1}) \leftarrow \text{MU}_{i} + \text{TU}_{\text{Min}(i)} \end{aligned}$$
This strategy ensures that the HUIs are mined based on a methodically computed minimum utility threshold value, which closely aligns with the underlying trends in the database. The least total utility value for an incremental update may remain the same, increase or decrease, and the threshold value is adjusted accordingly. Using a dynamically computed threshold is crucial to avoid generating a large number of irrelevant results that can obscure the underlying useful information.
5 SUM: An example
In order to further clarify the explanation of the SUM algorithm, we will use a sample database and a set of incremental updates as given in Fig. 5, applied to the original database. The database comprises of six transactions and five items. Each item in a transaction is assigned a utility value. We will now explain each step of the SUM algorithm, assuming a minimum utility threshold of eleven and a maximum reinduction delay of three.
-
Generation of residue maps A residue map for each item in the database, D, is formed by storing the transaction ID of the transactions in which the item occurs, total utility value, and residue, as shown in Fig. 6. Note that residue is computed as the difference between the minimum utility threshold and the total utility value.
-
Computation of reinduction counters A mapping of every item with a reinduction counter is computed as follows. Initially, the reinduction counters of every distinct item in the database are assigned a zero value. Whenever an item is encountered for the first time, its reinduction counter is set to a user-specified maximum value. Then, the reinduction counter is decremented sequentially, until the item is encountered again, when it is set to a maximum value. The computation procedure of reinduction counter is shown in Fig. 7.
-
Formation of master map The master map is formed by sorting the set of residue maps in the order of total utility value. Only those residue maps qualify to be included in the master map, which have a nonzero and a positive value for the reinduction counters. The master map for the current database is shown in Fig. 8. Since the value of the reinduction counter of the residue map for item r is negative, it does not qualify to be included in the master map.
-
Mining HUI’s from original database The initial set of HUI’s are mined by recursively traversing the master map as shown in Fig. 9. The item, w, has a utility value of eleven, which is equal to the minimum utility threshold, and hence it is produced as an output directly. The set of residue maps generated based on the comparisons as shown in Fig. 9 are presented in Figs. 10, 11, 12.
5.1 Updation of Master Map Based on Incremental Data
Consider an incremental update to the original database with a new set of transactions, as shown in Fig. 13. Based on the newly generated data, the residue maps are updated, and the reinduction counters are re-computed to form the master map. It is to be noted that the new set of transactions may contain items that may or may not exist in the current set of residue maps.
Handling of residue maps based on an incremental update: When the incremental transactions are scanned, the existing residue maps are updated for the items that already exist from the previous scans. If an item is discovered that does not occur in the previous transactions, then a new residue map is constructed. The updated set of residue maps, based on the incremental update to the database, is shown in Fig. 14.
Handling of reinduction counters based on an incremental update: The reinduction counters should imply the recency of occurrence of an item in the database. So, the reinduction counters are set based on the complete set of transactions, not as per the individual increments to the database. The updated set of reinduction counters is shown in Fig. 15.
Raising minimum utility threshold: The minimum utility threshold is raised by adding in the least value of the utility from the previous database scan. From Fig. 6, it can be observed that the least value of utility is 2. So, the new minimum utility threshold is computed as, minimum utility threshold + least utility, i.e., 11+3= 14.
Formation of master map and generation of HUI’s: Based on the updated data, the set of items with positive reinduction counters qualify for the formation of the master map, that is, {p, q, r}. Based on the newly updated master map, the HUI’s are extracted by recursive iterations as in the case of the original database, by using the updated value of the minimum utility threshold.
6 Experimental Results
The SUM algorithm is implemented in Java on a computer with 1.4 GHz Quad-Core Intel i5 processor and 8 GB of memory. In this section, we evaluate the performance of the SUM algorithm and compare it with that of the state-of-the-art mining algorithms, on several real-life datasets. The details of the datasets used in experiments are given in Table 1. The algorithms are executed until one of the conditions is met, that is, a clear winner is observed, the execution time of an algorithm is too long or the system runs out of memory.
6.1 Comparative Performance of SUM with Other High Utility Itemset Mining Algorithms
To evaluate the performance of the SUM algorithm, several experiments were conducted. In the first experiment, the performance of the SUM algorithm is assessed by comparing it with other high utility itemset mining algorithms. To conduct this comparison, a modified version of SUM, called m-SUM, is introduced. The m-SUM algorithm incorporates the concept of a reinduction counter into an existing list-based mining algorithm, that is, the ULB-miner [4]. ULB-miner is chosen for its reputation as one of the fastest list-based algorithms available. In the evaluation, both the execution time and memory consumption of the SUM algorithm are compared against those of ULB-miner and the newly introduced m-SUM algorithm. The results of this comparison are presented in Fig. 16 for execution time and in Fig. 17 for memory consumption.
As depicted in Fig. 16, the SUM algorithm exhibits the shortest execution time among the three algorithms. This notable improvement in execution time can be attributed to the utilization of residue maps and the master map as the underlying data structure for mining. These data structures offer more efficient storage and processing of data, thereby reducing the computational overhead and speeding up the execution process. Furthermore, as depicted in Fig. 17, employing residue maps can lead to a substantial reduction in memory consumption. The reduction in memory consumption occurs when using residue maps due to the way they store and represent data. Residue maps involve storing only the differences or the residual utility instead of storing the absolute values of each item separately. When traditional data structures are used to store data, for each item, the utility and the remaining utility of the item in a transaction are stored individually, requiring relatively more memory.
6.2 Comparative Performance of SUM with Support Based Mining Algorithms
In this section, we have compared the performance of SUM with a support-based mining algorithm called FHM-frequent [20], which is the only existing algorithm in literature that mines high utility items by taking into account the frequency of the items. However, FHM-frequent has a major drawback. It only produces an item as output when the frequency of its occurrence reaches the user-specified support value, which means that important items may not be produced as output during certain periods of the data. When an item’s support crosses a specific threshold, it is generated as output. However, this approach fails to consider the temporal relevance of the item, as its occurrence frequency might spike irregularly.
6.2.1 Comparison of the Count of Mined Itemsets
The SUM algorithm addresses this drawback by specifying windows indicating the intervals during which an item should occur at least once to be considered relevant. This point is further validated by the results of the comparison of the number of items that qualify for the reinduction condition and the support condition, as shown in Fig. 18, where the window size for the reinduction counter is set to 150 and the minimum support threshold is set to 0.1. The black markers in the figure represent the maximum number of distinct items scanned in the database with each incremental update. It can be observed from the figure that the number of items that satisfy the reinduction specification is much higher than the number of items that qualify for the support measure. Hence, support may not be an accurate measure to mine relevant high utility itemsets from databases.
6.2.2 Comparison of the Execution Time and the Memory Requirement
The execution time and memory requirement of the SUM along with the FHM-frequent algorithm are given in Figs. 19 and 20, respectively. It can be observed from Figs. 19 and 20 that the FHM-frequent and SUM have similar performance in almost all cases. For the dataset chess, the FHM-frequent does not finish executing, so its complete trace is missing in Figs. 19c and 20c, respectively. The SUM algorithm leverages residue maps as an intermediate data structure to store utility information from previous scans.
This approach reduces the need for repeated rescans of the entire database, which can be computationally expensive.
By efficiently retaining valuable information through residue maps, the SUM algorithm avoids redundant computations and significantly improves execution time and memory consumption, as evident from Figs. 19 and 20.
6.3 Comparative Performance of SUM with Incremental Mining Algorithms
We compared the performance of SUM with that of EIHI [6], which is currently the best algorithm for incremental high utility itemset mining.
The performance of the SUM algorithm is superior to that of the EIHI algorithm, particularly when dealing with incremental datasets, due to its efficient handling of dynamic updates and adaptability to changing database conditions.
The results are presented in Figs. 21 and 22. These figures clearly indicate that SUM outperforms EIHI.
This improved performance is attributed to the use of residue maps in SUM, which require significantly less memory compared to the trie-like structure used in EIHI. Specifically, the master map in SUM is used as an intermediate structure to store the utility information from previous scans, whereas EIHI stores HUIs using a trie-like structure. As the number of transactions increases, the trie-like structure in EIHI consumes more memory to store HUIs, whereas the master map in SUM serves both the mining process and intermediate result storage. As a result, SUM achieves better execution time and lower memory consumption than EIHI.
Due to the utilization of residue maps and reinduction counters, the SUM algorithm efficiently retains valuable information from previous scans. This helps avoid redundant computations during incremental updates, as it does not need to recompute the mining process for unchanged data. In contrast, the EIHI algorithm needs to reprocess the entire database, leading to increased computation time and inefficiencies.
6.4 Comparative Performance of SUM with Sliding Window Based Algorithms
The algorithms naive-FHMDS and FHMDS [3], which utilize inverted lists and a sliding window model to mine HUI’s from incremental databases with dynamic threshold setting, are outperformed by SUM in terms of execution time and memory usage, as shown in Figs. 23 and 24, respectively.
Both naive-FHMDS and FHMDS define a window as a batch of transactions used to form an inverted list-like structure. Additionally, a transaction-weighted utilization of items is stored in memory for each batch. However, as the number of batches increases in the database, this method requires a larger amount of memory. Additionally, FHMDS algorithm requires computation for each window, leading to increased computation time and inefficiencies.
6.5 Scalability of SUM, with Varying Window Size
The performance of the SUM algorithm is evaluated by varying the window size for each dataset and setting the window size for the reinduction counters to four different values: 10, 50, 100, and 150. The number of items with positive reinduction counters is plotted in Fig. 25. As shown in the figure, for almost all datasets, the number of items with positive reinduction counters increases four-fold as the window size is increased. Consequently, the size of the master map, which is formed using only items with positive reinduction counters, also increases proportionately.
Despite the increase in the size of the master map, the time consumption and memory requirements remain consistent with varying window sizes, as seen in Figs. 26 and 27, respectively.
This consistent performance with varying window sizes is attributed to the utilization of residue maps for mining High Utility Itemsets (HUIs) and the use of reinduction counter to hold only meaningful itemsets while scanning the database for high utility itemsets. The residue map, which serves as the fundamental storage unit in the SUM algorithm, is used to construct the master map, which is crucial for the mining process.
As the window size increases, a higher number of items hold positive reinduction counters for longer durations. Consequently, relevant residue maps remain valid and are retained for constructing the master map. As a result, the SUM algorithm maintains similar performance with various window sizes, showcasing its highly scalable nature. This scalability demonstrates a significant advantage of the SUM algorithm when dealing with different datasets and varying window size requirements.
7 Conclusion and Future Work
In this paper, we have proposed a reinduction-based approach for mining high utility itemsets from incremental databases. We introduced the concept of reinduction counters and discussed a threshold raising strategy to improve the mining process. Our proposed algorithm outperformed existing algorithms in terms of execution time and memory requirements. Additionally, we tested for scalability by varying the window size.
The proposed SUM algorithm addresses several drawbacks associated with existing high utility itemset mining algorithms for incremental databases. Firstly, traditional algorithms suffer from high memory requirements due to the storage of intermediate results and large data structures like tries, which can become unfeasible as the size of the database increases. The SUM algorithm overcomes this by using the residue map, which requires significantly less memory compared to other data structures. Secondly, existing algorithms struggle with maintaining consistency and accuracy when dealing with dynamic databases with varying thresholds. The SUM algorithm addresses this by using a reinduction strategy based on a sliding window model that adjusts the utility threshold value based on previous scans. Lastly, traditional algorithms are often not scalable enough to handle large datasets with dynamic thresholds. The SUM algorithm improves scalability by using the residue map as an intermediate structure for storing utility information from previous scans, making it more efficient and effective for mining high utility itemsets from incremental databases.
Some possible future directions for the SUM algorithm include its extension to other types of high utility mining problems. The SUM algorithm uses a residue map-based approach for mining high utility itemsets from incremental databases. Similar data structures and strategies could be explored for other variations of the high utility mining problem, such as mining with negative weights or temporal mining. Additionally, the SUM algorithm could be integrated with other data mining techniques, such as classification or clustering, to provide more comprehensive analysis of the data. As the size of the data increases, parallelization of the mining process becomes necessary to ensure efficient execution. Parallel versions of the SUM algorithm could be developed to exploit multi-core or distributed architectures. The high utility itemset mining problem has several applications in domains such as retail, healthcare, and finance and the SUM algorithm could be adapted and applied to these domains to address specific challenges and requirements.
Data Availability
The authors declare that the data used for conducting experiments for this research is available openly on internet in the SPMF repository (https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php).
References
Ahmed CF, Tanbeer SK, Jeong B-S, Lee Y-K (2009) Efficient tree structures for high utility pattern mining in incremental databases. IEEE Trans Knowl Data Eng 21(12):1708–1721
Ahmed U, Chun-Wei Lin J, Srivastava G, Yasin R, Djenouri Y (2020) An evolutionary model to mine high expected utility patterns from uncertain databases. IEEE Trans Emerg Top Comput Intell 5(1):19–28
Dawar S, Sharma V, Goyal V (2017) Mining top-k high-utility itemsets from a data stream under sliding window model. Appl Intell 47(4):1240–1255
Duong Q-H, Fournier-Viger P, Ramampiaro H, Nørvåg K, Dam T-L (2018) Efficient high utility itemset mining using buffered utility-lists. Appl Intell 48:1859–1877
Fang W, Zhang Q, Sun J, Wu X-J (2020) Mining high quality patterns using multi-objective evolutionary algorithm. IEEE Trans Knowl Data Eng
Fournier-Viger P, Chun-Wei Lin J, Gueniche T, Barhate P (2015) Efficient incremental high utility itemset mining. In: Proceedings of the ASE BigData & SocialInformatics 2015, pp 1–6
Fournier-Viger P, Wu C-W, Zida S, Tseng VS (2014) Fhm: faster high-utility itemset mining using estimated utility co-occurrence pruning. In: International symposium on methodologies for intelligent systems. Springer, pp 83–92
Hong T-P, Wang C-Y, Tao Y-H (2001) A new incremental data mining algorithm using pre-large itemsets. Intell Data Anal 5(2):111–129
Jianying H, Mojsilovic A (2007) High-utility pattern mining: a method for discovery of high-utility item sets. Pattern Recognit 40(11):3317–3324
Krishnamoorthy S (2015) Pruning strategies for mining high utility itemsets. Expert Syst Appl 42(5):2371–2381
Lin C-W, Hong T-P, Lan G-C, Wong J-W, Lin W-Y (2014) Incrementally mining high utility patterns based on pre-large concept. Appl Intell 40(2):343–357
Lin C-W, Hong T-P, Wen-Hsiang L (2011) An effective tree structure for mining high utility itemsets. Expert Syst Appl 38(6):7419–7424
Lin C-W, Lan G-C, Hong T-P (2012) An incremental mining algorithm for high utility itemsets. Expert Syst Appl 39(8):7173–7180
Liu M, Qu J (2012) Mining high utility itemsets without candidate generation. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, pp 55–64
Liu Y, Liao W-k, Choudhary A (2005) A fast high utility itemsets mining algorithm. In: Proceedings of the 1st international workshop on Utility-based data mining. ACM, pp 90–99
Liu Y, Liao W-k, Choudhary A (2005) A two-phase algorithm for fast discovery of high utility itemsets. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 689–695
Qu J-F, Liu M, Fournier-Viger P (2019) Efficient algorithms for high utility itemset mining without candidate generation. In: High-utility pattern mining: theory, algorithms and applications, pp 131–160
Sra P, Chand S (2023) A residual utility-based concept for high-utility itemset mining. Knowl Inf Syst, pp 1–25
Tseng VS, Wu C-W, Shie B-E, Yu PS (2010) Up-growth: an efficient algorithm for high utility itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 253–262
Vu HQ, Li G, Law R (2020) Discovering highly profitable travel patterns by high-utility pattern mining. Tour Manag 77:104008
Yin Q, Wang J, Sheng D, Leng J, Li J, Hong Y, Zhang F, Chai Y, Zhang X, Zhao X et al (2022) An adaptive elastic multi-model big data analysis and information extraction system. Data Sci Eng 7(4):328–338
Yun U, Ryang H, Lee G, Fujita H (2017) An efficient algorithm for mining high utility patterns from incremental databases with one database scan. Knowl-Based Syst 124:188–206
Zida S, Fournier-Viger P, Chun-Wei Lin J, Wu C-W, Tseng VS (2015) Efim: a highly efficient algorithm for high-utility itemset mining. In: Mexican international conference on artificial intelligence. Springer, pp 530–546
Funding
The authors declare that no funding was received for the research associated with this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sra, P., Chand, S. A Reinduction-Based Approach for Efficient High Utility Itemset Mining from Incremental Datasets. Data Sci. Eng. 9, 73–87 (2024). https://doi.org/10.1007/s41019-023-00229-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41019-023-00229-4