A general-purpose distributed pattern mining system

This paper explores five pattern mining problems and proposes a new distributed framework called DT-DPM: Decomposition Transaction for Distributed Pattern Mining. DT-DPM addresses the limitations of the existing pattern mining problems by reducing the enumeration search space. Thus, it derives the relevant patterns by studying the different correlation among the transactions. It first decomposes the set of transactions into several clusters of different sizes, and then explores heterogeneous architectures, including MapReduce, single CPU, and multi CPU, based on the densities of each subset of transactions. To evaluate the DT-DPM framework, extensive experiments were carried out by solving five pattern mining problems (FIM: Frequent Itemset Mining, WIM: Weighted Itemset Mining, UIM: Uncertain Itemset Mining, HUIM: High Utility Itemset Mining, and SPM: Sequential Pattern Mining). Experimental results reveal that by using DT-DPM, the scalability of the pattern mining algorithms was improved on large databases. Results also reveal that DT-DPM outperforms the baseline parallel pattern mining algorithms on big databases.


Introduction
Pattern mining is a data mining task that aims at studying the correlations within data and discovering relevant patterns from large databases. In practice, different database representation could be observed (from Boolean databases to sequence databases). The problem of pattern mining is to find an efficient approach to extract the relevant patterns in a database. It is used in many applications and domains such as ontology matching [1], process mining [2], decision making [3], and constraint programming [4]. The pattern mining is also called with "Big data" applications such as in frequent genes extractions from DNA in Bio-informatics [5], relevant hashtags from twitter streams in social network analysis [6], analysis of sensorial data from IoT devices in smart city applications [7]. This work mainly focuses on mining the information from big transactional databases.

Motivation
Solutions to pattern mining problems [8][9][10][11][12] are high time consuming when dealing with large and very large databases for pattern mining problems such as FIM and WIM, and they are totally inefficient when solving more complex problems such as UIM, HUIM, and SPM. To improve the runtime performance of the pattern mining approaches, many optimization and high performance computing techniques have been proposed [13][14][15][16][17][18]. However, these strategies are inefficient when dealing with big databases, where only few number of relevant patterns are useful and displayed to the end user. We contemplate that these algorithms are inefficient because they consider the whole database in the mining process. In our previous work [19], we proposed a new algorithm for pattern mining algorithm, where the aim is to study the correlation between the input data to split the whole problem into many smaller sub-problems, but as being as independent as possible. We proposed a k-means algorithm to assign the transactions into different clusters. We also developed an efficient strategy to accurately explore the clusters of transactions. This approach gives good results compared to the baseline serial methods. However, it still suffers from the runtime and accuracy performance when dealing with big databases. This is due to the separator items between clusters, where the mining process of these items should be carried out by exploring the transactions of all clusters. This issue degrades the overall performance of such an approach. Motivated by the preliminary results reported in [19], we propose a new parallel framework, which addresses the following issues, i) minimizing the number of separator items, ii) improving the runtime and accuracy on big databases.

Contributions
In this research work, we propose a generic intelligent pattern mining algorithm for dealing pattern mining problems on big databases. It is a comprehensive extension of our previous work [19]. With this in mind, the main contributions of this work are as follows: 1. Propose a new framework called DT-DPM for improving pattern mining algorithms in a distributed environment. 2. Develop a decomposition approach to cluster the transactions set into smaller similar groups. 3. Extend the MapReduce computing framework to deal with the pattern mining algorithms by exploiting the different dependencies between the transactions of the clusters. 4. Five cases studies (FIM, WIM, UIM, HUIM, and SPM) have been analyzed on well-known pattern mining databases by considering five best pattern mining algorithms in terms of time complexity as baseline algorithms for the DT-DPM framework. Experimental results reveal that by using DT-DPM, the scalability of the pattern mining algorithms was improved on large databases. Results also reveal that DT-DPM outperforms the baseline parallel pattern mining algorithms on big databases.

Outline
The remainder of the paper is organized as follows: Section 2 introduces the basic concepts of pattern mining problems. Section 3 reviews existing pattern mining algorithms followed by a detailed explanation of our DT-DPM framework in Section 4. The performance evaluation is presented in Section 5 whereas Section 6 draws the conclusions.

Pattern mining problems
In this section, we first present a general formulation of pattern mining and then we present a few pattern mining problems according to the general formulation.
Definition 1 (pattern) Let us consider I = {1, 2, …, n} as a set of items where n is the number of items, and T = {t 1 , t 2 , dots, t m } as a set of transactions where m is the number of transactions. We define the function a, where for the item i in the transaction t j , the corresponding pattern reads p=σ(i, j).
Definition 2 (pattern mining) A pattern mining problem finds the set of all relevant patterns L, such as where the Interestigness(T, I, p) is the measure to evaluate a pattern p among the set of transactions T and the set of items I, where γ is the mining threshold. From these two definitions, we present the existing pattern mining problems.

Definition 3 (Boolean database)
We define a Boolean database by setting the function σ (see Def. 2) as Definition 4 (frequent itemset mining (FIM)) We define a FIM problem as an extension of the pattern mining problem (see Def. 2) by with Support T ; I; p ð Þ¼ p j j T;I jTj where T is the set of transactions in a Boolean database defined by Def. 1, γ is a minimum support threshold, and |p| T, I is the number of transactions in T containing the pattern p.
Definition 5 (weighted database) We define a weighted database by setting the function σ (see Def. 2) as Note that w ij is the weight of the item i in the transaction t j.
Definition 6 (weighted itemset mining (WIM)) We define a WIM problem as an extension of the pattern mining problem (see Def. 2) by with WS T ; I; p ð Þ¼∑ jT j j¼1 W t j ; I; p À Á where T is the set of transactions in the weighted database defined by Def. 3, W(t j , I, p) is the minimum weight of the items of the pattern p in the transaction t j , and γ is a minimum weighted support threshold.
Definition 7 (uncertain database) We define an uncertain database by setting the function σ (see Def. 2) as Note that Prob ij is the uncertainty value of i in the transaction t j .
Definition 8 (uncertain itemset mining (UIM)) We define a UIM problem as an extension of the pattern mining problem (see Def. 2) by with US T ; I; p ð Þ¼∑ jT j j¼1 ∏ i∈p Prob ij where T is the set of transactions in the uncertain database defined by Def. 5 and γ is the minimum uncertain support threshold.
Definition 9 (utility database) We define an utility database by setting the function σ (see Def. 2) as Note that iu ij is the internal utility value of i in the transaction t j , we also define external utility of each item i by eu(i).
Definition 10 (high utility itemset mining (HUIM)) We define a HUIM problem as an extension of the pattern mining problem (see Def. 2) by where T is the set of transactions in the utility database defined by Def. 7 and γ is the minimum utility threshold.
Definition 11 (sequence database) We assume a total order on items ≺, such as 1 ≺ 2 ≺ 3… ≺ n. A sequence is an ordered list of itemsets s = {I 1 , I 2 , …, I |s| }. Each itemset I i is defined by setting the function σ (see Def. 2) as σ(i, j) = i, if i ∈ t j Definition 12 (sequential pattern mining (SPM)) We define a SPM problem as an extension of the pattern mining problem (see Def. 2) by where T is the set of transactions in the sequence database defined by Def. 9 and γ is the minimum support threshold.

Related work
Pattern mining has been largely studied in the last three decades [8-11, 20, 21]. There are many variants of pattern mining problem such as FIM, WIM, HUIM, UIM and SPM.
FIM It aims at extracting all frequent itemsets that exceed the minimum support threshold. Apriori [22] and FP-Growth [23] are the most popular algorithms. Apriori applies a generate and test strategy to explore the itemset space. The candidate itemsets are generated incrementally and recursively. To generate k-sized itemsets as candidates, the algorithm calculates and combines the frequent (k-1)-sized itemsets. This process is repeated until no candidate itemsets are obtained in an iteration. However, FP-Growth adopts a divide and conquer strategy and compresses the transactional database in the volatile memory using an efficient tree structure. It then applies recursively the mining process to find the frequent itemsets. The main limitation of the traditional FIM algorithms is the database format where only binary items can be mined. A typical application of this problem is the market basket analysis, which a given item (product) may be present or absent in the given transaction (customer).
WIM To address the FIM limitation, WIM is introduced, where weights are associated to each item to indicate their relative importance in the given transaction [24]. The goal is to extract itemsets exceeding minimum weight threshold. The first WIM algorithm is called WFIM: Weighted Frequent Itemset Mining [25]. It defines a weight range and a minimum weight constraint into the FP-Growth algorithm. Both weight and support measures are considered to prunethe search space. Yun [26] proposed WIP: Weighted Interesting Pattern. It introduces an infinity measure that determines the correlation between the items of the same pattern. The integration of the WIM in both Apriori and FP-Growth is studied in [27]. The results showed that FP-Growth outperforms Apriori for mining weighted patterns. Le et al. [28] proposed a frequent subgraph algorithm on a weighted large graph. A novel strategy is developed which aims to compute the weight of all candidate subgraphs. An efficient pruning strategy aims at reducing both the processing time and the memory usage. Lee et al. [29] mine the frequent weighted itemsets by employing a novel type of prefix tree structures. This allows to retrieve the relevant patterns more accurately without saving the list of identification number of the different transactions.
UIM An extension of WIM, called UIM, explores uncertain transactional databases, where two models (expected-support and probabilistic itemsets) are defined to mine uncertain patterns. Li et al. [30] proposed the PFIMoS: Probabilistic Frequent Itemset Mining over Streams algorithm. It derives the probabilistic frequent itemsets in an incremental way by determining the upper and the lower bounds of the mining threshold. Lee et al. [31] introduced the U-WFI: Uncertainmining of Weighted Frequent Itemsets algorithm. It allows to discover from a given uncertain database relevant uncertain frequent itemsets with high probability values by focusing on item weights. Liaqat et al. [32] show the use of uncertain frequent patterns in the image retrieval process. It incorporates the fuzzy ontology and uncertain frequent pattern mining for finding the relevant images regarding the user query. Lee et al. [33] suggest novel data structures to guarantee the correctness of the mining outputs without any false positives. It allows to retrieve a complete set of uncertain relevant patterns in an reasonable amount of time.
HUIM High Utility Itemset Mining is an extension of WIM where both internal and external utilities of the items are involved. The aim is to find all high utility patterns from transactional database that exceed the minimum utility threshold. The utility of a pattern is the sum of the utility of all its items, where the utility of an item is defined by the product by its internal and external utility values. Chan et al. [34] proposed the first HUIM algorithm. It applies the Apriori-based algorithm to discover top k high utility patterns. This algorithm suffers from the runtime performance, because the search space is not well pruned using the closure downward property. Thus, the utility measure is neither monotone nor antimonotone.To address this limitation the TWU: Transaction Weighted Utility property is defined to prune the high utility pattern space [35,36]. It is an upper-bound monotone measure to reduce the search space. More efficient HUIM algorithms based on TWU have been recently proposed such as EFIM: EFficient high-utility Itemset Mining [37], and d 2 HUP: Direct Discovery for High Utility Patterns [38]. The particularity of such approaches is that they use more efficient data structures to determine the TWU and the utility values. Singh et al. [39] address the problem of the minimum utility threshold tuning and derived the top k high utility patterns. It uses transaction merging and data projection techniques to reduce the data scanning cost. It also develops an intelligent strategy designed for top k high utility patterns to prune the enumeration search tree. Gan et al. [40] proposed a correlated high utility pattern miner algorithm. It considers the positive correlation, the profitable value concepts, and several strategies to prune the search space. Lee et al. [41] developed an efficient incremental approach for identifying high utility patterns. It adopts an accurate data structure to mine high utility patterns in an incremental way.
SPM Sequential Pattern Mining is an extension of FIM to discover a set of ordered patterns in a sequence database [42][43][44]. Salvemini et al. [42] find the completeset of the sequence patterns by reducing the candidates generation runtime by employing an efficient lexicographic tree structure. Fumarola et al. [43] discover closed sequential patterns using two main steps, i) Finding the closedof sequence patterns of size 1, and ii) Generating new sequences from the sequence patterns of size 1, already deduced in the first step. Van et al. [44] introduced the pattern-growth algorithm in solving the sequential pattern mining problem with itemset constraints. It proposed an incremental strategy to prune the enumeration search tree which allows to reduce the number of visited nodes. Aisal et al. [45] proposed a novel convoy pattern mining approach which can operateon a variety of operational data stores. It suggested a new heuristic to prune the objects which have no chance of forming a convoy. Wu et al. [46] solved the contrast sequential pattern mining problem, which is an extension of SPM, discovered all relevant patterns that figure out in one sequence data and not in the others. These patterns are highly used in some specified application such as analysing anomalous customers in the business intelligence or medical diagnosis in the smart healthcare [47][48][49].
High performance computing Regarding high performance computing, many algorithms have been developed for boosting the FIM performance [15,[50][51][52][53][54]. However, few algorithms have been proposed for the other pattern mining problem [16][17][18]55]. In [52], some challenges in big data analytics are discussed, such as mining evolving data streams and the need to handle many exabytes of data across various application areas such as social network analysis. The BigFIM: Big Frequent Itemset Mining [56] algorithm is presented, which combines principles from both Apriori and Eclat. BigFIM is implemented using the MapReduce paradigm. The mappers are determined using Eclat algorithm, whereas, the reducers are computed using the Apriori algorithm. [57] develops two strategies for parallelizing both candidate itemsets generation and support counting on a GPU (Graphic Processing Unit). In the candidate generation, each thread is assigned two frequent (k-1)-sized itemsets, it compares them to make sure that they share the common (k-2) prefix and then generates a k-sized candidate itemset. In the evaluation, each thread is assigned one candidate itemset and counts its support by scanning the transactions simultaneously. The evaluation of frequent itemsets is improved in [58] by proposing mapping and sum reduction techniques to merge all counts ofthe given itemsets. It is also improved in [59] by developing three strategies for minimizing the impact of the graphical processor unit thread divergence. In [60], a multilevel layer data structure is proposed to enhance the support counting of the frequent itemsets. It divides vertical data into several layers, where each layer is an index table of the next layer. This strategy can completely represent the original vertical structure. In a vertical structure, each item corresponds to a fixed-length binary vector. However, in this strategy, the length of each vector varies, which depends on the number of transactions included in the corresponding item. A Hadoop implementation based on MapReduce programming approach called FiDoop: Frequent itemset based on Decomposition is proposed in [61] for the frequent itemsets mining problem. It incorporates the concept of FIU-tree (Frequent Itemset Ultrametric tree) rather than traditional FP-tree of FP-Growth algorithm, for the purpose of improving the storage of the candidate itemsets. An improved version called FiDoop-DP is proposed in [15]. It develops an efficient strategy to partition data sets among the mappers. This allows better exploration of cluster hardware architecture by avoiding jobs redundancy. Andrzejewski et al. [62] introduce the concept of incremental co-location patterns, i.e., update the set of knowledge about the spatial features after inserting new spatial data to the original one. The authors develop a new parallel algorithm which combines effective update strategy and multi GPU co-location pattern mining [63] by designing an efficient enumeration tree on GPU. Since the proposed approach is memory-aware, i.e., the data is divided into several package to fit to GPU memories, it only achieves an speedup of six. Jiang et al. [64] adopt a parallel FP-Growth for mining world ocean atlas data. The whole data is partitioned among multiple CPU threads, where each thread explores 300,000 data points and derives correlation and regularities of oxygen, temperature, phosphate, nitrate, and silicate in the ocean. The experimental results reveal that the suggested adaptation only reachesan speedup of 1.2. Vanhalli et al. [65] developed a parallel row enumerated algorithm for mining frequent colossal closed patterns from high dimensional data. It first prunes the whole data by removing irrelevant items and transactions using rowset cardinality table, which determines the closeness of each subset of transactions. In addition, it uses a hybrid parallel bottom-up bitset based approach to enumerate the colossal frequent closed patterns. This approach is fast, howeverit suffers from the accuracy issue, which may ignore some relevant patterns due to the preprocessing phase. It also requires additional parameter to be fixed, represented by a cardinality threshold. Yu et al. [66] propose the parallel version of PrefixSpan on Spark: PrefixSpan-S. It optimizes the overhead by first loading the data from the Hadoop distributed file system into the RDDs: Resilient Distributed Datasets, and then reading the data from RDDs, and save the potential results back into the RDDs. This approach reaches a good performance with a wise choice of the minimum support threshold. However, it is very sensitive to the data distribution.
Kuang et al. [67] proposed the parallel implementation of FP-Growth algorithm in Hadoop by removing the data redundancy between the different data partitions, which allows to handle the transactions in a single pass. Sumalatha et al. [68] introduces the concept of distributed temporal high utility sequential patterns, and propose an intelligent strategy by creating a time interval utility data structure for evaluating the candidate patterns. The authors also defined two utility upper bounds, remaining utility, and co-occurrence utility to prune the search space.
To improve the runtime performance of the pattern mining approaches, several strategies have been proposed using metaheuristics, specifically exploiting evolutionary and/or swarm intelligence approaches [13,14,69,70]. However, these optimizations are inefficient when dealing with large and big transactional databases where only few number of interesting patterns are discovered. To deal with this challenging issue, the next section presents a new framework, which investigates both decomposition techniques and distributed computing in solving pattern mining algorithms.

DT-DPM: decomposition transaction for distributed pattern mining
This section presents the DT-DPM (Decomposition Transaction for Distributed Pattern Mining) framework, that integrates the DBSCAN: Density-Based Spatial Clustering of Applications with Noise algorithm and distributed computing represented by MapReduce, CPU multi-cores and Single CPU for solving pattern mining problems. As seen in Fig. 1, the DT-DPM framework uses heterogeneous distributed computing and decomposition techniques for solving pattern mining problems. Detailed explanation of the DT-DPM framework, step by step, is given in the following.

DBSCAN
The aim of this step is to divide a database into a collection of homogeneous groups using decomposition techniques, where each group shares entries highly correlated, i.e., the database entries of each group share maximum number of items compared to the entries of the other groups.
Definition 13 A database D is decomposed into several groups G = {G i }, where each group G i is subset of rows in D such as G i ∪ G j = ∅. We define, I G i ð Þ, the set of items of the group G i by Proposition 1 Suppose that the groups in G share any items which means We have the following proposition Note that P i is the set of the relevant patterns of the group G i .
From the above proposition, one may argue that if the whole transactions in the database are decomposed in such a way, the independent groups will be derived. It means that, any group of transactions share items with any other group, and therefore, the groups could be solved separately. Unfortunately, such case is difficult to realize, as many dependencies may be observed between rows. The aim of the decomposition techniques is to minimize the separator items between the groups such as The aim of the decomposition step is to minimize the shared items between the different clusters, these shared items are called Separator Items. More formally, this decomposition generates a labeled non-directed graph weighted noted G = < C, S> where C is the set of nodes formed by the clusters of G and S is the set of the separator items. Each element s ij in S contains two components: i) s ij l , is the label of the element s ij , it is represented by the set of items shared by the clusters C i and C j . ii)s ij w is the weight of the element s ij , it is represented by the number of items shared by the cluster C i and C j . As a result, k disjoint partitions P = {P 1 , P 2 , …, P k }, where P i ∩ P j = ∅, ∀(i, j) ∈ [1, …, k] 2 and ⋃ k i¼1 P i ¼ T . The partitions are constructed by minimizing the following function Solving this equation by exact solver requires high computational time. One way to solve this issue is to use clustering algorithms [71]. The adaptation of the DBSCAN [72] clustering algorithms has been investigated in this research. Before this, some definitions are given as follows.

Definition 14 (transaction-based similarity)
We define Transaction-based similarity by an adopted Jaccard similarity [73] as Note that x i is the value of the variable x in the row data D i .

Definition 15 (centroids)
We define the centroids of the group of rows data G i , noted G i by Fig. 1 DT-DPM framework where max x l x i l À Á is the most value of the variable x l in the group G i .

Definition 16 (clause neighborhoods)
We define the neighborhoods of a row data D i for a given threshold ϵ, noted N D i by Definition 17 (core data) A row data D i is called core data if there is at least the minimum number of rows data σ D such as jN D i j≥ σ D Algorithm 1 presents the pseudo-code of the decomposition for the rows data. The process starts by checking the ϵneighborhood of each transaction. The core transactions are determined and then iteratively collects density-reachable transactions from these core transactions directly, which may involve merging a few density-reachable clusters. The process terminates when no new transactions can be added to any cluster. The output of the decomposition step is a labeled non-directed graph weighted noted O = < G, S> where G is the set of nodes formed by the groups of the rows data and S is the set of the separator items. Each element s ij in S contains two components: i) s ij l , is the label of the element s ij , it is represented by the set of variables by the groups G i and G j . ii)s ij w is the weight of the element s ij , it is represented by the number of variables shared by the group G i and G j .

Mining process
After applying the DBSCAN algorithm on the whole data, the next step is to solve each resulted cluster independently using heterogeneous machines such as supercomputers, CPU cores, and single CPU core. The question that should be answered now is, which cluster should be assigned to which machine? We know that supercomputers are most powerful than CPU cores and single CPU. Obviously, clusters with high workload are assigned to supercomputer, the remaining clusters are assigned to CPU cores, and the noise data are assigned to a single CPU. This raises the following proposition, Proposition 2 Consider a function Cost(A, m, n), that computes the complexity cost of the given pattern mining algorithm A. Consider the output of the DBSCAN algorithm O = < G, S>, the mining process by A of the clusters could be performed in parallel by using heterogeneous machines according to the density of each cluster G i represented by jG i j; I G i ð Þ ð Þ , and with a complexity cost threshold μ cost as 1. If Cost (A, |G i |, I G i ð Þ ) ≥ μ cost then send G i to MapReduce framework.

2.
If Cost (A, |G i |, I G i ð Þ ) < μ cost then send G i to CPU cores.
Map and reduce Each mapper M i first recuperates the assigned cluster G i . It then processes the transactions of the cluster G i , and applies the mining process for each transaction. Gradually, it creates a set of candidate patterns C i . When M i scans all transactions of the cluster G i , it sends C i to the reducer R i . The reducer R i scans the candidate patterns C i , and computes the local support of each pattern belong to C i . This allows to create the local hash table LH MR i .

CPU cores
The CPU cores solve clusters having complexity cost less than μ cost , where each CPU core deal with one cluster of transactions, where it applies sequentially the given pattern mining algorithm A on the assigned cluster. The result is stored in the local hash table called LH CPU i .
Single CPU The noise transactions resulted by applying DBSCAN are solved separately in a single CPU. The generated noises are merged into a small transactions, where the sequential process is applied on it. Again, as result, a local hash table called LH noise is stored Merging The merging step is then performed to determine the global support of all patterns and extract all relevant patterns from the global hash table GH. This step considers the set of separator items S as wellas the clusters in the mining process. This allows to discover all relevant patterns from the whole transactions. The possible candidate patterns are then generated from the separator items. For each generated pattern, the aggregation function (see Definition 2) is then used to determine the interestingness of this pattern in the whole transactional database. Note that, the interestingness depends on the problem. For instance, if we are interested to deal with a frequent itemset mining problem, the interestingness function should be the support measure. The relevant patterns of the separator items are then concatenated with the local hash tables LH to derive the global relevant patterns of the whole transactional database, noted GH.

Definition 18
Let us define an aggregation function of the pattern p in the clusters of the transactions C by DT-DPM improves the sequential version of the pattern mining algorithms by exploiting heterogeneous machines while generating and evaluating the candidate patterns. The load balancing is automatically conducted since the transactions are assigned to the workers using decomposition step.
Workers including (mappers, CPU cores, and single CPU) can process transactions independently and stored the results into the local hash tables. The merging step is finally performed on the local hash tables and extract and extract the set of all frequent patterns by dealing with the separator items using the aggregation function.

Performance evaluation
Extensive experiments were carried out to evaluate the DT-DPM framework. Five case studies were investigated by considering the FIM, WIM, UIM, HUIM, and SPM problems. DT-DPM is integrated on the SPMF data mining library [74]. It offersmore than 150 pattern mining algorithms. DT-DPM java source code is integrated on the five best pattern mining algorithms in terms of time complexity [8]: i) frequent itemset mining: PrePost+ [75], ii) weighteditemset mining: WFIM [27], iii) uncertain high utility itemset mining: U-Apriori [76], iv) high utility itemset mining: EFIM [37], and v) sequential pattern mining: FAST [42]. All experiments were run on a computer with an Intel core i7 processor running Windows 10 and 16 GB of RAM.

Dataset description
Experiments were run on well-known pattern mining large and big databases. Table 1 presents the characteristics of the large databases used in the experiments. Moreover, three very large databases have been used to evaluate the patternmining approaches: In addition, an IBM Synthetic Data Generator for Itemsets and Sequences 4 is used to generate big databases of different number of items and different number of transactions.

Decomposition performance
The first experiment aims at evaluating the clustering step. Several tests have been performed on the k-means algorithm to fix the number of clusters k. Figures 2 and 3 present both the quality of obtained clusters measures by the percentage of separator items and the runtime performance in seconds using the twelve databases mentioned above. By varying with the number of clusters from 2 to 20, the percentage of the separator items is reduced. Thus, when tk is set to 2, the percentage of the separator items exceeds 50%, while this percentage does not reach 1% when setting the number of clusters is set to 20 for almost databases. Moreover, by increasing the number of clusters, the runtimeincreases linearly, where it does not reach 2.50 seconds for all databases. As a result, we fix the number of clusters to 20 for the remaining of the experimentation. Given such results we can conclude that the preprocessing step does not influence on the overall performance of the DT-DPM framework, in particular for very large and big databases. Figure 3 presents the runtime performance of the pattern mining algorithms with and without the DT-DPM framework for both strategies (approximate and exact) using different databases and with different mining thresholds. Experimental results reveal that by reducing the mining threshold and increasing the complexity of the problem solved, the pattern mining algorithms benefits from the DT-DPM framework. Therefore, for a small value of mining threshold and for more complex problem like UIM, HUIM or SPM, the approximatebased and exact-based strategies outperform the original pattern mining algorithms. For instance, when the minimum utility threshold is set to 1600 K, the runtime of original EFIM and EFIM using the DT-DPM framework is 1 s on Connect database. However, when setting the minimum utility to 1000 K, the runtime of original EFIM exceeds 8,000 seconds, and the runtime of EFIM with DT-DPM framework does not reach 1,500 seconds. These results are obtained thanks to many factors: i) the decomposition method applied in the DT-DPM framework by minimizing the number of separator items, ii) solving sub-problems with small number of transactions and small number of items, instead of dealing the whole transactional database with the whole distinct items, and iii) the ability of the pattern mining algorithms to be integrated with the DT-DPM framework. Table 2 presents the speedup of the pattern mining algorithms with and without using the DT-DPM framework on the large databases. The results reveal that by increasing the number of mappers and the increasing the complexity of the problem solved, the speedup of the pattern mining algorithms benefits from the DT-DPM framework. Thus, for a large number of mappers, and for a more complex problem like UIM, HUIM or SPM, the mining process with DT-DPM outperform the

Results on big databases
Figures 5 present the runtime of DT-DPM and both baseline MapReduce based models using big databases for solving both Itemset and Sequential Pattern Mining problems. The Table 2 Speedup of pattern mining algorithms with and without using DT-DPM framework using different mappers (2,4,8,16,32) FIM(Webdocs) baseline methods used is FiDoop-DP [15], and NG-PFP: NonGroup Parallel Frequent Pattern mining [67] for itemset mining, and PrefixSpan-S [66] for sequence mining. The results reveal that our model outperforms the baseline MapReduce based models in terms of computational time for both itemset and sequence mining. These results have been carried out whatever the number of items, and the number of transactions. Moreover, when the number of items varied from 10,000 to 1 million, and the number of transactions from 1 millionto 10 million, the ratio of runtimes between our model and the baseline models is increased. All these results are obtained thanks to, i) the k-means algorithm which minimizes the number of shared items, and ii) the efficient hybrid parallel processing of MapReduce, and the Multi CPU cores, which takes into account the information extracted during the decomposition step.

Conclusion
A novel distributed pattern mining framework called DT-DPM is proposed in this paper. DT-DPM aims to derive relevant patterns on big databases by studying the correlations within the transactional database and exploring heterogeneous architectures. The set of transactions are first grouped using a clustering approach, where transactions close to each other are assigned to the same cluster. For each cluster of transactions, the pattern mining algorithm is launched in order to discover the relevant patterns. DT-DPM solves the issue of cluster's size by incorporating heterogeneous computing such as single CPU, CPU cores, and MapReduce. Thus, the noise transactions are assigned to the single CPU core, the micro clusters are assigned to CPU cores, and dense clusters are assigned to the MapReduce reduce architecture. Experimental evaluation of DT-DPM was integrated on the SPMF tool, where five case studies have been shown (FIM, WIM, UIM, HUIM and SPM). The results reveal that by using DT-DPM, the scalability performance of pattern mining algorithms have been improved on large databases. Moreover, DT-DPM outperforms the baseline pattern mining algorithms on big databases. Motivated by the promising results shown in this paper, we plan to boost the performance ofDT-DPM and apply it on big data mining applications such as twitter analysis, smart building applications, and other large-scale applications.
Funding Information Open Access funding provided by NTNU Norwegian University of Science and Technology (incl St. Olavs Hospital -Trondheim University Hospital).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.