1 Introduction

Pattern mining is a data mining task that aims at studying the correlations within data and discovering relevant patterns from large databases. In practice, different database representation could be observed (from Boolean databases to sequence databases). The problem of pattern mining is to find an efficient approach to extract the relevant patterns in a database. It is used in many applications and domains such as ontology matching [1], process mining [2], decision making [3], and constraint programming [4]. The pattern mining is also called with “Big data” applications such as in frequent genes extractions from DNA in Bio-informatics [5], relevant hashtags from twitter streams in social network analysis [6], analysis of sensorial data from IoT devices in smart city applications [7]. This work mainly focuses on mining the information from big transactional databases.

1.1 Motivation

Solutions to pattern mining problems [8,9,10,11,12] are high time consuming when dealing with large and very large databases for pattern mining problems such as FIM and WIM, and they are totally inefficient when solving more complex problems such as UIM, HUIM, and SPM. To improve the runtime performance of the pattern mining approaches, many optimization and high performance computing techniques have been proposed [13,14,15,16,17,18]. However, these strategies are inefficient when dealing with big databases, where only few number of relevant patterns are useful and displayed to the end user. We contemplate that these algorithms are inefficient because they consider the whole database in the mining process. In our previous work [19], we proposed a new algorithm for pattern mining algorithm, where the aim is to study the correlation between the input data to split the whole problem into many smaller sub-problems, but as being as independent as possible. We proposed a k-means algorithm to assign the transactions into different clusters. We also developed an efficient strategy to accurately explore the clusters of transactions. This approach gives good results compared to the baseline serial methods. However, it still suffers from the runtime and accuracy performance when dealing with big databases. This is due to the separator items between clusters, where the mining process of these items should be carried out by exploring the transactions of all clusters. This issue degrades the overall performance of such an approach. Motivated by the preliminary results reported in [19], we propose a new parallel framework, which addresses the following issues, i) minimizing the number of separator items, ii) improving the runtime and accuracy on big databases.

1.2 Contributions

In this research work, we propose a generic intelligent pattern mining algorithm for dealing pattern mining problems on big databases. It is a comprehensive extension of our previous work [19]. With this in mind, the main contributions of this work are as follows:

  1. 1.

    Propose a new framework called DT-DPM for improving pattern mining algorithms in a distributed environment.

  2. 2.

    Develop a decomposition approach to cluster the transactions set into smaller similar groups.

  3. 3.

    Extend the MapReduce computing framework to deal with the pattern mining algorithms by exploiting the different dependencies between the transactions of the clusters.

  4. 4.

    Five cases studies (FIM, WIM, UIM, HUIM, and SPM) have been analyzed on well-known pattern mining databases by considering five best pattern mining algorithms in terms of time complexity as baseline algorithms for the DT-DPM framework. Experimental results reveal that by using DT-DPM, the scalability of the pattern mining algorithms was improved on large databases. Results also reveal that DT-DPM outperforms the baseline parallel pattern mining algorithms on big databases.

1.3 Outline

The remainder of the paper is organized as follows: Section 2 introduces the basic concepts of pattern mining problems. Section 3 reviews existing pattern mining algorithms followed by a detailed explanation of our DT-DPM framework in Section 4. The performance evaluation is presented in Section 5 whereas Section 6 draws the conclusions.

2 Pattern mining problems

In this section, we first present a general formulation of pattern mining and then we present a few pattern mining problems according to the general formulation.

Definition 1 (pattern)

Let us consider I = {1, 2, …, n} as a set of items where n is the number of items, and T = {t1, t2, dots, tm} as a set of transactions where m is the number of transactions. We define the function a, where for the item i in the transaction tj, the corresponding pattern reads p=σ(i, j).

Definition 2 (pattern mining)

A pattern mining problem finds the set of all relevant patterns L, such as

$$ L=\left\{p\ |\ Interestingness\left(T,I,p\right)\ge \gamma \right\} $$
(1)

where the Interestigness(T, I, p) is the measure to evaluate a pattern p among the set of transactions T and the set of items I, where γ is the mining threshold.

From these two definitions, we present the existing pattern mining problems.

Definition 3 (Boolean database)

We define a Boolean database by setting the function σ (see Def. 2) as

$$ \sigma \left(i,j\right)=\left\{\begin{array}{ll}1& \mathrm{if}\ i\in {t}_j\\ {}0& \mathrm{otherwise}\end{array}\right. $$
(2)

Definition 4 (frequent itemset mining (FIM))

We define a FIM problem as an extension of the pattern mining problem (see Def. 2) by

$$ L=\left\{p\ |\ Support\left(T,I,p\right)\ge \gamma \right\} $$
(3)

with \( Support\left(T,I,p\right)=\frac{{\left|p\right|}_{T,I}}{\mid T\mid } \) where T is the set of transactions in a Boolean database defined by Def. 1, γ is a minimum support threshold, and |p|T, I is the number of transactions in T containing the pattern p.

Definition 5 (weighted database)

We define a weighted database by setting the function σ (see Def. 2) as

$$ \sigma \left(i,j\right)=\left\{\begin{array}{ll}{w}_{ij}& \mathrm{if}\ i\in {t}_j\\ {}0& \mathrm{otherwise}\end{array}\right. $$
(4)

Note that wij is the weight of the item i in the transaction tj.

Definition 6 (weighted itemset mining (WIM))

We define a WIM problem as an extension of the pattern mining problem (see Def. 2) by

$$ L=\left\{p\ |\ WS\left(T,I,p\right)\ge \gamma \right\} $$
(5)

with \( WS\left(T,I,p\right)={\sum}_{j=1}^{\mid T\mid }W\left({t}_j,I,p\right) \) where T is the set of transactions in the weighted database defined by Def. 3, W(tj, I, p) is the minimum weight of the items of the pattern p in the transaction tj, and γ is a minimum weighted support threshold.

Definition 7 (uncertain database)

We define an uncertain database by setting the function σ (see Def. 2) as

$$ \sigma \left(i,j\right)=\left\{\begin{array}{ll} Pro{b}_{ij}& \mathrm{if}\ i\in {t}_j\\ {}0& \mathrm{otherwise}\end{array}\right. $$
(6)

Note that Probij is the uncertainty value of i in the transaction tj.

Definition 8 (uncertain itemset mining (UIM))

We define a UIM problem as an extension of the pattern mining problem (see Def. 2) by

$$ L=\left\{p\ |\ US\left(T,I,p\right)\ge \gamma \right\} $$
(7)

with \( US\left(T,I,p\right)={\sum}_{j=1}^{\mid T\mid }{\prod}_{i\in p} Pro{b}_{ij} \) where T is the set of transactions in the uncertain database defined by Def. 5 and γ is the minimum uncertain support threshold.

Definition 9 (utility database)

We define an utility database by setting the function σ (see Def. 2) as

$$ \sigma \left(i,j\right)=\left\{\begin{array}{ll}i{u}_{ij}& \mathrm{if}\ i\in {t}_j\\ {}0& \mathrm{otherwise}\end{array}\right. $$
(8)

Note that iuij is the internal utility value of i in the transaction tj, we also define external utility of each item i by eu(i).

Definition 10 (high utility itemset mining (HUIM))

We define a HUIM problem as an extension of the pattern mining problem (see Def. 2) by

$$ L=\left\{p\ |\ U\left(T,I,p\right)\ge \gamma \right\}, with\ U\left(T,I,p\right)=\sum \limits_{j=1}^{\mid T\mid}\sum \limits_{i\in p}i{u}_{ij}\times eu(i) $$
(9)

where T is the set of transactions in the utility database defined by Def. 7 and γ is the minimum utility threshold.

Definition 11 (sequence database)

We assume a total order on items ≺, such as 1 ≺ 2 ≺ 3… ≺ n. A sequence is an ordered list of itemsets s = {I1, I2, …, Is}. Each itemset Ii is defined by setting the function σ (see Def. 2) as σ(i, j) = i, if i ∈ tj

Definition 12 (sequential pattern mining (SPM))

We define a SPM problem as an extension of the pattern mining problem (see Def. 2) by

$$ L=\left\{p\ |\ Support\left(T,I,p\right)\ge \gamma \right\} $$
(10)

where T is the set of transactions in the sequence database defined by Def. 9 and γ is the minimum support threshold.

3 Related work

Pattern mining has been largely studied in the last three decades [8,9,10,11, 20, 21]. There are many variants of pattern mining problem such as FIM, WIM, HUIM, UIM and SPM.

FIM

It aims at extracting all frequent itemsets that exceed the minimum support threshold. Apriori [22] and FP-Growth [23] are the most popular algorithms. Apriori applies a generate and test strategy to explore the itemset space. The candidate itemsets are generated incrementally and recursively. To generate k-sized itemsets as candidates, the algorithm calculates and combines the frequent (k-1)-sized itemsets. This process is repeated until no candidate itemsets are obtained in an iteration. However, FP-Growth adopts a divide and conquer strategy and compresses the transactional database in the volatile memory using an efficient tree structure. It then applies recursively the mining process to find the frequent itemsets. The main limitation of the traditional FIM algorithms is the database format where only binary items can be mined. A typical application of this problem is the market basket analysis, which a given item (product) may be present or absent in the given transaction (customer).

WIM

To address the FIM limitation, WIM is introduced, where weights are associated to each item to indicate their relative importance in the given transaction [24]. The goal is to extract itemsets exceeding minimum weight threshold. The first WIM algorithm is called WFIM: Weighted Frequent Itemset Mining [25]. It defines a weight range and a minimum weight constraint into the FP-Growth algorithm. Both weight and support measures are considered to prunethe search space. Yun [26] proposed WIP: Weighted Interesting Pattern. It introduces an infinity measure that determines the correlation between the items of the same pattern. The integration of the WIM in both Apriori and FP-Growth is studied in [27]. The results showed that FP-Growth outperforms Apriori for mining weighted patterns. Le et al. [28] proposed a frequent subgraph algorithm on a weighted large graph. A novel strategy is developed which aims to compute the weight of all candidate subgraphs. An efficient pruning strategy aims at reducing both the processing time and the memory usage. Lee et al. [29] mine the frequent weighted itemsets by employing a novel type of prefix tree structures. This allows to retrieve the relevant patterns more accurately without saving the list of identification number of the different transactions.

UIM

An extension of WIM, called UIM, explores uncertain transactional databases, where two models (expected-support and probabilistic itemsets) are defined to mine uncertain patterns. Li et al. [30] proposed the PFIMoS: Probabilistic Frequent Itemset Mining over Streams algorithm. It derives the probabilistic frequent itemsets in an incremental way by determining the upper and the lower bounds of the mining threshold. Lee et al. [31] introduced the U-WFI: Uncertainmining of Weighted Frequent Itemsets algorithm. It allows to discover from a given uncertain database relevant uncertain frequent itemsets with high probability values by focusing on item weights. Liaqat et al. [32] show the use of uncertain frequent patterns in the image retrieval process. It incorporates the fuzzy ontology and uncertain frequent pattern mining for finding the relevant images regarding the user query. Lee et al. [33] suggest novel data structures to guarantee the correctness of the mining outputs without any false positives. It allows to retrieve a complete set of uncertain relevant patterns in an reasonable amount of time.

HUIM

High Utility Itemset Mining is an extension of WIM where both internal and external utilities of the items are involved. The aim is to find all high utility patterns from transactional database that exceed the minimum utility threshold. The utility of a pattern is the sum of the utility of all its items, where the utility of an item is defined by the product by its internal and external utility values. Chan et al. [34] proposed the first HUIM algorithm. It applies the Apriori-based algorithm to discover top k high utility patterns. This algorithm suffers from the runtime performance, because the search space is not well pruned using the closure downward property. Thus, the utility measure is neither monotone nor anti-monotone.To address this limitation the TWU: Transaction Weighted Utility property is defined to prune the high utility pattern space [35, 36]. It is an upper-bound monotone measure to reduce the search space. More efficient HUIM algorithms based on TWU have been recently proposed such as EFIM: EFficient high-utility Itemset Mining [37], and d2HUP: Direct Discovery for High Utility Patterns [38]. The particularity of such approaches is that they use more efficient data structures to determine the TWU and the utility values. Singh et al. [39] address the problem of the minimum utility threshold tuning and derived the top k high utility patterns. It uses transaction merging and data projection techniques to reduce the data scanning cost. It also develops an intelligent strategy designed for top k high utility patterns to prune the enumeration search tree. Gan et al. [40] proposed a correlated high utility pattern miner algorithm. It considers the positive correlation, the profitable value concepts, and several strategies to prune the search space. Lee et al. [41] developed an efficient incremental approach for identifying high utility patterns. It adopts an accurate data structure to mine high utility patterns in an incremental way.

SPM

Sequential Pattern Mining is an extension of FIM to discover a set of ordered patterns in a sequence database [42,43,44]. Salvemini et al. [42] find the completeset of the sequence patterns by reducing the candidates generation runtime by employing an efficient lexicographic tree structure. Fumarola et al. [43] discover closed sequential patterns using two main steps, i) Finding the closedof sequence patterns of size 1, and ii) Generating new sequences from the sequence patterns of size 1, already deduced in the first step. Van et al. [44] introduced the pattern-growth algorithm in solving the sequential pattern mining problem with itemset constraints. It proposed an incremental strategy to prune the enumeration search tree which allows to reduce the number of visited nodes. Aisal et al. [45] proposed a novel convoy pattern mining approach which can operateon a variety of operational data stores. It suggested a new heuristic to prune the objects which have no chance of forming a convoy. Wu et al. [46] solved the contrast sequential pattern mining problem, which is an extension of SPM, discovered all relevant patterns that figure out in one sequence data and not in the others. These patterns are highly used in some specified application such as analysing anomalous customers in the business intelligence or medical diagnosis in the smart healthcare [47,48,49].

High performance computing

Regarding high performance computing, many algorithms have been developed for boosting the FIM performance [15, 50,51,52,53,54]. However, few algorithms have been proposed for the other pattern mining problem [16,17,18, 55]. In [52], some challenges in big data analytics are discussed, such as mining evolving data streams and the need to handle many exabytes of data across various application areas such as social network analysis. The BigFIM: Big Frequent Itemset Mining [56] algorithm is presented, which combines principles from both Apriori and Eclat. BigFIM is implemented using the MapReduce paradigm. The mappers are determined using Eclat algorithm, whereas, the reducers are computed using the Apriori algorithm. [57] develops two strategies for parallelizing both candidate itemsets generation and support counting on a GPU (Graphic Processing Unit). In the candidate generation, each thread is assigned two frequent (k-1)-sized itemsets, it compares them to make sure that they share the common (k-2) prefix and then generates a k-sized candidate itemset. In the evaluation, each thread is assigned one candidate itemset and counts its support by scanning the transactions simultaneously. The evaluation of frequent itemsets is improved in [58] by proposing mapping and sum reduction techniques to merge all counts ofthe given itemsets. It is also improved in [59] by developing three strategies for minimizing the impact of the graphical processor unit thread divergence. In [60], a multilevel layer data structure is proposed to enhance the support counting of the frequent itemsets. It divides vertical data into several layers, where each layer is an index table of the next layer. This strategy can completely represent the original vertical structure. In a vertical structure, each item corresponds to a fixed-length binary vector. However, in this strategy, the length of each vector varies, which depends on the number of transactions included in the corresponding item. A Hadoop implementation based on MapReduce programming approach called FiDoop: Frequent itemset based on Decomposition is proposed in [61] for the frequent itemsets mining problem. It incorporates the concept of FIU-tree (Frequent Itemset Ultrametric tree) rather than traditional FP-tree of FP-Growth algorithm, for the purpose of improving the storage of the candidate itemsets. An improved version called FiDoop-DP is proposed in [15]. It develops an efficient strategy to partition data sets among the mappers. This allows better exploration of cluster hardware architecture by avoiding jobs redundancy. Andrzejewski et al. [62] introduce the concept of incremental co-location patterns, i.e., update the set of knowledge about the spatial features after inserting new spatial data to the original one. The authors develop a new parallel algorithm which combines effective update strategy and multi GPU co-location pattern mining [63] by designing an efficient enumeration tree on GPU. Since the proposed approach is memory-aware, i.e., the data is divided into several package to fit to GPU memories, it only achieves an speedup of six. Jiang et al. [64] adopt a parallel FP-Growth for mining world ocean atlas data. The whole data is partitioned among multiple CPU threads, where each thread explores 300,000 data points and derives correlation and regularities of oxygen, temperature, phosphate, nitrate, and silicate in the ocean. The experimental results reveal that the suggested adaptation only reachesan speedup of 1.2. Vanhalli et al. [65] developed a parallel row enumerated algorithm for mining frequent colossal closed patterns from high dimensional data. It first prunes the whole data by removing irrelevant items and transactions using rowset cardinality table, which determines the closeness of each subset of transactions. In addition, it uses a hybrid parallel bottom-up bitset based approach to enumerate the colossal frequent closed patterns. This approach is fast, howeverit suffers from the accuracy issue, which may ignore some relevant patterns due to the preprocessing phase. It also requires additional parameter to be fixed, represented by a cardinality threshold. Yu et al. [66] propose the parallel version of PrefixSpan on Spark: PrefixSpan-S. It optimizes the overhead by first loading the data from the Hadoop distributed file system into the RDDs: Resilient Distributed Datasets, and then reading the data from RDDs, and save the potential results back into the RDDs. This approach reaches a good performance with a wise choice of the minimum support threshold. However, it is very sensitive to the data distribution. Kuang et al. [67] proposed the parallel implementation of FP-Growth algorithm in Hadoop by removing the data redundancy between the different data partitions, which allows to handle the transactions in a single pass. Sumalatha et al. [68] introduces the concept of distributed temporal high utility sequential patterns, and propose an intelligent strategy by creating a time interval utility data structure for evaluating the candidate patterns. The authors also defined two utility upper bounds, remaining utility, and co-occurrence utility to prune the search space.

To improve the runtime performance of the pattern mining approaches, several strategies have been proposed using metaheuristics, specifically exploiting evolutionary and/or swarm intelligence approaches [13, 14, 69, 70]. However, these optimizations are inefficient when dealing with large and big transactional databases where only few number of interesting patterns are discovered. To deal with this challenging issue, the next section presents a new framework, which investigates both decomposition techniques and distributed computing in solving pattern mining algorithms.

4 DT-DPM: decomposition transaction for distributed pattern mining

This section presents the DT-DPM (Decomposition Transaction for Distributed Pattern Mining) framework, that integrates the DBSCAN: Density-Based Spatial Clustering of Applications with Noise algorithm and distributed computing represented by MapReduce, CPU multi-cores and Single CPU for solving pattern mining problems. As seen in Fig. 1, the DT-DPM framework uses heterogeneous distributed computing and decomposition techniques for solving pattern mining problems. Detailed explanation of the DT-DPM framework, step by step, is given in the following.

Fig. 1
figure 1

DT-DPM framework

4.1 DBSCAN

The aim of this step is to divide a database into a collection of homogeneous groups using decomposition techniques, where each group shares entries highly correlated, i.e., the database entries of each group share maximum number of items compared to the entries of the other groups.

Definition 13

A database \( \mathcal{D} \) is decomposed into several groups G = {Gi}, where each group Gi is subset of rows in \( \mathcal{D} \) such as Gi ∪ Gj = ∅. We define, \( \mathcal{I}\left({G}_i\right) \), the set of items of the group Gi by

$$ \mathcal{I}\left({G}_i\right)=\left\{\bigcup \mathcal{I}\left({\mathcal{D}}_j\right)/{\mathcal{D}}_j\in {G}_i\right\} $$
(11)

Proposition 1

Suppose that the groups in G share any items which means

$$ \forall \left({G}_i,{G}_j\right)\in {G}^2,\vee i\ne j,\mathcal{I}\left({G}_i\right)\cap \mathcal{I}\left({G}_j\right)=\varnothing $$
(12)

We have the following proposition

$$ L=\left\{\bigcup \limits_{i=1}^k{L}_i\right\} $$
(13)

Note that \( {\mathcal{P}}_i \) is the set of the relevant patterns of the group \( {\mathcal{G}}_i \).

From the above proposition, one may argue that if the whole transactions in the database are decomposed in such a way, the independent groups will be derived. It means that, any group of transactions share items with any other group, and therefore, the groups could be solved separately. Unfortunately, such case is difficult to realize, as many dependencies may be observed between rows. The aim of the decomposition techniques is to minimize the separator items between the groups such as

$$ {G}^{\ast }=G\arg\ \min \mid \bigcup \left(\mathcal{I}\left({G}_i\right)\cap \mathcal{I}\left({G}_j\right)\right)\mid $$
(14)

The aim of the decomposition step is to minimize the shared items between the different clusters, these shared items are called Separator Items. More formally, this decomposition generates a labeled non-directed graph weighted noted G =  < C, S> where C is the set of nodes formed by the clusters of G and S is the set of the separator items. Each element sij in S contains two components: i) \( {s}_l^{ij} \), is the label of the element sij, it is represented by the set of items shared by the clusters Ci and Cj. ii)\( {s}_w^{ij} \) is the weight of the element sij, it is represented by the number of items shared by the cluster Ci and Cj. As a result, k disjoint partitions P = {P1, P2, …, Pk}, where Pi ∩ Pj = ∅, ∀(i, j) ∈ [1, …, k]2 and \( {\bigcup}_{i=1}^k{P}_i=T \). The partitions are constructed by minimizing the following function

$$ \sum \limits_{i=1}^k\left(\sum \limits_{j=1}^k\left(\sum \limits_{l=1}^{\mid {P}_i\mid}\left({2}^{{\mathcal{G}}_i^l}-1\right)-\sum \limits_{l=1}^{\mid {P}_j\mid}\left({2}^{{\mathcal{G}}_j^l}-1\right)\right)\right) $$
(15)

Solving this equation by exact solver requires high computational time. One way to solve this issue is to use clustering algorithms [71]. The adaptation of the DBSCAN [72] clustering algorithms has been investigated in this research. Before this, some definitions are given as follows.

Definition 14 (transaction-based similarity)

We define Transaction-based similarity by an adopted Jaccard similarity [73] as

$$ {\mathcal{J}}_{\mathcal{D}}\left({\mathcal{D}}_i{\mathcal{D}}_j\right)=\frac{\sum_{x\in \left({\mathcal{D}}_i\cap {\mathcal{D}}_j\right)} Sim\left({\mathcal{D}}_i,{\mathcal{D}}_j,x\right)}{\mid {\mathcal{D}}_i\mid +\mid {\mathcal{D}}_j\mid +{\sum}_{x\in \left({\mathcal{D}}_i\cap {\mathcal{D}}_j\right)} Sim\left({\mathcal{D}}_i,{\mathcal{D}}_j,x\right)} $$
(16)
$$ Sim\left({\mathcal{D}}_i,{\mathcal{D}}_j,x\right)=\left\{\begin{array}{ll}1& \mathrm{if}\ {x}^i={x}^j\\ {}0& \mathrm{otherwise}\ \end{array}\right. $$
(17)

Note that xi is the value of the variable x in the row data \( {\mathcal{D}}_i \).

Definition 15 (centroids)

We define the centroids of the group of rows data Gi, noted \( \overline{G_i} \) by

$$ \overline{G_i}=\left\{\bigcup \underset{x_l}{\max}\left({x}_l^i\right)|{x}_l\in \mathcal{X}\left({G}_i\right)\right\} $$
(18)

where \( \underset{x_l}{\max}\left({x}_l^i\right) \) is the most value of the variable xl in the group Gi.

Definition 16 (clause neighborhoods)

We define the neighborhoods of a row data \( {\mathcal{D}}_i \) for a given threshold ϵ, noted \( {\mathcal{N}}_{{\mathcal{D}}_i} \) by

$$ {\mathcal{N}}_{{\mathcal{D}}_i}=\left\{{\mathcal{D}}_j|{\mathcal{J}}_{\mathcal{D}}\left({\mathcal{D}}_i{\mathcal{D}}_j\right)\ge \epsilon \vee j\ne i\right\} $$
(19)

Definition 17 (core data)

A row data \( {\mathcal{D}}_i \) is called core data if there is at least the minimum number of rows data σD such as \( \mid {\mathcal{N}}_{{\mathcal{D}}_i}\mid \ge {\sigma}_D \)

figure a

Algorithm 1 presents the pseudo-code of the decomposition for the rows data. The process starts by checking the ϵ-neighborhood of each transaction. The core transactions are determined and then iteratively collects density-reachable transactions from these core transactions directly, which may involve merging a few density-reachable clusters. The process terminates when no new transactions can be added to any cluster. The output of the decomposition step is a labeled non-directed graph weighted noted O =  < G, S> where G is the set of nodes formed by the groups of the rows data and S is the set of the separator items. Each element sij in S contains two components: i) \( {s}_l^{ij} \), is the label of the element sij, it is represented by the set of variables by the groups Gi and Gj. ii)\( {s}_w^{ij} \) is the weight of the element sij, it is represented by the number of variables shared by the group Gi and Gj.

4.2 Mining process

After applying the DBSCAN algorithm on the whole data, the next step is to solve each resulted cluster independently using heterogeneous machines such as supercomputers, CPU cores, and single CPU core. The question that should be answered now is, which cluster should be assigned to which machine? We know that supercomputers are most powerful than CPU cores and single CPU. Obviously, clusters with high workload are assigned to supercomputer, the remaining clusters are assigned to CPU cores, and the noise data are assigned to a single CPU. This raises the following proposition,

Proposition 2

Consider a function Cost(\( \mathcal{A} \), m, n), that computes the complexity cost of the given pattern mining algorithm \( \mathcal{A} \). Consider the output of the DBSCAN algorithm O =  < G, S>, the mining process by \( \mathcal{A} \) of the clusters could be performed in parallel by using heterogeneous machines according to the density of each cluster Gi represented by \( \left(|{G}_i|,\mathcal{I}\left({G}_i\right)\right) \), and with a complexity cost threshold μcost as

  1. 1.

    If Cost (\( \mathcal{A} \), ∣Gi∣, \( \mathcal{I}\left({G}_i\right) \)) ≥ μcost then send Gi to MapReduce framework.

  2. 2.

    If Cost (\( \mathcal{A} \), ∣Gi∣, \( \mathcal{I}\left({G}_i\right) \)) < μcost then send Gi to CPU cores.

Map and reduce

Each mapper \( {\mathcal{M}}_i \) first recuperates the assigned cluster Gi. It then processes the transactions of the cluster Gi, and applies the mining process for each transaction. Gradually, it creates a set of candidate patterns \( {\mathcal{C}}_i \). When \( {\mathcal{M}}_i \) scans all transactions of the cluster Gi, it sends \( {\mathcal{C}}_i \) to the reducer \( {\mathcal{R}}_i \). The reducer \( {\mathcal{R}}_i \) scans the candidate patterns \( {\mathcal{C}}_i \), and computes the local support of each pattern belong to \( {\mathcal{C}}_i \). This allows to create the local hash table \( {\mathcal{LH}}_{M{R}_i} \).

CPU cores

The CPU cores solve clusters having complexity cost less than μcost, where each CPU core deal with one cluster of transactions, where it applies sequentially the given pattern mining algorithm \( \mathcal{A} \) on the assigned cluster. The result is stored in the local hash table called \( {\mathcal{LH}}_{CP{U}_i} \).

Single CPU

The noise transactions resulted by applying DBSCAN are solved separately in a single CPU. The generated noises are merged into a small transactions, where the sequential process is applied on it. Again, as result, a local hash table called \( {\mathcal{LH}}_{noise} \) is stored

Merging

The merging step is then performed to determine the global support of all patterns and extract all relevant patterns from the global hash table \( \mathcal{GH} \). This step considers the set of separator items S as wellas the clusters in the mining process. This allows to discover all relevant patterns from the whole transactions. The possible candidate patterns are then generated from the separator items. For each generated pattern, the aggregation function (see Definition 2) is then used to determine the interestingness of this pattern in the whole transactional database. Note that, the interestingness depends on the problem. For instance, if we are interested to deal with a frequent itemset mining problem, the interestingness function should be the support measure. The relevant patterns of the separator items are then concatenated with the local hash tables \( \mathcal{LH} \) to derive the global relevant patterns of the whole transactional database, noted \( \mathcal{GH} \).

Definition 18

Let us define an aggregation function of the pattern p in the clusters of the transactions C by

$$ \mathcal{A}(p)=\sum \limits_{i=1}^k Interestingness\left({G}_i,\mathcal{I}\left({G}_i\right),p\right) $$
(20)

DT-DPM improves the sequential version of the pattern mining algorithms by exploiting heterogeneous machines while generating and evaluating the candidate patterns. The load balancing is automatically conducted since the transactions are assigned to the workers using decomposition step. Workers including (mappers, CPU cores, and single CPU) can process transactions independently and stored the results into the local hash tables. The merging step is finally performed on the local hash tables and extract and extract the set of all frequent patterns by dealing with the separator items using the aggregation function.

5 Performance evaluation

Extensive experiments were carried out to evaluate the DT-DPM framework. Five case studies were investigated by considering the FIM, WIM, UIM, HUIM, and SPM problems. DT-DPM is integrated on the SPMF data mining library [74]. It offersmore than 150 pattern mining algorithms. DT-DPM java source code is integrated on the five best pattern mining algorithms in terms of time complexity [8]: i) frequent itemset mining: PrePost+ [75], ii) weighteditemset mining: WFIM [27], iii) uncertain high utility itemset mining: U-Apriori [76], iv) high utility itemset mining: EFIM [37], and v) sequential pattern mining: FAST [42]. All experiments were run on a computer with an Intel core i7 processor running Windows 10 and 16 GB of RAM.

5.1 Dataset description

Experiments were run on well-known pattern mining large and big databases. Table 1 presents the characteristics of the large databases used in the experiments. Moreover, three very large databases have been used to evaluate the patternmining approaches:

  1. 1.

    WebdocsFootnote 1 It is created from a set of links among html documents available on the Web. It contains 1,692,082 transactions with 526,765 different items. The maximum number of items per transactions of this database is 70,000 with 1.48 GB as database size [77].

  2. 2.

    TwitterFootnote 2 It is dataset related to user tweets. It contains 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets [78].

  3. 3.

    NewYorkTimesFootnote 3 It consists of over 1.8 million newspaper articles published among twenty years.

Table 1 Large databases

In addition, an IBM Synthetic Data Generator for Itemsets and SequencesFootnote 4 is used to generate big databases of different number of items and different number of transactions.

5.2 Decomposition performance

The first experiment aims at evaluating the clustering step. Several tests have been performed on the k-means algorithm to fix the number of clusters k. Figures 2 and 3 present both the quality of obtained clusters measures by the percentage of separator items and the runtime performance in seconds using the twelve databases mentioned above. By varying with the number of clusters from 2 to 20, the percentage of the separator items is reduced. Thus, when tk is set to 2, the percentage of the separator items exceeds 50%, while this percentage does not reach 1% when setting the number of clusters is set to 20 for almost databases. Moreover, by increasing the number of clusters, the runtimeincreases linearly, where it does not reach 2.50 seconds for all databases. As a result, we fix the number of clusters to 20 for the remaining of the experimentation. Given such results we can conclude that the pre-processing step does not influence on the overall performance of the DT-DPM framework, in particular for very large and big databases.

Fig. 2
figure 2

Percentage(%) of separator items of the decomposition using different clustering algorithms on the following databases: D1: pumsb, D2: mushroom, D3: connect, D4: chess, D5: accident, D6: korasak, D7: foodmart, D8: chainstore, D9: leviathan, D10: sign, D11: snack, and D12: FIFA

Fig. 3
figure 3

Runtime (seconds) of pattern mining algorithm without and with decomposition step on the following databases: D1: pumsb, D2: mushroom, D3: connect, D4: chess, D5: accident, D6: korasak, D7: foodmart, D8: chainstore, D9: leviathan, D10: sign, D11:snack, and D12: FIFA

Figure 3 presents the runtime performance of the pattern mining algorithms with and without the DT-DPM framework for both strategies (approximate and exact) using different databases and with different mining thresholds. Experimental results reveal that by reducing the mining threshold and increasing the complexity of the problem solved, the pattern mining algorithms benefits from the DT-DPM framework. Therefore, for a small value of mining threshold and for more complex problem like UIM, HUIM or SPM, the approximate-based and exact-based strategies outperform the original pattern mining algorithms. For instance, when the minimum utility threshold is set to 1600 K, the runtime of original EFIM and EFIM using the DT-DPM framework is 1 s on Connect database. However, when setting the minimum utility to 1000 K, the runtime of original EFIM exceeds 8,000 seconds, and the runtime of EFIM with DT-DPM framework does not reach 1,500 seconds. These results are obtained thanks to many factors: i) the decomposition method applied in the DT-DPM framework by minimizing the number of separator items, ii) solving sub-problems with small number of transactions and small number of items, instead of dealing the whole transactional database with the whole distinct items, and iii) the ability of the pattern mining algorithms to be integrated with the DT-DPM framework.

5.3 Speedup of DT-DPM

Table 2 presents the speedup of the pattern mining algorithms with and without using the DT-DPM framework on the large databases. The results reveal that by increasing the number of mappers and the increasing the complexity of the problem solved, the speedup of the pattern mining algorithms benefits from the DT-DPM framework. Thus, for a large number of mappers, and for a more complex problem like UIM, HUIM or SPM, the mining process with DT-DPM outperform the original pattern mining algorithms. For instance, when the number of mappers set to 2, the speedup of the original PrePost+ and PrePost using the DT-DPM framework is less than 5 in the pumsb database. However, by setting the number of mappers to 32, the speedup of the original EFIM does not reach 8630, and the speedup of the EFIM with DT-DPM framework exceeds 800 on Connect database. These results are achieved thanks to the following factors, i) The decomposition method applied to the DT-DPM framework by minimizing the number of theseparator items, and ii) Solving the sub-problems with small number of transactions and small number of items using the Mapreduce architecture.

Table 2 Speedup of pattern mining algorithms with and without using DT-DPM framework using different mappers (2, 4, 8, 16, 32)

5.4 DT-DPM Vs state-of-the-art algorithms

Figures 4 presents the speedup of DT-DPM against the baseline pattern mining algorithms (FiDoop-DP [15] for FIM, PAWI: Parallel Weighted Itemsets [16] for WIM, MRGrowth: MapReduce for FP-Growth [17] for UIM, PHI-Miner: Parallel High utility Itemset Miner [55] for HUIM, and MG-FSM: Mind the Gap for Frequent Seauence Mining [18] for SPM) using the big databases, and setting the mining threshold to10% for Webdocs, 5% for Twitter and 2% for NewYorkTimes database. The results reveal that by increasing the percentage of transactions from 5% to 100% and increasing the complexity of the problem solved, DT-DPM outperformed the baseline pattern mining algorithms and benefits from the intelligent partition of the transactional database and the intelligent mapping of the different clusters into the mapper nodes. For instance, the speedup of our framework is 701 on Webdocs whereas the speedup of FiDoop-DP is 521. On the other hand, for mining the speedup of our framework on NewYorkTimes is 1325 whereas the speedup of MG-FSM does not exceed 900.

Fig. 4
figure 4

Speedup of DT-DPM and the state-of-the-art parallel pattern mining algorithms using big databases

5.5 Results on big databases

Figures 5 present the runtime of DT-DPM and both baseline MapReduce based models using big databases for solving both Itemset and Sequential Pattern Mining problems. The baseline methods used is FiDoop-DP [15], and NG-PFP: NonGroup Parallel Frequent Pattern mining [67] for itemset mining, and PrefixSpan-S [66] for sequence mining. The results reveal that our model outperforms the baseline MapReduce based models in terms of computational time for both itemset and sequence mining. These results have been carried out whatever the number of items, and the number of transactions. Moreover, when the number of items varied from 10,000 to 1 million, and the number of transactions from 1 millionto 10 million, the ratio of runtimes between our model and the baseline models is increased. All these results are obtained thanks to, i) the k-means algorithm which minimizes the number of shared items, and ii) the efficient hybrid parallel processing of MapReduce, and the Multi CPU cores, which takes into account the information extracted during the decomposition step.

Fig. 5
figure 5

Comparison of the Runtime of the DT-DPM and the baseline parallel-based pattern mining algorithms using big databases

6 Conclusion

A novel distributed pattern mining framework called DT-DPM is proposed in this paper. DT-DPM aims to derive relevant patterns on big databases by studying the correlations within the transactional database and exploring heterogeneous architectures. The set of transactions are first grouped using a clustering approach, where transactions close to each other are assigned to the same cluster. For each cluster of transactions, the pattern mining algorithm is launched in order to discover the relevant patterns. DT-DPM solves the issue of cluster’s size by incorporating heterogeneous computing such as single CPU, CPU cores, and MapReduce. Thus, the noise transactions are assigned to the single CPU core, the micro clusters are assigned to CPU cores, and dense clusters are assigned to the MapReduce reduce architecture. Experimental evaluation of DT-DPM was integrated on the SPMF tool, where five case studies have been shown (FIM, WIM, UIM, HUIM and SPM). The results reveal that by using DT-DPM, the scalability performance of pattern mining algorithms have been improved on large databases. Moreover, DT-DPM outperforms the baseline pattern mining algorithms on big databases. Motivated by the promising results shown in this paper, we plan to boost the performance ofDT-DPM and apply it on big data mining applications such as twitter analysis, smart building applications, and other large-scale applications.