Introduction

With the growth of information, traditional analysis methods must be modified because they cannot handle immense amounts of data. Data mining algorithms are analytics that require a paradigm shift for algorithm execution and changes for deployment over nodes. Data mining algorithms must be modified so they can be executed over scalable and distributable environments. However, modifying data mining algorithms for distributed architecture is not easy. One problem with distributed architecture is data locality. This occurs when the required data for processing do not exist on the processor node.

One of the most prominent methods used to solve big data problems over shared-nothing architecture is Map-Reduce. Map-Reduce [1] is used in open-source solutions such as Hadoop and Spark. There are many products in addition to Hadoop and Spark that can be used to solve data mining problems. However, pure Map-Reduce suffers from the data locality problem and does not support iteration. Because nearly all data mining algorithms are iterative in nature, it is necessary to solve the iteration problem in Map-Reduce to solve data mining problems by using Map-Reduce-based methods.

Association rule mining is a data mining algorithm that requires iteration to find frequent itemsets. Methods such as Apriori [2] and FP-Growth [3] solve association rule mining problems. There are two main parameters for association rule mining: confidence and support. A rule in an association rule is defined as x → y (if x, then y), where x and y are itemsets. The support of a rule is defined as the frequency of an itemset in a database. The confidence of a rule is defined as

$$ Confidence\left( {x \to y} \right) = \frac{{{\text{Support}}\left( {{\text{x}} \cap {\text{y}}} \right)}}{{{\text{Support}}\left( {\text{x}} \right)}} $$

In this paper, Kavosh, which is a method for association rule mining, is proposed. This method is designed for a shared-nothing architecture and can properly solve the association rule mining problem in Map-Reduce.

This method solves the data locality problem completely. Thus, the nodes do not require data from other nodes. By changing and unifying the data format, data compression and data load balancing are supported. In addition, iteration is omitted in the proposed method; therefore, it is well-suited for the Map-Reduce architecture.

The remainder of this paper is structured as follows: In “Related works” section, related works are discussed. In the third section, Kavosh is presented. The fourth section describes an evaluation, and the final section presents the conclusions.

Related works

Map-Reduce is used to solve many problems over a shared-nothing architecture. This method is also used to solve association rule mining. There are four main problems with association rule mining on a shared-nothing architecture, namely, lack of support for the following:

  • Data locality,

  • Iteration,

  • Load balancing, and

  • Data compression.

As mentioned above, the data locality problem is a lack of required data on the processor node, which creates data dependency among nodes. This problem causes network congestion and increases the algorithm execution time. Another problem is lack of iteration support. When intermediate results are created by the Reducers, they must be fed as input to the Mappers for another iteration. This problem also increases the execution time because intermediate data must be written on the disk by the Reducers and read by the Mappers to initiate another iteration. The third problem is the load balancing problem. This occurs when the processes are not allocated to the Mappers fairly. This problem reduces the algorithm speed because all nodes must wait for a busy node after job completion. The last problem is lack of support for data compression. Because of the large amount of data, data volume reduction is necessary for more rapid iterations and intermediate result storage.

In some methods, traditional Apriori methods are used in Map-Reduce. Transactions are allocated to Mappers and frequent k-itemsets are extracted from each Mapper before the results are shuffled through combiners and the final k-itemsets are extracted according to support and confidence thresholds. In Oruganti et al. [4], Kovacs et al. [5], Li et al. [6], Mappers and Reducers are used, but in [7, 8], in addition to Mappers and Reducers, Combiners are used for better shuffling and to address performance issues.

In Lin et al [9], three methods are proposed: Single Pass Counting (SPC), Fixed Passes Combined-counting (FPC) and Dynamic Passes Combined-counting (DPC). In Apriori, each iteration must generally wait until the results of all the Reducers from the previous iteration have been generated.

The FiDoop [10] method uses three Map-Reduce phases to generate frequent itemsets. FiDoop claims to support automatic parallelization, load balancing, data distribution, and fault tolerance. ScaDiBino [11] extracts rules with the maximum length and one target field. This method omits iterations. In Yu et al. [12], the Distributed Parallel Apriori (DPA) algorithm is proposed and metadata are stored in the form of Transaction Identifiers (TIDs). In this method, a single scan of the database is required and a balanced workload among nodes is created.

Various proposed methods use FP-Growth on a shared-nothing architecture. In Li et al. [13], parallel FP-Growth is proposed for independently executing a group of tasks on a node. In Bechini et al. [14], MRAC and MRAC+ are proposed, which are Map-Reduce-based and FP-Growth-based methods, respectively, and FP-Growth was modified to overcome performance issues. In Yang et al. [15], a Hadoop-based method is proposed that uses a distributed DH-TRIE frequent pattern algorithm and tries to solve FP-Growth problems for big data. In Tlili et al. [16], a partitioning method is used to achieve load balancing problems for association rule mining. PARMA [17] creates multiple small random samples of the input data and runs a mining algorithm on them. Because it is implemented on small samples, the algorithm can be run in parallel and independently. The results are aggregated to produce the final results. In Yu and Zhou [18], Tidset-based Parallel FP-tree (TPFP-tree) and Balanced Tidset-based Parallel FP-tree (BTP-tree) are proposed. In the proposed method, a transaction identification set (Tidset) is used to provide direct access to transactions instead of a full database scan. In Moens et al. [19], two methods are proposed: Dist-Eclat and BigFIM. Dist-Eclat has three steps: finding the frequent items, generating frequent itemsets of size k and sub-tree mining. BigFIM also has three steps: generating frequent itemsets of size k, finding potential extensions and sub-tree mining. The Sequence-Growth algorithm is designed according to the concepts of the lexicographical sequence tree and the lazy mining pruning strategy and is implemented in the MapReduce framework for a distributed execution [20]. In Liang et al. [21], an algorithm for lexicographic frequent itemset generation is proposed, which claims to find the maximum information from a database regarding frequent itemsets and their respective frequencies in a single database scan.

Methods

The proposed method uses the Map-Reduce architectural structure for association rule mining. The first step in the proposed method is to convert the input data items to a Kavosh format and the second step is to extract rules with variable lengths. In this paper, the number of fields after the “if” conditions are applied is defined as the length of the rule. In the proposed method, data are converted into Kavosh format, which helps nodes execute their processes independently. This format is a unified format and other input data formats are converted to this format. The proposed unified format helps create <key, value> pairs for use in the Map-Reduce method.

• Converting input data items to Kavosh format

In this section, the Kavosh format is defined

Θ = {θ1, θ2, …, θn, ϯ}.

The input data table is denoted as Θ, the columns of the table as θ and the table key as ϯ

θk= {µ1, µ2, ..., µm}.

The distinct values of each column are denoted as µ. In this paper, it is assumed that all θ values are discrete and that a continuous θ must be converted to discrete values.

Based on the above definitions, the Kavosh format is defined as follows:

ϝ = {θ1 → μ1, θ2 → μ2,…, θn → μm}


where ϝ is the Kavosh table and each θp → µq is equal to zero or one. Each input column value is converted to a Boolean column if there are more than two values for the column.

The input data (Θ) are divided into equal segments. For fragmentation, the function ψ is used, which is defined as follows:

Ψ (ϝ, ϯstart, ϯfinish, ηk)


where ηk is the Mapper node to which the data ϝ are allocated from key ϯstart to key ϯfinish. From the previous conversion steps, a Boolean matrix is created on each Mapper. Each field ϝ can be selected as a target field. If a binary position is assigned to each field ϝ (except target fields), each row can be converted into a decimal number. Therefore, a <Key, Value> pair is created in which the Key is the decimal number of each row (Rule-Key) and the Value is the decimal value of the target field. In the next step, the Kavosh triple is created, which is defined as follows:

<Rule-Key, Rule count, Target field count with the specified value>

According to the generated <Rule-Key, Target fields>, the count of each Rule-Key and target field with the specified value is calculated and the Kavosh triple set is created. To convert Mapper data into the Kavosh triple and Rule-Key aggregation, the function Ϯ is used.

Ϯ(ηk, ϼk)


where ϼk is the result of computation of the input data. After calculating ϼk for each Mapper, all ϼk are reduced to ϼReducer by the function £.

£(ϼ1, ϼ2,…, ϼp, ϼReducer, ϧ)


where ϧ is the Reducer node and p is the number of Mappers.

• Rule extraction

In this section, a Map-Key is added to the Rule-Key (ϻ). The Map-Key is used for rule generation, and its length in binary format is equal to the Rule-Key. Here, ϼReducer is sent to the rule generation nodes. Each rule generation node has one or more Map-Keys. The function ϖ allocates ϼReducer to each rule generation node.


where ι is the Map-Key, q is the number of Map-Keys allocated to a rule generation node, and Ϧ is the rule generation node.

ϖ (ϼReducer, ι1, ι2,…, ιq ,Ϧr)

Each Map-Key is concatenated to all Rule-Keys. If the length of the Rule-Key in binary format is equal to L, then there are 2L − 1 Map-Keys. They start from zero to 2L − 1 and are concatenated with the Rule-Keys to create various condition combinations of fields ϝ. The concatenation (ϒ) of the Map-Key with the result of the “logical and” (ϰ) of the Map-Key and the Rule-Key is called MHB_Key, and τ is the MHB_Key.

τ = ϒi, ϰ (ϻ, ιi))

There are two important factors for rule extraction in association rule mining: confidence and support. In this paper, Д (τ) is defined as the frequency of a specified MHB_Key (τ) and ζ (τ, σ, ρ) is defined as the frequency of a specific decimal value (ρ) that is equal to the target field (σ) for a specified MHB_Key (τ). In addition, ς is the total number of generated Kavosh triple sets. Thus, the support ( ) is calculated using the following formula:


Moreover, the confidence ( ) is calculated using the following formula:


After and have been calculated, they are compared with the defined thresholds for the association rules and the final results are produced. The rules are extracted using Д.


where α is the support threshold and β is the confidence threshold. Each rule generation node creates its final result separately because the results do not have common rules with other nodes. Figure 1 shows the Kavosh architecture. It consists of three layers: the Mapper layer, the Reducer layer and the rule generation layer.

Fig. 1
figure 1

Kavosh architecture

For clearer illustration, pseudocodes are provided. Table 1 lists the functions used in the code.

Table 1 Pseudocode functions

The following code shows the binominal conversion.

figure f

The following pseudocode shows the Mapper’s functionality:

figure g

In the reduce phase, the EXEC() function is repeated to obtain the final intermediate results layer. The following pseudocode shows the Reducer phase:

figure c

After the extraction of all the rules with a maximum length in the intermediate results layer (Fig. 4), a mapping field is added to these rules. Each rule is converted to 2n − 1 rules, where n is the number of input data item fields. The new rule value is equal to the concatenation of the mapping field and the result of the logical And between the mapping field and the rule value.

figure d

Suppose the input data are as described in Table 2, where New Service is the target field and the data are divided between two Mappers.

Table 2 Sample input information

Table 3 shows the data after conversion to a binominal format. Table 4 shows the Mapper results. The rules are now combined based on their keys in the Reducer layer. As shown in Table 5, the rules are combined in this phase, and their rule count and rule summation values are recalculated. Then, the mapping fields are added to the rules. The mapping fields are sorted to prevent future shuffling among the Mappers. As the input item, the field count is equal to three, and seven mapping fields (23 − 1) must be added to the rules. Table 6 shows this stage.

Table 3 Binominal format
Table 4 Mapper results
Table 5 Reducer results
Table 6 Mapping field addition stage

The rule sum and count are recalculated, and the same key rules are combined, as depicted in Table 7. In the next stage, confidence and support are calculated. Table 8 shows the results of this stage. As mentioned earlier, confidence values must be calculated for two values: zero and one. Finally, the Reduce stage of the rule generation layer filters rules according to thresholds for support and confidence. If support is greater than 5% or confidence is greater than 95%, the Reducer results are as shown in Table 9. As shown in Table 9, some rows have confidence equal to 0%. These rows have confidence equal to 100% for a zero value. In other words, for rule filtering with confidence greater than 95%, it is necessary to consider confidence lower than 5% (100 − 95%). Kavosh can be summarized as follows:

Table 7 Rule sum and count recalculation
Table 8 Support and confidence results
Table 9 Reducer results
  • The input data are converted to a binominal format.

  • The converted data are distributed among the Mappers.

  • The results of the Mappers are sent to the Reducer layer.

  • Mapping fields are added to the results of the Reducer layer, and MHB_Keys are generated.

  • The Reducer results are sent to the Rule generation layer.

  • Confidence and support parameters are applied to the extracted rules to create the final results.

Evaluation

The proposed method is evaluated on the TPC-DS dataset and a real-world dataset. “The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark” (http://www.tpc.org/tpcds).

The proposed method can use DBMS on each node independently. PostgreSQL is used as DBMS on each node. PostgreSQL is a powerful, open-source object-relational database system that uses and extends the SQL language, combined with many features that safely store and scale the most complicated data workloads (https://www.postgresql.org/about/).

To achieve higher performance, an In-Memory database is used. Redis is an open-source (BSD licensed), in-memory data structure store, which is used as a database, cache and message broker (https://redis.io/).

TPC-DS

To implement the Kavosh method, Hadoop 2.7.3 is used. Ubuntu 16.04 is installed on each node. Parts of Hadoop were modified for Kavosh implementation. As described above, there are three layers of nodes in Kavosh: The first layer consists of the Mapper nodes, the second layer consists of the Reducer node, and the third layer is the rule generation layer. The DataNodes of Hadoop are used for these three layers. A MetaNode is added to the Hadoop NameNode to maintain the type information of each group of nodes. Because of the Kavosh data format and the independence of each node, it is possible to use a database management system for each data node. Therefore, PostgreSQL 9.6.1 is used for the Mapper nodes. A Kavosh converter exists on the Mapper nodes to convert the input data format into the Kavosh format. To achieve faster computation speed, an in-memory database is used for the Reducer node and the rule generation nodes. Redis 3.2.5 is used as an in-memory database on these nodes with AOF disk persistence for the Reducer and disk persistence is disabled for the rule generation nodes. Table 10 shows DBMS for the various node types.

Table 10 DBMS for the various node types

Figure 2 shows Kavosh architecture using Hadoop.

Fig. 2
figure 2

Using Hadoop to implement Kavosh

For data generation, TPC-DS_Tools_v2.8.0 was used. Four tables (Customer_demographics, Store_sales, Web_sales, and Item) were used for association rule mining. Figures 3 and 4 show the ER-Diagrams.

Fig. 3
figure 3

Store sales diagram

Fig. 4
figure 4

Web sales diagram

A database view is created over four mentioned tables. This view contains customer information and items that each customer bought. Figures 5, 6, 7 and 8 show the table structures.

Fig. 5
figure 5

Store_sales

Fig. 6
figure 6

web_sales

Fig. 7
figure 7

Items

Fig. 8
figure 8

Customer_Demographics

The Kavosh method extracts market basket analysis (MBA) rules for items on various sell channels. The frequent items with a maximum length of twenty are extracted. In other words, twenty related items are extracted. The scale factor is 100 TB and SF = 100,000. Table 11 shows the number of rows in each data table.

Table 11 Number of rows in each data table

Tables 12 and 13 show the node specifications for the Kavosh method. A total of 86 nodes are used for evaluation. Seventy nodes were used as the Mapper nodes, one node was used as the Reducer node and fifteen nodes were used as the rule generation nodes. The input data must be converted to the Kavosh format and it requires 6740 s for the Kavosh converter to convert the input data to the Kavosh format. Each Mapper node simultaneously converts approximately 1.5 TB of data.

Table 12 Mapper node specifications
Table 13 Reducer and rule generation node specifications

The Kavosh method is compared with FiDoop [10] and Sequence-Growth [20]. Table 14 shows the node specifications for the FiDoop and Sequence-Growth methods.

Table 14 Nodes for FiDoop and Sequence-Growth

The total CPU, RAM and HDD used for Kavosh and other methods are equal.

To evaluate the three methods, frequent items for MBA of web sales and store sales are extracted. The association rule parameters are shown in Table 15.

Table 15 Association rule parameters

They are compared with one another in three dimensions: execution time, data compression and load balancing.

Execution time

The method proposed by FiDoop requires three Map-Reduce cycles to generate the final results. The execution time for each cycle is almost the same, and in each cycle, a full database scan is required. Sequence-Growth requires twenty Map-Reduce cycles because it generates each frequent item in a Map-Reduce cycle. However, the execution time decreases each cycle in comparison with the previous cycle because the number of frequent items decreases. Nevertheless, Kavosh requires a Map-Reduce cycle on compressed data and a rule generation phase. According to the data format, all nodes can complete their processes independently. In addition, Kavosh uses an in-memory database in the Reducer and rule generation nodes, which causes a reduction in I/O time. According to the above description, Table 16 shows the execution time for each method.

Table 16 Execution details

Figure 9 shows the total execution time for each method.

Fig. 9
figure 9

Association rule mining execution time

Compression

In this part of the evaluation, the Kavosh data volume is compared with those of FiDoop and Sequence-Growth. In FiDoop and Sequence-Growth, no compression is performed on the input data, whereas the data are compressed in Kavosh due to the conversion of the input data to the Kavosh data format. The compression rates are different: if fields have fewer discrete values, the compression rate reaches ninety percent, but Kavosh compresses data by an average of approximately 57%. The data compression in the Kavosh data format is due to the conversion of string and number values to the bit data type. In the Kavosh method, no data replication or collocation is required for performance issues. In addition, data compression helps Kavosh retrieve more data in the data retrieval phase. Table 17 shows the memory usage for different methods.

Table 17 Memory usage

Load balancing

In this section, the load balancing of various methods is investigated. To calculate the load balance, the length of time during which a processor has a CPU usage of over eighty percent while other CPUs have CPU usages of under thirty percent is calculated. In this paper, this measure is called Balance_Factor. Using this definition, the results in Table 18 were obtained.

Table 18 Hadoop configuration

Real traffic data of a mobile operator

In this section, the proposed method is evaluated using real traffic data of a mobile value-added service (VAS) provider. The evaluation data include more than 60,000,000 subscribers, together with their services. The subscribers used 3000 services and all the subscribers’ activities were considered in the evaluation. The mobile operator needs accurate information about their customers’ favourites to be able to provide appropriate VAS suggestions. Due to the high volume of information, existing association rule algorithms would take a long time to execute and, therefore, are not suitable. In this evaluation, high-dimensional data items were used. If all the combinations of rules were applied, 23000 − 1 mapping fields would have to be added to each rule that is generated in the rule generation layer. Such a large number of rules is neither necessary nor calculable. Based on additional information obtained about the services, a maximum of ten services is required for recommending VAS. Thus, rules with a maximum length of ten must be generated.

The input data format is as shown in Table 19. The first column is the subscriber mobile number and the second column is the service’s ID.

Table 19 Input data format

We converted Table 18 to a binominal format. Finally, a <Key,Value> pair is created. The Key is the mobile number, and the value is a bit stream with a length of 2999 (Number of VAS services − 1 (Target Field)). Kavosh was deployed on the hardware nodes with the specifications detailed in “TPC-DS” section.

In the first step, the customer service information table is converted to the Kavosh format. In this phase, one service is considered as a target field. The relationships between this service and other services are extracted as rules. The customer service information table is distributed over 70 nodes. The conversion time is 1200s. Fifty servers are used for the Mapper, one server is used for the Reducer and sixty servers are used for the rule generation layer.

Table 20 presents the parameters that are used for the association rule algorithm.

Table 20 Association rule parameters

Table 21 shows execution details for the three methods.

Table 21 Execution details

Figure 10 shows the total execution time for each method.

Fig. 10
figure 10

Association rule mining execution time

Compression

Table 22 shows the memory usage for the three methods.

Table 22 Memory usage

Load balancing

Table 23 shows the values of Balance_Factor for the three methods

Table 23 Load balancing

Scalability

In this section Kavosh scalability is evaluated. Different data sizes(TPC-DS) and various number of nodes are investigated and the results are depicted as Fig. 11. The results show that by adding more nodes and scale out data over nodes less time is required for data processing.

Fig. 11
figure 11

Kavosh scalability

Conclusion

In this paper, the Kavosh method is proposed. Kavosh is a Map-Reduce-based association rule mining method that can manage an immense amount of data. This method converts input data into the Kavosh format, which is a unified and compressed format. Using the Kavosh data format, data can be distributed over nodes, and nodes do not require additional data from other nodes. Kavosh also compresses input data and facilitates data management. The main advantage of the proposed method is the omission of iterations for extracting rules. This is a substantial achievement because writing intermediate results to a disk and retrieving them for the next iteration is the most time-consuming portion of association rule mining. Another important achievement of the proposed method is the balanced process for each node. Because of the unified data format, each node is allocated the same number of rows in the homogenous nodes; thus, the process is equally divided among the nodes. By modifying Hadoop, the proposed method was implemented. Kavosh was compared with prominent association rule mining methods and achieved a faster execution time.

Kavosh is not suitable for high-dimensional data because creating all the input field combinations requires many nodes in the data generation layer. However, Kavosh is completely usable for normal data (not high-dimensional data) and can generate rules of a predefined length if other rule lengths are not required.

Data format unification can be applied to problems in other fields. This method can be used in data warehouse like Aras and Atrak methods [22, 23], graph processing [24], integrating multidimensional data sources [25] and specific problems like finding patient similarity [26]. For future works, this method can also be used for interactive query processing, online data mining and stream processing.