1 Introduction

Association rule mining (ARM) is a widely known data mining technique which was first introduced by Agrawal et al. [3] in the early 1990s. ARM was conceived as an unsupervised learning task for finding close relationships between items in large data sets. Let \(\mathcal{I} = \{i_{1}, i_{2}, i_{3}, \ldots, i_{n}\}\) be the set of items and let \(\mathcal{D} = \{t_{1}, t_{2}, t_{3}, \ldots, t_{n}\}\) be the set of all transactions in a relation. An association rule is an implication of the form XY where \(X \subset\mathcal {I}\), \(Y \subset\mathcal{I}\), and XY=∅. The meaning of an association rule is that if antecedent X is satisfied, then it is highly likely that consequent Y will also be satisfied. ARM was originally designed for market basket analysis to obtain relations between products like diapersbeer that describes the high probability of someone buying diapers also buying beer. It would allow shop-keepers to exploit this relationship by moving the products closer together on the shelves. ARM tasks have also been applied in a wide range of domains, such as customer behavior [10], bank transactions [44], and medical diseases [38], where it is particularly important to discover, for instance, relationships between a specific disease and specific life habits.

The first algorithms in this field were based on an exhaustive search methodology, extracting all those rules which satisfy minimum quality thresholds [9, 22]. These approaches divide the ARM problem into two sub-problems: mining items and itemsets whose frequency of occurrence is greater than a specific threshold, and discovering strong relationships between these frequent itemsets. The first step is computationally expensive since it analyzes every item within the data set. The second step requires large amounts of memory because it holds every rule generated in memory.

As interest in storing information grows, data sets are increasing in the number of transactions or of items, thus prompting researchers to study the computational and memory requirements concerning ARM algorithms [22] in depth. Besides, real-world data sets usually contain numerical items, so a pre-processing step is required for mining rules by using exhaustive search approaches. The problem with using numerical attributes is the huge number of distinct values they might take on, making their search spaces much bigger, enlarging the computational time and memory requirements and therefore hampering the mining process.

For the sake of overcoming such drawbacks, researchers have achieved some promising results when using evolutionary algorithms [36, 40], which is becoming the most widely employed means to this end. Whereas exhaustive search algorithms for mining association rules first mine frequent items, evolutionary algorithms extract association rules directly, thus not requiring a prior step for mining frequent items. Grammar-guided genetic programming (G3P) [24] has also been applied to the extraction of association rules [37], where the use of a grammar allows for defining syntax constraints, restricting the search space, and obtaining expressive solutions in different attribute domains. Nevertheless, the capabilities of data collection in real-world application domains are still growing, hindering the mining process even when evolutionary algorithms are used. Therefore, it is becoming essential to design algorithms capable of handling very large data collections [5] in a reasonable time.

Parallel frequent pattern mining is emerging as an interesting research area, where many parallel and distributed methods have been proposed to reduce computational time [19, 25, 48]. To this end, hierarchical parallel environments with multi-core processors [2] and graphic processing units (GPUs) [6, 14] help to speed up the process. The use of GPUs has already been studied in machine learning [13, 21, 39], and specifically for speeding up algorithms within the framework of evolutionary computation [15, 23, 33] and data mining [28, 35]. Specifically, GPUs have also been successfully applied to speeding up the evaluation process of classification rules [12, 16].

In this paper, a new parallel methodology for evaluating association rules on GPUs is described. This methodology could be implemented on any ARM approach, regardless of the design and methodology of the algorithm. This issue is of special interest for mining association rules by means of evolutionary algorithms, and more specifically for G3P algorithms, which allow for mining rules on any domain. Evolutionary computation algorithms devote much time to the evaluation of the population. Therefore, it is especially interesting to speed up the evaluation phase in this kind of algorithms. This synergy prompts developing a high performance model, which demonstrates good performance in the experimental study presented herein.

This paper is organized as follows. In Sects. 2 and 3, association rule mining algorithms and the way of evaluating them are presented, respectively. Section 4 presents some details concerning the CUDA programming model. In Sect. 5, the evaluation methodology on GPUs is proposed. Section 6 describes the experimental study. The results obtained are discussed in Sect. 7. Finally, Sect. 8 presents some concluding remarks.

2 Association rules mining algorithms

Most existing proposals for mining association rules are based on Agrawal’s proposal in [3], which is widely known as Apriori. In the Apriori algorithm, authors divided the ARM problem into two steps. First, those patterns that frequently appear in a data set are extracted. Second, using the whole set of previously mined patterns, the algorithm seeks association rules. This exhaustive search problem becomes impracticable with increasing information storage in a data set, so the anti-monotone property was introduced: if a length-k itemset is not frequent in a data set, its length-(k+1) super-itemsets cannot be frequent in the data set. Apriori has four major drawbacks. First, the higher the number of items in a data set, the more tedious the mining of frequent patterns becomes. Secondly, the higher the number of frequent patterns, the more tedious becomes the discovery of association rules. Besides, Apriori works in two steps, increasing the computational time required. Finally, real-world data sets usually comprise numerical items which are hard to be mined using Apriori, so a pre-processing step is used. Therefore, the number of steps required by this algorithm is even greater, thus hampering the mining process and significantly increasing execution time.

Many researchers have focused their research on Apriori-like algorithms with the goal of overcoming these problems. One of the most relevant algorithms to this end was FP-Growth [22]. In this algorithm, Han et al. proposed a novel way of storing the frequent patterns mined, putting them in a frequent pattern tree (FP-tree) structure. In such a way, the algorithm works on the FP-tree structure instead of the whole set of frequent patterns, employing a divide-and-conquer method to reduce the computational time. Recently, an improved version of FP-Growth was presented in [29]. This proposal, called IFP-Growth (Improved FP-Growth), uses a novel structure, called FP-tree+, and an address table to decrease the complexity of building the entire FP-tree. Both the FP-tree+ and the address table require less memory but provide a better performance than the algorithm originally proposed by Han et al.

With the growing interest in evolutionary algorithms (EA), more and more researchers have focused on the evolutionary perspective of data mining tasks [4, 40], especially in ARM, where an exhaustive search could not be an option using huge data sets. In such a way, most existing EAs for mining association rules overcome the Apriori-like algorithms’ problems. First, the mere fact of using an evolutionary perspective enables the computational and memory requirements to be tackled. Furthermore, most EAs for mining association rules are based on genetic algorithms (GA), which discover rules directly instead of requiring two steps. Nevertheless, some GAs in the ARM field still require a pre-processing step for dealing with numerical items.

Recently, a genetic programming (GP) approach was introduced for mining association rules [37]. The main difference between GAs and that proposal is their representation. Whereas GAs use a fixed-length chromosome which is not very flexible, the proposal presented by Luna et al. represents individuals by using a variable-length hierarchical structure, where the shape, size, and structural complexity of the solutions are not constrained a priori. Furthermore, this algorithm uses a grammar to encode individuals, allowing for enforcing syntactic and semantic constraints and also dealing with any item regardless of its domain.

Despite the fact that the use of EAs in the field of ARM has overcome most of the drawbacks, evaluating the rules mined is still hard work when the data set is large. The higher the number of records to be checked, the higher is the evaluation time. Table 1 shows the runtime of the different phases of the evolutionary association rule mining algorithm [37]. The evaluation phase is clearly noted to take about 95 % of the algorithm’s runtime. Therefore, it is desirable to solve this problem and GPUs are currently being presented as efficient and high-performance platforms for this goal.

Table 1 Runtime of evolutionary ARM phases [37]

3 Evaluation of association rules

ARM algorithms usually discover an extremely large number of association rules. Hence, it becomes impossible for the end user to understand such a large set of rules. Nowadays, the challenge for ARM is to reduce the number of association rules discovered, focusing on the extraction of only those rules that have real interest for the user or satisfy certain quality criteria. Therefore, evaluating the rules discovered in the mining process has been studied in depth by different researchers and, many measures introduced to evaluate the interest of the rules [43]. These measures allow for ‘common’ rules to be filtered out from ‘quality’ or interesting ones.

In ARM, measures are usually calculated by means of frequency counts, so the study of each rule by using a contingency table helps to analyze the relationships between the patterns. Given a sample association rule XY, its contingency table is defined as in Table 2.

Table 2 Contingency table for a sample rule

On the basis of this contingency table, one can note the following.

  • X states the antecedent of the rule.

  • Y defines the consequent of the rule.

  • ¬X describes the fact of not satisfying the antecedent of the rule.

  • ¬Y defines the fact of not satisfying the consequent of the rule.

  • n(AC) is the number of records satisfying both A and C. Notice that A could be either X or ¬X, and C might be either Y or ¬Y.

  • N is the number of records in the data set studied.

Once the contingency table has been obtained, any association rule measure can be calculated. Some measures focus not only on absolute frequencies but also on relative frequencies, which are denoted p(A) instead of n(A) when relative frequencies are borne in mind—a relative frequency p(A) being defined as n(A)/N.

The two most frequently used measures in ARM are support and confidence. Support, also known as frequency or coverage, calculates the percentage of instances covering the antecedent X and the consequent Y in a data set. Support is defined by Eq. (1) as a relative frequency. Rules having a low support are often misleading since they do not describe a significant number of records. We have

$$ \mathit{Support}(X \rightarrow Y) = p(XY) = n(XY)/N $$
(1)

Confidence measures the reliability of the association rule. This measure is defined by Eq. (2) as the proportion of the number of transactions which include X and Y among all the transactions that include X. We have

$$ \mathit{Confidence}(X \rightarrow Y) = p(XY)/p(X) = n(XY)/n(X) $$
(2)

Most ARM proposals base their extraction process on the use of a support–confidence framework [9, 22, 47], attempting to discover rules whose support and confidence values are greater than certain minimum thresholds. However, the mere fact of having rules that exceed these thresholds is no guarantee that the rules will be of any interest, as noted by [8]. Other widely used measures to evaluate the interest of the extracted rules are lift [45], Piatetsky–Shapiro’s measure [41] and conviction [11].

Lift or interest measure (see Eq. (3)) calculates how many times more often the antecedent and consequent are related in a data set than would be expected if they were statistically independent. It could also be calculated as the ratio between the confidence of the rule and the support of its consequent:

$$ \mathit{Lift}(X \rightarrow Y) = p(XY)/p(X)p(Y) = n(XY)N/n(X)n(Y) $$
(3)

The leverage measure (see Eq. (4)) was proposed by Piatetsky-Shapiro. This measure calculates the difference between X and Y appearing together in the data set from what would be expected if they were statistically independent:

$$ \mathit{Leverage}(X \rightarrow Y) = p(XY)-p(X)p(Y) $$
(4)

Conviction is another well-known measure in the ARM field (see Eq. (5)). It was developed as an alternative to confidence, which was found not to adequately capture the direction of associations. Conviction compares the probability that X would appear without Y if they were dependent with the actual frequency of the appearance of X without Y. We have

$$ \mathit{Conviction}(X \rightarrow Y) = p(X)p(\neg Y)/p(X\neg Y) $$
(5)

These measures are those commonly used in ARM problems for evaluating the interest of the extracted association rules. However, there is a large variety of other measures which could also be applied, such as interest strength [20], Klosgen’s measure [30], etc. Any measure for evaluating association rules is calculated by using the contingency table, so the mere fact of calculating this table allows for any association rule to be evaluated. Nevertheless, it is also interesting not only to obtain the contingency table but also to calculate the coverage of each condition within the rule.

4 The CUDA programming model

Computer unified device architecture (CUDA) [1] is a parallel computing architecture developed by the NVIDIA corporation that allows programmers to take advantage of the parallel computing capacity of NVIDIA GPUs in a general purpose manner. The CUDA programming model executes kernels as batches of parallel threads. These kernels comprise thousands to millions of lightweight GPU threads per invocation.

CUDA’s threads are organized as a two-level hierarchy: at the higher one, all the threads in a data-parallel execution phase form a 3D grid. Each call to a kernel execution initiates a grid composed of many thread groupings, called thread blocks. Thread blocks are executed in streaming multiprocessors, as shown in Fig. 1. A stream multiprocessor can perform zero overhead scheduling to interleave warps (a warp is a group of threads that execute together) and hide the overhead of long-latency arithmetic and memory operations. Current GPU architectures may execute up to 16 kernels concurrently as long as there are multiprocessors available. Moreover, asynchronous data transfers can be performed concurrently with the kernel executions. These two features allow for speedier execution compared to a sequential kernel pipeline and synchronous data transfers in previous GPU architectures.

Fig. 1
figure 1

GPU streaming processing paradigm

There are four different main memory spaces: global, constant, shared, and local. These GPU memories are specialized and have different access times, lifetimes, and output limitations.

  • Global memory is a large long-latency memory which exists physically as an off-chip dynamic device memory. Threads can read and write global memory to share data and must write the kernel’s output to be readable after the kernel terminates. However, a better way to share data and improve performance is to take advantage of shared memory.

  • Shared memory is a small low-latency memory which exists physically as on-chip registers. Its contents are only maintained during thread block execution, and are discarded when the thread block completes. Kernels which read or write a known range of global memory with spatial or temporal locality can employ shared memory as a software-managed cache. Such caching potentially reduces global memory bandwidth demands and improves overall performance.

  • Local memory is where each thread also has its own local memory space for registers, so the number of registers a thread uses determines the number of concurrent threads executed in the multiprocessor, which is called the multiprocessor occupancy. To avoid wasting hundreds of cycles while a thread waits for a long-latency global-memory load or store to complete, a common technique is to execute batches of global accesses, one per thread, exploiting the hardware’s warp scheduling to overlap the threads’ access latencies.

  • Constant memory is specialized for situations in which many threads will read the same data simultaneously. This type of memory stores data written by the host thread, is accessed constantly, and does not change during the execution of the kernel. A value read from the constant cache is broadcast to all threads in a warp, effectively serving all loads from memory with a single-cache access. This enables a fast, single-ported cache to feed multiple simultaneous memory accesses.

Recommendations exist for improving performance on a GPU [18]. Memory accesses must be coalesced, as with accesses to global memory. Global memory resides in the device memory and is accessed via 32-, 64-, or 128-byte segment memory transactions. It is recommended to perform fewer but larger memory transactions. When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions, depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, thus reducing the instruction throughput.

To maximize global memory throughput, it is therefore important to maximize coalescing by following optimal access patterns, using data types that meet size and alignment requirements, or padding data in some cases, for example, when accessing a two-dimensional array. For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.

5 GPU evaluation model

This section presents the GPU model designed to evaluate association rules on the GPU. First, parallelization opportunities are analyzed to find the best way to parallelize the computation. Then, the kernels and data structures are given in detail.

The evaluation process of association rules has been described in Sect. 3. Its purpose is to check the coverage of the conditions of the rule, hence obtaining the coverage of the antecedent and consequent over the instances of the data set allowing the fitness measures to be computed.

Computing the evaluation of an association rule is independent from the evaluation of another rule, because there are no internal dependencies. Thus, the rules could be evaluated concurrently. Furthermore, the coverage process of the antecedent and the consequent of a rule over the instances is also independent. Hence, both the coverage of the antecedents and consequents from each rule could also be computed concurrently. Moreover, the coverage of a single instance is also an independent operation. Therefore, the coverage of the antecedents and consequents is independent for each rule and instance, which provides a high degree of parallelism. Computing the support of the conditions of the antecedent and the consequent is interesting when evaluating association rules and it could also be performed concurrently for each condition and rule. Nevertheless, calculating the support of the conditions and building the contingency table for computing the fitness measures calls for a reduction operation [26, 46] of the coverage results.

The GPU model designed to evaluate association rules comprises three kernels for computations of the coverage, reduction, and fitness. Figure 2 shows the computation flow. First, the rule, antecedent and consequent, are copied to the GPU memory. Specifically, they are copied to the constant memory, which provides broadcast to the GPU threads, taking advantage of the memory properties described in Sect. 4. Second, the coverage kernel is executed both on the antecedent and the consequent, using the instances of the data set. Third, the reduction kernel is executed and the support of the conditions is calculated. Fourthly, the fitness computation kernel is executed, providing the fitness measures described in Sect. 3. Finally, the results are copied back to the host memory. This figure represents the evaluation process of an association rule using serial kernels (one after the other). However, as mentioned before, there are some independencies which allow us to overlap their execution.

Fig. 2
figure 2

GPU kernel model

The CUDA 2.0 architecture introduced concurrent kernels, allowing the kernel executions to be overlapped as long as there are sufficient available resources on the multiprocessors and no dependencies between kernels. Concurrent kernels are issued by means of CUDA streams, which also allow asynchronous data transfers between host and device memory, overlapping data transfers and kernel executions. The coverage kernel may overlap its execution when concurrently evaluating the antecedent and the consequent, since they operate with different data and they have no dependencies. After the coverage kernel is completed, the reduction kernel may also overlap its execution with the antecedent and consequent bitsets. Finally, the fitness kernel has no dependencies with the reduction kernel. Therefore, their execution may also overlap.

Concurrent kernels enable asynchronous data transfers, which overlap memory transactions and kernel computations. There is no need to wait for kernel completion to copy data between host and device memories. In such a way, the execution of the coverage kernel over the antecedent could be concurrent to copying the consequent to the GPU memory. Similarly, copying the support of the conditions back to host memory could be concurrent to the computation of the fitness kernel. Figure 3 illustrates the benefits of concurrent kernels and asynchronous transfers in the 2.0 architecture versus the old-style serial execution pipeline from previous GPU architectures. Kernel dependencies should still be serialized within the stream, but those kernels with independent computations may overlap their execution using multiple streams, along with asynchronous data transfers, thus saving time.

Fig. 3
figure 3

GPU timeline using serial and concurrent kernels with asynchronous transfers

In the following subsections, the kernels and data structures are given in detail, considering the best way to facilitate coalescing, occupancy, and maximizing instruction throughput.

5.1 Coverage kernel

The coverage kernel interprets expressions (antecedents or consequents), which are expressed in reverse Polish notation (RPN), over the instances of the data set. The expressions interpreter is stack-based, i.e., operands are pushed onto the stack, and when an operation is performed, its operands are popped from the stack and its result pushed back on.

The kernel checks the coverage of the expressions and their conditions over the instances and stores the results in a bitset array. Each thread is responsible for the coverage of an expression over a single instance. Threads are grouped into a two dimensional grid of thread blocks, as illustrated in Fig. 4, whose size depends on the number of expressions (width) and instances (height). The number of vertical blocks depends on the number of instances and the number of threads per block, which is recommended to be a multiple of the warp size, usually being 128, 256 or 512 threads per block.

Fig. 4
figure 4

Two dimensional grid of thread blocks for the coverage kernel

The selection of the optimal number of threads per block is critical in CUDA since it limits the number of active blocks and warps in a multiprocessor depending on the register pressure of the kernels. Table 3 shows the multiprocessor occupancy, which is desired to be maximized, for different block sizes over the coverage kernel. The NVIDIA CUDA compiler (nvcc) reports that the coverage kernel requires 20 registers per thread and the number of registers available per multiprocessor is limited to 32768. Therefore, 256 and 512 threads per block configurations achieve 100 % occupancy with 48 active warps and 1536 threads per multiprocessor. However, we consider that the best option is to employ 256 threads per block, since it provides more active threads blocks per multiprocessor to hide the latency arising from register dependencies and, therefore, a wider range of possibilities for the dispatcher to issue concurrent blocks to the execution units. Moreover, it provides better scalability to future GPUs with more multiprocessors capable of handling more active blocks.

Table 3 Threads per block and multiprocessor occupancy

To maximize global memory throughput, it is important to maximize the coalescing, by following the most optimal access patterns, using data types that meet the size and alignment requirements or padding the data arrays. For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size. Therefore, the bitset array employs intra-array padding to align the memory addresses to the memory transfer segment sizes [12, 27, 42].

The kernel must be called for by the antecedents and consequents and their executions can be overlapped and executed concurrently because they are independent. Moreover, the copy of the expressions to the GPU memory is asynchronous, i.e., the data from one stream can be copied while a kernel from another stream is executing. The code for the coverage kernel is shown in Listing 1. The covers() method performs the push and pop operations to interpret the expression.

Listing 1
figure 5

Coverage kernel

5.2 Reduction kernel

The reduction kernel performs a reduction operation on the coverage bitsets of the expressions (antecedents and consequents) to compute the absolute support of the conditions. The naïve reduction operation is conceived as an iterative and sequential process. However, there are many ways to perform this operation in parallel. In fact, NVIDIA provides six different ways of optimizing parallel reduction in CUDA. The one which performed best for reducing the bitsets of the expression are the 2-level parallel reduction (see Fig. 5), which illustrates a 4-threaded reduction using shared memory to compute the condition’s support.

Fig. 5
figure 6

2-level parallel reduction

Threads within a block cooperate to reduce the bitset. Specifically, each thread is responsible for reducing a subset of the bitset, and stores its temporary result in its corresponding position of the shared memory. Finally, only one thread performs the last reduction of the temporary results from the different threads, thus obtaining the final result. The number of reduction levels and the number of threads per block determine the efficiency and cost of the reduction. More reduction levels imply more steps but reduce fewer values, and owing to the fact that the shared memory size is limited, the number of threads per block should not be too high. Thus, the implementation considers a two-level reduction using 128 threads per block, which is a trade-off solution that obtained the best experimental results in previous studies [12].

The threads are grouped into a two dimensional grid of thread blocks, whose size depends on the number of expressions (width) and number of conditions (height). Notice that the kernel must be called for by the antecedent and consequent conditions and their executions can be overlapped and executed concurrently because they are independent. Moreover, the copying of supports back to the host memory is asynchronous, i.e., the supports for the conditions of the antecedent can be copied while the kernel is executing on the conditions of the consequent in a different stream. The reduction kernel can only be called after the completion of the coverage kernel. The code for the reduction kernel is shown in Listing 2.

Listing 2
figure 7

Reduction kernel

5.3 Fitness kernel

The fitness kernel reduces both the bitsets of the antecedents and the consequents to calculate the contingency table. The reduction process is similar to the one from the reduction kernel, but it stores the temporary results of the contingency table in shared memory. Finally, one thread collects the temporary results of the contingency table and computes the final contingency table for the rule. The contingency table allows for computing the support, confidence, lift, leverage, conviction, and any other measure for evaluating association rules.

The fitness measures are all stored in a unique float array to perform a single but larger memory transaction back to host memory, avoiding multiple and smaller transactions, which is known to produce inefficiencies and low memory throughput due to the high latency of the peripheral component interconnect express bus (PCI-e).

The kernel is only called once and the threads are grouped into a grid of thread blocks, whose size depends on the number of rules. The fitness kernel has no dependencies with the reduction kernel and they can be executed concurrently. The code for the fitness kernel is shown in Listing 3.

Listing 3
figure 8

Fitness kernel

6 Experimental Setup

This section presents the hardware configuration and the different experiments used to evaluate the performance of the proposed model.

6.1 Hardware configuration

The experiments were run on a machine equipped with an Intel Core i7 quad-core processor running at 3.0 GHz and 12 GB of DDR3-1600 host memory. It featured two NVIDIA GeForce 480 GTX video cards equipped with 1.5 GB of GDDR5 video RAM. The 480 GTX GPU comprised 15 multiprocessors and 480 CUDA cores clocked at 1.4 GHz. The host operating system was GNU/Linux Ubuntu 11.10 64 bit along with CUDA runtime 4.2, NVIDIA drivers 302.07, Eclipse integrated development environment 3.7.0, Java OpenJDK runtime environment 1.6-23 64 bit, and GCC compiler 4.6.3 (O2 optimization level).

6.2 Experiments

Three different experiments were carried out. Firstly, the performance of the rule interpreter was evaluated. Secondly, the efficiency of the evaluation model was analyzed on a series of real-world data sets. Finally, the performance of the serial vs. concurrent kernels model was compared. In order to make a fair comparison, both codes on CPUs and GPUs were computed by using single precision floating points.

6.2.1 Rule interpreter performance

The efficiency of rule interpreters is often reported using the number of primitives interpreted by the system per second, similarly to GP interpreters, which determine the number of GP operations per second (GPops/s) [7, 31, 32, 34]. In GP, interpreters evaluate expression trees, which represent solutions to performing a user-defined task.

In this experimental stage, the efficiency and performance of the interpreter is evaluated by running it on different numbers of instances and rules. Hence, a sensitivity analysis of the effect of these parameters on the speed of the interpreter is achieved.

6.2.2 Evaluation model performance

A different experiment was carried out to study the performance of the complete parallelized evaluation model, including all kernel execution times and data transfer times. In this second experiment, the number of rules to be evaluated and the number of instances were increased, to analyze the scalability of the evaluation model regarding the values of the parameters mentioned above and to study its behavior when using more than one GPU. To this end, a series of data sets from the University of California Irvine (UCI) repository [17] were used. The data sets chosen for the experiment comprised a wide range in both the number of attributes and instances.

6.2.3 Serial vs. concurrent kernels

The new capabilities of modern GPUs to execute concurrent kernels is a major improvement for the efficient computation of independent kernels. Therefore, a profiling analysis was carried out of the serial and concurrent kernel execution on the GPUs. This analysis was obtained from the NVIDIA Visual Profiler tool.

7 Results

This section describes and discusses the experimental results obtained.

7.1 Rule interpreter performance

Table 4 shows interpreter execution times and performance in terms of GPops/s. The results depicted are the average results obtained over ten different executions. Here, the performance of the GPU model using one and two 480 GPUs is compared to the single-threaded and multi-threaded CPU implementation. Each row represents the case of interpreter performance when a different number of instances and rules is used. Focusing on the number of GP operations interpreted (GPops), its value depends on the number of instances, the number of rules, and the number of conditions included in each rule.

Table 4 RPN interpreter performance

The higher the number of instances and rules to be evaluated, the larger the number of GP operations to be interpreted and, in consequence, the more execution time is required. With regard to CPU performance, the single-threaded interpreter achieves a linear performance and remains constant around at 10 million GPops/s (GP operations per second), regardless of the number of instances and rules. The multi-threaded interpreter increases performance by around 35 million GPops/s when using the four CPU cores available in the Intel i7 processor.

The GPU implementation achieves high performance regardless of the number of rules and instances to be evaluated. The larger the number of rules and instances, the larger the number of thread blocks to compute consequently, the higher occupancy of the GPU multiprocessors. Using one 480 GPU, the limit is reached at 33 billion GPops/s. On the other hand, working with two 480 GPUs, the limit is reached at 67 billion GPops/s, demonstrating the great scalability of the model in the number of GPU devices and the complexity of the problem. In this way, unquestionable performance is obtained, which takes only 0.172 seconds to interpret 200 rules over one million instances. In other words, the GPU implementation using two 480 GPUs requires less than two deciseconds to carry out more than 11 billion GP operations. All the results are summarized in Fig. 6, which depicts the RPN performance of two 480 GPUs in terms of GPops/s with regard to the number of instances and rules.

Fig. 6
figure 9

RPN interpreter performance using 2×480 GPUs

7.2 Evaluation model performance

Table 5 shows execution times and the speed-up obtained for ten different real-world data sets. The results depicted are the average results obtained over ten different executions. Each row represents the execution time and the speed-up achieved when evaluating a series of rules over a group of data sets having different number of instances and attributes.

Table 5 UCI data sets evaluation performance

As far as the CPU evaluation model is concerned, it is shown that execution time increases linearly with the number of instances and rules to be evaluated (as is theoretically to be expected). Therefore, evaluating association rules over data sets having many instances becomes unmanageable, especially if the evaluation is embedded within an evolutionary algorithm, since it has to evaluate the population of each generation.

With regard to the GPU evaluation model, the execution times present few variations as the number of instances and rules increases. When using two 480 GPUs, execution time is less than one second in most of the evaluation cases. In extreme cases, the evaluation model requires only three seconds to evaluate 200 rules over a data set comprising more than one million instances. The scalability of the GPU model in the problem complexity and in the number of GPU devices is maintained.

Finally, focusing on the speed-up achieved when working with one 480 GPU (see Table 5), it is shown that the speed-up reaches its maximum at values of around 200. This maximum speed-up is obtained for data sets having more than 50 thousand instances. On the other hand, working with two 480 GPUs, the evaluation model achieves its maximum speed-up at values around 400×, doubling the performance of a single 480 GPU. The maximum speed-up is obtained when all the GPU multiprocessors are fully occupied, so using multiple GPUs allows for working on larger data sets with a higher number of instances and better speed-ups to be achieved. All the results are summarized in Fig. 7, depicting the speed-ups obtained by two 480 GPUs as a function of the number of rules and instances of the data sets.

Fig. 7
figure 10

GPU evaluation speed-up using 2×480 GPUs

NVIDIA Parallel Nsight software has been used to debug and profile the GPU implementation, allowing for the CPU/GPU execution timeline to be traced and analyzed. The results reported by the software demonstrate the efficiency of the implementation using shared memory effectively and avoiding bank conflicts, while providing fully coalesced memory accesses. The software also enables the concurrent execution of the kernels and asynchronous data transfers to be monitored. The throughput of data transfer between host and device memory has been reported to achieve 6 GB/s, which is the maximum speed of the PCI-e bus 2.0.

7.3 Serial vs. concurrent kernels

As mentioned, there are some independencies among the kernels that enables their execution to be efficiently overlapped. Concurrent kernels are issued by means of CUDA streams, which also allow for asynchronous data transfers. Figure 8 shows the performance profiling of the serial and concurrent kernel execution on the GPUs obtained from the NVIDIA Visual Profiler developer tool. The timeline clearly shows the runtime saved due to concurrent execution and data transfers. First, there are host to device (HtoD) memory transfers, which copy the antecedent and the consequent of the rules to the GPU memory. These transfers can be issued in different streams concurrently in the case of asynchronous transfers (HtoD async). Second, the coverage kernel is executed on both the antecedent and the consequent. Their execution is overlapped as soon as there are GPU resources available. Third, the reduction kernel is executed, respectively. Four, the fitness kernel is executed and overlapped with the reduction kernel, since it only depends on the coverage kernel being completed. Finally, memory transfers copy back results and fitness measures to the host memory (DtoH). In the case of streams, these transactions are concurrent with kernel executions. Table 6 shows the runtime for each of the kernels and for data transfers. The concurrent kernels and asynchronous transfers model reduces runtime significantly from 1.49 ms to 1.08 ms, which is about 27 % faster.

Fig. 8
figure 11

GPU timeline from NVIDIA Visual Profiler developer tool

Table 6 Runtime of kernels and data transfers

8 Concluding remarks

Association rule mining was conceived as an unsupervised learning task for finding close relationships between items in large data sets. However, with growing interest in the storage of information, high dimensionality data sets have been appearing, giving rise to high computational and memory requirements. This computational cost problem is mainly due to the evaluation process, where each condition within each rule has to be evaluated for each instance.

This paper describes the use of GPUs for evaluating association rules employing a novel and efficient model which may be applied to any association rule mining approach. It allows for the mined rules to be evaluated in parallel, thus reducing computational time. The proposal takes advantage of the concurrent execution of kernels and asynchronous data transfers to improve the efficiency and occupancy of the GPU.

The results of our experiments demonstrate the performance of the interpreter and the efficiency of the evaluation model by comparing the runtimes of the single-threaded and multi-threaded CPU with those of the GPU. The results of the GPU model have shown an interpreter performance above 67 billion giga genetic programming operations per second. As far as on the speed-up achieved over the CPU evaluation model, a value of 454× was obtained when using two NVIDIA 480 GTX GPUs. The experimental study demonstrated the efficiency of this evaluation model and its scalability in problem complexity, number of instances, rules, and GPU devices.