1 Introduction

Association rules mining (ARM), an important branch of data mining, has been extensively used in many areas since Agrawal first introduced it in 1993 [1]. In general, ARM can be seen as a method aimed at discovering groups of items that co-occur with high frequency. In contrast to other data mining methods involved with statistical models, ARM can extract possible relationships between variables from huge datasets with little prior knowledge according to co-occurrence features. When applied to biomedical data, ARM can obtain rules that provide a better understanding of biological associations among different covariates or between covariates and response variables. Bioinformatics techniques have been developing with increased speed and so have several high-throughput biotechnologies, such as genomic microarray and Next Generation Sequencing (NGS). High-dimensional data are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, high-dimensional data are the data with anywhere from a few dozen to many thousands of dimensions. A dataset is a collection of data. Most commonly a dataset corresponds to the contents of a single database table, or a single statistical data matrix. It follows that a concern among researchers has been the efficient and effective discovery of latent information underlying huge amounts of data. As a possible solution to this problem, ARM has been extensively applied in this field. A typical application of ARM on such high-throughput datasets is gene association analysis (GAA) [2, 3], in which the goal is to exploit the relationships among different genes based on corresponding expression levels.

Data from these high-throughput techniques often share in common the feature of high dimensionality. For example, gene expression data typically take the form of an \(N \times M\) matrix, where each row of the matrix represents a sample, and each column corresponds to the expression level of a certain gene. The number of genes in a given study can be in the thousands, while the number of specimens is generally dozens or hundreds.

Such high dimensionality is also true for other kinds of biomedical datasets, e.g., Operational Taxonomic Unit (OTU) abundance datasets that have different levels of extra environmental factors in metagenomics analysis [4], as well as multiple datasets, including mRNA/miRNA expression data and Copy Number Variations (CNV) data from The Cancer Genome Atlas (TCGA) project [5].

Based on the high dimensionality of such datasets, the use of traditional methods of association rules mining directly applied to these datasets could result in unsatisfactory performance [6]. To improve performance brought by high-dimensional datasets, multiple specialized algorithms have been proposed in the last decade.

2 Mining algorithms on high-dimensional datasets

2.1 Basic association rule mining algorithms

Apriori, the first ARM algorithm, was proposed by Agrawal [7], and it successfully reduced the search space size with a downward closure Apriori property that says a \(k{\text{-itemset}}\) is frequent only if all of its subsets are frequent. The Apriori algorithm is characterized by a feature called candidate generation whereby \(\left( {k+1} \right){\text{-candidate-itemsets}}\) are generated iteratively by combining any two frequent \(k{\text{-itemsets}}\) which share a common prefix of length \(\left( {k{\text{-}}1} \right)\). Further computation of the supports of each candidate itemset is then performed to determine if the candidate is frequent or not. Finally, the algorithm terminates if no more frequent itemsets can be generated.

Based on the standard Apriori algorithm, several improved variations were proposed. The performance-enhancing strategies include the hashing technique [8], partitioning technique [9], sampling approach [10], dynamic counting [11], and incremental mining [12]. As previous studies demonstrated, these Apriori-based approaches achieved good performance when the dataset was sparse and the patterns discovered were of short length. However, such methods suffer nontrivial costs caused by generating huge numbers of candidate itemsets and extra scans over the datasets for support computation.

We then saw the emergence of new algorithms like FP (frequent pattern)-growth without candidate itemsets [13]. First, an FP-tree, which retains information associating the itemsets, is constructed according to the frequency of 1-itemset. Next, patterns of different lengths are generated by concatenating suffix patterns, starting from frequent 1-pattern to least frequent itemsets, with frequent patterns from a conditional FP-tree, which is a subtree consisting of the set of prefix-paths in the FP-tree co-occurring with the suffix, recursively. In other words, this method only involves frequent patter \(\left( {k+1} \right){\text{-itemset}}\) n growth instead of Apriori-like generation-and-test. In this sense, then, it applies a partitioning-based divide-and-conquer strategy, and efficiency studies demonstrated that this method has substantially reduced search time. Subsequently, multiple algorithms were proposed as extensions to the FP-growth approach, such as generating frequent itemset in a depth-first manner [14], mining with devised hyperstructure [15], pattern-growth mining by traversing FP-tree-like structure in both directions (top-down and bottom-up) [16], and pattern-growth mining with tree structure in an array-based implementation form [17, 18]. Recursive searches on the tree could result in enormous costs in certain cases; nevertheless, such methods did lay a solid foundation for the further application of tree structure in association rule mining algorithms.

Apriori and FP-growth both adopted a horizontal format for mining frequent itemsets. In contrast, Zaki proposed an Equivalence CLASS Transformation (Eclat) algorithm employing a vertical data format [19]. Eclat also utilized Apriori’s candidate generation property of \(\left( {k+1} \right){\text{-itemset}}\)candidates. In Eclat, however, the support computation of candidate can be done by just intersecting the sample id sets of the corresponding frequent \(\left( {k+1} \right){\text{-itemset}}\). More simply stated, the support of any itemset can be obtained directly from the vertical sampleID without any further computation. Thus, additional scanning of the original dataset can be saved, again reducing the cost of search time.

Generally speaking, finding all frequent itemsets of a specific dataset can be regarded as a process consisting of search space traversal, itemset support computation, and search path pruning. A common strategy of traversing the search space includes breadth-first search (BFS). For example, in Apriori, the frequent \(\left( {k+1} \right){\text{-itemset}}\) is not generated until all frequent \(\left( {k+1} \right){\text{-itemsets}}\) have been discovered. Another common strategy is depth-first search (DFS). For example, in FP-growth, longer frequent patterns are generated recursively until no more can be done. Common strategies for support computation include counting (e.g., Apriori, FP-growth) and intersection (e.g., Eclat). In sum, the methodologies adopted by the three basic association rule mining algorithms described (Apriori [7], FP-growth [13] and Eclat [19]) serve as landmarks for the development of association rules mining and constitute the basis for subsequent association algorithms.

2.2 Maximal frequent itemset mining and frequent closed itemset mining

According to the Apriori Property, it is obvious that \(\left( {k{\text{-}}2} \right)\) subsets (except itself and \(\varphi\)) of a certain frequent \(k{\text{-itemset}}\) are also frequent. Such characteristic will result in massive unnecessary redundancy of frequent itemsets. To limit the redundancy, two alternative concepts were advanced, namely maximal frequent itemset mining and frequent closed itemset mining [20].

Many algorithms were developed to mine these two categories of itemsets. For example, MaxMiner, the very first study on maximal frequent itemset mining, was proposed in 1998 [21]. Based on Apriori, MaxMiner adopted a breadth-first search (BFS) strategy and reduced the search space by both superset and subset frequency pruning. As another efficient maximal frequent itemset mining method, MAFIA improved support counting efficiency by adopting vertical bitmaps to compress the transaction id list [22].

For frequent closed itemset mining, numerous methods have been proposed since 1999, when A-Close, an Apriori-based frequent closed itemset mining approach, was reported [23]. CLOSET explored frequent closed itemset mining based on FP-tree structure [24]. Another typical frequent closed itemset mining approach is CHARM, which adopted a hybrid search strategy, known as the diffsets technique (compact form of \(tID\) list information), and a hash-based “non-close item disposal” approach to enhance both computation and memory efficiency [25]. In addition, AFOPT-close presented a method which can adaptively use three different structures, including array, AFOPT-tree (FP-tree like) and buckets, to represent conditional databases according to their respective densities [26]. To integrate previous effective approaches and some newly developed techniques, CLOSET+ was proposed [27]. After thorough performance studies on diverse datasets, CLOSET+ was considered as one of the most efficient methods at the time.

Previous studies have shown that algorithms of these two categories are usually more efficient against previous iterations. However, maximal frequent itemset has a critical defect in that the supports of its subitemsets may be different from its own. This would, in turn, result in extra scans over the dataset for support computation and its ultimate unfitness for rule extraction. Frequent closed itemset mining does not encounter such problems, essentially because all subsets of a certain frequent closed itemset must have precisely the same support as that of the frequent closed itemset. Furthermore, frequent closed itemsets can be regarded as a compressed form of the complete frequent itemsets without information loss. Based on these features and properties, we can conclude that frequent closed itemset mining is more likely to play a vital role in the development of association rules mining.

In summary, frequent closed itemsets can provide analytical power equivalent to that of complete frequent itemsets, but with much smaller size. Substantial approaches have verified the higher efficiency and better interpretability obtained by frequent closed itemset mining. However, most of the above-mentioned algorithms adopt the column-enumeration strategy. Therefore, when applying such approaches over high-dimensional datasets, the search space will tend to expand exponentially, according to the feature size, thus making computational cost prohibitive. Therefore, it is easy to see why most of the algorithms discussed thus far cannot be applied to high-dimensional datasets, again underscoring the need to develop algorithms applicable to high-dimensional datasets to keep pace with advancements in sequencing and computer technology.

2.3 Algorithms applicable to high-dimensional datasets

From previous sections, we can see that the applicability of the above-mentioned algorithms to high-dimensional datasets is limited. In this section, we will discuss approaches better able to meet this challenge.

Among them, approaches incorporating frequent closed itemset mining and row enumeration can serve as a possible solution. This idea was first explored in 2003 [28]. Based on data in vertical format, CARPENTER constructs a row-enumeration tree and adopts a depth-first search (DFS) strategy to traverse it. Additionally, several pruning strategies are employed during the search process to cut off the branches incapable of generating frequent closed itemsets. Previous study [28] has shown that CARPENTER gained better performance, compared to its rivals as CHARM and CLOSE+, when applied to high-dimensional microarray datasets [27].

Adopting similar strategies, other methods were developed. For instance, RERII is like CARPENTER, but instead of searching frequent itemsets from the whole original datasets, RERII explored frequent closed itemsets in the opposite direction, starting from the nodes that represent the complete rowsets [29]. This strategy has the potential to enhance overall performance by reducing the cost of searching short rowsets and \(I{\text{-item}}\) rowsets.

To make CARPENTER more adaptable to more complex datasets, COBBLER integrated the strategies of both CARPENTER and CLOSE+ [30]. Accordingly, COBBLER can dynamically switch between row-enumeration and column-enumeration to meet estimated cost conditions. Its efficiency has been verified in experiments over datasets with high dimensionality and a relatively large number of rows.

TD-CLOSE adopts a top-down row-enumeration search strategy that enables the support of a stronger pruning power against the bottom-up style adopted by CARPENTER [31]. To guarantee closeness during the mining process, an additional closeness checking method was included in TD-CLOSE. Moreover, in 2009, an improved version of TD-CLOSE, called TTD-CLOSE, was proposed [32]. With optimized pruning strategy and data structure, TTD-CLOSE obtained better performance than the original TD-CLOSE.

To extend the applications of frequent pattern mining, new classification methods based on ARM over high-dimensional datasets, such as FARMER and TOPKRGS, emerged [33, 34]. With additional class information attached to the original datasets, both algorithms can extract classification rules in the form of \(X \Rightarrow C\), where \(C\) is a class label, and \(X\) is a set of items. Based on previous analysis, it can be concluded that a rule extracted from frequent closed itemset, consisting of \(k\) items in total, also implies the existence of other \({2^k}{\text{-}}2\) rules. To reduce rule redundancy, these two algorithms only extract “interesting” rule groups instead of all rules. Specifically, FARMER adopted the concept of a rule group that consists of a unique upper bound rule and a set of lower bound rules for clustering the complete results of rules, while TOPKRGS just selects the most significant top-k covering rule groups. In addition, FARMER reinforced interestingness measures with Chi square in addition to support and confidence, while TOPKRGS adopts a prefix tree structure to speed up the frequency computation and utilizes a dynamic minimum confidence generation strategy to better-fit different datasets.

To enhance mining efficiency, HDminer employs effective search space partitioning and pruning strategies. HDminer gradually narrows down the search space by pruning off the false-valued cells based on the space partition tree instead of accumulating the true-valued cells like the FP-tree- or enumeration tree-based methods. Owing to fewer false-valued cells compared to true-valued cells, HDminer works much more efficiently than the FP-tree- or enumeration tree-based methods [35]. HDMiner shows superiority, especially on synthetic data and dense microarray data.

To summarize, previous studies have verified the relatively high efficiency of row-enumeration algorithms for mining frequent closed itemset over high-dimensional datasets. However, with the advancement of biomedical data acquisition techniques, the volume of data has grown larger and larger, and the row size of a certain dataset may, therefore, become as large as the column size. In this case, methods such as COBBLER or TD-CLOSE, as described above, may still have trouble handling such large datasets. Consequently, instead of sequential algorithms, increased attention has focused on parallel and distributed algorithms. Actually, parallel association rule mining algorithms were proposed quite early in the 1990s [36, 37]. However, since the effectiveness of these algorithms was challenged by complicated strategies of workload balance, fault tolerance and data distribution, as well as interconnection costs and limited computer hardware capacity at that time, extensive application of such algorithms was suppressed. On the other hand, based on the exuberance over cloud computing and distributed computing techniques, parallelized association rule mining algorithms were revived with the opportunity to show their power. Specifically, as the most recognized large-scale data analysis technique, Hadoop has been broadly utilized in modern biomedical studies [38, 39]. Characterized by mapper and reducer functions [40], the Hadoop MapReduce framework is especially good at processing gigabytes, or even terabytes, of data. Moreover, by hiding details of underlying controls, Hadoop can enable users to just concentrate on algorithm design. All of these features make Hadoop a novel promising candidate to propel the development of ARM in the “big data” era.

Typical examples of adapting Apriori on Hadoop MapReduce include SPC (Single Pass Counting), FPC (Fixed Pass Combined-Counting) and DPC (Dynamic Pass Counting) [41]. These algorithms share common procedures of distributing data to different mappers and parallel counting supports, but differ in candidate generation. Typically, SPC generates frequent itemsets of only a single length after one phase of MapReduce, but FPC and DPC generate frequent itemsets with different lengths after phases. In addition, as the names suggest, and are fixed parameters in FPC, while in DPC, they are dynamically determined by the number of generated candidate itemsets at each MapReduce phase. Other Hadoop-based Apriori algorithms, which work in a similar manner, but different forms of implementation, were also proposed [42, 43].

To solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm of the parallelization was realized in based on Hadoop framework and Map Reduce model. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset [44]. MRFP-Growth (MapReduce Frequent Pattern Growth) is also implementing to solve the problem of discovering frequent patterns with massive datasets. The efficiency and performance of this method have been increased compared with other mining algorithms [45]. Also, implementations of Eclat on MapReduce were proposed, such as Dist-Eclat, focusing on speed acceleration, and its optimized version BigFIM which adopts hybrid approaches incorporating both Apriori and Eclat, thus making BigFIM better suited to very large datasets [46]. Experiments on real datasets have proven their scalability. An MapReduce algorithm for mining closed frequent itemsets was implemented, as well [47].

Another noteworthy approach is PARMA [48]. It applies parallel mining algorithms to randomly selected subsets of the original large dataset. Owing to its random sample property, its mined results can be considered as the approximation of the exact results according to the whole dataset. The approximation quality was verified by both both statistical analysis and real-time application.

Based on CARPENTER, a new algorithm called PaMPa-HD was developed. This algorithm adopts the depth-first search process, as well, but the process is broken up into independent subprocesses to which a centralized version of CARPENTER is applied so that it can autonomously evaluate subtrees of the search space. Then the final closed itemsets of each subprocess can be extracted in order to compute the whole closed itemset result [49]. Since the subprocesses are independent, they can be executed in parallel by means of a distributed computing platform such as Hadoop.

To achieve compressed storage and avoid building conditional pattern bases, FiDoop was brought forward. FiDoop utilizes a frequent itemset ultrametric tree. In FiDoop, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees is performed separately, which can speed up the mining performance for high-dimensional datasets analysis [50]. Extensive experiments using real-world celestial spectral data indicate that FiDoop is scalable and efficient.

In addition to the huge computation cost, it is typical for the size of derived patterns from high-dimensional datasets to be enormous. Such growth of derived patterns makes their effective use difficult. Therefore, a new methodology aimed at mining approximate or representative patterns, instead of full-scale patterns, appeared as a solution. The most typical approach, known as Pattern-Fusion, is based on a novel concept called core pattern. Pattern-Fusion is able to discover approximate colossal patterns, i.e., the colossal pattern mining algorithm based on pattern fusion improve seed pattern selection method, which is select pattern that the distant big, rather than random seed pattern [51].

Recently, the Graphics Processor Units (GPU) has emerged as one of the most used parallel hardware to solve large scientific complex problems. An approach benefits from the massively parallel power of GPU by using a large number of threads to evaluate association rule mining was proposed [52]. Then a new algorithm called MWBSO-MEGPU was proposed. This method combine both GPU and cluster computing to improve a Bees Swarm Optimization (BSO) metaheuristic. Several tests have been carried out to evaluate this approach. The results reveal that MWBSO-MEGPU outperforms the HPC-based ARM approaches in terms of speed up when exploring Webdocs instance [53].

3 Discussion

All algorithms reviewed as applicable approaches for high-dimensional datasets are summarized below in Table 1.

Table 1 Overall compilation of association rule mining algorithms on high-dimensional datasets

The performance evaluation of association rule mining algorithms raises two major concerns. The first is scalability, which refers to the ability of an algorithm to handle a large amount of data in a suitably efficient way. The other is interpretability, or the capacity to translate the results to real-world issues, such as biological meaning.

With respect to scalability, our primary focus in this paper, numerous approaches to efficiently process high-dimensional datasets have been proposed. with advantage of advanced computer technology and seemingly unlimited cloud computing resources, as well as “big data” processing techniques, parallelized association rules mining might be the most promising candidate to lead further development of association in the new “big data” era. More efforts should primarily concentrate on a more appropriate data distribution model, more efficient mining methods that take better advantage of the key-value feature and a more reliable load balance scheme. Furthermore, the idea adopted by Pattern-Fusion whereby approximate or representative patterns are extracted instead of full-scale patterns is a promising methodology to address the high-dimensionality problem. By employing such methodology, the mining process realizes a cost savings by identifying shorter patterns. It also yields a much smaller size of result sets, consisting of longer patterns preferred in practical use. For example, in gene expression analysis, longer patterns are usually more favorable. Still, approximation quality needs to be guaranteed to avoid major latent information loss, which may involve more statistical analysis and theoretical proof.

Interpretability is another critical issue in biomedical research. Typically, incorporating previously known biological knowledge with association rule mining algorithms is seen as providing better biologically meaningful results [6]; however, in this review, we have suggested that taking too much knowledge into account might lower the ability of ARM to obtain undiscovered rules because the algorithm would tend to fit the biological knowledge more. Additionally, such approach may increase the cost and, in turn, reduce scalability. Instead, toolkits that include fuzzy set theory, genetic algorithms, ant colony algorithms, and particle swarm optimization and other heuristic algorithms can be utilized to optimize association rule mining algorithms for better interpretability. Such optimization methods have been mainly used over quantitative data without a typical pre-discretization procedure. For example, fuzzy set theory can be used to generate fuzzy association rules with more practical meanings, genetic algorithms to dynamically specify appropriate support threshold and ant colony algorithms to reduce the scale of the result rules [54, 55]. As also suggested in this review, association rule mining algorithms that merge diverse optimization methods might, when combined with improved computer techniques, provide researchers with tools of more practical value.

4 Conclusion

Generally speaking, ARM has been widely utilized in bioinformatics studies. ARM can be used to identify the most relevant covariates in a certain biological process and thus construct the underlying intrinsic latent network. When applied over high-dimensional datasets, many older methods cannot manage the issue of high dimensionality. Many of the most recent methods have been proposed to address this problem, each with its merits and faults, but no perfect solution has been achieved. For better usage in this area, new algorithms that can better balance scalability and interpretability are still in demand.