Introduction

Classification is one of the main data mining tasks with many real-world applications such as personalization, anomaly detection, recommendation and prediction. There are many classifiers developed based on mining a small subset of rules out of training datasets. These methods can be categorized as naïve-Bayes classification [1], rule learning [2], statistical approach [3] and decision trees [4]. Associative classification is another research area.

Background

Associative classification (AC) [5, 6] is a data mining task that integrates the association rule discovery and classification. Classification and association rule mining are close concepts, with the exception that in the classification the class labels are predicted, and in the association rules the relationships between items in transactional datasets are represented. In association rule mining, there is no predetermined target, while there is one and only one target in the classification rule mining, the class [6]. Association classification is a promising approach and its classifiers can compete with decision tree, rule induction and probabilistic classifiers [6,7,8]. In this paper, we target a new associative classification technique, which statistically shows good accuracy compared to the traditional AC technique (in baseline datasets), and better efficiency for large dataset classification.

Every association rule mining algorithm is made of two phases. At first, the algorithms look for the frequent itemsets based on the support threshold, and then in the second, the association rules are derived out of these frequent itemsets that meet the confidence criterion. There are several well-known algorithms proposed for association rule mining such as GSP [9], SPADE [10], PrefixSpan [11], SPAM [12] and AprioriAll [13]. In this paper, we utilize discriminative itemsets to propose an accurate and efficient classification technique called discriminative association classification (DAC).

Discriminative itemsets are frequent itemsets in one data class with higher frequencies compared with the same itemsets in other data classes, based on the defined thresholds [14,15,16,17,18]. Class discriminative association rules (CDARs) are discovered from the discriminative itemsets based on the defined discriminative value, minimum confidence and minimum support thresholds. They are class association rules (CARs) in one data class that have higher support compared with the same CARs in other data classes. An essential issue is to find the small set of rules that can distinguish each target data class from all other data classes. The interpretability of the CDARs is expected to be much higher than the CARs, as they show the dominant rules in each data class which have less or no importance in other data classes. Compared to CARs, they exclude the rules which are dominant in more than one data class. There are fast algorithms proposed for mining discriminative items [14, 19] and discriminative itemsets [15,16,17,18, 20]. H-DAC is also proposed for discriminative associative classification in data streams [21].

Many real-world scenarios can show the significance of CDARs in datasets. In classification problems, there are two or more data classes. Usually, these data classes are discriminated in their concepts compared with each other (e.g. edible and poisonous data classes in the mushroom dataset, and won and lost data classes in the chess dataset [22]. In these types of datasets, there is a large number of rules in each data class. In AC methods, we can find many rules which discriminate the data classes based on their concepts. However, there are also many other rules with high support in more than one data class. This caused less accuracy and greater time and space usage in classification rule mining algorithms. In many other datasets, there are no discriminative data classes (e.g. different branches in market basket datasets).

Classifying the market basket dataset, made of multiple market branches (i.e. each branch is a data class), is an interesting application for seeing the different dominant trends in each market. There are no discriminative concepts between different branches of the market basket datasets (i.e. data classes). However, there are still CDARs (lesser in number) which can be used for classifying each branch vs all others. In the market basket dataset, these rules are useful for highlighting the itemsets in high interest in one market compared with the other markets. An interesting scenario is how people nourish their kids in different suburbs. In this scenario, we look for mining the itemsets which are discriminative in each suburb (i.e. having higher support in the branch of each suburb vs other branches in all suburbs), to classify the suburbs based on their prevailing purchasing behaviours. Web page personalization can be optimized by variations in user preferences in different geographical areas (i.e. data classes). Varied preferences may exist for each user group in different areas. They usually visit the groups of the web pages much more frequently compared with the other user groups in other areas. The sequences of queries with higher support in one geographical area compared with other areas are considered.

Research Gaps

Although this field of classification is overcrowded, we target an efficient and accurate classification method specifically for large datasets. We propose an approach based on rule mining, followed by rule selection, targeting the classification of large datasets efficiently, which moreover improves the accuracy. In the above examples, the sizes of datasets are usually much larger than traditional classification problems (i.e. as in the experiments), and our DAC algorithm shows much better time and space efficiency using its pruning heuristics [17]. The proposed algorithm is more efficient and shows good accuracy (i.e. on average) compared to the traditional AC algorithms.

Discriminative associative classification has more challenges, as it does not follow the Apriori property. The DISSparse algorithm was proposed [17, 18, 20] as an efficient method for mining discriminative itemsets in one batch of transactions. This algorithm shows good time and space usage by mining a complete set of discriminative itemsets, without any false positive or false negative. DISSparse is a heuristic-based algorithm and prunes many non-potential subsets during itemset combination generation. We adopt the DISSparse for rule mining with some modifications as will be explained in Sect. Rule mining algorithm In DISSparse, discriminative itemsets were defined as a kind of contrast pattern [23]. The contrast patterns show non-trivial differences between datasets. One of the well-known contrasting patterns is emerging patterns (EPs) [24]. These are the patterns with significant frequency growth in one dataset in comparison to another one.

Considering similarities in the definition of the discriminative itemsets to types of emerging patterns [25], [23], they are fundamentally different. The discriminative itemsets, proposed by [17, 18], are single itemsets with explicit supports in datasets, rather than a group of patterns with a change degree of support, between two frequency support borders, representatives in emerging patterns [24], [25]. The rules are discovered from the discriminative itemsets with different cardinalities, support and confidence, and are useful in classification applications which need single itemset and not a group of itemsets between borders. Moreover, discriminative itemsets may be frequent in each data class; still, the frequency is relatively different in the different data classes (e.g. the itemsets in the market basket are frequent in all suburbs with relatively higher frequency in the target suburb with a smaller average age of population).

In addition to the mentioned advantages of AC methods, and as another research gap, there are variations compared to the emerging patterns, as well. Firstly, the CDARs are presented as a single itemset (rule) with explicit support and confidence in datasets. Secondly, they are presented with different discriminative values (i.e. discrimination between support in each data class vs support in all other data classes). The highly and lowly discriminative itemsets are reported with their exact support and confidence on the datasets. The itemsets with different discriminative values are different from emerging patterns which are discovered between two borders.

Contributions

To the best of our knowledge, this is the first study in discriminative associative classification on static datasets. There are technical novelties in our proposed method compared to the existing methods for mining associative classification. We utilize the efficient DISSparse method [17] for mining discriminative itemsets in data classes. We extract the rules out of the discriminative itemsets. The DISSparse does not follow the Apriori property, which is mainly used in mining frequent itemsets in the associative classification method, either proposed based on Apriori or FP Growth [8, 7]. The DISSparse algorithm works based on two data classes, i.e. the target class and the general class. We propose an advanced high-efficient and high-accurate method called DAC algorithm to work based on several classes. In fact, instead of mining rules in the target class vs general class (i.e. the one vs all), we discover the rules in each class vs all other classes (i.e. each vs all). The method has to deal with the exponential number of itemsets generated in more than one data class. We adopt a new concise prefix-tree structure in the DAC algorithm for holding the discovered discriminative rules. Finding a small set of rules from an exponential number of itemsets is time-consuming. The rule pruning then happens by deleting the misleading rules. After that, the discovered rules are ranked based on their confidence, discriminative value and support. Then, the true sets of rules are selected for classifying the data classes. Finally, the rules are evaluated. Despite the challenges, discriminative associative classification is an emerging research area with great potential.

An extensive evaluation is done on the proposed method using small and large datasets made of multiple classes of transactions exhibiting diverse characteristics and by setting different thresholds. Empirical analysis shows the high accuracy and efficiency of the algorithm.

More specifically, we make these contributions:

  • Defining the novel concept of discriminative associative classification in the datasets.

  • Developing an efficient and accurate algorithm for mining class discriminative association rules (CDARs) in data classes in datasets.

  • Introducing novel in-memory DAC-Tree structures for efficiently mining the CDARs.

  • Using the discovered rules for the prediction of the labelled transactional datasets and evaluating the quality of the classification in a wide range of baseline datasets with different parameter settings.

  • Applying the classifier to several large datasets for showing its efficiency and scalability.

  • Showing strategies and principles for tuning parameters based on the classification application domains and dataset characteristics.

The remaining parts of the paper is organized as follows. The next section discusses related works, followed by a definition of the research problem in “Problem Statement”. Our novel classification method is presented in the“DAC Method”. The details of experimental results are reported in “Empirical Evaluation” and the final section concludes the paper and discusses future works.

Related Works

Contrast mining, for discovering contrast interesting patterns, is a focused research area in data mining that states the significant differences between datasets [23]. The class discriminative association rules are a type of contrast pattern. One of the well-known contrast patterns is emerging pattern (EP) [24]. EPs are the itemsets with significant frequency growth from one data class to another data class. They can show the prominent emerging trends in the time-stamped data classes or highlight the differences between data classes. The initially proposed method by [24], followed by [24,25,26,27,28,29,31], works based on pre-specified borders in terms of frequency growth rate of maximal itemsets in the first and second datasets. In emerging pattern mining, the maximal itemsets are discovered for each dataset following the frequency-changing rate in the first and second datasets. All the maximal itemsets enclosed between the two borders are then reported. The borders are illustrated based on the lowest and the highest thresholds [24] considered as the left and right borders denoted as < Left, Right > . A large collection of itemsets can be shown limited to the area between these two borders as < L, R > . The number of candidate itemsets is decreased based on the definition of just the borders for the maximal itemsets. The ‘frequency growth’ of EPs is in between the given ‘borders’. Compared to our method, emerging pattern mining methods do not need the pattern’s true frequency and only check frequency change. However, we find the frequent itemsets first and then we compare the frequency difference of the same itemsets in each data class vs all other classes. There are several types of emerging patterns defined by [23, 28, 29, 32, 33].

An information-based approach is proposed [25] for classification based on the aggregation of emerging patterns (i.e. iCAEP). Each EP shows a sharp difference in support and ratio of the class members in the part of instances containing the EP. This enables the constraint-based EP mining algorithm to learn from large-volume and high-dimensional data. The algorithm chooses representative EPs efficiently for a high predictive accuracy system. In the experiment section, we observed that our proposed DAC method outperforms this method.

There are differences between our method and EP mining in terms of concepts and algorithms. Firstly, EPs focus on the change degree in support of itemsets, and the itemsets’ true support is not considered [24]. The proposed algorithms for emerging patterns are mostly focused on compactly representing these patterns to avoid examining all the possible itemsets. Although all EPs are produced, the real supports of the EPs are not presented, as they are reported in a group between two borders using the maximal itemsets (i.e. the support of each EP is reported in the range of left and right borders). Many infrequent EPs can be reported as merging patterns with low supports which may not be interesting [33]. In contrast, CDARs are frequent in each target data class based on the defined support threshold and are discriminative in each target data class compared with all other data classes based on the defined discriminative level. In the proposed method in this paper, the confidence and support of the rules are derived explicitly in each data class vs all other data classes, respectively. Secondly, the significance of our method is in the explicit cardinality (i.e. discriminative value) of each rule in the data classes. The proposed methods by [15,16,17, 20] focus on the difference in support rather than the change in the degree of support. They discover the real support of each discriminative itemset and the relative differences of supports in data classes explicitly.

There are substantial differences in the case of the algorithm between the proposed method and the EPs methods. Our proposed method is fundamentally designed following the concepts in the FP-Growth method [34]. There are datasets with inherent discrimination (e.g. mushroom and chess datasets), which usually have one accepted class and one rejected class. In these types of datasets, emerging pattern mining methods are more efficient. Here, a few attributes mainly make the permanent discrimination in the itemsets irrespective of the rest of the items involved in the itemsets. This makes it easy to show between the borders as in emerging pattern mining methods [3, 24, 33]. In other types of datasets with no inherent discrimination (e.g. market basket datasets in different suburbs, or adult income datasets), the credit of each data class is similar to others. In this case, a large number of borders are discovered between emerging patterns, which can become almost equal to the number of rules caused for inefficiency. Here, there are a random and sparse set of rules and each rule must be discovered separately. Our algorithm skips a great deal of non-potential rules during rule mining, based on the heuristics proposed by [17].

Our proposed method is highly related to associative classification. The associative classification algorithms (AC) are designed based on the association rule discovered from frequent itemsets. The promising associative classification techniques [6, 35] are widely used for the classification of datasets [7, 8]. The AC methods are made of four steps rule ranking, rule pruning, rule prediction and rule evaluation [6]. Different methods are proposed based on different approaches for these four steps, including CBA [6], CMAR [5], CPAR [36], MMAC [37], HARMONY [38], L3 [39] and ACN [40].

Discovering the complete set of rules (i.e. rule mining) in the traditional AC approaches is time-consuming and not suitable for large datasets. A large number of discovered rules in the datasets may cause other steps to take longer. The discriminative itemsets are a small subset of frequent itemsets. They are used for discovering a new set of rules, useful for classification purposes for differentiation of the datasets (data classes). The effectiveness of the developed algorithm is highly dependent on the efficiency of the algorithms used for mining discriminative itemsets. However, the challenges related to rule mining also should be addressed.

The method does not require the prior frequency calculation of all generated itemset combinations. It is a heuristic-based method and effectively utilizes a novel prefix-tree structure (i.e. DAC-Tree) for holding the rules. The empirical analysis of the proposed method reveals that it can produce a complete set of CDARs. For rule ranking, we rely on the most distinguishing characteristics of the rules (i.e. confidence, discriminative value and support), respectively. There are several rule pruning techniques used in the previous methods, varying from \({X}^{2}\), redundant rule pruning, database coverage, pessimistic error estimation, lazy pruning, and conflicting rules [7, 8]. In our proposed method, the rule pruning is mainly done during rule mining (i.e. because of discriminative values), but we do have a separate rule pruning step, as well. Finally, the classifier is evaluated based on the general measures defined for classification.

Compared to the current associative classification techniques, our proposed discriminative associative classification method is different. First, the proposed method is based on the discriminative itemsets in datasets with multiple data classes. The AC methods are designed based on frequent itemsets only. Second, the Apriori property defined for the frequent itemset mining is not valid and a subset of CDARs can be non-discriminative. Third, the rule ranking and rule pruning processes are much different, as the number of discovered rules is much smaller. Moreover, compared with the H-DAC method [21], which is for discriminative classification in large fast speed data streams, our proposed DAC method is proposed for the classification of static datasets. We extensively evaluate this method on several datasets with different characteristics in comparison to the previously proposed AC and EP methods.

The rules discovered by our algorithm and patterns discovered by proposed methods in [41, 42] have variations in their structural discriminative definitions. In these methods, discriminative frequent patterns are considered in one dataset w.r.t class labels. The patterns are discriminative based on the measure of information gain [4]. However, in our proposed method the rules can be discriminative in each data class vs all other data classes. The patterns discriminative measure is the relative frequencies of the rules (i.e. discriminative value) in data classes.

In summary, our proposed method is positioned in the category of associative classification methods. CDARs are the rules with discriminative itemsets on their left side and class labels on their right side. Each rule is considered with its exact discriminative value, exact support and exact confidence. The rules are in concise numbers as many associative classification rules (i.e. frequent) are not discriminative in any class and are pruned during rule mining, as will be discussed. The superiorities of this method are based on its rule interpretability in the application domain, better accuracy in the majority of baseline datasets, and its efficiency in applying to large datasets.

Problem Statement

We define the dataset as \(D={\bigcup }_{i=1}^{k}{D}_{i}\) (i.e. \({D}_{i}\) is a data class in \(D\)) with the cardinality of \({n}_{i}\), \(i=\mathrm{1,2},\dots ,k\) (i.e. the size of each data class \({D}_{i}\)). Let the dataset consists of different length transactions \(T\) made of the lexicographically ordered alphabet of items \(\sum\). An itemset \(I\) is a subset of transactions in datasets. Let \({C}_{i}(I)\), \(i=\mathrm{1,2},\dots ,k\) be the class distribution of itemset \(I\) in data class \({D}_{i}\) in dataset \(D\), and \({r}_{i}(I) = {C}_{i}(I)/{n}_{i}\) show the frequency ratio of \(I\) in each data class \({D}_{i}\). The discriminative itemsets are those itemsets that occur in one data class more frequently than all other data classes. In a specific problem, we look for itemsets that are frequent in the data class \({D}_{i}\) and their frequency in that data class is higher than the same itemsets in all other data classes (i.e. except \({D}_{i}\)) based on the specified threshold \(\theta > 1\), using the following definition; \(\exists i\in \left\{\mathrm{1,2},\dots ,k\right\}\) such that:

$${R}_{i\bigcup_{j\ne i}^{k}j }\left(I\right)= \frac{{r}_{i}\left(I\right)}{\sum_{j\ne i}^{k}{r}_{j}\left(I\right)}=\frac{{C}_{i}\left(I\right)\sum_{j\ne i}^{k}{n}_{j}}{\sum_{j\ne i}^{k}{C}_{j}\left(I\right){n}_{i}} \ge \theta .$$
(1)

The reason for \(j\ne i\) in the sum of the denominator, in Eq. (1), is that the ratio of itemset frequency is defined as its ratio in each data class vs ratio in all other data classes, and \({r}_{i}\left(I\right)\) should be divided by the sum of all \({r}_{j}\left(I\right)\) except itself. This makes the denominator change for every \(i\), and thus it makes a comparison of the measures more relative. In the case of \({C}_{\bigcup_{j\ne i}^{k}j}(I)=0\), another user-defined parameter is used as the minimum support threshold \(0 < \varphi < 1/\theta\). Here, the itemset is considered a discriminative itemset if its frequency is greater than \(\varphi \theta {n}_{i}\), (\({C}_{i}(I) \ge \varphi \theta {n}_{i}\)) and also \({R}_{i\bigcup_{j\ne i}^{k}j }(I)\ge \theta\). The set of rules (\(DI\)) in \(D\) is defined as; \(\exists i\in \left\{\mathrm{1,2},\dots ,k\right\}\) such that:

$${DI}_{i\bigcup_{j\ne i}^{k}j}=\left\{I \subseteq \sum | {C}_{i}(I)\ge \varphi \theta {n}_{i} \& {R}_{i\bigcup_{j\ne i}^{k}j }(I)\ge \theta \right\}.$$
(2)

Let \(DI\) be the discovered itemsets and \(Y\in \left\{\mathrm{1,2},\dots ,k\right\}\) be the set of class labels in the dataset \(D\). The class discriminative association rules (CDARs) are defined as the rules \(I \to y,\) where \(I \subseteq DI\), and \(y \subseteq Y\). A rule \(I \to y\) holds in \(D\) with confidence \(c\), if \(c\%\) of cases in \(D\) that contain \(I\) are labelled with class \(y\) (i.e. \(\frac{{C}_{i}(I)}{\sum_{j=1}^{k}{C}_{j}(I)}*100\%\ge c\)). The rule \(I \to y\) has support \(s\) (i.e. \(s=\varphi *100\%\)) in \(D\) if \(s\%\) of the cases in \(D\) contains \(I\) and are labelled with class \(y\) (i.e. \(\frac{{C}_{i}(I)}{\sum_{j=1}^{k}{n}_{j}}*100\%\ge s\)). We define the rules in a dataset with multiple data classes (i.e. \(\left\{\mathrm{1,2},\dots ,k\right\}\)).

The rule mining algorithm skips the non-potential discriminative itemsets using two heuristics. In [17], it is proved that the heuristic-based DISSparse algorithm is correct (i.e. do not miss any discriminative itemset) and it generates the potential superset of discriminative itemsets efficiently. These two heuristics will be discussed in Sect. Rule Mining Algorithm. We then generate the complete set of rules that satisfy the user-specified discriminative value (called ratio \(\theta\)), minimum confidence (called minconf \(c\)) and minimum support (called minsup \(s\)) constraints, and then build a classifier out of them.

Example 1.

Consider we have two attributes (i.e. \(A\) and \(B\)) and two data classes with the lengths \({n}_{1}=10\) and \({n}_{2}=40\), respectively (i.e. the total number of cases in \(D\) is \(50\)). The itemset \(\left\{\left(A,1\right),\left(B,1\right)\right\}\) in \(D\) is with \(sup=15\). Following is a rule: \(<\left\{\left(A,1\right),\left(B,1\right)\right\},(class,1)>\) with support \(5\) (i.e. \(sup=(\frac{5}{50}*100\%)=10\%)\)) and \(<\left\{\left(A,1\right),\left(B,1\right)\right\},(class,2)>\) with support \(10\) (i.e. \(sup=(\frac{10}{50}*100\%)=20\%\)). The rule discrimination in \(class 1\) vs other classes (i.e. \(class 2\)) is \(2\) (i.e. \({R}_{1\bigcup_{j\ne 1}^{2}j}=\frac{{C}_{1}(I)\sum_{j\ne 1}^{2}{n}_{j}}{\sum_{j\ne 1}^{2}{C}_{j}(I){n}_{1}}=\frac{5*40}{10*10}=\frac{200}{100}\)). The confidence of the first rule is \(33.3\%\) and the second rule is \(66.7\%\). If \(s\) is \(10\%,\) then both rules are frequent. If \(\theta =2,\) then the rule is discriminative in \(class 1\) vs \(\mathrm{class }2\). It has to be considered that having higher support in a data class is not enough and the support ratio of the rule between data classes (i.e. discriminative value) should be higher than \(\theta\). When the rule is discriminative, the confidence in different data classes can be smaller, bigger or equal to each other.

For all rules with the same itemsets on their left side (i.e. \(I\) in \(I \to y\)), the one with the highest confidence is selected. If there is more than one rule with similar \(I\) and the same confidence, the one with the highest discrimination is chosen. If there is more than one rule with similar \(I\) and the same confidence and same discrimination, the one with the highest support is chosen. Finally, if all are the same, the rule with the shortest length is selected. In the above example, the rule \(<\left\{\left(A,1\right),\left(B,1\right)\right\},(class,1)>\) is generated with \(sup=10\%\),\(confd=33.3\%\) and \(Dis=2\). The other one, \(<\left\{\left(A,1\right),\left(B,1\right)\right\},(class,2)>\), is with \(sup=20\%\),\(confd=66.7\%\) and \(Dis=0.5\), which is not discriminative (i.e. it is not a rule). If support is greater than \(s\), the rule is frequent. If discrimination is greater than \(\theta\), the rule is discriminative. If confidence is greater than \(c\), the rule is accurate. The set of rules thus consists of all rules that are frequent, discriminative and accurate.

Our objectives are (1) to generate the complete set of CDARs that satisfy the user-specified minimum support, minimum discriminative value and minimum confidence and (2) to build a classifier out of the CDARs.

The process of rule mining is time-consuming and the proposed methods have to deal with a large number of combinations in multiple data classes. Efficiency is a challenging issue. The Apriori property defined for the association rules is not valid, that is, not every subset of a rule is discriminative. The other challenge is the generation of compact data structures so that all sets of rules can be discovered, ranked, pruned and evaluated appropriately. In this paper, an algorithm is proposed, and data structures and detailed processes of the algorithm are discussed in Sect. DAC Method. The specific characteristics and limitations of the proposed method are also presented. The algorithm is evaluated in Sect. Empirical Evaluation using different input datasets with different characteristics and sizes, for its effectiveness and performance.

DAC Method

We propose a method based on a novel in-memory prefix-tree structure called DAC-Tree, for monitoring all classification rules. The input dataset is a batch of transactions \(D\) from multiple data classes i.e. \({D}_{i}\), \(i=1, 2,\dots ,k\), with different sizes. We propose rule mining based on modified DISSparse, rule pruning and rule ranking based on rules precedence. The DISSparse is modified to work on multiple data classes and output the rules to the DAC-Tree. The rules' precedence is presented in the following section.

Classifier Basic Concepts

To produce the best classifier out of the whole set of rules, all the possible rules related to the transactions on the training datasets are evaluated, and the subset with the right rules that gives the least number of errors is selected. There are \({2}^{m}\) such subsets, where \(m\) is the number of rules, which can be very large (e.g. \(10000\)). Working with this exponential number of rules is infeasible. We use rule ranking to choose the best subset of rules based on heuristics. The classifier it builds performs very well as compared with that built by C4.5 [4] and AC methods [4,5,6, 35,36,37,38,40]. Before presenting the algorithm, let us define the total order of the generated rules for our classifier.

Definition:

Given two rules \({r}_{i}\) and \({r}_{j}\), \({r}_{i}> {r}_{j}\) (also called \({r}_{i}\) precedes \({r}_{j}\)) if

  1. 1.

    the confidence of \({r}_{i}\) is greater than that of \({r}_{j}\), or

  2. 2.

    their confidences are the same, but the discriminative level of \({r}_{i}\) is greater than \({r}_{j}\), or

  3. 3.

    both confidences and discriminative levels of \({r}_{i}\) and \({r}_{j}\) are the same, but the support of \({r}_{i}\) is greater than that of \({r}_{j}\), or

  4. 4.

    all confidences, discriminative levels and supports of \({r}_{i}\) and \({r}_{j}\) are the same, but \({r}_{i}\) is generated earlier than \({r}_{j}\) (i.e. has fewer attributes on its left side).

Let \(R\) be the set of generated rules (i.e. CDARs), and \(D\) the training datasets. The algorithm’s basic idea is to choose a set of high precedence rules in \(R\) to cover \(D\). Our classifier is of the following format:

$$< r_{1} ,{ }r_{2} , \ldots ,{ }r_{n} ,default\_class > ,$$

where \({r}_{i}\in R\), \({r}_{a}> {r}_{b}\) if \(b> a\). \(defualt\_class\) is the default class. In classifying an unseen case, two criteria are considered. First, each rule that satisfies the case is thought out. Second, and for better accuracy, the number of rules that satisfies the case (i.e. cardinality in each class) is also considered. Finally, the case is classified based on the maximum values obtained from the multiplication of the cardinality of rules, in each case, with the rule precedence criteria. If no rule applies to the case, it takes on the default class as in the C4.5 and AC methods.

Associative classification algorithms usually first extract, then rank/prune the rules and finally use the rules to classify new test objects. The proposed DAC algorithm extracts CDARs based on a modification of the DISSparse algorithm [20]. It executes DISSparse considering every different class as a target class and holds the discovered rules in a novel concise prefix-tree structure (i.e. DAC-Tree). The proposed pruning method is based on deleting many association rules that are not discriminative in any data class during the rule mining. It also uses the rule’s length, confidence and support to prune the rules that cause overfitting. Finally, it detects the redundant rules using the method for discarding specific rules based on an adaptation of CMAR [5]. The proposed ranking method is based on the precedence of the rules taking into consideration the confidence plus discriminative level, support and rule antecedent length based on an adaptation of CBA [6]. The novelty of the proposed DAC method is mainly related to its efficiency and accuracy for classification of the large datasets. This is also used for the classification of large and fast-growing multiple data streams [21].

Our algorithm for building such a classifier has three steps as in Algorithm 1.

Step 1: generate the set of rules \(R\) based on a modified version of the DISSparse algorithm. It ensures all the rules are considered for our classifier (lines 1–2).

Step 2: find all the matched rules with every new transaction \(Tr\) (line 3).

Step 3: select the highest precedence rule based on relation \(>\) for our classifier and measure the accuracy using cross-validation (lines 4–5).

Rule Mining Algorithm

This section presents the DAC algorithm for building a classifier using the ranked set of CDARs. The modified DISSparse algorithm generates a complete set of discriminative itemsets in a batch of transactions as a training dataset. The method deals with an exponential number of itemsets generated in more than one data class; however, it does not generate all itemsets. It uses two heuristics as proposed by [17, 20] to efficiently mine only the potential discriminative itemsets, as follows.

The two heuristics used in the DISSparse algorithm eliminate many non-potential discriminative itemsets during the itemset generation process. These heuristics were applied before the itemset combination generation process. The itemsets were generated following the principles of the FP-Growth algorithm [34]. However, against FP-Growth, the DISSparse did not use the divide and conquer as the Apriori property is not true for discriminative itemset mining (i.e. a subset of a discriminative itemset can be non-discriminative). The DISSparse used an itemset generation process by discovering the discriminative itemsets that were ending with a specific item, each time incrementally. The two heuristics were defined for avoiding the generation of non-potential discriminative itemsets. The first heuristic was defined on the whole set of itemsets starting with a specific item and ending with another specific item. The second heuristic was defined on the internal items that are between itemsets, starting with a specific item and ending with another specific item. The second heuristic was only applied in the case of the first heuristic said the set was the potential for generating a discriminative itemset. Using these two heuristics DISSparse skipped either the whole set of itemsets starting with a specific item and ending with another specific item from itemset combination generation or part of its internal items. The correctness and completeness of the heuristics in the DISSparse algorithm were justified [17, 20]. The efficiency caused by these heuristics also was reported using a different set of experiments with large synthetic and real datasets.

In DISSparse, these two heuristics worked based on one data class only (i.e. target data class). It means they eliminate the itemsets that are non-potential discriminative in the target class vs general class (i.e. a summary of all other classes). We modified the heuristics to work for each class vs all other classes instead of one class vs all other classes. For this, we consider one class as the target class and then discover the discriminative itemsets in that class vs all other classes (i.e. considered as one called general class). Then we do the same process for each class, respectively.

At first, the algorithm discovers the discriminative itemsets and then adds each itemset as a new rule (i.e. with its class label) to DAC-Tree.

DAC-Tree: The proposed prefix-tree structure in the FP-Growth [34] is used for holding the discriminative itemsets of the dataset. The prefix-tree shares the branches for their most common frequent items. The branch from root to a node is considered as the prefix of the itemsets ending at a node after that node. Not all the nodes present the rules as some of them may be a subset of discriminative itemsets. We know that a subset of a discriminative itemset can be non-discriminative as the Apriori property is not followed. Compared with FP-Growth, each rule node in the DAC-Tree is associated with its class label, confidence, discriminative value and support on the path starting from the root and ending at this node (e.g. there are four metrics associated with the rule node in DAC-Tree as in Fig. 1 based on Example 1). The DAC-Tree is a compact structure. It explores potential sharing among rules and thus can save a lot of space on storing rules.

Fig. 1
figure 1

DAC-Tree based on Example 1

The rule mining in our algorithm (i.e. DAC) is based on batch processing and by the end of it, the full set of rules is collected in the prefix-tree structure. The discovered rules are set to the DAC-Tree with their four metrics each. After that, the test dataset is read transaction by transaction. With each new transaction, the DAC-Tree is traversed using a recursive function as will be discussed in Sect. Rule Ranking. All matched rules with the new transaction are collected, and then ranked based on another function as will be discussed too in Sect. Rule Ranking. Every new transaction is classified and the process continues by cross-validation for measuring classification accuracy. The total number of errors that are made by the classifier is computed and recorded. This is the sum of the number of errors that have been made by all the selected rules in the classifier. The main challenge is setting the right set of parameters for mining more accurate rules for different datasets. This will be discussed in the empirical evaluation in Sect. Empirical Evaluation. The DAC algorithm is given in Algorithm 1.

figure a

This algorithm is very efficient, especially when the dataset is large. The method only takes two scans on the datasets (i.e. DISSparse necessity). The main complexity of the algorithm is related to discriminative itemset mining. The two heuristics in the modified DISSparse are efficient and they do not need to generate and test all the itemset combinations. They eliminate non-potential itemsets before generating them. This algorithm showed good performance tested using different synthetic and real large datasets from different domains [17, 20]. There are two functions used in the above code which are used for DAC-Tree traversing and rule ranking, respectively. The rule pruning is also done by adding only the non-redundant, confident and frequent rules to the DAC-Tree as follows.

Rule Pruning

Concerning traditional association rules, the discriminative associative rules are smaller in number. They have to be frequent first and then have a higher ratio in the target data class vs the rest of the data classes. The second condition prunes many association rules that are not discriminative in any data class. Although discriminative itemsets are a small subset of frequent itemsets, the DAC algorithm still drives a large set of rules, because, classification data are typically highly correlated. As a result, we attempt to reduce the size of the classifier produced by the DAC approach, mainly by preventing rules that are either redundant or misleading from taking any role in the prediction process of test data objects. The redundant rule removal can lead to a more effective and accurate classification process. Our rule pruning is made of three different steps, rule length, rule confidence and support, and rule redundancy. We limit the length of the generated rules to avoid the long rules. The longer rules are susceptible to the noises in datasets (i.e. lead to overfitting). We define a minimum confidence \(c\) and minimum support \(s\) to prune the rules that are below the thresholds. By using these two parameters, the misleading rules with low support and high confidence, or low confidence and high support are deleted from the rule ranking process.

We prune the redundant rules during rule mining and before adding them to the DAC-Tree. The rule’s condition is made of all attribute value combinations. In this case, classification rules may share training items in their conditions, and for this reason, there could be several specific rules containing many general rules. Rule redundancy in the classifier is unnecessary and in fact could be a serious problem, especially if the number of discovered rules is extremely large. The concept of rule redundancy is formally defined as, the substitution of the rule with smaller lengths (i.e. general rule) that have higher or equal confidence compared with the rules with larger lengths (i.e. sharing items with general rule). The redundant rule pruning method discards specific rules with fewer or equal confidence values than general rules. Once the rule generation process is completed and before adding the rules to the DAC-Tree, an evaluation step is performed to prune all rules such as I′ → c from the set of generated rules, where there is some general rule I → c of a higher rank and I ⊆ I′. This pruning method may minimize rule redundancy and reduces the size of the resulting classifiers.

After rule pruning, we traverse the DAC-Tree and find the set of all rules that classify the new transaction. Then, we do the rule ranking based on rule precedence as follows.

Rule Ranking

DAC-Tree traversing is done using a recursive function called DACTraverse. This function takes the \(Tr\), \(Tr\_len\) and \(DACroot\) as inputs and fills a stack with all the matched rules with the new \(Tr\), as output. It traverses the DAC-Tree using all \(Tr\) subsets starting with any of the \(Tr\) items. When it reaches a rule node (i.e. the ones with four rule metrics) it adds the rule to the stack and continues with the other \(Tr\) subsets. For avoiding traversing the same rule more than once we negate the traversed \(Tr\) subsets. The recursive function then checks the \(Tr\) items before calling for \(Tr\) with a lower length. After checking all subsets, the stack is used for rule ranking. The traversing algorithm is given in Algorithm 2.

figure b

Discriminative itemsets are a small subset of frequent itemsets. The prefix tree used for saving the rules is small, and traversing it specifically using a recursive function is not time-consuming. The traversing function fills the stack with the true set of rules matched with each new transaction. The rules may belong to different classes. Each rule has its class label, confidence, discriminative value and support. After that is rule ranking based on rule precedence for predicting the right label for the transaction.

For ranking the discovered rules related to each new transaction, we define a function called DACRanking. The rules are set in a stack before rule ranking. We check all the rules with the same class label to choose the one with the highest precedence. The highest precedence rule in each class nominates the class label. However, using only one rule in each class may cause overfitting based on noises. To avoid this, we use a rule counter for each class and then multiply the rule metrics (i.e. confidence, discriminative value and support) by this counter. This is a hypothesis that considers a rule with both the highest precedence and highest cardinality of rule repetition (i.e. in each class label) to be involved in rule ranking. Based on this, the rule with the highest precedence in a class with higher cardinality of rules is ranked higher. In the case of the same metrics for two rules, the algorithm chooses the one with the shortest length (i.e. generated earlier). If no rule matches the transaction the default class are chosen. In our algorithm, the default class is the class with the highest cardinality of rules in DAC-Tree. After that, the DAC algorithm read the next \(Tr\) and the process continues for the rest of the test dataset. The rule ranking algorithm is given in Algorithm 3.

figure c

The number of rules in the stack is not usually big. The algorithm checks all the matched rules with transaction \(Tr\) using a nested loop. The rules in each class are checked once only and the one with the highest precedence and highest repetition is chosen. This is done for each transaction in the test dataset based on cross-validation. The accuracy of the built classifier is high for most of the datasets. The algorithm is very efficient, especially for large datasets. The advantages of the rules in our method and their superiority to the other classification rule mining methods are explained in the following.

Quality Rules

Many classification techniques use domain-independent biases to generate rules to form a classifier, resulting in many rules that make no sense to the user. Our framework helps to solve the understandability of the rules in classification rule mining. We use discriminative itemsets that show the class differences. In many cases, discrimination is inherent in data classes. They are understandable by the user in specific domains. Moreover, associative classification techniques sometimes may suffer from biased classification or overfitting caused by the huge set of mined rules. The discriminative itemsets are a small subset of class associative rules and are usually shorter in their length. The superiority of our classification technique is mostly shown in large-scale datasets when the other classifiers fail or take long during rule mining. Our algorithm does this using several pruning heuristics both during rule mining and during rule ranking processes. The rationale of our pruning is that we use rules reflecting strong implications to do classification. By removing those rules not positively correlated, we prune noises. We also prune the redundant rules with higher length and lower or equal confidence compared with the general rule. In addition, to make a reliable and accurate prediction, the most confident rules may not always be the best choice. We use rules precedence based on itemsets that are all discriminative. However, confidence still precedes the discriminative value of the rules. In the case of higher confidence, the discriminative value is not considered. However, if the confidences are equal, then the rule with a higher discriminative value is considered. Our algorithm chooses the highest precedence rule in the class that has higher cardinality of rules. In this case, overfitting is prohibited by avoiding rule ranking only based on rules precedence.

Empirical Evaluation

In this section, we evaluate the accuracy, efficiency and scalability of the DAC by performing an extensive performance study. The execution time of the DAC algorithm is also shown. In our experiments, we use different discriminative values \(\theta\) and support threshold \(\varphi\). The algorithms were implemented in C +  + and the experiments were conducted on a desktop computer with an Intel Core (TM)2 Duo T6600 2.2 GHz CPU and 4 GB main memory running 64-bit Microsoft Windows 7 Enterprise.

The baseline datasets contain categorical or continuous attributes. The possible values in the categorical attributes are mapped to a set of consecutive positive integers. For a continuous attribute, the value range is discretized into intervals, and then intervals are mapped to consecutive positive integers. We do the discretization of continuous attributes using the Entropy method [43]. Following these procedures, all the attributes are treated uniformly in this study. Tenfold cross-validation is used for every dataset.

Comparison with Traditional Rule-Based Classification Methods

In this section, we compare our produced classifier with the state-of-the-art rule-based classifiers in two sub-sections. First, with those produced by C4.5 [4], AC methods including CBA [6], CMAR [5], CPAR [36], L3 [39] and second in a separated sub-section with HARMONY [38] and with ACN [40].

DAC Superiority over the Traditional AC Methods

Here, we used 26 datasets from UCI ML Repository [22] for the purpose. These datasets were used with the traditional AC methods, as the state of the art. We consider the rule length limit of less than 7 for different datasets. This improves the performance of the algorithm as well, specifically for datasets with longer transaction lengths (e.g. sick and sonar). Because the DAC algorithm does not follow the Apriori property, this limit helps to finish the rule generation process with reasonable time and space complexities. Table 1 shows the accuracy of the six approaches on the selected datasets.

Table 1 The comparison of C4.5, CBA, CMAR, CPAR, L3 and DAC on accuracy 

As can be seen, DAC does well compared with C4.5, CBA, CMAR, CPAR and L3 on accuracy. On average, DAC stays between CMAR and L3. Furthermore, out of 26 datasets, DAC achieves the best accuracy in 11 ones. In other words, DAC wins in more than 42% of the test datasets. In some datasets, e.g. Sonar, Waveform and Wine, DAC wins second place. Although L3 shows higher accuracy on average for these datasets, DAC is specifically designed for mining a small set of discriminative rules from large datasets. Its time and space efficiency (plus its reasonably high accuracy) are crucial for the classification of large datasets where the rest of the FP-Growth-like methods fail or take a long.

Statistically, DAC outperforms the other associative classification methods in datasets which are not pessimistic. For clarification, pessimistic datasets are the ones with the majority of the instances labelled as one specific class (e.g. more than 90%), and the rest of the instances are labelled as other class(es). In our experiments, Hypo and Sick are pessimistic (i.e. with more than 93% belonging to one class). In these datasets, our algorithm reports accuracy a bit better than random classification (i.e. classify all the instances to the majority class). In the pessimistic datasets, the class distribution of the instances is discriminative, caused for the rule precedence of the majority class. Although we can do the exception handling based on the dataset characteristics (i.e. class distribution), we left the algorithm in its general form.

There are two important thresholds, support threshold \(\varphi\) and ratio \(\theta\) for discriminative itemsets in DAC. As discussed before, these two thresholds control the number of rules selected for classification. In general, if the set of rules is too small, some effective rules may be missed. On the other hand, if the rule set is too large, the training dataset may be overfitting. Thus, we need to test the sensitivity of the thresholds w.r.t classification accuracy.

The results reported for the DAC algorithm have been obtained by selecting the best configuration for each dataset. We used a greedy method for tuning the parameters. For this, we hold all the parameters (i.e. support threshold, ratio and less important but still effective parameters including rule length, confidence and support) and only change one of them to find the best-reported accuracy. We follow the same process for tuning the other parameters. For achieving a possible better accuracy, we do the greedy method one or a couple more times. Considering the quick run-time of the proposed DAC algorithm, the time needed to detect the parameter setting during the tuning of the classifier is efficient (e.g. the time usage of the DAC algorithm for most of the datasets in Table 1 is under a minute). This is true even for large datasets, as we can sample the large dataset by selecting a few thousands of its instances and running the algorithm for the classifier’s tuning. Then, the obtained desired setting is used for the classification of the dataset in its real size (i.e. as in Sect. DAC Scalability in Large-Scale Datasets). Theoretically, the time complexity of the tuning process is about setting five parameters (i.e. mentioned above) using a greedy method, and the complexity of a greedy method is in the range of \(O({n}^{2})\). Empirically, the algorithm is fast and efficient and tuning the classifier does not take that much time, and usually, the best setting is identified in the first or second iteration.

As an example, we test different support threshold \(\varphi\) and ratio \(\theta\) values on the Austral dataset. The ratio \(\theta\) is set to \(1.5\) with different support thresholds. We also test different ratio \(\theta\) values on the Austral dataset, where the \(\varphi\) is set to \(0.068\). The accuracy results are shown in Fig. 2. The accuracy sensitiveness and the time and space complexity results in the Austral dataset being representative or generalizable to other datasets. We separately consider the variation of support threshold \(\varphi\) and ratio \(\theta\) to the indicators of accuracy, time and space, as \(\varphi\), itself, is defined in the range of \(\theta\) (i.e. \(0 < \varphi < 1/\theta\)).

Fig. 2
figure 2

The effect of support threshold \(\varphi\) and ratio \(\theta\) on the accuracy of the Austral dataset

From the above figure, we can see that there are optimal settings for both thresholds. However, according to our experiment results, there seems no way to pre-determine the best threshold values. Fortunately, both curves are quite plain. That means the accuracy is not very sensitive to the two threshold values. The time and space complexity are shown in Figs. 3, 4, respectively.

Fig. 3
figure 3

The time and space complexity of DAC with different support thresholds \(\varphi\) and \(\theta =1.5\) on the Austral dataset

Fig. 4
figure 4

The time and space complexity of DAC with different ratios \(\theta\) and \(\varphi =0.068\) on the Austral dataset

DAC-Tree is a compact structure to store rules. To test its effectiveness, we compare the main memory usage of the CBA, CMAR, L3 and DAC on large datasets. The results are shown in Table 2. It has to be considered that CMAR and CPAR are almost similar in the case of their rule mining algorithm and, on average, they showed close accuracy to each other as in Table 1, so the amount of memory used by CPAR is not reported.

Table 2 The comparison of CBA, CMAR and DAC on space usage

In these experiments, the limitation of the number of rules in CBA is disabled. In such a setting, CBA, CMAR, L3 and DAC are compared in a fair base as they generate all the rules necessary for the classification. From Table 2, we can see that, on average, DAC archives very good savings in main memory usage. The compactness of DAC-Tree brings significant gain in storing a large set of rules where many items in the rules can be shared. The runtimes are varied from a few seconds, for most of the datasets, (e.g. Breast and Diabetes datasets) to a few minutes (e.g. Sonar and Waveform datasets). The number of discovered rules (i.e. CDARs), in most of the baseline datasets, mainly stays under 10,000 to 15,000, with less than 1,000 rules for several of them (e.g. Diabetes, Horse and Iris). Moreover, for some datasets (e.g. Auto and Vehicle) it reaches about 70,000 rules. In the following sub-sections, we compare the DAC effectiveness with other more recent methods i.e. HARMONY [36] and ACN [40].

DAC Superiority to HARMONY

We compare the DAC with HARMONY as in Table 3. Here, we used 16 datasets from UCI ML Repository [22] for the purpose. These are the datasets used by [36] for HARMONY comparison with other methods of accuracy.

Table 3 The comparison of HARMONY and DAC on accuracy

DAC Superiority to ACN

We compare the DAC with ACN as in Table 4. Here, we used 16 datasets from UCI ML Repository [22] for the purpose. These are the datasets used by [40] for ACN comparison with other methods of accuracy.

Table 4 The comparison of ACN and DAC on accuracy

As we discussed in the introduction and related works, CDARS have close concepts with emerging patterns. In the following, we compare the effectiveness of our classifier with the one proposed based on emerging patterns. Comparison with emerging pattern classification.

Comparison with Emerging Pattern Classification

We compare the DAC method with iCAEP as in Table 5. Here, we used 14 datasets from UCI ML Repository [22] for the purpose. These are the datasets used by [25] for iCAEP comparison with other methods on accuracy.

Table 5 The comparison of iCAEP and DAC on accuracy

Comparison with Another Type of Classification

Comparing with top classification approaches would also be interesting to see the difference in accuracy (what does one sacrifice in accuracy to get "understandable" rules?). Here, we used 19 datasets from the UCI ML Repository [22] for the purpose. These are the datasets used by [41] for accuracy comparison. The accuracies of C4.5 and SVM on discriminative frequent patterns are presented in Table 6.

Table 6 The comparison of C4.5 and SVM on discriminative frequent patterns and DAC on accuracy

Again, we used seven datasets from UCI ML Repository [22] used by DDPMine [42] for accuracy comparison. The accuracy of DDPMine on discriminative frequent patterns is presented in Table 7.

Table 7 The comparison of DDPMine and DAC on accuracy

The most preferable feature of the proposed method to the other methods is that it uses a smaller number of discriminative rules, extracted during rule mining, caused for the efficiency of the algorithm in case of time and space usage with good accuracy. Although DAC passes most of the traditional AC methods and emerging pattern mining methods, in case of accuracy, as in the baseline datasets in Tables 1, 3, 4 and 5, the main purpose of presenting this method is its scalability (and at the same time its understandability) for classification of the large-scale datasets (e.g. big data analytics and data stream classification [21]). DAC algorithm is based on the DISSparse method, which is highly efficient compared to the other AC methods designed based on the FP-Growth method (i.e. extensively studied and evaluated in [17, 20]. In the following section, we measure the DAC scalability with three large datasets.

DAC Scalability in Large-Scale Datasets

In this section, we evaluate the DAC scalability using several synthetic and real datasets. As we discussed in Sect. Rule mining algorithm, the rule mining algorithm in the DAC method is based on a modified version of the DISSparse algorithm. The rule mining in the traditional AC methods is either based on Apriori or FP-Growth. However, the Apriori property is not valid for discriminative itemsets, as a subset of a discriminative itemset can be non-discriminative. In [16], DISTree was proposed based on the FP-Growth following and incremental itemsets mining (i.e. the divide and conquer based on subsets does not work). The DISSparse is a heuristic-based fast algorithm proposed by [17] for discriminative itemset mining. An extensive analysis was done [17, 20] for comparison between DISTree and DISSparse in case of efficiency, using synthetic and real datasets. Here, we evaluate the DAC scalability with three real large-scale datasets.

As we explained, the CDARs are a small subset of CARs, and in contradiction to the previous methods, the DAC algorithm is very efficient for large datasets. We chose three larger datasets including Mushroom, Adult and Susy datasets from the UCI repository [22], to test our algorithm. The Mushroom dataset has inherent discrimination with two classes of edible and poisonous classes. The Adult and Susy datasets have lesser inherent discrimination. The selected datasets are dense with fewer sparsity characteristics. The time and space usage of the baseline AC methods (i.e. the ones used in Table 1) are not mainly in a tolerable range in these three datasets, so we only evaluate them to show the DAC scalability with large-scale datasets. Although statistically, we can say mostly better accuracies are also obtained. Unfortunately, we did not find more large datasets useful for the classification. However, for scalability reasons, we repeated our method on the Susy dataset (i.e. made of five million instances) with different sizes and we obtained the same promising results (i.e. here we only report the experiments on the first hundred thousand instances).

The Mushroom dataset includes descriptions of hypothetical samples corresponding to different species identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The dataset size is 8124 instances. In each transaction, the first column is the class label, followed by 22 features, and transaction items are constructed from about a 120 unique items. We observed the best accuracy as \(99.9\%\) in this dataset. The results are shown in Fig. 5.

Fig. 5
figure 5

The effect of support threshold \(\varphi\) and ratio \(\theta\) on the accuracy of the mushroom dataset

DAC algorithm is also fast, specifically for large datasets. It is considered, that a big part of the time complexity is consumed during its cross-validation process. The time and space usage are presented in Figs. 6 and 7.

Fig. 6
figure 6

The time and space complexity of DAC with different support thresholds \(\varphi\) and \(\theta =3\) on the mushroom dataset

Fig. 7
figure 7

The time and space complexity of DAC with different ratios \(\theta\) and \(\varphi =0.08\) on the mushroom dataset

The Adult dataset is made of 48,842 instances and the first column is the class label, followed by 14 features. This dataset determines whether a person makes over 50 K a year. Tuning the classification parameters (i.e. \(\theta\) and \(\varphi\), plus rule length limit, confidence and support) in large-scale datasets greedily, as we explained earlier, is time-consuming. For time saving, we ran the algorithm on a smaller sample (e.g. 5000 instances) to find out about classifier behaviour and then expanded it to the real-size datasets (i.e. 48,842 instances). We observed the best accuracy as \(83.34\%\) in this dataset for \(\theta =4.5\) and \(\varphi =0.01\). We noticed the best accuracy with the rule length limit of less than 6 for datasets. The results are shown in Figs. 8, 9 and 10.

Fig. 8
figure 8

The effect of support threshold \(\varphi\) and ratio \(\theta\) on the accuracy of the adult dataset

Fig. 9
figure 9

The time and space complexity of DAC with different support thresholds \(\varphi\) and \(\theta =4.5\) on the adult dataset

Fig. 10
figure 10

The time and space complexity of DAC with different ratios \(\theta\) and \(\varphi =0.01\) on the adult dataset

The Susy dataset is related to high-level features derived by physic to help discriminate between two classes defined as signal and background. This dataset is made of five million instances and the first column is the class label followed by eighteen features. We selected the first hundred thousand instances for the scale of the experiments (i.e. 2% of the dataset). We observed the best accuracy as \(70.15\%\) in this dataset using \(\theta =2.5\) and \(\varphi =0.015\). The results are shown in Figs. 11, 12 and 13.

Fig. 11
figure 11

The effect of support threshold \(\varphi\) and ratio \(\theta\) of accuracy on the susy dataset

Fig. 12
figure 12

The time and space complexity of DAC with different support thresholds \(\varphi\) and \(\theta =2.5\) on the susy dataset

Fig. 13
figure 13

The time and space complexity of DAC with different ratios \(\theta\) and \(\varphi =0.015\) on the susy dataset

It is clear from the above figures that our algorithm is capable of classifying large datasets with reasonable time and space usage and good accuracy. We observed a linear growth in the algorithm time and space usage by increasing the size of the datasets (e.g. we repeated the experiments on Susy datasets with different sizes). As can be seen from the scalability experiments, the greater number of rules (with smaller \(\theta\) and \(\varphi\)) caused lower accuracy (i.e. because overfitting is based on longer rules with lower support and discriminative value). Therefore, the best parameter setting is usually something in between, which leads to acceptable time and space usage. The main advantage of our algorithm over the previous works is using discriminative itemsets. Every selected rule is discriminative. With this many unnecessary rules are not generated during the rule mining or simply prune during rule pruning and rule ranking. The discriminative value of the discovered rules is the second propriety for rule ranking after rule confidence.

Conclusion and future works

Discriminative itemsets show the distinguishing features of each data class in comparison to all other data classes in the dataset. This paper proposes a method for mining a set of discriminative rules from large datasets using the DAC-Tree structure. It presents the DAC algorithm to extract a complete set of rules in datasets accommodating several data classes. The generated structures used in the classification mining process are designed in a way to consume the least time and space. They can easily be fitted into the main memory. An extensive evaluation with various datasets exhibiting distinct characteristics has been done. The algorithm reports the set of rules with full accuracy and recall and with no false answers. The classifier is built based on the most precedence rule with the highest repetition. Results ascertain that mining class discriminative association rules are effective and realistic in small and large datasets. In the future, we propose the algorithm for discriminative associative classification in data streams using different window models.