Grafting for combinatorial binary model using frequent itemset mining

We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document}-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale L1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document}-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.


INTRODUCTION 1.Motivation
We are concerned with learning classification/regression functions from labeled data.We require here interpretability of acquired knowledge as well as high classification accuracy.Under this requirement we focus on the class of linear prediction models of the following form: where w is a (d + 1)-dimensional vector.It takes as input a ddimensional vector x = (x 1 , . . ., x d ) ∈ R d and outputs a scalar value = f w (x).We call each x j an a ribute.
is type of linear prediction functions is suitable for knowledge discovery in the sense that the a ributes whose weights are large can be interpreted as more important features for classification or regression.Nonlinear models such as the kernel method or the multi-layer neural networks may achieve higher classification accuracy, but have no such obvious interpretation of features.Mind the e power of knowledge representation for linear predictor models is, however, quite limited in the sense that an important feature is represented by a single a ribute only.We are rather interested in a wider class of linear prediction models such that features may be represented by combinations of a ributes.Such a class have richer knowledge representations and may achieve higher prediction accuracy, if properly learned.However, we may suffer from the computational difficulty in learning such a class since the total number of combinations of a ributes is exponential in the dimension d. ere arises an issue of how we can learn such a class of rich knowledge representations with less computational demands.e purpose of this paper is twofold.e first is to propose an efficient algorithm that learns a class of linear predictors over all possible combination of binary a ributes.e second is to empirically demonstrate that the proposed algorithm efficiently produces predictors having higher accuracy as well as be er interpretability than competitive methods.

Significance and Novelty
e significance of this paper is summarized as follows: 1) Proposal of an efficient algorithm that learns a class of linear predictors over all conjunctions of a ributes: We consider the class of linear predictors over all conjunctions of Boolean a ributes.We call this class the combinatorial Boolean model (CBM).It offers rich knowledge representation over the Boolean domain.We consider the problem of learning CBM from labeled examples within the regularized loss minimization framework.Since the size of CBM is exponential in the data dimension, it is challenging to learn CBM as efficiently as possible.
We propose a novel algorithm called the GRA ing for Boolean datasets algorithm (GRAB) to learn CBM efficiently.It outputs a linear predictor in CBM in polynomial time in sample size m and |P all |, where P all is the combinations of a ributes which occur at least once in the input dataset.
e key idea of GRAB is to reduce the loss minimization problem into the frequent itemset mining (FIM) one.We use the gra ing algorithm [13] designed for the large-scaled regularized loss minimization.We noticed that the gra ing algorithm includes FIM as a sub-procedure.Meanwhile, there exists an algorithm for FIM [21], which is able to efficiently find frequent itemsets making use of the monotonicity property of itemsets.We successfully unified the efficient FIM algorithm into the gra ing algorithm so that GRAB works very efficiently.
2) Empirical demonstration of validity of GRAB in terms of computational efficiency, prediction accuracy and knowledge interpretability: We employ benchmark datasets to demonstrate that GRAB is effective in terms of computational efficiency, prediction accuracy and knowledge intepretability.We empirically show that GRAB achieves higher or almost comparable prediction accuracy with much less computational complexity than existing methods such as support vector classifiers with polynomial kernels and radial basis function kernels.Further, we show that GRAB can acquire important knowledge in a comprehensive form.

Related Work
ere are a lot of studies on learning interpretable knowledge representations such as decision trees, Boolean functions, etc.Meanwhile, there are also many studies on uninterpretable but highly predictive knowledge representations such as support vector machines, neural networks, etc.We note that there also exist some studies that a empt to discover comprehensive knowledge using highly predictive models.Setiono et al. [17] proposed an algorithm that approximates a trained neural network using combinations of linear predictors.e resulting combinations of predictors are of high interpretability.
FIM has been successful in several tasks emerging in machine learning such as clustering and boosting [10,16,19].One of most closely related work is [16], which utilized FIM for boosting.ey also considered a similar but not identical model to CBM and worked on regression tasks on a biological context.
In the case of binary classification tasks, learning of CBM is related to learning Boolean functions, especially disjunctive normal forms (DNF), which has extensively been explored in the area of computational learning theory, e.g.[2] and [6].However, CBM is different from DNF regarding to the two points: 1. CBM takes a weighted sum of conjunctions of a ributes rather than disjunctive operations.2. CBM includes all scholar valued functions: {0, 1} d → R. Hence CBM can be thought of as a wider class of functions than DNF.Learning CBM is significant in this sense.e rest of this paper is organized as follows: Section 2 proposes CBM.Section 3 introduces the gra ing algorithm for loss minimization.Section 4 introduces the frequent itemset mining methodology.Section 5 proposes our GRAB by combining the gra ing algorithm with the frequent item mining.Section 6 shows experimental results.Section 7 gives concluding remarks.

COMBINATORIAL BOOLEAN MODEL
is section introduces a class of combinatorial Boolean models.Suppose that x ∈ {0, 1} d , that is, each datum is represented by a Boolean valued vector.is assumption does not loose generality because when a datum is real-valued or integer-valued, we may transform it into a Boolean valued one by digitalizing it in some appropriate way (see Section 6).
Let X be the set of all a ributes.We define the combinatorial feature set Φ (d,k) as a set of conjunctions of at most k distinct attributes chosen from X.For example, in case of d = 4, k = 2, Φ (d,k)  is given as follows: When d is fixed, we denote Φ (d,k) as Φ (k) in the discussion to follow.We define a linear predictor associated with Φ (k) by where w = (w ϕ ) is a real-valued |Φ (k) |-dimensional parameter vector.We call the class of all functions of the form (1) the class of combinatorial Boolean models, which we abbreviate as CBM.We call each ϕ ∈ Φ (k) a feature and k the degree.e weight w ϕ in w for ϕ represents the importance of the feature ϕ.Hence CBM has good interpretability because an important feature can be represented in a comprehensive form of a conjunction of a ributes.
On the other hand, as for the complexity of CBM, the following proposition holds: be the labeled examples, where x i ∈ {0, 1} d and i ∈ R for all i = 1, . . .,m.We assume that i = i ′ in the case of x i = x i ′ .en, for any (x i , i ) m i =1 , there exists (w ϕ ) Φ (d ) that satisfies the following equation: m). e proof is omi ed but to appear in the full version.is proposition shows that CBM is of high representability that CBM can fit any labeled examples.
Let us consider the problem of learning CBM. e purpose of learning is to estimate the parameter vector w from the given labeled examples (x 1 , 1 ), . . ., (x m , m ) where i is the label corresponding to x i and m is the sample size.As a learning framework, we employ that of regularized loss minimization, following (2).In it the objective function for learning is given by where ℓ is a loss function, Ω is a regularizing function, and C is a constant positive real number.
Our learning se ing is closely related to loss minimization using polynomial kernels.In the case where x is a Boolean vector, the loss minimization for CBM is analogous with that using polynomial kernels [18]: However, the weights for ϕ ∈ Φ (k) s depend on their degrees or their numbers of conjunctions.us features for CBM are nouniformly weighted while those for polynomial kernels are uniformly weighted.
From the standpoint of knowledge interpretability, it is desired that large weights are assigned only to a relatively small number of features ϕ ∈ Φ (k) a er learning.To this end we employ the framework of sparse learning [15] by the L 1 -regularizer Ω(w) = w 1 in (2).We denote the objective function for this case as G (k) (w): It is computationally difficult to learn CBM using existing standard techniques for loss minimization.In applying them, all of the values ϕ (x i ) should be stored for all ϕ and i beforehand.It requires exponentially large memories with respect to the data dimension.Further, it is also computationally expensive to solve a large-scale optimization problem associated with the loss minimization since the total number of parameters is We show how to overcome this computational difficulty in the sections to follow.

GRAFTING ALGORITHM
In this section we introduce the gra ing algorithm [13].Suppose that the data dimension is so large that there are many irrelevant a ributes in the parameter vector w.
en we may employ the L 1 -regularizer as the penalty term Ω in (2), which is wri en as e gra ing algorithm is designed so that it can solve the optimization problem of large-scale efficiently for such cases.e key idea of the gra ing algorithm is to construct a set of active features by adding features incrementally.

Overall procedure
In each iteration of the gra ing algorithm, a (sub)gradient-based heuristics is employed to find the feature that seemingly improve the model most effectively and then to add it to the set of active features.At the t-th iteration, the gra ing algorithm divides the set of all a ributes of parameter vector w into two disjoint sets: F t and Z t ¬F t .We call w j ∈ F t free weights.Z t is constructed implicitly so that it always satisfies e overall procedure of the gra ing algorithm as follows: First, it minimizes (4) with respect to free weights, resulting in fifi∂ w j G ∋ 0 (5) for ∀j ∈ F t , where ∂ w j G is the subdifferential of G with respect to w j .en, for ∀j is a necessary and sufficient condition for the value of the objective function to decrease by changing the w j (globally in case of convex G and locally in general).Secondly, the gra ing algorithm selects a parameter from Z t that is seemingly most effective for the objective function to decrease and adds it into F t +1 .en, Z t +1 is also implicitly updated by removing the selected parameter, and the gra ing algorithm iterates the procedure mentioned above.

Condition on effective parameters
e subdifferential of the objective function with respect to w j ∈ Z t is calculated as where L(w) Hence the condition ( 6) is equivalent with is implies that changing the value of w j from 0 will not decrease the objective function if ( 8) is not satisfied.is is the main reason why L 1 regularization gives a sparse solution.It also leads to a stopping condition of the gra ing algorithm as shown below.

Parameter selection
We consider the problem of selecting a parameter to be moved from Z t to F t .We see from the above argument that w j ∈ Z t satisfying (8) makes the value of the objective function decrease by changing its value from 0.
ere may exist more than one candidates that satisfy (8).In that case, the gra ing algorithm selects a parameter w best that makes the value of the objective function decrease most, by making use of the following gradient-based heuristics: e derivatives of the objective function with respect to all parameters must be calculated to obtain the maximum in (8).However, this naïve method might be computationally intractable when the data dimension and sample size are of large-scale.To the best of the authors' knowledge, there does not exist any efficient method to solve this parameter selecting problem.

Stopping condition
If the condition (8) is not satisfied for all w j ∈ Z t , the value of the objective function is not decreased by changing the value of w j ∈ Z t from 0 and the value of w j ∈ F t from the current value, globally in case of convex G and locally in general cases.erefore this condition can be used as a stopping condition for the gra ing algorithm.If the condition is fulfilled, we may think that a local optimum (or the global optimum in case of convex G) is achieved.
To summarize, the overall procedure is given in Algorithm 1.

FREQUENT ITEMSET MINING
It is computationally difficult to find the best parameter according to (9). is is because it requires computation of the gradient of loss over all of the components of the parameter.In order to overcome this difficulty, we employ the technique for frequent itemset mining (FIM).In this section we briefly review FIM.

Terminology
A set of items I = {1, . . ., d } is called the item base.e set T = {t 1 , . . ., t m } is called the transaction database, where each t i is a Figure 1: Demonstration of frequent itemset mining (FIM) and its variants: in the standard setting of FIM, all the transaction are treated homogeneously and it searches all itemsets that appears in a given transaction database more than once (le panel).Non-negatively weighted FIM (NWFIM) deal with non-negative weights with respect to each transaction and find itemsets weighted frequency of which is more than 1 (middle panel).Weighted FIM (WFIM) allows weights on transaction to be negative and loses the monotonicity of output (right panel).
Algorithm 1 e gra ing algorithm for L 1 -regularized problem , t ← 0 2: while max w j ∈Z ∂L/∂w j > 1 do 3: Optimize G(w ) with respect to ∀w j ∈ F t 7: end while subset of I.Each element of the transaction database is called a transaction.Given a transaction database, an occurrence set of p, denoted by T (p), is a set of all transactions that include p, i.e., We refer to a subset of item base p as an itemset.e cardinality of T (p) is called frequency, which is denoted as frq(p; T ): e simplest example of FIM problem is given as follows: For a given transaction database T and threshold θ, find P, which is the set of all itemsets with a larger frequency than θ, i.e.,

Efficient Algorithms by Utilizing Monotonicity
It is obvious that any subset of an itemset p is included by a transaction t when p is included by t.In other words, a kind of monotonicity holds in the following sense: T (p ′ ) ⊇ T (p), and frq(p ′ ; T ) ≥ frq(p; T ), for all p ′ ⊂ p.By making use of this property, we can search all frequent itemsets by adding an item one by one from ∅. e algorithm that performs this search in the breadth-first manner is the apriori algorithm, whereas the one that performs this search in the depth-first manner is the backtracking algorithm.e apriori algorithm was firstly proposed by [1].An FIM algorithm based on the backtraking algorithm was proposed in e.g., [23] and [4].
e size of transaction database, denoted as T , is defined by

Extension to Weighted FIM
Standard FIM methods handle transactions as if there were no difference in importance among transactions.However, there o en appear such cases that importance may differ one another depending on transactions.Actually, the values of importance may be positive and negative.If each transaction is allocated to a positive or negative label, then we may be interested in discovering itemsets that frequently appear in positive transactions and not frequently in negative ones.
In the se ing where only the positive importance is available, we can utilize the backtracking algorithm or the apriori algorithm.An efficient algorithm can be constructed, since the monotonicity of frequent itemsets still holds.We define weighted frequency as follows: for given α t for t ∈ T .e same monotonicity as (13) still holds for the weighted variant of frequencies.We are then led to nonnegatively-weighted FIM.It is to find all itemsets with larger weighted frequency than a given threshold.
In the se ing where both positive and negative importance has to be dealt with, the monotonicity does not hold any longer.en any algorithm of output-polynomial time may not exist.However, we may instead employ the following two-stage strategy: Let the sets of positive and negative transactions in T be T + and T − , respectively.In the first stage, ignoring transactions with negative importance, we obtain frequent itemsets such that In the second stage, for each itemset in P + we check if the frequency frq(p; T ) is still larger than θ. e first stage is executed e three types of FIM: FIM, non-negatively weighted FIM (NWFIM), and weighted FIM (WFIM) are illustrated in Figure 1.
A variant of WFIM can also be designed by restricting the itemsets so that their sizes are at most k.en it will output P = {p ⊆ I | frq(p; T , α), |p| ≤ k }.It is realized by doing the breadth/depthfirst search of frequent itemsets with at most k depth.

PROPOSED ALGORITHM (GRAB)
In this section we introduce our proposed algorithm for learning CBM, which we call GRAB (GRA ing for Boolean datasets).As shown in Section 2, CBM has at most 2 d parameters.It is difficult to store all the values ϕ(x i ) for all ϕ and i from the viewpoint of space complexity.Further, it is difficult to solve this large-scale optimization problem from the viewpoint of time complexity.However, we can overcome these difficulties by using the gra ing algorithm in combination with the WFIM.Let us consider to solve the optimization problem for learning CBM by means of the gra ing algorithm.It can overcome the space complexity issue because it does not require that all the possible features be stored.On the other hand, a time complexity issue arises in the process of finding a new feature in each iteration.
First note that the partial differential of the objective function G (k) with respect to w ϕ in case w ϕ = 0 is given as follows: We see from (8) that the value of the objective function is decreased by changing the value of w ϕ if and only if us, the problem of finding the best feature is reduced to finding ϕ ∈ Φ (k) satisfying ( 16).To solve this problem, we utilize the technique of WFIM.Note that the feature vector x is an element of {0, 1} d .Let us define a bijection T (d ) (•) from {0, 1} d to 2 {1, 2, ...,d } and an injection P (k) (•) from Φ (k) to 2 {1, 2, ...,d } as follows: Using these functions, it holds that We abbreviate T (d ) (x i ) and P (k) (ϕ) as t i and p ϕ , respectively, for each feature vector x i (i = 1, . . ., m) and ϕ ∈ Φ (k) .e le -hand side of ( 16) is rewri en as where T (p ϕ ) is an occurence set of p ϕ with respect to the transaction database T that regards t i as each transaction for i = 1, . . ., m.Hence, we see ( 16) is rewri en as where In other words, we can obtain a set of all p ϕ s that satisfy (17) by employing WFIM for the transaction database T and the transaction weights as in (18).According to the original form of gra ing algorithm, only one parameter is newly added to free weights by ( 9) at each iteration.However, knowing that ( 16) is a sufficient and necessary condition for the objective function to decrease, we may select more than one parameters such as the top-K frequent parameters: where p ϕ i is the i-th most frequent itemset in P. is method is more efficient than the parameter estimation step of the gra ing algorithm in the case where the parameter selection procedure of WFIM requires much computation time.
e overall flow of GRAB algorithm is given in Algorithm 2. e time complexity of GRAB algorithm is evaluated as follows: At each step of GRAB, we perform WFIM and optimization of G (k) (w).Since time complexity for the optimization largely depends on the loss function and the employed solver, we only consider time complexity for WFIM step.WFIM takes much more time than the optimization in many cases.As discussed at Section 4, the computation time of WFIM is O (m T |P + |).Here, it obviously holds that P + ⊆ P all , where P all {p ⊆ I | frq (p; T ) > 0}. e total computation time for WFIM in GRAB is eventually evaluated as follows: O (mT 1 T |P all |), where T 1 is total number of iteration.

Implementation Issue
We describe details of implementation for acceleration and termination of GRAB.

Dynamic threshold control for the acceleration of WFIM. We have already shown that the time complexity of WFIM is O(m T |P + |).
Hence WFIM terminates faster if the threshold θ is larger and |P + | is smaller.Further, it is desired to find the top-K frequent itemsets without extracting all the features that satisfy (16).To this end we first execute WFIM by se ing the threshold θ = 2 M with M = 10, and decrement M by 1 until we obtain K itemsets or it becomes θ = 2 0 = 1.When WFIM is called next time, we use the same value of M as that the previously used one.
for j = 1, . . ., K do 6: Pick ϕ such that p ϕ is the j-th most frequent itemset in P Optimize G (k) (w) w.r.t.w ϕ ∈ F 10: Incomplete Termination of WFIM.When the total number of outputs of WFIM is too large, we terminate WFIM a er extracting 100K frequent itemsets and select the top-K frequent itemsets among them.In such cases, it may not be guaranteed that the top-K frequent itemsets are selected.However, as the selection of a new feature based on ( 9) is already a heuristic, we do not expect that it significantly deteriorate the performance of the optimization step.
Stopping condition.We employ the following stopping condition: First we define the suboptimality of a solution as follows: where For a given tolerance ε > 0, we terminate the algorithm when V (t ) < εV (0) is satisfied.e first term of V (t ) is computed using the parameter obtained by the optimization step.e second term is computed by summing over features obtained by WFIM.V (0) may not be able to be computed exactly as long as we employ the heuristics for acceleration.We underestimate V (0) by computing the summand in the second term among the obtained features.is makes the stopping condition stricter and the proposed algorithm run longer.In our experiments, we set ε = 0.01.

EXPERIMENTS
We implemented our algorithm GRAB 1 in C and C++ to evaluate its computational complexity, prediction accuracy, and knowledge interpretability.In order to select new features to be added, we used Linear time Closed itemset Miner (LCM) by [20][21][22], which is one of backtracking-based FIM algorithms.All experiments below 1 All of our experimental codes are available at h ps://www.dropbox.com/sh/jmwkiwt509368/AABR7L2at0vQc4xMld6 WCGoa?dl=0.are executed by Linux (CentOS 6.4) machines with 96GB memory and Intel(R) Xeon(R) Processor X5690 @ 3.47GHz.We restricted the computation time to within one day and experiments with overtime were taken as time-outs.

Evaluation of Computation Time
In order to evaluate the computational efficiency of GRAB, we employed benchmark datasets to compare GRAB with other methods.We investigated how effective the gra ing algorithm and WFIM were respectively.To this end, as methods for comparison, we employed the following two combinations: (1) Expansion + L 1 -regularized logistic regression: In this combination, we first expand a dataset so that all possible features are employed to express ϕ ∈ Φ (k) (expansion).We then learn combinatorial linear models on the expanded data set within the L 1 -regularized logistic regression framework.As a solver, we used LIBLINEAR [9].(2) Gra ing + Naïve feature selection: In this combination we employ the grating algorithm, without combing it with WFIM.In it we calculate (9) by searching exhaustively over the set of all features ϕ ∈ Φ (k) .
We first observe that the computation time for GRAB is significantly smaller than that of Gra ing+Naïve feature selection.It implies that WFIM was so efficient that it could select the features even when exhaustive search is impractical.LIBLINEAR could solve the optimization problem more quickly than GRAB.However, the expansion step took much longer time than the whole procedure of GRAB.erefore we conclude that GRAB is more efficient than the compared methods.We consider the main reason of the efficiency is that GRAB could extract features selectively.

Evaluation of prediction accuracy
In order to evaluate GRAB's prediction performance, we conducted experiments for the following benchmark datasets: (1) a1a 1 [14]: It has 32,561 records and 123 a ributes for each data. is dataset is obtained by preprocessing UCI Adult dataset [11], which has 6 continuous and 8 categorical features.We divided the total data into 30,000 training data and 2,561 test data.(2) cod-rna 1 [3]: It is a dataset for the problem of detecting non-coding RNAs from a ributes of base sequences.Since all of the 8 a ributes are real-valued, we transformed each a ribute into a binary one as follows: We divided the interval from the minimum value to the maximum value into 30 cells of equal length.We expressed each a ribute in a binary form by indicating which section its value fell into.us, a binary dataset with 240 a ributes was obtained.Each data in it was specified by 8 non-zero a ributes.We used 300,000 data for training and 31,152 for test.
Table 1: Comparison of computation time (sec) on a1a dataset: the bracketed values at Expansion+L 1 -regularized Logistic Regression mean the elapsed times to solve L 1 -regularized logistic regression.
We binarized all the quantitative a ributes in the same way as cod-rna dataset.We used 500,000 data for training and 81,012 for test.
As methods for comparison, we employed support vector classification (SVC) [5] with two types of kernel functions: polynomial kernel (Poly) [18] and radial basis function kernel (RBF) [18], both of which have extensively been used for the purpose of classification.e polynomial kernel is defined as k where d ∈ N + and γ ∈ (0, 1] are hyper parameters, and radial basis function kernel is defined as k(x, x ′ ) = exp 1 2γ x − x ′ 2 2 , where γ > 0 is a hyper parameter.
We employed the logistic loss and the L 2 -hinge loss as loss functions for learning CBM. e L 2 -hinge loss function is defined as: Note that the WFIM procedure in GRAB can be accelerated by using the L 2 -hinge loss. is is because ∂ℓ(x i , i )/∂ f (k) (x), which is the i-th transaction weight of WFIM, is 0 for all i satisfying 1 − i f (x i ) ≤ 0 and we can remove the corresponding data from the transaction database.
For each of the algorithms, a er learning models with C = 10 −3 , 10 −2 ,. . ., 10 3 for GRAB and with C = 10 −3 , 10 −2 ,. . ., 10 3 and γ = 0.1, 0.2, . . ., 1 for SVC from training data, we selected the model with the highest accuracy on test data.In order to conduct SVCs, we made use of scikit-learn [12], which is a machine learning library wri en in Python.Note that it is not unfair to compare computation times of GRAB with them, since the SVCs in scikitlearn call LIBSVM [7], which is wri en in C and C++.
Table 2 shows the accuracies of the respective algorithms.From the results on a1a dataset, we see that GRAB with k = 1 is almost comparable to Poly with d = 1, while the former finished training more than 100 times faster than the la er. is implies that linear models are accurate enough for the prediction with a1a dataset.
e time complexity of logistic loss GRAB with k = 3 was more than 10 times higher than polynomial kernel SVC with d = 3.
From the results on cod-rna dataset shown in Table 2(b), we observe that the accuracies of SVC on the binarized dataset are much greater than those on the raw one.
e reason for this may be that cod-rna dataset becomes almost linearly separable by means of binarization.L 2 -hinge loss GRAB with k = ∞ is more than 1% more accurate than that with k = 1. is implies that combinatorial features are effective for the prediction with cod-rna dataset.Poly with d = 3 on the binarized data, which achieved the highest accuracy among the SVC family, was almost comparable to L 2hinge loss GRAB with k = ∞, which achieved the highest accuracy among GRAB family.However, the computation time of the best in GRABs was less than one hundredth of that of the best in SVCs.
is is because binarized cod-rna dataset was of high sparsity in the sense that the number of nonzero elements per record was just eight out of 240, and WFIM was executed very rapidly.
From the results on covtype.binarydataset shown in Table 2(c), we see that binarization contributed to the improvement of accuracies of SVCs. is was similar to the case of cod-rna.We also see that combinatorial features greatly contributed to the prediction with covtype.binarydataset.In terms of computation time, GRAB outperformed SVC significantly.GRAB finished within two hours while SVC on the binarized dataset did not finish in one day.

Evaluation of interpretability
We can conduct knowledge acquisition from the learning results for GRAB as follows: Look at the predictor learned by GRAB.In it the features whose corresponding weights have large absolute values can be interpreted as important ones.is is because such features greatly affect the prediction results.Hence, by simply extracting large-valued features, we can acquire knowledge about what features are important for prediction.Specifically, features used in GRAB are very comprehensive since they are represented    is implies that GRAB is of high interpretability of the acquired knowledge.
Below we show the results on knowledge discovery for a1a dataset [14]. is dataset was obtained by binarizing the adult dataset, which was extracted from the census bureau database.e a1a dataset has been used as a benchmark for classification of people with more than $50,000 annual income or the others, on the basis of their a ributes.
Table 3 shows weights with large absolute values, which were obtained by GRAB with L 2 -hinge loss and k = 3 on a1a dataset.e features having large absolute values of weight are listed in a descending order.is list itself is of high intepretability and represents knowledge acquired from the dataset.
e feature with the largest absolute value of weight is (Not in family), which has a negative gain for classification.
e feature with the secondly largest absolute value of weight is (Not in family)&(No monetary capital losses), which has a positive gain for classification.It is interesting to see that an identical a ribute (Not in family) can contribute to both of the positive and negative gains for classification.is suggests the importance of conjunctions of a ributes in knowledge discovery.e feature with the 6-th largest absolute value of weight is (large education number), which has a positive gain for classification.But its conjunction with (Other service) & (Middle average hours per week worked) has a negative gain (the 9-th feature) while its conjunction with (Prof specialty)&(Have monetary capital gains) has a positive gain (the 10-th feature).Hence making combinations of a ributes is really informative for deepening knowledge discovery.

CONCLUSION
In this paper we proposed GRAB that is an algorithm for learning combinatorial boolean models (CBM).e key idea of GRAB was to incorporate the techniques of frequent itemset mining with the gra ing algorithm for regularized loss minimization.We showed that GRAB was able to learn CBM more efficiently than the competitive methods such as the kernel methods.We also showed that it achieved higher or comparable prediction accuracy than the competitive ones.We also demonstrated that GRAB enabled us to discover knowledge in a form of conjunctions of a ributes of given data.
is knowledge representation turned out to be very comprehensive.
e main reason for the efficiency of GRAB is that the monotonicity of itemsets, or boolean inputs, makes it easy to search over all possible features.erefore, any other data structures having monotonicity (e.g.sequences, graphs, etc.) can also be incorporated with our methodology.
In this paper, we considered only the convex loss functions in order to avoid the algorithm from being trapped at local minima of the objective function.However, it is possible that GRAB works for non-convex losses as well as for convex ones.It is worthwhile to note that GRAB works whenever the loss function is partially differentiable.Hence, as future work, it is challenging to apply our methodology of GRAB into the efficient computation of multilayer neural networks or other kinds of highly predictive machine learning models.

w
j x j + w d +1 , XXX'XX, XXX .. .. .. .. .. .. . .$XX.XX DOI: XXXXXXXX using the aforementioned algorithms in time O(m T + |P + |) while the second stage is executed in time O(m T − |P + |) by accessing all negative transactions for each itemset obtained in the first stage.en the total computation time is O(m T |P + |).

7 :
Move w ϕ to F from Z .
Time complexities for both of the apriori and backtracking algorithms are O(m T |P |), which is called the output-polynomial time.Hence, they are expected to run in practical time as long as |P | is small. //www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/binary.html.

Table 2 :
Comparison of accuracy (%) on benchmark datasets: the terms Raw and Binarized mean that we used the original and the binarized dataset, respectively, as a training data.e row Time shows the elapsed time (sec) of each method when the number of training data is the maximum.