Abstract
An important problem in the context of supervised machine learning is designing systems which are interpretable by humans. In domains such as law, medicine, and finance that deal with human lives, delegating the decision to a black-box machine-learning model carries significant operational risk, and often legal implications, thus requiring interpretable classifiers. Building on ideas from Boolean compressed sensing, we propose a rule-based classifier which explicitly balances accuracy versus interpretability in a principled optimization formulation. We represent the problem of learning conjunctive clauses or disjunctive clauses as an adaptation of a classical problem from statistics, Boolean group testing, and apply a novel linear programming (LP) relaxation to find solutions. We derive theoretical results for recovering sparse rules which parallel the conditions for exact recovery of sparse signals in the compressed sensing literature. This is an exciting development in interpretable learning where most prior work has focused on heuristic solutions. We also consider a more general class of rule-based classifiers, checklists and scorecards, learned using ideas from threshold group testing. We show competitive classification accuracy using the proposed approach on real-world data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Other approaches to approximately solve group testing include greedy methods and loopy belief propagation; see references in [34].
- 2.
Instead of using LP, one can find solutions greedily, as is done in the SCM, which gives a log(m) approximation. The same guarantee holds for LP with randomized rounding. Empirically, LP tends to find sparser solutions.
- 3.
Surprisingly, for many practical datasets the LP formulation obtains integral solutions, or requires a small number of branch and bound steps.
- 4.
In general it will contain the features and their complements as columns. However, with enough data, one of the two choices will be removed by zero-row elimination beforehand.
- 5.
Here, the subscript “z” stands for zero and “o” stands for one.
- 6.
We use IBM SPSS Modeler 14.1 and Matlab R2009a with default settings.
References
Adams, S.T., Leveson, S.H.: Clinical prediction rules. Br. Med. J. 344, d8312 (2012)
Atia, G.K., Saligrama, V.: Boolean compressed sensing and noisy group testing. IEEE Trans. Inf. Theory 58 (3), 1880–1901 (2012)
Bertsimas, D., Chang, A., Rudin, C.: An integer optimization approach to associative classification. In: Advances in Neural Information Processing Systems 25, pp. 269–277 (2012)
Blum, A., Kalai, A., Langford, J.: Beating the hold-out: bounds for k-fold and progressive cross-validation. In: Proceedings of the Conference on Computational Learning Theory, Santa Cruz, CA, pp. 203–208 (1999)
Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12 (2), 292–306 (2000)
Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Process. Mag. 25 (2), 21–30 (2008)
Chen, H.B., Fu, H.L.: Nonadaptive algorithms for threshold group testing. Discret. Appl. Math. 157, 1581–1585 (2009)
Cheraghchi, M., Hormati, A., Karbasi, A., Vetterli, M.: Compressed sensing with probabilistic measurements: a group testing solution. In: Proceedings of the Annual Allerton Conference on Communication Control and Computing, Allerton, IL, pp. 30–35 (2009)
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3 (4), 261–283 (1989)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the International Conference on Machine Learning, Tahoe City, CA, pp. 115–123 (1995)
Dai, L., Pelckmans, K.: An ellipsoid based, two-stage screening test for BPDN. In: Proceedings of the European Signal Processing Conference, Bucharest, Romania, pp. 654–658 (2012)
Dash, S., Malioutov, D.M., Varshney, K.R.: Screening for learning classification rules via Boolean compressed sensing. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy, pp. 3360–3364 (2014)
Dash, S., Malioutov, D.M., Varshney, K.R.: Learning interpretable classification rules using sequential row sampling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brisbane, Australia (2015)
Dembczyński, K., Kotłowski, W., Słowiński, R.: ENDER: a statistical framework for boosting decision rules. Data Min. Knowl. Disc. 21 (1), 52–90 (2010)
Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proc. Natl. Acad. Sci. 100 (5), 2197–2202 (2003)
Du, D.Z., Hwang, F.K.: Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing. World Scientific, Singapore (2006)
Dyachkov, A.G., Rykov, V.V.: A survey of superimposed code theory. Prob. Control. Inf. 12 (4), 229–242 (1983)
Dyachkov, A.G., Vilenkin, P.A., Macula, A.J., Torney, D.C.: Families of finite sets in which no intersection of l sets is covered by the union of s others. J. Combin. Theory 99, 195–218 (2002)
Eckstein, J., Goldberg, N.: An improved branch-and-bound method for maximum monomial agreement. INFORMS J. Comput. 24 (2), 328–341 (2012)
El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination in sparse supervised learning. Pac. J. Optim. 8 (4), 667–698 (2012)
Emad, A., Milenkovic, O.: Semiquantitative group testing. IEEE Trans. Inf. Theory 60 (8), 4614–4636 (2014)
Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2010)
Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat. 2 (3), 916–954 (2008)
Fry, C.: Closing the gap between analytics and action. INFORMS Analytics Mag. 4 (6), 4–5 (2011)
Gage, B.F., Waterman, A.D., Shannon, W., Boechler, M., Rich, M.W., Radford, M.J.: Validation of clinical classification schemes for predicting stroke. J. Am. Med. Assoc. 258 (22), 2864–2870 (2001)
Gawande, A.: The Checklist Manifesto: How To Get Things Right. Metropolitan Books, New York (2009)
Gilbert, A.C., Iwen, M.A., Strauss, M.J.: Group testing and sparse signal recovery. In: Conference Record - Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, pp. 1059–1063 (2008)
Jawanpuria, P., Nath, J.S., Ramakrishnan, G.: Efficient rule ensemble learning using hierarchical kernels. In: Proceedings of the International Conference on Machine Learning, Bellevue, WA, pp. 161–168 (2011)
John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 367–370 (1996)
Kautz, W., Singleton, R.: Nonrandom binary superimposed codes. IEEE Trans. Inf. Theory 10 (4), 363–377 (1964)
Letham, B., Rudin, C., McCormick, T.H., Madigan, D.: Building interpretable classifiers with rules using Bayesian analysis. Tech. Rep. 609, Department of Statistics, University of Washington (2012)
Liu, J., Li, M.: Finding cancer biomarkers from mass spectrometry data by decision lists. J. Comput. Biol. 12 (7), 971–979 (2005)
Liu, J., Zhao, Z., Wang, J., Ye, J.: Safe screening with variational inequalities and its application to lasso. In: Proceedings of the International Conference on Machine Learning, Beijing, China, pp. 289–297 (2014)
Malioutov, D., Malyutov, M.: Boolean compressed sensing: LP relaxation for group testing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, pp. 3305–3308 (2012)
Malioutov, D.M., Varshney, K.R.: Exact rule learning via Boolean compressed sensing. In: Proceedings of the International Conference on Machine Learning, Atlanta, GA, pp. 765–773 (2013)
Malioutov, D.M., Sanghavi, S.R., Willsky, A.S.: Sequential compressed sensing. IEEE J. Spec. Top. Signal Proc. 4 (2), 435–444 (2010)
Malyutov, M.: The separating property of random matrices. Math. Notes 23 (1), 84–91 (1978)
Malyutov, M.: Search for sparse active inputs: a review. In: Aydinian, H., Cicalese, F., Deppe, C. (eds.) Information Theory, Combinatorics, and Search Theory: In Memory of Rudolf Ahlswede, pp. 609–647. Springer, Berlin/Germany (2013)
Marchand, M., Shawe-Taylor, J.: The set covering machine. J. Mach. Learn. Res. 3, 723–746 (2002)
Maron, O., Moore, A.W.: Hoeffding races: accelerating model selection search for classification and function approximation. Adv. Neural Inf. Proces. Syst. 6, 59–66 (1993)
Mazumdar, A.: On almost disjunct matrices for group testing. In: Proceedings of the International Symposium on Algorithms and Computation, Taipei, Taiwan, pp. 649–658 (2012)
Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 23–32 (1999)
Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27 (3), 221–234 (1987)
Rivest, R.L.: Learning decision lists. Mach. Learn. 2 (3), 229–246 (1987)
Rückert, U., Kramer, S.: Margin-based first-order rule learning. Mach. Learn. 70 (2–3), 189–206 (2008)
Sejdinovic, D., Johnson, O.: Note on noisy group testing: asymptotic bounds and belief propagation reconstruction. In: Proceedings of the Annual Allerton Conference on Communication Control and Computing, Allerton, IL, pp. 998–1003 (2010)
Stinson, D.R., Wei, R.: Generalized cover-free families. Discret. Math. 279, 463–477 (2004)
Ustun, B., Rudin, C.: Methods and models for interpretable linear classification. Available at http://arxiv.org/pdf/1405.4047 (2014)
Wagstaff, K.L.: Machine learning that matters. In: Proceedings of the International Conference on Machine Learning, Edinburgh, United Kingdom, pp. 529–536 (2012)
Wang, F., Rudin, C.: Falling rule lists. Available at http://arxiv.org/pdf/1411.5899 (2014)
Wang, J., Zhou, J., Wonka, P., Ye, J.: Lasso screening rules via dual polytope projection. Adv. Neural Inf. Proces. Syst. 26, 1070–1078 (2013)
Wang, Y., Xiang, Z.J., Ramadge, P.J.: Lasso screening with a small regularization parameter. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3342–3346 (2013)
Wang, Y., Xiang, Z.J., Ramadge, P.J.: Tradeoffs in improved screening of lasso problems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3297–3301 (2013)
Wang, T., Rudin, C., Doshi, F., Liu, Y., Klampfl, E., MacNeille, P.: Bayesian or’s of and’s for interpretable classification with application to context aware recommender systems. Available at http://arxiv.org/abs/1504.07614 (2015)
Wu, H., Ramadge, P.J.: The 2-codeword screening test for lasso problems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3307–3311 (2013)
Xiang, Z.J., Ramadge, P.J.: Fast lasso screening tests based on correlations. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, pp. 2137–2140 (2012)
Xiang, Z.J., Xu, H., Ramadge, P.J.: Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries. Advances in Neural Information Processing Systems, vol. 24, pp. 900–908. MIT Press, Cambridge, MA (2011)
Acknowledgements
The authors thank Vijay S. Iyengar, Benjamin Letham, Cynthia Rudin, Viswanath Nagarajan, Karthikeyan Natesan Ramamurthy, Mikhail Malyutov and Venkatesh Saligrama for valuable discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1: Dual Linear Program
We now derive the dual LP, which we use in Sect. 5. We start off by giving a reformulation of the LP in (10), i.e., we consider an LP with the same set of optimal solutions as the one in (10). First note that the upper bounds of 1 on the variables ξ i are redundant.Let \((\bar{\mathbf{w}},\bar{\boldsymbol{\xi }})\) be a feasible solution of (10) without the upper bound constraints such that \(\bar{\xi }_{i} > 1\) for some \(i \in \mathcal{P}\). Reducing \(\bar{\xi }_{i}\) to 1 yields a feasible solution (as \(\mathbf{a}_{i}\bar{\mathbf{w}} +\bar{\xi } _{i} \geq 1\)—the only inequality ξ i participates in besides the bound constraints—is still satisfied). The new feasible solution has lower objective function value than before, as ξ i has a positive coefficient in the objective function (which is to be minimized). One can similarly argue that in every optimal solution of (10) without the upper bound constraints, we have w j ≤ 1 (for j = 1, …, n). Finally, observe that we can substitute ξ i for \(i \in \mathcal{Z}\) in the objective function by a i w because of the constraints a i w = ξ i for \(i \in \mathcal{Z}\). We thus get the following LP equivalent to (10):
The optimal solutions and optimal objective values are the same as in (10).
Writing \(\mathbf{A}_{\mathcal{P}}\mathbf{w} +\boldsymbol{\xi } _{\mathcal{P}}\) as \(\mathbf{A}_{\mathcal{P}}\mathbf{w} + \mathbf{I}\boldsymbol{\xi }_{\mathcal{P}}\), where I is the p × p identity matrix, \(\vert \vert \mathbf{a}_{\mathcal{Z}}^{j}\vert \vert _{1}\) as \(\mathbf{1}^{T}\mathbf{a}_{\mathcal{Z}}^{j}\), and letting \(\boldsymbol{\mu }\) be a row vector of p dual variables, one can see that the dual is:
Suppose \(\boldsymbol{\bar{\mu }}\) is a feasible solution to (20). Then clearly \(\sum _{i=1}^{p}\bar{\mu }_{i}\) yields a lower bound on the optimal solution value of (19).
Appendix 2: Derivation of Screening Tests
Let \(\mathcal{S}(j)\) stand for the support of \(\mathbf{a}_{\mathcal{P}}^{j}\). Furthermore, let \(\mathcal{N}(j)\) stand for the support of \(\mathbf{1} -\mathbf{a}_{\mathcal{P}}^{j}\), i.e, it is the set of indices from \(\mathcal{P}\) such that the corresponding components of \(\mathbf{a}_{\mathcal{P}}^{j}\) are zero.
Now consider the situation where we fix w 1 (say) to 1. Let A′ stand for the submatrix of A consisting of the last n − 1 columns. Let w′ stand for the vector of variables w 2, …, w n . Then the constraints \(\mathbf{A}_{\mathcal{P}}\mathbf{w} +\boldsymbol{\xi } _{\mathcal{P}}\geq \mathbf{1}\) in (19) become \(\mathbf{A}_{\mathcal{P}}^{\prime}\mathbf{w}^{\prime} +\boldsymbol{\xi } _{\mathcal{P}}\geq \mathbf{1} -\mathbf{a}_{\mathcal{P}}^{1}\). Therefore, for all \(i \in \mathcal{S}(1)\), the corresponding constraint is now \((\mathbf{A}_{\mathcal{P}}^{\prime})_{i}\mathbf{w}^{\prime} +\xi _{i} \geq 0\) which is a redundant constraint as \(\mathbf{A}_{\mathcal{P}}^{\prime}\geq 0\) and w′, ξ i ≥ 0. The only remaining non-redundant constraints correspond to the indices in \(\mathcal{N}(1)\). Then the value of (19) with w 1 set to 1 becomes
This LP clearly has the same form as the LP in (19). Furthermore, given any feasible solution \(\boldsymbol{\bar{\mu }}\) of (20), \(\boldsymbol{\bar{\mu }}_{\mathcal{N}(1)}\) defines a feasible dual solution of (21) as
Therefore \(\sum _{i\in \mathcal{N}(n)}\bar{\mu }_{i}\) is a lower bound on the optimal solution value of the LP in (21) and therefore
is a lower bound on the optimal solution value of (19) with w 1 set to 1. In particular, if \((\bar{\mathbf{w}},\bar{\boldsymbol{\xi }})\) is a feasible integral solution to (19) with objective function value \(\lambda (\sum _{i=1}^{n}\bar{w}_{i}) +\sum _{ i=1}^{p}\bar{\xi }_{i}\), and if (22) is greater than this value, than no optimal integral solution of (19) can have w 1 = 1. Therefore w 1 = 0 in any optimal solution, and we can simply drop the column corresponding to w 1 from the LP.
In order to use the screening results in this section we need to obtain a feasible primal and a feasible dual solution. Some useful heuristics to obtain such a pair are described in [12].
Appendix 3: Extending the Dual Solution for Row-Sampling
Suppose that \(\hat{\boldsymbol{\mu }}^{p}\) is the optimal dual solution to the small LP in Sect. 5.3. Note that the number of variables in the dual for the large LP increases from p to \(\bar{p}\) and the bound on the second constraint grows from \(\lambda \mathbf{1}_{n} + \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}\) to \(\lambda \mathbf{1}_{n} + \mathbf{1}^{T}\bar{\mathbf{A}}_{\mathcal{Z}}\).
We use a greedy heuristic to extend \(\hat{\boldsymbol{\mu }}^{p}\) to a feasible dual solution \(\bar{\boldsymbol{\mu }}_{\bar{p}}\) of the large LP. We set \(\bar{\mu }_{j} =\hat{\mu } _{j}\) for j = 1, . . , p. We extend the remaining entries \(\bar{\mu }_{j}\) for \(j = (p + 1),..,\bar{p}\) by setting a subset of its entries to 1 while satisfying the dual feasibility constraint. In other words the extension of \(\boldsymbol{\bar{\mu }}\) corresponds to a subset \(\mathcal{R}\) of the row indices \(\{p + 1,\ldots,\bar{p}\}\) of \(\bar{\mathbf{A}}_{\mathcal{P}}\) such that \(\hat{\boldsymbol{\mu }}_{p}^{T}\mathbf{A}_{\mathcal{P}} +\sum _{i\in \mathcal{R}}(\bar{\mathbf{A}}_{\mathcal{P}})_{i} \leq \mathbf{1}^{T}\bar{\mathbf{A}}_{\mathcal{Z}}\). Having \(\boldsymbol{\bar{\mu }}^{T}\mathbf{A}_{\mathcal{P}} \leq \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}\) with \(\boldsymbol{\bar{\mu }}\) extended by a binary vector implies that \(\boldsymbol{\bar{\mu }}\) is feasible for (20). We initialize \(\mathcal{R}\) to ∅ and then simply go through the unseen rows of \(\bar{\mathbf{A}}_{\mathcal{P}}\) in some fixed order (increasing from p + 1 to \(\bar{p}\)), and for a row k, if
we set \(\mathcal{R}\) to \(\mathcal{R}\cup \{ k\}\). The heuristic (we call it H1) needs only a single pass through the matrix \(\bar{\mathbf{A}}_{\mathcal{P}}\), and is thus very fast.
This heuristic, however, does not use the optimal solution \(\hat{\mathbf{w}}^{m}\) in any way. Suppose \(\hat{\mathbf{w}}^{m}\) were an optimal solution of the large LP. Then complementary slackness would imply that if \((\bar{\mathbf{A}}_{\mathcal{P}})_{i}\hat{\mathbf{w}}^{m} > 1\), then in any optimal dual solution \(\boldsymbol{\mu },\mu _{i} = 0\). Thus, assuming \(\hat{\mathbf{w}}^{m}\) is close to an optimal solution for the large LP, we modify heuristic H1 to obtain heuristic H2, by simply setting \(\bar{\mu }_{i} = 0\) whenever \((\bar{\mathbf{A}}_{\mathcal{P}})_{i}\hat{\mathbf{w}}^{m} > 1\), while keeping the remaining steps unchanged.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Malioutov, D.M., Varshney, K.R., Emad, A., Dash, S. (2017). Learning Interpretable Classification Rules with Boolean Compressed Sensing. In: Cerquitelli, T., Quercia, D., Pasquale, F. (eds) Transparent Data Mining for Big and Small Data. Studies in Big Data, vol 32. Springer, Cham. https://doi.org/10.1007/978-3-319-54024-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-54024-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54023-8
Online ISBN: 978-3-319-54024-5
eBook Packages: EngineeringEngineering (R0)