Skip to main content

Learning Interpretable Classification Rules with Boolean Compressed Sensing

  • Chapter
  • First Online:
Transparent Data Mining for Big and Small Data

Part of the book series: Studies in Big Data ((SBD,volume 32))

Abstract

An important problem in the context of supervised machine learning is designing systems which are interpretable by humans. In domains such as law, medicine, and finance that deal with human lives, delegating the decision to a black-box machine-learning model carries significant operational risk, and often legal implications, thus requiring interpretable classifiers. Building on ideas from Boolean compressed sensing, we propose a rule-based classifier which explicitly balances accuracy versus interpretability in a principled optimization formulation. We represent the problem of learning conjunctive clauses or disjunctive clauses as an adaptation of a classical problem from statistics, Boolean group testing, and apply a novel linear programming (LP) relaxation to find solutions. We derive theoretical results for recovering sparse rules which parallel the conditions for exact recovery of sparse signals in the compressed sensing literature. This is an exciting development in interpretable learning where most prior work has focused on heuristic solutions. We also consider a more general class of rule-based classifiers, checklists and scorecards, learned using ideas from threshold group testing. We show competitive classification accuracy using the proposed approach on real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Other approaches to approximately solve group testing include greedy methods and loopy belief propagation; see references in [34].

  2. 2.

    Instead of using LP, one can find solutions greedily, as is done in the SCM, which gives a log(m) approximation. The same guarantee holds for LP with randomized rounding. Empirically, LP tends to find sparser solutions.

  3. 3.

    Surprisingly, for many practical datasets the LP formulation obtains integral solutions, or requires a small number of branch and bound steps.

  4. 4.

    In general it will contain the features and their complements as columns. However, with enough data, one of the two choices will be removed by zero-row elimination beforehand.

  5. 5.

    Here, the subscript “z” stands for zero and “o” stands for one.

  6. 6.

    We use IBM SPSS Modeler 14.1 and Matlab R2009a with default settings.

References

  1. Adams, S.T., Leveson, S.H.: Clinical prediction rules. Br. Med. J. 344, d8312 (2012)

    Article  Google Scholar 

  2. Atia, G.K., Saligrama, V.: Boolean compressed sensing and noisy group testing. IEEE Trans. Inf. Theory 58 (3), 1880–1901 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bertsimas, D., Chang, A., Rudin, C.: An integer optimization approach to associative classification. In: Advances in Neural Information Processing Systems 25, pp. 269–277 (2012)

    Google Scholar 

  4. Blum, A., Kalai, A., Langford, J.: Beating the hold-out: bounds for k-fold and progressive cross-validation. In: Proceedings of the Conference on Computational Learning Theory, Santa Cruz, CA, pp. 203–208 (1999)

    Google Scholar 

  5. Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12 (2), 292–306 (2000)

    Article  Google Scholar 

  6. Candès, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Process. Mag. 25 (2), 21–30 (2008)

    Article  Google Scholar 

  7. Chen, H.B., Fu, H.L.: Nonadaptive algorithms for threshold group testing. Discret. Appl. Math. 157, 1581–1585 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Cheraghchi, M., Hormati, A., Karbasi, A., Vetterli, M.: Compressed sensing with probabilistic measurements: a group testing solution. In: Proceedings of the Annual Allerton Conference on Communication Control and Computing, Allerton, IL, pp. 30–35 (2009)

    Google Scholar 

  9. Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3 (4), 261–283 (1989)

    Google Scholar 

  10. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the International Conference on Machine Learning, Tahoe City, CA, pp. 115–123 (1995)

    Google Scholar 

  11. Dai, L., Pelckmans, K.: An ellipsoid based, two-stage screening test for BPDN. In: Proceedings of the European Signal Processing Conference, Bucharest, Romania, pp. 654–658 (2012)

    Google Scholar 

  12. Dash, S., Malioutov, D.M., Varshney, K.R.: Screening for learning classification rules via Boolean compressed sensing. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, Italy, pp. 3360–3364 (2014)

    Google Scholar 

  13. Dash, S., Malioutov, D.M., Varshney, K.R.: Learning interpretable classification rules using sequential row sampling. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brisbane, Australia (2015)

    Google Scholar 

  14. Dembczyński, K., Kotłowski, W., Słowiński, R.: ENDER: a statistical framework for boosting decision rules. Data Min. Knowl. Disc. 21 (1), 52–90 (2010)

    Article  MathSciNet  Google Scholar 

  15. Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proc. Natl. Acad. Sci. 100 (5), 2197–2202 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  16. Du, D.Z., Hwang, F.K.: Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing. World Scientific, Singapore (2006)

    Book  MATH  Google Scholar 

  17. Dyachkov, A.G., Rykov, V.V.: A survey of superimposed code theory. Prob. Control. Inf. 12 (4), 229–242 (1983)

    MathSciNet  Google Scholar 

  18. Dyachkov, A.G., Vilenkin, P.A., Macula, A.J., Torney, D.C.: Families of finite sets in which no intersection of l sets is covered by the union of s others. J. Combin. Theory 99, 195–218 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  19. Eckstein, J., Goldberg, N.: An improved branch-and-bound method for maximum monomial agreement. INFORMS J. Comput. 24 (2), 328–341 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  20. El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination in sparse supervised learning. Pac. J. Optim. 8 (4), 667–698 (2012)

    MathSciNet  MATH  Google Scholar 

  21. Emad, A., Milenkovic, O.: Semiquantitative group testing. IEEE Trans. Inf. Theory 60 (8), 4614–4636 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  22. Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2010)

  23. Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat. 2 (3), 916–954 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  24. Fry, C.: Closing the gap between analytics and action. INFORMS Analytics Mag. 4 (6), 4–5 (2011)

    Google Scholar 

  25. Gage, B.F., Waterman, A.D., Shannon, W., Boechler, M., Rich, M.W., Radford, M.J.: Validation of clinical classification schemes for predicting stroke. J. Am. Med. Assoc. 258 (22), 2864–2870 (2001)

    Article  Google Scholar 

  26. Gawande, A.: The Checklist Manifesto: How To Get Things Right. Metropolitan Books, New York (2009)

    Google Scholar 

  27. Gilbert, A.C., Iwen, M.A., Strauss, M.J.: Group testing and sparse signal recovery. In: Conference Record - Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, pp. 1059–1063 (2008)

    Google Scholar 

  28. Jawanpuria, P., Nath, J.S., Ramakrishnan, G.: Efficient rule ensemble learning using hierarchical kernels. In: Proceedings of the International Conference on Machine Learning, Bellevue, WA, pp. 161–168 (2011)

    Google Scholar 

  29. John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 367–370 (1996)

    Google Scholar 

  30. Kautz, W., Singleton, R.: Nonrandom binary superimposed codes. IEEE Trans. Inf. Theory 10 (4), 363–377 (1964)

    Article  MATH  Google Scholar 

  31. Letham, B., Rudin, C., McCormick, T.H., Madigan, D.: Building interpretable classifiers with rules using Bayesian analysis. Tech. Rep. 609, Department of Statistics, University of Washington (2012)

    Google Scholar 

  32. Liu, J., Li, M.: Finding cancer biomarkers from mass spectrometry data by decision lists. J. Comput. Biol. 12 (7), 971–979 (2005)

    Article  Google Scholar 

  33. Liu, J., Zhao, Z., Wang, J., Ye, J.: Safe screening with variational inequalities and its application to lasso. In: Proceedings of the International Conference on Machine Learning, Beijing, China, pp. 289–297 (2014)

    Google Scholar 

  34. Malioutov, D., Malyutov, M.: Boolean compressed sensing: LP relaxation for group testing. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, pp. 3305–3308 (2012)

    Google Scholar 

  35. Malioutov, D.M., Varshney, K.R.: Exact rule learning via Boolean compressed sensing. In: Proceedings of the International Conference on Machine Learning, Atlanta, GA, pp. 765–773 (2013)

    Google Scholar 

  36. Malioutov, D.M., Sanghavi, S.R., Willsky, A.S.: Sequential compressed sensing. IEEE J. Spec. Top. Signal Proc. 4 (2), 435–444 (2010)

    Article  Google Scholar 

  37. Malyutov, M.: The separating property of random matrices. Math. Notes 23 (1), 84–91 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  38. Malyutov, M.: Search for sparse active inputs: a review. In: Aydinian, H., Cicalese, F., Deppe, C. (eds.) Information Theory, Combinatorics, and Search Theory: In Memory of Rudolf Ahlswede, pp. 609–647. Springer, Berlin/Germany (2013)

    Chapter  Google Scholar 

  39. Marchand, M., Shawe-Taylor, J.: The set covering machine. J. Mach. Learn. Res. 3, 723–746 (2002)

    MathSciNet  MATH  Google Scholar 

  40. Maron, O., Moore, A.W.: Hoeffding races: accelerating model selection search for classification and function approximation. Adv. Neural Inf. Proces. Syst. 6, 59–66 (1993)

    Google Scholar 

  41. Mazumdar, A.: On almost disjunct matrices for group testing. In: Proceedings of the International Symposium on Algorithms and Computation, Taipei, Taiwan, pp. 649–658 (2012)

    Google Scholar 

  42. Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 23–32 (1999)

    Google Scholar 

  43. Quinlan, J.R.: Simplifying decision trees. Int. J. Man Mach. Stud. 27 (3), 221–234 (1987)

    Article  Google Scholar 

  44. Rivest, R.L.: Learning decision lists. Mach. Learn. 2 (3), 229–246 (1987)

    Google Scholar 

  45. Rückert, U., Kramer, S.: Margin-based first-order rule learning. Mach. Learn. 70 (2–3), 189–206 (2008)

    Article  Google Scholar 

  46. Sejdinovic, D., Johnson, O.: Note on noisy group testing: asymptotic bounds and belief propagation reconstruction. In: Proceedings of the Annual Allerton Conference on Communication Control and Computing, Allerton, IL, pp. 998–1003 (2010)

    Google Scholar 

  47. Stinson, D.R., Wei, R.: Generalized cover-free families. Discret. Math. 279, 463–477 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  48. Ustun, B., Rudin, C.: Methods and models for interpretable linear classification. Available at http://arxiv.org/pdf/1405.4047 (2014)

  49. Wagstaff, K.L.: Machine learning that matters. In: Proceedings of the International Conference on Machine Learning, Edinburgh, United Kingdom, pp. 529–536 (2012)

    Google Scholar 

  50. Wang, F., Rudin, C.: Falling rule lists. Available at http://arxiv.org/pdf/1411.5899 (2014)

  51. Wang, J., Zhou, J., Wonka, P., Ye, J.: Lasso screening rules via dual polytope projection. Adv. Neural Inf. Proces. Syst. 26, 1070–1078 (2013)

    MATH  Google Scholar 

  52. Wang, Y., Xiang, Z.J., Ramadge, P.J.: Lasso screening with a small regularization parameter. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3342–3346 (2013)

    Google Scholar 

  53. Wang, Y., Xiang, Z.J., Ramadge, P.J.: Tradeoffs in improved screening of lasso problems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3297–3301 (2013)

    Google Scholar 

  54. Wang, T., Rudin, C., Doshi, F., Liu, Y., Klampfl, E., MacNeille, P.: Bayesian or’s of and’s for interpretable classification with application to context aware recommender systems. Available at http://arxiv.org/abs/1504.07614 (2015)

  55. Wu, H., Ramadge, P.J.: The 2-codeword screening test for lasso problems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3307–3311 (2013)

    Google Scholar 

  56. Xiang, Z.J., Ramadge, P.J.: Fast lasso screening tests based on correlations. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, Japan, pp. 2137–2140 (2012)

    Google Scholar 

  57. Xiang, Z.J., Xu, H., Ramadge, P.J.: Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries. Advances in Neural Information Processing Systems, vol. 24, pp. 900–908. MIT Press, Cambridge, MA (2011)

    Google Scholar 

Download references

Acknowledgements

The authors thank Vijay S. Iyengar, Benjamin Letham, Cynthia Rudin, Viswanath Nagarajan, Karthikeyan Natesan Ramamurthy, Mikhail Malyutov and Venkatesh Saligrama for valuable discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry M. Malioutov .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Dual Linear Program

We now derive the dual LP, which we use in Sect. 5. We start off by giving a reformulation of the LP in (10), i.e., we consider an LP with the same set of optimal solutions as the one in (10). First note that the upper bounds of 1 on the variables ξ i are redundant.Let \((\bar{\mathbf{w}},\bar{\boldsymbol{\xi }})\) be a feasible solution of (10) without the upper bound constraints such that \(\bar{\xi }_{i} > 1\) for some \(i \in \mathcal{P}\). Reducing \(\bar{\xi }_{i}\) to 1 yields a feasible solution (as \(\mathbf{a}_{i}\bar{\mathbf{w}} +\bar{\xi } _{i} \geq 1\)—the only inequality ξ i participates in besides the bound constraints—is still satisfied). The new feasible solution has lower objective function value than before, as ξ i has a positive coefficient in the objective function (which is to be minimized). One can similarly argue that in every optimal solution of (10) without the upper bound constraints, we have w j  ≤ 1 (for j = 1, , n). Finally, observe that we can substitute ξ i for \(i \in \mathcal{Z}\) in the objective function by a i w because of the constraints a i w = ξ i for \(i \in \mathcal{Z}\). We thus get the following LP equivalent to (10):

$$ \displaystyle\begin{array}{rcl} & & \min \quad \sum _{j=1}^{n}\left (\lambda +\|\mathbf{a}_{ \mathcal{Z}}^{j}\|_{ 1}\right )w_{j} +\sum _{ i=1}^{p}\xi _{ i} \\ & & \,\ \mathrm{s.t.}\quad 0 \leq w_{j},\,j = 1,\ldots,n \\ & & \qquad \quad 0 \leq \xi _{i},\,i = 1,\ldots,p \\ & & \qquad \quad \mathbf{A}_{\mathcal{P}}\mathbf{w} +\boldsymbol{\xi } _{\mathcal{P}}\geq \mathbf{1}. {}\end{array}$$
(19)

The optimal solutions and optimal objective values are the same as in (10).

Writing \(\mathbf{A}_{\mathcal{P}}\mathbf{w} +\boldsymbol{\xi } _{\mathcal{P}}\) as \(\mathbf{A}_{\mathcal{P}}\mathbf{w} + \mathbf{I}\boldsymbol{\xi }_{\mathcal{P}}\), where I is the p × p identity matrix, \(\vert \vert \mathbf{a}_{\mathcal{Z}}^{j}\vert \vert _{1}\) as \(\mathbf{1}^{T}\mathbf{a}_{\mathcal{Z}}^{j}\), and letting \(\boldsymbol{\mu }\) be a row vector of p dual variables, one can see that the dual is:

$$\displaystyle\begin{array}{rcl} & & \max \quad \sum _{i=1}^{p}\mu _{ i} \\ & & \,\ \mathrm{s.t. }\quad 0 \leq \mu _{i} \leq 1,\,i = 1,\ldots,p \\ & & \qquad \quad \boldsymbol{\mu }^{T}\mathbf{A}_{ \mathcal{P}}\leq \lambda \mathbf{1}_{n} + \mathbf{1}^{T}\mathbf{A}_{ \mathcal{Z}}.{}\end{array}$$
(20)

Suppose \(\boldsymbol{\bar{\mu }}\) is a feasible solution to (20). Then clearly \(\sum _{i=1}^{p}\bar{\mu }_{i}\) yields a lower bound on the optimal solution value of (19).

Appendix 2: Derivation of Screening Tests

Let \(\mathcal{S}(j)\) stand for the support of \(\mathbf{a}_{\mathcal{P}}^{j}\). Furthermore, let \(\mathcal{N}(j)\) stand for the support of \(\mathbf{1} -\mathbf{a}_{\mathcal{P}}^{j}\), i.e, it is the set of indices from \(\mathcal{P}\) such that the corresponding components of \(\mathbf{a}_{\mathcal{P}}^{j}\) are zero.

Now consider the situation where we fix w 1 (say) to 1. Let A′ stand for the submatrix of A consisting of the last n − 1 columns. Let w′ stand for the vector of variables w 2, , w n . Then the constraints \(\mathbf{A}_{\mathcal{P}}\mathbf{w} +\boldsymbol{\xi } _{\mathcal{P}}\geq \mathbf{1}\) in (19) become \(\mathbf{A}_{\mathcal{P}}^{\prime}\mathbf{w}^{\prime} +\boldsymbol{\xi } _{\mathcal{P}}\geq \mathbf{1} -\mathbf{a}_{\mathcal{P}}^{1}\). Therefore, for all \(i \in \mathcal{S}(1)\), the corresponding constraint is now \((\mathbf{A}_{\mathcal{P}}^{\prime})_{i}\mathbf{w}^{\prime} +\xi _{i} \geq 0\) which is a redundant constraint as \(\mathbf{A}_{\mathcal{P}}^{\prime}\geq 0\) and w′, ξ i  ≥ 0. The only remaining non-redundant constraints correspond to the indices in \(\mathcal{N}(1)\). Then the value of (19) with w 1 set to 1 becomes

$$\displaystyle{ \begin{array}{rl} \left (\lambda +\|\mathbf{a}_{\mathcal{Z}}^{1}\|_{1}\right ) +\min \quad &\sum _{j=2}^{n}\left (\lambda +\|\mathbf{a}_{\mathcal{Z}}^{j}\|_{1}\right )w_{j} +\sum _{i\in \mathcal{N}(1)}\xi _{i} \\ \mathrm{s.t.}\quad &0 \leq w_{j},\,j = 2,\ldots,n \\ &0 \leq \xi _{i},\,i \in \mathcal{N}(1) \\ &\mathbf{A}^{\prime}_{\mathcal{N}(1)}\mathbf{w}^{\prime} +\boldsymbol{\xi } _{\mathcal{N}(1)} \geq \mathbf{1}. \end{array} }$$
(21)

This LP clearly has the same form as the LP in (19). Furthermore, given any feasible solution \(\boldsymbol{\bar{\mu }}\) of (20), \(\boldsymbol{\bar{\mu }}_{\mathcal{N}(1)}\) defines a feasible dual solution of (21) as

$$\displaystyle\begin{array}{rcl} & \boldsymbol{\bar{\mu }}^{T}\mathbf{A}_{\mathcal{P}}\leq \lambda \mathbf{1}_{n} + \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}} & {}\\ & \Rightarrow \boldsymbol{\bar{\mu }}_{\mathcal{S}(1)}^{T}\mathbf{A}^{\prime}_{\mathcal{S}(1)} +\boldsymbol{\bar{\mu }}_{ \mathcal{N}(1)}^{T}\mathbf{A}^{\prime}_{\mathcal{N}(1)} \leq \lambda \mathbf{1}_{n-1} + \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}^{\prime}& {}\\ & \Rightarrow \boldsymbol{\bar{\mu }}_{\mathcal{N}(1)}^{T}\mathbf{A}^{\prime}_{\mathcal{N}(1)} \leq \lambda \mathbf{1}_{n-1} + \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}^{\prime}. & {}\\ \end{array}$$

Therefore \(\sum _{i\in \mathcal{N}(n)}\bar{\mu }_{i}\) is a lower bound on the optimal solution value of the LP in (21) and therefore

$$\displaystyle{ \lambda +\vert \vert \mathbf{a}_{\mathcal{Z}}^{1}\vert \vert _{ 1} +\sum _{i\in \mathcal{N}(1)}\bar{\mu }_{i} }$$
(22)

is a lower bound on the optimal solution value of (19) with w 1 set to 1. In particular, if \((\bar{\mathbf{w}},\bar{\boldsymbol{\xi }})\) is a feasible integral solution to (19) with objective function value \(\lambda (\sum _{i=1}^{n}\bar{w}_{i}) +\sum _{ i=1}^{p}\bar{\xi }_{i}\), and if (22) is greater than this value, than no optimal integral solution of (19) can have w 1 = 1. Therefore w 1 = 0 in any optimal solution, and we can simply drop the column corresponding to w 1 from the LP.

In order to use the screening results in this section we need to obtain a feasible primal and a feasible dual solution. Some useful heuristics to obtain such a pair are described in [12].

Appendix 3: Extending the Dual Solution for Row-Sampling

Suppose that \(\hat{\boldsymbol{\mu }}^{p}\) is the optimal dual solution to the small LP in Sect. 5.3. Note that the number of variables in the dual for the large LP increases from p to \(\bar{p}\) and the bound on the second constraint grows from \(\lambda \mathbf{1}_{n} + \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}\) to \(\lambda \mathbf{1}_{n} + \mathbf{1}^{T}\bar{\mathbf{A}}_{\mathcal{Z}}\).

We use a greedy heuristic to extend \(\hat{\boldsymbol{\mu }}^{p}\) to a feasible dual solution \(\bar{\boldsymbol{\mu }}_{\bar{p}}\) of the large LP. We set \(\bar{\mu }_{j} =\hat{\mu } _{j}\) for j = 1, . . , p. We extend the remaining entries \(\bar{\mu }_{j}\) for \(j = (p + 1),..,\bar{p}\) by setting a subset of its entries to 1 while satisfying the dual feasibility constraint. In other words the extension of \(\boldsymbol{\bar{\mu }}\) corresponds to a subset \(\mathcal{R}\) of the row indices \(\{p + 1,\ldots,\bar{p}\}\) of \(\bar{\mathbf{A}}_{\mathcal{P}}\) such that \(\hat{\boldsymbol{\mu }}_{p}^{T}\mathbf{A}_{\mathcal{P}} +\sum _{i\in \mathcal{R}}(\bar{\mathbf{A}}_{\mathcal{P}})_{i} \leq \mathbf{1}^{T}\bar{\mathbf{A}}_{\mathcal{Z}}\). Having \(\boldsymbol{\bar{\mu }}^{T}\mathbf{A}_{\mathcal{P}} \leq \mathbf{1}^{T}\mathbf{A}_{\mathcal{Z}}\) with \(\boldsymbol{\bar{\mu }}\) extended by a binary vector implies that \(\boldsymbol{\bar{\mu }}\) is feasible for (20). We initialize \(\mathcal{R}\) to ∅ and then simply go through the unseen rows of \(\bar{\mathbf{A}}_{\mathcal{P}}\) in some fixed order (increasing from p + 1 to \(\bar{p}\)), and for a row k, if

$$\displaystyle{\hat{\boldsymbol{\mu }}_{p}^{T}\mathbf{A}_{ \mathcal{P}} +\sum _{i\in \mathcal{R}}(\bar{\mathbf{A}}_{\mathcal{P}})_{i} + (\bar{\mathbf{A}}_{\mathcal{P}})_{k} \leq \mathbf{1}^{T}\bar{\mathbf{A}}_{ \mathcal{Z}},}$$

we set \(\mathcal{R}\) to \(\mathcal{R}\cup \{ k\}\). The heuristic (we call it H1) needs only a single pass through the matrix \(\bar{\mathbf{A}}_{\mathcal{P}}\), and is thus very fast.

This heuristic, however, does not use the optimal solution \(\hat{\mathbf{w}}^{m}\) in any way. Suppose \(\hat{\mathbf{w}}^{m}\) were an optimal solution of the large LP. Then complementary slackness would imply that if \((\bar{\mathbf{A}}_{\mathcal{P}})_{i}\hat{\mathbf{w}}^{m} > 1\), then in any optimal dual solution \(\boldsymbol{\mu },\mu _{i} = 0\). Thus, assuming \(\hat{\mathbf{w}}^{m}\) is close to an optimal solution for the large LP, we modify heuristic H1 to obtain heuristic H2, by simply setting \(\bar{\mu }_{i} = 0\) whenever \((\bar{\mathbf{A}}_{\mathcal{P}})_{i}\hat{\mathbf{w}}^{m} > 1\), while keeping the remaining steps unchanged.

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Malioutov, D.M., Varshney, K.R., Emad, A., Dash, S. (2017). Learning Interpretable Classification Rules with Boolean Compressed Sensing. In: Cerquitelli, T., Quercia, D., Pasquale, F. (eds) Transparent Data Mining for Big and Small Data. Studies in Big Data, vol 32. Springer, Cham. https://doi.org/10.1007/978-3-319-54024-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54024-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54023-8

  • Online ISBN: 978-3-319-54024-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics