Supersparse linear integer models for optimized medical scoring systems
 1.8k Downloads
 17 Citations
Abstract
Scoring systems are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction. These models are in widespread use by the medical community, but are difficult to learn from data because they need to be accurate and sparse, have coprime integer coefficients, and satisfy multiple operational constraints. We present a new method for creating datadriven scoring systems called a Supersparse Linear Integer Model (SLIM). SLIM scoring systems are built by using an integer programming problem that directly encodes measures of accuracy (the 0–1 loss) and sparsity (the \(\ell _0\)seminorm) while restricting coefficients to coprime integers. SLIM can seamlessly incorporate a wide range of operational constraints related to accuracy and sparsity, and can produce acceptable models without parameter tuning because of the direct control provided over these quantities. We provide bounds on the testing and training accuracy of SLIM scoring systems, and present a new data reduction technique that can improve scalability by eliminating a portion of the training data beforehand. Our paper includes results from a collaboration with the Massachusetts General Hospital Sleep Laboratory, where SLIM is being used to create a highly tailored scoring system for sleep apnea screening.
Keywords
Medical scoring systems Discrete linear classification Integer programming 0–1 Loss Sparsity Interpretability Sleep apnea Supervised Classification1 Introduction
Scoring systems are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction. These models are used to assess the risk of numerous serious medical conditions since they allow physicians to make quick predictions, without extensive training, and without the use of a computer (see e.g., Knaus et al. 1991; Bone et al. 1992; Moreno et al. 2005). Many medical scoring systems that are currently in use were handcrafted by physicians, whereby a panel of experts simply agreed on a model (see e.g., the CHADS\(_2\) score of Gage et al. 2001). Some medical scoring systems are datadriven in the sense that they were created by rounding logistic regression coefficients (see e.g., the SAPS II score of Le Gall et al. 1993). Despite the widespread use of medical scoring systems, there has been little to no work that has focused on machine learning methods to learn these models from data.
Scoring systems are difficult to create using traditional machine learning methods because they need to be accurate, sparse, and use small coprime integer coefficients. This task is especially challenging in medical applications because models also need to satisfy explicit constraints on operational quantities such as the false positive rate or the number of features. Such requirements represent serious challenges for machine learning. Current methods for sparse linear classification such as the lasso (Tibshirani 1996) and elastic net (Zou and Hastie 2005) control the accuracy and sparsity of models via approximations to speed up computation, and require rounding to yield models with coprime integer coefficients. Approximations such as convex surrogate loss functions, \({\ell _{1}}\)regularization, and rounding not only degrade predictive performance but make it difficult to address operational constraints imposed by physicians. To train a model that satisfies a hard constraint on the false positive rate, for instance, we must explicitly calculate its value, which is impossible when we control accuracy by means of a surrogate loss function. In practice, traditional methods can only address multiple operational constraints through a tuning process that involves highdimensional grid search. As we show, this approach often fails to produce a model that satisfies operational constraints, let alone a scoring system that is optimized for predictive accuracy.
In this paper, we present a new method to create datadriven scoring systems called a Supersparse Linear Integer Model (SLIM). SLIM is an integer program that optimizes direct measures of accuracy (the 0–1 loss) and sparsity (the \(\ell _0\)seminorm) while restricting coefficients to a small set of coprime integers. In comparison to current methods for sparse linear classification, SLIM can produce scoring systems that are fully optimized for accuracy and sparsity, and that satisfy a wide range of complicated operational constraints without any parameter tuning.

We present a principled machine learning approach to learn scoring systems from data. This approach can produce scoring systems that satisfy multiple operational constraints without any parameter tuning. Further, it has a unique advantage for imbalanced classification problems, where constraints on classbased accuracy can be explicitly enforced.

We derive new bounds on the accuracy of discrete linear classification models. In particular, we present discretization bounds that guarantee that we will not lose training accuracy when the size of the coefficient set is sufficiently large. In addition, we present generalization bounds that relate the size of the coefficient set to a uniform guarantee on testing accuracy.

We develop a novel data reduction technique that can improve the scalability of supervised classification algorithms by removing a portion of the training data beforehand. Further, we show how data reduction can be applied directly to SLIM.

We present results from a collaboration with the Massachusetts General Hospital (MGH) Sleep Laboratory where SLIM is being used to create a highly tailored scoring system for sleep apnea screening. Screening for sleep apnea is important: the condition is difficult to diagnose, has significant costs, and affects over 12 million people in the United States alone (Kapur 2010).

We provide a detailed experimental comparison between SLIM and eight popular classification methods on publicly available datasets. Our results suggest that SLIM can produce scoring systems that are accurate and sparse in a matter of minutes.
1.1 Related work
Our work is related to several streams of research, namely: medical scoring systems; sparse linear classification; discrete linear classification; and mixedinteger programming (MIP) approaches for classification. In what follows, we discuss related work in each area separately.
1.1.1 Medical scoring systems

SAPS I, II and III, to assess ICU mortality risk (Le Gall et al. 1984, 1993; Moreno et al. 2005);

APACHE I, II and III, to assess ICU mortality risk (Knaus et al. 1981, 1985, 1991);

CHADS\(_2\), to assess the risk of stroke in patients with atrial fibrillation (Gage et al. 2001);

Wells Criteria for pulmonary embolisms (Wells et al. 2000); Wells Criteria for and deep vein thrombosis (Wells et al. 1997);

TIMI, to assess the risk of death and ischemic events (Antman et al. 2000);

SIRS, to detect system inflammatory response syndrome (Bone et al. 1992);
Many medical scoring systems were constructed without optimizing for predictive accuracy. In some cases, physicians built scoring systems by combining existing linear classification methods with heuristics. The SAPS II score, for instance, was constructed by rounding logistic regression coefficients: as Le Gall et al. (1993) write, “the general rule was to multiply the \(\beta \) for each range by 10 and round off to the nearest integer.” This approach is at odds with the fact that rounding is known to produce suboptimal solutions in the field of integer programming. In other cases, scoring systems were handcrafted by a panel of physicians, and not learned from data at all. This appears to have been the case for CHADS\(_2\) as suggested by Gage et al. (2001): “To create CHADS\(_2\), we assigned 2 points to a history of prior cerebral ischemia and 1 point for the presence of other risk factors because a history of prior cerebral ischemia increases the relative risk (RR) of subsequent stroke commensurate to 2 other risk factors combined. We calculated CHADS\(_2\), by adding 1 point each for each of the following—recent CHF, hypertension, age 75 years or older, and DM—and 2 points for a history of stroke or TIA.” Methods that can easily produce highly tailored prediction models, such as SLIM, should eliminate the need for physicians to create models by hand.
In addition to the sleep apnea application that we present in Sect. 6, SLIM has also been used to create medical scoring system for diagnosing cognitive impairments using features derived from a clockdrawing test (SouillardMandar et al. 2015), and to create scoring systems for recidivism prediction (Zeng et al. 2015).
1.1.2 Sparse linear classification models
In comparison to SLIM, the majority of current methods for sparse linear classification are designed to fit models with real coefficients and would therefore require rounding to yield scoring systems. In practice, rounding the coefficients of a linear model may significantly alter its accuracy and sparsity, and may produce a scoring system that violates operational constraints on these quantities. Further, many current methods also control accuracy and sparsity by means of convex surrogate functions to preserve scalability (see e.g., Tibshirani 1996; Efron et al. 2004). As we show in Sects. 6 and 7, surrogate functions provide a poor tradeoff between accuracy and sparsity. Convex surrogate loss functions, for instance, produce models that do not attain the best learningtheoretic guarantee on predictive accuracy and are not robust to outliers (Li and Lin 2007; Brooks and Lee 2010; Nguyen and Sanner 2013). Similarly, \(\ell _1\)regularization is only guaranteed to recover the correct sparse solution (i.e., the one that minimizes the \(\ell _0\)norm) under restrictive conditions that are rarely satisfied in practice (Zhao and Bin 2007; Liu and Zhang 2009). In fact, \(\ell _1\)regularization may recover a solution that attains a significant loss in predictive accuracy relative to the correct sparse solution (see e.g., Lin et al. 2008, for a discussion). Sparse linear classifiers can also be produced using feature selection algorithms (Guyon and Elisseeff 2003; Mao 2004), though these algorithms cannot guarantee an optimal balance between accuracy and sparsity as they typically rely on greedy optimization (with some exceptions, see e.g., Bradley et al. 1999).
1.1.3 Discrete linear classification models
SLIM is part of a recent stream of research on creating linear classifiers with discrete coefficients. Specifically, Chevaleyre et al. (2013) have considered training linear classifiers with binary coefficients by rounding the coefficients of linear classifiers that minimize the hinge loss. In addition, Carrizosa et al. (2013) have considered training linear classifiers with small integer coefficients by using a MIP formulation that optimizes the hinge loss. The discretization bounds and generalization bounds in Sect. 4 are a novel contribution to this body of work and applicable to all linear models with discrete coefficients.
SLIM can reproduce the models proposed by Chevaleyre et al. (2013) and Carrizosa et al. (2013) (see e.g., our formulation to create MofN rule tables in Sect. 8.2.1). The converse, however, is not necessarily true because the methods of Chevaleyre et al. (2013) and Carrizosa et al. (2013) have the following weaknesses: (i) they optimize the hinge loss as opposed to the 0–1 loss; and (ii) they do not include a mechanism to control sparsity. These differences may result in better scalability compared to SLIM. However, they also eliminate the ability of these methods to produce scoring systems that are sparse, that satisfy operational constraints on accuracy and/or sparsity, and that can be trained without parameter tuning.
1.1.4 MIP approaches for classification
SLIM uses integer programming (IP) to achieve three distinct goals: (i) minimize the 0–1 loss; (ii) penalize the \({\ell _{0}}\)norm for feature selection; and (iii) restrict coefficients to a small set of integers. MIP approaches have been used to tackle each of these goals, albeit separately.
MIP formulations to minimize the 0–1 loss, for instance, were first proposed in Liittschwager and Wang (1978) and Bajgier and Hill (1982), and later refined by Mangasarian (1994), Asparoukhov and Stam (1997) and Glen (1999). Similarly, MIP formulations that penalize the \(\ell _0\)norm for feature selection were proposed in Goldberg and Eckstein (2010), Goldberg and Eckstein (2012) and Nguyen and Franke (2012). To our knowledge, the only MIP formulation to restrict coefficients to a small set of integers is proposed in Carrizosa et al. (2013).
SLIM has unique practical benefits in comparison to these MIP formulations since it handles all three of these goals simultaneously. As we discuss in Sect. 2, the simultaneous approach allows SLIM to train models parameter tuning, and make use of the variables that encode the 0–1 loss and \({\ell _{0}}\)norm to accommodate important operational constraints. Further, restricting coefficients to a discrete set that is finite leads to an IP formulation whose LP relaxation is significantly tighter than other MIP formulations designed to minimize the 0–1 loss and/or penalize the \(\ell _0\)norm.
The problem of finding a linear classifier that minimizes the 0–1 loss function is sometimes referred to as misclassification minimization for the linear discriminant problem in the MIP community (see e.g., Rubin 2009; Lee and Wu 2009, for an overview). Seeing how early attempts at misclassification minimization were only feasible for tiny datasets with at most 200 examples (see e.g., Joachimsthaler and Stam 1990), a large body of work has focused on improving scalability by using heuristics (Rubin 1990; Yanev and Balev 1999), decomposition procedures (Rubin 1997), cutting planes (Brooks 2011) and specialized branchandbound algorithms (Nguyen and Sanner 2013). The data reduction technique we present in Sect. 5 is a novel contribution to this body of work, and addresses a need for general methods to remove redundant data put forth by Bradley et al. (1999). We compare and contrast data reduction to existing approaches in greater detail in Sect. 5.3.
2 Methodology
SLIM is designed to produce scoring systems that attain a paretooptimal tradeoff between accuracy and sparsity: when we minimize 0–1 loss and the \(\ell _0\)penalty, we only sacrifice classification accuracy to attain higher sparsity, and vice versa. Minimizing the 0–1 loss produces scoring systems that are completely robust to outliers and attain the best learningtheoretic guarantee on predictive accuracy (see e.g., Brooks 2011; Nguyen and Sanner 2013). Similarly, controlling for sparsity via \({\ell _{0}}\)regularization prevents the additional loss in accuracy due to \(\ell _1\)regularization (see Lin et al. 2008, for a discussion). We can make further use of the variables that encode the 0–1 loss and \({\ell _{0}}\)penalty to formulate operational constraints related to accuracy and sparsity (see Sect. 3).
One unique benefit in minimizing an approximationfree objective function over a finite set of discrete coefficients is that the free parameters in the objective of (2) have special properties.
Remark 1
Remark 2
The tradeoff parameter \(C_0\) represents the maximum accuracy that SLIM will sacrifice to remove a feature from the optimal scoring system.
Remark 3
Remark 4
The aforementioned properties are only possible using the formulation in (2). In particular, the properties in Remarks 2–4 require that we control accuracy using the 0–1 loss and control sparsity using an \(\ell _0\)penalty. In addition, the property in Remark 1 requires that we restrict coefficients to a finite set of discrete values.
2.1 SLIM IP Formulation
Restricting coefficients to a finite discrete set results in significant practical benefits for the SLIM IP formulation, especially in comparison to other IP formulations that minimize the 0–1loss and/or penalize the \(\ell _0\)norm. Many IP formulations compute the 0–1 loss and \(\ell _0\)norm by means of BigM constraints that use on BigM constants (see e.g., Wolsey 1998). Restricting the coefficients to a finite discrete set allows us to bound BigM constants that are typically set heuristically to a large value. Specifically, the BigM constant for computing the 0–1 loss in constraints (3a) is bounded as \(M_i \le \max _{\varvec{\lambda }\in \mathcal {L}} (\gamma  y_i\varvec{\lambda }^T\varvec{x}_i)\) (compare with Brooks 2011, where the same parameter has to be approximated by a “sufficiently large constant”). Similarly, the BigM constant used to compute the \(\ell _0\)norm in constraints (3c) is bounded at \(\varLambda _j \le \max _{\lambda _j\in \mathcal {L}_j} \lambda _j\) (compare with Guan et al. 2009, where this same parameter has to be approximated by a “sufficiently large constant”). These differences lead to a tighter LP relaxation, which narrows the integrality gap, and subsequently improves the ability of commercial IP solvers to obtain a proof of optimality.
3 Operational constraints
In this section, we discuss how SLIM can accommodate a wide range of operational constraints related to the accuracy and sparsity of predictive models. The following techniques provide users with a practical approach to create tailored prediction models. They are made possible by the facts that: (i) the variables that encode the 0–1 loss and \({\ell _{0}}\)penalty in the SLIM IP formulation can also be used to handle accuracy and sparsity; and (ii) the free parameters in the SLIM objective can be set without tuning (see Remarks 2–4).
3.1 Loss constraints for imbalanced data
The majority of classification problems in the medical domain are imbalanced. Handling imbalanced data is incredibly difficult for most classification methods since maximizing classification accuracy often produces a trivial model (i.e., if the probability of heart attack is 1 %, a model that never predicts a heart attack is still 99 % accurate).
3.2 Featurebased constraints for input variables
SLIM provides finegrained control over the composition of input variables in a scoring system by formulating featurebased constraints. Specifically, we can make use of the indicator variables that encode the \(\ell _0\)norm \(\alpha _j := \mathbbm {1}\left[ \lambda _j \ne 0\right] \) to formulate arbitrarily complicated logical constraints between features such as “eitheror” conditions or “ifthen” conditions (see e.g., Wolsey 1998). This presents a practical alternative to create classification models that obey structured sparsity constraints (see e.g., Jenatton et al. 2011) or hierarchical constraints (see e.g., Bien et al. 2013).
3.3 Featurebased preferences
Physicians often have soft preferences between different input variables. SLIM allows practitioners to encode these preferences by specifying a distinct tradeoff parameter for each coefficient \(C_{0,j}\).
Explicitly, when our model should use feature j instead of feature k, we set \(C_{0,k} = C_{0,j} + \delta \), where \(\delta > 0\) represents the maximum additional training accuracy that we are willing to sacrifice in order to use feature j instead of feature k. Thus, setting \(C_{0,k} = C_{0,j} + 0.02\) would ensure that we would only be willing to use feature k instead of feature j if it yields an additional 2 % gain in training accuracy over feature k.
4 Bounds on training and testing accuracy
In this section, we present bounds on the training and testing accuracy of SLIM scoring systems.
4.1 Discretization bounds on training accuracy
Our first result shows that we can always craft a finite discrete set of coefficients \(\mathcal {L}\) so that the training accuracy of a linear classifier with discrete coefficients \(\varvec{\lambda }\in \mathcal {L}\) (e.g. SLIM) is no worse than the training accuracy of a baseline linear classifier with realvalued coefficients \(\varvec{\rho }\in \mathbb {R}^P\) (e.g. SVM).
Theorem 1
(Minimum margin resolution bound) Let \(\varvec{\rho }= [\rho _1,\ldots ,\rho _P]^T \in \mathbb {R}^P\) denote the coefficients of a baseline linear classifier trained using data \(\mathcal {D}_N = (\varvec{x}_i,y_i)_{i=1}^N\). Let \(X_{\max } = \max _i \Vert \varvec{x}_i\Vert _2\) and \(\gamma _{\min } = \min _i \frac{\varvec{\rho }^T\varvec{x}_i}{\left\ \varvec{\rho }\right\ _2}\) denote the largest magnitude and minimum margin achieved by any training example, respectively.
Proof
See Appendix. \(\square \)
The proof of Theorem 1 uses a rounding procedure to choose a resolution parameter \(\varLambda \) so that the coefficient set \(\mathcal {L}\) contains a classifier with discrete coefficients \(\varvec{\lambda }\) that attains the same the 0–1 loss as the baseline classifier with real coefficients \(\varvec{\rho }\). If the baseline classifier \(\varvec{\rho }\) is obtained by minimizing a convex surrogate loss, then the optimal SLIM classifier trained with the coefficient set from Theorem 1 may attain a lower 0–1 loss than \(\mathbbm {1}\left[ y_i\varvec{\rho }^T\varvec{x}_i\le 0\right] \) because SLIM directly minimizes the 0–1 loss.
The next corollary yields additional bounds on the training accuracy by considering progressively larger values of the margin. These bounds can be used to relate the resolution parameter \(\varLambda \) to a worstcase guarantee on training accuracy.
Corollary 1
(kth margin resolution bound) Let \(\varvec{\rho }= [\rho _1,\ldots ,\rho _P]^T \in \mathbb {R}^P\) denote the coefficients of a linear classifier trained with data \(\mathcal {D}_N = (\varvec{x}_i,y_i)_{i=1}^N\). Let \(\gamma _{(k)}\) denote the value of the kth smallest margin, \(\mathcal {I}_{(k)}\) denote the set of training examples with \(\frac{\varvec{\rho }^T\varvec{x}_i}{\left\ \varvec{\rho }\right\ _2} \le \gamma _{(k)}\), and \(X_{(k)} = \max _{i \not \in \mathcal {I}_{(k)}} \Vert \varvec{x}_i\Vert _2\) denote the largest magnitude of any training example \(\varvec{x}_i \in \mathcal {D}_N\) for \(i \not \in \mathcal {I}_{(k)}\).
Proof
The proof follows by applying Theorem 1 to the reduced dataset \(\mathcal {D}_N \backslash \mathcal {I}_{(k)}\). \(\square \)
We have now shown that good discretized solutions exist and can be constructed easily. This motivates that optimal discretized solutions, which by definition are better than rounded solutions, will also be good relative to the best nondiscretized solution.
4.2 Generalization bounds on testing accuracy
According to the principle of structural risk minimization (Vapnik 1998), fitting a classifier from a simpler class of models may lead to an improved guarantee on predictive accuracy. Consider training a classifier \(f:\mathcal {X}\rightarrow \mathcal {Y}\) with data \(\mathcal {D}_N = (\varvec{x}_i,y_i)_{i=1}^N\), where \(\varvec{x}_i \in \mathcal {X}\subseteq \mathbb {R}^P\) and \(y_i \in \mathcal {Y}= \{1,1\}\). In what follows, we provide uniform generalization guarantees on the predictive accuracy of all functions, \(f \in \mathcal {F}\). These guarantees bound the true risk \(R^\text {true}(f) = \mathbb {E}_{\mathcal {X},\mathcal {Y}} \mathbbm {1}\left[ f(\varvec{x}) \ne y\right] \) by the empirical risk \(R^\text {emp}(f) = \frac{1}{N}\sum _{i=1}^N \mathbbm {1}\left[ f(\varvec{x}_i) \ne y_i\right] \) and other quantities important to the learning process.
Theorem 2
A proof of Theorem 2 can be found in Section 3.4 of Bousquet et al. (2004). The result that more restrictive hypothesis spaces can lead to better generalization provides motivation for using discrete models without necessarily expecting a loss in predictive accuracy. The bound indicates that we include more coefficients in the set \(\mathcal {L}\) as the amount of data N increases.
In Theorem 3, we improve the generalization bound from Theorem 2 by exploiting the fact that we can bound the number of nonzero coefficients in a SLIM scoring system in terms of the value of \(C_0\).
Theorem 3
Proof
See Appendix. \(\square \)
This theorem relates the tradeoff parameter \(C_0\) in the SLIM objective to the generalization of SLIM scoring systems. It indicates that increasing the value of the \(C_0\) parameter will produce a model with better generalization properties.
In Theorem 4, we produce a better generalization bound by exploiting the fact that SLIM scoring systems use coprime integer coefficients (see Remark 1). In particular, we express the generalization bound from Theorem 2 using the Pdimensional Farey points of level \(\varLambda \) (see Marklof 2012, for a definition).
Theorem 4
The proof involves a counting argument over coprime integer vectors, using the definition of Farey points from number theory.
5 Data reduction
Data reduction is a technique that can decrease the computation associated with training a supervised classification model by discarding redundant training data. This technique can be applied to any supervised classification method where the training procedure is carried out by solving an optimization problem. However, it is best suited for methods such as SLIM, where the underlying optimization problem may be difficult to solve for large instances. In this section, we first describe how data reduction works in a general setting, and then show how it can be applied to SLIM.
5.1 Data reduction for optimizationbased supervised classification
Data reduction aims to decrease the computation associated with solving the original problem by removing redundant examples from \(\mathcal {D}_N = (\varvec{x}_i,y_i)_{i=1}^N\) (i.e., data points that can be safely discarded without changing the optimal solution to (6)). The technique requires users to specify a surrogate problem that is considerably easier to solve. Given the initial training data \(\mathcal {D}_N = (\varvec{x}_i,y_i)_{i=1}^N\), and the surrogate problem, data reduction solves \(N+1\) variants of the surrogate problem to identify redundant examples. These examples are then removed from the initial training data to leave behind a subset of reduced training data \(\mathcal {D}_M \subseteq \mathcal {D}_N\) that is guaranteed to yield the same optimal classifier as \(\mathcal {D}_N\). Thus, the computational gain from data reduction comes from training a model with \(\mathcal {D}_M\) (i.e., solving an instance of the original problem with \(NM\) fewer examples).
Theorem 5
Proof
See Appendix.
Theorem 6
 I.
Upper bound on the 0–1 loss: \(Z_{01}\left( \varvec{\lambda }\right) \le Z_{\psi }\left( \varvec{\lambda }\right) \)
 II.
Lipschitz near \(\varvec{\lambda }^{*}_{01}\): \(\Vert \varvec{\lambda } \varvec{\lambda }^{*}_{\psi }\Vert < A \implies Z_{\psi }\left( \varvec{\lambda }\right)  Z_{\psi }\left( \varvec{\lambda }^{*}_{\psi }\right) < L \Vert \varvec{\lambda } \varvec{\lambda }^{*}_{\psi }\Vert \)
 III.
Curvature near \(\varvec{\lambda }^{*}_{\psi }\): \(\Vert \varvec{\lambda } \varvec{\lambda }^{*}_{\psi }\Vert > C_{\varvec{\lambda }} \implies Z_{\psi }\left( \varvec{\lambda }\right)  Z_{\psi }\left( \varvec{\lambda }^{*}_{\psi }\right) > C_{\psi }\)
 IV.
Closeness of loss near \(\varvec{\lambda }^{*}_{01}\): \( Z_{\psi }\left( \varvec{\lambda }^{*}_{01}\right)  Z_{01}\left( \varvec{\lambda }^{*}_{01}\right)  < \varepsilon \)
Proof
See Appendix. \(\square \)
5.2 Offtheshelf data reduction for SLIM
Data reduction can easily be applied to SLIM by using an offtheshelf approach where we use the LP relaxation of the SLIM IP as the surrogate problem.
5.3 Discussion and related work
Data reduction is a technique that can be applied to a wide range of supervised classification methods, including methods that minimize the 0–1 loss function.
Data reduction is fundamentally different from many techniques for improving the scalability of 0–1 loss minimization, such as oscillation heuristics (Rubin 1990), decomposition (Rubin 1997), cutting planes (Brooks 2011) and specialized branchandbound algorithms (Nguyen and Sanner 2013). In fact, data reduction is most similar to the scenario reduction methods in the stochastic programming literature (see e.g., Dupačová et al. 2000, 2003). In comparison to scenario reduction techniques, data reduction does not require the objective function to satisfy stability or convexity assumptions, and is designed to recover the true optimal solution as opposed to an \(\varepsilon \)optimal solution.
Data reduction has the advantage that it easily be applied to SLIM by using SLIM’s LP relaxation as the surrogate problem. This offtheshelf approach may be used as a preliminary procedure before the training process, or as an iterative procedure that is called by the IP solver during the training process as feasible solutions are found. Offtheshelf data reduction makes use of the integrality gap to identify examples that have to be classified a certain way. Accordingly, the effectiveness of this approach may be improved using techniques that narrow the integrality gap—specifically by using higher quality feasible solutions and/or strengthening SLIM’s LP relaxation (e.g., by using the cutting planes of Brooks 2011).
6 Application to sleep apnea screening
In this section, we discuss a collaboration with the MGH Sleep Laboratory where we used SLIM to create a scoring system for sleep apnea screening (see also Ustun et al. 2015). Our goal is to highlight the flexibility and performance of our approach on a realworld problem that requires a tailored prediction model.
6.1 Data and operational constraints
The dataset for this application contains \(N = 1922\) records of patients and \(P = 33\) binary features related to their health and sleep habits. Here, \(y_i=+1\) if patient i has obstructive sleep apnea (OSA), and there is significant class imbalance as Pr\((y_i=+1) = 76.9\,\%\).
 1.
Limited FPR The model had to achieve the highest possible true positive rate (TPR) while maintaining a maximum false positive rate (FPR) of 20 %. This would ensure that the model could diagnose as many cases of sleep apnea as possible but limit the number of faulty diagnoses.
 2.
Limited model size The model had to be transparent and use at most 5 features. This would ensure that the model was could be explained and understood by other physicians in a short period of time.
 3.
Sign constraints The model had to obey established relationships between wellknown risk factors and the incidence of sleep apnea (e.g. it could not suggest that a patient with hypertension had a lower risk of sleep apnea since hypertension is a positive risk factor for sleep apnea).
6.2 Training setup and model selection

We added a loss constraint using the loss variables to limit the maximum FPR at 20 %. We then set \(W^{+}= N^{}/(1+N^{})\) to guarantee that the optimization would yield a classifier with the highest possible TPR with a maximum FPR less than 20 % (see Sect. 3.1).

We added a featurebased constraint using the loss variables to limit the maximum number of features to 5 (see Sect. 3.2). We then set \(C_0 = 0.9W^{}/NP\) so that the optimization would yield a classifier that did not sacrifice accuracy for sparsity (see Remark 3).

We added sign constraints to the coefficients to ensure that our model would not violate established relationships between features and the predicted outcome (i.e., we set \(\lambda _j \ge 0\) if there had to be a positive relationship, and \(\lambda _j \le 0\) if there had to be a negative relationship).
Training setup for all methods
Method  Controls  # Instances  Settings and free parameters 

CART  Max FPR  39  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\) 
Model Size  
C5.0T  Max FPR  39  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\) 
C5.0R  Max FPR  39  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\) 
Model Size  
Lasso  Max FPR  39,000  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\times \) 1000 values of \(\lambda \) chosen by glmnet 
Model Size  
Signs  
Ridge  Max FPR  39,000  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\times \) 1000 values of \(\lambda \) chosen by glmnet 
Signs  
Elastic Net  Max FPR  975,000  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\times \) 1000 values of \(\lambda \) chosen by glmnet \(\times \) 19 values of \(\alpha \in \{0.05,0.10,\ldots ,0.95\}\) 
Model Size  
Signs  
SVM Lin.  Max FPR  975  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\times \) 25 values of \(C \in \{10^{3},10^{2.75},\ldots ,10^3\}\) 
SVM RBF  Max FPR  975  39 values of \(W^{+}\in \{0.025,0.05,\ldots ,0.975\}\times \) 25 values of \(C \in \{10^{3},10^{2.75},\ldots ,10^3\}\) 
SLIM  Max FPR  1  \(W^{+}= N^{}/(1+N^{})\), \(C_0 = 0.9W^{}/NP\), \(\lambda _0 \in \{100,\ldots ,100\}\), \(\lambda _j \in \{10,\ldots ,10\}\) 
Model Size  
Signs 
6.3 Results and observations
TPR, FPR and model size for all methods
Objective  Constraints  Other information  

Method  Constraints satisfied  Test TPR (%)  Test FPR (%)  Final Model Size  Model Size  Train TPR (%)  Train FPR (%)  Final Train TPR (%)  Final Train FPR (%) 
SLIM  All  61.4  20.9  5  5  62.4  19.7  62.0  19.6 
55.5–68.8  15.0–30.4  –  55  61.0–64.2  19.3–20.0  –  –  
Lasso  All  29.3  8.6  3  3  28.7  7.2  22.1  3.8 
19.2–60.0  0.0–33.3  –  3–6  21.4–54.6  3.5–20.5  –  –  
Elastic Net  All  44.2  18.8  3  3  45.6  17.4  54.3  20.7 
0.0–64.1  0.0–37.0  –  3–6  0.0–66.5  0.0–36.4  –  –  
Ridge  Max FPR  66.0  20.6  30  30  66.4  18.9  66.0  18.9 
60.5–68.5  8.6–32.6  –  30–30  64.0–68.9  17.3–21.5  –  –  
SVM RBF  Max FPR  64.3  20.8  33  33  67.9  12.2  67.8  12.4 
59.2–71.1  10.0–30.4  –  33–33  66.5–70.0  11.1–13.3  –  –  
SVM Lin.  Max FPR  62.7  19.8  31  31  63.7  17.0  63.1  17.1 
57.9–69.0  7.5–28.6  –  31–31  61.5–66.1  15.6–18.5  –  –  
C5.0R  None  84.0  43.0  26  23  86.1  33.8  85.5  32.9 
78.9–87.7  32.6–54.2  –  18–30  84.2–88.5  30.9–38.2  –  –  
C5.0T  None  81.3  42.9  39  42  85.3  29.5  84.5  28.4 
77.4–84.8  29.6–62.5  –  39–50  82.6–88.6  24.6–33.7  –  –  
CART  None  93.0  70.4  8  9  95.2  66.8  95.9  73.9 
88.8–96.1  61.1–83.3  –  4–12  93.1–97.2  55.0–76.0  –  – 
6.3.1 On the difficulties of handling operational constraints

Handle simple operational constraints that are crucial for models to be used and accepted. Implementations of popular classification methods do not have a mechanism to adjust important model qualities. That is, there is no mechanism to control sparsity in C5.0T (Kuhn et al. 2012) and no mechanism to incorporate sign constraints in SVM (Meyer et al. 2012). Finding a method with suitable controls is especially difficult when a model has to satisfy multiple operational constraints.

Have controls that are easytouse and/or that work correctly. When a method has suitable controls to handle operational constraints, producing a model often requires a tuning process over a highdimensional free parameter grid. Even after extensive tuning, however, it is possible to never find a model that satisfies all operational constraints (e.g. CART, C5.0R, C5.0T for the Max FPR constraint in Fig. 4).

Allow tuning to be portable when the training set changes. Consider a standard KCV model selection procedure where we choose free parameters to maximize predictive accuracy. To adapt this procedure on a problem with operational constraints, we would train models on K validation folds for each instance of the free parameters, choose an instance of the free parameters that maximizes the mean KCV test accuracy among the instances that satisfied all operational constraints, and train a “final model” for this instance. Unfortunately, there is no guarantee that the final model will obey all operational constraints.
Percentage of instances that fulfilled operational constraints
% of Instances  

Method  Max FPR  Max FPR & Model Size  Max FPR, Model Size & Signs 
SLIM  100.0  100.0  100.0 
Lasso  19.6  4.8  4.8 
Elastic Net  18.3  1.0  1.0 
Ridge  20.9  0.0  0.0 
SVM Lin  18.7  0.0  0.0 
SVM RBF  15.8  0.0  0.0 
C5.0R  0.0  0.0  0.0 
C5.0T  0.0  0.0  0.0 
CART  0.0  0.0  0.0 
6.3.2 On the sensitivity of acceptable models
6.3.3 On the usability and interpretability of acceptable models
Datasets used in the numerical experiments
Dataset  Source  N  P  Classification task 

adult  Kohavi (1996)  32561  36  Predict if a U.S. resident earns more than $50,000 
breastcancer  Mangasarian et al. (1995)  683  9  Detect breast cancer using a biopsy 
bankruptcy  Kim and Han (2003)  250  6  Predict if a firm will go bankrupt 
haberman  Haberman (1976)  306  3  Predict 5year survival after breast cancer surgery 
heart  Detrano et al. (1989)  303  32  Identify patients a high risk of heart disease 
mammo  Elter et al. (2007)  961  12  Detect breast cancer using a mammogram 
mushroom  Schlimmer (1987)  8124  113  Predict if a mushroom is poisonous 
spambase  Cranor and LaMacchia (1998)  4601  57  Predict if an email is spam 
Our results highlight some of the unique interpretability benefits of SLIM scoring systems—that is, their ability to provide “a qualitative understanding of the relationship between joint values of the input variables and the resulting predicted response value” (Hastie et al. 2009). SLIM scoring systems are wellsuited to provide this kind of qualitative understanding due to their high level of sparsity and small integer coefficients. These qualities help users gauge the influence of each input variable with respect to the others, which is especially important because humans can only handle a few cognitive entities at once (\(7\pm 2\) according to Miller 1984), and are seriously limited in estimating the association between three or more variables (Jennings et al. 1982). Accordingly, these qualities may also help users gauge the influence Sparsity and small integer coefficients also allow users to make quick predictions without a computer or a calculator, which may help them understand how the model works by actively using it to classify prototypical examples. Here, this process helped our collaborators come up with the following simple rulebased explanation for our model predicted that a patient has OSA (i.e., when SCORE \(>\) 1): “if the patient is male, predict OSA if age \(\ge \) 60 OR hypertension OR bmi \(\ge \) 30; if the patient is female, predict OSA if bmi \(\ge \) 40 AND (age \(\ge \) 60 OR hypertension).”
7 Numerical experiments
In this section, we present numerical experiments to compare the accuracy and sparsity of SLIM scoring systems to other popular classification models. Our goal is to illustrate the offtheshelf performance of SLIM and show that we can train accurate scoring systems for realsized datasets in minutes.
7.1 Experimental setup
Datasets: We ran numerical experiments on 8 datasets from the UCI Machine Learning Repository (Bache and Lichman 2013) summarized in Table 4. We chose these datasets to explore the performance of each method as we varied the size and nature of the training data. We processed each dataset by binarizing all categorical features and some realvalued features. For the purposes of reproducibility, we include all processed datasets in Online Resource 1.
Training setup for classification methods used for the numerical experiments
Method  Acronym  Software  Settings and free parameters 

CART Decision Trees  CART  rpart (Therneau et al. 2012)  default settings 
C5.0 Decision Trees  C5.0T  c50 (Kuhn et al. 2012)  default settings 
C5.0 Rule List  C5.0R  c50 (Kuhn et al. 2012)  default settings 
Log. Reg. + \(\ell _1\) penalty  Lasso  glmnet (Friedman et al. 2010)  1000 values of \(\lambda \) chosen by glmnet 
Log. Reg. + \(\ell _2\) penalty  Ridge  glmnet (Friedman et al. 2010)  1000 values of \(\lambda \) chosen by glmnet 
Log. Reg. + \(\ell _1/\ell _2\) penalty  Elastic Net  glmnet (Friedman et al. 2010)  1000 values of \(\lambda \) chosen by glmnet \(\times \) 19 values of \(\alpha \in \{0.05,0.10,\ldots ,0.95\}\) 
SVM + Linear Kernel  SVM Lin.  e1071 (Meyer et al. 2012)  25 values of \(C \in \{10^{3},10^{2.75},\ldots ,10^3\}\) 
SVM + RBF Kernel  SVM RBF  e1071 (Meyer et al. 2012)  25 values of \(C \in \{10^{3},10^{2.75},\ldots ,10^3\}\) 
SLIM Scoring Systems  SLIM  CPLEX 12.6.0.0  6 values of \(C_0 \in \{0.01, 0.075, 0.05, 0.025, 0.001, 0.9/NP\}\) with \(\lambda _j \in \{10,\ldots ,10\}\); \(\lambda _0 \in \{100,\ldots ,100\}\) 
Accuracy and sparsity of all methods on all datasets
Dataset  Details  Metric  SLIM  Lasso  Ridge  Elastic Net  C5.0R  C5.0T  CART  SVM Lin.  SVM RBF  

adult  N  32561  Test Error  17.4 \(\pm \) 1.4 %  17.3 \(\pm \) 0.9 %  17.6 \(\pm \) 0.9 %  17.4 \(\pm \) 0.9 %  26.4 \(\pm \) 1.8 %  26.3 \(\pm \) 1.4 %  75.9 \(\pm \) 0.0 %  16.8 \(\pm \) 0.8 %  16.3 \(\pm \) 0.5 % 
P  37  Train Error  17.5 \(\pm \) 1.2 %  17.2 \(\pm \) 0.1 %  17.6 \(\pm \) 0.1 %  17.4 \(\pm \) 0.1 %  25.3 \(\pm \) 0.4 %  24.9 \(\pm \) 0.4 %  75.9 \(\pm \) 0.0 %  16.7 \(\pm \) 0.1 %  16.3 \(\pm \) 0.1 %  
Pr(\(y=+1\))  24 %  Model Size  18  14  36  17  41  87  4  36  36  
Pr(\(y=1\))  76 %  Model Range  7–26  13–14  36–36  16–18  38–46  78–99  4–4  36–36  36–36  
breastcancer  N  683  Test Error  3.4 \(\pm \) 2.0 %  3.4 \(\pm \) 2.2 %  3.4 \(\pm \) 2.0 %  3.1 \(\pm \) 2.1 %  4.3 \(\pm \) 3.3 %  5.3 \(\pm \) 3.4 %  5.6 \(\pm \) 1.9 %  3.1 \(\pm \) 2.0 %  3.5 \(\pm \) 2.5 % 
P  10  Train Error  3.2 \(\pm \) 0.2 %  2.9 \(\pm \) 0.3 %  3.0 \(\pm \) 0.3 %  2.8 \(\pm \) 0.3 %  2.1 \(\pm \) 0.3 %  1.6 \(\pm \) 0.4 %  3.6 \(\pm \) 0.3 %  2.7 \(\pm \) 0.2 %  0.3 \(\pm \) 0.1 %  
Pr(\(y=+1\))  35 %  Model Size  2  9  9  9  8  13  7  9  9  
Pr(\(y=1\))  65 %  Model Range  2–2  8–9  9–9  9–9  6–9  7–16  3–7  9–9  9–9  
bankruptcy  N  250  Test Error  0.8 \(\pm \) 1.7 %  0.0 \(\pm \) 0.0 %  0.4 \(\pm \) 1.3 %  0.0 \(\pm \) 0.0 %  0.8 \(\pm \) 1.7 %  0.8 \(\pm \) 1.7 %  1.6 \(\pm \) 2.8 %  0.4 \(\pm \) 1.3 %  0.4 \(\pm \) 1.3 % 
P  7  Train Error  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  0.4 \(\pm \) 0.1 %  0.4 \(\pm \) 0.7 %  0.4 \(\pm \) 0.2 %  0.4 \(\pm \) 0.2 %  1.6 \(\pm \) 0.3 %  0.4 \(\pm \) 0.1 %  0.4 \(\pm \) 0.1 %  
Pr(\(y=+1\))  57 %  Model Size  3  3  6  3  4  4  2  6  6  
Pr(\(y=1\))  43 %  Model Range  2–3  3–3  6–6  3–3  4–4  4–4  2–2  6–6  6–6  
haberman  N  306  Test Error  29.2 \(\pm \) 14.0 %  42.5 \(\pm \) 11.3 %  36.9 \(\pm \) 15.0 %  40.9 \(\pm \) 14.0 %  42.7 \(\pm \) 9.4 %  42.7 \(\pm \) 9.4 %  43.1 \(\pm \) 8.0 %  45.3 \(\pm \) 14.7 %  47.5 \(\pm \) 6.2 % 
P  4  Train Error  29.7 \(\pm \) 1.5 %  40.6 \(\pm \) 1.9 %  41.0 \(\pm \) 9.7 %  45.1 \(\pm \) 12.0 %  40.4 \(\pm \) 8.5 %  40.4 \(\pm \) 8.5 %  34.3 \(\pm \) 2.8 %  46.0 \(\pm \) 3.6 %  5.4 \(\pm \) 1.5 %  
Pr(\(y=+1\))  74 %  Model Size  3  2  3  1  3  3  9  3  4  
Pr(\(y=1\))  26 %  Model Range  2–3  2–2  3–3  1–1  0–3  1–3  4–9  3–3  4–4  
mammo  N  961  Test Error  19.5 \(\pm \) 3.0 %  19.0 \(\pm \) 3.1 %  19.2 \(\pm \) 3.0 %  19.0 \(\pm \) 3.1 %  20.5 \(\pm \) 3.3 %  20.3 \(\pm \) 3.5 %  20.7 \(\pm \) 3.9 %  20.3 \(\pm \) 3.0 %  19.1 \(\pm \) 3.1 % 
P  15  Train Error  18.3 \(\pm \) 0.3 %  19.3 \(\pm \) 0.3 %  19.2 \(\pm \) 0.4 %  19.2 \(\pm \) 0.3 %  19.8 \(\pm \) 0.3 %  19.9 \(\pm \) 0.3 %  20.0 \(\pm \) 0.6 %  20.3 \(\pm \) 0.4 %  18.2 \(\pm \) 0.4 %  
Pr(\(y=+1\))  46 %  Model Size  9  13  14  14  5  5  5  14  14  
Pr(\(y=1\))  54 %  Model Range  9–11  12–13  14–14  13–14  3–5  4–6  3–5  14–14  14–14  
heart  N  303  Test Error  16.5 \(\pm \) 7.8 %  15.2 \(\pm \) 6.3 %  14.9 \(\pm \) 5.9 %  14.5 \(\pm \) 5.9 %  21.2 \(\pm \) 7.5 %  23.2 \(\pm \) 6.8 %  19.8 \(\pm \) 6.5 %  15.5 \(\pm \) 6.5 %  15.2 \(\pm \) 6.0 % 
P  33  Train Error  14.9 \(\pm \) 1.1 %  14.0 \(\pm \) 1.0 %  13.1 \(\pm \) 0.8 %  13.2 \(\pm \) 0.6 %  10.0 \(\pm \) 1.8 %  8.5 \(\pm \) 2.0 %  14.3 \(\pm \) 0.9 %  13.6 \(\pm \) 0.5 %  10.4 \(\pm \) 0.8 %  
Pr(\(y=+1\))  46 %  Model Size  3  12  32  24  7  16  6  31  32  
Pr(\(y=1\))  54 %  Model Range  3–3  10–13  30–32  22–27  9–17  12–27  6–8  28–32  32–32  
mushroom  N  8124  Test Error  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  1.7 \(\pm \) 0.3 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  1.2 \(\pm \) 0.6 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 % 
P  114  Train Error  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  1.7 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  1.1 \(\pm \) 0.3 %  0.0 \(\pm \) 0.0 %  0.0 \(\pm \) 0.0 %  
Pr(\(y=+1\))  48 %  Model Size  7  21  113  108  8  9  7  98  113  
Pr(\(y=1\))  52 %  Model Range  7–7  19–23  113–113  106–108  8–8  9–9  6–8  98–108  113–113  
spambase  N  4601  Test Error  6.3 \(\pm \) 1.2 %  10.0 \(\pm \) 1.7 %  26.3 \(\pm \) 1.7 %  10.0 \(\pm \) 1.7 %  6.6 \(\pm \) 1.3 %  7.3 \(\pm \) 1.0 %  11.1 \(\pm \) 1.4 %  7.8 \(\pm \) 1.5 %  13.7 \(\pm \) 1.4 % 
P  58  Train Error  5.7 \(\pm \) 0.3 %  9.5 \(\pm \) 0.3 %  26.1 \(\pm \) 0.2 %  9.6 \(\pm \) 0.2 %  4.2 \(\pm \) 0.3 %  3.9 \(\pm \) 0.3 %  9.8 \(\pm \) 0.3 %  8.1 \(\pm \) 0.8 %  1.3 \(\pm \) 0.1 %  
Pr(\(y=+1\))  39 %  Model Size  34  28  57  28  29  73  7  57  57  
Pr(\(y=1\))  61 %  Model Range  28–40  28–29  57–57  28–29  23–31  56–78  6–10  57–57  57–57 
7.2 Results and observations
We summarize the results of our experiments in Table 6 and Figs. 13, 14. We report the sparsity of models using a metric that we call model size. Model size represents the number of coefficients for linear models (Lasso, Ridge, Elastic Net, SLIM, SVM Lin.), the number of leaves for decision tree models (C5.0T, CART), and the number of rules for rulebased models (C5.0R). For completeness, we set the model size for blackbox models (SVM RBF) to the number of features in each dataset.
We show the accuracy and sparsity of all methods on all dataset in Figs. 13 and 14. For each dataset, and each method, we plot a point at the 10CV mean test error and final model size, and surround this point with an error bar whose height corresponds to the 10CV standard deviation in test error. In addition, we include \(\ell _0\)regularization paths for SLIM and Lasso on the right side of Figs. 13 and 14 to show how the test error varies at different levels of sparsity for sparse linear models.
7.2.1 On accuracy, sparsity, and computation
Our results show that many methods are unable to produce models that attain the same levels of accuracy and sparsity as SLIM. As shown in Figs. 13 and 14, SLIM always produces a model that is more accurate than Lasso at some level of sparsity, and sometimes more accurate at all levels of sparsity (e.g., spambase, haberman, mushroom, breastcancer). Although the optimization problems we solved to train SLIM scoring systems were \(\mathcal {NP}\)hard, we did not find any evidence that computational issues hurt the performance of SLIM on any of the datasets. We obtained accurate and sparse models for all datasets in 10 minutes using CPLEX 12.6. Further, the solver provided a proof of optimality (i.e. a relative MIPGAP of 0.0 %) for all scoring systems we trained for mammo, mushroom, bankruptcy, breastcancer.
7.2.2 On the regularization effect of discrete coefficients
We expect that methods that directly optimize accuracy and sparsity will achieve the best possible accuracy at every level of sparsity (i.e. the best possible tradeoff between accuracy and sparsity). SLIM directly optimizes accuracy and sparsity. However, it may not necessarily achieve the best possible accuracy at each level of sparsity because it restricts coefficients to a finite discrete set \(\mathcal {L}\).
By comparing SLIM to Lasso, we can identify a baseline regularization effect due to this \(\mathcal {L}\) set restriction. In particular, we know that when Lasso’s performance dominates that of SLIM, it is very arguably due to the use of a small set of discrete coefficients. Our results show that this tends to happen mainly at large model sizes (see e.g., the regularization path for breastcancer, heart, mammo). This suggests that the \(\mathcal {L}\) set restriction has a more noticeable impact on accuracy at larger model sizes.
7.2.3 On interpretability
In this case, the SLIM scoring system uses 7 integer coefficients. However, it can be expressed as a 5 line scoring system due to the fact that odor=none, odor=almond, and odor=anise are mutually exclusive variables that all use the same coefficient. The model benefits from the fact that it not only allows users make predictions by hand but uses a linear form that helps users to gauge the influence of each input variable with respect to the others. We note that only some of these qualities are found in other models. The Lasso model, for instance, has a linear form but uses far more features. Similarly, the C5.0 models allow users to make predictions by hand, but use a hierarchical structure that makes it difficult to assess the influence of each input variable with respect to the others (see Freitas 2014, for a discussion).
We believe that these qualities represent “baseline” interpretability benefits of SLIM scoring systems. In practice, interpretability is a subjective and multifaceted notion (i.e., it depends on who will be using the model but also depends on multiple model qualities as discussed by Kodratoff 1994; Pazzani 2000; Freitas 2014). In light of this fact, SLIM has the unique benefit in that it allows practitioners to work closely with the target audience and directly encode all interpretability requirements by means of operational constraints.
8 Specialized models
In this section, we present three specialized models related to SLIM. These models are all special instances of the optimization problem in (1).
8.1 Personalized models
Here, the loss constraints and BigM parameters in (12a) are identical to those from the SLIM IP formulation in Sect. 2. The \(u_{j,k,r}\) are binary indicator variables that are set to 1 if \(\lambda _j\) is equal to \(l_{k,r}\). Constraints (12d) ensure that each coefficient uses exactly one value from one interpretability set. Constraints (12c) ensure that each coefficient \(\lambda _j\) is assigned a value from the appropriate interpretability set \(\mathcal {L}_r\). Constraints (12b) ensure that each coefficient \(\lambda _j\) is assigned the value specified by the personalized interpretability penalty.
8.2 Rulebased models
SLIM can be extended to produce specialized “rulebased” models when the training data are composed of binary rules. In general, any realvalued feature can be converted into a binary rule by setting a threshold (e.g., we can convert age into the feature \(age \ge 25 := \mathbbm {1}\left[ age \ge 25\right] \)). The values of the thresholds can be set using domain expertise, rule mining, or discretization techniques (Liu et al. 2002).
8.2.1 MofN rule tables
8.2.2 Thresholdrule models
A ThresholdRule Integer Linear Model (TILM) is a scoring system where the input variables are thresholded versions of the original feature set (i.e. decision stumps). These models are wellsuited to problems where the outcome has a nonlinear relationship with realvalued features. As an example, consider the SAPS II scoring system of Le Gall et al. (1993), which assesses the mortality of patients in intensive care using thresholds on realvalued features such as \(blood\_pressure>200\) and \(heart\_rate<40\). TILM optimizes the binarization of realvalued features by using feature selection on a large (potentially exhaustive) pool of binary rules for each realvalued feature. Carrizosa et al. (2010), Belle et al. (2013) and Goh and Rudin (2014) take different but related approaches for constructing classifiers with binary threshold rules.
Here, the loss constraints and BigM parameters in (14a) are identical to those from the SLIM IP formulation in Sect. 2. Constraints (14b) set the interpretability penalty for each coefficient as \(\varPhi _j = C_f\nu _j + C_t \tau _j + \epsilon \sum \beta _{j,t}\). The variables in the interpretability penalty include: \(\nu _j\), which indicate that we use at least one threshold rule from feature j; \(\tau _j\), which count the number of additional binary rules derived from feature j; and \(\beta _{j,t}:= \lambda _{j,t}\). The values of \(\nu _j\) and \(\tau _j\) are set using the indicator variables \(\alpha _{j,t} := \mathbbm {1}\left[ \lambda _{j,t}\ne 0\right] \) in constraints (14c) and (14d). Constraints (14e) limit the number of binary rules from feature j to \(\mathbb {R}_{max}\). Constraints (14g) ensure that the coefficients of binary rules derived from feature j agree in sign; these constraints are encoded using the variables \(\delta _j:=\mathbbm {1}\left[ \lambda _{j,t} \ge 0\right] \).
9 Conclusion
In this paper, we introduced a new method for creating datadriven medical scoring systems which we refer to as a Supersparse Linear Integer Model (SLIM). We showed how SLIM can produce scoring systems that are fully optimized for accuracy and sparsity, that can accommodate multiple operational constraints, and that can be trained without parameter tuning.
The major benefits of our approach over existing methods come from the fact that we avoid approximations that are designed to achieve faster computation. Approximations such as surrogate loss functions and \({\ell _{1}}\)regularization hinder the accuracy and sparsity of models as well as the ability of practitioners to control these qualities. Such approximations are no longer needed for many datasets, since using current integer programming software, we can now train scoring systems for many realworld problems. Integer programming software also caters to practitioners in other ways, by allowing them to choose from a pool of models by mining feasible solutions and to seamlessly benefit from periodic computational improvements without revising their code.
Footnotes
 1.
To illustrate the use of the \({\ell _{1}}\)penalty, consider a classifier such as \(\hat{y}={\text {sign}}\left( x_1 + x_2\right) \). If the objective in (2) only minimized the 0–1 loss and an \({\ell _{0}}\)penalty, then \(\hat{y}={\text {sign}}\left( 2 x_1 + 2 x_2\right) \) would have the same objective value as \(\hat{y}={\text {sign}}\left( x_1 + x_2\right) \) because it makes the same predictions and has the same number of nonzero coefficients. Since coefficients are restricted to a finite discrete set, we add a tiny \({\ell _{1}}\)penalty in the objective of (2) so that SLIM chooses the classifier with the smallest (i.e. coprime) coefficients, \(\hat{y} = {\text {sign}}\left( x_1+x_2\right) \).
 2.
Model size represents the number of coefficients for linear models (Lasso, Ridge, Elastic Net, SLIM, SVM Lin.), the number of leaves for decision tree models (C5.0T, CART), and the number of rules for rulebased models (C5.0R). For completeness, we set the model size for blackbox models (SVM RBF) to the number of features in each dataset.
 3.
While there exists an infinite number of thresholds for a realvalued feature, we only need consider at most \(N1\) thresholds (i.e. one threshold placed each pair of adjacent values, \(x_{(i),j}<v_{j,t}<x_{(i+1),j}\)). Using additional thresholds will produce the same set of binary rules and the same rulebased model.
Notes
Acknowledgments
We thank the editors and reviewers for valuable comments that helped improve this paper. In addition, we thank Dr. Matt Bianchi and Dr. Brandon Westover at the Massachusetts General Hospital Sleep Clinic for providing us with data used in Sect. 5. We gratefully acknowledge support from Siemens and Wistron.
Supplementary material
References
 Antman, E. M., Cohen, M., Bernink, P. J. L. M., McCabe, C. H., Horacek, T., Papuchis, G., et al. (2000). The TIMI risk score for unstable angina/nonST elevation MI. The Journal of the American Medical Association, 284(7), 835–842.CrossRefGoogle Scholar
 Asparoukhov, O. K., & Stam, A. (1997). Mathematical programming formulations for twogroup classification with binary variables. Annals of Operations Research, 74, 89–112.CrossRefzbMATHGoogle Scholar
 Bache, K., & Lichman, M. (2013). UCI machine learning repository Google Scholar
 Bajgier, S. M., & Hill, A. V. (1982). An experimental comparison of statistical and linear programming approaches to the discriminant problem. Decision Sciences, 13(4), 604–618.CrossRefGoogle Scholar
 Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3), 1111–1141.CrossRefMathSciNetzbMATHGoogle Scholar
 Bone, R. C., Balk, R. A., Cerra, F. B., Dellinger, R. P., Fein, A. M., Knaus, W. A., et al. (1992). American college of chest physicians/society of critical care medicine consensus conference: Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. Critical Care Medicine, 20(6), 864–874.CrossRefGoogle Scholar
 Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced lectures on machine learning. Springer, pp. 169–207Google Scholar
 Bradley, P. S., Fayyad, U. M., & Mangasarian, O. L. (1999). Mathematical programming for data mining: Formulations and challenges. INFORMS Journal on Computing, 11(3), 217–238.CrossRefMathSciNetzbMATHGoogle Scholar
 Brooks, J. P. (2011). Support vector machines with the ramp loss and the hard margin loss. Operations Research, 59(2), 467–479.CrossRefMathSciNetzbMATHGoogle Scholar
 Brooks, J. P., & Lee, E. K. (2010). Analysis of the consistency of a mixed integer programmingbased multicategory constrained discriminant model. Annals of Operations Research, 174(1), 147–168.CrossRefMathSciNetzbMATHGoogle Scholar
 Carrizosa, E., MartínBarragán, B., & Morales, D. R. (2010). Binarized support vector machines. INFORMS Journal on Computing, 22(1), 154–167.CrossRefMathSciNetzbMATHGoogle Scholar
 Carrizosa, E., NogalesGómez, A., & Morales, D. R. (2013). Strongly agree or strongly disagree? Rating features in support vector machines. Technical report, Saïd Business School, University of Oxford, UKGoogle Scholar
 Chevaleyre, Y., Koriche, F. , & Zucker, J.D. (2013). Rounding methods for discrete linear classification. In Proceedings of the 30th international conference on machine learning (ICML13) , pp. 651–659.Google Scholar
 Cranor, L. F., & LaMacchia, B. A. (1998). Spam!. Communications of the ACM, 41(8), 74–83.CrossRefGoogle Scholar
 Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.J., Sandhu, S., et al. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64(5), 304–310.CrossRefGoogle Scholar
 Dupačová, J., Consigli, G., & Wallace, S. W. (2000). Scenarios for multistage stochastic programs. Annals of operations research, 100(1–4), 25–53.CrossRefMathSciNetzbMATHGoogle Scholar
 Dupačová, J., GröweKuska, N., & Römisch, W. (2003). Scenario reduction in stochastic programming. Mathematical programming, 95(3), 493–511.CrossRefMathSciNetzbMATHGoogle Scholar
 Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.CrossRefMathSciNetzbMATHGoogle Scholar
 Elter, M., SchulzWendtland, R., & Wittenberg, T. (2007). The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Medical Physics, 34(11), 4164–4172.CrossRefGoogle Scholar
 Freitas, A. A. (2014). Comprehensible classification models: A position paper. ACM SIGKDD Explorations Newsletter, 15(1), 1–10.CrossRefGoogle Scholar
 Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.CrossRefGoogle Scholar
 Gage, B. F., Waterman, A. D., Shannon, W., Boechler, M., Rich, M. W., & Radford, M. J. (2001). Validation of clinical classification schemes for predicting stroke. The Journal of the American Medical Association, 285(22), 2864–2870.CrossRefGoogle Scholar
 Glen, J. J. (1999). Integer programming methods for normalisation and variable selection in mathematical programming discriminant analysis models. Journal of the Operational Research Society, 50, 1043–1053.CrossRefzbMATHGoogle Scholar
 Goh, S. T., & Rudin, C. (2014). Box drawings for learning with imbalanced data. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp. 333–342.Google Scholar
 Goldberg, N., & Eckstein, J. (2010). Boosting classifiers with tightened l0relaxation penalties. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 383–390.Google Scholar
 Goldberg, N., & Eckstein, J. (2012). Sparse weighted voting classifier selection and its linear programming relaxations. Information Processing Letters, 112, 481–486.CrossRefMathSciNetzbMATHGoogle Scholar
 Guan, W., Gray, A., & Leyffer, S. (2009). Mixedinteger support vector machine. In NIPS workshop on optimization for machine learning.Google Scholar
 Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.zbMATHGoogle Scholar
 Haberman, S. J. (1976). Generalized residuals for loglinear models. In Proceedings of the 9th international biometrics conference, Boston, pp. 104–122.Google Scholar
 Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Friedman, J., & Tibshirani, R. (2009). The elements of statistical learning (Vol. 2). New York: Springer.CrossRefzbMATHGoogle Scholar
 Jenatton, R., Audibert, J.Y., & Bach, F. (2011). Structured variable selection with sparsityinducing norms. The Journal of Machine Learning Research, 12, 2777–2824.MathSciNetzbMATHGoogle Scholar
 Jennings, D., Amabile, TM., & Ross, L. (1982). Informal covariation assessment: Databased vs. theorybased judgments. Judgment under uncertainty: Heuristics and biases, pp. 211–230Google Scholar
 Joachimsthaler, E. A., & Stam, A. (1990). Mathematical programming approaches for the classification problem in twogroup discriminant analysis. Multivariate Behavioral Research, 25(4), 427–454.CrossRefGoogle Scholar
 Kapur, V. K. (2010). Obstructive sleep apnea: Diagnosis, epidemiology, and economics. Respiratory Care, 55(9), 1155–1167.Google Scholar
 Kim, M.J., & Han, I. (2003). The discovery of experts’ decision rules from qualitative bankruptcy data using genetic algorithms. Expert Systems with Applications, 25(4), 637–646.CrossRefGoogle Scholar
 Knaus, W. A., Zimmerman, J. E., Wagner, D. P., Draper, E. A., & Lawrence, D. E. (1981). APACHEacute physiology and chronic health evaluation: a physiologically based classification system. Critical Care Medicine, 9(8), 591–597.CrossRefGoogle Scholar
 Knaus, W. A., Draper, E. A., Wagner, D. P., & Zimmerman, J. E. (1985). APACHE II: a severity of disease classification system. Critical Care Medicine, 13(10), 818–829.CrossRefGoogle Scholar
 Knaus, W. A., Wagner, D. P., Draper, E. A., Zimmerman, J. E., Bergner, M., Bastos, P. G., et al. (1991). The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospitalized adults. Chest Journal, 100(6), 1619–1636.CrossRefGoogle Scholar
 Kodratoff, Y. (1994). The comprehensibility manifesto. KDD Nugget Newsletter, 94, 9.Google Scholar
 Kohavi, R. (1996). Scaling up the accuracy of naivebayes classifiers: A decisiontree hybrid. In KDD, pp. 202–207.Google Scholar
 Kuhn, M., Weston, S., & Coulter, N. (2012). C50: C5.0 Decision trees and rulebased models, 2012. C code for C5.0 by R. Quinlan. R package version 0.1.0013.Google Scholar
 Le Gall, J.R., Loirat, P., Alperovitch, A., Glaser, P., Granthil, C., Mathieu, D., et al. (1984). A simplified acute physiology score for icu patients. Critical Care Medicine, 12(11), 975–977.CrossRefGoogle Scholar
 Le Gall, J.R., Lemeshow, S., & Saulnier, F. (1993). A new simplified acute physiology score (SAPS II) based on a european/north american multicenter study. The Journal of the American Medical Association, 270(24), 2957–2963.CrossRefGoogle Scholar
 Lee, E. K., & Wu, T.L. (2009). Classification and disease prediction via mathematical programming. In Handbook of optimization in medicine. Springer, pp. 1–50.Google Scholar
 Li, L., & Lin, H.T. (2007). Optimizing 0/1 loss for perceptrons by random coordinate descent. In International joint conference on neural networks, 2007. IJCNN 2007. IEEE, pp. 749–754.Google Scholar
 Light, R. W., Macgregor, M. I., Luchsinger, P. C., & Ball, W. C. (1972). Pleural effusions: The diagnostic separation of transudates and exudates. Annals of Internal Medicine, 77(4), 507–513.CrossRefGoogle Scholar
 Liittschwager, J. M., & Wang, C. (1978). Integer programming solution of a classification problem. Management Science, 24, 1515–1525.CrossRefzbMATHGoogle Scholar
 Lin, D., Pitler, E., Foster, D. P., & Ungar, L. H. (2008). In defense of l0. In Workshop on feature selection (ICML 2008).Google Scholar
 Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6, 393–423.CrossRefMathSciNetGoogle Scholar
 Liu, H., & Zhang, J. (2009). Estimation consistency of the group lasso and its applications. In Proceedings of the twelfth international conference on artificial intelligence and statistics.Google Scholar
 Mangasarian, O. L. (1994). Misclassification minimization. Journal of Global Optimization, 5(4), 309–323.CrossRefMathSciNetzbMATHGoogle Scholar
 Mangasarian, O. L., Street, W. N., & Wolberg, W. H. (1995). Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), 570–577.CrossRefMathSciNetzbMATHGoogle Scholar
 Mao, K. Z. (2004). Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 34(1), 629–634.CrossRefGoogle Scholar
 Marklof, J. (2012, July). Finescale statistics for the multidimensional Farey sequence. ArXiv eprints, July 2012.Google Scholar
 Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2012). e1071: Misc functions of the department of statistics (e1071), TU Wien, 2012. R package version 1.61.Google Scholar
 Miller, A. J. (1984). Selection of subsets of regression variables. Journal of the Royal Statistical Society Series A (General), 47, 389–425.CrossRefGoogle Scholar
 Moreno, R. P., Metnitz, P. G. H., Almeida, E., Jordan, B., Bauer, P., Campos, R. A., et al. (2005). SAPS 3—From evaluation of the patient to evaluation of the intensive care unit. Part 2: Development of a prognostic model for hospital mortality at icu admission. Intensive Care Medicine, 31(10), 1345–1355.CrossRefGoogle Scholar
 Nguyen, H. T., & Franke, K. (2012). A general lpnorm support vector machine via mixed 01 programming. In Machine learning and data mining in pattern recognition. Springer, pp. 40–49.Google Scholar
 Nguyen, T., & Sanner, S. (2013). Algorithms for direct 0–1 loss optimization in binary classification. In Proceedings of the 30th international conference on machine learning (ICML13), pp. 1085–1093.Google Scholar
 Pazzani, M. J. (2000). Knowledge discovery from data? IEEE Intelligent Systems and Their Applications, 15(2), 10–12.CrossRefGoogle Scholar
 R Core Team. (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.Rproject.org/.
 Ranson, J. H., Rifkind, K. M., Roses, D. F., Fink, S. D., Eng, K., Spencer, F. C., et al. (1974). Prognostic signs and the role of operative management in acute pancreatitis. Surgery, Gynecology & Obstetrics, 139(1), 69.Google Scholar
 Rubin, P. A. (1990). Heuristic solution procedures for a mixedinteger programming discriminant model. Managerial and Decision Economics, 11, 255–266.CrossRefGoogle Scholar
 Rubin, P. A. (1997). Solving mixed integer classification problems by decomposition. Annals of Operations Research, 74, 51–64.CrossRefzbMATHGoogle Scholar
 Rubin, P. A. (2009). Mixed integer classification problems. In Encyclopedia of optimization. Springer, pp. 2210–2214.Google Scholar
 Schlimmer, J. C. (1987). Concept acquisition through representational adjustment.Google Scholar
 SouillardMandar, W., Davis, R., Rudin, C., Au, R., Libon, D. J., Swenson, R., et al. (2015) Learning Classification Models of Cognitive Conditions from Subtle Behaviors in the Digital Clock Drawing Test. Machine Learning. AcceptedGoogle Scholar
 Therneau, T., Atkinson, B., & Ripley, B. (2012). rpart: Recursive Partitioning, 2012. URL http://CRAN.Rproject.org/package=rpart. R package version 4.10.
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58, 267–288.MathSciNetzbMATHGoogle Scholar
 Towell, G. G., & Shavlik, J. W. (1993). Extracting refined rules from knowledgebased neural networks. Machine Learning, 13, 71–101.Google Scholar
 Ustun, B., Westover, M. B., Rudin, C., & Bianchi, M. T. (2015). Clinical Prediction Models for Sleep Apnea: The Importance of Medical History over Symptoms. Journal of clinical sleep medicine: JCSM: official publication of the American Academy of Sleep Medicine.Google Scholar
 Van Belle, V., Neven, P., Harvey, V., Van Huffel, S., Suykens, J. A. K., & Boyd, S. (2013). Risk group detection and survival function estimation for interval coded survival methods. Neurocomputing, 112, 200–210.CrossRefGoogle Scholar
 Vapnik, V. (1998). Statistical Learning Theory. New York: Wiley.zbMATHGoogle Scholar
 Wells, P. S., Anderson, D. R., Bormanis, J., Guy, F., Mitchell, M., Gray, L., et al. (1997). Value of assessment of pretest probability of deepvein thrombosis in clinical management. Lancet, 350(9094), 1795–1798.CrossRefGoogle Scholar
 Wells, P. S., Anderson, D. R., Rodger, M., Ginsberg, J. S., Kearon, C., Gent, M., et al. (2000). Derivation of a simple clinical model to categorize patients probability of pulmonary embolismincreasing the models utility with the SimpliRED Ddimer. Thrombosis and Haemostasis, 83(3), 416–420.Google Scholar
 Wolsey, L. A. (1998). Integer programming (Vol. 42). New York: Wiley.zbMATHGoogle Scholar
 Yanev, N., & Balev, S. (1999). A combinatorial approach to the classification problem. European Journal of Operational Research, 115(2), 339–350.CrossRefzbMATHGoogle Scholar
 Zhao, P., & Bin, Y. (2007). On model selection consistency of lasso. Journal of Machine Learning Research, 7(2), 25–41.Google Scholar
 Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.CrossRefMathSciNetzbMATHGoogle Scholar
 Zeng, J., Ustun, B., & Rudin, C. (2015). Interpretable Classification Models for Recidivism Prediction. arXiv preprintarXiv:1503.07810.