1 Introduction

Neural networks have become increasingly popular due to their remarkable achievements in computer vision and natural language processing. Their generalization power has been demonstrated in wide-ranging applications, from classifying photos to recommending products. However, neural networks face challenges in real-world applications for high-stakes decision-making, including healthcare, policy-making, and autonomous driving.

First, many standard neural networks are not robust – they can be easily fooled by natural or artificial noise in the input data (Szegedy et al., 2014), making them vulnerable to perturbations that may arise in real-world applications. Moreover, neural networks, similar to other machine learning models, often suffer from instability during the training process – different train-validation splits could generate models with very different performance (May et al., 2010; Xu & Goodacre, 2018). This reduces the policymakers’ trust in these models and hinders post-hoc interpretations. Another critical difficulty is that neural networks are not sparse – the high number of parameters utilized for neural networks prevents efficient computation and storage (Thompson et al., 2020). Most neural networks have millions of non-zero parameters to be stored and accessed for evaluation. This is problematic in many decision-making settings with limitations or restrictions on hardware capabilities. Reducing the number of parameters could make them more applicable in a broader range of scenarios (Changpinyo et al., 2017; Narang et al., 2017).

The questions around improving robustness, stability, and sparsity metrics have all been previously studied in the neural network literature. However, they have been almost exclusively studied in isolation, with a limited understanding of the tradeoffs between these desired qualities and their effect on natural accuracy (accuracy with respect to the unperturbed data samples). This paper aims to simultaneously address all these objectives through a novel comprehensive methodology named Holistic Deep Learning (HDL). In particular, HDL carefully combines state-of-the-art techniques that address these individual challenges and demonstrates their collective efficacy through extensive experiments on diverse data sets. Our findings provide a promising pathway toward developing efficient and reliable machine learning models across many dimensions for real-world applications.

Specifically, our contributions are as follows:

  1. 1.

    We design HDL, a novel framework that jointly optimizes for neural network robustness (adversarial accuracy), stability (worst accuracy across train-validation splits), and sparsity (parameters with value zero) metrics by appropriately modifying the objective function.

  2. 2.

    Through extensive ablation experiments and SHAP value analysis (Lundberg & Lee, 2017) across 45 UCI data sets (Dua & Graff, 2017) and 3 image data sets (MNIST (Deng, 2012), Fashion MNIST (Xiao et al., 2017) and CIFAR10 (Krizhevsky et al., 2009)), we analyze the individual performance of each metric as well as the interactions and trade-offs between them. We corroborate that imposing robustness, stability, and sparsity improves the corresponding metrics across all data sets. In addition, we show that:

    • Imposing stability and sparsity further improves robustness,

    • Imposing stability and robustness further improves sparsity,

    • Imposing robustness further improves stability,

    • Imposing stability and robustness further improves natural accuracy.

    The effect of sparsity on natural accuracy is more complex and highly varies across data sets. However, we show that it is often possible to simultaneously improve robustness, stability, and sparsity without sacrificing performance on natural accuracy.

  3. 3.

    We propose a prescriptive approach to provide recommendations on selecting the appropriate loss function depending on the practitioner’s objective. In particular, simultaneously imposing robustness, stability and sparsity in the loss function leads to the best results when jointly optimizing for all the metrics.

The paper is organized as follows: Sect. 2 outlines the current literature of robust, sparse, and stable methods; Sect. 3 describes the Holistic Deep Learning framework, and Sect. 4 shows the results of the computational experiments.

2 Related work

2.1 Robust neural networks

Many state-of-the-art deep neural networks are highly vulnerable to small perturbations in the input data (Szegedy et al., 2014), which can be a threat to some real-world applications like self-driving cars or face recognition. Adversarial robustness evaluates a neural network’s resistance against these altered inputs, referred to as adversarial examples, intentionally designed to worsen the network’s performance (Goodfellow et al., 2014; Carlini & Wagner, 2017; Madry et al., 2017).

Multiple methods have been developed in recent years to enhance the adversarial robustness of neural networks. One of the most popular heuristics is augmenting the data set during training with adversarial examples (Madry et al., 2017; Goodfellow et al., 2014). Others include neuron randomization (Prakash et al., 2018; Xie et al., 2017), input space projections (Lamb et al., 2018; Kabilan et al., 2021; Ilyas et al., 2017) and regularization (Bertsimas et al., 2021; Ross & Doshi-Velez, 2018; Hein & Andriushchenko, 2017; Yan et al., 2018). A less common but more theoretically rigorous approach is to minimize a provable upper bound of the loss achieved with adversarial examples (Raghunathan et al., 2018; Singh et al., 2018; Zhang et al., 2018; Weng et al., 2018; Dvijotham et al., 2018; Lecuyer et al., 2019; Cohen et al., 2019; Anderson et al., 2020; Bertsimas et al., 2021).

While these methods successfully improve the network’s robustness, the extent to which they do so often depends on the data set, the network size, and the magnitude of the input perturbations. In particular, heuristic methods generally work well for small perturbations, while the upper bound methods yield better results when the input noise is larger (Bertsimas et al., 2021; Athalye et al., 2018). However, there is a trade-off between effectiveness and efficiency. The methods providing the strongest adversarial robustness are often computationally demanding, making it challenging to implement them for large data sets or complex network architectures.

2.2 Sparse neural networks

In machine learning, sparse models make predictions based on a limited number of parameters. Sparsity is often desirable, as it may save memory, enhance model interpretability, and reduce overfitting (Bertsimas et al., 2020).

There are two typical approaches to sparsity in deep learning. The first one, train-then-sparsify, consists of removing unnecessary neurons or connections after training the network, sometimes followed by retraining (Janowsky, 1989; LeCun et al., 1989). This approach has been widely investigated, and several schemes exist to choose which connections to prune (Hoefler et al., 2021). Han et al. (2015), for example, propose to prune the connections with the smallest weights. Other methods include formulating a convex optimization problem (Aghasi et al., 2020), removing filters for which the total absolute sum is low (Li et al., 2016), and eliminating channels that have limited impact on the network’s discriminatory ability (Zhuang et al., 2018). The second approach, sparsify-during-training, is achieved by learning a sparse architecture while training the network. Multiple methodologies exist (Bellec et al., 2017; Mocanu et al., 2017; Mostafa & Wang, 2019), including the method to approximate the \(\ell _0\) norm with continuous functions and add a regularization term to the loss function (Louizos et al., 2017; Savarese et al., 2020). We refer the reader to Gale et al. (2019) and Hoefler et al. (2021) for more comprehensive surveys on sparsity.

2.3 Stable neural networks

The stochastic nature of data samples can lead to instability and high dependence of machine learning models on the specific train-validation split. This can negatively impact the interpretability of the resulting model and its ability to make reliable predictions (Bertsimas & Paskov, 2020), a key factor to establishing trust in any algorithm.

The sensitivity of machine learning models to the choice of training split has mostly been studied through the lens of cross-validation and distributionally robust optimization. Cross-validation can be used to measure the variability from the selection of training split but at a significant increase in computational cost (Krogh & Vedelsby, 1994; Hastie et al., 2001) that is often intractable for deep learning settings. Distributionally robust optimization has been used to quantify the worst-case generalization error in the presence of shifts in distribution or regime (Staib & Jegelka, 2019; Goldwasser et al., 2020; Sagawa et al., 2019), but it often requires pre-defined groups over the training data and expensive group annotations for each data sample to avoid overly pessimistic uncertain distributions (Sagawa et al., 2019; Liu et al., 2021). A different approach has been studied by Bertsimas and Paskov (2020) and Bertsimas et al. (2022), who instead optimize over the worst training set of fixed size without making any probabilistic assumptions. Although their method was presented in the context of linear and tree-based models, their framework also applies to neural networks.

3 The holistic deep learning approach

3.1 The HDL framework

We introduce the HDL framework for a classification problem over points \(\textbf{x}\in {\mathbb {R}}^M\) whose target \(y\in [K]\) is one of K different classes (we use the notation [n] to denote the set \(\{1, \ldots , n\}\)). We illustrate our approach over a fully-connected neural network for simplicity of notation, but the framework remains the same for convolutional neural networks.

For \(\textbf{x}\in {\mathbb {R}}^M\), define \([\textbf{x}]_m^+ = \max \{0,x_m\}\) (the ReLU function). Given weight matrices \(\textbf{W}^\ell \in {\mathbb {R}}^{r_{\ell -1}\times r_{\ell }}\) and bias vectors \(\textbf{b}^\ell \in {\mathbb {R}}^{r_\ell }\) for \(\ell \in [L]\), such that \(r_0= M, r_L = K\), the corresponding feed-forward neural network with L layers and ReLU activation functions is defined by the equations:

$$\begin{aligned} \textbf{z}^1({\varvec{\mathcal {W}}} , \textbf{x})&= \textbf{W}^1\textbf{x}+\textbf{b}^1, \end{aligned}$$
(1)
$$\begin{aligned} \textbf{z}^\ell ({\varvec{\mathcal {W}}} , \textbf{x})&= \textbf{W}^\ell [\textbf{z}^{\ell -1}({\varvec{\mathcal {W}}} , \textbf{x})]^+ + \textbf{b}^{\ell }, \quad \forall \, 2\le \ell \le L, \end{aligned}$$
(2)

where \({\varvec{\mathcal {W}}}\) denotes the parameters \((\textbf{W}^\ell , \textbf{b}^\ell )\) for all \(\ell \in [L]\). Consider a data set \(\{({\textbf{x}}_n, {y}_n)\}_{n=1}^N\), where \(y_n\in [K]\) is the target class of \(\textbf{x}_n\). For each point \(\textbf{x}_n\), the class predicted by the network is \({{\,\mathrm{arg\,max}\,}}_k \, {z}^L_k({\varvec{\mathcal {W}}}, \textbf{x}_n)\).

The nominal DL approach is to minimize the cross-entropy loss of the network \(\textbf{z}^L\) described in Eq. (2), which can be written as:

$$\min _{{\varvec{\mathcal{W}}}} \frac{1}{N}\sum\limits_{{n = 1}}^{N} {\log } \left( {\sum\limits_{{k = 1}}^{K} {e^{{\left( {\Delta \varvec{e}_{k}^{{y_{n} }} } \right)^{{ \top }} {\mathbf{z}}^{L} \left( {{\varvec{\mathcal{W}}},{\mathbf{x}}} \right)}} } } \right),$$
(3)

where \(\Delta {\varvec{e}}_k^{y_n} = {\varvec{e}}_k - {\varvec{e}}_{y_n}\) and \({\varvec{e}}_k\) refers to the one-hot vectors with a 1 in the \(k^{\text {th}}\) coordinate and 0 everywhere else. In our HDL framework, we propose instead to minimize the following optimization problem:

$$\begin{gathered} \min _{{\varvec{s},\theta ,{\varvec{\mathcal{W}}}}} \lambda \underbrace {{\sum\limits_{{j = 1}}^{\vert{\varvec{\mathcal{W}}}\vert} \sigma \left( {\beta s_{j} } \right)}}_{{{\text{Sparsity}}}} + \underbrace {\theta }_{{{\text{Stability}}}} \hfill \\ \hfill + \frac{1}{a}\sum\limits_{{n = 1}}^{N} {\left[ {\log \sum\limits_{{k = 1}}^{K} {e^{{\left( {\Delta \varvec{e}_{k}^{{y_{n} }} } \right)^{{ \top }} {\mathbf{z}}^{L} ({\varvec{\mathcal{W}}} \odot \sigma \left( {\beta \varvec{s}} \right),{\mathbf{x}}) + \overbrace {{\left. \rho \right\|\nabla _{{\mathbf{x}}} \left( {\Delta \varvec{e}_{k}^{{y_{n} }} } \right)^{{ \top }} {\mathbf{z}}^{L} \left( {{\varvec{\mathcal{W}}} \odot \sigma \left( {\beta \varvec{s}} \right),{\mathbf{x}}} \right)\left\|_1 \right.}}^{{{\text{Robustness}}}}}} } - \underbrace {\theta }_{{{\text{Stability}}}}} \right]^{ + } } , \hfill \\ \end{gathered}$$
(4)

where \(\odot\) corresponds to the element-wise product, \(\sigma\) is the standard sigmoid function, \({\textbf{z}}^L(\cdot , \textbf{x})\) was defined in (2), \(\lambda\) (resp. \(\rho )\) is the regularization coefficient corresponding to the sparsity (resp. robustness) loss component, and a is the size of the data subsets used for the stability requirement (see Sect. 2.3). We observe that robustness adds a term to the output, while stability and sparsity add new parameters (\(\theta\) and \(\varvec{s}\) respectively) to be optimized. This loss function allows us to simultaneously train robust, sparse, and stable feed-forward neural networks at scale. In the next section, we provide more details about each metric.

3.2 Robustness

This section describes our method to introduce the robust component into neural network training. Since our ultimate goal is to incorporate the sparsity, robustness, and stability of neural networks together in a tractable way, we avoid algorithms that improve robustness at the expense of a significant increase in the training time or the algorithm’s complexity (for instance, the algorithms that perform training with adversarial examples usually require significantly longer times to optimally find such examples at each gradient descent iteration (Madry et al., 2017; Bertsimas et al., 2021)). We follow the approach from Bertsimas et al. (2021) of using a linear approximation of the neural network to estimate the robust objective. This approach is simple to implement, produces good adversarial accuracy, and does not require the extensive training time of other algorithms.

For a given \((\textbf{x}, y)\) pair, the robust problem using the cross-entropy loss and the \(\ell _{\infty }\)-norm uncertainty sets can be upper bounded as:

$$\begin{aligned} \max _{\varvec{\delta }: \Vert \varvec{\delta }\Vert _{\infty }\le \rho }\log \sum _{k} e^{(\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}+ \varvec{\delta })} \le \log \sum _{k} e^{\max _{\varvec{\delta }: \Vert \varvec{\delta }\Vert _{\infty }\le \rho } \: (\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}+ \varvec{\delta })}. \end{aligned}$$
(5)

Since \({z}^L_k({\varvec{\mathcal {W}}}, \textbf{x})\) is piece-wise linear, we expect the outputs \({z}^L_k({\varvec{\mathcal {W}}}, \textbf{x})\) and \({z}^L_k({\varvec{\mathcal {W}}}, \textbf{x}+ \varvec{\delta })\) to be in the same linear piece when \(\textbf{x}+ \varvec{\delta }\) is close to \(\textbf{x}\). In other words, the linear approximation

$$\begin{aligned} {z}^L_k({\varvec{\mathcal {W}}}, \textbf{x}+ \varvec{\delta }) \approx {z}^L_k({\varvec{\mathcal {W}}}, \textbf{x}) + \varvec{\delta }^\top \nabla _{\textbf{x}} {z}^L_k({\varvec{\mathcal {W}}}, \textbf{x}) \end{aligned}$$
(6)

is exact for small enough \(\varvec{\delta }\). Therefore, we approximate the upper bound in (5) as

$$\begin{aligned}&\log \sum _{k} e^{\max _{\varvec{\delta }: \Vert \varvec{\delta }\Vert _{\infty }\le \rho } \: (\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}+ \varvec{\delta })}\nonumber \\&\quad \approx \log \sum _{k} e^{\max _{\varvec{\delta }: \Vert \varvec{\delta }\Vert _{\infty }\le \rho } \: (\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}) + \varvec{\delta }^\top \nabla _{\textbf{x}} (\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x})} \nonumber \\&\quad = \log \sum _{k} e^{ (\Delta {\varvec{e}}_k^{y})^\top {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}) + \rho \Vert (\Delta {\varvec{e}}_k^{y})^\top \nabla _{\textbf{x}}{\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x})\Vert _1}. \end{aligned}$$
(7)

Even though the expression in Eq. (7) is not always an upper bound of Eq. (5) for an arbitrary value of \(\rho\), Bertsimas et al. (2021) experimentally show that generally the average loss obtained using this expression is indeed an upper bound of the average adversarial loss. In fact, for small \(\rho\), their experiments demonstrate that this approach achieves competitive results with state-of-the-art methods while requiring significantly less computational time across various tabular and image data sets. However, we emphasize that the methodology developed in this paper could also be performed with other methods for robust training, like adversarial training or upper bound minimization, which might be more appropriate for large uncertainty sets.

3.3 Sparsity

In this work, we use the specific retraining procedure proposed by Savarese et al. (2020), which deterministically approximates the \(\ell _0\) regularization utilizing a sequence of sigmoid functions and adding them as a penalty term in the loss function. Notably, the implementation is easily compatible with our robustness and stability requirements, since this methodology relies on a penalty term added in the loss function. Therefore, we can use gradient descent to simultaneously optimize the objective function comprising the robustness, stability, and sparsity penalties.

Adding \(\ell _0\) regularization explicitly penalizes the number of non-zero weights in the model to induce sparsity. However, the \(\ell _0\)-norm induces a priori a non-convex and non-differentiable loss function \({\mathcal {R}}(\varvec{{\varvec{\mathcal {W}}}})\), as follows:

$$\begin{aligned} {\mathcal {R}}(\varvec{{\varvec{\mathcal {W}}}})=\frac{1}{N}\left( \sum _{n=1}^{N} {\mathcal {L}}\left( y_{n}, {\textbf{z}}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)\right) \right) +\lambda \Vert {\varvec{\mathcal {W}}}\Vert _{0}, \quad \Vert {\varvec{\mathcal {W}}}\Vert _{0}=\sum _{j=1}^{\vert {\varvec{\mathcal {W}}}\vert } {\mathbb {I}}\left[ w_{j} \ne 0\right] , \end{aligned}$$
(8)

where \(\vert {\varvec{\mathcal {W}}}\vert\) is the number of parameters, \(w_j\) is the \(j^{th}\) coordinate of \({\varvec{\mathcal {W}}}\), \(\lambda\) is the regularization weight and \({\mathcal {L}}\) a loss function (e.g., cross-entropy loss).

The goal is to relax the discrete nature of the \(\ell _0\) penalty to preserve an efficient continuous optimization while allowing for exact zeros in the neural network weights. To do this, Savarese et al. (2020) propose to first parameterize the weights \(w_j=H(s_j)\) where \(H(\cdot )\) is the Heaviside step function, and then approximate the non-differentiable step function with the sigmoid function: \(\sigma (\beta s_j) \rightarrow H(s_j)\) when \(\beta \rightarrow \infty\). Therefore, \(\beta\) is the hardness parameter that controls how close the approximation is to the \(\ell _0\) regularization, and the final loss function can be written as:

$$\begin{aligned} {\mathcal {R}}(\varvec{{\varvec{\mathcal {W}}}})\approx \frac{1}{N}\left( \sum _{n=1}^{N} {\mathcal {L}}\left( y_{n}, {\textbf{z}}^L({\varvec{\mathcal {W}}}\odot \sigma (\beta \varvec{s}), \textbf{x}_n)\right) \right) +\lambda \sum _{j=1}^{\vert {\varvec{\mathcal {W}}}\vert } \sigma (\beta s_j). \end{aligned}$$
(9)

To achieve a sparse network, we use this loss function (9) over multiple training rounds to gradually reach a sparse initialization before training the final sparse neural network. To obtain each initialization before a new training round, we start with our initialized auxiliary sparsity \(\varvec{s}_0\) and hardness \(\beta =1\) parameters. Over the T training iterations, we gradually increase \(\beta\) until it reaches a maximum value \({\bar{\beta }}\) when the training procedure is completed with sparsity \(\varvec{s}_{T}\). Then, we take \(\varvec{s}_0'=\min ({\bar{\beta }} \varvec{s}_T, \varvec{s}_0)\) to generate the new initialization for the next round of training. This minimization function essentially keeps the information of the suppressed weights, i.e., \(\sigma (\beta s_j) \approx 0\), while reverting those not suppressed to their starting position. This process is completed over multiple rounds to find better and sparser initializations for the neural network.

We implement the methodology as suggested by Savarese et al. (2020). In the results section, we measure sparsity in terms of the percentage of neuron connections (weights) set to 0.

3.4 Stability

Using the measure of stability defined in Sect. 2.3, we apply the methodology developed in Bertsimas et al. (2022) for building stable neural networks. At a high level, this corresponds to constructing a model that is robust to the specific subset of data used to train it. One way to think about this is to view the training data set as a sample from the true data distribution and then require the model to be robust to the specific sample. Considering the partition of the data into train-validation sets as a sampling mechanism from this true data distribution (each split choice gives a different training set), we desire to build models that are robust to every partition.

To achieve this, we first associate each observation \((\textbf{x}_n,y_n)\) with a binary variable \(z_n\), \(n\in [N]\) that indicates whether or not \((\textbf{x}_n,y_n)\) is part of the training set. We then choose the network’s parameters as to minimize the worst-case loss over all possible allocations of these \(z_n\)’s, resulting in a model that is explicitly built to do well not just over one training set, but over all possible training sets. We start from the same minimization problem introduced in Sect. 3.1, i.e.,

$$\begin{aligned} \min _{{\varvec{\mathcal {W}}}} \frac{1}{N}\sum _{n=1}^N {\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)). \end{aligned}$$

To obtain network stability we require the model to be robust to every training set of fixed size a, which results in the following optimization problem:

$$\begin{aligned} \min _{{\varvec{\mathcal {W}}}}&\max _{z \in \mathcal{Z}}\frac{1}{a}\sum _{n=1}^N z_n{\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)), \\ \quad \text {where} \ {}&\mathcal{Z}= \left\{ z: ~\sum _{n=1}^N z_n = a, \quad z_n \in \{0,1\},~ n\in [N]\right\} . \end{aligned}$$
(10)

The value of a indicates the desired proportion between the size of the training and validation sets. For example, by setting \(a = 0.7N\) we recover the typical 70/30 train-validation split. Since the inner maximization problem is linear in z, the problem is equivalent to optimizing over the convex hull of \(\mathcal{Z}\). This implies that the binary constraints on \(z_n\) can be relaxed to \(0\le z_n\le 1\), and we obtain a linear maximization problem in the variables \(z_n\). Computing its dual problem we obtain that the value of the inner maximization problem is equivalent to:

$$\begin{aligned} \min _{\theta ,u_n} ~ \theta + \frac{1}{a}\sum _{n=1}^N u_n \quad \text {subject to} \quad \theta + u_n \ge {\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)), \quad u_n \ge 0,~ n\in [N]. \end{aligned}$$

Therefore, the stability problem becomes

$$ \min _{\varvec{\mathcal {W}}, \theta ,u_n } \theta + \frac{1}{a}\sum _{n=1}^N u_n \quad \text {subject to} \quad \theta + u_n \ge {\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)), \,\, u_n \ge 0,~ n\in [N].$$

Note that the variables \(u_n\) can be solved in closed form as \(u_n = [{\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)) - \theta ]^+\). The final minimization problem with stability then becomes:

$$\begin{aligned} \min _{{\varvec{\mathcal {W}}}, \theta } \hspace{5.0pt}\theta + \frac{1}{a}\sum _{n=1}^N \left[ {\mathcal {L}}(y_n, \textbf{z}^L({\varvec{\mathcal {W}}}, \textbf{x}_n)) - \theta \right] ^+, \end{aligned}$$

which is now an unconstrained problem that can be solved with standard gradient descent optimization algorithms.

4 Experiments

This section presents extensive computational experiments comparing the nominal DL approach (abbreviated DL) with 7 other models resulting from our holistic methodology. We showcase the merit of our HDL framework and investigate the influence of each studied component – robustness, sparsity, and stability – on the overall performance across 4 evaluation metrics:

  • Natural accuracy: Average accuracy on the testing set across the 10 different train-validation splits with respect to the original input data.

  • Adversarial robustness: Average adversarial accuracy on the testing set across the 10 different train-validation splits with respect to adversarial attacks resulting from perturbations of the original input data. We consider only attacks bounded in the \(L_\infty\) norm by some radius \(\rho\) using Projected Gradient Descent as in Madry et al. (2017).

  • Stability: Worst accuracy on the testing set across the 10 different train-validation splits with respect to the original input data.

  • Sparsity: Percentage of network parameters with value 0.

The exact optimization problem solved for each model results from combinations of the loss functions described in the previous section, and the specific formulations can be found in Table 1.

Table 1 Loss functions used for DL and all methods in the HDL framework

Data

We computed experiments on classification tasks with 45 UCI data sets from the UCI Machine Learning Repository (Dua & Graff, 2017). These data sets give various problem sizes and difficulties to form a representative sample of real-world tabular problems, with the largest data set having 245,056 observations and the highest number of features being 856. We also benchmarked our methodologies on three image data sets: MNIST, Fashion-MNIST, and CIFAR10.

Implementation

Our code is written in Python 3.8 (Van Rossum & Drake, 2009). Neural networks are coded using Tensorflow v1 (Abadi et al., 2015). We trained each model on a system equipped with an Intel Xeon Gold 6248 processor, which included 4 CPU cores and one Nvidia Volta V100 GPU.

Training methodology

For each data set, we used 20% of the data to obtain a fixed test set, and we randomly generated 10 different 80%-20% train-validation splits with the remaining data points. The same 10 train-validation partitions were used across all methods for a fair comparison. Given a choice of model and evaluation metric, we selected the hyperparameters that led to the best average performance in the validation set for the metric in question. We then reported the average performance of the chosen parameter configuration on the test set with respect to the given metric. For all evaluation metrics, the average performance is computed over the 10 train-validation splits initially generated.

Neural network architectures

For our experiments on UCI data sets, we used a feedforward neural network architecture with 2 hidden layers, each with 128 neurons and ReLU activations. For our experiments on the image data sets, we used a convolutional neural network with the AlexNet architecture (Krizhevsky et al., 2012). We used the Glorot uniform initialization (Glorot & Bengio, 2010) for the network weights \({\varvec{\mathcal {W}}}\) and \(\varvec{0}\) as initialization for the sparsity variable \(\varvec{s}_0\). The choice of architecture and initialization was made to reflect typical settings utilized in the machine learning community (e.g. Madry et al. (2017); Savarese et al. (2020); Bertsimas et al. (2021)) while maintaining moderate size networks that facilitate exhaustive experimentation across dozens of data sets. Importantly, the same architecture is used across all methods been evaluated.

Hyperparameter search

For each model, we cross-validated the values of the following hyperparameters:

  • Adam learning rate: {\(1e^{-2}, 1e^{-3}\}\) for UCI data sets, \(\{1e^{-3},1e^{-4}\}\) for image data sets.

  • Number of epochs: 150 for UCI data sets, 50 for vision data sets.

  • Batch Size: 32 for UCI data sets, 64 for image data sets.

  • Robustness radius \(\rho\): \(\{1e^{-1}, ~1e^{-2}, ~1e^{-3}, ~1e^{-4}, ~1e^{-5}\}\).

  • Sparsity regularization parameter \(\lambda\): \(\{1e^{-6}, ~1e^{-8}, ~1e^{-10}\}\).

  • Sparsity temperature parameter \({\bar{\beta }}:\) \(\{200, 1000\}\).

  • Stability parameter a: {0.7, 0.8, 0.9, 1}.

4.1 UCI data sets

We split the 45 UCI data sets into 6 roughly even-sized groups based on their difficulty level. Specifically, we consider the ranges \(0\%\)-\(70\%\), \(70\%\)-\(80\%\), \(80\%\)-\(90\%\), \(90\%\)-\(95\%\), \(95\%\)-\(98\%\) and \(98\%\)-\(100\%\) of natural accuracy achieved by the nominal DL approach. We first investigate the performance of the HDL framework with respect to a single evaluation metric. In Fig. 1, we evaluate all methods in terms of natural accuracy, adversarial accuracy with \(\rho = 0.1\), stability, and sparsity.

Figure 1a and c show that those data sets for which the nominal approach achieves accuracy in the \(70\%\)-\(90\%\) range are the ones that benefit the most from the HDL framework (especially the Robust, Stable, and Stable+Robust models) when the evaluation metric corresponds to natural accuracy or stability. For the data sets with natural accuracy above \(90\%\), none of the models significantly improve over the natural accuracy or stability achieved by the nominal DL model. However, for data sets in the \(98\%\)-\(100\%\) range sparsity slightly improves accuracy and robustness slightly helps for stability.

Figure 1b shows the adversarial robustness achieved with perturbation parameter \(\rho = 0.1\). We see a substantial adversarial robustness improvement in all methods that included the robust component. Moreover, combining robustness with stability and/or sparsity leads to higher adversarial accuracy than that achieved with robustness alone. In terms of parameter sparsity, Fig. 1d shows that all models with imposed sparsity (Sparse, Stable\(+\)Sparse, Robust+Sparse, and HDL) have a much lower percentage of nonzero parameters compared to the models without it. And importantly, both robustness and stability help achieve sparser models when combined with sparsity.

Fig. 1
figure 1

Evaluation of the different methods depending on the natural accuracy of the nominal DL approach on the UCI data sets

Since we are also interested in models that are simultaneously accurate, sparse, robust, and stable, we consider a multi-objective metric using the rank of each method (ranks start at 1, with lower ranks corresponding to better performance). For each method, we use the natural accuracy, adversarial accuracy, stability, and sparsity achieved in the validation set respectively to rank all its hyperparameter configurations 4 times. Then for each hyperparameter configuration, we compute the average rank across the 4 metrics and select the configuration that leads to the method’s highest average rank. Finally, we rank the 8 selected models (for the 8 different methods) with respect to each evaluation metric on the testing set to obtain their out-of-sample average rank. As shown in Fig. 2, all 7 models from the HDL framework outperform the nominal DL approach with respect to this holistic metric. Moreover, the HDL model typically achieves the best results across data set complexities.

Fig. 2
figure 2

Average multi-objective rank

4.2 Image data sets

In this section, we evaluate all methods using the MNIST, Fashion-MNIST, and CIFAR10 data sets. For each method, we select the parameters based on the multi-objective metric utilized for the UCI data sets in the validation set and report the performance across metrics. In Tables 2 and 3, we see that for MNIST and Fashion-MNIST, the HDL model outperforms the DL model for all objectives. In particular, HDL achieves higher accuracy using only around 70% of the parameters. The results for the CIFAR10 data set (Table 4) are a bit different since adding sparsity slightly hurts natural accuracy. However, the accuracy achieved by the HDL model is comparable to those achieved by the non-sparse models and the number of parameters is reduced by 47%.

Table 2 Results for the Fashion-MNIST data set. For each method, the parameters with the highest average rank in the validation set were chosen
Table 3 Results for the MNIST data set. For each method, the parameters with the highest average rank in the validation set were chosen
Table 4 Results for the CIFAR10 data set. For each method, the parameters with the highest average rank in the validation set were chosen

4.3 Computational times

Since modifying the loss function often affects the training computational time, we quantify the slowdown effect for all the methods in the HDL framework. Specifically, for each of the 45 UCI data sets as well as the 3 image data sets introduced in the previous section, we calculate how many times slower each method is when compared to the nominal DL approach in terms of batches per second as well as number of iterations needed. The average slowdown factors are shown in Table 5. We observe that robustness and sparsity both decrease the number of batches per second by approximately a factor of 3, while stability preserves the same speed as the DL approach. In addition, since we used 5 training rounds for the methods incorporating sparsity, they require 5 times as many training iterations as the other methods. On average, the HDL method is only 16 times slower, and methods that don’t optimize for sparsity only increase the computational time by less than 3 times.

Table 5 Average slowdown factors of computational time with respect to the nominal DL method

4.4 SHAP values

To gain a deeper understanding of the interplay between individual loss components (robustness, stability, sparsity) and the metrics we measure, we employ the SHAP values method (Lundberg & Lee, 2017). We compute the SHAP values for each UCI data set and average the results over three data set categories: Low Accuracy (\(<80\%\)), Medium Accuracy (\(80\%\)-\(95\%\)), and High Accuracy (\(>95\%\)), with 15 data sets each. The results are shown in Fig. 3.

Fig. 3
figure 3

SHAP values on various metrics across different UCI data set categories. Blue/red indicates that the feature has a positive/negative SHAP value on a specific category of UCI data set

Our findings confirm that robustness, stability, and sparsity techniques improve the corresponding metrics across all data set categories. More intriguingly, these techniques also positively impact metrics beyond their intended purposes. For example, sparsity and stability enhance adversarial accuracy, while robustness and stability yield sparser networks. This indicates that combining techniques does not necessarily result in any adverse effects and that it is feasible to attain networks with good performance across all metrics. Additionally, the benefits of these techniques are more pronounced in data sets with low initial accuracy, particularly for the accuracy and stability metrics. Lastly, we observe that sparsity generally hurts accuracy and stability, although this highly varies across data sets, as observed in Sect. 4.1.

4.5 Prescriptive approach

In this section, we develop a prescriptive approach that allows users to choose a training loss function based on the specific objective they wish to maximize, which can be a single evaluation metric or a weighted combination of several metrics. Depending on the data set characteristics and the performance scores of the nominal DL model, we propose a tree-based recommendation model to suggest the most suitable HDL loss function for optimal results with respect to the desired objective.

We train our models using an Optimal Policy Tree (OPT) algorithm (Amram et al., 2022), which uses observational data of the form \({({\textbf{x}}_i, y_i, z_i)}\). While it is possible to include variability and complexity indicators of the data set as part of \({\textbf{x}}_i\) (Lorena et al., 2019), given the extensive and diverse range of data sets in consideration, we choose to capture complexity using the performance metrics achieved by the nominal DL approach on the corresponding classification tasks. In our case, each observation (i.e., data set) i encompasses:

  • Data set features \({\textbf{x}}_i \in {\mathbb {R}}^{8}\): number of features, number of target classes, nominal DL accuracy, nominal DL adversarial accuracy with \(\rho = 0.001\), nominal DL adversarial accuracy with \(\rho = 0.01\), nominal DL adversarial accuracy with \(\rho = 0.1\), nominal DL stability, nominal DL worst case accuracy.

  • Prescriptions \(z_i\in {1,\ldots ,8}\): DL, Robust, Stable, Sparse, Robust+Sparse, Stable+Sparse, Stable+Robust, HDL.

  • Outcomes \(\varvec{y}_i\in {\mathbb {R}}^8\), which represent the performance improvement of each method compared to the nominal DL model with respect to the metric set by the user.

Our prescriptive task is to find the optimal policy that, given the information \({\textbf{x}}\) of a data set, prescribes the method z leading to the best metric score y. We randomly split the 45 UCI data sets into a training set (40 data sets) and a test set (5 data sets from different difficulty levels). We cross-validated the optimal tree depth and complexity using the training set.

Figures 4 and 5 represent the OPTs obtained for maximizing two different objectives: natural accuracy and adversarial accuracy. The tree in Fig. 4 highlights that the Stable and Stable+Robust methods are the best suited to obtain high natural accuracy, with the former being preferred when the nominal DL approach has very low adversarial accuracy (\(\rho = 0.1\)). To maximize robustness, the tree in Fig. 5 prescribes HDL, Stable+Robust, or Robust+Sparse depending on the adversarial accuracy achieved by the nominal DL method.

In addition, we obtained single-leaf trees when maximizing the stability and sparsity objectives. The recommended methods are Stable+Robust for optimizing stability and Stable+Sparse for maximizing sparsity. Lastly, HDL was always the prescribed method when the desired objective was the equally weighted average of all 4 previous metrics.

Fig. 4
figure 4

Optimal policy tree for maximizing natural accuracy

Fig. 5
figure 5

Optimal policy tree for maximizing robustness (\(\rho = 0.1\))

Finally, Table 6 reports the out-of-sample performance of these prescription trees on the 5 UCI data sets from the test set (cnae-9, hill-valley, libras-movement, magic-gamma, and thyroid-ann). We emphasize that the performance of the prescribed methods is higher than that of the nominal DL approach across all objectives and data sets, and it often matches the performance of the best method.

Table 6 Performance of prescription trees on the testing set

4.6 Significance analysis

To further validate the improvements achieved by the HDL framework, we analyze the significance of our results with one-sided Welch’s t-tests with different variance groups. Specifically, for each evaluation metric and each leaf of the corresponding optimal prescriptive tree, we consider all the UCI and image data sets that fall within that leaf. For those data sets, we test the null hypothesis that the average performance achieved by the prescribed method is equal to that one achieved by the nominal DL approach, with alternative hypothesis corresponding to the average performance achieved by the prescribed method being higher. As shown in Table 7, all p-values are below the 0.05 significance level, concluding that the prescribed methods have statistically significant higher performance than the nominal DL approach across all performance metrics.

Table 7 Significance results for the null hypothesis that the average performance achieved by the prescribed method is equal to that one achieved by the nominal DL approach, with alternative hypothesis corresponding to the average performance achieved by the prescribed method being higher

5 Conclusions

This paper presents a unifying methodology to obtain deep learning models that are accurate, robust, stable, and sparse by appropriately modifying the objective function to be minimized. Across multiple computational experiments, we show how these 4 metrics interact and demonstrate that we can often train models that simultaneously improve adversarial accuracy, worst-case accuracy, and parameter sparsity without sacrificing natural accuracy. Finally, we provide prescriptive trees that use general features of the data set (e.g. dimension, number of target classes, nominal accuracy, etc.) to recommend which method is more appropriate depending on the desired objective to be maximized, and we show that the improvements achieved by the prescribed methods are statistically significant.

For future research we aim to explore how HDL performs with respect to other data set indicators like variability and complexity, as this could offer further guidance on which method to select. We would also like to test our framework in real world applications; for instance in the area of healthcare, where trustworthy models are crucial and memory constraints are often required for practical use. Consequently, improving the interpretability of the HDL framework would be essential to make it more suitable for such applications. We deem adversarial robustness, stability and sparsity as critical qualities in the development of more reliable machine learning algorithms, and we hope this work will motivate further research in this important field.