Holistic Deep Learning

This paper presents a novel holistic deep learning framework that simultaneously addresses the challenges of vulnerability to input perturbations, overparametrization, and performance instability from different train-validation splits. The proposed framework holistically improves accuracy, robustness, sparsity, and stability over standard deep learning models, as demonstrated by extensive experiments on both tabular and image data sets. The results are further validated by ablation experiments and SHAP value analysis, which reveal the interactions and trade-offs between the different evaluation metrics. To support practitioners applying our framework, we provide a prescriptive approach that offers recommendations for selecting an appropriate training loss function based on their specific objectives. All the code to reproduce the results can be found at https://github.com/kimvc7/HDL.


Introduction
Neural networks have become increasingly popular due to their remarkable achievements in computer vision and natural language processing.Their generalization power has been demonstrated in wide-ranging applications, from classifying photos to recommending products.However, neural networks face challenges in real-world applications for high-stakes decision-making, including healthcare, policy-making, and autonomous driving.
First, many standard neural networks are not robust -they can be easily fooled by natural or artificial noise in the input data (Szegedy et al, 2014), making them vulnerable to perturbations that may arise in real-world applications.Moreover, neural networks, similar to other machine learning models, often suffer from instability during the training process -different train-validation splits could generate models with very different performance (May et al, 2010;Xu and Goodacre, 2018).This reduces the policymakers' trust in these models and hinders post-hoc interpretations.Another critical difficulty is that neural networks are not sparse -the high number of parameters utilized for neural networks prevents efficient computation and storage (Thompson et al, 2020).Most neural networks have millions of non-zero parameters to be stored and accessed for evaluation.This is problematic in many decision-making settings with limitations or restrictions on hardware capabilities.Reducing the number of parameters could make them more applicable in a broader range of scenarios (Changpinyo et al, 2017;Narang et al, 2017).
The questions around improving robustness, stability, and sparsity metrics have all been previously studied in the neural network literature.However, they have been almost exclusively studied in isolation, with a limited understanding of the tradeoffs between these desired qualities.This paper aims to simultaneously address all these objectives through a novel comprehensive methodology named Holistic Deep Learning (HDL).In particular, HDL carefully combines state-of-the-art techniques that address these individual challenges and demonstrates their collective efficacy through extensive experiments on diverse data sets.Our findings provide a promising pathway toward developing efficient and reliable machine learning models across many dimensions for real-world applications.Specifically, our contributions are as follows: 1. We design HDL, a novel framework that jointly optimizes for neural network robustness (adversarial accuracy), stability (worst accuracy across train-val splits), and sparsity (parameters with value zero) metrics by appropriately modifying the objective function.2. Through extensive ablation experiments and SHAP value analysis (Lundberg and Lee, 2017) across 45 UCI data sets (Dua and Graff, 2017) and 3 image data sets (MNIST (Deng, 2012), Fashion MNIST (Xiao et al, 2017) and CIFAR10 (Krizhevsky et al, 2009)), we analyze the individual performance of each metric as well as the interactions and trade-offs between them.We corroborate that imposing robustness, stability, and sparsity improves the corresponding metrics across all data sets.In addition, we show that: • imposing stability and sparsity further improves robustness, • imposing stability and robustness further improves sparsity, • imposing robustness further improves stability, • imposing stability and robustness further improves natural accuracy.
The effect of sparsity on natural accuracy is more complex and highly varies across data sets.However, we show that it is often possible to simultaneously improve robustness, stability, and sparsity without sacrificing performance on natural accuracy.3. We propose a prescriptive approach to provide recommendations on selecting the appropriate loss function depending on the practitioner's objective.
In particular, simultaneously imposing robustness, stability and sparsity in the loss function leads to the best results when jointly optimizing for all the metrics.
The paper is organized as follows: Section 2 outlines the current literature of robust, sparse, and stable methods; Section 3 describes the Holistic Deep Learning framework, and Section 4 shows the results of the computational experiments.
2 Related Work

Robust Neural Networks
Many state-of-the-art deep neural networks are highly vulnerable to small perturbations in the input data (Szegedy et al, 2014), which can be a threat to some real-world applications like self-driving cars or face recognition.Adversarial robustness evaluates a neural network's resistance against these altered inputs, referred to as adversarial examples, intentionally designed to worsen the network's performance (Goodfellow et al, 2014;Carlini and Wagner, 2017;Madry et al, 2017).
Multiple methods have been developed in recent years to enhance the adversarial robustness of neural networks.One of the most popular heuristics is augmenting the data set during training with adversarial examples (Madry et al, 2017;Goodfellow et al, 2014).Others include neuron randomization (Prakash et al, 2018;Xie et al, 2017), input space projections (Lamb et al, 2018;Kabilan et al, 2021;Ilyas et al, 2017) and regularization (Bertsimas et al, 2021;Ross and Doshi-Velez, 2018;Hein and Andriushchenko, 2017;Yan et al, 2018).A less common but more theoretically rigorous approach is to minimize a provable upper bound of the loss achieved with adversarial examples (Raghunathan et al, 2018;Singh et al, 2018;Zhang et al, 2018;Weng et al, 2018;Dvijotham et al, 2018;Lecuyer et al, 2019;Cohen et al, 2019;Anderson et al, 2020;Bertsimas et al, 2021).
While these methods successfully improve the network's robustness, the extent to which they do so often depends on the data set, the network size, and the magnitude of the input perturbations.In particular, heuristic methods generally work well for small perturbations, while the upper bound methods yield better results when the input noise is larger (Bertsimas et al, 2021;Athalye et al, 2018).However, there is a trade-off between effectiveness and efficiency.The methods providing the strongest adversarial robustness are often computationally demanding, making it challenging to implement them for large data sets or complex network architectures.

Sparse Neural Networks
In machine learning, sparse models make predictions based on a limited number of parameters.Sparsity is often desirable, as it may save memory, enhance model interpretability, and reduce overfitting (Bertsimas et al, 2020).
There are two typical approaches to sparsity in deep learning.The first one, train-then-sparsify, consists of removing unnecessary neurons or connections after training the network, sometimes followed by retraining (Janowsky, 1989;LeCun et al, 1989).This approach has been widely investigated, and several schemes exist to choose which connections to prune (Hoefler et al, 2021).Han et al (2015), for example, propose to prune the connections with the smallest weights.Other methods include formulating a convex optimization problem (Aghasi et al, 2020), removing filters for which the total absolute sum is low (Li et al, 2016), and eliminating channels that have limited impact on the network's discriminatory ability (Zhuang et al, 2018).The second approach, sparsify-during-training, is achieved by learning a sparse architecture while training the network.Multiple methodologies exist (Bellec et al, 2017;Mocanu et al, 2017;Mostafa and Wang, 2019), including the method to approximate the 0 norm with continuous functions and add a regularization term to the loss function (Louizos et al, 2017;Savarese et al, 2020).We refer the reader to Gale et al (2019) andHoefler et al (2021) for more comprehensive surveys on sparsity.

Stable Neural Networks
The stochastic nature of data samples can lead to instability and high dependence of machine learning models on the specific train-validation split.This can negatively impact the interpretability of the resulting model and its ability to make reliable predictions (Bertsimas and Paskov, 2020), a key factor to establishing trust in any algorithm.
The sensitivity of machine learning models to the choice of training split has mostly been studied through the lens of cross-validation and distributionally robust optimization.Cross-validation can be used to measure the variability from the selection of training split but at a significant increase in computational cost (Krogh and Vedelsby, 1994;Hastie et al, 2001) that is often intractable for deep learning settings.Distributionally robust optimization has been used to quantify the worst-case generalization error in the presence of shifts in distribution or regime (Staib and Jegelka, 2019;Goldwasser et al, 2020;Sagawa et al, 2019), but it often requires pre-defined groups over the training data and expensive group annotations for each data sample to avoid overly pessimistic uncertain distributions (Sagawa et al, 2019;Liu et al, 2021).A different approach has been studied by Bertsimas and Paskov (2020) and Bertsimas et al (2022), who instead optimize over the worst training set of fixed size without making any probabilistic assumptions.Although their method was presented in the context of linear and tree-based models, their framework also applies to neural networks.
3 The Holistic Deep Learning Approach

The HDL Framework
We introduce the HDL framework for a classification problem over points x ∈ R M whose target y ∈ [K] is one of K different classes (we use the notation [n] to denote the set {1, . . ., n}).We illustrate our approach over a fully-connected neural network for simplicity of notation, but the framework remains the same for convolutional neural networks.
For x ∈ R, define [x] + = max{0, x} (the ReLU function).Given weight matrices W ∈ R r −1 ×r and bias vectors b ∈ R r for ∈ [L], such that r 0 = M, r L = K, the corresponding feed-forward neural network with L layers and ReLU activation functions is defined by the equations: where W denotes the parameters (W , b ) for all ∈ [L].Consider a data set {(x n , y n )} N n=1 , where y n ∈ [K] is the target class of x n .For each point x n , the class predicted by the network is arg max k z L k (W, x n ).The nominal DL approach is to minimize the cross-entropy loss of the network z L described in Eq. ( 2), which can be written as: where ∆e yn k = e k − e yn and e k refers to the one-hot vectors with a 1 in the k th coordinate and 0 everywhere else.In our HDL framework, we propose instead to minimize the following optimization problem: where corresponds to the element-wise product, σ is the standard sigmoid function, z L (•, x) was defined in (2), λ (resp.ρ) is the regularization coefficient corresponding to the sparsity (resp.robustness) loss component, and a is the size of the data subsets used for the stability requirement (see Section 2.3).We observe that robustness adds a term to the output, while stability and sparsity add new parameters (θ and s respectively) to be optimized.This loss function allows us to simultaneously train robust, sparse, and stable feed-forward neural networks at scale.In the next section, we provide more details about each metric.

Robustness
This section describes our method to introduce the robust component into neural network training.Since our ultimate goal is to incorporate the sparsity, robustness, and stability of neural networks together in a tractable way, we avoid training algorithms that significantly increase the training time or the algorithm's complexity.We follow the approach from Bertsimas et al (2021) of using a linear approximation of the neural network to estimate the robust objective.This approach is simple to implement, produces good adversarial accuracy, and does not require the extensive training time of other algorithms.
For a given (x, y) pair, the robust problem using the cross-entropy loss and the ∞ -norm uncertainty sets can be upper bounded as: (5) Since z L k (W, x) is piece-wise linear, we expect the outputs z L k (W, x) and z L k (W, x + δ) to be in the same linear piece when x + δ is close to x.In other words, the linear approximation is exact for small enough δ.Therefore, we approximate the upper bound in (5) as Even though the expression in Eq. ( 7) is not always an upper bound of Eq. ( 5) for an arbitrary value of ρ, Bertsimas et al (2021) experimentally show that generally the average loss obtained using this expression is indeed an upper bound of the average adversarial loss.In fact, for small ρ, their experiments demonstrate that this approach achieves competitive results with state-of-theart methods while requiring significantly less computational time across various tabular and image data sets.However, we emphasize that the methodology developed in this paper could also be performed with other methods for robust training, like adversarial training or upper bound minimization, which might be more appropriate for large uncertainty sets.

Sparsity
In this work, we use the specific retraining procedure proposed by Savarese et al (2020), which deterministically approximates the 0 regularization utilizing a sequence of sigmoid functions and adding them as a penalty term in the loss function.Notably, the implementation is easily compatible with our robustness and stability requirements, since this methodology relies on a penalty term added in the loss function.Therefore, we can use gradient descent to simultaneously optimize the objective function comprising the robustness, stability, and sparsity penalties.Adding 0 regularization explicitly penalizes the number of non-zero weights in the model to induce sparsity.However, the 0 -norm induces a priori a non-convex and non-differentiable loss function R(W), as follows: where |W| is the number of parameters, w j is the j th coordinate of W, λ is the regularization weight and L a loss function (e.g., cross-entropy loss).
The goal is to relax the discrete nature of the 0 penalty to preserve an efficient continuous optimization while allowing for exact zeros in the neural network weights.To do this, Savarese et al (2020) propose to first parameterize the weights w j = H(s j ) where H(•) is the Heaviside step function, and then approximate the non-differentiable step function with the sigmoid function: σ(βs j ) → H(s j ) when β → ∞.Therefore, β is the hardness parameter that controls how close the approximation is to the 0 regularization, and the final loss function can be written as: To achieve a sparse network, we use this loss function ( 9) over multiple training rounds to gradually reach a sparse initialization before training the final sparse neural network.To obtain each initialization before a new training round, we start with our initialized auxiliary sparsity s 0 and hardness β = 1 parameters.Over the T training iterations, we gradually increase β until it reaches a maximum value β when the training procedure is completed with sparsity s T .Then, we take s 0 = min( βs T , s 0 ) to generate the new initialization for the next round of training.This minimization function essentially keeps the information of the suppressed weights, i.e., σ(βs j ) ≈ 0, while reverting those not suppressed to their starting position.This process is completed over multiple rounds to find better and sparser initializations for the neural network.
We implement the methodology as suggested by Savarese et al (2020).In the results section, we measure sparsity in terms of the percentage of neuron connections (weights) set to 0.

Stability
Using the measure of stability defined in Section 2.3, we apply the methodology developed in Bertsimas et al (2022) for building stable neural networks.At a high level, this corresponds to constructing a model that is robust to the specific subset of data used to train it.One way to think about this is to view the training data set as a sample from the true data distribution and then require the model to be robust to the specific sample.Considering the partition of the data into training/validation sets as a sampling mechanism from this true data distribution (each split choice gives a different training set), we desire to build models that are robust to every partition.
To achieve this, we first associate each observation (x n , y n ) with a binary variable z n , n ∈ [N ] that indicates whether or not (x n , y n ) is part of the training set.We then choose the network's parameters as to minimize the worst-case loss over all possible allocations of these z n 's, resulting in a model that is explicitly built to do well not just over one training set, but over all possible training sets.We start from the same minimization problem introduced in Section 3.1, i.e., min To obtain network stability we require the model to be robust to every training set of fixed size a, which results in the following optimization problem: The value of a indicates the desired proportion between the size of the training and validation sets.For example, by setting a = 0.7N we recover the typical 70/30 training/validation split.Since the inner maximization problem is linear in z, the problem is equivalent to optimizing over the convex hull of Z.This implies that the binary constraints on z n can be relaxed to 0 ≤ z n ≤ 1, and the inner maximization problem becomes linear in the variables z n .Computing its dual problem we obtain that the value of the inner maximization problem is equivalent to: Therefore, the stability problem becomes min Note that the variables u n can be solved in closed form as The final minimization problem with stability then becomes: min which is now an unconstrained problem that can be solved with standard gradient descent optimization algorithms.

Experiments
This section presents extensive computational experiments comparing the nominal DL approach (abbreviated DL) with 7 other models resulting from our holistic methodology.We showcase the merit of our HDL framework and investigate the influence of each studied component -robustness, sparsity, and stability -on the overall performance across 4 evaluation metrics: • Natural accuracy: Average accuracy on the testing set across the 10 different train-validation splits with respect to the original input data.
Table 1: Loss functions used for DL and all methods in the HDL framework.
• Adversarial robustness: Average adversarial accuracy on the testing set across the 10 different train-validation splits with respect to adversarial attacks resulting from perturbations of the original input data.We consider only attacks bounded in the L ∞ norm by some radius ρ using Projected Gradient Descent as in Madry et al (2017).• Stability: Worst accuracy on the testing set across the 10 different trainvalidation splits with respect to the original input data.• Sparsity: Percentage of network parameters with value 0.
The exact optimization problem solved for each model results from combinations of the loss functions described in the previous section, and the specific formulations can be found in Table 1 above.

Data
We computed experiments on classification tasks with 45 UCI data sets from the UCI Machine Learning Repository (Dua and Graff, 2017).These data sets give various problem sizes and difficulties to form a representative sample of real-world tabular problems, with the largest data set having 245,056 observations and the highest number of features being 856.We also benchmarked our methodologies on three image data sets: MNIST, Fashion-MNIST, and CIFAR10.

Implementation
Our code is written in Python 3.8 (Van Rossum and Drake, 2009).Neural networks are coded using Tensorflow v1 (Abadi et al, 2015).We trained each model on a system equipped with an Intel Xeon Gold 6248 processor, which included 4 CPU cores and one Nvidia Volta V100 GPU.

Training Methodology
For each data set, we used 20% of the data to obtain a fixed test set, and we applied 80%/20% train-validation splits to the remaining data points.Given a choice of model and evaluation metric, we selected the hyperparameters that led to the best average performance in the validation set for the metric in question.We then reported the average performance of the chosen parameter configuration on the test set with respect to the given metric.For all evaluation metrics, the average performance is computed over 10 training-validation splits that are random but identical for all experiments for a fair comparison.

Neural network architectures
For our experiments on UCI data sets, we used a feedforward neural network architecture with 2 hidden layers, each with 128 neurons and ReLU activations.
For our experiments on the image data sets, we used a convolutional neural network with the AlexNet architecture (Krizhevsky et al, 2012).We used the Glorot uniform initialization (Glorot and Bengio, 2010) for the network weights W and 0 as initialization for the sparsity variable s 0 .
• Batch Size: 32 for UCI data sets, 64 for image data sets.

UCI Data sets
We split the 45 UCI data sets into 6 roughly even-sized groups based on their difficulty level.Specifically, we consider the ranges 0%-70%, 70%-80%, 80%-90%, 90%-95%, 95%-98% and 98%-100% of natural accuracy achieved by the nominal DL approach.We first investigate the performance of the HDL framework with respect to a single evaluation metric.In Figure 1, we evaluate all methods in terms of natural accuracy, adversarial accuracy with ρ = 0.1, stability, and sparsity.
Figures 1a and 1c show that those data sets for which the nominal approach achieves accuracy in the 70%-90% range are the ones that benefit the most from the HDL framework (especially the Robust, Stable, and Stable+Robust models) when the evaluation metric corresponds to natural accuracy or stability.For the data sets with natural accuracy above 90%, none of the models significantly improve over the natural accuracy or stability achieved by the nominal DL model.However, for data sets in the 98 − 100% range sparsity slightly improves accuracy and robustness slightly helps for stability.
Figure 1b shows the adversarial robustness achieved with perturbation parameter ρ = 0.1.We see a substantial adversarial robustness improvement in all methods that included the robust component.Moreover, combining robustness with stability and/or sparsity leads to higher adversarial accuracy than that achieved with robustness alone.In terms of parameter sparsity, Figure 1d shows that all models with imposed sparsity (Sparse, Stable+Sparse, Robust+Sparse, and HDL) have a much lower percentage of nonzero parameters compared to the models without it.And importantly, both robustness and stability help achieve sparser models when combined with sparsity.Since we are also interested in models that are simultaneously accurate, sparse, robust, and stable, we consider a multi-objective metric using the rank of each method (ranks start at 1, with lower ranks corresponding to better performance).For each method, we use the natural accuracy, adversarial accuracy, stability, and sparsity achieved in the validation set respectively to rank all its hyperparameter configurations 4 times.Then for each hyperparameter configuration, we compute the average rank across the 4 metrics and select the configuration that leads to the method's highest average rank.Finally, we rank the 8 selected models (for the 8 different methods) with respect to each evaluation metric on the testing set to obtain their out-of-sample average rank.As shown in Figure 2, all 7 models from the HDL framework outperform the nominal DL approach with respect to this holistic metric.Moreover, the HDL model typically achieves the best results across data set complexities.

Image Data Sets
In this section, we evaluate all methods using the MNIST, Fashion-MNIST, and CIFAR10 data sets.For each method, we select the parameters based on the multi-objective metric utilized for the UCI data sets in the validation set and report the performance across metrics.In Tables 2 and 3, we see that for MNIST and Fashion-MNIST, the HDL model outperforms the DL model for all objectives.In particular, HDL achieves higher accuracy using only around 70% of the parameters.The results for the CIFAR10 data set (Table 4) are a bit different since adding sparsity slightly hurts natural accuracy.However, the accuracy achieved by the HDL model is comparable to those achieved by the non-sparse models and the number of parameters is reduced by 47%.Table 4: Results for the CIFAR10 data set.For each method, the parameters with the highest average rank in the validation set were chosen.

Computational Times
Since modifying the loss function often affects the training computational time, we quantify the slowdown effect for all the methods in the HDL framework.Specifically, for each of the 45 UCI data sets as well as the 3 image data sets in the previous section, we calculate how many times slower each method is when compared to the nominal DL approach in terms of batches per second as well as number of iterations needed.The average slowdown factors are shown in Table 5.We observe that robustness and sparsity both decrease the number of batches per second by approximately a factor of 3, while stability preserves the same speed as the DL approach.In addition, since we used 5 training rounds for the methods incorporating sparsity, they require 5 times as many training iterations as the other methods.On average, the HDL method is only 16 times slower, and methods that don't optimize for sparsity only increase the computational time by less than 3 times.

SHAP Values
To gain a deeper understanding of the interplay between individual loss components (robustness, stability, sparsity) and the metrics we measure, we employ the SHAP values method (Lundberg and Lee, 2017).We compute the SHAP values for each UCI data set and average the results over three data set categories: Low Accuracy (< 80%), Medium Accuracy (80%-95%), and High Accuracy (> 95%), with 15 data sets each.The results are shown in Figure 3.Our findings confirm that robustness, stability, and sparsity techniques improve the corresponding metrics across all data set categories.More intriguingly, these techniques also positively impact metrics beyond their intended purposes.For example, sparsity and stability enhance adversarial accuracy, while robustness and stability yield sparser networks.This indicates that combining techniques does not necessarily result in any adverse effects and that it is feasible to attain networks with good performance across all metrics.Additionally, the benefits of these techniques are more pronounced in data sets with low initial accuracy, particularly for the accuracy and stability metrics.Lastly, we observe that sparsity generally hurts accuracy and stability, although this highly varies across data sets, as observed in Section 4.1.

Prescriptive Approach
In this section, we develop a prescriptive approach that allows users to choose a training loss function based on the specific objective they wish to maximize, which can be a single evaluation metric or a weighted combination of several metrics.Depending on the data set characteristics and the performance scores of the nominal DL model, we propose a tree-based recommendation model to suggest the most suitable HDL loss function for optimal results with respect to the desired objective.
We train our models using an Optimal Policy Tree (OPT) algorithm (Amram et al, 2022), which uses observational data of the form (x i , y i , z i ).In our case, each observation (i.e., data set) i encompasses: Our prescriptive task is to find the optimal policy that, given the information x of a data set, prescribes the method z leading to the best metric score y.
We randomly split the 45 UCI data sets into a training set (40 data sets) and a test set (5 data sets from different difficulty levels).We cross-validated the optimal tree depth and complexity using the training set.Figures 4 and 5 represent the OPTs obtained for maximizing two different objectives: natural accuracy and adversarial accuracy.The tree in Figure 4 highlights that the Stable and Stable+Robust methods are the best suited to obtain high natural accuracy, with the former being preferred when the nominal DL approach has very low adversarial accuracy (ρ = 0.1).To maximize robustness, the tree in Figure 5 prescribes HDL, Stable+Robust, or Robust+Sparse depending on the adversarial accuracy achieved by the nominal DL method.
In addition, we obtained single-leaf trees when maximizing the stability and sparsity objectives.The recommended methods are Stable+Robust for optimizing stability and Stable+Sparse for maximizing sparsity.Lastly, HDL was always the prescribed method when the desired objective was the equally weighted average of all 4 previous metrics.
Finally, Table 6 reports the out-of-sample performance of these prescription trees on the 5 UCI data sets from the test set (cnae-9, hill-valley, libras-movement, magic-gamma, and thyroid-ann).We emphasize that the  performance of the prescribed methods is higher than that of the nominal DL approach across all objectives and data sets, and it often matches the performance of the best method.Table 6: Performance of prescription trees on the testing set.

Conclusions
This paper presents a unifying methodology to obtain deep learning models that are accurate, robust, stable, and sparse by appropriately modifying the objective function to be minimized.Across multiple computational experiments, we show how these 4 metrics interact and demonstrate that we can often train models that simultaneously improve adversarial accuracy, worstcase accuracy, and parameter sparsity without sacrificing natural accuracy.Finally, we provide prescriptive trees that recommend which method is more appropriate depending on the desired objective to be maximized.Appendix A Results Tables We present the evaluation results for natural accuracy, adversarial accuracy, stability, and sparsity on the test sets across all data sets and methods discussed in the paper.The natural accuracy results can be found in Table A1, adversarial accuracy results in Table A2, stability results in Table A3, and sparsity results in Table A4.

Fig. 1 :
Fig. 1: Evaluation of the different methods depending on the natural accuracy of the nominal DL approach on the UCI data sets.

Fig. 3 :
Fig. 3: SHAP values on various metrics across different UCI data set categories.Blue/red indicates that the feature has a positive/negative SHAP value on a specific category of UCI data set.

Table 2 :
Results for the Fashion-MNIST data set.For each method, the parameters with the highest average rank in the validation set were chosen.

Table 3 :
Results for the MNIST data set.For each method, the parameters with the highest average rank in the validation set were chosen.

Table 5 :
Average slowdown factors of computational time with respect to the nominal DL method.
number of features, number of target classes, nominal DL accuracy, nominal DL adversarial accuracy with ρ = 0.001, nominal DL adversarial accuracy with ρ = 0.01, nominal DL adversarial accuracy with ρ = 0.1, nominal DL stability, nominal DL worst case accuracy.•Prescriptions z i ∈ 1, . . ., 8: DL, Robust, Stable, Sparse, Robust+Sparse, Stable+Sparse, Stable+Robust, HDL.• Outcomes y i ∈ R 8 , which represent the performance improvement of each method compared to the nominal DL model with respect to the metric set by the user.

Table A1 :
Natural accuracy results for all UCI and vision data sets, where n denotes the data size, p denotes the data dimension, and k denotes the number of classes.Darker blue corresponds to higher nominal DL natural accuracy for the UCI data sets.

Table A2 :
Adversarial accuracy results for all UCI and vision data sets, where n denotes the data size, p denotes the data dimension, and k denotes the number of classes.We use ρ = 0.1 for all data sets except CIFAR10 and Fashion-MNIST, for which we set ρ = 0.01.Darker blue corresponds to higher nominal (DL) natural accuracy.

Table A3 :
Stability (worst case accuracy) results for all UCI and vision data sets, where n denotes the data size, p denotes the data dimension, and k denotes the number of classes.Darker blue corresponds to higher nominal (DL) natural accuracy.