1 Introduction

Regarding the ambitious goal of achieving trustworthy AI (e.g., [58]), one crucial component is the robustness of the AI model itself, sometimes referred to under the notion of security of AI. Robustness is often understood in the sense of adversarial robustness, i.e., robustness of the AI against adversarial attacks (e.g., [25]). Adversarial attacks are perturbations of the data points which are as small as possible, quantified usually in some vector norm, so that the output of the AI model significantly changes, e.g., in the standard classification setting for adversarial attacks, the output class switches from the original class to another class or even to a target class prescribed by the attacker (e.g., [11]). Adversarial attacks are crafted after the AI model is trained; therefore, they do not have any effect on AI training unless being used for adversarial training.

The classical robust statistics (e.g., [26, 31, 39, 44] consider the robustness of an estimator against contamination in the dataset, so the contamination appears before training, often by natural reasons like model misspecification. The robustness of an estimator is quantified either in terms of local robustness, which amounts to the influence function [16, 28], measuring the infinitesimal influence of one single data point onto the estimator, or in terms of global robustness, which corresponds to the breakdown point (BDP). As for the BDP, there exists a functional version [27], but when working with data, the empirical version, also called finite-sample BDP [16], is often more appropriate, which quantifies the number of instances in the dataset that have to be modified arbitrarily in order to let the estimator break down, i.e., take unreasonable values.

Robust statistics is related to the concept of poisoning attacks [9], which is a counterpart of adversarial attacks, where also perturbations of data points are computed in order to distort a model, but in contrast to adversarial attacks, the perturbation takes place before AI training; hence, poisoning attacks indeed have an impact on the trained model. Note that one can either consider poisoning attacks as additional samples [9] or as perturbations of existing samples that replace these original samples [37]. In contrast to the perturbations usually considered in robust statistics which either affect a subset of the instances or instance-specific subsets of the cells [3], the perturbations for poisoning attacks are computed onto the whole training data subject to an overall bound, often quantified in some matrix norm. Therefore, the perturbations considered in robust statistics can be interpreted as poisoning attack perturbations that are bounded by the \(l_0\)-norm, either w.r.t. the row indices or w.r.t. the cell indices.

In order to safeguard against perturbations, robust statistics considers means to bound the influence of data points, for example, by downweighting or even trimming outlying instances or cells. It is important to note that contamination in the response vector is fundamentally different from contamination in the predictor matrix due to the additional leverage effect. In fact, while the first situation may be handled using loss functions with bounded gradient, contamination in the predictor matrix amounts to using loss functions whose gradient even redescends to zero for input values with large norm, bounding the loss function itself (e.g., [26]). Usually, due to low efficiency, the resulting M-estimator often enters MM-estimation as starting estimator (e.g., [39]). Another opportunity is to robustify the aggregation procedure of the losses in the sense that a trimmed mean or the median of the losses is minimized, which, for the squared loss function, leads to the least trimmed squares (LTS) or the least median of squares (LMS) estimator (see [46]). Techniques based on robust loss functions or aggregations have indeed also been proposed for deep learning, e.g., [17] or [32], [49].

The enormous rise of deep AI models during the past two decades leads to the natural question to which extent they are sensitive to contamination in the training data. In particular, in safety-critical applications such as autonomous driving, models that are distorted due to contaminated training data can have a fatal impact on the safety of the system and thus even human life. For example, [8] and [43] use robust linear regression techniques in order to estimate vehicle parameters that are hardly measurable during operation, such as the tractive force, resistances in electric vehicles, or the status of health of lithium-ion cells. It is well known that difficult nonlinear relationships between the regressor variables and the response variable encourage the use of neural networks, so once a similar estimation has to be done where linearity cannot be assumed, one has to find robust nonlinear methods such as suitably robust neural networks in order to safely fulfill this task.

In this paper, we aim at systematically investigating the effects of different types of contamination and to which extent a suitably robust loss function can safeguard against them. Moreover, we provide a more formal description of the meaning of robustness in the setting of deep learning and propose an adapted version of the regression BDP in order to quantify the global robustness of feed-forward neural networks.

The paper is organized as follows. Section 2 compiles the basic components of feed-forward neural networks, robust statistics and robustifications of neural networks by the standard concepts of robust statistics from the literature. In Sect. 3, we propose an adapted breakdown point notion tailored to neural networks and analyze it for different contamination configurations. Section 4 provides the details of our simulation study. The summarized results are provided in Sect. 5 while the graphics that depict the results are moved to the supplementary file due to their sheer number.

2 Preliminaries

2.1 Feed-forward neural networks

Neural networks are complex structures that do not directly map an input vector onto an output like a standard linear model but that include numerous potentially nonlinear intermediate transformations. We refer to [24] for an excellent overview.

In this work, we restrict ourselves to the analysis of feed-forward neural networks (FFNNs), which can be considered to be the most simple ones. Let a data matrix \(\mathcal {D}=(X,Y) \in \bigotimes _{i=1}^n \mathcal {X} \times \mathcal {Y} \subset \mathbb {R}^{n \times (p+1)}\) be given for predictors \(X_i \in \mathcal {X} \subset \mathbb {R}^p\) and responses \(Y_i \in \mathcal {Y} \subset \mathbb {R}\). An FFNN consists of \((H+1)\) layers of which H ones are interpreted as hidden layers since they contain the intermediate transformations while the \((H+1)\)-th one is the output layer that maps the final transformation onto the output space. Each of the layers consists of so-called hidden nodes which are \(L_h\) ones for layer h, \(h=1,\ldots ,H+1\). In the hidden nodes, the actual transformations are computed.

Starting from an input vector, \(x \in \mathcal {X}\), which can be interpreted to define the input layer (layer 0), a fully connected FFNN computes

$$\begin{aligned} z_l^{(1)}=\sigma ^{(1)}(a_l^{(1)}):=\sigma ^{(1)}\left( \sum _{j=1}^p w_{lj}^{(1)}x_j+b_l^{(1)}\right) , \ \ \ l=1,\ldots ,L_1, \end{aligned}$$
(1)

in the first hidden layer, for a so-called activation function \(\sigma ^{(1)}: \mathbb {R} \rightarrow \mathbb {R}\) and weights \(b_l^{(1)}\), \(w_{lj}^{(1)} \in \mathbb {R}\), \(j=1,\ldots ,p\). The \(z_l^{(1)}\), \(l=1,\ldots ,L_1\), serve as the input for the second hidden layer and so forth. In order to streamline the notation, we write \(z_j^{(0)}:=x_j\), so that for layer \(h=1,\ldots ,H+1\), the computation in the l-th hidden node, \(l=1,\ldots ,L_h\), can be written as

$$\begin{aligned} z_l^{(h)}=\sigma ^{(h)}(a_l^{(h)}):=\sigma ^{(h)}\left( \sum _{j=1}^{L_{h-1}} w_{lj}^{(h)}z_j^{(h-1)}+b_l^{(h)}\right) . \end{aligned}$$

Classical intermediate activation functions include sigmoid ones like \(\hbox {tanh}\) or the piece-wise linear ReLU activation \(z \mapsto \max (z,0)\), although many variants already have been proposed [24]. The output activation function \(\sigma ^{(H+1)}\) is essentially defined by the learning task, for example, a regression NN with real-valued outputs would require no output activation (artificially, one can speak of an “identity activation”), besides having only one output node (\(L_{H+1}=1\)), while classification problems usually invoke a mapping of the activations in layer H onto the, say K, different classes, leading to K output nodes, which amounts to using a softmax activation, or a sigmoid activation for \(K=2\).

During training, the intercepts \(((b_l^{(h)})_{l=1}^{L_h})_{h=1}^{H+1}\) and weights \((((w_{lj}^{(h)})_{j=1}^{L_h-1})_{l=1}^{L_h})_{h=1}^{H+1}\) (we will usually refer to both intercepts and weights just as “weights”) are updated via gradient steps. Having a loss function \(L: \mathcal {Y} \times \mathcal {Y} \rightarrow [0,\infty [\), one updates the weights in the output layer via standard gradient steps. Provided that the activation functions are sufficiently smooth, this gradient step can be used to update the weights of the previous layer in a backward updating scheme that relies on the chain rule, therefore the procedure is called backward propagation (BP). Based on an initialization of all weights, the first training epoch consists of a forward pass where the inputs \(X_i\) are fed into the NN, resulting in predictions \(\hat{Y}_i\). Using these \(\hat{Y}_i\), a backward pass based on the losses \(L(\hat{Y}_i,Y_i)\) is performed, updating the weights via BP. Starting with the output layer, denoting

$$\begin{aligned} \partial _{a_l^{(H+1)}}L=\partial _{z_l^{(H)}}L (\sigma ^{(H+1)})'(a_l^{(H+1)})=:\delta _l^{(H+1)}, \end{aligned}$$

one computes the partial derivaties

$$\begin{aligned}{} & {} \partial _{b_l^{(H+1)}}L=\partial _{a_l^{(H+1)}}L\partial _{b_l^{(H+1)}}a_l^{(H+1)}=\delta _l^{(H+1)}, \\{} & {} \partial _{w_{lj}^{(H+1)}}L=\partial _{a_l^{(H+1)}}L\partial _{w_{lj}^{(H+1)}}a_l^{(H+1)}=\delta _l^{(H+1)}z_j^{(H)}, \end{aligned}$$

and using \(\partial _{b_l^{(h)}}a_l^{(h)}=1 \ \forall j,l,h\) and \(\partial _{w_{lj}^{(h)}}a_l^{(h)}=z_j^{(h-1)} \ \forall j,h\) leads to the recursion

$$\begin{aligned} \delta _l^{(h)}=w_{lj}^{(h)} \delta _l^{(h+1)}(\sigma ^{(h)})'(a_l^{(h)}), \end{aligned}$$
$$\begin{aligned} \partial _{b_l^{(h)}}L=\delta _l^{(h)}, \ \ \ \partial _{w_{lj}^{(h)}}L=z_j^{(h-1)}\delta _l^{(h)}. \end{aligned}$$
(2)

The gradients in Eq. 2 are then used for a gradient step of the form

$$\begin{aligned} w_{lj}^{(h)}=w_{lj}-\frac{\eta }{m}\sum _{i=1}^m \partial _{w_{lj,i}^{(h)}}L, \ \ \ b_l^{(h)}=b_l^{(h)}-\frac{\eta }{m}\sum _{i=1}^m \partial _{b_{l,i}^{(h)}}L, \end{aligned}$$

where the index i is the respective derivative on instance i from a sample of size \(m \le n\) of instances from the training data and where \(\eta\) is a learning rate.

This is done until numerical convergence or until a maximum number of epochs is reached.

2.2 Robust statistics

The following definition formalizes the understanding of contamination in robust statistics (cf. [44, Sect. 4.2]).

Definition 1

Let \((\Omega , \mathcal {A})\) be a measurable space and denote \(\mathcal {P}:=\{P_{\theta } \ \vert \ \theta \in \Theta \}\) as a parametric family so that each element \(P_{\theta } \in \mathcal {P}_{\theta }\) is a distribution on \((\Omega , \mathcal {A})\). Let \(P_{\theta _0}\) denote the ideal distribution. Let \(\Theta \subset \mathbb {R}^p\) be a parameter space. A contamination model is a family \(\mathcal {U}_*(\theta _0):=\{U_*(\theta _0,r) \ \vert \ r \in [0,\infty [\}\) of contamination balls \(U_*(\theta _0,r)=\{Q \in \mathcal {M}_1(\mathcal {A}) \ \vert \ d_*(P_{\theta _0},Q) \le r \}\) for the set \(\mathcal {M}_1(\mathcal {A})\) of probability distributions on \(\mathcal {A}\). The radius r is sometimes referred to as “contamination radius”.

Example 1

The convex contamination model \(\mathcal {U}_c(\theta _0)\) is given by the set of convex contamination balls

$$\begin{aligned} U_c(\theta _0,r)=\{(1-r)_+P_{\theta _0}+\min (1,r)Q \ \vert \ Q \in \mathcal {M}_1(\mathcal {A}) \}. \end{aligned}$$

The finite-sample regression BDP has been introduced in [16] and considers linear models, i.e., \(Y_i=X_i\beta +\epsilon _i\).

Definition 2

Let \(Z_n:=\{(X_1,Y_1),\ldots ,(X_n,Y_n)\}\) for instances \((X_i,Y_i) \in \mathbb {R}^{p+1}\) in the given dataset. The case-wise finite-sample breakdown point for a regression estimator \(\hat{\beta }\in \mathbb {R}^p\) is defined by

$$\begin{aligned} \varepsilon ^*(\hat{\beta },Z_n)=\min \left\{ \frac{m}{n} \ \bigg \vert \ \sup _{Z_n^m}(\vert \vert \hat{\beta }(Z_n^m)\vert \vert )=\infty \right\} \end{aligned}$$
(3)

where the set \(Z_n^m\) denotes any sample with exactly \((n-m)\) instances in common with \(Z_n\). The coefficient \(\hat{\beta }(Z_n)\) is the estimated regression coefficient on \(Z_n^m\).

The contamination procedure considered in Definition 2 is a case-wise or instance-wise contamination as in the convex contamination setting in Example 1 where the distributions P and Q are defined on \(\mathcal {X} \times \mathcal {Y}\). One can restrict the contamination to Y-contamination, indicating that one is only allowed to contaminate the response vector, so P and Q would only be defined on \(\mathcal {Y}\) while the predictor matrix X would stay unharmed. Contamination of the predictor matrix is referred to as X-contamination. An even more flexible setting has been proposed in [3] where one does not necessarily have to contaminate whole instances but where one is allowed to contaminate single cells. This contamination model can be described by the following definition (see also [1]):

Definition 3

Let \(P_{\theta _0}\) be a distribution on some measurable space \((\Omega ,\mathcal {A})\) and let \(X_i \sim P_{\theta _0}\) for \(i=1,\ldots ,n\). For the \(n \times p\)-matrix X consisting of the \(X_i\) as rows and random variables \(U_1,\ldots ,U_p \sim Bin(1,r)\) i.i.d., the cell-wise convex contamination model consists of all sets of the form

$$\begin{aligned} U^{cell}(\theta _0,r):=\{Q \ \vert \ Q=\mathcal {L}(UX+(1-U) \tilde{X})\} \end{aligned}$$

for \(\tilde{X} \sim \tilde{Q}\) where \(\tilde{Q}\) is any distribution on \((\Omega ,\mathcal {A})\) and where U is the matrix with diagonal entries \(U_i\).

Remark 1

The notion of the BDP can be lifted to the cell-wise contamination setting by just considering the relative frequency of contaminated cells necessary in order to let the estimator break down, as for example proposed in [56].

2.3 Robust neural networks

There are tons of literature considering the robustness of NNs. The term “robustness” is often understood as robustness against adversarial attacks [25], which aim at finding perturbations of the training points that lead to significantly different outputs (while training the model on the non-perturbed training points), or poisoning or backdoor attacks [9], which are injections of malicious inputs into the training data, either by replacing existing ones (e.g., [37]) or by adding new points. While adversarial robustness relates to the smoothness of the NN itself for regression and also considers a proper placement of the decision boundaries for classification, robustness against poisoning attacks is closely related to the classical understanding of robustness in robust statistics. More precisely, poisoning attacks are crafted subject to a \(\vert \vert \cdot \vert \vert _q\)-bound on the perturbation matrix itself, \(1 \le q \le \infty\), while the classical robustness considers mostly the sparse variant, i.e., the \(\vert \vert \cdot \vert \vert _0\)-bound, meaning that only a few instances or cells can be perturbed, but with arbitrary perturbation magnitude which is bounded in poisoning attacks.

The application of robust loss functions or trimming for NNs has been proposed in several works, e.g., [5, 13, 17, 29, 32, 33, 49, 51, 59, 60, 63]. Although these works are accompanied by simulation studies on simulated and real data, to the best of our knowledge, a systematic robustness analysis by distinguishing between Y-outliers, case-wise and cell-wise X-outliers has not yet been done so far, and Y-outliers and X-outliers have been handled separately very rarely, e.g., in [34]. Moreover, although the notion of the BDP has been used in many works, a concise definition in the context of neural networks has not been given except for the work of [12], which is however tailored to the aggregation procedure in convolutional filters.

Another strategy is noise modeling, which, for example, has been proposed in [53]. They consider large outliers in the training data of NNs for single-input single-output systems, i.e., they consider time-dependent structures where the goal is to identify the potentially nonlinear structure that holds across all time steps. Note that the training scheme is therefore recursive (see [10] for details). Due to the non-robustness of the standard gradient descent algorithm, they propose to model the noise by Huber’s model of measurement noise, which is a mixture of two Gaussians with different scales. Alternatively, they suggest to model the noise by an exponential power distribution, which in their case leads to an M-estimation with a Huber-type loss function. As a third alternative, they use a trimmed loss criterion. These algorithms are applied in [54] for modeling the temperature at the exit of an induction furnace, where the exponential distribution method leads to the best results.

Further robustifications have been proposed for the training process itself, for example, by gradient clipping which essentially corresponds to a Huberization of the losses [40], resilient backpropagation where the weights are not updated according to the gradients themselves but by a fixed amount where the gradient only enters via its sign [45], i.e.,

$$\begin{aligned} w_{lj}^{(h)}&= {} w_{lj}-\eta \mathop {\textrm{sign}}\limits \left( \frac{1}{m}\sum _{i=1}^m \partial _{w_{lj,i}^{(h)}}L\right) , \\ b_l^{(h)}&= {}b_l^{(h)}-\eta \mathop {\textrm{sign}}\limits \left( \frac{1}{m}\sum _{i=1}^m \partial _{b_{l,i}^{(h)}}L\right) , \end{aligned}$$

robust aggregation in GNNs [22] or bounded activation functions, just to name a few. Proper regularization like dropout [52] or knowledge constraints as in informed machine learning (e.g., [57]) can also be interpreted as implicit robustification, although generally not safeguarding against large outliers.

Of course, classification data can also be contaminated. Y-contamination is nothing but a label switch. This type of contamination is known as label noise. We refer to [23] and references therein.

3 Breakdown of regression feed-forward neural networks

3.1 Robust loss functions and robust aggregation

In order to safeguard, at least partially, against contamination, several “robust loss functions” have been proposed (see, e.g., [26]). Well-known ones include Huber’s loss function, introduced in [30], and Tukey’s biweight function,

$$\begin{aligned}{} & {} L^{\rm Huber}_{\delta }(r)={\left\{ \begin{array}{ll} r^2/2, \ \ \ \vert r\vert \le \delta \\ \delta \vert r\vert -\delta ^2/2, \ \ \ \vert r\vert>\delta \end{array}\right. } \\{} & {} L^{\rm Tukey}_k(r)={\left\{ \begin{array}{ll} 1-[1-(r/k)^2]^3, \ \ \ \vert r\vert \le k \\ 1, \ \ \ \vert r\vert >k \end{array}\right. } \end{aligned}$$

While Huber’s loss function only has bounded gradients, Tukey’s loss function is a redescender (e.g., [31]) since the gradients attain zero for large inputs (in absolute value) (Fig. 1). For regression, r denotes the residuals, i.e., for an instance \((X_i,Y_i)\) of the dataset, the residual is given by \(r_i=Y_i-\hat{Y}_i\) for the prediction \(\hat{Y}_i\) of the regression model.

Fig. 1
figure 1

Huber’s and Tukey’s loss functions and derivatives

Another common technique to safeguard against contamination is to robustify the aggregation procedure when computing average losses or gradients during training. Rousseeuw [46] proposed to minimize the median of the squares residuals, leading to the least median of squares (LMS) estimator, or the trimmed mean of the squared residuals, resulting in the least trimmed squares (LTS) estimator. Trimming is understood in the sense that one only considers the \(\lceil n/2 \rceil \le h<n\) instances with the smallest losses during training. Although finding the optimal h-subset is a combinatorial problem and therefore infeasible, a clever strategy suggested in [47, 48] based on so-called concentration steps where one iteratively finds the h instances with the smallest losses w.r.t. the current model fit and updates the model by re-computing it on the current h best instances allows for a fast computation of the LTS.

It is important to point out that trimming does not correspond to a robust loss function because for each instance, either the standard loss function is used or the loss function \(L \equiv 0\) is used, in contrast to a skipped loss function (e.g., [26, Sect. 2.5, Example 1]) which, based on the squared loss, would be \(L_k(r)=\vert r\vert ^2I(\vert r\vert \le k)\), i.e., this loss function would indeed non-continuously fall back to zero for large absolute values of the input. We highlighted this fact because [34] mentioned non-differentiability of certain loss functions, including a trimmed absolute loss. The trimmed absolute loss is clearly non-differentiable due to the absolute loss already being non-differentiable (in zero), but this does not jeopardize BP in NN training as it just corresponds to a zero set. This fact has been proven in [36] who show that automatic differentiation as done in BP is correct even for non-differentiable functions as long as they satisfy certain regularity conditions, which holds for piece-wise smooth functions, including Huber’s loss, a skipped squared loss or an absolute loss.

3.2 Breakdown point

Although we do not propose a new BDP concept, we intend to clarify the meaning of a regression BDP for neural networks:

Definition 4

Let \(Z_n\) and \(Z_n^m\) be defined as in Definition 2 and let \(B \in \mathbb {R}^{L_1 \times L_2 \times \cdots \times L_{H+1}}\) denote the intercept array and let \(W \in \mathbb {R}^{(L_1 \times L_0) \times \cdots \times (L_{H+1} \times L_H)}\) denote the weight array that correspond to the FFNN. Then, the case-wise finite-sample BDP for the FFNN is given by

$$\begin{aligned}{} & {} \varepsilon ^*(\hat{B}, \hat{W},Z_n)=\min \nonumber \\{} & {} \quad \left\{ \frac{m}{n} \ \bigg \vert \ \sup _{Z_n^m}(\sup _{k \in \mathbb {N}}(\vert \vert (vec(\hat{B}^{(k)}(Z_n^m)),vec(\hat{W}^{(k)}(Z_n^m)))\vert \vert ))=\infty \right\} \end{aligned}$$
(4)

where \(vec: \mathbb {R}^{n_1 \times \cdots \times n_K} \rightarrow \mathbb {R}^{n_1+\cdots+n_K}\) denotes the operator that flattens an array to a vector and where \(\hat{B}^{(k)}(Z_n^m)\), \(\hat{W}^{(k)}(Z_n^m)\) denote the estimated intercepts and weights on \(Z_n^m\) in training epoch k.

The situation of cell-wise contamination can be captured as described in Remark 1, i.e. the relative number of contaminated cells is considered.

We have to elaborate why the epoch enters the BDP definition here. BDP computations in machine learning usually consider the underlying empirical risk in order to show the effect of outliers, for example, one can show that the loss of the zero coefficient is smaller than the loss of a broken coefficient in order to prove that no breakdown occurs (e.g., [2, Theorem 1], [61, Theorem 2]). As NN training considers techniques like gradient clipping or resilient backpropagation which bounds gradients even if the underlying loss function in fact leads to unbounded gradients, in combination with early stopping, such NNs would never break down by design, although not converging and being in fact highly non-robust. In our BDP Definition 4, we therefore do not allow for early stopping since it does not seem to be reasonable in this setting because, in consequence, one could stop any iterative algorithm directly after the initialization which would also never lead to a breakdown for proper initializations and artificially create robustness where there is in fact no robustness.

Remark 2

We implicitly made the assumption that only the instances, i.e., the predictors \(X_i\) or the responses \(Y_i\), can be contaminated, so a direct manipulation of the hidden nodes is excluded. Although it may technically be possible to intercept a training procedure by modifying and freezing hidden neurons arbitrarily, it does not seem to be reasonable in this work where we want to examine robustness of FFNNs against training data contamination only. Note that if one had access to the internal training process, one just could modify the activations \(z_j^{(H)}\) in the final hidden layer and freeze them which would correspond to the standard linear regression or GLM setting where these final activations act as the usual predictors.

Note that we consistently cover non-fully connected FFNNs where some neuron connections do not exist, which can be described by weights that are frozen at zero.

3.3 Breakdown point for Y-contamination

Proposition 1

A regression FFNN with an unbounded output activation function \(\sigma\) with \(\vert \sigma (z)\vert \rightarrow \infty\) for \(\vert z \vert \rightarrow \infty\) whose gradient does not vanish for \(\vert z \vert \rightarrow \infty\) and unbounded loss function L with \(L(r) \rightarrow \infty\) for \(\vert r\vert \rightarrow \infty\) for residuals r and non-redescending gradient which may be regularized with a penalty term of the form \(\lambda J(W)\) for \(J \ge 0\), \(J(W) \rightarrow \infty\) for \(\vert \vert W\vert \vert \rightarrow \infty\) and \(J(0)=0\), has a FSBDP of 1/n w.r.t. Y-contamination if non-trimmed average gradients are computed in BP.

Proof

Let wlog. \(Y_1\) be contaminated.

(1) In the case of a loss function L with unbounded gradient and non-manipulated gradients, \(Y_1\) can be manipulated so that \(\vert L'(u-u')\vert _{u=Y_1,u'=z^{(H+1)}}\vert \rightarrow \infty\). Then, the first gradient in the first BP step, i.e., \(\nabla _{a^{(H+1)}}L \ \circ \ \sigma '(z^{(H+1)})\) is already unbounded almost-surely, leading to unbounded output weights via the first update. This is also true if a regularization term of the form \(\lambda J(W)\) is added to the loss function, since, due to L being unbounded, one can always find \(Y_1\) so that the loss term is dominant due to the penalty parameter \(\lambda\) being static. Regardless of the remaining backward pass, the following forward pass leads to unbounded output activations by assumption. More precisely, due to non-vanishing gradients of the output activation function and the unbounded gradients of the loss function, the total gradients in the BP pass that update the output weights are unbounded, always ending with unbounded output weights, therefore, to a breakdown. Note that due to Definition 4, minibatch GD that randomly selects \(m<n\) instances whose gradients are aggregated does not safeguard against contamination as the probability that only clean instances are selected converges to zero for a growing number of iterations, so eventually, i.e., for epoch number \(k \rightarrow \infty\), such an instance will almost-surely be selected.

(2) The case of clipped gradients, resilient backpropagation (Rprop) and bounded gradients can be handled simultaneously. Contamination of \(Y_1\) has the effect that the maximum possible gradient \(\vert L'(u-u')\vert _{u=Y_1,u'=z^{(H+1)}}\vert\) in absolute value can be achieved. Therefore, although the gradient is bounded, either by bounded gradients of the loss function, gradient clipping or Rprop which only considers the sign of the gradient, \(Y_1\) always induces a non-zero maximum gradient (whose absolute value is \(\eta\) in Rprop) due to the assumption that the gradient is non-redescending. As pointed out in the discussion after Definition 4, robustness solely achieved by early stopping is ignored here, therefore, letting \(k \rightarrow \infty\) for the number k of training epochs, at least the output weights eventually diverge as a non-zero constant (positive for \(Y_1 \rightarrow \infty\), negative for \(Y_1 \rightarrow -\infty\)) is added in each iteration to the output weights, hence, a breakdown in the sense of Definition 4 occurs. \(\square\)

Proposition 1 shows that the robustification of the training procedure, altough undoubtly solving or at least safeguarding against the exploding gradient problem, does not lead to robustness in the classical sense of quantitative robustness in our modified notion in Definition 4 if neither the loss function nor the aggregation of the losses is robust, hence the outliers can persistently pull the weights eventually to unbounded values.

The tides turn if one robustifies the aggregation procedure of either the loss or the gradients by the classical trimming.

Proposition 2

A regression FFNN with an unbounded output activation function \(\sigma\) with \(\vert \sigma (z)\vert \rightarrow \infty\) for \(\vert z \vert \rightarrow \infty\) whose gradient does not vanish for \(\vert z \vert \rightarrow \infty\) and unbounded loss function L with \(L(r) \rightarrow \infty\) for \(\vert r\vert \rightarrow \infty\) and non-redescending gradient and where one computes the average of the h gradients with the smallest norm in BP in each training epoch, has a BDP of \((n-h+1)/n\).

Proof

Assume that there are \((n-h)\) large outliers which, as in Proposition 1 produce unbounded losses and gradients. Due to trimming the largest (in absolute value) \((n-h)\) gradients away, these outliers cannot harm the training procedure; therefore, the BDP is at least \((n-h+1)/n\).

Conversely, assume that there are \((n-h+1)\) large outliers. Even trimming away \((n-h)\) of them results in one remaining large outlier, so a breakdown occurs due to the unboundedness of the gradients, as detailed out in the proof of Proposition 1. Hence, the BDP is at most \((n-h+1)/n\). \(\square\)

The difference between trimmed means of gradients and mini-batch GD is that mini-batch GD randomly samples gradients while trimming targetedly removes the largest gradients. Note that one cannot treat the case of a trimmed loss aggregation and a trimmed gradient aggregation simultaneously in Proposition 2 as both procedures are in general not equivalent. For example, if one would use Tukey’s biweight loss (which is not covered by Proposition 2 due to the vanishing gradients) and an additional trimming in the aggregation and if there are \((n-h)\) large outliers which fall into the constant area in the biweight loss, one would trim away these instances when trimming the loss aggregation; however, as these large outliers correspond to zero gradients, one would not trim away them during trimmed gradient aggregation (which would not be harmful in the sense of the BDP as these gradients are zero). For classical loss functions like the squared loss that satisfy both \(L=L(r)\) with \(L(r) \rightarrow \infty\) for \(\vert r\vert \rightarrow \infty\) as well as \(\nabla L=\nabla L(r)\) with \(\vert \nabla L(r)\vert \rightarrow \infty\) for \(\vert r\vert \rightarrow \infty\), the equivalence obviously holds, hence upper \(\alpha\)-trimmed squared losses would indeed induce a BDP of \(\alpha\). As a consequence, when using loss functions like the Tukey loss, one large outlier, even if not trimmed away during gradient or loss aggregation, may not necessarily be able to already lead to a breakdown.

Remark 3

We want to point out that we only consider outlier robustness here due to the upper trimming, i.e., trimming away the largest losses or gradients. Inlier robustness would require also a lower trimming. In this work, we consider only the classical static contamination in the sense that the attacker can only manipulate a selected subset of instances before the training starts. One could, due to the iterative training procedure, also consider an attacker who can intercept this process and adapt the contamination (at least for the initially selected instances), leading to iterative contamination.

In the setting of iterative contamination, trimming away a large fraction of the highest losses can indeed backfire when concerning robustness. For example, consider an upper \(\alpha\)-trimmed squared loss with \(\alpha =0.5\) and assume for simplicity that n is even. Then, provided that the attacker can query the output of the neural network in each training epoch, the attacker can, for the selected half of the instances, manipulate the responses in a way that they are always slightly larger than the currently predicted values but that they correspond to the lowest half of the losses. More precisely, let \(L_i^{(k)}=L(Y_i^{(k)},\hat{Y}_i^{(k)})\) for all \(i=1,\ldots ,n\) in iteration k and let \(L_{(1)}^{(k)} \le \ldots \le L_{(n)}^{(k)}\). Let \(I^{(k)}\) be the index set corresponding to the n/2 smallest losses. The iterative replacement outliers \(\hat{Y}^{(k+1)}\) after finishing iteration k would be taken from the set

$$\begin{aligned} \left\{ Y \ \vert \ L(Y,\hat{Y}_i^{(k)})<L_{(n/2+1)}^{(k)}, Y>\hat{Y}_i^{(k)}\right\} \end{aligned}$$

for all \(i \in I^{(k)}\). This will result in trimming away all instances j for \(j \in \{1,\ldots ,n\} {\setminus } I^{(k)}\) and, by construction, a (potentially small but) positive gradient. After the following forward pass, the attacker adapts the responses again according to this scheme and so forth, making the output weights eventually arbitrarily large.

3.4 The case of X-contamination

The effect of X-contamination is much more difficult to handle than that of Y-contamination due to the strong dependence of the influence of such contamination on the architecture and the weights. We can elaborate it more precisely in the following proposition.

Proposition 3

A FFNN that uses standard BP has a case-wise BDP of 1/n if the initial weights allow for the error terms \(\delta _l^{(1)}\), corresponding to the input layer and considered as a function of an input vector x, to decrease with at most a sublinear rate.

Proof

In BP, the gradients in the input layer are given by \(\partial _{w_{lj}^{(1)}} L=z_j^{(0)}\delta _l^{(1)}=x_j\delta _l^{(1)}\) (see. Eq. 2), so if the assumptions are satisfied, a large X-outlier, wlog. \(X_1\), can produce unbounded gradients here and, therefore, unbounded weights for the first hidden layer in the first BP step. \(\square\)

Corollary 4

In the same setting as in Proposition 3, cell-wise X-contamination would lead to a cell-wise BDP of \(\frac{1}{np}\).

This extreme setting considered in the proof above then carries over to the following forward pass where either all predicted responses (not only \(\hat{Y}_1\)!) become unbounded, leading to the problems we examined for Y-contamination, or where the large weights of the first hidden layer cause intermediate linear terms \(a_l^{(h)}\) to be located at near-flat regions of the following intermediate activation function \(\sigma ^{(h)}\) that, for \(X_{1,j} \rightarrow \infty\), in fact prevent from any further weight updates in the subsequent BP steps when using standard BP, while it may still allow for updates using Rprop if the gradients of the activation functions are not exactly zero. Nevertheless, the problem of diverging initial weights in the setting of Proposition 3 cannot be remedied.

However, it is not evident under which conditions the assumptions hold unless a particular NN with fixed initial weights are given. This is true since even ReLU activation functions have flat regions. For example, a simple NN with one hidden layer and two hidden nodes with ReLU activation is not automatically affected by large X-outliers. Let \(p=1\) for simplicity. Then, letting \(X_1 \rightarrow \infty\), a negative initial weight \(w_{11}^{(1)}\) would make the activation \(a_{1}^{(1)}\) negative and therefore censored by the ReLU activation, leading to a vanishing gradient w.r.t. instance 1 in the BP, so the aggregated gradient in BP will not be affected by the outlier. However, due to the gradients w.r.t. the other instances, the weight \(w_{11}^{(1)}\) may become positive in later iterations. Due to this fact, a rigorous assessment of the effect of X-contamination is out of scope for this paper.

We restrict ourselves to a practical analysis of the effect of X-contamination to FFNN training which, in our opinion, has not yet been sufficiently done in the literature. Especially the case of cell-wise contamination does not seem to yet have been considered at all for FFNNs. Although the effect of X-contamination may be bounded if bounded activation functions are used, it is not evident how a NN behaves in the presence of cell-wise outliers which easily affect all instances while keeping the cell-wise contamination rate rather low, especially for high-dimensional predictors.

4 Simulation study

In order to experimentally examine the quantitative robustness of regression FFNNs, we set up an extensive simulation study that shall shed light on the performance, the number of training steps and the breakdown behavior of robust and non-robust FFNNs for non-contaminated data, Y-contaminated data, case-wise X-contaminated data and cell-wise (XY)-contaminated data, all with convex contamination, for different dimensions, underlying regression structures and contamination radii.

4.1 Data generation

The data are generated by first drawing \(X_i \sim \mathcal {N}_p(\mu 1_p, I_p)\) i.i.d., \(i=1,\ldots ,n\), for \(1_p=(1,\ldots ,1) \in \mathbb {R}^p\) and the identity matrix \(I_p \in \mathbb {R}^{p \times p}\), and coefficients \(\beta _j \sim \mathcal {N}(0,1)\) for \(j=1,\ldots ,p\) i.i.d.. We consider three different regression structures:

  1. 1.

    Structure “lin”: We have a standard linear model of the form \(Y_i=X_i\beta +\epsilon _i\).

  2. 2.

    Structure “poly”: We first compute \(\tilde{Y}_i=X_i\beta\) and transform the responses according to \(Y_i=\vert \tilde{Y}_i\vert ^{-2/3}+\epsilon _i\) afterward. This structure has been proposed in the literature, e.g., [50].

  3. 3.

    Structure “trig”: We again first compute \(\tilde{Y}_i=X_i\beta\) and transform the responses according to \(Y_i=\frac{\sin (\vert \tilde{Y}_i\vert )}{\vert \tilde{Y}_i\vert }+\epsilon _i\), which has also been proposed e.g. in [50].

Evidently, for the linear structure, one would not need neural networks; however, we aim at comparing the performance of the FFNNs also in this setting. The errors \(\epsilon _i\) are drawn i.i.d. according to a Gaussian distribution where the variance is adapted according to a prescribed signal-to-noise ratio (\(\mathop {\textrm{SNR}}\limits\)). The different configurations are collected in Table 1.

Table 1 Data generation specifications

The training sets contain \(n_\textrm{train}\) instances, the (independent) test sets \(n_\textrm{test}\) instances. For each configuration, we draw \(V=100\) independent datasets.

4.2 Perturbation generation

In the non-contaminated setting, the training set is used for training as it is. Note that the test data are never perturbed.

As for Y-contamination, we randomly select \({\lceil rn_\textrm{train} \rceil}\) responses for the contamination radius r in the convex contamination model (see Example 1) and replace them with i.i.d. realizations from a \(\mathcal {N}(\mu _\textrm{out},1)\)-distribution. In the X-contamination setting, we randomly select \(\lceil rn_\textrm{train} \rceil\) instances and replace the p-dimensional predictor vectors with i.i.d. realizations from a \(\mathcal {N}_p(\mu _\textrm{out}1_p,I_p)\)-distribution. Finally, in the cell-wise (XY)-contamination setting, we randomly select \(\lceil rn_\textrm{train}(p+1) \rceil\) cells and replace them with i.i.d. realizations from a \(\mathcal {N}(\mu _\textrm{out},1)\)-distribution. We set \(r \in \{0.1,0.25,0.4\}\) and \(\mu _\textrm{out} \in \{10,100,1000\}\).

4.3 Network training

We use the implementation from the \(\textsf{R}\)-package neuralnet [21]. In order to allow for robust networks, we added Huber’s loss function, Tukey’s loss function, and the upper trimmed squared loss as well as their respective gradients to the convert.error.function function.

As for network training, we only modify the parameters err.fct, act.fct, stepmax and hidden; for the other parameters, the default configurations are used. In particular, we use the resilient BP algorithm here, since experiments with large outliers lead to exploding gradients when using the standard BP algorithm.

We consider two different activation functions (that are used in each hidden node, note that the output activation is the identity function as we are in a regression setting), namely, the logistic activation function and the softplus activation function \(x \mapsto \ln (1+e^x)\), which is a classical smooth approximation of the ReLU activation function (e.g., [24]).

We consider six loss functions/aggregations, i.e., the squared loss, Huber’s loss, Tukey’s loss, and the upper \(\alpha\)-trimmed squared loss (just trimmed (squared) loss henceforth) with a trimming rate \(\alpha \in \{0.1, 0.25, 0.5\}\). As for the depth and size of the FFNNs, we use small networks with \(H=2\) hidden layers with \(L_h=2\), \(h=1,2\), shallow networks with \(H=2\) hidden layers with \(L_h=10\) hidden nodes, \(h=1,2\), and deep networks with \(H=10\) and \(L_h=5\) hidden nodes, \(h=1,\ldots ,10\). The resulting number of weights is listed in Table 2. As for the Tukey loss, we use the value \(k=4.685\) from literature, corresponding to 95% efficiency (e.g., [39]). As for the Huber loss, we compute \(\delta =\hbox {med}((\vert Y_i-\hat{Y}_i\vert )_{i=1}^n)\) in each iteration.

Note that in the backpropagation algorithm, only the outer derivatives, i.e., \(L'\), change by changing the loss function. For the Huber and Tukey loss function, the derivative is

$$\begin{aligned}{} & {} (L_{\delta }^\textrm{Huber})'={\left\{ \begin{array}{ll} r, \ \ \ \vert r\vert \le \delta \\ \delta \mathop {\textrm{sign}}\limits (r), \ \ \ \vert r \vert>\delta \end{array}\right. },\\{} & {} (L_k^\textrm{Tukey})'={\left\{ \begin{array}{ll} \frac{6}{k^2}r\left( 1-\left( \frac{r}{k}\right) ^2\right) ^2, \ \ \ \vert r\vert \le k \\ 0, \ \ \ \vert r\vert >k \end{array}\right. }, \end{aligned}$$

respectively, where \(\mathop {\textrm{sign}}\limits (r)=1\) for \(r>0\), \(\mathop {\textrm{sign}}\limits (r)=-1\) for \(r<0\), and \(\mathop {\textrm{sign}}\limits (r) \in [-1,1]\) for \(r=0\). As for the trimmed loss functions, we use the standard gradients for all non-trimmed instances, and zero gradients for all trimmed instances.

The weight initialization is done by a random initialization where each single weight is drawn from a standard normal distribution. Therefore, the initialization is independent from the data contamination. On each of the V generated datasets for each scenario, one single network is trained, so each \(v=1,\ldots ,V\) is associated with one realization of the data and the contamination and each of the trained networks is initialized with one individual realization of the weights.

Table 2 Number of neural network weights

For the shallow neural networks, we allow for a maximum epoch number stepmax of 100,000, while deep networks are allowed to iterate through 250,000 epochs.

Finally, we consider both standardized and non-standardized responses, i.e., in the case of a standardization, we compute

$$\begin{aligned} \displaystyle Y_i \mapsto \frac{Y_i-\min _j(Y_j)}{\max _j(Y_j)-\min _j(Y_j)} \end{aligned}$$

as a pre-processing step before training which is a common technique to stabilize the training procedure.

Summarizing, we consider a factorial design of the following configurations listed in Table 3.

Table 3 Simulation configurations

Due to the factorial design, we have 23,328 different configurations in total.

Remark 4

(Trimmed loss) Trimming has been successfully applied to regression tasks in the least trimmed squares (LTS, [46]) and sparse least trimmed squares (SLTS, [2]) algorithm. Both invoke so-called C-steps [47, 48], i.e., concentration steps, based on the following argument: Having initially trained the regression model on a random h-subset of the dataset, identify the h instances with the smallest losses (originally, with the smallest absolute residuals, however, for a loss function of the form \(L(u,u')=L(\vert u-u'\vert )\) which holds for the losses we use in this work, both statements are equivalent) and re-compute the estimator on these clean instances. Denote by \(R_H(\hat{\theta })\) the average loss on the h-subset \(H \subset \{1,\ldots ,n_\textrm{train}\}\) for regression parameter \(\hat{\theta }\). In iteration k, for the current parameter \(\hat{\theta }^{(k-1)}\), by definition of the clean h-subset \(H^{(k)}\), it holds that \(R_{H^{(k)}}(\hat{\theta }^{(k-1)}) \le R_{H^{(k-1)}}(\hat{\theta }^{(k-1)})\). Then, by computing \(\hat{\theta }^{(k)}=\mathop {\textrm{argmin}}\limits _{\theta }(R_{H^{(k)}}(\theta ))\), it holds that \(R_{H^{(k)}}(\hat{\theta }^{(k)}) \le R_{H^{(k)}}(\hat{\theta }^{(k-1)})\), so the loss monotonically decreases.

In the setting of neural networks, due to the iterative and time-consuming training procedure, the pragmatic solution which is also used for the computations in this paper does not enable C-steps. If one would indeed compute \(\hat{\theta }^{(k)}=(\hat{B}^{(k)}, \hat{W}^{(k)})\) for each k by training the corresponding neural network until convergence, one would have the same argument as in LTS or SLTS that one at least attains a local minimum of the trimmed objective. The practical problem is however that this procedure would be very time-consuming. The computation of multiple neural networks until convergence is usually avoided, for example, instead of Bagging neural networks, one applies Dropout [52] where in each iteration, one randomly drops nodes so that one artificially creates multiple networks without having to train multiple networks until convergence.

For the trimmed squared loss however, we effectively only localize the loss and hence the gradient to the clean h-subset with \(h=\lceil (1-\alpha )n_\textrm{train}\rceil\) for one single epoch. One can interpret this as an approximate C-step; however, one single gradient step is not guaranteed to lead to a better solution as, for example, a too large step size in the gradient update may cause the updated parameter to jump over the local optimum in the parameter space and to be in fact worse than the previous one.

4.4 Evaluation

In our simulations, we always use batches of all 24 configurations concerning the loss function and the contamination for all of the 972 configurations of the other parameters. For each of these 972 configurations, we evaluate the performance of the neural networks trained w.r.t. the six different loss functions in each of the four different contamination scenarios.

The performance consists of three aspects: The test loss, the number of training epochs and the number of successfully terminated training procedures. A training procedure is successfully terminated if convergence of the partial derivatives up to a given threshold (we use the default of 0.01) happened within the maximum allowed number of training epochs. In case of successful training, we can compute the test loss, for which, since we are in a regression setting, we simply compute the average squared test loss, i.e.,

$$\begin{aligned} \frac{1}{n_{\rm test}}\sum _{i=1}^{n_{\rm test}} (\hat{Y}_i-Y_i)^2. \end{aligned}$$

Moreover, if training was successful, we extract the number of epochs required.

The test losses and training epoch numbers are then averaged over all (out of \(V=100\)) successful trainings so that for each of the 972 configurations, we can compute the average test loss and average training epochs for all of the 24 (loss function, contamination)-configurations. These averages are depicted graphically, separately for the test losses and training epochs. These graphics also contain the number of successful trainings which are written on top of the individual bars. As a by-product, the relative number of failed trainings can be interpreted as a surrogate of the breakdown rate. In some cases, in particular, for softplus activation, the test loss can be that large that \(\textsf{R}\) reports Inf so that the average test loss would also be \(\infty\) which would be misleading. Therefore, we compute the average of the finite test losses and report in the graphics that there was at least one infinite loss by writing “Inf” over the respective error bar.

Due to the large number of 1944 graphics, they can be found in the supplementary file.

5 Results

In this section, we summarize the results of our simulation study. Due to the large number of different parameters, we analyze the impact of each parameter separately in an own subsection.

5.1 Output standardization

Regarding the test losses, evidently, the mean losses are smaller in the scenarios where an output standardization was used than in scenarios without output standardization. Note that the ranking of the models in terms of their mean test losses in the scenarios with output standardization does not need to coincide with the ranking in the corresponding scenarios without output standardization since this scaling of the responses does not directly carry over into a scaling of the model as we do not have linear models here.

Notably, the number of NNs that have converged is considerably higher for the scenarios with output standardization than in the scenarios without output standardization. Due to the standardization, large outliers have the effect that the original responses are crowded near the left side of the interval [0, 1] into which they are transformed while the large outliers are crowded near the right side. It is important to emphasize that robustness has to be defined carefully when operating in bounded spaces (see [14]), as, in this standardized setting, it would be impossible to achieve a breakdown in the classical sense of an unbounded norm of the coefficients/weights of the model (in our particular setting, if only Y-contamination is allowed as the regressors are not standardized). Although a breakdown would also not be achievable in our breakdown point notion in Definition 4, the results show that, given a maximum number of iterations, the NN algorithm was unable to converge in many cases in settings with a large contamination magnitude and a large contamination radius.

We want to point out that the number of converged networks does not necessarily need to be higher in a setting with standardized responses than in the corresponding setting without output standardization, which can happen in particular for the Tukey loss. The reason is that the Tukey loss can lead to the vanishing gradient problem which we will refer to in more detail later in Sect. 5.6. If the softplus activation function is used, as expected, output standardization cannot deal with large X- and (XY)-contamination radii since we only use the output standardization once as data preprocessing step, so after the first iteration, the predicted responses suffer from the X-outliers, preventing the NN from converging within the maximum number of iterations.

The number of iterations is generally lower if output standardization is used and often even dramatically lower, although the mean number of iterations does not necessarily have to be lower in a scenario with output standardization than in the corresponding setting without output standardization, which again can happen for the Tukey loss.

5.2 Dimensionality and regression structure

Contamination shows similar effects concerning the number of converged NNs, the required training epochs and the test losses, disregarding the dimensionality of the data. One has to be careful when comparing the results among the respective sections as for a growing dimensionality of the data, the NNs tend to even not converge within our fixed maximum number of training epochs on non-contaminated data. Nevertheless, the fact that, in average, the number of converged NNs decreases with the amount of contamination holds for all dimensionalities.

The regression structure leads to a different loss scale which is smaller for the trigonometric structure than for the linear and the polynomial structure. The effect of contamination however is similar among all regression structures.

Figure 2 supports the discussion in this and the previous subsection. Note that the skewness is a consequence of gathering the test losses among many different scenarios. It does not imply that the empirical loss distributions over the \(V=100\) repetitions of each single scenario are skewed.

Fig. 2
figure 2

Boxplots of all test losses corresponding to different configurations of the dimensionality of the training data, the output standardization, and the structure of the underlying regression function. Left half: No output standardization, right half: Output standardization. Within each half: Training data with \(p=5\) (left), \(p=20\) (middle), and \(p=50\) (right) variables. The three boxplots for each p correspond to a linear (left), polynomial (middle), and trigonometric (right) regression function

5.3 Contamination

Of course, a higher contamination magnitude and a higher contamination radius generally lead to a lower number of converged NNs and a higher test loss. However, occasionally, a higher amount of contamination may even lead to a better test loss or a higher number of converged NNs, which happens in particular for the Tukey loss.

It is important to point out that the number of converged NNs does not necessarily have to decrease for less robust NNs with a growing amount of contamination, in particular, this is the case for the NN based on the squared loss. One can observe that, oddly, the number of converged NNs w.r.t. the squared loss and the 0.1-trimmed squared loss sometimes increases with the contamination magnitude and the contamination radius, at least for the logistic activation function. However, in these cases, the test loss increases. One may interpret this observation by a gradient enhancement thanks to the contamination so that oscillating gradients (w.r.t. the subsequent epochs) that may prevent convergence (within the allowed maximum number of epochs) are avoided, nevertheless, one can clearly not speak of a reasonable fit due to the large test losses, strongly indicating that the NNs were highly distorted by the outliers. For other regression structures, due to the low test losses of the more robust models, the bad generalization performance of the distorted fit of the NNs w.r.t. the squared loss can be dramatically high, up to more than 6 orders of magnitude higher than for the robust models. This effect of an increased number of converged NNs w.r.t. non-robust losses for higher contamination radii and magnitudes cannot be observed for the softplus loss function, not even if output standardization is used, most likely due to exploding gradients against which the Rprop method does not safeguard in the case that the raw gradient whose sign is computed is already infinite (or at least no longer representable by the software).

As for the number of iterations, there is no clear effect of contamination or the different types of contamination.

5.4 Activation function

Evidently, the logistic activation function bounds the activations of the respective hidden nodes, but clearly does not prevent the NN from making large-valued predictions due to the weights in the output layer and the identity output activation for regression, neither does it safeguard against Y-contamination.

Most interestingly, X-contamination generally only leads to a slight increase of the test loss for small data for logistic activation and can in some cases even lead to a better out-of-sample performance than in the corresponding non-contaminated situation. However, X-contamination has a strong effect for softplus activation, leading to a large test loss and a decreasing number of converged NNs. Due to the bounded logistic activation and the redescending gradient of this function, the predicted responses for instances containing X-outliers cannot become infinite and hence do not distort the training process much, in contrast to the unbounded softplus activation.

In this sense, one can conclude that a bounded activation function in general can provide a robustification against X-contamination but not against Y-contamination.

Figure 3 supports the discussion in this and the previous subsection.

Fig. 3
figure 3

Boxplots of all test losses corresponding to different configurations of the activation function and the contamination. Left: Logistic activation, right: Softplus activation. In each half, the leftmost boxplot corresponds to the setting without contamination, the boxplots 2–4 to Y-, X-, and XY-contamination with \(r=0.1\), respectively, the boxplots 5–7 to Y-, X-, and XY-contamination with \(r=0.25\), respectively, the boxplots 8–10 to Y-, X-, and XY-contamination with \(r=0.4\), respectively

5.5 Network structure

As for the network structure, there is one striking unexpected behavior: Although the squared loss function induces a highly non-robust network, the number of converged NNs is generally much larger for deep networks than for shallow networks and, oddly, may even be higher on contaminated data than on non-contaminated data. This behavior can sometimes also be observed for shallow networks, but it is characteristic for deep networks for every regression structure, but only without output standardization and for the logistic activation function. We interpret this behavior as a clear overfitting problem of deep networks. Although current research indicates that there is a so-called “interpolating regime” (e.g., [18]) or “modern regime” (e.g., [7]) where the statistical risk corresponding to a model, after increasing with increasing model complexity due to overfitting, may decrease again if the model complexity becomes very large, leading to a double descent risk curve [7], so that this behavior is referred to as “benign overfitting”. Recently, another phenomenon called “grokking” [42] or “delayed generalization” [38] has been discovered, where a neural network first overfits but that, after a long training time, the generalization performance suddenly starts to increase, which may happen due to slowly emerging well-generalizing learning patterns [15]. However, although benign overfitting has been proven for data from an ideal distribution with certain assumptions on the tail behavior (e.g., [4, 6, 55]) for linear models and, in particular, in [19] for nonlinear models (see also recent works on benign overfitting for two-layer ReLU CNNs [35], two-layer leaky ReLU NNs [20], multilayer ReLU NNs [62]), a tight approximation of the training data may certainly have to be understood as overfitting if the training data are vastly contaminated as near-interpolating such data cannot lead to a well-generalizing model, which is confirmed by the large test losses. Note that the work of [19] indeed shows that for neural networks trained according to the logistic loss, even adversarial label flipping is allowed while still achieving a double descent risk curve, so a certain amount of contamination could still be coped with, although our setting is a regression setting and hence structurally different. Another aspect of benign overfitting is that it not only happens as the model complexity increases, but also for an increasing number of epochs in NN training [41]. Should benign overfitting has occurred on the clean data in our experiments, one could ask whether the cases in which models do not perform well on test data are caused by the fact that we used a maximum number of training epochs, so that the generalization performance had been better if more iterations were allowed. However, in contaminated settings, even a theoretically infinite number of training epochs should never result in benign overfitting.

Interestingly, this effect can sometimes also be observed for the smallest networks. It seems that in the context of contaminated data, there is a tradeoff between the size of the network and the adaptability to the data. Larger networks have more complicated gradients that may explode, but they could interpolate even contaminated data. On the other hand, small networks have less complicated gradients so that training is more likely to succeed; however, on contaminated data, outliers contribute very much to the losses and gradients, potentially de-stabilizing the training even of small networks.

In addition, it seems that in combination with the softplus activation function, deep neural networks are much more brittle than shallow networks, even on non-contaminated data, in particular, the cases with output standardization where the test loss is generally higher for the deeper networks. The difference is particularly visible for the Huber NNs. This could be explained by the fact that a deep network contains even more summations, and due to the unboundedness of the softplus activation, the activations grow in each hidden layer and if they do not become negative so that they are clipped, they de-stabilize the training. Another reason could be that the bound of 250,000 epochs we fixed may not be sufficient for convergence in these settings.

Concerning the training steps, one can clearly observe that in the cases where the deep network nearly interpolates the contaminated training data for non-robust networks, the number of training epochs is rather small, indicating that, due to the large number of network parameters, the network can quickly adapt to the data due to the squared or 0.1-trimmed squared losses not or rarely trimming gradients away in contrast to the other loss functions. The number of training steps is even smaller than those of the smallest networks in most of those cases.

Figure 4 supports the discussion in this and the previous subsection.

Fig. 4
figure 4

Boxplots of all test losses corresponding to different configurations of the network depth and the contamination. Left: Shallow networks, middle: small networks, right: deep networks. Considering the 10 boxplots corresponding to each depth, the leftmost boxplot corresponds to the setting without contamination, the boxplots 2–4 to Y-, X-, and XY-contamination with \(r=0.1\), respectively, the boxplots 5–7 to Y-, X-, and XY-contamination with \(r=0.25\), respectively, the boxplots 8–10 to Y-, X-, and XY-contamination with \(r=0.4\), respectively

5.6 Loss function

This is the main part of the analysis of the results.

The most non-robust loss functions, i.e., the non-trimmed and the 0.1-trimmed squared loss function, suffer most from contamination which can be clearly seen in the simulation results. Even for logistic activation, the number of converged NNs quickly decreases with the contamination magnitude and radius. As already highlighted in Sect. 5.5, sometimes the number of converged NNs even increases with the amount of contamination, but for the price of an immense test loss due to overfitting. Concerning the test loss, for the non-robust NNs, they are prevented from becoming extremely large in our simulations when using output standardization, nevertheless, the effect of contamination is clearly visible. A very important additional result is that the NN based on the non-trimmed squared loss does not necessarily correspond to the smallest test losses when trained on non-contaminated data. There are configurations for which the NNs based on the non-trimmed squared loss even lead to the largest test losses if trained on non-contaminated data. On top of that, there are many configurations in which this NN was not able to converge within the maximum number of training epochs, even on non-contaminated data.

As for the number of training epochs, one can observe that the NNs based on the non-trimmed squared loss and the 0.1-trimmed squared loss are often among the networks with the highest required number of training epochs (together with the Tukey NNs), on both non-contaminated and contaminated data, in particular, when using output standardization. Nevertheless, there are also situations where the number of iterations is smaller on contaminated data than on non-contaminated data and lower than for NNs that are trained w.r.t. other loss functions. Unexpectedly, on the already mentioned situations where the NN overfits to contaminated data, the number of training epochs is rather low. An explanation could be that, although the Rprop algorithm is used that only considers the sign of the gradients, due to the large number of network weights and the fact that no gradients for any training instance are clipped for the non-trimmed squared loss, the NN is able to quickly adapt to the contaminated data so that the training loss decreases quickly, leading to small residuals and hence small gradients and quick numerical convergence, in contrast to NNs trained w.r.t. robust loss functions where the gradients for large outliers are trimmed or zero during a training epoch but where the set of trimmed instances may change from training epoch to training epoch, delaying convergence. Note that this near-interpolating behavior of the NN trained w.r.t. the non-trimmed squared loss does not necessarily have to imply a low number of training epochs.

The 0.25- and 0.5-trimmed squared loss lead to models which show a quite well performance throughout our simulations and which can handle large contamination radii. If the contamination radius is 0.4, exceeding the trimming rate for the 0.25-trimmed squared loss, one can occasionally indeed see that the performance is much worse than that of the NN trained according to the 0.5-trimmed squared loss, but, although \(25\%\) of the instances are trimmed away for the 0.25-trimmed squared loss, it does not necessarily safeguard against settings with a contamination radius of 0.25. In the settings where the softplus activation function is used, the 0.5-trimmed loss usually leads to models that have converged, if any model has converged at all, often accompanied with Tukey NNs. Notably, the 0.5-trimmed squared loss shows sometimes outstanding behavior in comparison with the other losses, where it leads to models which are clearly the best ones in any setting without output contamination while being still competitive in settings with output standardization. As the contamination radius increases, the corresponding models are among the best models. Of course, trimming too much information away can also result in undesired behavior, leading to cases where the trimmed losses with a high trimming rate induce weakly performing models on the non-contaminated data.

As for the number of training epochs, intuitively, NNs trained according to a trimmed loss with a high trimming rate should require less iterations on contaminated data in order to converge due to trimming away the gradients of outliers. Of course, there are counterexamples where these NNs require more epochs than most of the other NNs for the cases without output standardization and without contamination or with Y-contamination, respectively, but generally, the number of training epochs is lower than for most of the other NNs, often considerably lower, where the corresponding NN performed very well concerning the test losses in the cases without output standardization. However, too quick convergence due to excessive trimming can also be misleading, which can happen if the NN trained according to the 0.5-trimmed squared loss where the number of training epochs is very low which may have prevented these models from approximating the (mostly clean part of the data) well.

The Huber loss function does not lead to zero gradients, in contrast to the Tukey loss function where large residuals correspond to zero gradients or trimmed loss functions where losses (and hence gradients) are trimmed away. The robustification effect of the Huber loss function originates from the constant gradients for large residuals while the magnitude of the residuals still contributes to the gradients. However, as we use the Rprop algorithm, one can ask whether the robustification effect is still valid as only the sign of the averaged gradients is considered. Indeed, due to bounding the gradients in the Huber loss function, a single outlier cannot contribute much to this average gradient, i.e., the robustification effect here means that single outliers cannot have full control over the average gradient and hence of its sign. Therefore, the Huber loss also has a robustification effect in our simulations, but weaker than that of loss functions like the Tukey loss or the 0.5-trimmed loss because in situations with high contamination radii, the outliers can still pull the average gradients and hence control their sign.

Overall, the Huber NNs clearly suffer from contamination in the settings without output standardization, but perform very well in settings with output standardization and logistic activation and, for shallow networks, also with softplus contamination (see Sect. 5.5 where we pointed out that for softplus activation, shallow networks tend to perform better than deep ones), often even leading to a better performance than the NNs trained according to the non-trimmed squared loss on non-contaminated data and being competitive on contaminated data. Note again that the effect of deep networks is particularly striking for the Huber NNs, which perform generally very well for softplus activation and output standardization for shallow networks, but not for deep networks. However, in some settings, they do not perform as desired even for logistic activation.

As for the number of training epochs, in the cases without output standardization, their number is still lower than for the squared loss and often comparable or even better than for the Tukey loss or a 0.1-trimmed squared loss, but for standardized output, one can clearly observe that the Huber NNs generally require much less training epochs than the NNs trained according to the squared loss while requiring generally more epochs than the NNs trained according to the 0.5-trimmed squared loss. Interestingly, this performance aspect also degrades for deep networks where the number of training epochs is relatively high in cases without output standardization but while being in general still very low when using output standardization.

The Tukey NNs lead to anomalous behavior in several cases. One can observe that without output standardization, the performance of the Tukey NNs is competitive for Y-contamination but the number of converged Tukey NNs quickly decreases with growing contamination magnitudes and radii if X-contamination is included. For data with output standardization, the test loss corresponding to the Tukey NNs is visibly higher than for most of the other NN, in particular, if X-contamination is involved. For deep networks, the behavior is similar, although for low contamination radii, the performance of the Tukey NNs on data with output standardization and X- and (XY)-contamination is comparable to that of the other NNs and decreases significantly on data with (XY)-contamination when the contamination radius increases. These characteristics carry over to the scenarios with higher-dimensional data, where the performance is even worse than on the smaller data, both in terms of test losses and converged NNs.

The Figs. 5 and 6 support the discussion in this and the previous subsection. We want to emphasize that one does not have to be irritated by the boxplots, as one could ask whether the advantage of robust losses in contaminated settings really is as high as discussed. Apart from the logarithmic scale, one has account for the fact that the boxplots gather all test losses over different scenarios, including the “easy” scenarios where a robust loss leads to a similar perfomance as a non-robust loss. They key difference are the challenging scenarios, which, at least for the case of non-standardized output, appear as outliers and upper quartiles in the boxplots.

Fig. 5
figure 5

Boxplots of all test losses corresponding to different configurations of the loss function and the contamination. Left three columns: No contamination (\(L_2\)-loss, Huber loss, Tukey loss), blocks of 9 boxplots (from left to right): Y-contamination, X-contamination, XY-contamination. Within each block: left: \(r=0.1\), middle: \(r=0.25\), right: \(r=0.4\), within each block of three boxplots for a given r: \(L_2\)-loss (left), Huber loss (middle), Tukey loss (right)

Fig. 6
figure 6

Boxplots of all test losses corresponding to different configurations of the loss function and the contamination. Left three columns: No contamination (trimmed losses with trimming rate 0.1, 0.25, and 0.5, respectively), blocks of 9 boxplots (from left to right): Y-contamination, X-contamination, XY-contamination. Within each block: left: \(r=0.1\), middle: \(r=0.25\), right: \(r=0.4\), within each block of three boxplots for a given r: Trimmed loss with trimming rate 0.1, 0.25, 0.5, respectively

As for the number of training epochs, the Tukey NNs converge rather quickly in the settings where Tukey NNs perform well, i.e., on data with Y-contamination and standardized responses and where the NNs use logistic activation functions. If X-contamination is included, the number of training epochs is in general very high, which may be a consequence of trimming many instances implicitly away due to zero gradients (as the threshold of \(k=4.685\) is fixed during training), slowing down the training progress and, on top of that, misleading the NNs since the test losses are high in these cases. In particular in situations with softplus activation, one can also observe contrary behavior. The test loss in these cases is large, but the number of training epochs is extremely low, evidently due to the large residuals resulting from the unbounded activation function so that a high number of instances receives a zero gradient, stopping training far too early. This issue obviously has occurred in very challenging situations such as for non-standardized responses with contaminated data and ReLU activation, where only Tukey NNs have converged, but with an awkwardly low average number of training epochs of around 1 or 2. In other words, while any other NN did not converge here, evidently due to exploding gradients, some Tukey NNs were able to perform a very few training epochs until all residuals were so large that each instance has been trimmed, leading to zero gradients and hence numerical convergence.

As for the statistical significance whether a robust loss function leads to better results, i.e., smaller losses, than a non-robust loss function in the presence of contamination, we summarize all our experiments using Wilcoxon tests. More precisely, for each pair of two loss functions and for each of the 28 different contamination scenarios (no contamination; for Y-, X-, and XY-contamination, respectively, the nine scenarios defined by \(\mu _\textrm{out}\) and r), we perform two Wilcoxon tests for each of the 108 sub-configurations (defined by the combinations of network depth, data dimensionality, activation function, true underlying input–output function, and standardization), one left-sided and one right-sided one. At the end, we compute the relative number of all sub-configurations where the p-value was smaller than 0.05. Due to sub-configurations where too few networks converged, there may be less than 108 tests. Note that we do not have multiplicity issues here because the individual tests for each of the 840 outer cases (28 outer scenarios times 15 pairs of loss functions times 2 one-sided tests) are based on independent samples, as the data are independent across these configurations. Since we report the results for each outer case separately and do not aggregate the test results, we do not need to account for multiplicity here as well. The results can be found in Table 4 for non-contaminated data, Table 5 for Y-contaminated data, Table 6 for X-contaminated data, and Table 7 for Y- and X-contaminated data, respectively.

Table 4 Results of the Wilcoxon tests for non-contaminated data. The upper triangle reports the relative number of sub-configurations where the right-sided Wilcoxon test leads to a p-value less than 0.05. Analogously, the lower triangle reports the results for the left-sided tests. In other words, the number in a cell always corresponds to the relative fraction of cases where NN training with the loss in the upper row leads to a significantly lower median test loss than NN training with the loss in the left column
Table 5 Results of the Wilcoxon tests for data with Y-contamination. The upper triangle reports the relative number of sub-configurations where the right-sided Wilcoxon test leads to a p-value less than 0.05. Analogously, the lower triangle reports the results for the left-sided tests. In other words, the number in a cell always corresponds to the relative fraction of cases where NN training with the loss in the upper row leads to a significantly lower median test loss than NN training with the loss in the left column
Table 6 Results of the Wilcoxon tests for data with X-contamination. The upper triangle reports the relative number of sub-configurations where the right-sided Wilcoxon test leads to a p-value less than 0.05. Analogously, the lower triangle reports the results for the left-sided tests. In other words, the number in a cell always corresponds to the relative fraction of cases where NN training with the loss in the upper row leads to a significantly lower median test loss than NN training with the loss in the left column
Table 7 Results of the Wilcoxon tests for data with X- and Y-contamination. The upper triangle reports the relative number of sub-configurations where the right-sided Wilcoxon test leads to a p-value less than 0.05. Analogously, the lower triangle reports the results for the left-sided tests. In other words, the number in a cell always corresponds to the relative fraction of cases where NN training with the loss in the upper row leads to a significantly lower median test loss than NN training with the loss in the left column

We can observe in Table 4 that the Huber loss leads to a significantly better median of the test losses in 37–61% of all configurations than the other losses, and to a worse median in at most 9% of the cases. This is an interesting result that motivates to even use the Huber loss instead of the squared loss on clean data, as the number of training steps is generally lower, maybe preventing the network from overfitting. On the contrary, trimming away large fractions of the data lead to a performance loss, as the 0.25- and 0.5-trimmed squared loss do not perform well here.

On Y-contaminated data, we can observe in Table 5 that the performance of the Tukey loss and the trimmed losses increases as the contamination amount and the contamination radius increases. While this observation was expected, it is surprising that the squared loss also leads to an increased performance as the contamination amount and radius increases. The Huber loss in contrast leads to a decreased performance. This may partly be explained by the fact that both losses are not robust, but that, due to the complex structure of neural networks, the very susceptible squared loss leads to a network that, despite overfitting, can model outliers and the clean data rather well, while the Huber loss leads to a tradeoff that does no longer perform well on out-of-sample data.

As for X-contaminated data, one can see a slight tendency that the squared loss leads to a worse performance as the contamination amount and radius increases. In contrast to the case with Y-contamination, there are much more cases where the test did not lead to a p-value less than 0.05. Overall, one can observe that the 0.5-trimmed loss leads to the best results.

For data with X- and Y-contamination, the phenomenon that the squared loss leads to an increased performance for larger contamination amounts and radii is inherited, but not to that extent as for pure Y-contamination. Similarly, the performance decrease of the Huber loss is also less harsh here as for pure Y-contamination. One can clearly observe that the Tukey loss and the 0.5-trimmed squared loss perform best here.

5.7 Discussion

The goal of this work was to investigate the effect of different types of contamination on feed-forward regression networks with different network configurations, in particular, different loss functions. We are aware of the fact that in real data analysis, extreme outliers like the ones we generated in the settings with at least \(\mu _\textrm{out}=10^3\) would very likely be detected when applying an outlier detection algorithm first. Nevertheless, contamination can have a masking effect [26] in the sense that contaminated data points let other contaminated data points appear to be non-contaminated, making the detection of all outliers in practice very complicated.

The results of our simulations indicate, as expected, that contamination decreases the performance of the trained NNs and prevents some NNs even from converging within a given number of training epochs. This observation holds throughout the whole simulation study, but one has to point out that the effect of X-contamination depends on the activation function since logistic activation decreased the effect of X-outliers due to the boundedness of the activation function so that a growing magnitude of contamination essentially has no further effect. Our results are consistent with those from the literature, in particular, in [49], it also has been observed that the effect of X-contamination when using a sigmoid activation function is limited on the loss scale, however, NNs trained according to robust approaches (least median and least trimmed squares) still have shown a better performance in these settings.

Output standardization is essentially always done as it speeds up the training procedure. In this study, we nevertheless also wanted to examine cases without output standardization. As expected, the effect of contamination leads to a drastic decrease in performance, in particular, sometimes nearly no NN converges. As for output standardization, one could ask whether contamination has an effect as the responses are standardized to [0, 1]. Nevertheless, contamination indeed significantly decreases the performance of the models, motivating to use robust NNs.

As for the robust NNs, we can conclude that the Tukey loss is not suitable for network training due to the vanishing gradients, preventing the NN from learning. This issue either results in a way too fast convergence or in a very high number of training epochs required for convergence. Although it performed rather well on data with both Y- and X-contamination, the performance of the Tukey NNs was not convincing throughout the simulation study as it often fails on data with pure Y-, pure X-contamination, or on clean data. One could try to empirically adapt the parameter k of the Tukey loss, but it does not seem necessary to focus on the Tukey loss here because, in contrast, the Huber NNs generally perform very well in the settings with output standardization and logistic activation, where the parameter \(\delta\) has been adaptively chosen in each epoch. Moreover, they require generally less training epochs than the NNs trained according to the non-trimmed squared loss. As elaborated in the previous subsection, although the gradients of the Huber loss function are bounded and although we use the Rprop algorithm, the Huber NNs are not robust against large contamination amounts since a large group of outliers can have full control over the average gradients and hence the sign. Nevertheless, the Huber loss is not robust; therefore, a Huber-NN is prone to fail on contaminated data, in particular for data with Y-contamination. Trimmed losses with a high trimming rate turned out to be highly competitive throughout the whole simulation study, both in terms of test losses, the number of converged NNs and the required training epochs. However, excessive trimming can also mislead the training process, in particular on clean data, apart from the fact that the trimming rate is another hyperparameter whose optimal value cannot be easily selected.

Summarizing, we recommend to replace the squared loss function in regression applications of feed-forward NNs by either the Huber loss function if one can expect a low contamination radius, or a trimmed squared loss function.

6 Conclusion

In this paper, we formalized and investigated the global quantitative robustness of feed-forward regression neural networks.

First, we argued why the classical regression BDP from robust statistics is not suitable for measuring the quantitative robustness of a regression NN due to the iterative training procedure in combination with techniques like gradient clipping or resilient backpropagation that erroneously could let highly non-robust NNs appear to be robust. We proposed an adapted version of the regression BDP for neural networks, which is based on suprema of the norm of the network weights over all possible contaminated samples and over the training iterations.

We formally proved that a feed-forward NN with an unbounded output activation and an unbounded loss function has a BDP of zero while a trimmed loss aggregation indeed induces NNs with a positive BDP.

We conducted an extensive simulation study where we compared the performance of NNs trained according to the squared loss, the Huber loss, the Tukey loss, and several trimmed squared losses on a plethora of different scenarios, which differ by the amount of contamination, the size of the data, the activation functions, the standardization of the responses and the underlying true regression structure. We computed the average squared loss on independent and non-contaminated test data, the average number of training epochs and the number of converged NNs. Moreover, we performed Wilcoxon tests for each pair of loss functions on for each configuration and computed the average number of significant test results, distinguished by the contamination type, the contamination amount, and the contamination radius.

The main observations were: Output standardization generally leads to smaller test losses, a higher number of converged training procedures, and a lower number of training iterations; the performance of the NNs trained according to the non-trimmed squared loss suffers from contamination, in particular, if Y-contamination is included, while the effect of X-contaminaton is bounded if logistic activation is used; the logistic activation function generally prevents networks to be distorted from X-contamination as the activations are bounded; X-contamination clearly decreases the performance of the networks when using the softplus activation function; deep networks are difficult to train when using an unbounded activation function such as the softplus activation due to exploding gradients; deep networks can adapt even to contaminated data much faster than shallow networks and therefore may require less training iterations; the effects of contamination and the other parameters carry over to all considered underlying data structures.

As an unexpected observation, it turned out that even networks trained according to the squared loss can perform well on severely contaminated data and even outperform the Huber loss on data with Y-contamination. One can suspect that, due to the complex structures that neural networks can approximate, some networks may reach the interpolating regime where the risk again decreases. Here, one has to be however very cautious as this interpolating behavior has not been proven for neural networks in general so far, in particular not for underlying mixture distributions, where the ideal distribution is convex-contaminated with another distribution, which is the case in our experiments.

One can observe that the overall behavior of the Huber NNs and the NNs trained according to a trimmed squared loss with a high trimming rate (0.25, 0.5) is best in all three aspects. However, one has to account for the contamination type and radius here. The Huber loss leads to a good performance if the contamination radius is rather small, while trimmed losses with a high trimming rate suffer from information loss on clean data and should therefore be used on data where one can expect a high contamination radius.

We conclude that one should consider the usage of the Huber loss function on data where the contamination radius can be expected to be rather small, or a trimmed squared loss on data where the contamination radius can be expected to be rather high, for regression FFNNs.

Robustness of graphical convolutional networks has been considered in [12], who propose a median or trimmed mean aggregation of the information of the neighbors in order to robustify the networks. Future work should consider about a formal notion of robustness and the robustification of other types of neural networks.