1 Motivation

Optimization problems based on neural networks arise in a variety of different contexts, including function approximation, pattern recognition, sequential decision making, data compression and many others. This work is motivated by two key motivations highlighted below.

Adversarial attacks for adversarial robustness. Neural networks trained on large datasets have achieved breakthroughs in speech and visual recognition tasks and many other applications in recent years [24, 54]. However, these models have been shown to be susceptible to imperceptible perturbations [55]. This motivates the study of adversarial robustness, that is, the design of classifiers that are robust to perturbations within a small ball around the input, typically measured in the \(\ell _p\) norm [5, 7, 9, 15, 37, 40, 57].

Thus, in the adversarial setting, the standard binary classification loss is replaced by its adversarial counterpart, which is defined, for a given data point, as the supremum of the binary loss over a ball of a pre-specified radius around the given point. The adversarial counterpart of the classification margin loss is similarly defined via its supremum over the ball. Optimizing the binary loss or adversarial binary loss directly is NP-hard for most hypothesis sets. A natural alternative, instead, is to optimize a surrogate margin-based loss with an alternative function \(\Phi \), such as the hinge loss, that admits more favorable properties such as differentiability, convexity and statistical consistency with respect to the target loss [2,3,4, 8, 29, 36, 38, 39, 41, 42, 53, 56, 62, 63].

However, the evaluation of the above surrogates still requires computing the adversarial attack, that is solving an optimization problem related to the presence of the supremum in the definition of the loss. For a surrogate margin-based loss, that amounts to a margin minimization problem, that is finding the strongest adversarial attack point in the ball that minimizes the margin—this objective should of course be distinguished from the standard goal of margin-maximization in learning and generalization [45].

Margin minimization is a notoriously difficult problem in adversarial training for which a variety of heuristics have been developed. We will show that this problem can be cast as a DC-programming problem by proving a decomposition of the multi-class margin as a difference of two convex functions. The DCA algorithm of Pham Dinh and Le Thi [46, 47] can then be used to solve the optimization problem.

We note that Seck et al. [50] also applied DCA to adversarial robustness verification. However, their optimization problem is to find the minimal perturbation required to flip the label of the clean example, which is distinct from our setting. In addition, the approaches used to address the problems are different: their technique involved the use of DC to eliminate some binary variables in the optimization problem, while our method provides an explicit DC-decomposition of neural networks to solve the margin minimization problem.

Approximate optimization of complex functions. Neural networks with at least one hidden-layer and enough hidden units are known to be universal approximators, that is they are able to approximate any measurable function defined over finite-dimensional spaces to an arbitrary degree of accuracy [22]. On the other hand, minimization or maximization of an arbitrary complex function still remains an open problem in the optimization literature. The expressive power of neural networks and the challenge of optimizing an arbitrary function motivates the study of approximate optimization, that is, the design of algorithms that can closely optimize any function that is well approximated by neural networks.

Let F be a very complex or costly function to optimize. Then, the approximate optimization procedure consists of the following steps:

  • Sampling: draw a large number of pairs \(\{ * \}{(x^i, F(x^i))}_{i = 1}^m\);

  • Learning: use a deep neural network h to fit F on that sample by minimizing an appropriate loss;

  • Optimization: find the minimizer or maximizer of the learned neural network h.

The first two steps of sampling and learning have been extensively studied in the literature. However, the final step of optimization of neural networks brings up new challenges due to the lack of convexity, the complexity of architectures and the diversity of neural networks. We will show that the neural networks typically used in practice are all DC-functions (difference of convex functions) of the input feature vector and thus that the DCA algorithm [46, 47] can be adopted in the final step, which helps solve efficiently the approximate optimization problem.

In the next section, we introduce the notation and some basic concepts for DC-programming and the hypothesis sets of neural networks considered in this paper: multilayer perceptrons (MLPs) and convolutional neural networks (CNNs). In Sect. 3, we derive the explicit form of a DC-decomposition for MLPs and CNNs, which we use to derive a DCA solution for the approximate optimization of complex functions. In Sect. 4, we give a detailed description of the margin minimization problem and present a DCA solution for the corresponding computation of the adversarial attack for adversarial robustness. To do so, we prove that the DC-decomposition derived in Sect. 3 naturally leads to a DC-decomposition of the margin. In Sect. 5, we report the results of experiments demonstrating the effectiveness of the DCA algorithms proposed in Sects. 3 and 4.

2 Preliminaries

We start by introducing some basic concepts and results that are related to DC-programming.

2.1 DC-programming definitions and algorithms

Definition 1

(DC-functions) We say that a hypothesis \(h :\mathbb {R}^n \rightarrow \mathbb {R}\) is a DC-function if it admits a DC-decomposition, that is, if it can be written as the difference of two convex functions (DC) f and g:

$$\begin{aligned} h = f - g. \end{aligned}$$

In that case, functions f and g are called DC-components of the hypothesis h.

The optimization problem corresponding to such a function h is referred to as a DC-programming problem and defined as follows:

$$\begin{aligned} p := \inf \left\{ h(x) = f(x)-g(x) :x\in \mathbb {R}^n\right\} . \end{aligned}$$
(1)

Observe that any constraint \(x\in \mathcal {C}\) with a closed convex set \(\mathcal {C}\subset \mathbb {R}^n\) can be equivalently incorporated into the standard DC-program by letting \(h(x) = [{f(x) + \chi _{\mathcal {C}}(x)}] - g(x)\), where \(\chi _{\mathcal {C}}(x):= {\left\{ \begin{array}{ll} 0 &{} x\in \mathcal {C}\\ +\infty &{} \text {otherwise}. \end{array}\right. }\)

To find the global minimum of (1), there have been several approaches discussed in the literature [48], including the pioneering combinatorial approach proposed by Hoang [17], its further developed solution for low-rank non-convex structures [16], and the branch-and-bound algorithm which has an exponential convergence Reiner and N. V., [49] with the correction of Hoang [18]. Nevertheless, as pointed by Pham Dinh and Le Thi [46], these global algorithms are not able to solve real-world high-dimensional DC-programs.

Instead, an alternative method based on convex analysis, the DCA algorithm [46, 47], is often adopted in practice, which can be further combined with branch-and-bound techniques to find a global optimum. When the function g is sub-differentiable, DCA coincides with the concave-convex procedure (CCCP) of Yuille and Rangarajan [61]. The pseudocode of DCA is given for that case in Algorithm 1; here, \(\partial g\) denotes any subgradient of the function g and \(\cdot \) denotes the inner product of two vectors.

Algorithm 1
figure a

DCA algorithm solving DC-program (1).

The algorithm consists of iteratively replacing g with its first-order approximation and solving the resulting convex optimization problem. The stopping criterion is checked at the end of each iteration. Figure 1 illustrates this procedure for minimizing the non-convex function \(h:x \mapsto -\frac{1}{1 + x^2}\) that can be decomposed into the difference of two convex functions f and g. DCA is a primal-dual sub-differential method [46, 47], which can handle DC-programs (1) with proper lower semi-continuous convex functions f and g. In this paper, we propose solutions based on DCA (Algorithm 2 and Algorithm 3) for the computation of the adversarial attack and approximate optimization problems described in Sect. 1 and further demonstrate their effectiveness for these problems in Sect. 5.

Fig. 1
figure 1

Left: an illustration of DCA in Algorithm 1 for minimizing the non-convex function \(h:x \mapsto -\frac{1}{1 + x^2}\). For each t, the convex function \({\mathscr {C}}_t\) has the same value and derivative as h at \(x = x_t\) and upper bounds h by definition. Right: DC-components f and g of the function \(h:x \mapsto -\frac{1}{1 + x^2}\), where \(f:x \mapsto x^2\) and \(g:x \mapsto x^2 + \frac{1}{1 + x^2}\)

The DCA algorithm is based on the decomposition of a function, and benefits from the flexibility of such a decomposition. Indeed, if \(h = f - g\) is a DC-decomposition, adding a convex auxiliary function to both f and g still results in a DC-decomposition. In practice, a good auxiliary function such as one admitting strongly-convexity, e.g. \(\lambda \Vert \, \cdot \, \Vert ^2\) often makes the DCA algorithm more efficient [33, 34]. The convergence of DCA has been discussed extensively in the literature [31, 46, 47, 51, 59]. Pham Dinh and Le Thi [46] first proved that DCA is guaranteed to converge to a critical point. The same result also holds for the CCCP algorithm, which is a special case of DCA [51]. Moreover, DCA can find the global optimum of the trust region problem [47]. For many DC-programs related to support vector machines (SVM) [10], the DCA algorithm admits linear convergence and thus its number of iterations is relatively small [59]. In practice, an effective heuristic adopted in many applications consists of using DCA with multiple restarting points to minimize the objective function [43, 44]. In some applications, global optimality can be efficiently tested and in fact DCA typically leads to the global optimum without even resorting to such heuristics [11, 19,20,21].

DCA can also often be combined with branch-and-bound techniques to determine the global optimum, which can be achieved efficiently in some instances [30, 35]. More generally, as pointed out by Le Thi and Pham Dinh [32,33,34], DCA admits the following benefits: (1) Flexibility: a suitable choice of the DC-decomposition can make DCA more robust and efficient, and even lead to the global optimum; (2) Convergence: DCA admits linear convergence for general DC-programs and, in particular, admits finite convergence for polyhedral DC-programs; (3) Versatility: DCA can recover many standard algorithms with a careful choice of the DC-decomposition for both convex and non-convex optimization. For a convex program, DCA can converge to the global optimum by reinterpreting it as a DC-program.

In view of these advantages, DCA can be applied to a wide range of non-convex optimization problems and applications including non-convex quadratic programs [13], the trust region problem [47], kernel selection [1], learning in second-price reserve auctions [43, 44], forecasting time series [26,27,28], discrepancy estimation in domain adaptation [6], eigenvalue problems [52] and many others, where DCA is applicable to large-scale scenarios, adapted to flexible target needs, and benefitting from theoretical guarantees. In this paper, we will apply DCA to two problems related to learning and optimization of neural networks.

2.2 Neural network definitions

Let \({\mathscr {X}}\) denote the input feature space and \({\mathscr {Y}}\) denote the label space. In the approximate optimization of complex function, \({\mathscr {Y}}= \mathbb {R}\) is real-valued, while in adversarial robustness, the computation of the adversarial attack, \({\mathscr {Y}}= \{ 1, \ldots , c \}\) is a set of \(c \ge 2\) classes. Let \({\mathscr {H}}\) be a hypothesis set of functions mapping from \({\mathscr {X}}\times {\mathscr {Y}}\) to \(\mathbb {R}\) and \(\overline{h}(x) = ( h(x, 1), \ldots , h(x, c) )\) be the output vector of a hypothesis \(h\in {\mathscr {H}}\) in multi-class classification. For each class \(z\in \{ 1, \ldots , c \}\), real-valued h(xz) can be viewed as the score assigned to class z by h. For example, if we let \({\mathscr {H}}\) be the family of feedforward neural networks with L hidden layers (will be introduced soon in (2)), then, \(\overline{h}(x)\) is equal to \(a^{[L+2]}(x)\) and \(\{ x \mapsto h(x,z),~z\in \{ 1, \ldots , c \} \}\) are the component functions of \(a^{[L+2]}\). We denote by \(\ell :{\mathscr {H}}\times {\mathscr {X}}\times {\mathscr {Y}}\rightarrow \mathbb {R}\) a loss function and thus by \(\ell (h, x, y)\) the loss of a hypothesis h for a pair (xy). For the hypothesis set of neural networks, we will specifically consider the family of feedforward neural networks (also known as multilayer perceptrons (MLPs)) and convolutional neural networks (CNNs) [14].

The feedforward neural network is a quintessential artificial neural network, which typically consists of one input layer, L hidden layers and one output layer, and can be represented in the following form:

$$\begin{aligned} \begin{aligned} a^{[1]}(x)&= x\in \mathbb {R}^{n_1}, \\ a^{[l]}(x)&= \sigma \big ( W^{[l]}a^{[l-1]}(x)+b^{[l]} \big ) \in \mathbb {R}^{n_l},\text { for }l = 2, \ldots , L + 1, \\ a^{[L+2]}(x)&= W^{[L+2]}a^{[L+1]}(x)+b^{[L+2]} \in \mathbb {R}^{n_{L+2}}, \end{aligned} \end{aligned}$$
(2)

where given an input \(x\in \mathbb {R}^{n_1}\), we use \(n_l\) to denote the dimension of the output of the lth layer, \(a^{[l]}(x)\). Here, \(W^{[l]}\in \mathbb {R}^{n_l\times n_{l-1}}\) and \(b^{[l]}\in \mathbb {R}^{n_l}\) denote the weight matrix and the offset vector at layer l, and \(\sigma \) denotes the ReLU activation. The expressions (2) define vector-valued functions \(a^{[l]} :\mathbb {R}^{n_1} \rightarrow \mathbb {R}^{n_l}\), for \(l=1,\ldots ,L+2\). For the MLP given by (2), we also refer to \(L + 2\) as its depth and \(n_l\) as the number of units at the lth layer. The simplest MLPs are those with one hidden layer, and often referred to as one-hidden-layer neural networks, i.e., in the form of (2) with \(L=1\).

The convolutional neural network is a specialized kind of neural networks for processing input feature vector with a known grid-like structure, such as images with a two-dimensional grid of pixels. The CNN typically has convolution layers and pooling layers. Let X denote the input image with height \(n_{h_1}\), width \(n_{w_1}\) and the number of channels \(n_{c_1}\). The convolution layers can be represented in the following form:

$$\begin{aligned} \begin{aligned} s^{[1]}(X)&= X \in \mathbb {R}^{n_{h_1}\times n_{w_1} \times n_{c_1}}, \\ s^{[l]}(X)&= \sigma \big ( W^{[l]} *s^{[l-1]}(X)+b^{[l]} \big ) \in \mathbb {R}^{n_{h_l}\times n_{w_l} \times n_{c_l}},\text { for }l=2,\ldots ,L+1, \end{aligned} \end{aligned}$$
(3)

where given an input image \(X \in \mathbb {R}^{n_{h_1}\times n_{w_1} \times n_{c_1}}\), \(s^{[l]}(X)\) is the output of the lth convolution layer with dimension \(n_{h_l}\times n_{w_l} \times n_{c_l}\), for \(l = 2, \ldots , L + 1\). Here, \(\sigma \) is the ReLU activation, \(W^{[l]}\in \mathbb {R}^{k_{h_l} \times k_{w_l} \times n_{c_{l-1}} \times n_{c_l} }\) is the convolution kernel and \(b^{[l]}\in \mathbb {R}^{1\times 1\times 1\times n_{c_l}}\) is the offset at layer l. The symbol \(*\) is used to denote the convolution. With VALID padding and a stride of \(1\times 1\) [14], the result of the convolution \(W^{[l]} *s^{[l-1]}(X)\) is of the shape \(n_{h_l}\times n_{w_l} \times n_{c_l}\), where

$$\begin{aligned} n_{h_l}&= n_{h_{l-1}} - k_{h_l} + 1, \\ n_{w_l}&= n_{w_{l-1}} - k_{w_l} + 1, \end{aligned}$$

and it is defined by the following equation:

$$\begin{aligned} \big ( W^{[l]} *s^{[l-1]}(X) \big )_{i,j,t} = \sum _{r=0}^{n_{c_{l-1}} - 1} \sum _{u=0}^{k_{h_l}-1} \sum _{v=0}^{k_{w_l}-1} W^{[l]}_{u,v,r,t} s^{[l-1]}(X)_{i+u, j+v, r}. \end{aligned}$$

The pooling layers further modify the output by replacing a certain element with summary statistics of the elements within its neighborhood. We will consider two kinds of pooling: max-pooling and average pooling, which report the maximum output and average output within a rectangular neighborhood respectively [14].

3 DC-decomposition of neural networks

In this section, we prove that many commonly used types of neural networks are DC-functions of the input feature vector and provide an explicit DC-decomposition for these functions. Building upon that, we present a DCA solution for the approximate optimization of complex functions.

3.1 One-hidden-layer neural networks

We start the analysis with one-hidden-layer neural networks, which admit the following form:

$$\begin{aligned} \begin{aligned} a^{[1]}(x)&= x\in \mathbb {R}^{n_1}, \\ a^{[2]}(x)&= \sigma \big ( W^{[2]}a^{[1]}(x)+b^{[2]} \big ) \in \mathbb {R}^{n_2}, \\ a^{[3]}(x)&= W^{[3]}a^{[2]}(x)+b^{[3]} \in \mathbb {R}^{n_3}. \end{aligned} \end{aligned}$$
(4)

Here, given an input \(x\in \mathbb {R}^{n_1}\), the vectors \(a^{[1]}(x)\), \(a^{[2]}(x)\), and \(a^{[3]}(x)\) are the outputs of the input layer, the hidden layer and the output layer with dimension \(n_1\), \(n_2\), and \(n_3\) respectively. The next theorem shows that each component function of the vector-valued function \(a^{[3]} :\mathbb {R}^{n_1} \rightarrow \mathbb {R}^{n_3}\) is a DC-function of x.

Theorem 1

For one-hidden-layer neural networks (4), the function \(a^{[3]}_i\) can be written as \(f^{[3]}_i - g^{[3]}_i\), for \(i=1,\ldots ,n_3\), where \(f^{[3]}_i\) and \(g^{[3]}_i\) are convex in x.

Proof

Since the composition with affine function preserves convexity, any component function of \(a^{[2]}\) is convex in x. Note that the weight matrix \(W^{[3]}\) can be expressed in terms of its positive and negative parts as

$$\begin{aligned} W^{[3]} = W^{[3]}_+ - W^{[3]}_-, \end{aligned}$$

where \(W^{[3]}_+\) and \(W^{[3]}_-\) are both nonnegative. Therefore, \(a^{[3]}\) admits the following decomposition:

$$\begin{aligned} a^{[3]}&= \big ( W^{[3]}_+a^{[2]}+b^{[3]} \big ) - \big ( W^{[3]}_-a^{[2]} \big ). \end{aligned}$$

We can let \(f^{[3]} = W^{[3]}_+a^{[2]}+b^{[3]}\) and \(g^{[3]} = W^{[3]}_-a^{[2]}\), since nonnegative weighted sum preserves convexity. \(\square \)

Theorem 1 also holds for the special case where \(a^{[3]}\) only has one component function. This implies that any one-hidden-layer neural network (4) with \(n_3=1\) admits a DC-decomposition.

Corollary 1

Any one-hidden-layer neural network with one output unit is a DC-function of the input feature vector.

3.2 Multi-layer perceptrons

The decomposition of one-hidden-layer neural networks can be extended to multilayer perceptrons with L hidden layers (2). The next theorem shows that for an MLP, each component function of the vector-valued function \(a^{[l]} :\mathbb {R}^{n_1} \rightarrow \mathbb {R}^{n_l}\), for \(l = 1, \ldots , L + 2\), is a DC-function of x, which generalizes Theorem 1.

Theorem 2

For multi-layer perceptrons (2), any component function of the vector-valued function \(a^{[l]}\), for \(l = 1, \ldots , L + 2\), is a DC-function of the input feature vector x.

Proof

Clearly, any component function of \(a^{[1]}\) is convex and thus is a DC-function of x. Since the composition with affine function preserves convexity, any component function of \(a^{[2]}\) is convex and thus is a DC-function of x as well.

We then proceed by induction on l. Assume that \(a^{[l]}\) admits the following decomposition:

$$\begin{aligned} a^{[l]} = f^{[l]} - g^{[l]}, \end{aligned}$$

where the component functions of \(f^{[l]}\) and \(g^{[l]}\) are all convex in x. Then, \(c^{[l+1]}:=W^{[l+1]}a^{[l]}+b^{[l+1]}\) can be written as

$$\begin{aligned} c^{[l+1]}&= \big ( W^{[l+1]}_+ - W^{[l+1]}_- \big ) \big ( f^{[l]} - g^{[l]} \big ) + b^{[l+1]} \\&= \big (W^{[l+1]}_+ f^{[l]} + W^{[l+1]}_-g^{[l]} + b^{[l+1]} \big ) - \big ( W^{[l+1]}_-f^{[l]} + W^{[l+1]}_+g^{[l]} \big ). \end{aligned}$$

Since non-negative weighted sum preserves convexity, \(\tilde{f}^{[l+1]}:= W^{[l+1]}_+ f^{[l]} + W^{[l+1]}_-g^{[l]} + b^{[l+1]}\) and \(\tilde{g}^{[l+1]}:= W^{[l+1]}_-f^{[l]} + W^{[l+1]}_+g^{[l]}\) both have convex component functions. Hence, the component functions of \(c^{[l+1]}\) are DC-functions. Now, we have

$$\begin{aligned} a^{[l+1]}&= \sigma \big ( \tilde{f}^{[l+1]} - \tilde{g}^{[l+1]} \big ) \\&= \max (\tilde{f}^{[l+1]}, \tilde{g}^{[l+1]}) - \tilde{g}^{[l+1]}, \end{aligned}$$

where the component functions of \(f^{[l+1]}:=\max (\tilde{f}^{[l+1]}, \tilde{g}^{[l+1]})\) and \(g^{[l+1]}:=\tilde{g}^{[l+1]}\) are all convex.

Note that \(a^{[L+2]} = c^{[L+2]}\). Therefore, any component function of \(a^{[l]}\), for \(l=1,\ldots ,L+2\), is a DC-function. \(\square \)

Similarly to Theorem 1, Theorem 2 holds for the special case where \(a^{[L+2]}\) only has one component function. Thus, Theorem 2 implies that any multilayer perceptron (2) with \(n_{L+2}=1\) admits a DC-decomposition.

Corollary 2

Any multilayer perceptron with one output unit is a DC-function of the input feature vector.

The multilayer perceptrons with one output unit are useful for the approximation optimization of real-valued complex functions, as indicated in Sect. 1. Our results like Corollary 1 and Corollary 2 are thus helpful for the design of DCA solution to these problems as shown in Sect. 3.4.

3.3 Convolutional neural networks (CNNs)

As with dense layers, DC-decomposition also works for convolution layers (3). Theorem 3 shows that DC-decomposition can be constructed for any component function of the function \(s^{[l]}\) at convolution layer l.

Theorem 3

For convolution layers (3), any component function of \(s^{[l]}\), for \(l=1,\ldots ,L+1\), is a DC-function of the input X.

Proof

Clearly, any component function of \(s^{[1]}\) is convex and thus is a DC-function of X. Since the composition with affine function preserves convexity, any component function of \(s^{[2]}\) is also convex and thus is a DC-function of X.

We then proceed with induction on l. Assume that \(s^{[l]}\) admits a decomposition

$$\begin{aligned} s^{[l]} = f^{[l]} - g^{[l]}, \end{aligned}$$

where the component functions of \(f^{[l]}\) and \(g^{[l]}\) are all convex in X. Then, \(c^{[l+1]}:=W^{[l+1]}*a^{[l]}+b^{[l+1]}\) can be written as

$$\begin{aligned} c^{[l+1]}&= \big ( W^{[l+1]}_+ - W^{[l+1]}_- \big )*\big ( f^{[l]} - g^{[l]} \big ) + b^{[l+1]} \\&= \big (W^{[l+1]}_+ *f^{[l]} + W^{[l+1]}_-*g^{[l]} + b^{[l+1]} \big ) - \big ( W^{[l+1]}_-*f^{[l]} + W^{[l+1]}_+*g^{[l]} \big ). \end{aligned}$$

Since non-negative weighted sum preserves convexity, \(\tilde{f}^{[l+1]}:= W^{[l+1]}_+*f^{[l]} + W^{[l+1]}_-*g^{[l]} + b^{[l+1]}\) and \(\tilde{g}^{[l+1]}:= W^{[l+1]}_-*f^{[l]} + W^{[l+1]}_+*g^{[l]}\) both have convex component functions. Hence, the component functions of \(c^{[l+1]}\) are DC-functions. Now we have

$$\begin{aligned} s^{[l+1]}&= \sigma \big ( \tilde{f}^{[l+1]} - \tilde{g}^{[l+1]} \big ) \\&= \max (\tilde{f}^{[l+1]}, \tilde{g}^{[l+1]}) - \tilde{g}^{[l+1]}, \end{aligned}$$

where the component functions of \(f^{[l+1]}:= \max (\tilde{f}^{[l+1]}, \tilde{g}^{[l+1]})\) and \(g^{[l+1]}:= \tilde{g}^{[l+1]}\) are all convex.

Therefore, any component function of \(s^{[l]}\), for \(l=1,\ldots ,L+1\), is a DC-function. \(\square \)

DC-decomposition is also compatible with pooling layers. We give the analysis for both average pooling and max pooling as follows.

Average pooling. Let \({\mathscr {A}}\) denote the average pooling operation. Theorem 4 shows that the DC structure is preserved by the average pooling.

Theorem 4

If all the component functions of \(s^{[l]}\) are DC-functions of the input x, then the component functions of \({\mathscr {A}}\circ s^{[l]}\) are also DC-functions of x.

Proof

Suppose that \(s^{[l]}\) admits a decomposition

$$\begin{aligned} s^{[l]} = f^{[l]} - g^{[l]}, \end{aligned}$$

where the component functions of \(f^{[l]}\) and \(g^{[l]}\) are all convex in x. Since the average pooling operation is linear, \({\mathscr {A}}\circ s^{[l]}\) can be written as

$$\begin{aligned} {\mathscr {A}}\circ s^{[l]} = {\mathscr {A}}\circ f^{[l]} - {\mathscr {A}}\circ g^{[l]}. \end{aligned}$$

Since non-negative weighted sum preserves convexity, the component functions of \({\mathscr {A}}\circ f^{[l]}\) and \({\mathscr {A}}\circ g^{[l]}\) are all convex in x. \(\square \)

Max pooling. Let \({\mathscr {M}}\) denote the max pooling operation. Theorem 5 shows that the DC structure is preserved by the max pooling.

Theorem 5

If all the component functions of \(s^{[l]}\) are DC-functions of the input x, then the component functions of \({\mathscr {M}}\circ s^{[l]}\) are also DC-functions of x.

Proof

Suppose that \(s^{[l]}\) admits a decomposition

$$\begin{aligned} s^{[l]} = f^{[l]} - g^{[l]}, \end{aligned}$$

where the component functions of \(f^{[l]}\) and \(g^{[l]}\) are all convex in x. Assume that the kernel size of the max pooling layer is \(k_h\times k_w\). Then, \({\mathscr {M}}\circ s^{[l]}\) can be written as

$$\begin{aligned} {\mathscr {M}}\circ s^{[l]} = \big ( {\mathscr {M}}\circ s^{[l]} + k_h k_w {\mathscr {A}}\circ g^{[l]} \big ) - k_h k_w {\mathscr {A}}\circ g^{[l]}. \end{aligned}$$

Since non-negative weighted sum preserves convexity, the component functions of \(k_h k_w {\mathscr {A}}\circ g^{[l]}\) are all convex in x. The component functions of \({\mathscr {M}}\circ s^{[l]} + k_h k_w {\mathscr {A}}\circ g^{[l]}\) are all convex in x as well. Indeed, each one of them is of the following form:

$$\begin{aligned}&\max _{\begin{array}{c} 0\le u\le k_h-1 \\ 0\le v\le k_w-1 \end{array}}\{f^{[l]}_{i+u,j+v,r}-g^{[l]}_{i+u,j+v,r}\} + k_h k_w\frac{1}{k_h k_w} \sum _{u'=0}^{k_h-1}\sum _{v'=0}^{k_w-1}g^{[l]}_{i+u', j+v', r} \\&= \max _{\begin{array}{c} 0\le u\le k_h-1 \\ 0\le v\le k_w-1 \end{array}}\{f^{[l]}_{i+u,j+v,r} + \sum _{u'=0}^{k_h-1}\sum _{v'=0}^{k_w-1}g^{[l]}_{i+u', j+v', r} - g^{[l]}_{i+u,j+v,r}\}, \end{aligned}$$

for some ijr. It is convex in x since sum and pointwise maximum preserve convexity. \(\square \)

Theorem 3, Theorem 4 and Theorem 5 imply that any convolutional neural network with one output unit admits a DC-decomposition.

Corollary 3

Any convolutional neural network with one output unit is a DC-function of the input X.

As with the multilayer perceptrons, convolutional neural networks with one output unit can also be used for the approximation optimization of real-valued complex functions. Thus, Corollary 3 can serve as a tool for the design of a DCA solution for approximation optimization problems with CNNs. Furthermore, we will see that the results such as Corollary 3 are also helpful for the design of a DCA solution for the margin minimization in adversarial robustness where convolutional neural networks are used as common hypotheses, as detailed in Sect. 4 and Sect. 5.

3.4 DCA solution for approximate optimization

Section 3 shows that there exists a DC-decomposition for the feedforward neural networks and convolutional neural networks. Building upon this, we can give our DCA solution for the approximate optimization problem introduced in Sect. 1 using such neural networks with one output unit, denoted as \({\mathscr {H}}_{\mathrm {NN-aprox}}\). The pseudocode of our algorithm is given in Algorithm 2.

Algorithm 2
figure b

DCA solution for the approximate optimization of complex functions.

In Sect. 5, we also report the empirical results which further demonstrate the effectiveness of our Algorithm 2.

4 DC-decomposition of confidence margin

In this section, we show that the margin of a hypothesis h, if viewed as a function of x, is also a DC-function, when for each class z, \(x \mapsto h(x, z)\) is a DC-function. We then present a DCA solution for the adversarial attack computation for adversarial robustness.

For a real-valued hypothesis h, the multi-class margin \(\rho _h(x, y)\) for a labeled pair (xy) is defined by

$$\begin{aligned} \rho _h(x, y) = h(x, y) - \max _{y' \ne y} h(x, y'). \end{aligned}$$
(5)

Table 1 shows the standard and adversarial loss, where \(\Phi :\mathbb {R}\rightarrow \mathbb {R}_{+}\) is a non-increasing function upper bounding the indicator function \(t \mapsto \mathbbm {1}_{t\le 0}\). Here, \(\Vert * \Vert {\, \cdot \,}\) denotes a norm on the input feature space \({\mathscr {X}}\), and \(\gamma \) is the size of a perturbation. Observe that since \(\Phi \) is non-increasing, we can rewrite the adversarial margin-based loss as the following equality [60]:

$$\begin{aligned} \widetilde{\Phi }(h,x,y)=\sup _{x':\Vert x-x'\Vert \le \gamma }\Phi ( \rho _h(x',y) ) = \Phi \left( {\inf _{x':\Vert x-x'\Vert \le \gamma } \rho _h(x',y)}\right) . \end{aligned}$$
(6)

Thus, to optimize \(\widetilde{\Phi }\), we need to solve the following adversarial attack computation problem for each labeled pair (xy), often with the hypothesis h being a neural network:

$$\begin{aligned} \min _{x':\Vert x-x'\Vert \le \gamma } \rho _h(x',y)=\min _{x':\Vert x-x'\Vert \le \gamma } \left( {h(x',y) - \max _{y'\ne y} h(x',y')}\right) . \end{aligned}$$
(7)
Table 1 Target loss and surrogate margin-based loss in standard and adversarial classification

We now show that for a fixed y, the function \(x\mapsto \rho _h(x,y)\) admits a DC-decomposition when for each class z, the function \(x \mapsto h(x, z)\) is a DC-function. This condition is satisfied by all neural networks commonly used in practice (see Sect. 3).

Theorem 6

Assume that for each class \(z\in \{ 1, \ldots , c \}\), h(xz) admits the DC-decomposition \(h(x, z) = f_z(x) - g_z(x)\), where \(f_z\) and \(g_z\) are convex functions. Then, for any y, the function \(x\mapsto \rho _h(x,y)\) admits the following DC-decomposition:

$$\begin{aligned} \rho _h(x,y) = \left( {f_y(x)+\sum _{z\in {\mathscr {Y}}:z \ne y} g_z(x)}\right) -\max _{y' \ne y} \left( {f_{y'}(x) + \sum _{z\in {\mathscr {Y}}:z \ne y'} g_z(x)}\right) , \end{aligned}$$

where \(x \mapsto ( f_y(x) + \sum _{z\in {\mathscr {Y}}:z \ne y} g_z(x) )\) and \(x \mapsto \max _{y' \ne y}( f_{y'}(x) + \sum _{z\in {\mathscr {Y}}:z \ne y'} g_z(x) )\) are convex functions with respect to x.

Proof

For a labeled example (xy), the margin can be expressed as follows:

$$\begin{aligned} \rho _h(x,y)&=h(x,y) - \max _{y' \ne y}h(x,y')\\&=\left( {f_y(x)-g_y(x)}\right) - \max _{y' \ne y} \left( {f_{y'}(x)-g_{y'}(x)}\right) \\&= \left( {f_y(x)+\sum _{z\in {\mathscr {Y}}}g_z(x)-g_y(x)}\right) - \max _{y' \ne y} \left( {f_{y'}(x)+\sum _{z\in {\mathscr {Y}}}g_z(x)-g_{y'}(x)}\right) \\&=\left( {f_y(x)+\sum _{z\in {\mathscr {Y}}:z \ne y} g_z(x)}\right) -\max _{y' \ne y} \left( {f_{y'}(x)+\sum _{z\in {\mathscr {Y}}:z \ne y'} g_z(x)}\right) . \end{aligned}$$

The convexity of the function \(x \mapsto ( f_y(x) + \sum _{z\in {\mathscr {Y}}:z \ne y} g_z(x) )\) and that of \(x \mapsto \max _{y' \ne y} ( f_{y'}(x) + \sum _{z \in {\mathscr {Y}}:z \ne y'} g_z(x) )\) hold by the assumption and the fact that convexity is preserved under sum and pointwise maximum. \(\square \)

As discussed in Sect. 1, the computation of the adversarial attack for adversarial robustness is an important problem in practice, with the form in (7), for a given hypothesis \(h\in {\mathscr {H}}\), sample (xy) and perturbation size \(\gamma \). Combining the results of Sect. 3 and Theorem 6, the problem can be cast as an instance of DC-programming [46, 47], for which we can make use of the DCA algorithm. The pseudocode of our algorithm to solve the optimization problem (7) is given in Algorithm 3. This provides a DCA-based solution for the computation of the adversarial attack for the family of feedforward neural networks and that of convolutional neural networks, denoted as \({\mathscr {H}}_{\mathrm {NN-adv}}\).

Algorithm 3
figure c

DCA solution for the computation of the adversarial attack for adversarial robustness.

Here, we wish to further emphasize the significance of our DCA solution for the computation of the adversarial attack for adversarial robustness. The adversarial attack computation problem has been extensively studied in the adversarial robustness literature, which is crucial for both adversarial training and evaluation. The existing typical methods for the computation of the adversarial attack include the single-step Fast Gradient Sign Method (FGSM) [15], its stronger version Projected Gradient Descent (PGD) method [25, 37], and the state-of-the-art Auto-PGD (APGD) method [12]. To compare with our Algorithm 3, we describe FGSM and PGD for solving (7) with \(\ell _{\infty }\) norm in the following, and refer interested readers to [12, Algorithm 1] for details of APGD. Here, we adopt the same notation as for Algorithm 3.

  • Fast Gradient Sign Method (FGSM):

    $$\begin{aligned} x' = x - \gamma {{\,\textrm{sign}\,}}( \partial \rho _{h_{\mathrm {NN-adv}}}(x,y) ). \end{aligned}$$
  • Projected Gradient Descent (PGD): for each round \(t=1,\ldots ,T\),

    $$\begin{aligned} x'_{t+1} = \textrm{Proj}_{\{ x':\Vert x-x'\Vert \le \gamma \}} ( x'_t - \alpha \cdot {{\,\textrm{sign}\,}}( \partial \rho _{h_{\mathrm {NN-adv}}}(x'_t,y) ) ), \end{aligned}$$

    where \(\alpha \) is the step size and \(\textrm{Proj}_B(\beta ):= {{\,\mathrm{\mathrm argmin}\,}}_{\beta '\in B}\Vert \beta - \beta ' \Vert _2\) is the projection.

Nevertheless, none of these methods is guaranteed to produce the strongest attack, that is, the margin minimizer. On the one hand, it has been observed that using a single-step method like FGSM during adversarial training suffers from the issue of “catastrophic overfitting” [58], which suggests that stronger attack method or real margin minimization is needed during the training. On the other hand, [12, Table 2] show that the adversarial test accuracy for most of the studied defenses can be further reduced if evaluated on APGD, which implies that previous methods are not really reliable for the adversarial evaluation. Instead, our solution for the adversarial attack computation is built upon the DCA algorithm and benefits from its convergence guarantees [46], which can produce stronger attacks than PGD and is also comparable with the state-of-the-art method APGD in certain cases as shown in Sect. 5. Furthermore, our proposed method can also be combined or incorporated into other targeted attack or ensemble attack methods like AutoAttack [12], which can be helpful for both adversarial training and evaluation.

Fig. 2
figure 2

Left: a fourth degree polynomial \(F :x \mapsto 3x^4 - 4x^3 - 12x^2\). Right: DC-components \(f_{\mathrm {NN-aprox}}\) and \(g_{\mathrm {NN-aprox}}\) of the trained two-hidden-layer neural network

Fig. 3
figure 3

Histograms of the margin value over 100 trials for each image. We ran 10-step PGD on the margin loss with random starts and then ran DCA for 20 rounds starting from the returned point. The vertical line indicates the margin value corresponding to the adversarial attack found by 100-step APGD on the margin loss

Table 2 Architecture of the convolutional neural network used in Sect. 5 for the computation of the adversarial attack

5 Experiments

Here, we present experiments on simulated data and real-world datasets to validate the effectiveness of our algorithms for the approximate optimization and adversarial attack computation problems. In the approximate optimization example, we show that our DCA solution, Algorithm 2, can effectively minimize a polynomial function with only access to its samples using a feedforward neural network. In the adversarial attack computation example, we show that our DCA solution, Algorithm 3, can effectively achieve margin minimization for an adversarially trained convolutional neural network on CIFAR-10 [23].

Approximate optimization. The function to minimize is a fourth-degree polynomial \(F :x \mapsto 3x^4 - 4x^3 - 12x^2\) whose plot is shown in Fig. 2. Its derivative is \(F'(x) = 12(x+1)x(x-2)\). Function F admits a global minimum \(F(2) = -32\) at \(x = 2\) and a local minimum \(F(-1) = -5\) at \(x = -1\). Assume that we have access to samples \(\{ (x^i, F(x^i)) \}_{i = 1}^m\), where \(\{ x^i \}_{i = 1}^m\) are drawn from the uniform distribution on the interval \([-2, 3]\). We train a feedforward neural network with 2 hidden-layers and 100 units in each layer using \(5\mathord ,000\) training samples. As shown in Theorem 2, the trained neural network, denoted by \(h_{\mathrm {NN-aprox}}\), admits a DC-decomposition \(h_{\mathrm {NN-aprox}} = f_{\mathrm {NN-aprox}} - g_{\mathrm {NN-aprox}}\). The two convex DC-components \(f_{\mathrm {NN-aprox}}\) and \(g_{\mathrm {NN-aprox}}\) are illustrated in Fig. 2. With a randomly chosen initial point \(x = 0.47\), DCA rapidly converges to \(h_{\mathrm {NN-aprox}}(2.05) = -32.20\). The iteration points are marked in Fig. 2, which closely match the polynomial F and approach its minimum value, though the form of F is unknown to the neural network.

Adversarial attacks for adversarial robustness. Here, we also report the computation results of the adversarial attack for an adversarially trained convolutional neural network on images from CIFAR-10 [23]. The architecture of the CNN is described in Table 2. The adversarial attacks are \(\ell _{\infty }\)-norm bounded perturbations of size \(\gamma = 8/255\). Figure 3 displays histograms of the margin value over 100 trials for each image. We ran 10-step PGD on the margin loss with random starts and then ran DCA for 20 rounds starting from the returned point. The vertical line indicates the margin value corresponding to the adversarial attack found by 100-step APGD on the margin loss. Figure 3 shows that DCA can improve upon PGD attacks and is comparable with strong APGD attacks.

6 Conclusion

We presented the study of two key problems related to neural network optimization: the computation of the adversarial attack for adversarial robustness and approximate optimization based on neural networks. Our results include new DCA solutions which are shown to be effective, as demonstrated by our experiments. Our results can help design better adversarial training algorithms or stronger adversarial attacks during evaluation in adversarial robustness. They can be extended to many other neural network architectures using a similar proof technique, and can be extended to many other neural network learning and optimization problems.