Regularisation of Neural Networks by Enforcing Lipschitz Continuity

We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks with respect to their inputs. To this end, we provide a simple technique for computing an upper bound to the Lipschitz constant of a feed forward neural network composed of commonly used layer types and demonstrate inaccuracies in previous work on this topic. Our technique is then used to formulate training a neural network with a bounded Lipschitz constant as a constrained optimisation problem that can be solved using projected stochastic gradient methods. Our evaluation study shows that, in isolation, our method performs comparatively to state-of-the-art regularisation techniques. Moreover, when combined with existing approaches to regularising neural networks the performance gains are cumulative. We also provide evidence that the hyperparameters are intuitive to tune and demonstrate how the choice of norm for computing the Lipschitz constant impacts the resulting model.


Introduction
Supervised learning is primarily concerned with the problem of approximating a function given examples of what output should be produced for a particular input. In order for the approximation to be of any practical use, it must generalise to unseen data points. Thus, we need to select an appropriate space of functions in which the machine should search for a good approximation, and select an algorithm to search through this space. This is typically done by first picking a large family of models, such as support vector machines or decision trees, and applying a suitable search algorithm. Crucially, when performing the search, regularisation techniques specific to the chosen model family must be employed to combat overfitting. For example, one could limit the depth of decision trees considered by a learning algorithm, or impose probabilistic priors on tunable model parameters.
Regularisation of neural network models is a particularly difficult challenge. The methods that are currently most effective (Srivastava et al., 2014;Ioffe and Szegedy, 2015) are heuristically motivated, which can make the process of applying these techniques to new problems nontrivial or unreliable. In contrast, well-understood regularisation approaches adapted from linear models, such as applying an ℓ 2 penalty term to the model parameters, are known to be less effective than the ad hoc approaches (Srivastava et al., 2014). This provides a clear motivation for developing well-founded and effective regularisation methods for neural networks. Following the intuition that functions are considered simpler when they vary at a slower rate, and thus generalise better, we develop a method that allows us to control the Lipschitz constant of a network-a measure of the maximum variation a function can exhibit. Our experiments show that this is a useful inductive bias to impose on neural network models.
One of the prevailing themes in the theoretical work surrounding neural networks is that the magnitude of the weights directly impacts the generalisation gap (Bartlett, 1998;Bartlett et al., 2017;Neyshabur, 2017), with larger weights being associated with poorer relative performance on new data. In several of the most recent works (Bartlett et al., 2017;Neyshabur, 2017), some of the dominant terms in these bounds are equal to the upper bound of the Lipschitz constant of neural networks as we derive it in this paper. While previous works have only considered the Lipschitz continuity of networks with respect to the ℓ 2 norm, we put a particular emphasis on working with ℓ 1 and ℓ ∞ norms and construct a practical algorithm for constraining the Lipschitz constant of a network during training. The algorithm takes a single hyperparameter that is used to enforce an upper bound on the Lipschitz constant of the network.
Several interesting properties of this regularisation technique are demonstrated experimentally. We show that our algorithm performs similarly to other methods in isolation. Moreover, when the method presented in this paper is combined with other commonly used regularisers the performance gains are cumulative. We verify that the hyperparameter behaves in an intuitive manner: when set to a small value the generalisation gap is very small, and as the value of the hyperparameter is increased, the generalisation gap widens.
The paper begins with an outline of previous work related to regularisation and the Lipschitz continuity of neural networks in Section 2. This is followed by a detailed derivation of our upper bound on the Lipschitz constant of a wide class of feed forward neural networks in Section 3, where we give consideration to multiple choices of vector norms. Section 4 shows how this upper bound can be used to regularise the neural network in an efficient manner. Experiments showing the utility of our regularisation approach are given in Section 5, and conclusions are drawn in Section 6.

Related work
One of the most widely applied regularisation techniques currently used for deep networks is dropout (Srivastava et al., 2014). By randomly setting the activations of each hidden unit to zero with some probability, p, during training, this method noticeably reduces overfitting for a wide variety of models. Various extensions have been proposed, such as randomly setting weights to zero instead of activations (Wan et al., 2013). Another modification, concrete dropout (Gal et al., 2017), allows one to directly learn the dropout rate, p, thus making the search for a good set of hyperparameters easier. Kingma et al. (2015) have also shown that the noise level in Gaussian dropout can be learned during optimisation. Srivastava et al. (2014) found that constraining the ℓ 2 norm of the weight vector for each unit in isolation-a technique that they refer to as maxnorm-slightly improved the performance of networks trained with dropout.
Batch normalisation (Ioffe and Szegedy, 2015), which was initially developed to accelerate the convergence of the training process, has also been shown to improve generalisation. It is efficient and simple to implement: it solely standardises the activations of each layer by aggregating statistics over minibatches. The activations are then rescaled and translated by model parameters learned during training. A similar technique that is more effective in the small minibatch case, when the statistics required by batch normalisation cannot be computed reliably, is weight normalisation (Salimans and Kingma, 2016). Rather than computing the mean and standard deviation, the orientation and magnitude of each weight vector are decoupled and learned separately. Interestingly, it has been shown that weight decay (adding an ℓ 2 penalty term to the loss function) provides no regularisation benefit when used in conjunction with these methods (van Laarhoven, 2017), but it does change the effective learning rate of commonly used optimisation algorithms. However one could achieve the same effect by simply changing the learning rate directly.
The recent work on optimisation for deep learning has also contributed to our understanding of the generalisation performance of neural networks. Hardt et al. (2016) provide some theoretical justifications for common practices, such as early stopping. Neyshabur et al. (2015) presents the Path-SGD optimisation algorithm, a method specifically designed for training deep networks. They show that it outperforms several conventional stochastic gradient methods for training multi-layer perceptrons.
Several papers have shown that the generalisation gap of a neural network is dependent on the magnitude of the weights (Bartlett et al., 2017;Neyshabur, 2017;Bartlett, 1998). Early results, such as Bartlett (1998), present bounds that assume sigmoidal activation functions, but nevertheless relate generalisation to the sum of the absolute values of the weights in the network. More recent work has shown that the product of spectral norms, scaled by various other weight matrix norms, can be used to construct bounds on the generalisation gap. Bartlett et al. (2017) scale the spectral norm product by a term related to the element-wise ℓ 1 norm, whereas  use the Frobenius norm.  speculate that Lipschitz continuity alone is insufficient to guarantee generalisation. However, it appears in multiple generalisation bounds (Neyshabur, 2017;Bartlett et al., 2017), and this paper presents evidence that it is an effective means for controlling the generalisation performance of a deep network.  proposed a new regularisation scheme in the form of a term in the loss function that penalises the sum of spectral norms of the weight matrices. Our work differs from theirs in several ways. Firstly, we investigate norms other than ℓ 2 . Secondly,  use a penalty term, whereas we employ a hard constraint on the induced weight matrix norm. Moreover, they penalise the sum of the norms, but the Lipschitz constant is determined by the product of operator norms. Finally,  use a heuristic to regularise convolutional layers that does not correspond to penalising their spectral norm. Specifically, they compute the largest singular value of a flattened weight tensor, as opposed to deriving the true matrix corresponding to the linear operation performed by convolutional layers, as we do in Section 3.2. Explicitly constructing this matrix and computing its largest singular value-even approximately-would be prohibitively expensive. We provide efficient methods for computing the ℓ 1 and ℓ ∞ norms of convolutional layers exactly, and show how one can approximate the spectral norm efficiently by avoiding the need to explicitly construct the matrix representing the linear operation performed by convolutional layers.
Constraining the Lipschitz continuity of a network is not only interesting for regularisation.  have shown that constraining the weights of the discriminator in a generative adversarial network to have a specific spectral norm can improve the quality of generated samples. They use the same technique as  to compute these norms, and thus may benefit from the improvements presented in this paper.

Computing the Lipschitz Constant
A function, f : X → Y , is said to be Lipschitz continuous if it satisfies for some real-valued k ≥ 0, and metrics D X and D Y . The value of k is known as the Lipschitz constant, and the function can be referred to as being k-Lipschitz. Generally, we are interested in the smallest possible Lipschitz constant, but it is not always possible to find it. In this section, we show how to compute an upper bound to the Lipschitz constant of a feed-forward neural network with respect to the input features. Such networks can be expressed as a series of function compositions: where each φ i is an activation function, linear operation, or pooling operation. A particularly useful property of Lipschitz functions is how they behave when composed: the composition of a k 1 -Lipschitz function, f 1 , with a k 2 -Lipschitz function, f 2 , is a k 1 k 2 -Lipschitz function. It is important to note that k 1 k 2 will not necessarily be the smallest Lipschitz constant of (f 2 • f 1 ), even if k 1 and k 2 are individually the best Lipschitz constants of f 1 and f 2 , respectively. Denoting the Lipschitz constant of some function, f , as L(f ), repeated application of this composition property yields the following upper bound on the Lipschitz constant for the entire feed-forward network: Thus, we can compute the Lipschitz constants of each layer in isolation and combine them in a modular way to establish an upper bound on the constant of the entire network. In the remainder of this section, we derive closed form expressions for the Lipschitz constants of common layer types when D X and D Y correspond to ℓ 1 , ℓ 2 , or ℓ ∞ norms respectively. As we will see in Section 4, Lipschitz constants with respect to these norms can be constrained efficiently.

Fully Connected Layers
A fully connected layer, φ f c ( x), implements an affine transformation parameterised by a weight matrix, W , and a bias vector, b: Others have already established that, under the ℓ 2 norm, the Lipschitz constant of a fully connected layer is given by the spectral norm of the weight matrix Neyshabur, 2017). We provide a slightly more general formulation that will prove to be more useful when considering other p-norms. We begin by plugging the definition of a fully connected layer into the definition of Lipschitz continuity: By setting a = x 1 − x 2 and simplifying the expression slightly, we arrive at which, assuming x 1 = x 2 , can be rearranged to W a p a p ≤ k, a = 0.
The smallest Lipschitz constant is therefore equal to the supremum of the left-hand side of the inequality, which is the definition of the operator norm. For the p-norms we consider in this paper, there exist efficient algorithms for computing operator norms on relatively large matrices. Specifically, for p = 1, the operator norm is the maximum absolute column sum norm; for p = ∞, the operator norm is the maximum absolute row sum norm. The time required to compute both of these norms is linearly related to the number of elements in the weight matrix. When p = 2, the operator norm is given by the largest singular value of the weight matrix-the spectral norm-which can be approximated relatively quickly using a small number of iterations of the power method.

Convolutional Layers
Convolutional layers, φ conv (X), also perform an affine transformation, but it is usually more convenient to express the computation in terms of discrete convolutions and pointwise additions. For a convolutional layer, the i-th output feature map is given by where each F i,j is a filter, each X j is an input feature map, B i is an appropriately shaped bias tensor exhibiting the same value in every element, and the previous layer produced M l−1 feature maps. The convolutions in Equation 9 are linear operations, so one can exploit the isomorphism between linear operations and square matrices of the appropriate size to reuse the matrix norms derived in Section 3.1. To represent a single convolution operation as a matrix-vector multiplication, the input feature map is serialised into a vector, and the filter coefficients are used to construct a doubly block circulant matrix. Due to the structure of doubly block circulant matrices, each filter coefficient appears in each column and row of this matrix exactly once. Consequently, the ℓ 1 and ℓ ∞ operator norms are the same and given by F i,j 1 , the sum of the absolute values of the filter coefficients used to construct the matrix.
Summing over several different convolutions associated with different input feature maps and the same output feature map, as done in Equation 9, can be accomplished by horizontally concatenating matrices. For example, suppose V i,j is a matrix that performs a convolution of F i,j with the j-th feature map serialised into a vector. Equation 9 can now be rewritten in matrix form as where the inputs and biases, previously represented by X and B i , have been serialised into vectors x and b i , respectively. The complete linear transformation, W , performed by a convolutional layer to generate M l output feature maps can be constructed by adding additional rows to the block matrix: To compute the ℓ 1 and ℓ ∞ operator norms of W , recall that the operator norm of V i,j for p ∈ {1, ∞} is F i,j 1 . A second matrix, W ′ , can be constructed from W , where each block, V i,j , is replaced with the corresponding operator norm, F i,j 1 . Each of these operator norms can be thought of as a partial row or column sum for the original matrix, W . Now, based on the discussion in Section 3.1, the ℓ 1 operator norm is given by and the ℓ ∞ operator norm is given by Yoshida and Miyato (2017) and  both investigate the effect of penalising or constraining the spectral norm of convolutional layers by reinterpreting the weight tensor of a convolutional layer as a matrix, where each u i,j contains the elements of the corresponding F i,j serialised into a row vector. They then proceed to compute the spectral norm of U , rather than computing the spectral norm of W , given in Equation 11. However, it is unclear when the spectral norm of U is an accurate approximation of the spectral norm of W , and it is in fact trivial to construct cases where it is highly inaccurate. 1 Explicitly constructing W and applying a conventional singular value decomposition to compute the spectral norm of a convolutional layer is infeasible, but we show how the power method can be adapted to use standard convolutional network primitives to compute it efficiently. Consider the usual process for computing the largest singular value of a square matrix using the power method, provided in Algorithm 1. The expression of most interest to us is inside the for loop, namely which, due to the associativity of matrix multiplication, can be broken down into two steps: and When W is the matrix in Equation 11, the expressions given in Equations 16 and 17 correspond to a forward propagation and a backwards propagation through a convolutional layer, respectively. Thus, if we replace these matrix multiplication with convolution and transposed convolution operations respectively, as implemented in many deep learning frameworks, the spectral norm can be computed efficiently. Note that only a single vector must undergo the forward and backward propagation operations, rather than an entire batch of instances. This means, for most cases, only a small increase in runtime will be incurred by using this method. It also automatically takes into account the padding and stride hyperparameters used by the convolutional layer.

Pooling Layers and Activation Functions
Most common activation functions and pooling operations are, at worst, 1-Lipschitz with respect to all p-norms. For example, the maximum absolute sub-gradient of the ReLU activation function is 1, which means that ReLU operations have a Lipschitz constant of Algorithm 1 Power method for producing the largest singular value, σ max , of a non-square matrix, W .
A similar argument yields that the Lipschitz constant of max pooling layers is one. The Lipschitz constant of the softmax is one (Gao and Pavel, 2017).

Residual Connections
Recently developed feed-forward architectures often include residual connections between non-adjacent layers (He et al., 2016). These are most commonly used to construct structures known as residual blocks: where the function composition may contain a number of different linear transformations and activation functions. In most cases the composition is formed by two convolutional layers, each preceded by a batch normalisation layer and a ReLU function. While networks that use residual blocks still qualify as feed-forward networks, they no longer conform to the linear chain of function compositions we formalised in Equation 2. Fortunately, networks with residual connections are usually built by composing a linear chain of residual blocks of the form given in Equation 18. Hence, the Lipschitz constant of a residual network will be the product of Lipschitz constants for each residual block. For a k 1 -Lipschitz function, f 1 , and a k 2 -Lipschitz function, f 2 , we are interested in the Lipschitz constant of their sum: which can be rearranged to The subadditivity property of norms and the Lipschitz constants of f 1 and f 2 can then be used to bound Equation 20 from above: Thus, we can see that the Lipschitz constant of the addition of two functions is bounded from above by the sum of their Lipschitz constants. Setting f 1 to be the identity function and f 2 to be a linear chain of function compositions, we arrive at the definition of a residual block as given in Equation 18. Noting that the Lipschitz constant of the identity function is one, we can see that the Lipschitz constant of a residual block is bounded by where the property given in Equation 3 has been applied to the function compositions.

Constraining the Lipschitz Constant
The assumption motivating our work is that the Lipschitz constant of a feed-forward neural network enables control of how well the model will generalise to new data. Using the composition property of Lipschitz functions, we have shown that the Lipschitz constant of a network is the product of the Lipschitz constants of each layer. Thus, controlling the Lipschitz constant of a network can be accomplished by constraining the Lipschitz constant of each layer in isolation by performing constrained optimisation when training the network. In practice, we pick a single hyperparameter, λ, and use it to control the upper bound of the Lipschitz constant for each layer. This means the network as a whole will have a Lipschitz constant less than or equal to λ d , where d is the depth of the network. The easiest way to adapt existing deep learning methods to allow for constrained optimisation is to introduce a projection step and perform a variant of the projected stochastic gradient method. In our particular problem, because each parameter matrix is constrained in isolation, it is straightforward to project any infeasible parameter values back into the set of feasible matrices. Specifically, after each weight update step we must check that none of the weight matrices (including the filter banks in the convolutional layers) are violating the Lipschitz constant constraints. If the weight update has caused a weight matrix to leave the feasible set, we must replace the resulting matrix with the closest matrix that does lie in the feasible set. This can all be accomplished with the projection function, which will leave the matrix untouched if it does not violate the constraint, and project it back to the closest matrix in the feasible set if it does. Closeness is measured by the matrix distance metric induced by taking the operator norm of the difference between two matrices. This will work with any valid operator norm, because all norms are absolutely homogeneous. In particular, it will work with the operator norms with p ∈ {1, 2, ∞}, which can be computed using the approaches outlined in Section 3. Pseudocode for this projected gradient method is given in Algorithm 2. We have observed fast convergence when using the AMSGrad update rule (Reddi et al., 2018), but other variants of the stochastic gradient method also work. For example, in our experiments, we show that stochastic gradient descent with Nesterov's momentum is compatible with our approach.

Stability of p-norm Estimation
A natural question to ask is which p-norm should be chosen when using the training procedure given in Algorithm 2. The Euclidean (i.e., spectral) norm is often seen as the Algorithm 2 Projected stochastic gradient method to optimise a neural network subject to the Lipschitz Constant Constraint (LCC).
i , λ) end for end while default choice, due to its special status when talking about distances in the real world. Like , we use the power method to estimate the spectral norms of the linear operations in deep networks. The convergence rate of the power method is related to the ratio of the two highest singular values, σ 2 σ 1 . If the two largest singular values are almost the same, it will converge very slowly. Because each iteration of the power method for computing the spectral norm of a convolutional layer requires both forward propagation and backward propagation, it is only feasible to perform a small number of iterations. However, regardless of the quality of the approximation, we can be certain that our approximation is an underestimate of the true norm: the expression in the final line of Algorithm 1 is maximised when x n is the first eigenvector of W . Therefore, if the algorithm has not converged x n will not be an eigenvector of W and our approximation of σ max will be an underestimate.
In contrast to the spectral norm calculation, we compute the values of the ℓ 1 and ℓ ∞ norms exactly, in time that is linear in the number of weights in a layer, so it always comprises a relatively small fraction of the overall runtime for training the network. However, it may be the case that the ℓ 1 and ℓ ∞ constraints do not provide as suitable an inductive bias as the ℓ 2 constraint. This is something we investigate in our experimental evaluation.

Compatibility with Batch Normalisation
Constraining the Lipschitz continuity of the network will have an impact on the magnitude of the activations produced by each layer, which is what batch normalisation attempts to explicitly control (Ioffe and Szegedy, 2015). Thus, we consider whether batch normalisation is compatible with our Lipschitz Constant Constraint regulariser. Batch normalisation can be expressed as where diag(·) denotes a diagonal matrix, and γ and β are learned parameters. This can be seen as performing an affine transformation with a linear transformation term x.
Based on the operator norm of this diagonal matrix the Lipschitz constant of a batch normalisation layer, with respect to p-norms where p ∈ {1, 2, ∞}, is given by Thus, if one were to use batch normalisation in conjunction with our technique, the γ parameter must also be constrained. This is accomplished by using the expression in Equation 28 to compute the operator norm in the projection function given in Equation 25. We use a moving average estimate of the variance for performing the projection, rather than the variance computed solely on the current minibatch of training examples. This is done because the minibatch estimates of the mean and variance can be quite noisy.

Interaction with Dropout
In the standard formulation of dropout, one corrupts activations during training by performing pointwise multiplication with vectors of Bernoulli random variables. As a consequence, when making a prediction at test time-when units are not dropped out-the activations must be scaled by the probability that they remained uncorrupted during training. This means the activation magnitude at both test time and training time is approximately the same. The majority of modern neural network make extensive use of rectified linear units, or similar activation functions that are also homogeneous. This implies that scaling the activations at test time is equivalent to scaling the weight matrices in the affine transformation layers. By definition, this will also scale the operator norm, and therefore the Lipschitz constant, of that layer. As a result, one may expect that when using our technique in conjunction with dropout, the λ hyperparameter will need to be increased in order to maintain the desired Lipschitz constant. Note that this does not necessarily mean that the optimal value for λ, from the point of view of generalisation performance, can be found by performing hyperparameter optimisation without dropout, and then dividing the best λ found on the validation set by one minus the desired dropout rate: the change in optimisation dynamics and regularisation properties of dropout make it difficult to predict analytically how these two methods interact when considering generalisation performance.

Experiments
In our experiments, we aim to answer several questions about the behaviour of our Lipschitz Constant Constraint (LCC) regularisation scheme. The question of most interest is how well this regularisation technique compares to the state-of-the-art, in terms of the accuracy measured on held out data. In addition to this, we perform experiments that demonstrate how sensitive the method is to the choice of λ, how it interacts with existing regularisation methods, and investigate to what extent other methods implicitly control the Lipschitz continuity of feed-forward networks.
Several different architectural design philosophies are used throughout the experiments. Specifically, we use multi-layer perceptrons, VGG-style convolutional networks, and networks with residual connections. This is done to ensure that the regularisation method works for a broad range of feed-forward architectures. Similarly, two different optimisers are used in order to verify that Algorithm 2 behaves well when used with a range of common stochastic gradient methods. Specifically, we use SGD with Nesterov momentum for some experiments, and an adaptive gradient method, AMSGrad (Reddi et al., 2018), for other experiments.

Synthetic Data
We begin by illustrating several key points using a simple synthetic dataset generated using the function This function was chosen because there are two trends: one with a large amplitude that varies slowly, and another that varies quickly and has a much smaller amplitude. A model with a reasonably small Lipschitz constant should largely ignore the high frequency signal and model only the component with low frequency and high amplitude. Because the function is a scalar function of one variable, it is also very convenient to visualise. A training dataset was generated by randomly sampling 1,000 points from a uniform distribution covering the range [−5, 5]. Each of these points is then labelled by applying the generating function in Equation 29. We train Lipschitz continuous neural networks with several different λ values for each of the three p-norms discussed earlier. Fully connected networks with two hidden layers, each containing 1,000 hidden units, were used. Figures 1,  2, and 3 show the result of training these networks on the synthetic training set using p = 1, p = 2, and p = ∞, respectively.
For all three norms, when λ is set to one, the function approximates the training data very poorly, failing to even capture the general trend provided by the sine function. As λ is increased, the networks approximate the function used to generate the dataset with higher accuracy. As expected, for each norm, there is a window where only the low frequency component of the function is approximated.
One might expect that, because we are dealing with a scalar function, all three norms would behave the same, as they all reduce to taking the absolute value in the single dimensional case. Despite this, we clearly observe varying levels of approximation quality when comparing the use of different norms with the same value for λ. In the case of the ℓ 2 norm, this is fairly easy to explain: we are only approximating the spectral norm of each layer using the power method, which we have already explained is liable to underestimate the true spectral norm if only a few iterations are used. We suspect the difference between the ℓ 1 and ℓ ∞ norm approaches-most noticeable when λ is two-is explained by a less obvious phenomenon. Because of the difference in the operator norms for the ℓ 1 and ℓ ∞ vector norms, the set of permissible weight matrices is different. This is something that is very likely to impact the trajectory of the weights during optimisation, and hence result in quite different solutions. Optimisation dynamics are presumably also the reason the function is poorly approximated when λ is one-even though the sine function is 1-Lipschitz. 2

CIFAR-10
The CIFAR-10 dataset (Krizhevsky and Hinton, 2009) contains 60,000 tiny images, each belonging to one of 10 classes. We follow the common protocol of using 10,000 of the images in the 50,000 image training set for tuning model hyperparameters. Two networks are considered for this dataset: a VGG19-style network, resized to be compatible with the 32×32 pixel images in CIFAR-10, and a Wide Residual Network (WRNs) (Zagoruyko and Komodakis, 2016). All experiments on this dataset utilise data augmentation in the form of random crops and horizontal flips, and the image intensities were rescaled to fall into the [−1, 1] range. Each of the VGG networks were trained for 140 epochs using the AMSGrad optimiser. The initial learning rate was set to 10 −4 and decreased by a factor of 10 after epoch 100 and epoch 120. The WRNs were trained for a total of 200 epochs using the stochastic gradient method with Nesterov's momentum. The learning rate was initialised to 0.1, and decreased by a factor of 5 at epochs 60, 120, and 160. All network weights were initialised using the method of Glorot and Bengio (2010).
We begin by comparing how well the LCC regulariser improves generalisation compared to other commonly used and related methods. Specifically, dropout, batch normalisation, and the spectral decay method of  are considered. Dropout and batch normalisation are the two most widely used regularisation schemes, often acting as key components of state-of-the-art models (Simonyan and Zisserman, 2014;He et al., 2016;Zagoruyko and Komodakis, 2016), and the spectral decay method has a similar goal to the ℓ 2 instantiation of our method: encouraging the spectral norm of the weight matrices to be small. For this particular experiment, we consider each regulariser in isolation, and also the combination of our technique with dropout and batch normalisation. Results are given in Tables 1 and 2 for the VGG and WRN architectures, respectively. Interestingly, the performance of the VGG network varies considerably more than that of the Wide Residual Network. VGG models trained with our regularisation exhibit only a small performance gain over the unregularised and spectral decay baselines; however, when combined with batch normalisation, there is a sizeable increase in accuracy. In the case of WRNs, our method performs similarly to dropout. It is noteworthy that spectral decay performs worse than LCC-ℓ 2 , as expected based on our discussion in Section 3.2.

CIFAR-100
CIFAR-100, like CIFAR-10, is a dataset of 60,000 tiny images, but with 100 classes. The same data augmentation methods used for CIFAR-10 are also used for training models on CIFAR-100-random crops and horizontal flips. We train WRNs and VGG-style networks on this dataset. We found that the learning rate schedules used in the CIFAR-10 experiments also worked well on this dataset, which is not surprising given their similarities. We did, however, optimise the regularisation hyperparameters specifically for CIFAR-100. The   results for the VGG networks are given in Table 3, and the WRN results can be found in Table 4.
We can see that combining the Lipschitz-based regularisation scheme with batch normalisation is a particularly effective technique for improving generalisation of networks both with and without residual connections. For the VGG networks, there is an obvious trend that the combination of LCC with batch normalisation provides a sizeable increase in accuracy. For the WRN models, there are more modest gains provided by the combination of LCC and dropout. Again, spectral decay performs consistently worse than LCC-ℓ 2 .

MNIST and Fashion-MNIST
The Fashion-MNIST dataset (Xiao et al., 2017) is designed as a more challenging drop-in replacement for the original MNIST dataset of hand-written digits (LeCun et al., 1998). Both contain 70, 000 greyscale images labelled with one of 10 possible classes. The last 10,000 instances are used as the test set. For optimising hyperparameters we use the last 10,000 of the training set to measure validation accuracy. We train a small convolutional network on both of these datasets with different combinations of regularisers. The network contains only two convolutional layers, each consisting of 5×5 kernels, and both are followed by 2 × 2 max pooling layers. The first layer has 64 feature maps, and the second has 128.
These layers feed into a fully connected layer with 128 units, which is followed by the output layer with 10 units. ReLU activations are used for all layers, and each model is trained for 60 epochs using AMSGrad. The learning rate was started at 0.0001 and decreased by a factor of 10 at the fiftieth epoch. The test accuracies for each of the models trained on these datasets are given in Table 5. For both datasets, the LCC methods alone provide a small but consistent improvement in performance. When combined with dropout only, the gains are less consistent. Applying LCC in conjunction with batch normalisation provides a larger, and more reliable, increase in test accuracy. Spectral decay is slightly worse than LCC-ℓ 2 .

Street View House Numbers
The Street View House Numbers dataset contains over 600, 000 images of digits extracted from Google's Street View platform. Each image contains three colour channels and has a resolution of 32×32 pixels. As with the previous datasets, the only preprocessing performed is to rescale the input features to the range [−1, 1]. No data augmentation is performed while training on this dataset. The first network used for this dataset, which follows a VGG-style structure, is comprised of four conv-conv-maxpool blocks with 64, 128, 192, and 256 feature maps, respectively. This is followed by two fully connected layers, each with 512 units, and then the logistic regression layer. Due to the large training set size, we found that it was only necessary to train for 20 epochs. The AMSGrad optimiser was used with an initial learning rate of 10 −4 , which was decreased by a factor of 10 at epochs 15 and 18. Results demonstrating the performance of individual regularisers, and the combination of LCC with dropout and batch normalisation, are given in Table 6. We also train small WRN models on this dataset. Once again, due to the large size of the training set, it is sufficient to only train each network for 20 epochs in total. We therefore use a compressed learning rate schedule, compared to WRNs trained on CIFAR-10 and CIFAR-100. The learning rate is started at 0.1, and is decreased by a factor of 5 at epochs 6, 12, and 16. Test set performance for each of the WRN models trained on SVHN is provided in Table 7.
In isolation, LCC provides only a small improvement in test accuracy, which is in line with what has been observed on other datasets so far. Also congruent with our other results is that the performance of LCC combined with batch normalisation on the VGG networks is noticeably higher than other approaches-in this case these networks perform comparably with the WRN models. For this dataset, the LCC methods provide very little benefit to the residual networks we have considered. Spectral decay again performs worse than LCC-ℓ 2 and actually fails to converge when using wide residual networks.

Fully Connected Networks
Neural networks consisting exclusively of fully connected layers have a long history of being applied to classification problems arising in data mining scenarios. To evaluate how well the LCC regularisers work on tabular data, we have trained fully connected networks on the classification datasets collected by Geurts and Wehenkel (2005). These datasets were primarily collected from the University California at Irvine dataset repository, and the only selection criterion used by Geurts and Wehenkel (2005) was that they contain only numeric features. Each network contained two hidden layers consisting of 100 units, and used the ReLU activation function. We performed two repetitions of 5-fold cross validation for each dataset. Hyperparameters for each regulariser were tuned on a per-fold basis using grid search. The accuracy of a particular hyperparameter combination tried during the grid search was determined using a hold-out set drawn from the training data in each fold. The values considered for dropping a unit when using dropout were p ∈ {0.2, 0.3, 0.4, 0.5}. The values considered for λ when using the ℓ 2 and ℓ ∞ approaches were {2, 4, ..., 18, 20}, and for the ℓ 1 variant we used {5, 10, ..., 45, 50}. Once again, we also considered the combination of each of the regularisation methods. Because multiple estimates of the test set accuracy are available to us, we can perform hypothesis tests to determine statistically significant differences in performance of each of the regularisers over the unregularised baseline. We use the hypothesis testing procedure outlined by Bouckaert and Frank (2004) in order to overcome the problem of overlapping testing and training sets introduced by the cross validation procedure. The mean test set accuracies on each dataset are given in Table 8, with statistically significant improvements and degredations in accuracy at the 95% confidence level indicated by the +and -symbols, respectively.
Several interesting trends can be found in this table. One particularly surprising trend is that the presence of dropout is a very good indicator of a statistically significant degredation in accuracy. Interestingly, the only exceptions to this are the two synthetic datasets, where dropout is associated with an improvement in accuracy. The combination of batch normalisation with LCC is one of the more reliable approaches to regularisation on the tabular classification datasets considered. In particular, the LCC-ℓ ∞ method combined with batch normalisation achieves the highest mean accuracy on seven of the 10 datasets. On the other three datasets, either all regularisers are ineffective, or LCC-ℓ ∞ is still present in the best combination of regularisation schemes. This provides strong evidence that LCC-ℓ ∞ is a good choice for regularisation of neural network models trained on tabular data.
These results can also be visualised using a critical difference diagram (Demšar, 2006), as shown in Figure 4. The average rank of LCC-ℓ ∞ combined with batch normalisation is 2.6, which is considerably higher than the next best method: batch normalisation, with an average rank of 5.9. However, there is insufficient evidence to be able to state that LCC-ℓ ∞ significantly outperforms batch normalisation. It can also be seen from this diagram that the LCC-ℓ ∞ with batch normalisation is statistically significantly better than most of the combinations of regularisers that include dropout.  Figure 4: A critical difference diagram showing the statistically significant (95% confidence) differences between the average rank of each method. The number beside each method is the average rank of that method across all datasets. The thick black bars overlaid on groups of thin black lines indicate a clique of methods that have not been found to be statistically significantly different.

Sensitivity to λ
We have proposed using the Lipschitz constant of a network to control its effective model capacity. This suggests setting λ too low will cause the network to underfit the data. Similarly, if λ is set too high, the network will overfit. Figure 5 shows how the validation accuracies of the VGG19 network trained on CIFAR-10 are impacted by varying the value of λ for the LCC-ℓ 1 , LCC-ℓ 2 , and LCC-ℓ ∞ regularisation methods. The plot corresponding to LCC-ℓ 2 best follows our expected trend. With λ = 2, the network underfits, relative to the best validation accuracy obtained. Increasing λ results in improved accuracy initially, but increasing it further causes the model to begin to overfit. This trend is also apparent to a lesser extent in the LCC-ℓ 1 and LCC-ℓ ∞ plots.

Do other methods constrain the Lipschitz constant?
The results presented so far have indicated that constraining the Lipschitz constant of a network provides effective regularisation. It is interesting to consider whether other com- indicates that the methods are complementary, but to further investigate this, we supply plots of the Lipschitz constant of each layer of the networks trained on CIFAR-10. These plots are given in Figure 6. When the network is trained with dropout, we scale each of the operator norms by the probability of retaining an activation for the reasons described in Section 4.3. The network trained with batch normalisation is omitted from these plots due to excessively large variance in the per-layer Lipschitz constants-often over an order of magnitude larger than that of other methods, and an order of magnitude lower in some cases.
The most obvious characteristic of the plots in Figure 6 is that networks trained without LCC have significantly larger operator norms than those trained with LCC. Interestingly, dropout does reduce both the ℓ 1 and ℓ ∞ operator norms for nearly all layers compared to the network trained without dropout or LCC. This indicates that dropout does, to some extent, penalise layers with large operator norms, but that there is also another mechanism at play. When using LCC-ℓ 2 to regularise a network, a single iteration of the power method is sufficient to estimate the operator norm during training. However, when constructing these plots, we performed power iterations until the operator norms converged to their true values. As can be seen in the middle plot, this shows that some layers were not quite constrained to have an operator norm less than four-the value chosen for λ for this network.

Convergence
The technique presented in this paper incurs only a small increase in the time required to perform each minibatch of training. The exact amount of time required to perform the projection of each weight matrix, in the case of the ℓ 1 and ℓ ∞ variants, is proportional to the number of parameters in the network. In the ℓ 2 case, the runtime is approximately the same as performing both a forward and backward propagation for a single sample. For typical network architectures, one can expect an increase in runtime of only a few percent, across all metrics.
Another potential concern is the number of weight updates required for the network loss to converge. Learning curves for VGG and WRN networks trained on CIFAR-100 are provided in Figure 7. These plots alleviate the concern that the projection operation may slow convergence, due to the truncated weight update. The rate of convergence of our method follows similar trends to the other regularisers. Something interesting to note is that it is particularly dependent on dropping the learning rate. This is most evident in the WRN chart, where the models that do not use the per layer operator norm constraint converge towards much better test accuracies in the first 120 epochs, before the learning rate has been lowered substantially.

Conclusion
We have presented a simple and effective regularisation technique for deep feed-forward neural networks, shown that it is applicable to a variety of feed-forward neural network architectures, and does not depend on a particular optimiser. We have also demonstrated that our technique can be used in conjunction with both batch normalisation and dropout, often with cumulative performance gains. The investigation into the differences between the three p-norms (p ∈ {1, 2, ∞}) we considered has provided some useful information about which one might be best suited to the problem at hand. In particular, the ℓ ∞ norm was best suited to tabular data, and the ℓ 2 norm consistently showed the best performance when used as a regulariser on natural image datasets. However, given that LCC-ℓ 2 is only approximately constraining the norm, if one wants a guarantee that the Lipschitz constant of the trained network is bounded below some user specified value, then using the ℓ 1 or ℓ ∞ norm would be more appropriate.
Lastly, we anticipate that the utility of constraining the Lipschitz constant of neural networks is limited not only to improving classification accuracy. There is already evidence that constraining the Lipschitz constant of the discriminator networks in GANs is useful (Arjovsky et al., 2017;. Given the shortcomings in these previous approaches to constraining Lipschitz constants we have outlined, one might expect improvements training GANs that are k-Lipschitz with respect to the ℓ 1 or ℓ ∞ norms, and approximately 1-Lipschitz with respect to the ℓ 2 norm. Exploring how well our technique works with recurrent neural networks would also be of great interest, but the theoretical basis for the method presented in this paper does apply in the context of recurrent neural networks. Finally, our method assumes that all layers should have the same Lipschitz constant. This is a potentially inappropriate assumption in practice, and a more sophisticated hyperparameter tuning mechanism that allows for selecting a different value of λ for each layer could provide a further improvement to performance.