1 Introduction

One of the key areas of artificial intelligence is supervised machine learning (ML), which involves the development of algorithms and models capable of learning a model on a given set of samples and making predictions or decisions based on previously unseen data. In ML community, this property is commonly known as generalization capability. In most real-world applications, as well as in the case analyzed in this paper, the system is a neural network trained by minimizing a differentiable loss function measuring the dissimilarity between some target values and the values returned by the network itself. The training process consists in the minimization of a loss function with respect to the network weights and biases, which can result in a complex and large-scale optimization problem.

The crucial role played by optimization algorithms in machine learning, acknowledged since the birth of this research field, has been widely and deeply discussed in the literature, both from an operations research and a computer science perspective. Training a supervised ML model involves, indeed, addressing the optimization problem (Gambella et al. 2021) of minimizing a function, which measures the dissimilarity between predicted and correct values. Simple examples are the different type of regression (Lewis-Beck and Lewis-Beck 2015; LaValley 2008; Ranstam and Cook 2018), the neural network optimization (Goodfellow et al. 2014), the decision trees (Carrizosa et al. 2021; Rokach and Maimon 2010; Bertsimas and Dunn 2017; Buntine 2020), and support vector machines (Steinwart and Christmann 2008; Tatsumi and Tanino 2014; Suthaharan and Suthaharan 2016; Pisner and Schnyer 2020). While the underlying idea of stochastic gradient-like methods, proposed by Robbins and Monro (1951), dates back to the 1950s, the research community has deeply investigated its theoretical and computational properties and a vast amount of new algorithms have been developed in the past decades.

Despite all the studies that have already been carried out in this field, which will discuss further, to the best of our knowledge, only a few have tried to answer some relevant computational questions encountered when using optimization methods to train deep neural networks (DNNs). In this paper, we point out and address the following issues in solving training optimization problems to assess the influence of optimization algorithm settings and architectural choices on generalization performances.

  • Convergence to local versus global minimizers and the effect of the quality (in terms of training loss) of the solution on the generalization performances;

  • effect of the non-monotone behavior of mini-batch methods with respect to traditional batch methods (L-BFGS) on the computational performances;

  • different role of the starting point and regions of attraction on L-BFGS than on mini-batch algorithms;

  • importance of the optimization algorithm’s hyperparameters tuning on the optimization and the generalization performances;

  • the robustness of hyperparameters setting tuned on a specific architecture and dataset by modifying the number of layers and neurons and the datasets

In this paper, we discuss and try to answer some questions regarding the aforementioned issues. We conduct extensive computational experiments to enforce our main claims:

  1. i)

    generalization performances can be influenced by the solution found in the training process. Local minima can be very different from each other and result in very different test performances;

  2. ii)

    traditional batch methods, like L-BFGS, are less efficient and also more sensitive to the starting point than mini-batch online algorithms;

  3. iii)

    hyperparameters tuned on a specific baseline problem, namely a given baseline architecture trained on an instance of a class of problems, can achieve better generalization performance than the default ones even on different problems, changing either the architecture and/or the instance in the given class.

To the aim above, we consider the task of training convolutional neural networks (CNNs) for an image classification task. We use three open-source datasets to carry out our experiments. We train the networks using nine optimization algorithms implemented in open-source state-of-the-art libraries for optimization and ML. We show that not all the algorithms reach a neighborhood of a global optimum, getting stuck in local minima. In particular, FTLR, Adadelta, and Adagrad cannot find good solutions on our experimental testbed, regardless the initialization seed and the hyperparameters’ setting. We also notice that test performance, i.e., the classification test accuracy, is remarkably higher when a good approximation of the global solution is reached and that better solutions can be achieved by carefully choosing the optimization hyperparameters setting. We carry out a thorough computational analysis to assess the robustness of the tuned hyperparameters’ configuration on a baseline problem (the image classification task on the open-source dataset UC Merced (Yang and Newsam 2010) using a customized deep convolutional neural network) with respect to architectural changes of the network and to new datasets for image classifications and we find that the hyperparameters tuned on the baseline problem give often better out-of-sample performance than the default settings even on different image classification datasets. Notice that we define a problem as a couple dataset-network, e.g., UC Merced-Baseline architecture. The paper is organized as follows. In Sect. 2, we discuss some relevant literature highlighting both the importance of what has already been produced by the ML research community and the novelty of our contributions. In Sect. 3, we describe the network architecture, while in Sect. 4, we formalize the optimization problem behind the image classification task, mathematically describing the convolution operation performed by the network layers. In Sect. 5, we briefly describe each of the nine different open-source algorithms we have tested on our task. In Sect. 6, we describe the composition of the three open-source datasets, and in Sect. 7, implementation details are reported. In Sect. 8, we describe in detail the computational tests we have carried out on different networks and datasets. We present our conclusions in Sect. 9.

2 Related literature

Several attempts have been made in the scientific literature to address the main issues discussed in this paper, both in the form of a survey and in the form of a comparative analysis and computational study. For instance, to understand how different types of data and tuning of algorithm parameters affected performances, Lim et al. (2000) carried out a thorough comparative analysis of nearly all the algorithms available at the time for classification tasks. As machine learning and, in particular, deep learning, gained steadily growing interest in the community, this comparative analysis methodology became a standard framework applied to specific methods and neural architectures. More recently, some other specific surveys have been produced, in particular comparing the behavior of different optimization algorithms on image classification tasks (Dogo et al. 2018; Kandel et al. 2020; Haji and Abdulazeez 2021), but they are mostly focused on mere computational aspects rather than to performance in respect of the ML task. Pouyanfar et al. (2018) and Braiek and Khomh (2020) provided methodological surveys on different approaches to ML problems. In the same years, algorithms used in machine learning have been widely studied also from an optimization perspective. Bottou et al. (2018) studied different first-order optimization algorithms applied to large-scale machine learning problems, while Baumann et al. (2019) produced a thorough comparative analysis of first-order methods in a machine learning framework and traditional combinatorial methods on the same classification tasks. Following the increased need for a high-level overview, some other papers on first-order methods have been published by Lan (2020), which provides a detailed survey on stochastic optimization algorithms, and by Sun et al. (2019), which compares from a theoretical perspective the main advantages and drawbacks of some of the most used methods in machine learning.

Some recent literature (see, e.g., Palagi 2019) also discusses the role of global optimization in the training of neural networks, as well as the problem of hyperparameters’ optimization. The role of global optimization in ML is also strictly linked to the emerging practice in ML of perfect interpolation, i.e., of training a model to fit the dataset perfectly. Advanced studies in this direction have been carried out over the past years, starting from Zhang et al. (2016) and Zhang et al. (2021), who reconsidered the classical bias–variance trade-off, remarking that most of the state-of-the-art neural models, especially in the field of image classification, are trained to reach close-to-zero training error, i.e., a global minimizer of the loss function. Sun (2019) investigates the problem of choosing the best initialization of parameters and the best-performing algorithms for a given dataset, namely Global Optimization of the Network framework. other computational studies (Advani et al. 2020; Spigler et al. 2019; Geiger et al. 2019) enforce the idea that larger or more trained (i.e., trained for a larger number of epochs) models also generalize better.

Another particularly valuable work for our research is Im et al. (2016), where a loss function projection mechanism is used to discuss how different algorithms can have remarkably different performances on the same problem. Eventually, the issue of having plenty of local minimizers, some of which are better than others in the sense that they lead to better test performances (which is, indeed, the main point of our claim i)), has been actively addressed both from a theoretical and from a computational perspective. Ding et al. (2022) provided detailed mathematical proof of the existence of sub-optimal local minima for deep neural networks with smooth activation. The authors show how it is not possible to create general mathematical rules to guarantee convergence to good local minima. Recent research shows that local minima can, in practice, be distinguished by visualizing the loss function (Sun et al. 2020) and, in particular, the occurrence of bad local minima can be empirically reduced with some architectural choices (Li et al. 2018). We show the role of selecting different stationary points in Sect. 8.1, where a multistart approach is also used on a subset of algorithms that seem particularly affected by the starting point.

The issues surrounding hyperparameters of optimization algorithms are also an important field for the ML research community. These hyperparameters are often treated in the same way as the hyperparameters defining the architecture (layers, neurons, activation functions, etc.), thus causing possible confusion about the reason for the good/bad performance of the obtained classification model. More specifically, despite the problem of local minimizers being certainly well known, this has been studied more in relation to the loss landscape, namely in relation to architecture hyperparameters, which can have an impact on shaping the loss landscape. However, recent research highlights that hyperparameters’ setting can have a strong influence on algorithms’ behavior, if they are specifically tuned on a given task. Xu et al. (2020) are among the first to point out the problem of the robustness of hyperparameters; they discuss how traditional first-order methods can get stuck in bad local minima or saddle points when tackling non-convex ML problems and how computational results can depend on the hyperparameters’ setting. Jais et al. (2019) carry out a thorough analysis of Adam algorithm performance on a classification problem, focusing on optimizing the network structure as well as Adam parameters. Nonetheless, hyperparameters are often set to a default value, which is obtained by maximizing the aggregated (in most cases, the average) performance across a variety of tasks, balancing a trade-off between efficiency and adaptability to different datasets (Probst et al. 2019; Yang and Shami 2020; Bischl et al. 2023). To our knowledge, no one has systematically addressed the question of whether it could be convenient to tune the hyperparameters on a baseline problem (small network and small dataset) and use the tuned configuration on other problems (network dataset) rather than using the default setting, which is our claim iii). Indeed, performing a grid search is computationally expensive for the considered task due to the high amount of training time needed for each possible combination of hyperparameters. Thus, we aim to show that performing a single grid search for hyperparameters and tuning them for the baseline network on a simple dataset can also have advantages on more complex problems (network dataset). The grid search on the baseline problems is reported in Sect. 8.2. We then reuse the best-identified hyperparameter setting to investigate the effect when the architecture changes (in Sect. 8.3), and as the dataset varies (Sect. 8.4). Indeed, we aim to analyze if the high-demanding operation of the grid search is more dataset-oriented or architecture-oriented, i.e., if the hyperparameters are more sensitive when the architecture or dataset change, given the same (classification) task.

3 The task and the network architectures

The chosen task is multi-class image classification, which is a predictive modeling problem where a class, among the set of classes, is predicted for a given input data. More formally, we are given a training set made up of P pairs \((x^j, y^j)\), \(j=1,\dots ,P\), of two-dimensional input colorful images x represented by \(H \times W\) pixels for each of the three color channels (red, green, blue) thus as a tensor \((3 \times H \times W)\), and the corresponding class label y. We denote with N the number of possible classes, so that \(y^j\in \{0,1\}^N\) and the target class value of sample j is \(y^ j_{i}= 1\) if the sample image j belongs to class i and \(y^ j_{i}= 0\) otherwise. In this paper, we consider three well-known 2D input images datasets, described in Sect. 6.

Developed and formalized by LeCun et al. (1995), deep Convolutional Neural Networks (CNN), a special type of deep neural network (DNN) architecture, are one of the most widespread types of neural network for image processing, e.g., image recognition and classification (Hijazi et al. 2015), monocular depth estimation (Papa et al. 2022), semantic segmentation (Guo et al. 2018), video recognition (Ding and Tao 2017), and vision, speech, and image processing tasks (Abbaschian et al. 2021; Kuutti et al. 2021; Shorten et al. 2021). In the literature, well-known Deep CNN models have been developed to face multi-class classification. Among them, we cite the DenseNet (Huang et al. 2017), the ResNet (He et al. 2016), and the MobileNet (Howard et al. 2017). These architectures are distinguished by specific and complex designs composed of stacked operational blocks.

In this paper, different optimizers are tested over the three datasets and different CNN architectures. In particular, we specifically design a lightweight low-complexity Baseline CNN model composed of elementary operations, such as Convolution (Conv2D), Pooling, and Fully Connected (FC) layers, briefly described below. Moreover, starting from the Baseline model, we designed three architectural variants based on the same elementary blocks, varying the number of units per layer (Wide), the number of layers (Deep), and both of them (Deep &Wide). We refer to these architectures as Synthetic Networks.

The Synthetic Networks have been designed to analyze if the tuned set-up of the hyperparameters for the Baseline architecture on a baseline dataset shows similar improvements across architectural and dataset changes. To check whether this behavior can be generalized to other architectures, we also used two traditional CNN architectures, namely Resnet50 (He et al. 2016) and Mobilenetv2 (Sandler et al. 2018).

In this section, we present the architectural aspects of the Baseline CNN architecture and its Wide, Deep, and Deep &Wide variants. Details and the mathematical formalization of the operations performed by the different layers are presented in Sect. 4.

The Baseline CNN is composed of a cascade of

  • five Convolutional Downsampling Blocks (CDBs)

  • one Fully Connected Block (FCB)

  • one final Classification Block (CB).

A graphical representation of the models and a detailed block diagram representation, with layers operations and respective parameters, are reported in Fig. 1. Each CDB block, represented with the yellow blocks in Fig. 1, performs a sequence of operations

  • a standard 2D-convolution (Conv2D layer) which takes in input a tensor of dimension \(C \times H \times W\) and produces in output a new tensor with the spatial feature dimensions H and W decreased and the channels C increased;

  • application of an activation function \(\sigma \);

  • 2D-max-pooling or 2D-mean-pooling operation allows downsampling the extracted features along their spatial dimensions by taking the maximum value or the mean over a fixed-dimension, known as pool size;

  • batch normalization.

Both CDB and FCB allow dropout with a given drop rate. Dropout consists of removing randomly parameters during optimization. Thus, it affects the structure of the objective function during the iterations by fixing some variables, and it can be seen as a sort of decomposition over the variables.

The FC layer is a shallow Feed-forward Neural Network (FFN) where all the possible layer-by-layer connections are established. The FCB block uses the Dropout operation, which is considered a trick to prevent overfitting. The last Classification layer is made up of a FC layer too, followed by the SoftMax activation function to extract the probability of each class.

An overview of the input–output shapes of the Baseline model and the respective number of trainable parameters is reported in Table 1.

Fig. 1
figure 1

Overview of the Baseline model and corresponding blocks used for the study and its variants Wide, Deep, and Deep &Wide. A generic activation function \(\sigma \) is used

Table 1 Baseline architecture: input–output spatial dimensions, reported in the [Channels, Height, Width] format, and number of trainable parameters (\(\hbox {N}^{\circ }\) Param.) of each layer in the case of a \(256 \times 256\) input

The Wide, Deep, and Deep &Wide architectures are detailed below, while a block diagram for each designed architecture is shown in Fig. 1.

  • The Wide model is designed by doubling the dimension of the output of each Conv2D layer, i.e., the number of output filters of the Baseline model in the convolution.

  • The Deep model is designed by doubling the number of convolutional operations, i.e., stacking to each CDB a further Convolutional Block (CB), as reported in Fig. 1 (blue blocks), performing the same operations as the CDB except for the downsampling step of the 2D-max/mean pooling.

  • The Deep &Wide model is designed by combining the previous Wide and Deep structures.

Finally, to assess the generality of the computational results, tests have been carried out also using two well-known neural architectures: Resnet50 (He et al. 2016) and Mobilenetv2 (Sandler et al. 2018). Resnet50 is a deep convolutional neural network with 50 hidden layers and with residual connections at each layer, meaning that the output of each layer is added to the output of the subsequent layer to prevent the well-known vanishing gradient issue (Borawar and Kaur 2023). Mobilenetv2 is a lightweight convolutional neural network, which uses a lighter convolutional operator. Both Resnet50 and Mobilenetv2 are among the most used neural architectures for image classification.

4 The optimization problem for the synthetic networks

The optimization problem related to our task consists in the unconstrained minimization of the Categorical Cross Entropy (CCE) between the predicted output of the neural model \(\hat{y}^ j_{i}(\omega )\) and the correct classes \(y^j\in \{0,1\}^N\).

The predicted value \(\hat{y}^ j_{i}(\omega )\) is the output of the last Classification block, and it represents the probability, estimated by the neural architecture, that the sample j belongs to class i. Thus, we have \(\sum _{i=1}^{N} \hat{y}^ j_{i}(\omega ) = 1 \ \forall j = 1,\dots ,P \ \forall \omega \in \mathbb {R}^h\). Thus, the unconstrained optimization problem can be written as

$$\begin{aligned} \min _{\omega \in \mathbb {R}^h} f(\omega ) {:}{=} - \frac{1}{P}\sum ^P_{j=1} \sum ^{N}_{i=1} y^j_{i} \log (\hat{y}^j_{i}(\omega )). \end{aligned}$$
(1)

We aim to derive the probability output of the Baseline synthetic network as a function of the network parameters \(\omega \), i.e., \(\hat{y}^ j_{i}(\omega )\), and, in particular, to write the dependency of \(\hat{y}^ j_{i}\) from \(\omega \) in a closed form

As reported in Fig. 1, in the Baseline model, the input images propagate along convolutional downsampling layers (CDB), a fully connected (FCB) and a classification (CB) layers. We formalize this process to get an analytical expression for \(\hat{y}^ j_{i}(\omega )\).

Each CDB performs five operations: a standard 2D-convolution, denoted as Conv2D layer in Fig. 1, followed by an activation function, a Max-pooling, or a Mean-pooling operation, the batch normalization. In the FCB, the Max/Mean-pooling is removed.

The input \(X^{\ell -1}\) to a Conv2D layer \(\ell \) is a tensor of dimension \([d^{\ell -1}, m^{\ell -1}, m^{\ell -1}]\) where \(d^{\ell -1}\) are the channels and \(m^\ell \) is the height/width of each channel \(c=1,\dots , d^{\ell -1}\), and the output \(X^\ell \) is a tensor of dimension \([d^{\ell }, m^{\ell }, m^{\ell }]\). The input \(X^0\) at layer \(\ell =0\) of the CDB layer is the colorful sample image represented with \(m^0=256\) pixels and \(d^0=3\) color channels (red, green, blue). We denote by \(X^\ell _c\) the matrix \(m^{\ell } \times m^{\ell }\) for each channel \(c=1,\dots , d^{\ell }\) and by \(x_c=X^0_c\in \mathbb {R}^{ m \times m}\) for \(c=1,\dots , d^0\). A Conv2D layer \(\ell \) applies a discrete convolution on the input \(X^{\ell -1}\). This operation consists in applying filters (also called kernels) \(w^k_c \in \mathbb {R}^{n \times n}\) for \(k=1,\dots , d^{\ell }\) to the c-th input channel \(X^{\ell -1}_c\) with \(c=1,\dots , d^{\ell -1}\).

The convolution operation depends on the integer stride \(s\ge 1\), representing the amount by which the filter \(w^k_c\) shifts around the input \(X^{\ell -1}_c\). The stride is commonly set to \(s=1\), as we did for all the experiments except for L-BFGS for which the stride has been fixed to \(s=2\). The dimension of the filter n, the number of channels \(d^\ell \), and the stride \(s\ge 1\), for each layer \(\ell \), are network hyperparameters. Let us denote as \(\otimes _s\) the discrete convolution operation with stride \(s\in \mathbb {N}\). The expression component-wise of the convolutional operation between the filter \(w^k_c \in \mathbb {R}^{n \times n}\) and the input feature \(X^{\ell -1}_c\) is

$$\begin{aligned} \left[ w^k_c \otimes _s X_c\right] _{ij} = \sum _{a=1}^n \sum _{b=1}^n [w^k_c]_{a,b}[X_c]_{i+sa, j+sb} \quad i, j=0,\dots , m-n, \end{aligned}$$

where for the sake of simplicity, we avoid the use of the superscript \(\ell \). As reported in Bengio et al. (2017) Chapter 9 (equation (9.4) with \(s=1\)), the kth convoluted output is the matrix \(F_k^\ell \in \mathbb {R}^{\frac{(m^{\ell -1}-n+1)}{s} \times \frac{(m^{\ell -1}-n+1)}{s}}\) defined as the sum over the channels, namely

$$\begin{aligned} F^\ell _k = \sum _{c=1}^{d^{\ell -1}} w^k_c \otimes _s X^{\ell -1}_c \qquad k=1,\dots , {d^{\ell }}. \end{aligned}$$
(2)

The convoluted output \(F^\ell \) of the Conv2D layer \(\ell \) is thus

$$F^\ell =[F^\ell _k]_{k=1,\dots , d^\ell }= (F^1,\dots ,F^{d^{\ell }}) \in \mathbb {R}^{d^{\ell } \times \frac{(m^{\ell -1}-n+1)}{s}\times \frac{(m^{\ell -1}-n+1)}{s}},$$

The parameters of the convolutional layer \(\ell \) are denoted as

$$\begin{aligned} W_{Conv}^{\ell } = \left[ w^k_c\right] ^{k=1,\dots ,d^{\ell }}_{c=1,\dots ,d^{\ell - 1}} \in \mathbb {R}^{n \times n \times d^{\ell } \times d^{\ell - 1}}. \end{aligned}$$
(3)

The next block applies a nonlinear activation function \(\sigma \) to the output of the Conv2D layer \(\ell \). The activation function \(\sigma \) that has been used in the computational experiments in Sect. 8 can be either the ReLU \(\sigma (z) = \max \{ 0,z \}\) or the SiLU \(\sigma (z) = \frac{z}{1 + e^{-z}}\) (Mercioni and Holban 2020; Ramachandran et al. 2017). In particular, the SiLU is used when applying L-BFGS to ensure the smoothness of the objective function and avoid failures of the optimization procedure.

In CDB blocks, the Conv2D output is submitted to the pooling operation aimed at further reducing the image dimensionality. The pooling operation Pool involves sliding a two-dimensional non-overlapping \(p \times p\) matrix, where p is the pool size, over each convoluted output \(F^\ell _k\) and contracting the features lying within the \(p\times p\) region covered by the filter using the max or the mean operation. More formally, let us introduce the set \(\mathcal{P}_{(i,j)}\) of positions, i.e., for each (ij) the rows and columns of the matrix \(F^\ell _k\), that are located in the \(p\times p\) region; then, Pool: \(F^\ell _k\rightarrow G^\ell _k\) where the kth pooled output is \( G^\ell _k \in \mathbb {R}^{\frac{(m^{\ell -1}-n+1)}{sp} \times \frac{(m^{\ell -1}-n+1)}{sp}}\). The output of the Max-Pool or Mean-Pool layer is computed, respectively, as

$$\begin{aligned} (G_k)^{Max}_{ij} = \max _{(r,c) \in \mathcal{P}_{(i,j)}} (F_k)_{r,c} \qquad \qquad (G_k)^{Mean}_{ij} = \frac{1}{\left| \mathcal{P}_{(i,j)}\right| }\sum _{(r,c) \in \mathcal{P}_{(i,j)}} (F_k)_{r,c}, \end{aligned}$$
(4)

where for the sake of simplicity, we removed the superscript \(\ell \). The set \(\mathcal{P}_{(i,j)}\) depends on how drastically we want to reduce the dimensionality. In our experiments, we have set \(p=2\). In this case, the moving region is just a \(2 \times 2\) matrix, and thus, we halve the dimension of F. For instance, \(\mathcal{P}_{(1,1)} = \{(1,1),(1,2),(2,1),(2,2)\}\), meaning that \(G_{11}^k\) is the maximum/mean value between four different values \(\{F^k_{1,1},F^k_{1,2},F^k_{2,1},F^k_{2,2}\}\). The Max-pooling operation is widely used in image classification, but it introduces a non-differentiability issue. For this reason, when testing L-BFGS, where differentiability is crucial, we use the Mean-pooling.

The output of a CDB block obtained by Eqs. (2) and (4) is \(Z^\ell =\left\{ \texttt {Pool} \left[ \sigma \left( Z^\ell _k\right) \right] \right\} _{k = 1,\dots , d_{\ell }}\) and finally given as

$$\begin{aligned} Z^\ell = \left\{ \texttt {Pool} \left[ \sigma \left( F^\ell _k\right) \right] \right\} _{k = 1,\dots , d_{\ell }}, \end{aligned}$$
(5)

whereas the output of a CB block does not use the Pool operations and thus is given simply by \( Z^\ell = F^\ell \). In both cases, \(Z^\ell \) is then normalized to stabilize and speed up the training process. The normalization is performed following the standard batch-normalization procedure described in Ioffe and Szegedy (2015), i.e., subtracting the mean and dividing by the standard deviation.

The output of the last layer \(X^L\), being L the total number of layers, of either CDB or CB is finally sent into the Fully Connected Block (FCB) and then into the Classification Block. The FCB is a shallow Feed-forward Neural (FFN) Network with \(M_{FCB}\) neural units and the activation function \(\sigma \) (ReLU or SiLU, as before). The output of the FCB is given by

$$\begin{aligned} Z_{FCB} = \sigma (W_{FCB}^T X^{L} + b_{FCB} ), \end{aligned}$$
(6)

where \((W_{FCB},b_{FCB})\) are the weights and the biases of the FFN network. The Classification Block is made up of a shallow FFN network followed by a softmax operation, so that the final output is

$$\begin{aligned} \hat{y}= \texttt {Soft}( Z_{CB}) =\texttt {Soft}\left( W_{CB}^T Z_{FCB} + b_{CB} \right) , \end{aligned}$$
(7)

where \(\texttt {Soft}\) is applied component-wise to the vector \(Z_{CB}\) as

$$\texttt {Soft}(z_h) = \frac{e^{z_h}}{\sum _{j=1}^{N} e^{z_j}}.$$

The overall network parameters are

$$\omega =(W_{CB},b_{CB},W_{FCB},b_{FCB},W_{Conv}^L, \dots , W_{Conv}^1).$$

Mobilenetv2 (Sandler et al. 2018) and the Resnet50 (He et al. 2016) present differences with respect to the Synthetic Network. Indeed, Mobilnetv2 instead uses depthwise separable convolutions different from (2), while Resnet50 presents residual connections among layers. Thus, the resulting optimization problem can be different with respect to the one described in this section.

5 The selected optimization algorithms

In this paper, L-BFGS and eight state-of-the-art ML optimization algorithms with multiple hyperparameters setups are compared. Precisely, those are: Adam (Kingma and Ba 2015), Adamax (Kingma and Ba 2015), Nadam (Dozat 2016), RMSprop,Footnote 1 SGD (Robbins and Monro 1951; Bottou et al. 2018; Ruder 2016; Sutskever et al. 2013b), FTRL (McMahan 2011), Adagrad (Duchi et al. 2011), and Adadelta (Zeiler 2012). We use the SciPyFootnote 2 version for L-BFGS and the built-in implementation in TensorFlow libraryFootnote 3 for the eight others.

For the sake of completeness, we report the updating rule of each algorithm, assuming it is applied to the problem as in Eq. (1), namely

$$\min _{\omega \in \textbf{R}^h} f(\omega )= \sum _{p=1}^P f_p(\omega ).$$

We note that all algorithms require f to be a continuously differentiable function. The use of non-differentiable activation functions (ReLU) in the network layers and the MaxPooling layer, as usually done in CNN, implies that the objective function f does not satisfy this essential property and a finite number of non-differentiable points arise. When using L-BFGS, this aspect becomes evident as discussed in Sect. 8.1. We tried to use L-BFGS with the standard setting in CNNs, but it happened very often that the method failed and ended at a non-stationary point. Indeed, since L-BFGS is a full-batch method using all the samples at each iteration, whenever a point of non-differentiability is reached, the gradient returned by TensorFlow is None, and the method gets stuck. The other eight first-order algorithms are instead mini-batch methods, which perform network parameters update using only a small subset of the whole samples. When it happens that the partial gradient is None on a subset of samples, the method continues in the epoch, changing the batch and possibly the new partial gradient can be used to move from the current iteration. Hence, although convergence of the mini-batch methods requires smoothness, from the computational point of view, they can work heuristically without it.

Hence, when using L-BFGS, we need to reduce non-differentiability. To this aim, we set SiLU as activation function and we select the MeanPooling, which is not a common practice in CNN for image classification. We also fully deactivate the Dropout operation and set stride s = 2. We also remark that we are comparing a globally convergent traditional full-batch method with eight different mini-batch methods, that require strong assumptions to prove convergence that do not hold for the problem at hand. By studying L-BFGS performance against commonly used optimizers, we aim to assess whether theoretical convergence really plays an important role in determining the efficiency, the train performances, and, most of all, the generalization capability.

5.1 L-BFGS

Being one of the best-known first-order methods with strong convergence properties (see Liu and Nocedal (1989)), L-BFGS belongs to the limited memory quasi-Newton methods class. This algorithm is purely deterministic and, at every iteration k, exploits an approximated inverse Hessian of the objective function, and it performs the following update scheme:

$$\begin{aligned} \omega _{k+1} = \omega _k - \eta _k H_k \nabla f(\omega _k), \end{aligned}$$

where \(\eta _k\) is a step size obtained via some line search method.

The updating rule for \(H_k\) has been formalized by Nocedal and Wright (1999). Given an initial approximate Hessian \(H_0 \sim \left[ \nabla ^2 f(\omega _0)\right] ^{-1}\), the algorithm uses the rule

$$\begin{aligned} H_{k+1} = V^T_k H_k V_k + \frac{s_k s^T_k}{y_k^T s_k}, \end{aligned}$$

where

$$\begin{aligned} s_k = \omega _{k+1} - \omega _k, \qquad y_k = \nabla f(\omega _{k+1}) - \nabla f(\omega _{k}),\quad V_k = I - \frac{y_k s^T_k}{y_k^T s_k}, \end{aligned}$$

being I is the identity matrix.

L-BFGS is not among the optimizers mostly used in machine learning. However, in force of its strong convergence properties, this algorithm has recently gained increasing interest in the research community. Some multi-batch versions of L-BFGS have been proposed in the past years in (Berahas et al. 2016; Bollapragada et al. 2018; Berahas and Takáč 2020), in particular for image processing tasks in medicine in (Yun et al. 2018; Wang et al. 2019).

Since L-BFGS is not directly available in TensorFlow, we have used the SciPy version implemented with an open-source wrapper available onlineFootnote 4.

5.2 SGD

The Stochastic Gradient Descent (SGD) is the basic algorithm to perform the minimization of the objective function using a direction \(d^k\) which is random estimate of its gradient (see for details, the comprehensive survey Bottou et al. (2018)). In the TensorFlow implementation, the following mini-batch approximation is used:

$$\begin{aligned} g_k (\omega _k)=\displaystyle \frac{1}{|\mathcal{B}_k|} \sum _{i\in \mathcal{B}_k}\nabla f_{i}(\omega _k)\text { for some }\mathcal{B}_k\subset \{1,\dots ,P\}. \end{aligned}$$

The updating rule is given by

$$\begin{aligned} \omega _{k+1} = \omega _k - \eta _k g_k(\omega _k). \end{aligned}$$

The update step can be modified by adding a momentum term (which depends on a parameter \(\beta \)) or a Nesterov acceleration step which are an extrapolation steps along the difference between the two past iterations. TensorFlow allows the use of a Boolean parameter, called Nesterov, which enables the Nesterov acceleration step (see Sutskever et al. (2013a)). When Nesterov=False, only a momentum is applied and the basic SGD iteration is modified by adding

$$\beta _k (\omega _k - \omega _{k-1})$$

with \(\beta _k \ge 0\) momentum parameter. When Nesterov=True, first

$$\begin{aligned} z_k = \omega _k + \beta _k (\omega _k - \omega _{k-1})\end{aligned}$$
(8)

is computed and the updating rule becomes

$$\omega _{k+1} = z_k - \eta _k g_k(z^k).$$

In both cases, the value of \(\beta \) is a hyperparameter to be tuned.

In Bertsekas and Tsitsiklis (2000), the pure SGD has been proved to converge to a stationary point under strong assumptions. We further refer the reader to Drori and Shamir (2020) for a thorough analysis of SGD complexity and convergence rate.

5.3 Adam

Adam is one of the first SGD extensions, where the gradient estimate is enhanced with the use of an exponential moving average according to two coefficients: \(\beta _1\) and \(\beta _2\), ranging in (0, 1). The index \(i \in \{1,2\}\) is referred to as the moment of the stochastic gradient, i.e., the first moment (expected value) and the second moment (non-centered variance). Being \(g_k\) the same mini-batch approximation used in SGD, we define the following first and second-moment estimators at iterate k:

$$\begin{aligned} \mathbb {E}\left[ \nabla f(\omega _k) \right] \sim m_k = (1-\beta _1) \sum _{i=1}^k \beta _1^{k-i} g_i \end{aligned}$$
(9)
$$\begin{aligned} \mathbb {E}^2\left[ \nabla f(\omega _k) \right] \sim v_k = (1-\beta _2) \sum _{i=1}^k \beta _2^{k-i} {(g_i\odot g_i)}, \end{aligned}$$
(10)

where \(\odot \) is the Hadamard component-wise product among vectors.

Given the following matrix:

$$\begin{aligned}\widetilde{V}_k(\varepsilon ) =\frac{1}{1-\beta _2^k} \left[ I \varepsilon + \text{ diag }({v}_k)\right] ^{\frac{1}{2}},\end{aligned}$$

where \(\text{ diag }(v)\) denotes the diagonal \(h \times h\) matrix with elements \(v_i\) on the diagonal, and \(\epsilon >0\), the updating rule is the following:

$$\begin{aligned} \omega _{k+1} = \omega _{k} - \eta _k \widetilde{V}_k(\varepsilon )^{-1} \frac{1}{1-\beta _1^k} m_k, \end{aligned}$$

where \(m_k\) is given in (9). It has been recently proved in Défossez et al. (2020) that Adam can converge under smoothness assumption and gradients boundness in \(L_{\infty }\) norm with convergence rate \(O(\frac{h\log (N)}{N})\), being h the number of variables and N the numbers of iterations. For a more detailed discussion of Adam complexity (as well as for the other adaptive gradient methods Adamax and Nadam), we refer the reader to Zhou et al. (2018).

5.4 Adamax

Adamax performs mainly the same operations described in Adam, but it does not make use of the parameter \(\epsilon \), and the algorithm exploits the infinite norm to average the gradient. Let

$$u_k = \max _{i=1,\dots ,k} \beta _2^{k-i}|g_i|\qquad U_k=\text{ diag }(u_k)$$

and \(m_k\) as in (9). Thus, the updating rule is

$$\begin{aligned} \omega _{k+1} = \omega _{k} - \eta _k U_k^{-1}\frac{1}{(1 - \beta _1^k)}m_k. \end{aligned}$$

5.5 Nadam

Nadam, also known as Nesterov–Adam, performs the same updating rule as Adam but employs the Nesterov acceleration step Eq. (8). Nadam is expected to be more efficient, but the Nesterov trick involves only the order in which operations are carried out and not the updating formula.

5.6 Adagrad

Adagrad is the first Adam extension that makes use of adaptive learning rates to discriminate more informative and rare features. The general update rule of \(\omega _k \in \textbf{R}^h\) involves complex matrix operations, for which we need to introduce some other notation. At iteration k, we introduce the cumulative vector

$$G_k = \displaystyle \sum _{t=0}^{k-1} \left( g_t \odot g_t \right) $$

where \(g_t=\sum _{i\in \mathcal{B}_t} \nabla f_i(\omega _t) \). Given \(\varepsilon > 0\) and the identity matrix I, we define the following matrix:

$$\begin{aligned} H_k(\varepsilon ) = \left[ I \varepsilon + \text{ diag }(G_k) \right] ^{\frac{1}{2}}, \end{aligned}$$

where \(\text{ diag }(v)\), where \(v\in \textbf{R}^h\), denotes the diagonal \({h\times h}\) matrix with elements \(v_i\) on the diagonal. Thus, the updating rule resulting after the minimization of a specific proximal function (see Duchi et al. (2011)) is the following:

$$\omega _{k+1} = \omega _k - \eta H_k (\varepsilon )^{-1} g_k . $$

Convergence results concerning Adagrad, as well as its modification Adadelta, have been reported in (Li and Orabona 2019; Défossez et al. 2020; Chen et al. 2018) and they still require strong assumptions.

5.7 RMSProp

Proposed by Hinton et al. in the unpublished lecture Hinton et al. (2012), RMSProp (Root Mean Square Propagation) performs a similar operations as Adagrad, but the update rule is modified to slow down the learning rate decrease.

In particular, following the notation introduced in the last subsections, at iteration k, the following matrix is used:

$$\begin{aligned} {V}_k(\varepsilon ) = \left[ I \varepsilon + \text{ diag }({v}_k)\right] ^{\frac{1}{2}} \end{aligned}$$

where \(v_k\) be given by (10). The update rule is:

$$\begin{aligned} \omega _{k+1} = \omega _k - \eta {V_k (\varepsilon )^{-1}} g_k. \end{aligned}$$

Despite being still unpublished, RMSProp has already been studied in the field of convergence theory in (Défossez et al. 2020; De et al. 2018), with an eye to the non-convex ML context.

5.8 Adadelta

Adadelta (Zeiler 2012) can be considered as an extension of Adagrad, which allows for a less rapid decrease in learning rate. Let us consider the same matrix

$$\begin{aligned} {V}_k(\varepsilon ) = \left[ I \varepsilon + \text{ diag }({v}_k)\right] ^{\frac{1}{2}} \end{aligned}$$

used in RMSProp. Further, let \(\delta _{k}=\omega _{k+1}-\omega _{k}\)

$${\widetilde{\delta }}_{k} = \displaystyle \sum _{t=1}^{k} \rho ^{k-t} (1 - \rho ) \left( \delta _t \odot \delta _t \right) \qquad \widetilde{\Delta }_k(\varepsilon ) = \left[ I \varepsilon + \text{ diag }({\widetilde{\delta }}_k)\right] ^{\frac{1}{2}}. $$

Thus, we can write the Adadelta updating rule as follows:

$$ \omega _{k+1} = \omega _k - V_k(\varepsilon )^{-\frac{1}{2} } \widetilde{\Delta }_{k-1}(\varepsilon ) g_k. $$

5.9 FTRL

FTRL (Follow The Regularized Leader), as implemented in TensorFlow following McMahan et al. (2013), is a regularized version of SGD, which uses the L1 norm to perform the update of the variables. Given, at every iteration k, \(d_k = \sum _{t=1}^k g_t\), and fixed the quantity \(\sigma _k\), such that \(\sum _{t=0}^k \sigma _{t} = \frac{1}{\eta _k}\), the update rule is the following:

$$ \omega _{k+1} = \arg \min _{\omega } \left\{ d_k^T \omega + \sum _{t=1}^k \sigma _t ||\omega - \omega _t||^2 + \lambda ||\omega ||_1\right\} . $$

As proved in McMahan et al. (2013), the minimization problem in the update rule can be solved in closed form, setting

$$\begin{aligned} \omega _{k+1,i} = {\left\{ \begin{array}{ll} 0 \ \ \ \text {if } \vert z_{k,i}\vert \le \lambda \\ -\eta _k (z_{k,i} - \texttt {sgn} (z_{k,i}) \lambda ) \text { otherwise}, \end{array}\right. } \end{aligned}$$
(11)

where \(\displaystyle z_{k,i} = d_k - \sum _{s=1}^k \sigma _s \omega _s\) and \(\texttt {sgn}(\cdot )\) is the Signum function.

FTLR convergence can be proved only in the convex case, as explained in detail in McMahan (2011).

6 The datasets

We have carried out our computational test using three datasets: UC Merced (Yang and Newsam 2010), CIFAR10, and CIFAR100 (Krizhevsky et al. 2009). UC Merced represents the benchmark dataset used to define the Baseline problem, namely the training of the Baseline network defined in Sect. 3. We use the Baseline problem (defined as the pair Baseline network—UC Merced) to assess the performance of the different optimization methods.

UC Merced is a balanced dataset that comprises a total of 2100 land samples divided into 21 classes, i.e., 100 images per class. The dataset images have a resolution of \(256\times 256\) pixels. The high number of classes and the limited number of samples for each class make the multi-class classification a non-trivial task.

To assess whether the computational results obtained on the Baseline problem generalize to different datasets, we have also carried out additional tests on two larger datasets: CIFAR10 and CIFAR100 (Krizhevsky et al. 2009), respectively, with \(N=\)10 and \(N=\)100 classes, both containing 60000 samples at a resolution of \(H\times W= 32 \times 32\) pixels.

Furthermore, for mini-batch methods, we also apply data augmentation, which is a commonly used technique in machine learning for image classification. It consists of random transformation of the selected mini-batch of samples with the aim of increasing the training dataset diversity and achieving better generalization capabilities. For a better understanding of this technique, we refer the reader to (Van Dyk and Meng 2001; Connor and Khoshgoftaar 2019); in our case, data augmentation involves random transformations on selected images, such as rotation, scaling, adding noise, and changing brightness and contrast.

7 Implementation details

We implemented the proposed study using TensorFlow 2Footnote 5 deep learning high-level API, using its implementation of Categorical Cross Entropy (CCE).Footnote 6 We set the environment seed (also for the normal initializer of the convolutional kernels) at a randomly chosen value equal to 1699806 or to a specific listFootnote 7 of values in the multistart analysis. Computational tests have been conducted using a mini-batch size \(bs = 128\), except for L-BFGS, which is a batch method, i.e., requires the whole gradient at each iteration. Concerning the eight built-in optimizers, in Sect. 8.1, the network was trained setting to 100 the number of epochs over the whole dataset. We remark that a single epoch consists of \(\frac{P}{bs}\) update steps, being P the number of samples in the dataset. In the experiments with tuned hyperparameters in Sect. 8.2, we halved the number of epochs. In other experiments with larger problems, the number of epochs was further reduced to 30. We underline that reducing the number of epochs is a common heuristic procedure in deep learning (Diaz et al. 2017; Yu and Zhu 2020), where at an early testing phase, the number of epochs is set to an arbitrary value (in our case 100) and, then, it is reduced according to the training loss decrease, such that the network is not trained when the loss has already reached values close to zero and is not further improving. This prevents any waste of computational time that could result from training the network when the loss is already extremely close to zero.

Eventually, we underline that the TensorFlow implementation of the eight built-in optimizers, as well as the SciPy version of L-BFGS, uses back-propagation algorithm to compute the gradients.

The training have been run on 12GB NVIDIA GTX TITAN V GPU. The L-BFGS algorithm, being a full-batch method, cannot be run on a GPU due to the lack of memory storage, and takes almost 30 s on our reference Intel i9-10900X CPU to execute an entire step, i.e., a batch containing all the training samples.

8 Computational results

We present in this chapter our computational experiments divided into three blocks. In Sect. 8.1, we explain how we have tested L-BFGS and the eight optimizers briefly described in Sect. 5 on the baseline problem, i.e., training the Baseline architecture on UC Merced dataset, using default setting of the hyperparameters. We have also carried out a multistart test on the three worst-performing algorithms (Adadelta, Adagrad, and FTLR) to assess whether poor performances were caused only by an unfortunate weights initialization or by the inherent behavior of these optimizers on the dataset. We discuss the correlation between the test accuracy performances and the precision with which the problem in Eq. (1) is solved, the loss profiles produced by the algorithms, as well as the role of the data augmentation technique. Our further analysis in the following sections is focused only on five of these optimizers since, as shown in Sect. 8.1, they achieved the highest accuracy prediction. In Sect. 8.2, we describe the grid search we have carried out on the baseline network on UC Merced dataset to tune optimizers’ hyperparameters. We show how a careful tuning aimed at finding a nearly optimal hyperparameters setting can result in significant improvements in terms of test accuracy. In Sect. 8.3, we discuss the results obtained after modifying the network architecture with respect to the baseline, investigating in particular hyperparameters robustness to the increase in depth and width.

Table 2 Number of trainable parameters (variables) of the different neural architectures; they depend on HWN. The values are reported in millions [M]

Finally, in Sect. 8.4, we carry out tests on the two other image classification datasets, CIFAR10 and CIFAR100 with \(H=W=32\) pixels and \(N=10, 100\), respectively. For the sake of completeness, we report in Table 2 the number of trainable parameters for all the architecture-dataset pairs.

8.1 The baseline problem with default hyperparameters

The first tests we have carried out are aimed at studying the optimizers’ performances both from an optimization perspective (i.e., the value of the final loss) and a machine learning perspective (i.e., the test accuracy). Hyperparameters have been set to their default values (Table 3), taken from the TensorFlow documentation.

Table 3 Default and tuned values of the TensorFlow built-in optimizer hyperparameters. The tuned values are reported in brackets when different from the default ones. The grid search has not been performed for the worst-performing algorithms: Adadelta, Adagrad, and FTRL

For each algorithm, we report in Fig. 2 the behavior of the training losses without and with data augmentation, and in Table 4 the test accuracy.

We noticed that, while most of the algorithms converge to points in the neighborhood of the globally optimal solution, i.e., the training loss is close to zero, Adadelta, Adagrad, and FTRL get stuck in some local minima, as can be seen in Fig. 2. This results in quite poor accuracy performances for Adadelta, Adagrad, and FTRL as reported in the first column of Table 4. We highlight that FTRL is an extreme case, being the descent so slow that the loss profile looks like an horizontal line. This is not surprising, because, as explained in McMahan et al. (2013), FTRL has been thought to deal with extremely sparse datasets, which is not the case for colored images. Furthermore, FTLR convergence requires a very strong convexity condition (McMahan 2011), making it impossible to predict its behavior in such a non-convex context.

The observed behavior seems to confirm what has been already pointed out in (Swirszcz et al. 2016; Yun et al. 2018): neural networks can be affected by the local minima issue, which has a direct influence on the performance metrics. Getting stuck in bad local minima often implies also an accuracy level that makes the entire network useless for the classification task. In the case of FTRL, the test accuracy is so low that the network selects randomly the predicted class.

Furthermore, data augmentation has no substantial effect in modifying the convergence endpoint. Indeed, Adadelta, Adagrad, and FTRL are somehow stable in returning the a bad point, as well as SGD, Adam, Adamax, Nadam, and RMSProp always converge to good solutions, leading to similar values of accuracy, as we can see again in Table 4. Nonetheless, data augmentation have a boosting effect on test accuracy for all five working algorithms. This improvement is obtained, because data augmentation artificially increases the diversity and the quantity of the training data and, thus, enhances the network generalization capability. Nonetheless, data augmentation also makes the task harder, and thus, the the training loss decrease is slightly slower, i.e., the network needs more time to learn.

Fig. 2
figure 2

Training loss for the baseline problem with default hyperparameters setting: (a) without data augmentation and (b) with data augmentation. A color is assigned to each algorithm according to the legend

To investigate the behavior of Adadelta, Adagrad, and FTRL and to assess the stability of their bad performance, we have carried out another test employing a multistart procedure. To this aim, we have initialized the weights using two different distributions Glorot Uniform (GU) (Glorot and Bengio 2010) and Lecun Normal (LN) (LeCun et al. 1989) and 16 different seed values, i.e., starting from 16 different initial points for each initialization, that is from 32 different points in total. The three algorithms always get stuck in a point, with value of the training loss quite far from zero with respect to the others. This behavior is very stable and does not change with the initialization seeds.

Best accuracy values, not reported in a table for the sake of brevity, are always \(4.6\%\) for FTRL, \(18.2\%\) for Adadelta, and \(32.7\%\) for Adagrad. These results suggest that the bad behavior of Adadelta, Adagrad, and FTRL is not just caused by an unfortunate initial point. These algorithms seem to converge to points which are not good for our classification task. Hence, we have discarded Adadelta, Adagrad, and FTRL from the testing phases reported in the next sections.

Table 4 Test accuracy in % obtained with default values and with tuned values (± % increase) of the optimization algorithms, without data augmentation (W/out DA) and with data augmentation (with DA)

Concerning L-BFGS, we use the original dataset without data augmentation (which is specific for mini-batch methods). We have first trained the baseline problem using the ReLU as activation function, as well as the MaxPooling Eq. (4) and the Dropout operations. Since the final points returned by L-BFGS are influenced by the starting point (Liu and Nocedal 1989), we ran the algorithm with different initialization seeds. In particular, we have used again the Glorot Uniform and Lecun Normal distributions and, due to the heavy computational effort, only 5 different seed values for each initialization. The training loss profile of this first set of experiments is reported in Fig. 3a. We observe that the algorithm always fails before achieving convergence: the lines in Fig. 3a stop, because the returned loss was infinite at a given iteration. We argue that this is caused by a non-differentiability issue. Indeed, as we already discussed, L-BFGS convergence is guaranteed exclusively when the objective function is continuously differentiable (Liu and Nocedal 1989) and the ReLU, as well as the MaxPooling operation Equation (4), cause the occurrence of non-differentiable points, i.e., points where the gradient is not defined. Although, this could in principle happen with any other algorithm, since L-BFGS is a full-batch method, once a non-differentiable point is reached the algorithm gets stuck.

Fig. 3
figure 3

L-BFGS training loss with 5 different initialization seeds as reported in the legend, using the Lecun Normal (LN) and Glorot Uniform (GU) initialization procedures

Hence, we have also trained the baseline network in a more differentiable setting, namely using the SiLU activation function, the MeanPooling and disabling the Dropout operation. As we show in Fig. 3b, results significantly improved with almost all the initialization seeds. However, the loss does not always tend to zero and L-BFGS generally converges to points with a far worse loss function value than Adam, Adamax, Nadam, RMSProp, and SGD.

This difference can be seen also in terms of test accuracy. Indeed, when using the MaxPooling layers and the ReLU activation function, L-BFGS performs quite poorly in terms of test accuracy reaching the maximum value of \(15.4 \%\).

When using the “differentiable” setting, we obtain better results reported in Table 5. We report the average over the 5 runs of the final training loss values and of the test accuracies.

We observe that even the most unlucky initialization, which is GU 10, results in a test accuracy of 15.6% which is better than the highest one obtained with ReLU and MaxPooling. However, L-BFGS is much more sensible to the initialization seed with respect to the other built-in methods, confirming claim (ii). The final training loss, as well as the test accuracy, are not stable and may vary in a wide range of values. Despite this computational result could question the practical effectiveness of traditional batch methods in deep learning, it also confirms our claim (i): the quality of local minima matters. Indeed, looking at Table 5, we observe a relation between the final loss value and the test accuracy: lower final loss value usually corresponds to higher test accuracy. In general, the accuracy performances achieved are not satisfactory when compared to mini-batch methods as well as the training loss decrease in unstable and highly influenced by the starting point. Finally, we also remark that L-BFGS is significantly less efficient with respect to the other built-in algorithms. Indeed, it is practically impossible to run it on a standard GPU, because one needs enough memory storage to access the entire dataset in one single step, which is possible only on CPU and this results in slower training.

Table 5 Average value of the training loss (Avg Train value) and Average Test accuracy in % (Avg Test acc) obtained on the baseline problem with L-BFGS in the “differentiable” setting on 10 multistart runs using 5 different seeds for the two different distributions GU and LN: different training loss results in different test performances

8.2 Impact of tuning on the Baseline problem

In this section, we perform tuning of hyperparameters of the optimization algorithm on the baseline problem, namely on the baseline architecture and the UC Merced problem, to assess their role in the computational efficiency and, in turn, on the final test accuracy. As we mentioned in the introduction, default values for hyperparameters are often obtained by maximizing the aggregated (in most cases the average) performance across a variety of different tasks, balancing a trade-off between efficiency and adaptability to different datasets (Probst et al. 2019; Yang and Shami 2020; Bischl et al. 2023). We aim here to assess if a specific tuning on the classification task has a influence on algorithms’ behavior. As this will be the case, in the next section, we analyze the impact of the tuning obtained on a baseline problem to other settings (architecture and/or dataset).

Table 6 Range of the grid search for the different hyperparameters. For the sake of brevity, we have omitted the Boolean hyperparameters, Amsgrad for Adam, Centered for RMSProp, and Nesterov for SGD

We discard Adadelta, Adagrad, FTRL, and L-BFGS from further analysis due to their extremely poor performance on the baseline problem. Thus, we have carried out a grid search to tune the hyperparameters of Adam, Adamax, Nadam, RMSProp, and SGD on the baseline problem.

The grid search ranges are reported in Table 6. Concerning numerical hyperparameters, we have chosen ranges centered in the default values, resulting in almost 200 possible combinations for each algorithm. We did not perform either a k-fold cross-validation or a multistart procedure, as is usually the case in computer vision (see, e.g., Gärtner et al. (2023)) for computational reasons. Indeed, each run (including loading the dataset and the network to the GPU and the effective computational time) takes approximately 9 min (around 2 s per epoch plus the set-up time). Thus, performing the grid search over about 200 hyperparameter settings takes 1.25 days on a fully dedicated 12GB NVIDIA GTX TITAN V GPU for each of the five algorithms. This implies that each training phase with a complete grid search would require nearly 6.25 days. Therefore, performing, e.g., a k-fold cross-validation would require \(6.25 \cdot k\) days on a fully dedicated machine, being a prohibited amount of time for standard values of k (5 or 10). Similar observations holds for a multistart procedure.

The tuned values of the hyperparameters are selected considering the best test accuracy obtained and are reported into brackets in Table 3, when different from default ones. Once tuned the hyperparameters to new values, we have used them on the Baseline problem halving the number of epochs.

Fig. 4
figure 4

Training loss with tuned hyperparameters values for the baseline problem: (a) without data augmentation and (b) with data augmentation

We report in Fig. 4a, b the training loss profiles for the two settings without data augmentation and with data augmentation. Comparing with the corresponding training loss with default values in Fig. 2a, b, we can state that tuning does not directly influence the final value of the objective function returned by the algorithms, which was already the global optimal value near to zero. However, the loss decrease is faster, and nearly optimal values (near to zero) are reached earlier, obtaining good results despite having halved the number of epochs. In particular, in Fig. 4a, the loss is almost zero already after 15–20 epochs, while in Fig. 2a after 25–30 epochs. Considering the case with data augmentation, in Fig. 2b, the loss is almost zero after 40 epochs in Fig. 4b the same is true after approximately 60 epochs.

Fig. 5
figure 5

Training Loss trends of the 5 algorithms for UC Merced on the synthetic architectures

However, our computational experience shows that the most relevant benefit of hyperparameters tuning is the gain in terms of test accuracy. In Table 4, we report in square brackets the test accuracy changes for the five optimizers, with and without data augmentation.The change is always positive except for RMSProp with data augmentation. We also observe that Adam, SGD, and Nadam show a larger improvement over Adamax and RMSProp both w/out and with data augmentation.

8.3 Impact of tuning when changing the architectures

In this section, we aim to investigate the impact of tuned vs default hyperparameters when changing the network architecture, while the dataset is UC Merced.

In particular, we are interested in assessing how optimizers react to the increase in depth and width, using the three synthetic configurations (Wide, Deep, Deep &Wide) described in Sect. 3, as well as in determining the impact of tuning the hyperparameters on state-of-the-art architectures as Resnet50, and Mobilenetv2. In this experiment, we perform a single run for each algorithm starting from the same initial point, fixing the distribution and the random seed to LN 1699806, and considering the use of data augmentation, which gave better results in the former experiments.

The results for the synthetic architectures are reported in Fig. 5 (training results) and in Table 7 (test accuracy), whereas the results for the state-of-the-art architectures are in Fig. 6 (training) and Table 8 (test).

In terms of training loss, on the synthetic networks, the tuned version seems to reach, on average, slightly smaller values, whereas on the state-of-the-art architectures, there are not noticeable differences in the reached value. Thus, the tuning of the hyperparameters does not improve significantly the decrease rate.

Table 7 Test accuracy in % obtained by training the synthetic network architectures using the Default and Tuned hyperparameter values

As regard the test accuracy on synthetic networks, in Table 7, we report for each algorithm the % accuracy obtained for UC Merced dataset on the four different architectures and also the average % over the architectures (column Avg ARCH).

From Table 7, it seems that SGD and Adamax benefit from the tuned setting on the synthetic architectures, significantly improving the average of the % accuracy (Avg ARCH), whereas Adam and Nadam are slightly worse on average. RMSProp deteriorates significantly, but we remark that this was the only case of worst performance also in the Baseline problem with data augmentation (see Table 4).

The accuracy on Resnet50 and Mobilenetv2 are reported in % in Table 8. On these architectures, the tuned configuration does not perform uniformly better. However, we observe an improvement when using SGD on Mobilenetv2, which is more similar in the architecture to the Baseline architecture. Thus, we can conclude that when the architecture is significantly different from the Baseline used for tuning Hyperparmeters, the advantages are limited.

8.4 Impact of tuning when changing the datasets

In this final set of experiments, we aim to assess the role of tuning when training all the architectures on the two additional datasets CIFAR10 and CIFAR100, described in Sect. 6.

Fig. 6
figure 6

Training Loss trends of the 5 algorithms for UC Merced on Resnet50 and Mobilenetv2

The training loss of the the default and tuned versions of the five algorithms on CIFAR 10 are reported in Fig. 7 for the synthetic architectures and Fig. 9 for the state-of-the-art architectures. The same results on CIFAR100 are shown in Fig. 8 and Fig. 10. The test accuracies, expressed as percentages, are reported in Table 7 and Table 8. For each algorithm in the two settings Default and Tuned, we report a final row with the average over the datasets when the architecture is fixed (Avg DATA), and a final column with the average over the architecture when the Dataset is fixed (Avg ARCH). The entries in the Avg ARCH columns and in the Avg DATA rows are in bold to highlight which is the winning between the default and tuned configuration.

Examining the training losses for both CIFAR10 and CIFAR100, we do not observe significant differences in the training loss profiles between the default and tuned configurations for most architectures. Thus, one might conclude that the optimal hyperparameters setting found in Sect. 8.2 on the Baseline network is not robust enough. However, when we consider the generalization performance, measured by the final test accuracy, the situation appears different.

Indeed, looking at Table 7 and Table 8, we observe that SGD shows a strong advantage with the tuned configuration, while the average accuracy values for other optimizers are often very close to each other. SGD is the only non-adaptive optimizer, meaning that the learning rate is not adjusted during the training. We argue that this makes SGD much more sensitive to the hyperparameters setting than other adaptive algorithms. Nonetheless, even on Adam and Nadam, the tuned configuration achieves slightly better test accuracy. A remarkable exception to this pattern is Resnet50 in Table 8, where the default configuration significantly outperforms the tuned one. This result seems to suggest that our hyperparameter configuration found on the baseline is not robust to more radical architectural changes, like in Resnet50, where residual connections (see Sect. 3) are added to each layer to prevent the vanishing gradient effect.

Table 8 Test accuracy in % obtained by training the Resnet50 and Mobilenetv2 using the Default and Tuned hyperparameter values. For each algorithm in the two settings, we report the average over the datasets when the architecture is fixed (Avg DATA) and the average over the architecture when the Dataset is fixed (Avg ARCH)
Fig. 7
figure 7

Training Loss trends of the 5 algorithms for CIFAR10 on the synthetic architectures

Fig. 8
figure 8

Training Loss trends of the 5 algorithms for CIFAR100 on the synthetic architectures

Fig. 9
figure 9

Training loss trends for CIFAR10 on Resnet50 and MobilenetV2

Fig. 10
figure 10

Training Loss trends of the 5 algorithms for CIFAR100 on Resnet50 and MobilenetV2

Looking at Tables 7 and 8, we observe that some dataset/architecture/algorithm configurations have differences in test accuracy that are too slight to claim which is the best performing configuration. Although, in deep learning, it is usual to perform a single-run test (see, e.g., Gärtner et al. (2023)), to assess the potential advantage of the tuned configuration over the default one, we decided to carry out an additional multistart computational test on a reduced number of possible configurations which present, in our opinion, the most inconclusive results. To this aim, we identified in Tables 7 and 8 the dataset/architecture/algorithm combinations with the smallest accuracy differences between the tuned and the default configuration. We selected the six architectures (i.e., the four synthetic networks, Resnet50, and Mobilenetv2), algorithms Adam, Adagrad, Nadam on the CIFAR100 dataset, and Adamax on both CIFAR10 and CIFAR100, for a total of 30 datasets/architectures/algorithms combinations on which we performed 5 runs with a different seed (we set the seed to 0,1000,10000,150000,1698064). To reduce the computational burden of such a test, we decided to halve the number of epochs for each training compared to the original value (see Sect. 7). As a result, the test accuracy values are lower than the ones in Tables 7 and 8.

We report the results in Table 9 where, for each dataset/architecture/algorithm combination, we report the average test accuracy and the standard deviation (in brackets) over the five multistart runs. We can conclude that the tuned configuration clearly outperforms the default one on all the synthetic networks. The differences are generally significant, particularly when considering the standard deviation. Given that we halved the number of epochs, this result suggests that, even when the difference in accuracy between default and tuned configuration is slight, the tuned configuration tends to achieve better generalization performance more quickly. However, the advantage of the tuned configuration is smaller when the architecture is significantly changed. Resnet50 and Mobilenetv2 do not show the same trend on all experiments and it is not possible to draw such a definitive conclusion. We believe that this is due to the structural architectural difference of the residual networks with respect to the other synthetic networks.

Table 9 Average test accuracy and its standard deviation over five-multi start runs for each algorithm–dataset–architecture combination used for this test

8.5 Collective impact of tuning: performance profiles

In this section, we consider a collective representation of the training results with the aim of assessing the impact of tuned versus default setting in reaching the global solution. Following the underlying idea of the bench-marking method proposed in (Dolan and Moré 2002) and (Moré and Wild 2009), we consider a variant of the performance profiles as an additional tool of comparison between the five algorithms: Adam, Adamax, Nadam, RMSProp, and SGD.

Following (Moré and Wild 2009), given the set of problems \(\mathcal {P}\) and the set of solvers \(\mathcal {S}\), a problem \(p\in \mathcal {P}\) is solved by a solver \(s\in \mathcal {S}\) with precision \(\tau \) if

$$f(\omega _{p,s}) \le f^p_L + \tau (f(\omega ^0_p) - f^p_L),$$

being \(\omega ^0_p\) the starting point of problem \(p \in \mathcal {P}\), which is the same for all solvers, \(f(\omega _{p,s})\) the final value of the objective function in (1) after the training process and \(f^p_L = \min _{s \in \mathcal {S}} f(\omega _{p,s})\). In our case, \(\mathcal {P}\) is made up of the 18 different versions of the problem (1) corresponding to all the possible combinations of the 6 network architectures (Baseline, Deep, Wide, Deep &Wide, Resnet50, Mobilenetv2), with the three datasets. Further, in our set of experiments, we have fixed the epochs, i.e., the computational time. Hence, differently from (Moré and Wild 2009), we are interested in checking how many problems are solved to a given accuracy \(\tau \). Thus, we introduce the success rate performance profile \(\sigma _s ( \tau )\) for a solver s in Fig. 11 as:

$$ \sigma _s ( \tau ) = \frac{1}{\vert \mathcal {P} \vert } \vert \{ p \in P: f(\omega _{p,s}) \le f^p_L + \tau (f(\omega ^0_p) - f^p_L)\} \vert .$$

In Fig. 11, we plot \( \sigma _s ( \tau )\) with \(\tau \in [10^{-4},1]\). The higher the plot on the left, the better. Looking at the performance profiles in Fig. 11, we confirm that it is not possible to state the superiority of the tuning versions in the training performance. However, the tuned versions of the algorithms differ less from each other, being more stable.

8.6 Data availability

All the data we have presented in this section are fully reproducible from the source code, which is available on the public Github repository at https://github.com/lorenzopapa5/Computational_Issues_in_Optimization_for_Deep_networks.

Fig. 11
figure 11

Success rate performance profiles \(\sigma _s ( \tau ) \) of the five algorithms s on the 18 problems arising by considering all the 6 architectures over all the three datasets: a with default and b tuned hyperparameters’ settings

9 Conclusions

In this paper, nine optimization open-source algorithms have been extensively tested in training a deep CNN network on a multi-class classification task. Computational experience shows that not all the algorithms reach a neighborhood of a global solution, and some of them get stuck in local minima, independently of the choice of the starting point. Algorithms reaching a local non-global solution have test performances, i.e., accuracy on the test set, far below the minimal required threshold for such a task. This result confirms the initial claim that reaching a neighborhood of a global optimum is extremely important for generalization performance.

A fine grid search on the optimization hyperparameters leads to hyperparameter choices that give remarkable improvements in test accuracy when the network structure and the dataset do not change. Thus, using a default setting might not be the better choice.

Finally, the tests on different architectures and datasets suggest that when the architectural changes are not too radical, it may be more effective to use the tuned configuration rather than the default one. We believe that this finding can have a significant impact, especially for ML practitioners tasked with training similar models on different datasets within the same problem class, such as image classification, which is common in real-world applications. Conducting a grid search on a representative problem within a given class and tuning the hyperparameters accordingly, rather than using the default configuration, can be viewed as creating a new customized setting, which is reusable for larger instances and achieves better generalization performance.