1 Introduction

As shown by Cardon et al. [5], neural networks took revenge since 2010. Snubbed during the winter of artificial intelligence, the fast increase in computing capacity of the last decade allowed to exploit their potential, leading to their supremacy when considering machine learning. Indeed, they were successfully operated to solve a wide range of problems: image recognition [26], machine translation and automatic speech recognition [1], etc. However, these networks became deeper and deeper and increasingly complex. Architectures to deal with sequential data are symptomatic of this growing complexity. Recurrent neural networks led to convolutional networks and Long short-term memory architectures [11]. Furthermore, for data requiring bidirectional contexts, transformers and their efficient attention mechanisms were widely adopted [25]. The Bert language model which is transformer-based is, in its large version, made of 24 layers with 16 attention heads each [8]. With an embedding size of 1028, it leads to 340 millions parameters to train. The training time of such huge architectures has to be done on sophisticated hardware and still requires a large amount of computation and time. That is why we focus on speeding up the training of neural networks. It would help to generalize their use while reducing the high carbon footprint of such use [15, 23]. To do so, we bring from statistics to machine learning the One-Step procedure [16]. The main contribution of this paper is extending and evaluating this procedure for simple neural networks architecture, paving the way to its use to speed up the training of more complex architectures.

One-Step procedure. This procedure was initially considered for the estimation of finite-dimensional parameters with an observation sample composed by independent and identically distributed (i.i.d.) variables [16]. In such procedure, an initial guess estimator is proposed which is fast to be computed but not asymptotically efficient. Then, a single step (hence its name One-Step) of the gradient descent method is done on the log-likelihood function in order to correct the initial estimation and reach asymptotic efficiency. With some recent developments, the One-Step procedure has been successfully generalized to more sophisticated statistical experiments as diffusion processes [9, 13], ergodic Markov chains [14], inhomogeneous Poisson counting processes [6], fractional Gaussian and stable noises observed at high frequency [2, 3].

In order to make it work, the initial sequence of guess estimator is generally supposed to be \(\sqrt{n}\)-consistent (see Sect. 2.2 for the definition) and the Fisher information matrix uniformly continuous. But it had been recently shown [13, 14] that for a \(n^{\frac{\delta }{2}}\)-consistent initial sequence of guess estimators (with \(\frac{1}{2} < \delta \le 1\)) and a Lipschitz Fisher information matrix, the sequence of One-Step estimators is still consistent, asymptotically normal and efficient. In this setting, for a initial sequence which is neither asymptotically rate nor variance efficient, the new sequence is asymptotically rate and variance efficient. This result allows to use the numerical computation of the Maximum Likelihood estimation (MLE) on a subsample (of size \(n^\delta \), \(\frac{1}{2} < \delta \le 1\)) as a fast initial sequence of guess estimators (see [4] for the practical application in the i.i.d. setting).

In the usual One-Step procedure, a single step of the gradient descent is done on the log-likelihood function. In this paper, we generalize this procedure to allow its use with neural networks optimizing the least-square functional in the regression setting and the cross-entropy in the binary classification setting. Our experiments show that this procedure allows to fasten the training of multi-layer perceptron neural networks, which may thus be of use for a wide panel of applications.

Outline. The notations and the One-Step estimation procedure we introduce for neural networks are presented in Sect. 2. The experimental setup to evaluate the performance of our procedure on multi-layer perceptrons neural networks is described Sect. 3. In this setup, we consider both simulated and real dataset describing classification and regression tasks. Results are detailed Sect. 5 and show that our extended One-Step procedure is at least halving the training time while preserving the performances. Conclusions and perspectives of this method for real applications are given in Sect. 6.

2 Extending the One-Step Estimation Procedure

2.1 Notations

Let \((y_i)_{i=1,\ldots ,n}\) be the response variable belonging to \({\mathbb {R}}\) for the regression setting or to \(\{-1,1\}\) for the binary classification setting. Let \((x_i)_{i=1,\ldots ,n}\) be the dataset of explanatory variables belonging to \({\mathbb {R}}^d\).

Let us consider a loss functional \(\ell _n(\vartheta )\) for the model f (linear model, multi-layer perceptrons neural networks, etc.) characterized by the finite-dimensional parameter \(\vartheta \in \Theta \subset {\mathbb {R}}^d\). For instance, in the regression setting,

$$\begin{aligned} \ell _n(\vartheta ) = \frac{1}{n} \sum _{i=1}^n \left( y_i - f(\vartheta ,x_i) \right) ^2 \end{aligned}$$
(1)

and, for the binary classification,

$$\begin{aligned} \ell _n(\vartheta ) = \frac{1}{n} \sum _{i=1}^n \log \left( 1+ e^{-y_i f( \vartheta , x_i)}\right) . \end{aligned}$$
(2)

Generally, the training of the model is done by simply minimizing the loss function

$$\begin{aligned} {\widehat{\vartheta }}_n = \arg \min _{\vartheta \in \Theta } \ell _n(\vartheta ). \end{aligned}$$
(3)

For multi-layer perceptrons neural networks, the back-propagation equations allows to compute the gradient \(\frac{\partial }{\partial \vartheta } \ell _n\) and the Hessian \(\frac{\partial ^2}{\partial \vartheta ^2} \ell _n\) of the function \(\ell _n\) defined in (1) and (2). In the most simple form, without any regularization method, the gradient descent method could be proposed to minimize the loss function and train the model. In this paper, the Newton method for optimization allows to avoid considering the setup of hyper-parameters such as the learning rate, or the mini-batch size in the case of stochastic gradient descent and to focus on the comparison in terms of performance and computation time. This method is not so commonly used in machine learning due to the fact that calculating the inverse of the Hessian can be expansive in computing resources. But the Newton type gradient descent, or its approximations, are efficient to train physically-inspired neural networks [17]. Fastening this method of optimization is yet of great interest in order to allow its use.

2.2 Our One-Step Estimation Procedure

Let us denote \(\vartheta \in \Theta \subset {\mathbb {R}}^d\) the finite-dimensional parameter to be estimated. We propose in this paper the following One-Step procedure to speed up the training of multi-layer perceptrons neural networks: we first train a neural network on a subsample of the data to get a guess estimator that we drastically improve with one step of Newton gradient descent on the whole dataset.

Guess estimator. Firstly, we propose an initial guess estimator \(({\widetilde{\vartheta }}_n, n \ge 1)\). Here the initial sequence of guess estimator \(({\widehat{\vartheta }}_n, n \ge 1)\) is the estimator (3) on a subsample of size \(\lceil n^\delta \rceil \), \( \frac{1}{2}< \delta < 1\). In other words, the parameter \(\delta \) is a key ingredient in the proposed One-Step method which characterizes the subsample size considered in the initial estimation.

One-Step estimators. Then the sequence of One-Step estimators \(( {\overline{\vartheta }}_n, n \ge 1)\) is defined by

$$\begin{aligned} {\overline{\vartheta }}_n = {\widetilde{\vartheta }}_n - \left( \frac{\partial ^2}{\partial \vartheta ^2} \ell _n({\widetilde{\vartheta }}_n) \right) ^{-1} \cdot \frac{\partial }{\partial \vartheta } \ell _n({\widetilde{\vartheta }}_n), \quad n \ge 1. \end{aligned}$$
(4)

Heuristically, in the simplest statistical experiments (i.i.d. or generalized linear models with controlled covariates variables as in [18]), if the initial guess estimator \({\widetilde{\vartheta }}_n\) is \(n^{\frac{\delta }{2}}\)-consistent (with \(\frac{1}{2} < \delta \le 1\)), i.e. for any \(\epsilon >0\) there exists a constant C such that

$$\begin{aligned} P_\vartheta \left( n^{\frac{\delta }{2}}\left\| {\widetilde{\vartheta }}_n - \vartheta \right\| > C \right) \le \epsilon , \end{aligned}$$

and the Fisher information matrix (the mean of the opposite of the Hessian) is Lipschitz regular, then the sequence of One-Step estimators defined by (4) is asymptotically equivalent to \({\widehat{\vartheta }}_n\) defined in (3). The proof in this setting is postponed in Appendix A. In other words, if the initial guess is statistically close enough to the true parameter and the Hessian is sufficiently regular, then a single step in the Newton gradient descent is sufficient to reach the asymptotic efficiency in the calibration procedure.

In the following, we show on simulations and real datasets that the sequential experiment setup described in Sect. 3 which mimicks the One-Step procedure (4) can speed up the training of the multi-layer perceptrons neural networks with the same asymptotic performance.

3 Experimental Setup

In this paper, we consider speeding up the training of neural networks architectures with a sequential experiment setup which mimicks the One-Step estimation procedure. To evaluate the efficiency of our approach, we consider regression and classification tasks with simulated and real datasets. However, we do not compare the results of our systems to the state-of-the-art nor do we claim our models to be better. The aim is to demonstrate that the One-Step procedure that we extended has the potential to speed up the training while preserving performances. We thus compare models trained using a classic Newton type gradient descent with models trained with the One-Step procedure we described in Sect. 2.2, and calculate runtime in addition to performance indicators.

Therefore, in this paper, we aim to demonstrate empirically that the adaptation of One-Step ’s to neural networks can speed up their training.

3.1 Infrastructure

All results in the paper are computed using Python 3.7.13, scikit-learn 1.0.2, keras 2.8.0 and tensorflow 2.8.2 on Google Colab with a GPU accelerator. In addition, we use the Kormos implementation of a Newton optimizer [12].

3.2 Simulations

Multiple datasets were simulated for regression and classification using randomly generated but fixed random states. At the beginning of the experiment, a random state was fixed arbitrarily and used to generate 10 other random states for dataset simulations, making sure the method is robust but allowing other users to easily reproduce the results. Results that are presented in Sect. 5 are averaged results of these 10 runs.

In adddition, we vary the sample size throughout the experiment to display the impact of dataset size on model training speed. The values tested were \(5.10^4\) and \(5.10^5\).

3.2.1 Simulated Regression Dataset

The method used was scikit-learn’s make_regression (for more information on the method, see [20]). Here are the data generation parameters:

  • n_features: 50, total number of features

  • n_informative: 95, total number of features that will be used to generate the target. This is to better simulate real data where some features have no impact on the target

  • noise: 0.5, standard deviation of the gaussian noise applied to the target. This is to make predicting the target a little harder

3.2.2 Simulated Classification Dataset

The method used was scikit-learn’s make_classification (for more information on the method, see [20]) with the generation parameters as follows:

  • n_features: 50, total number of features

  • n_redundant: 5, total number of features generated as random linear combinations of the informative features. This is to help better simulate a real dataset with some level of feature colinearity

  • weights: [0.7], class weights. 0.7 corresponds to 30–70% split between positive and negative values. The imbalance is to simuate a real dataset with some data imbalance.

3.3 Real Datasets

In addition to simulations, real datasets were used to validate our approach. Yolanda and Credit Card Fraud Detection datasets were chosen in order to eliminate the influence of different preprocessing techniques and having enough volume to necessitate the speedup of the training. We described further these datasets in this section.

3.3.1 Regression Dataset: Yolanda

Yolanda is a dataset introduced in Guyon et al. [10] as part of an automl challenge. This dataset is a subsample of the Million song dataset and the regression task consists in predicting the year a song was released from 90 audio features extracted with Echo Nest API. Most of the songs are western, commercial tracks ranging from 1922 to 2011. Yolanda is made of 460.000 songs divided in 400.000 as a training set, 30.000 as validation set and 30.000 as a test set.

3.3.2 Classification Dataset: Credit Card Fraud Detection

The Credit Card Fraud Detection dataset contains transactions made by credit cards in september 2013 by European cardholders. The dataset’s target is binary and imbalanced, with the positive class accounting for only \(0.172\%\) of all transactions. In addition, its variables are the result of PCA transformation, making it perfect for our purpose. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of Université Libre de Bruxelles on big data mining and fraud detection [7].

4 Models and Training

4.1 Preprocessing

The variables of all datasets are somewhat similar in format, allowing for homogeneous preprocessing across the board - barring some exceptions. All datasets have no missing values and have purely quantitative explicative variables. All variables are normalized in order to have a mean of 0 and standard deviation of 1. Extra care is done during normalization to avoid data leakage.

The one exception is the handling of the “Amount” variable in the Credit Card Fraud Detection dataset, which is replaced by its logarithm in order to reduce its range and limit the effect of outlying values. These steps are inspired from tensorflow’s tutorial on that dataset [19].

After preprocessing, the datasets go through a train test split. If the task is classification, the split is stratified according to the target variable. In all cases except Fraud Detection, 10% of the data is dedicated to testing. For Fraud Detection, we use 30% of the data given the small number of positive values.

4.2 Model Architectures

Models used change based on dataset difficulty, but they share the same overall simple design. All models features two layers and a batch size equal to the size of the dataset (Newton optimization). Such simple architectures are commonly used in physically-inspired neural networks, for instance to solve differential equations [22, 24]. For such networks, because of the use of the Hessian to guide the convergence with Newton, the training may actually be better than with Adam. Furthermore, there is no such hyper-parameters as batch size and learning rate that requires much computation to tune. In addition, we used Michael B Hynes’s Python implementation of a Newton-cg optimizer for keras in the Kormos Python library [12].

For all models, sigmoid is used as the activation function for all layers (except in the prediction layer for regression models). The sigmoid’s structure has been chosen in order to meet the usual regularity assumptions to make the One-Step procedure works, although other activation functions (such as ReLU) may give better empirical results.

Finally, classification models were initialized in order to have them predict the percentage of positive cases at initialization - effectively initializing the bias to \(\log (pos/neg)\), pos being the number of positive values, and neg being the number of negative values. This helps convergence in imbalanced datasets, as described in the tensorflow imbalanced datasets tutorial [19]. In addition, class weights with a factor of 2 are introduced during model training in order to make learning to predict the positive classes easier.

4.3 The Experiment

After preprocessing and train/test splitting, we begin by training the One-Step model (see One-Step experiments below). We save the training time and the performances for a specific metric on the test set (see Sect. 4.4). For simulated datasets, we calculate the average training time and performance (on the test set) after training separately on every dataset. For real datasets, a 10-fold cross-validation is done in order to get the average training time.

Then we train the control model with a callback stopping training when it overtakes the performance (on the test set) of the One-Step model trained on the same sample.

Finally we compare the training time, one trained using the One-Step method, and the other being the control model with a classic Newton gradient descent. The comparison is made in terms of average training time on the 10 simulations (or 10-fold cross-validation). We also compare the performance of the two models in terms of performances on the test set.

4.3.1 One-Step Experiments

The One-Step training experiment starts with a data split based on a \(\delta \) value that allows us to take a subset of the train set for training. Larger values of \(\delta \) allowing for larger portions of the dataset in the initial training. In addition, this subset would be stratified by the target value in case of classification.

Therefore, we initially train the model on p epochs of the same subsample and then perform one final epoch on the whole dataset (One-Step procedure). The parameter p may vary, depending on the dataset.

4.3.2 Control Experiment

The control training experiment (written Ctrl in tables of results) goes through the training of model on m epochs with a callback stopping it when it reaches the One-Step model’s performance on the same sample in terms of loss function on the test set. The m epochs are never effectively reached due to the callback - model training stops when the evaluation metric is overtaken.

The callback depends on the score attained by the One-Step experiment relative to the dataset sample in question. For simulated datasets, this means that we compare the control score at iteration level with the score attained by the One-Step procedure on the respective dataset simulation in question. For real datasets, this means comparing the score attained at the respective cross-validation fold.

4.4 Metrics

For regression tasks, we use the Mean Average Error (MAE) metric to evaluate our models. For classification tasks, given the data imbalance, we use the Area Under the Precision-Recall Curve (AUPRC).

5 Results

The One-Step procedure outperforms the classical approach in terms of computation time with the same asymptotic performances in terms of precision. However, this is only true for very large datasets, as when dataset size is small, differences in execution time are minimal. Detailed results are described in the section below.

5.1 Regression on Simulation Data

The specific parameters of this experiment are as follows:

  • \(p=5\), number of epochs trained on a \(\delta \) dependant subsample pre-One-Step;

  • \(\delta =0.9\), allows us to compute what share of the dataset to use in the pre-One-Step iterations;

  • Test set size: \(10\%\);

  • Model Architecture: Two fully-connected sigmoid layers (256 \(\rightarrow \) 128) followed by one prediction neuron.

In this first experiment, we vary the sample size and observe the average training time with the control approach, and with the extended One-Step that we introduced. We can observe Table 1 that for larger datasets, the One-Step method is faster then the control on (Ctrl) for a similar performance: roughly 2, 3 times faster for 500.000 data.

Table 1 Results of the One-Step and control approaches (Ctrl) on simulated data for regression

In our second experiment, we aim to understand the impact of the hyper-parameter \(\delta \) of our One-Step procedure introduced Sect. 2.2. This parameter controls the size of the subsample used to compute the guess estimator, having thus an influence on the quality of this guess estimator. In particular, we monitor the average training time on this parameter on the dataset made of 500.000 observations. As one can see on Fig. 1, the lower the \(\delta \), the lower the training time. Furthermore, with this sample size, even with low \(\delta \), the guess estimators are good enough: one step of Newton gradient (One-Step) on the whole data allows to converge to the same results as the control.

Training time remains somewhat similar for models trained using the control procedure, i.e. the classic Newton gradient descent. Their variation is due to the variations of model performance of their respective One-Step training iterations, since the iteration score is used in the callback.

Fig. 1
figure 1

Influence of the \(\delta \) parameter regarding the average time to train the systems (in seconds) on a regression dataset made of 500.000 observations. Mean time for Ctrl is 26.785 s for comparison

5.2 Classification on Simulated Data

The specific parameters of this experiment are as follows:

  • \(p=5\), number of epochs trained on a \(\delta \) dependant subsample pre-One-Step;

  • \(\delta =0.85\), allows us to compute what share of the dataset to use in the pre-One-Step iterations;

  • Test set size: \(10\%\);

  • Model Architecture: Three fully-connected sigmoid layers (\(256 \rightarrow 128 \rightarrow 1\));

  • A bias initializer element depending on target imbalance as per the Tensorflow imbalanced classification tutorial [19].

Table 2 Results of the One-Step and control approaches on simulated data for classification

Results of Table 2 show a similar trend to the regression experiment, in the sense that average training times are significantly lower for the One-Step experiment than in the control (Ctrl) in the larger dataset while preserving performances of classification given by the AUPRC. These resultats are thus very encourageing, which is why we aim to confirm these trends on real datasets in the rest of this section.

5.3 Regression Dataset: Yolanda

The specific parameters of this experiment are as follows:

  • \(p=8\), number of epochs trained on a \(\delta \) dependant subsample pre-One-Step;

  • \(\delta =0.9\), allows us to compute what share of the dataset to use in the pre-One-Step iterations;

  • Test set size: \(10\%\);

  • Model Architecture: Two fully-connected sigmoid layers (\(64 \rightarrow 16\)) and one prediction neuron;

  • Number of folds: 10, number of cross-validation folds.

Table 3 Results on the Yolanda regression dataset

The gap in training time in this experiment is lower compared to the regression experiments on simulated data as one can see in Table 3, especially regarding the huge size of Yolanda (400.000 observations in the train set). This is due to the fact that the model considered is significantly simpler with a lower number of neurons involved. Still, this displays a pattern that the size of the model has an influence on the time gained through the One-Step procedure in absolute terms.

5.4 Classification Dataset: Credit Card Fraud Detection

  • \(p=4\), number of epochs trained on a \(\delta \) dependant subsample pre-One-Step;

  • \(\delta =0.9\), allows us to compute what share of the dataset to use in the pre-One-Step iterations;

  • Test set size: \(30\%\);

  • Model Architecture: Three fully-connected sigmoid layers (\(512 \rightarrow 256 \rightarrow 1\));

  • Number of folds: 10, number of cross-validation folds;

  • A bias initializer element depending on target imbalance as per the Tensorflow imbalanced classification tutorial [19].

Table 4 Results on the Credit card fraud detection classification dataset

As one can see Table 4, the gap in this experiment is significant, One-Step is roughly five times faster to train than with the control procedure (Ctrl). This is due both to the more complex model structure with more neurons involved, and to the dataset size (over 280.000 samples).

6 Conclusions and Perspectives

In this paper, we introduce a new extended One-Step procedure to deal with the training of neural networks. To prove the relevance of this approach, we consider classic multi-layer perceptrons that are trained using Newton optimization, which often suffers from its computing time. On simulated and real dataset, we have demonstrated that the One-Step methodology allows to speed up the training time while keeping the same asymptotic performance in terms of precision. Our One-Step based approach seems to be particularly efficient to deal with large dataset, speed up is also more impressive on bigger architectures as shown with the credit card fraud detection example.

It is worth mentioning that other experiments without early stopping were also conducted. They show that, on the aforementioned datasets, the performances of the One-Step and the control methods are similar.

To further demonstrate the relevance of this One-Step procedure, we aim at extending its application to a wider range of problems, and thus to bigger and more complex architectures. Other contributions could be made considering other optimizers for the initial guess. It would be for example interesting to experiment the relevance of such a One-Step approach when stochastic gradient descent is used. Furthermore, it would be interesting to develop a framework to efficiently setup the hyper-parameters of the One-Step procedure, i.e. \(\delta \), the size of the subsample, and the number of Newton gradient descents to run on the subsample to get the most optimal speed up while preserving the performances. Eventually, we supposed here some regularity of the functionals using sigmoids helping us to understand the machinery. But simulations with ReLU activations functions are also encouraging and needs to be further studied theoretically.