1 Introduction

Adaptive moment estimation (Adam) method proposed in [10] is one of the latest stochastic gradient descent methods. This method is another adaptive learning rate method among other adaptive learning rate methods in the literature, see for examples, momentum [16], Nesterov accelerated gradient (NAG) [15], Adagrad [4], Adadelta [24], AdaMax [10], Nadam [3], AMSGrad [17], AdamW [12], QHAdam [14], and AggMo [13]. In Adam method, the decay averages of past gradients and past squared gradients are computed and stored to give estimates of the first and second moments of the gradients, respectively. These estimates are initialized to zeros vectors and are biased towards to zero, especially during the initial time steps and when the decay rates are small. These biases are counteracted by computing bias-corrected first and second moment estimates to update the parameters during the calculation procedure.

The Adam method has been extensively studied across multiple fields, such as risk management, portfolio selection, and machine learning. Schiele [19] improved the accuracy of the asset return estimation and the expected associated portfolio performance through a dynamic portfolio optimization framework and the artificial neural network. Ghahtarani et al. [6] reviewed recent robust portfolio selection problems from operational research and financial perspectives. From their study, the classification of models and methods was presented. Veraguas et al. [23] considered stochastic optimal control problems for which a risk minimization problem for controlled diffusions was solved. They derived a dynamic programming principle to recover central results of risk-neutral, and the value of the risk minimization problem can be characterized as a viscosity of a Hamilton–Jacobi–Bellman–Isaacs equation. Chronopoulos et al. [2] studied a deep quantile estimator based on a neural network to forecast value-at-risk (VaR) and to find significant gains over linear quantile regression, where the Adam algorithm is used to train the neural network.

On the other hand, portfolio optimization problems have been well-studied in economics and finance. Baltas et al. [1] studied a robust-entropic optimal control problem for portfolio management. They provided a closed-form solution and a detailed study of the limiting behaviour by associated stochastic differential game. Thus, the effect of robustness on the optimal decisions of both players was clarified. Temocin et al. [21] considered the optimal portfolio problem with minimum guarantee protection in a defined contribution pension scheme. They compared various versions of the guarantee concept, and each guarantee framework was obtained through a classical stochastic control approach. Kara et al. [9] considered the robust conditional VaR under parallelepiped uncertainty in modelling the robust optimal portfolio allocation. From their finding, the stability of portfolio allocation was increased, and the portfolio risk was reduced. Savku and Weber [18] discussed optimal investment problems using stochastic differential game approaches. They derived regime-switching Hamilton–Jacobi–Bellman–Isaacs equations to obtain explicit optimal portfolio strategies with Feynman–Kac representations of value functions.

In our study, we improve the convergence rate of the Adam method by reducing the iteration number. For this, the standard error (SE) is added to the updating rule of the Adam algorithm, and hence, the name AdamSE algorithm is given. To begin, a mean-value-at-risk (mean-VaR) portfolio optimization problem for the Employees Provident Fund (EPF) is formulated. The mean, covariance and required rate of return are calculated from the weekly stock prices of 10 assets selected for the period from 2015 to 2019. The simulation results obtained by using the Adam and AdamSE algorithms are compared and discussed. In addition, we consider nine samples of the past gradients through sampling simulation, which differs from only one sample in [20]. Therefore, different iteration numbers are given to arrive at the optimal weights and three different confidence levels are used to provide the portfolio risk for the model under study.

This paper is organized as follows. In Section 2, a mean-VaR portfolio optimization problem for the EPF is described. The weekly stock prices of 10 assets from the top 30 equity holdings list released by the EPF are utilized to calculate the mean, covariance and required rate of return. These parameters are used to construct the portfolio model. In Section 3, the Lagrange function is defined and the first order necessary conditions are derived. Furthermore, the calculation procedures of Adam and AdamSE algorithms are presented. In Section 4, simulation results obtained using the Adam and AdamSE algorithms are provided. In addition, the results of the nine samples through sampling are discussed. Finally, concluding remarks are given.

2 Problem Description

Consider a mean-VaR portfolio optimization problem [7, 25], which is to minimize the objective function,

$$\begin{aligned} f(w)=z_{\alpha }\sqrt{w^{\top }{\varSigma } w}\sqrt{{\varDelta } t} \end{aligned}$$
(1)

subject to the following constraints,

$$\begin{aligned}{} & {} w^{\top }\mu = R,\end{aligned}$$
(2)
$$\begin{aligned}{} & {} w^{\top }I =1,\end{aligned}$$
(3)
$$\begin{aligned}{} & {} 0 \le w\le 1, \end{aligned}$$
(4)

where \(w=(w_{1},w_{2},\dots ,w_{n})^{\top }\in \Re ^{n}\) is an n-vector of the portfolio weights, \({\varSigma }\in \Re ^{n\times n}\) is an \(n\times n\) covariance matrix of the portfolio, and \(\mu =(\mu _{1},\mu _{2},\dots ,\mu _{n})^{\top }\in \Re ^{n}\) is an n-vector of the expected return rate of the portfolio, whereas \(I=(1,1,\dots ,1)^{\top }\Re ^{n}\) is an n-vector with 1s elements, and R is the minimum threshold at which investors can tolerate the expected rate of return on their portfolio.

Here, the portfolio’s VaR is given by the objective function (1), the confidence level \(\alpha \) reflects the degree of risk aversion, \(z_{\alpha }\) is the z-score for the confidence level \(\alpha \) and \({\varDelta } t\) is the holding period. Since the portfolio consists of a set of assets with uncertain stock prices, and the portfolio weights are random variables, for which the initial weights are average weight. This mean-VaR problem is defined as a stochastic optimization problem.

Now, a mean-VaR portfolio optimization problem is stated as follows. Consider the case where 10 stocks are selected [5] from the top 30 equity holdings list released by the EPF. These stock prices have a weak correlation. The weekly stock prices of these stocks are selected for the period from 2015 to 2019 and retrieved from the website investing.com. Using these past historical stock prices data, the mean, covariance and required rate of return of the portfolio are calculated, and they are given below.

  1. (a)

    The means of return rates

    $$\begin{aligned} \mu = \begin{pmatrix} -0.001935268 \\ -0.000349588 \\ 0.001131086 \\ -0.00147822 \\ 0.000463904 \\ 0.000831973 \\ -0.00354601 \\ -0.000959335 \\ -0.000252542 \\ 0.00302638 \end{pmatrix}. \end{aligned}$$
  2. (b)

    The covariance of the portfolio

    $$\begin{aligned} {\varSigma } = 10^{-3}\times \begin{bmatrix} 1.192 &{}~ 0.151 &{}~ 0.297 &{}~ 0.339 &{}~ 0.106 &{}~ 0.329 &{}~ 0.198 &{}~ 0.388 &{}~ 0.213 &{}~ 0.213 \\ 0.151 &{}~ 1.094 &{}~ 0.072 &{}~ 0.205 &{}~ 0.108 &{}~ 0.143 &{}~ 0.217 &{}~ 0.375 &{}~ 0.138 &{}~ 0.136 \\ 0.297 &{}~ 0.072 &{}~ 2.805 &{}~ 0.210 &{}~ 0.091 &{}~ 0.240 &{}~ 0.394 &{}~ 0.261 &{}~ 0.116 &{}~ 0.276 \\ 0.339 &{}~ 0.205 &{}~ 0.210 &{}~ 1.594 &{}~ 0.197 &{}~ 0.307 &{}~ 0.248 &{}~ 1.004 &{}~ 0.327 &{}~ 0.215 \\ 0.106 &{}~ 0.108 &{}~ 0.091 &{}~ 0.197 &{}~ 0.339 &{}~ 0.190 &{}~ 0.139 &{}~ 0.193 &{}~ 0.131 &{}~ 0.013 \\ 0.329 &{}~ 0.143 &{}~ 0.240 &{}~ 0.307 &{}~ 0.190 &{}~ 1.299 &{}~ 0.268 &{}~ 0.407 &{}~ 0.260 &{}~ 0.134 \\ 0.198 &{}~ 0.217 &{}~ 0.394 &{}~ 0.248 &{}~ 0.139 &{}~ 0.268 &{}~ 1.804 &{}~ 0.646 &{}~ 0.144 &{}~ 0.300 \\ 0.388 &{}~ 0.375 &{}~ 0.261 &{}~ 1.004 &{}~ 0.193 &{}~ 0.407 &{}~ 0.646 &{}~ 2.829 &{}~ 0.285 &{}~ 0.188 \\ 0.213 &{}~ 0.138 &{}~ 0.116 &{}~ 0.327 &{}~ 0.131 &{}~ 0.260 &{}~ 0.144 &{}~ 0.285 &{}~ 0.537 &{}~ 0.216 \\ 0.213 &{}~ 0.136 &{}~ 0.276 &{}~ 0.215 &{}~ 0.013 &{}~ 0.134 &{}~ 0.300 &{}~ 0.188 &{}~ 0.216 &{}~ 0.851 \end{bmatrix}. \end{aligned}$$
  3. (c)

    The required return rate

    $$\begin{aligned} R = 0.0005. \end{aligned}$$

Here, the holding period \({\varDelta } t\) = 260 days, the confidence level \(\alpha \) = 0.05 and the z-score \(z_\alpha \) = 1.645 are used in the model’s objective function (1).

Therefore, the mean-VaR portfolio optimization problem for the EPF investment is constructed by substituting the values of mean, covariance and required return rate into (1) and (2). Define the portfolio’s weight \(w=(w_{1}, w_{2},\dots , w_{10})^{\top }\in \Re ^{10}\), which is regarded as a random variable vector, and the aim is to determine the optimal weight w for these 10 assets of the portfolio such that the VaR of the portfolio is minimized.

3 Adaptive Moment Estimation with Standard Error

Define the Lagrange function,

$$\begin{aligned} L(w,\lambda )=z_{\alpha }\sqrt{w^{\top }{\varSigma } w}\sqrt{{\varDelta } t}+\lambda _{1}(R-w^{\top }\mu )+\lambda _{2}(1-w^{\top }I)+\lambda _{3}^{\top }w, \end{aligned}$$
(5)

where \(\lambda =(\lambda _{1},\lambda _{2},\lambda _{3})\) is the multiplier vector with \(\lambda _{1},\lambda _{2}\in \Re \) and \(\lambda _{3}\in \Re ^{n}\) to be determined later. From (5), the first order necessary conditions for the model are derived as follows:

$$\begin{aligned} \frac{\partial L(w,\lambda )}{\partial w}= & {} \frac{z_{\alpha }\sqrt{{\varDelta } t}({\varSigma } w)}{\sqrt{w^{\top }{\varSigma } w}}-\lambda _{1}\mu -\lambda _{2}I+\lambda _{3}=0,\\ \frac{\partial L(w,\lambda )}{\partial \lambda _{1}}= & {} w^{\top }\mu -R=0,\nonumber \\ \frac{\partial L(w,\lambda )}{\partial \lambda _{2}}= & {} w^{\top }I-1=0,\nonumber \\ \lambda _{3}^{\top }w= & {} 0,\quad \lambda _{3}\ge 0.\nonumber \end{aligned}$$
(6)

Here, (6) is the gradient of the mean-VaR model, which is employed in Adam and AdamSE algorithms to find the optimal weights for the mean-VaR portfolio optimization problem.

3.1 Analytical Solution for Deterministic Case

We now consider the case where the portfolio weight w is deterministic, requiring to be determined. Multiplying \(w^{\top }\) to (6) and doing some algebraic manipulations, we obtain the standard deviation of the portfolio as follows,

$$\begin{aligned} \sqrt{w^{\top }{\varSigma } w}=\frac{\lambda _{1}R+\lambda _{2}}{z_{\alpha }\sqrt{{\varDelta } t}}. \end{aligned}$$
(7)

Substituting (7) into (6), we obtain the weight as given below:

$$\begin{aligned} w={\varSigma }^{-1}\frac{(\lambda _{1}R+\lambda _{2})(\lambda _{1}\mu +\lambda _{2}I-\lambda _{3})}{z_{\alpha }^{2}{\varDelta } t}. \end{aligned}$$
(8)

From (7) and (8), it follows that \(\lambda _{1}\) and \(\lambda _{2}\) are given by

$$\begin{aligned} \lambda _{1}= & {} \frac{AR-B}{\sqrt{(AC-B^{2}) (AR^{2}-2BR+C)}} 2z_{\alpha }\sqrt{{\varDelta } t}, \end{aligned}$$
(9)
$$\begin{aligned} \lambda _{2}= & {} \frac{C-BR}{\sqrt{(AC-B^{2}) (AR^{2}-2BR+C)}} 2z_{\alpha }\sqrt{{\varDelta } t}, \end{aligned}$$
(10)

with \(A=I^{\top }{\varSigma }^{-1}I\), \(B=I^{\top }{\varSigma }^{-1}\mu \), \(C=\mu ^{\top }{\varSigma }^{-1}\mu \) and \(\lambda _{3}=0\).

According to the above discussion, the analytical solution of the mean-VaR portfolio optimization problem [25] defined by (1)–(4) is determined by (8), (9) and (10). However, we assume that the analytical solution does not exist since the weight variables are random variables, which depend on the availability of the expected return rate and the covariance of the portfolio. In addition, the stock prices in the portfolio are uncertain and random.

3.2 Adam Algorithm

Consider the exponential moving averages of past gradients \(m_{k}\) and past squared gradients \(v_{k}\) are, respectively, given as follows:

$$\begin{aligned} m_{k}= & {} \beta _{1}m_{k-1}+(1-\beta _{1})g_{k},\end{aligned}$$
(11)
$$\begin{aligned} v_{k}= & {} \beta _{2}v_{k-1}+(1-\beta _{2})g_{k}^{2}, \end{aligned}$$
(12)

where the term \(g_{k}\) is the gradient at the time step k, parameter \(\beta _{1}\) is the exponential decay rate for the first moment (mean) estimates of the gradient. In contrast, the parameter \(\beta _{2}\) is the exponential decay rate for the second moment (uncentered variance) estimates of the gradient. Since the average of the past gradient \(m_{k}\) is the first moment, it resembles the momentum that records the past normalized gradients. While the squared gradient \(v_{k}\) is the second moment that gives different learning rates for different parameters.

The moment estimates are biased towards zero, especially during the initial time steps and low decay rates. These biases can be counteracted by using the bias-corrected first and second-moment estimates given by

$$\begin{aligned} \hat{m}_{k}= & {} \frac{m_{k}}{1-\beta _{1}^{k}},\end{aligned}$$
(13)
$$\begin{aligned} \hat{v}_{k}= & {} \frac{v_{k}}{1-\beta _{2}^{k}}. \end{aligned}$$
(14)

When the moments \(m_{k}\) and \(v_{k}\) are expanded and expressed by the gradient \(g_{k}\), it is found that after dividing by the correction factor \(1-\beta ^{k}\), the sum of the coefficients of all gradients \(g_{i}\) approximates to 1, so it is called the normalized correction. Both moments \(m_{k}\) and \(v_{k}\) are initialized to 0, the gradients have not accumulated in the first few iterations, and the values of moments \(m_{k}\) and \(v_{k}\) are close to 0. In particular, the parameter \(\beta _{2}\) is often set closer to 1 than \(\beta _{1}\), which leads to the initial update step size being too large. By normalization correction, the moments \(m_{k}\) and \(v_{k}\) can be enlarged so that the size of moment estimates \(\hat{m}_{k}\) and \(\hat{v}_{k}\) with a small k value is at the same level as that of moment estimates \(\hat{m}_{k}\) and \(\hat{v}_{k}\) when the gradient with a large k value has been fully accumulated.

The Adam algorithm updates exponential moving averages of past gradient \(m_{k}\) and past squared gradient \(v_{k}\) by using hyper-parameters \(\beta _{1}, \beta _{2}\in [0, 1)\) to control the exponential decay rates of the moving in (13) and (14). The Adam algorithm has the following updating rule,

$$\begin{aligned} w^{(k+1)}=w^{(k)}-\alpha _{r}\times \frac{\hat{m}_{k}}{\sqrt{\hat{v}_{k}}+\delta }, \end{aligned}$$
(15)

where \(\alpha _{r}, \delta > 0\). In the Adam algorithm, the learning rate that increases or decreases its value is dependent on the gradient value of the loss function. The learning rate will be lower for the higher gradient values, and the learning rate will be larger for the lower gradient values. Hence, the learning decelerates at steeper and speeds up at shallower parts of the loss function curve.

The learning rate for the Adam algorithm is set at \(\alpha _{r}(\sqrt{\hat{v}_{k}}+\delta )^{-1}\). Its value varies from one iteration to another iteration because the parameter \(\alpha _{r}\) is divided by the square root of the mean square sum of \((1-\beta _{2})^{-1}\) parametric gradients at each iteration. The gradient of each parameter is different, so the learning rate of each parameter is not the same even in the same iteration. Moreover, the direction of parameter update is not only the gradient \(g_{k}\) of the current iteration but also the average of the gradient of the current and the past iterations, that is \((1-\beta _{1})^{-1}\).

The parameter \(\delta \) is a small number that prevents any division by zero during the algorithm implementation. Assuming \(\delta =0\), the effective step taken in the parameter space at iteration k is \({\varDelta }_{k}=\alpha _{r}\times \hat{m}_{k}/\sqrt{\hat{v}_{k}}\), where the smaller signal-to-noise ratio (SNR), represented by \(\hat{m}_{k}/\sqrt{\hat{v}_{k}}\), indicates that there is a greater uncertainty about whether the direction \(\hat{m}_{k}\) corresponds to the direction of the actual gradient. Meanwhile, the effective step size is closer to zero towards an optimum when the SNR is small. This SNR is often close to zero, resulting in smaller effective steps in the parameter space. When approaching the minimum value, the noise in all directions will be very large, resulting in an SNR close to 0. The updating step size quickly reduces to 0, which is called automatic annealing, as mentioned in [10]. When a saddle point is encountered, the noise generated by moving around the saddle point can quickly make the current point jump out of the saddle point.

The calculation procedure of the Adam algorithm is summarized as Algorithm 1.

Algorithm 1
figure a

(Adam algorithm).

Remark 1

The default values for the decay rates [10] are \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\), and the smoothing term is \(\delta = 10^{-8}\), while the tolerance is \(\varepsilon = 10^{-6}\), and the learning rate is \(\alpha = 0.001\).

3.3 AdamSE Algorithm

From the perspective of sampling theory, the standard error is used to measure the discrepancy of the sample mean and the population mean [11]. In other words, the standard error measures how accurately a sample distribution is representative of a population by using the standard deviation. The standard error is defined by

$$\begin{aligned} SE=\frac{\sigma }{\sqrt{n}}, \end{aligned}$$
(16)

where \(\sigma \) is the population standard deviation and n is the sample size of the sampling distribution concerned. This standard error will increase when the population standard deviation increases, while this standard error will decrease when the sample size is increased. According to the central limit theorem [8], as the sample size approaches the actual population size, the sample means will increasingly cluster around the true population mean.

From the observation, we notice that the Adam algorithm uses the sampled gradients. Therefore, we assume that multiple gradient samples can be generated with a fixed sample size. From this point of view, we hypothesize that the standard error can be reduced and it is more appropriate to use instead of the standard deviation. This is because the standard error varies with sample size, but the standard deviation does not. Thus, to improve the updating rule of the Adam algorithm, it is assumed that the sampling distribution of the average past gradient \(m_{k}\) follows a normal distribution of the biased-corrected first and second moments \(\hat{m}_{k}\) and \(\hat{v}_{k}\), the standard error of the bias-corrected first-moment estimate \(\hat{m}_{k}\), similar to (16), is defined as

$$\begin{aligned} \hat{s}_{k}=\frac{\sqrt{\hat{v}_{k}}+\delta }{\sqrt{n}}, \end{aligned}$$
(17)

where n is the number of samples of the average past gradient \(m_{k}\) and \(\delta \) is a very small positive number that prevents division by zero during the implementation. As the result of (17), the updating rule (15) of the Adam algorithm is replaced by

$$\begin{aligned} w^{(k+1)}=w^{(k)} - \alpha _{r}\times \frac{\hat{m}_{k}}{\hat{s}_{k}}, \end{aligned}$$
(18)

as the updating rule of the AdamSE algorithm.

Note that the standard error is always smaller than the standard deviation [22]. Thus, with smaller standard errors, the step size of the AdamSE algorithm will become more effective than the step size of the Adam algorithm. This effective step size of the AdamSE algorithm speeds up the optimal search step. Therefore, we express the result in the following theorem.

Theorem 1

Suppose that the step size of the AdamSE algorithm is

$$\begin{aligned} {\varDelta }_{k}=\alpha _{r}\times \frac{\hat{m}_{k}}{\hat{s}_{k}}. \end{aligned}$$
(19)

Then, the convergence rate of the AdamSE algorithm is better than the convergence rate of the Adam algorithm. That is,

$$\begin{aligned} \Vert w^{(k+1)}-w^{(k)}\Vert _{\text {AdamSE}}\le \Vert w^{(k+1)}-w^{(k)}\Vert _{\text {Adam}}. \end{aligned}$$
(20)

Proof

From (17) and (19), consider the step size of the AdamSE algorithm,

$$\begin{aligned} {\varDelta }_{k}=\alpha _{r}\times \frac{\hat{m}_{k}}{\hat{s}_{k}} =\alpha _{r}\times \frac{\hat{m}_{k}}{\frac{\sqrt{\hat{v}_{k}}+\delta }{\sqrt{n}}} =\alpha _{r}\times \frac{\hat{m}_{k}}{\sqrt{\hat{v}_{k}}+\delta }\cdot \frac{1}{\sqrt{n}} \le \frac{\hat{m}_{k}}{\sqrt{\hat{v}_{k}}+\delta }. \end{aligned}$$

Therefore, the convergence rate of the AdamSE algorithm follows from (20) for \(n\ge 1\). This completes the proof. \(\square \)

The calculation procedure for the AdamSE algorithm is summarised as Algorithm 2.

Algorithm 2
figure b

(AdamSE algorithm).

Remark 2

The default values for the decay rates [10] are \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\), and the smoothing term is \(\delta = 10^{-8}\), while the tolerance is \(\varepsilon = 10^{-6}\), and the learning rate is \(\alpha = 0.001\). These values are the same as in the Adam algorithm.

4 Illustrative Results

The optimal portfolio weights after implementing the Adam and AdamSE algorithms are shown in Table 1, where only one sample \((n = 1)\) of the past gradients is employed in the AdamSE algorithm.

Table 1 Optimal portfolio weights
Table 2 Performance of algorithms
Table 3 Optimal portfolio weights for different sample sizes
Table 4 Performance of AdamSE algorithm
Table 5 Portfolio risk

Moreover, from Table 2, the AdamSE algorithm takes 43 number of iterations to converge, which is 76.8% faster than the Adam algorithm with the number of iterations being 185. From this result, we can see that these two algorithms are able to give the same optimal weights for the mean-VaR model. This shows that the AdamSE algorithm performs as well as the Adam algorithm in providing an optimal solution to the mean-VaR portfolio optimization problem.

Table 3 shows the simulation results when we consider different sample sizes of past gradients for \(n = 1, 2, 3, 4, 5, 6, 7, 8, 9\) using the AdamSE algorithm. Although more samples could be considered, we only have a limited number of samples in this simulation to avoid any unnecessary problems such as divergence.

In addition, the performance of the AdamSE algorithm (measured in terms of the number of iterations for these sample sizes) is shown in Table 4. From the theoretical results, the smaller the number of iterations, the faster the algorithm converges. However, when using the AdamSE algorithm, there is no linear relationship between the number of samples and the number of iterations because the number of iterations decreases after a sample size of 5. Thus, the optimal weights are robust solutions that are not affected by the number of iterations.

The portfolio risk of the mean-VaR model under different confidence levels is shown in Table 5. We only consider 90%, 95% and 99% confidence levels, where the portfolio risk is increased when the confidence level increases. This indicates that the portfolio investment of the EPF becomes riskier as the confidence level increases, which may increase the possibility of causing the maximum loss on the investment.

5 Concluding Remarks

This paper discussed the improvement of the Adam algorithm, adding the standard error in the updating rule of the Adam algorithm. The aim is to improve the convergence rate of the Adam algorithm, so the improved algorithm is named AdamSE algorithm. For illustration, the mean-VaR portfolio optimization problem for the EPF was formulated. This portfolio optimization problem was solved using Adam and AdamSE algorithms, giving rise to their respective optimal weights. These two optimal weights turn out to be the same. However, the AdamSE algorithm took fewer number of iterations to converge. In our study, past gradients of nine samples were simulated through sampling. Different iteration numbers showed robust optimal weights. From these results, we concluded that the AdamSE algorithm is an efficient algorithm for handling the mean-VaR portfolio optimization problem. For future research, the practicality of the AdamSE algorithm will be investigated for solving nonlinear stochastic optimization problems.