A simultaneous perturbation weak derivative estimator for stochastic neural networks

Flynn, Thomas; Vázquez-Abad, Felisa

doi:10.1007/s10287-019-00357-1

A simultaneous perturbation weak derivative estimator for stochastic neural networks

Original Paper
Published: 08 November 2019

Volume 16, pages 715–738, (2019)
Cite this article

Computational Management Science Aims and scope Submit manuscript

Thomas Flynn¹ &
Felisa Vázquez-Abad²

94 Accesses
Explore all metrics

Abstract

In this paper we study gradient estimation for a network of nonlinear stochastic units known as the Little model. Many machine learning systems can be described as networks of homogeneous units, and the Little model is of a particularly general form, which includes as special cases several popular machine learning architectures. However, since a closed form solution for the stationary distribution is not known, gradient methods which work for similar models such as the Boltzmann machine or sigmoid belief network cannot be used. To address this we introduce a method to calculate derivatives for this system based on measure-valued differentiation and simultaneous perturbation. This extends previous works in which gradient estimation algorithm’s were presented for networks with restrictive features like symmetry or acyclic connectivity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fundamentals of Artificial Neural Networks and Deep Learning

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

Article 13 April 2024

The Frank-Wolfe Algorithm: A Short Introduction

Article Open access 13 December 2023

Notes

This norm for matrices is defined as as $\Vert w\Vert _{\infty } = \sup _{\Vert u\Vert _{\infty }=1}\Vert wu\Vert _{\infty }$, where for the vector u and wu, the norm $\Vert \cdot \Vert _{\infty }$ is defined in the usual way.
http://yann.lecun.com/exdb/mnist/.

References

Ackley DH, Hinton GE, Sejnowski TJ (1985) A learning algorithm for Boltzmann machines. Cognit Sci 9(1):147–169
Article Google Scholar
Apolloni B, de Falco D (1991) Learning by asymmetric parallel Boltzmann machines. Neural Comput 3(3):402–408
Article Google Scholar
Apolloni B, de Falco D (1991) Learning by parallel Boltzmann machines. IEEE Trans Inf Theory 37(4):1162–1165. https://doi.org/10.1109/18.87009
Article Google Scholar
Cao XR (1998) The relations among potentials, perturbation analysis, and markov decision processes. Discrete Event Dyn Syst 8(1):71–87. https://doi.org/10.1023/A:1008260528575
Article Google Scholar
Ermoliev Y (1983) Stochastic quasigradient methods and their application to system optimization. Stochastics 9(1–2):1–36. https://doi.org/10.1080/17442508308833246
Article Google Scholar
Heidergott B, Hordijk A (2003) Taylor series expansions for stationary markov chains. Adv Appl Probab 35(4):1046–1070
Article Google Scholar
Heidergott B, Vázquez-Abad FJ (2006) Measure-valued differentiation for random horizon problems. Markov Process Relat Fields 12(3):509–536
Google Scholar
Heidergott B, Vázquez-Abad FJ (2008) Measure-valued differentiation for markov chains. J Optim Theory Appl 136(2):187–209. https://doi.org/10.1007/s10957-007-9297-7
Article Google Scholar
Heidergott B, Vázquez-Abad FJ, Pflug G, Farenhorst-Yuan T (2010) Gradient estimation for discrete-event systems by measure-valued differentiation. ACM Trans Model Comput Simul 20(1):5:1–5:28. https://doi.org/10.1145/1667072.1667077
Article Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554
Article Google Scholar
Hinton GE, Sejnowski TJ (1983) Optimal perceptual inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE
Kirkland S (2003) Conditioning properties of the stationary distribution for a Markov chain. Electron. J. Linear Algebra 10(1):1
Article Google Scholar
Kushner H, Clark D (1978) Stochastic approximation methods for constrained and unconstrained systems, vol 26. Springer, Berlin
Book Google Scholar
Little WA (1974) The existence of persistent states in the brain. Math Biosci 19(1–2):101–120
Article Google Scholar
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5(4):115–133
Article Google Scholar
Minsky M (1961) Steps toward artificial intelligence. Proc IRE 49(1):8–30
Article Google Scholar
Neal RM (1992) Connectionist learning of belief networks. Artif Intell 56(1):71–113
Article Google Scholar
Peretto P (1984) Collective properties of neural networks: a statistical physics approach. Biol Cybern 50(1):51–62
Article Google Scholar
Pflug GC (1990) On-line optimization of simulated Markovian processes. Math Oper Res 15(3):381–395
Article Google Scholar
Pflug GC (1992) Gradient estimates for the performance of Markov chains and discrete event processes. Ann Oper Res 39(1):173–194. https://doi.org/10.1007/BF02060941
Article Google Scholar
Pflug GC (1996) Optimization of stochastic models: the interface between simulation and optimization. The Kluwer International Series in Engineering and Computer Science. Kluwer, Dordrecht
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev 65(6):386
Article Google Scholar
Ross SM (1990) A course in simulation. Prentice Hall PTR, Englewood Cliffs
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Smolensky P (1987) Information processing in dynamical systems: foundations of harmony theory, vol 1. MIT Press, Cambridge, pp 194–281
Google Scholar
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341
Article Google Scholar
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. Adv Neural Inf Process Syst 25:2231–2239
Google Scholar

Download references

Author information

Authors and Affiliations

Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, USA
Thomas Flynn
Department of Computer Science, Hunter College, New York, NY, USA
Felisa Vázquez-Abad

Authors

Thomas Flynn
View author publications
You can also search for this author in PubMed Google Scholar
Felisa Vázquez-Abad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Flynn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Derivations related to the little model

1.1 A.1 Derivation of Eq. 12

Fix an $x^0$ and a direction $v \in \mathbb {R}^{n\times n}\times \mathbb {R}^n$. Then

$$\begin{aligned}&\nabla _{\lambda }P_{\theta + \lambda v}(x^{0},x^{1})\\&\quad = P_{\theta + \lambda v}(x^0,x^1) \nabla _{\lambda }\log P_{\theta + \lambda v}(x^0,x^1) \\&\quad = P_{\theta + \lambda v}(x^0,x^1) \sum \limits _{i=1}^{n}\nabla _{\lambda } \log \left( \sigma ( (x_{i}^{1})^{\dag }u_{i}(x^{0},\theta +\lambda v))\right) \\&\quad = P_{\theta + \lambda v}(x^0,x^1) \sum \limits _{i=1}^{n} \left( 1 -\sigma ( (x_{i}^{1})^{\dag }u_{i}(x^{0},\theta +\lambda v))\right) (x_{i}^{1})^{\dag } \nabla _{\lambda }u_{i}(x^{0},\theta +\lambda v) \\&\quad = P_{\theta + \lambda v}(x^0,x^1) \sum \limits _{i=1}^{n} \left( 1 -\sigma ( (x_{i}^{1})^{\dag }u_{i}(x^{0},\theta +\lambda v))\right) (x_{i}^{1})^{\dag } \left( \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_{i}\right) . \end{aligned}$$

Evaluating this at $\delta =0$ we find that

$$\begin{aligned} \nabla _{\theta }P_{\theta }(x^{0},x^{1})v = P_{\theta }(x^0,x^1) \sum \limits _{i=1}^{n}(1 -\sigma ( (x_{i}^{1})^{\dag }u_{i}(x^{0},\theta ))) (x_{i}^{1})^{\dag } \left( \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_{i}\right) . \end{aligned}$$

(24)

Note also that

$$\begin{aligned} (1- \sigma ( x^{\dag }u))x^{\dag } = {\left\{ \begin{array}{ll} (1-\sigma (u)) &{}\text { if } x = 1 \\ -\sigma (u)&{}\text { if } x = 0 \end{array}\right. } \end{aligned}$$

which means

$$\begin{aligned} (1-\sigma (x^{\dag }u))x^{\dag } = x - \sigma (u). \end{aligned}$$

(25)

Combining (24) and (25),

$$\begin{aligned}&\nabla _{\theta }\textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^{0},x^{1})v\nonumber \\&\quad = \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1) \textstyle \sum \limits _{i=1}^{n}(1 -\sigma ( (x_{i}^{1})^{\dag }u_{i}(x^{0},\theta )))(x_{i}^{1})^{\dag } \left( \textstyle \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_i\right) \nonumber \\&\quad = \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1) \textstyle \sum \limits _{i=1}^{n}(x_{i}^{1} -\sigma (u_{i}(x^{0},\theta ))) \left( \textstyle \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_i\right) \nonumber \\&\quad = \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\left[ \textstyle \sum \limits _{i=1}^{n}x_{i}^{1} \left( \textstyle \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_i\right) \right. \nonumber \\&\qquad \left. - \textstyle \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}v_{i,j}x^{0}_{j} + v_i\right) \right] . \end{aligned}$$

(26)

Splitting each $v_{i,j}$ and $v_i$ into positive and negative parts,

$$\begin{aligned}&= \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\left[ \textstyle \sum \limits _{i=1}^{n}x_{i}^{1} \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_+\right) \right. \nonumber \\&\qquad \left. - \textstyle \sum \limits _{i=1}^{n}x_{i}^{1} \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_{i})_{-}\right) \right] \nonumber \\&- \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\left[ \textstyle \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_{+}\right) \right. \nonumber \\&\qquad \left. - \textstyle \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_{i})_{-}\right) \right] \nonumber \\&= \left( \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\left[ \textstyle \sum \limits _{i=1}^{n}x_{i}^{1} \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_i)_{+}\right) \right. \right. \nonumber \\&\qquad \left. \left. + \textstyle \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_{i})_{-}\right) \right] \right) \nonumber \\&\quad - \left( \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\left[ \textstyle \sum \limits _{i=1}^{n}x_{i}^{1} \left( \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_i)_{-}\right) \right. \right. \nonumber \\&\qquad \left. \left. + \textstyle \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_{+}\right) \right] \right) \nonumber \\&= \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\sum \limits _{i=1}^{n} \left[ x_{i}^{1}\left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_{+}\right) \right. \nonumber \\&\qquad \left. + \sigma (u_{i}(x^{0},\theta )) \left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_i)_{-} \right) \right] \nonumber \\&\quad - \textstyle \sum \limits _{x^{1}}e(x^{1})P_{\theta }(x^0,x^1)\textstyle \sum \limits _{i=1}^{n} \left[ x_{i}^{1}\left( \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_i)_-\right) \right. \nonumber \\&\qquad \left. + \sigma (u_{i}(x^{0},\theta ))\left( \textstyle \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_i)_{+}\right) \right] .\nonumber \\ \end{aligned}$$

(27)

Note that

$$\begin{aligned}&\sum \limits _{x^{1}}P_{\theta }(x^0,x^1)\sum \limits _{i=1}^{n} \left[ x_{i}^{1}\left( \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_{+}\right) \right. \nonumber \\&\qquad \left. + \sigma (u_{i}(x^{0},\theta )) \left( \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_{i})_{-}\right) \right] \nonumber \\&\quad =\sum \limits _{i=1}^{n}\left( \sum \limits _{x^{1}}P_{\theta }(x^0,x^1)x_{i}^{1}\right) \left( \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_{i})_{+}\right) \nonumber \\&\qquad + \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_i)_{-}\right) \nonumber \\&\quad =\sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \sum \limits _{j=1}^{n}(v_{i,j})_{+}x^{0}_{j} + (v_i)_{+}\right) \nonumber \\&\qquad + \sum \limits _{i=1}^{n}\sigma (u_{i}(x^{0},\theta )) \left( \sum \limits _{j=1}^{n}(v_{i,j})_{-}x^{0}_{j} + (v_{i})_{-}\right) \nonumber \\&\quad = \sum \limits _{i=1}^{n}\sigma (u_{i}(x^0))|v_i| + \sum \limits _{i=1}^{n}\sum \limits _{j=1}^{n} \sigma (u_{i}(x^{0},\theta ))|v_{i,j}|x^{0}_{j}. \end{aligned}$$

(28)

Combining (27) with (28) and the definitions (13) and (14) we obtain (12).

1.2 A.2 Derivation of Eqs. 17 and 18

We have

$$\begin{aligned} Q(x_1)= & {} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} Q(x_1,x_2,\ldots ,x_n) \nonumber \\= & {} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=1}^{n} \beta _i^{x_i} (1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=1}^{n}\alpha _{i}x_i \right) \nonumber \\= & {} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \beta _1^{x_1}(1-\beta _1)^{1-x_{1}} \left( d + \alpha _1x_1 + \sum \limits _{i=2}^{n}\alpha _{i}x_i \right) \nonumber \\= & {} \beta _1^{x_1}(1-\beta _1)^{1-x_1} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \alpha _1x_1 + \sum \limits _{i=2}^{n}\alpha _{i}x_i \right) \nonumber \\= & {} \beta _1^{x_1}(1-\beta _1)^{1-x_1} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c}\prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=2}^{n}\alpha _{i}x_i \right) \nonumber \\&\qquad + \beta _1^{x_1}(1-\beta _1)^{1-x_1} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c}\prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i}\alpha _1x_1 \nonumber \\= & {} \beta _1^{x_1}(1-\beta _1)^{1-x_1} \alpha _1x_1\frac{1}{c}\nonumber \\&\qquad + \beta _1^{x_1}(1-\beta _1)^{1-x_1} \frac{1}{c} \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i}\left( d + \sum \limits _{i=2}^{n}\alpha _{i}x_i \right) .\nonumber \\ \end{aligned}$$

(29)

To simplify this equation, note that for $n>1$,

$$\begin{aligned}&\sum \limits _{x_1 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \prod _{i=1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=1}^{n}a_ix_i\right) \nonumber \\&= \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \left[ \beta _1\prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=2}^na_ix_i + a_1\right) \right. \nonumber \\&\qquad \left. + (1-\beta _1)\prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=2}^na_ix_i\right) \right] \nonumber \\&\quad =\sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}}\left[ \beta _1a_1\prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} + \prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=2}^na_ix_i\right) \right] \nonumber \\&\quad = \beta _1\alpha _1 + \sum \limits _{x_2 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \prod _{i=2}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=2}^na_ix_i\right) ,\nonumber \\ \end{aligned}$$

(30)

and if $n=1$ then

$$\begin{aligned} \begin{aligned} \sum \limits _{x_1 \in \{0,1\}}\prod _{i=1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=1}^{n}a_ix_i\right)&= \beta _1(d + a_1) + (1-\beta _1)d \\&= \beta _1d + \beta _1a_1 + d - \beta _1d = \beta _1a_1 +d. \end{aligned} \end{aligned}$$

(31)

Combining Eqs. 30 and 31, we see that for any $n\ge 1$,

$$\begin{aligned} \sum \limits _{x_1 \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \prod _{i=1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=1}^{n}a_ix_i \right) = d + \sum \limits _{i=1}^{n}\beta _i\alpha _i. \end{aligned}$$

(32)

Combining Eqs. 29 with 32,

$$\begin{aligned} Q(x_1)&= \beta _1^{x_1}(1-\beta _1)^{1-x_1}\frac{\alpha _1x_1}{c} + \beta _1^{x_1}(1-\beta _1)^{1-x_1}\frac{1}{c} \left( d + \sum \limits _{i=2}^{n}\beta _i\alpha _i\right) \\&= \beta _1^{x_1}(1-\beta _1)^{1-x_1}\frac{1}{c} \left[ d + \alpha _1x_1 + \sum \limits _{i=2}^{n}\beta _i\alpha _i\right] . \end{aligned}$$

In general,

$$\begin{aligned}&Q(x_k, x_{k-1},\ldots , x_1) \\&\quad = \textstyle \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} Q(x_1,\ldots ,x_k,,\ldots ,x_n) \\&\quad = \textstyle \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c}\prod _{i=1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=1}^{n}a_{i}x_i\right) \\&\quad = \textstyle \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=k+1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \\&\quad \quad \times \beta _k^{x_k}(1-\beta _k)^{1-x_k} \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( \sum \limits _{i=1}^{k-1}a_ix_i + a_kx_k + d + \sum \limits _{i=k+1}^{n}a_{i}x_i \right) \\&\quad = \textstyle \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=k+1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \beta _k^{x_k}(1-\beta _k)^{1-x_k}\\&\qquad \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( a_kx_k + \sum \limits _{i=1}^{k-1}a_ix_i\right) \\&\qquad + \textstyle \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}} \frac{1}{c} \prod _{i=k+1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \beta _k^{x_k}(1-\beta _k)^{1-x_k}\\&\qquad \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=k+1}^{n}a_{i}x_i \right) \\&\quad = \textstyle \beta _k^{x_k}(1-\beta _k)^{1-x_k}\frac{1}{c} \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( a_kx_k + \sum \limits _{i=1}^{k-1}a_ix_i\right) \\&\qquad + \textstyle \beta _k^{x_k}(1-\beta _k)^{1-x_k}\frac{1}{c} \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \sum \limits _{x_{k+1} \in \{0,1\}}\ldots \sum \limits _{x_n \in \{0,1\}}\\&\qquad \prod _{i=k+1}^{n}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d + \sum \limits _{i=k+1}^{n}a_{i}x_i \right) \\&\quad = \textstyle \beta _k^{x_k}(1-\beta _k)^{1-x_k}\frac{1}{c} \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( a_kx_k + \sum \limits _{i=1}^{k-1}a_ix_i\right) \\&\qquad + \textstyle \beta _k^{x_k}(1-\beta _k)^{1-x_k}\frac{1}{c} \prod _{i=1}^{k-1}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d +\sum \limits _{i=k+1}^{n}\beta _i\alpha _i\right) \\&\quad = \textstyle \frac{1}{c} \prod _{i=1}^{k}\beta _i^{x_i}(1-\beta _i)^{1-x_i} \left( d+ \sum \limits _{i=1}^{k}a_ix_i + \sum \limits _{i=k+1}^{n}\beta _i\alpha _i\right) . \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Flynn, T., Vázquez-Abad, F. A simultaneous perturbation weak derivative estimator for stochastic neural networks. Comput Manag Sci 16, 715–738 (2019). https://doi.org/10.1007/s10287-019-00357-1

Download citation

Received: 06 July 2018
Accepted: 11 October 2019
Published: 08 November 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s10287-019-00357-1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A simultaneous perturbation weak derivative estimator for stochastic neural networks

Abstract

Access this article

Similar content being viewed by others

Fundamentals of Artificial Neural Networks and Deep Learning

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Derivations related to the little model

1.1 A.1 Derivation of Eq. 12

1.2 A.2 Derivation of Eqs. 17 and 18

Rights and permissions

About this article

Cite this article

Navigation

A simultaneous perturbation weak derivative estimator for stochastic neural networks

Abstract

Access this article

Similar content being viewed by others

Fundamentals of Artificial Neural Networks and Deep Learning

A New Insight on Augmented Lagrangian Method with Applications in Machine Learning

The Frank-Wolfe Algorithm: A Short Introduction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Derivations related to the little model

Appendix A: Derivations related to the little model

1.1 A.1 Derivation of Eq. 12

1.2 A.2 Derivation of Eqs. 17 and 18

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation