1 Introduction

In this paper, we concentrate on stochastic optimization problems of the form

\begin{aligned} \begin{aligned}&\min _{u \in \mathcal {U}}\, \lbrace j(u)\;{:}{=}\; {\mathbb {E}}\left[ {J}(u,\varvec{\xi }) \right] = \int _\varOmega J(u, \varvec{\xi }(\omega )) \,\textrm{d}{\mathbb {P}}(\omega )\} \\&\text {subject to (s.t.)} \quad h_i(u) = 0 \quad i \in \mathcal {E}, \quad h_i(u) \le 0 \quad i \in \mathcal {I}. \end{aligned} \end{aligned}
(P)

Here, $$\mathcal {U}$$ is a Riemannian manifold and $$\varvec{\xi } :\varOmega \rightarrow \varXi \subset {\mathbb {R}}^m$$ is a random vector defined on a given probability space. We assume that we have deterministic constraints of the form $$\varvec{h}:\mathcal {U} \rightarrow {\mathbb {R}}^n$$, $$u \mapsto \varvec{h}(u) = (h_1(u), \dots , h_n(u))^\top$$, where we distinguish between the index set $$\mathcal {E}$$ of equality constraints and the index set $$\mathcal {I}$$ of inequality constraints.

Our investigations are motivated by applications in shape optimization, where an objective functional is supposed to be minimized with respect to a shape, or a subset of $${\mathbb {R}}^{d}$$. Finding a correct model to describe the set of shapes is one of the main challenges in shape optimization. From a theoretical and computational point of view, it is attractive to optimize in Riemannian manifolds because algorithmic ideas from [1] can be combined with approaches from differential geometry as outlined in [15]. This is one of the main reasons why we focus on Riemannian manifolds in this paper. One needs to take into account that these Riemannian manifolds could be also infinite dimensional, e.g., the space of plane curves [37,38,39, 51], the space of piecewise-smooth curves [40], and the space of surfaces in higher dimensions [3, 4, 26, 30, 36]. Often, more than one shape needs to be considered, which leads to so-called multi-shape optimization problems. As applications, we can mention electrical impedance tomography, where the material distribution of electrical properties such as electric conductivity and permittivity inside the body is examined [11, 31, 33], and the optimization of biological cell composites in the human skin [45, 46].

If one focuses on one-dimensional shapes, the above-mentioned space of plane unparametrized curves is a prominent example of an infinite-dimensional manifold. In our numerical application (cf. Sect. 3), we also focus on this shape space. Our choice of this space comes from the fact that in shape optimization, the set of permissible shapes generally does not allow a vector space structure. One should note that there is no obvious distance measure without a vector space structure, which is a central difficulty in the formulation of efficient optimization methods. If one cannot work in vector spaces, Riemannian shape manifolds are the next best option, but they come with additional difficulties; see Sect. 2.3.

A central difficulty in (P) is that the constraints lead to a stochastic optimization problem that cannot be handled using standard techniques such as gradient descent or Newton’s method; additionally, the numerical solution of the problem may be intractable on account of the expectation. In this work, we propose a stochastic augmented Lagrangian method to solve problems of the form (P). The proposed method combines the smoothing properties of the augmented Lagrangian method with a reduction in complexity granted by stochastic approximation.

The augmented Lagrangian method has been extensively studied; see [7, 8] for an introduction to the method when $$\mathcal {U} = {\mathbb {R}}^n$$. Substantial theory can be found in the literature for PDE-constrained optimization, which is related to our setting in PDE-constrained shape optimization and where convergence has been studied in function spaces; see [21,22,23,24, 47]. This theory does not apply even for deterministic counterparts of (P) since our control variable u belongs to a Riemannian manifold, not a Banach space. The study of constrained optimization on Riemannian manifolds is still nascent. There are relatively recent advances in first-order optimality conditions in KKT form, including the development of constraint qualifications analogous to the finite-dimensional setting [6, 50]. The augmented Lagrangian method has recently been developed for Riemannian manifolds [27, 35, 49]. These methods have been developed for deterministic problems, however, and therefore cannot be applied to problems of the form (P).

Stochastic approximation is a class of algorithms that originated from the paper [41] and has developed in recent decades due to its applicability to high-dimensional stochastic optimization problems. Thanks in part to applications in machine learning, these algorithms are increasingly being developed in the setting of Riemannian optimization; see, e.g., [10, 25, 42, 52, 53]. The most basic algorithm is the stochastic gradient method, which can be used to solve an unconstrained version of (P), i.e., the problem of minimizing the expectation. Recently, the stochastic gradient method was proposed to handle PDE-constrained shape optimization problems [14, 15]. In [14], asymptotic convergence was proven for optimization variables belonging to a Riemannian manifold and the connection was made to shape optimization following the ideas in [48]. However, the stochastic gradient method cannot solve problems of the form (P).

While both augmented Lagrangian and stochastic approximation methods are well-developed, the combined method—what we call the stochastic augmented Lagrangian method—is not. In the context of training neural networks, a combined stochastic gradient/augmented Lagrangian approach in the same spirit as ours can be found in the paper [13]. Our method, however, involves a novel use of the randomized multi-batch stochastic gradient method from [18, 19], where a random number of stochastic gradient steps are chosen. We use this strategy to solve the inner loop optimization problem for fixed Lagrange multipliers and penalty parameters. A central consequence of the random stopping rule from [18, 19] is that convergence rates of the expected value of the norm of the gradient can be obtained, even in the nonconvex case. The random stopping rule in combination with an outer loop procedure can be used to adaptively adjust step sizes and batch sizes for a tractable algorithm where asymptotic convergence to stationary points of the original problem is guaranteed.

The paper is structured as follows. In Sect. 2, we present the stochastic augmented Lagrangian method for optimization on Riemannian manifolds and analyze its convergence. Then, an application for our method is introduced and results of numerical tests are presented in Sect. 3. To conclude, we summarize our results in Sect. 4.

2 Optimization Approach

In this section, we introduce the stochastic augmented Lagrangian method for Riemannian manifolds. In view of our later application to shape optimization, where convexity of the objective functional j cannot be expected, we focus on providing results for the nonconvex case. First, in Sect. 2.1, we will provide background material that will be of use in our analysis. In particular, definitions and theorems from differential topology and geometry that are required in this paper will be provided. For background details, we refer to, e.g., [28, 29, 32, 34] for differential geometry and [20] for probability theory. The algorithm is presented in Sect. 2.2. Convergence of the method is proven in two parts: in Sect. 2.3, we provide an efficiency estimate for the inner loop procedure, corresponding to a randomized multi-batch stochastic gradient method. Then, in Sect. 2.4, convergence rates with respect to the outer loop procedure, which corresponds to a stochastic augmented Lagrangian method, are given.

2.1 Background and Notation

We consider the Euclidean norm $$\left\| \cdot \right\| _2$$ on $${\mathbb {R}}^n$$ throughout the paper. For a differentiable Riemannian manifold $$(\mathcal {U},\mathcal {G})$$, $$\mathcal {G}=(\mathcal {G}_u)_{u\in \mathcal {U}}$$ denotes the Riemannian metric. The induced norm is denoted by $$\left\| \cdot \right\| _{\mathcal {G}} {:}{=}\sqrt{\mathcal {G}(\cdot ,\cdot )}$$. Here and throughout the manuscript, we frequently omit the subscript u from the metric when the context is clear. The tangent of space of $$\mathcal {U}$$ at a point $$u \in \mathcal {U}$$ is defined in its geometric version as

\begin{aligned} T_u\mathcal {U}=\{ c:{\mathbb {R}}\rightarrow \mathcal {U}\mid c\text { differentiable}, c(0) =u\}/\sim , \end{aligned}

where the equivalence relation for two differentiable curves $$c,\tilde{c}:{\mathbb {R}}\rightarrow \mathcal {U}$$ with $$c(0) = \tilde{c}(0) =u$$ is defined as follows: $$c \sim \tilde{c} \Leftrightarrow \tfrac{\,\textrm{d}}{\,\textrm{d}t}\phi _{\alpha }(c(t))\vert _{t=0} =\tfrac{\,\textrm{d}}{\,\textrm{d}t} \phi _{\alpha }(\tilde{c}(t))\vert _{t=0}$$ for all $$\alpha$$ with $$u \in U_\alpha$$, where $$\{(U_\alpha , \phi _\alpha )\}_\alpha \text { is an atlas of }\mathcal {U}.$$ The derivative of a smooth mapping $$f:\mathcal {U}\rightarrow \widetilde{\mathcal {U}}$$ between two differentiable manifolds $$\mathcal {U}$$ and $$\widetilde{\mathcal {U}}$$ is defined using the pushforward. In a point $$u\in \mathcal {U}$$, it is defined by $$(f_*)_u :T_u \mathcal {U} \rightarrow T_{f(u)} \widetilde{\mathcal {U}}$$ with $$(f_*)_u(c){:}{=}\frac{\textrm{d}}{\textrm{d}t} f(c(t))\vert _{t=0} = (f \circ c)'(0),$$ where $$c:I\subset {\mathbb {R}}\rightarrow \mathcal {U}$$ is a differentiable curve with $$c(0)=u$$ and $$c'(0)\in T_u \mathcal {U}$$. In particular, $$f:\mathcal {U} \rightarrow \widetilde{\mathcal {U}}$$ is called $$\mathcal {C}^k$$ if $$\psi _\beta \circ f\circ \phi _\alpha ^{ -1}$$ is k-times continuously differentiable for all charts $$(U_\alpha ,\phi _\alpha )$$ of $$\mathcal {U}$$ and $$(V_\beta ,\psi _\beta )$$ of $$\widetilde{\mathcal {U}}$$ with $$f(U_\alpha )\subset V_\beta$$. In the case $$\widetilde{\mathcal {U}}={\mathbb {R}}$$, a Riemannian gradient $$\nabla f(u) \in T_u \mathcal {U}$$ is defined by the relation

\begin{aligned} (f_*)_u w = \mathcal {G}_u(\nabla f(u), w) \quad \forall w \in T_u \mathcal {U}. \end{aligned}
(1)

We define $$V_u\;{:}{=}\;\{v\in T_u\mathcal {U}:1\in I_{u,v}^\mathcal {U}\}$$ with $$I_{u,v}^\mathcal {U}\;{:}{=}\; \bigcup \limits _{I\in \tilde{I}_{u,v}^\mathcal {U}} I$$, where

\begin{aligned} \begin{aligned} \tilde{I}_{u,v}^\mathcal {U}\;{:}{=}\; \{I\subset {\mathbb {R}}:&I \text { open, }0\in I\text {, there exists a geodesic } c:I\rightarrow \mathcal {U} \\&\text {satisfying }c(0)=u\in \mathcal {U},\, c'(0)=v\in T_u\mathcal {U} \}. \end{aligned} \end{aligned}

Then, we denote the exponential mapping by

\begin{aligned} \exp :\bigcup \limits _{u\in \mathcal {U}}\{u\}\times V_u \rightarrow \mathcal {U},\ (u,v)\mapsto \exp _u(v)\;{:}{=}\; c(1), \end{aligned}

where $$\exp _u(v)$$ is the exponential map of $$\mathcal {U}$$ at u, which assigns to every tangent vector $$v\in V_u$$ the point c(1) and $$c:I_{u,v}^\mathcal {U}\rightarrow \mathcal {U}$$ is the unique geodesic satisfying $$c(0)=u$$ and $$c'(0)=v$$.

Let the length of a $$\mathcal {C}^1$$-curve $$c:[0,1]\rightarrow \mathcal {U}$$ be denoted by $$L (c) = \int _0^1 \left\| c'(t) \right\| _{\mathcal {G}}\,\textrm{d}t$$. Then the distance $$\textrm{d}:\mathcal {U} \times \mathcal {U} \rightarrow {\mathbb {R}}$$ between points $$u,q\in \mathcal {U}$$ is given by

\begin{aligned} \textrm{d}(u,q)= \inf \{L (c) :&c:[0,1]\rightarrow \mathcal {U} \text { is a piecewise smooth curve}\\&\text {with } c(0)=u \text { and } c(1)=q\}. \end{aligned}

The injectivity radius $$i_u$$ at a point $$u \in \mathcal {U}$$ is defined as $$i_u \ {:}{=}\ \sup \{r > 0 :\exp _u \vert _{B_r(0_u)} \text { is a diffeomorphism}\}$$, where $$0_u$$ denotes the zero element of $$T_u \mathcal {U}$$ and $$B_r(0_u) \subset T_u \mathcal {U}$$ is a ball centered at $$0_u\in T_u \mathcal {U}$$ with radius r. The injectivity radius of the manifold $$\mathcal {U}$$ is the number $$i(\mathcal {U}) \;{:}{=}\; \inf _{u \in \mathcal {U}} i_u.$$

The triple $$(\varOmega , \mathcal {F}, {\mathbb {P}})$$ denotes a (complete) probability space, where $$\mathcal {F} \subset 2^{\varOmega }$$ is the $$\sigma$$-algebra of events and $${\mathbb {P}}:\varOmega \rightarrow [0,1]$$ is a probability measure. The expectation of a random variable $$X:\varOmega \rightarrow {\mathbb {R}}$$ is defined by $${\mathbb {E}}\left[ X \right] = \int _\varOmega X(\omega )\,\textrm{d}{\mathbb {P}}(\omega )$$. A filtration is a sequence $$\{ \mathcal {F}_n\}$$ of sub-$$\sigma$$-algebras of $$\mathcal {F}$$ such that $$\mathcal {F}_1 \subset \mathcal {F}_2 \subset \cdots \subset \mathcal {F}$$. If for an event $$F \in \mathcal {F}$$ it holds that $${\mathbb {P}}(F) = 1$$, then we say F occurs almost surely (a.s.). Given an integrable random variable $$X :\varOmega \rightarrow {\mathbb {R}}$$ and a sub-$$\sigma$$-algebra $$\mathcal {F}_n$$, the conditional expectation is denoted by $${\mathbb {E}}\left[ X | \mathcal {F}_n \right]$$, which is a random variable that is $$\mathcal {F}_n$$-measurable and satisfies $$\int _A {\mathbb {E}}\left[ X|\mathcal {F}_n \right] (\omega ) \,\textrm{d}{\mathbb {P}}(\omega ) = \int _A X(\omega ) \,\textrm{d}{\mathbb {P}}(\omega )$$ for all $$A \in \mathcal {F}_n$$.

We will frequently use the convention $$\varvec{\xi } \in \varXi$$ to denote a realization (i.e., the deterministic value $$\varvec{\xi }(\omega ) \in \varXi$$ for some $$\omega$$) of the vector $$\varvec{\xi }:\varOmega \rightarrow \varXi \subset {\mathbb {R}}^m$$; based on the context, there should be no confusion as to whether a realization or a random vector is meant. Let $$J:\mathcal {U} \times {\mathbb {R}}^m \rightarrow {\mathbb {R}}$$ be a parametrized objective as in problem (P) and define $$J_{\varvec{\xi }}\;{:}{=}\; J(\cdot ,\varvec{\xi })$$. The gradient $$\nabla _{u} J(u,\varvec{\xi })\;{:}{=}\; \nabla J_{\varvec{\xi }}(u)$$ of J with respect to u is defined by the relation

\begin{aligned} ((J_{\varvec{\xi }})_*)_{u} w = \mathcal {G}_{u} (\nabla _{u} J(u,\varvec{\xi }),w) \quad \forall w\in T_{u} \mathcal {U}. \end{aligned}
(2)

Following [14], if $$\nabla _{u} J:\mathcal {U} \times {\mathbb {R}}^m \rightarrow T_{u} \mathcal {U}$$ is $${\mathbb {P}}$$-integrable, Eq. (2) is fulfilled for all u almost surely, and $${\mathbb {E}}\left[ \nabla _{u} J(u,\varvec{\xi })\right] = \nabla j(u)$$, we call $$\nabla _{u} J$$ a stochastic gradient.

Let the Lagrangian for problem (P) be the mapping $$\mathcal {L}:\mathcal {U} \times {\mathbb {R}}^n \rightarrow {\mathbb {R}}$$ defined by

\begin{aligned} \mathcal {L}(u, \varvec{\lambda })\;{:}{=}\; j(u) + \varvec{\lambda }^\top \varvec{h}(u). \end{aligned}

The gradient $$\nabla h_i(u) \in T_{u} \mathcal {U}$$ of $$h_i :\mathcal {U} \rightarrow {\mathbb {R}}$$ is defined by the relation $$((h_i)_{*})_{u} w = \mathcal {G}_u(\nabla h_i(u),w)$$ for all $$w \in T_{u} \mathcal {U}.$$ The gradient of the corresponding vector $$\varvec{h}:\mathcal {U} \rightarrow {\mathbb {R}}^n$$ is the vector $$\nabla \varvec{h} (u)= (\nabla h_1(u), \dots , \nabla h_n(u))^\top$$.

In the following, we define a Karush–Kuhn–Tucker (KKT) point.

Definition 2.1

The pair $$(\hat{u},\hat{\varvec{\lambda }})\in \mathcal {U} \times {\mathbb {R}}^n$$ is called a KKT point for problem (P) if it satisfies the following conditions:

\begin{aligned} \nabla j(\hat{u})+ \sum _{i=1}^n \hat{\lambda }_i \nabla h_i({\hat{u}})&= 0_{\hat{u}}, \end{aligned}
(3a)
\begin{aligned} h_i(\hat{u})&=0, \quad \forall i \in \mathcal {E},\end{aligned}
(3b)
\begin{aligned} h_i(\hat{u}) \le 0, \quad {\hat{\lambda }_i} \ge 0, \quad {{\hat{\lambda }}_i} h_i(\hat{u})&= 0, \quad \forall i \in \mathcal {I}. \end{aligned}
(3c)

Remark 2.1

In order for the above-formulated KKT conditions to be necessary optimality conditions for problem (P), certain constraint qualifications are required. Analogues of linear independence (LICQ), Mangasarian-Fromovitz, Abadie, and Guignard constraint qualifications have only recently been treated in finite-dimensional manifolds; see [6, 50]. The investigation of proper constraint qualifications for the infinite dimensional setting is an open area of research and is not further pursued in this paper. In Theorem 2.2, we will see that in certain cases, our method produces KKT points in the limit. However, in the absence of constraint qualifications, it can only be shown that certain asymptotic KKT conditions (AKKT) are satisfied, in general. This is discussed in more detail in Sect. 2.4.

The closed cone corresponding to the constraints, the distance to the cone, and the projection are defined, respectively, by

\begin{aligned} \varvec{K}&\;{:}{=}\; \{ \varvec{y} \in {\mathbb {R}}^n:y_i=0 \,\, \forall i \in \mathcal {E}, y_i \le 0 \,\, \forall i \in \mathcal {I}\}, \\ {\text {dist}}_{\varvec{K}}(\varvec{y})&\;{:}{=}\; \inf _{\varvec{k} \in {\varvec{K}}} \left\| \varvec{y}-\varvec{k}\right\| _2 \quad \text {and} \quad \pi _{\varvec{K}}(\varvec{y}) \;{:}{=}\; {\text {argmin}}_{\varvec{k} \in {\varvec{K}}}\left\| \varvec{y}-\varvec{k}\right\| _2. \end{aligned}

For $$y \in {\mathbb {R}}$$, the projection onto the ith component of the closed cone $$\varvec{K}$$ has the formula $$\pi _{K_i}(y) = 0$$ if $$i \in \mathcal {E}$$, and $$\pi _{K_i}(y) = \min (0,y)$$ if $$i \in \mathcal {I}$$. We have $$\pi _{\varvec{K}}(\varvec{y}) = (\pi _{K_1}(y_1), \dots , \pi _{K_n}(y_n))^\top .$$ The normal cone of $$\varvec{K}$$ in a point $$\varvec{s} \in \varvec{K}$$ is defined by $$N_{\varvec{K}}(\varvec{s})= \{ \varvec{v} \in {\mathbb {R}}^n:\varvec{v}^\top (\varvec{s}- \varvec{y}) \ge 0 \,\, \forall \varvec{y} \in \varvec{K}\}$$; the normal cone is the empty set if $$\varvec{s}$$ is not contained in $$\varvec{K}$$. To define the augmented Lagrangian, we first introduce a slack variable $$\varvec{s} \in {\varvec{K}}$$ to obtain the equivalent, equality-constrained problem

\begin{aligned} \begin{aligned}&\min _{(u,\varvec{s}) \in \mathcal {U}\times {\varvec{K}}} \left\{ j(u)={\mathbb {E}}\left[ J(u,\varvec{\xi }) \right] \right\} \quad \text {s.t.} \quad \varvec{h}(u)-\varvec{s} = \varvec{0}. \end{aligned} \end{aligned}

The corresponding augmented Lagrangian for a fixed parameter $$\mu$$ is the mapping $$\mathcal {L}_A^{\varvec{s}}:\mathcal {U} \times {\mathbb {R}}^n \times {\mathbb {R}}^n \rightarrow {\mathbb {R}}$$ defined by

\begin{aligned} \mathcal {L}_A^{\varvec{s}}(u, \varvec{s},\varvec{\lambda };\mu )&= j(u) + \varvec{\lambda }^\top ( \varvec{h}(u)-\varvec{s}) + \frac{\mu }{2} \left\| \varvec{h}(u)-\varvec{s}\right\| _2^2 \\&=j(u) + \frac{\mu }{2} \left\| \varvec{h}(u)+ \frac{\varvec{\lambda }}{\mu } - \varvec{s} \right\| _2^2 - \frac{\left\| \varvec{\lambda }\right\| _2^2}{2\mu }. \end{aligned}

Notice that $$\min _{\varvec{s} \in \varvec{K}} \left\| \varvec{h}(u) + \tfrac{\varvec{\lambda }}{\mu } - \varvec{s}\right\| _2^2 = {\text {dist}}_{\varvec{K}}(\varvec{h}(u)+\tfrac{\varvec{\lambda }}{\mu })^2.$$ Hence, it is possible to eliminate the slack variable to obtain, again for fixed $$\mu$$, the augmented Lagrangian $$\mathcal {L}_A:\mathcal {U} \times {\mathbb {R}}^n \rightarrow {\mathbb {R}}$$ defined by

\begin{aligned} \mathcal {L}_A(u, \varvec{\lambda };\mu ) = j(u) + \frac{\mu }{2}{\text {dist}}_{\varvec{K}} \!\left( \varvec{h}(u)+ \frac{\varvec{\lambda }}{\mu }\right) ^2 - \frac{\left\| \varvec{\lambda }\right\| _2^2}{2\mu }. \end{aligned}

2.2 Augmented Lagrangian Method on Riemannian Manifolds

In this section, we present Algorithm 1, which relies on stochastic approximation. For this, we need the function $$L_A :\mathcal {U} \times {\mathbb {R}}^n \times \varXi \rightarrow {\mathbb {R}}$$ defined by

\begin{aligned} L_A(u,\varvec{\lambda }, \varvec{\xi };\mu )\;{:}{=}\; J(u,\varvec{\xi })+ \frac{\mu }{2}{\text {dist}}_{\varvec{K}} \!\left( \varvec{h}(u)+ \frac{\varvec{\lambda }}{\mu }\right) ^2 - \frac{\left\| \varvec{\lambda }\right\| _2^2}{2\mu }. \end{aligned}

The stochastic augmented Lagrangian (AL) method is shown in Algorithm 1. The inner loop is an adaptation of the randomized mini-batch stochastic gradient (RSG) method from [19]. In deterministic AL methods, the inner loop is in practice only solved up to a given error tolerance, leading to an inexact augmented Lagrangian method. Deterministic termination conditions for the inner loop typically rely on conditions of the following type: $$u^{k+1}$$ is chosen as the first point of the corresponding iterative procedure satisfying

\begin{aligned} \nabla _{u} \mathcal {L}_A(u^{k+1},\varvec{w}^k; \mu _k) =\varepsilon _k \end{aligned}

with the error disappearing asymptotically, i.e., $$\varepsilon _k \rightarrow 0$$ as $$k \rightarrow \infty$$. Stochastic methods like the kind used here can only provide probabilistic error bounds; termination conditions are based on a priori estimates and result in stochastic errors. The outer loop corresponds to the augmented Lagrangian (AL) method with a safeguarding procedure as described in [21]; see also [47]. A feature of this procedure is that instead of using the Lagrange multiplier $$\varvec{\lambda }$$ in the subproblem in line 4, one chooses a function $$\varvec{w}$$ from a bounded set B, which is essential for achieving global convergence. In practice, this should be chosen in such a way so that the projection is easy to compute, i.e., box constraints are appropriate. A natural choice is $$\varvec{w}^k\;{:}{=}\; \pi _B(\varvec{\lambda }^k)$$ for a closed and convex set B. For the algorithm, we define an infeasibility measure and its induced sequence by

\begin{aligned} H(u,\varvec{\lambda }; \mu ) \;{:}{=}\; \left\| \varvec{h}(u) - \pi _{\varvec{K}} \!\left( \varvec{h}(u) +\frac{\varvec{\lambda }}{\mu } \right) \right\| _2 \quad \text {and} \quad H_{k}\;{:}{=}\; H(u^{k},\varvec{w}^{k-1};\mu _{k-1}). \end{aligned}

2.3 Convergence of Inner Loop

To prove convergence of the RSG procedure in Algorithm 1, we make the following assumptions about the manifold, which are adapted from [14].

Assumption 1

We assume that (i) the distance $$\textrm{d}(\cdot ,\cdot )$$ is non-degenerate,

(ii) the manifold $$(\mathcal {U},\mathcal {G})$$ has a positive injectivity radius $$i(\mathcal {U})$$, and

(iii) for all $$u\in \mathcal {U}$$ and all $$\tilde{u} \in \exp _{\varvec{u}}(B_{i_{u}}(0_{u}))$$, the minimizing geodesic between u and $$\tilde{u}$$ is completely contained in $$B_{i_{u}}(0_{u})$$.

As pointed out in [14], the conditions in Assumption 1, while mild for finite-dimensional manifolds, are strong for infinite-dimensional manifolds. Distances on an infinite-dimensional Riemannian manifold can be degenerate. For example, [37] shows that the reparametrization-invariant $$L^2$$-metric on the infinite-dimensional manifold of smooth planar curves induces a geodesic distance equal to zero. Any assumption regarding the injectivity radius is challenging to prove in practice. In infinite dimensions, Riemannian metrics are generally weak, such that gradients may not exist. For certain metrics, the exponential map may not be well-defined; it may even fail to be a diffeomorphism on any neighborhood, cf., e.g., [12].

In the following, a function $$g :\mathcal {U} \rightarrow {\mathbb {R}}$$ is called $$L_g$$-Lipschitz continuously differentiable if the function is $$\mathcal {C}^1$$ and there exists a constant $$L_g>0$$ such that for all $$u, \tilde{u} \in \mathcal {U}$$ with $$\textrm{d}(u,\tilde{u})< i(\mathcal {U})$$ we have

\begin{aligned} \left\| P_{1,0}\nabla {g}(\tilde{u})-\nabla {g}(u)\right\| _{\mathcal {G}}&\le {L_g} \textrm{d}(u,\tilde{u}), \end{aligned}

where $$P_{1,0}:T_{\gamma (1)}\mathcal {U} \rightarrow T_{\gamma (0)}\mathcal {U}$$ is the parallel transport along the unique geodesic such that $$\gamma (0) = u$$ and $$\gamma (1) = \tilde{u}.$$

Assumption 2

1. (i)

The functions j and $${h}_i$$ ($$i=1, \dots , n$$) are $$L_j$$-Lipschitz and $$L_{{h}_i}$$-Lipschitz continuously differentiable and the gradients $$\nabla j$$ and $$\nabla {h}_i$$ ($$i=1, \dots , n$$) exist for all $$u \in \mathcal {U}$$.

2. (ii)

The function J is continuously differentiable with respect to the first argument for every $$\varvec{\xi } \in \varXi$$, the stochastic gradient $$\nabla _{u} J$$ defined by (2) exists, and there exists $$M>0$$ such that:

\begin{aligned} {\mathbb {E}}\left[ \left\| \nabla _{u} J(u,\varvec{\xi } ) - \nabla j(u)\right\| _{\mathcal {G}}^2 \right] \le M^2 \quad \forall u \in \mathcal {U}. \end{aligned}
(4)

We begin our investigations with the following useful property.

Lemma 2.1

Under Assumption 1 and assuming the gradients $$\nabla j$$ and $$\nabla {h}_i$$ ($$i=1, \dots , n$$) exist, the iterates of Algorithm 1 satisfy

\begin{aligned} \nabla _{u} \mathcal {L}_A(u^{k+1}, \varvec{w}^k; \mu _k) = \nabla _{u} \mathcal {L}(u^{k+1}, \varvec{\lambda }^{k+1}) \quad \text {for all }k. \end{aligned}

Proof

We have $$\nabla {\text {dist}}_{\varvec{K}}^2 = 2(Id _{{\mathbb {R}}^n} \!-\! \pi _{\varvec{K}})$$; see [5, Corollary 12.31]. Let $$f(u)\!\;{:}{=}\;\! \mathcal {L}_A(u,$$$$\varvec{w}; \mu )$$. Then, the chain rule yields

\begin{aligned} \begin{aligned} (f_*)_{u} v&= (j_*)_{u} {v} + \mu \sum _{i=1}^n \left( h_i(u) + \frac{w_i}{\mu } - \pi _{K^i} \!\left( h_i(u)+ \frac{w_i}{\mu } \right) \right) ((h_i)_*)_{u} v. \\ \end{aligned} \end{aligned}

From this, thanks to the identity (1), it follows that

\begin{aligned} \nabla f(u) = \nabla j(u) + \mu \nabla \varvec{h}(u)^\top \left( \varvec{h}(u) + \frac{\varvec{w}}{\mu } - \pi _{\varvec{K}} \!\left( \varvec{h}(u)+ \frac{\varvec{w}}{\mu } \right) \right) , \end{aligned}

and using the definition of $$\varvec{\lambda }^{k+1}$$ from Algorithm 1, we obtain

\begin{aligned} \nabla _{u} \mathcal {L}_A(u^{k+1}, \varvec{w}^k;\mu _k) = \nabla j(u^{k+1}) + \nabla \varvec{h}({u^{k+1}})^\top \varvec{\lambda }^{k+1}. \end{aligned}

Using the fact that $$\nabla _{u} \mathcal {L}(u,\varvec{\lambda }) = \nabla j(u) + \nabla \varvec{h}({u})^\top \varvec{\lambda }$$, we have proven the claim. $$\square$$

Now, we turn to an efficiency estimate for the inner loop. First, we define the functions

\begin{aligned} F_k(u,\varvec{\xi })&\;{:}{=}\; L_A(u,\varvec{w}^k, \varvec{\xi };\mu _k) \quad \text {and} \\ f_k(u)&\;{:}{=}\; {\mathbb {E}}\left[ L_A(u,\varvec{w}^k, \varvec{\xi };\mu _k) \right] = \mathcal {L}_A(u,\varvec{w}^k;\mu _k). \end{aligned}

Recall the convention $$\varvec{\xi } \in \varXi$$ being used in the definition of $$F_k$$ and $$\varvec{\xi } :\varOmega \rightarrow \varXi$$ being used in the definition of $$f_k$$.

Lemma 2.2

Suppose that Assumptions 1 and 2 are satisfied and let $$\hat{B}_k \subset \mathcal {U}$$ be a bounded set such that $$\textrm{d}(\tilde{u},u) \le i(\mathcal {U})$$ for all $$\tilde{u}, {u} \in \hat{B}_k.$$ Then, $$f_k$$ is $$L_k$$-Lipschitz continuously differentiable with $$L_k$$ depending on $$L_j, L_{{h}_1}, \dots , L_{{h}_n}$$, and $$\hat{B}_k$$. Moreover, for all $$\tilde{u},u \in \hat{B}_k$$ with $$v\;{:}{=}\; \exp _{u}^{-1}(\tilde{u})$$, we have

\begin{aligned} f_k(\tilde{u})-f_k(u) \le \mathcal {G}(\nabla f_k(u), v) + \frac{L_k}{2}\left\| v\right\| _{\mathcal {G}}^2. \end{aligned}
(5)

Proof

Let $$P_{1,0}$$ denote the parallel transport as defined directly before Assumption 2 and set $$g_i({u})\;{:}{=}\; h_i({u}) + \frac{w^k_i}{\mu _k}-\pi _{K_i}(h_i({u}) + \frac{w^k_i}{\mu _k}).$$ Since $$h_i$$ is $$L_{h_i}$$-Lipschitz continuously differentiable and $$\hat{B}_k$$ is bounded, there exists $$C_{i,k}>0$$ such that $$\left\| \nabla h_i(u)\right\| _{\mathcal {G}} \le C_{i,k}.$$ Now, we have

\begin{aligned} \begin{aligned}&\left\| \sum _{i=1}^n P_{1,0}\nabla h_i(\tilde{u}) g_i(\tilde{u}) - \nabla h_i(u) g_i(u) \right\| _{\mathcal {G}} \\&\quad \le \sum _{i=1}^n \left\| P_{1,0} \nabla h_i(\tilde{u})-\nabla h_i(u)\right\| _{\mathcal {G}} \left| g_i(u)\right| + \left\| \nabla h_i(u)\right\| _{\mathcal {G}} \left| g_i(\tilde{u})- g_i(u) \right| \\&\quad \le \sum _{i=1}^n L_{h_i} d (u,\tilde{u}) \left| g_i(u) \right| + C_{i,k} \left| g_i(\tilde{u})- g_i(u) \right| \\&\quad \le \sum _{i=1}^n L_{h_i} d (u,\tilde{u}) \left| g_i(u) \right| + 2 C_{i,k} \left| h_i(\tilde{u})- h_i(u) \right| , \end{aligned} \end{aligned}
(6)

where in the last step, we used the contraction property of the projection operator. Notice that

\begin{aligned} \left| h_i(\tilde{u})-h_i(u) \right| \le C'_i \textrm{d}(\tilde{u},u) \end{aligned}
(7)

for some $$C'_i>0$$ since $$h_i$$ is $$\mathcal {C}^1$$. Additionally, we have

\begin{aligned} \left| g_i(u) \right| \le \left| h_i(u)+\frac{w_i^k}{\mu _k} \right| \quad (i \in \mathcal {E}) \end{aligned}
(8)

and

\begin{aligned} \left| g_i(u) \right| = {\left\{ \begin{array}{ll} h_i(u)+\frac{w_i^k}{\mu _k} & \text {if } h_i(u)+\frac{w_i^k}{\mu _k}\ge 0, \\ 0 & \text {else} \end{array}\right. } \quad (i \in \mathcal {I}). \end{aligned}
(9)

Since $$\hat{B}_k$$ is bounded, (8) and (9) together imply that there exists $$C_{i,k}'' >0$$ such that $$\left| g_i(u) \right| \le C_{i,k}''$$. As a consequence of (6) and (7), we have

\begin{aligned}&\left\| \sum _{i=1}^n P_{1,0}\nabla h_i(\tilde{u}) g_i(\tilde{u}) - \nabla h_i(u) g_i(u) \right\| _{\mathcal {G}} \le \textrm{d}(u,\tilde{u})\sum _{i=1}^n L_{{h}_i} C_{i,k}''+ 2 C_{i,k}C_i' . \end{aligned}

$$\tilde{L}_{\varvec{h},k}\;{:}{=}\; \sum _{i=1}^n L_{{h}_i} C_{i,k}''+ 2 C_{i,k}C_i'$$, we have

\begin{aligned}&\left\| P_{1,0} \nabla f_k(\tilde{u}) - \nabla f_k(u) \right\| _{\mathcal {G}}\\&\quad \le \left\| P_{1,0} \nabla j(\tilde{u}) -\nabla j(u) \right\| _{\mathcal {G}} + \mu _k \left\| \sum _{i=1}^n P_{1,0}\nabla h_i(\tilde{u}) g_i(\tilde{u}) - \nabla h_i(u) g_i(u) \right\| _{\mathcal {G}}\\&\quad \le (L_j +\mu _k \tilde{L}_{\varvec{h},k}) \textrm{d}(\tilde{u},u) \end{aligned}

Therefore, $$f_k$$ is $$L_k$$-Lipschitz with $$L_k\;{:}{=}\; L_j +\mu _k \tilde{L}_{\varvec{h},k}$$. Applying [14, Theorem 2.6], we obtain (5). $$\square$$

Remark 2.2

In the previous lemma, we introduced a bounded set $$\hat{B}_k$$. For the following results, we will need the existence of these sets containing the iterates almost surely within each k. Conditions ensuring boundedness can, e.g., be guaranteed by including constraints of the form $$u \in C\subset \mathcal {U}$$ for some bounded set C, or growth conditions on the gradient in combination with a regularizer; see [16].

Our first result concerning the convergence of Algorithm 1 handles the efficiency of the inner loop process, which corresponds to a stochastic gradient method that is randomly stopped after $$R_k$$ iterations. We follow the arguments in [19, Corollary 3]. It is possible to choose non-constant step sizes $$t_{k_j}$$; see [19, Theorem 2], but for the sake of clarity we observe step sizes that are constant in the inner loop here.

To handle the analysis, we interpret $$R_k$$ as a realization of a stopping time $$\tau _k :\varOmega \rightarrow \{ 1, \dots , N_k\}$$. Let $$\varvec{\xi }^{k,j}\;{:}{=}\; (\varvec{\xi }^{k,j,1}, \dots ,\varvec{ \xi }^{k,j,m_k})$$ be the batch associated with iteration j for a given outer loop k and let $$\mathcal {F}_{k,n} = \sigma (\varvec{\xi }^{\ell ,i}:\ell \in \{1, \dots , k\}, i \in \{ 1, \dots , n \})$$ define the corresponding natural filtration. We define the filtration associated with the randomly stopped stochastic process by $$\mathcal {F}^{\tau _k} =\{ \mathcal {F}_{\ell ,n \wedge \tau _k}:\ell \in \{ 1, \dots , k\}, n \in \{ 1, \dots , N_k\} \}$$.

Theorem 2.1

Suppose Assumptions 1 and 2 are satisfied. Observe a fixed iteration k from Algorithm 1. Suppose the iterates $$\{ \varvec{z}^{k,j}\}$$ are a.s. contained in a bounded set $$\hat{B}_k \subset \mathcal {U}$$, where $$\textrm{d}(u,\tilde{u})\le i(\mathcal {U})$$ for all $$u,\tilde{u}\in \hat{B}_k$$. Then, if the step size $$t_{k}$$ satisfies $$t_{k} =\alpha _k/{L_k}$$ for $$\alpha _k \in (0,2)$$ and all k, we have

\begin{aligned} {{\mathbb {E}}} \left[ \left\| \nabla f_k(u^{k+1}) \right\| _{\mathcal {G}}^2 \big | \mathcal {F}^{\tau _k} \right] \le \frac{2L_k(f_k(u^k)-f_k^*)}{(2\alpha _k-\alpha _k^2)N_k}+ \frac{\alpha _k M^2}{(2-\alpha _k)m_k}, \end{aligned}
(10)

where $$f_k^*\;{:}{=}\; \inf _{u \in \hat{B}_k} f_k(u)$$. Moreover, if $$\hat{B}_\infty \;{:}{=}\; \cup _{k=1}^\infty \hat{B}_k$$ is bounded, $$\textrm{d}(u, \tilde{u}) \le i(\mathcal {U})$$ for all $$u, \tilde{u} \in \hat{B}_\infty$$, the maximum iterations $$\{ N_k\}$$ are chosen such that $$N_k = \beta _k L_k$$ for $$\beta _k >0$$, and

\begin{aligned} \sum _{k=1}^\infty \frac{1}{(2\alpha _k-\alpha _k^2)\beta _k}+ \frac{\alpha _k}{(2-\alpha _k)m_k} < \infty , \end{aligned}
(11)

then we have $$\left\| \nabla f_k(u^{k+1})\right\| _{\mathcal {G}} \rightarrow 0$$ a.s. as $$k\rightarrow \infty$$.

Proof

Let k be fixed. We define $$\delta ^{j}\;{:}{=}\; \frac{1}{m_k}\sum _{i=1}^{m_k} \nabla _{u}F_k(z^{k,j},\varvec{\xi }^{k,j,i})- \nabla f_k(z^{k,j}).$$ With $$v^j\;{:}{=}\; \exp _{z^{k,j}}^{-1}(z^{k,j+1}) = - \frac{1}{L_k m_k}\sum _{i=1}^{m_k} \nabla _{u}F_k(z^{k,j},\varvec{\xi }^{k,j,i})$$, Lemma 2.2 yields

\begin{aligned}&f_k({z}^{k,j+1})-f_k(z^{k,j}) \\&\quad \le - t_k \mathcal {G}\left( \nabla f_k(z^{k,j}), \frac{1}{m_k}\sum _{i=1}^{m_k} \nabla _{u}F_k(z^{k,j},\varvec{\xi }^{k,j,i})\right) \\&\qquad + \frac{L_kt_k^2}{2}\left\| \frac{1}{m_k}\sum _{i=1}^{m_k} \nabla _{u}F_k(z^{k,j},\varvec{\xi }^{k,j,i}) \right\| _{\mathcal {G}}^2 \\&\quad = -\frac{\alpha _k}{L_k} \left\| \nabla f_k(z^{k,j})\right\| _{\mathcal {G}}^2 -\frac{\alpha _k}{L_k} \mathcal {G}(\nabla f_k(z^{k,j}),\delta ^j) \\&\qquad + \frac{\alpha _k^2}{2 L_k} \left( \left\| \nabla f_k(z^{k,j})\right\| _{\mathcal {G}}^2 + 2\mathcal {G}(\nabla f_k(z^{k,j}),\delta ^j) + \left\| \delta ^j \right\| _{\mathcal {G}}^2 \right) \\&\quad = \left( -\frac{\alpha _k}{L_k}+\frac{\alpha _k^2}{2L_k}\right) \left\| \nabla f_k(z^{k,j}) \right\| _{\mathcal {G}}^2 + \left( -\frac{\alpha _k}{L_k}+\frac{\alpha _k^2}{L_k}\right) \mathcal {G}(\nabla f_k(z^{k,j}),\delta ^j) \\&\qquad + \frac{\alpha _k^2}{2 L_k} \left\| \delta ^j \right\| _{\mathcal {G}}^2. \end{aligned}

Taking the sum with respect to j on both sides and rearranging, we obtain

\begin{aligned} \begin{aligned}&\sum _{\ell =1}^{N_k} \left\| \nabla f_k(z^{k,\ell }) \right\| _{\mathcal {G}}^2 \le \frac{2L_k}{2\alpha _k-\alpha _k^2}( f_k(z^{k,1}) -f_k^*) \\&\qquad + \frac{2(\alpha _k-1)}{2-\alpha _k} \sum _{\ell =1}^{N_k} \mathcal {G}(\nabla f_k(z^{k,\ell }),\delta ^\ell ) +\frac{\alpha _k}{2-\alpha _k} \sum _{\ell =1}^{N_k} \left\| \delta ^\ell \right\| _{\mathcal {G}}^2 \end{aligned} \end{aligned}
(12)

since $$f_k^* \le f_k(z^{k,N_k+1})$$ and $$0< \alpha _k < 2$$. Since $$\nabla _{u}F_k$$ is a stochastic gradient, we have $${\mathbb {E}}\left[ \mathcal {G}(\nabla f_k(z^{k,j}),\delta ^j)|\mathcal {F}_{k,j} \right] =\mathcal {G} \left( \nabla f_k(z^{k,j}),{\mathbb {E}}\left[ \delta ^j|\mathcal {F}_{k,j} \right] \right) =0.$$ Notice that due to (4), we have

\begin{aligned} \begin{aligned}&{\mathbb {E}}\left[ \left\| \nabla _{u}F_k(z^{k,j},\varvec{\xi }^{k,j,i})- \nabla f_k(z^{k,j}) \right\| _{\mathcal {G}}^2 \Big \vert \mathcal {F}_{k,j}\right] \\&\quad = {\mathbb {E}}\left[ \left\| \nabla _{u}J(z^{k,j},\varvec{\xi }^{k,j,i})- \nabla j(z^{k,j}) \right\| _{\mathcal {G}}^2 \Big \vert \mathcal {F}_{k,j}\right] \\&\quad = {\mathbb {E}}\left[ \left\| \nabla _{u}J(z^{k,j},\varvec{\xi })- \nabla j(z^{k,j}) \right\| _{\mathcal {G}}^2 \right] \le M^2. \end{aligned} \end{aligned}
(13)

With (13), we obtain

\begin{aligned} \begin{aligned} {\mathbb {E}}\left[ \left\| \delta ^j \right\| _{\mathcal {G}}^2 \big | \mathcal {F}_{k,j} \right]&= \frac{1}{m_k^2} {\mathbb {E}}\left[ \left\| \sum _{i=1}^{m_k} \left( \nabla _{u}F(z^{k,j},\varvec{\xi }^{k,j,i})- \nabla f_k(z^{k,j})\right) \right\| _{\mathcal {G}}^2 \Bigg | \mathcal {F}_{k,j}\right] \\&\le \frac{1}{m_k^2} \sum _{i=1}^{m_k} {\mathbb {E}}\left[ \left\| \nabla _{u}F(z^{k,j},\varvec{\xi }^{k,j,i})- \nabla f_k(z^{k,j}) \right\| _{\mathcal {G}}^2 \Big | \mathcal {F}_{k,j}\right] \le \frac{M^2}{m_k}, \end{aligned} \end{aligned}
(14)

where we used Jensen’s inequality, the linearity of the expectation, and (13). Taking the expectation on both sides of (14), using (12), and using the tower rule, cf. [20, Proposition 1.1 (a), p. 471], we get the inequality

\begin{aligned} \sum _{\ell =1}^{N_k}{\mathbb {E}}\left[ \left\| \nabla f_k(z^{k,\ell }) \right\| _{\mathcal {G}}^2 \right] \le \frac{2 L_k( f_k(z^{k,1}) -f_k^*)}{2\alpha _k-\alpha _k^2}+ \frac{\alpha _k}{2-\alpha _k}\frac{M^2 N_k}{m_k}. \end{aligned}
(15)

Due to the law of total expectation, we have

\begin{aligned} {{\mathbb {E}}} \left[ \left\| \nabla f_k(z^{k,R_k}) \right\| _{\mathcal {G}}^2 \big | \mathcal {F}^{\tau _k} \right]&= {{\mathbb {E}}} \left[ \left\| \nabla f_k(z^{k,\tau _k}) \right\| _{\mathcal {G}}^2 \big | \mathcal {F}^{\tau _k} \right] \\&= \sum _{\ell =1}^{N_k} {\mathbb {E}}\left[ \left\| \nabla f(z^{k,\ell }) \right\| _{\mathcal {G}}^2 \big | \mathcal {F}_{k,\ell } \right] {\mathbb {P}}\{\tau _k = \ell \} \\&= \frac{1}{N_k} \sum _{\ell =1}^{N_k} {\mathbb {E}}\left[ \left\| \nabla f(z^{k,\ell }) \right\| _{\mathcal {G}}^2 \right] . \end{aligned}

Note that $$f_k(z^{k,R_k}) = f_k(u^{k+1})$$ and $$f_k(z^{k,1})=f_k(u^k)$$. Returning to (15), we obtain

\begin{aligned} {{\mathbb {E}}} \left[ \left\| \nabla f_k(u^{k+1}) \right\| _{\mathcal {G}}^2 \big | \mathcal {F}^{\tau _k} \right] \le \frac{2 L_k( f_k(u^k) -f_k^*)}{(2\alpha _k-\alpha _k^2)N_k}+ \frac{\alpha _k M^2}{(2-\alpha _k)m_k}, \end{aligned}

so we have shown (10).

Now, to prove almost sure convergence, we first observe that if all iterates are contained in $$\hat{B}_\infty$$, we have

\begin{aligned} f_k(u^k) - f_k^* \le 2 \sup _{u \in \hat{B}_\infty } \left| f_k(u) \right| \le C \end{aligned}
(16)

for some $$C>0$$ due to the assumed smoothness of $$f_k$$ on $$\mathcal {U}$$. Taking the total expectation of (10), Markov’s inequality in combination with Jensen’s inequality gives

\begin{aligned} {{\mathbb {P}}} \left\{ \left\| \nabla f_k(u^{k+1}) \right\| _{\mathcal {G}} \ge \varepsilon \right\}&\le \varepsilon ^{-2} {{\mathbb {E}}} \left[ \left\| \nabla f_k(u^{k+1})\right\| _{\mathcal {G}}^2 \right] \\&\le \varepsilon ^{-2} \left( \frac{2L_kC}{(2\alpha _k-\alpha _k^2)N_k}+ \frac{\alpha _k M^2}{(2-\alpha _k)m_k}\right) . \end{aligned}

Since $$N_k=\beta _k L_k$$ and (11) holds, the infinite sum of the right-hand side is finite for every $$\varepsilon >0$$, implying the almost sure convergence of $$\big \{ \left\| \nabla f_k(u^{k+1})\right\| _{\mathcal {G}} \big \}$$ to zero. $$\square$$

For the choice $$t_k = 1/L_k$$ and (16), the efficiency estimate (10) evidently simplifies to $${{\mathbb {E}}} \left[ \left\| \nabla f_k(u^{k+1}) \right\| _{\mathcal {G}}^2 \right] \le \frac{2L_k C}{N_k}+ \frac{M^2}{m_k}.$$ In the next section, we will investigate optimality of the solution in the limit as k is taken to infinity. Since the Lipschitz constant $$L_k$$ has a potential to be unbounded due to the penalty term $$\mu _k$$, the maximal number of iterations $$N_k$$ needs to be balanced appropriately in this case. To obtain almost sure convergence, we required $$N_k = \beta _k L_k$$ for $$\beta _k >0$$. Alternatively, if it can be guaranteed that $$L_k$$ is bounded for all k (for instance by bounding $$\mu _k$$), then one could (asymptotically) choose $$t_k = \alpha _k/L$$ with $$L=\sup _{k} L_k$$. Regarding complexity, it is possible to establish the inner loop’s complexity as argued in [19, Section 4.2]. We define a $$(\varepsilon _k, \eta _k)$$-solution to the problem $$\min _{u \in \mathcal {U}} \, \{ f_k(u)= {\mathbb {E}}\left[ F_k(u,\varvec{\xi }) \right] \}$$ as the point $$\hat{u}$$ that satisfies $${{\mathbb {P}}} \big \{ \left\| \nabla f_k(\hat{u}) \right\| _{\mathcal {G}}^2 \le \varepsilon _k \big \} \ge 1 - \eta _k.$$ Ignoring some constants, for the choice $$t_k=1/L_k$$, the complexity can be bounded by $$\mathcal {O} \left( (\eta _k \varepsilon _k)^{-1} + M^2 \eta _k^{-2} \varepsilon _k^{-2} \right) .$$

2.4 Convergence of Outer Loop

In the final part of this section, we analyze the behavior of the outer loop of Algorithm 1 adapting arguments from [23, 47]. We define an optimality measure and its induced sequence by

\begin{aligned} r(u,\varvec{\lambda }) = \left\| \nabla _{u} \mathcal {L}(u,\varvec{\lambda }) \right\| _{\mathcal {G}} + \left\| \varvec{h}(u) - \pi _{\varvec{K}}(\varvec{h}(u) + \varvec{\lambda }) \right\| _2, \quad r_k\;{:}{=}\; r(u^k,\varvec{\lambda }^k) \end{aligned}

and make the following assumptions on iterates induced by Algorithm 1.

Assumption 3

We assume that

1. (i)

the sequence $$\{ u^k\}$$ is a.s. contained in a bounded set $$\hat{B}_\infty$$ such that $$d (u,\tilde{u}) \le i(\mathcal {U})$$ for all $$u,\tilde{u} \in \hat{B}_\infty$$,

2. (ii)

$$\left\| \nabla _{u}\mathcal {L}_A(u^{k+1},\varvec{w}^k;\mu _k)\right\| _{\mathcal {G}}\rightarrow 0$$ a.s. as $$k\rightarrow \infty$$,

3. (iii)

$$\{(u^k,\varvec{\lambda }^k)\}$$ converges a.s. to the set of KKT points and

4. (iv)

for k sufficiently large, we have $$\varvec{w}^k = \varvec{\lambda }^k$$.

Note that Theorem 2.1 implies Assumption 3(ii). Assumption 3(iii) requires that every limit point of every realization of the sequence $$\{u^k, \varvec{\lambda }^k\}$$ is a KKT point. In the absence of constraint qualifications, one can still work with asymptotic KKT (AKKT) conditions; under certain conditions, it can even be shown that they are necessary conditions (see, e.g., [23, Theorem 5.3]). We will say that a feasible point $$\hat{u}$$ satisfies the AKKT conditions if there exists a sequence $$\{ u^k\}$$ such that $$\textrm{d}(u^k,\hat{u})\rightarrow 0$$ and there exists a sequence $$\{\varvec{\lambda }^k \}$$ contained in the dual cone $$\varvec{K}^{\oplus }\;{:}{=}\; \{ \varvec{y} \in {\mathbb {R}}^n:\varvec{y}^\top \varvec{k} \ge 0 \, \forall \varvec{k} \in \varvec{K}\}$$ such that

\begin{aligned} \left\| \nabla j(u^k) + \nabla \varvec{h}(u^k)^\top \varvec{\lambda }^k \right\| _{\mathcal {G}} \rightarrow 0 \quad \text { and } \quad \pi _{\varvec{K}} (-\varvec{h}(u^k))^\top \varvec{\lambda }^k \rightarrow 0 \end{aligned}
(17)

as $$k \rightarrow \infty .$$

A fundamental difference in the stochastic variant of the augmented Lagrangian method is that limit points, as limits of the stochastic process $$(u^k, \varvec{\lambda }^k)$$, are random. In the following, we will consider a fixed limit point $$(\hat{u},\hat{\varvec{\lambda }})$$ and the corresponding set of paths converging to it. This motivates the definition of the set

\begin{aligned} E_{\hat{u},\hat{\varvec{\lambda }}}\;{:}{=}\; \{ \omega \in \varOmega :(u^k(\omega ), \varvec{\lambda }^k(\omega )) \rightarrow (\hat{u}, \hat{\varvec{\lambda }}) \quad \text {a.s.~on a subsequence} \}. \end{aligned}
(18)

Note that here, and in the following analysis, $$\omega$$ represents an outcome of the random process $$(\varvec{\xi }^{1,1}, \dots , \varvec{\xi }^{1,R_1}, \varvec{\xi }^{2,1}, \dots , \varvec{\xi }^{2,R_2}, \dots )$$ induced by sampling and random stopping.

Theorem 2.2

Suppose Assumptions 13(i)-(ii) are satisfied. Let $$E\;{:}{=}\; \{ \omega \in \varOmega :\mu _k(\omega ) \text { is a.s.~bounded}\}$$. Then, $$\{ \varvec{\lambda }^k(\omega )\}$$ is a.s. bounded on E and any limit point $$(\hat{u}, \hat{\varvec{\lambda }})$$ of $$\{(u^k(\omega ),\varvec{\lambda }^k(\omega )):\omega \in E, k \in {\mathbb {N}}\}$$ is a KKT point. On the set $$\varOmega \backslash E$$, if a limit point $$\hat{u}$$ is feasible, then it is an AKKT point.

Proof

We will make arguments in two parts, where we distinguish between the case of bounded and unbounded $$\mu _k$$.

Case 1: Bounded $$\mu _k$$. We first show that the sequence $$\{ \varvec{\lambda }^k\}$$ is a.s. bounded. Let $$\varvec{v}^{k+1}\;{:}{=}\; \varvec{h}(u^{k+1}) +\frac{\varvec{w}^k}{\mu _k}$$ and $$\varvec{y}^{k+1} \;{:}{=}\; \pi _{\varvec{K}}(\varvec{v}^{k+1})$$. By definition of $$\varvec{\lambda }^{k}$$, we have

\begin{aligned} \varvec{h}(u^{k+1}) = \frac{1}{\mu _k}(\varvec{\lambda }^{k+1}-\varvec{w}^k)+ \varvec{y}^{k+1}. \end{aligned}
(19)

Now, observe that the boundedness of $$\{ \mu _k\}$$ on E implies that there exists a maximal iterate $$\bar{k}$$ in Algorithm 1 such that $$H_{k+1}\le \tau H_k \le \tau \tilde{M}$$ is satisfied for every $$k \ge \bar{k}$$ and some $$\tilde{M}>0$$. This $$\tilde{M}$$ exists since $$\varvec{h}$$ is $$\mathcal {C}^1$$ and $$u^k$$, $$\varvec{w}^k$$, and $$\mu _k$$ are all bounded by assumption. In particular, $$H_k \rightarrow 0$$ as $$k\rightarrow \infty$$ on E. In turn, (19) combined with the definition of $$H_k$$ implies the a.s. convergence of $$\Vert \varvec{\lambda }^{k+1}-\varvec{w}^k\Vert _2/\mu _k$$ to zero, in turn implying $$\Vert \varvec{\lambda }^{k+1}-\varvec{w}^k\Vert _2 \rightarrow 0$$ for $$k\rightarrow 0$$. The boundedness of $$\varvec{w}^k$$ guaranteed by Algorithm 1 means therefore that $$\{ \varvec{\lambda }^k\}$$ is bounded on E.

Now, we prove that for any $$\varvec{y} \in \varvec{K}$$, there exists a nonnegative sequence $$\gamma _k$$ converging to zero such that

\begin{aligned} (\varvec{y}-\varvec{h}(u^k))^\top \varvec{\lambda }^k \le \gamma _k, \quad \omega \in E, k \in {\mathbb {N}}. \end{aligned}
(20)

With [5, Theorem 3.14], the projection formula

\begin{aligned} (\varvec{v}^{k+1}-\varvec{y}^{k+1})^\top (\varvec{y}^{k+1}-\varvec{y}) \ge 0 \end{aligned}

holds for all $$\varvec{y} \in {\varvec{K}}$$, implying that $$\varvec{\lambda }^{k+1}=\mu _{k+1}(\varvec{v}^{k+1}-\varvec{y}^{k+1}) \in N_{\varvec{K}}(\varvec{y}^{k+1}).$$ Now, using $$\varvec{\lambda }^{k+1} \in N_{\varvec{K}}(\varvec{y}^{k+1})$$ and (19), we have

\begin{aligned} (\varvec{y}-\varvec{h}(u^{k+1}))^\top \varvec{\lambda }^{k+1}&= \left( \varvec{y}- \frac{1}{\mu _k}(\varvec{\lambda }^{k+1}-\varvec{w}^k)-\varvec{y}^{k+1}\right) ^\top \varvec{\lambda }^{k+1}\\&\le \frac{1}{\mu _k}\left( (\varvec{w}^k)^\top \varvec{\lambda }^{k+1} - \big \Vert \varvec{\lambda }^{k+1} \big \Vert _2^2 \right) \\&= (\varvec{y}^{k+1}-\varvec{h}(u^{k+1}))^\top \varvec{\lambda }^{k+1}{=}{:}\gamma _{k+1}. \end{aligned}

We have shown (20). That $$\{\gamma _k\}$$ is a.s. a null sequence follows from the fact that $$\Vert \varvec{\lambda }^{k+1}-\varvec{w}^k\Vert _2/\mu _k$$ a.s. converges to zero.

Consider a subsequence of $$\{ (u^k(\omega ),\varvec{\lambda }^k(\omega ))\}$$ that converge to a limit point $$(\hat{u},\hat{\varvec{\lambda }})$$ for a fixed $$\omega \in E_{\hat{u}, \hat{\varvec{\lambda }}}$$. We will prove that the limit point satisfies the KKT conditions (3). Continuity of $$\nabla _{u} \mathcal {L}$$ gives $$\lim _{k \rightarrow \infty } \nabla _{u} \mathcal {L}(u^k(\omega ),\varvec{\lambda }^k(\omega )) = \nabla _{u}\mathcal {L}(\hat{u}, \hat{\varvec{\lambda }})$$ and $$\big \Vert \nabla _{u} \mathcal {L}(\hat{u}, \hat{\varvec{\lambda }}) \big \Vert _{\mathcal {G}} = 0$$ due to Assumption 3(ii). By definition, $$\nabla _{u} \mathcal {L}(\hat{u},\hat{\varvec{\lambda }}) \in T_{\hat{u}}\mathcal {U}$$, and the only element in $$T_{\hat{u}}\mathcal {U}$$ having norm zero is $$0_{\hat{u}}$$, thus (3a) is fulfilled. Since $$\alpha _k \rightarrow 0$$ a.s., we have that $$(\varvec{y}-\varvec{h}(\hat{u}))^\top \hat{\varvec{\lambda }} \le 0$$ for all $$\varvec{y} \in \varvec{K},$$ implying that $$\hat{\varvec{\lambda }} \in N_{\varvec{K}}(\varvec{h}(\hat{u}))$$. This immediately implies (3b)–(3c).

Case 2: Unbounded $$\mu _k$$. Consider a fixed $$\omega \in \varOmega \backslash E$$ and a sequence $$\{ u^k(\omega )\}$$ such that (possibly on a subsequence that we do not relabel) $$\textrm{d}(u^k(\omega ), \hat{u}) \rightarrow 0$$ as $$k\rightarrow \infty$$. Assumption 3(ii) gives the first AKKT condition in (17). It remains to prove that $$\pi _{\varvec{K}} (-\varvec{h}(u^k(\omega )))^\top \varvec{\lambda }^k(\omega ) \rightarrow 0$$. Now, we define

\begin{aligned} \varvec{p}^k(\omega )\;{:}{=}\; (\mu _k(\omega ) \varvec{h}(u^{k+1}(\omega ))+\varvec{w}^k(\omega ))^\top \pi _{\varvec{K}}(-\varvec{h}(u^{k+1}(\omega ))). \end{aligned}

For readability, we will suppress the dependence on $$\omega$$. Since

\begin{aligned} \varvec{\lambda }^{k+1} = \mu _k \left( \varvec{h}(u^{k+1})+\frac{\varvec{w}^k}{\mu _k} - \pi _{\varvec{K}} \!\left( \varvec{h}(u^{k+1})+\frac{\varvec{w}^k}{\mu _k} \right) \right) \end{aligned}

it is evidently enough to prove $$\varvec{p}^k \rightarrow 0$$, since due to the contraction property of the projection, we have $$\pi _{\varvec{K}}(\varvec{a}^k)^\top \varvec{b}^k \rightarrow 0$$ implies $$\pi _{\varvec{K}}(\varvec{a}^k)^\top \pi _{\varvec{K}}(\varvec{b}^k) \rightarrow 0$$ for any $$\varvec{a}^k, \varvec{b}^k \in {\mathbb {R}}^n.$$ Note that at least on a subsequence, we have $$\varvec{h}(u^{k+1}) \rightarrow \varvec{h}(\hat{u})$$ and $$\left| \varvec{h}(u^{k}) \right|$$ is bounded.

Consider first the case that $$h_i(\hat{u})<0$$. Then $$\varvec{h}(u^{k+1})\rightarrow \varvec{h}(\hat{u})$$ implies that $$w_i^k+\mu _k h_i(u^{k+1})<0$$ for k sufficiently large, implying $$\varvec{p}^k \rightarrow 0$$.

Consider now the case that $$h_i(\hat{u})=0$$. For a fixed k, if $$h_i(u^{k+1}) \ge 0$$ then $$\varvec{p}^k=0$$. Else if $$h_i(u^{k+1})<0$$, then $$p_i^k = (\mu _k h_i(u^{k+1}) +w_i^k) \pi _{\varvec{K}}(-h_i(u^{k+1})) \le w_i^k \left| h_i(u^{k+1}) \right|$$. If $$h_i(u^{k+1})<0$$ infinitely many times, then $$w_i^k \left| h_i(u^{k+1}) \right| \rightarrow 0$$, meaning $$\varvec{p}^k \rightarrow 0$$.

Since $$\varvec{p}^k$$ in both cases converges to zero and $$\omega \in \varOmega \backslash E$$ was arbitrary, we have proven the claim. $$\square$$

We now turn to local convergence statements. In the spirit of a local argument, we restrict our investigations to the study around a limit point for only those realizations converging to it. Again, we consider the set $$E_{\hat{u},\hat{\varvec{\lambda }}}$$ defined in (18).

Lemma 2.3

Suppose Assumptions 13 hold. Let $$(\hat{u}, \hat{\varvec{\lambda }})$$ be a limit point satisfying for some $$c_1, c_2 >0$$

\begin{aligned} c_1 r(u,\varvec{\lambda }) \le \,\textrm{d}(u,\hat{u}) + \big \Vert \varvec{\lambda }- \hat{\varvec{\lambda }} \big \Vert _2 \le c_2 r(u,\varvec{\lambda }) \end{aligned}
(21)

for all $$(u, \varvec{\lambda })$$ with u near $$\hat{u}$$ and $$r(u,\varvec{\lambda })$$ sufficiently small. Then we have for sufficiently large k

\begin{aligned} \left( 1-\frac{c_2}{\mu _k} \right) r_{k+1} \le \left\| \nabla _{u} \mathcal {L}_A(u^{k+1}, \varvec{w}^k; \mu _k)\right\| _{\mathcal {G}} + \frac{c_2}{\mu _k} r_k \quad \text {a.s. on } E_{\hat{u},\hat{\varvec{\lambda }}}. \end{aligned}

Proof

We have using Lemma 2.1 and $$\varvec{w}^k = \varvec{\lambda }^k$$ that

\begin{aligned} r_{k+1} = \big \Vert \nabla _{u} \mathcal {L}_A(u^{k+1},\varvec{\lambda }^{k};\mu _k)\big \Vert _{\mathcal {G}} + \big \Vert \varvec{h}(u^{k+1}) - \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k+1})\big \Vert _2. \end{aligned}
(22)

Let $$\varvec{v}^{k+1}\;{:}{=}\; \varvec{h}(u^{k+1}) +\frac{\varvec{w}^k}{\mu _k}$$ and $$\varvec{y}^{k+1} \;{:}{=}\; \pi _{\varvec{K}}(\varvec{v}^{k+1})$$. Then it follows that

\begin{aligned} \big \Vert \varvec{y}^{k+1} - \pi _{\varvec{K}}(\varvec{y}^{k+1}+ \varvec{\lambda }^{k+1})\big \Vert _2 = 0 \end{aligned}

since $$\varvec{\lambda }^{k+1} \in N_{\varvec{K}}(\varvec{y}^{k+1})$$ as argued in Part 1 of the proof of Theorem 2.2. Note that $$Id _{{\mathbb {R}}^n} - \pi _{\varvec{K}}$$ is (firmly) nonexpansive (cf. [5, Prop. 12.27]). It is an easy exercise to deduce that the mapping $$\varvec{y} \mapsto \varvec{y} - \pi _{\varvec{K}}(\varvec{y}+\varvec{\lambda }^{k+1})$$ is nonexpansive as well, from which we can conclude

\begin{aligned} \begin{aligned}&\left| \big \Vert \varvec{h}(u^{k+1})- \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k+1})\big \Vert _2 - \big \Vert \varvec{y}^{k+1} - \pi _{\varvec{K}}(\varvec{y}^{k+1} + \varvec{\lambda }^{k+1})\big \Vert _2 \right| \\&\quad \le \big \Vert \varvec{h}(u^{k+1})- \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k+1})-\varvec{y}^{k+1} + \pi _{\varvec{K}}(\varvec{y}^{k+1} + \varvec{\lambda }^{k+1})\big \Vert _2 \\&\quad \le \left\| \varvec{h}(u^{k+1}) - \varvec{y}^{k+1}\right\| _2. \end{aligned} \end{aligned}

Using the definition of $$\varvec{y}^{k+1}$$ and $$\varvec{w}^k=\varvec{\lambda }^k$$, notice that

\begin{aligned} \begin{aligned}&\big \Vert \varvec{h}(u^{k+1})- \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k+1})\big \Vert _2 \\&\le \big \Vert \varvec{h}(u^{k+1}) - \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k}/\mu _k)\big \Vert _2\\&=\frac{1}{\mu _k} \big \Vert \mu _k \varvec{h}(u^{k+1}) +\varvec{\lambda }^k - \mu _k \pi _{\varvec{K}}(\varvec{h}(u^{k+1})+\varvec{\lambda }^{k}/\mu _k) -\varvec{\lambda }^k \big \Vert _2\\&= \frac{1}{\mu _k} \big \Vert \varvec{\lambda }^{k+1} - \varvec{\lambda }^k\big \Vert _2. \end{aligned} \end{aligned}

Returning to (22), we obtain

\begin{aligned} r_{k+1} \le \big \Vert \nabla _{u} \mathcal {L}_A(u^{k+1},\varvec{\lambda }^{k};\mu _k)\big \Vert _{\mathcal {G}} + \frac{1}{\mu _k} \left( \big \Vert \varvec{\lambda }^{k+1} - \hat{\varvec{\lambda }}\big \Vert _2 + \big \Vert \varvec{\lambda }^k - \hat{\varvec{\lambda }} \big \Vert _2\right) . \end{aligned}
(23)

Since $$\lim _{k\rightarrow \infty } \textrm{d}(u^k, \hat{u}) = 0$$ a.s. on $$E_{\hat{u}, \hat{\varvec{\lambda }}}$$, then for any $$\varepsilon >0$$ there exists $$\bar{k}$$ such that $$\textrm{d}(u^k, \hat{u}) < \varepsilon$$ for all $$k\ge \bar{k}$$ a.s. Possibly choosing $$\bar{k}$$ even larger, Assumption 3 combined with the positive injectivity radius further implies $$\big \Vert \varvec{\lambda }^k(\omega ) - \hat{\varvec{\lambda }}\big \Vert _2 \le c_2 r_k$$ for almost all $$\omega \in E_{\hat{u}, \hat{\varvec{\lambda }}}$$. Using (23), we conclude that for almost all $$\omega \in E_{\hat{u}, \hat{\varvec{\lambda }}}$$,

\begin{aligned} r_{k+1} \le \big \Vert \nabla _{u} \mathcal {L}_A(u^{k+1}(\omega ),\varvec{\lambda }^{k}(\omega );\mu _k(\omega ))\big \Vert _{\mathcal {G}} + \frac{1}{\mu _k} (c_2 r_{k+1} + c_2 r_k), \end{aligned}

for k large enough. Rearranging terms proves the claim. $$\square$$

We are now ready to show the local rate of convergence. We recall the definition of convergence for the convenience of the reader: A sequence $$\{r_k\}$$ that converges to $$r^*$$ is said to have order of convergence $$s \ge 1$$ and rate of convergence q if

\begin{aligned} {\displaystyle \lim _{k\rightarrow \infty }{\frac{\left| r_{k+1}-r^{*}\right| }{\left| r_{k}-r^{*}\right| ^{s}}}=q.} \end{aligned}

Linear convergence occurs in the case $$s=1$$ and $$q\in (0,1)$$. Moreover, superlinear convergence occurs in all cases where $$q>1$$ and the case where $$s=1$$ and $$q=0$$.

Theorem 2.3

Under the same assumptions as Lemma 2.3, assume further that $$\big \Vert \nabla _{\varvec{u}} \mathcal {L}_A(\varvec{u}^{k+1},\varvec{\lambda }^{k};\mu _k)\big \Vert _{\mathcal {G}} = o(r_k).$$ Then

1. 1)

Given the existence of $$\hat{\mu }_q >0$$ such that if $$\mu _k \ge \hat{\mu }_q$$ for k sufficiently large, $$\{(\varvec{u}^k, \varvec{\lambda }^k)\}$$ converges linearly to $$(\hat{\varvec{u}}, \hat{\varvec{\lambda }})$$ a.s. on $$E_{\hat{\varvec{u}},\hat{\varvec{\lambda }}}$$ with convergence rate $$q\in (0,1)$$.

2. 2)

If $$\mu _k \rightarrow \infty$$, then $$(\varvec{u}^k, \varvec{\lambda }^k) \rightarrow (\hat{\varvec{u}}, \hat{\varvec{\lambda }})$$ a.s. on $$E_{\hat{\varvec{u}},\hat{\varvec{\lambda }}}$$ at a superlinear rate.

Proof

Note that for k large enough, we have $$\varvec{w}^k = \varvec{\lambda }^k$$ and Lemma 2.3 gives

\begin{aligned} \left( 1-\frac{c_2}{\mu _k} \right) r_{k+1} \le \left\| \nabla _{\varvec{u}} \mathcal {L}_A(\varvec{u}^{k+1}, \varvec{w}^k; \mu _k)\right\| _{{\mathcal {G}}} + \frac{c_2}{\mu _k} r_k = o(r_k) + \frac{c_2}{\mu _k} r_k. \end{aligned}

Taking $$\mu _k$$ such that $$\mu _k-c_2>0$$ gives $$r_{k+1} \le \frac{\mu _k}{\mu _k - c_2} \left( o(r_k) + \frac{c_2}{\mu _k} r_k \right) .$$ This implies

\begin{aligned} \frac{r_{k+1}}{r_k} \le \frac{\mu _k}{\mu _k - c_2} \left( o(1) + \frac{c_2}{\mu _k} \right) =\frac{c_2}{\mu _k - c_2} + o(1). \end{aligned}

Thanks to the error bound (21), we get the corresponding rates for $$\{(\varvec{u}^k, \varvec{\lambda }^k)\}$$. $$\square$$

In practice, the assumption $$\big \Vert \nabla _{\varvec{u}} \mathcal {L}_A(\varvec{u}^{k+1},\varvec{\lambda }^{k};\mu _k)\big \Vert _{{\mathcal {G}}} = o(r_k)$$ is difficult to implement since one can only work with estimates $$\hat{f}_k \approx {\mathbb {E}}\big [L_A(\varvec{u}^{k+1}, \varvec{\lambda }^k, \varvec{\xi }; \mu _k) \big ] = \mathcal {L}_A(\varvec{u}^{k+1}, \varvec{\lambda }^k; \mu _k).$$ However, we have a convergence rate guaranteed in expectation by (10), which can be used to choose appropriate sequences for $$N_k$$ and $$m_k$$. A possible heuristic is shown in the following section.

3 Application and Numerical Results

In this section, we present an application to a two-dimensional fluid-mechanical problem to demonstrate the algorithm. We denote the hold-all domain as $$D=D(\varvec{u})$$, which is partitioned into $$N+1$$ disjoint subdomains $$D_1, \ldots , D_{N+1}$$, where $$D_{N+1}$$ represents the subdomain in which fluid is allowed to flow, and the other sets are obstacles around which the fluid is supposed to flow. The subdomain boundaries are defined as $$\partial D_1 = u_1$$, $$\ldots$$, $$\partial D_N = u_N$$, and $$\partial D_{N+1} = \varGamma \cup u_1 \cup \cdots \cup u_N$$, where $$\varGamma$$ is the outer boundary that is fixed and split into two disjoint parts $$\varGamma _D$$ and $$\varGamma _N$$ representing the Dirichlet and Neumann boundary, respectively.

In [15], a shape is seen as a point on an abstract manifold so that a collection of shapes can be viewed as a vector of points $$\varvec{u}=(u_1, \dots , u_N)$$ in a product manifold $$\mathcal {U}^N = \mathcal {U}_1 \times \cdots \times \mathcal {U}_N$$, where $$\mathcal {U}_i$$ are Riemannian manifolds for all $$i=1,\dots ,N$$. In the following, our shapes are the above-mentioned obstacles leading to a multi-shape optimization problem. One should take into account that a product manifold is a manifold and, thus, all theoretical findings from the Sect. 2 can also be applied to product manifolds. We will work with a (possibly infinite-dimensional) connected Riemannian product manifold $$(\mathcal {U},\mathcal {G})=(\mathcal {U}^N, \mathcal {G}^N)$$. As described in [15], the tangent space $$T\mathcal {U}^N$$ can be identified with the product of tangent spaces $$T\mathcal {U}_1\times \cdots \times T\mathcal {U}_N$$ via $$T_{\varvec{u}} \mathcal {U}^N \cong T_{u_1} \mathcal {U}_1 \times \cdots \times T_{u_N} \mathcal {U}_N.$$ Additionally, the product metric $$\mathcal {G}^N$$ to the corresponding product shape space $$\mathcal {U}^N$$ can be defined via $$\mathcal {G}^N=(\mathcal {G}^N_{\varvec{u}})_{\varvec{u}\in \mathcal {U}^N}$$, where

\begin{aligned} \mathcal {G}^N_{\varvec{u}}(\varvec{v},\varvec{w}) =\sum _{i=1}^{N} \mathcal {G}_{\pi _i(u)}^{i}(\pi _{i_*}\varvec{v},\pi _{i_*}\varvec{w})\qquad \forall \, \varvec{v},\varvec{w} \in T_u \mathcal {U}^N \end{aligned}
(24)

and $$\pi _i:\mathcal {U}^N\rightarrow \mathcal {U}_i$$, $$i=1, \dots , N$$, correspond to canonical projections. If we work with multiple shapes $$\varvec{u}$$, the exponential map in Algorithm 1 needs to be replaced by the so-called multi-exponential map. Let $$V_{\varvec{u}}^N\;{:}{=}\; V_{u_1}\times \cdots \times V_{u_N}$$, where $$V_{u_i}\;{:}{=}\; \{v_i\in T_{u_i}\mathcal {U}_i:1\in I_{u_i,v_i}^{\mathcal {U}_i}\}$$ for all $$i=1,\dots ,N$$. Then, we define the multi-exponential map by $$\exp _{\varvec{u}}^N:V_{\varvec{u}}^N\rightarrow \mathcal {U}^N,\, \varvec{v}=(v_1,\dots ,v_N)\mapsto (\exp _{u_1}v_1,\dots ,\exp _{u_N}v_N)$$ for the vector $$\varvec{u}=(u_1,\dots ,u_N)$$, where $$\exp _{u_i}:V_{u_i}\rightarrow \mathcal {U}_i,\,v_i\mapsto \exp _{u_i}(v_i)$$ for all $$i=1,\dots ,N$$.

The shape space we consider in the numerical experiments is the product space of plane unparametrized curves, i.e., $$\mathcal {U}^N=B_e^N(S^1,{\mathbb {R}}^2)$$. The shape space $$B_e(S^1,{\mathbb {R}}^2)$$ is defined as the orbit space of $$\textrm{Emb}(S^1,\mathbb {R}^2)$$ under the action by composition from the right by the Lie group $$\textrm{Diff}(S^1)$$, i.e., $$B_e(S^1,{\mathbb {R}}^2) \;{:}{=}\; \text {Emb}(S^1,\mathbb {R}^2) / \text {Diff}(S^1)$$ (cf., e.g., [37]). Here, $$\textrm{Emb}(S^1,{\mathbb {R}}^2)$$ denotes the set of all embeddings from the unit circle $$S^1$$ into $${\mathbb {R}}^2$$, and $$\textrm{Diff}(S^1)$$ is the set of all diffeomorphisms from $$S^1$$ into itself. In [28], it is proven that the shape space $$B_e(S^1,{\mathbb {R}}^2)$$ is a smooth manifold; together with appropriate inner products, it is even a Riemannian manifold. In our numerical experiments, we choose the Steklov–Poincaré metric defined in [43]. Originally, it is defined as a mapping from Sobolev spaces. To define a metric on $$B_e(S^1,{\mathbb {R}}^2)$$, the Steklov–Poincaré metric is restricted to a mapping from the tangent spaces, i.e., $$T_u B_e(S^1,{\mathbb {R}}^2) \times T_u B_e(S^1,{\mathbb {R}}^2) \rightarrow {\mathbb {R}}$$, where $$T_uB_e(S^1,\mathbb {R}^{2})\cong \left\{ h:h=\alpha \varvec{n},\, \alpha \in \mathcal {C}^\infty (S^{1})\right\}$$. Of course, one can choose a different metric on the shape space to represent the shape gradient. We focus on the Steklov–Poincaré metric due to its advantages in combination with the computational mesh (cf. [43, 46]).

The physical system on D is described by the Stokes equations under uncertainty. Note that here, flow is modeled on the domain D instead of $$D_{N+1}$$. This is done (in view of the tracking-type functional) to produce a shape derivative on the entire domain. Let $$V(D) = \left\{ \varvec{q}\in H^1( D, {\mathbb {R}}^2):\varvec{q}\vert _{\varGamma _D\cup \varvec{u}} = \varvec{0} \right\}$$ denote the function space associated to the velocity for a fixed domain D. We neglect volume forces and consider a deterministic viscosity of the fluid. Inflow $$\varvec{g}$$ on parts of the Dirichlet boundary is assumed to be uncertain and is modeled as a random field $$\varvec{g}:D \times \varXi \rightarrow {\mathbb {R}}^2$$ with regularity $$\varvec{g} \in L_{{\mathbb {P}}}^2(\varXi , H^1( D, {\mathbb {R}}^2))$$ and depending on $$\varvec{\xi } :\varOmega \rightarrow \varXi \subset {\mathbb {R}}^m$$. We will use the abbreviation $$\varvec{g}_{\varvec{\xi }} = \varvec{g}(\cdot ,\varvec{\xi })$$. For each realization $$\varvec{\xi }$$, consider Stokes flow in weak form: find $$\varvec{q}_{\varvec{\xi }} \in H^1( D, {\mathbb {R}}^2)$$ and $$p_{\varvec{\xi }} \in L^2( D)$$ such that $$\varvec{q}_{\varvec{\xi }}-\varvec{g}_{\varvec{\xi }} \in V(D)$$ and

\begin{aligned} \int _{D} \nabla \varvec{q}_{\varvec{\xi }} : \nabla \varvec{\varphi }- p_{\varvec{\xi }} {{\,\textrm{div}\,}}{\varvec{\varphi }} \,\textrm{d} \varvec{x}&= 0 \quad \forall \varvec{\varphi }\in V(D), \end{aligned}
(25a)
\begin{aligned} \int _{D} \psi {{\,\textrm{div}\,}}{\varvec{q}}_{\varvec{\xi }} \,\textrm{d} \varvec{x}&= 0 \quad \forall \psi \in L^2(D). \end{aligned}
(25b)

Here, $$\varvec{A}: \varvec{B} = \sum _{j=1}^{d} \sum _{k=1}^{d} A_{j k} B_{j k}$$ for two matrices $$\varvec{A},\varvec{B} \in {\mathbb {R}}^{d\times d}$$. The gradient and divergence operators $$\nabla$$ and $${{\,\textrm{div}\,}}$$ act with respect to the spatial variable only with $$\varvec{\xi }$$ acting as a parameter.

For each shape $$u_i$$, $$i=1,\ldots ,N$$, we introduce one inequality constraint for a constrained volume, see Eq. (27a) and one inequality constraint for a constrained perimeter, see Eq. (27b). The volume of the domain $$D_{i}$$ is given by $${{\,\textrm{vol}\,}}(D_i) = \int _{D_i} 1 \,\textrm{d} \varvec{x}$$ and the perimeter of $$u_i$$ is given by $${{\,\textrm{peri}\,}}(u_i) = \int _{u_i} 1 \,\textrm{d} \varvec{s}.$$ Now, we suppose there is a deterministic target velocity $$\bar{\varvec{q}}$$ to be reached on the domain D. We would like to determine the optimal placement of shapes that come closest on average to this velocity field. More precisely, we solve the problem

\begin{aligned} \min _{\varvec{u}\in B_e^N(S^1,{\mathbb {R}}^2)} \, \left\{ j(\varvec{u}) = \int _\varOmega { \int _{D} \big \Vert \varvec{q}_{\varvec{\xi }(\omega )}(\varvec{x}) + \varvec{g}_{\varvec{\xi }(\omega )}(\varvec{x})- \bar{\varvec{q}}(\varvec{x}) \big \Vert _2^2 \,\textrm{d} \varvec{x} \,\textrm{d}{\mathbb {P}}(\omega )} \right\} \end{aligned}
(26)

subject to (25) and

\begin{aligned} {{\,\textrm{vol}\,}}(D_i)&\ge \underline{\mathcal {V}}_i&\forall i=1,\ldots ,N , \end{aligned}
(27a)
\begin{aligned} {{\,\textrm{peri}\,}}(u_i)&\le \overline{\mathcal {P}}_{i}&\forall i=1,\ldots ,N. \end{aligned}
(27b)

We note that a deterministic model using a tracking-type functional in combination with Stokes flow has been studied in [9].

3.1 Shape Derivative

In the following, we compute the shape derivative of the parametrized augmented Lagrangian corresponding to the model problem defined by (25)–(27). We define $$\varvec{h}:B_e^N(S^1,{\mathbb {R}}^2) \rightarrow {\mathbb {R}}^{2 N}$$ by

\begin{aligned} \varvec{h}(\varvec{u}) = \begin{pmatrix} \varvec{h}_{V}(\varvec{u}) \\ \varvec{h}_{\mathcal {P}}(\varvec{u}) \end{pmatrix} = \begin{pmatrix} \left[ \underline{\mathcal {V}}_i - {{\,\textrm{vol}\,}}(D_i) \right] _{i\in \{1,\dots ,N\}} \\ \left[ {{\,\textrm{peri}\,}}(u_i) - \overline{\mathcal {P}}_i \right] _{i\in \{1,\dots ,N\}} \end{pmatrix}, \end{aligned}

as well as the set $$\varvec{K} \;{:}{=}\; \{ \varvec{h} \in {\mathbb {R}}^{2 N}:h_i \le 0 \,\, \forall i=1, \ldots , 2 N\}$$ and the objective $$J(\varvec{u},\varvec{\xi })\;{:}{=}\; \int _{D} \left\| \varvec{q}_{\varvec{\xi }}(\varvec{x}) + \varvec{g}_{\varvec{\xi }}(\varvec{x})- \bar{\varvec{q}}(\varvec{x}) \right\| _2^2 \,\textrm{d} \varvec{x}.$$ The parametrized augmented Lagrangian is defined by

\begin{aligned} \begin{aligned} L_{A}(\varvec{u}, \varvec{\lambda },\varvec{\xi };\mu )&= J(\varvec{u},\varvec{\xi })+ \int _{D} \nabla \varvec{q}_{\varvec{\xi }} : \nabla \varvec{\varphi }_{\varvec{\xi }} - p_{\varvec{\xi }} {{\,\textrm{div}\,}}{\varvec{\varphi }}_{\varvec{\xi }} + \psi _{\varvec{\xi }} {{\,\textrm{div}\,}}{\varvec{q}}_{\varvec{\xi }} \,\textrm{d} \varvec{x} \\&\quad + \frac{\mu }{2} {{\,\textrm{dist}\,}}_{\varvec{K}} \!\left( \varvec{h}(\varvec{u}) + \frac{\varvec{\lambda }}{\mu } \right) ^2 - \frac{\left\| \varvec{\lambda }\right\| _2^2}{2 \mu }. \end{aligned} \end{aligned}
(28)

Differentiating the Lagrangian (28) with respect to $$\left( \varvec{q},p\right)$$ and setting it to zero gives the weak form of the adjoint equation: find $$\varvec{\varphi }_{\varvec{\xi }} \in V(D)$$ and $$\psi _{\varvec{\xi }} \in L^2(D)$$ such that

\begin{aligned} \int _{D} 2 \tilde{\varvec{\varphi }}^\top \left( \varvec{q}_{\varvec{\xi }} + \varvec{g}_{\varvec{\xi }} - \bar{\varvec{q}} \right) + \nabla \varvec{\varphi }_{\varvec{\xi }} : \nabla \tilde{\varvec{\varphi }}+ \psi _{\varvec{\xi }} {{\,\textrm{div}\,}}{\tilde{\varvec{\varphi }}} \,\textrm{d} \varvec{x}&= 0 \quad \forall \tilde{\varvec{\varphi }} \in V(D), \end{aligned}
(29a)
\begin{aligned} \int _{D} {{\,\textrm{div}\,}}{\varvec{\varphi }}_{\varvec{\xi }} \,\tilde{\psi } \,\textrm{d} \varvec{x}&= 0 \quad \forall \tilde{\psi } \in L^2(D). \end{aligned}
(29b)

We define the space $$\mathcal {W}(D) = \{ \varvec{W} \in H^1(D, {\mathbb {R}}^2):\varvec{W}\vert _{\varGamma }=0 \}$$. We have the shape derivative

\begin{aligned}&\textrm{d}_{\varvec{u}}L_A (\varvec{u}, \varvec{\lambda }, \varvec{\xi };\mu )\left[ \varvec{W}\right] \nonumber \\&\quad = \int _{D} - \left( \nabla \varvec{q}_{\varvec{\xi }} \nabla \varvec{W} \right) : \nabla \varvec{\varphi }_{\varvec{\xi }} - \left( \nabla \varvec{\varphi }_{\varvec{\xi }} \nabla \varvec{W} \right) : \nabla \varvec{q}_{\varvec{\xi }} + \left( p_{\varvec{\xi }} {\nabla \varvec{\varphi }_{\varvec{\xi }}}^\top - \psi _{\varvec{\xi }} {\nabla \varvec{q}_{\varvec{\xi }}}^\top \right) : \nabla \varvec{W} \nonumber \\&\qquad + {{\,\textrm{div}\,}}{(}\varvec{W}) \left( \left\| \varvec{q}_{\varvec{\xi }} + \varvec{g}_{\varvec{\xi }}- \bar{\varvec{q}} \right\| _2^2 + \nabla \varvec{q}_{\varvec{\xi }} : \nabla \varvec{\varphi }_{\varvec{\xi }} - p_{\varvec{\xi }} {{\,\textrm{div}\,}}{\varvec{\varphi }}_{\varvec{\xi }} + \psi _{\varvec{\xi }} {{\,\textrm{div}\,}}{\varvec{q}}_{\varvec{\xi }} \right) \!\,\textrm{d} \varvec{x}\nonumber \\&\qquad + \mu \left( \left( \varvec{h}(\varvec{u}) + \frac{\varvec{\lambda }}{\mu } \right) - \pi _{\varvec{K}} \!\left( \varvec{h}(\varvec{u}) + \frac{\varvec{\lambda }}{\mu } \right) \right) ^\top \nonumber \\&\qquad \times \begin{pmatrix} \left[ \int _{D_i} {{\,\textrm{div}\,}}{(\varvec{W})} \,\textrm{d} \varvec{x} \right] _{i\in \{1,\dots ,N\}} \\ \left[ \int _{u_i} {{\,\textrm{div}\,}}{(\varvec{W})} - \varvec{n}^\top \nabla \varvec{W} \varvec{n} \,\textrm{d} \varvec{s} \right] _{i\in \{1,\dots ,N\}} \end{pmatrix}, \end{aligned}

where $$\left( \varvec{q}_{\varvec{\xi }},p_{\varvec{\xi }}\right)$$ and $$\left( \varvec{\varphi }_{\varvec{\xi }},\psi _{\varvec{\xi }}\right)$$ solve the state Eq. (25) and adjoint Eq. (29), respectively. The shape derivative is needed to represent the gradient with respect to the metric under consideration (cf., e.g., [15]). As described in [15], we can use the multi-shape derivative in an “all-at-once”-approach to compute the multi-shape gradient with respect to the Steklov–Poincaré metric and the mesh deformation $$\varvec{V}= \varvec{V}_{\varvec{\xi }}$$ all at once by solving

\begin{aligned} a(\varvec{V}, \varvec{W}) = \textrm{d}_{\varvec{u}}L_A (\varvec{u},\varvec{\lambda }, \varvec{\xi };\mu )[\varvec{W}] \quad \forall \varvec{W}\in \mathcal {W}(D)\cap \mathcal {C}^\infty (D,{\mathbb {R}}^2), \end{aligned}
(30)

where a is a coercive and symmetric bilinear form. The mesh deformation $$\varvec{V}$$ calculated from (30) can be viewed as an extension of the multi-shape gradient $$\varvec{v}$$ with respect to the Steklov–Poincaré metric to the hold-all domain D (for details we refer the reader to [15]).

The bilinear form that describes linear elasticity is a common choice for a due to the advantageous effect on the computational mesh (cf. [46, 48]), and is selected for the following numerical studies. The Lamé parameters are chosen as $$\hat{\lambda }=0$$ and $$\hat{\mu }$$ smoothly decreasing from 33 on $$\varvec{u}$$ to 10 on $$\varGamma$$, as obtained by the solution of Poisson’s equation on D.

To update the shapes according to Algorithm 1, we need to compute the multi-exponential map. This computation is prohibitively expensive in most applications because a calculus of variations problem must be solved or the Christoffel symbols need be known. Therefore, we approximate it using a multi-retraction

\begin{aligned} \mathcal {R}_{\varvec{z}^{k,j}}^N:T_{\varvec{z}^{k,j}}\mathcal {U}^N\rightarrow \mathcal {U}^N,\, \varvec{v}=(v_1,\dots ,v_N)\mapsto (\mathcal {R}_{z^{k,j}_1}v_1,\dots ,\mathcal {R}_{z^{k,j}_N}v_N) \end{aligned}

to update the shape vector $$\varvec{z}^{k,j}=(z^{k,j}_1,\dots ,z^{k,j}_N)$$ in each pair (jk). For each shape $$z_i^{k,j}$$ we use the retraction in [14, 15, 44]: $$\mathcal {R}_{z_i^{k,j}}:T_{z_i^{k,j}}\mathcal {U}^i \rightarrow \mathcal {U}^i,\, v_i \mapsto z_i^{k,j}+v_i$$ for all $$i=1,\dots ,N$$.

3.2 Numerical Results

All numerical simulations were performed on the HPC cluster HSUper.Footnote 1 using the FEniCS toolbox, version 2019.1.0 [2] and Python 3.10.10. The hold-all domain is chosen as $$D=(0,1)^2$$. We choose $$N=3$$ shapes inside the hold-all domain, which can be seen on the left-hand side of Fig. 1. The computational mesh is generated with Gmsh 4.11.1 [17], which yields 265 line elements for the outer boundary and the interfaces, and 3803 triangular elements as the discretization of D. Additionally, a new mesh was automatically generated if the mesh quality.Footnote 2 fell below a threshold of $$40\%$$. A reevaluation of all relevant values within the optimization (e.g., objective functional and geometrical constraints) after remeshing ensures that optimization can continue to be performed. It has already been observed that this increases the number of optimization iterations (cf., e.g., [40]), but is difficult to avoid due to quickly deteriorating meshes. The target velocity is shown in Fig. 1 on the right, together with the shapes to obtain the target velocity in white. Standard Taylor-Hood elements are used.

The values of the geometrical constraints were chosen in accordance with the shapes of the target velocity. The volumes of $$D_1$$, $$D_2$$ and $$D_3$$ were constrained to be at or above 0.035295, 0.025397 and 0.036967, and the perimeters of $$u_1$$, $$u_2$$ and $$u_3$$ to be at or below 0.72630, 0.56521 and 0.69796, respectively. The augmented Lagrangian parameters in Algorithm 1 were initialized to $$\varvec{\lambda }^1=\varvec{0}$$, $$\mu _1=10$$, $$\gamma =10$$, and $$\tau =0.9$$. The ball for the projection of Lagrange multipliers was chosen to be $$B=[-100,100]^{2N}$$.

We chose homogenous Dirichlet boundary conditions for the velocity on the top and bottom boundary and on $$\varvec{u}$$ (see Fig. 1, right). The inflow profile on the left boundary is modeled as an inhomogenous Dirichlet boundary with $$\varvec{g}_{\varvec{\xi }}(\varvec{x}) = (\kappa (\varvec{x},\varvec{\xi }), 0)^\top$$. The horizontal component is given by the truncated Karhunen-Loève expansion

\begin{aligned} \kappa (\varvec{x},\varvec{\xi }) = -4 x_2 (x_2-1) + \sum _{\ell =1}^{100} \ell ^{-\eta -1/2} \sin (2 \pi \ell (x_2-1/2)) \xi _\ell , \end{aligned}

where $$\eta =3.5$$ and $$\xi _\ell \sim U\!\left[ -\frac{1}{2}, \frac{1}{2}\right]$$ (U[ab] being the uniform distribution on the interval [ab]). We used numpy.random from numpy 1.22.4 for the generation of all random values. For this, rng=numpy.random.default_rng(seed) is used to set the generator and then the random samples are drawn by calling rng.uniform(lowerBound, upperBound, shape). The lower and upper bound correspond to the bounds of the uniform distribution. The shape of the matrix of random values was set to $$(100, m_k)$$ yielding $$100 \times m_k$$ random values per stochastic gradient step, generated row by row. We chose the four different seeds 964113, 454612, 421507 and 107785. Parallelization of multiple realizations was performed via MPI using mpi4py version 3.1.4, which distributed the matrix to the $$m_k$$ processes column-wise. On the right boundary, a homogenous Neumann boundary condition is imposed. The step size is chosen as $$t_k=\frac{20}{\mu _k}$$, the scaling of which is obtained by tuning (to avoid deterioration of the mesh, especially in the first steps of the inner loop procedure). The maximum number of inner loop iterations is chosen to be $$N_k=c_1 \cdot 2^k$$, with $$c_1=4$$ or $$c_1=25$$. The batch size is increased according to $$m_k=c_2 \cdot 2^k$$, with $$c_2=\frac{1}{2}$$ or $$c_2=5$$. Each inner loop k requires $$m_k\cdot R_k$$ solutions of the state equation, the adjoint equation, the Poisson equation for the Lamé parameter, and the deformation equation, which becomes computationally expensive for high k.

The obtained shapes for $$c_1=4$$ and $$c_2=\frac{1}{2}$$ for different seeds are shown in Fig. 2. For all seeds, the top-right shape $$u_2$$ looks basically identical to the shape used to obtain the target velocity, however $$u_1$$ (left) shows differences at the bottom left and on the right-hand side between different seeds and compared to the shapes for the target velocity, and $$u_3$$ has a different left-hand side. Differences for $$u_3$$ between the different seeds can also be observed. We investigate the optimization with the random seed 421507 further. The remesher is activated after the stochastic gradient step 4, 9, 13, 20, 26, 37, 93 and 1682. In Fig. 3, the numerical results for objective functional estimate $$\hat{j}=\frac{1}{m_k}\sum _{i=1}^{m_k} J(\varvec{z}^{k,j},\varvec{\xi }^{k,j,i})$$ and the estimate of the $$H^1$$ norm of the mesh deformation $$\widehat{\varvec{V}}=\frac{1}{m_k}\sum _{i=1}^{m_k} \varvec{V}_{\varvec{\xi }^{k,j,i}}$$ over cumulative stochastic gradient steps is provided. Here, even for a comparatively low number of samples per step, we see a strong decrease in objective functional values initially. The points where the inner loop is stopped due to reaching $$R_k$$ are denoted by the red vertical dashed lines in the right-hand side plot. At the later stages of the optimization the batch size is increased up to $$m_{11}=1024$$ for $$k=11$$. This yields an increasingly accurate approximation of the mesh deformation and the objective functional value as evidenced by the decreasing variance.

We provide the numerical results at the end of each inner loop for different seeds, $$c_1$$ and $$c_2$$ in Tables 1, 2 and 3. Here, we present the number of iterations until random stopping $$R_k$$, the $$H^1$$ norm of the mesh deformation for each k, which is estimated using the seed 883134 and a (larger) sample size of $$m=10024$$ as $$\widehat{\varvec{V}}= \frac{1}{m} \sum _{i=1}^m \varvec{V}_{{\varvec{\xi }}^i}$$, and the infeasibility measure $$H_k$$. Different seeds (Tables 1 and 2) behave differently regarding mesh deformation norm estimate, penalty factor and infeasibility, however the mesh deformation norm estimate and the infeasibility measure were overall reduced by orders of magnitude. We attribute the increases in these values in between to the effect of the randomness on the stochastic gradient. Using larger batch sizes (Table 3, left) yielded lower mesh deformation norms at a significantly increased computational cost, indicating a very strong influence of the randomness on the objective functional that can be reduced by larger sample sizes, cf. also Fig. 3. An increased iteration limit $$N_k$$ (Table 3, right) did not seem to improve the result, which is expected due to the strong influence of the randomness on the objective functional.

As an additional numerical experiment, we investigated the influence of the choice of B for the projection of Lagrange multipliers. Instead of $$B=[-100,100]^{2N}$$ we chose $$B=[-0.1,0.1]^{2N}$$. The batch size and maximum number of inner loop iterations match those in Table 2, left. Therefore, the random samples were exactly the same in both cases, but the optimization problem changes since the Lagrange multipliers are different. We did not see any notable improvement in performance by choosing the smaller set.

4 Conclusion

In this paper, we introduced a novel method for solving constrained optimization problems under uncertainty, where the optimization variable belongs to a Riemannian (shape) manifold. The objective functional is formulated as an expectation and the constraints are deterministic. Our work is motivated by applications in PDE-constrained shape optimization, where uncertainty enters the problem in the form of a random PDE, and geometric constraints are introduced to avoid trivial solutions. The optimization variable—the shape—is understood as an element of a Riemannian shape manifold.

Using the framework of Riemannian manifolds allows us to rigorously prove the convergence of our method, which we call the stochastic augmented Lagrangian method. This algorithm consists of a batch stochastic gradient method with random stopping in an inner loop, combined with an augmented Lagrangian method in an outer loop. The inherently nonconvex character of our underlying application is the reason for introducing random stopping and it allows us to prove convergence rates in expectation even in the absence of convexity. A price that is paid for the guaranteed convergence rates is that the inner loop procedure becomes increasingly expensive. While this is a disadvantage, this still outperforms the standard approach used in sample average approximation, where a one-time sample is taken and the corresponding problem is solved using all samples. The stochastic approximation approach used here dynamically samples over the course of optimization, allowing us to use dramatically fewer samples, especially in the first iterations of the augmented Lagrangian procedure. To our knowledge, our method is the first to solve this kind of shape optimization problem under uncertainty. Since this is quite new, the results of this paper leave space for future research. In particular, there are a few open questions from differential geometry that are outside the scope of the paper but that came up while formulating our theory. It is still unclear whether Assumption 1 is satisfied for the manifold used in our application. In particular, we require connectivity and the existence of a bounded injectivity radius of the shape space under consideration.