Analysis of Stochastic Gradient Descent in Continuous Time

Stochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system -- a gradient flow -- represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process: We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. We give conditions under which the stochastic gradient process with constant learning rate is exponentially ergodic in the Wasserstein sense. Then we study the case, where the learning rate goes to zero sufficiently slowly and the single target functions are strongly convex. In this case, the process converges weakly to the point mass concentrated in the global minimum of the full target function; indicating consistency of the method. We conclude after a discussion of discretisation strategies for the stochastic gradient process and numerical experiments.


Introduction
The training of models with big data sets is a crucial task in modern machine learning and artificial intelligence. The training is usually phrased as an optimisation problem. Solving this problem with classical optimsation algorithms is usually infeasible. Classical algorithms being gradient descent or the (Gauss-)Newton method ; see [51]. Those methods require evaluations of the loss function with respect to the full big data set in each iteration. This leads to an immense computational cost.
Stochastic optimisation algorithms that only consider a small fraction of the data set in each step have shown to cope well with this issue in practice; see, e.g., [10,13,56]. The stochasticity of the algorithms is typically induced by subsampling. In subsampling the aforementioned small fraction of the data set is picked randomly in every iteration. Aside from a higher efficiency, this randomness can have a second effect: The perturbation introduced by subsampling can allow to escape local extrema and saddle points. This is highly relevant for target functions in, e.g., deep learning, since those are often non-convex; see [15,67].
Due to the randomness in the updates, the sequence of iterates of a stochastic optimisation algorithm forms a stochastic process; rather than a deterministic sequence. Stochastic properties of these processes have been hardly studied in the literature so far; see [4,22,32] for earlier studies. However, understanding these properties seems crucial for the construction of efficient stochastic optimisation methods.
In this work, we study the stochastic processes generated by the stochastic gradient descent (SGD) algorithm. More precisely, the contributions of this work are: 1. We construct the stochastic gradient process (SGP), a continuous-time representation of SGD.
We show that SGP is a sensible continuum limit of SGD and discuss SGP from a biological viewpoint: a model of the same type is used to model growth and phenotypes of clonal populations living in randomly fluctuating environments.
2. We study the long-time behaviour of SGP: We give assumptions under which SGP with constant learning rate has a unique stationary measure and converges to this measure in the Wasserstein distance at exponential rate. In this case, SGP is exponentially ergodic. If the learning rate is decreasing to zero and additional assumptions hold, we will prove that SGP converges weakly to the Dirac measure concentrated in the global optimum.
3. We discuss discretisation strategies for SGP. Those will allow us to derive practical optimisation algorithms from SGP. We also discuss existing algorithms that can be retrieved in this way. 4. We illustrate and investigate the stochastic gradient process and its stationary regime alongside with stochastic gradient descent in numerical experiments.
This work is organised as follows: we introduce notation and background in the remainder of §1. In §2, we introduce the stochastic gradient process and justify our model choice. We study the long-time behaviour of SGP in §3. After discussing discretisation strategies for SGP in §4, we give numerical experiments in §5 and conclude the work in §6.

Stochastic gradient descent
Let (X, · ) := (R K , · 2 ), let ·, · be the associated inner product, and let BX := B(X, · ) be the Borel σ-algebra on X. Functions defined throughout this work will be assumed to be measurable with respect to appropriate σ-algebras. LetΦ : X → R be some function attaining a global minimum in X. We assume thatΦ is of the form Here, N ∈ N := {1, 2, . . .}, N ≥ 2, and Φ i : X → R is some continuously differentiable function, for i in the index set I := {1, . . . , N }. In the following, we aim to solve the unconstrained optimisation problem θ * ∈ argmin θ∈XΦ (θ).
Optimisation problems as given in (1) frequently arise in data science and machine learning applications. HereΦ represents the negative log-likelihood or loss function of some training data set y with respect to some model. An index i ∈ I typically refers to a particular fraction y i of the data set y. Then, Φ i (θ) represents the negative log-likelihood of only this fraction y i given the model parameter θ ∈ X or the associated loss, respectively. For optimisation problems of this kind, we employ the stochastic gradient descent algorithm, which was proposed by Robbins and Monro [56]. We sketch this method in Algorithm 1. In practice, it is implemented with an appropriate termination criterion.
In this work, we consider gradient descent algorithms as time stepping discretisations of a certain gradient flow. The potential of this gradient flow is the respective target functionΦ, Φ 1 , . . . , Φ N . Thus, we refer to these target functions as potentials. In SGD, the potentials of these gradient flows are randomly switched after every time step.
We now comment on the meaning of the learning rate η k .

Remark 1
In the gradient flow setting, the learning rate η k has two different interpretations/objectives: (i) It represents the step size of the explicit Euler method that is used to discretise the underlying gradient flow.
(ii) It represents the length of the time interval in which the flow follows a certain potential Φ i at the given iteration k, i.e. the time between two switches of potentials.
Recently, several authors, e.g. [27,39,60], have been studying the behaviour of algorithms and methods at their continuum limit; i.e. the limit as η j ↓ 0. The advantage of such a study is that numerical aspects, e.g., arising from the time discretisation can be neglected. Also, a new spectrum of tools is available to analyse, understand, and interpret the continuous system. If the continuous system is a good representation of the algorithm, we can sometimes use the results in the continuous setting to improve our understanding of the discrete setting. Under some assumptions, a diffusion process is a good choice for a continuous-time model of SGD. Diffusion processes, such as Langevin dynamics, are traditionally used in statistical physics to represent the motion of particles; see, e.g., §8 in [62].

Diffusions and piecewise-deterministic Markov processes
Under assumptions discussed in [32] [46], one can show that the sequence of iterates of the SGD algorithm, with, say, constant (η k ) ∞ k=1 ≡ η, can be approximated by a stochastic differential equation of the following form: Here, Σ(θ) : X → X is symmetric, positive semi-definite for θ ∈ X and W : [0, ∞) → X is a K-dimensional Brownian motion. 'Can be approximated' means that as η goes to zero, the approximation of SGD via such a diffusion process is precise in a weak sense. In the following remark, we give a (rather coarse) intuitive explanation, how this diffusion process could be derived using the Central Limit Theorem and discretisation schemes for stochastic differential equations.
Remark 2 Let η ≈ 0. Then, for some k ∈ N, we have is now the sample mean of a finite sample of independent and identically distributed (i.i.d.) random variables with finite variance. Hence, by the Central Limit Theorem, where γ 0 ∼ N(0, Σ(θ 0 )) and the covariance matrix is given by which is the first step of an Euler-Maruyama discretisation of the diffusion process in (2) with step size ηk. See, e.g., [47] for details on discretisation strategies for stochastic differential equations.
The diffusion view (2) of SGD has been discussed by Li et al. [43,44,45] and Mandt et al. [48,49]. Moreover, it forms the basis of the Stochastic Gradient Langevin MCMC algorithm [49,69]. A diffusive continuous-time version of stochastic gradient descent also arises when the underlying target functional itself contains a continuous data stream; see [64,65]; this however is not the focus of the present work.
Unfortunately, the process of slowly switching between a finite number of potentials in the preasymptotic phase of SGD is not represented in the diffusion. Indeed, the diffusion represents an infinite amount of switches within any strictly positive time horizon. In SGD this is only the case as η k ↓ 0; see [11]. The pre-asymptotic phase, however, is vital for the robustness of the algorithm and its computational efficiency. Moreover, the SGD algorithm is sometimes applied with a constant learning rate; see [14]. Here, the regime η k ↓ 0 is never reached. Finally, one motivation for this article has been the creation of new stochastic optimisation algorithms. Here, the switching between a finite number of potentials/data sets is a crucial element to reduce computational cost and memory complexity. Replacing the subsampling by a full sampling and adding Gaussian noise is not viable in large data applications.
In this work, we aim to propose a continuous-time model of SGD that captures the switching of the finite number of potentials. To this end we separate the two different learning rate objects: the gradient flow discretisation and the waiting time between two switches of potentials; see Remark 1 (i) and (ii) respectively. We proceed as follows: 1. We let the discretisation step width go to zero and thus obtain a gradient flow with respect to some potential Φ i .
2. We randomly replace Φ i by another potential Φ j after some strictly positive waiting time.
Hence, we take the continuum limit only in the discretisation of the gradient flows, but not in the switching of potentials. This gives us a continuous-time dynamic in which the randomness is not introduced by a diffusion, but by an evolution according to a potential that is randomly chosen from a finite set. This non-diffusive approach should give a better representation of the pre-asymptotic phase. Moreover, since we do not require the full potential in this dynamical system, we obtain a representation that is immediately relevant for the construction of new computational methods.
We will model the waiting times T between two switches as a random variable following a failure distribution, i.e. T has survival function where t ∈ R, t 0 ≥ 0 is the current time, ν : [0, ∞) → (0, ∞) is a hazard function that depends on time, and 1[·] represents the indicator function: 1[true] := 1 and 1[false] := 0. We denote P(T ∈ ·) =: π wt (·|t 0 ). Note that when ν is constant, T is exponentially distributed. Then, we obtain a so-called Markov switching process; see, e.g. [3,5,6,17,71]. Markov switching processes are a subclass of piecewise deterministic Markov processes (PDMPs). PDMPs were first introduced by Davis [19] as 'a general class of non-diffusion stochastic models'; see also [20]. They play a crucial role in the modelling of biological, economic, technical, and physical systems; e.g., as a model for internet traffic [30] or in risk analysis [38]. See also §2.4, where we discuss a particular biological system that is modelled by a PDMP. Furthermore, PDMPs have recently gained attention in the Markov chain Monte Carlo literature as efficient way of sampling from inaccessible probability distributions; see, e.g., [8,26,54].

From discrete to continuous
In the following, we give a detailed description of the two PDMPs that will be discussed throughout this article: One PDMP will represent SGD with constant learning rate, the other PDMP models SGD with decreasing learning rate. Then, we will argue, why we believe that these PDMPs give an accurate continuous-time representation of the associated SGD algorithms. Finally, we give a biological interpretation of the PDMPs discussed in this section.

Definition and well-definedness
Let (Ω, A, P) be a probability space on which all random variables in this work are defined. We now define two continuous-time Markov processes (CTMPs) on I:= {1, . . . , N } that will model the switching of the data sets in our PDMPs. For details on continuous-time Markov processes on finite state spaces, we refer to [1]. We start with the constant learning rate. Let λ > 0 be a positive constant and let i : Ω × [0, ∞) → I be the CTMP on I with transition rate matrix and with initial distribution i(0) ∼ Unif(I). Here, Id I is the identity matrix in R N ×N . Let M t : I × 2 I → [0, 1] be the Markov kernel representing the semigroup of (i(t)) t≥0 , i.e.
This Markov kernel can be represented analytically by solving the associated Kolmogorov forward equation. We do this in Lemma 5 in Appendix A and show that where i, i 0 ∈ I, t ≥ 0. Moreover, note that the waiting time between two jumps of the process (i(t)) t≥0 is given by an exponential distribution with rate (N − 1)λ, i.e. π wt (·|t 0 ) = Exp((N − 1)λ). The CTMP (i(t)) t≥0 will represent the switching among potentials in the SGD algorithm with constant learning rate. Now, we move on to the case of a decreasing learning rate. Let µ : [0, ∞) → (0, ∞) be a non-decreasing, positive, and continuously differentiable function, with µ(t) → ∞, as t → ∞. We assume furthermore that for any t > 0 there is some constant C t > 0 : We define j : Ω × [0, ∞) → I to be the inhomogeneous CTMP with time-dependent transition rate matrix B : [0, ∞) → R N ×N given by Again, we assume that the initial distribution j(0) ∼ Unif(I). Equivalently to (5), we can compute the associated Markov transition kernel in this setting. First note that since (j(t)) t≥0 is not homogeneous in time, it is not sufficient to construct the Markov kernel with respect to the state of the Markov process at time t 0 = 0. Indeed, we get a kernel of type where j 0 ∈ I and t ≥ t 0 ≥ 0. This kernel is given by where j, j 0 ∈ I and t ≥ t 0 ≥ 0; see again Lemma 5 in Appendix A. In this case, the waiting time at time t 0 ≥ 0 between two jumps is distributed according to the failure distribution π wt in (3), with ν ≡ (N − 1)µ. The CTMP (j(t)) t≥0 represents the potential switching when SGD has decreasing learning rates. Based on these Markov jump processes, we can now define the stochastic gradient processes that will act as continuous-time version of SGD as defined in Algorithm 1.
Definition 1 (SGP) Let θ 0 , ξ 0 ∈ X. We define (i) the stochastic gradient process with constant learning rate (SGPC) as a solution of the initial value problem (ii) the stochastic gradient process with decreasing learning rate (SGPD) as a solution of the initial value problem Also, we use the denomination stochastic gradient process (SGP) when referring to (i) and (ii) at the same time.
We illustrate the processes (i(t)) t≥0 and (θ(t)) t≥0 in Figure 1. We observe that SGP constructs a piecewise smooth path that is smooth between jumps of the underlying CTMP.
In order to show that the dynamics in Definition 1 are well-defined, we require regularity assumptions on the potentials (Φ i ) i∈I . After stating those, we immediately move on with proving well-definedness in Proposition 1. Figure 1: Cartoon of SGPC: the process (i(t)) t≥0 is a right continuous, piecewise constant process on the set I, whereas the process (θ(t)) t≥0 on X is continuous and piecewise smooth. The pieces on which the processes are constant resp. smooth are identical, since the dynamic of (θ(t)) t≥0 is controlled by (i(t)) t≥0 . Note that, T 0 is the initial time and the increments T k − T k−1 are the random waiting times.
Assumption 1 For any i ∈ I, let Φ i : X → R be continuously differentiable, i.e. Φ i ∈ C 1 (X; R), and let ∇Φ i be locally Lipschitz continuous.
Proof. We first discuss the process (θ(t)) t≥0 . Let T 0 = 0 and T 1 , T 2 , . . . be the jump times of (i(t)) t≥0 . Let k ∈ N. Note that the increments T k − T k−1 ∼ Exp((N − 1)λ). Hence, P(T k − T k−1 > 0) = 1. By Assumption 1 the (Φ i ) N i=1 are locally Lipschitz continuous. Hence, the process (θ(t)) t≥0 can be defined iteratively on the intervals where f (x−) := lim x ↑x f (x ) and T 0 − := 0. Iterative application of the Picard-Lindelöf Theorem for k ∈ N gives unique existence of the trajectory. Picard-Lindelöf can be applied, since ∇Φ i is locally Lipschitz continuous for any i ∈ I by Assumption 1.
The proof for (ξ(t)) t≥0 is partially analogous. Importantly, we now need to make sure that Otherwise, (j(t)) t≥0 would only be well-defined up to a possibly finite explosion time T ∞ := lim k→∞ T k < ∞. Under our assumptions, (j(t)) t≥0 is indeed 'non-explosive', we prove this in Lemma 6 in Appendix A. Moreover, let k ∈ N. Then, we have for any t k−1 ≥ 0. This is implied by the continuous differentiability of µ. Thus, we also have Then, as for (θ(t)) t≥0 we can employ again Picard-Lindelöf iteratively to show the P-a.s. welldefinedness of (ξ(t)) t≥0 . 2

Choice of model
In this section, we reason why the dynamical systems in Definition 1 are sensible continuous-time models for SGD given in Algorithm 1 with constant, resp. decreasing learning rate.
Gradient flow. The update in line 5 of Algorithm 1 is an explicit Euler update of the gradient flow with respect to the potential Φ i , for some i ∈ I. In this model, we replace this discretised gradient flow with the continuous dynamic. Hence, we replace Uniform sampling. We aim to accurately represent the uniform sampling from the index set I, given in line 4 of the algorithm. Indeed, at each point in time t ∈ [0, ∞), we can show that both i(t) ∼ Unif(I) and j(t) ∼ Unif(I).
Proof. To prove this proposition, we need to show that Unif(I) is stationary with respect to the Markov transition kernels M t and M t|t0 given in (5) and (8) for i ∈ I and 0 ≤ t 0 ≤ t. We show only the decreasing learning rate case, the proof for the constant learning rate proceeds analogously. A calculation gives: for any i ∈ I and 0 ≤ t 0 ≤ t. 2 Hence, the CTMPs (i(t)) t≥0 , (j(t)) t≥0 indeed represent the uniform sampling among the data set indices i ∈ I.
Markov property. The trajectory (θ k ) ∞ k=0 generated by Algorithm 1 satisfies the Markov property, i.e. the distribution of the current state given information about previous states is equal to the distribution of the current state given only information about the most recent of the previous states. By the particular structure we chose for the continuous-time processes (θ(t), i(t)) t>0 and (ξ(t), j(t)) t>0 , we indeed retain the Markov property.
Proof. This follows from the particular choice of waiting time distribution, see e.g. the discussion in §3 of [19]. 2 Choosing random waiting times between switches allows us to analyse SGD as a PDMP. However, this choice comes at some cost. In Algorithm 1, the waiting times are all deterministic; a feature we, thus, do not represent in SGP. We briefly discuss a continuous-time version of SGD with deterministic waiting times in Remark 6 as a potential extension of the SGP framework, but do not consider it otherwise in this work. In the next two steps, we will, thus, explain how we connect the deterministic waiting times in SGD and the random waiting times in SGP.
Constant learning rate. We have defined (θ(t)) t≥0 as a continuous-time representation of the trajectory returned by Algorithm 1 with a constant learning rate η k ≡ η. The hazard function of the waiting time distribution of (i(t)) t≥0 is just constant ν ≡ (N − 1)λ. The waiting time T is the time (i(t)) t≥0 remains in a certain state. Note that the hazard function satisfies where T is a waiting time; see, e.g., §21 in [20]. Hence, the hazard function describes the rate of events happening at time u ≥ 0. In SGD with constant learning rate, the waiting time is constant η. Hence, the number of data switches in a unit interval is 1/η. Hence, we mimic this behaviour by choosing λ in the matrix A such that it satisfies (N − 1)λ = 1/η. Indeed, we set λ := 1/((N − 1)η).
Decreasing learning rate. Let now (η k ) ∞ k=1 ∈ (0, ∞) N be a non-increasing sequence of learning rates, with lim k→∞ η k = 0. Moreover, we assume that ∞ k=1 η k = ∞. Similarly to the last paragraph, we now try to find a rate function (µ(t)) t≥0 such that the PDMP (ξ(t)) t≥0 represents the SGD algorithm with the sequence of learning rates (η k ) ∞ k=1 . To go from discrete time to continuous time, we need to define a function H that interpolates the sequence of learning rates where the t k are chosen like this, since the η k themselves represent the time stepsizes in the sequence of learning rates. H could for instance be chosen as a sufficiently smooth interpolant between the η k . Equivalently to the case of the constant learning rate, we now argue via the hazard function of the waiting time distribution ν(t) : Approximation of the exact gradient flow. We now consider SGD, i.e. Algorithm 1. If the learning rate η ↓ 0, we discretise the gradient flow precisely. Moreover, the waiting time between two data switches goes to zero. Hence, intuitively we switch the data set infinitely often in any finite time interval. By the Law of Large Numbers, we should then anticipate that the limiting process behaves like the full gradient flow with initial value ζ(0) = ζ 0 := θ 0 as chosen in SGPC andΦ := N i=1 Φ i /N being the full potential. This behaviour can also be seen in the diffusion approximation to SGD (2), where the stochastic part disappears as η ↓ 0.
So we should now show that this is also true for SGPC. Indeed, we will give assumptions under which the SGPC (θ(t)) t≥0 converges weakly to (ζ(t)) t≥0 , as η ↓ 0. Weak convergence of (θ(t)) t≥0 to (ζ(t)) t≥0 means that for any bounded, continuous function F mapping from To show weak convergence, we need some stronger smoothness assumption concerning the potentials Φ i . We denote the Hessian of Φ i by HΦ i for i ∈ I. Assumption 2 For any i ∈ I, let Φ i ∈ C 2 (X; R) and let ∇Φ i , HΦ i be continuous. Please note that Assumption 1 is already implied by Assumption 2.
Theorem 1 Let θ 0 = ζ 0 and let Assumption 2 hold, then the stochastic gradient process (θ(t)) t≥0 converges weakly to the full gradient flow (ζ(t)) t≥0 , as the learning rate We prove Theorem 1 rigorously in §2.3. We illustrate the shown result in Figure 2, where we can see that indeed as η decreases, the processes converge to the full gradient flow. Following our reasoning above, we assert that SGPC, resp. SGPD, are suitable continuous-time representations of SGD with constant, resp. decreasing, learning rate.

Proof of Theorem 1
We prove Theorem 1 using the perturbed test function theory. In particular, we apply a result from [40] that we summarise below. We note that a similar technique is used to derive the Infinite Swapping Markov Chain Monte Carlo technique; see [24] for details from the statistical mechanics viewpoint and [42] for the discrete-time MCMC viewpoint. In the following, we adapt the notation of [40].

Assumption 3
We consider the following three assumptions: (i) Let G and ∇ x G be continuous and bounded on X × Y , where X ⊆ X is bounded, (ii) letḠ : X → X be continuously differentiable and let for any 0 ≤ t < t < ∞ and x ∈ X: in probability, as ε ↓ 0, and (iii) let (ξ ε (t)) t≥0 be tight with respect to ε.
The associated result reads then: Proof. The proof uses the perturbed test function method; see [40,Theorem 4.1]. 2 To prove Theorem 1, we now show that Assumption 3 (i)-(iii) hold for SGPC. Then, Theorem 2 will imply weak convergence.
Moreover, we define ε := 1/λ and ξ ε (t) := e i(t) , where e i is the i-th unit-vector in Y . Then, we have ∇Φ i(t) = G(·, ξ ε (t)). Assumption 3(i) is now immediately implied by Assumption 2; note that any continuous function on X = R K is bounded on a bounded subset of X. The tightness in Assumption 3(iii) follows from (ξ ε (t)) t≥0 being a càdlàg process taking values in the finite set {e i : i ∈ I}; see Theorem 16.8 from [9]. To show Assumption 3(ii), we employ the explicit representation of the transition kernel M t of (i(t)) t≥0 given in (5). Since (ξ ε (t)) t≥0 is a Markov process and homogeneous in time, we assume without loss of generality that t = 0. Let now i 0 ∈ I. Then, we have for s ∈ [0, t] the following expression for the conditional expectation: Now we integrate the resulting function on [0, t]: almost surely, as ε ↓ 0, which implies Assumption 3(ii). Finally, we note that θ(0) = ζ(0), hence:

Stochastic gradient descent in nature
PDMPs are popular models for random or uncertain processes in biological systems; see Chapter 1 of [58] for an overview. In the following, we briefly discuss a biological system that is modelled by a dynamical system that corresponds to the SGP. This model was proposed by Kussell and Leibler [41]. The modelled biological system contains clonal populations that diversify to survive in randomly fluctuating environments.
Diversified bet-hedging. In the following, we consider clonal populations, such as bacteria or fungi, that live in fluctuating environments, i.e., environments that are subject to temporal change. Examples are the fluctuation of temperature and light during the day-night-cycle or a different supply of nutrients; see [2,12]. We define the set of environments to be I := {1, . . . , N }.
Here, populations typically adapt their phenotypes to retain a high fitness in any environment. If the fluctuations within I are irregular or even random, the organisms in a population cannot adapt to the changes in the environment sufficiently fast; see, e.g., [41]. To prevent extinction and retain high fitness in such fluctuating environments, some populations employ so-called diversified bet-hedging strategies; see, e.g., [31,52,59,63]. That means, rather than relying on homogeneous switching of phenotypes in the population, the population has heterogeneous phenotypes that are developed and switched based on the current environment i ∈ I or even completely randomly.
A PDMP model. Next, we briefly explain the way Kussell and Leibler [41] model the growth of this population and the phenotype distribution among its individuals. Indeed, there is a set of N phenotypes, which will be identical to I. Indeed, the i-th phenotype is the one with the highest fitness in environment i, for i ∈ I. The fluctuation between environments is modelled by a CTMP (i(t)) t≥0 on I with a certain transition matrix. Let θ 0 ∈ X := R N . Here, the i-th component θ (i) 0 of θ 0 describes the number of organisms in the population having phenotype i ∈ I. Given we are currently in environment k ∈ I, we assume that organisms with phenotype i grow at a rate f (k) i ≥ 0 and that organisms switch from phenotype i to j at rate H (k) j,i . Knowing this, we define Given an initial vector θ 0 ∈ (0, ∞) N of phenotypes, we can now model the amount of organisms with a particular phenotype via the dynamical system The dynamical system (13) is a Markov switching process closely related to SGP. Indeed, we have a homogeneous ODE the right-hand side of which is switched according to a CTMP. The different environments in the population model represent the different subsamples of the data set that are trained with SGP. While the population aims to reach a high fitness in the current environment, SGP aims to optimise an underlying model with respect to the partition of the data set that is currently subsampled. Overall, SGP aims at solving a certain optimisation problem. In general there is not ad hoc an equivalent optimisation problem in the population dynamic: Positive growth rates (f (k) ) k∈I should lead to as t → ∞. Moreover, the flows in (13) are likely no gradient flows with underlying scalar potential. However, diversified bet-hedging strategies also overall aim at long-term high fitness; see [52].
Hence, both, SGP and diversified bet-hedging aim to enhance a system by enhancing this system in randomly switching situations. Therefore, we believe that bet-hedging gives a good background for interpreting SGP.

Long-time behaviour
PDMPs have been subject of extensive studies throughout the last decades, ever since they were introduced by Davis [19]. Many of the results derived in the past also apply to SGP. Hence, the PDMP view of SGD gives us access to a large set of analytical tools. Those allow us to study mixing properties or the long-time behaviour of the algorithm, such as convergence to stationary distributions and ergodicity.
In the following, we will use tools provided by [3,5,6,17,40] to study the long-time behaviour of SGP. Indeed, we will give assumptions under which the processes generated by SGPC and SGPD have a unique stationary measure and are ergodic or exponentially ergodic. For SGPD, we discuss especially the convergence to the minimum ofΦ. After proving our assertions, we discuss the required assumptions regarding linear least squares estimation problems.

Preliminaries
We collect some notation and basic facts that will be required in the following. First, we define a distance measure on X for some q ∈ (0, 1]: Note that d is a metric on X and (X, d ) forms a Polish space, i.e. it is separable and complete. Let π, π be two probability measures on (X, BX). We define the Wasserstein(-1) distance between those measures by where Coup(π, π ) is the set of couplings of π, π . This is the set of probability measures H on (X × X, BX ⊗ BX), with H(· × X) = π and H(X × ·) = π . Note that due to the boundedness of d , the distance d W is well-defined for any two π, π probability measures on (X, BX). Indeed, the boundedness of d also implies that convergence in d W is equivalent to weak convergence on (X, BX). Finally, note that d being a metric implies that d W is a metric as well. For details see Chapter 6 in the book by Villani [68]. Moreover, we define the Dirac measure concentrated in θ 0 ∈ X by δ(· − θ 0 ) := 1[θ 0 ∈ ·]. Next, we define the flow ϕ i : X × [0, ∞) → X associated to the i-th potential Φ i , for i ∈ I. In particular, ϕ i satisfies for any i ∈ I and θ 0 ∈ X. Similarly, we define the Markov kernels associated with the processes (θ(t)) t>0 and (ξ(t)) t>0 : where t ≥ t 0 ≥ 0. We now note two different assumptions on the convexity of the Φ i ; a weak and a strong version.
Assumption 4 (Strong convexity) For every i ∈ I, there is a κ i ∈ R, with with either (i) κ 1 + · · · + κ N > 0 and for every θ 0 ∈ X there is some bounded S ∈ BX, S θ 0 , such that In the strong version, we assume that all of the potentials {Φ i } i∈I are strongly convex. In the weak version, strong convexity of some potentials is sufficient; however, we need to ensure additionally that none of the flows escapes to infinity. The set S, in which we trap the process, is called positively invariant for (ϕ i ) i∈I . The uniform strong convexity in Assumption 4(ii), indeed, implies the existence of such a set for all θ 0 ∈ X. Both, Assumption 4(i) and (ii) are quite strong. As we have mentioned before, optimisation problems in machine learning are often non-convex. However, we focus on convex optimisation problems in this study. Strong convexity implies for instance that the associated flows contract exponentially: Lemma 1 Inequality (15) for some i ∈ I implies that the corresponding flows contract exponentially, i.e.
Proof. This is implied by Lemma 4.1 given in [17]. 2 Given this background, we now study the ergodicity of SGP. We commence with the case of a constant learning rate.

Constant learning rate
Under Assumption 4(i), the SGP (θ(t), i(t)) t>0 has a unique stationary measure π C on (Z, BZ) := (X × I, BX ⊗ 2 I ) and it contracts with respect to this measure in the Wasserstein distance d W . As the Markov process contracts exponentially, we say, the Markov process is exponentially ergodic.
We now state this result more particularly: Theorem 3 Let Assumptions 2 and 4(i) hold. Then, (θ(t), i(t)) t>0 has a unique stationary measure π C on (Z, BZ). Moreover, there exist κ , c > 0 and q ∈ (0, 1], with for any i 0 ∈ I and θ 0 ∈ X. The proof of this theorem follows similar lines as the proof of Theorem 5. Thus, we prove both in §3.4. Note that in Theorem 3, q influences the metric d that is defined in (14) and that is part of the Wasserstein distance d W . This result implies that SGPC converges very quickly to its stationary regime. For estimates of the constants in Theorem 3, we refer to [5]. Determining the stationary measure π C may be rather difficult in practice; see [18,25]. We give numerical illustrations in §5.

Decreasing learning rate
Next, we study the longtime behaviour of SGP with decreasing learning rate. Here, we are less interested in the convergence of SGP to some abstract probability measure. Instead, we study the convergence of SGPD to the minimum θ * ∈ X of the full potentialΦ. Hence, we aim to analyse the behaviour of d W (δ(· − θ * ), D t|0 (·|ξ 0 , j 0 )), as t → ∞. Here, we have anticipated that the Dirac measure δ(· − θ * ) is the stationary measure of SGPD as t → ∞. This can be motivated by Theorem 1 where SGPC converges to the full gradient flow, as η ↓ 0. Two aspects of SGPD imply that the analysis of this distance is significantly more involved than that of SGPC. First, the process is inhomogeneous in time; a case hardly discussed in the literature. We use the following standard idea to solve this issue: (i) We define a homogeneous Markov chain (ξ (t)) t≥0 on an extended state space X × R where the transition rate matrix of (j(t)) t≥0 will not depend on time, but on the current position of (ξ (t)) t≥0 .
Second, as t → ∞ the rate matrix B(t) degenerates; the diagonal entries go to −∞, the off-diagonal entries will go to ∞. This case is not covered by [17] or related literature on PDMPs -to the best of our knowledge. However, we were discussing a closely related problem in Theorem 1. To apply the perturbed test function theory, we require three fold actions: (ii) We define an auxiliary Markov jump process with bounded transition rate matrix.
(iii) We show that the PDMP based on this Markov jump process converges to a unique stationary measure at exponential rate.
(iv) We show that this stationary measure approximates δ(· − θ * ) at any precision. Also, we show that the auxiliary PDMP approximates SGPD.
Finally, we will obtain the following result: Theorem 4 Let Assumptions 2 and 4(ii) hold. Then, there is a function α : [0, 1) → [0, ∞) that is continuous at 0 and satifies α(0) = 0. Moreover, for any ε ∈ (0, 1), we have a probability measure π ε and constants κ , c > 0, q ∈ (0, 1] such that for any j 0 ∈ I and ξ 0 ∈ X. Hence, as t → ∞, the state ξ(t) of the SGPD converges weakly to the Dirac measure concentrated in the minimum θ * of the full target functionΦ. To prove this theorem, we now walk through steps (i)-(iv). Using several auxiliary results, we are then able to give a proof of Theorem 4.
One can see easily that this definition of SGPD is equivalent to our original Definition 1(ii). Note furthermore that the dynamic is defined such that if {∇Φ i } i∈I satisfies Assumption 4(i) (resp. (ii)) { Ψ i } i∈I does as well.
(iii) Ergodicity of the auxiliary process. The following theorem shows that the auxiliary process (ξ ε (t), j ε (t)) t≥0 converges at exponential rate to its unique stationary measure.
(iv) Weak convergence of the auxiliary process. The last preliminary step consists in showing that the auxiliary process (ξ ε (t)) t≥0 approximates the SGPD (ξ(t)) t≥0 . Moreover, the same needs to hold for the respective stationary measures.

Proofs of Theorem 3 and Theorem 5
The proof of Theorem 3 proceeds by showing the assumptions of Theorem 1.4 in [17], which implies exponential ergodicity of the PDMP. Under the same assumptions, Corollary 1.11 of [5] implies uniqueness of the stationary measure. We denote the necessary assumptions below, then we proceed with the proof.

Proof. [Proof of Theorem 3] Assumption 5(i) is satisfied by standard properties of homogeneous continuous-time Markov processes on finite sets. Assumption 5(ii) is implied by Assumption 4(i)
; see also the proof of Lemma 2.2 in [17]: Let G be a coupling in Coup(π, π ) and choose a coupling H ∈ Coup(πC

By Assumption 4(i) and Lemma 1, we have
Thus, we have indeed the required contractivity in the Wasserstein distance: t ) does not depend on H and G, we finally obtain Concerning Assumption 5(iii), we employ the boundedness of the flows in Assumption 4(i). 2 Now we move on to the proof of Theorem 5. It is conceptually similar to the proof of Theorem 3: It relies on proving the necessary assumptions of Theorem 3.3 in [17]. The uniqueness of the stationary measure follows under the same assumptions from Corollary 1.16 in [5]. We state these assumptions below.

Assumption 6
We consider the following four assumptions: (i) there is a transition rate matrix B leading to a positive recurrent, irreducible Markov chain and (ii) the Markov kernels representing the different gradient flows C (i) t (·|θ 0 ) := δ(· − ϕ i (θ 0 , t)) are exponentially contracting in d W , i.e. for any two probability measures π, π on (X, BX) satisfy for any t > 0 and κ 1 = · · · = κ N > 0, (iii) the Markov kernel D ε t|0 has a finite first absolute moment, i.e.
for t ≥ 0 and ξ 0 ∈ X, and (iv) the transition rate matrix B ε is bounded in the sense that and sup i∈I j∈I B ε (τ ) i,j 1[i = j] is Lipschitz continuous with respect to τ ∈ (ε, 1]. Note that Assumption 6(i)-(iii) closely correspond to Assumption 5.
Proof. [Proof of Theorem 5] The matrix B in Assumption 6(i) is given by which indeed induces positive recurrent, irreducible continuous-time Markov chain. Assumption 6(ii) can be proven analogously to Assumption 5(ii) in the proof of Theorem 3. Here, Assumption 4(i) is replaced by Assumption 4(ii). Next, we prove Assumption (iv). First, note that we can write sup i∈I j∈I Boundedness and Lipschitz continuity of this function, follows from the boundedness of τ ∈ (ε, 1] and the continuous differentiability of µ. Assumption 6(iii) follows from Lemma 1.14 in [5] and Assumption 6(iv).

Proof of Proposition 4
In this subsection, we prove Proposition 4. First, we show weak convergence of (ξ ε (t)) t≥0 ⇒ (ξ(t)) t≥0 in the sense of (12). Given this result, we will be able to construct the function α and thus prove Proposition 4(i). Part (ii) of the proposition will rely on showing that (ξ ε (t)) t≥0 approximates the underlying gradient flow, as discussed in Theorem 1.
for any A ∈ BX and J ⊆ I.
Proof. The first part follows from Lemma 1. The second part is implied by the Banach Fixed-Point Theorem and by the stationarity of θ * with respect to ∇Φ. 2 Now, we proceed to prove the second part of the main proposition. Proof.
Those assumptions will imply that θ(t) ⇒π, as ε ↓ 0 and t → ∞; see Theorem 6.5 in [40]. As (θ(t)) t≥0 has a unique stationary measure, we have that π C ⇒π. Now to prove these four assertions. (i), (ii) follow immediately from Lemma 4, withπ := δ(· − θ * ). (iii) is implied by Theorem 2. Due to the strong convexity that we have assumed in Assumption 4(ii), we know that the process cannot escape a certain compact set; see Lemma 1.14 in [5] for details. This implies tightness as needed in (iv). Finally, note that π C ⇒π already implies that they also converge in d W . Hence, we can construct a function α accordingly. 2

Linear least squares problems
In this section, we illustrate the theoretical results of § §3.2-3.5 with an abstract example. In particular, we show that Assumptions 2 and 4 hold for linear least squares problems under weak assumptions. Those appear in (regularised) linear or polynomial regression. Let Y := R M , y ∈ Y , and G : X → Y be a linear operator. Y is the data space, y is the observed data set, and G is the parameter-to-data map. We consider the problem of estimating which is called linear least squares problem.
We aim to solve this problem by the stochastic gradient descent algorithm. Indeed, we define where y i is an element of another Euclidean vector space Y i := R Mi and G i : X → Y i is a linear operator, for i ∈ I. We assume that these are given such that the space Y = i∈I Y i , the vector (y i ) i∈I = N · y, and the operator [G T 1 , . . . , G T N ] T = N · G. To define the SGP, we now need to derive the gradient field. This is given by the associated normal equations: These vector fields are linear, thus, satisfy Assumption 2. Now we discuss Assumption 4. Let i ∈ I. Note that G T i G i is symmetric, positive semi-definite. We have where κ i ≥ 0 is the smallest eigenvalue of G T i G i . This implies that Assumption 4(i) holds, if there is some i ∈ I with G T i G i strictly positive definite. Furthermore, Assumption 4(ii) holds, if for all i ∈ I the matrix G T i G i is strictly positive definite. Strict positive definiteness of G T i G i is satisfied, if dim Y i ≥ dim X and G i has full rank, for i ∈ I. The inequality dim Y i ≥ dim X is not restrictive, as we apply SGD typically in settings with very large data sets. If the G i do not have full rank, one could add a Tikhonov regulariser to the target function in (21).

From continuous to discrete
In the previous sections, we have introduced and discussed SGP mainly as an analytical tool and abstract framework to study SGD. However, we can also apply SGP more immediately in practice.
To this end, we need to consider the following computational tasks: (i) discretisation of deterministic flows (ϕ i ) i∈I (ii) discretisation of continuous-time Markov processes (i(t)) t≥0 , resp. (j(t)) t≥0 The discretisation of the (ϕ i ) i∈I consists in the discretisation of several homogeneous ODEs. The discretisation of ODEs has been studied extensively; see, e.g., [33]. Thus, we focus on (ii) and discuss a sampling strategy for the CTMPs in §4.1.
A different aspect is the following: note that when specifying strategies for (i) and (ii), we implicitly construct a stochastic optimisation algorithm. Since we have introduced SGP as a continuous-time variant of SGD, one of these algorithms should be the original SGD algorithm. Indeed, in §4.2 we will explain a rather crude discretisation scheme which allows us to retrieve SGD. Well-known algorithms beyond SGD that can be retrieved from SGP are discussed in §4.3.

Applying SGP
We now briefly explain a strategy that allows us to sample the CTMPs (i(t)) t≥0 and (j(t)) t≥0 . Without loss of generality, we focus on the second case, (j(t)) t≥0 .
Indeed, we give a sampling strategy in Algorithm 2. It commences by sampling an initial value j(0). This value remains constant for the duration of the random waiting time. After this waiting time is over, we sample the next value of the process from a uniform distribution on all states, but the current state. This value is kept constant for another random waiting time and so on. This strategy goes back to Gillespie [29]; see also [55] for this and other sampling strategies for CTMPs on discrete spaces.
The potentially most challenging step in Algorithm 2 is the sampling from the distribution π wt (·|t 0 ) in line 4. In the case of SGPC, i.e. if η is constant, this sampling just comes down to sampling from an exponential distribution. In SGPD, the sampling could be performed using the quantile function of π wt (·|t 0 ), if accessible. We sketch the method below. If the quantile function is not accessible, strategies such as rejection sampling may be applicable; see [57] for details. In the following, we consider first the case where 1/η(·) is an affine function and then the case where η scales exponentially in time. Both of these cases satisfy the growth condition in (6). Thus, our theory applies to the SGPD employing either of these learning rate functions.
We can again compute the quantile function where s ∈ (0, 1), t 0 ≥ 0. We again use the quantile function to estimate mean and standard deviations of the distribution for a = b = 1 and t 0 ∈ [0, 10]; see Figure 3.

Retrieving SGD from SGP
Now, we discuss how the SGP dynamic needs to be discretised to retrieve the SGD algorithm. To this end, we list some features that we need to keep in mind: The waiting times between switches of the data sets are deterministic in SGD and random in SGP. The processes (i(t)) t≥0 and (j(t)) t≥0 in SGP indeed jump with probability one after the waiting time is over, i.e. i(t) = i(s) when one jump occurred in (t, s]. In SGD, however, it is possible to have a data set picked from the sample twice in a row. Finally, we need to discretise the flows (ϕ i ) i∈I using the explicit Euler method. We approximate the process (j(t)) t≥0 by where j 0 , j 1 , . . . ∼ Unif(I) i.i.d. and the sequence ( t k ) ∞ k=0 is given by Note that with this definition of the sequence ( η k ) ∞ k=1 , we obtain η k = η k , k ∈ N, which was the discrete learning rate defined in Algorithm 1. See our discussion in §2.2 for the choice of ( j(t)) t≥0 as an approximation of (j(t)) t≥0 . If we employ ( j(t)) t≥0 and an explicit Euler discretisation with step length η k in step k ∈ N to discretise the respective flows (ϕ i ) i∈I , we obtain precisely the process defined in Algorithm 1.

Beyond SGD
In §4.2, we have discussed how to discretise the SGP (ξ(t)) t≥0 to obtain the standard SGD algorithm. It is also possible to retrieve other stochastic optimisation algorithms by employing other discretisation strategies for the flows (ϕ i ) i∈I . Note, e.g., that when replacing the explicit Euler discretisation of the flows (ϕ i ) i∈I in §4.2 by an implicit Euler discretisation, we obtain the stochastic proximal point algorithm; see, e.g., Proposition 1 of [7] for details.
Using higher-order methods instead of explicit/implicit Euler, we obtain higher-order stochastic optimisation methods. Those have been discussed by Song et al. [66]. Adaptive Learning Rates for SGD are conceptually similar to adaptive stepsize algorithms in ODE solvers, but follow different ideas in practice; see [23,46].
Linear-complexity SGD-type methods, like Stochastic Average Gradient (SAG) [61], Stochastic Variance Reduced Gradient (SVRG) [35], or SAGA [21] remind us of multistep integrators for ODEs. Here, the update does not only depend on the current state of the system, but also on past states. On the other hand, variance reduction in the discretisation of stochastic dynamical systems is, e.g., the object of Multilevel Monte Carlo path sampling, as proposed by Giles [28].

Numerical experiments
We now aim to get an intuition behind the stationary measures π C , π ε (Theorems 3 and 5), study the convergence of the Markov processes, and compare SGP with SGD. Figure 4: Estimated stationary measures of SGD and SGPC with different η ∈ {1, 10 −1 , 10 −2 , 10 −3 } and initial value θ 0 = −1.5. The results are based on kernel density estimations with 10 4 samples each of θ(10) for SGPC and θ k with k = 10/η for SGD. Note that for SGD with η = 1, the samples are concentrated in 3 points, which is why we plot a histogram rather than a density.

Constant learning rate
Approaching the optimisation problem in Example 3, we now employ SGPC with initial value θ 0 = −1.5 and η ∈ {1, 10 −1 , 10 −2 , 10 −3 }. We sample from this process using Algorithm 2 for the CTMP (i(t)) t≥0 and the analytical solution of the gradient flows (ϕ i ) i∈I . Throughout this section, we use the Matlab function ksdensity to compute kernel density estimates. All of those are based on Gaussian kernel functions with boundary correction at {−2, 2}, if necessary. We now sample SGPC as discussed above and collect the samples θ(10), i.e. the value of the process at time t = 10. In Figure 4, we show kernel density estimates based on 10 4 of these samples. For large η, the density has mass all over the invariant set of the (ϕ i ) i∈I . If η is reduced, we see that the densities become more and more concentrated around the optimum θ * .
Next, we compare SGPC with SGD. Indeed, we compute kernel density estimates of 10 4 samples of the associated SGD outputs. In particular, we run SGD with the same learning rates up to iterate 10/η. For η = 1, the numerical artifacts seem to dominate SGD. For smaller η, the densities obtained from both algorithms behave very similarly: we only see a slightly larger variance in SGP. Indeed, when looking at the values of the variances of θ(10) for η ∈ {10 −1 , 10 −2 , 10 −3 }, they seem to depend linearly on η and only differ among each other by about factor 1.3, see the estimates in Table 1.  Table 1: Sample variances of 10 4 samples of θ(10) in SGPC and θ 10/η in SGD.
We next take a look at the sample paths of said SGPC runs; consider Figure 5. As anticipated and actually already shown in Figure 2, the smaller η leads to a faster switching and to a sample path that well approximates the full gradient flow. Large η leads to slow switching. It is difficult to recognise the actual speed of convergence shown in Theorem 3. However, we see that each of the chains indeed reaches a stationary regime. The time at which those regimes are reached highly depends on η. Indeed, for η = 1 we seem to be almost right away in said regime. For the smallest learning rate η = 10 −3 , it appears to take up to t ≈ 3.5. What does this mean from a computational point of view? The approach with a small learning rate is computationally inefficient: the large number of switches makes the discretisation of the sample paths computationally expensive; the slow convergence to the stationary regime implies that we need to run the process for a relatively long time. For large η, however, we are not able to identify the optimal point; see Figure 4. Hence, with large and constant η the method is ineffective.

Decreasing learning rate
In SGPD, we can solve the efficiency problem of SGPC noted in the end of §5.1: we start with a large η, which is decreased over time. Hence, we should expect to see fast convergence in the beginning and accurate estimation of θ * later on. To test this assertion we get back to the problem defined in Example 3.
We study two different time-dependent learning rates: a rational rate that is the reciprocal of an affine function, as in Example 1, as well as an exponential learning rate; as in Example 2. In particular, we choose and sample from the associated waiting time distribution using the quantile functions (22) and (23), respectively. Note that, as mentioned before, the reciprocal of both learning rate functions satisfies the growth condition in (6). All the other specifications are identical to the ones given in Figure 5: Sample paths of SGPC as in Figure 4. Left: four sample paths (θ(t)) t≥0 , right: associated distances between sample paths and optimal point, i.e. (|θ(t) − 0.5|) t≥0 . §5.1: we set, e.g., ξ 0 := −1.5 as an initial value for the process. In Figure 6, we show single sample paths of the processes (ξ(t), j(t)) t≥0 , with the different learning rate functions. In both cases, we can see that the waiting times between jumps in (j(t)) t≥0 go down as t increases: the (vertical) jumps become denser over time. For small t > 0, one can also recognise the coupling between (ξ(t)) t≥0 and (j(t)) t≥0 . If we compare the paths with the different learning rate functions, we see that the exponential rate allows for much larger steps in the beginning and then decreases quite quickly. The rational rate leads to fast switching early on, which decreases further rather slowly over time. Note that these plots are essentially realistic versions of the cartoon in Figure 1. Next, we look at the distribution of ξ(t) for particular t > 0. In Figure 7, we plot kernel density estimates for the distributions of ξ(1/4), ξ(1/2), ξ(1), ξ(2), ξ(4), ξ(8) and ξ (10). Those estimates are each based on 10 4 independent Monte Carlo samples. Hence, we show how the distribution of the processes evolves over time. We observe that the process starting at ξ(0) = −1.5 moves away from that state and slowly approaches the optimal point θ * = 0.5. Doing so, it starts with a large variance that is slowly decreased over time. This is consistent with what we have observed in Figure 4 and Table 1. In case of the exponential learning rate, this behaviour is much more pronounced: we start with a much higher variance but end up at t = 10 with a smaller variance.
In Figure 8, we additionally compare the distribution of the constant learning rate process with η = 10 −3 with the exponential and rational rate processes at the time at which their learning rate is approximately equal to 10 −3 . We see that the states of the constant and rational rate processes have almost the same distribution, which is what we would hope to see. The exponential learning rate process has a larger variance.
To study the performance of SGPD quantitatively, we estimate mean and standard deviation of the absolute error |ξ(t) − 0.5| at t = 1, 2, . . . , 10 using 10 4 Monte Carlo samples. To see the full context, we also performed 10 4 runs of the associated discrete-time SGD algorithms. The learning rate sequences (η k ) ∞ k=1 are chosen as we have suggested in (25). We show the results in Figure 9. In the exponential, continuous case, we see an exponential convergence rate. In all the other settings, the rates are sublinear. For the discrete settings, this is exactly what we would expect based on the literature; see [34] and the references therein. Interestingly, the rational, continuous case appears to be less efficient than the rational, discrete case. This could imply that the learning rate  Figure 6: A sample path of (ξ(t), j(t)) t≥0 , as specified in §5.2. The top two figures refer to the rational learning rate, the bottom two figures refer to the exponential rate.
function is supposed to be chosen according to the convergence rate of the underlying deterministic dynamical system.

Conclusions
We have proposed the stochastic gradient process as a natural continuum limit of the popular stochastic gradient descent algorithm. It arises when replacing the explicit Euler updates by the exact gradient flows and the waiting times between data switches by appropriate random waiting times. This continuous-time model is a piecewise-deterministic Markov process. It represents the uniform subsampling from a finite set of potentials after strictly positive waiting times, the Markovian nature of SGD, the switching of potentials, and the approximation of the full deterministic gradient flow. Moreover, the process has an interpretation in population dynamics. Within this continuum limit, we are able to study Wasserstein ergodicity in the case of strongly convex target functions. In the case of constant learning rates, we obtain exponential ergodicity. A similar result has been established by [22] in discrete time. In the case of decreasing learning rates, we could show weak convergence to the minimiser of the target function. Our results do not allow us to assess the convergence rate in that case. Numerical experiments indicate that it depends on the underlying data switching process and could in certain cases be exponential as well.
In the numerical experiments, we compared samples from SGP with samples from SGD. Here, we, for instance, observed strong similarities between the stationary measure of the two processes. Indeed, we claim that our continuum limit is a good representation of stochastic gradient descent in the long-time limit. Here, we have been able to sample accurately from SGP, as the flows attain analytical representations. In most practical cases, we would need to construct a discrete stochastic optimisation algorithm from SGP using an ODE integrator. Following this machinery, one can also retrieve known stochastic optimisation algorithms, showing that SGP is also a generalisation of those.
We conclude this work with four remarks. Here, we discuss possible extensions of the stochastic gradient process framework.
Remark 3 (Global, non-convex) Throughout our long-time analysis, we have required strong convexity of the target functions. In practical applications, e.g. the training of deep neural  Figure 4, the rational learning rate SGPD ξ(9.99), and the exponential learning rate SGPD ξ(6.91). The densities are estimated with 10 4 samples.
networks, convexity is too strong. If certain Hörmander bracket conditions are satisfied, exponential ergodicity may also be shown without the strong convexity assumption, see, e.g. [3,17]. This does not yet imply that the processes will converge to the global optimum, if η ↓ 0. However, we remark that the densities in the numerical illustrations in §5 very much remind us of a simulated annealing scheme, where η controls the variance of the target measure; see e.g. §5.2.3 of [57]. In some cases, simulated annealing is able to find global extrema of non-convex target functions; see [70]. Hence, this connection may fertilise future research in this direction.
Remark 4 (Constrained) SGD has been successfully applied in constrained optimisation; typically by projecting each update on the space of feasible vectors. This is difficult to represent in the SGP setting; as the projection would need to be part of the piecewise ODEs. However, PDMPs on bounded sets already appear in the original paper by Davis [19]. Here, a jump is introduced as soon as the boundary of the feasible set is reached. In SGP, one could introduce a jump in the continuous-time Markov process (i(t)) t≥0 and (j(t)) t≥0 , as soon as the boundary is hit. Hence, the data set is randomly switched until the process moves away from the boundary or the boundary point is stationary for the process.
Remark 5 (Gradient-free) In this work, we cover only methods that are fundamentally based on discretised gradient flows. Other stochastic optimisation algorithms are based on other underlying dynamics. Such are ensemble-based methods or evolutionary algorithms. Consider, for instance, the ensemble Kalman inversion framework, which was proposed by Schillings and Stuart [60] as a continuum limit of some ensemble Kalman filter. Using our SGP view, one may be able to analyse subsampling in ensemble Kalman inversion, as proposed by [37]. Markovian settings arise, e.g., when adapting the learning rate throughout the algorithm, as in the celebrated AdaGrad algorithm [23]. Another non-Markovian extension is the following. In the present work, we have decided to switch the potentials in the SGPs after random waiting times. While this allowed us to study SGP as a (piecewise-deterministic) Markov process, it did not retain SGD's property of jumping after deterministic waiting times. If we model the waiting times deterministically, the processes (i(t)) t≥0 , (j(t)) t≥0 become general renewal processes and non-Markovian. Especially since deterministic waiting times are easier to handle in practice, the then resulting 'renewal stochastic gradient processes' are highly interesting objects for future studies.

A Auxiliary results concerning CTMPs
In this appendix, we give a brief derivation of the Markov kernel describing the processes (i(t)) t≥0 and (j(t)) t≥0 . Moreover, we discuss the non-explosiveness of (j(t)) t≥0 , i.e. we show that the sequence of jump times (T k ) ∞ k=1 satisfies P lim k→∞ T k = ∞ = 1.
We commence with the discussion of the Markov kernels.
For details, we refer to the fundamental work by Kolmogorov [36,Equations (47), (52)]. The initial condition (29) is obviously satisfied. Moving on to (28). We have Due to symmetry, it is sufficient to consider the cases j = j 0 and j = j 0 . Let first j = j 0 . Then, Hence, M t|t0 is indeed the Markov kernel describing the transition of the CTMP (j(t)) t≥0 . 2 We now move on to proving the non-explosiveness of (j(t)) t≥0 .
Proof. In the following, we construct a CTMP (k(t)) t≥0 on N which has the same jump times (T k ) ∞ k=0 as (j(t)) t≥0 . Then, we show that (k(t)) t≥0 satisfies the assumptions of Proposition 1 in [16] on any compact intervall in [0, ∞). This will imply our assertion. Let (k(t)) t≥0 be the CTMP on N with transition rate matrix Λ(t) : R N → R N (t ≥ 0), with We now need to check the following assertions for t > t ≥ 0.