## 1 Introduction

The climate is changing. In particular, the repercussions caused by the loss of water in the soil or the ice sheets are dangerous to human kind as they fuel droughts as well as the sea level rise. Thus, the geosciences are taking care to monitor the mass transport of the Earth, see e. g. Fischer and Michel (2013b), Flechtner et al. (2020), IPCC (2014), Lin et al. (2018), NASA (2020), Sneeuw and Saemian (2019), Wiese et al. (2020) as well as the results of the DFG SPP 1257 (2006-2014) coordinated by Ilk and Kusche, see e. g. Kusche et al. (2012). The mass transport is obtained by modelling the gravitational potential from time-dependent satellite data, e. g. from the GRACE and GRACE-FO satellite mission, see e. g. Devaraju and Sneeuw (2017), Flechtner et al. (2014a, 2014b), NASA Jet Propulsion Laboratory (2020), Schmidt et al. (2008), Tapley et al. (2004), The University of Texas at Austin, Centre for Space Research (2020).

Mathematically, we can approximate the Earth’s surface with the unit sphere $$\Omega$$. The gravitational potential f on the Earth’s surface is usually expanded in SHs $$Y_{n,j},\ n\in {\mathbb {N}}_0,\ j=-n,\ldots ,n,$$ such that we have

\begin{aligned} f = \sum _{n=0}^\infty \sum _{j=-n}^n \left\langle f,Y_{n,j}\right\rangle _{\mathrm {L}^2(\Omega )}Y_{n,j} \end{aligned}

which holds in the $$\mathrm {L}^2(\Omega )$$-sense. Correspondingly, at a satellite orbit radius $$\sigma >1$$, we upward continue this representation at every point $$\eta \in \Omega$$ via the operator $${\mathcal {T}}$$ and obtain the gravitational potential V outside the Earth, see e. g. Baur (2014), Freeden and Michel (2004), Moritz (2010),

\begin{aligned} V\left( \sigma \eta \right)&= \left( {\mathcal {T}}f \right) \left( \sigma \eta \right) \\&= \sum _{n=0}^\infty \sum _{j=-n}^n \left\langle f,Y_{n,j}\right\rangle _{\mathrm {L}^2(\Omega )}\sigma ^{-n-1}Y_{n,j}\left( \eta \right) ,\nonumber \end{aligned}
(1)

which actually holds pointwise. Obviously, $${\mathcal {T}}$$ has exponentially decreasing singular values. The well-known EGM2008 model represents the gravitational potential as given by (1) truncated at degree $$N=2190$$. For a satellite at 500 km height, this leads to a singular value of

\begin{aligned} \sigma ^{-N-1} = 0.927230^{2191} = 1.28279 \cdot 10^{-72}. \end{aligned}

This behaviour is mathematically deduced and, thus, does not depend on a specific experiment setting. Naturally, we are much more interested in the inverse problem of the downward continuation, i. e. if we have given values of the potential V at $$\sigma \eta$$, we are interested in the values of f at $$\eta$$. However, the inverse of $${\mathcal {T}}$$ has exponentially increasing singular values. Thus, it is not continuously dependent on the data, see e. g. Engl et al. (1996), Louis (1989), Michel (2005, 2022), Rieder (2003). This means, mathematically, the downward continuation is an exponentially ill-posed inverse problem and needs sophisticated methods to be tackled. We are interested in further developing such methods for ill-posed inverse problems. Thus, we consider the continuation of the gravitational potential and not of the potential gradients here because the former has a higher instability. Moreover, here the classical downward continuation problem using satellite data serves our need of a well-understood ill-posed inverse problem though we are aware that, in the last years, research also concentrated on the also important airborne downward continuation, see, e.g. Novák et al. (2001), Eicker (2008), Naeimi (2013), Lieb (2017).

One possible approach for ill-posed inverse problems is the IPMP algorithms, i. e. here the Regularized (Orthogonal) Functional Matching Pursuit (R(O)FMP) algorithm. To the knowledge of the authors, only our own group developed matching pursuits for ill-posed inverse problems. Thus, for more details, we refer to e. g. Fischer and Michel (2013a), Michel (2015), Michel and Orzlowski (2017), Michel and Telschow (2016). Using discrete values of V, these methods iteratively build an approximation of f as a best basis expansion drawn from dictionary elements. A dictionary $${\mathcal {D}}$$ is an intentionally redundant set of trial functions. For spherical tasks like the downward continuation, it may contain SHs, SLs, RBFs and/or RBWs. Then, $${\mathcal {D}}$$ contains global functions as well as localized ones and band-limited as well as non-band-limited, respectively. In each iteration, the next dictionary element is chosen such that it minimizes the Tikhonov–Phillips functional regarding the current residual and approximation. In the ROFMP algorithm, it simultaneously fulfils to a certain extent orthogonality relations with the previously chosen basis elements as well. Thus, in this case, the coefficients are updated regularly to maintain their optimality. Note that the methods can in principle also be used for spatio-temporal data. For developing them further, however, we do without a temporal aspect.

The IPMP algorithms usually work with a large finite, a priori chosen subset of the infinitely many possible trial functions as its dictionary. This is obviously an obstacle for users who are new to the incorporated trial functions. Moreover, it also leads to a long runtime and a high storage demand which are both preventing us from considering realistic experiment settings (e.g. using $$5 \times 10^6$$ data points). Further, we also recognize that a manually chosen dictionary possibly biases the obtained approximation though without alternatives this can hardly be quantified.

Thus, we further developed the IPMP algorithms by adding a novel dictionary learning technique. Note, however, that our task at hand differs from established dictionary learning challenges: instead of solving pure approximation problems, we have to deal with the operator of the inverse problem; moreover, we intend to focus on learning established trial functions in contrast to defining new trial functions by learning values at grid points. Note that the former provides a higher comparability with well-known models. Thus, we cannot straightforwardly use previous dictionary learning strategies, see e. g. Aharon et al. (2006), Bruckstein et al. (2009), Engan et al. (1999a, 1999b), Prünte (2008), Rubinstein et al. (2010) but need to develop our own approach.

Here, the dictionary learning add-on for the SLs, RBFs and RBWs consists of a 2-step process of (first global then local) nonlinear constrained optimization problems in order to compute optimized candidates for these types of trial functions. We utilize the NLOpt library, see Johnson (2019), for this as, to the knowledge of the authors, it is reliable and offers a large range of local and global optimization algorithms to choose from. A candidate for the SHs, however, can be obtained by comparing the values of the respective objective function for different SHs. Then, we learn particular functions as well as a maximal SH degree. All of these candidates together constitute again a finite dictionary and we can proceed with the (remaining) routine of the IPMP algorithms in this iteration. After termination, we obtain an approximation of f in an optimized best basis whose elements can be re-used as a learnt dictionary in future runs of the IPMP algorithms. The IPMP algorithms which include the novel learning add-on are called the Learning Inverse Problem Matching Pursuit (LIPMP) algorithms, i. e. the Learning Regularized (Orthogonal) Functional Matching Pursuit (LR(O)FMP) algorithm. Note that the LRFMP algorithm here is an advanced version of the one presented in Michel and Schneider (2020).

In the sequel, we formally but shortly introduce the SHs, SLs, RBFs and RBWs as well as a dictionary in Sect. 2. Then, we explain the idea of the LIPMP algorithms which includes an overview of the IPMP algorithms and the novel learning add-on in Sect. 3. Particularly, we focus on how the 2-step optimization process fits into the routine of the IPMP algorithms. Then, we show the applicability and efficiency of the add-on as well as the learnt dictionary in a series of experiments in Sect. 4.

This paper is based on Schneider (2020) and was partly presented at the EGU2020: Sharing Geoscience Online (Schneider and Michel 2020).

## 2 Basics

Let $${\mathbb {R}}$$ be the set of all real numbers, $${\mathbb {R}}^d$$ be the real, d-dimensional vector space, $${\mathbb {N}}$$ be the set of all positive integers and $${\mathbb {N}}_0$$ be that of all non-negative integers. We denote $$\Omega :=\{x\ \in {\mathbb {R}}^3 :\ |x|=1\}$$ as the two-dimensional unit sphere and $${\mathbb {B}}:={\mathbb {B}}^3 :=\{x\ \in {\mathbb {R}}^3 :\ |x|< 1\}$$ the open unit ball. Furthermore, we can represent $$\eta (\varphi ,t) \in \Omega$$ as usual in spherical coordinates via the longitude $$\varphi \in [0,2\pi [$$ and the polar distance $$t = \cos (\theta ) \in [-1,1]$$ for the latitude $$\theta \in [0,\pi ]$$.

### 2.1 Spherical harmonics

Spherical harmonics (SHs) $$Y_{n,j}$$ are (global) polynomials on $$\Omega$$. They have a distinct degree $$n \in {\mathbb {N}}_0$$ and order $$j=-n,\ldots ,n$$. In practice, we usually choose the fully normalized SHs

\begin{aligned} Y_{n,j}(\eta (\varphi ,t)) :=p_{n,j}P_{n,|j|}(t) \left\{ \begin{matrix} \cos (j\varphi ),&{}j\le 0\\ \sin (j\varphi ),&{}j> 0\end{matrix} \right. \end{aligned}
(2)

for $$\eta (\varphi ,t) \in \Omega$$, an $$\mathrm {L}^2(\Omega )$$-normalization $$p_{n,j}$$ and the associated Legendre functions $$P_{n,|j|}$$. An example is given in Fig. 4a. For further details, the reader is referred to, e. g., Freeden et al. (1998); Freeden and Schreiner (2009), Müller (1966).

### 2.2 Slepian functions

Slepian functions (SLs) are band-limited, spatially optimally localized trial functions. Here, the spatial localization region shall be a spherical cap parametrized with $$c = \cos (\theta ) \in [-1,1]$$ and its centre $$A(\alpha ,\beta ,\gamma )\varepsilon ^3\in \Omega$$. Note that $$\theta$$ is the polar angle between vectors pointing at the apex and at an arbitrary point of the base and $$\alpha ,\ \gamma \in [0,2\pi [$$ and $$\beta \in [0,\pi ]$$ denote the Euler angles, $$A\in \mathrm {SO}(3)$$ is a rotation matrix and $$\varepsilon ^3=(0,0,1)^\mathrm {T}$$ is the North Pole. Then, we obtain for each localization region $$R :=R(c,A(\alpha ,\beta ,\gamma )\varepsilon ^3) \in {\overline{{\mathbb {B}}}}^4 :=\left\{ x \in {\mathbb {R}}^4\ :\ |x|\le 1\right\}$$ a set of $$k=1,\ldots ,(L+1)^2$$ Slepian functions defined by

\begin{aligned} g^{(k,L)}(R,\eta ) :=\sum _{l=0}^L\sum _{m=-l}^l g^{(k,L)}_{l,m}(R) Y_{l,m}(\eta ),\ \eta \in \Omega , \end{aligned}

where $$L \in {\mathbb {N}}_0$$ is the band-limit. The Fourier coefficients $$g^{(k,L)}_{l,m}(R),\ l=0,\ldots ,L,\ m=-l,\ldots ,l,$$ are obtained from the eigenvectors of the related algebraic eigenvalue problem of optimizing a band-limited, in SHs expanded function in the region R. An example is given in Fig. 4a. Note that a commuting operator provides a stable computation of these values if the localization region is a spherical cap. For further details, the reader is referred to, e. g., Albertella et al. (1999), Grünbaum et al. (1982), Michel (2013), Seibert (2018), Simons and Dahlen (2006).

### 2.3 Radial basis functions and wavelets

As examples for non-band-limited localized trial functions, we consider Abel–Poisson kernels $$K(x,\cdot )$$ (APKs) and wavelets $$W(x,\cdot )$$ (APWs) due to their closed form. That means, we have

\begin{aligned} K(x,\eta )&:=\frac{1-|x|^2}{4\pi (1+|x|^2-2x\cdot \eta )^{3/2}} \end{aligned}

and

\begin{aligned} W(x,\eta )&:=K(x,\eta ) - K(|x|x,\eta ) \end{aligned}

for $$\eta \in \Omega$$ and with the characteristic parameter $$x\in {\mathbb {B}}$$. The kernels act as low pass filters, whereas the wavelets are band pass filters. Examples are given in Fig. 4b. For further details, the reader is referred to, e. g., Freeden et al. (1998), Freeden and Michel (2004), Freeden and Schreiner (1998, 2009), Freeden and Windheuser (1996), Michel (2013), Windheuser (1995).

### 2.4 A dictionary

A dictionary is a set of trial functions. Usually, we intentionally choose a redundant system such that it contains global as well as local functions and band-limited as well as non-band-limited functions, respectively. In this way, we are able to tailor an approximation in a heterogeneous basis which, thus, combines the best of several worlds. Then, the specific needs of the solution can in all probability be met more flexibly. In other words, the different trial functions are used to recover different aspects of the signal: global ones model major trends within the signal and local ones recover detail structures.

A dictionary can be specified as follows. We first define subsets for each type of trial function under investigation. Here, we set

\begin{aligned}&{[}N]_{\mathrm {SH}} :=\left\{ (n,j) \in N \subseteq {\mathcal {N}}\right\} ,\\&{\mathcal {N}}:=\left\{ (n,j)\ :\ n\in {\mathbb {N}}_0,\ j=-n,\ldots ,n\right\} \end{aligned}

for SHs,

\begin{aligned}&{[}S]_{\mathrm {SL}} :=\left\{ g^{(k,L)}(R,\cdot )\ :\ (R,(k,L)) \in {\overline{{\mathbb {B}}}}^4 \times {\mathcal {L}}\right\} \\&{\mathcal {L}} :=\left\{ (k,L)\ :\ L\in {\mathbb {N}}_0,\ k=1,\ldots ,(L+1)^2\right\} \end{aligned}

for SLs,

\begin{aligned} {[}B_K]_{\mathrm {APK}}&:=\left\{ K(x,\cdot )\ :\ x \in B_K \subseteq {\mathbb {B}}\right\} \end{aligned}

for APKs and

\begin{aligned} {[}B_W]_{\mathrm {APW}}&:=\left\{ W(x,\cdot )\ :\ x \in B_W \subseteq {\mathbb {B}}\right\} \end{aligned}

for APWs. Such subsets are called trial function classes. The union of the defined trial function classes gives us the dictionary $${\mathcal {D}}$$:

\begin{aligned} {\mathcal {D}}:=[N]_{\mathrm {SH}} \cup [S]_{\mathrm {SL}} \cup [B_K]_{\mathrm {APK}} \cup [B_W]_{\mathrm {APW}}, \end{aligned}

confer Michel (2022), Michel and Schneider (2020), Schneider (2020). In general, it is not necessary that $${\mathcal {D}}$$ is finite. However, we emphasize an infinite dictionary as $${\mathcal {D}}^{\mathrm {Inf}}$$. Note that, depending on the actual choice of $$[N]_\mathrm {SH},\ [B_K]_\mathrm {APK}$$ and $$[B_W]_\mathrm {APW}$$, $${\mathcal {D}}$$ may be complete in certain function spaces like $$\mathrm {L}^2(\Omega )$$, see e. g. Freeden et al. (1998), Michel (2013), Schneider (2020) and the previously mentioned references regarding the SHs, APKs and APWs. This naturally holds for $${\mathcal {D}}^{\mathrm {Inf}}$$. In particular, each subsystem $$[N]_{\mathrm {SH}},\ [B_K]_{\mathrm {APK}}$$ and $$[B_W]_{\mathrm {APW}}$$ can be complete in $$\mathrm {L}^2(\Omega )$$ – take $$N={\mathcal {N}}$$ and dense subsets $$B_K,\ B_W \subset {\mathbb {B}}$$. This is the mentioned and desired redundancy of the dictionary. The representation of an unknown function in an overcomplete dictionary $${\mathcal {D}}$$ is, therefore, not unique. The algorithmic choice and the associated objective function determine which specific representation is obtained eventually. The details of this are explained in the following.

## 3 The LIPMP algorithms

We consider the downward continuation of the gravitational potential from satellite data to the Earth’s surface. That is, mathematically speaking, we consider the ill-posed inverse problem $$y={\mathcal {T}}_\daleth f$$ with the data $$y \in {\mathbb {R}}^\ell ,\ y_i = V(\sigma \eta ^i)$$, the satellite height $$\sigma >1$$, grid points $$\eta ^i \in \Omega$$ for $$i=1,\ldots , \ell$$ at the Earth’s surface and the operator $${\mathcal {T}}_\daleth f :=(({\mathcal {T}}f)(\sigma \eta ^i))_{i=1,\ldots ,\ell }$$ with the upward continuation operator $${\mathcal {T}}$$ as in (1). Thus, $${\mathcal {T}}_\daleth$$ is the corresponding evaluation operator of the upward continuation operator for an $$\ell$$-dimensional discretized grid. Our task is to approximate the gravitational potential f at the Earth’s surface $$\Omega$$.

The LIPMP algorithms introduce an add-on to the established IPMP algorithms. The remaining routines coincide. Thus, we give a short overview of the latter’s strategy.

### 3.1 The underlying IPMP algorithms

Note that the IPMP algorithms discussed here have been developed by the Geomathematics Group Siegen in the past decade. Thus, this subsection is based on the previously mentioned literature from our group. We summarize here the main aspects to enable a general understanding of the methods necessary to describe the learning add-on. To avoid repetition of previously published content, we refer to the respective literature on the RFMP and ROFMP, respectively, for implementation details.

Due to the (severe) ill-posedness of the inverse problem at hand (see Sect. 1), it is indispensable to consider a regularization for the downward continuation. The IPMP algorithms utilize a Tikhonov–Phillips regularization which is an established and well-performing choice, e. g. for the downward continuation. The Tikhonov–Phillips regularization aims to solve the (approximative) regularized normal equation by minimizing the so-called Tikhonov–Phillips functional which consists of a data error and a penalty term.

In general, the minimization can be done via various approaches. We consider the IPMP algorithms for this here. Using an initial (guessed) approximation $$f_0$$, e. g. $$f_0 \equiv 0$$, the methods iteratively build a minimizer as a linear combination of weighted dictionary elements $$d_n \in {\mathcal {D}}$$:

\begin{aligned} f_N&= f_0 + \sum _{n=1}^N \alpha _n d_n \end{aligned}
(3)

in the case of the RFMP algorithm and

\begin{aligned} f_N^{(N)}&= f_0 + \sum _{n=1}^N \alpha _n^{(N)} d_n, \end{aligned}
(4)

in the case of the ROFMP algorithm. Recall that due to the aforementioned redundancy in the dictionary, the approximation is usually built in a heterogeneous basis due to a mixture of global as well as local and band-limited as well as non-band-limited, respectively, trial functions. Note that the superscript (N) in (4) refers to an update of the coefficients in each iteration step due to the orthogonality process: to improve the efficiency of the RFMP, a pre-fitting technique is included. This is a usual approach with matching pursuits, see, e.g. Vincent and Bengio (2002). Prefitting means that, in each iteration of the ROFMP, the previously chosen weights are updated to prevent the algorithm from picking the same trial function more than once. Note that, in practice, we usually consider the iterated (L)ROFMP algorithm which restarts the pre-fitting technique after a prescribed number of iterations for practical as well as theoretical reasons. For readability, in the sequel, we do without an additional subscript that would be necessary for this restart process.

The respective residuals are

\begin{aligned} R^{N+1}&:=R^N - \alpha _{N+1}{\mathcal {T}}_\daleth d_{N+1} \end{aligned}

for the RFMP algorithm and, in the case of the ROFMP algorithm,

\begin{aligned} R^{N+1}&:=R^N - \alpha _{N+1}^{(N+1)}{\mathcal {P}}_{{\mathcal {V}}_N^\perp } {\mathcal {T}}_\daleth d_{N+1}, \end{aligned}

where $${\mathcal {P}}_{{\mathcal {V}}_N^\perp }$$ is the orthogonal projection onto the orthogonal complement

\begin{aligned} {\mathcal {V}}_N^\perp&:=\left( \mathrm {span} \{{\mathcal {T}}_\daleth d_n\ :n=1,\ldots ,N\}\right) ^{\perp _{{\mathbb {R}}^\ell }}. \end{aligned}

In both cases, we have $$R^0 = y-{\mathcal {T}}_\daleth f_0$$ which yields $$R^0 = y$$ if $$f_0 \equiv 0$$. In each iteration N, we choose the weights $$\alpha _{N+1} \in {\mathbb {R}}$$ and $$\alpha _{N+1}^{(N+1)} \in {\mathbb {R}}$$, respectively, as well as the basis function $$d_{N+1} \in {\mathcal {D}}$$ such that we minimize, for $$\lambda >0$$, the Tikhonov–Phillips functional

\begin{aligned} (\alpha , d) \mapsto \left\| R^N - \alpha {\mathcal {T}}_\daleth d \right\| ^2_{{\mathbb {R}}^\ell } + \lambda \left\| f_N+\alpha d\right\| ^2_{{\mathcal {H}}_2(\Omega )} \end{aligned}

for the RFMP algorithm and, for the ROFMP algorithm,

\begin{aligned} (\alpha , d)&\mapsto \left\| R^N - \alpha {\mathcal {P}}_{{\mathcal {V}}_N^\perp }{\mathcal {T}}_\daleth d \right\| ^2_{{\mathbb {R}}^\ell } \\&\qquad + \lambda \left\| f_N^{(N)}+\alpha \left( d-b_n^{(N)}(d) \right) \right\| ^2_{{\mathcal {H}}_2(\Omega )}, \end{aligned}

where the projection is given by

\begin{aligned} {\mathcal {P}}_{{\mathcal {V}}_N^\perp }{\mathcal {T}}_\daleth d = {\mathcal {T}}_\daleth d - \sum _{n=1}^N \beta _n^{(N)}(d) {\mathcal {T}}_\daleth d_n \end{aligned}
(5)

with the projection coefficients $$\beta _n^{(N)}(d)$$. These projection coefficients are given by

\begin{aligned}&\beta _N^{(N)}(d) :=\frac{\left\langle T_\daleth d, {\mathcal {P}}_{{\mathcal {V}}^\perp _{N-1}} T_\daleth d_N \right\rangle _{{\mathbb {R}}^\ell }}{\left\| {\mathcal {P}}_{{\mathcal {V}}^\perp _{N-1}} T_\daleth d_N\right\| _{{\mathbb {R}}^\ell }^2} \\&\text {and} \qquad \beta _n^{(N)}(d) :=\beta _n^{(N-1)}(d) - \beta _N^{(N)}(d) \beta _n^{(N-1)}(d_n) \end{aligned}

for $$n=1,\ldots ,N-1.$$ With these, we define

\begin{aligned} b_n^{(N)}(d) :=\sum _{n=1}^N \beta _n^{(N)}(d) d_n. \end{aligned}
(6)

Note that, in contrast to (5), the expansion (6) need not be a projection. For the penalty term, we use the norm of the Sobolev space $${\mathcal {H}}_2(\Omega ) \subset \mathrm {L}^2(\Omega )$$ which is the completion of the set of all square-integrable functions for which

\begin{aligned} \sum _{n=0}^\infty \sum _{j=-n}^n (n+0.5)^4 \langle F, Y_{n,j} \rangle ^2_{\mathrm {L}^2(\Omega )} =:\Vert F\Vert ^2_{{\mathcal {H}}_2(\Omega )} \end{aligned}

is finite, see e. g. Freeden et al. (1998), Michel (2013).

In practice, we utilize that the minimization above is equivalent to the maximization of

\begin{aligned} \mathrm {RFMP}(d;N)&:=\frac{\left( \left\langle R^N, {\mathcal {T}}_\daleth d\right\rangle _{{\mathbb {R}}^\ell } - \lambda \left\langle f_N,d \right\rangle _{{\mathcal {H}}_2(\Omega )} \right) ^2}{\left\| {\mathcal {T}}_\daleth d\right\| _{{\mathbb {R}}^\ell }^2 + \lambda \left\| d\right\| ^2_{{\mathcal {H}}_2(\Omega )}}\nonumber \\&=:\frac{\left( A_N(d)\right) ^2}{B_N(d)} \end{aligned}
(7)

and

\begin{aligned}&\mathrm {ROFMP}(d;N) \\&\quad :=\frac{\left( \left\langle R^N, {\mathcal {P}}_{{\mathcal {V}}_N^\perp }{\mathcal {T}}_\daleth d \right\rangle _{{\mathbb {R}}^\ell } - \lambda \left\langle f_N^{(N)},d-b_n^{(N)}(d) \right\rangle _{{\mathcal {H}}_2(\Omega )} \right) ^2}{\left\| {\mathcal {P}}_{{\mathcal {V}}_N^\perp }{\mathcal {T}}_\daleth d\right\| _{{\mathbb {R}}^\ell }^2 + \lambda \left\| d-b_n^{(N)}(d)\right\| ^2_{{\mathcal {H}}_2(\Omega )}} \nonumber \\&\quad =:\frac{\left( A_N^{(N)}(d)\right) ^2}{B_N^{(N)}(d)},\nonumber \end{aligned}
(8)

respectively, where we obtain the weights via

\begin{aligned} \alpha _{N+1}&:=\frac{A_N(d_{N+1})}{B_N(d_{N+1})} \qquad \mathrm {and} \qquad \alpha _{N+1}^{(N+1)} :=\frac{A_N^{(N)}(d_{N+1})}{B_N^{(N)}(d_{N+1})}, \end{aligned}

respectively. The IPMP algorithms most commonly terminate if the relative data error falls below a certain threshold like the noise level or if a certain number of iterations is reached.

The dictionary $${\mathcal {D}}$$ is finite and manually (though influenced by experience) chosen in most of the previous publications on an IPMP algorithm as the use of an infinite dictionary $${\mathcal {D}}^{\mathrm {Inf}}$$ was an open question at that time. Then, the maximizer of (7) and (8), respectively, is found by comparing the objective function for all dictionary elements.

Obviously, particularly the dictionary causes some difficulties for a wider use of these methods: first of all, a user who is unexperienced with the presented trial function might have a hard time choosing a well-working dictionary. Moreover, though the choice can be improved by experience, it remains that usually a very large finite dictionary is used as it is most likely better working. However, its preprocessing causes high computational costs particularly with respect to runtime and storage demand. At last, such a dictionary is naturally prone to produce a biased approximation $$f_N$$ and $$f_N^{(N)}$$, respectively, though this is hard to quantify as we lack practical alternatives. In contrast, if we automatized the choice of (i.e. learn) a finite dictionary, we would remedy the repercussions of a lack of experience. For this automation, it is vital to implement the use of all possible functions from (some of) the trial function classes, i.e. the use of an infinite dictionary. At least, from a theoretical point of view, an infinite dictionary should also be less prone to bias than a finite one. However, it probably depends on the specific realization of using an infinite dictionary whether also the computational costs can be reduced. Note that the reduction in computational costs is a first step to applying the methods in competitive experiments. This again might, in future research, allow for a quantification of any existing bias.

To enable the use of an infinite dictionary, the LIPMP algorithms were developed. In particular, the LIPMP algorithms expand the IPMP methods to an infinite dictionary via a learning add-on.

We start by considering how an infinite dictionary $${\mathcal {D}}^{\mathrm {Inf}}$$ can be introduced into the routine of the IPMP algorithms. From this, the learnt dictionary follows naturally. If we run an IPMP algorithm with the learning add-on, we say that we run an LIPMP (Learning IPMP) algorithm. Thus, note that we obtain the LRFMP and the LROFMP with this approach.

The infinite dictionary $${\mathcal {D}}^{\mathrm {Inf}}$$ for the downward continuation of satellite data is defined by

\begin{aligned} {\mathcal {D}}^{\mathrm {Inf}}:=\big [{\widetilde{N}}\big ]_{\mathrm {SH}} \cup [S]_{\mathrm {SL}} \cup [B_K]_{\mathrm {APK}} \cup [B_W]_{\mathrm {APW}} \end{aligned}
(9)

with

\begin{aligned}&{\widetilde{N}} :=\{(n,j)\ :\ n \le {\overline{N}},\ j=-n,\ldots ,n\}, \\&S :={\mathbb {B}}^4 \times {\mathcal {L}} \qquad \text {and}\qquad B_K :=B_W :={\mathbb {B}}\end{aligned}

for fixed $${\overline{N}},\ L \in {\mathbb {N}}_0$$. The trial function class of the SH is still finite. This is due to their discrete nature of the characteristic parameters: the degree n and the order j. Nonetheless, the other trial function classes are truly infinite. This means, in the case of a SL, that the choice of the parameters of the centre $$\alpha ,\ \beta ,\ \gamma$$ and the size c of the spherical cap are arbitrary, while their band-limit is fixed and finite in analogy to the SH choice. In the case of APKs and APWs, the centres can be chosen from all points in the ball $${\mathbb {B}}$$.

The main obstacle for using $${\mathcal {D}}^{\mathrm {Inf}}$$ is the determination of the maximizer of (7) and (8), respectively, in the truly infinite trial function classes. For this, we introduce an additional optimization step into the routine. In particular, in this step, we determine a finite dictionary of (optimized) candidates $${\mathcal {C}}$$ from the infinite $${\mathcal {D}}^{\mathrm {Inf}}$$. Then, $${\mathcal {C}}$$ acts as a finite dictionary and we can proceed in the current iteration just like in an iteration of the respective IPMP algorithm. Therefore, after termination (which correspondingly obeys to the same rules as in the IPMP algorithms), we obtain an approximation $$f_N$$ and $$f_N^{(N)}$$, respectively, in a best basis of optimized dictionary elements. The latter constitute the learnt (finite) dictionary which can be used in future runs of the IPMP algorithms.

Due to the different nature of the classes, we distinguish a strategy for the SH and the remaining three trial function classes. The approach to learn SHs has already been explained to a certain extent in Michel and Schneider (2020). For completeness and some additional insights, we summarize it here again. We propose to learn a maximal SH degree as well as particular SHs simultaneously. The trial function class $$[{\widetilde{N}}]_\mathrm {SH}$$ includes all SHs up to a degree $${\overline{N}}$$ (see (9)). The value of $${\overline{N}}$$ should be chosen much larger than it is sensible for the data y. Then, we again follow the previous approach and determine and compare the values of (7) and (8), respectively, for all SHs up to degree $${\overline{N}}$$. Hence, after termination, we have a certain set of SHs that are used in the representation (3) and (4), respectively. Most likely, the algorithms will have determined a smaller, properly learnt maximal SH degree $$\nu \in {\mathbb {N}},\ \nu < {\overline{N}},$$ on its own. Moreover, we obtain a set of distinct SHs used in the approximation and, thus, contained in the learnt dictionary. Note that this approach demands a finite starting dictionary $${\mathcal {D}}^\mathrm {s}$$ in the LIPMP algorithms which contains (at least) $$[{\widetilde{N}}]_\mathrm {SH}$$.

For the remaining trial function classes of the SLs, APKs and APWs, we determine each candidate by solving nonlinear constraint optimization problems. Note that we use N for the iteration number next. The objective functions in the N-th iteration are $$\mathrm {IPMP}(d(z);N)$$ where d(z) denotes a SL, APK or APW, respectively, and we have

\begin{aligned} \mathrm {IPMP}(d(z);N) :=\left\{ \begin{array}{ll} \mathrm {RFMP}(d(z);N), &{}\text {if LRFMP is used,}\\ \mathrm {ROFMP}_S(d(z);N), &{}\text {if LROFMP is used.} \end{array} \right. \end{aligned}

The replacement character $$d(z)(\cdot )$$ stands for either$$g^{(k,L)}(R,\cdot )$$ with $$z=R(c,\alpha ,\beta ,\gamma ) \in {\mathbb {B}}^4$$ or for $$K(x,\cdot )$$ and $$W(x,\cdot )$$ with $$z=x\in {\mathbb {B}}$$. Hence, we maximize $$\mathrm {IPMP}(d(z);N)$$ with respect to the characteristic parameter vector z of the trial functions SL, APK and APW, respectively.

As $$\mathrm {ROFMP}(\cdot ; \cdot )$$ is not well-defined for previously chosen dictionary elements, we use $$\mathrm {ROFMP}_S(d(z);N)$$ which is the product of $$\mathrm {ROFMP}(\cdot ;\cdot )$$ from (8) and a spline to avoid neighbourhoods of critical basis functions from their respective trial function class. Let $$\varepsilon$$ be the size of such a neighbourhood. To avoid a neighbourhood of a current z, such a spline is given by

\begin{aligned} \left( S_{z^{(n)}} \right) |_{[0,\varepsilon ]}&\equiv 0,\\ \left( S_{z^{(n)}}\right) |_{(\varepsilon ,2\varepsilon )}(\tau )&:=10\left( \frac{\tau }{\varepsilon } -1\right) ^3 -15\left( \frac{\tau }{\varepsilon }-1\right) ^4 + 6\left( \frac{\tau }{\varepsilon }-1\right) ^5 \\&\qquad \text {and} \qquad \left( S_{z^{(n)}} \right) |_{[2\varepsilon ,2]} \equiv 1, \end{aligned}

where $$\tau = \Vert z - z^{(n)} \Vert ^2$$ denotes the distance between the current value z and previously chosen values $$z^{(n)}$$. In the N-th iteration, this yields

\begin{aligned} \mathrm {ROFMP}_S(d(z);N)&:=\mathrm {ROFMP}(d(z);N) \\&\qquad \qquad \times \prod _{n=1}^{N} S_{z^{(n)}}\left( \left\| z - z^{(n)} \right\| ^2\right) . \end{aligned}

Note that the product on the right-hand side only considers those $$d(z^{(n)})$$ which are from the same trial function class as d(z). Then, for each truly infinite trial function class, we solve the maximization problem $$\mathrm {IPMP}(d(z);N) \rightarrow \max !$$ in each iteration N.

Note that we have to model the corresponding constraints on the parameter vectors z as well. For the SLs, we have

\begin{aligned} (c,\alpha ,\beta ,\gamma ) \in [-1,1] \times [0,2\pi ] \times [0,\pi ] \times [0,2\pi ]. \end{aligned}

For the APKs and APWs, we obtain

\begin{aligned} |z| < 1. \end{aligned}

With these constraints and the objective function $$\mathrm {IPMP}(\cdot ;\cdot )$$, we can solve each optimization problem with any established and suitable approach for nonlinear optimization. Note, however, that if a gradient-based method is chosen, we have to determine the gradients of $$\mathrm {IPMP}(\cdot ;\cdot )$$ with respect to the characteristic parameters of the trial function under investigation. As we showed for the kernels and $$\mathrm {RFMP}(\cdot ;\cdot )$$ in the first publication on the LIPMP algorithms, Michel and Schneider (2020), this can be done with the standard rules of differentiation. In particular, as we have seen in (7) and (8), the objective function of such a nonlinear approximation is given as a quotient of certain terms. Thus, the first step is to use the quotient rule. Then, we have to consider the derivatives of the single terms of the numerator and denominator. These terms fall into one of two categories: either we consider an $${\mathbb {R}}^\ell$$- or an $${\mathcal {H}}_2(\Omega )$$-inner product. In the first case, the interesting aspect is the derivative of $${\mathcal {T}}_\daleth d(z)$$ with respect to z. For the APK and APW, this is straightforwardly obtained. For the SLs, we need a bit more effort as we need to derive the Wigner rotation matrices and the Gauß algorithm to obtain the derivative of the coefficients. When differentiating the $${\mathcal {H}}_2(\Omega )$$-inner products, we have to take a closer look on the derivation of the projection coefficients and of terms of the form $$\langle d_n, d(z) \rangle _{{\mathcal {H}}_2(\Omega )}$$ and $$\langle d(z), d(z) \rangle _{{\mathcal {H}}_2(\Omega )}$$, where $$d_n$$ is any previously chosen trial function and d(z) is the current trial function to be optimized. The former problem reduces again to the derivative of $$T_\daleth d(z)$$ which is discussed in the first case. The derivation of $$\langle d_n, d(z) \rangle _{{\mathcal {H}}_2(\Omega )}$$ has already been discussed for the SLs there as well as it only depends on the derivative of their coefficients. For the APKs and APWs, the derivation of $$\langle d_n, d(z) \rangle _{{\mathcal {H}}_2(\Omega )}$$ effectively reduces to the terms

\begin{aligned} \nabla _x |x|^{mn} Y_{n,j}\qquad \text {and} \qquad \nabla _x |x|^{mn} P_{n} \end{aligned}

for $$x\in {\mathbb {B}},\ m\in \{1,2\},\ n\in {\mathbb {N}}_0$$ and spherical harmonics $$Y_{n,j}$$ as well as Legendre polynomials $$P_n$$. Note, however, that this reduction includes a discussion of exchanging limits. With the well-known spherical formulation of $$\nabla$$, the terms are easily derived. Note that any possible singularity turned out to be well-defined under closer inspection. Last but not least, the derivation of $$\langle d(z), d(z) \rangle _{{\mathcal {H}}_2(\Omega )}$$ is obvious as the exchange of limits has been discussed in the former case. For those interested in the details of this derivation, we fully published it in Schneider (2020). As it is quite lengthy, we abstain from repeating it here.

In practice, we solve the optimization problems using the NLOpt library, see Johnson (2019). In particular, as it is advised there, we solve them in a 2-step-optimization procedure. That means, we first determine a global solution (with a derivative-free method) and, then, refine this using a gradient-based local method. We include both solutions in the learnt set of candidates $${\mathcal {C}}$$ just in case there are problems with a solver. If the global solver needs a sensible starting solution, we can include a selection of SLs, APKs or APWs in the starting dictionary as well. Also, this starting solution can then be included in $${\mathcal {C}}$$. However, these should not have a major impact on the learnt dictionary.

In Michel and Schneider (2020), we proposed the use of certain additional features to guide the learning process. Though the features proved to be helpful in certain learning settings, from our experience, using a 2-step optimization, i. e. solving the described optimization problems first globally and then locally, as well as using more diverse trial function classes remedies the urgent need of some rather manual features. Nonetheless, some of them like an iterative application of the learnt dictionary (i. e. allowing only the first N dictionary elements in the N-th iteration of the IPMP when the learnt dictionary is used) are in particular helpful when we have to balance the tradeoff between numerical accuracy and runtime. Thus, one should bear in mind that, in some cases, it can be helpful to guide the learning process. We explain in the description of the experiments which few additional features we use here.

In Fig. 1, a schematic representation of the learning method is given. The starting point is the red circle which represents the initialization of the experiment parameters. Then, the iteration process starts (’next iteration’). In each iteration, the methods steps into the trial function class under consideration (’find optimal APK/APW/SH/SL’) and solves the respective optimization problem, e.g. via a 2-step procedure. The solutions are passed to ’learnt set of candidates’ which builds the finite dictionary of candidates. From there, the common routine of an iteration step of the respective IPMP algorithm is executed: ’choose a best candidate as $$d_{n+1}$$’, ’check termination criteria’ and, if the method does not terminate yet, ’updates of IPMP’. After the updates (e.g. of the residual and possibly the weights), the method steps into the next iteration. When it terminates, we obtained a ’learnt dictionary and approximation’ (green circle).

By construction, the LIPMP algorithms yield an approximate solution of the inverse problem as well as a learnt dictionary for this problem. Hence, they can be used as standalone approximation algorithms or as auxiliary algorithms to determine a finite dictionary automatically.

Moreover, they inherit the convergence results of the IPMP algorithms (see the literature mentioned above). In particular, for infinitely many iterations in the LRFMP algorithm, we have convergence of the approximation to the solution of the Tikhonov–Phillips regularized normal equation. This means: the determined approximation is stable with respect to noise in the data and it tends (theoretically) to the exact (unstable) solution, if the regularization parameter tends (together with the noise) to zero.

### 3.3 Discussion of approaches for GRACE data

In this paper, we propose the (L)IPMPs for modelling the mass transport on the Earth using, e.g. GRACE data. Naturally, we should discuss this approach with respect to the common modelling of such data via spherical harmonics or mascons.

Approximating GRACE data via spherical harmonics is the traditional approach but has shown to produce a North-South striping in the gravitational field caused, e.g. by the mission design as well as processing strategies, see, e.g. Chen et al. (2021). In our research, we have also seen this with Level 2 data. That is why we include a very basic destriping ansatz (see Sect. 4.1) which is sufficient here. Nevertheless, there exist various methods (see, for instance, the references given below or in Chen et al. (2021)) which take care of the North-South stripes in a much more sophisticated manner. However, these methods yielded in signal loss, see Chen et al. (2021), Watkins et al. (2015).

Watkins et al. (2015) show that constrained mascons yield a higher resolution (thus, less signal loss) without the need for post-processing destriping methods. Other mascon approaches, see, e.g. Luthcke et al. (2013), Save et al. (2016), also supersede the spherical harmonics approach in that respect. Though the details may vary with respect to a specific mascon approach, the general principle is as follows: the Earth’s surface is paved with specific patches, for instance spherical caps or hexagons. For each of these patches, a mass value is determined. Hence, a mascon is some kind of finite element on the Earth’s surface. Along with noise constraints and regularization, they are used as basis functions to approximate the gravitational potential via GRACE Level 1b measurements (e.g. the k-band range rate). Note that, see Chen et al. (2021), the mascon approach can be transformed into a spherical harmonic ansatz which enables a spectral representation of the results.

The question is now how the (L)IPMP approach fits into these concepts. We include spherical harmonics in the dictionary of an (L)IPMP. We do not use finite elements here (although it is in principle possible). However, we utilize RBFs and RBWs which are local functions as well. Thus, one could say that, with the (L)IPMPs, we combine the general ideas of both approaches into one algorithm.

There are two things to be noted: first, because the RBFs and RBWs can only be represented in a spherical harmonic series, we obviously cannot—without loss of accuracy—transform our approximation into a pure and finite spherical harmonic one. However, from the mathematical point of view there is also no necessity to do this. Moreover, previous studies (Michel and Telschow 2014, 2016; Telschow 2014) showed that, dependent on the experiment setting, distinguishing the IPMP approximation into the different trial function classes yields a multiscale representation. Second, Watkins et al. (2015) discussed the size of the mascon patches in use and concluded that the chosen size was a compromise between regions of low and high latitude (i.e. the equatorial and the polar area) and that future research should investigate mascons with flexible sizes of—in their case—spherical caps. This is interesting because, when comparing the mascon approach and the LIPMPs, the size of the spherical cap resembles the scale of the local dictionary elements used. However, the LIPMPs include by construction all scales of RBFs and RBWs. That means, the LIPMPs do not necessitate similar compromises in the basis functions but already implement what appeared sensible in the NASA research.

Thus, we assume that our method could in all probability be competitive with the established spherical harmonic as well as mascon approach if we use Level 1b data as well.

## 4 Numerical results

We first summarize the general setting of the experiments. Then, we consider the performance of the LIPMP methods when using different noise levels. Next, we show the results of the LIPMP algorithms as standalone approximation methods. At last, we show results for comparing a manually chosen and a learnt dictionary in the IPMP algorithms. Note that our test scenarios here shall serve as proofs of concept in the sense that the main features of the add-on are demonstrated. In a continuing project, we investigate the behaviour for more realistic data and for other applications.

### 4.1 Experiment setting in general

We use the unit sphere as an approximate relative surface of the Earth. Then, we can use (1) for the evaluation of the operator. Of course, this is a simplification from real life. However, it suffices for our purposes. Moreover, it allows us to use the mentioned singular value decomposition. Using other geometries (e.g. an ellipsoid or the geoid) would lead to enormous numerical burden with respect to the evaluation of the operator as well as the regularization terms. As data, we use the EGM2008, see e. g. Pavlis et al. (2012), as well as GRACE data from May 2008 as expansion coefficients in (1). In both cases, we use the degrees equal or greater than 3 up to the highest given one (i.e. degree 2190 and order 2159 for the EGM2008 and degree and order 60 for GRACE). We do without the degree 2 because then the signal contains visible local aspects instead of representing majorly only the Earth’s ellipticity. Further, we evaluate the respective expansion on a regularly distributed Reuter grid of 12684 points. For an example of a Reuter grid, see Fig. 4d.

The question arises how the used resolution of the EGM2008 and the GRACE data fits to the number of data points. With respect to (2), we see that a spherical harmonic $$Y_{n,j}$$ has 2|j| extrema at the equator. This means we have maximally 120 extrema for the GRACE data and 4318, respectively, for the EGM2008 data. This resembles a resolution of $$\approx 360$$ km and $$\approx 20$$ km, respectively, at a satellite height of 500 km. From the definition of the Reuter grid in Michel (2013), we see that, for an even number $$N \in {\mathbb {N}}$$, we obtain 2N grid points at the equator (i.e. $$\theta = \pi /2$$). In our case, we have $$N=100$$ which leads to 12684 grid points distributed over the sphere. Thus, at the equator, we have a resolution of 200 grid points, or $$\approx 215$$ km. Obviously, we undersample the EGM2008 and oversample the GRACE data (in particular, as we also destripe the GRACE data). We aim to implement the (L)IPMPs in future research more efficiently such that we can increase our resolution of the data in use. For the following tests, it was not yet possible to increase the number of data points adequately. However, because we present an over- and an undersampled problem, the results are sufficient for the proof of concept we intend to show.

The Driscoll–Healy grid was used for obtaining the approximation error. At the Earth’s surface, the EGM2008 has a resolution of $$\approx 9$$ km while the GRACE data attain $$\approx 334$$ km. The Driscoll–Healy grid we use has a resolution of $$\approx 111$$ km. Thus, we gain more information of the solution when using the Driscoll–Healy grid. However, note that we need to sample more points when looking at the approximation error in order to evaluate whether our approximation suits the solution well in between data points. Moreover, note that a regular grid does not introduce an additional challenge on the methods as it avoids critical data gaps. However, in Sect. 4.3, we also discuss an experiment with an irregular grid. For the IPMPs, irregular grids were already discussed in Michel and Telschow (2014, 2016), Telschow (2014). Thus, the methods themselves can be compared more easily.

For the relative root mean square error (RMSE)

\begin{aligned} \frac{\sqrt{\frac{\sum _{i=1}^{65341} (f_N({\tilde{\eta }}^i) - f({\tilde{\eta }}^i))^2}{65341}}}{\sum _{i=1}^{65341} (f({\tilde{\eta }}^i))^2}, \end{aligned}

we utilize grid points $${\tilde{\eta }}^i$$ of an equi-angular Driscoll–Healy grid, see e. g. Driscoll and Healy (1994), Michel (2013), of 65341 points where f is the exact solution (presumed according to the EGM2008 or GRACE data) and $$f_N$$ is our approximation after N iterations. For an example of a Driscoll–Healy grid, see Fig. 4d. Note that, for a meaningful analysis of the absolute error, we have to use a different grid with much more grid points than the data are given at. We choose the Driscoll–Healy grid because it is obviously very different in its distribution. Besides the relative RMSE, we also consider the relative data error $$\Vert R^N\Vert _{{\mathbb {R}}^\ell }/\Vert R^0\Vert _{{\mathbb {R}}^\ell }$$ and the absolute approximation error.

Moreover, for the GRACE data, we utilize the arithmetic mean of the Level 2 Release 05 provided by the GFZ, JPL and UTCSR as it was advised in Sakumura et al. (2014). Further, we smooth the data with a Cubic Polynomial Spline of order 5, see, e. g. Schreiner (1996), Freeden et al. (1998), Fengler et al. (2007), to remove the North-South striping. We are aware that there exist many and more sophisticated methods to remove satellite striping, see, e.g. Davis et al. (2008), Klees et al. (2008), Kusche (2007). However, here, we aim to show the competitiveness of the methods and do not strive for discovering new geoscientific phenomena. Thus, this very basic destriping approach suffices for our needs.

If not stated otherwise, the data are modelled on a 500 km satellite height and are perturbed with $$5\%$$ Gaussian noise, such that we have perturbed data $$y^\delta$$ given by

\begin{aligned} y^\delta _i = y_i \cdot \left( 1 + 0.05 \cdot \varepsilon _i\right) \end{aligned}
(10)

for the unperturbed data $$y_i$$ and a Gaussian distributed random number $$\varepsilon _i$$. Certainly, for a more realistic scenario, one could also use specific GRACE-related noise instead.

The algorithms terminate if the relative data error falls below the noise level, increases above 2 or if 1000 iterations are reached. The latter two criteria are necessary, because we tested different regularization parameters and some of them turned out to be inappropriate and yielded a numerically diverging sequence. We implemented the iterated (L)ROFMP algorithm and restarted the prefitting procedure after 100 iterations. Amongst the tested regularization parameters, we chose that which minimized the relative RMSE if the relative data error reached the noise level at termination.

The optimization problems are solved by the ORIG_DIRECT_L (globally) and the SLSQP (locally) algorithms from the NLOpt, see Johnson (2019). As it is advised, we narrow the constraints by $$10^{-8}$$. Due to the regularization, the narrowing can be relatively small. Further, we set some termination criteria for the optimization procedures. We found the following values to be useful in practice: we limit the absolute tolerance for the change of the objective function between two successive iterates as well as the tolerance between the iterates themselves to $$10^{-8}$$. Moreover, we allow 5000 function evaluations and 200 seconds for each optimization.

With respect to the SLs, APKs and APWs, we forbid to choose two trial functions of the same type which are as close as $$\varepsilon = 5 \times 10^{-4}$$ or closer in one (L)ROFMP step. The distance between two trial functions is obtained as the distance between their characteristic parameters. In the case of the APKs and APWs, we compute the Euclidean norm of $$x-x^{(n)}$$. In the case of SLs, we use $$\Vert z-z^{(n)}\Vert ^2 = (c-c^{(n)}) + \arccos ({\overline{z}}\cdot {\overline{z}}^{(n)})$$, where $${\overline{z}} = (\alpha ,\beta ,\gamma )$$ and $${\overline{z}}^{(n)} = (\alpha ^{(n)},\beta ^{(n)},\gamma ^{(n)})$$. From our experience, a value smaller than $$5 \times 10^{-4}$$ appears to be too small to prevent ill-definedness in the objective function. Further, we use the same regularization parameter for learning and applying the learnt dictionary. The regularization parameter is constant unless anything different is stated. We apply the dictionary learning iteratively (confer Michel and Schneider 2020). That means, in the N-th step of the IPMP only the first N-th learnt dictionary elements can be chosen.

As the starting dictionary, we use

\begin{aligned} \left[ N^\mathrm {s}\right] _\mathrm {SH}&= \left\{ Y_{n,j}\ \left| \ n=0,\ldots ,100;\ j=-n,\ldots ,n \right. \right\} \\ \left[ S^\mathrm {s}\right] _\mathrm {SL}&= \left\{ \left. g^{(k,5)}\left( \left( c,A(\alpha ,\beta ,\gamma )\varepsilon ^3\right) ,\cdot \right) \right| \right. \\&\quad \quad \quad \quad \quad \quad c\in \left\{ \frac{\pi }{4},\frac{\pi }{2}\right\} ,\ \alpha \in \left\{ 0,\frac{\pi }{2},\pi ,\frac{3\pi }{2}\right\} ,\\&\quad \quad \quad \quad \quad \quad \beta \in \left\{ 0,\frac{\pi }{2},\pi \right\} ,\ \gamma \in \left\{ 0,\frac{\pi }{2},\pi ,\frac{3\pi }{2}\right\} ,\\&\quad \left. \quad \quad \quad \quad \quad k=1,\ldots ,36\right\} \\ \left[ B_K^\mathrm {s}\right] _\mathrm {APK}&= \left\{ \frac{K(x,\cdot )}{\Vert K(x,\cdot )\Vert _{\mathrm {L}^2(\Omega )}}\ \left| \ |x| = 0.94,\ \frac{x}{|x|} \in X^\mathrm {s} \right. \right\} \\ \left[ B_W^\mathrm {s}\right] _\mathrm {APW}&= \left\{ \frac{W(x,\cdot )}{\Vert W(x,\cdot )\Vert _{\mathrm {L}^2(\Omega )}}\ \left| \ |x| = 0.94,\ \frac{x}{|x|} \in X^\mathrm {s} \right. \right\} \\ {\mathcal {D}}^{\mathrm {s}}&= \left[ N^\mathrm {s}\right] _{\mathrm {SH}} \cup \left[ S^\mathrm {s}\right] _{\mathrm {SL}} \cup \left[ B_K^\mathrm {s}\right] _{\mathrm {APK}} \cup \left[ B_W^\mathrm {s}\right] _{\mathrm {APW}} \end{aligned}

with a regularly distributed Reuter grid $$X^\mathrm {s}$$ of 123 grid points. Thus, the starting dictionary contains 13903 trial functions. This allows the experiments of the LIPMP algorithms to run on an HPC node of 48 GB RAM with 12 CPUs.

### 4.2 Results for different noise levels

In the described setting, we run the LIPMP algorithms with the EGM2008 data and different noise levels to investigate the influence of noise on the results before we analyse the learning methods as standalone approximation methods as well as compare the results from a learnt and a manually chosen dictionary in the next sections. In particular, we tested no noise, $$1\%$$, $$2\%$$, $$3\%$$, $$5\%$$ and $$10\%$$ of Gaussian noise for the downward continuation from 500km using EGM2008 data. The results are given in Table 1. There we give the determined regularization parameter, the completed iterations as well as the relative RMSE and data error for the mentioned noise levels for both the LRFMP and the LROFMP. We computed new random numbers (confer (10)) but ran each algorithm only once for each noise level. Of course, we realize that a more profound approach of evaluating the methods’ behaviour for different noise levels would be to create, for each noise level, a sufficiently large number of perturbations and run each algorithm for each perturbation. Note that each run of an LIPMP would include a search for a regularization parameter as well. However, we abstain from this due to the associated high demand on calculation time. In view of our aim to further increase the efficiency of the (L)IPMPs, such computationally complex validations can hopefully become feasible in the near future. Note that, nonetheless, by generating different random numbers for each noise level, we also contain a little bit variation in the used noise here as well.

In Table 1, we see that for decreasing noise

• the regularization parameter decreases,

• the number of completed iterations increases and

• —most importantly—the relative RMSE decreases,

• though it cannot reach similar values as if no satellite height would be used (see Sect. 4.3).

Note that, for noise equal to or higher than $$2\%$$, we present here the results where the methods terminate when the noise level is reached and—amongst those—the relative RMSE is lowest. In the experiments with less noise, the methods never reached the noise level for the tested regularization parameters before 1000 iterations. Hence, we present the results with the lowest RMSE despite that the noise level is not reached (yet). Furthermore, note that the number of completed iterations is generally lower for the LROFMP. This is in accordance with previous publications on the non-learning ROFMP and, thus, can also be expected for the learning variant. However, this suggests that, in the case of $$2\%$$ noise, the result of the LRFMP could be improved if we allowed more than 1000 iterations. Similarly, this could also be assumed for $$1\%$$ noise as the number of completed iterations increases with decreasing noise. However, again for efficiency reasons as well as for a better comparability with Sects. 4.3 and 4.4, we stick with the chosen maximal number of 1000 iterations here.

All in all, the most important result from these experiment is that the relative RMSE decreases with decreasing noise level. This shows that the performance of the LIPMP algorithms is influenced by noise only in the expectable way. Further, we also see that the influence of the satellite height appears to be more significant to remaining errors than the noise level (compare with the results of the pure approximation in Sect. 4.3). Both of these influences are well-understood and minimized with a regularization but cannot sensibly be abandoned. With this in mind, we next analyse the approximations of the learning methods as well as the learnt dictionary.

### 4.3 The LIPMP algorithms as standalone approximation methods

By construction, the learning algorithms themselves incorporate the maximization of the same objective function which also occurs in the IPMP. Thus, they should be usable as standalone approximation algorithms. We investigate this next: we consider the approximation of EGM2008-based surface data as well as the downward continuation of regularly and irregularly distributed EGM2008-based satellite data by the LRFMP as well as the LROFMP algorithm. Moreover, we verify the downward continuation of contrived data by the LROFMP algorithm. Due to the orthogonality procedure, we assume that the LROFMP algorithm is more suited to distinguish the contrived data.

The irregularly distributed grid has already been used in Michel and Telschow (2014) and simulates a denser data distribution on the continents. It is given in Fig. 4d and includes 6968 grid points. The contrived data consist of 3 SHs and APKs, respectively:

\begin{aligned} f&= Y_{9,5}\left( \cdot \right) + Y_{5,5}\left( \cdot \right) + Y_{2,0}\left( \cdot \right) \\&\quad + {\widetilde{K}}\left( x\left( 0.5,\frac{3\pi }{2},\frac{\pi }{4}\right) ,\cdot \right) + {\widetilde{K}}\left( x\left( 0.75,2\pi ,-\frac{\pi }{4}\right) ,\cdot \right) \\&\quad + {\widetilde{K}}\left( x\left( 0.9,\frac{\pi }{2},\frac{\pi }{4}\right) ,\cdot \right) , \end{aligned}

where the notation $$x(r,\varphi ,\theta )$$ with the radius r, the longitude $$\varphi$$ and the latitude $$\theta$$ is used. $${\widetilde{K}}$$ stands for the $$\mathrm {L}^2(\Omega )$$-normalized APKs. The data are again perturbed by $$5\%$$ Gaussian noise. Correspondingly, the starting dictionary (for the test with contrived data only) is given as

\begin{aligned} \left[ N^\mathrm {s}\right] _\mathrm {SH}&= \left\{ \left. Y_{n,j}\ \right| \ n=0,\ldots ,10;\ j=-n,\ldots ,n \right\} \\ \left[ B_K^\mathrm {s}\right] _\mathrm {APK}&= \left\{ \frac{K(x,\cdot )}{\Vert K(x,\cdot )\Vert _{\mathrm {L}^2(\Omega )}}\ \left| \ |x| = 0.94,\ \frac{x}{|x|} \in X^\mathrm {s} \right. \right\} \\ {\mathcal {D}}^{\mathrm {s}}&= \left[ N^\mathrm {s}\right] _{\mathrm {SH}} \cup \left[ B_K^\mathrm {s}\right] _{\mathrm {APK}}, \end{aligned}

where $$X^\mathrm {s}$$ is a regularly distributed Reuter grid of 6 points. Then, the APKs included in the contrived data are not contained in the starting dictionary. We allow a maximum of 100 iterations here because the data consist of only six trial functions.

In Table 2, we give an overview of the results. The type of experiment is abbreviated: “approximation” stands for the experiment where no satellite height is included, “downward continuation, (ir-)regular grid” stands for the use of 500 km satellite height and an (ir-)regularly distributed grid and “contrived data” is self-explanatory. Further, we state the regularization parameter, the completed iterations, the respective maximally learnt SH degrees and the relative RMSEs as well as data errors. In Figs. 2a to 3a, we see the absolute approximation errors obtained in the different experiments. Figure 3b shows the given and chosen APKs of the experiment with contrived data.

Generally, the remaining errors are situated in regions where we expect them to be. In the case of the approximation of surface data, Fig. 2a, and the downward continuation of regularly distributed satellite data, Fig. 2b, c, respectively, we find deviations to the solution particularly in the Andean region, the Himalayas and the Pacific Ring of Fire. This is reasonable as the gravitational potential contains much more local structure there which is highly influenced by the noise and the satellite height. Note that we included the same results in Figure 2b, c twice: the left-hand side of Fig. 2b can be compared to Fig. 5a and the right-hand side to Fig. 5b, i.e. the approximation of the LIPMP can be compared with the respective approximations of the IPMP with a manually chosen or a learnt dictionary. Figure 2c compares the solution of the LRFMP and the LROFMP as standalone approximation algorithms with each other. Further, we find that, in the case of approximating surface data, i. e. using potential data which is not damped due to satellite height, the methods obtain much better relative RMSEs while still counting more data errors than in the case of the downward continuation, see Table 2. The latter is clear, since more local structures are visible on the surface and appear relatively larger with respect to the noisy data. Obviously, they also need more iterations in this case. Again, as the data contain more information in this case, this behaviour can be expected. Nonetheless, these experiments show that the LIPMP algorithms can, indeed, be used as standalone approximation algorithms.

Similar results are obtained for the downward continuation of irregularly distributed satellite data, Fig. 3a. However, we notice that some additional errors occur here in comparison with the results of the regularly distributed data, Fig. 2b. In particular, these are located mostly in areas with larger data gaps, see, e. g., the North Atlantic and the Indian Ocean. This points out that the LIPMP algorithms are able to distinguish smoother and rougher regions on its own and, thus, prevent local gaps to have a global influence.

At last, we consider the test with the contrived data. First of all, we note that the LROFMP algorithm is able to approximate this data as well, see Table 2. Moreover, we saw that the SHs are obtained exactly. Further, the APKs are either clustered around the solutions or have a very small coefficient, see Fig. 3b. Note that those few wrongly chosen APKs may be caused by the present noise. The SHs are easier to distinguish, most likely because of their orthogonality. Hence, we see that the LROFMP algorithm is also able to distinguish global trends and local anomalies.

### 4.4 Learning a dictionary

At last, we consider the competitiveness of the learnt dictionary. For this, we extend the general setting from Sect. 4.1 in the following way. We compare the learnt dictionary with a manually chosen dictionary which is similar to those in previous publications, see e. g. Telschow (2014):

\begin{aligned} \left[ N^\mathrm {m}\right] _\mathrm {SH}&= \left\{ \left. Y_{n,j}\ \right| \ n=0,\ldots ,25;\ j=-n,\ldots ,n \right\} \\ \left[ S^\mathrm {m}\right] _\mathrm {SL}&= \left\{ \left. g^{(k,5)}\left( \left( c,A(\alpha ,\beta ,\gamma )\varepsilon ^3\right) ,\cdot \right) \right| \right. \\&\quad \quad \quad \quad \quad \quad c\in \left\{ \frac{\pi }{4},\frac{\pi }{2}\right\} ,\ \alpha \in \left\{ 0,\frac{\pi }{2},\pi ,\frac{3\pi }{2}\right\} ,\\&\quad \quad \quad \quad \quad \quad \beta \in \left\{ 0,\frac{\pi }{2},\pi \right\} ,\ \gamma \in \left\{ 0,\frac{\pi }{2},\pi ,\frac{3\pi }{2}\right\} ,\\&\quad \quad \quad \quad \quad \quad \left. k=1,\ldots ,36\right\} \\ \left[ B_K^\mathrm {m}\right] _\mathrm {APK}&= \left\{ \frac{K(x,\cdot )}{\Vert K(x,\cdot )\Vert _{\mathrm {L}^2(\Omega )}}\ \left| \ |x| \in Z,\ \frac{x}{|x|} \in X^\mathrm {m} \right. \right\} \\ \left[ B_W^\mathrm {m}\right] _\mathrm {APW}&= \left\{ \frac{W(x,\cdot )}{\Vert W(x,\cdot )\Vert _{\mathrm {L}^2(\Omega )}}\ \left| \ |x| \in Z,\ \frac{x}{|x|} \in X^\mathrm {m} \right. \right\} \end{aligned}

such that

\begin{aligned} {\mathcal {D}}^{\mathrm {m}}&= \left[ N^\mathrm {m}\right] _{\mathrm {SH}} \cup \left[ S^\mathrm {m}\right] _{\mathrm {SL}} \cup \left[ B_K^\mathrm {m}\right] _{\mathrm {APK}} \cup \left[ B_W^\mathrm {m}\right] _{\mathrm {APW}} \end{aligned}

with a regularly distributed Reuter grid $$X^\mathrm {m}$$ of 4551 grid points and

\begin{aligned} Z= \{0.75,\ 0.80,\ 0.85,\ 0.89,\ 0.91, 0.93,\ 0.94,\ 0.95,\ 0.96,\ 0.97\}. \end{aligned}

All in all, the manually chosen dictionary contains 95152 trial functions. We undertake this comparison because it is most sensible as we have explained in Michel and Schneider (2020): a comparison with the best dictionary of a sensibly large set of random dictionaries cannot seriously be put into practice due to high memory demand and a long runtime. Note that, in some of the literature on the IPMP algorithms mentioned before, the approaches have been compared to traditional methods like splines which is why we abstain from this here. Further note that, due to the size of the manually chosen dictionary, the respective tests ran on a node with 512 GB RAM and 32 CPUs which is a much higher memory demand than the LIPMP algorithms had.

In Table 3, we see a summary of the results of the experiments. We compare the IPMP algorithms with the manually chosen dictionary (*), the learnt dictionary (**), a learnt dictionary when using the non-stationary regularization parameter $$\lambda _N = \lambda _0 \cdot \Vert y\Vert _{\mathbb {R}}^\ell /N$$ for the iteration $$N\in {\mathbb {N}}$$ (“non-stationary learnt”; ***), and a learnt dictionary where only the SHs, APKs and APWs were considered (“learnt-without-Slepian-functions”; ****). We give the regularization parameter, the size of the dictionary, the number of completed iterations, the maximal SH degree included in the dictionary, the relative data error and RMSE at termination and the needed CPU-runtime in hours. Note that the size of the learnt dictionaries is given as a “less or equal than” value since elements may be contained multiple times. Moreover, we identify the following aspects:

• Due to our termination criteria, the relative data error was at the noise level in all cases.

• In all experiments, the relative RMSE is about the same size. In comparison with Michel and Schneider (2020), we conclude that the IPMP algorithms produce better results if more trial function classes are available. Further, the learnt dictionary yields similar results. In Figs. 5 and 6, we also see that, in all cases, the remaining errors lie within regions of higher local structures, i. e. the Andean region, the Himalayas and the Pacific Ring of Fire in the case of EGM2008 data as well as the Amazon basin in the case of the GRACE data. These detail structures cannot be represented because of the noise and the satellite height (confer Sect. 4.2).

• The non-stationary learnt as well as the learnt-without-Slepian-functions dictionary produce results which are similar to the others regarding the relative RMSE such that these settings could be explored in future research. However, to quantify the influence of the non-stationary regularization parameter on the approximation, the number of tests here is too low.

• The learnt dictionary is less than $$1\%$$ of the size of the manually chosen dictionary.

• The maximal SH degree of the learnt dictionaries is a truly learnt degree.

• For the LRFMP algorithm, the CPU-time needed for learning and applying the learnt dictionary is lower or similar than applying the manually chosen dictionary. In particular, without the Slepian functions, the needed CPU-time is much smaller.

• For the LROFMP algorithm, there exist settings which have a smaller runtime as well. In particular, for the GRACE data, this is always the case. Similarly, the learnt-without-Slepian-functions dictionary is also learnt in a much shorter time for the EGM2008 data. However, there are two cases for the EGM2008 data where the runtime is higher than for the manually chosen dictionary. This could be caused by the non-stationary regularization parameter, the orthogonality procedure itself and / or the use of the Slepian functions.

It appears that, by using a learnt dictionary, the ROFMP algorithm is not that much superior to the RFMP algorithm as it seemed in previous publications. Thus, if we need to learn a dictionary (despite the LIPMP algorithms being sufficient algorithms themselves), we would advise to learn a dictionary for the RFMP in the same setting as it should be applied to (experiment (**) in Table 3) since it yields solid and comparable results. Note that the RFMP is overall easier to implement and needs less runtime. Further, the experiments (***) and (****) in Table 3 yield starting points to improve the result or the runtime and, thus, should be borne in mind as well.

## 5 Conclusion and outlook

The downward continuation of the gravitational potential from satellite data is important for many reasons such as monitoring the climate change. One approach for this is presented by the IPMP algorithms. They seek iteratively the minimizer of the Tikhonov–Phillips functional and, in this way, obtain a weighted linear combination in dictionary elements as an approximation. For practical use, the IPMP algorithms had to be improved regarding the automation of the dictionary choice, the runtime and the storage demand. For this reason, the novel LIPMP algorithms include an add-on such that an infinite dictionary can be used. Further, a finite dictionary can be learnt as well.

Our numerical results in this paper are meant to be a proof of concept. They show that both the non-learning IPMP with a learnt or a manually chosen dictionary as well as the LIPMP algorithms yield good results. However, the LIPMP algorithms have additional advantages in terms of CPU-runtime, storage demand, sparsity and the consequences of the number of different types of trial functions in use. Hence, we suggest that the IPMP algorithms with a manually chosen dictionary may be used if those aspects are not critical because these methods are easier to implement. Otherwise, we advise to include the add-on, i. e. use the LIPMP algorithms either for obtaining a learnt dictionary or as a standalone approach. However, obviously, the former is probably redundant in the light of the latter. In particular, we prefer the LRFMP algorithm as it was presented here because it has a lower runtime than the LROFMP algorithm and is easier to implement.

Here, we showed only results of a learnt dictionary that is applied to the same data again. After all, an even more interesting use of a learnt dictionary is probably given if we have many similar data as for instance from long-running satellite missions like GRACE. Applying a learnt dictionary to unseen data is basically possible with our approach as well. However, this most likely demands very lengthy tests for suitable regularization parameters. In the light of the LIPMP algorithms as standalone approximation algorithms, it does not seem sensible to put that much effort in learning a finite dictionary for future use on unseen data. This is except, maybe, if computation time is restricted when unseen data arrives but is much less restricted beforehand.

In an on-going project, we are working on the use of $$5\times 10^6$$ data points, true satellite tracks, observable-related noise and the downward continuation of the gravitational force (i.e. the gradient of the potential) in the LIPMP algorithms which could show the competitiveness of the algorithms in a more realistic setting. If this actually includes the use of data that is not obtained from SHs, we then also might be able to quantify whether the LIPMPs approximations contain any bias. Moreover, we are interested in applying the algorithms to other geoscientific tasks, e. g., traveltime tomography from seismology. As both of these current research aspects inevitably work with big data, we are able to tackle them today only due to the significant improvements regarding storage demand and runtime made by the LIPMP algorithms.