A first approach to learning a best basis for gravitational field modelling

Gravitational field modelling is an important tool for inferring past and present dynamic processes of the Earth. Functions on the sphere such as the gravitational potential are usually expanded in terms of either spherical harmonics or radial basis functions (RBFs). The (Regularized) Functional Matching Pursuit ((R)FMP) and its variants use an overcomplete dictionary of diverse trial functions to build a best basis as a sparse subset of the dictionary and compute a model, for instance, of the gravity field, in this best basis. Thus, one advantage is that the dictionary may contain spherical harmonics and RBFs. Moreover, these methods represent a possibility to obtain an approximative and stable solution of an ill-posed inverse problem, such as the downward continuation of gravitational data from the satellite orbit to the Earth's surface, but also other inverse problems in geomathematics and medical imaging. A remaining drawback is that in practice, the dictionary has to be finite and, so far, could only be chosen by rule of thumb or trial-and-error. In this paper, we develop a strategy for automatically choosing a dictionary by a novel learning approach. We utilize a non-linear constrained optimization problem to determine best-fitting RBFs (Abel-Poisson kernels). For this, we use the Ipopt software package with an HSL subroutine. Details of the algorithm are explained and first numerical results are shown.


Introduction
The gravitational potential is an important observable in the geosciences as it is used as a reference for multiple static and dynamic phenomena of the complex Earth system. The EGM2008 gives us a high-precision model in spherical harmonics, i.e. polynomials, up to degree 2190 and order 2159, see [36,39]. From satellite missions like GRACE or its successor GRACE-FO, we have time-dependent models of the potential, see, for example, [10,35,45,47]. These data enable a visualization of mass transports on the Earth such as seasonal short-term phenomena like the wet season in the Amazon basin as well as longterm phenomena like the climate change. Therefore, gravity field modelling and especially the downward continuation of satellite data is one of the major important mathematical problems in physical geodesy, see, for instance, [2,22].
From a mathematical point of view, the gravitational potential F on the approximately spherical Earth's surface can be modelled as a Fourier expansion in a suitable basis, for example in the mentioned spherical harmonics Y n,j , n ∈ N 0 , j = −n, ..., n. If we assume the Earth to be a closed unit ball, we obtain, for σ > 1, a pointwise representation of the potential as for the unit sphere Ω and η ∈ Ω, see, for example [2,13,34,48]. This gives us the potential in the outer space including a satellite orbit. The inverse problem of the downward continuation of this potential is given as follows: if data values V (ση) = (T F )(ση), σ > 1, are known, determine the function F on Ω. For more details on inverse problems in general, see the classical literature, for example, [5,26,42]. The occurring mathematical challenges of the downward continuation are well-known. First of all, the operator T has exponentially decreasing singular values due to σ > 1 in (1). Thus, the inverse operator which we need for the downward continuation has exponentially increasing singular values. For this reason, the inverse problem is called exponentially ill-posed. In particular, it violates the third characteristic of a well-posed problem according to Hadamard (continuous dependence on the data). Furthermore, the existence of F is only ensured if V is in the range of T . However, if F exists, then it is unique. Therefore, sophisticated algorithms need to be used to solve the problem of the downward continuation of satellite data. Previous studies showed that the (Regularized) Functional Matching Pursuit ((R)FMP), the (Regularized) Orthogonal Functional Matching Pursuit ((R)OFMP) as well as the latest (Regularized) Weak Functional Matching Pursuit ((R)WFMP) are possible approaches for this and other inverse problems, see, for instance, [3,6,7,8,9,17,19,21,20,29,30,31,32,48]. In the sequel, we will write Inverse Problem Matching Pursuit (IPMP) if we refer to either one of the mentioned algorithms. Although the core routine of these algorithms is well established by now, there are still possibilities to improve their performance.
One of these possibilities is given due to the following circumstances. The IPMPs are based on a finite dictionary D of suitable trial functions from which they build a best basis and eventually the approximate solution in terms of this best basis. Originally, matching pursuits utilize a dictionary consisting of vectors from finite-dimensional spaces. The first development of a matching pursuit was done by S.G. Mallat and Z. Zhang (1993). The ROFMP is additionally based on works of P. Vincent and Y. Bengio (2002) and Y.C. Pati et al. (1993). The RWFMP inherits ideas from V.N. Temlyakov (2000). The idea of using dictionaries instead of finding a representation of a signal in an a-priori given basis can be summed up as follows [27]: the human language gives us nearly infinite possibilities to describe the very same thing in the real world. However, these descriptions vary in length and rigour. This idea can be transferred to mathematics, for instance, gravity field modelling. We can model the gravitational potential in spherical harmonics as given in (1). However, if we look for a model in a best basis from a dictionary D, the representation of the signal might be sparser and/or more precise. In particular, the reduction to those basis functions which are essential increases the interpretability of the obtained model. Further, numerical experiments showed that the obtained solution is more accurate and stable. These aims can be achieved by the IPMPs.
Further characteristics of the IPMPs can be summed up from previous publications as follows: they represent a regularization for ill-posed inverse problems; they can combine different kinds of trial functions, e.g. global and localized ones, in their solution; they can be used for pure interpolation / approximation as well as for linear and non-linear ill-posed inverse problems; they work with single-source data as well as are capable of a joint-inversion of multiple-source data; the data can be given on different geometries, like a sphere, a ball, or an interval of the real line; they yield an approximative function and not a discretized approximation; they build this solution iteratively without the need to invert a matrix or solve a large linear system of equations; the orthogonal variant yields a linear combination of orthogonal trial functions; the runtime can be improved with the use of preprocessing of certain data; or they can be combined with a weak strategy to cut runtime without significant loss of accuracy; the implementation is easy and they can be parallelized very well.
In the practical tests for diverse applications, very good approximations could be achieved not only for the downward continuation but also for other ill-posed inverse problems, for instance the (linear as well as the non-linear) inverse gravimetric problem or the inversion of MEG-and EEG-data, see, for instance, [6,7,19,21,23,24].
However, the experiments also revealed a sensitivity of the result regarding the choice of the dictionary, for example concerning the runtime and the convergence behaviour. Therefore, the main focus of this paper is on a first dictionary learning strategy for the downward continuation of gravitational potential data.
Previous works on dictionary learning considered discretized approximation problems. In this case, the dictionary can be interpreted as a matrix. The approaches aimed to obtain a solution of the approximation problem and a sparse dictionary matrix simultaneously. For more details, see, for example, [4,40,43].
However, a particular feature of the IPMPs is that their solution is a linear combination of established trial functions. Neither do we want to discretize the dictionary elements, i.e. the trial functions, nor do we want to modify them. In the latter case, the comparability with traditional models in these trial functions would be lost. Furthermore, with the use of scaling functions and / or wavelets in the dictionary, the IPMP generates a solution in a multiscale basis. This allows a multiresolution analysis of the obtained model revealing hidden local detail structures as it was shown in, for example, [6,7,8,9,31,48]. Moreover, we do not only consider interpolation / approximation problems, but also ill-posed inverse problems. Thus, a dictionary matrix would not contain the basis elements themselves, but, for example, their upward continued values. Applying previous strategies, like, for instance, MOD or K-SVD, would only alter the upward continued values and leave us with the question of how to downward continue them. All in all, this shows that learning a dictionary for the IPMPs requires the development of a different strategy.
For a first approach to learning a dictionary, we concentrate on the RFMP as the basic IPMP in this paper. For this algorithm, we develop a procedure to determine a best basis for the gravitational potential from different types of infinitely many trial functions. We choose to learn dictionary elements from spherical harmonics and Abel-Poisson kernels as radial basis functions. In particular, while previously a discrete grid of centres of the RBFs had to be chosen a-priori, which could have put a bias in the obtained numerical result, we now allow every point on the unit sphere to be a centre of an RBF. Equally, the localization parameter of the Abel-Poisson kernel is now determined from an interval instead of a finite set. Our continuous, i.e. non-discrete, learning ansatz produces a 'best-dictionary' with which the RFMP can be run. We call this procedure the Learning (Regularized) Functional Matching Pursuit (L(R)FMP). The results show that the use of a learnt dictionary in the RFMP gives us a higher sparsity and better results with less storage demand. This paper is structured as follows. On the way to a detailed description of our learning approach, we define some fundamental basics in Section 2. We introduce the trial functions under investigation as well as a general form of a dictionary and the basic principles of the RFMP. With these aspects explained, we state our learning strategy in Section 3. We motivate its idea and explain how this is embedded into the established theory of dictionary learning. Then we define its routine and give necessary derivatives. These are derived in Appendix A. We end Section 3 by introducing some additional learning features that guide the learning process positively in practice. In Section 4, we describe experiments for which we learn a dictionary and compare the results of the learnt dictionary with the results which a manually chosen dictionary yields. At last, we conclude this paper in Section 5 with an outlook of how we want to further develop this first learning approach.
2 Some mathematical tools for learning a dictionary

Trial functions under consideration for dictionaries
First of all, to develop a learning strategy, we have to define what trial functions we want to consider as possible dictionary elements, i.e. what the learnt dictionary shall consist of. Our aim is to determine which well-known trial functions are most suitable for a problem at hand. For gravity field modelling, it is sensible to determine suitable spherical harmonics as well as Abel-Poisson kernels. Examples of those trial functions are given in Figure 1. Spherical harmonics or fully normalized spherical harmonics for practical purposes are global trial functions, see for instance, [12,14,15,28,33]. An example is given on the left-hand side of Figure 1. They are defined for a unit vector ξ ∈ Ω as Y n,j (ξ(ϕ, t)) := (2n + 1) 4π where ξ(ϕ, t) is the representation of ξ ∈ Ω in polar coordinates (ϕ, t), where t = cos ϑ and ϑ is the latitude. Further, the definition uses associated Legendre functions given by where P n denotes the n-th Legendre polynomial. Abel-Poisson kernels are defined for a particular unit vector ξ ∈ Ω and a scaling parameter h ∈ [0, 1) as (with x = hξ) for any unit vector η ∈ Ω, see, for example, [15, pp. 108-112] or [14, p. 103 and 441]. These kernels are more localized than polynomials as one can see on the right-hand side of Figure 1. It is visible that they are radial basis functions, that means they have one maximum whose descent depends on the distance to the centre ξ = x/|x| of the extremum. In that way, they are zonal functions and can be viewed as 'hat'-functions. Dependent on the parameter h = |x|, the size of the extremum or 'hat' varies in size. Thus, the functions have different scales of localization. For more details and examples, see, for instance, [15, p. 111] or [28, p. 117].
In this paper, we consider dictionaries consisting of spherical harmonics and Abel-Poisson kernels. We introduce here a notation for building blocks of spherical dictionaries. for Abel-Poisson kernels P (rξ, ·). We define a dictionary as We call [·] * a trial function class.
Note that N and K may be finite or infinite.

Basic principles of linear ill-posed inverse problems
This subsection is mainly based on [5,19,28,42]. First of all, we recall the definition of a linear inverse problem.  As we explained in the introduction, the problem of the downward continuation of satellite data for gravity field modelling is a linear ill-posed inverse problem. As most inverse problems from practice are ill-posed, there exists a large theory on how to still solve these problems. In the sequel, we will use the approach by Tikhonov.
Its idea is to find the best approximate solution instead of the true solution. The illposedness is treated with an additional penalty term. The best approximate solution is the minimizer of the Tikhonov functional for a vanishing regularization parameter. The functional is given as follows.
In the case of satellite data, we set Y = R , i.e. we are only given data on discrete points of the unit sphere. For the space X , we propose to use a Sobolev space. The reasons for this choice are, for instance, that this space enforces more smoothness of the solution than, e.g. the L 2 (Ω)-space. This has proven to yield better results. Further, it was shown that the IPMPs also profit theoretically from the use of this space, see, [19]. Specifically, we will use the Sobolev space H 2 .

Definition 5. On the set H
we define an inner product via The completion of H (n + 0.5) 2 ; Ω with respect to ·, · H2 is called the Sobolev space H 2 .
For practical purposes, we will give a short overview of the main principles of the RFMP algorithm, which is a regularization method for ill-posed linear problems, next. For more details on any IPMP, see, for example, [6,7,8,9,19,20,21,30,31,32,48]. For theoretical discussions of the IPMPs, we refer to this literature. Here, we will concentrate on the practical aspects of the RFMP.
The underlying idea of this matching pursuit is to build a solution as a linear combination of dictionary elements by iteratively minimizing a Tikhonov functional. In theory, we consider the linear inverse problem T F = V , for instance, as given in (1). In practice, we consider the particular case Y = R . Then the problem formulates as follows. We have a relative satellite height σ > 1, a set of grid points {η (i) } i=1,..., ∈ Ω and data values y i for each grid point η (i) . The operator T is exchanged by a finite system of related functionals T i for which T i F = (T F )(ση (i) ) = y i holds for i = 1, ..., . We use the Hebrew letter Dalet to emphasize that the functionals T i i=1,..., represent a discretization of the operator T . Summarized, we consider the linear inverse problem T F = y for the operator T = (T i ) i=1,..., and y ∈ R .
Further, we assume that F ∈ H 2 . A regularization parameter is denoted by λ. Additionally, we need an a-priori defined dictionary D as given in Definition 1. Then the aim of the RFMP is to iteratively minimize the Tikhonov functional for an element d ∈ D of the dictionary, a real coefficient α and a current approximation F n . In practice, this means we start with an initial approximation F 0 , e.g. F 0 ≡ 0, and iteratively determine F n+1 : It can be shown, see, for example, [6,29,31] that the minimization with respect to α and d of the Tikhonov functional (6) is equivalent to a maximization of a certain quotient with respect to d. For the RFMP, this quotient is given by
in the n-th iteration step, where R n := y − T F n is the residual.

The learning approach
In this section, we refer to a linear inverse problem T F = y for y ∈ R as described above. The term F n represents the current approximation of the RFMP at iteration step n. A dictionary element is denoted by d, spherical harmonics by Y n,j and Abel-Poisson kernels by P (x, ·) as we introduced them in the last section.

About learning dictionaries for inverse problems
We motivate our learning algorithm as follows. We consider an infinite set of trial functions from which we want to learn dictionary elements in the LRFMP. We set If we chose the dictionary element d i and its coefficient α i in this greedy manner in each iteration step, in a perfect world and for n → ∞, then this would be equal to Unfortunately, this is practically impossible to solve for several reasons. First of all, the problem is of the type of the travelling salesman problem which is known to be NP-hard, see, for instance, [16, p. 114]. Thus, we simply cannot be sure that picking trial functions in a greedy manner also yields a true greedy algorithm and, thus, the optimal solution F ∞ . Secondly, in practice, we can only compute a finite linear combination, i.e. an optimal bestn-term approximation if at all. And thirdly, if we work with infinitely many trial functions, we cannot preprocess any data. However, it is an expensive feature to compute everything needed on the fly.
However, if we consider the RFMP, we know that its solution is a good approximation of for a finite dictionary D. Note that the approximation is also a n-term approximation. That means, the structure of the RFMP yields a good approximation of the best-n-term solution dependent on a fixed finite dictionary. Therefore, if we extend the RFMP to the infinite set of functions D inf in the way defined in (7), we are able to approximate a solution F n of This is the main objective of the LRFMP in practice. How can we learn a finite dictionary D from this solution? If we have approximated the solution by F n , we know which basis elements we need for this approximation. In this way, we know which dictionary elements should be at least in the learnt finite dictionary D for its use in the RFMP. Therefore, a learnt dictionary should contain at least these n elements. More generally, we can set an upper bound D > n for the size of the learnt dictionary D. Furthermore, the approximation of F n should be in the span of the learnt dictionary D, such that the solution of (8) can be reproduced. Moreover, the dictionary D naturally needs to be a subset of D inf . Taking all things into consideration, we can write the objective of the dictionary learning process as This is a doubled minimization problem for a predefined dictionary size D. From another point of view, the task it presents is to find a dictionary and build an approximation from its elements that minimizes the Tikhonov functional simultaneously. A doubled minimization problem is a common approach in the field of dictionary learning, see, for instance, [1,4,40,43]. Further, referring to the criteria for dictionary learning from Aharon et. al. (2006), this can be thought of as a well-defined goal of the learning approach.
Thus, the starting point for our learning algorithm is (7). We choose this generalization for the learning approach because we want to learn the dictionary itself as well as maintain the unique characteristic of the RFMP to combine various but established trial functions. Moreover, for this minimization, we choose to follow the structure of the RFMP for two reasons. First of all, as we pointed out above, the RFMP is built to solve the same minimization problem but over a finite set of functions. Secondly, we search for a well-working dictionary for the RFMP. It would be intuitive to expect better results when applying the learnt dictionary if we included the behaviour of the RFMP in the learning process. In this way, we use as much information and structure as we have to obtain an efficient learning routine. Hence, the idea of our learning algorithm, the LRFMP, is to iteratively minimize a Tikhonov functional over an infinite set of trial functions similarly to the RFMP in order to model the solution in a best basis. The chosen basis elements form the learnt dictionary which can be applied in runs of the RFMP.  Figure 2: Schematic representation of the basic learning algorithm.

A first learning algorithm
An overview of the structure of the learning algorithm is shown in Figure 2. We start in the red circle ('start') where we initialize the LRFMP similarly to the initialization which the RFMP needs. This means, the initialization includes the necessary preprocessing and setting of parameters (similarly as described, for example, in [48]) as well as setting parameters for the learning. The latter learning parameters include, most importantly, a starting dictionary and smoothing properties. Then we step into the first iteration in which we want to minimize the Tikhonov functional in order to find d 1 . As we also want to learn a dictionary, the steps up to choosing d 1 differ from the established RFMP: we choose d 1 from (in the case of the RBFs) infinitely many trial functions instead of a finite a-priori selection of trial functions. This is done by first computing a candidate for d 1 among each trial function class we consider. In Figure 2, this is shown by the boxes 'spherical harmonics' and 'radial basis functions' which lead to 'set of candidates'. Mathematically, in this step, we seek where C denotes one trial function class, i.e. C = [N ] SH the set of spherical harmonics or C = [B 1 (0)] APK the set of Abel-Poisson kernels. Then we have again a finite (but optimized) set of trial functions and can choose d 1 from this set of candidates in the common fashion of the RFMP by comparing how well each one minimizes the Tikhonov functional. The candidate that minimizes the Tikhonov functional among all candidates is chosen as d 1 ('choose best candidate as d n+1 '). Then we compute the necessary updates of the RFMP-routine as described, for example, in [48]. Next, we check the termination criteria for the learning algorithm. We adopt the termination criteria of the RFMP which are up to now the norm of the residual or the size of the currently chosen coefficient (this would be α 1 at this stage) being smaller than a given threshold or a maximal number of iterations. Either they are not fulfilled, then we search for the next element d 2 in the same manner as we found d 1 . Or we have a fulfilled termination criterion. In this case, we stop the RFMP and, thus, the learning of the dictionary. We obtain the same output as in a non-learning RFMP which is an approximation of the given signal. Additionally, the learnt dictionary is defined as the set of all chosen elements in this approximation. For the sake of completeness, note that for an arbitrary iteration step n, the objective to seek α n+1 and d n+1 is given by respectively. Before we can state the learning algorithm itself, we first have to define an objective function for the determination of the candidates. Definition 6. For the Sobolev space H 2 , we define the objective function of the RFMP in the n-th iteration step as where d is a trial function, R n is the current residual, F n the current approximation and T , λ depend on the linear inverse problem.
Theorem 7. The minimization of the Tikhonov functional in the n-th step of the RFMP with respect to a trial function d and a real coefficient α as seen in (5) is equivalent to the maximization of RFMP(·; n) with respect to a trial function d.
Proof. The proof is analogous to the corresponding proofs in [6,29].
All in all, we can now state the learning algorithm. We will explain the steps of this algorithm in further detail below. Note that we determine a preliminarily optimal Abel-Poisson kernel from a discretely parametrized dictionary [K] APK and use this as a starting point for the optimization procedure for a continuously parametrized ansatz which uses [B 1 (0)] APK . Algorithm 8. We obtain a learnt dictionary for the RFMP as follows. Let T F = y be the linear inverse problem under investigation, H 2 the Sobolev space from Definition 5 and λ the regularization parameter.
(S0) initialize: at least one termination criterion (maximal number of iterations I and/or prescribed accuracy of the relative data error and/or minimal size of the coefficients); data vector y; initial approximation F 0 ; setsN as well asK as in Definition 1 and

Determination of candidates
For practice, the question remains how the candidates in each trial function class under consideration are determined. We need to explain what is done in S3.1 to S3.3.
First of all, we consider the determination of a spherical harmonic candidate. We want to seek the best-fitting function among all spherical harmonics up to a certain degree ν ∈ N. We have to learn the size of ν. It is defined in Algorithm 8 which specific spherical harmonics with a degree up to ν we insert into the learnt dictionary.
The idea is to allow the choice of spherical harmonics up to a degree N ∈ N (i.e.N = {(n, j) | n ∈ N 0 , n ≤ N , j = −n, ..., n}) which is probably not chosen in practice. For example, the data resolution can provide a threshold up to which a resolution appears to be realistic (as it is also done for other gravitational models like EGM or models based on CHAMP, GRACE and GOCE data). If the LRFMP chooses only spherical harmonics with a lower degree ν, we have a truly learnt bound ν < N . However, note that the higher we choose N , the more expensive is the preprocessing of the algorithm. The candidate d SH n+1 which the LRFMP chooses in each iteration step can be chosen as in the RFMP by comparing RFMP(Y m,k ; n) for all spherical harmonics up to degree N .
Unfortunately, for meaningful results with respect to learning spherical harmonics, we are lacking appropriate data sets. Both the EGM2008 as well as the GRACE data are given to us only as coefficients for a representation in spherical harmonics. Thus, if we allow the LRFMP to choose an optimal spherical harmonic with an arbitrarily high degree, naturally, the algorithm tends to choose more spherical harmonics than other trial functions. If we were able to work with data not given in this polynomial representation, we would assume that this effect diminishes.
Next, we consider the determination of the candidate d APK n+1 from the Abel-Poisson kernels in S3.3. In this case, the minimization of the Tikhonov functional is modelled as a nonlinear constrained optimization problem. The solution of this problem yields the respective candidate. Note that we do not seek the minimizer of a function, but the minimizer of a functional among a set of functions. Therefore, we have to define the trial functions dependent on their characteristics as we did in (4).
The model of the optimization problem is given as follows. Due to Theorem 7, in the n-th step of the LRFMP, we consider the optimization problem RFMP(P (x, ·); n) → max! for learning a dictionary for the RFMP. The optimization with respect to a trial function d can be modelled by an optimization with respect to the characteristics of each trial function. However, these characteristics yield a constraint for the optimization problem. Abel-Poisson kernels are given as K h (ξ·) = P (hξ, ·), h ∈ [0, 1) and ξ ∈ Ω in (4), see, for example, [11, p. 132] or [13, p. 52]. Here, ξ is the centre of the radial basis function and h is the parameter which controls the localization. Therefore, the kernels are well-defined only in the interior of the unit ball and the constraint is given by x 2 R 3 < 1 for x = hξ. Definition 9. The optimal candidate d APK n+1 among the set of Abel-Poisson kernel is given by the solution of the optimization problem Note that the maximizer is not necessarily unique. In this case, we use one representative among the maximizers.
We prefer a gradient-based approach. Thus, we have to compute the derivatives of (9) with respect to x ∈ R 3 . In general, this can be done by applying the quotient rule and computing the derivatives of the inner products and norms in the de-/nominator separately. We state the results of this derivation at this point. A detailed derivation is given in Appendix A.
Theorem 10. We define some abbreviation terms and state their derivatives. Let R n be the current residual of size and F n the current approximation, i.e. F n = n i=1 α i d i for d i being a spherical harmonic Y ni,ji or an Abel-Poisson kernel P x (i) , · . Moreover, let T be the upward continuation operator and σ the respective satellite orbit. Further, let the data be given on a point grid {ση (i) } i=1,..., , η i ∈ Ω. We consider the Tikhonov functional with a penalty term dependent on the norm of the Sobolev space H 2 . At last, P n denotes a Legendre polynomial and ε r , ε ϕ , ε t represents the common local orthonormal basis vectors (up, East and North) in R 3 , see (20). Then we have for an Abel-Poisson kernel P (x, ·) and x(r, ϕ x , t x ) ∈B 1 (0) the terms and b 2 (P (x, ·)) := P (x, ·) 2 H2 = ∞ n=0 2n + 1 4π (n + 0.5) 4 |x| 2n .
With respect to the derivative of a 2 , we have further For a proof, see Appendix A.
Theorem 11. With the abbreviations and derivatives from Theorem 10, the partial derivatives ∂ xj , j = 1, 2, 3, of RFMP(·; n) are given by Proof. We only apply the common rules for derivatives.
Thus, for Abel-Poisson kernels, we determined the gradient of the modelled objective functions RFMP(·; n), n ∈ N, analytically such that we are able to use a gradient-based optimization method. We use the primal-dual interior point filter line search method Ipopt.
To enable parallelization, we installed the linear solver ma97 from the HSL package. For more details on the Ipopt algorithm and the HSL package, see [18,37,50,52,53,54]. In practice, we set a few options manually which we explain in the necessary contexts later. In all other cases, we use the default option values. However, note that due to [50], we can only expect to obtain local solutions of the optimization problem. Furthermore, this algorithm uses a starting point to start its procedure towards an optimal Abel-Poisson kernel. This starting point is denoted by d APK,loc n+1 in Algorithm 8 and is computed in S3.2. We obtain d APK,loc n+1 by comparing the objective value RFMP(P (x, ·); n) for a selection of kernels given to the algorithm in the discrete dictionary by [K] APK and choosing the one with the highest value. As we have computed this kernel, we make use of it in S3.4 as well in case the optimization failed to find a better kernel.

Additional features for practice
The previously presented algorithm gives us a first and basic learning technique. During its development, we faced several problems dependent on the choice of the data and the given inverse problem. In order to overcome these difficulties, we introduced some additional features, which will be explained in the following.
First of all, the data of the monthly variation of the gravitational potential provided by GRACE attains very small values. In our experiments, these values lie in the interval [-0.1,0.1]. When inserting this data into the objective function for determining an optimal Abel-Poisson kernel, the Ipopt solver fails to find a solution at first. Thus, we set the option obj scaling factor to −10 10 . For the EGM-data, we only use −1 to perform a maximization instead of a minimization. Note that the scaled objective is only seen internally to support the optimizer.
Next, to cut runtime and possible round-off errors, we implemented a restart method. We initiate a new run of the algorithm by resetting F E to zero after a previously chosen iteration number E. Note that, in contrast to the restart procedure of the ROFMP, see, for example, [48], we also reset the regularization term λ F E 2 H2 to zero. In our experiments, we used E = 250.
Furthermore, we saw that the learnt dictionary heavily depends on the regularization parameter. Thus, we have to use the same regularization parameter as we used during learning when applying the learnt dictionary. Moreover, in contrast to previous works on the IPMPs, see, for instance, [48], the use of a non-stationary regularization parameter is necessary when learning a dictionary and applying it. In the previous works, the idea of a non-stationary, decreasing regularization parameter was explained in order to emphasize accuracy instead of smoothness of the obtained approximation. However, the improvement of the results did not justify the additional computational expenses of choosing a parameter and a decrease routine. Nonetheless, we reconsidered this idea since the main aim of the LRFMP is to learn a dictionary and a decreasing parameter appears to guide the learning process positively. Thus, we use a non-stationary regularization parameter λ n = (λ 0 R 0 R /(n+1)), where n is the current iteration number and λ 0 is an optimal regularization parameter. Our experiments show that with a decreasing regularization parameter, we determine a dictionary which yields a better approximation when applied. Hence, with a learnt dictionary, the use of a non-stationary regularization parameter has an impact on the result of the RFMP.
Next, for the case of a high satellite orbit or the seasonal variation in the gravitational potential obtained via GRACE, we developed a dynamic dictionary approach. Note that in previous literature on dictionary learning, see, for example, [40], it was mentioned that the structure of the input data could or even should be considered when learning a dictionary. We developed two strategies to learn a dynamic dictionary whose combination works well in the experiments under considerations.
First of all, we demand that the first 250 learnt dictionary elements are spherical harmonics. It seems that after these 250 iterations, the current residual R n has a rougher structure than the initial residual R 0 . Then the optimization routine finds more easily a more sensible solution and, therefore, learns a better dictionary. Additionally, we allow to only choose from the first n + 1 learnt dictionary elements in the n-th iteration step of the RFMP when we use the learnt dictionary. In this case, the order of chosen dictionary elements from the LRFMP has to be preserved. In this way, the optimized trial function which was chosen in the n-th step of the LRFMP can be chosen in the RFMP as well. Additionally, we save runtime as the dictionary size is small at the beginning. At last, this also treats possible complications which might otherwise arise from the non-stationary regularization parameter. The fact that we decrease this parameter yields the choice of very localized trial functions in later iterations of the LRFMP. If we allowed the RFMP to choose them prematurely this would only lead to a reduced data error and not a better approximation as we have seen in practice. If we allow the RFMP to only choose from the first n learnt dictionary elements, we can prevent it to choose very localized kernels prematurely.
With these features included in the LRFMP, we currently run our experiments.

Setting of the experiments
In this section, we present the results of the learnt dictionary compared to a manually chosen dictionary applied in the standard RFMP. We will use data from the EGM2008 as well as GRACE satellite data. The EGM2008 data is evaluated up to degree 1500. For the GRACE data, we computed the meanfield from 2003 to 2013 averaged from the release 5 products from JPL, GFZ and CSR as was proposed in [44]. This meanfield was subtracted from the data corresponding to May 2008. Note that in May, traditionally, the wet season in the Amazon basin is about to end such that we can expect a concentration of masses in this region. Additionally, we smoothed the data with a Cubic Polynomial Scaling Function (see, for instance, [46] and [15, p. 295]) of scale 5.
In all experiments, we compute the data on an equidistributed Reuter grid of 12684 data points, see, for instance, [41] and [28, p. 137]. We choose the constant regularization parameter λ of the RFMP and starting regularization parameter λ 0 of the LRFMP (see Subsection 3.4) such that they yield the lowest relative approximation error after 2000 iterations. In detail, we choose λ = 10 −2 for both experiments and λ 0 = 10 −4 for the experiment with EGM data and λ 0 = 10 −1 for the experiment with GRACE data.
The arbitrarily chosen dictionary with which we compare the learnt dictionary is chosen similarly to [48]. We choose From experience, we know that it is better to cut down the number of scales r than the number of centres ξ of the Abel-Poisson kernels. Thus, the starting dictionary contains 11330 dictionary elements in the case of EGM2008 data and 4850 dictionary elements in the case of GRACE data. We use the Ipopt optimizer with the linear solver ma97 from HSL. For EGM data, we demand a desired and acceptable tolerance of 10 −4 of the optimal solution. For GRACE data, we demand these tolerances to be only 10 0 due to the scaling of the objective function explained in Section 3.4. As we need to ensure that the constraint is not violated during the optimization process, we set r 2 < 0.98999999 2 as well as the options theta max fact 10 −2 and watchdog shortened iter triggered 0 in practice. For details on these options, see the Ipopt documentation [50].
We terminate the LRFMP as well as the RFMP in all cases when one of the following termination criteria is fulfilled. Either the relative data error is less than 10 −8 or we reached 3000 iterations. We also terminate the LRFMP when the last chosen coefficient α n+1 is less than 10 −5 . We experienced that otherwise the additionally learnt dictionary elements do not improve the solution and we can easily stop the learning process at this stage. Note that in stopping after at most 3000 iterations, we also limit the learnt dictionary to at most 3000 learnt trial functions. This means, that the learnt dictionary is much smaller than the manually chosen dictionary with which we compare the learnt dictionary. Further, note that, due to its size, the manually chosen dictionary obviously has a much larger storage demand.
The results which we will compare here are obtained as follows: in one case, we use the RFMP with the manually chosen dictionary D m . In the second case, we first learn a dictionary D * ,• (see Algorithm 8) for • ∈ {EGM, GRACE} by using the LRFMP (which requires the starting dictionary D s,• ) and then run the RFMP with the learnt dictionary D * ,• . The major question is: is the learnt dictionary able to yield better results than the manually chosen dictionary?
The plots shown in this paper are done with MATLAB. Note that the colours for the results of the GRACE data are flipped in comparison to the results of the EGM data. This is done in order to emphasize wet regions with blue colour and dryer regions with red colour.

Results
In Figure 3, the results of the two experiments are shown. The first row shows the results with the EGM data. The second row depicts the results of the GRACE data. In the lefthand column the solution is given. The middle column shows the absolute approximation error of the RFMP with the manually chosen dictionary. The right-hand column depicts this error of the RFMP with the learnt dictionary. We adjusted the scales of the values for a better comparison.
Obviously, in both cases the algorithm is able to construct a good approximation. The relatively low errors occur basically within regions where more local structures are given. These regions are in the case of the EGM data in particular the Andean region as well as the Himalayas and the borders of the tectonic plates in Asia. In the case of monthly GRACE data, the masses in the Amazon basin originating from the ending wet season show the strongest structure. As we only allow 3000 iterations, it can be expected that such regions cannot be approximated perfectly.
Particularly interesting are the results in the right-hand column which were obtained with the learnt dictionary. Clearly, in both scenarios, the approximation error is notably reduced. This is, in particular, also the case in the regions with localized anomalies.
In Table 1, the relative approximation and data errors after 3000 iterations of the RFMP with the manually chosen dictionary and the learnt dictionary are given. Furthermore, the (currently) needed CPU-runtime for the experiments are presented in the last column. We notice that we do not only obtain a smaller relative approximation error with the learnt dictionary but also a smaller relative data error.
We state the CPU-runtime in hours for the sake of completeness. Although we see that we need less time to compute and preprocess the learnt dictionary than to preprocess the manually chosen dictionary, these results are to be understood with care because we  currently do not work with an optimized code. All in all, the results show that we are able to learn a dictionary which yields a smaller data as well as approximation error than a manually chosen dictionary. In addition, we obtain these results with a sparser dictionary, less storage demand and an appropriate CPUruntime.

Conclusion and Outlook
We started our investigations by aiming to improve our methods for gravity field modelling. In practice, we considered the RFMP for now. We expected to reduce the computational demands of the RFMP and the approximation error if a learnt dictionary is used rather than a manually chosen 'rule-of-thumb' dictionary.
In this paper, we presented a first approach to learn a dictionary of spherical harmonics and Abel-Poisson kernels for the downward continuation of gravitational data. In the numerical tests, we used data generated from EGM2008 and GRACE models. The idea of our learning approach is to iteratively minimize a Tikhonov functional over an infinite set of these trial functions. We do so by using non-linear constrained optimization techniques. Our results show that we obtain better results with respect to the relative data and approximation error when applying the learnt dictionary than when using a manually chosen dictionary in the RFMP. Moreover, we obtain these results with a sparser dictionary and less storage demand. Further, even non-optimized code yields satisfactory runtime results.
In the future research, we aim to transfer this learning approach to the ROFMP and enlarge its idea to Abel-Poisson wavelets and Slepian functions. Further, the presented learning algorithm can be viewed as one component of a structured learning technique. The question is whether we can determine an optimized learnt dictionary via learning from the results of the here presented strategy. In this way, we aim to build a learning hierarchy for the dictionary where the present technique represents the second level and the first level still needs to be developed. This could also lead to an automation of the additional features which we explained in Subsection 3.4. An additional objective is to obtain a dictionary for GRACE from a given set of training data to apply it with new test data such that we can provide an optimal dictionary for GRACE-FO satellite data. Further, we need to consider the theoretical aspects of the learning algorithm like determining a quantitative measure for the quality of a dictionary and investigating how the learnt dictionary of the LRFMP is related to that measure. In addition, with respect to practical aspects, we plan on optimizing our code to have more meaningful runtime results.

A Gradient of the objective function with respect to Abel-Poisson kernels
In this appendix, we prove Theorem 10.

First considerations
We discuss the terms for the upward continuation operator as given in (1). Note that the Euclidean inner product of two vectors is emphasized by using a '·' at the particular positions. Additionally, we make use of the following basic aspects.
In geomathematics, a common orthonormal basis in R 3 is given by see, for example, [28, p. 86]. Note that it holds ∂ ϕ ε r = √ 1 − t 2 ε ϕ and ∂ t ε r = 1 √ 1−t 2 ε t . For the gradient ∇, we will use a Cartesian definition as well as its decomposition into radial and angular parts. We have see, for instance, [28, p. 87]. Next, we consider the following recurring inner products.
see, for instance, [48, p. 114]. At last, we note one specific property of the fully normalized spherical harmonics. It holds see, for instance, [25]. Now, we can compute the terms in Theorem 10.
The term a 1 (P (x, ·)) and its derivative Obviously, for the formulation of a 1 (P x, ·)) in (10), we only have to show that for any η ∈ Ω. We start at the left-hand side of (25).
With ε r = x |x| , this is the formulation of (19).
The term b 1 (P (x, ·)) and its derivative The term b 1 (P (x, ·)) as in (12) is obvious when we take (26) into account. The derivative of b 1 (P (x, ·)) as in (16) is obtained by The term b 2 (P (x, ·)) and its derivative The formulation as in (13) of b 2 (P (x, ·)) is due to the following considerations which use (23) and P n (1) = 1, see, for instance, [28, p. 49]. The gradient with respect to x is then obtained as follows.
This is in accordance with (17). Hence, Theorem 10 is proven.