1 Introduction

In the past three decades, tens of thousands of global navigation satellite system (GNSS) continuously operating reference stations (CORS) have been established worldwide. The coordinate time series derived from daily GNSS data of these CORSs have been used to study the tectonic motion, non-tectonic, inter-seismic strain, post-seismic deformation, slow slip events, surface mass-induced displacements, and other geodynamics-driven surface displacements. The coordinate time series from selected stable sites have also been used to establish and maintain a geodetic reference frame (Altamimi et al. 2011, 2016).

GNSS site motions are related to some physical variables, including the relative locations to Sun and the Moon, earth rotation parameters such as polar motion, temperature, atmospheric pressure, and hydrological parameters (Petit and Luzum 2010). Although these physical factors have somehow been dealt with in GNSS data processing for the GNSS daily coordinates, there are still residual effects and unconsidered factors. Dong et al. (2002) subtracted seasonal terms generated by known geophysical sources including the pole tide, ocean tide, atmospheric mass loading, non-tidal ocean mass loading, and snow and soil moisture mass loading. It was speculated that the residuals are caused by mismodelling of atmospheric delay, by bedrock thermal expansion, and by glacier surge and internal ice flow, depending on the locations of the sites and time. GNSS coordinate time series has been conventionally modelled by the least squares (LS) fitting method. In this way, the GNSS time series are fitted as a function of harmonic terms with constant annual and semi-annual amplitudes and phases, linear trend, jumps (offsets, steps, discontinuities), and other possible variables (Bock and Melgar 2016; Heflin et al. 2020). Related studies suggest that determining the time-varying seasonal signals may consider nonparametric annual signal (Tesmer et al. 2009; Freymueller 2009), or piecewise continuous linear polynomials (Davis et al. 2012), or singular spectrum analysis (Chen et al. 2013), or on a flexible semi-parametric model by Bennett (2008). However, there are still biases and large noises after removing the seasonal variation by LS fitting models. The LS fitting precision has remained between 4 and 6 mm in the vertical component for long time (Heflin et al. 2020; Davis et al. 2006; Chen et al. 2013). In addition, how well the LS model can predict the time series into future is rarely discussed.

In the traditional modelling of GNSS time series, the functional models are expressed as linear (or linearised) equations for unknown parameters and independent variables. For instance, in realisation of the international terrestrial reference frame (ITRF), the station position kinematic model contains parameters such as position offsets, velocity, amplitudes and phases of annul and semi-annul signals, and deformation breaks (Altamimi et al. 2018, 2021). The LS modelling of the time series is to determine the unknown parameters over a continuous time series period. The functional and statistical models are given and input data is processed to fit into the models and output the solutions. However, when we try to fit the GNSS coordinate time series with many other physical variables, regardless of independent and dependent variables, we cannot assume the linear relationship for the unknown parameters and potential impact factors or physical variables. With machine learning (ML), input data and expected solutions are trained to output the models, which are the relationships between the coordinate time series in three components and various physical variables. In other words, various physical factors or causes are taken into consideration without having the explicit mathematical equations. In addition, the traditional modelling approaches, such as LS and KF processing, data samples over a limited time period can effectively contribute to the modelling and prediction. In contrast, ML principles allow for making good use of historical data for modelling and prediction of GNSS time series. The challenge is how to properly apply ML approaches to modelling and prediction of the GNSS time series with different physical variables. We note some ML approaches have already been used to model and predict GNSS time series, with some success. For example, Alevizakou et al. (2018) precisely forecasted the position of 1000 GNSS stations in short and long terms using two types of artificial neural networks (ANNs), and Wang et al. (2021) achieved good prediction results for XJSS station using the long short-term memory (LSTM) approach, in which the prediction precision for 294 days is 3.2 mm. However, the essence of these studies is the same as that of traditional methods, that is, they only use the time variable to model the GNSS time series, without considering other physical variables related to site motions.

Some efforts have been made in using ML approaches to model the relationship between some physical factors and GNSS time series or residuals, for example, to correct environmental influences and improve the positioning accuracy. Mohammednour and Özdemir (2020) trained an ANN model by using surface meteorological parameters and number of observed satellites as inputs to eliminate troposphere delay. With application of this model, 3D position accuracy of the GNSS receiver is improved by about 20%. Ruttner et al. (2021) applied temporal convolutional neural network (TCN) to find connection between raw meteorological parameters and GNSS height residuals. They suggest that the trained TCN can achieve almost the same level with physical models on reduction of GNSS height residuals. Further, current research commonly uses meteorological parameters such as temperature, humidity, wind speed, and pressure to model GNSS time series or residuals. However, we argue that to precisely model GNSS time series, more comprehensive physical factors need to be used, as they can also have significant impact on the GNSS time series. Hydrologic surface loading which can produce displacements of several millimetres in GNSS time series (Herring et al. 2016) is one example of this. Therefore, to precisely model GNSS time series, more comprehensive physical factors are used in this research.

There have been some studies into the application of different ML approaches, such as boosting tree (BT), gradient boosting decision tree (GBDT), LSTM, and support vector machine (SVM), in addressing modelling problems in geodetic data analytics. Li et al. (2020) proposed using the BT algorithm to model the orbit prediction (OP) errors. This ML-derived OP error model was used to modify the future physics-based OP results. The errors of the physics-based OP results over the future 7 days achieve at least 50% accuracy improvement with the application of the model. Furthermore, Li et al. (2021) applied the GBDT and convolutional neural networks (CNN) to model the underlying orbit error patterns. With the correction by model-predicted errors, the accuracy of OP over the future 14 days are improved by more than 75%, 90%, and 90% in the along-track, cross-track, and radial directions, respectively. Yang et al. (2019) proposed using a LSTM neural network-based dynamic model to predict landslide displacement. They claimed that the proposed LSTM-based model can effectively predict the displacement of stepwise landslides in the Three Gorges Reservoir Area. Dörterler and Faruk Bay (2018) applied the ML technique for the vehicular location prediction (VLP). They developed an ANN-based VLP model for cooperative active safety systems. Comparing the ANN-based model with other models which are based on KF and numerical integration methods (Caveney 2010), this ANN-based model shows better performance than others.

We hypothesise that ML principles can guide the modelling and prediction of GNSS time series well, due to the similar nature of error modelling problems as resolved in other applications. There are various well-developed ML algorithms over the past decades. In our experiments, the GBDT, LSTM, and SVM were chosen as they are considered to be suitable approaches, due to the proven performance in time series analysis. In using these ML approaches for GNSS time series modelling and prediction, the physical factors considered in this study include polar motion, the Sun and the Moon ephemerides, the temperature, the atmospheric pressure, and the hydrology. This contribution seeks to answer the key specific research questions in the workflow of machine learning: (1) how to define a ML problem with the input and output variables and how to prepare the datasets; (2) what modelling and prediction performance the ML models can achieve, and how much improvement would it be compared with the traditional LS fitting and prediction results.

Section 2 describes the GBDT, LSTM, and SVM principles and their evaluation methods. Section 3 prepares the input and output data for ML training and define the ML problem. In Sect. 4, the modelling and prediction results of the ML models and LS models are presented and evaluated. Finally, Sect. 5 concludes the paper.

2 Methodology

In a supervised ML problem, models are trained from labelled samples. These independent and identically distributed samples used for training are known as training data and can be represented as:

$$\begin{aligned} D_{\mathrm{Trn}} = \{(x_1, y_1), (x_2,y_2),\ldots ,(x_N,y_N)\} \end{aligned}$$
(1)

where N indicates the number of samples in the training dataset. And \(x_n \in {\mathbf {R}}, n = 1,2, \ldots ,N\), represents the input vector. Each \(x_n\) consists of one or more features. In this paper, the features include a total of 12 variables: the time variable t, \(P_n^x\) and \(P_n^y\) variables representing polar motion, \(\mathrm{RA}_n^{S}\), \(\mathrm{DEC}_n^S\), and \(\mathrm{DEL}_n^S\) variables representing astrometric right ascension, declination and apparent range of the Sun’s centre with respect to the observing site in the international celestial reference frame (ICRF), \(\mathrm{RA}_n^{M}\), \(\mathrm{DEC}_n^{M}\), and \(\mathrm{DEL}_n^{M}\) variables similarly representing astrometric right ascension, declination, and apparent range of the Moon’s mass centre with respect to the observing site in the ICRF, \(\mathrm{TEM}_n\), \(\mathrm{AP}_n\), and \(\mathrm{HYD}_n\) representing surface temperature, surface atmospheric pressure, and hydrology at the GNSS site. These variables are introduced in detail in Sect. 3.2. For non-ANN ML approaches such as GBDT and SVM, the input vector \(x_n\) can be represented as

$$\begin{aligned} \begin{aligned} x_n =&[ t_n, P_n^x, P_n^y, \mathrm{RA}_n^{S}, \mathrm{DEC}_n^S, \mathrm{DEL}_n^S, \mathrm{RA}_n^{M},\\&\mathrm{DEC}_n^{M}, \mathrm{DEL}_n^{M}, \mathrm{TEM}_n, \mathrm{AP}_n, \mathrm{HYD}_n]^T. \end{aligned} \end{aligned}$$
(2)

And the input matrix with two dimensions (\(12 \times N\)) can be expressed as \(\varvec{X}=(x_1, x_2, \ldots , x_N)\). It should be noticed that the input matrix for the LSTM approach has three dimensions as time-step is involved as another dimension. However, in this study, the time-step for the LSTM is set as 1. Therefore, the input matrix for the LSTM can also be regarded as a two-dimensional matrix in this case.

The \(y_n \in {\mathbf {R}}\) is the corresponding output, or called response, which is the vertical coordinate of GNSS sites in this study. The purpose of ML is to find an underlying mapping relationship between \(\varvec{X}=(x_1,x_2,\ldots ,x_N)\) and \(Y=(y_1,y_2,\ldots ,y_N)\), i.e. \({\hat{Y}} = {\hat{f}}(\varvec{X})\), where \({\hat{Y}}\) is the predicted output, which usually differs slightly from Y because the ML model usually cannot represent the relationship with absolute accuracy. Once the relationship between \(\varvec{X}\) and Y is established, the prediction process gives the output \({\hat{y}}_{N+1}\) through \({\hat{y}}_{N+1} = {\hat{f}}(x_{N+1})\), where \(x_{N+1}\) is the corresponding new input vector.

In practice, the input vector \(x_n\) is usually normalised before training. Normalisation is a data preparation technique often applied for ML. The goal of normalisation is to change the feature values in the input dataset to a common scale, without distorting differences in the ranges of values (Jalal et al. 2020). There are two reasons that dataset should be normalised before ML training (Wang and Balog 2016): (1) objective functions of some ML algorithms will not work properly without normalisation if the input values vary widely; (2) gradient descent can converge faster with normalisation than without it. The standard score normalisation, also called z-score is applied in this study. If the mean and standard deviation (STD) of the input vector \(x_n\) are known, the normalised input can be obtained via:

$$\begin{aligned} x_n'=\frac{x_n-\mu }{\sigma } \end{aligned}$$
(3)

where \(\mu \) and \(\sigma \) are the mean and STD of \(x_n\), respectively.

A common challenge of ML modelling is to address the overfitting problem. Overfitting occurs when a ML model fits exactly against its training data but has poor fits for new datasets. This situation happens when training for too many epochs or when the model is too complex, leading to the “noise” or irrelevant information being learned by the ML model. To prevent overfitting, a part of the dataset is set aside as the “validation dataset” to check for overfitting. For convenience, training and validation datasets are referred to as the modelling dataset and denoted as \(D_{md}\), as they are used together for training a model in this context. The mapping relationship \({\hat{Y}}={\hat{f}}\left( {\mathbf {X}}\right) \) can be different if different ML approaches are used. GBDT, LSTM and SVM approaches will be introduced in next subsections.

2.1 Gradient boosting decision tree

Boosting is an ensemble technique that converts weak learners (base learners) to strong ones in an iterative fashion. Gradient boosting is a competitive, robust, and interpretable boosting algorithm developed by Friedman (2001, 2002). It is typically used with the classification and regression tree (CART) (Breiman et al. 1984) of a fixed size as base learners, which is known as GBDT. GBDT has been one of the most commonly used techniques in Kaggle competitions and achieved excellent performances in many scientific applications, such as the orbit predictions of space debris (Li et al. 2020, 2021), real-time GNSS precipitable water vapour sensing (Zheng et al. 2022), and GPS signal reception classification (Sun et al. 2020). Therefore, we hypothesise that the GBDT could perform well in modelling and prediction of GNSS time series.

A boosting tree model can be represented as the additive model of decision trees

$$\begin{aligned} f_{M}(x)=\sum _{m=1}^{M} T\left( x; \Theta _{m}\right) \end{aligned}$$
(4)

where \(T\left( x ; \Theta _{m}\right) \) is the decision tree with its parameters \(\Theta _{m}\) and M indicates the number of trees.

The model can be constructed by using the boosting algorithm to fit the training data in a forward stage-wise manner:

$$\begin{aligned} f_{0}(x)= & {} 0 \nonumber \\ f_{m}(x)= & {} f_{m-1}(x)+T\left( x ; \Theta _{m}\right) , \quad m=1,2, \ldots , M \nonumber \\ f_{M}(x)= & {} \sum _{m=1}^{M} T\left( x ; \Theta _{m}\right) \end{aligned}$$
(5)

In the mth step of the algorithm, with the known current model \(f_{m-1}(x)\), parameters of mth decision tree \({\hat{\Theta }}_{m}\) can be determined by minimising

$$\begin{aligned} {\hat{\Theta }}_{m}=\arg \min _{\Theta _{m}} \sum _{i=1}^{N} L\left( y_{i}, f_{m-1}\left( x_{i}\right) +T\left( x_{i} ; \Theta _{m}\right) \right) \end{aligned}$$
(6)

where L(yf(x)) is the loss function which is usually the mean squared error (MSE):

$$\begin{aligned} L\left[ y, f_{m}(x)\right]= & {} \frac{1}{2}\left[ y-f_{m}(x)\right] ^{2} \nonumber \\= & {} \frac{1}{2}\left[ r-T\left( x ; \Theta _{m}\right) \right] ^{2} \end{aligned}$$
(7)

where \(r=y-f_{m-1}(x)\) is the fitting residual of step \(m-1\).

The GBDT algorithm solves Eq. 6 by iteratively creating a base learner \(T\left( x ; \Theta _{m}\right) \) that points in the negative gradient direction, which can be expressed as

$$\begin{aligned} {\tilde{y}}=-\left[ \frac{\partial L[y, f(x)]}{\partial f(x)}\right] _{f(x)=f_{m-1}(x)} \end{aligned}$$
(8)

\({\tilde{y}}\) is also called as the pseudo-residual because it is an approximate value of residual.

In GBDT method, the current base learner \(T\left( x ; \Theta _{m}\right) \) is applied to fit the pseudo-residuals of previous base learner \(T\left( x ; \Theta _{m-1}\right) \), thus tackling the drawbacks of a weak learner. Finally, the desirable GBDT model can be built in the forward stage-wise manner.

2.2 Long short-term memory

The LSTM is an artificial recurrent neural network (RNN) architecture. It has been widely applied in time series anomaly detection and prediction (Malhotra et al. 2015), machine translation (Wu et al. 2016), speech recognition and synthesis (Fan et al. 2014), handwriting recognition (Carbune et al. 2020), for example, as well as many other applications. For a classic RNN with one hidden layer fed with the input dataset \(\varvec{X} = (x_1, x_2, \ldots , x_N )\), the hidden vector sequences \(H = (h_1 , h_2, \ldots , h_N )\), and the output sequence \({\hat{Y}}= (\hat{y_1}, \hat{y_2}, \ldots , \hat{y_N})\) are computed through iterating the following equations from \(n=1\) to N:

$$\begin{aligned} h_{n}= & {} {\mathcal {A}}\left( W_{xh} x_{n}+W_{hh} h_{n-1}+b_{h}\right) \end{aligned}$$
(9)
$$\begin{aligned} {\hat{y}}_{n}= & {} {\hat{f}}_{\mathrm{LSTM}}(x_n)=W_{hy} h_{n}+b_{y} \end{aligned}$$
(10)

where \({\mathcal {A}}\) is the hidden layer function, which is usually a sigmoid function in conventional RNNs; \(W_{xh}\), \(W_{hh}\) and \(W_{hy}\) indicate weight matrixes between input and hidden vectors, between different time steps of hidden vectors, and between hidden and output vectors, respectively; \(b_h\) and \(b_y\) are corresponding bias vectors of \(W_{hh}\) and \(W_{hy}\), respectively.

Theoretically, a large enough RNN should be sufficient to generate arbitrarily complex sequences (Vincent et al. 2010). However, the application of RNN is limited by vanishing gradient or exploding gradient problems (Bengio et al. 1994). LSTM NN was therefore proposed by (Hochreiter and Schmidhuber 1997) to overcome these problems. Unlike the traditional ANN which typically consists of an input layer, one or more hidden layer(s), and an output layer, the basic unit of the LSTM NN hidden layer is the memory block (shown in Fig. 1) composed of a memory cell and three gates—input gate, forget gate, and output gate (Gers and Schmidhuber 2000). The cell remembers values in any time interval, and three gates regulate the flow of information in and out of the cell. In this way, useful information can be retained, and useless information can be eliminated. The hidden vector \(h_t\) of LSTM is different from the conventional RNN, which can be obtained as follows:

$$\begin{aligned} i_{n}= & {} \sigma \left( W_{x i} x_{n}+W_{h i} h_{n-1}+W_{c i} c_{n-1}+b_{i}\right) \end{aligned}$$
(11)
$$\begin{aligned} f_{n}= & {} \sigma \left( W_{x f} x_{n}+W_{h f} h_{n-1}+W_{c f} c_{n-1}+b_{f}\right) \end{aligned}$$
(12)
$$\begin{aligned} c_{n}= & {} f_{n} c_{n-1}+i_{n} \tanh \left( W_{x c} x_{n}+W_{h c} h_{n-1}+b_{c}\right) \end{aligned}$$
(13)
$$\begin{aligned} o_{n}= & {} \sigma \left( W_{x o} x_{n}+W_{h o} h_{n-1}+W_{c o} c_{n}+b_{o}\right) \end{aligned}$$
(14)
$$\begin{aligned} h_{n}= & {} o_{n} \tanh \left( c_{n}\right) \end{aligned}$$
(15)

where \(\sigma \) is the logistic sigmoid function which is defined as \(\sigma (x)=\frac{1}{1+e^{-x}}\); \(i_n\), \(f_n\), \(c_n\), and \(o_n\) are the input gate, forget gate, cell activation vectors, and output gate, respectively; \(b_i\) , \(b_f\) , \(b_o\) , and \(b_c\) are their corresponding bias values; W denotes weight matrices and subscripts of it have the obvious meaning. For instance, \(W_{hi}\) indicates the hidden-input gate matrix, and \(W_xo\) indicates the input–output gate matrix, etc.

The output \(\hat{y_n}\) then can be calculated from Eq. 10. And the \({\hat{Y}}= (\hat{y_1}, \hat{y_2}, \ldots , \hat{y_N})\) can be obtained by iterating Eqs. 1015 from \(n = 1\) to N.

Fig. 1
figure 1

LSTM neural network architecture (Ma et al. 2015)

2.3 Support vector machine

The SVM is another widely applied ML approach proposed by Cortes and Vapnik (1995). The principle of the SVM is to transform input data into a high-dimensional feature space through nonlinear transformation realised by the kernel function and then perform linear regression within the feature space (Smola and Schölkopf 2004). SVM regression is considered a nonparametric technique as it relies on kernel functions. SVM algorithms can be classified by usage of kernels, absence of local minima, or on number of support vectors (Ribeiro 2005). In this study, the linear epsilon-insensitive SVM (\(\varepsilon \)-SVM) regression is implemented. The goal of \(\varepsilon \)-SVM is to find a function \({\hat{f}}_{SVM}(x_n)\) which deviates from \(y_n\) by no more than \(\varepsilon \), and at the same time it is as smooth as possible. The regression function can be represented as:

$$\begin{aligned} {\hat{f}}_{\mathrm{SVM}}(\varvec{X})=\omega \cdot \varphi (\varvec{X})+s \end{aligned}$$
(16)

where \(\omega \) is the weight vector; \(\varphi (\varvec{X})\) indicates a nonlinear mapping function that maps the input space \(\varvec{X}\) to high dimensional feature space; s is a scalar threshold. The SVM model performs linear regression in the high-dimensional feature space by \(\varepsilon \)-insensitive loss. The coefficients \(\omega \) and s then can be estimated through minimising the regularised risk function:

$$\begin{aligned} \begin{aligned}&J(\omega )=\frac{\Vert \omega \Vert ^{2}}{2}\\&s. t.\,\, \forall n: \left\{ \begin{array}{l} y_{n}-\varphi \left( \omega , x_{n}\right) -s \le \varepsilon \\ \varphi \left( \omega , x_{n}\right) +s-y_{n} \le \varepsilon \end{array}\right. \end{aligned} \end{aligned}$$
(17)

It is likely that the function \(f(\varvec{X})\) that perfectly satisfies Eq. 17 for all points does not exist. Therefore, slack variables \(\zeta _n\) and \(\zeta _n^*\) are introduced for each point to deal with other infeasible constraints. These slack variables allow regression errors to exist up to the value of \(\zeta _n\) and \(\zeta _n^*\), but still meet the required conditions. This objective function is described as (Vapnik 2013):

$$\begin{aligned} \begin{aligned}&J(\omega )=\frac{\Vert \omega \Vert ^{2}}{2}+C \sum _{i=1}^{n}\left( \zeta _{n}+\zeta _{n}^{*}\right) \\&s. t.\,\, \forall \, n: \left\{ \begin{array}{l} y_{n}-\varphi \left( \omega , x_{n}\right) -s \le \varepsilon +\zeta _{n}\\ \varphi \left( \omega , x_{n}\right) +s-y_{n} \le \varepsilon +\zeta _{n}^*\\ \xi _{n} \ge 0 \\ \xi _{n}^{*} \ge 0 \end{array}\right. \\ \end{aligned} \end{aligned}$$
(18)

where the positive constant C is the box constraint, which indicates the penalty degree of the sample with an error that exceeds \(\varepsilon \).

The optimisation method is used to maximise the function to solve the dual problem. The dual formula is constructed by introducing non-negative Lagrangian multipliers \(\alpha _n\) and \(\alpha _n^*\) into the primal function for each observation \(x_n\). We minimise the dual formula:

$$\begin{aligned} \begin{aligned}&L(\alpha )=\frac{1}{2} \sum _{i=1}^{N} \sum _{j=1}^{N}\left( \alpha _{i}-\alpha _{i}^{*}\right) \left( \alpha _{j}-\alpha _{j}^{*}\right) G\left( x_{i}, x_j\right) \\&\qquad \qquad +\varepsilon \sum _{i=1}^{N}\left( \alpha _{i}+\alpha _{i}^{*}\right) +\sum _{i=1}^{N} y_{i}\left( \alpha _{i}^{*}-\alpha _{i}\right) \\&\,\,s. t.\, \forall n: \left\{ \begin{array}{l} \sum _{n=1}^{N}\left( \alpha _{n}-\alpha _{n}^{*}\right) =0 \\ 0 \le \alpha _{n} \le C \\ 0 \le \alpha _{n}^{*} \le C \end{array}\right. \\ \end{aligned} \end{aligned}$$
(19)

where the Gaussian kernel function

$$\begin{aligned} G\left( x_{i}, x_{j}\right) =\exp \left( -\left\| x_{i}-x_{j}\right\| ^{2}\right) \end{aligned}$$
(20)

is used as the kernel function of SVM. The SVM model obtained by minimising Eq. 19 is then given as:

$$\begin{aligned} {\hat{f}}(x)=\sum _{i=1}^{n}\left( a_{i}-a_{i}^{*}\right) G\left( x_{n}, x\right) +s \end{aligned}$$
(21)

Then sequential minimal optimisation (SMO) (Platt 1998) approach is introduced to determine the appropriate parameters (e.g. C and \(\varepsilon \)) of SVM and the SVM model can be finally determined.

3 Defining the ML problem for GNSS time series modelling

In the workflow of machine learning, the input and output datasets must be clearly defined first: that is, what data are input for training and what data are to be predicted. For GNSS time series modelling and prediction, the output variable Y is the time series for one of the three coordinate components, that is, the height or vertical component in this context. The choice of the input variables is not straightforward and unique, because there are many potential features associated with the GNSS site motions. Theoretically, the trained model would be more accurate but more complex if more input features were included in the \(\varvec{X}\). However, if there is too much irrelevant information, overfitting may occur. In this sense, it is necessary to carefully consider which factors have impacts on the GNSS site motion. In the following, we will discuss how the input and output variables are collected.

3.1 Output data—GNSS height time series

The experimental study for this paper was implemented by using daily solutions of site coordinates provided by the Jet Propulsion Laboratory (JPL) (Heflin et al. 2020). The JPL uses GipsyX software (Bertiger et al. 2020) to process GNSS observations collected from over 2000 receivers in precise point positioning (PPP) mode (Zumberge et al. 1997) with JPL 3.0 final orbits (Dietrich et al. 2018), applying IERS (International Earth Rotation and Reference Systems Service) 2010 solid earth tides (Petit and Luzum 2010; Eanes 1983; Mathews et al. 2002), GPT2w (global pressure and temperature 2 wet) troposphere mapping functions and nominals (Böhm et al. 2015), GYM95 (GPS yaw attitude model-95) satellite yaw model (Bar-Sever 1996), and IGS (International GNSS Service) antenna phase centre maps (Rothacher and Mader 2002). The site coordinate time series provided by the JPL consist of coordinate displacements where the reference coordinates (a set of constant values) are subtracted. According to Heflin et al. (2020), coarse outliers are detected and removed from the time series, breaks are exhaustively detected using the “F test” method, and all parameters, including breaks and seasonal terms, are finally determined through LS fitting. New solutions are added every week and they are freely and openly available at https://sideshow.jpl.nasa.gov/post/series.html.

Nine GNSS sites, including six sites located in Australia namely MOBS, SYDN, PERT, DARW, KARR, and HOB2, two sites located in USA, namely P053 and CEDA, and one site located in Brazil namely POVE, were chosen as our study objects because of the long time span and low missing rate of the datasets collected by these sites. As shown in Fig. 2, the latitude span of these nine sites is from \(48.7^{\circ }\)N to \(37.8^{\circ }\)S. Among them, six Australian sites are close to the ocean and the other three sites are located inland. The vertical coordinates time series of these nine stations with a duration of 9 years (from 1 January 2008 to 31 December 2016) were used as output data of modelling datasets. In addition, the time series from 1 January 2017 to 1 January 2021, totalling 4 years of data, were used as test datasets to test prediction performance of derived models. The samples in some dates are missing due to instrument failure or other reasons. For example, the PERT station has the missing of 294 days of data in total, from 3 January 2012 to 24 October 2012 (see Fig. 6c). However, our adopted ML models are able to train these discontinuous datasets because the time series is not modelled with time variable only, but physical factors. This is also an advantage of the ML models compared with the traditional LS methods.

Fig. 2
figure 2

Distribution of the applied GNSS sites

3.2 Input data—impact factors

Based on the IERS Conventions (2010) (Petit and Luzum 2010), the GNSS site motions are divided into three categories: (1) conventional displacements mainly related to tidal motions (mostly with near diurnal and semidiurnal frequencies) and other displacements with longer periods and can be accurately modelled; (2) non-tidal motions associated with changing environmental loads; (3) displacements that affect the internal reference points within the observing instruments.

Fig. 3
figure 3

Relationships between impact factors and input data sets

Fig. 4
figure 4

Impact factor time series (input datasets) for MOBS site

The GNSS site displacements are conventionally caused by the solid earth tides (SET), the ocean tidal loading (OTL), the atmospheric pressure loading (APL) arising from the attraction of the Sun and the Moon, and the pole tide and the ocean pole tide loading caused by changes in the centrifugal potential due to polar motion (Petit and Luzum 2010). These displacements can be modelled and corrected from the GNSS time series by means of corresponding models. However, displacements caused by some tidal loadings may not be considered and corrected in some GNSS observation processing cases. Furthermore, even with all corrections applied, these displacements still cannot be fully corrected because of models and estimation uncertainty. For example, after the correction of the SET-caused displacement, there still be residuals existing in the GNSS time series which can be up to 2.0 mm in vertical direction (Watson et al. 2006). For the GNSS time series used in this research, as described in Heflin et al. (2020), only the SET displacement is corrected by using the IERS 2010 SET model (Eanes 1983; Mathews et al. 2002). As a result, the SET displacement residuals, along with other displacements caused by the OTL, APL, pole tide and ocean pole tide, are still remained in these vertical coordinate time series. The six sites located in Australia are specifically affected by the OTL as they are close to the ocean. Studies show that time-varying deformations of the Earth produced by the OTL can reach 100 mm at some special coast regions (Li et al. 2014; Melachroinos et al. 2008). Atmospheric tides also have an impact on the site motions, especially on the vertical component. The amplitude of the vertical deformation caused by APL is equal to that of some OTL effects and should be considered when modelling the site motions (Petit and Luzum 2010). The root mean square (RMS) of average variations for the vertical component is 2.6 mm, but peak to peak variations can reach 40 mm (Capaldo et al. 2014). Dong et al. (2002) indicates that the impact of APL on oceanic islands and near coasts are smaller than within the continents. Their study shows that sites on the eastern Antarctica coast have strong semi-annual variations from APL with amplitudes of more than 1 mm. In addition, the pole tide could cause the variation of station coordinates of a couple of centimetres. And the deformation caused by the ocean pole tide, which is generated by the centrifugal effect of polar motion on the oceans, is typically no greater than 1.8, 0.5, 0.5 mm on radial, north, and east directions, respectively, but it can be larger occasionally (Petit and Luzum 2010).

Non-tidal motions are associated with changing environmental loads, e.g. from hydrology, atmosphere, and the ocean (Singh et al. 2021). Hydrology can cause GNSS site motions by means of water mass change. In continental interiors, water is stored in lakes, rivers, groundwater reservoirs, as well as in soil, vegetation and snowpack (Puskas et al. 2017). The mass of water varies seasonally (Fovell and Fovell 1993) and geographically (Miller 1994) and can cause the displacements of several millimetres in GNSS time series (Herring et al. 2016). The non-tidal atmospheric loading is the non-tidal part of the APL except for the conventional diurnal \(S_1\) and semidiurnal \(S_2\) components. The IERS 2010 conventions do not recommend involving non-tidal loading (NTL) deformations into the calculation of the reference point displacement because their modelling accuracy is rather low (Petit and Luzum 2010). However, in this study, these NTLs are expected to be figured into the GNSS site motions by introducing NTL-related variables, including hydrology, and surface atmospheric pressure data, into ML models.

In addition to the tidal and non-tidal loadings, the GNSS site position is also affected by bedrock thermal expansion caused by surface temperature changes. Surface temperature changes cause internal temperature change in the surface cement piers where GNSS antennas are installed on GNSS sites, and contribute to temperature change in the bedrock through heat conduction, thereby causing change in the vertical displacement of GNSS stations. According to the research of Yan et al. (2010), the impact of temperature change on GNSS reference stations can reach 2.8 mm. Therefore, temperature change is a non-negligible factor that causes annual changes in the vertical displacement of GNSS sites.

The temperature not only directly causes the site deformation through the bedrock thermal expansion mechanism, but also indirectly affects site position by generating changes in the hydrology and atmospheric pressure. For example, comparing the temperature time series and hydrology time series which are shown in Fig. 4, it is found that within a 1-year time span, the hydrological variable consistently records at the minimum value when the temperature is the highest of the year.

Based on the above knowledge of the factors causing the GNSS site motions, it is necessary to represent these factors using several datasets which can be used as input data in the ML training. As noted in Sect. 2.1 the GNSS site motions can be represented by 12 variables in a ML model. Figure 3 illustrates the relationships between these variables and physical factors. The backward arrows indicate that each variable can represent the influence of one or more physical factors. However, it must be noted that the input variable time t is the variable of all the factors. The site motions caused by the OTL and the APL are represented by the ephemerides of the Sun and Moon, which are calculated by the JPL’s HORIZONS system, including their right ascension, declination, and distance with respect the studied GNSS site in ICRF. The coordinates of the celestial ephemeris pole relative to the to the IRP, the IERS Reference Pole, are provided by the IERS and used to represent the site motions caused by pole tide and ocean pole tide. The surface temperature and atmospheric pressure data are obtained from the National Centres for Environmental Prediction (NCEP) reanalysis products. The spatial resolution of these products is \({2.5}^\circ \times {2.5}^\circ \), and the temporal resolution is 1 day. The hydrological variable is represented by the terrestrial water storage consisting of soil water, snow water equivalent, canopy water, and groundwater, obtained from the Global Land Data Assimilation System (GLDAS) products provided by the National Aeronautics and Space Administration (NASA). These products have a spatial resolution of \({0.25}^\circ \times {0.25}^\circ \), and a temporal resolution of 1 day.

All these impact variables have the same temporal resolution with the GNSS time series which is 1 day. Variables for the MOBS site are illustrated in Fig. 4. It can be seen from the figure that while the three variables related to the lunar ephemeris have a 1-month cycle, other variables have seasonal variations. It also can be observed that there are missing values in variable time series. These values are removed from the datasets and their epochs are corresponding to that of GNSS time series’ missing values. This enables every input vector consisting of 12 input data on each epoch has a corresponding output value.

4 Experimental results and numerical analysis

The workflow of the data analysis is shown in Fig. 5. The modelling datasets for each site are normalised and denoted as \(D_m^s\) where s indicates the site name, for example, if \(s= \) MOBS, the \(D_m^s\) is the modelling dataset for MOBS site. 80% timespan of \(D_m^s\) is used as training data and 20% is used as validation data. Then \(D_m^s\) are trained through GBDT, LSTM NN, and SVM, respectively. The modelling and prediction performances of GBDT-, LSTM-, and SVM-models are evaluated on modelling datasets and test datasets, respectively.

4.1 Training models

The GBDT algorithm is implemented through the scikit-learn which is a machine learning library (Pedregosa et al. 2011). Parameters are tuned by a grid search processing, and the result shows that the GBDT model achieves its best performance when employing 100 base learners, with their maximum depth of 3 and learning rate of 0.1. The LSTM NN is constructed by the Keras package (Chollet 2015) using TensorFlow (Abadi et al. 2016) as the backend. This LSTM NN has three layers: the first two layers are LSTM layers with 64 and 32 neurons, respectively, and the last one is a dense layer. LSTM-models for each site are obtained through training the \(D_m^s\) on the LSTM NN, with time step set to 1 and the number of training epochs set to 150, as the performance of the model no longer improves on the validation dataset after 150 epochs of training. The modelling dataset \(D_m^s\) is also trained through the SVM algorithm implemented by the Machine Learning Toolbox provided by the MATLAB. Compared with the LSTM algorithm, SVM has less hyperparameters to be tuned. As mentioned in Sect. 2.3, the Gaussian kernel function is used as the kernel function, and the SMO is chosen as the sequential minimal optimisation of the applied SVM.

Fig. 5
figure 5

Workflow of study case

ML models are trained on a computer with an Intel Core I7 processor with 2.60 GHz processing speed. On this platform, the GBDT, LSTM, and SVM algorithms spend 0.7 s, 29.3 s, and 0.5 s on average to train a model, respectively. Obviously, the GBDT and SVM are time-efficient algorithms in the GNSS time series modelling case. In contrast, the LSTM algorithm, as an ANN approach, is time-consuming.

As mentioned in Sect. 2, these models could describe the underlying relationship \({\hat{Y}} = {\hat{f}}(\varvec{X})\) between input data \(\varvec{X}\) (physical variables) and output data Y (GNSS site vertical displacement or motion). For a given input vector x, the estimated output value \({\hat{y}}\) can be obtained through the model: \({\hat{y}} = {\hat{f}}(x)\).

Fig. 6
figure 6

Fitting and prediction results derived from GBDT, LSTM, SVM, and LS models from 9 GNSS sites

The GBDT-fitting results \({\hat{Y}}_{GBDT, m}^{s}\) and GBDT-predicted results \({\hat{Y}}_{GBDT, t}^{s}\) are obtained through \({\hat{Y}}_{GBDT, m}^{s} = {\hat{f}}_{GBDT}^{s}(\varvec{X}_m^{s})\) and \({\hat{Y}}_{GBDT, t}^{s}= {\hat{f}}_{GBDT}^{s}(\varvec{X}_t^{s})\), respectively, where \(\varvec{X}_m\) and \(\varvec{X}_t\) indicate input data from the modelling dataset and test dataset, respectively. The fitting and prediction results of LSTM and SVM are obtained in the same way.

The fitting performance of GBDT, LSTM, and SVM models is compared with a LS model applied by Heflin et al. (2020). The LS-fitting results are denoted as \({\hat{Y}}_{LS,m}^s\).

4.2 Fitting performance evaluation

To verify the performance of the GBDT, LSTM, SVM, and LS models, the root mean square error (RMSE) is adopted as the evaluation indicator and can be calculated through

$$\begin{aligned} \hbox {RMSE}=\sqrt{\frac{1}{M} \sum _{i=1}^{M}\left( y_{i}-{\hat{y}}_{i}\right) ^{2}} \end{aligned}$$
(22)

where \(y_i\) is the measured vertical coordinate on epoch i; \({\hat{y}}_{i}\) is the corresponding model-derived value; M is the number of epochs.

Fig. 7
figure 7

Comparison of fitting precisions of GBDT, LSTM, SVM, and LS models

The fitting and predicting results derived from three ML methods are illustrated in Fig. 6, which shows all having good consistence with the measured values. The LS curves are also plotted over the fitting period, but prediction curves are unavailable from the datasets collected. The statistics of fitting precision (FPs) for GBDT, LSTM, SVM, and LS methods (Fig. 7) shows that all the three ML methods achieve consistently better fitting performance than the LS method for each studied site. Among them, the GBDT achieved the slightly better fitting precision. The fitting RMSE of GBDT models for nine studied sites range from 2.7 mm (MOBS and P053 sites) to 4.3 mm (DARW site), with an average of 3.4 mm, which is an improvement by 36%, compared with the 5.3 mm achieved by the LS method. The fitting results from the LSTM and SVM methods also show similar improvements. The average FPs of LSTM and SVM are 3.5 and 3.7 mm, improved by 34% and 30% as compared to that of LS, respectively.

In addition to higher FPs achieved by ML models relative to the LS models, the ML models have two advantages when modelling time series: (1) there are no needs to detect breaks or discontinuity, and (2) there are no needs to bridge missing data. As a result, the discontinuous datasets can all be used in ML processing. When applying the LS models, it is necessary to detect the breaks in the time series as these breaks will significantly affect the LS fitting performance. For instance, according to our experiments with these nine sites, the LS FP can be decreased by 14% if without break detection. However, if the ML models only use previous positions as input data as presented in (Alevizakou et al. 2018; Wang et al. 2021), the continuity of the time series is still required. In contrast, the adopted ML models can still work well when there are missing values in modelling dataset, as these models use physical variables as inputs. Actually, when training GBDT models using the scikit-learn library, modelling samples are randomly split into training and validation samples (the illustration of data division shown in Fig. 4 is inaccurate in this case). Therefore, missing some samples barely have impact on training models. For example, though PERT and KARR sites lost 315 and 172 days of data in the training data respectively, the FP of the GBDT models for these two sites can still reach 3.1 and 3.2 mm. These results indicate no signs that the discontinuity of modelling dataset has impacted the adopted ML models. It is to be noted that, the disadvantages of the proposed ML approaches are also obvious: (1) it takes more time on training model, and (2) data collection of various physical variables is required. The physical variables were collected from different institutions and stored in different formats with different spatial or temporal resolutions. Therefore, the proposed ML models require a lot of preparatory work compared traditional methods such as LS method.

4.3 Prediction performance evaluation

A model’s ability to generalise is critical to its success. Prediction performance refers to the model’s generalisation ability to adapt properly to unseen data in the time series. The right-hand side part of Fig. 6 illustrates the prediction results of three ML models for the future 48 months against the measured data. Despite not being plotted, our computation shows that the prediction errors of LS models can grow to several centimetres within 24 months, indicating a significant limitation of the LS models for predicting the GNSS site motions. However, the good consistence between the predicted and measured data samples illustrates that ML models have the ability to predict vertical displacements with established physical variables.

Fig. 8
figure 8

Illustration of prediction precision of GBDT, LSTM and SVM models versus predicted periods

The prediction results of ML models as shown in Fig. 6 also clearly indicate grouped sample outliers of 20–60 mm found in DARW and POVE sites. These outliers could be caused by data processing, antenna height changes, or site motion due to physical reasons such as earthquakes. The real causes should be directed to further investigation or reference to results from different GNSS processing systems. For instance, the DARW time series results from Geoscience Australia database (https://gnss.ga.gov.au/network) did not present similar grouped outliers over the period of 31 January 2018 to 16 April 2018.

To evaluate generalisation capability of these ML models in prediction, the test data period of 48 months was equally divided into eight 6-month data parts in the chronological order. The prediction errors of ML models over each data part are grouped together to obtain RMSE values as prediction precision (PP). The grouped sample outliers in DARW and POVE sites are removed from the RMSE statistics. As shown in Fig. 8, the RMSE values for GBDT, LSTM and SVM methods show the growth of the prediction errors from 4 to 6 mm within the first 12 months, to 5–8 mm in the second 12 months, and to 6–15 mm in the last 24 months. It is observed that three ML models have roughly similar prediction performance over the first 24 months. Over the next 24 months, however, prediction precision varies with ML models and sites. For instance, the LSTM models show slightly better prediction capability on most of the tested sites, while the GBDT models show superior prediction performance on PERT and POVE sites. Unlike traditional LS models based only on the time variable for the input data (Bock and Melgar 2016; Heflin et al. 2020), the prediction performance of the same ML models could vary because of different impacts of the physical factors depending on site locations and local environment.

While generalisation capability is the essential assessment of ML models, the prediction performance of the GNSS coordinate time series is critical for deformation analysis and maintenance of a regional or global geodetic reference frame (Altamimi et al. 2011, 2016). As shown above, the comparison between the predicted and measured values can detect and identify slow or sudden changes in the new samples, which could be the displacements caused by site deformation, or outliers caused by data treatments and antenna problems. A geodetic reference frame is realised by a large member of regionally or globally distributed geodetic reference stations with respect to a particular time epoch. For instance, the ITRF 2020 is referred to the epoch 2015.0 (Altamimi et al. 2018) and provides with the station position kinematic model to compute the station position on other dates. The kinematic model, perhaps being updated regularly with new time series data, includes terms for linear displacements by velocity, post-seismic deformation and the annul and semi-annul signals (Altamimi et al. 2016). The fitted and predicted coordinate time series from ML models can alternatively be used to compute the current station coordinates of the reference stations with high precision.

4.4 Feature importance analysis

Some ML algorithms can derive the parameters of the feature importance (FI) for each physical variable, which indicates the degree that the variable affects the GNSS site displacements. For tree-based ensemble methods, the importance of a feature is computed as the (normalised) total reduction of the criterion brought by that feature. It is also known as the Gini importance or mean decrease in impurity (MDI) (Louppe et al. 2013). As the GBDT approach achieves better modelling and good short-term prediction performance compared with other two ML approaches, FIs derived from GBDT models are used to study the relationship between physical variables and GNSS site motions.

Table 1 Feature importance (%) for 12 input variables
Fig. 9
figure 9

Relationship between feature importance of time and vertical motion speed

Fig. 10
figure 10

Satellite images around DARW, P053, CEDA, and POVE sites (these images are thankfully from Google Maps)

FIs for 12 physical variables (Table 1) reveals that time and hydrological variables are two dominant factors or variables affecting GNSS site displacements. It can be found from Fig. 6 and Table 1 that the FI of the time variable (FIOT) is higher when the site vertical linear motion is faster (e.g. PERT site). The vertical linear motion speeds of nine sites are calculated by least squares estimates (Heflin et al. 2020) and are shown in Fig. 9 along with FIOT. Figure 9 roughly illustrates the positive correlation between FIOT and site motion speed. For instance, FIOT and vertical motion speeds reached the highest values at PERT site and lowest values at CEDA and POVE sites. However, the site motion speed could be more dominantly caused by other factors, such as the hydrological factor as shown in P053.

The feature importance parameters indicate that the hydrologic variable could be another dominant factor. Hydrologic variable shows big feature importance on DARW, P053, CEDA, and POVE sites, especially on the POVE site where the feature importance of hydrologic variable (FIOH) is over 70%. The high FIOHs of these four sites correspond with relevant features of their locations and terrains. The satellite images around the four sites extracted from Google Maps as shown in Fig. 10 shows that they all are closed to water bodies (reservoirs, lakes, rivers, etc.). Water mass changes can cause elastic deformation at the Earth’s surface (Puskas et al. 2017). In addition, groundwater pumping can drive poroelastic deformation in agricultural regions (Argus et al. 2014). The DARW site is located in the south of Darwin, Australia. There are two reservoirs near this site—the Manton Dam with a volume of 13.3 gigalitres (GL) is less than 1 km from the site, and the Darwin River Dam which can hold up to 259.0 GL of water is in the 10 km range from the site. The CEDA site located in Utah state of the USA is about 20 km from the Great Salt Lake, one of the largest saltwater lakes in the world (Ghosal et al. 2021). The P053 site is located in Whitewater, Montana, USA. There are two water bodies about 30 km south of this site—the Nelson reservoir and the Lake Bowdoin. In addition, the site is about 100 km away from the Missouri River in Northern America. Site displacements could also be affected by the poroelastic deformation caused by groundwater pumping as this site is located in the agricultural regions of Montana state. A large volume of groundwater extraction for agricultural irrigation could have a significant impact on site motion. The POVE site is located in Porto Velho, Rondônia, Brazil, where lies on the east bank of the Madeira River and is a part of the Amazon Rainforest, featuring the tropical monsoon climate. In addition, this site is also in the agriculture region. For all these reasons, the hydrologic variable achieves the highest feature importance on this site.

In addition, Table 1 indicates that the FI of sun’s position (\(\mathrm{RA}^S\), \(\mathrm{DEC}^S\) and \(\mathrm{DEL}^S\)) (FIOS) and the FI of moon’s position (\(\mathrm{RA}^M\), \(\mathrm{DEC}^M\) and \(\mathrm{DEL}^M\)) (FIOM) are all relatively small and varying from site to site. This result is not necessarily inconsistent with the high influence Sun’s and Moon’s tide-generating force on vertical displacement (Thurman 1994). We should notice that the SET, which is the most influential tidal loading and can up to 0.4 m (Lambeck 1988), is corrected from GNSS time series by means of the IERS 2010 SET model when processing GNSS observations (Petit and Luzum 2010; Eanes 1983; Mathews et al. 2002). The small FIOS and FIOM results only reflect the residual effects of these variables.

5 Conclusion

Long-term GNSS time series from widely distributed reference stations has played a fundamental role in geophysics and geodynamics studies, deformation monitoring, and maintenance of a regional or global geodetic reference frame. GNSS coordinate time series have been conventionally fitted by the least square (LS) model as a function of time variable over the past tens of years. This paper has proposed to model the underlying pattern between the GNSS vertical time series and physical variables through the GBDT, LSTM, and SVM machine learning approaches. To form the machine learning problem, we turn the major impact factors into the input variables and the vertical time series into the outputs. There are 12 identified variables related to the vertical site motion, including time variable, polar motion coordinates, Sun and Moon coordinates, temperature, atmospheric, and hydrologic parameters. The experimental results derived from nine GNSS sites have demonstrated that all the trained GBDT, LSTM, and SVM models are capable of capturing most features of the underlying site motion pattern. Compared with the LS model proposed by (Heflin et al. 2020), ML models have shown evidently higher fitting precision on studied sites. The average fitting precision of GBDT, LSTM, and SVM models are 3.4, 3.5, and 3.7 mm, improvements by 36%, 34%, and 30%, respectively, with respect to the LS fitting results. In addition, all the ML approaches show consistent performance for modelling GNSS time series.

The GBDT, LSTM and SVM approaches have also demonstrated good generalisation capability that the trained ML models are able to predict the vertical displacements using the established physical variables in the predicted period. The prediction precision of ML models is a little lower than fitting precision, but slowly ranging from about 4 to 15 mm, as the prediction time terms span from 6 to 48 months. High prediction performance can enhance the detection of large displacements samples and prediction of the current station coordinates of the reference stations with high precision.

Overall, the ML models have shown the evident improvement in modelling of GNSS time series with respect to the conventional LS model. In addition, this paper has shown that the ML models can be used to predict the vertical motion to long future periods, thus enhancing and extending the applications of GNSS time series analysis in geodesy and geodynamics. Future research may extend the adoption of different ML models to fit and predict GNSS time series in three coordinate components and other types of geodetic time series. This may enable the inclusion of more and different impact factors or features for individual sites or data types and generate more complete knowledge about using ML approaches for geodetic data analytics and applications.