1 Introduction

Radio signals emitted by Global Navigation Satellite System (GNSS) satellites propagate through the atmosphere before being received on Earth. The signals are delayed and bent when travelling through the neutral atmosphere and this delay can be estimated (Nilsson et al. 2013). Typically, the tropospheric delays in zenith direction, i.e. the delays in the lowest part of the atmosphere, are split into a zenith hydrostatic delay (ZHD) and a non-hydrostatic or zenith wet delay (ZWD). The sum of ZHD and ZWD is the zenith total delay (ZTD), which amounts to roughly 2.4 m for an observer at mean sea level. The hydrostatic part makes up the majority of the ZTD and can be modelled to high accuracy with analytical methods, e.g. by using the equation of Saastamoinen (Saastamoinen 1972b). Although the wet part only contributes up to 0.40 m, it is more variable in space and time. At present there is no analytical model of ZWD with sufficient accuracy, and hence, it is typically estimated empirically.

Estimating ZWD with high accuracy is important for GNSS positioning (Ibrahim and El-Rabbany 2011; Wilgan 2015; Hadas et al. 2013), as it represents a major source of error. Furthermore, ZWD is proportional to the water vapour content along the signal path, and therefore plays an important role in GNSS meteorology (Bevis et al. 1992), with applications in weather monitoring, forecasting, and climate research (Bevis et al. 1994; Karabatić et al. 2011; Benevides et al. 2013; Seco et al. 2012; Zhao et al. 2018).

Therefore, many studies have investigated new methods to improve state-of-the-art ZWD models, such as the Hopfield model (Hopfield 1971), the Saastamoinen model (Saastamoinen 1972a), the global pressure and temperature (GPT)2w model and the GPT3 model (Böhm et al. 2015; Landskron and Böhm 2018), to name a few. Recently, machine learning (ML) approaches have also been used to construct models of tropospheric delays. Zhang et al. (2022) proposed a transformer-based global ZTD forecasting model while Yang et al. (2021) established a regional ZTD model based on the GPT3 model and artificial neural networks (ANNs). ANNs have also been used in studies by Mohammed (2021) and Selbesoglu (2020) to predict ZWDs. More recently, Ding (2022) developed a global ZWD model using neural networks that led to a better accuracy compared to state-of-the-art ZWD models (Yang et al. 2021; Böhm et al. 2015).

In this study, an ML-based model is trained based on ZWD observations of 10,718 GNSS stations during the year 2019. The reference ZWD is taken from the Nevada Geodetic Laboratory (NGL) (Blewitt et al. 2018) and the input features are the geographical location of the GNSS station, as well as the reference time epoch and meteorological variables, in particular, specific humidity on six pressure levels obtained from the ERA5 data set (Hersbach et al. 2020). The proposed model reaches centimetre-level accuracy in spatio-temporal interpolation mode, i.e. when predicting the ZWD at arbitrary spatial locations within the reference period. The temporal prediction accuracy is \(\approx 2\times \) lower, but still reasonable, when extrapolating to potentially unknown atmospheric conditions outside the reference period.

Compared to the existing, relatively small-scale studies, our model is much broader. First of all, many more GNSS stations have been used to create and evaluate the established ML-based model. While Zhang et al. (2022) and Mohammed (2021) only used 505 stations, the ML-based ZWD model proposed in this study is based on 13,402 globally distributed stations (10,718 training stations and 2684 test stations). The higher number of stations leads to better performance and better generalisation of the model, especially in regions with a sparse GNSS station network. Second, in contrast to all previously mentioned ML models, our proposed ZWD model does not rely on prior ZWDs or ZWD properties to make its predictions. It is based entirely on meteorological variables, position, and time information. Therefore, the model can be applied anywhere on Earth, not only at the locations of existing GNSS stations, opening up a wide variety of applications ranging from climate research to more accurate navigation with low-cost GNSS devices, including smartphones. For the present study, the model utilised post-processed meteorological data that is available with a temporal lag of five days. Forecasts of the meteorological values are also available and could replace the reprocessed values if the model is to be used for ZWD forecasting. Here, we focus on global spatial modelling of ZWDs within a given year.

In Sect. 2, the reference ZWD as well as the meteorological variables are presented. Section 3 introduces the methodology by giving an overview of the algorithms used (Sect. 3.1), the setup (Sect. 3.2), and the validation strategy (Sect. 3.3). In Sect. 4, the ZWD predictions of the final model are shown, discussed, and thoroughly evaluated (Sects. 4.1, 4.2, 4.3, 4.4). Furthermore, several comparisons with independently computed ZWDs (Sect. 4.5) and baseline models (Sect. 4.6) have been carried out. Section 5 contains a discussion of the global model by comparing it to regional (Sect. 5.1.1) and monthly models (Sect. 5.1.2). Furthermore, the global model is applied for a different year, thus, making temporal predictions (Sect. 5.2), and the applicability of the model is explained (Sect. 5.3). In Sect. A.1 and A.2 in the appendix, the comparison of different ML algorithms and the feature selection process is further explained in more detail. Finally, Sect. 6 summarises the findings of the study and gives an outlook on future plans and further improvements.

2 Data

2.1 Zenith wet delay

Zenith wet delay (ZWD) estimates have been provided by the Nevada Geodetic Laboratory (NGL) since 1994 for a global network of GNSS stations with a temporal resolution of five minutes (Blewitt et al. 2018). NGL processes the GNSS measurements by using Jet Propulsion Laboratory’s (JPL) GipsyX 1.0 software (Bertiger et al. 2020) and JPL’s Repro 3.0 orbits and clocks. The tropospheric delay is calculated using the Vienna mapping function 1 (VMF1) (Boehm et al. 2006), having separate mapping functions for the ZHD and ZWD, and its gridded map products of a-priori ZHD and ZWD. Additionally, north–south and east–west gradients are estimated together with ZWD as piece-wise constants. The processing assumes a correct ZHD, with the residual delay in the ZWD. Thus, small errors in a-priori ZHD do not affect the final ZTD but might affect ZWD. The details on the data processing strategy can be found in http://geodesy.unr.edu/gps/ngl.acn.txt (last access: 25 January 2024).

For the present study, we use the ZWDs of 13,440 GNSS stations from the year 2019. To match the temporal resolution of the meteorological variables (see Sect. 2.2), the ZWD data set is down-sampled to an hourly resolution by taking the ZWD values at every full hour. This leads to a total of 117,734,400 potential samples (8760 hourly time steps \(\times \) 13,440 stations). Since not all GNSS stations are recording continuously, 21,462,376 are missing, resulting in a total of 96,272,024 available samples. Although the ML algorithm is resilient against a moderate number of outliers (see Sect. 4.2), a rigorous outlier detection procedure for the ZWD data has been established. It employs four different filters: (1) The 1 % ZWD estimates with the highest uncertainties (\({>}\,3.5\) mm), i.e. the standard deviations according to the product, were removed (968,173 samples). (2) Negative ZWD estimates were removed because they are physically meaningless (396,619 samples). Those estimates are likely due to ZHD modelling errors. (3) All ZWD estimates were removed that differ from the 5-hour floating median by more than 3\(\times \) their standard deviation (2,494,624 samples). (4) There are 573 sites that have at least two co-located stations within a distance of 1 km, covering a total of 1300 stations (roughly 10 % of all stations). For each of those sites, the median ZWD estimate per hour was calculated, and co-located stations with an offset above 5 mm from that median were removed (27 stations, 140,123 samples).

Cumulatively, the procedure flagged 3,922,694 unique outlier samples, or 4.1 % of the ZWD data set. After discarding them, we are left with 13,402 GNSS stations that still have observations (92,349,330 samples).

The distribution of the GNSS stations is illustrated in Fig. 1. It can be seen that the spatial distribution is far from homogeneous. Most stations are located in the northern hemisphere, especially in North America and Europe. However, in Asia (except for Japan, South Korea, and Nepal) the density of GNSS stations is very limited. In the southern hemisphere, the distribution of the stations is much sparser. In particular, in Africa and South America, only very few stations are available.

Figure 2 shows the completeness of the 13,402 utilised GNSS stations and the number of ZWD estimates per hour for the year 2019. The completeness is calculated by dividing the number of samples of each time series by the full number of hourly epochs in 2019 (i.e. 8760). The results show that the median is 93 %, 58 % of all stations have a completeness of over 90 % and only 17 % of all stations have a completeness of less than 50 %.

The variation of the number of ZWD estimates per hour is small throughout the year. The average number is 10,542 (out of a possible maximum of 13402) with a standard deviation of 372, which means that ZWD estimates of 79 % of all stations are available on average every hour.

Fig. 1
figure 1

Distribution of the 13,402 utilised GNSS stations

Fig. 2
figure 2

Completeness of the 13,402 utilised GNSS stations (left) and the number of ZWD estimates per hour at the 13,402 utilised GNSS stations for the year 2019 (right)

Several papers have examined the quality of NGL’s troposphere product by comparing it to other products. A study by Ding and Chen (2020) used the NGL troposphere data to evaluate the performance of the empirical troposphere model GPT3 and thus, assessed the accuracy of the NGL troposphere products. They compared 26 representative common stations of the International GNSS Service (IGS) and NGL and concluded that NGL’s ZTD has the same accuracy as IGS’s ZTD. Thus, they state that NGL’s troposphere product can be used as a reference to evaluate troposphere models. In another study by Ding et al. (2022), a very high level of agreement between precipitable water vapour (PWV) data derived from radiosonde measurements and GNSS-derived PWV from NGL has been found. Since PWV can be derived directly from ZWD, it is also an indicator of the good quality of the NGL’s ZWD. The characteristic differences in tropospheric delays between NGL products and numerical weather model ray-tracing are discussed in Ding et al. (2023). They found that in most regions the products correspond well, although in some high-altitude regions such as the Andes, the differences reach the cm-level. A recently published study (Yuan et al. 2023) carried out data screening of NGL’s ZTD values and flagged fewer than 0.5 % of observations as outliers, which again demonstrated the good quality of NGL’s troposphere product. These studies support the use of the product as a basis for a global ZWD model. Furthermore, there exists no comparable GNSS tropospheric data set in terms of dense global coverage, which is essential for our study. However, when using our ZWD predictions, the shortcomings of the NGL data must be taken into account as our model uses it as reference data.

2.2 Meteorological variables

The meteorological variables are provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) data set (Hersbach et al. 2020). ERA5 is the fifth-generation ECMWF atmospheric reanalysis of the global climate covering the period from 1940 to the present. It provides hourly estimates for a large number of atmospheric, land, and oceanic climate variables on a regular latitude, longitude grid of 0.25 degrees. The data can be accessed through the Climate Data StoreFootnote 1 and are available either as single level data (roughly at surface level) or on 37 pressure levels ranging from 1000 hPa to 1 hPa. Based on expert knowledge, for our study, we selected the variables at a pressure level of 1000 hPa, except for specific humidity, where six pressure levels (1000, 900, 800, 650, 450, 300 [hPa]) are used, as further discussed in Sect. A.2 in the appendix. Table 1 lists the variables that are used within this study.

Table 1 Meteorological variables from the ERA5 data sets that are used in this study

3 Methodology

In order to utilise machine learning (ML) for the determination of ZWD, three important questions have to be clarified. First, a predictive set of input features must be chosen. Second, a suitable ML algorithm has to be identified. Third, the best-performing hyper-parameters for that algorithm have to be found.

Each of these aspects depends on the others, but the huge number of possible combinations precludes an exhaustive search. We follow standard practice and take an iterative approach to determine the best-performing setup. As a start, the 13,402 stations are randomly split into 80 % training (10,718) and 20 % test stations (2684). The test stations are only utilised in the final evaluation, presented in Sect. 4, to have an independent data set. For the experiments carried out to find the best model setup, we rely only on the training stations. To that end, they are further split into five folds of equal size, four for training and one for model validation.

In the first investigation, several different ML algorithms were tested that are introduced in Sect. 3.1. In these runs, an initial set of features was selected based on expert knowledge. In total, the 12 meteorological variables at a pressure level of 1000 hPa listed in Table 1 were selected, as well as nine position and time variables describing the geographical location of the GNSS station and sample time epoch, as further discussed in Sect. 3.2.

For every ML algorithm, a hyper-parameter tuning based on grid search was carried out to optimise the predictive performance of the validation set. The results of this initial comparison are presented in detail in Sect. A.1 in the appendix. Based on that investigation, the XGBoost method was found to be the most promising candidate, in line with many other ML tasks based on relatively low-dimensional feature sets (Yan et al. 2020; Lundberg et al. 2018; Hengl et al. 2017; Xia et al. 2017; Ziȩba et al. 2016).

In the second step, we performed a detailed feature selection for XGBoost. Several combinations of meteorological variables were studied, as further discussed in Sect. A.2 in the appendix. We found that specific humidity values at six different pressure levels, in combination with the previously mentioned representations of the position and time variables, provide good prediction performance. A full list of the features used in our final ZWD model is given in Sect. 4 in Table 2.

The hyper-parameters are then tuned again to find the best setting for the adapted feature set. It turned out that neither the hyper-parameters themselves nor the overall performance changed significantly, which demonstrates the robustness and generality of the model. The hyper-parameters that lead to the most accurate predictions are listed in Sect. 4 in Table 3.

3.1 Algorithms

To cover a broad range of ML schemes, we tested four representative methods from the vast pool of possible ML algorithms: a linear method, an exemplar-based method, a neural network, and a tree-based ensemble approach:

  • Least Absolute Shrinkage and Selection Operator (LASSO) regression (Tibshirani 1996)

  • K-Nearest Neighbours (KNN) (Fix and Hodges 1989; Cover and Hart 1967)

  • Multilayer Perceptron (MLP) (Rosenblatt 1957; Rumelhart et al. 1986; LeCun et al. 2012)

  • Extreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016)

XGBoost is a tree-based ensemble learning scheme. Shallow regression trees as weak learners are combined into a strong learner with gradient boosting, i.e. the trees are sequentially learned such that they correct prediction errors of the previous stage. XGBoost is also known for its ability to capture highly nonlinear dependencies, as well as for its computational efficiency and scalability. It has achieved state-of-the-art results across a wide range of prediction tasks (Chen and Guestrin 2016).

A short description, as well as a comparison of the results of the other ML methods, can be found in Sect. A.1 in the appendix.

Additionally, two widely used methods for spatial interpolation are selected to serve as baseline models:

  • Ordinary Kriging (Krige 1951)

  • 3D Delaunay triangulation (Delaunay 1934)

Kriging is a spatial interpolation technique. It estimates the best linear unbiased prediction (BLUP) at an unobserved location as a weighted average of the nearby observations (Krige 1951). The weights are derived via a kernel function ("variogram") that specifies the spatial covariance structure of the target variable.

Another well-known approach to spatial interpolation in 3D is to explicitly link the 3D locations into a tetrahedral mesh with the 3D Delaunay method (Delaunay 1934), then linearly interpolate within each tetrahedron.

Table 2 List of features utilised in the final XGBoost model

These baseline models were only applied to the geographical information (latitude, longitude, height) for each time step to provide a comparison with the ML models operating on the geographical as well as meteorological parameters.

3.2 Setup

As already described at the beginning of Sect. 3, all available GNSS stations are randomly split into training, validation, and test stations. For each portion, a target vector y (\(y_{train}\), \(y_{val}\), \(y_{test}\)) and a feature matrix X (\(X_{train}\), \(X_{val}\), \(X_{test}\)) are created.

Regardless of the ML method used, the learning setup in our study is always the same. The vector y, of length \([\#samples]\), is created by concatenating the station ZWD time series and represents the regression targets—in our case ZWD estimates from NGL. Missing data are not filled in but simply discarded.

The feature matrix X has dimension \([\#samples \times \#features]\) and is composed of position and time variables (i.e. the geographical location \(\phi \), \(\lambda \), h of the GNSS station and the sampling timestamp) and the corresponding meteorological variables, found by nearest-neighbour lookup in the ERA5 grids.

Three variables are extracted from the timestamps of each observation, which are in UTC: absolute time as a continuous, real-valued number (t); the day of the year (doy); and the hour of the day (hod). The rationale is that periodic daily and yearly signals are to be expected in ZWD time series, which are represented more directly in terms of hod and doy. To account for the cyclic nature of doy, hod, and \(\lambda \), the former two are normalised to the range \([0,\;2\pi )\) and all three are then transformed to pairs of \(\sin (\cdot )\) and \(\cos (\cdot )\) values, resulting in two features per variable.

Following best practice, the feature matrix X is standardised by subtracting the mean feature vector and scaling each feature dimension to unit variance, before feeding it to the ML algorithms.

3.3 Validation metrics

All quantitative results are computed on the validation fold(s) of the training set during model comparison, hyper-parameter tuning (Sect. A.1), and feature selection (Sect. A.2). Only the evaluation of the final model (Sect. 4) uses the test stations, to have an independent data set. For each test station i, we calculate the root mean squared error (RMSE; eq. (1)) and the mean absolute error (MAE; eq. (2)) between the predicted ZWD \(\hat{y}\) and the (NGL-based) reference value y. As global summary statistics, the station-wise RMSEs and MAEs are combined by calculating their weighted means (WRMSE; eq. (3) and WMAE; eq. (4)), with weights proportional to the number of samples per station (\(\#samples_i\)). WRMSE and WMAE serve as overall performance metrics.

$$\begin{aligned} \text {RMSE}_i&= \sqrt{\frac{\sum _{j}^{\#samples_i}(y_{i,j}-\hat{y}_{i,j})^{2 }}{\#samples_i}} \end{aligned}$$
(1)
$$\begin{aligned} \text {MAE}_i&= \frac{\sum _{j}^{\#samples_i}\mid y_{i,j}-\hat{y}_{i,j}\mid }{\#samples_i} \end{aligned}$$
(2)
$$\begin{aligned} \text {WRMSE}&= \frac{\sum _{i}^{\#stations}(\#samples_{i} \cdot \text {RMSE}_{i})}{\sum _{i}^{\#stations}(\#samples_{i})} \end{aligned}$$
(3)
$$\begin{aligned} \text {WMAE}&= \frac{\sum _{i}^{\#stations}(\#samples_{i} \cdot \text {MAE}_{i})}{\sum _{i}^{\#stations}(\#samples_{i})} \end{aligned}$$
(4)

4 Results

In this section, results for our final, best-performing global model, based on XGBoost, are presented. The features used in that model are listed in Table 2 and the hyper-parameters are given in Table 3. In the appendix, in Sects. A.1 and A.2 detailed insights are given into why XGBoost was selected and discuss how the hyper-parameters have been chosen and how the features have been selected.

Table 3 Hyper-parameters of the final XGBoost model

4.1 Internal validation

To evaluate how well our model is able to reproduce the behaviour of the reference solution used as training target, its predictions at the test stations (\(\hat{y}_{test}\)) are compared to the corresponding reference values from NGL (\(y_{test}\)). In Fig. 3, XGBoost predictions of ZWD are plotted against NGL reference values. The values cluster tightly along the identity line (white, dashed) across the entire range from 0 to 400 mm ZWD, with barely any outliers. Moreover, positive and negative deviations from the ideal diagonal are symmetric, meaning that the model does not systematically over- or underpredict anywhere in the relevant range.

Fig. 3
figure 3

Comparison of predicted ZWD values to reference values at the test stations

Figure 4 displays a histogram of the station-wise RMSE and MAE values. The distribution is skewed towards 0, meaning that most stations have small errors, while there are few stations with significantly larger errors. The weighted means of the error distributions, corresponding to WRMSE and WMAE, are 8.1 mm and 6.1 mm, respectively. Upon inspection, most of the stations with large errors (> 20 mm) are located near the coast or on islands, predominantly in tropical or subtropical regions. We speculate that in those areas the meteorological parameters may be less accurate.

Fig. 4
figure 4

Distribution of station-wise RMSE and MAE at test stations, relative to NGL reference values. Vertical lines denote WRMSE (8.1 mm) and WMAE (6.1 mm). Stations with errors larger than 20 mm are grouped in the last bin

The spatial distribution of the test stations’ RMSE values is depicted in Fig. 5. The MAE distribution exhibits a very similar pattern and is not separately shown. A number of interesting trends can be seen. Test stations in areas with a dense GNSS station network (conterminous USA, Europe, Japan, and south-eastern Australia) tend to have lower errors. As a uniform random train/test split is used, the distribution of training stations is comparable to the one in the figure. In other words, predictive skill is better in areas with a high density of GNSS stations (and thus many training examples), as expected. This behaviour is further studied in Sect. 5.1.1, by constructing regional models. We also note that, at comparable (low) station density, the errors tend to be higher in tropical regions than in the Arctic and Antarctic, which can be explained by the much larger absolute ZWD values and their variability, another expected behaviour that we revisit in Sect. 5.1.2.

Fig. 5
figure 5

Spatial distribution of the test stations’ RMSEs w.r.t. NGL reference values

To better understand the behaviour of the ML model, the station-wise RMSE and MAE values of the test stations are related to the geographical similarity with the nearest training stations. For each test station, the Euclidean distance and the height difference to the nearest training location are determined. Then, the correlation between these values and the ZWD errors (both RMSE and MAE) are computed, see Fig. 6. The intuition behind this investigation is to see how strongly the predictive skill of the model depends on having a training station close by. All four correlation coefficients lie around 0.30\(-\)0.33. In other words, having a nearby station does play a certain role, but the model does not just memorise the training station values (in which case the correlation with distance would have to be higher). We point out that the observed correlations are likely skewed, due to the imbalanced distribution of the distances with many more stations from areas with dense GNSS networks, and consequently also small station-to-station distances. In addition, we also computed the correlation with the absolute station height, which is very low (\(\le \) 0.06). We speculate that several effects related to the absolute station height might cancel each other out. Overall, there are fewer stations at higher altitudes, thus fewer samples to train the model. Furthermore, maintenance of stations at higher altitudes is in general more difficult, which might affect the quality of their observations. However, at higher altitudes, the ZWDs are typically smaller and consequently have smaller errors.

Fig. 6
figure 6

RMSE and MAE at test stations, plotted against the distance to the nearest training station (left), the height difference to the nearest training station (middle), and the absolute height (right). Correlation coefficients are given in the upper right corner of each plot

4.2 Robustness against outliers

Despite the thorough outlier detection scheme described earlier (Sect. 2.1), some outliers may remain in the data set. In ML applications based on large data sets, it is not feasible to perform a manual outlier detection by individually inspecting every time series. Instead, the models are designed to be robust and/or to include automatic quality control.

To evaluate the robustness of the model against noisy data and outliers empirically, the following investigations have been conducted. First, the ZWDs of some training stations were modified to produce erroneous data. Next, a new XGBoost model was trained based on the altered ZWDs. Finally, the model was evaluated and compared to the unaltered test stations. This test was conducted for outlier rates of 1 % (107 stations) and 5 % (538 stations), significantly higher than the actual outlier ratio of NGL, which lies below 0.5 % according to previous studies (Yuan et al. 2023). The ZWDs were perturbed by various levels of white noise, and systematically biased by values between 1 and 20 mm. The artificial degradations thus reflect the entire range from small deviations that would be expected due to coordinate estimation errors up to untypically large values. Table 4 lists the WRMSE changes for all tested perturbation levels of the training data, as a way to quantify the robustness of the proposed ML approach.

Table 4 Change in WRMSE based on modified ZWD training data. Positive values indicate a degradation in WRMSE. The reference WRMSE based on the unaltered ZWD is 8.1 mm

For all tested white noise levels and biases, the resulting WRMSE changes are within 0.1 mm in the case of 1 % of artificial outliers. Even with the exaggerated 5 % outlier tests, the changes in terms of WRMSE are within 0.2 mm for all tested white noises and biases up to 10 mm before growing to [0.3, 0.6] mm for [15, 20] mm biases, respectively. In conclusion, the ML model based on XGBoost is robust enough for the application and can deal with a reasonable amount of outliers and poor-quality data. That robustness is due to the large sample size combined with the inherent tolerance of the model to label noise in the ZWDs as well as hyper-parameters tuned on unseen data with cross-validation.

4.3 Global spatial ZWD predictions

The proposed ML model can be applied at any location on Earth and at any desired time, as long as the meteorological input variables are available. This ability is visualised by predicting global maps of ZWD with a 0.25\(^\circ \) spatial resolution and 1-hour temporal resolution. Figure 7 shows the resulting ZWD maps for 00:00 UTC on the first day of each month in 2019.

Fig. 7
figure 7

Global ZWD maps for 00:00 UTC on the first day of every month in 2019

The maps reveal the expected large-scale patterns, with overall higher values in the tropics and lower values in the polar regions. Additionally, regional weather phenomena can be distinguished, such as the South Asian monsoon that affects the Indian subcontinent from August to November. A further ZWD pattern over Central and Western Africa can probably be attributed to the seasonal displacements of the Inter-tropical Convergence Zone (ITCZ), which drives rainfall. When comparing the ZWD maps to the rainfall maps shown in the study by Dezfuli (2017), we find many similar patterns, which qualitatively corroborate the (relative) ZWD distribution predicted by our model.

4.4 Feature importance

Figure 8 illustrates the feature importance in the XGBoost model, representing the relative number of times a particular feature appears in a tree. It reveals that the three most important features are the specific humidities at pressure levels 900 hPa, 650 hPa, and 1000 hPa, highlighting that the humidity at the lower part of the atmosphere is the primary influence factor for ZWD. The most predictive pressure level of 900 hPa corresponds rather well to the 433 m average station height of the data set, suggesting that the specific humidity in the immediate environment of the station is of particular importance. Among the position and time features, ellipsoidal height (h) plays the biggest role, while latitude (\(\phi \)) and longitude (\(\lambda \)) have less impact. In a dedicated experiment, the nine position and time features were omitted altogether. This roughly doubled both the RMSE and the MAE, showing that they do play a significant role, despite their relatively low feature importance (see Table 12 in Sect. A.2).

4.5 External validation

To further assess the learned model, an inter-comparison with three independent ways to estimate ZWD was performed: vertical integration of (1) ERA5 data, (2) radiosonde observations, and (3) estimation of ZWDs from the Very Long Baseline Interferometry (VLBI) analysis.

For ERA5, we obtain hourly temperature, humidity, and geopotential fields at 37 pressure levels and apply the approach of Zus et al. (2012) to the test stations. The deviations between the resulting ZWD estimates and our XGBoost model have a WRMSE of 9.1 mm and a WMAE of 6.9 mm, only 10 % higher than the values obtained in the internal validation against NGL, the reference for our study. To also quantify how well ZWDs from ERA5 and NGL agree, we compute the statistics for the difference between them and obtain a WRMSE of 10.8 mm and a WMAE of 8.3 mm for the test stations (see Table 5). Our XGBoost model thus reproduces NGL results better than a direct integration of ERA5.

In a similar fashion, ZWDs from radiosonde observations were obtained by integrating the wet refractivity vertical profiles which can be computed using pressure, temperature, and relative humidity data from radiosonde measurements (Zhang et al. 2021). The radiosonde-based ZWDs were then inter-compared to the other data sets. The radiosonde data are provided by the Integrated Global Radiosonde Archive (IGRA) (Durre et al. 2006, 2018). For the year 2019, 790 radiosonde stations are available. However, their geographical locations do not coincide with the GNSS stations. To minimise the influence of spatial ZWD variability, radiosonde locations were only paired with GNSS stations if they lie within a radius of 20 km, which leaves us with 116 station pairs. At those locations, the differences between ZWDs from radiosondes and from NGL have a WRMSE of 14.5 mm and a WMAE of 11.5 mm. This result further highlights that ZWD estimates from existing retrieval methods exhibit noticeable discrepancies. The (average) deviations between the radiosonde results and our XGBoost model are very similar, with a WRMSE of 15.0 mm, respectively, a WMAE of 11.7 mm for the 116 station pairs. Finally, the deviations between radiosondes and ERA5 integration (calculated over the 116 radiosonde stations) amount to 11.0 mm WRMSE, respectively, 8.2 mm WMAE. A better agreement of radiosonde measurements with ERA5 (than NGL or XGBoost) was expected as the radiosonde measurements are assimilated in ERA5 (Virman et al. 2021).

Table 5 WRMSEs [mm] (upper triangle, black) and WMAEs [mm] () for the inter-comparison experiment. The numbers in the brackets refer to the number of stations that have been compared

Additionally, ZWDs estimated during VLBI analysis are compared to ZWDs of NGL, XGBoost, and ERA5. All VLBI R1/R4 sessions of the year 2019 have been analysed based on the bkg2023a solution. In total, 104 sessions containing 24 stations (excluding the station NOTO) with 25,100 samples were included. The agreement between ZWDs of NGL and VLBI is the best with a WRMSE of 5.4 mm and a WMAE of 4.1 mm. Comparing the ZWDs of VLBI to our XGBoost model and the ZWDs of ERA5, respectively, reveals that our XGBoost model performs better (WRMSE of 10.6 mm, WMAE of 7.9 mm) than a vertical integration of ERA5 (WRMSE of 12.3 mm, WMAE of 9.0 mm). The inter-comparison is summarised in Table 5. In summary, the higher compatibility of the proposed model with NGL and ERA5 was expected, given that the former served as its training target and the latter is based on the same meteorological data set. Why the discrepancies between NGL and ERA5 are lower than between either of them and the radiosonde data is less clear. This may be due to the distribution of the radiosonde locations, or it might hint at systematic observation biases.

We point out that our ML approach can, in principle, be trained with any desired ZWD data set as the regression target. Given the significant differences between retrieval methods, further research may be needed to determine to what extent training data from different sources can be mixed.

Fig. 8
figure 8

Feature importance in the final ZWD model

4.6 Baseline model comparisons

To compare our meteorologically informed ZWD predictions to conventional spatial interpolation techniques, we also fit ZWD fields to the set of GNSS training stations with two baseline models, Ordinary Kriging and Delaunay triangulation (see Sect. 3.1 for brief descriptions of those standard techniques).

For Ordinary Kriging, we employ the implementation available in SciKit-GStat (v.1.0.1) (Mälicke et al. 2021; Mälicke 2022). As for the previous learning algorithms, the regression targets are the ZWD values from NGL, but the input in this case is only the geographical location of the stations (latitude, longitude, height). The Delaunay interpolation is based on implementation in scipy (v.1.8.0) (Virtanen et al. 2020). Again, the training stations serve as coordinates that define the tetrahedral regions, from which the ZWD values at the test locations are read out by barycentric interpolation within the relevant tetrahedron.

The ZWD values predicted by Ordinary Kriging, respectively, Delaunay are then compared to the NGL values at the test stations, see Table 6. For the former, we obtain a WRMSE of 19.6 mm and a WMAE of 14.7 mm; for the latter, we get similar values of 18.3 mm WRMSE and 13.7 mm WMAE. These values are significantly higher than those of the XGBoost model (8.1 mm WRMSE, 6.1 mm WMAE). This was expected and confirms that the meteorological observations contribute important information about ZWD that is missing when simply interpolating the ZWD values observed at the sparse locations of the GNSS station network. The importance of meteorological data is in line with the finding of the variable importance study of Sect. A.2.

5 Discussion

The following subsections discuss the global model by comparing its performance to specialised models, namely, regional (Sect. 5.1.1) and monthly models (Sect. 5.1.2). Additionally, the global model was tested for a different year and its performance was evaluated (Sect. 5.2). Sections A.1 and A.2 in the appendix contain further details about the performance of different ML algorithms and give more insights about the feature selection.

5.1 Global versus specialised models

The final model presented in Sect. 4 is a global model that is based on 10,718 GNSS stations worldwide processed by NGL for the year 2019. In addition to creating a global model for the whole year, regional and monthly models were also generated. With these more specialised, (spatially or temporally) local models we investigate the prediction quality in more detail. Moreover, by comparing such specialised models to the monolithic, global one we are able to study the associated trade-offs. For instance, a single, global model has a larger training set and may be beneficial in regions with few stations, while on the other hand, it faces a more difficult task, as it must cover a broader range of geographical and meteorological conditions.

In the following sections, it is investigated how well the regional and monthly models performed w.r.t. the global model.

5.1.1 Regional models

In total, six continental models were created that cover North America, South America, Europe, Africa, Asia, and Australia. These models were trained and evaluated with the respective subsets of the training and test sets defined for the global model. For a meaningful comparison, also the global model was evaluated separately for each continental subset of the test data. Results are shown in Table 7.

It can be seen that the best performance was achieved in Europe (WRMSE of 6.9 mm) and North America (WRMSE of 7.2 mm), the two regions with the highest number and the highest density of GNSS stations. The lowest performance was obtained for South America (WRMSE of 14.5 mm) and Africa (WRMSE of 13.4 mm), which have the lowest number of stations. We note that the results for Asia may not be representative, since both the training and the test sets are dominated by a small region comprising Japan, South Korea, and Nepal.

Table 6 Comparison between ZWDs from meteorologically informed XGBoost, Delaunay interpolation, and Kriging

Overall, the performance gaps between the global model and its local counterparts are very small, indicating that generalising across the entire globe with a single model is indeed justified. In more detail, the results confirm our expectations: the local models perform slightly better in regions with enough stations, as they can fit the specific, narrower set of local conditions; but that small edge vanishes in regions with very few stations, as the global model benefits from the information contributed by the much larger set of more distant training stations. Further research is needed to comprehensively assess whether, for geographically restricted, high-accuracy applications in regions with many GNSS stations, localised models may bring a significant advantage.

Table 7 Performance per geographical region, for both specialised regional XGBoost models and the global model

5.1.2 Monthly models

Following previous studies (Sun et al. 2019; Ding 2022) that investigated variations of the model accuracy across different seasons and latitudes, we also split the training and test sets in time. As the seasonal cycles vary across the globe, we prefer not to split into somewhat arbitrary seasons, but instead, train and test a separate model for each calendar month of the year 2019. Again, the same train/test split is used as for the global model, just further subdivided into monthly subsets. The results of this experiment are summarised in Table 8.

It can be seen that a more local view tends to simplify the modelling problem: in all months the monthly models achieve slightly better performance than the global one, which is evaluated separately for each month of the test data. Moreover, the errors of the monthly models remain below the (spatially and temporally) global average error for all months except June, July, August, and September (when the largest ZWDs occur in the densely observed northern hemisphere).

Again, the differences are very small and confirm that a single, global model can capture the seasonal variations of ZWD. As before, it may be interesting to investigate in a further study how much of an advantage can be gained when the training period is extended by including the same month from multiple years.

Table 8 Performance per month of 2019, for both specialised monthly XGBoost models and the global model

To analyse the seasonal behaviour of ZWD predictions in different parts of the world, the performance is evaluated separately for the polar zones, the tropical zone, the northern temperate zone, and the southern temperate zone. Table 9 lists the climate zones, their latitude limits, and the number of (test) stations per zone. The results of the analysis are depicted in Fig. 9.

The errors are highest in the tropical zone but of a similar magnitude throughout the year. This makes sense since atmospheric water content and ZWDs are highest in the tropics, which on the one hand increases the magnitude of potential ZWD variations, and on the other hand, means that similar relative errors translate to higher absolute errors. Adding to that, the number of stations in the tropical zone is particularly low. It also makes sense that the errors in the tropical zone show only very little variability throughout the year, as a consequence of the stable climatic conditions without marked seasonality.

For the polar regions, the observed behaviour is plausible, too, with much lower errors presumably due to the dry atmosphere, and a relatively stronger seasonal signal. Due to the very low number of stations in the Arctic and Antarctic, we refrain from further interpretations.

The performance in the northern and southern temperate zones follows the expected pattern. In both zones, the errors exhibit a pronounced seasonal signal, dropping during the winter months and increasing over the summer when temperature and humidity (and thus also ZWD) are higher. The slightly higher accuracy in the northern hemisphere is likely not due to climatic influences, but explained by the much higher number of training stations.

Table 9 Latitude range and number of (test) stations for the utilised climatic zones
Fig. 9
figure 9

Performance of monthly models evaluated in different climate zones of the world

Figure 9 raises the question if it is correct to use a random sample of test stations for the evaluation of the model. For example, if stations are selected in the north temperate zone that only observe during the summer months, significant biases might be introduced in the evaluation. However, due to the large sample size, it is unlikely that such biases appear. Furthermore, in our case, almost all stations observe year-round (see Fig. 2). Still, to ensure that no bias is present, the evaluation was additionally calculated only based on test stations with a completeness of at least 95 %. The resulting accuracy agrees at the sub-millimetre level with the one over all test stations.

5.2 Temporal predictions

As noted previously, the focus of our work lies on the global, spatially explicit modelling of ZWDs. In this context, it is important to emphasise that our current model is trained on data from the year 2019 only, and is therefore unaware of inter-annual variability. To quantify this limitation, we test its capability to predict ZWD in a different year, namely 2020. We obtain the meteorological variables for 2020 at the test stations and apply the model trained for 2019 to them. Therefore, ZWD predictions using the meteorological variables of the year 2020 are conducted.

The resulting WRMSE and WMAE values are 14.2 mm and 10.6 mm, respectively. They still lie in a very reasonable range (c.f. the inter-comparison of ZWD models in Sect. 4.5), but are nonetheless almost a factor \(\times \)2 higher than those for 2019. Thus, to obtain the highest accuracy for a certain time period, it is necessary to retrain the model with the corresponding NGL data.

Figure 10 depicts the daily average RMSE over all test stations from January 2019 until December 2020. The clearly higher errors, as well as the higher variability and a sudden jump on New Year of 2020, indicate significant temporal over-fitting of the current model to the conditions of 2019. In other words, although our approach is able to model the spatial distribution of ZWD, the data from one particular year is not enough to learn a general model that covers the entire range of relevant meteorological conditions anywhere on Earth for multiple years or even decades. This is not surprising given the large inter-annual variability of the weather in large parts of the globe and the existence of major atmospheric phenomena that do not occur every year, such as the El Niño Southern Oscillation (ENSO) or the Northern Annular Mode (NAM) with their different phases and magnitudes. We point out that this shortcoming can be mitigated quite easily by extending the training data to cover multiple years (if necessary even at the cost of fewer samples per year). Additionally, the hyper-parameter tuning would have to be modified to utilise not only spatially, but also temporally independent validation data. Together, these two measures would almost certainly mitigate the problem—a promising, if obvious direction for a future extension of our model.

Fig. 10
figure 10

Time series of daily average RMSE for 2019 (blue) and 2020 (orange)

5.3 Applicability of our ZWD model

There are several ways in which users will be able to utilise the presented model, depending on their needs and applications.

First, a gridded data product is available that provides hourly ZWDs on a regular grid of 0.25 degrees. This data set may be useful for meteorological studies and other applications that require dense, global ZWD values. It could potentially also be used in future weather forecasting.

Second, the trained XGBoost model is provided directly. With it, users can estimate ZWD for specific locations and times but must ensure that they supply the correct inputs. Importantly, the specific humidity values at the various pressure levels are to be taken from the ERA5 data set, so as to match the data characteristics during model training. We do not recommend the use of other, user-generated specific humidity values—these would require retraining of the model. Furthermore, ZHDs have to be calculated with VMF1 to match the NGL processing of the training data to obtain realistic ZTD values.

Finally, we provide a web interface through which users can upload their location (latitude, longitude, height) and time information, which then calculates the corresponding ZWD based on the XGBoost model and ERA5 input. That interface, as well as gridded data products at 0.25\(^\circ \) resolution are available at the Geodetic Prediction Center of ETH Zurich, https://gpc.ethz.ch/Troposphere/ (last access: 25 January 2024).

6 Conclusions and outlook

In this study, a global ML-based ZWD model is presented that achieved a performance of 8.1 mm and 6.1 mm in terms of WRMSE and WMAE, respectively, for the test stations for the year 2019. The model utilised the XGBoost algorithm with the geographical location, time epoch, and specific humidity at six pressure levels as its input features. It was trained based on hourly ZWD measurements from 10,718 GNSS stations provided by NGL for the year 2019 and evaluated against ZWD measurements from 2684 GNSS stations for the same year. The huge number of training stations ensured that the model generalised well and led to a good performance, even for regions with a sparse GNSS station network.

We verified the performance based on a thorough inter-comparison with three independent methods for determining ZWD: computation of ZWDs via (1) vertical integration of ERA5 data, (2) radiosonde measurements, and (3) ZWD estimation in VLBI analyses. Our model has the best agreement with NGL (WRMSE of 8.1 mm), which was expected since NGL serves as the reference. However, the comparison also shows good agreement between our model with ERA5 (WRMSE of 9.1 mm) and VLBI (WRMSE of 10.6 mm). The WRMSE for the radiosonde measurements is 15.0 mm. Note that the inter-comparison is based on a different number of stations (see Table 5).

We assume the NGL ZWD estimates to represent the ground truth. One critical question to be answered in future studies is to what extent this assumption actually holds. In particular, errors in ZHD modelling are bound to propagate into the ZWD estimates and cause local biases, which could in turn propagate into our model. A study by Ding et al. (2023) indicates that such regional biases might exist, for example, in the Andes region. Since our model has access to the geographical location, such errors would normally remain localised and not propagate to other regions. A more detailed analysis of the ZWD quality in difficult terrain is required to ascertain how the local accuracy in such regions differs from the global one. That being said, our study nevertheless demonstrates that the proposed model delivers ZWD values globally and with high accuracy in most regions of the Earth. We also note that updating the model is straightforward: if different, better ZWD values for sufficiently many reference stations become available, all one has to do is retrain the model with those values.

To further demonstrate the quality of the global model, regional (continental) and monthly models were also investigated, which showed that the differences between the WRMSE and WMAE were very small, on average 0.3 mm for the regional and 0.4 mm for the monthly models. This indicates that the global model performs reasonably well for all regions of the Earth and over the full year. Concerning the regional models, it is shown that areas with a dense GNSS station network and a high number of stations (e.g. Europe, North America) have a better performance than areas with a sparse network and a low number of stations (e.g. South America, Africa). Concerning the monthly models, it is revealed that the ZWD accuracy of stations located in the northern and southern temperate zones is worse during the corresponding summer months, likely explained by the higher water vapour content and thus higher variations in ZWDs.

One major advantage of the proposed model is that, in contrast to other ZWD models, it does not depend on prior ZWDs. Thus, it can be applied anywhere on Earth, opening up the possibility to use it for a wide range of applications in the field of positioning and possibly also for weather monitoring and forecasting. Furthermore, once trained, calculating ZWDs based on the input features is computationally inexpensive making it attractive for low-cost or low-power devices. These properties, together with the better performance compared to ZWD computed from ERA5, make the ML model superior to alternative options.

While our model was designed for spatial modelling, additional experiments were conducted regarding its potential for temporal predictions. We found that the performance noticeably drops when applying the model to data outside the training period. This can be explained by the fact that it is only trained on data from one year, and therefore unaware of inter-annual variability. To overcome this limitation, it will be necessary to train the model on multiple years in a future extension of our model, including the choice of a temporally independent validation set. We are confident that in this way temporal generalisation can be achieved, leading to improved predictions at previously unobserved points in time. In addition, this study used specific humidity from ERA5 reanalysis data that only become available within five days of the present day, which allows to present the concept of our model but is a limitation for real-time applications. Preliminary investigations towards a temporal forecasting model of ZWD suitable for real-time applications have already been presented in Crocetti et al. (2023) at the European Geoscience Union (EGU) 2023. It appears that feeding an adapted ML model with meteorological forecast data from the ECMWF Integrated Forecasting System (IFS), in combination with training data from multiple years improves ZWD estimation across time and does not lead to a significant performance loss. However, this real-time setting is still under investigation and will be reported in future works.