Introduction

Global Navigation Satellite System (GNSS) is widely used in various fields, such as precision agriculture (Du et al. 2023), smart transportation (Anand et al. 2021) and autonomous vehicles (de Miguel et al. 2020). Multipath has long been recognized as one of the major error sources for urban GNSS navigation and positioning as it is highly related to the environments. When multipath interference occurs, reflected signals superimposed on the direct-path (DP) signal can cause distortion to the autocorrelation function, and hence add a bias to the pseudorange measurement. In order to mitigate the effect of multipath, many measures have been proposed such as advanced antenna techniques (Xie et al. 2017; Granger et al. 2008; Wang et al. 2019) and baseband signal processing techniques (Van Dierendonck et al. 1992; Garin 1996; McGraw et al. 1999; Lee 2002; Wu et al. 2012; Van Nee et al. 1994; Weill 2002). Three-dimensional (3D) building models are also increasingly used to aid navigation and improve localization accuracy in densely built-up areas as an effective means of multipath and no-line-of-sight (NLOS) mitigation (Zhang et al. 2020).

In recent years, a large number of multipath mitigation methods at baseband signal processing level have been proposed. These methods can be broadly grouped into two categories: (1) advanced correlator and discriminator design that directly obtains the result after the multipaths are suppressed or eliminated, such as narrow correlator (Van Dierendonck et al. 1992), multipath elimination technology (MET) (Townsend et al. 1994), strobe edge correlator (Garin 1996), high resolution correlator (HRC) (McGraw et al. 1999) and code correlation reference waveforms (CCRW) technology (Lee 2002; Wu et al. 2012), etc.; and (2) multipath parameter estimation which estimates individual multipath components and subtracts them from the total received signal to restore the DP signal, such as the multipath estimating delay-lock loop (MEDLL) (Van Nee et al. 1994), the vision correlator (Fenton et al. 2005) and the multipath mitigation technique (MMT) (Weill 2002).

MEDLL was proposed by NovAtel in 1994. It uses multiple correlators to sample a series of correlation values and obtain the correlation peak envelop. MEDLL estimates multipath signal parameters based on maximum likelihood estimation criterion. As an effective multipath mitigation method, MEDLL can eliminate about 90% multipath effect in the narrow correlation receiver in some cases (Xie 2009). However, the maximum likelihood estimation criterion determines that the performance of MEDLL depends heavily on the shape of the autocorrelation function and is therefore undoubtedly affected by the front-end bandwidth setting and the carrier-to-noise ratio (CNR) of the received signal.

Machine learning has been widely used in GNSS field, such as spoofing detection (Semanjski et al. 2019; Aissou et al. 2022), signal acquisition (Borhani-Darian et al. 2023) and positioning enhancement (Kanhere al. 2022), etc. In recent years, machine learning has also been utilized to solve multipath interference. On one hand, various intelligent methods for multipath detection have been proposed. Hsu (2017) proposed a support vector machine (SVM)-based LOS/MP/NLOS classifier which use signal strength and pseudorange residue as the features; Quan et al. (2018) employed conventional neural network (CNN) for feature extraction in multipath detection for kinematic GPS positioning; Savas et al. (2019) proposed a multipath detection algorithm based on K-means clustering so that no priori training data is required; Suzuki et al. (2020) constructed a machine learning-based multipath/NLOS classifier that discriminates NLOS multipath signals from the output of the multiple GNSS signal correlators of a software GNSS receiver.

At a deeper level, scholars have attempted to employ machine learning algorithms to mitigate the effect of multipath directly. Phan et al. (2013) employed support vector regression (SVR) to estimate multipath error by making use of the connections between multipath error and satellite relative elevation angle and azimuth angle. In baseband signal processing level, Orabi et al. (2020) proposed a neural network-based delay-locked loop (NNDLL), which built a neural network discriminator using multilayer perceptron (MLP) to estimate code phase directly. Li et al. (2022) proposed a deep network correlator, where a convolutional layer is used to achieve standard correlation, and then a MLP is employed to filter out the multipath effects on the autocorrelation function. These two attempts show the potential of machine learning assistance at the baseband signal processing level. However, both of them focus on correlators or discriminators design, while the application of machine learning to multipath parameter estimation is less investigated. Utilizing the powerful learning capability of machine learning, this study aims to propose a machine learning-based multipath parameter estimator to achieve better robustness than traditional multipath parameter estimation algorithms such as MEDLL. Besides, both the NNDLL and the deep network correlator employ a large number of correlators, which limits the feasibility of these algorithms in receivers. Therefore, the balance between algorithm performance and computational effort is also an issue to be considered.

We develop a random forest (RF)-based multipath parameters estimator, where random forest regression is employed to estimate the parameters of multipath for GNSS signals. In (Qi et al. 2023), we have presented the preliminary results of parameter estimation and multipath mitigation performance evaluation of the RF-based estimator in the one-multipath case. We present tests of the algorithm in greater depth and detail in this paper. First, the multipath mitigation performance of the RF-based estimator is evaluated using MEDLL as a benchmark in one-multipath and three-multipath simulation cases, respectively. The effect of both CNR and front-end bandwidth setting of received signals are considered simultaneously. The simulation results show that both MEDLL and the RF-based estimator can mitigate the effect of multipath to some extent compared to no multipath correction case. Compared to MEDLL, the RF-based estimator is rarely affected by the front-end bandwidth and shows better robustness. In the experiments with real data, the delay-locked loop (DLL) with RF-based estimator achieves both lower root mean square error (RMSE) and lower standard deviation of positioning errors than conventional DLL and MEDLL, which again validates the effectiveness of our algorithm. We also test the performance of the RF-based parameter estimator considering different numbers of correlators. The results show that the proposed algorithm works effectively even when only 21 correlators are used.

The following sections are organized as follows. First, the GNSS signal model in the presence of multipath we considered is introduced. Then, we present the methodology and the architecture of the RF-based multipath parameter estimator and how it works in a DLL. After that, simulation results comparing the performance of the RF-based estimator and MEDLL in one-multipath and three-multipath cases are presented and discussed. In addition, the multipath mitigation performance of the DLL with RF-based estimator is investigated with real data. Finally, the conclusion is illustrated.

Signal model

The down-converted intermediate-frequency signal at the receiver can be modeled as

$$\begin{array}{*{20}c} {r_{\Sigma } \left( t \right) = \mathop \sum \limits_{i = 0}^{M} a_{i} p\left( {t - \tau_{i} } \right)\cos \left( {2\pi f_{IF} t + \phi_{i} } \right) + n\left( t \right)} \\ \end{array}$$
(1)

where the 0-th signal is the DP signal; \(p\left( t \right)\) is the pseudo-random noise (PRN) code of a specific satellite; \(a_{i}\), \(\tau_{i}\) and \(\phi_{i}\) are the amplitude, code phase and carrier phase of the i-th signal; \(f_{IF}\) is the intermediate frequency. The amplitude of DP signal \(a_{0}\) is assumed to be unity. The relative amplitude of i-th reflected signal \(a_{i} \in \left( {0, 1} \right)\) considering the attenuation because of reflections. \(n\left( t \right)\) is the measurement noise process, which can be modeled as additive white Gaussian noise (AWGN).

Correlating the received signal with local replica yields

$$\begin{array}{*{20}c} {R_{\Sigma } \left( t \right) = \mathop \sum \limits_{i = 0}^{M} a_{i} R\left( {t - \tau_{i} } \right)e^{{j\phi_{i} }} + N\left( t \right)} \\ \end{array}$$
(2)

where \(R\left( {t - \tau_{i} } \right)\) is the autocorrelation function of the PRN code shifted by code phase delay \(\tau_{i}\); \(N\left( t \right)\) is the low-pass filtered noise.

\(R_{\Sigma } \left( t \right)\) is a complex signal whose real and imag parts are realized in two arms in a receiver. Since the specific value of carrier phase is not concerned in the code tracking loop, only the real part of \(R_{\Sigma } \left( t \right)\) is employed, shown as

$$\begin{array}{*{20}c} {R_{\Sigma I} \left( t \right) = R\left( {t - \tau_{0} } \right) + \mathop \sum \limits_{i = 1}^{M} a_{i} R\left( {t - \tau_{i} } \right)\cos \overline{\phi }_{i} + N_{I} \left( t \right)} \\ \end{array}$$
(3)

where \(\overline{\phi }_{i}\) is the relative carrier phase of i-th reflected signal with respect to the DP signal.

For each signal component, the change in carrier phase acts on the amplitude of the autocorrelation function. Here we perform an invertible transformation, letting

$$\begin{array}{*{20}c} {\alpha_{Ii} = a_{i} \cos \overline{\phi }_{i} } \\ \end{array}$$
(4)

Then, equation (3) is transferred to

$$\begin{array}{*{20}c} {R_{\Sigma I} \left( t \right) = R\left( {t - \tau_{0} } \right) + \mathop \sum \limits_{i = 1}^{M} \alpha_{Ii} R\left( {t - \tau_{i} } \right) + N_{I} \left( t \right)} \\ \end{array}$$
(5)

For the \(R_{\Sigma I} \left( t \right)\) shown in (5), the multipath parameters to be estimated at time \(k\) are \(\theta^{k} = \left\{ {\alpha_{I1} , \cdots , \alpha_{IM} ,\tau_{1} , \cdots ,\tau_{M} } \right\}\).Without considering noise, the autocorrelation function of the DP signal can be restored by

$$\begin{array}{*{20}c} {R\left( {t - \tau_{0} } \right) = R_{\Sigma I} \left( t \right) - \mathop \sum \limits_{i = 1}^{M} \hat{\alpha }_{Ii} R\left( {t - \hat{\tau }_{i} } \right)} \\ \end{array}$$
(6)

where \(\hat{\alpha }_{Ii} , \hat{\tau }_{i} \in \hat{\theta }^{k}\) is the parameter estimates of i-th multipath at time \(k\).

Methodology

Multipath parameter estimation is essentially a regression problem. We employ random forest regression to construct the multipath parameter estimator. This section describes the principle of random forest regression and how the RF-based parameter estimator works in the code tracking loop.

Random forest regression

Random forest is an ensemble learning algorithm proposed by Breiman in 2001 (Breiman 2001). As a successful general-purpose classification and regression method, it can be applied to a wide range of prediction problems and has just a few parameters to tune (Biau et al. 2016). RF consists of multiple decision trees as base learners that operate as an ensemble. A technique called bagging (a contraction of bootstrap-aggregating) (Breiman 1996) is implemented in RF. Bagging creates multiple subsets of the original data by sampling with replacement, and trains a base learner on each subset. The final decision of RF would be obtained by aggregating the predictions of all the base learners. For classification tasks, the output of RF is the class selected by most trees. In the case of a regression problem, the final output is obtained by averaging of all the predictions from individual decision trees.

The general framework of random forest regression is nonparametric regression estimation whose input is a random vector \(X \in {\upchi } \subset {\mathbb{R}}^{n}\), and the goal is to predict the response \(Y \in {\mathbb{R}}\) by estimating the regression function \(m\left( x \right) = {\mathbb{E}}\left\{ {Y|X = x} \right\}\). In regression, mean square error (MSE) shown in (7) is often chosen as the impurity function (IF), which is used to determine the split nodes of decision trees.

$$\begin{array}{*{20}c} {IF = \frac{1}{{n_{j} }}\mathop \sum \limits_{i = 1}^{{n_{j} }} \left( {Y_{i} - \hat{Y}_{i} } \right)^{2} } \\ \end{array}$$
(7)

where \({n}_{j}\) is the number of samples in the input space.

For a training dataset \(D = \left\{ {\left( {X_{1} , Y_{1} } \right), \cdots ,\left( {X_{n} , Y_{n} } \right)} \right\}\) of random variables distributed as the independent prototype pair \(\left( {X,Y} \right)\), the goal of a random forest regression is to construct an estimate \(m_{D} : \chi \to Y\), which satisfy

$$\begin{array}{*{20}c} {\mathop {\lim }\limits_{n \to \infty } {\mathbb{E}}\left[ {m_{D} \left( X \right) - m\left( X \right)} \right]^{2} \to 0} \\ \end{array}$$
(8)

The training dataset of j-th decision tree \(D_{j}\) is obtained by sampling with replacement from the original dataset. The classification and regression trees (CART)-split criterion (Breiman et al. 1984) is used to construct individual trees by choosing the best split nodes. Suppose \(s\) is the split node for a feature variable \(q\), the input space is split into two sub-space \(R_{1} \left( {q,s} \right) = \left\{ {X|X^{\left( q \right)} \le s} \right\}\) and \(R_{2} \left( {q,s} \right) = \left\{ {X\left| {X^{\left( q \right)} } \right. > s} \right\}\). The goal of CART regression is to find the optimal \(\left( {q^{*} ,s^{*} } \right)\), which minimizes the sum of \(IF\left( {R_{1} } \right)\) and \(IF\left( {R_{2} } \right)\). The optimization function in CART regression is written as

$$\begin{array}{*{20}c} {J\left( {q,s} \right) = \mathop {\min }\limits_{{c_{1} }} \mathop \sum \limits_{{X_{i} \in R_{1} \left( {q,s} \right)}} \left( {Y_{i} - c_{1} } \right)^{2} + \mathop {\min }\limits_{{c_{2} }} \mathop \sum \limits_{{X_{i} \in R_{2} \left( {q,s} \right)}} \left( {Y_{i} - c_{2} } \right)^{2} } \\ \end{array}$$
(9)

where

$$\begin{array}{*{20}c} {\left\{ {\begin{array}{*{20}c} {\hat{c}_{1} = ave\left( {Y_{i} |X_{i} \in R_{1} \left( {q,s} \right)} \right)} \\ {\hat{c}_{2} = ave\left( {Y_{i} |X_{i} \in R_{2} \left( {q,s} \right)} \right)} \\ \end{array} } \right.} \\ \end{array}$$
(10)

The above partitioning process would be repeated for each subspace until the stopping condition is satisfied, then a decision tree is generated. Assuming that the whole input space is finally divided into M regions \(R_{1} , R_{2} , \cdots ,R_{M}\). After training, the prediction of the j-th decision tree \(m_{{D_{j} }} \left( x \right)\) with \(x\) as the input is

$$\begin{array}{*{20}c} {m_{{D_{j} }} \left( x \right) = \mathop \sum \limits_{m = 1}^{M} \hat{c}_{m} I\left( {x \in R_{m} } \right)} \\ \end{array}$$
(11)

where

$$\begin{array}{*{20}c} {\hat{c}_{m} = ave\left( {Y_{i} |x_{i} \in R_{m} } \right)} \\ \end{array}$$
(12)

The output of the random forest regression is calculated by

$$\begin{array}{*{20}c} {m_{D} \left( x \right) = \frac{1}{p}\mathop \sum \limits_{j = 1}^{p} m_{{D_{j} }} \left( x \right)} \\ \end{array}$$
(13)

where \(p\) is the number of decision trees in the random forest.

Random forest-based multipath parameter estimator

When the multipath interference occurs, there is distortion in the autocorrelation function. The dashed line in Fig. 1 depicts an autocorrelation function for constructive multipath interference. On the other hand, the distorted autocorrelation function contains all the information of the multipath, such as relative amplitude and code delay with respect to the DP signal. Therefore, the equally spaced sampling points of the autocorrelation function are chosen as the input features \(x_{I} \left( t \right)\) of the random forest for multipath parameter estimation. The points in Fig. 1 illustrate the sampling points used as features. The architecture of the RF-based parameter estimator is shown in Fig. 2. \(\hat{\theta }\) at time \(k\) is calculated by

$$\begin{array}{*{20}c} {\hat{\theta }^{k} = \frac{1}{p}\mathop \sum \limits_{j = 1}^{p} \hat{\theta }_{j}^{k} } \\ \end{array}$$
(14)

where \(\hat{\theta }_{j}^{k}\) is the prediction from the j-th decision tree at time \(k\).

Fig. 1
figure 1

Autocorrelation function with one constructive multipath

Fig. 2
figure 2

Diagram of RF-based parameter estimator. \(p\) is the number of decision trees

The architecture of tracking loops with the RF-based parameter estimator is shown in Fig. 3. The equally spaced sampling points of the autocorrelation function are implemented by a multi-correlator. The multi-correlator outputs are then sent to the RF-based estimator to calculate multipath parameter estimates \(\hat{\theta }^{k}\). After that, the autocorrelation function components of reflected signals are restored by \(\hat{\theta }^{k}\) and then subtracted from the total autocorrelation function. Note that a fixed reference autocorrelation function \(R_{ref}\) is employed to restore the autocorrelation function components. Theoretically, the amplitude of the autocorrelation function of DP signal is the same as the amplitude of \(R_{ref}\), which is unity.

Fig. 3
figure 3

Architecture of tracking loops with the RF-based multipath parameter estimator

Simulations

In this section, we evaluate the performance of the RF-based multipath parameter estimator through simulations. The classic multipath parameter estimator, MEDLL, is used as a benchmark. MEDLL uses maximum likelihood estimation criterion to estimate multipath parameter with multiple sampling of complex autocorrelation functions. It estimates the relative amplitude, relative carrier phase and code delay of the multipath signal separately. For the \(R_{\Sigma } \left( t \right)\) given in (2), the estimates for i-th signal in MEDLL are shown in (15) (Van Nee et al. 1994)

$$\begin{array}{*{20}c} {\left\{ {\begin{array}{*{20}c} {\hat{\tau }_{i} = \mathop {\max }\limits_{\tau } \Re \left[ {\left( {R_{\Sigma } \left( t \right) - \mathop \sum \limits_{{\begin{array}{*{20}c} {m = 0} \\ {m \ne i} \\ \end{array} }}^{M} \hat{a}_{m} R\left( {t - \hat{\tau }_{m} } \right)e^{{j\hat{\phi }_{m} }} } \right)e^{{ - j\hat{\phi }_{i} }} } \right],} \\ {\hat{a}_{i} = R\left[ {\left( {R_{\Sigma } \left( {\hat{\tau }_{i} } \right) - \mathop \sum \limits_{{\begin{array}{*{20}c} {m = 0} \\ {m \ne i} \\ \end{array} }}^{M} \hat{a}_{m} R\left( {\hat{\tau }_{i} - \hat{\tau }_{m} } \right)e^{{j\hat{\phi }_{m} }} } \right)e^{{ - j\hat{\phi }_{i} }} } \right],} \\ {\hat{\phi }_{i} = \arg \left( {R_{\Sigma } \left( {\hat{\tau }_{i} } \right) - \mathop \sum \limits_{{\begin{array}{*{20}c} {m = 0} \\ {m \ne i} \\ \end{array} }}^{M} \hat{a}_{m} R\left( {\hat{\tau }_{i} - \hat{\tau }_{m} } \right)e^{{j\hat{\phi }_{m} }} } \right).} \\ \end{array} } \right.} \\ \end{array}$$
(15)

where the notation \(\Re \left( \cdot \right)\) is the real part of a complex value.

Since the specific value of carrier phase is not calculated separately in our algorithm, only the accuracy of relative amplitude and code phase delay estimates are compared. Note that the relative amplitude estimated by MEDLL is the relative amplitude of the reflected signal with respect to the DP signal \(a_{i}\); while the relative amplitude calculated by the RF-based estimator is \(\alpha_{Ii}\) defined in (4). Root mean square error (RMSE) is used to assess the accuracy of parameter estimation. The RMSE of \(\tau_{i}\) over \(m\) moments is shown below:

$$\begin{array}{*{20}c} {RMSE\left( {\tau_{i} } \right) = \sqrt {\frac{1}{m}\mathop \sum \limits_{k = 1}^{m} \left( {\hat{\tau }_{i}^{k} - \tau_{i}^{k} } \right)^{2} } } \\ \end{array}$$
(16)

The shape of the autocorrelation function is influenced by both CNR and front-end bandwidth setting of the received signal. Figures 4 and 5 show the autocorrelation functions subjected to a constructive multipath with different CNRs and different front-end bandwidths. The relative amplitude of the multipath is 0.5 and the relative code delay is 0.6 chips. When the CNR is low, the shape of autocorrelation function will be largely affected by noise. While small front-end bandwidth can filter out high-frequency noise components in the received signal, the high-frequency signal components are also filtered out, causing the rounded autocorrelation function peak. Therefore, both CNR and front-end bandwidth are considered in performance evaluation. A software defined receiver (SDR) developed by Borre et al. (2007) is employed as the test platform.

Fig. 4
figure 4

Autocorrelation functions affected by multipath at different CNRs

Fig. 5
figure 5

Autocorrelation functions affected by multipath for different front-end bandwidths

Performance evaluation in one-multipath case

As a supervised learning algorithm, the performance of random forest regression depends heavily on the quantity and quality of the dataset. We expect the RF-based estimator to be able to accurately estimate the parameters of reflected signals when multipath interference occurs, while not introducing new error sources in multipath-free case. For this purpose, a dataset \(D_{M1}\) containing 80,000 samples is generated for the training of random forest, of which 30% is multipath-free samples and 70% is one-multipath samples (including 15% of short multipath samples with code delay less than 0.2 chips and 55% of long multipath samples). The details of multipath parameter settings in \(D_{M1}\) are shown in Table 1. The label \(\theta^{k}\) of k-th sample in \(D_{M1}\) is \(\left\{ {\alpha_{I1} , \tau_{1} } \right\}\). The parameters of each sample are sampled uniformly within the range of values. The sample points of the autocorrelation function with 0.3 chips spacing from − 3 to + 3 chip are taken as the features \(x_{I}\). All these samples in \(D_{M1}\) are obtained in simulation for the signal with a sample frequency of 58 MHz and CNRs ranging from 18 to 35 dB-Hz.

Table 1 Details of the multipath parameter settings in one-multipath case training set

The setting of sampling spacing determines the number of correlators required. Considering that it is difficult to implement a large number of correlators in a receiver, the sampling spacing is set to 0.3 chips to balance the feasibility and the performance of the RF-based estimator. There are 21 correlators are required in total in this case. The feasibility of the RF-based estimator in receivers is discussed in detail in discussion.

All 80,000 samples in \(D_{M1}\) are used to train the random forest regressor. Each decision tree would be generated by minimizing the optimization function in (9) to find the best split points. Bootstrap aggregating ensures the diversity of base learners. The number of estimators is set to 100, so a random forest regressor consisting of 100 decision trees is generated after training, which is the multipath parameter estimator for one multipath case. Details of random forest generation can be found in Methodology.

First, parameter estimation performance of the RF-based estimator is evaluated and compared with MEDLL. 18 test sets are generated with CNRs ranging from 18 to 35 dB-Hz, each consisting of 2000 samples. The multipath parameter settings for the samples in test sets are the same as those in the training set. To explore the impact of front-end bandwidth, 4 series of comparison experiments on parameter estimation are conducted at front-end bandwidth settings of 2 MHz, 4 MHz, 8 MHZ and 20 MHz, respectively. The results are shown in Figs. 6 and 7.

Fig. 6
figure 6

RMSE of relative amplitude estimates

Fig. 7
figure 7

RMSE of relative code phase delay estimates

As shown in Figs. 6 and 7, the RF-based parameter estimator achieves both lower amplitude error and lower code phase delay error than MEDLL in all tests. In terms of the effect of CNR, the estimation error of both relative amplitude and code phase delay obtained by the RF-based estimator increase with the decrease of CNR; the relative code delay estimation errors achieved by MEDLL show the same trend, but the amplitude errors got by MEDLL do not differ much at different CNRs. Concerning the effect of front-end bandwidth, for weak signals with CNR less than 22 dB-Hz, the RF-based estimator obtains smaller parameter estimation errors with low front-end bandwidth settings, while for the signals whose CNR is greater than 22 dB-Hz, the largest parameter estimation errors are obtained for the signal with a front-end bandwidth of 2 MHz. Unlike the results of the RF-based estimator, MEDLL always obtains the largest parameter estimation error for the signals with a front-end bandwidth of 2 MHz, which indicates that the parameter estimation performance of MEDLL performs poorly for signals with low front-end bandwidths.

We then evaluate multipath mitigation performance of the RF-based parameter estimator and compare it with MEDLL. A series of tests for no multipath correction case are performed as the benchmark. All these tests are conducted using signals with different CNRs and front-end bandwidths. At this point, 1 chip spacing normalized early-minus-late (NEML) discriminator, 0.1 chips spacing narrow correlator and 0.05 chips spacing HRC are employed for code discrimination, respectively. The results are shown in Figs. 8, 9, and 10.

Fig. 8
figure 8

Color map of the pseudorange error distributions when 1 chip spacing NEML discriminator is used for discrimination in the one-multipath case

Fig. 9
figure 9

Color map of the pseudorange error distributions when 0.1 chips spacing narrow correlator is used for discrimination in the one-multipath case

Fig. 10
figure 10

Color map of the pseudorange error distributions when 0.05 chips spacing HRC is used for discrimination in the one-multipath case

As shown in Fig. 8, both MEDLL and the RF-based estimator achieve lower pseudorange error than no multipath correction case when 1 chip spacing NEML discriminator is used for code discrimination. The pseudorange error obtained by both MEDLL and the RF-based estimator increases as the CNR decreases. In terms of the effect of front-end bandwidth, the pseudorange error obtained by MEDLL varies greatly with the front-end bandwidth, whereas the pseudorange error obtained using RF-based estimator is almost stable. Especially, when the front-end bandwidth of received signals is below 6 MHz, the RF-based parameter estimator achieves much lower pseudorange error than MEDLL.

As shown in Fig. 9, the pseudorange error distributions for the cases when the 0.1 chips spacing narrow correlator is used are similar to Fig. 8. However, when the front-end bandwidth is set below 4 MHz, MEDLL produces even larger pseudorange errors than no multipath correction case.

Figure 10 demonstrates the pseudorange error distributions when 0.05 chips spacing HRC is employed for code discrimination. In this test, both MEDLL and the RF-based estimator can still further mitigate the pseudorange error caused by multipath. It is noting that the three subplots in Fig. 10 show similar pseudorange error distributions, and the maximum pseudorange errors due to multipath are at the same level. This is due to the characteristic of HRC. According to (McGraw et al. 1999), HRC is able to eliminate the pseudorange error caused by multipath whose code delay is greater than 0.1 chips.

Both MEDLL and the RF-based estimator can mitigate multipath to some extent, and the multipath mitigation performance of both of them is affected by the CNR of received signal: the larger the CNR, the better the performance. The performance of MEDLL is heavily influenced by the front-end bandwidth setting, whereas RF-based estimator is rarely affected. This can be explained by the fact that the performance of MEDLL is largely affected by the shape of autocorrelation function. When the front-end bandwidth is low, the noise is filtered out while the high-frequency components in the PRN code are also removed, thus round up the autocorrelation function.

Performance evaluation in three-multipath case

The simulation results in the one-multipath case have verified the superiority of the RF-based multipath parameter estimator. However, there are often multiple reflected signals in the received signal in urban environments. It is essential to test the performance of the proposed algorithm in multiple-multipath cases. Next, the performance of RF-based parameter estimator is evaluated under the three-multipath case and compared with MEDLL. Similarly, a series of simulations without multipath correction are performed as the benchmark.

Considering that the number of reflected signals might be changed due to the relative motion of satellites and the receiver. A dataset \(D_{M2}\) containing 80,000 samples is built for RF training. The details of multipath parameter settings in \(D_{M2}\) are configured as shown in Table 2. \(D_{M2}\) considers multipath-free, one-multipath, two-multipath, and three-multipath examples, which are set up as follows: if the code delay of the first reflected signal is larger than 1.5 chips, the relative amplitude and code delay of the remaining two reflected signals are set to 0; when the delay of the second reflected signal is greater than 1.5 chips, the relative amplitude and code delay of the third reflected signal are set to 0. As with the setup in dataset \(D_{M1}\), the sample points of the autocorrelation function with a spacing of 0.3 chips from − 3 to + 3 chip are taken as inputs \(x_{I}\), and all these samples are simulated for signals with a sampling frequency of 58 MHz and a CNR of 18 dB-Hz to 35 dB-Hz. The label \(\theta^{k}\) of k-th sample in \(D_{M2}\) is represented as \(\left\{ {\alpha_{I1} , \tau_{1} ,\alpha_{I2} , \tau_{2} ,\alpha_{I3} , \tau_{3} } \right\}\). The multipath parameters of each sample are also sampled uniformly within the range of values. All samples in \(D_{M2}\) are used for the training of random forest and the number of estimators is set to 100.

Table 2 Details of the multipath parameter settings in the three-multipath case

We also consider the effects of CNR and front-end bandwidth. 1 chip spacing NEML discriminator, 0.1 chips spacing narrow correlator and 0.05 chips spacing HRC are used for code discrimination, respectively. The simulation results are shown in Figs. 11, 12 and 13. According to the results, the pseudorange errors in the three-multipath case demonstrate similar distributions as those in the one-multipath case. The difference is that MEDLL obtains best multipath mitigation performance when 1 chip spacing NEML discriminator is used for code discrimination, while the RF-based estimator achieves best performance when using 0.05 chips HRC for discrimination. Besides, it is noted that there are many localized extreme points in the pseudorange distributions realized by MEDLL. To explore why this happens, additional tests for a set of signals with a fixed front-end bandwidth are conducted.

Fig. 11
figure 11

Color map of the pseudorange error distributions when 1 chip spacing NEML discriminator is used for discrimination in the three-multipath case

Fig. 12
figure 12

Color map of the pseudorange error distributions when 0.1 chips spacing narrow correlator is used for discrimination in the three-multipaths case

Fig. 13
figure 13

Color map of the pseudorange error distributions when 0.05 chips spacing HRC is used for discrimination in the three-multipath case

Since MEDLL achieves the best performance at a front-end bandwidth of 10 MHz in above simulations. In the following tests, the front-end bandwidth of received signals is set to 10 MHz. Considering the randomness of noise, 20 repetitions of the test are conducted. The results are shown in Figs. 14, 15 and 16. Each point in these figures is the mean value of pseudorange errors in 20 trials, and the error bars are the standard deviations.

Fig. 14
figure 14

Pseudorange errors at different CNRs when 1 chip spacing NEML discriminator is used for discrimination

Fig. 15
figure 15

Pseudorange errors at different CNRs when 0.1 chips spacing narrow correlator is used for discrimination

Fig. 16
figure 16

Pseudorange errors at different CNRs when 0.05 chips spacing HRC is used for discrimination

According to the results in Figs. 14, 15 and 16, the standard deviations of the pseudorange errors obtained by RF-based estimator and no multipath correction case in all three groups of tests are small, and both the mean value and the standard deviation decrease smoothly with the increase of CNR; whereas the mean value of the pseudorange errors obtained by MEDLL fluctuated largely, and the standard deviations are much larger than those of RF-based estimator and no multipath correction case. These results indicate that the performance of MEDLL is more susceptible to random noise and less robust than the RF-based estimator in the multiple multipaths case. We speculate that this is also the reason why there are localized extreme points in the pseudorange error distributions of MEDLL in Figs. 11 and 12.

Overall, the performance of MEDLL is also much affected by the front-end bandwidth of received signals in the multiple-multipath case. The RF-based estimator demonstrates better multipath mitigation performance than MEDLL and no multipath correction case when the 0.1 chips spacing narrow correlator and the 0.05 chips spacing HRC are used for the discrimination, whereas MEDLL performs better using 1 chip spacing NEML discriminator. MEDLL is less robust in the multiple-multipath case compared to the RF-based estimator and no multipath correction.

Discussion

The RF-based multipath parameter estimator takes equally spaced sample points of the autocorrelation function as the inputs. The sample points are obtained through a multi-correlator, and the sample spacing determines the number of correlators used. However, it is difficult to implement a large number of correlators in a receiver. Here, a feasibility evaluation is carried out to test the performance of the RF-based estimator with different spaced samples as the inputs. Ten tests are conducted to evaluate the performance of the RF-based estimator and compare with MEDLL. The sampling spacing settings and the corresponding number of correlators required in different tests are shown in Table 3. The parameters of samples in each training and testing dataset are the same as those in the one-multipath case, except the number and the sampling spacing of the inputs. The CNR and front-end bandwidth of received signals are set to 35 dB-Hz and 20 MHz, respectively. The RMSE of relative amplitude estimates and relative code delay estimates with different sampling spacing setting are shown in Fig. 17.

Table 3 Sampling spacing settings of the multi-correlator and the corresponding number of correlators required
Fig. 17
figure 17

RMSEs of multipath parameter estimates for inputs with different sampling spacing

As shown in Fig. 17, the parameter estimation error obtained by the RF-based estimator always remains at the same level when the sampling chips spacing is no larger than 0.3 chips; when the sampling spacing is larger than 0.3 chips, the estimation error of code delay rises slightly with increasing sample spacing. Whereas the estimation error of both relative amplitude and code delay achieved by MEDLL increases sharply with increasing chip spacing setting. The results show that the RF-based estimator is rarely affected by the sampling spacing of the inputs. It demonstrates good multipath mitigation performance even when only a limited number of correlators are used. According to the results, a sampling spacing of 0.3 chips would be a good choice to balance the feasibility and performance of the RF-based estimator. The number of correlators required at this point is 21.

Experiments

The above simulations are performed under ideal assumptions, i.e., the noise in the received signals is assumed to be AWGN and the autocorrelation function do not distort except for the effect of multipath. However, the noise in the real data is not exactly AWGN, and the shape of the autocorrelation function might be unpredictably deformed due to the presence of various disturbances. In order to verify whether the proposed RF-based parameter estimator can still work effectively in real urban environments, experiments on real data are conducted. The experimental setup is as follows: the data file we used is collected at 20 Msps using an USRP N210 in an urban area of Hong Kong (Tsim Sha Tsui East, 21/12/2022). Figure 18 shows the skyplot mask of visible Global Positioning System (GPS) satellites. The front-end bandwidth of the received signal is 2 MHz. The average CNR is at 39 dB-Hz.

Fig. 18
figure 18

Skyplot mask of visible GPS satellites at the data collection site

Data preprocessing

Different from the signal model in (1), the GNSS signal model of real data is represented as

$$\begin{array}{*{20}c} {r_{\Sigma } \left( t \right) = \sqrt {2P} \mathop \sum \limits_{i = 0}^{M} a_{i} d\left( {t - \tau_{i} } \right)p\left( {t - \tau_{i} } \right)\cos \left( {2\pi f_{IF} t + \phi_{i} } \right) + n\left( t \right)} \\ \end{array}$$
(17)

where \(P\) is the signal power, \(d\left( \cdot \right)\) is the navigation message.

Correlating the received signal with local replica yields

$$\begin{array}{*{20}c} {R_{\Sigma } \left( t \right) = \sqrt {2P} \mathop \sum \limits_{i = 0}^{M} a_{i} d\left( {t - \tau_{i} } \right)R\left( {t - \tau_{i} } \right)e^{{j\phi_{i} }} + N\left( t \right)} \\ \end{array}$$
(18)

At this time, \(R_{\Sigma I} \left( t \right)\) is represented as

$$\begin{array}{*{20}c} {R_{\Sigma I} \left( t \right) = \sqrt {2P} d\left( {t - \tau_{0} } \right)\left\{ {R\left( {t - \tau_{0} } \right) + \mathop \sum \limits_{i = 1}^{M} \alpha_{Ii} R\left( {t - \tau_{i} } \right)} \right\} + N_{I} \left( t \right)} \\ \end{array}$$
(19)

where

$$\begin{array}{*{20}c} {\alpha_{Ii} = \frac{{d\left( {t - \tau_{i} } \right)}}{{d\left( {t - \tau_{0} } \right)}}a_{i} \cos \overline{\phi }_{i} } \\ \end{array}$$
(20)

Due to the uncertainties in signal power and the presence of navigation messages, the relative amplitude of reflected signal components with respect to the reference autocorrelation function may not be in the range of \(\left( {0, 1} \right)\). To make the relative amplitude of the reflected signal fall in \(\left( {0, 1} \right)\) so that the RF-based estimator works properly, the value of \(\sqrt {2P}\) must satisfy \(0 < \sqrt {2P} < 1\). However, the above condition is not always satisfied in real data. We preprocess the original autocorrelation function, let

$$\begin{array}{*{20}c} {R^{\prime}_{\Sigma I} \left( t \right) = \beta R_{\Sigma I} \left( t \right)} \\ \end{array}$$
(21)

where

$$\begin{array}{*{20}c} {\beta = 0.2\frac{{\max \left( {R_{ref} } \right)}}{{\max \left( {\left| {R_{\Sigma I} \left( t \right)} \right|} \right)}}} \\ \end{array}$$
(22)

Considering the effect of navigation message on the autocorrelation function can be eliminated by multiplying \(\pm 1\), if noise is not concerned, the autocorrelation function of DP signal can be restored by

$$\begin{array}{*{20}c} {R\left( {t - \tau_{0} } \right) = R^{\prime}_{\Sigma I} \left( t \right) - \left( {P^{\prime} - 1} \right)R\left( {t - \tau_{M + 1} } \right) - \mathop \sum \limits_{i = 1}^{M} \alpha_{Ii}^{\prime} R\left( {t - \tau_{i} } \right)} \\ \end{array}$$
(23)

where \(\tau_{M + 1}\) satisfy \(\left| {\tau_{M + 1} - \tau_{0} } \right| \to 0\); \(P^{\prime} = \beta \sqrt {2P}\); \(\alpha_{Ii}^{\prime} = \beta \sqrt {2P} \alpha_{Ii}\).

The multipath parameters to be estimated at time \(k\) are \(\theta^{k} = \left\{ {P^{\prime},\alpha_{I1}^{\prime} ,\alpha_{I2}^{\prime} , \cdots ,\alpha_{IM}^{\prime} ,\tau_{M + 1} , \tau_{1} , \cdots ,\tau_{M} } \right\}\).

The estimates of \(P^{\prime}\) serves two purposes: (1) to restore the amplitude of \(R\left( {t - \tau_{0} } \right)\) to unity respected to \(R_{ref}\), and (2) to calculate the relative amplitude estimates of multipaths. The former one makes no effect on multipath mitigation, and conversely, if there is an estimation error in \(\tau_{M + 1}\), it will bring in a new error source in code discrimination. So \(\left( {P^{\prime} - 1} \right)R\left( {t - \tau_{M + 1} } \right)\) is ignored in the restoration of DP signal. \(R\left( {t - \tau_{0} } \right)\) is finally restored by

$$\begin{array}{*{20}c} {R\left( {t - \tau _{0} } \right) = R_{{\Sigma I}}^{\prime } \left( t \right) - \sum\limits_{{i = 1}}^{M} {\hat{\alpha }^{\prime } _{{Ii}} } R\left( {t - \hat{\tau }_{i} } \right)} \\ \end{array}$$
(24)

where \(\hat{\alpha }_{Ii} , \hat{\tau }_{i} \in \hat{\theta }^{k}\) is the parameter estimates of i-th multipath at time \(k\).

The amplitude of the restored \(R\left( {t - \tau_{0} } \right)\) may be smaller than the original autocorrelation function, but this does not affect the discrimination results.

Experimental results

The preprocessed data is then processed using the open source SDR with conventional DLL, MEDLL and the DLL with RF-based estimator for code tracking, respectively. Both the MEDLL and the RF-based estimator use the three multipath signal model. The filter bandwidth of DLLs are set to 2 Hz. Here, the 1 chip spacing NEML discriminator is employed for code discrimination. The statistical results of 2D and 3D positioning over 180 epochs are shown in Tables 4 and 5.

Table 4 2D positioning statistical results
Table 5 3D positioning statistical results

According to the positioning results, the SDR using the DLL with RF-based estimator obtains lower positioning error than those using MEDLL and conventional DLL, and it achieves 8.5% lower RMSE for 2D positioning and 8.7% lower RMSE for 3D positioning than the SDR with conventional DLL over 180 epochs. Although the SDR with MEDLL obtained lower mean absolute error than that with conventional DLL, both the RMSE and the standard deviation are much larger than those of the SDR employing the DLL with RF-based estimator or conventional DLL.

The RF-based parameter estimator still performs well for real data collected in urban environment with the front-end bandwidth of 2 MHz. Although the SDR employing MEDLL achieves lower mean absolute error than the SDR with conventional DLL in this experiment, the larger RMSE and standard deviation indicate the poor robustness of MEDLL compared to the RF-based parameter estimator.

Computational complexity

In this subsection, we present an analysis on the computational complexity of the RF-based estimator and MEDLL. To ensure fairness in performance comparison, both the MEDLL and the RF-based estimators use 21 correlators from − 3 to + 3 chip in the above simulations and experiments, so the cost of correlation computation is the same for both. In the following, we test the time complexity of parameter estimation procedure for the RF-based estimator and MEDLL.

First, we test the time cost of a single parameter estimation with the MEDLL and the RF-based estimator. The time costs are tested on a desktop with an Intel Core i9-12900 K 24-Core Processor CPU@3.19GHZ and 32.00 GB memory. The average time cost for 100 runs of both methods is shown in Table 6. As shown in Table 6, the RF-based parameter estimator spends much less time on parameter estimation than MEDLL. This is due to the fact that the decision trees in random forest make predictions in parallel, thus effectively reducing the runtime. MEDLL also takes more time in autocorrelation function correction because it finds the signal with the shortest delay as the LOS signal before performing correction.

Table 6 Average time cost of MEDLL and RF-based estimator for 100 runs

The training of random forest is also time consuming, but since the training of the model can be completed offline, it does not affect the real-time performance. As for model calling, in practice the model only needs to be read once before use, so it is not considered. The random forest model is stored in memory in use after reading, which needs more memory than MEDLL.

We also theoretically analyze the time complexity of MEDLL and RF-based parameter estimator. The focus here is on the trend of time complexity as the number of multipaths increases. In the worst case, the time complexity of random forest building and prediction is \(O\left( {N_{T} K\tilde{N}^{2} \log^{2} \tilde{N}} \right)\) and \(O\left( {N_{T} N} \right)\), respectively, where \(N_{T}\) denotes the number of decision trees, N denotes the number of samples in the training set and K is the number of variables randomly drawn at each node. \(\tilde{N} = 0.632N\), due to the fact that bootstrap samples draw, on average, 63.2% of unique samples (Louppe 2014). The complexity of multiple output decision tree is the same as single output decision tree (Pliakos et al. 2018). Thus, the RF-based parameter estimator remains constant time complexity as the number of multipaths increases. As for MEDLL, its time complexity is \(O\left( {T^{M} } \right)\), where \(T\) is the number of iterations and \(M\) is the number of multipaths. Therefore, the runtime of MEDLL will increase exponentially with the number of multipaths.

Conclusion

This study proposes a random forest-based multipath parameter estimator which can estimate and mitigate the multipath for GNSS signal in urban environments. On one hand, the RF-based estimator demonstrates better robustness compared to MEDLL — its multipath mitigation performance is rarely affected by the front-end bandwidth setting of received signals and noise randomness. The simulation results show that the RF-based estimator exhibits much better multipath mitigation performance than MEDLL for signals with front-end bandwidths below 6 MHz. In the experiments for the real data with a front-end bandwidth of 2 MHz, the SDR employing the DLL with RF-based estimator reduces the 2D and 3D positioning RMSE by 8.5% and 8.7% over 180 epochs compared to the SDR with conventional DLL, whereas MEDLL does not show superiority. On the other hand, compared to the machine learning-based methods mentioned before, the RF-based estimator focuses on multipath parameter estimation rather than correlator or discriminator design. And it requires only 21 correlators in total, which shows great feasibility in receivers. However, our current algorithm just addresses the multipath mitigation problem for pseudorange measurements. The multipath mitigation for carrier phase remains to be explored.