Introduction

Dense national and regional GNSS networks provide an outstanding opportunity to detect the kinematics of recent plate motions and crustal deformations (Caporali et al. 2009; Nocquet 2012; Devoti et al. 2017; Kenyeres et al. 2019). The removal of outlier velocity vectors that are apparently different from the local neighborhood due to possible anthropogenic effects or monumentation issues is vital to developing reliable geophysical interpretation. Briefly depicting the variety of available solutions, several approaches exist from the Euler-parameter estimation-related χ2 and F-ratio tests proposed by Stein, Gordon (1984) and Nocquet et al. (2001) through the frame-invariant approach of Xu et al. (2000) to the strain rate pattern-derived solution in Araszkiewicz et al. (2016). The scope of the current research is to find a suitable classification approach that is capable of detecting outliers of GNSS velocity data to support the quality control of the EPN densification products (Kenyeres et al. 2019).

Outlier detection as a classification problem is clearly not trivial, and it significantly depends on the selection approach as described in the classification frameworks of Zhang et al. (2007) and Boukerche et al. (2020). To obtain a suitable classification, we propose an outlier detection procedure that considers GNSS velocity data only, interpreted within the well-defined classification framework as proposed by Zhang et al. (2007).

We introduce a solution to perform spatial outlier detection with multiple non-spatial attributes. This includes the characterization of the input data and its related attributes and presents the mapping functions between spatial and non-spatial attributes proposed by Lu et al. (2003a, 2003b). We present a feature engineering procedure and related classification approach applied to GNSS data to detect and remove outliers. The proposed solution also delivers the feature extraction process, which transforms the initial attributes into features and presents the related classification algorithm. We highlight the outlier detection capability of the proposed solution compared to existing classification algorithms on an arbitrary generated synthetic dataset. In addition, we also demonstrate the proposed procedure on a real dataset of EPN Densification (Kenyeres et al. (2019).

Problem statement

The characterization of the outlier detection of GNSS velocity field data is twofold. On the one hand, the research scope needs to be investigated as proposed by Zhang et al. (2007). The initial data type, the characteristics of the possible outliers, and the detection approach shall be clarified.

According to this framework, the initial data may be considered as spatially referenced samples with multiple non-spatial attributes that may contain multivariate local outliers, which behave as inconsistent compared to their neighbors (Shekhar et al. 2003; Lu et al. 2003b; Zhang et al. 2007). The input data cannot be seen as a pre-labeled dataset because it does not have any pre-defined attribute, which attribute indicates if the corresponding sample is regular (inlier) or outlier, consequently, the unsupervised learning approach should be applied. We assume that our non-outlier samples are derived from a multivariate normal distribution (MVN) and that the corresponding Mahalanobis distances follow a χ2q distribution with q degrees of freedom. This problem can be modeled by parametric methods and outliers which are not derived from the model can be rejected.

On the other hand, it is important to specify the procedure of the classification and describe the processing chain of its input features. Generally, this refers to feature engineering, which consists not only of data preparation but also of creating features that better represent the underlying problem. In the following section, we propose a solution that indicates the main steps of the feature engineering described above and provides the outlier filtered GNSS velocity field.

Proposed solution

As we discussed before, feature engineering is a key aspect of creating processes and features which support outlier detection. To describe such a processing chain, we review the main input attributes, the corresponding mapping functions and the detailed evaluation procedure to obtain the input features of the classification algorithm, called feature extraction. Then we introduce the implemented classification algorithm for outlier detection.

Dataset and pre-processing

We present here the investigated attributes and describe the a priori data-related pre-processing steps. The input dataset includes GNSS station velocities, characterized by spatial and non-spatial attributes. These attributes describe the estimated velocity field expressed in the European Terrestrial Reference Frame 2000 (ETRF2000) with local tangent plate coordinates (LTP). These attributes are as follows:

  • Spatial Point: represented for a GNSS station as \(x=\left(\varphi ,\lambda \right)\), such that latitudes \(\varphi \in \Phi \) and longitudes \(\lambda \in \Lambda \) are expressed in the ETRF2000 reference frame and sets of \(\Phi \) and \(\Lambda \) determined by the available dataset. Thus, the set of spatial points can be described as \(X=\left\{{x}_{i}:0\le i\le m-1\wedge m\in N\right\}\), where m equals the number of GNSS stations in the input dataset.

  • A priori attributes: parameter set that is used in the a priori filtering. These are the length of the time series expressed in weeks, the serial solution number of the corresponding velocity components and the ID of the GNSS station.

  • Velocity related attributes: represents the northing, easting and up (NEU) velocity components expressed in the local tangent plane (LTP) for the corresponding GNSS station x, such that \(v\left(x\right)=\left({v}_{N},{v}_{E},{v}_{U}\right)\in {R}^{3}\) for each x in X. The velocity component related standard deviations, namely \(\sigma \left(x\right)=\left({\sigma }_{N},{\sigma }_{E},{\sigma }_{U}\right)\in {R}^{3}\), are derived from the full variance–covariance matrix of the CATREF solution (Altamimi et al. 2007). The velocity and uncertainty components are in mm/yr unit and expressed in the ETRF2000 frame.

As data pre-processing procedure, we perform data cleaning and filtering based on the a priori information and the corresponding a priori attributes. We keep stations only with time series length at least 3 years as Legrand et al. (2012) and Kenyeres et al. (2019) suggest. The resulting pre-processed dataset will act as the input for the feature engineering procedure. The related initial attributes are the velocity components and the corresponding standard deviations for each x in the set of X.

Mapping functions

The different functions that describe the spatial attribute mapping to the non-spatial attributes will be defined here. These mapping functions can be interpreted as:

  • Attribute function: defined as e: X → Rd, where the attribute function e maps the spatial dataset X to Rd dimensional attribute space (Lu et al. 2003a, 2003b), where d represents the number of selected six attributes to be investigated. These attribute sets are describe the velocity and standard deviation attributes registered to each x in X.

  • Feature function: defined as f: X → Rr, where the feature function f maps the spatial dataset X to a lower dimensional (d > r) feature space. We describe its details in the following Feature extraction subsection.

  • Distance function: defined as metric \({d}_{H}:X\times X\)R; given \(A,B\in X\), let the corresponding distance between spatial points A and B given by the Haversine-metric (Pedregosa et al. 2011; Van Brummelen 2013). Please note that the resulting (great-circle) distances may be affected by a 0.5% error compared to using ellipsoidal metrics.

  • Neighborhoods: defined as function kNN: X → X, which retrieves the \(\begin{array}{c}k\in N\end{array}\) nearest neighbors (kNN) of each x in X (Lu et al. 2003a) according to the distance metric defined by Distance function (Van Brummelen 2013) using BallTree algorithm (Pedregosa et al. 2011; Mahapatra and Chakraborty 2015). The selection of \(\begin{array}{c}k\in N\end{array}\) corresponds to the evaluation of selection of the quasi-lowest k that yields maximum classification accuracy, as we present it in the subsection of Synthetic dataset.

  • Neighborhood function: defined as function of g: X → R d, representing a mapping that returns the specific statistical moments (e.g., average or median) of the local spatial environment represented by kNN neighbors for each GNSS station for each related attribute (Lu et al. 2003a, 2003b). Both the presented neighborhoods and neighborhood function represents and yields spatial environment of each x in X

  • Comparison function: defined as function of h: X → Rd, which is responsible for implementing the comparison of all features to its local environment, based on the neighborhood function (g); thus, h = e-g, as Lu et al. (2003a, 2003b) suggests. The non-outlier results of the comparison function (h(x)) are assumed to follow a MVN. Besides, we also assume that the possible outlier samples do not follow MVN and in this context, the outlier detection can be performed via parametric methods.

Feature extraction

After considering the characteristics of the input dataset and the related mapping functions, we present here the procedure of constructing features that explain the GNSS velocity field from the most informative aspect. This feature extraction requires a complex approach considering the characteristics of the local neighbors as well as indicating the possible outliers in the feature space.

Considering the dataset and the pre-processing steps described in the related subsection, let denote \({X}_{CATREF}=\left\{{x}_{i}:0\le i\le m-1\wedge m\in N\right\}\), where m represents the number of samples of the GNSS stations in the CATREF output of the minimum constrained cumulative solution of dense GNSS network measurements (Altamimi et al. 2002; Kenyeres et al. 2019). XCATREF also underwent the previously discussed pre-processing step to only consider samples where observation length is 3 years long at least as Blewitt and Lavallée 2002 suggest.

Let us consider xi, the i-th element of X, where \(0\le i\le n-1\) and \(n\in N\), and apply attribute function e(x), which maps xi to the Rd attribute space, such that:

$$ y_{i} = e\left( {x_{i} } \right) = \left( {y_{i,0} , \cdots y_{i,j} , \cdots ,y_{i,d - 1} } \right)^{T} $$
(1)

where yi represents the feature vector of the corresponding i-th x, such that the yj component represents the j-th attribute described in the set of velocity components and standard deviations. Applying (1) for all x in, we write:

$$ e\left( X \right) = \left( {e\left( {x_{0} } \right),...e\left( {x_{i} } \right),...e\left( {x_{n - 1} } \right)} \right)^{T} = \left( {y_{ij} } \right) \in R^{nxd} , $$
(2)

where the matrix indices are \(0\le i\le n-1\) for rows and \(0\le j\le d-1\) for columns.

To implement the median algorithm proposed by Lu et al. (2003a, 2003b), let us denote:

$$ g_{j} \left( {x_{i} } \right) = median\left\{ {e_{j} \left( x \right):x \in kNN\left( {x_{i} } \right)} \right\} $$
(3)

that is, the neighborhood function gj(xi) retrieves the median of the j-th features (2) of all x in the local spatial set. This set is defined by the neighborhood mapping function kNN(xi), representing the k-nearest neighborhood of the i-th x of X in the Haversine metric.

To obtain results that highlight the effects of the possible outliers, let us consider the evaluation of the comparison function, coinciding with the difference of the attribute (2) and neighborhood functions (Lu et al. 2003a, 2003b) (3):

$$ h\left( x \right) = e\left( x \right) - g\left( x \right) $$
(4)

where h(x) represents the non-spatial feature of each x spatial point in X compared to its local environment. Reasonably, the possible outliers may imply higher h(x) scores compared to the regular/non-outlier samples.

To minimize the effect of the outliers on the feature mean, we perform feature-wise robust centering technique to obtain a quasi-standardized form of (4) via:

$$ Y_{ctr} = h\left( x \right) - median\left( {h\left( x \right)} \right) $$
(5)

where data centering is performed by feature-wise subtraction of feature median, and related scaling is neglected (Pedregosa et al. 2011; Raghav et al. 2020),

Furthermore, we assume that the velocity outliers are due to station monumentation weaknesses and may also be related to data quality (multipath, antenna problems, external noise) issues. None of these error sources are correlated; they contribute to the results independently of each other. The combination process (Kenyeres et al. 2019) may generate a low mathematical correlation between them, but we consider it well negligible compared to the independent station-specific biases.

To remove the correlation between features, we implement an orthogonal linear transformation called Principal Component Analysis (PCA) using Singular Value Decomposition (SVD). In addition to revealing the internal structure of the dataset while finding directions (principal components) of its maximum variances, PCA is also capable of projecting a high-dimensional dataset to a lower-dimensional subspace while keeping most of the information (Jolliffe 2002; Wall et al. 2002; Salem and Hussein 2019).

The implementation of PCA is directly linked to SVD if the related principal components are derived from the covariance matrix of Yctr (Wall et al. 2002). Let us consider the centered feature matrix Yctr defined in (5), then define the empirical sample covariance matrix \(C\in {R}^{d\times d}\) of Yctr, which is proportional to the (1/(n-1))* YctrT Yctr symmetric matrix. We diagonalize C as \(C=W\Lambda {W}^{T}\), where \(W\in {R}^{d\times d}\) is a matrix built up by eigenvectors of C, such that Wj is the j-th eigenvector of the empirical sample covariance matrix (Jolliffe 2002). As well as the diagonal matrix \(\Lambda \in {R}^{d\times d}\), where the diagonal elements λj of \(\Lambda \) are in descending order are the j-th eigenvalues of C. These W eigenvectors can be seen as the principal directions or axes (principal components) of the data, and projections of the data onto these principal axes return the principal component scores, i.e., the transformed variables of the data (Jolliffe 2002; Wall et al. 2002, Salem and Hussein 2019):

$$ T_{d} = Y_{ctr} W $$
(6)

where \(T_{d} \in R^{n \times d}\) represents the resulting principal component scores.

In addition, let us denote the singular value decomposition of Yctr:

$$ Y_{ctr} = U\Sigma V^{T} $$
(7)

where \(U\in {R}^{n\times n}\) can be seen as the left singular vectors and \(\Sigma \in {R}^{n\times d}\) as the rectangular diagonal matrix of the corresponding singular values, while \(V\in {R}^{d\times d}\) as the right singular vectors of Yctr. Full SVD components are evaluated by the divide-and-conquer approach in the LAPACK algorithm (Demmel 1989; Jones et al. 2001; Pedregosa et al. 2011). As linear algebra theorem states, the right singular vectors correspond to the W eigenvectors of matrix YctrT Yct (Wall et al. 2002; Datta, 2004). This lead us to substitute (7) into (6), which yields to:

$$ T_{d} = Y_{ctr} \cdot W = U \cdot \Sigma \cdot W^{T} \cdot W = U \cdot \Sigma $$
(8)

where \({T}_{d}\in {R}^{n\times d}\) equals to the principal component score matrix described by (6).

As Jolliffe (2002) indicates, the variance of the j-th principal components can be expressed through the related j-th eigenvalue of C. Hence the singular values of Yctr can be represented as the non-negative square roots of eigenvalues of YctrT Yctr (Datta 2004; Wall et al. 2002), and considering that the eigenvalues of C are proportional to eigenvalues of YctrT Yctr (Jolliffe, 2002; Salem and Hussein, 2019) allow us to discuss the variance / variance ratio of the SVD derived principal components.

Since (8) involves all principal components, therefore the related explained variance ratio yields 1 (Jolliffe 2002; Salem and Hussein, 2019). To perform dimensionality reduction and obtain the truncated solution of (8), consider \(s={\Sigma }_{jj}=\left({s}_{0},...{s}_{j},...{s}_{d-1}\right)\in {R}^{d}\), the vector of singular values, which are the main diagonal elements of Σ. According to this, we evaluate the first sr singular values (Salem and Hussein, 2019) where:

$$ \frac{{\mathop \sum \nolimits_{j = 0}^{r - 1} s_{j}^{2} }}{{\mathop \sum \nolimits_{j = 0}^{d - 1} s_{j}^{2} }} \ge 0.98 $$
(9)

constraint is satisfied by the first r < d (9). It aims to specify the first r singular values with the corresponding principal components, whose related cumulative ratio of explained variance interprets at least 98% of the full variance according to the investigated feature space.

Thus, the truncated format of (8) with the first r principal components can be generalized as:

$$ T_{r} = U_{r} \cdot \Sigma_{r} $$
(10)

where \({T}_{r}\in {R}^{n\times r}\) is the truncated principal score matrix, \({U}_{p}\in {R}^{n\times n}\) is the corresponding left singular vector and \({\Sigma }_{r}\in {R}^{n\times r}\) represents the rectangular diagonal matrix of the first r largest singular values (Jolliffe 2002; Pedregosa et al. 2011; Wall et al. 2002). The resulting truncated and uncorrelated features describe the investigated features in the most informative aspect.

To perform spatial outlier filtering, let us consider mapping the set of spatial point x in X to the non-spatial feature space described in (10), i.e.:

$$ f\left( {x_{i} } \right) = \left( {f_{0} \left( {x_{i} } \right),....f_{j} \left( {x_{i} } \right),....f_{r - 1} \left( {x_{i} } \right)} \right)^{T} , $$
(11)

where the feature function f(xi) maps the i-th spatial point to the feature space described in (10) and fj retrieves the j-th feature regarding the indices of \(0\le i\le n-1\) and \(0\le j\le r-1\).

Classification algorithm

In this subsection, we present the classification algorithm and the related model of the applied parametric method. According to this we denote the squared Mahalanobis distance of the features described in (11), as Lu et al. (2003a, b), Zhang et al. (2007), Leys et al. (2018) and Etherington (2019) suggest, as:

$$ d_{Mahalanobis}^{2} \left( x \right) = \left( {h\left( x \right) - \mu_{MCD} } \right)^{T} \cdot \Sigma_{MCD}^{ - 1} \cdot \left( {h\left( x \right) - \mu_{MCD} } \right) $$
(12)

where \(\mu_{MCD} \in R^{r}\) and \(\Sigma_{MCD} \in R^{n \times n}\) are respectively the robust mean vector and the robust covariance matrix estimated by the minimum covariant determinant (MCD) algorithm. Implementation of the MCD-derived robust mean vector and robust covariance matrix is requisite due to the presence of possible outliers, which can significantly affect the sample mean and covariance as well (Rousseeuw and Driessen 1999, Pedregosa et al. 2011; Zhang et al. 2007).

Considering that the squared Mahalanobis distance described by (12) follows a \({\chi }^{2}\) distribution with r degrees of freedom, namely, \({\chi }_{r}^{2}\) (Lu et al. 2003a, b; Zhang et al. 2007), we denote the corresponding cumulative density function (CDF) of \(d_{Mahalanobis}^{2} \left( x \right) \sim \chi_{r}^{2}\) as \(CDF_{{\chi^{2} }} \left( {r,d_{Mahalanobis}^{2} \left( x \right)} \right)\). To perform the hypothesis testing, let us consider the null hypothesis, which is that the sample is derived from the test distribution; in our case, this is \(\chi_{r}^{2}\). Thus, to detect the statistically significant samples, with an observation probability at least as extreme in the \({\chi }_{r}^{2}\) distribution, let us consider p-values below a pre-defined significance level (\(\alpha =1e-8\)). Samples with low p-values imply statistical significance, which allows us to reject the null hypothesis within a given significance level. To obtain such p-values (\(\wp \)), we subtract the \({CDF}_{{\chi }^{2}}\) from 1:

$$ \wp \left( x \right) = 1 - CDF_{{\chi^{2} }} \left( {p,d_{Mahalanobis}^{2} \left( x \right)} \right) $$
(13)

where the resulting p-value can be utilized during the classification. Thus, the generalized classification function with a given significance level (\(\alpha \)) can be expressed with (13) as a hypothesis test:

$$ clf\left( x \right) = \left\{ {\begin{array}{*{20}c} { - 1} & {if\quad\wp \left( x \right) \le \alpha } \\ 0 & {else} \\ \end{array} } \right. $$
(14)

where the statistically significant samples with \(\wp \left( x \right) \le \alpha\) indicate the spatial outliers with -1 flags, which do not follow the test distribution and act as outliers. According to the regular samples, Eq. (14) passes with zero flags.

Application of the proposed solution to a synthetic dataset

We demonstrate the capabilities of the outlier detection procedure presented above, applying it to a synthetic dataset. The characteristics of the artificially generated velocity field and the evaluation strategy are shown, together with the analysis of the efficiency of the proposed filtering process is discussed.

Synthetic dataset

The motivation to present such an artificially created and pre-labeled dataset is to compare our proposed procedure to other methods and estimate its prediction accuracy.

The generated synthetic velocity field is built up from a regular part contaminated with systematic biases and random errors. The regular part is expected to be conserved in the outlier detection procedure, while the errors are supposed to be marked as outliers. Regarding the spatial density, 3 sample sets were generated, sparse, normal and dense cases, each representing fraction of the initial synthetic dataset via stratified K-fold split approach while preserving sample percentage in each pre-defined class (Pedregosa et al., 2011). The presented synthetic datasets are available in Appendix I of the Supplementary Information and visualized in Fig. 1.

Fig. 1
figure 1

Synthetic horizontal velocity field test cases with normal, dense and sparse sampling. Color labels indicate the pre-defined supervised classes of inliers (blue) and outliers (red)

Parameter test

As we highlighted in the previous Mapping functions subsection, it is requisite to determine the k number of stations to be involved in the kNN Neighborhoods mapping function. Here we demonstrate the strategy of evaluation of such \(\begin{array}{c}k\in N\end{array}\). To obtain the quasi-lowest k that yields maximum or at least 95% classification accuracy, we executed the proposed RMD-based outlier detection over a list of parameter settings for the above three cases. The evaluated accuracy corresponds to the value of the accuracy score implemented by Pedregosa et al. (2011). The related results are displayed in Fig. 2:

Fig. 2
figure 2

Classification accuracy scores of parameter testing (k). Given test cases of the k number of nearest neighbors are also indicated

As Fig. 2. implies that the selection of k = 12 can be seen as a reasonable choice. Hence its classification accuracy yields > 95% at k = 12 in all test cases. On the contrary, selection of k < 12 yields to a steep decrease in accuracy score, while selection k > 12 yields approximately 98.5% classification accuracy at the cost of a larger kNN size.

Comparison between the proposed solution and existing solutions

To rank the performance of the proposed RMD-based outlier filtering approach, we implemented and evaluated several existing supervised classification algorithms (Amancio et al. 2014; Uddin et al. 2019; Varoquaux et al. 2020). Using the synthetic dataset we generated the accuracy scores of the presented cases and compared them to our RMD-based approach. The evaluated accuracy corresponds to the accuracy score implemented by Pedregosa et al. (2011), as same as discussed in the previous subsection. The corresponding results of the RMD-based approach compared to the supervised classification approaches are summarized in Table 1.

Table 1 Summary of comparison of classification accuracy scores of the test cases. Abbreviations stands for: RMD – Robust Mahalanobis Distance (proposed solution), KNN – Nearest Neighborhood, Linear SVM – Linear Support Vector Machine, RBF SVM – Radial Basis Function Support Vector Machine, DT – Decision Tree, RF – Random Forest, NNET – Neural Network, ADA – AdaBoost, NB – Native Bayes and QDA – Quadratic Discriminant Analysis. Related parameter sets of the supervised classifiers are derived from Varoquaux et al. (2020)

Advantages and disadvantages

In this subsection, we discuss the advantages and disadvantages of the proposed solution and the demonstrated supervised techniques as well. These empirical considerations are based on the comparison results, which are presented in the previous subsection.

The theoretical details and limitations of the demonstrated supervised classification approaches are presented in Demsar (2006), Amancio et al. (2014) and Uddin et al. (2019). As Amancio et al. (2014) discuss, the kNN classifier mostly over-performs the other investigated methods, even in higher dimensions. Related to this, Uddin et al. (2019) highlight that Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM) classifiers can also have superior prediction accuracy. This corresponds to our empirical results as well, namely, kNN has one of the largest accuracy scores, in parallel with DT and RF classifiers. On the other hand, it is important to note that these supervised approaches may require parameter optimization via parameter tuning using grid-search or other parameter optimization techniques. Most importantly, each demonstrated supervised technique requires pre-labeled data for the test dataset of the learning process.

Concerning the advantages and disadvantages of the proposed RMD outlier filtering approach, we emphasize that it does not require any pre-defined labels. Moreover, due to the application of a parametric method and the presented feature engineering procedure, it is capable of detecting outliers with multiple attributes. The classification accuracy score of the RMD-based approach is comparable to KNN, DT and RF classifiers in the three investigated cases. As of its disadvantages are concerned we must mention that using \({\chi }_{r}^{2}\) may be too conservative approach, it may appoint many inlier as outliers, as Hardin and Rocke (2005) discuss. To overcome this problem, we interpreted the outliers in \(\alpha =1e-8\) significance level. Besides, the proposed solution is only designed for spatial outlier detection, namely to define inlier and outlier clusters from the non-spatial features. Its main advantage is that it does not require pre-defined labels in the input dataset. Also, the comparison of the proposed unsupervised RMD-based approach to the described supervised methods is possible due to the synthetic dataset (Fig. 1) because it has pre-defined labels.

Application of the proposed solution to real dataset

As a further innovation, we analyzed the current multi-year velocity result of the EPN Densification dataset up to GPS week 2050 to demonstrate the capabilities of the introduced RMD-based classification algorithm on a real dataset as well. The input dataset is available in the Appendix II of the Supplementary Information. This European continental-scale multi-year solution includes 31 national network position and velocity solutions of 26 European GNSS Analysis Centers (ACs), combined with CATREF software (Altamimi et al. 2007) following the processing chain described in Kenyeres et al. (2019). Both horizontal and vertical (UP) velocity field solutions are expressed in the European Terrestrial Reference Frame 2000 (ETRF2000) (Altamimi et al. 2017).

According to this, we illustrate the RMD-based outlier detection technique with the tectonic unit of the Apennine Terrane (Serpelloni et al. 2007, Carminati et al. 2010, Palano et al. 2014 and Minissale et al. 2019). Related tectonic settings are presented and described, as well as are available in vector format in Appendix III of the Supplementary Information. The related horizontal velocity field is visualized in Fig. 3.

Fig. 3
figure 3

Horizontal velocity field of Apennine Terrane tectonic unit before the RMD based outlier detection. GNSS station GRO1 (red) and its k-nearest neighborhood (yellow) are also indicated

After the steps described in the subsection of dataset and preprocessing above, we perform spatial mapping to map the spatial dataset to the attribute space as expressed in (2). To aim for the detection of spatial outliers in the (11) resulting feature space, we implement the median algorithm proposed by Lu et al. (2003a, b), as we discuss in (3) – (4). As the next step, to achieve the principal component scores, we implemented (5)–(8), as Fig. 4 illustrates.

Fig. 4
figure 4

Scatter plots of the feature engineering procedure up to PCA. Top-panel indicates the robust centered h(x) scores (5) of velocity (NEU → ENU) velocity components. The middle panel represents the robust centered h(x) component scores of the corresponding standard deviations. The bottom panel indicates the space of the scores of (first three) principal components (11). The test sample (green label) and its kNN environment (graphs) are also highlighted

Corresponding dimensionality reduction can be achieved by implementing (9)–(11), i.e., keeping the first r principal components where the resulting cumulative variance ratio explains at least 98% of the full variance, as Fig. 5 illustrates.

Fig. 5
figure 5

Scree-plot of dimensionality reduction of the initial attributes using PCA. The first four principal component explains 99.976% of the variance ratio

Considering the bottom panel of Fig. 4, the results of these features highlight the possible spatial outliers. By assuming that the regular samples follow MVN and calculating the related squared Mahalanobis distances (12) corresponding to \({\chi }_{r}^{2}\) (Fig. 6), a parametric classification model can be obtained (13)-(14).

Fig. 6
figure 6

χ2-related statistics of the Apennine Terrane tectonic unit, representing the corresponding Mahalanobis-distance fitting to χ2 model. The top panel shows the theoretical and observed PDF values. The bottom panel represents the related CDF and the SF (survival) functions

According to these results, 15 out of 257 samples can be identified as spatial outliers in the presented Apennine Terrane tectonic unit. Related horizontal velocity field is illustrated in Fig. 7.

Fig. 7
figure 7

Results of the RMD based classification indicating the regular (blue) and outlier (red) horizontal velocities. Outlier detection yields 15 outliers out of 257 samples

As Fig. 7 implies, some of the detected outliers are affected by an error in the horizontal components and/or by the vertical component, as well as the extreme values of related standard deviations. The proposed outlier detection procedure is applied to the current EPN densification solution for each tectonic unit described in Appendix III of the Supplementary Information. The results are available on the co-developed and regularly updated webpage of https://epnd.sgo-penc.hu/velociraptor/.

Conclusions

We implemented the presented feature engineering process and related RMD-based classifier to solve the outlier detection of GNSS velocity field data. In order to estimate the performance of the proposed RMD-based outlier detection algorithm, we generated a pre-labeled synthetic dataset with different error contributions and with three different spatial resolutions. We performed a parameter test on the number of the k-nearest neighbors (k) and we found that the classification procedure yields its > 95% prediction accuracy with the lowest k-value as k = 12. We also presented a comparison of the unsupervised RMD-based approach to supervised classifiers. This highlighted that about 94.5% prediction mean accuracy of the RMD technique is 1.7–5.3% less than the accuracy of the tested supervised classifiers, which yielded 96.2–99.8% mean accuracy scores. The main difference is that the investigated set of supervised techniques does require pre-defined outlier labels until the proposed RMD-based approach works without any a priori labeling.

As a further innovation, we integrated the proposed RMD-based method into the EPN Densification products workflow as a quality checking algorithm to remove local outliers. We demonstrated the outlier detection procedure for real data related to the tectonic unit of Apennine Terrane step by step. In addition, we installed a webpage that interactively displays the current EPN Densification solution filtered with the proposed RMD technique. Regarding the scope of further development, it is reasonable to test the RMD-based classification on different datasets (Kreemer et al. 2014), as well as to involve more (even time-domain-related) parameters and attributes of the corresponding multivariate GNSS velocity level data, like non-diagonal elements of the full VC matrix.