Introduction

Technologies related to acquisition of spatial data have grown exponentially and are still following this trend today. Spatial data are enabled when information recorded by the sensor is linked to a conventional spatial reference system, usually cartographically defined as a coordinate reference system (CRS). Such information is referred to as geoinformation. This allows to map the information from the CRS to the real world and viceversa. Global Navigation Satellite Systems (GNSS), before solely available for military applications from the United States’ Global Positioning System (GPS) constellation, is now publicly accessible from several providers and with unprecedented accuracy. Accurate GNSS, along with a trend in the direction of lighter, less-expensive and metrically more accurate sensors, produces high-volumes of geospatial data. Crowd-sourcing solutions and sensors distributed in smart cities create and use large volumes of spatial data [1]. Datasets with unstructured points are a common direct or indirect output from such technologies.

Analyses of point-clouds has become a focus of scientific investigation also due to laser scanner technology. Laser scanners, from fixed, mobile or airborne platforms, can acquire several thousands of points per second, sampling objects and creating 3D representations. Technology in laser-derived 3D measurements is still improving at a fast rate; an example is the introduction of single photon-count sensors [2] which multiplies the number of measurements that a sensor can provide in a unit of time, potentially providing even larger datasets. Datasets with a large unstructured point can also be produced in a photogrammetric workflow, e.g. after aligning images using structure from motion (SfM), via dense matching [3] . The analysed datasets in this paper are derived from photogrammetry and from direct GNSS measurements, but the approach can be applied also to datasets from laser scanners.

In this scenario, outliers play an important role in the first phases of processing. A point dataset must be rid of outliers for the following modelling steps to be successful. Optimal outlier removal has been thoroughly investigated [4,5,6,7,8], and is still subject of investigation nowadays in many fields, such as fraud detection, medicine, pattern recognition and measurement error detection. Methods can be divided in supervised [9] and unsupervised: in this case the two tested methods belong to the unsupervised category.

Test data

Nowadays many spatially-enabled sensors can produce datasets with massive volume that can easily contain millions of points with attributes. In this study two quite different examples of such surveys were tested. One dataset is a product of a photogrammetric procedure (SfM) for creating a 3D model using overlapping imagery taken from a remotely piloted airborne system (RPAS). The second dataset is from trajectory data collected from vehicles every second via GNSS. These type of data are commonly referred to as Floating Car Data (FCD) and are becoming a very important part of smart-city frameworks.

SfM point dataset

A dense point dataset can be obtained from dense matching after reconstruction of a sparse 3D scene via photogrammetric SfM techniques [10]. This remote sensing method has strong support from open source libraries and software [11]. In this investigation, images where acquired with an RPAS flight carried out in July 2017 over an area with dense conifer forest, grass and some buildings as shown in Fig. 1. The final point density (ground sampling distance – GSD) is ~ 31 points per square meter. This dataset was chosen as it contains many characteristics that pose challenges to defining outliers: terrain surface has flat and steep parts, some areas have dense vegetation and others no vegetation, buildings or roads. From Fig. 1, top right, it is also evident that there are several outliers, i.e. points clearly not belonging to either the ground plane or the top surface. To have full control, outliers were manually removed to create a digital surface model (DSM) (Fig. 1 bottom left and right and Fig. 2 in green). Cloud Compare [12] was used to manually determine the clean DSM.

Fig. 1
figure 1

Point clouds from RPAS flight – outliers are clearly visible on the top, cleaned dataset is shown in the bottom as meshed surface

Fig. 2
figure 2

SfM dataset – left: clean DSM surface in green, random unclustered outliers in red, clustered random outliers in blue – right: only outliers

Artificial outliers were created to define a final control dataset (Fig. 2). Two types of outliers were created: (i) randomly positioned single points at a distance between 1 and 200 m from the DSM and (ii) randomly positioned clusters of points, with 2 to 30 points per cluster, with the cluster centre randomly positioned between 2 and 200 m above the DSM (Fig. 2 in red and blue respectively). R cran [13] was used to simulate and add the outliers to the dataset by randomly picking a non-outlier point and transforming its position according to the rules described above.

FCD – Floating Car data

The largest part of movements in an urban environment is constrained to the road network. Thanks to the recent development of navigation technologies, nowadays GNSS sensors represent a low-cost, efficient and already largely widespread tool to collect such movement information from different types of objects, including pedestrians and vehicles (cars, bicycles, buses …) [14], especially if compared with more traditional traffic monitoring methods like loop detectors or automatic plate number recognition [15]. GNSS sensors are capable of recording at high rate, e.g. 1 position per second of the tracked object, so that its continuous movement is recorded as a trajectory containing a sequence of sampled points. This type of surveying is extremely important in estimating hazard situations, e.g. integrated with remote sensing [16] or integrated with geographic information systems (GIS) [17, 18].

These type of data are gaining importance as new paradigms are being implemented in real scenarios. Bigdata processing for smart-cities can be applied to high volumes of data from multiple sensors, which are analysed to get in depth information on the multiple dynamic aspects of a mobility and other factors.

Such data can be corrupted by noise [15] due to the pretty well-known problems encountered by GNSS in urban environment (e.g. obstructions, multipath). Critical information can be extracted if a proper preliminary data cleaning for possible spurious data/outliers is performed. To underline this key step, in this work the FCD of the city of Turin (Italy) Public Transportation system were analysed. The preliminary step for the impedance map calculation consisted in the removal of all the information not referable to the actual path of the lines- see Fig. 3.

Fig. 3
figure 3

FCD of BUS lines in urban environment. Right is BUS line 11, left is BUS line 39 – outliers are in green

To test outlier detection the FCD from two bus lines were used, line 11 and line 39. The methods were applied to 2D and 3D data: 2D dimensions were geospatial positions, i.e. latitude and longitude provided by GNSS, and the third dimension was the estimated velocity of the vehicle at each point.

Methods

There are many outlier detection methods in literature, in this study case we focus on unsupervised methods based on local density metrics of points. The rationale behind the two tested methods is that in large datasets consisting of 3D points the number of outliers is much lower than the number of correct points. The correct points are also clustered with respect to outliers, and therefore outliers can be detected by metrics that represent mutual distance between neighbouring points. In the next sub-sections the two methods are described in-depth.

From the definition by Hawkings [5] “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”. In point datasets from laser scanners, SfM and other spatial sensors outliers can be produced from incorrect processing, multipath or from unwanted objects [4], such as birds or dust particles. In SfM in particular, which represents the first dataset, outliers can be from mismatches of keypoint descriptors, which can be common when using a small number of targets or none at all – e.g. with smartphones, or where the image geometry is below optimal [19, 20].

There are several ways to remove outliers with unsupervised, semi-unsupervised or even manual methods. Many users still prefer to remove outliers manually [21], but in this implementation the target is to have a high degree of automation, therefore the two methods that were tested are unsupervised.

In this implementation we tested two methods: (i) Statistical Outlier Removal (SOR), (ii) local outlier factor (LOF). Four predictors – two per each method – were tested: they consist of SOR and LOF values for each point, and of absolute differences, with respect to their median value, of LOF and SOR values, referred to as SOR2 and LOF2 respectively. The hypothesis behind these last two predictors is that most points will be correct, and the median of the distribution of SOR and LOF values will reflect correctness, thus points with values of SOR or LOF distant from the median will likely be outliers. The threshold for optimal results is calculated using ROC curves and applied to flag outliers.

All the above described methods require detecting a number K of nearest neighbours (KNN) to each point. The creation of a metric structure in large point sets is critical for detection of KNN in an acceptable time span. K-d tree structures and methods for approximate nearest neighbours search are implicitly used in the implementation of the methods, libnabo for R cran [22] and the fast library for approximate nearest neighbours (FLANN) [23] in the point cloud library (PCL) [24].

Statistical outlier removal (SOR)

The SOR method is a distance-based approach, which assigns a probability of being an outlier to each point by comparing its distance to neighbours. The statistic used in this case is local density calculated by measuring distances of a user-defined number K of nearest neighbours [8] (in this paper referred to as KNN). It is trivial to state that outliers, by definition, should be significantly distant from the main distribution of other points, see Fig. 4. SOR filter for this work was implemented as an R function using nabor package [22] for fast calculation of KNN distances. SOR filter is also fully implemented as part of PCL.

Fig. 4
figure 4

Schema of nearest neighbours with LOF reachability distance for KNN = 3 (modified from Breunig et al. [18])

Local outlier factor (LOF)

The local outlier factor (LOF) algorithm as described by [25, 26] is an unsupervised method which assigns a score to each point by computing its local density deviation with respect to its neighbours in a cluster. An outlier or a group of outliers substantially have a lower density than their neighbours do, thus a LOF value significantly greater than the rest (see Eq. 14).

The number of neighbours chosen is typically greater than the minimum number of points a cluster can contain, so that other points can be local outliers relative to this cluster. In practice, such information can be available if the user is knowledgeable about the data. Such situation is likely in the two presented cases, as SfM point density and GNSS rate of recording can provide estimation of respective point density. The LOF method also has the advantage of limiting statistical fluctuations [27].

Fundamentally three steps are necessary to extract LOF values for each point. First for each point (i) every distance with k other points is calculated, and defined as K.dist.

$$ K.{dist}_{i,j}= dist\left({P}_i,{P}_j\right) $$
(1)

where K-distance of point Pi is the distance between Pi and Kth nearest point, Pj .

The second step calculates reachability distance (R.dist) for every point and its K neighbours. The reachability distance is the maximum between two values: the K.dist of the considered point and the considered neighbour, for each KNN other points (see Fig. 4).

$$ R. dist\left({P}_i,{P}_{K^{th}}\right)=\max \left(K.{dist}_{K^{th}}\left({P}_{K^{th}}\right);K.{dist}_i\right) $$
(2)

The local reachability density (LRD) is then defined for each point as inverse of the average reachability distances of point Pi. In the equation below, the numerator defines the cardinality of the point set of KNN.

$$ LRD\left({P}_i\right)=\frac{\left\Vert {N}_k\left({P}_i\right)\right\Vert }{\sum_{P_j\in {N}_k\left({P}_i\right)}R. dist\left({P}_i,{P}_j\right)} $$
(3)

The last step calculates LOF value for each point is calculated by comparing LRD value of the point with LRD value of its k neighbours.

$$ LOF\left({P}_i\right)=\frac{\sum_{P_j\in {N}_k\left({P}_i\right)}\frac{LRD\left({P}_j\right)}{LRD\left({P}_i\right)}}{\left\Vert {N}_k\left({P}_i\right)\right\Vert } $$
(4)

In this work the LOF method is implemented as a new filter in point cloud library (PCL). The source code is available in a GitHub repository for inclusion in PCL builds [28]. PCL is a “standalone, large scale, open project for 2D/3D image and point cloud processing. PCL is released under the terms of the BSD license, and thus free for commercial and research use” [24, 29]. PCL provides the ideal framework to process large point datasets. In these methods finding nearest neighbours is an essential step. Spatial metric structures allow approximate nearest neighbours matching with binary trees and are implemented in PCL via the FLANN library [30, 31].

Software implementation

A PCL build was integrated in a graphical user interface (GUI) for processing two data formats: LAS/LAZ point clouds and SQLite database format. The former was used for the SfM dataset and implemented using LASlib (with LASzip), a “C++ programming API for reading / writing LIDAR data stored in standard LAS or in compressed LAZ format (1.0 - 1.3)” [32]. Both LASlib and LASzip are released under the terms of the GNU Lesser General Public Licence. The latter format, SQLite, was used to read FCD data, and was implemented using the dedicated library in public domain: “SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine” [33]. The GUI was developed in C++ using the Qt Framework IDE. The PCL build included the local implementation of the LOF method and thus it was applied to the analysed point datasets via the GUI. The GUI also provides information on the process via a log (see Fig. 5).

Fig. 5
figure 5

LOF implementation GUI

Results and discussion

It is trivial that the best method and combination of parameters (KNN and threshold) must have the highest number of true positives and true negatives and the lowest number of false positives and false negatives. In this investigation we consider detecting points which are outliers, therefore positives are the outliers and negatives are the inliers. Two possible types of errors can be present when predicting a binary response (inliers vs. outliers): false outliers (i.e. type I error, false positive rate - FPR), and false inliers (i.e. type II error, missed outliers, false negative rate- FNR). In this investigation particular attention is given to false inliers (FN) – points which are outliers, but are incorrectly assigned as inliers, are considered. This is because for further processing of point datasets, this type of error leads to worse consequences than false outliers. The Receiver Operating Characteristic (ROC) curve is used to define optimal balance overall performance and best-performing threshold, and false negative rate is analysed in depth.

ROC curves

“The Receiver Operating Characteristic (ROC) curve is used to assess the accuracy of a continuous measurement for predicting a binary outcome” [34]. In particular, it allows to intuitively evaluate metrics at varying thresholds; this is exactly what is looked for in this study case, where continuous metrics are used to discriminate between outliers and inliers (i.e. SOR and LOF). ROC curves have long been used in signal detection theory and applications [35]. As mentioned, for a predictor consisting in single continuous measurements, convention dictates that a test positive for outlier is defined as the value of the predictor (LOF or SOR) of a point exceeding a fixed threshold (T):

$$ {\displaystyle \begin{array}{c}T= threshold=\phi \in \mathrm{\mathbb{R}}\\ {}\phi \in \left[{V}_{\mathrm{min}},{V}_{\mathrm{max}}\right]\end{array}} $$
(5)

where V is the value of LOF, LOF2, SOR or SOR2: Vmin is the lowest and Vmax is the highest value in the set.

The two axes of the ROC graph are respectively:

$$ {\displaystyle \begin{array}{c}\begin{array}{l}{ROC}_{x1}\left(\phi \right)=\mathrm{FP}{\mathrm{R}}_{outlier}\left(\phi \right)=\frac{FP\left(\phi \right)}{FP\left(\phi \right)+ TN\left(\phi \right)}\\ {}{ROC}_{x2}\left(\phi \right)=\mathrm{TN}{\mathrm{R}}_{outlier}\left(\phi \right)=1-\mathrm{FP}{\mathrm{R}}_{outlier}\left(\phi \right)\end{array}\\ {}\begin{array}{l}{ROC}_{y1}\left(\phi \right)=\mathrm{TP}{\mathrm{R}}_{outlier}\left(\phi \right)=\frac{TP\left(\phi \right)}{FN\left(\phi \right)+ TP\left(\phi \right)}\\ {}{ROC}_{y2}\left(\phi \right)=\mathrm{FN}{\mathrm{R}}_{outlier}\left(\phi \right)=1-\mathrm{TP}{\mathrm{R}}_{outlier}\left(\phi \right)\end{array}\end{array}} $$
(6)

Since the threshold T has to be determined, we plot TPR as a function of FPR for all possible values V. This will be applied to the SfM dataset and to the two FCD datasets to determine the optimal value of T for all cases. Optimal T is chosen by adopting the corresponding value of T which provides the highest value of area under the curve (AUC). The AUC is a single combined measure of sensitivity and specificity allowing effective comparison between results [36]. Specific results for the two datasets are reported in the next sections.

SfM point dataset

The plots in Figs. 6 and 7 allow interpretation and discussion of the performance of the four predictors applied to the SfM dataset. The overall best performance was by SOR2 and LOF2. In both cases higher values of KNN improve accuracy, up to KNN = 50 where there is only very slight improvements.

Fig. 6
figure 6

ROC Curves of the four predictors

Fig. 7
figure 7

AUC values and best threshold values calculated from ROC curves at different KNN values for all four predictors

Figures 6 and 7 define SOR2 and LOF2 as having similar accuracies when considering detection of outliers, but, as mentioned in data description, in the SfM dataset two classes of outliers were artificially added, clustered and unclustered (see Fig. 2). A more in depth analysis can thus be carried out to assess if methods behave differently with respect to the two classes of outliers. The adopted method of manually inserting outliers allows us to have full control; e.g. we know the distance from the DSM, thus we could test if there is correlation between detection rate and distance from the inliers in the point dataset. Table 1 reports the number of undetected (FN) outliers for each class and Table 2 provides the more complete confusion matrix. It is worth noting that the two methods show different behaviour with respect to the outlier class. Almost all of the unclustered outliers were detected by SOR, SOR2 and LOF2, whereas LOF had very low detection rate. The opposite is true when considering clustered points, LOF2 provides the best result (lower FNR – see Table 1). The reachability distance used in LOF to determine local density is a measure to produce more stable results within clusters [25]. This is clearly inferred by the low AUC values which result when KNN is too low (Fig. 9); i.e. when KNN is below the number of points per cluster that was set in the artificially clustered outliers (30 points), AUC is low. On the other hand, accuracy increases significantly in LOF2 when KNN surpasses this value.

Table 1 Number of undetected (FN) outliers and false negative rate in the two classes, clustered and unclustered (see Fig. 2)
Table 2 Confusion matrix of results where two outlier classes, clustered and unclustered (see Fig. 2) are defined in the control and matched against results from the four predictors – in green the correct number of points detected outliers/inliers (true positives and negatives respectively): percentages represent the true positive rate

Threshold values allow getting insights on what point density determines what an inlier is and what an outlier is. SOR2, is the overall best method, with, at best KNN value (Fig. 7 - KNN = 60), the threshold being ~ 0.25 – meters is the unit in this case. It is to be interpreted as the average of distances, between the point and 60 nearest neighbours; this value determines if the point is an inlier (below 0.25) or an outlier (above 0.25). LOF values represent local density with values near 1 being considered inliers, and values above tend to indicate outliers [25]. This is also intuitively seen in Fig. 8 where the median values of LOF values of all points (1 million inliers and 2596 outliers) are very close to 1. In our case, regarding the SfM dataset, the LOF value threshold is 1.1 at its best KNN value (Fig. 7 - KNN = 10). This value reflects results from literature, where, for datasets with low local fluctuations, LOF values above 1.1 are likely to be outliers [25], whereas, in other types of datasets with varying densities, i.e. high local fluctuations, higher LOF values might still indicate inliers. SOR2 and LOF2 thresholds are defined by distance from the respective medians, which are shown in Fig. 8.

Fig. 8
figure 8

Median values in SOR and LOF distributions at different values of KNN

LOF2 performed close to SOR2 and both outperformed LOF and SOR. This indicates that assigning to each point a metric based on absolute difference from median, improves the ability to discern outliers from inliers.

FCD – Floating Car data

Figure 9 summarizes results from ROC curves of FCD data by providing AUC values at different KNN values. It is clear that the different point distribution (Fig. 3) impacts on which method and which KNN provides the highest outlier detection rate (TPR – see Table 3). Also the overall performance differed for each line. Line 11 had the best results with low KNN (Fig. 9 - KNN = 3) using SOR or SOR2 methods; using the third dimension did not significantly change the results of AUC values or the values of TPR. Line 39 had lower AUC values and TPR values, but best method resulted to be LOF, with highest KNN of 70 neighbours, leaving out the third dimension, i.e. the velocity of the vehicle. It is worth noting that velocity either did not contribute to improve accuracy, in line 39 it even decreased accuracy, so this metric is not useful in case we want to predict points that do not belong to the original route.

Fig. 9
figure 9

AUC values of two bus lines, using only latitude and longitude (2d) and also velocity (3d); LOF2 and SOR2 use normalized values as predictors

Table 3 True positive rate – TPR – for different bus lines and methods at optimal KNN (see Fig. 9); in green the best relative to each line and 2d/3d combination, with thick border the overall best for each bus line

Figure 10 is a visual representation of Bus line 11 before and after the application of the SOR2 method with KNN = 3. It is not clear in the image as it is 2D, but several overlapping points are present in what looks like an isolated point outside the main track.

Fig. 10
figure 10

Bus line 11 before and after applying the SOR filter using two dimensions (latitude and longitude)

Conclusions

Two objectives were reached in the presented investigation: the implementation of the LOF method in the PCL open-source library with its integration in a GUI, and results of testing the LOF method against the SOR method using two very diverse datasets in terms of technology and point density and distribution. It is worth noting that investigations on outlier detection methods keeps on being a topic of high interest, due to the many technologies that provide datasets with a large number of unstructured points.

Results are mixed, with the two datasets resulting in best performances from different methods and threshold types. This indicates that, very likely, the type of point distribution, i.e. the local density fluctuation, influences on the choice of method for detecting outliers. SfM point dataset clearly LOF2 performed close to SOR2, both with high KNN values, and both outperformed LOF and SOR. This indicates that assigning to each point a metric based on absolute difference from median, improves the ability to discern outliers from inliers. This is quite different from the FCD datasets; which showed opposite behaviour. The best results were given by low values of KNN for all except the 2D dataset of line 39, which had highest KNN perform best. SOR performed best for line 11 whereas line 39 had SOR2 at lowest KNN do best for the 3D dataset, and LOF do best for the 2D dataset; again with lowest and highest KNN respectively. This seemingly erratic behaviour reflects the very different datasets chosen for testing, which was one of the objectives of this investigation. As mentioned, SfM has a much more consistent density, whereas FCD has higher density fluctuations. This can explain why thresholds of absolute differences from the median (SOR2 and LOF2) outperformed with respect to using LOF and SOR values as thresholds, whereas this was not the case for the FCD dataset. It is worth mentioning that points at border of a dataset can be perceived as outliers, but this case can be considered a “margin” effect that can be ignored in most cases because the objects of interest in a survey are usually not at the margin of the survey; this is to be considered when planning a survey.

An aspect worth noting is that in SfM dataset the AUC value for best methods (LOF2 and SOR2) levels out at higher KNN values. This is important because it indicates that result at the best KNN = 60 is no particularly better than the result from KNN = 20. Considering that processing is much faster at the latter value of KNN, users can choose this value instead of the higher value. Another interesting point is that at and above KNN = 20 results are good, and they seem to stabilize, i.e. results do not deteriorate with higher KNN values. Experimentation stopped at KNN = 70, also due to long processing time, future tests might increase KNN to see if, and when, there is a deterioration. This behaviour is likely related to the median value of LOF (Fig. 8 - right) that becomes stable at KNN > = 20, meaning that at least 20 neighbours are necessary, for the SfM dataset, to represent the local fluctuation. In SfM dataset, while LOF2 increases with KNN, LOF is constant at KNN 10–20 and deteriorates at KNN > 20. In this dataset KNN values in the 10–20 range bring this difference between LOF and LOF2, likely due to the way that different thresholds are calculated; i.e. using, as threshold, the absolute difference from median LOF improves the efficiency of the method, whereas LOF value alone is not enough to discriminate outliers from inliers.

Other practical considerations are necessary to select the proper approach for removing outliers. The dataset must be analysed to understand if there are any systematic ways to model either outliers or inliers. For example SfM datasets are more prone to have outliers related to the Z axis value, whereas the floating car dataset has outliers which are sensible to planar offsets due to vehicles going on different routes with respect to the usual track. Therefore a careful evaluation of the dataset source will help to figure which descriptors can be inserted to improve results. In the FCD point dataset, the third dimension is velocity, but this feature did not improve results with respect to only planar 2D spatial coordinates. It is very likely that better results can be achieved with specific descriptors extracted from the dataset. For example, the floating car dataset has a linear characteristic; therefore, a degree of linearity of neighbouring points can be added as descriptor and will likely improve results. The focus of this paper is to assess two generic algorithms and not to evaluate specific use cases, but it is worth reporting that specific descriptors can help in detecting outliers.

The bottom-line of the results is that there is not a one-method-suits-all, and not a best number of nearest neighbours - KNN - to consider in these two methods. Best KNN values strongly depend on local density of points. As mentioned, to choose ideal KNN, enough neighbours must be used to represent local fluctuations. This seems trivial, but is important to keep in mind. Differences in AUC and TPR values show that ideal combinations of method and KNN must be chosen depending on the characteristics of the dataset and of the type of outliers that are expected (clustered or not).