BPF: a novel cluster boundary points detection method for static and streaming data

Khalique, Vijdan; Kitagawa, Hiroyuki; Amagasa, Toshiyuki

doi:10.1007/s10115-023-01854-1

BPF: a novel cluster boundary points detection method for static and streaming data

Regular Paper
Open access
Published: 21 March 2023

Volume 65, pages 2991–3022, (2023)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

BPF: a novel cluster boundary points detection method for static and streaming data

Download PDF

2136 Accesses
1 Altmetric
Explore all metrics

Abstract

Data points situated near a cluster boundary are called boundary points and they can represent useful information about the process generating this data. The existing methods of boundary points detection cannot differentiate boundary points from outliers as they are affected by the presence of outliers as well as by the size and density of clusters in the dataset. Also, they require tuning of one or more parameters and prior knowledge of the number of outliers in the dataset for tuning. In this research, a boundary points detection method called BPF is proposed which can effectively differentiate boundary points from outliers and core points. BPF combines the well-known outlier detection method Local Outlier Factor (LOF) with Gravity value to calculate the BPF score. Our proposed algorithm StaticBPF can detect the top-m boundary points in the given dataset. Importantly, StaticBPF requires tuning of only one parameter i.e. the number of nearest neighbors $(k)$ and can employ the same $k$ used by LOF for outlier detection. This paper also extends BPF for streaming data and proposes StreamBPF. StreamBPF employs a grid structure for improving k-nearest neighbor computation and an incremental method of calculating BPF scores of a subset of data points in a sliding window over data streams. In evaluation, the accuracy of StaticBPF and the runtime efficiency of StreamBPF are evaluated on synthetic and real data where they generally performed better than their competitors.

BPF: An Effective Cluster Boundary Points Detection Technique

Fast Top-k Distance-Based Outlier Detection on Uncertain Data

A Grid Partition-Based Local Outlier Factor for Data Stream Processing

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Clustering is one of the data mining techniques that divides a dataset into subsets such that data belonging to each subset have some similar properties [24]. These clusters of data may represent a useful phenomenon. In cluster analysis useful features of data is extracted. For example, in a customer dataset, a cluster of data objects may represent a specific behavior of customers, or in an image dataset, a cluster of images may share similar properties.

Outlier detection is related to clustering as an outlier is defined as a data object that does not belong to any cluster and deviates from the majority of the data objects [15]. An outlier is often situated in an isolated region of the data space. Many techniques have been proposed for outlier detection based on different properties of the dataset. For example, distance [3, 20], density [5, 17], angle [23, 32] and isolation [14, 28].

There are several research efforts targeting the problem of clustering [1, 7, 12] and outlier detection [5, 29, 33, 39] in static and streaming data. However, a limited research has been dedicated to the boundary points detection. In [26], border or boundary points are defined as points which are located at the extremes of a class region or near free pattern space. In other words, boundary points are located at the border of a cluster forming its boundary. Hence, boundary points detection can be defined as the task of detecting the points which are situated at the boundary of a cluster [41].

Detecting boundary points may provide useful information about the system which is generating this data. Consider the example of a disease detection system in which the normal data objects may represent healthy patients and the patient who have contracted a certain disease may be represented by outliers. In this example, the boundary points may represent normal patients showing abnormal symptoms but somehow have not yet developed the disease. Consequently, closely monitoring such boundary cases may reveal interesting information about the disease. Similar motivating examples have been presented in the related papers [9, 27, 34, 41].

Amongst many outlier detection techniques [21, 23, 28, 30], Local Outlier Factor (LOF) [5] is one of the most popular and competitive density-based outlier detection methods [43]. It can detect outliers based on relative density of a target object according to its local neighborhood. Core points are situated in the inner region of the cluster, whereas the outliers are isolated points in the less dense regions. LOF can detect outliers by calculating the LOF scores where outliers get larger LOF scores ($>1$) than other points.

The problem of boundary points detection requires the detection of boundary points while ignoring the core points and outliers. In our previous work [19] , we proposed Boundary Point Factor (BPF) method which calculates a boundary point score called BPF score to effectively identify boundary points. In a nutshell, BPF calculates the BPF score of a given point by taking the ratio of its Gravity value and LOF score. The Gravity value is calculated by the norm of the average of unit vectors from the given point to its k-nearest neighbors. Based on our proposed formulation, BPF scores of boundary points tend to be greater than outliers and core points. As a result, the boundary points can be distinguished from the outliers and core points based on the BPF scores. We propose ${\textit{BPF}}$ algorithm for static datasets which calculates BPF scores and output the top-m boundary points. Followings are the advantages of our proposed method [19]:

BPF is robust to the presence of outliers and clusters of different sizes and densities.
BPF can be used with LOF for detecting boundary points and outliers as BPF shares the k-nearest neighbors and LOF computation.
BPF has one tune-able parameter $k$ (number of nearest neighbors) where the value of k tuned for boundary points detection with BPF can also be used for outlier detection with LOF.

This paper extends BPF to the problem of boundary points detection over data streams and provides more intensive experimental results to verify the effectiveness of BPF. It is important to clarify that in our previous work, BPF combinedly represented the method of calculating the BPF score, and the algorithm that output the top-m boundary points in a static dataset. In this work, we refer to the algorithm of detecting the top-m boundary points via BPF as StaticBPF, whereas BPF refers to the method of calculating BPF score by combining Gravity and LOF.

In streaming data, the task of boundary points detection becomes more challenging due to high arrival rate of data. As a result, a faster method is desirable which can calculate the BPF scores of new arrival points and update the BPF scores of points affected by arrival of new points or expiration of old points. This paper proposes a grid-based runtime efficient method to address the problem of fast boundary points detection over data streams. The challenges are to improve the computation of the k-nearest neighbors and BPF scores of the new arrival points and the points affected by arrival and expiration of points due to window slide. To address this problem, we propose StreamBPF which employs a grid structure to efficiently compute the k-nearest neighbors and uses an incremental method of computing BPF scores. This paper is an extension of our previous work [19], and followings are the key extensions in this work:

Quantitative evaluation of StaticBPF on 2- and high-dimensional synthetic and real data. In our previous work, we demonstrated the accuracy by showing the detected boundary points on 2-d synthetic data and real data. In this paper, in addition to the previous results, the accuracy results are shown quantitatively w.r.t. precision, recall, F1 score, area under precision-recall curve (AUC PR) and area under ROC curve (AUC ROC) on all datasets.
Proposal of a boundary points detection method over data stream named ${\textit{StreamBPF}}$. Our previous contribution is suitable for static datasets, and it is computationally expensive to use it for streaming data. Therefore, in this paper, we propose StreamBPF that:
1. 1.
  uses a grid structure to improve the k-nearest neighbors computation, and
2. 2.
  incrementally computes BPF scores adopting observations in [33].
Runtime performance evaluation of StreamBPF on synthetic and real data, and comparison with StaticBPF and other methods.

2 Related work

BORDER [41] is one of the boundary points detection algorithms. It exploits the observation that boundary points have smaller number of reverse k-nearest neighbors (${\textit{RN}}_k$) than core points and identifies boundary points. However, the computation of ${\textit{RN}}_k$ is expensive, and therefore they proposed to use G-ordering kNN join method [40] to improve the computation of ${\textit{RN}}_k$. BORDER was found to be effective in datasets which do not have outliers. In the case of dataset with outliers, BORDER cannot differentiate between outliers and boundary points, as both of them tend to have a smaller number of ${\textit{RN}}_k$. To address the shortcoming of BORDER, BRIM [35] proposed to consider ${\textit{eps}}$-neighborhood to successfully detect the boundary points in datasets with many outliers. Given a distance ${\textit{eps}}$, BRIM uses the observation that since a boundary point is located at the edge of a dense region, its ${\textit{eps}}$-neighborhood can be distributed in either positive or negative direction based on the diameter line which divides its ${\textit{eps}}$-neighborhood into two parts. Furthermore, boundary points tend to have denser ${\textit{eps}}$-neighborhood than outliers. The major drawback of BRIM is that it cannot perform well in datasets with clusters of different densities and scales due to the fixed ${\textit{eps}}$ value.

Recently, Li et al. proposed BPDAD [27] for detecting outliers and boundary points based on geometrical measures. BPDAD exploits two important observations that outliers and boundary points have lower local densities and smaller variance of angles than their neighbors. Consequently, BPDAD output outliers and boundary points together and it does not specifically detect boundary points. BorderShift [9] is another boundary points detection algorithm which uses similar observations regarding the densities of outliers, boundary and core points. BorderShift employs Parzen Window (kernel density estimation) to estimate the local density of a point and MeanShift vector to determine the direction of dense region. It effectively detects boundary points provided that its three parameters ($k$, $\lambda _1$ and $\lambda _2$) are tuned appropriately. Particularly, tuning $\lambda _1$ and $\lambda _2$ can be difficult as it requires prior information about the number of outliers in the dataset.

For high-dimensional data, [6, 34] project high-dimensional data onto lower dimensions for boundary points detection. However, similar to BorderShift [9], their parameter tuning depends on the prior knowledge of the number of outliers in datasets. A more desirable technique is easy to tune and does not require any prior information about the data distribution of the given dataset. It should take a dataset with or without outliers as input and output the top-m boundary points.

Our previous work [19] introduced BPF method and experimentally showed its effectiveness for static datasets. However, in [19], we did not consider the problem to detect boundary points occurring in streaming data. To the best of our knowledge, there is no proposed method of boundary points detection for streaming data. Since, LOF is one of the key components of BPF, incremental computation of LOF can improve the runtime performance of BPF and may make it suitable for streaming data.

There are many outlier detection methods based on different observations for streaming data [2, 8, 16, 18, 22, 39, 42]. ILOF is the extension of LOF for data streams which can incrementally update the LOF scores of the points [33]. ILOF proposed two algorithms: Insertion and Deletion which are used when a new point is inserted and deleted from the dataset, respectively. One disadvantage of ILOF is that it requires a large amount of memory. Consequently, to improve the space complexity, [37] proposed a memory efficient method called MiLOF which stores the summary of past data to improve memory consumption. Since MiLOF stores the summary of past data as cluster centers, its accuracy may degrade with time. DILOF [29] addressed this problem by preserving the density information of past data. Another attempt to improve LOF for streaming data is [13] which employs a cube-based method to approximate the LOF scores of incoming points. All these proposed methods are approximation of ILOF algorithm. ILOF is related to our proposed method for streaming data as we need to update LOF scores in order to update the BPF scores of points. Hence, we adopt the observations given in [33] to update the LOF scores. Furthermore, we propose a grid structure to improve the runtime of the k-nearest neighbors computation.

Table 1 List of important symbols

BPF: a novel cluster boundary points detection method for static and streaming data

Abstract

Similar content being viewed by others

BPF: An Effective Cluster Boundary Points Detection Technique

Fast Top-k Distance-Based Outlier Detection on Uncertain Data

A Grid Partition-Based Local Outlier Factor for Data Stream Processing

1 Introduction

2 Related work

3 Preliminaries

Definition 1

Definition 2

Definition 3

4 Boundary point factor (BPF)

Definition 4

Definition 5

5 StaticBPF

5.1 Algorithm

5.2 Runtime complexity

5.3 Evaluation of StaticBPF

5.3.1 Experimental setup

5.3.2 2-Dimensional synthetic data

5.3.3 High-dimensional synthetic data

5.3.4 Real data

6 StreamBPF

6.1 Problem definition

Definition 6

Definition 7

Definition 8

6.2 Grid structure

6.2.1 Definitions

Definition 9

Definition 10

Definition 11

Definition 12

6.2.2 Grid-based k-nearest neighbor computation algorithm

6.3 Incremental computation

6.4 Algorithm

6.5 Evaluation of StreamBPF

6.5.1 Synthetic data streams

6.5.2 Real data streams

7 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation