Half-space mass: a maximally robust and efficient data depth method
Abstract
Data depth is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces Half-Space Mass which is a significantly improved version of half-space data depth. Half-Space Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of Half-Space Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of Half-Space Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than half-space depth; and it runs orders of magnitude faster than \(L_2\) depth, an existing maximally robust location estimator. In clustering, the Half-Space Mass version of K-means overcomes three weaknesses of K-means.
Keywords
Half-space mass Mass estimation Data depth Robustness1 Introduction
“Most important for the selection of a depth statistic in applications are the questions of computability and - depending on the data situation - robustness.” - Karl Mosler (2013)
Data depth (Liu et al. 1999) is a statistical method which models data distribution in terms of center-outward ranking rather than density or linear ranking. In 1975, Tukey (1975) proposed a way to define multivariate median in a data cloud, known as half-space depth or Tukey depth. Since then it has been extensively studied. Donoho and Gasko (1992) have revealed the breakdown point of Tukey median; Zuo and Serfling (2000) have compared it to various competitors and Dutta et al. (2011) have investigated the properties of half-space depth. Meanwhile, the concept of data depth has been adopted for multivariate statistical analysis since it provides a nonparametric approach that does not rely on the assumption of normality (Liu et al. 1999).
- (i)
It is concave in a user defined region that covers the source density distribution or the data cloud. An example is shown in Fig. 1.
- (ii)
It has a unique maximum point, which can be regarded as a multi-dimensional median.
- (iii)
Its median, which has a breakdown point equal to \(\frac{1}{2}\), is maximally robust.
- (iv)
It extends the information carried in a dataset to a higher dimensional space in which such dataset has a zero-volume convex hull.
Its maximal robustness leads directly to better performance in anomaly detection than half-space depth.
Compared to the existing maximally robust \(L_2\) depth, it runs orders of magnitude faster.
Compared to the distance-based K-means clustering method, the half-space mass-based version overcomes three weaknesses of K-means (Tan et al. 2014) to find clusters of varying densities and sizes, as well as in the presence of noise.
The rest of the paper is organized as follows. Section 2 introduces the formal definitions of half-space mass as well as the proposed implementation. Sections 3 and 4 provide its theoretical properties and proofs, respectively. Section 5 discusses the relationship between half-space mass and other data depth methods. Section 6 describes applications of half-space mass in anomaly detection and clustering. Section 7 reports the empirical evaluations. Section 8 discusses its relation to mass estimation and Sect. 9 concludes the paper.
2 Half-space mass
2.1 Definitions
Notations
\(\mathfrak {R}^d\) | A d-dimensional real space |
\(\ell \) | A direction in \(\mathfrak {R}^d\) |
x | A one-dimensional point in \(\mathfrak {R}\) |
\(\mathbf{x}\) | A point in \(\mathfrak {R}^d\) |
D | A dataset, where \(|D| = n\) |
\(\mathbf{X}\) | A point in D |
\({\mathcal {D}}\) | A subset of D, where \(|{\mathcal {D}}| = \psi \) |
t | Number of half-spaces sampled for estimation |
R | A convex region covering a source density F or a dataset D |
\(\lambda \) | A parameter that determines the size of R |
\(P_F(\cdot )\) | A probability mass function of a probability density distribution F |
\(P_D(\cdot )\) | An empirical probability mass function of a dataset D |
\(HM(\cdot |F)\) | Half-space mass function given F |
\(HM(\cdot |D)\) | Half-space mass function given D |
Let \(F(\mathbf{x})\) be a probability density on \(\mathbf{x} \in \mathfrak {R}^d\), \(d \ge 1\); \(R \subset \mathfrak {R}^d\) be a convex and closed region covering the domain of F; and H be a closed half-space formed by separating \(\mathfrak {R}^d\) with a hyperplane that intersects R. Note that the probability mass of H computed with respect to F is \(0 \le P_F(H)=P_F(H \cap R) \le 1\).
Definition 1
The definition of half-space mass can be conceptualized as the expectation of the probability mass of a randomly selected half-space H, which is defined for R and contains the query point \(\mathbf{x}\), given that every half-space is equally likely. This definition happens to have certain similarity to that of half-space depth (Tukey 1975). While half-space depth takes the minimum of probability mass of a random half-space containing query point \(\mathbf{x}\) as the depth value (see its definition in Table 2 in Sect. 5), half-space mass takes the expectation of it. This key difference enables half-space mass to have more desirable properties, which will be discussed in Sects. 3 and 4.
Practically an i.i.d. sample D is usually given instead of the source density distribution F. The sample version of \(HM(\mathbf{x}|F)\) is obtained by replacing F with D as follows.
Definition 2
We also propose a computation-friendly version to estimate \(HM(\mathbf{x}|D)\). Instead of using the whole dataset D to calculate \(P_D(H_i)\) in (1), a small subsample \({{\mathcal {D}}}_i \subset D\) with size \(|{{\mathcal {D}}}_i| = \psi \ll |D|\) is randomly selected from D without replacement for \(i = 1,\ldots ,t\). Let \(R_i\) be a convex region covering \({{\mathcal {D}}}_i\), \(H_i(\mathbf{x})\) be a randomly selected half-space containing \(\mathbf{x}\) and intersecting \(R_i\), for \(i = 1,\ldots ,t\).
Definition 3
2.2 Implementation
In general, half-space mass is a concave function in R, as will be shown in Sects. 3 and 4; therefore it provides distinct center-outward ordering in the region R, while concavity outside of R is not guaranteed.
When concavity needs to be guaranteed in a region larger than the convex hull of D, a larger R would be desirable. To this end, we propose a projection-based algorithm to estimate \(HM(\mathbf{x}|D)\) in which the region R or \(R_i\) is determined by a size parameter \(\lambda \). It is the ratio of diameters between R and the convex hull of D along every direction. The value of \(\lambda \) should be more than or equal to 1. When \(\lambda = 1\), R or \(R_i\) is the convex hull of D or \({{\mathcal {D}}}_i\). The bigger \(\lambda \) is, the larger R or \(R_i\) expands from the convex hull of D or \({{\mathcal {D}}}_i\).
Algorithm 1 is the training procedure of \(\widetilde{HM}(\cdot |D)\). The half-space is implemented as follows: a random subsample \(\mathcal{D}_i\) is projected onto a random direction \(\ell \) in \(\mathfrak {R}^d\), t times. For each projection, a split point s is randomly selected between a range adjusted by \(\lambda \); and then the number of points that fall in either sides of s are recorded.
2.3 Parameter setting
Here we provide a general guide for setting the parameters. The parameter t affects the accuracy of the estimation. The larger t is, the more accurate the estimation is. In high dimensional datasets or datasets which are elongated significantly in some direction than others, t shall be set to a large value, in order to gather sufficient information from all directions.
For the rest of this paper, we use Algorithms 1 and 2 to estimate half-space mass. The parameter \(\lambda \) is set to 1 by default unless mentioned otherwise.
3 Properties of half-space mass
We list four theoretical properties of half-space mass in this section, which are concavity in region R, unique median, the median having breakdown point equal to \(\frac{1}{2}\), and extension across dimension. Proofs of the lemma and theorems stated in this section can be found in Sect. 4.
3.1 Concavity
Lemma 1
HM(x|F) under Definition 1 is a concave function for any finite F in any finite R in a univariate real space \(\mathfrak {R}\).
Using this lemma, we can obtain the following theorem on the concavity of the multi-dimensional half-space mass distribution.
Theorem 1
\(HM(\mathbf{x}|F)\) under Definition 1 is a concave function for any finite F in any finite, convex and closed \(R \subset \mathfrak {R}^d\).
Similarly, \(HM(\mathbf{x}|D)\) is also concave in the convex region R covering D.
3.2 Unique median
Based on Theorem 1, a unique location in R which has the maximum half-space mass value is guaranteed, as stated in the following theorem:
Theorem 2
The “center” of a given density F based on half-space mass \(\mathbf{x}^* := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x}|F)\) is a unique location in R, given that F covers an area more than a straight line in \(\mathfrak {R}^d\).
3.3 Breakdown point
We define a location estimator based on half-space mass as follows: \(T(D) := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x} | D)\). It is a maximally robust estimator with properties given in the following theorem:
Theorem 3
The breakdown point of T, \(\epsilon (T, D) > \frac{n-1}{2n-1} \rightarrow \frac{1}{2}\) as \(n \rightarrow \infty \).
3.4 Extension across dimension
Dutta et al. (2011) reveal that, for a size n dataset in a \(d > n\) dimensional space, since the d-dimensional volume of the convex hull of such dataset is going to be zero, half-space depth will behave anomalously having 0 measures almost everywhere in \(\mathfrak {R}^d\). In such cases, half-space depth does not carry any useful statistical information.
On the other hand, the definition of half-space mass enables it not only to rank locations outside the convex hull of the training dataset in the lower dimensional space where this convex hull has positive volume, but also to extend the ranking of locations to a higher dimensional space where the convex hull has zero volume.
4 Proofs
This section provides the proofs for the lemma and theorems given in the last section. The proofs for Lemma 1, Theorems 1, 2 and 3 are presented in the following four subsections.
4.1 Proof of Lemma 1
4.2 Proof of Theorem 1
\(HM(\mathbf{x}|F,\ell )\) is equivalent to the univariate mass distribution on \(\ell \) where F is projected onto \(\ell \). Accordingly, from Lemma 1, for all \(\mathbf{x} \in R\), it is concave in the direction of \(\ell \) and constant in the direction vertical to \(\ell \). Thus, \(HM(\mathbf{x}|F,\ell )\) is concave in R. Since the summation of multiple concave functions are also concave, \(HM(\mathbf{x}|F)\) is concave in R.
4.3 Proof of Theorem 2
Here we prove Theorem 2 by contradiction.
But since F covers an area more than a straight line, there will always exist an \(\ell \) and \(\mathbf{x}\) such that \(\mathbf{x}^{\ell } \in L^{\ell }\) and \(F^{\ell }(\mathbf{x}^{\ell })>0\), which will contradict with (7). Therefore, there is one unique location that has the maximum half-space mass value in R.
4.4 Proof of Theorem 3
Suppose for a size n dataset D, a contaminating set Q of size \(n-1\) is strategically chosen. Let U denote the convex hull of D, and \(U^{\ell }\) denote its projection segment on a line along direction \(\ell \), assuming U has a finite volume in \(\mathfrak {R}^d\).
For any \(\ell \), the median point of the projection of \( D \cup Q \) on \(\ell \) will lie within \(U^{\ell }\). Because if it lies outside of \(U^{\ell }\), then at least n out of \(2n-1\) points are on one side of the median which contradicts the definition of median. Since Ting et al. (2013) have shown that the univariate mass is maximised at its median, the maximum value of \(HM(\mathbf{x}^{\ell }| D^{\ell } \cup Q^{\ell })\) occurs in the segment \(U^{\ell }\) for all \(\ell \).
For a given query point \(\mathbf{x}\), let \({{\mathcal {L}}}_{\mathbf{x}}^- = \{\ell : \mathbf{x}^{\ell } \notin U^{\ell } \}\) denote the set of directions in \(\mathfrak {R}^d\) on which the projection of \(\mathbf{x}\) lies outside of the projection of the convex hull of D, and \({{\mathcal {L}}}_{\mathbf{x}}^+ = \{\ell : \mathbf{x}^{\ell } \in U^{\ell } \}\) denote the rest of the directions.
For any \(\ell \in {{\mathcal {L}}}^-_{\mathbf{x}}\), the one-dimensional mass \(HM(\mathbf{x}^{\ell }| D^{\ell } \cup Q^{\ell })\) increases while \(\mathbf{x}^{\ell }\) moves a small enough distance towards \(U^{\ell }\), since it is a concave function with the maximum value occurs somewhere in the segment \(U^{\ell }\).
The location estimator T(D) is within U, the convex hull of D. If the distance between \(T(D \cup Q)\) and T(D) is infinity, then the distance between \(T(D \cup Q)\) and U is also infinity. Thus suppose \(\mathbf{x}^* = T(D \cup Q)\) is infinitely far away from U, then the solid angle of U over \(\mathbf{x}^*\) is 0, therefore almost surely \(\ell \in {{\mathcal {L}}}^-_{\mathbf{x}^*}, \forall \ell \in \mathfrak {R}^d\) and \(HM(\mathbf{x}^*| D \cup Q) = E_{ {{\mathcal {L}}}^-_{\mathbf{x}^*}}[HM({\mathbf {x^*}}^{\ell }| D^{\ell } \cup Q^{\ell })]\). Any movement of finite length from \(\mathbf{x}^*\) towards U will increase the one-dimensional mass values \(HM(\mathbf{x}^{\ell }| D^{\ell } \cup Q^{\ell })\), \(\forall \ell \in {{\mathcal {L}}}^-_{\mathbf{x}} \); thus increase the mass value \(HM(\mathbf{x}| D \cup Q)\), which contradicts with the assumption that \(HM(\mathbf{x}^*| D \cup Q)\) is the maximum. Therefore \(T(D \cup Q)\) can only be finitely far away from T(D) for a contaminating dataset Q of size \(n-1\).
Using the same inference as above, any contaminating dataset Q of any size between 1 to \(n-1\) combining dataset D of size n can only cause a finite shift of the location estimator T. Therefore \(\epsilon (T, D) > \frac{n-1}{2n-1}\).
5 Relation to other data depth methods
Definitions of half-space mass (\(HM(\cdot )\)), half-space depth (\(HD(\cdot )\)) and \(L_2\) depth (\(L_2D(\cdot )\)) with a given dataset D
Depth function | Definition | Equation |
---|---|---|
Half-space mass | The expectation of probability mass of all half-spaces covering \(\mathbf{x}\) | \(\displaystyle { HM(\mathbf{x}|D) = E_{{\mathcal {H}}(\mathbf{x})}[P_D(H)]}\) |
Half-space depth | The minimum of probability mass of all half-spaces covering \(\mathbf{x}\) (Tukey 1975) | \(\displaystyle { HD(\mathbf{x}|D) = \min _{H \in {\mathcal {H}}(\mathbf{x})}[P_D(H)]}\) |
\(L_2\) depth | The reciprocal of 1 plus the average of \(L_2\) distances between \(\mathbf{x}\) and each data point in D (Mosler 2013) | \(\displaystyle {L_2D(\mathbf{x}|D) = \bigg ( 1 + \frac{1}{|D|}\sum \nolimits _{\mathbf{X} \in D} ||\mathbf{x} - \mathbf{X}||_2 \bigg )^{-1} }\) |
Medians of half-space mass, half-space depth and \(L_2\) depth and their properties
Depth function | Multivariate median | Breakdown point; median unique? | Extension across dimension | Time complexity |
---|---|---|---|---|
Half-space mass | The point \(\mathbf{x}\) which has the largest expected probability mass of all half-spaces covering \(\mathbf{x}\). | \(\frac{1}{2}\); unique | Yes | O(nt) (sample version) \(O(\psi t)\) (computation-friendly version) |
Half-space depth | The point \(\mathbf{x}\) which maximizes the minimum probability mass of all half-spaces covering \(\mathbf{x}\). | \([1/(1+d),1/3]\); Not unique (Aloupis 2006) | No | O(nt) [An implementation as in Eq. (8)] |
\(L_2\) depth | The point which minimizes the sum of Euclidean distances to all points in a given data set. | \(\frac{1}{2}\); unique (Lopuhaa and Rousseeuw 1991) | Yes | \(O(n^2)\) |
It is interesting to note the similarity between half-space mass and half-space depth, i.e., they are both based on the probability mass of half-spaces. The main difference is between taking the expectation or minimum over probability mass of half-spaces. This has led to the improvement of breakdown point and uniqueness of median shown in Table 3.
\(L_2\) depth and half-space mass have the same four properties: concavity, unique median which is maximally robust and their distribution extends across dimensions which have zero-volume convex hull. The key difference is the core mechanism: one employs half-space and the other uses distance. The computation without distance calculations leads directly to the advantage of half-space mass in time complexity, as shown in Table 3.
The implementation of \(L_2\) depth is straightforward: Given a query point \(\mathbf{x}\), compute the sum of Euclidean distances to all points in D. The output of \(L_2D(\mathbf{x}|D)\) is computed as specified in Table 2.
6 Applications of half-space mass
We demonstrate the applications of half-space mass in two tasks: anomaly detection and clustering, in the following two subsections.
6.1 Anomaly detection
The application of half-space mass to anomaly detection is straightforward since the distribution of half-space mass is concave with center-outward ranking. Once every point in the given dataset is given a score, they can be sorted; and those close to the outer fringe of the distribution, i.e., having low scores, are more likely to be anomalies.
The above property is the same for half-space depth and \(L_2\) depth. Thus, all three methods can be directly applied to anomaly detection.
6.2 Clustering
We provide a simple algorithm utilizing half-space mass in clustering. This algorithm is designed in a fashion that is similar to the K-means clustering algorithm.
Let \(\mathbf{X}_i \in D, i = 1,...,n\) denote data points in dataset D and \(Y_i \in \{1,...,K\}\) denote the cluster labels, where K is the number of clusters. Let \(G_k := \{ \mathbf{X}_i \in D : Y_i = k \}\), where \(k \in \{1,...,K\}\), denote the points in the k-th group.
K-means clustering algorithm (Jain 2010) is provided in Algorithm 4 for comparison. The K-mass algorithm and the K-means algorithm share the same algorithmic structure. They differ only in the action required in each of the two steps in the iteration process.
Note that when considering K-means as an EM (Expectation-Maximisation) algorithm (Kroese and Chan 2014), K-means implements the expectation step in line 3 and the minimisation step in lines 4–6 in Algorithm 4. Similarly, K-mass implements the expectation step in line 3 and the maximisation step in lines 4–6 in Algorithm 3.
7 Empirical evaluations
In this section, we conduct experiments to investigate the advantages of utilizing half-space mass in anomaly detection and clustering, first with artificial data sets and second with real datasets. In both cases, robustness is the key determinant for half-space mass to gain advantage over its contenders.
To simplify notations, we use HM and \(HM^*\) hereafter to denote the sample version (\(\psi = |D|\)) and the computational-friendly version (\(\psi \ll |D|\)) of half-space mass, respectively. And HD and \(L_2D\) denote half-space depth and \(L_2\) depth, respectively.
7.1 Anomaly detection
In this section, half-space mass, half-space depth and \(L_2\) depth are used for anomaly detection. That is, given a dataset, HM is constructed as described in Algorithms 1 and 2; HD and \(L_2D\) are constructed as described in Sect. 5. Then, each of the models is used to score each point in the dataset. In all cases, points with low mass/depth scores are more likely to be anomalies. The final ranking of the points is sorted based on the scores produced from each model.
Area under the ROC curve (AUC) is used to measure the detection accuracy of an anomaly detector. \(AUC=1\) indicates that the anomaly detector ranks all anomalies in front of normal points; \(AUC=0.5\) indicates that the anomaly detector is a random ranker. Visualizations are used to show the impact of robustness. When comparing AUC values in the second experiment, a t-test with \(5\,\%\) significance level is conducted based on AUC values of multiple runs.
The t parameter for both HM and HD is set to 5000 in the experiments, which is sufficiently large since further increase of t observes no noticeable AUC improvement. \(L_2\) depth has no parameter setting.
7.1.1 Anomaly detection with artificial data
The AUC results, presented in the first row in Fig. 5, show that both HM and \(L_2D\) performed much better than HD. In this example, all of the three methods failed to detect some local anomalies but HD failed to detect the anomaly cluster on the right while the other two methods separated the anomaly cluster from the normal points perfectly.
The second row of the plots in Fig. 5 shows the contour maps of mass/depth values when normal points contaminated with noise were used to train the anomaly detectors; and the third row of the plots shows the contour maps when normal data points only were used to train the anomaly detectors.
The contrast between the second row and the third row of the plots is a testament to the impact of robustness. Being maximally robust, the contour maps of HM and \(L_2D\) remain centered inside the normal cluster. In contrast, the contour map of HD is significantly stretched towards the anomaly cluster. This resulted many clustered anomalies (on the right) being scored with high depth values as equivalent to many normal points; and thus impaired its ability to detect anomalies. Anomalies are contamination to the distribution of normal points. An anomaly detector, which is not robust to contamination, often results in poor ranking outcomes in relation to detecting anomalies. This example shows the impact of contamination has to an anomaly detector which is not robust.
7.1.2 Anomaly detection with benchmark datasets
Here we evaluated the performance of HM, \(HM^*\), HD and \(L_2D\) in anomaly detection using nine benchmark datasets (Lichman 2013). AUC values and runtime results are shown in Table 4. The figures are the average of 10 runs except for \(L_2D\) which is a deterministic method. Boldface figures in the HM, \(HM^*\) and \(L_2\) columns indicate that the differences are significant compared to HD; while boldface figures in the HD column indicate that the differences are significant compared to any of the other methods.
In comparison with HD, both HM and \(HM^*\) have 7 wins and 2 losses, which is evidence that half-space mass performed better than HD in most datasets.
Note that HM and \(L_2D\) have similar AUC results. This is not surprising since both have the same four properties shown in Table 3.
\(HM^*\) using \(\psi =10\) performed comparably with HM in seven out of the nine data sets. This suggests that the performance of \(HM^*\) can be further improved by tuning \(\psi \).
Anomaly detection performance with the benchmark datasets, where n is data size, d is the number of dimensions, and “ano” is the percentage of anomalies
Dataset | n | d | ano (%) | AUC | Runtime (second) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
HM | \(HM^*\) | HD | \(L_2\) | HM | \(HM^*\) | HD | \(L_2\) | ||||
Mulcross | 262144 | 4 | 10.00 | 1.00 | 1.00 | 0.86 | 1.00 | 30.3 | 26.3 | 30.3 | 2213.0 |
Satellite | 6435 | 36 | 31.60 | 0.61 | 0.62 | 0.57 | 0.62 | 1.1 | 0.8 | 1.2 | 11.2 |
Shuttle | 49097 | 9 | 7.15 | 0.99 | 0.99 | 0.92 | 0.99 | 5.4 | 5.3 | 5.2 | 133.5 |
Smtp | 95156 | 3 | 0.03 | 0.77 | 0.73 | 0.83 | 0.78 | 6.9 | 8.0 | 6.7 | 218.9 |
Isolet | 7797 | 617 | 3.85 | 0.82 | 0.85 | 0.68 | 0.84 | 24.9 | 13.4 | 25.0 | 229.1 |
Mfeat | 2000 | 649 | 10.00 | 0.92 | 0.93 | 0.56 | 0.92 | 5.6 | 3.3 | 5.7 | 17.8 |
Covertype | 286048 | 10 | 0.96 | 0.87 | 0.78 | 0.92 | 0.87 | 45.7 | 35.3 | 44.5 | 5251.3 |
Http | 567497 | 3 | 0.39 | 1.00 | 1.00 | 0.99 | 1.00 | 55.1 | 57.3 | 54.4 | 7794.4 |
Dbworld | 64 | 4702 | 45.31 | 0.78 | 0.78 | 0.53 | 0.79 | 2.0 | 2.1 | 2.0 | 0.1 |
Note that HD performed poorly in all three high dimensional datasets. Our investigation suggests that as the number of dimensions increases, an increasing percentage of points will appear at the outer fringe of the convex hull covering the data set. Because HD assigns the same lowest depth value to all these points, they are thus unable to be meaningfully ranked. This is the reason why the AUC results of HD in these three datasets are close to 0.5, equivalent to random ranking. In a nutshell, HD is more prone to the curse of dimensionality than HM or \(L_2D\).
The training and testing times of HM and \(HM^*\) with subsample size \(\psi = 10\)
Dataset | n | d | Training time (second) | Testing time (second) | ||
---|---|---|---|---|---|---|
HM | \(HM^*\) | HM | \(HM^*\) | |||
mulcross | 262,144 | 4 | 9.291 | 0.073 | 21.009 | 26.227 |
satellite | 6435 | 36 | 0.429 | 0.082 | 0.671 | 0.718 |
shuttle | 49,097 | 9 | 1.545 | 0.073 | 3.855 | 5.227 |
smtp | 95,156 | 3 | 1.639 | 0.071 | 5.261 | 7.929 |
isolet | 7797 | 617 | 11.953 | 0.509 | 12.947 | 12.891 |
mfeat | 2000 | 649 | 2.810 | 0.426 | 2.790 | 2.874 |
covertype | 286,048 | 10 | 15.632 | 0.080 | 30.068 | 35.220 |
http | 567,497 | 3 | 17.706 | 0.072 | 37.394 | 57.228 |
dbworld | 64 | 4702 | 1.315 | 1.370 | 0.685 | 0.730 |
In summary, half-space mass is the best anomaly detectors among the three methods, which has significantly better detection accuracy than HD and runs orders of magnitude faster than \(L_2D\).
7.2 Clustering
This section reports the empirical evaluation of K-mass in comparison with K-means. The first experiment examines the three scenarios in which K-means is known to have difficulty to find all clusters, i.e., clusters with different sizes, densities and the presence of noise. The second experiment evaluates the clustering performance using eight real data sets (Lichman 2013, Franti et al. 2006).^{3}
In every trial using a data set, K-mass or K-means is executed 40 runs and we report the best clustering result. The clustering performance is measured in terms of F-measure, and visualizations of the clustering results are presented where possible in two-dimensional datasets.
K-mass employs \(HM^*\) which uses \(\psi =5\) and \(t=2000\) as default in all experiments; it uses \(\lambda = 3\) in the first experiment, and \(\lambda = 1.6\) in the second experiment. Recall that \(\lambda \) controls the size of the convex hull covering the data set. Because the sample size is \(\psi =5\), the convex hull must be enlarged (using \(\lambda > 1\)) in order to cover points which exist outside the convex hull. For the stopping criterion p, both K-mass and K-means use \(p=1\) in the first experiment and search for the best result with \(p=0.98\) and 1 in the second experiment.
7.2.1 Clustering with artificial data
Figures 7, 8 and 9 show the clustering results of K-mass and K-means on three artificial datasets, representing scenarios having clusters with different sizes, densities and the presence of noise, respectively.
In scenario 2, the four clusters are of equal density but with different data sizes, as shown in Fig. 8. K-mass worked well separating the four clusters; but K-means failed to converge to the global optimum because of its tendency to split half-way between group centers.
Scenario 3 demonstrates the importance of robustness in clustering. The dataset consists of four clusters of equal sizes and density with the presence of noise, scattered around the four clusters. Figure 9 shows that K-mass, in spite of having a F-measure \({<}1\) because the noise points were assigned to the nearest clusters, was able to separate the four clusters perfectly; while K-means wrongly assigned many points of the four clusters. This is because K-means is not robust against outliers, therefore the group centers could be easily influenced by noise.
In summary, K-mass perfectly separated the four clusters while K-means failed to do so in all three scenarios.
7.2.2 Clustering with real datasets
Clustering with real datasets
Dataset | n | d | K | K-mass | K-means | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Best F | p | time | l | Best F | p | time | l | ||||
Iris | 150 | 4 | 3 | 0.933 | 1 | 0.40 | 4 | 0.920 | 0.98 | 0.001 | 3 |
Seeds | 210 | 7 | 3 | 0.923 | 0.98 | 0.53 | 5 | 0.919 | 0.98 | 0.001 | 2 |
Column | 310 | 6 | 3 | 0.684 | 0.98 | 2.13 | 18 | 0.675 | 0.98 | 0.002 | 4 |
Banknote | 1372 | 4 | 2 | 0.725 | 0.98 | 0.59 | 4 | 0.602 | 0.98 | 0.012 | 8 |
Breast | 699 | 9 | 2 | 0.963 | 0.98 | 0.44 | 4 | 0.961 | 0.98 | 0.002 | 2 |
Dim | 1024 | 1024 | 16 | 1.000 | 1 | 29.16 | 2 | 1.000 | 1 | 0.308 | 2 |
Wdbc | 569 | 30 | 2 | 0.934 | 0.98 | 0.59 | 5 | 0.929 | 0.98 | 0.004 | 5 |
Wine | 178 | 13 | 3 | 0.944 | 0.98 | 0.86 | 8 | 0.966 | 1 | 0.002 | 4 |
8 Discussion
Mass estimation (Ting et al. 2013) was recently proposed as an alternative to density estimation in data modeling. It has significant advantages over density estimation in efficiency and/or efficacy in various data mining tasks such as anomaly detection, clustering, classification and information retrieval (Ting et al. 2013). Despite this success, the formal definition of mass is univariate only and its theoretical analysis is limited to two properties: (i) its mass distribution is concave, and (ii) its maximum mass point is equivalent to median (Ting et al. 2013).
The half-space mass can be viewed as a generalisation of the univariate mass estimation to multi-dimensional spaces, and it has four properties rather than the two revealed previously. The one-dimensional mass estimation is defined as the weighted probability mass (see the details in the Appendix). Half-space splits reduce to binary splits, and the half-space mass reduces to the weighted probability mass in one dimensional space defined in Ting et al. (2013).
The two additional properties of half-space mass, i.e., maximal robustness and extension across dimension, are important in understanding the behaviour of any algorithms designed based on half-space mass, as we have shown in the empirical evaluation section.
The proof for concavity in Lemma 1 made use of the same idea for the concavity proof as presented by Ting et al. (2013). Other ideas in this paper are new.
Ting et al. (2013) also gave a definition of higher level mass estimation, which can be viewed as a localised version of a level-1 mass estimation. We have limited our exposition to level-1 mass estimation in this paper so that we have a direct comparison with data depth and its properties. As a result, it is limited to data modeling with a unimodal distribution having a unique maximum as the median. In datasets which have multi-modal distribution, HM will be outperformed by existing density-based anomaly detectors. We believe that HM can be extended to higher level mass estimation as shown in the one-dimensional case (Ting et al. 2013), which could be regarded as a localized data depth method (Agostinelli and Romanazzi 2011). We will explore higher level mass estimation using half-space mass in the near future.
The successful application of half-space mass in K-mass implies that other data depth methods may also be applicable in K-mass. Our investigation reveals that because half-space depth can only provide its estimations within the convex hull of a given data set (i.e., the lack of the fourth property stated in Sect. 3.4), it could not be applied to K-mass. A K-mass version using \(L_2\) depth exhibits a better convergence property than K-mass. However, its performance is in general worse than both K-mass and K-means.^{4} Another drawback of \(L_2\) depth is that it is very costly to compute in large datasets.
Despite all the advantages of K-mass over K-means shown in this paper, a caveat is in order here: we do not have a proof that K-mass will always converge like K-means.
9 Conclusions
This paper makes three key contributions:
First, we propose the first formal definition of half-space mass, which is a significantly improved version of half-space data depth, and it is the only data depth method which is both robust and efficient, as far as we know.
Second, we reveal four theoretical properties of half-space mass: (i) half-space mass is concave in a convex region; (ii) it has a unique median; (iii) the median is maximally robust; and (iv) its estimation extends to higher dimensional space in which training data occupies zero-volume convex hull.
Third, we demonstrate applications of half-space mass in two tasks: anomaly detection and clustering. In anomaly detection, it outperforms the popular half-space depth because it is more robust and able to extend across dimensions; and it runs orders of magnitude faster than \(L_2\) data depth. In clustering, we introduce K-mass by using half-space mass, instead of a distance function, in the expectation and maximisation steps in K-means. We show that K-mass overcomes three weaknesses of K-means. The maximally robust property of half-space mass contributes directly to these outcomes in both tasks.
Footnotes
- 1.
We suspect that the result in the covertype dataset is due to the similar reason. But we could not visualize it due to its dimensionality.
- 2.
When comparing a fixed size vector to a scalar in Matlab, the runtime of such comparison is not constant. It varies significantly depending on the value of the scalar. The closer the scalar is to the median of the numbers in the vector, the longer it takes for the comparison. Because \(HM^*\) uses a small subsample for projection, the split points \(s_i\) in Algorithm 1 are selected within a narrower range than if the whole dataset was used. Thus \(s_i\) lies near the median of the whole dataset more often in \(HM^*\) than in HM. As a result, the comparisons take significantly longer time in \(HM^*\) than in HM in the testing stage. However, this effect is dampened in high dimensional datasets because the high dimensionality makes the range after projection much longer, even for a small subsample. This irregularity will not occur if another programming language is used.
- 3.
- 4.
The best F-measure out of 40 runs using \(L_2\) depth in clustering with the eight datasets are: 0.947(iris), 0.905(seeds), 0.626(column), 0.595(banknote), 0.939(breast), 1(dim), 0.896(wdbc), 0.943(wine).
Notes
Acknowledgments
This project is partially supported by a grant from the U.S. Air Force Research Laboratory, under agreement # FA2386-13-1-4043, awarded to Kai Ming Ting. It is also partially supported by JSPS KAKENHI Grant Number 25240036, awarded to Takashi Washio. Bo Chen and Gholamreza Haffari are grateful to National ICT Australia (NICTA) for their generous funding, as part of the Machine Learning Collaborative Research Projects. Bo Chen is also supported by a scholarship from the Faculty of Information Technology, Monash University.
References
- Agostinelli, C., & Romanazzi, M. (2011). Local depth. Journal of Statistical Planning and Inference, 141(2), 817–830.MATHMathSciNetCrossRefGoogle Scholar
- Aloupis, G. (2006). Geometric measures of data depth. DIMACS Series in Discrete Math and Theoretical Computer Science, 72, 147–158.MathSciNetGoogle Scholar
- Donoho, D. L., & Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4), 1803–1827.MathSciNetCrossRefGoogle Scholar
- Dutta, S., Ghosh, A. K., & Chaudhuri, P. (2011). Some intriguing properties of Tukey’s half-space depth. Bernoulli, 17(4), 1420–1434.MATHMathSciNetCrossRefGoogle Scholar
- Franti, P., Virmajoki, O., & Hautamaki, V. (2006). Fast agglomerative clustering using a k-nearest neighbor graph. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11), 1875–1881.CrossRefGoogle Scholar
- Jain, A. K. (2010). Data clustering: 50 years beyond K-means, Pattern Recognition Letters, 31(8), 651–666.Google Scholar
- Kroese, D. P., & Chan, J. C. C. (2014). Statistical modeling and computation. New York: Springer.MATHCrossRefGoogle Scholar
- Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
- Liu, R. Y., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth: descriptive statistics, graphics and inference. The Annals of Statistics, 27(3), 783–840.MATHMathSciNetCrossRefGoogle Scholar
- Lopuhaa, H. P., & Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19(1), 229–248. doi:10.1214/aos/1176347978.MATHMathSciNetCrossRefGoogle Scholar
- Mosler, K. (2013). Depth statistics. In C. Becker, R. Fried, & S. Kuhnt (Eds.), Robustness and complex data structures. Festschrift in honour of Ursula Gather (pp. 17–34). Berlin: Springer.CrossRefGoogle Scholar
- Tan, P.-N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining (2nd ed.). Pearson Education, Ltd.Google Scholar
- Ting, K. M., Zhou, G.-T., Liu, F., & Tan, J. S.C. (2010). Mass estimation and its applications. In Proceedings of KDD’10: The 16th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 989–998)Google Scholar
- Ting, K. M., Zhou, G.-T., Liu, F., & Tan, J. S. C. (2013). Mass estimation. Machine Learning, 90(1), 127–160.Google Scholar
- Tukey, J. W. (1975). Mathematics and picturing data. Proceedings of 1975 international congress of mathematics, Vol. 2, (pp. 523–531).Google Scholar
- Zuo, Y., & Serfling, R. (2000). General notion of statistical depth function. Annals of Statistics, 28, 461–482.MathSciNetCrossRefGoogle Scholar