Halfspace mass: a maximally robust and efficient data depth method
 1.2k Downloads
 4 Citations
Abstract
Data depth is a statistical method which models data distribution in terms of centeroutward ranking rather than density or linear ranking. While there are a lot of academic interests, its applications are hampered by the lack of a method which is both robust and efficient. This paper introduces HalfSpace Mass which is a significantly improved version of halfspace data depth. HalfSpace Mass is the only data depth method which is both robust and efficient, as far as we know. We also reveal four theoretical properties of HalfSpace Mass: (i) its resultant mass distribution is concave regardless of the underlying density distribution, (ii) its maximum point is unique which can be considered as median, (iii) the median is maximally robust, and (iv) its estimation extends to a higher dimensional space in which the convex hull of the dataset occupies zero volume. We demonstrate the power of HalfSpace Mass through its applications in two tasks. In anomaly detection, being a maximally robust location estimator leads directly to a robust anomaly detector that yields a better detection accuracy than halfspace depth; and it runs orders of magnitude faster than \(L_2\) depth, an existing maximally robust location estimator. In clustering, the HalfSpace Mass version of Kmeans overcomes three weaknesses of Kmeans.
Keywords
Halfspace mass Mass estimation Data depth Robustness1 Introduction
“Most important for the selection of a depth statistic in applications are the questions of computability and  depending on the data situation  robustness.”  Karl Mosler (2013)
Data depth (Liu et al. 1999) is a statistical method which models data distribution in terms of centeroutward ranking rather than density or linear ranking. In 1975, Tukey (1975) proposed a way to define multivariate median in a data cloud, known as halfspace depth or Tukey depth. Since then it has been extensively studied. Donoho and Gasko (1992) have revealed the breakdown point of Tukey median; Zuo and Serfling (2000) have compared it to various competitors and Dutta et al. (2011) have investigated the properties of halfspace depth. Meanwhile, the concept of data depth has been adopted for multivariate statistical analysis since it provides a nonparametric approach that does not rely on the assumption of normality (Liu et al. 1999).
 (i)
It is concave in a user defined region that covers the source density distribution or the data cloud. An example is shown in Fig. 1.
 (ii)
It has a unique maximum point, which can be regarded as a multidimensional median.
 (iii)
Its median, which has a breakdown point equal to \(\frac{1}{2}\), is maximally robust.
 (iv)
It extends the information carried in a dataset to a higher dimensional space in which such dataset has a zerovolume convex hull.

Its maximal robustness leads directly to better performance in anomaly detection than halfspace depth.

Compared to the existing maximally robust \(L_2\) depth, it runs orders of magnitude faster.

Compared to the distancebased Kmeans clustering method, the halfspace massbased version overcomes three weaknesses of Kmeans (Tan et al. 2014) to find clusters of varying densities and sizes, as well as in the presence of noise.
The rest of the paper is organized as follows. Section 2 introduces the formal definitions of halfspace mass as well as the proposed implementation. Sections 3 and 4 provide its theoretical properties and proofs, respectively. Section 5 discusses the relationship between halfspace mass and other data depth methods. Section 6 describes applications of halfspace mass in anomaly detection and clustering. Section 7 reports the empirical evaluations. Section 8 discusses its relation to mass estimation and Sect. 9 concludes the paper.
2 Halfspace mass
2.1 Definitions
Notations
\(\mathfrak {R}^d\)  A ddimensional real space 
\(\ell \)  A direction in \(\mathfrak {R}^d\) 
x  A onedimensional point in \(\mathfrak {R}\) 
\(\mathbf{x}\)  A point in \(\mathfrak {R}^d\) 
D  A dataset, where \(D = n\) 
\(\mathbf{X}\)  A point in D 
\({\mathcal {D}}\)  A subset of D, where \({\mathcal {D}} = \psi \) 
t  Number of halfspaces sampled for estimation 
R  A convex region covering a source density F or a dataset D 
\(\lambda \)  A parameter that determines the size of R 
\(P_F(\cdot )\)  A probability mass function of a probability density distribution F 
\(P_D(\cdot )\)  An empirical probability mass function of a dataset D 
\(HM(\cdot F)\)  Halfspace mass function given F 
\(HM(\cdot D)\)  Halfspace mass function given D 
Let \(F(\mathbf{x})\) be a probability density on \(\mathbf{x} \in \mathfrak {R}^d\), \(d \ge 1\); \(R \subset \mathfrak {R}^d\) be a convex and closed region covering the domain of F; and H be a closed halfspace formed by separating \(\mathfrak {R}^d\) with a hyperplane that intersects R. Note that the probability mass of H computed with respect to F is \(0 \le P_F(H)=P_F(H \cap R) \le 1\).
Definition 1
The definition of halfspace mass can be conceptualized as the expectation of the probability mass of a randomly selected halfspace H, which is defined for R and contains the query point \(\mathbf{x}\), given that every halfspace is equally likely. This definition happens to have certain similarity to that of halfspace depth (Tukey 1975). While halfspace depth takes the minimum of probability mass of a random halfspace containing query point \(\mathbf{x}\) as the depth value (see its definition in Table 2 in Sect. 5), halfspace mass takes the expectation of it. This key difference enables halfspace mass to have more desirable properties, which will be discussed in Sects. 3 and 4.
Practically an i.i.d. sample D is usually given instead of the source density distribution F. The sample version of \(HM(\mathbf{x}F)\) is obtained by replacing F with D as follows.
Definition 2
We also propose a computationfriendly version to estimate \(HM(\mathbf{x}D)\). Instead of using the whole dataset D to calculate \(P_D(H_i)\) in (1), a small subsample \({{\mathcal {D}}}_i \subset D\) with size \({{\mathcal {D}}}_i = \psi \ll D\) is randomly selected from D without replacement for \(i = 1,\ldots ,t\). Let \(R_i\) be a convex region covering \({{\mathcal {D}}}_i\), \(H_i(\mathbf{x})\) be a randomly selected halfspace containing \(\mathbf{x}\) and intersecting \(R_i\), for \(i = 1,\ldots ,t\).
Definition 3
2.2 Implementation
In general, halfspace mass is a concave function in R, as will be shown in Sects. 3 and 4; therefore it provides distinct centeroutward ordering in the region R, while concavity outside of R is not guaranteed.
When concavity needs to be guaranteed in a region larger than the convex hull of D, a larger R would be desirable. To this end, we propose a projectionbased algorithm to estimate \(HM(\mathbf{x}D)\) in which the region R or \(R_i\) is determined by a size parameter \(\lambda \). It is the ratio of diameters between R and the convex hull of D along every direction. The value of \(\lambda \) should be more than or equal to 1. When \(\lambda = 1\), R or \(R_i\) is the convex hull of D or \({{\mathcal {D}}}_i\). The bigger \(\lambda \) is, the larger R or \(R_i\) expands from the convex hull of D or \({{\mathcal {D}}}_i\).
Algorithm 1 is the training procedure of \(\widetilde{HM}(\cdot D)\). The halfspace is implemented as follows: a random subsample \(\mathcal{D}_i\) is projected onto a random direction \(\ell \) in \(\mathfrak {R}^d\), t times. For each projection, a split point s is randomly selected between a range adjusted by \(\lambda \); and then the number of points that fall in either sides of s are recorded.
2.3 Parameter setting
Here we provide a general guide for setting the parameters. The parameter t affects the accuracy of the estimation. The larger t is, the more accurate the estimation is. In high dimensional datasets or datasets which are elongated significantly in some direction than others, t shall be set to a large value, in order to gather sufficient information from all directions.
For the rest of this paper, we use Algorithms 1 and 2 to estimate halfspace mass. The parameter \(\lambda \) is set to 1 by default unless mentioned otherwise.
3 Properties of halfspace mass
We list four theoretical properties of halfspace mass in this section, which are concavity in region R, unique median, the median having breakdown point equal to \(\frac{1}{2}\), and extension across dimension. Proofs of the lemma and theorems stated in this section can be found in Sect. 4.
3.1 Concavity
Lemma 1
HM(xF) under Definition 1 is a concave function for any finite F in any finite R in a univariate real space \(\mathfrak {R}\).
Using this lemma, we can obtain the following theorem on the concavity of the multidimensional halfspace mass distribution.
Theorem 1
\(HM(\mathbf{x}F)\) under Definition 1 is a concave function for any finite F in any finite, convex and closed \(R \subset \mathfrak {R}^d\).
Similarly, \(HM(\mathbf{x}D)\) is also concave in the convex region R covering D.
3.2 Unique median
Based on Theorem 1, a unique location in R which has the maximum halfspace mass value is guaranteed, as stated in the following theorem:
Theorem 2
The “center” of a given density F based on halfspace mass \(\mathbf{x}^* := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x}F)\) is a unique location in R, given that F covers an area more than a straight line in \(\mathfrak {R}^d\).
3.3 Breakdown point
We define a location estimator based on halfspace mass as follows: \(T(D) := \mathop {{{\mathrm{arg\,max}}}}\nolimits _{\mathbf{x}} HM(\mathbf{x}  D)\). It is a maximally robust estimator with properties given in the following theorem:
Theorem 3
The breakdown point of T, \(\epsilon (T, D) > \frac{n1}{2n1} \rightarrow \frac{1}{2}\) as \(n \rightarrow \infty \).
3.4 Extension across dimension
Dutta et al. (2011) reveal that, for a size n dataset in a \(d > n\) dimensional space, since the ddimensional volume of the convex hull of such dataset is going to be zero, halfspace depth will behave anomalously having 0 measures almost everywhere in \(\mathfrak {R}^d\). In such cases, halfspace depth does not carry any useful statistical information.
On the other hand, the definition of halfspace mass enables it not only to rank locations outside the convex hull of the training dataset in the lower dimensional space where this convex hull has positive volume, but also to extend the ranking of locations to a higher dimensional space where the convex hull has zero volume.
4 Proofs
This section provides the proofs for the lemma and theorems given in the last section. The proofs for Lemma 1, Theorems 1, 2 and 3 are presented in the following four subsections.
4.1 Proof of Lemma 1
4.2 Proof of Theorem 1
\(HM(\mathbf{x}F,\ell )\) is equivalent to the univariate mass distribution on \(\ell \) where F is projected onto \(\ell \). Accordingly, from Lemma 1, for all \(\mathbf{x} \in R\), it is concave in the direction of \(\ell \) and constant in the direction vertical to \(\ell \). Thus, \(HM(\mathbf{x}F,\ell )\) is concave in R. Since the summation of multiple concave functions are also concave, \(HM(\mathbf{x}F)\) is concave in R.
4.3 Proof of Theorem 2
Here we prove Theorem 2 by contradiction.
But since F covers an area more than a straight line, there will always exist an \(\ell \) and \(\mathbf{x}\) such that \(\mathbf{x}^{\ell } \in L^{\ell }\) and \(F^{\ell }(\mathbf{x}^{\ell })>0\), which will contradict with (7). Therefore, there is one unique location that has the maximum halfspace mass value in R.
4.4 Proof of Theorem 3
Suppose for a size n dataset D, a contaminating set Q of size \(n1\) is strategically chosen. Let U denote the convex hull of D, and \(U^{\ell }\) denote its projection segment on a line along direction \(\ell \), assuming U has a finite volume in \(\mathfrak {R}^d\).
For any \(\ell \), the median point of the projection of \( D \cup Q \) on \(\ell \) will lie within \(U^{\ell }\). Because if it lies outside of \(U^{\ell }\), then at least n out of \(2n1\) points are on one side of the median which contradicts the definition of median. Since Ting et al. (2013) have shown that the univariate mass is maximised at its median, the maximum value of \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\) occurs in the segment \(U^{\ell }\) for all \(\ell \).
For a given query point \(\mathbf{x}\), let \({{\mathcal {L}}}_{\mathbf{x}}^ = \{\ell : \mathbf{x}^{\ell } \notin U^{\ell } \}\) denote the set of directions in \(\mathfrak {R}^d\) on which the projection of \(\mathbf{x}\) lies outside of the projection of the convex hull of D, and \({{\mathcal {L}}}_{\mathbf{x}}^+ = \{\ell : \mathbf{x}^{\ell } \in U^{\ell } \}\) denote the rest of the directions.
For any \(\ell \in {{\mathcal {L}}}^_{\mathbf{x}}\), the onedimensional mass \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\) increases while \(\mathbf{x}^{\ell }\) moves a small enough distance towards \(U^{\ell }\), since it is a concave function with the maximum value occurs somewhere in the segment \(U^{\ell }\).
The location estimator T(D) is within U, the convex hull of D. If the distance between \(T(D \cup Q)\) and T(D) is infinity, then the distance between \(T(D \cup Q)\) and U is also infinity. Thus suppose \(\mathbf{x}^* = T(D \cup Q)\) is infinitely far away from U, then the solid angle of U over \(\mathbf{x}^*\) is 0, therefore almost surely \(\ell \in {{\mathcal {L}}}^_{\mathbf{x}^*}, \forall \ell \in \mathfrak {R}^d\) and \(HM(\mathbf{x}^* D \cup Q) = E_{ {{\mathcal {L}}}^_{\mathbf{x}^*}}[HM({\mathbf {x^*}}^{\ell } D^{\ell } \cup Q^{\ell })]\). Any movement of finite length from \(\mathbf{x}^*\) towards U will increase the onedimensional mass values \(HM(\mathbf{x}^{\ell } D^{\ell } \cup Q^{\ell })\), \(\forall \ell \in {{\mathcal {L}}}^_{\mathbf{x}} \); thus increase the mass value \(HM(\mathbf{x} D \cup Q)\), which contradicts with the assumption that \(HM(\mathbf{x}^* D \cup Q)\) is the maximum. Therefore \(T(D \cup Q)\) can only be finitely far away from T(D) for a contaminating dataset Q of size \(n1\).
Using the same inference as above, any contaminating dataset Q of any size between 1 to \(n1\) combining dataset D of size n can only cause a finite shift of the location estimator T. Therefore \(\epsilon (T, D) > \frac{n1}{2n1}\).
5 Relation to other data depth methods
Definitions of halfspace mass (\(HM(\cdot )\)), halfspace depth (\(HD(\cdot )\)) and \(L_2\) depth (\(L_2D(\cdot )\)) with a given dataset D
Depth function  Definition  Equation 

Halfspace mass  The expectation of probability mass of all halfspaces covering \(\mathbf{x}\)  \(\displaystyle { HM(\mathbf{x}D) = E_{{\mathcal {H}}(\mathbf{x})}[P_D(H)]}\) 
Halfspace depth  The minimum of probability mass of all halfspaces covering \(\mathbf{x}\) (Tukey 1975)  \(\displaystyle { HD(\mathbf{x}D) = \min _{H \in {\mathcal {H}}(\mathbf{x})}[P_D(H)]}\) 
\(L_2\) depth  The reciprocal of 1 plus the average of \(L_2\) distances between \(\mathbf{x}\) and each data point in D (Mosler 2013)  \(\displaystyle {L_2D(\mathbf{x}D) = \bigg ( 1 + \frac{1}{D}\sum \nolimits _{\mathbf{X} \in D} \mathbf{x}  \mathbf{X}_2 \bigg )^{1} }\) 
Medians of halfspace mass, halfspace depth and \(L_2\) depth and their properties
Depth function  Multivariate median  Breakdown point; median unique?  Extension across dimension  Time complexity 

Halfspace mass  The point \(\mathbf{x}\) which has the largest expected probability mass of all halfspaces covering \(\mathbf{x}\).  \(\frac{1}{2}\); unique  Yes  O(nt) (sample version) \(O(\psi t)\) (computationfriendly version) 
Halfspace depth  The point \(\mathbf{x}\) which maximizes the minimum probability mass of all halfspaces covering \(\mathbf{x}\).  \([1/(1+d),1/3]\); Not unique (Aloupis 2006)  No  O(nt) [An implementation as in Eq. (8)] 
\(L_2\) depth  The point which minimizes the sum of Euclidean distances to all points in a given data set.  \(\frac{1}{2}\); unique (Lopuhaa and Rousseeuw 1991)  Yes  \(O(n^2)\) 
It is interesting to note the similarity between halfspace mass and halfspace depth, i.e., they are both based on the probability mass of halfspaces. The main difference is between taking the expectation or minimum over probability mass of halfspaces. This has led to the improvement of breakdown point and uniqueness of median shown in Table 3.
\(L_2\) depth and halfspace mass have the same four properties: concavity, unique median which is maximally robust and their distribution extends across dimensions which have zerovolume convex hull. The key difference is the core mechanism: one employs halfspace and the other uses distance. The computation without distance calculations leads directly to the advantage of halfspace mass in time complexity, as shown in Table 3.
The implementation of \(L_2\) depth is straightforward: Given a query point \(\mathbf{x}\), compute the sum of Euclidean distances to all points in D. The output of \(L_2D(\mathbf{x}D)\) is computed as specified in Table 2.
6 Applications of halfspace mass
We demonstrate the applications of halfspace mass in two tasks: anomaly detection and clustering, in the following two subsections.
6.1 Anomaly detection
The application of halfspace mass to anomaly detection is straightforward since the distribution of halfspace mass is concave with centeroutward ranking. Once every point in the given dataset is given a score, they can be sorted; and those close to the outer fringe of the distribution, i.e., having low scores, are more likely to be anomalies.
The above property is the same for halfspace depth and \(L_2\) depth. Thus, all three methods can be directly applied to anomaly detection.
6.2 Clustering
We provide a simple algorithm utilizing halfspace mass in clustering. This algorithm is designed in a fashion that is similar to the Kmeans clustering algorithm.
Let \(\mathbf{X}_i \in D, i = 1,...,n\) denote data points in dataset D and \(Y_i \in \{1,...,K\}\) denote the cluster labels, where K is the number of clusters. Let \(G_k := \{ \mathbf{X}_i \in D : Y_i = k \}\), where \(k \in \{1,...,K\}\), denote the points in the kth group.
Kmeans clustering algorithm (Jain 2010) is provided in Algorithm 4 for comparison. The Kmass algorithm and the Kmeans algorithm share the same algorithmic structure. They differ only in the action required in each of the two steps in the iteration process.
Note that when considering Kmeans as an EM (ExpectationMaximisation) algorithm (Kroese and Chan 2014), Kmeans implements the expectation step in line 3 and the minimisation step in lines 4–6 in Algorithm 4. Similarly, Kmass implements the expectation step in line 3 and the maximisation step in lines 4–6 in Algorithm 3.
7 Empirical evaluations
In this section, we conduct experiments to investigate the advantages of utilizing halfspace mass in anomaly detection and clustering, first with artificial data sets and second with real datasets. In both cases, robustness is the key determinant for halfspace mass to gain advantage over its contenders.
To simplify notations, we use HM and \(HM^*\) hereafter to denote the sample version (\(\psi = D\)) and the computationalfriendly version (\(\psi \ll D\)) of halfspace mass, respectively. And HD and \(L_2D\) denote halfspace depth and \(L_2\) depth, respectively.
7.1 Anomaly detection
In this section, halfspace mass, halfspace depth and \(L_2\) depth are used for anomaly detection. That is, given a dataset, HM is constructed as described in Algorithms 1 and 2; HD and \(L_2D\) are constructed as described in Sect. 5. Then, each of the models is used to score each point in the dataset. In all cases, points with low mass/depth scores are more likely to be anomalies. The final ranking of the points is sorted based on the scores produced from each model.
Area under the ROC curve (AUC) is used to measure the detection accuracy of an anomaly detector. \(AUC=1\) indicates that the anomaly detector ranks all anomalies in front of normal points; \(AUC=0.5\) indicates that the anomaly detector is a random ranker. Visualizations are used to show the impact of robustness. When comparing AUC values in the second experiment, a ttest with \(5\,\%\) significance level is conducted based on AUC values of multiple runs.
The t parameter for both HM and HD is set to 5000 in the experiments, which is sufficiently large since further increase of t observes no noticeable AUC improvement. \(L_2\) depth has no parameter setting.
7.1.1 Anomaly detection with artificial data
The AUC results, presented in the first row in Fig. 5, show that both HM and \(L_2D\) performed much better than HD. In this example, all of the three methods failed to detect some local anomalies but HD failed to detect the anomaly cluster on the right while the other two methods separated the anomaly cluster from the normal points perfectly.
The second row of the plots in Fig. 5 shows the contour maps of mass/depth values when normal points contaminated with noise were used to train the anomaly detectors; and the third row of the plots shows the contour maps when normal data points only were used to train the anomaly detectors.
The contrast between the second row and the third row of the plots is a testament to the impact of robustness. Being maximally robust, the contour maps of HM and \(L_2D\) remain centered inside the normal cluster. In contrast, the contour map of HD is significantly stretched towards the anomaly cluster. This resulted many clustered anomalies (on the right) being scored with high depth values as equivalent to many normal points; and thus impaired its ability to detect anomalies. Anomalies are contamination to the distribution of normal points. An anomaly detector, which is not robust to contamination, often results in poor ranking outcomes in relation to detecting anomalies. This example shows the impact of contamination has to an anomaly detector which is not robust.
7.1.2 Anomaly detection with benchmark datasets
Here we evaluated the performance of HM, \(HM^*\), HD and \(L_2D\) in anomaly detection using nine benchmark datasets (Lichman 2013). AUC values and runtime results are shown in Table 4. The figures are the average of 10 runs except for \(L_2D\) which is a deterministic method. Boldface figures in the HM, \(HM^*\) and \(L_2\) columns indicate that the differences are significant compared to HD; while boldface figures in the HD column indicate that the differences are significant compared to any of the other methods.
In comparison with HD, both HM and \(HM^*\) have 7 wins and 2 losses, which is evidence that halfspace mass performed better than HD in most datasets.
Note that HM and \(L_2D\) have similar AUC results. This is not surprising since both have the same four properties shown in Table 3.
\(HM^*\) using \(\psi =10\) performed comparably with HM in seven out of the nine data sets. This suggests that the performance of \(HM^*\) can be further improved by tuning \(\psi \).
Anomaly detection performance with the benchmark datasets, where n is data size, d is the number of dimensions, and “ano” is the percentage of anomalies
Dataset  n  d  ano (%)  AUC  Runtime (second)  

HM  \(HM^*\)  HD  \(L_2\)  HM  \(HM^*\)  HD  \(L_2\)  
Mulcross  262144  4  10.00  1.00  1.00  0.86  1.00  30.3  26.3  30.3  2213.0 
Satellite  6435  36  31.60  0.61  0.62  0.57  0.62  1.1  0.8  1.2  11.2 
Shuttle  49097  9  7.15  0.99  0.99  0.92  0.99  5.4  5.3  5.2  133.5 
Smtp  95156  3  0.03  0.77  0.73  0.83  0.78  6.9  8.0  6.7  218.9 
Isolet  7797  617  3.85  0.82  0.85  0.68  0.84  24.9  13.4  25.0  229.1 
Mfeat  2000  649  10.00  0.92  0.93  0.56  0.92  5.6  3.3  5.7  17.8 
Covertype  286048  10  0.96  0.87  0.78  0.92  0.87  45.7  35.3  44.5  5251.3 
Http  567497  3  0.39  1.00  1.00  0.99  1.00  55.1  57.3  54.4  7794.4 
Dbworld  64  4702  45.31  0.78  0.78  0.53  0.79  2.0  2.1  2.0  0.1 
Note that HD performed poorly in all three high dimensional datasets. Our investigation suggests that as the number of dimensions increases, an increasing percentage of points will appear at the outer fringe of the convex hull covering the data set. Because HD assigns the same lowest depth value to all these points, they are thus unable to be meaningfully ranked. This is the reason why the AUC results of HD in these three datasets are close to 0.5, equivalent to random ranking. In a nutshell, HD is more prone to the curse of dimensionality than HM or \(L_2D\).
The training and testing times of HM and \(HM^*\) with subsample size \(\psi = 10\)
Dataset  n  d  Training time (second)  Testing time (second)  

HM  \(HM^*\)  HM  \(HM^*\)  
mulcross  262,144  4  9.291  0.073  21.009  26.227 
satellite  6435  36  0.429  0.082  0.671  0.718 
shuttle  49,097  9  1.545  0.073  3.855  5.227 
smtp  95,156  3  1.639  0.071  5.261  7.929 
isolet  7797  617  11.953  0.509  12.947  12.891 
mfeat  2000  649  2.810  0.426  2.790  2.874 
covertype  286,048  10  15.632  0.080  30.068  35.220 
http  567,497  3  17.706  0.072  37.394  57.228 
dbworld  64  4702  1.315  1.370  0.685  0.730 
In summary, halfspace mass is the best anomaly detectors among the three methods, which has significantly better detection accuracy than HD and runs orders of magnitude faster than \(L_2D\).
7.2 Clustering
This section reports the empirical evaluation of Kmass in comparison with Kmeans. The first experiment examines the three scenarios in which Kmeans is known to have difficulty to find all clusters, i.e., clusters with different sizes, densities and the presence of noise. The second experiment evaluates the clustering performance using eight real data sets (Lichman 2013, Franti et al. 2006).^{3}
In every trial using a data set, Kmass or Kmeans is executed 40 runs and we report the best clustering result. The clustering performance is measured in terms of Fmeasure, and visualizations of the clustering results are presented where possible in twodimensional datasets.
Kmass employs \(HM^*\) which uses \(\psi =5\) and \(t=2000\) as default in all experiments; it uses \(\lambda = 3\) in the first experiment, and \(\lambda = 1.6\) in the second experiment. Recall that \(\lambda \) controls the size of the convex hull covering the data set. Because the sample size is \(\psi =5\), the convex hull must be enlarged (using \(\lambda > 1\)) in order to cover points which exist outside the convex hull. For the stopping criterion p, both Kmass and Kmeans use \(p=1\) in the first experiment and search for the best result with \(p=0.98\) and 1 in the second experiment.
7.2.1 Clustering with artificial data
Figures 7, 8 and 9 show the clustering results of Kmass and Kmeans on three artificial datasets, representing scenarios having clusters with different sizes, densities and the presence of noise, respectively.
In scenario 2, the four clusters are of equal density but with different data sizes, as shown in Fig. 8. Kmass worked well separating the four clusters; but Kmeans failed to converge to the global optimum because of its tendency to split halfway between group centers.
Scenario 3 demonstrates the importance of robustness in clustering. The dataset consists of four clusters of equal sizes and density with the presence of noise, scattered around the four clusters. Figure 9 shows that Kmass, in spite of having a Fmeasure \({<}1\) because the noise points were assigned to the nearest clusters, was able to separate the four clusters perfectly; while Kmeans wrongly assigned many points of the four clusters. This is because Kmeans is not robust against outliers, therefore the group centers could be easily influenced by noise.
In summary, Kmass perfectly separated the four clusters while Kmeans failed to do so in all three scenarios.
7.2.2 Clustering with real datasets
Clustering with real datasets
Dataset  n  d  K  Kmass  Kmeans  

Best F  p  time  l  Best F  p  time  l  
Iris  150  4  3  0.933  1  0.40  4  0.920  0.98  0.001  3 
Seeds  210  7  3  0.923  0.98  0.53  5  0.919  0.98  0.001  2 
Column  310  6  3  0.684  0.98  2.13  18  0.675  0.98  0.002  4 
Banknote  1372  4  2  0.725  0.98  0.59  4  0.602  0.98  0.012  8 
Breast  699  9  2  0.963  0.98  0.44  4  0.961  0.98  0.002  2 
Dim  1024  1024  16  1.000  1  29.16  2  1.000  1  0.308  2 
Wdbc  569  30  2  0.934  0.98  0.59  5  0.929  0.98  0.004  5 
Wine  178  13  3  0.944  0.98  0.86  8  0.966  1  0.002  4 
8 Discussion
Mass estimation (Ting et al. 2013) was recently proposed as an alternative to density estimation in data modeling. It has significant advantages over density estimation in efficiency and/or efficacy in various data mining tasks such as anomaly detection, clustering, classification and information retrieval (Ting et al. 2013). Despite this success, the formal definition of mass is univariate only and its theoretical analysis is limited to two properties: (i) its mass distribution is concave, and (ii) its maximum mass point is equivalent to median (Ting et al. 2013).
The halfspace mass can be viewed as a generalisation of the univariate mass estimation to multidimensional spaces, and it has four properties rather than the two revealed previously. The onedimensional mass estimation is defined as the weighted probability mass (see the details in the Appendix). Halfspace splits reduce to binary splits, and the halfspace mass reduces to the weighted probability mass in one dimensional space defined in Ting et al. (2013).
The two additional properties of halfspace mass, i.e., maximal robustness and extension across dimension, are important in understanding the behaviour of any algorithms designed based on halfspace mass, as we have shown in the empirical evaluation section.
The proof for concavity in Lemma 1 made use of the same idea for the concavity proof as presented by Ting et al. (2013). Other ideas in this paper are new.
Ting et al. (2013) also gave a definition of higher level mass estimation, which can be viewed as a localised version of a level1 mass estimation. We have limited our exposition to level1 mass estimation in this paper so that we have a direct comparison with data depth and its properties. As a result, it is limited to data modeling with a unimodal distribution having a unique maximum as the median. In datasets which have multimodal distribution, HM will be outperformed by existing densitybased anomaly detectors. We believe that HM can be extended to higher level mass estimation as shown in the onedimensional case (Ting et al. 2013), which could be regarded as a localized data depth method (Agostinelli and Romanazzi 2011). We will explore higher level mass estimation using halfspace mass in the near future.
The successful application of halfspace mass in Kmass implies that other data depth methods may also be applicable in Kmass. Our investigation reveals that because halfspace depth can only provide its estimations within the convex hull of a given data set (i.e., the lack of the fourth property stated in Sect. 3.4), it could not be applied to Kmass. A Kmass version using \(L_2\) depth exhibits a better convergence property than Kmass. However, its performance is in general worse than both Kmass and Kmeans.^{4} Another drawback of \(L_2\) depth is that it is very costly to compute in large datasets.
Despite all the advantages of Kmass over Kmeans shown in this paper, a caveat is in order here: we do not have a proof that Kmass will always converge like Kmeans.
9 Conclusions
This paper makes three key contributions:
First, we propose the first formal definition of halfspace mass, which is a significantly improved version of halfspace data depth, and it is the only data depth method which is both robust and efficient, as far as we know.
Second, we reveal four theoretical properties of halfspace mass: (i) halfspace mass is concave in a convex region; (ii) it has a unique median; (iii) the median is maximally robust; and (iv) its estimation extends to higher dimensional space in which training data occupies zerovolume convex hull.
Third, we demonstrate applications of halfspace mass in two tasks: anomaly detection and clustering. In anomaly detection, it outperforms the popular halfspace depth because it is more robust and able to extend across dimensions; and it runs orders of magnitude faster than \(L_2\) data depth. In clustering, we introduce Kmass by using halfspace mass, instead of a distance function, in the expectation and maximisation steps in Kmeans. We show that Kmass overcomes three weaknesses of Kmeans. The maximally robust property of halfspace mass contributes directly to these outcomes in both tasks.
Footnotes
 1.
We suspect that the result in the covertype dataset is due to the similar reason. But we could not visualize it due to its dimensionality.
 2.
When comparing a fixed size vector to a scalar in Matlab, the runtime of such comparison is not constant. It varies significantly depending on the value of the scalar. The closer the scalar is to the median of the numbers in the vector, the longer it takes for the comparison. Because \(HM^*\) uses a small subsample for projection, the split points \(s_i\) in Algorithm 1 are selected within a narrower range than if the whole dataset was used. Thus \(s_i\) lies near the median of the whole dataset more often in \(HM^*\) than in HM. As a result, the comparisons take significantly longer time in \(HM^*\) than in HM in the testing stage. However, this effect is dampened in high dimensional datasets because the high dimensionality makes the range after projection much longer, even for a small subsample. This irregularity will not occur if another programming language is used.
 3.
 4.
The best Fmeasure out of 40 runs using \(L_2\) depth in clustering with the eight datasets are: 0.947(iris), 0.905(seeds), 0.626(column), 0.595(banknote), 0.939(breast), 1(dim), 0.896(wdbc), 0.943(wine).
Notes
Acknowledgments
This project is partially supported by a grant from the U.S. Air Force Research Laboratory, under agreement # FA23861314043, awarded to Kai Ming Ting. It is also partially supported by JSPS KAKENHI Grant Number 25240036, awarded to Takashi Washio. Bo Chen and Gholamreza Haffari are grateful to National ICT Australia (NICTA) for their generous funding, as part of the Machine Learning Collaborative Research Projects. Bo Chen is also supported by a scholarship from the Faculty of Information Technology, Monash University.
References
 Agostinelli, C., & Romanazzi, M. (2011). Local depth. Journal of Statistical Planning and Inference, 141(2), 817–830.zbMATHMathSciNetCrossRefGoogle Scholar
 Aloupis, G. (2006). Geometric measures of data depth. DIMACS Series in Discrete Math and Theoretical Computer Science, 72, 147–158.MathSciNetGoogle Scholar
 Donoho, D. L., & Gasko, M. (1992). Breakdown properties of location estimates based on halfspace depth and projected outlyingness. Annals of Statistics, 20(4), 1803–1827.MathSciNetCrossRefGoogle Scholar
 Dutta, S., Ghosh, A. K., & Chaudhuri, P. (2011). Some intriguing properties of Tukey’s halfspace depth. Bernoulli, 17(4), 1420–1434.zbMATHMathSciNetCrossRefGoogle Scholar
 Franti, P., Virmajoki, O., & Hautamaki, V. (2006). Fast agglomerative clustering using a knearest neighbor graph. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11), 1875–1881.CrossRefGoogle Scholar
 Jain, A. K. (2010). Data clustering: 50 years beyond Kmeans, Pattern Recognition Letters, 31(8), 651–666.Google Scholar
 Kroese, D. P., & Chan, J. C. C. (2014). Statistical modeling and computation. New York: Springer.zbMATHCrossRefGoogle Scholar
 Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
 Liu, R. Y., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth: descriptive statistics, graphics and inference. The Annals of Statistics, 27(3), 783–840.zbMATHMathSciNetCrossRefGoogle Scholar
 Lopuhaa, H. P., & Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Annals of Statistics, 19(1), 229–248. doi: 10.1214/aos/1176347978.zbMATHMathSciNetCrossRefGoogle Scholar
 Mosler, K. (2013). Depth statistics. In C. Becker, R. Fried, & S. Kuhnt (Eds.), Robustness and complex data structures. Festschrift in honour of Ursula Gather (pp. 17–34). Berlin: Springer.CrossRefGoogle Scholar
 Tan, P.N., Steinbach, M., & Kumar, V. (2014). Introduction to data mining (2nd ed.). Pearson Education, Ltd.Google Scholar
 Ting, K. M., Zhou, G.T., Liu, F., & Tan, J. S.C. (2010). Mass estimation and its applications. In Proceedings of KDD’10: The 16th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 989–998)Google Scholar
 Ting, K. M., Zhou, G.T., Liu, F., & Tan, J. S. C. (2013). Mass estimation. Machine Learning, 90(1), 127–160.Google Scholar
 Tukey, J. W. (1975). Mathematics and picturing data. Proceedings of 1975 international congress of mathematics, Vol. 2, (pp. 523–531).Google Scholar
 Zuo, Y., & Serfling, R. (2000). General notion of statistical depth function. Annals of Statistics, 28, 461–482.MathSciNetCrossRefGoogle Scholar