Abstract
This paper introduces mass estimation—a base modelling mechanism that can be employed to solve various tasks in machine learning. We present the theoretical basis of mass and efficient methods to estimate mass. We show that mass estimation solves problems effectively in tasks such as information retrieval, regression and anomaly detection. The models, which use mass in these three tasks, perform at least as well as and often better than eight stateoftheart methods in terms of taskspecific performance measures. In addition, mass estimation has constant time and space complexities.
Keywords
Mass estimation Density estimation Information retrieval Regression Anomaly detection1 Introduction
‘Estimation of densities is a universal problem of statistics (knowing the densities one can solve various problems).’ —Vapnik (2000).
Density estimation has been the base modelling mechanism used in many techniques designed for tasks such as classification, clustering, anomaly detection and information retrieval. For example in classification, density estimation is employed to estimate the classconditional density function (or likelihood function) p(xj) or posterior probability p(jx)—the principal function underlying many classification methods; e.g., mixture models, Bayesian networks, Naive Bayes. Examples of density estimation include kernel density estimation, knearest neighbours density estimation, maximum likelihood procedures and Bayesian methods.
Ranking data points in a given data set in order to differentiate core points from fringe points in a data cloud is fundamental in many tasks, including anomaly detection and information retrieval. Anomaly detection aims to rank anomalous points higher than normal points; information retrieval aims to rank points similar to a query higher than dissimilar points. Many existing methods (e.g., Bay and Schwabacher 2003; Breunig et al. 2000; Zhang and Zhang 2006) have employed density to provide the ranking; but density estimation is not designed to provide a ranking.

A mass distribution stipulates an ordering from core points to fringe points in a data cloud. In addition, this ordering accentuates the fringe points with a concave function derived from data, resulting in fringe points having markedly smaller mass than points close to the core points.

Mass estimation is more efficient than density estimation because mass is computed by simple counting and it requires only a small sample through an ensemble approach. Density estimation (often used to estimate p(xj) and p(jx)) requires a large sample size in order to have a good estimation and is computationally expensive in terms of time and space complexities (Duda et al. 2001).
Mass estimation has two advantages in relation to efficacy and efficiency. First, the concavity property mentioned above ensures that fringe points are ‘stretched’ to be farther from the core points in a mass space—making it easier to separate fringe points from those points close to core points. This property in mass space can then be exploited by a machine learning algorithm to achieve a better result for the intended task than applying the same algorithm in the original space without this property. We show the efficacy of mass in improving the taskspecific performance of four existing stateoftheart algorithms in information retrieval and regression tasks. The significant improvements are achieved through a simple mapping from the original space to a mass space using the mass estimation mechanism introduced in this paper.
Second, mass estimation offers to solve a ranking problem more efficiently using the ordering derived from data directly—without expensive distance (or related) calculations. An example of inefficient application is in anomaly detection tasks where many methods have employed distance or density to provide the required ranking. An existing stateoftheart densitybased anomaly detector LOF (Breunig et al. 2000) (which has quadratic time complexity) completes a job involving half a million data points in more than five hours; yet the massbased anomaly detector we have introduced here completes it in less than 20 seconds! Section 6.3 provides the details of this example.
The rest of the paper is organised as follows. Section 2 introduces mass and mass estimation, together with their theoretical properties. We also describe methods for onedimensional mass estimation. We extend onedimensional mass estimation to multidimensional mass estimation in Sect. 3. We provide an implementation of multidimensional mass estimation in Sect. 4. Section 5 describes a massbased formalism which serves as a basis of applying mass to different data mining tasks. We realise the formalism in three different tasks: information retrieval, regression and anomaly detection, and report the empirical evaluation results in Sect. 6. The relations to kernel density estimation, data depth and other related work are described in Sects. 7, 8 and 9, respectively. We provide conclusions and suggest future work in the last section.
2 Mass and mass estimation
Data mass or mass, in its simplest form, is defined as the number of points in a region. Any two groups of data in the same domain have the same mass if they have the same number of points, regardless of the characteristics of the regions they occupy (e.g., density, shape or volume). Mass in a given region is thus defined by a rectangular function which has the same value for the entire region in which the mass is measured.
To estimate the mass for a point and thus the mass distribution of a given data set, a more sophisticated form is required. The intuition is based on the simplest form described above, but multiple (overlapping) regions covering a point are generated. The mass for the point is then derived from an average of masses from all regions covering the point. We show two ways to define these regions. The first is to generate all possible regions through binary splits from the given data points; and the second is to generate random axisparallel regions within the confine covered by a data sample. The first is described in this section and the second is described in Sect. 3.
Each region can be defined in multiple levels where a higher level region covering a point has a smaller volume than that of a lower level region covering the same point. We show that the mass distribution has special properties: (i) the mass distribution defined by level1 regions is a concave function which has the maximum mass at the centre of the data cloud, irrespective of its density distribution, including uniform and Ushape distributions; and (ii) higher level regions are required to model multimodal mass distributions.
Note that mass is not a probability mass function, and it does not provide a probability, as the probability density function does through integration.
Symbols and notations
\(\mathcal{R}^{u}\)  A real domain of u dimensions 
x  A onedimensional instance in \(\mathcal{R}\) 
x  An instance in \(\mathcal{R}^{u}\) 
D  A data set of x, where D=n 
\(\mathcal{D}\)  A subset of D, where \(\mathcal{D} = \psi\) 
z  An instance in \(\mathcal{R}^{t}\) 
D′  A data set of z 
c  The ensemble size used to estimate mass 
h  Level of mass distribution 
t  Number of mass distributions in \(\widetilde{\mathbf{mass}}(\cdot )\) 
m _{ i }(⋅)  Mass base function defined using binary split s _{ i } 
mass(⋅)  Mass function which returns a real value in onedimensional mass space 
\(\widetilde{\mathbf{mass}}(\cdot)\)  Mass function which returns a vector of t values in tdimensional mass space 
2.1 Mass distribution estimation
In this section, we first show in Sect. 2.1.1 a mass distribution estimation that uses binary splits in the onedimensional setting, where each binary split separates the onedimensional space into two nonempty regions. In Sect. 2.1.2, we then generalise the treatment using multiple levels of binary splits.
2.1.1 Mass distribution estimation using binary splits
Here, we employ a binary split to divide the data set into two separate regions and compute the mass in each region. The mass distribution at point x is estimated to be the sum of all ‘weighted’ masses from regions occupied by x, as a result of n−1 binary splits for a data set of size n.
Let x _{1}<x _{2}<⋯<x _{ n−1}<x _{ n } on the real line,^{1} \(x_{i} \in\mathcal{R}\) and n>1. Let s _{ i } be the binary split between x _{ i } and x _{ i+1}, yielding two nonempty regions having two masses \(m_{i}^{L}\) and \(m_{i}^{R}\).
Definition 1
Definition 2
Example
For a given data set, p(s _{ i }) can be estimated on the real line as p(s _{ i })=(x _{ i+1}−x _{ i })/(x _{ n }−x _{1})>0, as a result of a random selection of splits based on a uniform distribution.^{2}
For a point x∉{x _{1},x _{2},…,x _{ n−1},x _{ n }}, mass(x) is defined as an interpolation between two masses of adjacent points x _{ i } and x _{ i+1}, where x _{ i }<x<x _{ i+1}.
Theorem 1
Proof
Theorem 2
mass(x _{ a }) is a concave function defined w.r.t. {x _{1},x _{2},…,x _{ n }}, when p(s _{ i })=(x _{ i+1}−x _{ i })/(x _{ n }−x _{1}) for n>2.
Proof
We only need to show that the gradient of x _{ a } is nonincreasing, i.e., g(x _{ a })>g(x _{ a+1}) for each a.
Corollary 1
A mass distribution estimated using binary splits stipulates an ordering, based on mass, of the points in a data cloud from x _{ n/2} (with the maximum mass) to the fringe points (with the minimum mass at either side of x _{ n/2}), irrespective of the density distribution including uniform density distribution.
Corollary 2
The concavity of mass distribution stipulates that fringe points have markedly smaller mass than points close to x _{ n/2}.
The implication from Corollary 2 is that fringe points are ‘stretched’ to be farther away from the median in a mass space than in the original space—making it easier to separate fringe points from those points close to the median. The mass space is mapped from the original space through mass(x). This property in mass space can then be exploited by a machine learning algorithm to achieve a better result for the intended task than applying the same algorithm in the original space without this property. We will show that this simple mapping significantly improves the performance of four existing algorithms in information retrieval and regression tasks in Sects. 6.1 and 6.2.
Equation (1) is sufficient to provide a mass distribution corresponding to a unimodal density function or a uniform density function. To better estimate multimodal mass distributions, multiple levels of binary splits need to be carried out. This is provided in the following.
2.1.2 Levelh mass distribution estimation
If we treat the mass estimation defined in the last subsection as level1 estimation, then levelh estimation can be viewed as localised versions of the basic level1 estimation.
Definition 3
Here a high level mass distribution is computed recursively by using the mass distributions obtained at lower levels. A binary split s _{ i } in a levelh(>1) mass distribution produces two level(h1) mass distributions: (a) \(\mathit{mass}_{i}^{L}(x,h\mbox{}1)\)—the mass distribution on the left of split s _{ i } which is defined using {x _{1},…,x _{ i }}; and (b) \(\mathit{mass}_{i}^{R}(x,h\mbox{}1)\)—the mass distribution on the right which is defined using {x _{ i+1},…,x _{ n }}. Equation (1) is the mass distribution at level1.
As a result, only the mass for the first point x _{1} needs to be computed using (3). Note that it is more efficient to compute the mass distribution from the above equation which has time complexity O(n ^{ h+1}); the computation using (3) has complexity O(n ^{ h+2}).
Definition 4
A levelh mass distribution stipulates an ordering of the points in a data cloud from αcore points to the fringe points. Let αneighbourhood of a point x be defined as N _{ α }(x)={y∈Ddist(x,y)≤α} for some distance function dist(⋅,⋅). Each αcore point x ^{∗} in a data cloud has the highest mass value ∀x∈N _{ α }(x ^{∗}). A small α defines local core point(s); and a large α, which covers the entire value range for x, defines global core point(s).
For h>1 mass distribution, though there is no guarantee for a concave function any more as a whole, our simulation shows that each cluster within the data cloud (if they exist) exhibits a concave function and it becomes more distinct (as a concave function) as h increases. An example is shown in Fig. 3(b) which has a trimodal density distribution. Notice that the h>1 mass distributions have three αcore points for some α, e.g., 0.2. Other examples are shown in Figs. 3(c) and 3(d).
Traditionally, one can estimate the coreness or the fringeness of nonuniformly distributed data to some degree by using density or distance (but not in uniform density distribution). Mass allows one to do that in any distribution without density or distance calculation—the key computational expense in all methods that employ them. For example in Fig. 3(c) which has a skew density distribution, the distinction between near fringe points and far fringe points are less obvious using density, unless distances are computed to reveal the difference. In contrast, mass distribution depicts the relative distance from x _{ median } using the fringe points’ mass values, without further calculation.
Figure 3(d) shows an example where there are clustered anomalies which are denser than the normal points (shown in the bigger cluster on the left of the figure). Anomaly detection based on density will identify all these clustered anomalies as more ‘normal’ than the normal points because anomalies are defined as points having low density. In sharp contrast, h=1 mass estimation will correctly rank them as anomalies which have the third lowest mass values. These points are interpreted as points at the fringe of the data cloud of normal points which have higher mass values.
This section has described properties of mass distribution from a theoretical perspective. Though it is possible to estimate mass distribution using (1) and (3), they are limited by its high computational cost. We suggest a practical mass estimation method in the next subsection. We use the term ‘mass estimation’ and ‘mass distribution estimation’ interchangeably hereafter.
2.2 Practical onedimensional levelh mass estimation
Here we devise an approximation to (3) using random subsamples from a given data set.
Definition 5
\(\mathit{mass}(x,h\mathcal{D})\) is the approximate mass distribution for a point \(x \in\mathcal{R}\), defined w.r.t. \(\mathcal{D} = \{x_{1},\ldots, x_{\psi}\}\), where \(\mathcal{D}\) is a random subset of the given data set D, and ψ≪D, h<ψ.
Only relative, not absolute, mass is required to provide an ordering between instances. For h=1, because the relative mass is w.r.t. the median and the median is a robust estimator (Aloupis 2006)—that is why small subsamples produce a good estimator for ordering. While this reason cannot be applied to h>1 (and multidimensional mass estimation to be discussed in the next section) because the notion of median is undefined, our empirical results in Sect. 6 show that all these mass estimations using small subsamples produce good results.
In order to show that relative performance of mass(x,1) and \(\mathit{mass}(x,1\mathcal{D})\), we compare the ordering results based on mass values in two separate data sets: the onedimensional Gaussian density distribution and the COREL data set; each of the data sets has 10000 data points. Figure 4(b) shows the correlation (in terms of Spearman’s rank correlation coefficient) between the orderings provided by mass(x,1) using the entire data set and \(\mathit{mass}(x,1\mathcal{D})\) using ψ=8. They achieve very high correlations when c≥100 for both data sets.
The ability to use a small sample, rather than a large sample, is a key characteristic of mass estimation.
3 Multidimensional mass estimation
Ting and Wells (2010) describe a way to generalise the onedimensional mass estimation we have described in the last section. We reiterate the approach in this section but the implementation we employed (to be described in Sect. 4) differs. Section 9 provides the details of these differences.
The approach proposed by Ting and Wells (2010) eliminates the need to compute the probability of a binary split, p(s _{ i }); and it gives rise to randomised versions of (1), (3) and (5).
The idea is to generate multiple random regions which cover a point, and then the mass for that point is estimated by averaging all masses from all those regions. We show that random regions can be generated using axisparallel splits called halfspace splits. Each halfspace split is performed on a randomly selected attribute in a multidimensional feature space. For a hlevel split, each halfspace split is carried out h times recursively along every path in a tree structure. Each hlevel (axisparallel) split generates 2^{ h } nonoverlapping regions. Multiple hlevel splits are used to estimate mass for each point in the feature space.
The multidimensional mass estimation requires two functions. First, it needs a function that generates random regions covering each point in the feature space. This function is a generalisation of the binary split into halfspace splits or 2^{ h }region splits when h levels of halfspace splits are used. Second, a generalised version of the mass base function is used to define mass in a region. The formal definition follows.
Let x be an instance in \(\mathcal{R}^{d}\). Let T ^{ h }(x) be one of the 2^{ h } regions in which x falls into; T ^{ h }(⋅) is generated from the given data set D, and \(T^{h}(\cdot \mathcal{D})\) is generated from \(\mathcal{D} \subset D\); and m be the number of training instances in the region.
Here every \(T_{k}^{h}\) is generated randomly with equal probability. Note that p(s _{ i }) in (1) has the same assumption.
Figure 5(b) shows the contour map for h=32 on the same data set. It demonstrates that multidimensional mass estimation can use a high h level to model multimodal distribution.
We show in Sect. 6 that both \(\mathit{mass}(x,h\mathcal {D})\) and \(\mathbf{m}(T^{h}(\mathbf{x}\mathcal{D}))\) (in (5) and (9), respectively) can be employed effectively for three different tasks: information retrieval, regression and anomaly detection, through the massbased formalism described in Sect. 5. We shall describe the implementation of multidimensional mass estimation in the next section.
4 HalfSpace Trees for mass estimation
This section describes the implementation of T ^{ h } using HalfSpace Tree. Two variants are provided. We have used the second variant of HalfSpace Tree to implement the multidimensional mass estimation.
4.1 HalfSpace Tree
The binary halfspace split ensures that every split produces two equalsize halfspaces, each containing exactly half of the mass before the split under a uniform mass distribution. This characteristic enables us to compute the relationship between any two regions easily. For example, the mass in every region shown in Fig. 6(a) is the same, and it is equivalent to the original mass divided by 2^{3} because three levels of binary halfspace splits have been applied. A deviation from the uniform mass distribution allows us to rank the regions based on mass. Figure 6(b) provides such an example in which a ranking of regions based on mass provides an order of the degrees of anomaly in each region.
Definition 6
HalfSpace Tree is a binary tree in which each internal node makes a halfspace split into two equalsize regions, and each external node terminates further splits. All nodes record the mass of the training data in their own regions.
Let T ^{ h }[i] be a HalfSpace Tree with depth level i; and m(T ^{ h }[i]) or short for m[i] be the mass in one of the regions at level i.
The relationship between any two regions is expressed using mass with reference to m[0] at depth level=0 (the root) of a HalfSpace Tree.
Mass is estimated using m[ℓ] only if a HalfSpace Tree has all external nodes at the same depth level. The estimation is based on augmented mass, m[ℓ]×2^{ ℓ }, if the external nodes have differing depth levels. We describe two such variants of HalfSpace Tree below.
HS*Tree: based on augmented mass. Unlike HSTree, the second variant, HS*Tree, whose external nodes have differing depth levels. The mass estimated from HS*Tree is defined in equation (10) in order to account for different depths. We call this augmented mass because the mass is augmented in the calculation by the depth level in HS*Tree, as opposed to mass only in HSTree.
In a special case of HS*Tree, the tree growing process at a branch will only terminate to form an external node if the training data size at the branch is 1 (i.e., the size limit is set to 1). Here the mass estimated depends on depth level only, i.e., 2^{ ℓ } or simply ℓ. In other words, the depth level becomes a proxy for mass in HS*Tree when the size limit is set to 1. An example of HS*Tree, when the size limit is set to 1, is shown in Fig. 7(b).
Since the two variants have similar performance, we focus on HS*Tree only in this paper because it builds a smallersized tree than HSTree which may grow many branches with zero mass—this saves on training time and memory space requirements.
4.2 Algorithm to generate HalfSpace Trees
HalfSpace Trees estimate a mass distribution efficiently, without density or distance calculations or clustering. We first describe the training procedure, then the testing procedure, and finally the time and space complexities.
Constructing a single HalfSpace Tree is almost identical to constructing an ordinary decision tree^{3} (Quinlan 1993), except that no splitting selection criterion is required at each node.
Given a work space, an attribute q is randomly selected to form an internal node of an HalfSpace Tree (line 4 in Algorithm 2). The split point of this internal node is simply the midpoint between the minimum and maximum values of attribute q (i.e., min _{ q } and max _{ q }), defined by the work space (line 5). Data are filtered through one of the two branches depending on which side of the split the data reside (lines 6–7). This node building process is repeated for each branch (lines 9–12 in Algorithm 2) until a size limit or a depth limit is reached to form an external node (lines 1–2 in Algorithm 2). The training instances at the external node at depth level ℓ form the mass \(\mathbf{m}(T^{h}(\mathbf{x}\mathcal{D}))\) to be used during the testing process for x. The parameters are set as follows: \(S = \log_{2}(\mathcal{D})1\) and \(h=\mathcal{D}\) for all the experiments conducted in this paper.
Ensemble. The proposed method uses a random subsample \(\mathcal{D}\) to build one HalfSpace Tree (i.e., \(T^{h}(\cdot\mathcal{D})\)), and multiple HalfSpace Trees are constructed from different random subsamples (using sampling without replacement) to form an ensemble.
Testing. During testing, a test instance x traverses through each HalfSpace Tree from the root to an external node, and the mass recorded at the external node is used to compute its augmented mass (see (11) below). This testing is carried out for all HalfSpace Trees in the ensemble, and the final score is the average score from all trees, as expressed in (12) below.
Mass needs to be augmented with depth ℓ of a HalfSpace Tree in order to ‘normalise’ the masses from different depths in the tree.
Time and Space complexities. Because it involves no evaluations or searches, a HalfSpace Tree can be generated quickly. In addition, a good performing HalfSpace Tree can be generated using only a small subsample (size ψ) from a given data set of size n, where ψ≪n. An ensemble of HalfSpace Trees has training time complexity O(chψ) which is constant for an ensemble with fixed subsample size ψ, maximum depth level h and ensemble size c. It has time complexity O(chn) during testing. The space complexity for HalfSpace Trees is O(chψ) and is also a constant for an ensemble with fixed subsample size, maximum depth level and ensemble size.
5 Massbased formalism
The data ordering expressed as a mass distribution can be interpreted as a measure of relevance with respect to the concept underlying the data, i.e., points having high mass are highly relevant to the concept and points having low mass are less relevant. In tasks whose primary aim is to rank points in a database with reference to a data profile, mass provides the ideal ranking measure without distance or density calculations. In anomaly detection, high mass signifies normal points and low mass signifies anomalies; in information retrieval, high (low) mass signifies that a database point is highly (less) relevant to the query. Even in tasks whose primary aim is not ranking, the transformed mass space can be better exploited by existing algorithms because the transformation stretches conceptirrelevant points farther away from relevant points in the mass space.
We introduce a formalism in which mass can be applied to different tasks in this section, and provide the empirical evaluation in the following section.
 C1
 The first component constructs a number of mass distributions in a mass space. A mass distribution \(\mathit{mass}(x^{d},h\mathcal {D})\) for dimension d in the original feature space is obtained using our proposed onedimensional mass estimation, as given in Definition 5. A total number of t mass distributions is generated which forms \(\widetilde{\mathbf{mass}}(\mathbf{x}) \rightarrow\mathcal{R}^{t}\), where t≫u. This procedure is given in Algorithm 3. Multidimensional mass estimation \(\mathbf{m}(T^{h}(\mathbf{x}\mathcal {D}))\) (replacing onedimensional mass estimation \(\mathit{mass}(x^{d},h\mathcal {D})\)) can be used to generate the mass space similarly; see note in Algorithm 3.
 C2
 The second component maps the data set D in the original space of u dimensions into a new data set D′ in tdimensional mass space using \(\widetilde{\mathbf{mass}}(\mathbf{x}) = \mathbf{z}\). This procedure is described in Algorithm 4.
 C3

The third component employs a decision rule to determine the final outcome for the task at hand. It is a taskspecific decision function applied to z in the new mass space.
The formalism becomes a blueprint for different tasks. Components C1 and C3 are mandatory in the formalism, but component C2 is optional, depending on the task.
In our experiments described in the next section, the mapping from u dimensions to t dimensions using Algorithm 3 is carried out one dimension at a time when using onedimensional mass estimation; and all u dimensions at a time when using multidimensional mass estimation. Each such mapping produces one dimension in mass space and is repeated t times to get a tdimensional mass space. Note that randomisation gives different variations to each of the t mappings. The first randomisation occurs at step 2 in Algorithm 3 in selecting a random subset of data. Additional randomisation is applied to attribute selection at step 3 in Algorithm 3 for onedimensional mass estimation, or at step 4 in Algorithm 2 for multidimensional mass estimation.
6 Experiments
We evaluate the performance of MassSpace and MassAD for three tasks in the following three subsections. We denote an algorithm A using onedimensional and multidimensional mass estimations as A′ and A″, respectively.
In information retrieval and regression tasks, the mass estimation uses ψ=8 and t=1000. These settings are obtained by examining the rank correlation as shown in Fig. 4(b)—having a high rank correlation between mass(x,1) and \(\mathit{mass}(x,1\mathcal{D})\). Note that this is done before any method is applied, and no further tuning of the parameters is carried out after this step. In anomaly detection tasks, ψ=256 and t=100 are used so that they are comparable to those used in a benchmark method for a fair comparison. In all tasks, h=1 is used for onedimensional mass estimation, and it cannot afford to use a high h because of its high cost O(ψ ^{ h }). h=ψ is used for multidimensional mass estimation in order to reduce one parameter setting.
All the experiments were run in Matlab and conducted on a Xeon processor which ran at 2.66 GHz and with 48 GB memory. The performance of each method was measured in terms of taskspecific performance measure and runtime. Paired ttests at 5 % significance level were conducted to examine whether the difference in performance is significant between two algorithms under comparison.
Note that we treated information retrieval and anomaly detection as unsupervised learning tasks. Classes/labels in the original data were used as ground truth for evaluation of performance only; they were not used in building mass distributions. In regression, only the training set was used to build mass distributions in step 1 of Algorithm 5; the mapping in step 2 was conducted for both the training and testing sets.
6.1 Contentbased image retrieval
We use a ContentBased Image Retrieval (CBIR) task as an example of information retrieval. The MassSpace approach is compared with three stateoftheart CBIR methods that deal with relevance feedbacks: a manifold based method MRBIR (He et al. 2004), and two recent techniques for improving similarity calculation, i.e., Qsim (Zhou and Dai 2006) and InstR (Giacinto and Roli 2005); and we employ the Euclidean distance to measure the similarity between instances in these two methods. The default parameter settings are used for all these methods.
Our experiments were conducted using the COREL image database (Zhou et al. 2006) of 10000 images, which contains 100 categories and each category has 100 images. Each image is represented by a 67dimensional feature vector, which consists of 11 shape, 24 texture and 32 color features. To test the performance, we randomly selected 5 images from each category to serve as the queries. For a query, the images within the same category were regarded as relevant and the rest were irrelevant. For each query, we continued to perform up to 5 rounds of relevance feedback. In each round, 2 positive and 2 negative feedbacks were provided. This relevance feedback process was also repeated 5 times with 5 different series of feedbacks. Finally, the average results with one query and in different feedback rounds were recorded. The retrieval performance was measured in terms of BreakEvenPoint (BEP) (Zhou and Dai 2006; Zhou et al. 2006) of the precisionrecall curve. The online processing time reported is the time required in each method for a query plus the stated number of feedback rounds. The reported result is an average over 5×100 runs for query only; and an average over 5×100×5 runs for query plus feedbacks. The offline costs of constructing the onedimensional mass estimation and the mapping of 10000 images were 0.27 and 0.32 seconds, respectively. The multidimensional mass estimation and the corresponding mapping took 1.72 and 5.74 seconds, respectively.
CBIR results (in BEP×10^{−2}). An algorithm A using onedimensional and multidimensional mass estimations are denoted as A′ and A″, respectively. Note that a high BEP is better than a low BEP
MRBIR″  MRBIR′  MRBIR  Qsim″  Qsim′  Qsim  InstR″  InstR′  InstR  

One query  12.65  10.70  9.69  12.38  10.35  7.78  12.38  10.35  7.78 
Round 1  16.58  14.24  12.72  19.18  15.46  10.59  13.88  13.33  9.40 
Round 2  18.41  16.05  13.90  21.98  17.58  11.81  15.12  14.95  9.99 
Round 3  19.69  17.34  14.75  23.67  18.71  12.59  16.19  16.07  10.36 
Round 4  20.48  18.20  15.33  24.65  19.50  13.16  16.88  16.93  10.78 
Round 5  21.15  19.86  15.71  25.42  19.96  13.55  17.49  17.58  11.05 
The BEP results clearly show that the MassSpace approach achieves a better retrieval performance than that using the original space in all three methods MRBIR, Qsim and InstR, for one query and all rounds of relevance feedbacks. Paired ttests with 5 % significance level also indicate that the MassSpace approach significantly outperforms each of the three methods in all experiments, without exception. These results show that the mass space provides useful additional information that is hidden in the original space.
The results also show that the multidimensional mass estimation provides better information than the onedimensional mass estimation—MRBIR″, Qsim″ and InstR″ give better retrieval performance than MRBIR′, Qsim′ and InstR′, respectively; only some exceptions occur in the higher feedback rounds for InstR′, with minor differences.
CBIR results (online time cost in seconds)
MRBIR″  MRBIR′  MRBIR  Qsim″  Qsim′  Qsim  InstR″  InstR′  InstR  

One query  0.714  0.785  0.364  0.715  0.822  0.093  0.715  0.822  0.093 
Round 1  0.762  0.893  0.696  0.207  0.208  0.035  0.197  0.198  0.026 
Round 2  0.763  0.893  0.696  0.228  0.231  0.058  0.200  0.200  0.028 
Round 3  0.763  0.893  0.696  0.257  0.259  0.086  0.200  0.200  0.028 
Round 4  0.764  0.893  0.696  0.291  0.294  0.122  0.200  0.200  0.028 
Round 5  0.764  0.893  0.697  0.335  0.341  0.167  0.200  0.200  0.028 
6.2 Regression
In this experiment, we compare support vector regression (Vapnik 2000) that employs the original space (SVR) with that employs the mapped mass space (SVR″ and SVR′). SVR is the ϵSVR algorithm with RBF kernel, implemented by LIBSVM (Chang and Lin 2001). SVR is chosen here because it is one of the top performing models.
Regression results (the smaller the better for MSE)
Data size  MSE (×10^{−2})  W/D/L  

SVR″  SVR′  SVR  SVR″  SVR′  
tic  9822  5.56  5.58  5.62  18/0/2  17/0/3 
wine_white  4898  1.08  1.21  1.36  20/0/0  20/0/0 
quake  2178  2.87  2.86  2.92  17/0/3  18/0/2 
wine_red  1599  1.50  1.62  1.62  19/0/1  11/0/9 
concrete  1030  0.28  0.33  0.57  20/0/0  20/0/0 
In each data set, we randomly sampled twothirds of the instances for training and the remaining onethird for testing. This was repeated 20 times and we report the average result of these 20 runs. The data set, whether in the original space or the mass space, was minmax normalized before an ϵSVR model was trained. To select optimal parameters for the ϵSVR algorithm, we conducted a 5fold cross validation based on mean squared error using the training set only. The kernel parameter γ was searched in the range {2^{−15},2^{−13},2^{−11},…,2^{3},2^{5}}; the regularization parameter C in the range {0.1,1,10}, and ϵ in the range {0.01,0.05,0.1}. We measured regression performance in terms of mean squared error (MSE) and runtime in seconds. The runtime reported is the runtime for SVR only. The total cost of mass estimation (from the training set) and mapping (of training and testing sets) in the largest data set, tic, was 1.8 seconds for onedimensional mass estimation, and 8.5 seconds for multidimensional mass estimation. The cost of normalisation and the parameter search using 5fold crossvalidation was not included in the reported result for all SVR″, SVR′ and SVR.
The result is presented in Table 4. SVR′ performs significantly better than SVR in all data sets in MSE measure; the only exception is in the wine_red data set. SVR″ performs significantly better than SVR in all data sets, without exceptions. SVR″ generally performs better than SVR′.
Regression results (time in seconds)
#Dimension  Processing time  Factor increase  

SVR″  SVR′  SVR  time(SVR″)  time(SVR′)  #dimension  
tic  85  23.4  26.6  11.9  2.0  2.2  12 
wine_white  11  8.2  9.2  4.2  2.0  2.2  91 
quake  3  2.5  3.4  1.0  2.5  3.4  333 
wine_red  11  1.7  2.6  1.0  1.6  2.5  91 
concrete  8  1.2  2.3  0.9  1.3  2.6  125 
6.3 Anomaly detection
This experiment compares MassAD with four stateoftheart anomaly detectors: isolation forest or iForest (Liu et al. 2008), a distancebased method ORCA (Bay and Schwabacher 2003), a densitybased method LOF (Breunig et al. 2000), and oneclass support vector machine (or 1SVM) (Schölkopf et al. 2000). MassAD was built with t=100 and ψ=256, the same default settings as used in iForest (Liu et al. 2008), which also employed a multimodel approach. The parameter settings employed for ORCA, LOF and 1SVM were as stated by Liu et al. (2008).
Data characteristics of the data sets in anomaly detection tasks. The percentage in brackets indicates the percentage of anomalies
Data size  #Dimension  Anomaly class  

Http  567497  3  Attack (0.4 %) 
Forest  286048  10  Class 4 (0.9 %) vs class 2 
Mulcross  262144  4  2 clusters (10 %) 
Smtp  95156  3  Attack (0.03 %) 
Shuttle  49097  9  Classes 2, 3, 5, 6, 7 (7 %) vs class 1 
Mammography  11183  6  Class 1 (2 %) 
Annthyroid  7200  6  Classes 1, 2 (7 %) 
Satellite  6435  36  3 Smallest classes (32 %) 
MassAD and iForest were implemented in Matlab and tested on a Xeon processor ran at 2.66 GHz. LOF was written in Java in ELKI platform version 0.4 (Achtert et al. 2008); and ORCA was written in C++ (www.stephenbay.net/orca/). The results for ORCA, LOF and 1SVM were conducted using the same experimental setting but on a slightly slower 2.3 GHz machine, the same machine used by Liu et al. (2008).
AUC values for anomaly detection
MassAD  iForest  ORCA  LOF  1SVM  

Mass″  Mass′  
Http  1.00  1.00  1.00  0.36  0.44  0.90 
Forest  0.90  0.92  0.87  0.83  0.56  0.90 
Mulcross  0.26  0.99  0.96  0.33  0.59  0.59 
Smtp  0.91  0.86  0.88  0.87  0.32  0.78 
Shuttle  1.00  0.99  1.00  0.55  0.55  0.79 
Mammography  0.86  0.37  0.87  0.77  0.71  0.65 
Annthyroid  0.75  0.71  0.82  0.68  0.72  0.63 
Satellite  0.77  0.62  0.71  0.65  0.52  0.61 
Again, the multidimensional version of MassAD generally performs better than the onedimensional version, with five wins, one draw and two losses. Most importantly, the worst performance in the Mulcross data set can be easily ‘corrected’ using a better parameter setting—by using ψ=8, instead of 256, the multidimensional version of MassAD improves its detection performance from 0.26 to 1.00 in terms of AUC.^{4}
It is also noteworthy that the multidimensional MassAD significantly outperforms the traditional densitybased, distancebased and SVM anomaly detectors in all data sets, except two: one in Annthyroid when compared to ORCA; the poor performance in Mulcross was discussed earlier. The above observations validate the effectiveness of our proposed mass estimation on anomaly detection tasks.
Runtime (second) for anomaly detection
MassAD  iForest  ORCA  LOF  1SVM  

Mass″  Mass′  
Http  168  18  74  9487  18913  35872 
Forest  63  10  39  6995  10853  9738 
Mulcross  52  10  38  2512  5432  7343 
Smtp  27  4  13  267  540  987 
Shuttle  20  3  8  157  368  333 
Mammography  21  1  3  4  39  11 
Annthyroid  7  1  3  2  9  4 
Satellite  13  1  3  9  10  9 
Training time and testing time (second) for MassAD and iForest, using t=100 and ψ=256
Training time  Testing time  

MassAD  iForest  MassAD  iForest  
Mass″  Mass′  Mass″  Mass′  
Http  16.2  14.3  14.4  151.8  3.3  59.6 
Forest  10.3  8.2  8.6  53.1  2.0  30.8 
Mulcross  9.1  7.9  8.1  42.8  2.1  29.4 
Smtp  5.4  3.9  3.5  21.9  0.6  9.9 
Shuttle  6.1  3.1  2.8  14.1  0.3  5.6 
Mammography  8.4  1.3  1.2  12.8  0.1  1.8 
Annthyroid  3.1  1.3  1.1  3.4  0.1  1.5 
Satellite  6.6  1.2  1.6  5.9  0.0  1.9 
A comparison of time and space complexities. The time complexity includes both training and testing. n is the given data set size and u is the number of dimensions. For MassAD and iForest, the first part of the summation is the training time and the second the testing time
Time complexity  Space complexity  

MassAD (multidimensional)  O(t(ψ+n)h)  O(tψh) 
MassAD (onedimensional)  O(t(ψ ^{ h+1}+n))  O(tψ) 
iForest  O(t(ψ+n)⋅log(ψ))  O(tψ⋅log(ψ)) 
ORCA  O(un⋅log(n))  O(un) 
LOF  O(un ^{2})  O(un) 
In contrast to ORCA and LOF (distancebased and densitybased methods), the time complexity (and the space complexity) for both MassAD and iForest are independent of the number of dimension u.
6.4 Constant time and space complexities
Runtime (second) for sampling, \(\mathit{mass}(x,1\mathcal{D})\) and \(\mathit{mass}(x,3\mathcal{D})\), where t=1000 and ψ=8
Data size  Sampling  \(\mathit{mass}(x,1\mathcal{D})\)  \(\mathit{mass}(x,3\mathcal{D})\)  

Http  567497  138.30  0.33  10.96 
Shuttle  49097  16.16  0.39  10.97 
COREL  10000  1.23  0.27  11.03 
tic  9822  1.09  0.43  11.14 
concrete  1030  0.18  0.31  10.95 
The results show that the sampling time increased linearly with the size of the given data set, and it took significantly longer (in the largest data set) than the time to construct the mass distribution—which was constant, regardless of the given data size. Note that the training time provided in Table 9 includes both the sampling time and mass estimation time, and it is dominated by the sampling time for large data sets.
The memory required for each construction of \(\mathit{mass}(x,h\mathcal{D})\) is to store one lookup table of size ψ which is constant.
The constant time and space complexities apply to multidimensional mass estimation too.
6.5 Runtime comparison between onedimensional and multidimensional mass estimations
6.6 Summary
The above results in all three tasks show that the orderings provided by mass distributions deliver additional information about the data that would otherwise hidden in the original features. The additional information, which accentuates fringe points with a concave function (or an approximation to a concave function in the case of multidimensional mass estimation), improves the taskspecific performance significantly, especially in the information retrieval and regression tasks.
Using Algorithm 5 for the information retrieval and regression tasks, the runtime is expected to be higher because the new space has much higher dimensions than the original space (t≫u). It shall be noted that the runtime increase (linearly or worse) is solely a characteristic of the existing algorithms used, and is not due to the mass space mapping which has constant time and space complexities.
We believe that a more tailored approach that better integrates the information provided by mass (into the C3 component in the formalism) for a specific task can potentially further improve the current level of performance in terms of either taskspecific performance measure or runtime. We have demonstrated this ‘direct’ application using Algorithm 6 for the anomaly detection task, in which MassAD performs equally well or significantly better than four stateoftheart methods in terms of taskspecific performance measure, and the onedimensional mass estimation executes faster than all other methods in terms of runtime.
Why does onedimensional mapping work when tackling multidimensional problems? We conjecture that if there is no or little interaction between features, then the onedimensional mapping will work because the ordering that accentuates the fringe points for each original dimension making it easy for existing algorithms to exploit. When there are strong interactions between features, then onedimensional mapping might not achieve good results. Indeed, our results in all three tasks show that multidimensional mass estimation does perform better than onedimensional mass estimation in general, in terms of taskspecific performance measures.
The ensemble method for mass estimation usually needs only a small sample to build each model in an ensemble. In addition, in order to build all t models for an ensemble, tψ could be more than n when ψ>n/t.
The key limitation of the onedimensional mass estimation is its high cost when a high value of h is applied. This can be avoided by implementing it using a tree structure rather than a lookup table, as we have done using HalfSpace Trees which reduces the time complexity to O(th(ψ+n)) from O(t(ψ ^{ h+1}+n)).
7 Relation to kernel density estimation
A comparison of kernel density estimation and mass estimation. Kernel density estimation requires two parameter settings: kernel function K(⋅) and bandwidth h _{ w }; mass estimation has one: h
\(\mbox{Kernel density}(x) = \frac{1}{nh_{w}}\ \sum_{i=1}^{n} K(\frac{xx_{i}}{h_{w}})\) 
\(\mathit{mass}(x, h) =\left \{ \begin{array}{l@{\quad}l} \sum_{i=1}^{n1} \mathit{mass}_{i}(x,h\mbox{}1) p(s_{i}), & h > 1 \\ \sum_{i=1}^{n1} m_{i}(x) p(s_{i}), & h = 1 \end{array} \right .\) 

Aim: Kernel estimation is aimed to do probability density estimation; whereas mass estimation is to estimate an order from the core points to the fringe points.

Kernel function: While kernel estimation can use different kernel functions for probability density estimation; we doubt that mass estimation requires a different base function for two reasons. First, a more sophisticated function is unlikely to provide a better ordering than a simple rectangular function. Second, the rectangular function keeps the computation simple and fast. In addition, a kernel function must be fixed (i.e., having userdefined values for its parameters); e.g., the rectangular kernel function has fixed width or fixed per unit size. But the rectangular function used in mass has no parameter and no fixed width.

Sample size: Kernel estimation or other density estimation methods require a large sample size in order to estimate the probability accurately (Duda et al. 2001). Mass estimation using \(\mathit{mass}(x,h\mathcal{D})\) needs only a small sample size in an ensemble to accurately estimate the ordering.
Here we present the results using a Gaussian kernel density estimation, replacing the onedimensional mass estimation, using the same subsample size in an ensemble approach. The bandwidth parameter is set to be the standard deviation of the subsample; and all the other parameters are the same.
CBIR results (in BEP×10^{−2})
(a) Compare with Qsim ^{ K } (using kernel density estimation), Qsim ^{ D } (using data depth), Qsim ^{ LD } (using local data depth)  

Qsim″  Qsim′  Qsim ^{ K }  Qsim ^{ D }  Qsim ^{ LD }  Qsim  
One query  12.38  10.35  2.90  10.39  7.60  7.78 
Round 1  19.18  15.46  3.01  15.02  10.95  10.59 
Round 2  21.98  17.58  2.74  17.16  12.50  11.81 
Round 3  23.67  18.71  2.54  18.37  13.42  12.59 
Round 4  24.65  19.50  2.42  19.20  14.03  13.16 
Round 5  25.42  19.96  2.34  19.74  14.36  13.55 
(b) Compare with InstR ^{ K }, InstR ^{ D } and InstR ^{ LD }  

InstR″  InstR′  InstR ^{ K }  InstR ^{ D }  InstR ^{ LD }  InstR  
One query  12.38  10.35  2.90  10.39  7.60  7.78 
Round 1  13.88  13.33  2.91  13.05  8.71  9.40 
Round 2  15.12  14.95  2.55  14.73  9.68  9.99 
Round 3  16.19  16.07  2.25  15.98  10.28  10.36 
Round 4  16.88  16.93  2.06  16.82  10.78  10.78 
Round 5  17.49  17.58  1.99  17.50  11.17  11.05 
8 Relation to data depth
There is a close relationship between the proposed mass and data depth (Liu et al. 1999): they both delineate the centrality of a data cloud (as opposed to compactness in the case of the density measure). The properties common to both measures are: (a) the centre of a data cloud has the maximum value of the measure; (b) an ordering from the centre (having the maximum value) to the fringe points (having the minimum values).
However, there are two key differences. First, not until recently (see Agostinelli and Romanazzi 2011) data depth always models a given data with one centre, regardless whether the data is unimodal or multimodal; whereas mass can model both unimodal and multimodal data by setting h=1 or h>1. Local data depth (Agostinelli and Romanazzi 2011) has a parameter (τ) which allows it to model multimodal data as well as unimodal data. However, the performance of local data depth appears to be sensitive to the setting of τ (see a discussion of the comparison below). In contrast, a single setting of h in mass estimation had produced good taskspecific performance in three different tasks in our experiments.
Second, mass is a simple and straightforward measure, and has efficient estimation methods based on axisparallel partitions only. Data depth has many different definitions, depending on the construct used to define depth. The constructs could be Mahalanobis, Convex Hull, simplicial, halfspace and so on (Liu et al. 1999), all of which are expensive to compute (Aloupis 2006)—this has been the main obstacle in applying data depth to real applications in multidimensional problems. For example, Ruts and Rousseeuw (1996) compute the contour of data depth of a data cloud for visualization, and employ depth as the anomaly score to identify anomalies. Because of its computational cost, it is limited to small data size only. In contrast to the axisparallel partitions used in mass estimation, halfspace data depth^{5} (Tukey 1975), for example, requires to consider all halfspaces which demands high computational time and space.
To provide a comparison, we replace the onedimensional mass estimation (defined in Algorithm 3) with data depth (defined by simplicial depth Liu et al. 1999) and local data depth (defined by simplicial local depth Agostinelli and Romanazzi 2011). We repeat the experiments by employing both the data depth and local data depth implementation in R by Agostinelli and Romanazzi (2011) (accessible from rforge.rproject.org/projects/localdepth). Both data depths are carried out in the same approach by using sample size ψ to build each of the t models in an ensemble.^{6} The number of simplices used to do the empirical estimation is set to 10000 for all runs. Default settings are used for all other parameters (i.e., the membership of a data point in simplices is evaluated in the “exact” mode rather than the approximate mode, and the tolerance parameter is fixed to 10^{−9}). Note that local depth uses an additional parameter τ to select candidate simplices, where a simplex having volume larger than τ is excluded from consideration. As the performance of local depth is sensitive to τ, we employ the quantile order of τ of 10 %, the low value of the range 10 %–30 % suggested by Agostinelli and Romanazzi (2011). Because both data depth and local data depth are estimated using the same procedure, their runtimes are the same.
The taskspecific performance result for information retrieval is provided in Table 13. Note that local data depth could produce worse retrieval results than those in the original feature space. Data depth performed close to that achieved by the onedimensional mass estimation, but it was significantly worse than the multidimensional mass estimation.
Table 15 shows the result in anomaly detection. Data depth performed worse than both versions of mass estimation in six out of eight data sets; local data depth performed worse than multidimensional mass estimation in five out of eight data sets; local data depth versus onedimensional mass estimation have four wins and four losses. Note that though local data depth achieved the best result in two data sets, it also produced the worst in three data sets which were significantly worse than others (in http, forest and shuttle).
CBIR results (online time cost in seconds)
(a) Compare with Qsim ^{ K }, Qsim ^{ D }, Qsim ^{ LD }  

Qsim″  Qsim′  Qsim ^{ K }  Qsim ^{ D }  Qsim ^{ LD }  Qsim  
One query  0.715  0.822  0.820  0.840  0.829  0.093 
Round 1  0.207  0.208  0.224  0.237  0.226  0.035 
Round 2  0.228  0.231  0.279  0.288  0.276  0.058 
Round 3  0.257  0.259  0.348  0.355  0.343  0.086 
Round 4  0.291  0.294  0.435  0.438  0.425  0.122 
Round 5  0.335  0.341  0.547  0.543  0.531  0.167 
(b) Compare with InstR ^{ K }, InstR ^{ D } and InstR ^{ LD }  

InstR″  InstR′  InstR ^{ K }  InstR ^{ D }  InstR ^{ LD }  InstR  
One query  0.715  0.822  0.820  0.840  0.829  0.093 
Round 1  0.197  0.198  0.203  0.215  0.206  0.026 
Round 2  0.200  0.200  0.205  0.216  0.206  0.028 
Round 3  0.200  0.200  0.206  0.217  0.207  0.028 
Round 4  0.200  0.200  0.207  0.218  0.208  0.028 
Round 5  0.200  0.200  0.207  0.218  0.208  0.028 
Anomaly detection: MassAD vs DensityAD and DepthAD (AUC)
MassAD  DensityAD  DepthAD  

Mass″  Mass′  Depth  LDepth  
Http  1.00  1.00  0.99  0.98  0.52 
Forest  0.90  0.92  0.70  0.85  0.49 
Mulcross  0.26  0.99  1.00  0.99  0.93 
Smtp  0.91  0.86  0.59  0.92  0.93 
Shuttle  1.00  0.99  0.90  0.87  0.72 
Mammography  0.86  0.37  0.27  0.36  0.79 
Annthyroid  0.75  0.71  0.80  0.58  0.86 
Satellite  0.77  0.62  0.61  0.59  0.69 
Anomaly detection: MassAD vs DensityAD and DepthAD (time in seconds)
MassAD  DensityAD  DepthAD  

Mass″  Mass′  Depth  LDepth  
Http  168  18  17  38  38 
Forest  63  10  10  31  31 
Mulcross  52  10  10  31  31 
Smtp  27  10  10  26  26 
Shuttle  20  4  4  25  25 
Mammography  21  3  3  24  24 
Annthyroid  7  1  1  23  23 
Satellite  13  1  1  23  23 
9 Other work based on mass
iForest (Liu et al. 2008) and MassAD share some common features: Both are ensemble methods which build t models, each from a random sample of size ψ, and they both combine the outputs of the models through averaging during testing. Although iForest (Liu et al. 2008) employs path length—an instance traverses from the root of a tree to its leaf—as the anomaly score, we have shown that the path length used in iForest is in fact a proxy to mass (see Sect. 4.1 for details). In other words, iForest is a kind of massbased method—that is why MassAD and iForest have similar detection accuracy. Multidimensional MassAD has the closest resemblance to iForest because of the use of tree. The key difference is that MassAD is just one application of the more fundamental concept of mass introduced here, whereas iForest is for anomaly detection only. In terms of implementation, the key difference is how the cutoff value is selected at each internal node of a tree: iForest selects the cutoff value randomly whereas a HalfSpace Tree selects a mid point deterministically (see step 5 in Algorithm 2).
How easily can the proposed formalism be applied to other tasks? In addition to the tasks we have applied in this paper, we have applied mass estimation ‘directly’, using the proposed formalism, to solve problems in contentbased multimedia information retrieval (Zhou et al. 2012) and clustering (Ting and Wells 2010). While the ‘indirect’ application is straightforward which simply uses the existing algorithms in the mass space, a ‘direct’ application requires a complete rethink of the problem and produces a totally different algorithm. However, this rethink of a problem in terms of mass often results a more efficient and sometimes more effective algorithm than existing algorithms. We provide a brief description of the two applications in the following two paragraphs.
In addition to the massspace mapping we have shown here (i.e., components C1 and C2), Zhou et al. (2012) present a contentbased information retrieval method that assigns a weight (based on iForest, thus, mass) to each new mapped feature w.r.t. a query; and then it ranks objects in the database according to their weighted average feature values in the mapped space. The method also incorporates relevance feedback which modifies the ranking based on the feedbacks through reweighted features in the mapped space. This method forms the third component of the formalism stated in Sect. 5. This ‘direct’ application of mass has been shown to be significantly better than the ‘indirect’ approach we have shown in Sect. 6.1, in terms of both taskspecific measure and runtime (Zhou et al. 2012). It is interesting to note that, unlike existing retrieval systems which rely on a metric, the new massbased method does not employ a metric—it is the first information retrieval system that does not use a metric, as far as we know.
Ting and Wells (2010) use a variant of HalfSpace Trees we have employed here and apply mass directly to solve clustering problems. It is the first massbased clustering algorithm, and it is unique because it does not use any distance and density measure. In this task, like in the case of anomaly detection, only two components are required. After building a mass model (in the C1 component), the C3 component consists of linking instances with nonzero mass connected by the mass model and making each group of connected instances a separate cluster; and all other unconnected instances are regarded as noise. This massbased clustering algorithm has been shown to perform equally well as DBSCAN (Ester et al. 1996) in terms of clustering performance, but it runs orders of magnitude faster (Ting and Wells 2010).
The earlier version of this paper (Ting et al. 2010) establishes the properties of mass estimation in the onedimensional setting only; and use it in all three tasks. This paper extends onedimensional mass estimation to multidimensional mass estimation using the same approach as described by Ting and Wells (2010), and implements multidimensional mass estimation using HalfSpace Trees. This paper reports new experiments using the multidimensional mass estimation, and shows the advantage of using multidimensional mass estimation over onedimensional mass estimation in the three tasks reported earlier (Ting et al. 2010). These related works show that mass estimation can be implemented in different ways using treebased or nontreebased methods.
10 Conclusions and future work
This paper makes two key contributions. First, we introduce a base measure, mass, and delineate its three properties: (i) a mass distribution stipulates an ordering from core points to fringe points in a data cloud; (ii) this ordering accentuates the fringe points with a concave function—a property that can be easily exploited by existing algorithms to improve their taskspecific performance; and (iii) the mass estimation methods have constant time and space complexities. Density estimation has been the base modelling mechanism employed in many techniques thus far. Mass estimation introduced here provides an alternative choice, and it is better suited for many tasks which require an ordering rather than probability density estimation.
Second, we present a massbased formalism which forms a basis to apply mass to different tasks. The three tasks (i.e., information retrieval, regression and anomaly detection) to which we have successfully applied are just examples of its application. Mass estimation has potentials in many other applications.
Footnotes
 1.
In data having a pocket of points of the same value, an arbitrary order can be ‘forced’ by adding increasing multiples of an insignificant small value ϵ to each subsequent point of the pocket, without changing the general distribution.
 2.
The estimated mass(x) values can be calibrated to a finite data range Δ by multiplying a factor (x _{ n }−x _{1})/Δ.
 3.
However, they are for different tasks: Decision trees are for supervised learning tasks; HalfSpace trees are for unsupervised learning tasks.
 4.
Mulcross produces anomaly clusters rather than scattered anomalies. Detecting anomaly clusters are more effective using a low ψ setting when the multidimensional version of MassAD is employed.
 5.Zuo and Serfling (2000) define halfspace data depth (HD) of a point x in \(\mathcal{R}^{u}\) w.r.t. a probability measure P on \(\mathcal{R}^{u}\) as the minimum probability mass carried by any closed halfspace containing x:In the language of data depth, the onedimensional mass estimation may be interpreted as a kind of average probability mass of halfspaces containing x, weighted by mass covered by halfspace. But the onedimensional mass estimation defined in (1) allows mass to be computed by a summation of n−1 components from the given data set of size n, whereas data depth does not. In addition, our implementation of multidimensional mass estimation using a tree structure with axisparallel splits cannot be interpreted using any of the constructs employed by data depth.$$HD(x;P) = \mathit{inf}\bigl\{P(H): H \mbox{ a closed halfspace}, x \in H\bigr\}, \quad x \in \mathcal{R}^u $$
 6.
Our experiments indicate that using the entire data set to estimate data depth or local data depth produces worse results than those using an ensemble approach. This result is shown in Appendix.
Notes
Acknowledgements
This work is supported by the Air Force Research Laboratory, under agreement numbers FA23860914014, FA23861014052 and FA23861114112. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The anonymous reviewers have provided many helpful comments to improve the clarity of this paper.
References
 Achtert, E., Kriegel, H.P., & Zimek, A. (2008). ELKI: a software system for evaluation of subspace clustering algorithms. In Proceedings of the 20th international conference on scientific and statistical database management (pp. 580–585). Google Scholar
 Agostinelli, C., & Romanazzi, M. (2011). Local depth. Journal of Statistical Planning and Inference, 141, 817–830. MathSciNetzbMATHCrossRefGoogle Scholar
 Aloupis, G. (2006). Geometric measures of data depth. DIMACS Series in Discrete Math and Theoretical Computer Science, 72, 147–158. MathSciNetGoogle Scholar
 Asuncion, A., & Newman, D. (2007). UCI machine learning repository. Google Scholar
 Bay, S. D., & Schwabacher, M. (2003). Mining distancebased outliers in near linear time with randomization and a simple pruning rule. In Proceedings of ACM SIGKDD (pp. 29–38). Google Scholar
 Breunig, M. M., Kriegel, H.P., Ng, R. T., & Sander, J. (2000). LOF: identifying densitybased local outliers. In Proceedings of ACM SIGKDD (pp. 93–104). Google Scholar
 Chang, C.C., & Lin, C.J. (2001). LIBSVM: a library for support vector machines. Google Scholar
 Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed.). New York: Wiley. zbMATHGoogle Scholar
 Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM SIGKDD (pp. 226–231). Google Scholar
 Giacinto, G., & Roli, F. (2005). Instancebased relevance feedback for image retrieval. In Advances in NIPS (pp. 489–496). Google Scholar
 He, J., Li, M., Zhang, H., Tong, H., & Zhang, C. (2004). Manifoldranking based image retrieval. In Proceedings of ACM multimedia (pp. 9–16). Google Scholar
 Liu, F. T., Ting, K. M., & Zhou, Z.H. (2008). Isolation forest. In Proceedings of IEEE ICDM (pp. 413–422). Google Scholar
 Liu, R., Parelius, J. M., & Singh, K. (1999). Multivariate analysis by data depth. The Annals of Statistics, 27(3), 783–840. MathSciNetzbMATHCrossRefGoogle Scholar
 Quinlan, J. (1993). C4.5: programs for machine learning. San Mateo: Morgan Kaufmann. Google Scholar
 Rocke, D. M., & Woodruff, D. L. (1996). Identification of outliers in multivariate data. Journal of the American Statistical Association, 91(435), 1047–1061. MathSciNetzbMATHCrossRefGoogle Scholar
 Ruts, I., & Rousseeuw, P. (1996). Computing depth contours of bivariate point clouds. Computational Statistics and Data Analysis, 23(1), 153–168. zbMATHCrossRefGoogle Scholar
 Schölkopf, B., Williamson, R. C., Smola, A. J., ShaweTaylor, J., & Platt, J. C. (2000). Support vector method for novelty detection. In Advances in NIPS (pp. 582–588). Google Scholar
 Simonoff, J. S. (1996). Smoothing methods in statistics. Berlin: Springer. zbMATHCrossRefGoogle Scholar
 Ting, K. M., & Wells, J. R. (2010). Multidimensional mass estimation and massbased clustering. In Proceedings of IEEE ICDM (pp. 511–520). Google Scholar
 Ting, K. M., Zhou, G.T., Liu, F. T., & Tan, S. C. (2010). Mass estimation and its applications. In Proceedings of ACM SIGKDD (pp. 989–998). Google Scholar
 Tukey, J. W. (1975). Mathematics and picturing data. In Proceedings of the international congress on mathematics (Vol. 2, pp. 525–531). Google Scholar
 Vapnik, V. N. (2000). The nature of statistical learning theory (2nd ed.). Berlin: Springer. zbMATHGoogle Scholar
 Zhang, R., & Zhang, Z. (2006). BALAS: empirical Bayesian learning in the relevance feedback for image retrieval. Image and Vision Computing, 24(3), 211–223. zbMATHCrossRefGoogle Scholar
 Zhou, G.T., Ting, K. M., Liu, F. T., & Yin, Y. (2012). Relevance feature mapping for contentbased multimedia information retrieval. Pattern Recognition, 45, 1707–1720. CrossRefGoogle Scholar
 Zhou, Z.H., Chen, K.J., & Dai, H.B. (2006). Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems, 24(2), 219–244. CrossRefGoogle Scholar
 Zhou, Z.H., & Dai, H.B. (2006). Querysensitive similarity measure for contentbased image retrieval. In Proceedings of IEEE ICDM (pp. 1211–1215). Google Scholar
 Zuo, Y., & Serfling, R. (2000). General notion of statistical depth function. The Annal of Statistics, 28, 461–482. MathSciNetzbMATHCrossRefGoogle Scholar