Variance Reduction in Outlier Ensembles

Aggarwal, Charu C.; Sathe, Saket

doi:10.1007/978-3-319-54765-7_3

Variance Reduction in Outlier Ensembles

Charu C. Aggarwal³ &
Saket Sathe³

Chapter
First Online: 07 April 2017

2145 Accesses

Abstract

The theoretical discussion in the previous chapter establishes that the error of an outlier detector can be decomposed into the squared bias and the variance. Ensemble methods attempt to reduce the overall error by reducing either the squared bias or the variance.

Bagging goes a ways toward making a silk purse out of a sow’s ear, especially if the sow’s ear is twitchy.

Leo Brieman

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
An analytical explanation for data-centric variance reduction is provided in Sect. 3.3.1.
2.
For example, in the case of Fig. 3.3c, each individual detector is run on a smaller data set with a correspondingly adjusted value of k. This smaller detector will have larger variance.
3.
An example of such an adjusted k-nearest neighbor detector is discussed in [6]. In this case, when subsamples of the data are drawn at sample fraction f, the value of k is also adjusted with fraction f. In such a case, the bias becomes less unpredictable and increases only slightly at smaller subsample sizes.
4.
For example, if a clustering-based detector is used, some of the normal clusters might persistently disappear at small values of n, each time a training data set of size n is drawn from the base distribution. This will cause bias.
5.
Even for simple functions such as computing the mean of a set of data points, the standard deviation reduces with the inverse of the square-root of the number of data points. Complex functions (such as those computed in outlier detection) generally have a much larger standard deviation.
6.
The description in this chapter is repeated almost identically in Sect. 6.3.1.3 of Chap. 6. This is because this detector is important both from the perspective of variance-reduction methods as well as base detectors. The description is repeated in both places for the benefit of readers who have electronic access to only one of the chapters of this book.
7.
Although the same dimension may be used to split a node along a path of the tree, this becomes increasingly unlikely with increasing dimensionality. Therefore, the outlier score is roughly equal the dimensionality required to isolate a point with random splits in very high dimensional data. For a subsample of 256 points, roughly 8 dimensions are required to isolate a point, but only 4 or 5 dimensions might be sufficient for an outlier. The key is in the power of the ensemble, because at least some of the components will isolate the point in a local subspace of low dimensionality.
8.
Perhaps the claims in [73] were motivated by the superior performance [41] of the isolation forest on smaller samples from the Mulcross data generator [58]. This is caused by a swamping effect of other data instances on the algorithm. Another related behavior is masking. However, the claims in [41] are very specific to a particular data set and it is impossible to predict the size of the base data at which phenomena like swamping or masking might occur.
9.
The (roughly similar) boxplots show random variations.
10.
The prediction is even rougher for LOF because of reachability smoothing and the quirky harmonic normalization.
11.
For example, if we have an anomalous cluster (i.e., contaminant) and a normal cluster, the bias at a particular subsample size and k would depend on the relative size of the anomalous cluster.
12.
Perhaps, it is not so surprising. The rare class often has a very different mean from the normal class (and by implication the full data). Features in classification data tend to have high Fisher’s scores [8], indicating at least partial separability on individual dimensions. This separability is sharpened over multiple dimensions. The increasing tendency of tiny classes to be separable with increasing dimensionality is also well known [43, 48]. Even partial separability can often guarantee a decent AUC for a pareto-extreme detector. See also comments on real benchmarks on p. 267.
13.
In high-dimensional space, it is often more meaningful and general to talk of data manifolds describing the distribution structure rather than clusters.
14.
The work in [66] states that the result in [6] is based on “stretching” results from density estimation. This is incorrect because the notion of density-estimation is not discussed in [6]. It is recognized that the proportional adjustment factor is a heuristic one and is slightly different between the average k-NN detector and exact k-NN detector, especially at small values of k. For some algorithms like LOF, the exact adjustment factor is even more difficult to analytically compute.
15.
Another example of a less visible design choice is that parametrization is generally necessary to create flexible learning methods in the supervised setting. However, parameter-free methods are very much desirable in the unsupervised setting. Unfortunately, this comes at the price that parameter-free methods will often work well at specific data sizes because of inflexibility in design. Non-monotonicity in performance with increased data is also common in data types like graphs and networks where controlling performance with increased data is hard.
16.
The Bayes optimal error rate is the smallest possible error that is theoretically achievable by a classifier on a particular data distribution. In other words, the Bayes error rate is the irreducible error caused by the vagaries of the underlying data distribution.
17.
The LOF paper does suggest the use of k-distinct-distances as a possibility to fix this problem. The implementation from the LMU group that proposed LOF [74] also allows \(\infty \) scores. However, this issue only presents an extreme case of a pervasive problem with LOF when k data points are close together by chance at small values of k.
18.
This constraint is essential for subsampling but not quite as essential for bagging. Even for a data set with 1 point, one can create a data set between 50 and 1000 points by re-sampling it. However, oversampling a data set does not provide significant advantages of diversity, and therefore we maintained exactly the same approach as used in subsampling by constraining the maximum size of the bag to the original data set size.
19.
As we will see later, these attempts to create more stable combination methods were largely unsuccessful because the results did not seem to show a significant improvement over the mean and in fact worsened in many cases. Nevertheless, we include these experiments, because there are some useful lessons one can learn from them. Furthermore, the differences were small enough that we do not consider these results to be definitive in terms of relative performance.
20.
We used geometric subsampling because it was one of the best performing variance reduction methods. However, the basic trends were quite similar over other types of subsampling.
21.
http://archive.ics.uci.edu/ml/datasets.html.
22.
http://www.ipd.kit.edu/~muellere/HiCS/.

References

C. C. Aggarwal. Outlier Analysis, Second Edition, Springer, 2017.
Google Scholar
C. C. Aggarwal and P. S. Yu. Outlier Detection in Graph Streams, ICDE Conference, 2011.
Google Scholar
C. C. Aggarwal. Outlier Ensembles: Position Paper, ACM SIGKDD Explorations, 14(2), pp. 49–58, December, 2012.
Google Scholar
C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999.
Google Scholar
C. C. Aggarwal and P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000.
Google Scholar
C. C. Aggarwal and S. Sathe. Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17(1), June 2015.
Google Scholar
C. C. Aggarwal. Recommender Systems: The Textbook, Springer, 2016. [Chapter 6 on Ensemble-Based Systems]
Google Scholar
C. C. Aggarwal. Data Mining: The Textbook, Springer, 2015.
Google Scholar
C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.
Google Scholar
C. C. Aggarwal and P. S. Yu. Outlier Detection in High Dimensional Data, ACM SIGMOD Conference, 2001.
Google Scholar
F. Angiulli, C. Pizzuti. Fast outlier detection in high dimensional spaces, PKDD Conference, 2002.
Google Scholar
E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, 36(1), pp. 1–38, 1998.
Google Scholar
S. Bay. Nearest Neighbor Classification from Multiple Feature Subsets. Intelligent Data Analysis, 2(3), pp. 191–209, 1999.
Article Google Scholar
G. Biau, F. Cerou, and A. Guyader. On the Rate of Convergence of the Bagged Nearest Neighbor Estimate. Journal of Machine Learning Research, 11, pp. 687–712, 2010.
MathSciNet MATH Google Scholar
L. Brieman. Random Forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.
Article Google Scholar
L. Brieman. Bagging Predictors. Machine Learning, 24(2), pp. 123–140, 1996.
MATH Google Scholar
M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying Density-based Local Outliers, ACM SIGMOD Conference, 2000.
Google Scholar
G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5(20), 2005.
Google Scholar
R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute Bagging: Improving Accuracy of Classifier Ensembles by using Random Feature Subsets. Pattern Recognition, 36(6), pp. 1291–1302, 2003.
Article MATH Google Scholar
P. Buhlmann. Bagging, Subagging and Bragging for Improving some Prediction Algorithms, Recent advances and trends in nonparametric statistics, Elsevier, 2003.
Google Scholar
P. Buhlmann, B. Yu. Analyzing Bagging. Annals of Statistics, pp. 927–961, 2002.
Google Scholar
A. Buja and W. Stuetzle. Observations on Bagging. Statistica Sinica, 16(2), 323, 2006.
MathSciNet MATH Google Scholar
J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier Detection with Autoencoder Ensembles. SIAM Conference on Data Mining, 2017.
Google Scholar
A. Criminisi, J. Shotton, and E. Konukoglu. Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Microsoft Research Cambridge, Tech. Rep. MSRTR-2011-114, 5(6), 12, 2011.
Google Scholar
M. Datar, N. Immorlica, P. Indyk, V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. ACM Annual Symposium on Computational Geometry, pp. 253–262, 2004.
Google Scholar
M. Denil, D. Matheson, and N. De Freitas. Narrowing the Gap: Random Forests In Theory and in Practice. ICML Conference, pp. 665–673, 2014.
Google Scholar
C. Desir, S. Bernard, C. Petitjean, and L. Heutte. One Class Random Forests. Pattern Recognition, 46(12), pp. 3490–3506, 2013.
Article Google Scholar
T. Dietterich. Ensemble Methods in Machine Learning, First International Workshop on Multiple Classifier Systems, 2000.
Google Scholar
A. Emmott, S. Das, T. Dietterich, A. Fern, and W. Wong. Systematic Construction of Anomaly Detection Benchmarks from Real Data. arXiv:1503.01158, 2015. https://arxiv.org/abs/1503.01158
M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?. The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.
MathSciNet MATH Google Scholar
J. Friedman, and P. Hall. On bagging and nonlinear estimation. Journal of statistical planning and inference, 137(3), pp. 669–683, 2007.
Article MathSciNet MATH Google Scholar
P. Geurts. Variance Reduction Techniques, Chapter 4 of unpublished PhD Thesis entitled “Contributions to decision tree induction: bias/variance tradeoff and time series classification.” University of Liege, Belgium, 2002. http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2002/Geu02/
M. Grill and T. Pevny. Learning Combination of Anomaly Detectors for Security Domain. Computer Networks, 2016.
Google Scholar
S. Guha, N. Mishra, G. Roy, and O. Schrijver. Robust Random Cut Forest Based Anomaly Detection On Streams. ICML Conference, pp. 2712–2721, 2016.
Google Scholar
Z. He, S. Deng and X. Xu. A Unified Subspace Outlier Ensemble Framework for Outlier Detection, Advances in Web Age Information Management, 2005.
Google Scholar
T. K. Ho. Random decision forests. Third International Conference on Document Analysis and Recognition, 1995. Extended version appears as “The random subspace method for constructing decision forests” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), pp. 832–844, 1998.
Google Scholar
T. K. Ho. Nearest Neighbors in Random Subspaces. Lecture Notes in Computer Science, Vol. 1451, pp. 640–648, Proceedings of the Joint IAPR Workshops SSPR’98 and SPR’98, 1998. http://link.springer.com/chapter/10.1007/BFb0033288
F. Keller, E. Muller, K. Bohm. HiCS: High-Contrast Subspaces for Density-based Outlier Ranking, IEEE ICDE Conference, 2012.
Google Scholar
M. Kopp, T. Pevny, and M. Holena. Interpreting and Clustering Outliers with Sapling Random Forests. Information Technologies Applications and Theory Workshops, Posters, and Tutorials (ITAT), 2014.
Google Scholar
A. Lazarevic, and V. Kumar. Feature Bagging for Outlier Detection, ACM KDD Conference, 2005.
Google Scholar
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation Forest. ICDM Conference, 2008. Extended version appears as “Isolation-based Anomaly Detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 3, 2012.
Google Scholar
F. T. Liu, K. N. Ting, and Z.-H. Zhou. On Detecting Clustered Anomalies using SCiForest. Machine Learning and Knowledge Discovery in Databases, pp. 274–290, Springer, 2010.
Google Scholar
C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval, Cambridge University Press, 2008. [Also see Exercises 14.16 and 14.17]
Google Scholar
G. Martinez-Munoz and A. Suarez. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43, pp. 143–152, 2010.
Article MATH Google Scholar
P. Melville, R. Mooney. Creating Diversity in Ensembles Using Artificial Data. Information Fusion, 6(1), 2005.
Google Scholar
B. Micenkova, B. McWilliams, and I. Assent. Learning Outlier Ensembles: The Best of Both Worlds Supervised and Unsupervised. ACM SIGKDD Workshop on Outlier Detection and Description, ODD, 2014.
Google Scholar
B. Micenkova, B. McWilliams, and I. Assent. Learning Representations for Outlier Detection on a Budget. arXiv preprint arXiv:1507.08104, 2014.
F. Moosmann, B. Triggs, and F. Jurie. Fast Discriminative Visual Codebooks using Randomized Clustering Forests. Neural Information Processing Systems, pp. 985–992, 2006.
Google Scholar
R. Motwani and P. Raghavan. Randomized Algorithms. Chapman and Hall/CRC, 2012.
Google Scholar
E. Muller, M. Schiffer, and T. Seidl. Statistical Selection of Relevant Subspace Projections for Outlier Ranking. ICDE Conference, pp, 434–445, 2011.
Google Scholar
E. Muller, I. Assent, P. Iglesias, Y. Mulle, and K. Bohm. Outlier Ranking via Subspace Analysis in Multiple Views of the Data, ICDM Conference, 2012.
Google Scholar
H. Nguyen, H. Ang, and V. Gopalakrishnan. Mining Ensembles of Heterogeneous Detectors on Random Subspaces, DASFAA, 2010.
Google Scholar
H. Nguyen, E. Muller, J. Vreeken, F. Keller, and K. Bohm. CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. SIAM International Conference on Data Mining (SDM), pp. 198–206, 2013.
Google Scholar
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos, LOCI: Fast outlier detection using the local correlation integral, ICDE Conference, 2003.
Google Scholar
T. Pevny. Loda: Lightweight On-line Detector of Anomalies. Machine Learning, 102(2), pp. 275–304, 2016.
Article MathSciNet MATH Google Scholar
J. Pickands. Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), pp. 119–131, 1975.
Article MathSciNet MATH Google Scholar
J. Pickands. Multivariate extreme value distributions. Proceedings of the 43rd Session International Statistical Institute, 2, pp. 859–878, 1981.
Google Scholar
D. Rocke and D. Woodruff. Identification of Outliers in Multivariate Data. Journal of the American Statistical Association 91, 435, pp. 1047–1061, 1996.
Article MathSciNet MATH Google Scholar
L. Rokach. Pattern classification using ensemble methods, World Scientific Publishing Company, 2010.
Google Scholar
R. Samworth. Optimal Weighted Nearest Neighbour Classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.
Article MathSciNet MATH Google Scholar
S. Sathe and C. Aggarwal. Subspace Outlier Detection in Linear Time with Randomized Hashing. ICDM Conference, 2016.
Google Scholar
G. Seni and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, Synthesis Lectures in Data Mining and Knowledge Discovery, Morgan and Claypool, 2010.
Google Scholar
B. M. Steele. Exact Bootstrap k-Nearest Neighbor Learners. Machine Learning, 74(3), pp. 235-255, 2009.
Article Google Scholar
M. Sugiyama and K. Borgwardt. Rapid distance-based outlier detection via sampling. Advances in Neural Information Processing Systems, pp. 467–475, 2013.
Google Scholar
S. C. Tan, K. M. Ting, and T. F. Liu. Fast Anomaly Detection for Streaming Data. IJCAI Conference, 2011.
Google Scholar
K. M. Ting, T. Washio, J. Wells, and S. Arya. Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Machine learning Journal, Auguest 2016.
Google Scholar
K. M. Ting, G. T. Zhou, F. T. Liu, and S. C. Tan. Mass Estimation and its Applications. ACM KDD Conference, pp. 989–998, 2010.
Google Scholar
K. M. Ting, Y. Zhu, M. Carman, and Y. Zhu. Overcoming Key Weaknesses of Distance-Based Neighbourhood Methods using a Data Dependent Dissimilarity Measure. ACM KDD Conference, 2016.
Google Scholar
Y. Wang, S. Parthasarathy, and S. Tatikonda. Locality sensitive outlier detection: a ranking driven approach. ICDE Conference, pp. 410–421, 2011.
Google Scholar
D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–72, 1997.
Article Google Scholar
K. Wu, K. Zhang, W. Fan, A. Edwards, and P. Yu. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection. IEEE ICDM Conference, pp. 600–609, 2014.
Google Scholar
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC Press, 2012.
Google Scholar
A. Zimek, M. Gaudet, R. Campello, J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles, KDD Conference, 2013.
Google Scholar
http://elki.dbs.ifi.lmu.de/

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
Charu C. Aggarwal & Saket Sathe

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Saket Sathe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aggarwal, C.C., Sathe, S. (2017). Variance Reduction in Outlier Ensembles. In: Outlier Ensembles. Springer, Cham. https://doi.org/10.1007/978-3-319-54765-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-54765-7_3
Published: 07 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54764-0
Online ISBN: 978-3-319-54765-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics