Abstract
The theoretical discussion in the previous chapter establishes that the error of an outlier detector can be decomposed into the squared bias and the variance. Ensemble methods attempt to reduce the overall error by reducing either the squared bias or the variance.
Bagging goes a ways toward making a silk purse out of a sow’s ear, especially if the sow’s ear is twitchy.
Leo Brieman
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
An analytical explanation for data-centric variance reduction is provided in Sect. 3.3.1.
- 2.
For example, in the case of Fig. 3.3c, each individual detector is run on a smaller data set with a correspondingly adjusted value of k. This smaller detector will have larger variance.
- 3.
An example of such an adjusted k-nearest neighbor detector is discussed in [6]. In this case, when subsamples of the data are drawn at sample fraction f, the value of k is also adjusted with fraction f. In such a case, the bias becomes less unpredictable and increases only slightly at smaller subsample sizes.
- 4.
For example, if a clustering-based detector is used, some of the normal clusters might persistently disappear at small values of n, each time a training data set of size n is drawn from the base distribution. This will cause bias.
- 5.
Even for simple functions such as computing the mean of a set of data points, the standard deviation reduces with the inverse of the square-root of the number of data points. Complex functions (such as those computed in outlier detection) generally have a much larger standard deviation.
- 6.
The description in this chapter is repeated almost identically in Sect. 6.3.1.3 of Chap. 6. This is because this detector is important both from the perspective of variance-reduction methods as well as base detectors. The description is repeated in both places for the benefit of readers who have electronic access to only one of the chapters of this book.
- 7.
Although the same dimension may be used to split a node along a path of the tree, this becomes increasingly unlikely with increasing dimensionality. Therefore, the outlier score is roughly equal the dimensionality required to isolate a point with random splits in very high dimensional data. For a subsample of 256 points, roughly 8 dimensions are required to isolate a point, but only 4 or 5 dimensions might be sufficient for an outlier. The key is in the power of the ensemble, because at least some of the components will isolate the point in a local subspace of low dimensionality.
- 8.
Perhaps the claims in [73] were motivated by the superior performance [41] of the isolation forest on smaller samples from the Mulcross data generator [58]. This is caused by a swamping effect of other data instances on the algorithm. Another related behavior is masking. However, the claims in [41] are very specific to a particular data set and it is impossible to predict the size of the base data at which phenomena like swamping or masking might occur.
- 9.
The (roughly similar) boxplots show random variations.
- 10.
The prediction is even rougher for LOF because of reachability smoothing and the quirky harmonic normalization.
- 11.
For example, if we have an anomalous cluster (i.e., contaminant) and a normal cluster, the bias at a particular subsample size and k would depend on the relative size of the anomalous cluster.
- 12.
Perhaps, it is not so surprising. The rare class often has a very different mean from the normal class (and by implication the full data). Features in classification data tend to have high Fisher’s scores [8], indicating at least partial separability on individual dimensions. This separability is sharpened over multiple dimensions. The increasing tendency of tiny classes to be separable with increasing dimensionality is also well known [43, 48]. Even partial separability can often guarantee a decent AUC for a pareto-extreme detector. See also comments on real benchmarks on p. 267.
- 13.
In high-dimensional space, it is often more meaningful and general to talk of data manifolds describing the distribution structure rather than clusters.
- 14.
The work in [66] states that the result in [6] is based on “stretching” results from density estimation. This is incorrect because the notion of density-estimation is not discussed in [6]. It is recognized that the proportional adjustment factor is a heuristic one and is slightly different between the average k-NN detector and exact k-NN detector, especially at small values of k. For some algorithms like LOF, the exact adjustment factor is even more difficult to analytically compute.
- 15.
Another example of a less visible design choice is that parametrization is generally necessary to create flexible learning methods in the supervised setting. However, parameter-free methods are very much desirable in the unsupervised setting. Unfortunately, this comes at the price that parameter-free methods will often work well at specific data sizes because of inflexibility in design. Non-monotonicity in performance with increased data is also common in data types like graphs and networks where controlling performance with increased data is hard.
- 16.
The Bayes optimal error rate is the smallest possible error that is theoretically achievable by a classifier on a particular data distribution. In other words, the Bayes error rate is the irreducible error caused by the vagaries of the underlying data distribution.
- 17.
The LOF paper does suggest the use of k-distinct-distances as a possibility to fix this problem. The implementation from the LMU group that proposed LOF [74] also allows \(\infty \) scores. However, this issue only presents an extreme case of a pervasive problem with LOF when k data points are close together by chance at small values of k.
- 18.
This constraint is essential for subsampling but not quite as essential for bagging. Even for a data set with 1 point, one can create a data set between 50 and 1000 points by re-sampling it. However, oversampling a data set does not provide significant advantages of diversity, and therefore we maintained exactly the same approach as used in subsampling by constraining the maximum size of the bag to the original data set size.
- 19.
As we will see later, these attempts to create more stable combination methods were largely unsuccessful because the results did not seem to show a significant improvement over the mean and in fact worsened in many cases. Nevertheless, we include these experiments, because there are some useful lessons one can learn from them. Furthermore, the differences were small enough that we do not consider these results to be definitive in terms of relative performance.
- 20.
We used geometric subsampling because it was one of the best performing variance reduction methods. However, the basic trends were quite similar over other types of subsampling.
- 21.
- 22.
References
C. C. Aggarwal. Outlier Analysis, Second Edition, Springer, 2017.
C. C. Aggarwal and P. S. Yu. Outlier Detection in Graph Streams, ICDE Conference, 2011.
C. C. Aggarwal. Outlier Ensembles: Position Paper, ACM SIGKDD Explorations, 14(2), pp. 49–58, December, 2012.
C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999.
C. C. Aggarwal and P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000.
C. C. Aggarwal and S. Sathe. Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17(1), June 2015.
C. C. Aggarwal. Recommender Systems: The Textbook, Springer, 2016. [Chapter 6 on Ensemble-Based Systems]
C. C. Aggarwal. Data Mining: The Textbook, Springer, 2015.
C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.
C. C. Aggarwal and P. S. Yu. Outlier Detection in High Dimensional Data, ACM SIGMOD Conference, 2001.
F. Angiulli, C. Pizzuti. Fast outlier detection in high dimensional spaces, PKDD Conference, 2002.
E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, 36(1), pp. 1–38, 1998.
S. Bay. Nearest Neighbor Classification from Multiple Feature Subsets. Intelligent Data Analysis, 2(3), pp. 191–209, 1999.
G. Biau, F. Cerou, and A. Guyader. On the Rate of Convergence of the Bagged Nearest Neighbor Estimate. Journal of Machine Learning Research, 11, pp. 687–712, 2010.
L. Brieman. Random Forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.
L. Brieman. Bagging Predictors. Machine Learning, 24(2), pp. 123–140, 1996.
M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying Density-based Local Outliers, ACM SIGMOD Conference, 2000.
G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5(20), 2005.
R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute Bagging: Improving Accuracy of Classifier Ensembles by using Random Feature Subsets. Pattern Recognition, 36(6), pp. 1291–1302, 2003.
P. Buhlmann. Bagging, Subagging and Bragging for Improving some Prediction Algorithms, Recent advances and trends in nonparametric statistics, Elsevier, 2003.
P. Buhlmann, B. Yu. Analyzing Bagging. Annals of Statistics, pp. 927–961, 2002.
A. Buja and W. Stuetzle. Observations on Bagging. Statistica Sinica, 16(2), 323, 2006.
J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier Detection with Autoencoder Ensembles. SIAM Conference on Data Mining, 2017.
A. Criminisi, J. Shotton, and E. Konukoglu. Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Microsoft Research Cambridge, Tech. Rep. MSRTR-2011-114, 5(6), 12, 2011.
M. Datar, N. Immorlica, P. Indyk, V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. ACM Annual Symposium on Computational Geometry, pp. 253–262, 2004.
M. Denil, D. Matheson, and N. De Freitas. Narrowing the Gap: Random Forests In Theory and in Practice. ICML Conference, pp. 665–673, 2014.
C. Desir, S. Bernard, C. Petitjean, and L. Heutte. One Class Random Forests. Pattern Recognition, 46(12), pp. 3490–3506, 2013.
T. Dietterich. Ensemble Methods in Machine Learning, First International Workshop on Multiple Classifier Systems, 2000.
A. Emmott, S. Das, T. Dietterich, A. Fern, and W. Wong. Systematic Construction of Anomaly Detection Benchmarks from Real Data. arXiv:1503.01158, 2015. https://arxiv.org/abs/1503.01158
M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?. The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.
J. Friedman, and P. Hall. On bagging and nonlinear estimation. Journal of statistical planning and inference, 137(3), pp. 669–683, 2007.
P. Geurts. Variance Reduction Techniques, Chapter 4 of unpublished PhD Thesis entitled “Contributions to decision tree induction: bias/variance tradeoff and time series classification.” University of Liege, Belgium, 2002. http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2002/Geu02/
M. Grill and T. Pevny. Learning Combination of Anomaly Detectors for Security Domain. Computer Networks, 2016.
S. Guha, N. Mishra, G. Roy, and O. Schrijver. Robust Random Cut Forest Based Anomaly Detection On Streams. ICML Conference, pp. 2712–2721, 2016.
Z. He, S. Deng and X. Xu. A Unified Subspace Outlier Ensemble Framework for Outlier Detection, Advances in Web Age Information Management, 2005.
T. K. Ho. Random decision forests. Third International Conference on Document Analysis and Recognition, 1995. Extended version appears as “The random subspace method for constructing decision forests” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), pp. 832–844, 1998.
T. K. Ho. Nearest Neighbors in Random Subspaces. Lecture Notes in Computer Science, Vol. 1451, pp. 640–648, Proceedings of the Joint IAPR Workshops SSPR’98 and SPR’98, 1998. http://link.springer.com/chapter/10.1007/BFb0033288
F. Keller, E. Muller, K. Bohm. HiCS: High-Contrast Subspaces for Density-based Outlier Ranking, IEEE ICDE Conference, 2012.
M. Kopp, T. Pevny, and M. Holena. Interpreting and Clustering Outliers with Sapling Random Forests. Information Technologies Applications and Theory Workshops, Posters, and Tutorials (ITAT), 2014.
A. Lazarevic, and V. Kumar. Feature Bagging for Outlier Detection, ACM KDD Conference, 2005.
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation Forest. ICDM Conference, 2008. Extended version appears as “Isolation-based Anomaly Detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 3, 2012.
F. T. Liu, K. N. Ting, and Z.-H. Zhou. On Detecting Clustered Anomalies using SCiForest. Machine Learning and Knowledge Discovery in Databases, pp. 274–290, Springer, 2010.
C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval, Cambridge University Press, 2008. [Also see Exercises 14.16 and 14.17]
G. Martinez-Munoz and A. Suarez. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43, pp. 143–152, 2010.
P. Melville, R. Mooney. Creating Diversity in Ensembles Using Artificial Data. Information Fusion, 6(1), 2005.
B. Micenkova, B. McWilliams, and I. Assent. Learning Outlier Ensembles: The Best of Both Worlds Supervised and Unsupervised. ACM SIGKDD Workshop on Outlier Detection and Description, ODD, 2014.
B. Micenkova, B. McWilliams, and I. Assent. Learning Representations for Outlier Detection on a Budget. arXiv preprint arXiv:1507.08104, 2014.
F. Moosmann, B. Triggs, and F. Jurie. Fast Discriminative Visual Codebooks using Randomized Clustering Forests. Neural Information Processing Systems, pp. 985–992, 2006.
R. Motwani and P. Raghavan. Randomized Algorithms. Chapman and Hall/CRC, 2012.
E. Muller, M. Schiffer, and T. Seidl. Statistical Selection of Relevant Subspace Projections for Outlier Ranking. ICDE Conference, pp, 434–445, 2011.
E. Muller, I. Assent, P. Iglesias, Y. Mulle, and K. Bohm. Outlier Ranking via Subspace Analysis in Multiple Views of the Data, ICDM Conference, 2012.
H. Nguyen, H. Ang, and V. Gopalakrishnan. Mining Ensembles of Heterogeneous Detectors on Random Subspaces, DASFAA, 2010.
H. Nguyen, E. Muller, J. Vreeken, F. Keller, and K. Bohm. CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. SIAM International Conference on Data Mining (SDM), pp. 198–206, 2013.
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos, LOCI: Fast outlier detection using the local correlation integral, ICDE Conference, 2003.
T. Pevny. Loda: Lightweight On-line Detector of Anomalies. Machine Learning, 102(2), pp. 275–304, 2016.
J. Pickands. Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), pp. 119–131, 1975.
J. Pickands. Multivariate extreme value distributions. Proceedings of the 43rd Session International Statistical Institute, 2, pp. 859–878, 1981.
D. Rocke and D. Woodruff. Identification of Outliers in Multivariate Data. Journal of the American Statistical Association 91, 435, pp. 1047–1061, 1996.
L. Rokach. Pattern classification using ensemble methods, World Scientific Publishing Company, 2010.
R. Samworth. Optimal Weighted Nearest Neighbour Classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.
S. Sathe and C. Aggarwal. Subspace Outlier Detection in Linear Time with Randomized Hashing. ICDM Conference, 2016.
G. Seni and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, Synthesis Lectures in Data Mining and Knowledge Discovery, Morgan and Claypool, 2010.
B. M. Steele. Exact Bootstrap k-Nearest Neighbor Learners. Machine Learning, 74(3), pp. 235-255, 2009.
M. Sugiyama and K. Borgwardt. Rapid distance-based outlier detection via sampling. Advances in Neural Information Processing Systems, pp. 467–475, 2013.
S. C. Tan, K. M. Ting, and T. F. Liu. Fast Anomaly Detection for Streaming Data. IJCAI Conference, 2011.
K. M. Ting, T. Washio, J. Wells, and S. Arya. Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Machine learning Journal, Auguest 2016.
K. M. Ting, G. T. Zhou, F. T. Liu, and S. C. Tan. Mass Estimation and its Applications. ACM KDD Conference, pp. 989–998, 2010.
K. M. Ting, Y. Zhu, M. Carman, and Y. Zhu. Overcoming Key Weaknesses of Distance-Based Neighbourhood Methods using a Data Dependent Dissimilarity Measure. ACM KDD Conference, 2016.
Y. Wang, S. Parthasarathy, and S. Tatikonda. Locality sensitive outlier detection: a ranking driven approach. ICDE Conference, pp. 410–421, 2011.
D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–72, 1997.
K. Wu, K. Zhang, W. Fan, A. Edwards, and P. Yu. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection. IEEE ICDM Conference, pp. 600–609, 2014.
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC Press, 2012.
A. Zimek, M. Gaudet, R. Campello, J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles, KDD Conference, 2013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Aggarwal, C.C., Sathe, S. (2017). Variance Reduction in Outlier Ensembles. In: Outlier Ensembles. Springer, Cham. https://doi.org/10.1007/978-3-319-54765-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-54765-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54764-0
Online ISBN: 978-3-319-54765-7
eBook Packages: Computer ScienceComputer Science (R0)