Skip to main content

Variance Reduction in Outlier Ensembles

  • Chapter
  • First Online:
  • 2145 Accesses

Abstract

The theoretical discussion in the previous chapter establishes that the error of an outlier detector can be decomposed into the squared bias and the variance. Ensemble methods attempt to reduce the overall error by reducing either the squared bias or the variance.

Bagging goes a ways toward making a silk purse out of a sow’s ear, especially if the sow’s ear is twitchy.

Leo Brieman

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    An analytical explanation for data-centric variance reduction is provided in Sect. 3.3.1.

  2. 2.

    For example, in the case of Fig. 3.3c, each individual detector is run on a smaller data set with a correspondingly adjusted value of k. This smaller detector will have larger variance.

  3. 3.

    An example of such an adjusted k-nearest neighbor detector is discussed in [6]. In this case, when subsamples of the data are drawn at sample fraction f, the value of k is also adjusted with fraction f. In such a case, the bias becomes less unpredictable and increases only slightly at smaller subsample sizes.

  4. 4.

    For example, if a clustering-based detector is used, some of the normal clusters might persistently disappear at small values of n, each time a training data set of size n is drawn from the base distribution. This will cause bias.

  5. 5.

    Even for simple functions such as computing the mean of a set of data points, the standard deviation reduces with the inverse of the square-root of the number of data points. Complex functions (such as those computed in outlier detection) generally have a much larger standard deviation.

  6. 6.

    The description in this chapter is repeated almost identically in Sect. 6.3.1.3 of Chap. 6. This is because this detector is important both from the perspective of variance-reduction methods as well as base detectors. The description is repeated in both places for the benefit of readers who have electronic access to only one of the chapters of this book.

  7. 7.

    Although the same dimension may be used to split a node along a path of the tree, this becomes increasingly unlikely with increasing dimensionality. Therefore, the outlier score is roughly equal the dimensionality required to isolate a point with random splits in very high dimensional data. For a subsample of 256 points, roughly 8 dimensions are required to isolate a point, but only 4 or 5 dimensions might be sufficient for an outlier. The key is in the power of the ensemble, because at least some of the components will isolate the point in a local subspace of low dimensionality.

  8. 8.

    Perhaps the claims in [73] were motivated by the superior performance [41] of the isolation forest on smaller samples from the Mulcross data generator [58]. This is caused by a swamping effect of other data instances on the algorithm. Another related behavior is masking. However, the claims in [41] are very specific to a particular data set and it is impossible to predict the size of the base data at which phenomena like swamping or masking might occur.

  9. 9.

    The (roughly similar) boxplots show random variations.

  10. 10.

    The prediction is even rougher for LOF because of reachability smoothing and the quirky harmonic normalization.

  11. 11.

    For example, if we have an anomalous cluster (i.e., contaminant) and a normal cluster, the bias at a particular subsample size and k would depend on the relative size of the anomalous cluster.

  12. 12.

    Perhaps, it is not so surprising. The rare class often has a very different mean from the normal class (and by implication the full data). Features in classification data tend to have high Fisher’s scores [8], indicating at least partial separability on individual dimensions. This separability is sharpened over multiple dimensions. The increasing tendency of tiny classes to be separable with increasing dimensionality is also well known [43, 48]. Even partial separability can often guarantee a decent AUC for a pareto-extreme detector. See also comments on real benchmarks on p. 267.

  13. 13.

    In high-dimensional space, it is often more meaningful and general to talk of data manifolds describing the distribution structure rather than clusters.

  14. 14.

    The work in [66] states that the result in [6] is based on “stretching” results from density estimation. This is incorrect because the notion of density-estimation is not discussed in [6]. It is recognized that the proportional adjustment factor is a heuristic one and is slightly different between the average k-NN detector and exact k-NN detector, especially at small values of k. For some algorithms like LOF, the exact adjustment factor is even more difficult to analytically compute.

  15. 15.

    Another example of a less visible design choice is that parametrization is generally necessary to create flexible learning methods in the supervised setting. However, parameter-free methods are very much desirable in the unsupervised setting. Unfortunately, this comes at the price that parameter-free methods will often work well at specific data sizes because of inflexibility in design. Non-monotonicity in performance with increased data is also common in data types like graphs and networks where controlling performance with increased data is hard.

  16. 16.

    The Bayes optimal error rate is the smallest possible error that is theoretically achievable by a classifier on a particular data distribution. In other words, the Bayes error rate is the irreducible error caused by the vagaries of the underlying data distribution.

  17. 17.

    The LOF paper does suggest the use of k-distinct-distances as a possibility to fix this problem. The implementation from the LMU group that proposed LOF [74] also allows \(\infty \) scores. However, this issue only presents an extreme case of a pervasive problem with LOF when k data points are close together by chance at small values of k.

  18. 18.

    This constraint is essential for subsampling but not quite as essential for bagging. Even for a data set with 1 point, one can create a data set between 50 and 1000 points by re-sampling it. However, oversampling a data set does not provide significant advantages of diversity, and therefore we maintained exactly the same approach as used in subsampling by constraining the maximum size of the bag to the original data set size.

  19. 19.

    As we will see later, these attempts to create more stable combination methods were largely unsuccessful because the results did not seem to show a significant improvement over the mean and in fact worsened in many cases. Nevertheless, we include these experiments, because there are some useful lessons one can learn from them. Furthermore, the differences were small enough that we do not consider these results to be definitive in terms of relative performance.

  20. 20.

    We used geometric subsampling because it was one of the best performing variance reduction methods. However, the basic trends were quite similar over other types of subsampling.

  21. 21.

    http://archive.ics.uci.edu/ml/datasets.html.

  22. 22.

    http://www.ipd.kit.edu/~muellere/HiCS/.

References

  1. C. C. Aggarwal. Outlier Analysis, Second Edition, Springer, 2017.

    Google Scholar 

  2. C. C. Aggarwal and P. S. Yu. Outlier Detection in Graph Streams, ICDE Conference, 2011.

    Google Scholar 

  3. C. C. Aggarwal. Outlier Ensembles: Position Paper, ACM SIGKDD Explorations, 14(2), pp. 49–58, December, 2012.

    Google Scholar 

  4. C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park. Fast Algorithms for Projected Clustering. ACM SIGMOD Conference, 1999.

    Google Scholar 

  5. C. C. Aggarwal and P. S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces, ACM SIGMOD Conference, 2000.

    Google Scholar 

  6. C. C. Aggarwal and S. Sathe. Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17(1), June 2015.

    Google Scholar 

  7. C. C. Aggarwal. Recommender Systems: The Textbook, Springer, 2016. [Chapter 6 on Ensemble-Based Systems]

    Google Scholar 

  8. C. C. Aggarwal. Data Mining: The Textbook, Springer, 2015.

    Google Scholar 

  9. C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications, CRC Press, 2013.

    Google Scholar 

  10. C. C. Aggarwal and P. S. Yu. Outlier Detection in High Dimensional Data, ACM SIGMOD Conference, 2001.

    Google Scholar 

  11. F. Angiulli, C. Pizzuti. Fast outlier detection in high dimensional spaces, PKDD Conference, 2002.

    Google Scholar 

  12. E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, 36(1), pp. 1–38, 1998.

    Google Scholar 

  13. S. Bay. Nearest Neighbor Classification from Multiple Feature Subsets. Intelligent Data Analysis, 2(3), pp. 191–209, 1999.

    Article  Google Scholar 

  14. G. Biau, F. Cerou, and A. Guyader. On the Rate of Convergence of the Bagged Nearest Neighbor Estimate. Journal of Machine Learning Research, 11, pp. 687–712, 2010.

    MathSciNet  MATH  Google Scholar 

  15. L. Brieman. Random Forests. Journal Machine Learning archive, 45(1), pp. 5–32, 2001.

    Article  Google Scholar 

  16. L. Brieman. Bagging Predictors. Machine Learning, 24(2), pp. 123–140, 1996.

    MATH  Google Scholar 

  17. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying Density-based Local Outliers, ACM SIGMOD Conference, 2000.

    Google Scholar 

  18. G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5(20), 2005.

    Google Scholar 

  19. R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute Bagging: Improving Accuracy of Classifier Ensembles by using Random Feature Subsets. Pattern Recognition, 36(6), pp. 1291–1302, 2003.

    Article  MATH  Google Scholar 

  20. P. Buhlmann. Bagging, Subagging and Bragging for Improving some Prediction Algorithms, Recent advances and trends in nonparametric statistics, Elsevier, 2003.

    Google Scholar 

  21. P. Buhlmann, B. Yu. Analyzing Bagging. Annals of Statistics, pp. 927–961, 2002.

    Google Scholar 

  22. A. Buja and W. Stuetzle. Observations on Bagging. Statistica Sinica, 16(2), 323, 2006.

    MathSciNet  MATH  Google Scholar 

  23. J. Chen, S. Sathe, C. Aggarwal, and D. Turaga. Outlier Detection with Autoencoder Ensembles. SIAM Conference on Data Mining, 2017.

    Google Scholar 

  24. A. Criminisi, J. Shotton, and E. Konukoglu. Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Microsoft Research Cambridge, Tech. Rep. MSRTR-2011-114, 5(6), 12, 2011.

    Google Scholar 

  25. M. Datar, N. Immorlica, P. Indyk, V. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. ACM Annual Symposium on Computational Geometry, pp. 253–262, 2004.

    Google Scholar 

  26. M. Denil, D. Matheson, and N. De Freitas. Narrowing the Gap: Random Forests In Theory and in Practice. ICML Conference, pp. 665–673, 2014.

    Google Scholar 

  27. C. Desir, S. Bernard, C. Petitjean, and L. Heutte. One Class Random Forests. Pattern Recognition, 46(12), pp. 3490–3506, 2013.

    Article  Google Scholar 

  28. T. Dietterich. Ensemble Methods in Machine Learning, First International Workshop on Multiple Classifier Systems, 2000.

    Google Scholar 

  29. A. Emmott, S. Das, T. Dietterich, A. Fern, and W. Wong. Systematic Construction of Anomaly Detection Benchmarks from Real Data. arXiv:1503.01158, 2015. https://arxiv.org/abs/1503.01158

  30. M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?. The Journal of Machine Learning Research, 15(1), pp. 3133–3181, 2014.

    MathSciNet  MATH  Google Scholar 

  31. J. Friedman, and P. Hall. On bagging and nonlinear estimation. Journal of statistical planning and inference, 137(3), pp. 669–683, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  32. P. Geurts. Variance Reduction Techniques, Chapter 4 of unpublished PhD Thesis entitled “Contributions to decision tree induction: bias/variance tradeoff and time series classification.” University of Liege, Belgium, 2002. http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2002/Geu02/

  33. M. Grill and T. Pevny. Learning Combination of Anomaly Detectors for Security Domain. Computer Networks, 2016.

    Google Scholar 

  34. S. Guha, N. Mishra, G. Roy, and O. Schrijver. Robust Random Cut Forest Based Anomaly Detection On Streams. ICML Conference, pp. 2712–2721, 2016.

    Google Scholar 

  35. Z. He, S. Deng and X. Xu. A Unified Subspace Outlier Ensemble Framework for Outlier Detection, Advances in Web Age Information Management, 2005.

    Google Scholar 

  36. T. K. Ho. Random decision forests. Third International Conference on Document Analysis and Recognition, 1995. Extended version appears as “The random subspace method for constructing decision forests” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), pp. 832–844, 1998.

    Google Scholar 

  37. T. K. Ho. Nearest Neighbors in Random Subspaces. Lecture Notes in Computer Science, Vol. 1451, pp. 640–648, Proceedings of the Joint IAPR Workshops SSPR’98 and SPR’98, 1998. http://link.springer.com/chapter/10.1007/BFb0033288

  38. F. Keller, E. Muller, K. Bohm. HiCS: High-Contrast Subspaces for Density-based Outlier Ranking, IEEE ICDE Conference, 2012.

    Google Scholar 

  39. M. Kopp, T. Pevny, and M. Holena. Interpreting and Clustering Outliers with Sapling Random Forests. Information Technologies Applications and Theory Workshops, Posters, and Tutorials (ITAT), 2014.

    Google Scholar 

  40. A. Lazarevic, and V. Kumar. Feature Bagging for Outlier Detection, ACM KDD Conference, 2005.

    Google Scholar 

  41. F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation Forest. ICDM Conference, 2008. Extended version appears as “Isolation-based Anomaly Detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), 6(1), 3, 2012.

    Google Scholar 

  42. F. T. Liu, K. N. Ting, and Z.-H. Zhou. On Detecting Clustered Anomalies using SCiForest. Machine Learning and Knowledge Discovery in Databases, pp. 274–290, Springer, 2010.

    Google Scholar 

  43. C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval, Cambridge University Press, 2008. [Also see Exercises 14.16 and 14.17]

    Google Scholar 

  44. G. Martinez-Munoz and A. Suarez. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43, pp. 143–152, 2010.

    Article  MATH  Google Scholar 

  45. P. Melville, R. Mooney. Creating Diversity in Ensembles Using Artificial Data. Information Fusion, 6(1), 2005.

    Google Scholar 

  46. B. Micenkova, B. McWilliams, and I. Assent. Learning Outlier Ensembles: The Best of Both Worlds Supervised and Unsupervised. ACM SIGKDD Workshop on Outlier Detection and Description, ODD, 2014.

    Google Scholar 

  47. B. Micenkova, B. McWilliams, and I. Assent. Learning Representations for Outlier Detection on a Budget. arXiv preprint arXiv:1507.08104, 2014.

  48. F. Moosmann, B. Triggs, and F. Jurie. Fast Discriminative Visual Codebooks using Randomized Clustering Forests. Neural Information Processing Systems, pp. 985–992, 2006.

    Google Scholar 

  49. R. Motwani and P. Raghavan. Randomized Algorithms. Chapman and Hall/CRC, 2012.

    Google Scholar 

  50. E. Muller, M. Schiffer, and T. Seidl. Statistical Selection of Relevant Subspace Projections for Outlier Ranking. ICDE Conference, pp, 434–445, 2011.

    Google Scholar 

  51. E. Muller, I. Assent, P. Iglesias, Y. Mulle, and K. Bohm. Outlier Ranking via Subspace Analysis in Multiple Views of the Data, ICDM Conference, 2012.

    Google Scholar 

  52. H. Nguyen, H. Ang, and V. Gopalakrishnan. Mining Ensembles of Heterogeneous Detectors on Random Subspaces, DASFAA, 2010.

    Google Scholar 

  53. H. Nguyen, E. Muller, J. Vreeken, F. Keller, and K. Bohm. CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. SIAM International Conference on Data Mining (SDM), pp. 198–206, 2013.

    Google Scholar 

  54. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos, LOCI: Fast outlier detection using the local correlation integral, ICDE Conference, 2003.

    Google Scholar 

  55. T. Pevny. Loda: Lightweight On-line Detector of Anomalies. Machine Learning, 102(2), pp. 275–304, 2016.

    Article  MathSciNet  MATH  Google Scholar 

  56. J. Pickands. Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), pp. 119–131, 1975.

    Article  MathSciNet  MATH  Google Scholar 

  57. J. Pickands. Multivariate extreme value distributions. Proceedings of the 43rd Session International Statistical Institute, 2, pp. 859–878, 1981.

    Google Scholar 

  58. D. Rocke and D. Woodruff. Identification of Outliers in Multivariate Data. Journal of the American Statistical Association 91, 435, pp. 1047–1061, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  59. L. Rokach. Pattern classification using ensemble methods, World Scientific Publishing Company, 2010.

    Google Scholar 

  60. R. Samworth. Optimal Weighted Nearest Neighbour Classifiers. The Annals of Statistics, 40(5), pp. 2733–2763, 2012.

    Article  MathSciNet  MATH  Google Scholar 

  61. S. Sathe and C. Aggarwal. Subspace Outlier Detection in Linear Time with Randomized Hashing. ICDM Conference, 2016.

    Google Scholar 

  62. G. Seni and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy through Combining Predictions, Synthesis Lectures in Data Mining and Knowledge Discovery, Morgan and Claypool, 2010.

    Google Scholar 

  63. B. M. Steele. Exact Bootstrap k-Nearest Neighbor Learners. Machine Learning, 74(3), pp. 235-255, 2009.

    Article  Google Scholar 

  64. M. Sugiyama and K. Borgwardt. Rapid distance-based outlier detection via sampling. Advances in Neural Information Processing Systems, pp. 467–475, 2013.

    Google Scholar 

  65. S. C. Tan, K. M. Ting, and T. F. Liu. Fast Anomaly Detection for Streaming Data. IJCAI Conference, 2011.

    Google Scholar 

  66. K. M. Ting, T. Washio, J. Wells, and S. Arya. Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Machine learning Journal, Auguest 2016.

    Google Scholar 

  67. K. M. Ting, G. T. Zhou, F. T. Liu, and S. C. Tan. Mass Estimation and its Applications. ACM KDD Conference, pp. 989–998, 2010.

    Google Scholar 

  68. K. M. Ting, Y. Zhu, M. Carman, and Y. Zhu. Overcoming Key Weaknesses of Distance-Based Neighbourhood Methods using a Data Dependent Dissimilarity Measure. ACM KDD Conference, 2016.

    Google Scholar 

  69. Y. Wang, S. Parthasarathy, and S. Tatikonda. Locality sensitive outlier detection: a ranking driven approach. ICDE Conference, pp. 410–421, 2011.

    Google Scholar 

  70. D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–72, 1997.

    Article  Google Scholar 

  71. K. Wu, K. Zhang, W. Fan, A. Edwards, and P. Yu. RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection. IEEE ICDM Conference, pp. 600–609, 2014.

    Google Scholar 

  72. Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC Press, 2012.

    Google Scholar 

  73. A. Zimek, M. Gaudet, R. Campello, J. Sander. Subsampling for efficient and effective unsupervised outlier detection ensembles, KDD Conference, 2013.

    Google Scholar 

  74. http://elki.dbs.ifi.lmu.de/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal .

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Aggarwal, C.C., Sathe, S. (2017). Variance Reduction in Outlier Ensembles. In: Outlier Ensembles. Springer, Cham. https://doi.org/10.1007/978-3-319-54765-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54765-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54764-0

  • Online ISBN: 978-3-319-54765-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics