Advertisement

Data Mining and Knowledge Discovery

, Volume 30, Issue 6, pp 1520–1555 | Cite as

Discovering outlying aspects in large datasets

  • Nguyen Xuan Vinh
  • Jeffrey Chan
  • Simone Romano
  • James Bailey
  • Christopher Leckie
  • Kotagiri Ramamohanarao
  • Jian Pei
Article

Abstract

We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be used to explain any data point of interest, which itself might be an inlier or outlier. In this paper, we investigate several open challenges faced by existing outlying aspects mining techniques and propose novel solutions, including (a) how to design effective scoring functions that are unbiased with respect to dimensionality and yet being computationally efficient, and (b) how to efficiently search through the exponentially large search space of all possible subspaces. We formalize the concept of dimensionality unbiasedness, a desirable property of outlyingness measures. We then characterize existing scoring measures as well as our novel proposed ones in terms of efficiency, dimensionality unbiasedness and interpretability. Finally, we evaluate the effectiveness of different methods for outlying aspects discovery and demonstrate the utility of our proposed approach on both large real and synthetic data sets.

Keywords

Outlying aspects mining Subspace selection Outlier explanation 

Notes

Acknowledgments

This work is supported by the Australian Research Council via Grant Numbers FT110100112 and DP140101969.

References

  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on management of data, SIGMOD ’01, ACM, New York, pp 37–46Google Scholar
  2. Bache K, Lichman M (2013) UCI machine learning repository. University of California, IrvineGoogle Scholar
  3. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2):93–104CrossRefGoogle Scholar
  4. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, CambridgezbMATHGoogle Scholar
  5. Dang X, Micenkova B, Assent I, Ng R (2013) Local outlier detection with interpretation. In: Blockeel H, Kersting K, Nijssen S, Elezn F (eds) Machine learning and knowledge discovery in databases, vol 8190., Lecture notes in computer scienceSpringer, Berlin, pp 304–320CrossRefGoogle Scholar
  6. Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In Proceedings of the IEEE 30th international conference on data engineering (ICDE), pp 88–99Google Scholar
  7. Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Tseng V, Ho T, Zhou Z-H, Chen A, Kao H-Y (eds) Advances in knowledge discovery and data mining, vol 8443., Lecture notes in computer scienceSpringer International Publishing, Berlin, pp 249–260CrossRefGoogle Scholar
  8. Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Discov 29(5):1116–1151MathSciNetCrossRefGoogle Scholar
  9. Garfinkel S, Spafford G, Schwartz A (2003) Practical unix & internet security, 3rd edn. O’Reilly Media Inc, CaliforniaGoogle Scholar
  10. He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inform Syst 2(1):103–118CrossRefGoogle Scholar
  11. Keller F, Muller E, Bohm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 2012 IEEE 28th international conference on data engineering, ICDE ’12, IEEE Computer Society, Washington, pp 1037–1048Google Scholar
  12. Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, ACM, New York, pp 444–452Google Scholar
  13. Kriegel H-P, Kruger P, Schubert E, Zimek A (2009) Outlier detectionin axis-parallel subspaces of high dimensional data. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining, vol 5476., Lecture notes in computer scienceSpringer, Berlin, pp 831–838CrossRefGoogle Scholar
  14. Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In Proceedings of the 8th IEEE international conference on data mining, ICDM ’08., pp 413–422Google Scholar
  15. Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39CrossRefGoogle Scholar
  16. Micenkova B, Dang X-H, Assent I, Ng R (2013) Explaining outliers by subspace separability. In Proceedings of the 2013 IEEE 13th international conference on data mining (ICDM), pp 518–527Google Scholar
  17. Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 2013 SIAM data mining conference (SDM), pp 198–206Google Scholar
  18. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRefGoogle Scholar
  19. Romano S, Bailey J, Vinh NX, Verspoor K (2014) Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In T. Jebara and E. P. Xing (eds) Proceedings of the 31st international conference on machine learning (ICML-14), pp 1143–1151Google Scholar
  20. Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Pearson Education, LondonzbMATHGoogle Scholar
  21. Sabhnani M, Serpen G (2003) KDD feature set complaint heuristic rules for R2L attack detection. In Proceedings of the international conference on security and management, SAM ’03, Vol 1, Las Vegas, 23–26 June 2003, pp 310–316Google Scholar
  22. Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In Proceedings of the 2011 SIAM international conference on data mining, chapter 69, pp 804–815Google Scholar
  23. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetzbMATHGoogle Scholar
  24. Vinh NX, Chan J, Romano S, Bailey J (2014a) Effective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14, ACM, New York, pp 512–521Google Scholar
  25. Vinh NX, Chan J, Bailey J (2014b) Reconsidering mutual informationbased feature selection: a statistical significance view. In Proceedings of the twenty-eighth AAAI conference on artificialintelligence, Québec City, 27 -31 July 2014, pp 2092–2098Google Scholar
  26. Wu T, Xin D, Mei Q, Han J (2009) Promotion analysis in multi-dimensional space. Proc VLDB Endow 2(1):109–120CrossRefGoogle Scholar
  27. Zhang J, Lou M, Ling TW, Wang H (2004) Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In Proceedings of the thirtieth international conference on very large data bases , Vol 30, VLDB ’04, VLDB Endowment, Brussels, pp 1265–1268Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  • Nguyen Xuan Vinh
    • 1
  • Jeffrey Chan
    • 1
  • Simone Romano
    • 1
  • James Bailey
    • 1
  • Christopher Leckie
    • 1
  • Kotagiri Ramamohanarao
    • 1
  • Jian Pei
    • 2
  1. 1.The University of MelbourneMelbourneAustralia
  2. 2.Simon Fraser UniversityBurnabyCanada

Personalised recommendations