Advertisement

Data Mining and Knowledge Discovery

, Volume 29, Issue 5, pp 1116–1151 | Cite as

Mining outlying aspects on numeric data

  • Lei DuanEmail author
  • Guanting Tang
  • Jian Pei
  • James Bailey
  • Akiko Campbell
  • Changjie Tang
Article

Abstract

When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a query object \(o\) in a multidimensional numeric data set \(O\), in which subspace is \(o\) most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.

Keywords

Outlying aspect Outlyingness degree Kernel density estimation Subspace search 

Notes

Acknowledgments

The authors thank the editor and the anonymous reviewers for their invaluable comments, which help to improve this paper. Lei Duan’s research is supported in part by Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371). Work by Lei Duan at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. Jian Pei’s and Guanting Tang’s research is supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work is supported by an ARC Future Fellowship (FT110100112). All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

References

  1. Aggarwal CC (2013) An introduction to outlier analysis. Springer, New YorkCrossRefGoogle Scholar
  2. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Record, ACM, vol 30, pp 37–46Google Scholar
  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB ’94, pp 487–499Google Scholar
  4. Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):7:1–7:62CrossRefGoogle Scholar
  5. Angiulli F, Fassetti F, Palopoli L, Manco G (2013) Outlying property detection with numerical attributes. CoRR abs/1306.3558Google Scholar
  6. Bache K, Lichman M (2013) UCI machine learning repositoryGoogle Scholar
  7. Bhaduri K, Matthews BL, Giannella CR (2011) Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 859–867Google Scholar
  8. Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proceedings of the 13th SIAM international conference on data mining, SDM ’13, pp 198–206Google Scholar
  9. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 93–104Google Scholar
  10. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58CrossRefGoogle Scholar
  11. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  12. Härdle W (1990) Smoothing techniques: with implementations in S. Springer, New YorkCrossRefGoogle Scholar
  13. Härdle W, Werwatz A, Müller M, Sperlich S (2004) Nonparametric and semiparametric modelss., Springer Series in StatisticsSpringer, BerlinCrossRefGoogle Scholar
  14. He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst/ComSIS 2(1):103–118CrossRefGoogle Scholar
  15. Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 28th international conference on data engineering, ICDE ’12, pp 1037–1048Google Scholar
  16. Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, pp 211–222Google Scholar
  17. Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 444–452Google Scholar
  18. Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD ’09, pp 831–838Google Scholar
  19. Müller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: Proceedings of the 27th IEEE international conference on data engineering, ICDE ’11, pp 434–445Google Scholar
  20. Müller E, Assent I, Iglesias P, Mülle Y, Böhm K (2012a) Outlier ranking via subspace analysis in multiple views of the data. In: Proceedings of the 12th IEEE international conference on data mining, ICDM ’12, pp 529–538Google Scholar
  21. Müller E, Keller F, Blanc S, Böhm K (2012b) OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML/PKDD (2), pp 828–832Google Scholar
  22. Paravastu R, Kumar H, Pudi V (2008) Uniqueness mining. In: Proceedings of the 13th international conference on database systems for advanced applications, DASFAA ’08, pp 84–94Google Scholar
  23. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 427–438Google Scholar
  24. Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 3rd international conference on principle of knowledge representation and reasoning, KR ’92, pp 539–550Google Scholar
  25. Scott DW (1992) Multivariate density estimation: theory, practice, and visualization., Wiley Series in Probability and StatisticsWiley, New YorkCrossRefzbMATHGoogle Scholar
  26. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, LondonCrossRefzbMATHGoogle Scholar
  27. Tang G, Bailey J, Pei J, Dong G (2013) Mining multidimensional contextual outliers from categorical relational data. In: Proceedings of the 25th international conference on scientific and statistical database management, SSDBM ’13, pp 43:1–43:4Google Scholar
  28. Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Lei Duan
    • 1
    Email author
  • Guanting Tang
    • 2
  • Jian Pei
    • 2
  • James Bailey
    • 3
  • Akiko Campbell
    • 4
  • Changjie Tang
    • 1
  1. 1.School of Computer Science, Sichuan UniversityChengduChina
  2. 2.School of Computing ScienceSimon Fraser UniversityBurnabyCanada
  3. 3.Department of Computing and Information SystemsThe University of MelbourneMelbourneAustralia
  4. 4.Pacific Blue CrossBurnabyCanada

Personalised recommendations