Data Mining and Knowledge Discovery

, Volume 31, Issue 1, pp 134–163 | Cite as

Outlying property detection with numerical attributes

  • Fabrizio Angiulli
  • Fabio Fassetti
  • Giuseppe Manco
  • Luigi Palopoli
Article

Abstract

The outlying property detection problem (OPDP) is the problem of discovering the properties distinguishing a given object, known in advance to be an outlier in a database, from the other database objects. This problem has been recently analyzed focusing on categorical attributes only. However, numerical attributes are very relevant and widely used in databases. Therefore, in this paper, we analyze the OPDP within a context where also numerical attributes are taken into account, which represents a relevant case left open in the literature. As major contributions, we present an efficient parameter-free algorithm to compute the measure of object exceptionality we introduce, and propose a unified framework for mining exceptional properties in the presence of both categorical and numerical attributes.

Keywords

Outlier detection Outlying properties Kernel density estimation Clustering 

References

  1. Aggarwal C (2013) Outlier analysis. Springer, New YorkCrossRefMATHGoogle Scholar
  2. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceeding of the ACM SIGMOD conference on managment of data (SIGMOD’01), pp 37–46Google Scholar
  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases (VLDB’94), pp 487–499Google Scholar
  4. Angiulli F, Fassetti F (2009) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):Article 4Google Scholar
  5. Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):1–62CrossRefGoogle Scholar
  6. Arning A, Aggarwal C, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’96), pp 164–169Google Scholar
  7. Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, ChichesterMATHGoogle Scholar
  8. Bay S, Pazzani M (1999) Detecting change in categorical data: mining constrast sets. In: Proceedings of the ACM conference on knowledge discovery in data (KDD’99), pp 302–306Google Scholar
  9. Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD conference on managment of data (SIGMOD’00), pp 93–104Google Scholar
  10. Caroni C (2000) Outlier detection by robust principal component analysis. Commun Stat Simul Comput 29:129–151CrossRefMATHGoogle Scholar
  11. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6CrossRefGoogle Scholar
  12. Costa G, Fassetti F, Guarascio M, Manco G, Ortale R (2010) Mining models of exceptional objects through rule learning. In: Proceedings of the ACM symposium on applied computing (SAC’10), pp 1078–1082Google Scholar
  13. Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In: Proceedings of the IEEE international conference on data engineering, (ICDE’14), pp 88–99Google Scholar
  14. Dang XH, Micenkov B, Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases (ECML-PKDD’13). Lecture Notes in Computer Science, vol 8190. pp 304–320Google Scholar
  15. De Vries, T, Chawla, S, Houle M (2010) Finding local anomalies in very high dimensional space. In: Proceedings of the IEEE international confence on data mining (ICDM’10), pp 128–137Google Scholar
  16. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39Google Scholar
  17. Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Dis 29(5):1116–1151MathSciNetCrossRefGoogle Scholar
  18. Eskin E (2000) Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the international conference on machine learning (ICML’00), pp 255–262Google Scholar
  19. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24:381–396CrossRefGoogle Scholar
  20. Fox J (1990) Describing univariate distributions. In: Fox J, Long JS (eds) Modern methods of data analysis. Sage Publications, Newbury Park, pp 58–125Google Scholar
  21. Ghoting A, Parthasarathy S, Otey ME (2015) Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Dis 16:349–364MathSciNetCrossRefGoogle Scholar
  22. Greco A, Perri S (2014) Identification of high shears and compressive discontinuities in the inner heliosphere. Astrophys J 784(2):163CrossRefGoogle Scholar
  23. Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 293–298Google Scholar
  24. Jones MC, Henderson DA (2009) Maximum likelihood kernel density estimation: on the potential of convolution sieves. Comput Stat Data Anal 53:3726–3733MathSciNetCrossRefMATHGoogle Scholar
  25. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large databases (VLDB’98), pp 392–403Google Scholar
  26. Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large databases (VLDB’99), pp 211–222Google Scholar
  27. Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 444–452Google Scholar
  28. Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Loop: local outlier probabilities. In: Proceedings of the ACM international conference on information and knowledge management (CIKM’09), pp 1649–1652Google Scholar
  29. Kriegel HP, Kroger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD’09), pp 831–838Google Scholar
  30. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the ACM SIGKDD conference on knowledge discovery in data (KDD’05), pp 157–166Google Scholar
  31. Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml
  32. Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: Proceedings of the IEEE international conference on data mining (ICDM’08), pp 413–422Google Scholar
  33. Micenková B, Ng RT, Dang XH, Assent I (2013) Explaining outliers by subspace separability. In: Proceedings of the IEEE international conference on data mining (ICDM’13), pp 518–527Google Scholar
  34. Nguyen H, Gopalkrishnan V, Assent I (2011) An unbiased distance-based outlier detection approach for high dimensional data. In: Proceedings of the international conference on database systems for advanced applications (DASFAA), pp 138–152Google Scholar
  35. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the IEEE international conference on data enginnering (ICDE’03), pp 315–326Google Scholar
  36. Rousseeuw P, Leroy A (2003) Robust regression and outlier detection. Wiley, New YorkMATHGoogle Scholar
  37. Salgado-Ugarte IH, Pérez-Hernández MA (2003) Exploring the use of variable bandwidth kernel density estimators. Stata J 3(2):133–147Google Scholar
  38. Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 252–257Google Scholar
  39. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, LondonCrossRefMATHGoogle Scholar
  40. Vinh NX, Chan J, Bailey J, Leckie C, Ramamohanarao K, Pei J (2015) Scalable outlying-inlying aspects discovery via feature ranking. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data (PAKDD’15), pp 422–434Google Scholar
  41. Xiong L, Chen X, Schneider J (2011) Direct robust matrix factorization for anomaly detection. In: Proceedings of the IEEE international confence on data mining (ICDM’11), pp 844 – 853Google Scholar
  42. Yang X, Latecki LJ, Pokrajac D (2009) Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the SIAM conference on data mining (SDM’09), pp 145–154Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.DIMES DepartmentUniversity of CalabriaRendeItaly
  2. 2.Institute of High Performance Computing and Networks (ICAR-CNR)RendeItaly

Personalised recommendations