Data Mining and Knowledge Discovery

, Volume 28, Issue 2, pp 519–568 | Cite as

Exploiting domain knowledge to detect outliers



We present a novel definition of outlier whose aim is to embed an available domain knowledge in the process of discovering outliers. Specifically, given a background knowledge, encoded by means of a set of first-order rules, and a set of positive and negative examples, our approach aims at singling out the examples showing abnormal behavior. The technique here proposed is unsupervised, since there are no examples of normal or abnormal behavior, even if it has connections with supervised learning, since it is based on induction from examples. We provide a notion of compliance of a set of facts with respect to a background knowledge and a set of examples, which is exploited to detect the examples that prevent to improve generalization of the induced hypothesis. By testing compliance with respect to both the direct and the dual concept, we are able to distinguish among three kinds of abnormalities, that are irregular, anomalous, and outlier observations. This allows us to provide a finer characterization of the anomaly at hand and to single out subtle forms of anomalies. Moreover, we are also able to provide explanations for the abnormality of an observation which make intelligible the motivation underlying its exceptionality. We present both exact and approximate algorithms for mining abnormalities. The approximate algorithms improve execution time while guaranteeing good accuracy. Moreover, we discuss peculiarities of the novel approach, present examples of knowledge mined, analyze the scalability of the algorithms, and provide comparison with noise handling mechanisms and some alternative approaches.


Outlier detection Unsupervised methods Knowledge representation Concept learning 


  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 37–46Google Scholar
  2. Angiulli F, Fassetti F (2009a) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1):Article 4Google Scholar
  3. Angiulli F, Fassetti F (2009b) Outlier detection using inductive logic programming. In: ICDM, pp 693–698Google Scholar
  4. Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of the international conference on principles of data mining and knowledge discovery (PKDD), pp 15–26Google Scholar
  5. Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng pp 203–215Google Scholar
  6. Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160CrossRefGoogle Scholar
  7. Angiulli F, Greco G, Palopoli L (2007) Outlier detection by logic programming. ACM Trans Comput Log 9(1):Article 7Google Scholar
  8. Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872CrossRefMATHMathSciNetGoogle Scholar
  9. Bain M, Srinivasan A (1995) Inductive logic programming with large-scale unstructured data. In: Furukawa K, Michie D, Muggleton S (eds) Machine intelligence 14. Clarendon Press, OxfordGoogle Scholar
  10. Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: Identifying density-based local outliers. In: Proceedings of the international conference on management of data (SIGMOD), pp 93–104Google Scholar
  11. Bruno G, Garza P, Quintarelli E, Rosato R (2007) Anomaly detection through quasi-functional dependency analysis. J Digit Inf Manag 5(4):190–200Google Scholar
  12. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58Google Scholar
  13. Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6CrossRefGoogle Scholar
  14. Debnath A, de Compadre RL, Debnath G, Shusterman A, Hansch C (1991) The structure–activity relationship of mutagenic aromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem 34:786–797CrossRefGoogle Scholar
  15. Fassetti F, Fazzinga B (2007) Approximate functional dependencies for xml data. In: ADBIS research communications. Springer, Heidelberg, pp 86–95Google Scholar
  16. He Z, Xu X, Huang J, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118CrossRefGoogle Scholar
  17. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126Google Scholar
  18. Kirsten M, Wrobel S, Horváth T (2001) Distance based approaches to relational learning and clustering. In: Dz̆eroski S, Lavrac̆ N (eds) Relational data mining, Springer, Berlin, pp 213–232Google Scholar
  19. Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. TCS 149:129–149CrossRefMATHMathSciNetGoogle Scholar
  20. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large data bases (VLDB), pp 392–403Google Scholar
  21. Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: KDD, pp 444–452Google Scholar
  22. Lavrac̆ N, Dz̆eroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, ChichesterGoogle Scholar
  23. Lavrac̆ N, Dz̆eroski S, Bratko I (1996) Handling imperfect data in inductive logic programming. In: Raedt LD (ed) Advances in inductive logic programming. IOS Press, Amsterdam, pp 48–64Google Scholar
  24. Liu FT, Ting KM, Zhou ZH (2012) Isolation-based anomaly detection. TKDD 6(1):3CrossRefGoogle Scholar
  25. Lloyd JW (1987) Foundations of logic programming. Springer, BerlinCrossRefMATHGoogle Scholar
  26. Mannila H, Räihä K (1987) Dependency inference. In: VLDB, pp 155–158Google Scholar
  27. Muggleton S (1995) Inverse entailment and Progol. New Gen Comput 13(3–4):245–286CrossRefGoogle Scholar
  28. Muggleton S, Feng C (1990) Efficient induction of logic programs. In: First conference on algorithmic learning theory, pp 368–381Google Scholar
  29. Muggleton S, Bain M, Hayes-Michie J, Michie D (1989) An exeperimental comparison of human and machine learning formalisms. In: Sixth international workshop on machine learningGoogle Scholar
  30. Novelli N, Cicchetti R (2001) Functional and embedded dependency inference: a data mining point of view. IS 26(7):477–506MATHGoogle Scholar
  31. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the international conference on data engineering (ICDE) , pp 315–326Google Scholar
  32. Plotkin G (1971) A further note on inductive generalization. In: Machine learning, vol 6, chap 8. American Elsevier, New York, pp 101–124Google Scholar
  33. Quinlan J, Cameron-Jones R (1993) Foil: a midterm report. In: 6th European conference on machine learning, pp 3–20Google Scholar
  34. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the international conference on management of data (SIGMOD), pp 427–438Google Scholar
  35. Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: KDD, pp 252–257Google Scholar
  36. Srinivasan A, Muggleton S, Sternberg M, King R (1996) Theories for mutagenicity: a study in first-order and feature-based induction. Artif Intell 85(1–2):277–299CrossRefGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.DIMES DepartmentUniversity of CalabriaRendeItaly

Personalised recommendations