# Exploiting domain knowledge to detect outliers

- 770 Downloads
- 6 Citations

## Abstract

We present a novel definition of outlier whose aim is to embed an available domain knowledge in the process of discovering outliers. Specifically, given a background knowledge, encoded by means of a set of first-order rules, and a set of positive and negative examples, our approach aims at singling out the examples showing abnormal behavior. The technique here proposed is *unsupervised*, since there are no examples of normal or abnormal behavior, even if it has connections with *supervised* learning, since it is based on induction from examples. We provide a notion of compliance of a set of facts with respect to a background knowledge and a set of examples, which is exploited to detect the examples that prevent to improve generalization of the induced hypothesis. By testing compliance with respect to both the direct and the dual concept, we are able to distinguish among three kinds of abnormalities, that are *irregular*, *anomalous*, and *outlier* observations. This allows us to provide a finer characterization of the anomaly at hand and to single out subtle forms of anomalies. Moreover, we are also able to provide *explanations* for the abnormality of an observation which make intelligible the motivation underlying its exceptionality. We present both exact and approximate algorithms for mining abnormalities. The approximate algorithms improve execution time while guaranteeing good accuracy. Moreover, we discuss peculiarities of the novel approach, present examples of knowledge mined, analyze the scalability of the algorithms, and provide comparison with noise handling mechanisms and some alternative approaches.

## Keywords

Outlier detection Unsupervised methods Knowledge representation Concept learning## References

- Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 37–46Google Scholar
- Angiulli F, Fassetti F (2009a) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1):Article 4Google Scholar
- Angiulli F, Fassetti F (2009b) Outlier detection using inductive logic programming. In: ICDM, pp 693–698Google Scholar
- Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of the international conference on principles of data mining and knowledge discovery (PKDD), pp 15–26Google Scholar
- Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng pp 203–215Google Scholar
- Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160CrossRefGoogle Scholar
- Angiulli F, Greco G, Palopoli L (2007) Outlier detection by logic programming. ACM Trans Comput Log 9(1):Article 7Google Scholar
- Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872CrossRefMATHMathSciNetGoogle Scholar
- Bain M, Srinivasan A (1995) Inductive logic programming with large-scale unstructured data. In: Furukawa K, Michie D, Muggleton S (eds) Machine intelligence 14. Clarendon Press, OxfordGoogle Scholar
- Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: Identifying density-based local outliers. In: Proceedings of the international conference on management of data (SIGMOD), pp 93–104Google Scholar
- Bruno G, Garza P, Quintarelli E, Rosato R (2007) Anomaly detection through quasi-functional dependency analysis. J Digit Inf Manag 5(4):190–200Google Scholar
- Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58Google Scholar
- Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6CrossRefGoogle Scholar
- Debnath A, de Compadre RL, Debnath G, Shusterman A, Hansch C (1991) The structure–activity relationship of mutagenic aromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem 34:786–797CrossRefGoogle Scholar
- Fassetti F, Fazzinga B (2007) Approximate functional dependencies for xml data. In: ADBIS research communications. Springer, Heidelberg, pp 86–95Google Scholar
- He Z, Xu X, Huang J, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118CrossRefGoogle Scholar
- Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126Google Scholar
- Kirsten M, Wrobel S, Horváth T (2001) Distance based approaches to relational learning and clustering. In: Dz̆eroski S, Lavrac̆ N (eds) Relational data mining, Springer, Berlin, pp 213–232Google Scholar
- Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. TCS 149:129–149CrossRefMATHMathSciNetGoogle Scholar
- Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large data bases (VLDB), pp 392–403Google Scholar
- Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: KDD, pp 444–452Google Scholar
- Lavrac̆ N, Dz̆eroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, ChichesterGoogle Scholar
- Lavrac̆ N, Dz̆eroski S, Bratko I (1996) Handling imperfect data in inductive logic programming. In: Raedt LD (ed) Advances in inductive logic programming. IOS Press, Amsterdam, pp 48–64Google Scholar
- Liu FT, Ting KM, Zhou ZH (2012) Isolation-based anomaly detection. TKDD 6(1):3CrossRefGoogle Scholar
- Lloyd JW (1987) Foundations of logic programming. Springer, BerlinCrossRefMATHGoogle Scholar
- Mannila H, Räihä K (1987) Dependency inference. In: VLDB, pp 155–158Google Scholar
- Muggleton S (1995) Inverse entailment and Progol. New Gen Comput 13(3–4):245–286CrossRefGoogle Scholar
- Muggleton S, Feng C (1990) Efficient induction of logic programs. In: First conference on algorithmic learning theory, pp 368–381Google Scholar
- Muggleton S, Bain M, Hayes-Michie J, Michie D (1989) An exeperimental comparison of human and machine learning formalisms. In: Sixth international workshop on machine learningGoogle Scholar
- Novelli N, Cicchetti R (2001) Functional and embedded dependency inference: a data mining point of view. IS 26(7):477–506MATHGoogle Scholar
- Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the international conference on data engineering (ICDE) , pp 315–326Google Scholar
- Plotkin G (1971) A further note on inductive generalization. In: Machine learning, vol 6, chap 8. American Elsevier, New York, pp 101–124Google Scholar
- Quinlan J, Cameron-Jones R (1993) Foil: a midterm report. In: 6th European conference on machine learning, pp 3–20Google Scholar
- Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the international conference on management of data (SIGMOD), pp 427–438Google Scholar
- Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: KDD, pp 252–257Google Scholar
- Srinivasan A, Muggleton S, Sternberg M, King R (1996) Theories for mutagenicity: a study in first-order and feature-based induction. Artif Intell 85(1–2):277–299CrossRefGoogle Scholar