Journal of Intelligent Information Systems

, Volume 36, Issue 2, pp 217–247 | Cite as

Outlier detection by example

  • Cui Zhu
  • Hiroyuki Kitagawa
  • Spiros Papadimitriou
  • Christos Faloutsos
Article

Abstract

Outlier detection is a useful technique in such areas as fraud detection, financial analysis and health monitoring. Many recent approaches detect outliers according to reasonable, pre-defined concepts of an outlier (e.g., distance-based, density-based, etc.). However, the definition of an outlier differs between users or even datasets. This paper presents a solution to this problem by including input from the users. Our OBE (Outlier By Example) system is the first that allows users to provide examples of outliers in low-dimensional datasets. By incorporating a small number of such examples, OBE can successfully develop an algorithm by which to identify further outliers based on their outlierness. Several algorithmic challenges and engineering decisions must be addressed in building such a system. We describe the key design decisions and algorithms in this paper. In order to interact with users having different degrees of domain knowledge, we develop two detection schemes: OBE-Fraction and OBE-RF. Our experiments on both real and synthetic datasets demonstrate that OBE can discover values that a user would consider outliers.

Keywords

Outlier detection Outlier example Data mining Machine learning 

References

  1. Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proc. SIGMOD.Google Scholar
  2. Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.CrossRefMathSciNetGoogle Scholar
  3. Barbará, D., Domeniconi, C., & Rogers, J. P. (2006). Detecting outliers using transduction and statistical testing. In Proc. SIGKDD conf. (pp. 55–64).Google Scholar
  4. Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley.MATHGoogle Scholar
  5. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbors meaningful? In Proc. international conf. on database theory (pp. 217–235).Google Scholar
  6. Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proc. SIGMOD Conf. (pp. 93–104).Google Scholar
  7. Goh, K., Chang, E., & Cheng, K. (2001). SVM binary classifier ensembles for image classification. In Proc. International conf. on information and knowledge management (pp. 395–402).Google Scholar
  8. Hawkins, D. M. (1980). Identification of outliers. London, UK: Chapman and Hall.MATHGoogle Scholar
  9. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.CrossRefGoogle Scholar
  10. Joachims, T. (1998). Text categorization with support vector machines. In Proc. European conf. machine learning (ECML) (pp. 137–142).Google Scholar
  11. Johnson, T., Kwok, I., & Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proc. KDD (pp. 224–228).Google Scholar
  12. Knorr, E. M., & Ng, R. T. (1997). A unified notion of outliers: Properties and computation. In Proc. KDD (pp. 219–222).Google Scholar
  13. Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proc. VLDB (pp. 392–403).Google Scholar
  14. Knorr, E. M., & Ng, R. T. (1999). Finding intentional knowledge of distance-based outliers. In Proc. VLDB (pp. 211–222).Google Scholar
  15. Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal, 8(3–4), 237–253.Google Scholar
  16. Markowetz, F. (2003). Support vector machines in bioinformatics. Ph.D. Thesis, University of Heidelberg.Google Scholar
  17. Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE (pp. 315–326).Google Scholar
  18. Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD international conference on management of data (pp. 427–438).Google Scholar
  19. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.MATHCrossRefGoogle Scholar
  20. Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1991–1999.CrossRefGoogle Scholar
  21. Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proc. KDD (pp. 389–394).Google Scholar
  22. Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. KDD (pp. 250–254).Google Scholar
  23. Yu, H., Han, J., & Chang, K. (2002). PEBL: Positive example based learning for web page classification using SVM. In Proc. KDD (pp. 239–248).Google Scholar
  24. Zhu, C., Kitagawa, H., & Faloutsos, C. (2005). Example-based robust outlier detection in high dimensional datasets. In Proc. ICDM (pp. 829–832).Google Scholar
  25. Zhu, C., Kitagawa, H., Papadimitriou, S., & Faloutsos, C. (2004). OBE: Outlier by example. In Proc. PAKDD (pp. 222–234).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Cui Zhu
    • 1
  • Hiroyuki Kitagawa
    • 2
  • Spiros Papadimitriou
    • 3
  • Christos Faloutsos
    • 4
  1. 1.College of Computer ScienceBeijing University of TechnologyBeijingPeople’s Republic of China
  2. 2.Graduate School of Systems and Information Engineering, Center for Computational SciencesUniversity of TsukubaIbarakiJapan
  3. 3.IBM T.J. WatsonHawthorneUSA
  4. 4.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations