A Fast Greedy Algorithm for Outlier Mining

  • Zengyou He
  • Shengchun Deng
  • Xiaofei Xu
  • Joshua Zhexue Huang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)

Abstract

The task of outlier detection is to find small groups of data objects that are exceptional when compared with rest large amount of data. Recently, the problem of outlier detection in categorical data is defined as an optimization problem and a local-search heuristic based algorithm (LSA) is presented. However, as is the case with most iterative type algorithms, the LSA algorithm is still very time-consuming on very large datasets. In this paper, we present a very fast greedy algorithm for mining outliers under the same optimization model. Experimental results on real datasets and large synthetic datasets show that: (1) Our new algorithm has comparable performance with respect to those state-of-the-art outlier detection algorithms on identifying true outliers and (2) Our algorithm can be an order of magnitude faster than LSA algorithm.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Hawkins, D.: Identification of Outliers. Chapman and Hall, Reading (1980)CrossRefMATHGoogle Scholar
  2. 2.
    Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Journal, 379–423 (1948)Google Scholar
  3. 3.
    Aggarwal, C., Yu, P.: Outlier Detection for High Dimensional Data. In: Proc. of SIGMOD 2001, pp. 37–46 (2001)Google Scholar
  4. 4.
    He, Z., Xu, X., Huang, J., Deng, S.: A Frequent Pattern Discovery Based Method for Outlier Detection. In: Li, Q., Wang, G., Feng, L. (eds.) WAIM 2004. LNCS, vol. 3129, pp. 726–732. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley and Sons, New York (1994)MATHGoogle Scholar
  6. 6.
    Johnson, T., Kwok, I., Ng, R.: Fast Computation of 2-Dimensional Depth Contours. In: Proc. of KDD 1998, pp. 224–228 (1998)Google Scholar
  7. 7.
    Knorr, E., Ng, R., Tucakov, T.: Distance-Based Outliers: Algorithms and Applications. VLDB Journal 8(3-4), 237–253 (2000)CrossRefGoogle Scholar
  8. 8.
    Ramaswamy, S., Rastogi, R., Kyuseok, S.: Efficient Algorithms for Mining Outliers from Large Data Sets. In: Proc. of SIGMOD 2000, pp. 93–104 (2000)Google Scholar
  9. 9.
    Bay, S.D., Schwabacher, M.: Mining Distance Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule. In: Proc of KDD 2003, pp. 29–38 (2003)Google Scholar
  10. 10.
    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: Identifying Density-Based Local Outliers. In: Proc. of SIGMOD 2000, pp. 93–104 (2000)Google Scholar
  11. 11.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Fast Outlier Detection Using the Local Correlation Integral. In: Proc of ICDE 2003 (2003)Google Scholar
  12. 12.
    Jiang, M.F., Tseng, S.S., Su, C.M.: Two-phase Clustering Process for Outliers Detection. Pattern Recognition Letters 22(6-7), 691–700 (2001)CrossRefMATHGoogle Scholar
  13. 13.
    Yu, D., Sheikholeslami, G., Zhang, A.: FindOut: Finding Out Outliers in Large Datasets. Knowledge and Information Systems 4(4), 387–412 (2002)CrossRefGoogle Scholar
  14. 14.
    He, Z., Xu, X., Huang, J., Deng, S.: Discovering Cluster-based Local Outliers. Pattern Recognition Letters 24(9-10), 1641–1650 (2003)CrossRefMATHGoogle Scholar
  15. 15.
    Tax, D.M.J., Duin, R.P.W.: Support Vector Data Description. Pattern Recognition Letters 20(11-13), 1191–1199 (1999)CrossRefMATHGoogle Scholar
  16. 16.
    Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the Support of a High Dimensional Distribution. Neural Computation 13(7), 1443–1472 (2001)CrossRefMATHGoogle Scholar
  17. 17.
    Harkins, S., He, H., Willams, G.J., Baster, R.A.: Outlier Detection Using Replicator Neural Networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  18. 18.
    Willams, G.J., Baster, R.A., He, H., Harkins, S., Gu, L.: A Comparative Study of RNN for Outlier Detection in Data Mining. In: Proc of ICDM 2002, pp. 709–712 (2002)Google Scholar
  19. 19.
    He, Z., Deng, S., Xu, X.: Outlier Detection Integrating Semantic Knowledge. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, pp. 126–131. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Papadimitriou, S., Faloutsos, C.: Cross-Outlier Detection. In: Proc of SSTD 2003, pp. 199–213 (2003)Google Scholar
  21. 21.
    He, Z., Xu, X., Huang, J., Deng, S.: Mining Class Outliers: Concepts, Algorithms and Applications in CRM. Expert Systems with Applications 27(4), 681–697 (2004)CrossRefGoogle Scholar
  22. 22.
    He, Z., Deng, S., Xu, X.: An Optimization Model for Outlier Detection in Categorical Data. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 400–409. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  23. 23.
    Merz, G., Murphy, P.: Uci Repository of Machine Learning Databases (1996), http://www.ics.uci.edu/mlearn/MLRepository.html
  24. 24.
    Lazarevic, A., Kumar, V.: Feature Bagging for Outlier Detection. In: Proc. of KDD 2005, pp. 157–166 (2005)Google Scholar
  25. 25.
    He, Z., Deng, S., Xu, X.: A Unified Subspace Outlier Ensemble Framework for Outlier Detection. In: Fan, W., Wu, Z., Yang, J. (eds.) WAIM 2005. LNCS, vol. 3739, pp. 632–637. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Zengyou He
    • 1
  • Shengchun Deng
    • 1
  • Xiaofei Xu
    • 1
  • Joshua Zhexue Huang
    • 2
  1. 1.Department of Computer Science and EngineeringHarbin Institute of TechnologyChina
  2. 2.E-Business Technology InstituteThe University of Hong KongHong KongChina

Personalised recommendations