Outlier Detection with Kernel Density Functions

  • Longin Jan Latecki
  • Aleksandar Lazarevic
  • Dragoljub Pokrajac
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4571)

Abstract

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Joshi, M., Agarwal, R., Kumar, V., Nrule, P.: Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA (May 2001)Google Scholar
  2. 2.
    Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: Improving the Prediction of Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)Google Scholar
  3. 3.
    Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley and Sons, New York, NY (1994)MATHGoogle Scholar
  4. 4.
    Lazarevic, A., Ertoz, L., Ozgur, A., Srivastava, J., Kumar, V.: A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the Third SIAM Int. Conf. on Data Mining, San Francisco, CA (May 2003)Google Scholar
  5. 5.
    Lazarevic, A., Kumar, V.: Feature Bagging for Outlier Detection. In: Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Chicago, IL (August 2005)Google Scholar
  6. 6.
    Billor, N., Hadi, A., Velleman, P.: BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators. Computational Statistics and Data Analysis 34, 279–298 (2000)MATHCrossRefGoogle Scholar
  7. 7.
    Eskin, E.: Anomaly Detection over Noisy Data using Learned Probability Distributions. In: Proceedings of the Int. Conf. on Machine Learning, Stanford University, CA (June 2000)Google Scholar
  8. 8.
    Aggarwal, C.C., Yu, P.: Outlier detection for high dimensional data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)Google Scholar
  9. 9.
    Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Identifying, L.O.F.: Density Based Local Outliers. In: Proceedings of the ACM SIGMOD Conference, Dallas, TX (May 2000)Google Scholar
  10. 10.
    Knorr, E., Ng, R.: Algorithms for Mining Distance based Outliers in Large Data Sets. In: VLDB. Proceedings of the Very Large Databases Conference, New York City, NY (August 1998)Google Scholar
  11. 11.
    Yu, D., Sheikholeslami, G., Zhang, A.: FindOut: Finding Outliers in Very Large Datasets. The Knowledge and Information Systems (KAIS) 4, 4 (2002)Google Scholar
  12. 12.
    Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data. In: Jajodia, S., Barbara, D. (eds.) Applications of Data Mining in Computer Security, Advances In Information Security, Kluwer Academic Publishers, Boston (2002)Google Scholar
  13. 13.
    Hawkins, S., He, H., Williams, G., Baxter, R.: Outlier Detection Using Replicator Neural Networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Medioni, G., Cohen, I., Hongeng, S., Bremond, F., Nevatia, R.: Event Detection and Analysis from Video Streams. IEEE Trans. on Pattern Analysis and Machine Intelligence 8(23), 873–889 (2001)CrossRefGoogle Scholar
  15. 15.
    Chen, S.-C., Shyu, M.-L., Zhang, C., Strickrott, J.: Multimedia Data Mining for Traffic Video Sequences. MDM/KDD pp. 78–86 (2001)Google Scholar
  16. 16.
    Chen, S.-C., Shyu, M.-L., Zhang, C., Kashyap, R.L.: Video Scene Change Detection Method Using Unsupervised Segmentation And Object Tracking. ICME (2001)Google Scholar
  17. 17.
    Tao, Y., Papadias, D., Lian, X.: Reverse kNN search in arbitrary dimensionality. In: Proceedings of the 30th Int. Conf. on Very Large Data Bases, Toronto, Canada (September 2004)Google Scholar
  18. 18.
    Singh, A., Ferhatosmanoglu, H., Tosun, A.: High Dimensional Reverse Nearest Neighbor Queries. In: CIKM 2003. Proceedings of the ACM Int. Conf. on Information and Knowledge Management, New Orleans, LA (November 2003)Google Scholar
  19. 19.
    Stanoi, I., Agrawal, D., Abbadi, A.E.: Reverse Nearest Neighbor Queries for Dynamic Databases. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dalas, TX (May 2000)Google Scholar
  20. 20.
    Anderson, J., Tjaden, B.: The inverse nearest neighbor problem with astrophysical applications. In: Proceedings of the 12th Symposium of Discrete Algorithms (SODA), Washington, DC (January 2001)Google Scholar
  21. 21.
    Pokrajac, D., Latecki, L.J., Lazarevic, A., et al.: Computational geometry issues of reverse-k nearest neighbors queries, Technical Report TR-CIS5001, Delaware State University (2006)Google Scholar
  22. 22.
    Conway, J., Sloane, N.H.: Sphere Packings, Lattices and Groups. Springer, Heidelberg (1998)Google Scholar
  23. 23.
    Preparata, F.P., Shamos, M.I.: Computational Geometry: an Introduction, 2nd Printing. Springer, Heidelberg (1988)Google Scholar
  24. 24.
    Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proceedings of the ACM SIGMOD Conference, San Jose, CA, pp. 71–79 (1995)Google Scholar
  25. 25.
    Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. SIGMOD Rec. 19(2), 322–331 (1990)CrossRefGoogle Scholar
  26. 26.
    Berchtold, S., Keim, D.A., Kriegel, H.-P.: The X-tree: An index structure for highdimensional data. In: Vijayaraman, T.M., Buchmann, A.P., Mohan, C., Sarda, N.L. (eds.) Proceedings of the 22nd International Conference on Very Large Databases, San Francisco, USA, pp. 28–39. Morgan Kaufmann Publishers, Seattle (1996)Google Scholar
  27. 27.
    Weber, R., Schek, H.-J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB 1998. Proceedings of the 24rd International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 194–205. Morgan Kaufmann, Seattle, Washington (1998)Google Scholar
  28. 28.
    DeMenthon, D., Latecki, L.J., Rosenfeld, A., Stückelberg, M.V.: Relevance Ranking of Video Data using Hidden Markov Model Distances and Polygon Simplification. In: Laurini, R. (ed.) VISUAL 2000. LNCS, vol. 1929, pp. 49–61. Springer, Heidelberg (2000)Google Scholar
  29. 29.
    Latecki, L.J., Miezianko, R., Megalooikonomou, V., Pokrajac, D.: Using Spatiotemporal Blocks to Reduce the Uncertainty in Detecting and Tracking Moving Objects in Video. Int. Journal of Intelligent Systems Technologies and Applications 1(3/4), 376–392 (2006)Google Scholar
  30. 30.
    Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, Heidelberg (2002)MATHGoogle Scholar
  31. 31.
    Lippmann, R.P., Fried, D.J., Graf, I.J., et al.: Evaluating Intrusion Detection Systems: The 1998 DARPA Off-line Intrusion Detection Evaluation. In: DISCEX 2000. Proc. DARPA Information Survivability Conf. and Exposition, vol. 2, pp. 12–26. IEEE Computer Society Press, Los Alamitos (2000)Google Scholar
  32. 32.
    Tcptrace software tool, www.tcptrace.org
  33. 33.
    UCI KDD Archive, KDD Cup Data Set (1999), www.ics.uci.edu/kdd/databases/kddcup99/kddcup99.html
  34. 34.
    Tang, J., Chen, Z., Fu, A., Cheung, D.: Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 535–548. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  35. 35.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: LOCI: Fast Outlier Detection Using the Local Correlation Integral. In: ICDE 2003. Proc. of the 19th Int. Conf. on Data Engineering, Bangalore, India (March 2003)Google Scholar
  36. 36.
    Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the Ninth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, New York, NY (2003)Google Scholar
  37. 37.
    Breiman, L., Meisel, W., Purcell, E.: Variable kernel estimates of multivariate densities. Technometrics 19(2), 135–144 (1977)MATHCrossRefGoogle Scholar
  38. 38.
    Loftsgaarden, D.O., Quesenberry, C.P.: A nonparametric estimate of a multivariate density function. Ann. Math. Statist. 36, 1049–1051 (1965)MATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    Terrell, G.R., Scott, D.W.: Variable kernel density estimation. The Annals of Statistics 20(3), 1236–1265 (1992)MATHCrossRefMathSciNetGoogle Scholar
  40. 40.
    Maloof, M., Langley, P., Binford, T., Nevatia, R., Sage, S.: Improved Rooftop Detection in Aerial Images with Machine Learning. Machine Learning 53(1-2), 157–191 (2003)CrossRefGoogle Scholar
  41. 41.
    Michalski, R., Mozetic, I., Hong, J., Lavrac, N.: The Multi-Purpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains. In: Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, pp. 1041–1045 (1986)Google Scholar
  42. 42.
    van der Putten, P., van Someren, M.: CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden LIACS Technical Report 2000-09 (June 2000)Google Scholar
  43. 43.
    Ertoz, L.: Similarity Measures, PhD dissertation, University of Minnesota (2005)Google Scholar
  44. 44.
    Provost, F., Fawcett, T.: Robust Classification for Imprecise Environments. Machine Learning 42(3), 203–231 (2001)MATHCrossRefGoogle Scholar
  45. 45.
    Blake, C., Merz, C.: UCI Repository of machine learning databases (1998), http://www.ics.uci.edu/mlearn/MLRepository.html
  46. 46.
    Roussopoulos, N., Kelly, S., Vincent, F.: Nearest Neighbor Queries. In: Proc. ACM SIGMOD, pp. 71-79 (1995)Google Scholar
  47. 47.
    Devore, J.: Probability and Statistics for Engineering and the Sciences, 6th edn. (2003)Google Scholar
  48. 48.
    Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Longin Jan Latecki
    • 1
  • Aleksandar Lazarevic
    • 2
  • Dragoljub Pokrajac
    • 3
  1. 1.CIS Dept. Temple University Philadelphia, PA 19122USA
  2. 2.United Technology Research Center 411 Silver Lane, MS 129-15 East Hartford, CT 06108USA
  3. 3.CIS Dept. CREOSA and AMRC, Delaware State University, Dover DE 19901USA

Personalised recommendations