Data Mining and Knowledge Discovery

, Volume 12, Issue 2–3, pp 203–228 | Cite as

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

  • Matthew Eric OteyEmail author
  • Amol Ghoting
  • Srinivasan Parthasarathy


Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.


outlier detection anomaly detection distributed data mining mining dynamic data mixedattribute data sets data streams 


  1. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the International Conference on Very Large Data Bases VLDB (pp. 487–499). Morgan Kaufmann.Google Scholar
  2. Barnett, V. and Lewis, T. 1994. Outliers in statistical data. John Wiley.Google Scholar
  3. Bay, S.D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  4. Blake, C. and Merz, C. 1998. UCI machine learning repository.Google Scholar
  5. Bolton, R.J. and Hand, D.J. 2002. Statistical fraud detection: A review. Statistical Science, 17:235–255.Google Scholar
  6. Breunig, M.M., Kriegel, H.-P., Ng, R.T. and Sander, J. 2000. LOF: Identifying density-based local outliers. Proc. of the ACM SIGMOD International Conference on Management of Data.Google Scholar
  7. Gamberger, D., Lavračc, N. and Grošselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of the International Conference on Machine Learning.Google Scholar
  8. Ghoting, A., Otey, M.E. and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. Proceedings of the IEEE International Conference on Data Mining.Google Scholar
  9. Guha, S., Rastogi, R. and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25:345–366.Google Scholar
  10. Hettich, S. and Bay, S. 1999. KDDCUP 1999 dataset, UCI KDD archive.Google Scholar
  11. Huang, Y.-A. and Lee, W. 2003. A cooperative intrusion detection system for ad hoc networks. Proc. of the ACM workshop on Security of ad hoc and sensor networks (SASN) (pp. 135–147). Fairfax, Virginia: ACM Press.Google Scholar
  12. Jain, A.K. and Dubes, R.C. 1988. Algorithms for clustering data. Prentice Hall.Google Scholar
  13. Johnson, T., Kwok, I. and Ng, R. 1998. Fast computation of 2-dimensional depth contours. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  14. Knorr, E., Ng, R. and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB Journal.Google Scholar
  15. Knorr, E. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. Proc. of the International Conference on Very Large Databases.Google Scholar
  16. Lazarevic, A., Ertoz, L., Ozgur, A., Kumar, V. and Srivastava, J. 2003. A comparative study of outlier detection schemes for network intrusion detection. Proc. of the SIAM International Conference on Data Mining.Google Scholar
  17. Locasto, M.E., Parekh, J.J., Stolfo, S.J., Keromytis, A.D., Malkin, T. and Misra, V. 2004. Collaborative distributed intrusion detection (Technical Report CUCS-012-04). Department of Computer Science, Columbia University in the City of New York.Google Scholar
  18. Mahoney, M.V. and Chan, P.K. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  19. Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S. and Panda, D. 2003. Towards nic-based intrusion detection. Proceedings of 9th annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  20. Palpanas, T., Papadopoulos, D., Kalogeraki, V. and Gunopulos, D. 2003. Distributed deviation detection in sensor networks. SIGMOD Record, 32:77–82.Google Scholar
  21. Papadimitriou, S., Kitawaga, H., Gibbons, P.B. and Faloutsos, C. 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering.Google Scholar
  22. Penny, K.I. and Jolliffe, IT.. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician, Journal of the Royal Statistical Society, 50:295–308.Google Scholar
  23. Porras, P.A. and Neumann, P.G. 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. Proc. of the 20th NIST-NCSC National Information Systems Security Conference (pp. 353–365).Google Scholar
  24. Rice, J. 1995. Mathematical statistics and data analysis. Duxbury Press.Google Scholar
  25. Sequeira, K. and Zaki, M. 2002. ADMIT: Anomaly-based data mining for intrusions. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  26. Veloso, A.A., Meira W., Jr., de Carvalho, M.B., Possas, B., Parthasarathy, S. and Zaki, M.J. 2002. Mining frequent itemsets in evolving databases. Proc. of the SIAM International Conference on Data Mining.Google Scholar
  27. Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15:353–367.Google Scholar
  28. Zhang, S., Wu, X. and Zhang, C. 2003a. Multi-database mining. IEEE Computational Intelligence Bulletin, 2:5–13.Google Scholar
  29. Zhang, Y. and Lee, W. 2000. Intrusion detection in wireless ad-hoc networks. Mobile Computing and Networking (pp. 275–283).Google Scholar
  30. Zhang, Y., Lee, W. and Huang, Y.-A. 2003b. Intrusion detection techniques for mobile wireless networks. Wireless Networks, 9:545–556.Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2006

Authors and Affiliations

  • Matthew Eric Otey
    • 1
    Email author
  • Amol Ghoting
    • 1
  • Srinivasan Parthasarathy
    • 1
  1. 1.Department of Computer Science and Engineering,The Ohio State UniversityColumbusUSA

Personalised recommendations