Data Mining and Knowledge Discovery

, Volume 20, Issue 2, pp 259–289 | Cite as

A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

  • Anna KoufakouEmail author
  • Michael Georgiopoulos


Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.


Outlier detection Anomaly detection Data mining Distributed data sets Mixed attribute data sets High-dimensional data sets 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classification. Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez. Available at
  2. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. ACM SIGMOD Record 30(2): 37–46CrossRefGoogle Scholar
  3. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, pp 487–499Google Scholar
  4. Aha D, Bankert R (1994) Feature selection for case-based classification of cloud types: an empirical comparison. In: Proceedings of the 1994 AAAI workshop on case-based reasoning, pp 106–112Google Scholar
  5. Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Transac Knowl Data Engin 17(2): 203–215CrossRefMathSciNetGoogle Scholar
  6. Barnett V, Lewis T (1978) Outliers in statistical data. Wiley, NYzbMATHGoogle Scholar
  7. Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 29–38Google Scholar
  8. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, pp 217–235Google Scholar
  9. Biba M, Esposito F, Ferilli S, Di Mauro N, Basile T (2007) Unsupervised discretization using kernel density estimation. In: Proceedings of the 20-th international conferece on artificial intelligence, pp 696–701Google Scholar
  10. Blake C, Merz C (1998) UCI repository of machine learning databases. Accessed Sept 2008
  11. Bolton R, Hand D (2002) Statistical fraud detection: a review. Stat Sci 17(3): 235–255zbMATHCrossRefMathSciNetGoogle Scholar
  12. Branch J, Szymanski B, Giannella C, Wolff R, Kargupta H (2006) In-network outlier detection in wireless sensor networks. In: Proceedings 26th international conference on distributed computing systemsGoogle Scholar
  13. Breunig M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. ACM SIGMOD Record 29(2): 93–104CrossRefGoogle Scholar
  14. Calders T, Rigotti C, Boulicaut J (2004) A survey on condensed representations for frequent sets. LNCS Constraint-Based Mining and Inductive Databases 3848: 64–80CrossRefGoogle Scholar
  15. Catlett J (1991) Megainduction: machine learning on very large databases, PhD thesis, Basser Department of Computer Science, University of Sydney, AustraliaGoogle Scholar
  16. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: USENIX symposium on operating systems design and implementation OSDIGoogle Scholar
  17. Dokas P, Ertoz L, Kumar V, Lazarevic A, Srivastava J, Tan P (2002) Data mining for network intrusion detection. In: Proceedings NSF workshop on next generation data mining, pp 21–30Google Scholar
  18. Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: SIAM international conference on data mining, pp 47–58Google Scholar
  19. Geerts F, Goethals B, Van den Bussche J (2005) Tight upper bounds on the number of candidate patterns. ACM Transac Database System (TODS) 30(2): 333–363CrossRefGoogle Scholar
  20. Hawkins D (1980) Identification of outliers. Chapman and Hall, LondonzbMATHGoogle Scholar
  21. Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: Proceedings of the 4th international conference on data warehousing and knowledge discovery, pp 170–180Google Scholar
  22. Hays C (2004) What Wal-Mart knows about customers habits. The New York Times, November 14Google Scholar
  23. He Z, Xu X, Deng S, Calvanese D, De Giacomo G, Lenzerini M (2006) A fast greedy algorithm for outlier mining. In: Proceedings of 10th Pacific-Asia conference on knowledge and data discovery, pp 567–576Google Scholar
  24. Hettich S, Bay S (1999) The UCI KDD archive.
  25. Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2): 85–126zbMATHCrossRefGoogle Scholar
  26. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases, pp 392–403Google Scholar
  27. Knorr E, Ng R, Tucakov V (2000) Distance-based outliers: algorithms and applications. Int J Very Large Data Bases VLDB 8(3): 237–253CrossRefGoogle Scholar
  28. Knuth D (1968) The art of computer programming, vol 1. Addison-Wesley, Reading, MAzbMATHGoogle Scholar
  29. Koufakou A, Georgiopoulos M, Anagnostopoulos G (2008b) Detecting outliers in high-dimensional datasets with mixed attributes. In: International conference on data mining DMIN, pp 427–433Google Scholar
  30. Koufakou A, Ortiz E, Georgiopoulos M, Anagnostopoulos G, Reynolds K (2007) A scalable and efficient outlier detection strategy for categorical data. In: IEEE international conference on tools with artificial intelligence ICTAI, pp 210–217Google Scholar
  31. Koufakou A, Secretan J, Reeder J, Cardona K, Georgiopoulos M (2008a) Fast parallel outlier detection for categorical datasets using MapReduce. In: IEEE world congress on computational intelligence international joint conference on neural networks IJCNN, pp 3298–3304Google Scholar
  32. Latecki L, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. Lecture Notes in Computer Science 4571: 61CrossRefGoogle Scholar
  33. Lazarevic A, Ertoz L, Kumar V, Ozgur A, Srivastava J (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Proceedings of the 3rd SIAM international conference on data mining, p 25Google Scholar
  34. Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Transac Knowl Data Engin 17(9): 1174–1185CrossRefGoogle Scholar
  35. Otey M, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Mining Knowl Discov 12(2): 203–228CrossRefMathSciNetGoogle Scholar
  36. Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C, (2003) LOCI: fast outlier detection using the local correlation integral. In: Proceedings 19th international conference on data engineering, pp 315–326Google Scholar
  37. Penny K, Jolliffe I (2001) A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician 50(3): 295–308MathSciNetGoogle Scholar
  38. Preparata F, Shamos M (1985) Computational geometry: an introduction. Springer, BerlinGoogle Scholar
  39. Roberts S, Tarassenko L (1994) A probabilistic resource allocating network for novelty detection. Neural Comput 6(2): 270–284CrossRefGoogle Scholar
  40. Rousseeuw P (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8: 283–297MathSciNetGoogle Scholar
  41. Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, NYzbMATHCrossRefGoogle Scholar
  42. Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson Addison Wesley, LondonGoogle Scholar
  43. Tax D, Duin R (2004) Support vector data description. Mach Learn 54(1): 45–66zbMATHCrossRefGoogle Scholar
  44. Yu J, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9(3): 309–338CrossRefGoogle Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  1. 1.U.A. Whitaker School of EngineeringFlorida Gulf Coast UniversityFort MyersUSA
  2. 2.School of EECSUniversity of Central FloridaOrlandoUSA

Personalised recommendations