Data Mining and Knowledge Discovery

, Volume 28, Issue 1, pp 190–237 | Cite as

Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Article

Abstract

Outlier detection research has been seeing many new algorithms every year that often appear to be only slightly different from existing methods along with some experiments that show them to “clearly outperform” the others. However, few approaches come along with a clear analysis of existing methods and a solid theoretical differentiation. Here, we provide a formalized method of analysis to allow for a theoretical comparison and generalization of many existing methods. Our unified view improves understanding of the shared properties and of the differences of outlier detection models. By abstracting the notion of locality from the classic distance-based notion, our framework facilitates the construction of abstract methods for many special data types that are usually handled with specialized algorithms. In particular, spatial neighborhood can be seen as a special case of locality. Here we therefore compare and generalize approaches to spatial outlier detection in a detailed manner. We also discuss temporal data like video streams, or graph data such as community networks. Since we reproduce results of specialized approaches with our general framework, and even improve upon them, our framework provides reasonable baselines to evaluate the true merits of specialized approaches. At the same time, seeing spatial outlier detection as a special case of local outlier detection, opens up new potentials for analysis and advancement of methods.

Keywords

Local outlier Spatial outlier Video outlier Network outlier 

References

  1. Achtert E, Hettab A, Kriegel HP, Schubert E, Zimek A (2011) Spatial outlier detection: data, algorithms, visualizations. In: Proceedings of the 12th international symposium on spatial and temporal databases (SSTD), Minneapolis, MN, pp 512–516. doi:10.1007/978-3-642-22922-0_41
  2. Achtert E, Goldhofer S, Kriegel HP, Schubert E, Zimek A (2012) Evaluation of clusterings—metrics and visual support. In: Proceedings of the 28th international conference on data engineering (ICDE), Washington, DC, pp 1285–1288. doi:10.1109/ICDE.2012.128
  3. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the ACM international conference on management of data (SIGMOD), Santa Barbara, CA, pp 37–46Google Scholar
  4. Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 8th SIAM international conference on data mining (SDM), Atlanta, GA, pp 483–493Google Scholar
  5. Agyemang M, Barker K, Alhajj R (2006) A comprehensive survey of numeric and symbolic outlier mining techniques. Intell Data Anal 10:521–538Google Scholar
  6. Angiulli F, Fassetti F (2009) DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data 3(1):4:1–4:57. doi:10.1145/1497577.1497581 CrossRefGoogle Scholar
  7. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery (PKDD), Helsinki, Finland, pp 15–26. doi:10.1007/3-540-45681-3_2
  8. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM international conference on management of data (SIGMOD), Philadelphia, PA, pp 49–60Google Scholar
  9. Anselin L (1995) Local indicators of spatial association—LISA. Geogr Anal 27(2):93–115CrossRefGoogle Scholar
  10. Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New YorkMATHGoogle Scholar
  11. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 29–38. doi:10.1145/956750.956758
  12. Breunig MM, Kriegel HP, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the ACM international conference on management of data (SIGMOD), Dallas, TX, pp 93–104Google Scholar
  13. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):Article 15, 1–58. doi:10.1145/1541880.1541882
  14. Chandola V, Banerjee A, Kumar V (2012) Anomaly detection for discrete sequences: a survey. IEEE Trans Knowl Data Eng 24(5):823–839CrossRefGoogle Scholar
  15. Chawla S, Sun P (2006) SLOM: a new measure for local spatial outliers. Knowl Inf Syst 9(4):412–429. doi:10.1007/s10115-005-0200-2 CrossRefGoogle Scholar
  16. Chen F, Lu CT, Boedihardjo AP (2010) GLS-SOD: a generalized local statistical approach for spatial outlier detection. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1069–1078. doi:10.1145/1835804.1835939
  17. de Vries T, Chawla S, Houle ME (2010) Finding local anomalies in very high dimensional space. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, Australia, pp 128–137. doi:10.1109/ICDM.2010.151
  18. Gao J, Tan PN (2006) Converting output scores from outlier detection algorithms into probability estimates. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), Hong Kong, China, pp 212–221. doi:10.1109/ICDM.2006.43
  19. Gao J, Liang F, Fan W, Wang C, Sun Y, Han J (2010) On community outliers and their efficient detection in information networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 813–822. doi:10.1145/1835804.1835907
  20. Geary RC (1954) The contiguity ratio and statistical mapping. Inc Stat 5(3):115–146Google Scholar
  21. Hadi AS, Rahmatullah Imon AHM, Werner M (2009) Detection of outliers. Wiley Interdiscip Rev Comput Stat 1(1):57–70CrossRefGoogle Scholar
  22. Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126. doi:10.1023/B:AIRE.0000045502.10941.a9 CrossRefMATHGoogle Scholar
  23. Hong SB, Nah W, Baek JH (2003) Abrupt shot change detection using multiple features and classification tree. In: Proceedings of the 4th international conference on intelligent data engineering and automated learning (IDEAL), Hong Kong, China, pp 553–560. doi:10.1007/978-3-540-45080-1_76
  24. Jagadish HV, Koudas N, Muthukrishnan S (1999) Mining deviants in a time series database. In: Proceedings of the 25th international conference on very large databases (VLDB), Edinburgh, Scotland, pp 102–113Google Scholar
  25. Jin W, Tung A, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the 7th ACM international conference on knowledge discovery and data mining (SIGKDD), San Francisco, CA, pp 293–298. doi:10.1145/502512.502554
  26. Jin W, Tung AKH, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Proceedings of the 10th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 577–593. doi:10.1007/11731139_68
  27. Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 28th international conference on data engineering (ICDE), Washington, DCGoogle Scholar
  28. Kim HH, Kim YH (2010) Toward a conceptual framework of key-frame extraction and storyboard display for video summarization. J Am Soc Inf Sc Technol 61(5):927–939. doi:10.1002/asi.21317 CrossRefGoogle Scholar
  29. Knorr EM, Ng RT (1997) A unified notion of outliers: properties and computation. In: Proceedings of the 3rd ACM international conference on knowledge discovery and data mining (KDD), Newport Beach, CA, pp 219–222Google Scholar
  30. Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large databases (VLDB), New York City, NY, pp 392–403Google Scholar
  31. Knorr EM, Ng RT, Tucanov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4):237–253CrossRefGoogle Scholar
  32. Kollios G, Gunopulos D, Koudas N, Berchthold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE Trans Knowl Data Eng 15(5):1170–1187. doi:10.1109/TKDE.2003.1232271 CrossRefGoogle Scholar
  33. Kou Y, Lu CT, Chen D (2006) Spatial weighted outlier detection. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MDGoogle Scholar
  34. Kou Y, Lu CT, Dos Santos RF (2007) Spatial outlier detection: a graph-based approach. In: Proceedings of the 19th IEEE international conference on tools with artificial intelligence (ICTAI), Patras, Greece, pp 281–288. doi:10.1109/ICTAI.2007.169
  35. Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas, NV, pp 444–452. doi:10.1145/1401890.1401946
  36. Kriegel HP, Kröger P, Schubert E, Zimek A (2009a) LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM), Hong Kong, China, pp 1649–1652. doi:10.1145/1645953.1646195
  37. Kriegel HP, Kröger P, Schubert E, Zimek A (2009b) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok. Thailand, pp 831–838. doi:10.1007/978-3-642-01307-2_86
  38. Kriegel HP, Kröger P, Schubert E, Zimek A (2011) Interpreting and unifying outlier scores. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, pp 13–24Google Scholar
  39. Kriegel HP, Kröger P, Schubert E, Zimek A (2012) Outlier detection in arbitrarily oriented subspaces. In: Proceedings of the 12th IEEE international conference on data mining (ICDM), Brussels, BelgiumGoogle Scholar
  40. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, pp 157–166. doi:10.1145/1081870.1081891
  41. Lee W, Kim H, Kang H, Lee J, Kim Y, Jeon S (2004) Video cataloging system for real-time scene change detection of news video. In: Proceedings of Teh 10th international workshop on combinatorial image analysis (IWCIA), Auckland, New Zealand, pp 705–715Google Scholar
  42. Lee JG, Han J, Li X (2008) Trajectory outlier detection: a partition-and-detect framework. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, pp 140–149. doi:10.1109/ICDE.2008.4497422
  43. Liu X, Lu CT, Chen F (2010) Spatial outlier detection: Random walk based approaches. In: Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), San Jose, CA, pp 370–379, doi:10.1145/1869790.1869841
  44. Lu CT, Chen D, Kou Y (2003) Algorithms for spatial outlier detection. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM), Melbourne, FL, pp 597–600. doi:10.1109/ICDM.2003.1250986
  45. Money AG, Agius H (2008) Video summarisation: a conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143. doi:10.1016/j.jvcir.2007.04.002 CrossRefGoogle Scholar
  46. Moran P (1950) Notes on continuous stochastic phenomena. Biometrika 37(1/2):17–23CrossRefMATHMathSciNetGoogle Scholar
  47. Müller E, Assent I, Steinhausen U, Seidl T (2008) OutRank: ranking outliers in high dimensional data. In: Proceedings of the 24th international conference on data engineering (ICDE) workshop on ranking in databases (DBRank), Cancun, Mexico, pp 600–603. doi:10.1109/ICDEW.2008.4498387
  48. Müller E, Schiffer M, Gerwert P, Hannen M, Jansen T, Seidl T (2010a) SOREX: subspace outlier ranking exploration toolkit. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML PKDD), Barcelona, Spain, pp 607–610. doi:10.1007/978-3-642-15939-8_44
  49. Müller E, Schiffer M, Seidl T (2010b) Adaptive outlierness for subspace outlier ranking. In: Proceedings of the 19th ACM conference on information and knowledge management (CIKM), Toronto, ON, Canada, pp 1629–1632. doi:10.1145/1871437.1871690
  50. Müller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: Proceedings of the 27th international conference on data engineering (ICDE), Hannover, Germany, pp 434–445. doi:10.1109/ICDE.2011.5767916
  51. Nguyen HV, Ang HH, Gopalkrishnan V (2010) Mining outliers with ensemble of heterogeneous detectors on random subspaces. In: Proceedings of the 15th international conference on database systems for advanced applications (DASFAA), Tsukuba, Japan, pp 368–383. doi:10.1007/978-3-642-12026-8_29
  52. Nguyen HV, Gopalkrishnan V, Assent I (2011) An unbiased distance-based outlier detection approach for high-dimensional data. In: Proceedings of the 16th international conference on database systems for advanced applications (DASFAA), Hong Kong, China, pp 138–152. doi:10.1007/978-3-642-20149-3_12
  53. Orair GH, Teixeira C, Wang Y, Meira W Jr, Parthasarathy S (2010) Distance-based outlier detection: consolidation and renewed bearing. Proc VLDB Endow 3(2):1469–1480Google Scholar
  54. Papadimitriou S, Kitagawa H, Gibbons P, Faloutsos C (2003) LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the 19th international conference on data engineering (ICDE), Bangalore, India, pp 315–326. doi:10.1109/ICDE.2003.1260802
  55. Pei Y, Zaïane O, Gao Y (2006) An efficient reference-based approach to outlier detection in large datasets. In: Proceedings of the 6th IEEE international conference on data mining (ICDM), Hong Kong, China, pp 478–487. doi:10.1109/ICDM.2006.17
  56. Pickup L, Zisserman A (2009) Automatic retrieval of visual continuity errors in movies. In: Proceedings of the 8th ACM international conference on image and video retrieval (CIVR), Santorini, Greece. doi:10.1145/1646396.1646406
  57. Pokrajac D, Lazarevic A, Latecki LJ (2007) Incremental local outlier detection for data streams. In: Proceedings of the IEEE symposium on computational intelligence and data mining (CIDM), Honolulu, HI, pp 504–515. doi:10.1109/CIDM.2007.368917
  58. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM international conference on management of data (SIGMOD), Dallas, TX, pp 427–438Google Scholar
  59. Roddick JF, Spiliopoulou M (1999) A bibliography of temporal, spatial and spatio-temporal data mining research. ACM SIGKDD Explor 1(1):34–38. doi:10.1145/846170.846173 CrossRefGoogle Scholar
  60. Schubert E, Wojdanowski R, Zimek A, Kriegel HP (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 12th SIAM international conference on data mining (SDM), Anaheim, CA, pp 1047–1058Google Scholar
  61. Shekhar S, Lu CT, Zhang P (2003) A unified approach to detecting spatial outliers. GeoInformatica 7(2):139–166. doi:10.1023/A:1023455925009 CrossRefGoogle Scholar
  62. Su X, Tsai CL (2011) Outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):261–268. doi:10.1002/widm.19 CrossRefGoogle Scholar
  63. Sun P, Chawla S (2004) On local spatial outliers. In: Proceedings of the 4th IEEE international conference on data mining (ICDM), Brighton, UK, pp 209–216. doi:10.1109/ICDM.2004.10097
  64. Takeuchi J, Yamanishi K (2006) A unifying framework for detecting outliers and change points from time series. IEEE Trans Knowl Data Eng 18(4):482–492. doi:10.1109/TKDE.2006.1599387 CrossRefGoogle Scholar
  65. Tan PN, Steinbach M, Kumar V (2006) Introduction to Data Mining. Addison Wesley, Upper Saddle RiverGoogle Scholar
  66. Tang J, Chen Z, Fu AWC, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Proceedings of the 6th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Taipei, Taiwan, pp 535–548. doi:10.1007/3-540-47887-6_53
  67. Vu NH, Gopalkrishnan V (2009) Efficient pruning schemes for distance-based outlier detection. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML PKDD), Bled, Slovenia, pp 160–175. doi:10.1007/978-3-642-04174-7_11
  68. Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixture with discounting learning algorithms. Data Min Knowl Discov 8:275–300. doi:10.1023/B:DAMI.0000023676.72185.7c CrossRefMathSciNetGoogle Scholar
  69. Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inf Syst 9(3):309–338. doi:10.1007/s10115-005-0197-6 CrossRefGoogle Scholar
  70. Zhang J, Lou M, Ling TW, Wang H (2004) HOS-miner: a system for detecting outlying subspaces of high-dimensional data. In: Proceedings of the 30th international conference on very large databases (VLDB), Toronto, Canada, pp 1265–1268Google Scholar
  71. Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok, Thailand, pp 813–822. doi:10.1007/978-3-642-01307-2_84
  72. Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387. doi:10.1002/sam.11161 CrossRefMathSciNetGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  • Erich Schubert
    • 1
  • Arthur Zimek
    • 2
  • Hans-Peter Kriegel
    • 1
  1. 1.Ludwig-Maximilians-Universität MünchenMunichGermany
  2. 2.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations