Advertisement

Detecting and ranking outliers in high-dimensional data

  • Amardeep Kaur
  • Amitava DattaEmail author
Article
  • 34 Downloads

Abstract

Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

Keywords

Data mining Outlier detection High-dimensional data 

References

  1. 1.
    Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014)CrossRefGoogle Scholar
  2. 2.
    Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 2000 (2000)Google Scholar
  4. 4.
    Lee, Y.W., Pipino, L.L., Funk, J.D., Wang, R.Y.: Journey to Data Quality. The MIT Press, Cambridge (2009)Google Scholar
  5. 5.
    Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)zbMATHCrossRefGoogle Scholar
  7. 7.
    Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them). Pract. Assess. Res. Eval. 9(6), 1–12 (2004)Google Scholar
  8. 8.
    Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)CrossRefGoogle Scholar
  9. 9.
    Haug, A., Zachariassen, F., Van Liempd, D.: The costs of poor data quality. J. Ind. Eng. Manag. 4(2), 168–193 (2011)Google Scholar
  10. 10.
    English, L.P.: Information quality: critical ingredient for national security. J. Database Manag. 16(1), 18–32 (2005)CrossRefGoogle Scholar
  11. 11.
    of Inspector General, O.: Undeliverable as addressed mail. Tech. Rep. MS-AR-14-006, United States Postal Service (2014)Google Scholar
  12. 12.
    Quality, E.D.: The data quality benchmark report. In: Experian Data Quality, pp. 1–10 (2015)Google Scholar
  13. 13.
    Koh, H.C., Tan, G., et al.: Data mining applications in healthcare. J. Healthc. Inf. Manag. 19(2), 65 (2011)Google Scholar
  14. 14.
    Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inf. Assoc. 20(1), 144–151 (2013)CrossRefGoogle Scholar
  15. 15.
    Rosenberg, W., Donald, A.: Evidence based medicine: an approach to clinical problem-solving. BMJ Br. Med. J. 310(6987), 1122 (1995)CrossRefGoogle Scholar
  16. 16.
    Md, A.R.F., Md, R.I.H.: Problems in the evidence of evidence-based medicine. Am. J. Med. 103(6), 529–535 (1997)CrossRefGoogle Scholar
  17. 17.
    Berndt, D.J., Fisher, J.W., Hevner, A.R., Studnicki, J.: Healthcare data warehousing and quality assurance. Computer 34(12), 56–65 (2001)CrossRefGoogle Scholar
  18. 18.
    Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)CrossRefGoogle Scholar
  19. 19.
    Godfrey, A.B.: Juran’s Quality Handbook. McGraw Hill, New York (1999)Google Scholar
  20. 20.
    Redman, T.C.: Data Quality: The Field Guide. Digital press, Boston (2001)Google Scholar
  21. 21.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer, New York, Secaucus (2006)zbMATHGoogle Scholar
  22. 22.
    Chapman, A.D.: Principles of data quality. Tech. rep., Global Biodiversity Information Facility, Copenhagen (2005)Google Scholar
  23. 23.
    Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16:1–16:52 (2009)CrossRefGoogle Scholar
  24. 24.
    Fan, W., Geerts, F.: Foundations of Data Quality Management. Morgan and Claypool, San Rafael (2012)zbMATHCrossRefGoogle Scholar
  25. 25.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, San Francisco (2011)zbMATHGoogle Scholar
  26. 26.
    Maletic, J.I., Marcus, A.: Data cleansing: beyond integrity analysis. In: MIT Conference on Information Quality, pp. 200–209 (2000)Google Scholar
  27. 27.
    Van den Broeck, J., Argeseanu Cunningham, S., Eeckels, R., Herbst, K.: Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2(10), e267 (2005)CrossRefGoogle Scholar
  28. 28.
    Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52, 1694–1711 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  29. 29.
    Aggarwal, C.C.: Outlier Analysis. Springer, Berlin (2013)zbMATHCrossRefGoogle Scholar
  30. 30.
    Hodge, V., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)zbMATHCrossRefGoogle Scholar
  31. 31.
    Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey (1961)zbMATHCrossRefGoogle Scholar
  32. 32.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer, Berlin (2001)zbMATHCrossRefGoogle Scholar
  33. 33.
    Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Hoboken (1994)zbMATHGoogle Scholar
  34. 34.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)CrossRefGoogle Scholar
  36. 36.
    Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: In Proceedings of the International Conference on Very Large Databases, pp. 392–403 (1998)Google Scholar
  37. 37.
    Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, vol. 1998, pp. 224–228. AAAI Press (1998)Google Scholar
  38. 38.
    Ramaswamy, S., Rastogi, R., Shim, K., Ramaswamy, S., Rajeev rastogi, K.S.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Rec. 29(2), 427–438 (2000)CrossRefGoogle Scholar
  39. 39.
    Breunig, M., Kriegel, H., Ng, R., Sander, J.: LOF: identifying density-based local outliers. ACM Sigmod Record, pp. 1–12 (2000)Google Scholar
  40. 40.
    Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, 2003. Proceedings, pp. 315–326. IEEE (2003)Google Scholar
  41. 41.
    Ghoting, A., Parthasarathy, S., Otey, M.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 410–421 (2011)Google Scholar
  43. 43.
    Kriegel, H.P., S hubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 444 (2008)Google Scholar
  44. 44.
    Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal. 23(1996), 153–168 (1996)zbMATHCrossRefGoogle Scholar
  45. 45.
    Muller, E., Schiffer, M.: Statistical selection of relevant subspace projections for outlier ranking. Data Eng. (ICDE) 2011, 434–445 (2011)Google Scholar
  46. 46.
    Zhang, J., Wang, H.: Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst. 10(3), 333–355 (2006)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Keller, F.: HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of ICDE (1) (2012)Google Scholar
  48. 48.
    Knorr, E.M., Ng, R.T.: Finding intentional knowledge of distance-based outliers. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 211–222 (1999)Google Scholar
  49. 49.
    Aggarwal, C., Yu, P.: Outlier detection for high dimensional data. In: ACM Sigmod Record (2001)Google Scholar
  50. 50.
    Zhang, J., Lou, M., Ling, T.: Hos-Miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, pp. 1265–1268 (2004)Google Scholar
  51. 51.
    Kriegel, H., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1, pp. 831–838 (2009)Google Scholar
  52. 52.
    Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 17 (2015)CrossRefGoogle Scholar
  53. 53.
    Agrawal, R., Gehrke, J., Gunopulos, D.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105 (1998)Google Scholar
  54. 54.
    Datta, A., Kaur, A., Lauer, T., Chabbouh, S.: Parallel subspace clustering using multi-core and many-core architectures. In: Kirikova, M., Nørvåg, K., Papadopoulos, G.A., Gamper, J., Wrembel, R., Darmont, J., Rizzi, S. (eds.) New Trends in Databases and Information Systems, pp. 213–223. Springer, Cham (2017)CrossRefGoogle Scholar
  55. 55.
    Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 4 Apr 2017
  56. 56.
    Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A., Moroz, I.M., et al.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6(1), 23 (2007)CrossRefGoogle Scholar

Copyright information

© Indian Institute of Technology Madras 2018

Authors and Affiliations

  1. 1.School of Computer Science and Software EngineeringUniversity of Western AustraliaPerthAustralia

Personalised recommendations