Advertisement

Outliers in High Dimensional Data

  • N. N. R. Ranga SuriEmail author
  • Narasimha Murty M
  • G. Athithan
Chapter
Part of the Intelligent Systems Reference Library book series (ISRL, volume 155)

Abstract

This chapter addresses one of the research issues connected with the outlier detection problem, namely dimensionality of the data. More specifically, the focus is on detecting outliers embedded in subspaces of high dimensional categorical data. To this effect, some algorithms for unsupervised selection of feature subsets in categorical data domain are furnished here. A detailed discussion on devising necessary measures for assessing the relevance and redundancy of categorical attributes/features is presented. Experimental study of these algorithms on benchmark categorical data sets explores the efficacy of these algorithms towards outlier detection.

References

  1. 1.
    Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: ACM SIGMOD International Conference on Management of Data, pp. 37–46. Santa Barbara, USA (2001)CrossRefGoogle Scholar
  2. 2.
    Bellman, R.E.: Dynamic Programming. Courier Dover Publications, New York (2003)Google Scholar
  3. 3.
    Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: 7th International Conference on Database Theory. ICDT, Lecture Notes in Computer Science, vol. 1540, pp. 217–235. Springer, Jerusalem, Israel (1999)Google Scholar
  4. 4.
    Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data. In: 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 333–342. Washington DC, USA (2010)Google Scholar
  5. 5.
    Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imabalanced data sets. ACM SIGKDD Explor. Newslett. 6(1), 1–6 (2004)CrossRefGoogle Scholar
  6. 6.
    Dua, D., Efi, K.T.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
  7. 7.
    Estevez, P.A., Tesmer, M., Perez, C.A., Zurada, J.M.: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20(2), 189–201 (2009)CrossRefGoogle Scholar
  8. 8.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)zbMATHGoogle Scholar
  10. 10.
    He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems 18. Vancouver, Canada (2005)Google Scholar
  11. 11.
    Hughes, G.F.: Number of pattern classifier design samples per class (corresp.). IEEE Trans. Inf. Theory 15(5), 615–618 (1969)Google Scholar
  12. 12.
    Kaban, A.: Fractional norm regularization: learning with very few relevant features. IEEE Trans. NNLS 24(6), 953–63 (2013)Google Scholar
  13. 13.
    Kanal, L., Chandrasekaran, B.: On dimensionality and sample size in statistical pattern classification. Pattern Recogn. 3, 225234 (1971)CrossRefGoogle Scholar
  14. 14.
    Keller, F., Muller, E., Bohm, K.: Hics: high contrast subspaces for density-based outlier ranking. In: 28th International Conference on Data Engineering (ICDE), pp. 1037–1048. IEEE (2012)Google Scholar
  15. 15.
    Khoshgoftaar, T.M., Gao, K.: Feature selection with imbalanced data for software defect prediction. In: International Conference on Machine Learning and Applications, pp. 235–240 (2009)Google Scholar
  16. 16.
    Koufakou, A., Ortiz, E., Georgiopoulos, M.: A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of IEEE ICTAI, pp. 210–217. Patras, Greece (2007)Google Scholar
  17. 17.
    Kriegel, H.P., Kroger, P., Schubert, E., Zimek, A.: Outlier detection in arbitrarily oriented subspaces. In: 12th International Conference on Data Mining (ICDM), pp. 379–388. IEEE Computer Society, Brussels, Belgium (2012)Google Scholar
  18. 18.
    Kullback, S.: Information Theory and Statistics. Dover, New York (1997)zbMATHGoogle Scholar
  19. 19.
    Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: ACM KDD, pp. 157–166. Chicago, USA (2005)Google Scholar
  20. 20.
    Liu, H., Li, X., Li, J., Zhang, S.: Efficient outlier detection for high dimensional data. IEEE Trans. Syst. Man Cybern. Syst. (online)Google Scholar
  21. 21.
    Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24, 301–312 (2002)CrossRefGoogle Scholar
  22. 22.
    Muller, E., Assent, I., Iglesias, P., Mulle, Y., Bohm, K.: Outlier ranking via subspace analysis in multiple views of the data. In: 12th International Conference on Data Mining (ICDM), pp. 529–538. IEEE Computer Society, Brussels, Belgium (2012)Google Scholar
  23. 23.
    Muller, E., Assent, I., Steinhausen, U., Seidl, T.: Outrank: Ranking outliers in high dimensional data. In: IEEE ICDE 2008 Workshops: 3rd International Workshop on Self-managing Database Systems (SMDB), pp. 600–603. Cancun, Mexico (2008)Google Scholar
  24. 24.
    Nguyen, H.V., Gopalakrishnan, V.: Feature extraction for outlier detection in high-dimensional spaces. JMLR WCP Feature Sel. Data Min. 10, 66–75 (2010)Google Scholar
  25. 25.
    Nguyen, H.V., Muller, E., Vreeken, J., Keller, F., Bohm, K.: CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SIAM International Conference on Data Mining (SDM), pp. 1–9. Asustin, Texas (2013)CrossRefGoogle Scholar
  26. 26.
    Own, H.S., AAl, N.A.A., Abraham, A.: A new weighted rough set framework for imbalance class distribution. In: International Conference of Soft Computing and Pattern Recognition, pp. 29–34. Paris, France (2010)Google Scholar
  27. 27.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)Google Scholar
  28. 28.
    Peters, J.F., Ramanna, S.: Feature selection: near set approach. In: MCD, pp. 57–71 (2007)Google Scholar
  29. 29.
    Pham, N., Pagh, R.: A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In: 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 877–885. ACM, Beijing, China (2012)Google Scholar
  30. 30.
    Sarkar, B.K., Sana, S.S., Chaudhuri, K.S.: A combined approach to tackle imbalanced data sets. Int. J. Hybrid Intell. Syst. (IJHIS) 9(4) (2012)CrossRefGoogle Scholar
  31. 31.
    Subudhi, B.N., Nanda, P.K., Ghosh, A.: Entropy based region selection for moving object detection. Pattern Recogn. Lett. 32, 2097–2108 (2011)CrossRefGoogle Scholar
  32. 32.
    Suri, N.N.R.R., Murty, M.N., Athithan, G.: Unsupervised feature selection for outlier detection in categorical data using mutual information. In: 12th International Conference on Hybrid Intelligent Systems (HIS), pp. 253–258. IEEE Xplore, Pune, India (2012)Google Scholar
  33. 33.
    Suri, N.N.R.R., Murty, M.N., Athithan, G.: Detecting outliers in high dimensional categorical data through feature selection. J. Netw. Innovative Comput. (JNIC) 1(1), 23–32 (2013)Google Scholar
  34. 34.
    Suzuki, T., Sugiyama, M.: Sufficient dimension reduction via squared-loss mutual information estimation. JMLR Workshop Conf. Proc. 9, 804–811 (2010)zbMATHGoogle Scholar
  35. 35.
    deVries, T., Chawla, S., Houle, M.E.: Finding local anomalies in very high dimensional space. In: IEEE ICDM, pp. 128–137 (2010)Google Scholar
  36. 36.
    Wasikowski, M., Chen, X.: Combating the small sample class imbalance problem using feature selection. IEEE Trans. Knowl. Data Eng. (TKDE) 22(10), 1388–1400 (2010)CrossRefGoogle Scholar
  37. 37.
    Yamada, M., Jitkrittum, W., Sigal, L., E.P.Xing, Sugiyama, M.: High-dimensional feature selection by feature-wise kernelized lasso (2012). arXiv:1202.0515v2 [stat.ML]
  38. 38.
    Zhang, J., Wang, H.: Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl. Inf. Syst. 10(3), 333–355 (2006)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • N. N. R. Ranga Suri
    • 1
    Email author
  • Narasimha Murty M
    • 2
  • G. Athithan
    • 3
  1. 1.Centre for Artificial Intelligence and Robotics (CAIR)BangaloreIndia
  2. 2.Department of Computer Science and AutomationIndian Institute of Science (IISc)BangaloreIndia
  3. 3.Defence Research and Development Organization (DRDO)New DelhiIndia

Personalised recommendations