Advertisement

Adaptive Clustering for Outlier Identification in High-Dimensional Data

  • Srikanth ThudumuEmail author
  • Philip Branch
  • Jiong Jin
  • Jugdutt (Jack) Singh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11945)

Abstract

High-dimensional data brings new challenges and opportunities for domains such as clinical, scientific and industry data. However, the curse of dimensionality that comes with the increased dimensions causes outlier identification extremely difficult because of the scattering of data points. Furthermore, clustering in high-dimensional data is challenging due to the intervention of irrelevant dimensions where a dimension may be relevant for some clusters and irrelevant for others. To address the curse of dimensionality in outlier identification, this paper presents a novel technique that generates candidate subspaces from the high-dimensional space and refines the identification of potential outliers from each subspace using a novel iterative adaptive clustering approach. Our experimental results show that the technique is effective.

Keywords

Outlier detection High-dimensionality problem Adaptive clustering Big data 

References

  1. 1.
    Aggarwal, C.C., Philip, S.Y.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB J. 14(2), 211–221 (2005)CrossRefGoogle Scholar
  2. 2.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)Google Scholar
  3. 3.
    Christiansen, B.: Ensemble averaging and the curse of dimensionality. J. Clim. 31(4), 1587–1596 (2018)CrossRefGoogle Scholar
  4. 4.
    Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2797. IEEE (2009)Google Scholar
  5. 5.
    Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)CrossRefGoogle Scholar
  6. 6.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 47–58. SIAM (2003)Google Scholar
  7. 7.
    Gan, G., Ng, M.K.P.: Subspace clustering with automatic feature grouping. Pattern Recogn. 48(11), 3703–3713 (2015)CrossRefGoogle Scholar
  8. 8.
    Gartner, I.: Big data definition. https://www.gartner.com/it-glossary/big-data/. Accessed 6 Sept 2019
  9. 9.
    Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 8, 1026–1041 (2007)CrossRefGoogle Scholar
  10. 10.
    Jing, L., Ng, M.K., Xu, J., Huang, J.Z.: Subspace clustering of text documents with feature weighting K-means algorithm. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 802–812. Springer, Heidelberg (2005).  https://doi.org/10.1007/11430919_94CrossRefGoogle Scholar
  11. 11.
    Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)CrossRefGoogle Scholar
  12. 12.
    Li, T., Ma, S., Ogihara, M.: Document clustering via adaptive subspace iteration. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 218–225. ACM (2004)Google Scholar
  13. 13.
    Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2012)CrossRefGoogle Scholar
  14. 14.
    Mucha, H.J., Sofyan, H.: Nonhierarchical clustering (2011)Google Scholar
  15. 15.
    Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. Proc. VLDB Endow. 8(12), 1976–1979 (2015)CrossRefGoogle Scholar
  16. 16.
    Thudumu, S., Branch, P., Jin, J., Singh, J.J.: Elicitation of candidate subspaces in high-dimensional data. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications. IEEE (2019, in press)Google Scholar
  17. 17.
    Tomasev, N., Radovanovic, M., Mladenic, D., Ivanovic, M.: The role of hubness in clustering high-dimensional data. IEEE Trans. Knowl. Data Eng. 26(3), 739–751 (2014)CrossRefGoogle Scholar
  18. 18.
    Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: Hubness-based clustering of high-dimensional data. In: Celebi, M.E. (ed.) Partitional Clustering Algorithms, pp. 353–386. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-09259-1_11CrossRefGoogle Scholar
  19. 19.
    Zhai, Y., Ong, Y.S., Tsang, I.W.: The emerging “big dimensionality” (2014)Google Scholar
  20. 20.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min.: ASA Data Sci. J. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Srikanth Thudumu
    • 1
    Email author
  • Philip Branch
    • 1
  • Jiong Jin
    • 1
  • Jugdutt (Jack) Singh
    • 2
  1. 1.Swinburne University of TechnologyHawthornAustralia
  2. 2.State Government of SarawakKuchingMalaysia

Personalised recommendations