Adaptive Clustering for Outlier Identification in High-Dimensional Data
- 1 Citations
- 597 Downloads
Abstract
High-dimensional data brings new challenges and opportunities for domains such as clinical, scientific and industry data. However, the curse of dimensionality that comes with the increased dimensions causes outlier identification extremely difficult because of the scattering of data points. Furthermore, clustering in high-dimensional data is challenging due to the intervention of irrelevant dimensions where a dimension may be relevant for some clusters and irrelevant for others. To address the curse of dimensionality in outlier identification, this paper presents a novel technique that generates candidate subspaces from the high-dimensional space and refines the identification of potential outliers from each subspace using a novel iterative adaptive clustering approach. Our experimental results show that the technique is effective.
Keywords
Outlier detection High-dimensionality problem Adaptive clustering Big dataReferences
- 1.Aggarwal, C.C., Philip, S.Y.: An effective and efficient algorithm for high-dimensional outlier detection. VLDB J. 14(2), 211–221 (2005)CrossRefGoogle Scholar
- 2.Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)Google Scholar
- 3.Christiansen, B.: Ensemble averaging and the curse of dimensionality. J. Clim. 31(4), 1587–1596 (2018)CrossRefGoogle Scholar
- 4.Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2797. IEEE (2009)Google Scholar
- 5.Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2765–2781 (2013)CrossRefGoogle Scholar
- 6.Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 47–58. SIAM (2003)Google Scholar
- 7.Gan, G., Ng, M.K.P.: Subspace clustering with automatic feature grouping. Pattern Recogn. 48(11), 3703–3713 (2015)CrossRefGoogle Scholar
- 8.Gartner, I.: Big data definition. https://www.gartner.com/it-glossary/big-data/. Accessed 6 Sept 2019
- 9.Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 8, 1026–1041 (2007)CrossRefGoogle Scholar
- 10.Jing, L., Ng, M.K., Xu, J., Huang, J.Z.: Subspace clustering of text documents with feature weighting K-means algorithm. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 802–812. Springer, Heidelberg (2005). https://doi.org/10.1007/11430919_94CrossRefGoogle Scholar
- 11.Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)CrossRefGoogle Scholar
- 12.Li, T., Ma, S., Ogihara, M.: Document clustering via adaptive subspace iteration. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 218–225. ACM (2004)Google Scholar
- 13.Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 171–184 (2012)CrossRefGoogle Scholar
- 14.Mucha, H.J., Sofyan, H.: Nonhierarchical clustering (2011)Google Scholar
- 15.Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A.: A framework for clustering uncertain data. Proc. VLDB Endow. 8(12), 1976–1979 (2015)CrossRefGoogle Scholar
- 16.Thudumu, S., Branch, P., Jin, J., Singh, J.J.: Elicitation of candidate subspaces in high-dimensional data. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications. IEEE (2019, in press)Google Scholar
- 17.Tomasev, N., Radovanovic, M., Mladenic, D., Ivanovic, M.: The role of hubness in clustering high-dimensional data. IEEE Trans. Knowl. Data Eng. 26(3), 739–751 (2014)CrossRefGoogle Scholar
- 18.Tomašev, N., Radovanović, M., Mladenić, D., Ivanović, M.: Hubness-based clustering of high-dimensional data. In: Celebi, M.E. (ed.) Partitional Clustering Algorithms, pp. 353–386. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09259-1_11CrossRefGoogle Scholar
- 19.Zhai, Y., Ong, Y.S., Tsang, I.W.: The emerging “big dimensionality” (2014)Google Scholar
- 20.Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min.: ASA Data Sci. J. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar