Clustering categorical data in projected spaces

Bouguessa, Mohamed

doi:10.1007/s10618-013-0336-8

Clustering categorical data in projected spaces

Published: 01 August 2013

Volume 29, pages 3–38, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Mohamed Bouguessa¹

1170 Accesses
12 Citations
Explore all metrics

Abstract

The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-Dimensional Clustering via Random Projections

Article 22 November 2021

Laura Anderlucci, Francesca Fortunato & Angela Montanari

Cluster Analysis of Data with Reduced Dimensionality: An Empirical Study

Local projections for high-dimensional outlier detection

Article Open access 03 August 2020

Thomas Ortner, Peter Filzmoser, … Christian Breiteneder

Notes

http://archive.ics.uci.edu/ml/.

References

Aggarwal CC, Yu PS (2002) Redefining clustering for high dimensional applications. IEEE Trans Knowl Data Eng 14(2):210–225
Article Google Scholar
Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithm for Projected clustering. In: Proceedings of the ACM SIGMOD’99 conference, pp 61–72
Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the 9th international conference on extending database technology (EDBT’04), pp 123–146
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng 17(2):203–215
Article MathSciNet Google Scholar
Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861
Article MATH Google Scholar
Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM’02), pp 582–589
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York
Book MATH Google Scholar
Bouguessa M (2011) An unsupervised approach for identifying spammers in social networks. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI’11), pp 832–840
Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21(4):507–522
Article Google Scholar
Bouguessa M, Wang S, Sun H (2006) An objective approach to cluster validation. Pattern Recognit Lett 27(13):1419–1430
Article Google Scholar
Bouguila N, Ziou D, Monga E (2006) Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications. Stat Comput 16(2):215–225
Article MathSciNet Google Scholar
Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19(12):1607–1624
Article Google Scholar
Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 220–229
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38
MATH MathSciNet Google Scholar
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14(1):63–97
Article MathSciNet Google Scholar
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. ACM SIGKDD Explor Newsl 6(2):87–94
Article Google Scholar
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 73–83
Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5):345–366
Article Google Scholar
He Z, Deng S, Xu X, Huang JZ (2006) A fast greedy algorithm for outlier mining. In: Proceedings of the 10th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD’06), pp 567–576
Ji Y, Wu C, Liu P, Wang J, Coombes KR (2005) Applications of beta-mixture models in bioinformatics. Bioinformatics 21(9):2118–2122
Article Google Scholar
Jing L, Ng MK, Huang JZ (2007) An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19(8):1026–1041
Article Google Scholar
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 206–215
Kim M, Ramakrishna RS (2006) Projected clustering for categorical datasets. Pattern Recognit Lett 27(12):1405–1417
Article Google Scholar
Koufakou A, Georgiopoulos M (2010) A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min Knowl Discov 20(2):259–289
Article MathSciNet Google Scholar
Koufakou A, Ortiz EG, Georgiopoulos M, Anagnostopoulos GC, Reynolds KM (2007) A scalable and efficient outlier detection strategy for categorical data. In: Proceedings of the 19th IEEE international conference on tools with artificial intelligence (ICTAI’07), pp 210–217
Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data 3(1), art no 1
Google Scholar
Ma Z, Leijon A (2009) Beta mixture models and the application to image classification. In: Proceedings of the 16th IEEE international conference on image processing (ICIP’09), pp 2045–2048
Moise G, Sander J, Ester M (2008) Robust projected clustering. Knowl Inf Syst 14(3):273–298
Article MATH Google Scholar
Müller E, Günnemann S, Assent I, Seidl T (2009) Evaluating clustering in subspace projections of high dimensional data. Proc Very Large Databases Endow 2(1):1270–1281
Google Scholar
Otey ME, Ghoting A, Parthasarathy S (2006) Fast distributed outlier detection in mixed-attribute data sets. Data Min Knowl Discov 12(2–3):203–228
Article MathSciNet Google Scholar
Rodriguez-Baena DS, Perez-Pulido AJ, Aguilar-Ruiz JS (2011) A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19):2738–2745
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10(1):63–72
Article Google Scholar
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceedings of the 8th ACM international conference on information and knowledge management (CIKM’99), pp 483–490
Xiong T, Wang S, Mayers A, Monga E (2012) DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov 24(1):103–135
Article MATH MathSciNet Google Scholar
Yip KY, Cheung DW, Ng MK (2004) HARP: A practical projected clustering algorithm. IEEE Trans Knowl Data Eng 16(11):1387–1397
Article Google Scholar
Yip AM, Ng MK, Wu EH, Chan TF (2007) Strategies for identifying statistically significant dense regions in microarray data. IEEE/ACM Trans Comput Biol Bioinform 4(3):415–429
Article Google Scholar
Ypma TJ (1995) Historical development of the Newton–Raphson method. SIAM Rev 37(4):531–551
Article MATH MathSciNet Google Scholar
Zaki MJ, Peters M, Assent I, Seidl T (2007) CLICKS: an effective algorithm for mining subspace clusters in categorical datasets. Data Knowl Eng 60(1):51–70
Article Google Scholar

Download references

Acknowledgments

The author gratefully thank Dr. Guiseppe Manco for providing the implementation of AT-DC, Dr. Tengke Xiong for providing the implementation of DHCC and Dr. Andy M. Yip for providing the Primate and Aging Human Brain data sets. The author also would like to thank the reviewers for their valuable comments and important suggestions. This work is supported by Research Grants from the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Department of Computer Science, University of Quebec at Montreal, Montreal, QC, Canada
Mohamed Bouguessa

Authors

Mohamed Bouguessa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Bouguessa.

Additional information

Responsible editor: Sugato Basu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bouguessa, M. Clustering categorical data in projected spaces. Data Min Knowl Disc 29, 3–38 (2015). https://doi.org/10.1007/s10618-013-0336-8

Download citation

Received: 15 December 2011
Accepted: 22 July 2013
Published: 01 August 2013
Issue Date: January 2015
DOI: https://doi.org/10.1007/s10618-013-0336-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering categorical data in projected spaces

Abstract

Access this article

Similar content being viewed by others

High-Dimensional Clustering via Random Projections

Cluster Analysis of Data with Reduced Dimensionality: An Empirical Study

Local projections for high-dimensional outlier detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering categorical data in projected spaces

Abstract

Access this article

Similar content being viewed by others

High-Dimensional Clustering via Random Projections

Cluster Analysis of Data with Reduced Dimensionality: An Empirical Study

Local projections for high-dimensional outlier detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation