Automatic Subspace Clustering of High Dimensional Data
 Rakesh Agrawal,
 Johannes Gehrke,
 Dimitrios Gunopulos,
 Prabhakar Raghavan
 … show all 4 hide
Rent the article at a discount
Rent now* Final gross prices may vary according to local VAT.
Get AccessAbstract
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
 Aggarwal, C.C. and Yu, P.S. 2000. Finding generalized projected clusters in high dimensional spaces. In Proc. of SIGMOD 2000 Conference, pp. 70–81.
 Aggrawal, C., Procopiuc, C., Wolf, J., Yu, P., and Park, J. 1999. Fast algorithms for projected clustering. In Proc. of 1999 ACM SIGMOD Int. Conf. on Management of Data, Philadelphia, PA.
 Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of 1998 ACM SIGMOD Int. Conf. on Management of Data, pp. 94–105.
 Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI/MIT Press, Chap 12, pp. 307–328.
 Aho, A., Hopcroft, J., and Ullman, J. 1974. The Design and Analysis of Computer Algorithms. AddisonWelsley.
 Arabie, P. and Hubert, L.J. 1996. An overview of combinatorial data analyis. In Clustering and Classification. P. Arabie, L. Hubert, and G.D. Soete, (Eds.). New Jersey: World Scientific Pub., pp. 5–63.
 Arbor Software Corporation. Application Manager User’s Guide, Essbase Version 4.0 edition.
 Bayardo, R. 1998. Efficiently mining long patterns from databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Seattle, Washington.
 Berchtold, S., Bohm, C., Keim, D., and Kriegel, H.P. 1997. A cost model for nearest neighbor search in highdimensional data space. In Proceedings of the 16th Symposium on Principles of Database Systems (PODS), pp. 78–86.
 Berger, M. and Regoutsos, I. 1991. An algorithm for point clustering and grid generation. IEEE Transactions on Systems, Man and Cybernetics, 21(5):1278–86.
 Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Conference on Management of Data.
 Bronniman, H. and Goodrich, M. 1994. Almost optimal set covers in finite VCdimension. In Proc. of the 10th ACM Symp. on Computational Geometry, pp. 293–302.
 Cheeseman, P. and Stutz, J. 1996. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining. U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, (Eds.). Chap 6. AAAI/MIT Press, pp. 153–180.
 Chhikara, R. and Register, D. 1979. A numerical classification method for partitioning of a large multidimensional mixed data set. Technometrics, 21:531–537.
 Domeniconi, C., Papadopoulos, D., Gunopulos, D., and Ma, S. 2004. Subspace clustering of high dimensional data. SIAM International Conference on Data Mining (SDM).
 Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons.
 Earle, R.J. 1994. Method and apparatus for storing and retrieving multidimensional data in computer memory. U.S. Patent No. 5359724.
 Ester, M., Kriegel, H.P., Sander, J., and Xu, X. 1996. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd Int’l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon.
 Ester, M., Kriegel, H. P., and Xu, X. 1995. A database interface for clustering in large spatial databases. In Proc. of the 1st Int’l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada.
 Fayyad, U.M., PiatetskyShapiro, G., Smyth, P., and Uthurusamy, R. (Eds.). 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.
 Feige, U. 1996. A threshold of ln n for approximating set cover. In Proceedings of the TwentyEighth Annual ACM Symposium on Theory of Computing, pp. 314–318.
 Franzblau, D. 1989. Performance guarantees on a sweepline heuristic for covering rectilinear polygons with rectangles. SIAM J. Disc. Math, 2(3):307–321. CrossRef
 Franzblau, D.S. and Kleitman, D.J. 1984. An algorithm for constructing regions with rectangles: Independence and minimum generating sets for collections of intervals. In Proc. of the 6th Annual Symp. on Theory of Computing, Washington D.C., pp. 268–276.
 Friedman, J. 1997. Optimizing a noisy function of many variables with application to data mining. In UW/MSR Summer Research Institute in Data Mining.
 Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. Academic Press.
 Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. Proceedings of ACM SIGMOD, pp. 73–84.
 Gunopulos, D., Khardon, R., Mannila, H., and Saluja, S. 1997. Data mining, hypergraph transversals, and machine learning. In Proc. of the 16th ACM Symp. on Principles of Database Systems, pp. 209–216.
 Ho, C.T., Agrawal, R., Megiddo, N., and Srikant, R. 1997. Range queries in OLAP data cubes. In Proc. of the ACM SIGMOD Conference on Management of Data, Tucson, Arizona.
 Hong, S.J. 1987. MINI: A heuristic algorithm for twolevel logic minimization. In Selected Papers on Logic Synthesis for Integrated Circuit Design, R. Newton (Eds.). IEEE Press.
 Internationl Business Machines. 1996. IBM Intelligent Miner User’s Guide, Version 1 Release 1, SH12621300 edition, July 1996.
 Jain, A.K. and Dubes, R.C. 1988. Algorithms for Clustering Data. Prentice Hall.
 Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons.
 Lin, D.I. and Kedem, Z.M. 1998. Pincer search: A new algorithm for discovering the maximum frequent sets. In Proc. of the 6th Int’l Conference on Extending Database Technology (EDBT), Valencia, Spain.
 Lovász, L. 1975. On the ratio of the optimal integral and fractional covers. Discrete Mathematics, 13:383–390. CrossRef
 Lund, C. and Yannakakis, M. 1993. On the hardness of approximating minimization problems. In Proceedings of the ACM Symposium on Theory of Computing, pp. 286–293.
 Masek, W. 1978. Some NPComplete Set Covering Problems. M.S. Thesis, MIT.
 Mehta, M., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology (EDBT), Avignon, France.
 Michalski, R.S. and Stepp, R.E. 1983. Learning from observation: Conceptual clustering. In Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.G. Carbonell, and T. M. Mitchell (Eds.). Volume I. Morgan Kaufmann, pp. 331–363.
 Miller, R. and Yang, Y. 1997. Association rules over interval data. In Proc. ACM SIGMOD International Conf. on Management of Data, pp. 452–461.
 Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile.
 Procopiuc, C.M., Jones, M., Agarwal, P.K., and Murali, T.M. 2002. A Monte Carlo algorithm for fast projective clustering. SIGMOD.
 Reckhow, R.A. and Culberson, J. 1987. Covering simple orthogonal polygon with a minimum number of orthogonally convex polygons. In Proc. of the ACM 3rd Annual Computational Geometry Conference, pp. 268–277.
 Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.
 Schroeter, P. and Bigun, J. 1995. Hierarchical image segmentation by multidimensional clustering and orientationadaptive boundary refinement. Pattern Recognition, 25(5):695–709. CrossRef
 Shafer, J., Agrawal, R. and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd Int’l Conference on Very Large Databases, Bombay, India.
 Shoshani, A. Personal communication, 1997.
 Sneath, P. and Sokal, R. 1973. Numerical Taxonomy. Freeman.
 Soltan, V. and Gorpinevich, A. 1992. Minimum dissection of rectilinear polygon with arbitrary holes into rectangles. In Proc. of the ACM 8th Annual Computational Geometry Conference, Berlin, Germany, pp. 296–302.
 Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada.
 Toivonen, H. 1996. Sampling large databases for association rules. In Proc. of the 22nd Int’l Conference on Very Large Databases, Mumbai (Bombay), India, pp. 134–145.
 Wharton, S. 1983. A generalized histogram clustering for multidimensional image data. Pattern Recognition, 16(2):193–199. CrossRef
 Zait, M. and Messatfa, H. 1997. A comparative study of clustering methods. Future Generation Computer Systems, 13(23):149–159. CrossRef
 Zhang, D. and Bowyer, A. 1986. CSG settheoretic solid modelling and NC machining of blend surfaces. In Proceedings of the Second Annual ACM Symposium on Computational Geometry, pp. 314–318.
 Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, Montreal, Canada.
 Title
 Automatic Subspace Clustering of High Dimensional Data
 Journal

Data Mining and Knowledge Discovery
Volume 11, Issue 1 , pp 533
 Cover Date
 20050701
 DOI
 10.1007/s1061800513961
 Print ISSN
 13845810
 Online ISSN
 1573756X
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 subspace clustering
 clustering
 dimensionality reduction
 Industry Sectors
 Authors

 Rakesh Agrawal ^{(1)}
 Johannes Gehrke ^{(2)}
 Dimitrios Gunopulos ^{(3)}
 Prabhakar Raghavan ^{(4)}
 Author Affiliations

 1. IBM Almaden Research Center, 650 Harry Road, San Jose, CA, 95120
 2. Computer Science Department, Cornell University, Ithaca, NY
 3. Department of Computer Science and Eng., University of California Riverside, Riverside, CA, 92521
 4. Verity, Inc., Germany