Bregman Bubble Clustering: A Robust Framework for Mining Dense Clusters

  • Joydeep Ghosh
  • Gunjan Gupta
Part of the Intelligent Systems Reference Library book series (ISRL, volume 23)

Abstract

In classical clustering, each data point is assigned to at least one cluster. However, in many applications only a small subset of the available data is relevant for the problem and the rest needs to be ignored in order to obtain good clusters. Certain non-parametric density-based clustering methods find the most relevant data as multiple dense regions, but such methods are generally limited to low-dimensional data and do not scale well to large, high-dimensional datasets. Also, they use a specific notion of “distance”, typically Euclidean or Mahalanobis distance, which further limits their applicability. On the other hand, the recent One Class Information Bottleneck (OC-IB) method is fast and works on a large class of distortion measures known as Bregman Divergences, but can only find a single dense region. This paper presents a broad framework for finding k dense clusters while ignoring the rest of the data. It includes a seeding algorithm that can automatically determine a suitable value for k. When k is forced to 1, our method gives rise to an improved version of OC-IB with optimality guarantees. We provide a generative model that yields the proposed iterative algorithm for finding k dense regions as a special case. Our analysis reveals an interesting and novel connection between the problem of finding dense regions and exponential mixture models; a hard model corresponding to k exponential mixtures with a uniform background results in a set of k dense clusters. The proposed method describes a highly scalable algorithm for finding multiple dense regions that works with any Bregman Divergence, thus extending density based clustering to a variety of non-euclidean problems not addressable by earlier methods. We present empirical results on three artificial, two microarray and one text dataset to show the relevance and effectiveness of our methods.

Keywords

Local Search Dense Region Dense Cluster Cluster Representative Class Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering points to identify the clustering structure. In: Proc. ACM SIGMOD, pp. 49–60 (1999)Google Scholar
  2. 2.
    Arabie, P., Carroll, J.D., DeSarbo, W., Wind, J.: Overlapping clustering: A new method for product positioning. Journal of Marketing Research 18(3), 317–319 (1981)CrossRefGoogle Scholar
  3. 3.
    Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering with Bregman divergences. JMLR 6, 1705–1749 (2005)MathSciNetMATHGoogle Scholar
  4. 4.
    Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping clustering. In: Proc. KDD 2005, Chicago, Illinois, USA, pp. 532–537 (2005)Google Scholar
  5. 5.
    Banerjee, A., Langford, J.: An objective evaluation criterion for clustering. In: KDD 2004, Seattle, Washington, USA (August 2004)Google Scholar
  6. 6.
    Battle, A., Segal, E., Koller, D.: Probabilistic discovery of overlapping cellular processes and their regulation. In: Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004) (April 2004)Google Scholar
  7. 7.
    Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-Tree: An index structure for high-dimensional data. In: Proceedings of the 22nd International Conference on Very Large Databases, pp. 28–39. Morgan Kaufmann Publishers, San Francisco (1996)Google Scholar
  8. 8.
    Buzo, A., Gray, A.H., Gray, R.M., Markel, J.D.: Speech coding based on vector quantization. IEEE Transactions on Accoustics, Speech and Signal Processing 28(5), 562–574 (1980)MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Casella, G., Robert, C.P., Wells, M.T.: Mixture models, latent variables and partitioned importance sampling. Technical Report, RePEc:fth:inseep:2000-03, Institut National de la Statistique et des Etudes Economiques (2003), http://ideas.repec.org/p/fth/inseep/2000-03.html
  10. 10.
    Chakaravathy, S.V., Ghosh, J.: Scale based clustering using a radial basis function network. IEEE Transactions on Neural Networks 2(5), 1250–1261 (1996)CrossRefGoogle Scholar
  11. 11.
    Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 17(8), 790–799 (1995)CrossRefGoogle Scholar
  12. 12.
    Crammer, K., Chechik, G.: A needle in a haystack: Local one-class optimization. In: ICML 2004, Banff, Alberta, Canada (2004)Google Scholar
  13. 13.
    Crammer, K., Singer, Y.: Learning algorithms for enclosing points in Bregmanian spheres. In: COLT 2003, pp. 388–402 (2003)Google Scholar
  14. 14.
    Deodhar, M., Ghosh, J.: Simultaneous co-clustering and modeling of market data. In: Workshop for Data Mining in Marketing (DMM 2007). IEEE Computer Society Press, Leipzig (2007)Google Scholar
  15. 15.
    Deodhar, M., Ghosh, J., Gupta, G., Cho, H., Dhillon, I.: A scalable framework for discovering coherent co-clusters in noisy data. In: Bottou, L., Littman, M. (eds.) Proceedings of the 26th International Conference on Machine Learning, Omnipress, Montreal, pp. 241–248 (June 2009)Google Scholar
  16. 16.
    Dettling, M., Bühlmann, P.: Supervised clustering of genes. Genome Biol. 3(12) (2002)Google Scholar
  17. 17.
    Dhillon, I., Mallela, S., Kumar, R.: A divisive information-theoretic feature clustering algorithm for text classification. JMLR 3, 1265–1287 (2003)MATHGoogle Scholar
  18. 18.
    Dhillon, I.S., Guan, Y., Kogan, J.: Refining clusters in high-dimensional text data. In: 2nd SIAM International Conference on Data Mining (Workshop on Clustering High-Dimensional Data and its Applications) (April 2002)Google Scholar
  19. 19.
    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1-2), 143–175 (2001)MATHCrossRefGoogle Scholar
  20. 20.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD 1996, pp. 226–231 (1996)Google Scholar
  21. 21.
    Gasch, A.P., et al.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Bio. of the Cell 11(3), 4241–4257 (2000)Google Scholar
  22. 22.
    Georgescu, B., Shimshoni, I., Meer, P.: Mean shift based clustering in high dimensions: A texture classification example. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision. pp. 456–463. IEEE Computer Society, Washington, DC, USA (2003)CrossRefGoogle Scholar
  23. 23.
    Ghosh, J., Deodhar, M., Gupta, G.: Detection of Dense Co-clusters in Large, Noisy Datasets. In: Wang, P.S.P. (ed.) Pattern Recognition and Machine Vision (in memory of Professor King-Sun Fu), pp. 3–18. River Publishers, Aalborg (2010)Google Scholar
  24. 24.
    Gollub, J., et al.: The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res. 31, 94–96 (2003)CrossRefGoogle Scholar
  25. 25.
    Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. In: 15th International Conference on Data Engineering, Sydney, Australia, p. 512 (1999)Google Scholar
  26. 26.
    Gupta, G.: Robust methods for locating multiple dense regions in complex datasets. PhD Thesis, University of Texas at Austin (December 2006)Google Scholar
  27. 27.
    Gupta, G., Ghosh, J.: Robust one-class clustering using hybrid global and local search. In: Proc. ICML 2005, Bonn, Germany, pp. 273–280 (August 2005)Google Scholar
  28. 28.
    Gupta, G., Ghosh, J.: Bregman Bubble Clustering: A robust framework for mining dense clusters. Tech Report, Dept. of Elec. & Comp. Engineering, University of Texas at Austin. IDEAL-TR04 (September 2006), http://www.lans.ece.utexas.edu/techreps.html
  29. 29.
    Gupta, G., Liu, A., Ghosh, J.: Automated Hierarchical Density Shaving: A robust, automated clustering and visualization framework for large biological datasets. IEEE Trans. On Comp. Bio. and Bioinformatics (TCBB) 7(2), 223–237 (2010)CrossRefGoogle Scholar
  30. 30.
    Gupta, G.K.: Modeling Customer Dynamics Using Motion Estimation in A Value Based Cluster Space for Large Retail Data-sets. Master’s thesis, University of Texas at Austin (August 2000)Google Scholar
  31. 31.
    Gupta, G.K., Ghosh, J.: Detecting seasonal trends and cluster motion visualization for very high dimensional transactional data. In: Society for Industrial and Applied Mathematics (First International SIAM Conference on Data Mining (SDM 2001)) (April 2001)Google Scholar
  32. 32.
    Gupta, G.K., Ghosh, J.: Value Balanced Agglomerative Connectivity Clustering. In: SPIE conference on Data Mining and Knowledge Discovery III, Orlando, Florida. SPIE Proc. vol. 4384, pp. 6–15 (April 2001)Google Scholar
  33. 33.
    Hastie, T., et al.: Gene shaving as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1, 1–21 (2000)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Hendrickson, B., Leland, R.: An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal on Scientific Computing 16(2), 452–469 (1995)MathSciNetMATHCrossRefGoogle Scholar
  35. 35.
    Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification, 193–218 (1985)Google Scholar
  36. 36.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)MATHGoogle Scholar
  37. 37.
    Jiang, D., Pei, J., Ramanathan, M., Tang, C., Zhang, A.: Mining coherent gene clusters from gene-sample-time microarray data. In: KDD 2004, Seattle, WA, USA, pp. 430–439 (2004)Google Scholar
  38. 38.
    Jiang, D., Pei, J., Zhang, A.: DHC: A density-based hierarchical clustering method for time series gene expression data. In: BIBE 2003, p. 393. IEEE Comp. Soc., Washington, DC, USA (2003)Google Scholar
  39. 39.
    Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254 (1967)CrossRefGoogle Scholar
  40. 40.
    Judd, A., Hovland, M.: Seabed Fluid Flow: The Impact of Geology, Biology and the Marine Environment. Cambridge University Press, Cambridge (2007)CrossRefGoogle Scholar
  41. 41.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing 20(1), 359–392 (1998)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: Applications in VLSI domain. In: Design and Automation Conference (1997)Google Scholar
  43. 43.
    Kearns, M., Mansour, Y., Ng, A.Y.: An information-theoretic analysis of hard and soft assignment methods for clustering. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pp. 282–293. AAAI, Menlo Park (1997)Google Scholar
  44. 44.
    Kriegel, H.P., Pfeifle, M., Pötke, M., Seidl, T.: The paradigm of relational indexing: A survey. In: BTW. LNI, vol. 26. GI (2003)Google Scholar
  45. 45.
    Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)Google Scholar
  46. 46.
    Lazzeroni, L., Owen, A.B.: Plaid models for gene expression data. Statistica Sinica 12(1), 61–86 (2002)MathSciNetMATHGoogle Scholar
  47. 47.
    Lee, I., Date, S.V., Adai, A.T., Marcotte, E.M.: A probabilistic functional network of yeast genes. Science 306, 1555–1558 (2004)CrossRefGoogle Scholar
  48. 48.
    Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Transactions on Communications 28(1), 84–95 (1980)CrossRefGoogle Scholar
  49. 49.
    Linden, G., Smith, B., York, J.: Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing 7(1), 76–80 (2003)CrossRefGoogle Scholar
  50. 50.
    Long, B., Zhang, Z.M., Wu, X., Yu, P.S.: Relational clustering by symmetric convex coding. In: ICML 2007, pp. 569–576. ACM, New York (2007)CrossRefGoogle Scholar
  51. 51.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  52. 52.
    Mansson, R., Tsapogas, P., Akerlund, M., et al.: Pearson correlation analysis of microarray data allows for the identification of genetic targets for early b-cell factor. J. Biol. Chem. 279(17), 17905–17913 (2004)CrossRefGoogle Scholar
  53. 53.
    McGuire, A.M., Church, G.M.: Predicting regulons and their cis-regulatory motifs by comparative genomics. Nucleic Acids Research 28(22), 4523–4530 (2000)CrossRefGoogle Scholar
  54. 54.
    Merugu, S.: Privacy-preserving distributed learning using generative models. PhD Thesis, The University of Texas at Austin (August 2006)Google Scholar
  55. 55.
    Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, Kluwer, Dordrecht (1998), http://www.cs.toronto.edu/char126radfordem.abstract.html Google Scholar
  56. 56.
    Pietra, S.D., Pietra, V.D., Lafferty, J.: Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science. Carnegie Mellon University (2001)Google Scholar
  57. 57.
    Schmid, C., Sengstag, T., Bucher, P., Delorenzi, M.: MADAP, a flexible clustering tool for the interpretation of one-dimensional genome annotation data. Nucleic Acids Res., W201–W205 (2007)Google Scholar
  58. 58.
    Schölkopf, B., Burges, C., Vapnik, V.: Extracting support data for a given task. In: KDD. AAAI Press, Menlo Park (1995)Google Scholar
  59. 59.
    Schölkopf, B., Platt, J.C., Shawe-Taylor, J.S., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471 (2001)MATHCrossRefGoogle Scholar
  60. 60.
    Segal, E., Battle, A., Koller, D.: Decomposing gene expression into cellular processes. In: 8th Pacific Symposium on Biocomputing (PSB), Kaua’i (January 2003)Google Scholar
  61. 61.
    Sharan, R., Shamir, R.: Click: A clustering algorithm with applications to gene expression analysis. In: Proc. 8th ISMB, pp. 307–316 (2000)Google Scholar
  62. 62.
    Slonim, N., Atwal, G.S., Tkacik, G., Bialek, W.: Information-based clustering. PNAS 102(51), 18297–18302 (2005)MathSciNetMATHCrossRefGoogle Scholar
  63. 63.
    Strehl, A., Ghosh, J.: Value-based customer grouping from large retail data-sets. In: SPIE Conference on Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, Orlando, Florida, USA, April 24-25, vol. 4057, pp. 33–42. SPIE (2000)Google Scholar
  64. 64.
    Strehl, A., Ghosh, J.: Relationship-based clustering and visualization for high-dimensional data mining. INFORMS Journal on Computing 15(2), 208–230 (2003)CrossRefGoogle Scholar
  65. 65.
    Strehl, A., Ghosh, J., Mooney, R.J.: Impact of similarity measures on web-page clustering. In: AAAI Workshop on AI for Web Search (AAAI 2000), pp. 58–64. AAAI/MIT Press (July 2000)Google Scholar
  66. 66.
    Tax, D., Duin, R.: Data domain description using support vectors. In: Proceedings of the ESANN 1999, pp. 251–256 (1999)Google Scholar
  67. 67.
    Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)CrossRefGoogle Scholar
  68. 68.
    Ueda, N., Nakano, R.: Deterministic annealing EM algorithm. Neural Networks 11(2), 271–282 (1998)CrossRefGoogle Scholar
  69. 69.
    Wedel, M., Steenkamp, J.: A clusterwise regression method for simultaneous fuzzy market structuring and benefit segmentation. Journal of Marketing Research, 385–396 (1991)Google Scholar
  70. 70.
    Wishart, D.: Mode analysis: A generalization of nearest neighbour which reduces chaining effects. In: Proceedings of the Colloquium in Numerical Taxonomy, pp. 282–308. Academic Press, University of St. Andrews, Fife, Scotland (1968)Google Scholar
  71. 71.
    Yun, C.H., Chuang, K.T., Chen, M.S.: An efficient clustering algorithm for market basket data based on small large ratios. In: Computer Software and Applications Conference 2001, pp. 505–510 (2001)Google Scholar
  72. 72.
    Zhong, S.: Efficient streaming text clustering. Special issue IJCNN 2005: Neural Networks 18(5-6), 790–798 (2005)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Joydeep Ghosh
    • 1
  • Gunjan Gupta
    • 2
  1. 1.Department of Electrical & Computer EngineeringThe University of Texas at AustinAustinUSA
  2. 2.Microsoft, One Microsoft WayRedmondUSA

Personalised recommendations