Journal of Global Optimization

, Volume 39, Issue 3, pp 323–346 | Cite as

A novel clustering approach and prediction of optimal number of clusters: global optimum search with enhanced positioning

  • Meng Piao Tan
  • James R. Broach
  • Christodoulos A. Floudas
Original Paper

Abstract

Cluster analysis of genome-wide expression data from DNA microarray hybridization studies is a useful tool for identifying biologically relevant gene groupings (DeRisi et al. 1997; Weiler et al. 1997). It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. In this study, we describe a novel clustering algorithm framework based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search (Floudas et al. 1989; Floudas 1995), which includes a procedure to determine the optimal number of clusters to be used. The approach involves a pre-clustering of data points to define an initial number of clusters and the iterative solution of a Linear Programming problem (the primal problem) and a Mixed-Integer Linear Programming problem (the master problem), that are derived from a Mixed Integer Nonlinear Programming problem formulation. Badly placed data points are removed to form new clusters, thus ensuring tight groupings amongst the data points and incrementing the number of clusters until the optimum number is reached. We apply the proposed clustering algorithm to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces cerevisiae and compare the results to that obtained with some commonly used clustering algorithms. Our algorithm compares favorably against these algorithms in the aspects of intra-cluster similarity and inter-cluster dissimilarity, often considered two key tenets of clustering. Furthermore, our algorithm can predict the optimal number of clusters, and the biological coherence of the predicted clusters is analyzed through gene ontology.

Keywords

Clustering Microarray data Optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adams W.P. and Sherali H.D. (1990). Linearization strategies for a class of zero-one mixed integer programming problems. Operat. Res. 38(2): 217–226 Google Scholar
  2. Aggarwal A. and Floudas C.A. (1990). Synthesis of general separation sequences - nonsharp separations. Comput. Chem. Eng 14: 631–653 CrossRefGoogle Scholar
  3. Beer M. and Tavazoie S. (2004). Predicting gene expression from sequence. Cell 117: 185–198 CrossRefGoogle Scholar
  4. Bezdek J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York Google Scholar
  5. Brooke A., Kendrick D. and Meeraus A. (1988). GAMS: A User’s Guide. The Scientific Press, San Francisco, CA Google Scholar
  6. Carpenter G. and Grossberg S. (1990). ART3: hierarchical search using chemical transmitters in self-organizing patterns recognition architectures. Neural Networks 3: 129–152 CrossRefGoogle Scholar
  7. Ciric A.R. and Floudas C.A. (1989). A retrofit approach of heat exchanger networks. Comput. Chem. Eng 13: 703–715 CrossRefGoogle Scholar
  8. Claverie J. (1999). Computational methods for the identification of differential and coordinated gene expression. Human Mol. Genet. 8: 1821–1832 CrossRefGoogle Scholar
  9. Davis D.L. and Bouldin D.W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Machine Intell. 1(4): 224–227 Google Scholar
  10. Dempster A.P., Laird N.M. and Rudin D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B. 39(1): 1–38 Google Scholar
  11. DeRisi J.L., Iyer V.R. and Brown P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686 CrossRefGoogle Scholar
  12. Dhillon, I.S., Guan, Y.: Information theoretic clustering of sparse co-occurrence data. Proceedings of the Third IEEE International Conference on Data Mining (ICDM) (2003)Google Scholar
  13. Dunn J.C. (1973). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybernet. 3: 32–57 Google Scholar
  14. Dunn J.C. (1974). Well separated clusters and optimal fuzzy partitions. J. Cybernet. 4: 95–104 Google Scholar
  15. Duran M.A. and Odell P.L. (1974). Cluster Analysis: A Survey. Springer Verlag, New York Google Scholar
  16. Eisen M.B., Spellman P.T., Brown P.O. and Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. U.S.A. 95(25): 14863–14868 CrossRefGoogle Scholar
  17. Floudas C.A., Akrotirianakis I.G., Caratzoulas S., Meyer C.A. and Kallrath J. (2005). Global optimization in the 21st Century: advances and challenges. Comput. Chem. Eng. 29: 1185–2002 CrossRefGoogle Scholar
  18. Floudas, C.A. Deterministic Global Optimization: Theory, Algorithms, and Applications. Kluwer Academic Publishers (2000)Google Scholar
  19. Floudas, C.A.: Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford University Press (1995)Google Scholar
  20. Floudas C.A., Aggarwal A. and Ciric A.R. (1989). Global optimum search for non convex NLP and MINLP problems. Comp. Chem. Eng. 13(10): 1117–1132 CrossRefGoogle Scholar
  21. Floudas C.A. and Anastasiadis S.H. (1988). Synthesis of general distillation sequences with several multicomponent feeds and products. Chem. Eng. Sci. 43: 2407–2419 CrossRefGoogle Scholar
  22. Floudas C.A. and Grossmann I.E. (1987). Synthesis of flexible heat exchanger networks with uncertain flow rates and temperatures. Comput. Chem. Eng 11: 319–336 CrossRefGoogle Scholar
  23. Geoffrion A.M. (1973). Generalized benders decomposition. J. Optim. Theory Appl. 10(4): 237 CrossRefGoogle Scholar
  24. Goodman L. and Kruskal W. (1954). Measures of associations for cross-validations. J. Am. Stat. Assoc. 49: 732–764 CrossRefGoogle Scholar
  25. Gower J.C. and Ross G.J.S. (1969). Minimum spanning trees and single-linkage cluster analysis. Appl. Stat. 18: 54–64 CrossRefGoogle Scholar
  26. Halkidi M., Batistakis Y. and Vazirgiannis M. (2002). Cluster validity methods: Part 1. SIGMOD record 31(2): 40–45 CrossRefGoogle Scholar
  27. Hansen P. and Jaumard B. (1997). Cluster analysis and mathematical programming. Math. Program. 79: 191–215 Google Scholar
  28. Hartigan J.A. (1975). Clustering Algorithms. John Wiley & Sons, New York Google Scholar
  29. Hartigan J.A. and Wong M.A. (1979). Algorithm AS 136: a K-means clustering algorithm. Appl. Stat. J. Roy. St. C. 28: 100–108 Google Scholar
  30. Herrero J., Valencia A. and Dopazo J. (2001). A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17(2): 126–136 CrossRefGoogle Scholar
  31. Heyer L.J., Kruglyak S. and Yooseph S. (1999). Exploring expression data: identification and analysis of co-expressed genes. Genome Res. 9: 1106–1115 CrossRefGoogle Scholar
  32. Hubert L. and Schultz J. (1976). Quadratic assignment as a general data-analysis strategy. Br. J. Math. Stat. Psychol. 29: 190–241 Google Scholar
  33. Jaccard P. (1912). The distribution of flora in the alpine zone. New Phytol. 11: 37–50 CrossRefGoogle Scholar
  34. Jain A.K., Murty M.N. and Flynn P.J. (1999). Data clustering: a review. ACM Comput. Surv. 31(3): 264–323 CrossRefGoogle Scholar
  35. Jain A.K. and Dubes R.C. (1988). Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series, Prentice-Hall, Inc., Englewood Cliffs, New Jersey Google Scholar
  36. Johnson, R.E.: The role of cluster analysis in assessing comparability under the US transfer pricing regulations. Business Economics (April 2001)Google Scholar
  37. Jung Y., Park H., Du D. and Drake B.L. (2003). A decision criterion for the optimal number of clusters in hierarchical clustering. J. Global Optimiz. 25: 91–111 CrossRefGoogle Scholar
  38. Kirkpatrick S., Gelatt C.D. and Vecchi M.P. (1983). Optimization by simulated annealing. Science 220(4598): 671–680 CrossRefGoogle Scholar
  39. Kohonen T. (1984). Self Organization and Associative Memory. Springer Information Science Series, Springer Verlag, Berlin, Heidelberg, New York Google Scholar
  40. Kohonen T. (1997). Self-Organizing Maps. Springer Verlag, Berlin Google Scholar
  41. Kokossis A.C. and Floudas C.A. (1994). Optimization of complex reactor networks - II. Nonisothermal operation.. Chem. Eng. Sci 49: 1037–1051 CrossRefGoogle Scholar
  42. Leisch, F., Weingessel, A., Dimitriadou, E.: Competitive learning for binary valued data. In: Niklasson L., Bod’en M., Ziemke T. (eds.) Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), vol. 2, pp. 779–784. Sk"ovde, Sweden, Springer (1998)Google Scholar
  43. Likas A., Vlassis N. and Vebeek J.L. (2003). The global K-means clustering algorithm. Pattern Recogn. 36: 451–461 CrossRefGoogle Scholar
  44. Lin X., Floudas C., Wang Y. and Broach J.R. (2003). Theoretical and computational studies of the glucose signaling pathways in yeast using global gene expression data. Biotechnol. Bioeng. 84(7): 864–886 CrossRefGoogle Scholar
  45. Lukashin A.V. and Fuchs R. (2001). Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics 17(5): 405–414 CrossRefGoogle Scholar
  46. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  47. Metropolis N., Rosenbluth A., Rosenbluth M., Teller A. and Teller E.J. (1953). Equations of State calculations by fast computing machines. J. Chem. Phys. 21: 1087–1091 CrossRefGoogle Scholar
  48. Paules G.E. IV. and Floudas C.A. (1989). APROS: Algorithmic development methodology for discrete-continuous optimization problems. Oper. Res. J. 37: 902–915 CrossRefGoogle Scholar
  49. Pauwels E.J. and Frederix G. (1999). Finding salient regions in images: non-parametric clustering for image segmentation and grouping. Comput. Vision Image Understand. 75: 73–85 CrossRefGoogle Scholar
  50. Pipenbacher P., Schliep A., Schneckener S., Schonhuth A., Schomburg D. and Schrader R. (2002). ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 18(Suppl 2): S182–S191 Google Scholar
  51. Rand W.M. (1971). Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336): 846–850 CrossRefGoogle Scholar
  52. Rousseeuw P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comp. App. Math 20: 53–65 CrossRefGoogle Scholar
  53. Ruspini E.H. (1969). A new approach to clustering. Inf. Control 15: 22–32 CrossRefGoogle Scholar
  54. Schneper L., Düvel K. and Broach J.R. (2004). Sense and sensibility: nutritional response and signal integration in yeast. Curr. Opin. Microbiol. 7(6): 624–630 CrossRefGoogle Scholar
  55. Sherali H.D. and Desai J. (2005a). A global optimization RLT-based approach for solving the hard clustering problem. J. Global Optimiz. 32(2): 281–306 CrossRefGoogle Scholar
  56. Sherali H.D. and Desai J. (2005b). A global optimization RLT-based approach for solving the fuzzy clustering approach. J. Global Optimiz. 33(4): 597–615 CrossRefGoogle Scholar
  57. Slonim N., Atwal G.S., Tkačik G. and Bialek W. (2005). Information based clustering. Proc. Nat. Acad. Sci. U.S.A. 102(51): 18297–18302 CrossRefGoogle Scholar
  58. Sokal R.R. and Michener C.D. (1958). A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38: 1409–1438 Google Scholar
  59. Sorlie T., Tibshirani R., Parker J., Hastie T., Marron J.S., Nobel A., Deng S., Johnsen H., Pesich R., Geisler S., Demeter J., Perou C.M., Lonning P.E., Brown P.O., Borresen-Dala A.L. and Botstein D. (2003). Repeated observations of breast tumor subtypes in independent gene expression data sets. Proc. Nat. Acad. Sci. U.S.A. 100: 8418–8423 CrossRefGoogle Scholar
  60. Tishby, N., Pereira, F., Bialek, W.: The information bottleneck method; proceedings of the 37th annual allerton conference on communication. Control Comput. 368–377 (1999)Google Scholar
  61. Troyanskaya O.G., Dolinski K., Owen A.B., Altman R.B. and Botstein D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Nat. Acad. Sci. U.S.A. 100: 8348–8353 CrossRefGoogle Scholar
  62. Wang Y., Pierce M., Schneper L., Guldal C.G., Zhang X., Tavazoie S. and Broach J.R. (2004). Ras and Gpa2 mediate one branch of a redundant glucose signaling pathway in yeast. Plos Biol. 2(5): 610–622 CrossRefGoogle Scholar
  63. Weiler J., Gausepohl H., Hauser N., Jensen O.N. and Hoheisel J.D. (1997). Hybridization-based DNA screening on peptide nucleic acid (PNA) oligomer arrays. Nuclei Acids Res. 25: 2792–2799 CrossRefGoogle Scholar
  64. Wu Z. and Leahy R. (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Trans. Pattern Recogn. Mach. Intell. 15(11): 1101–1113 CrossRefGoogle Scholar
  65. Xu R. and Wunsch D. (2005). Survey of clustering algorithms. IEEE Trans. Neural Networks 16(3): 645–678 CrossRefGoogle Scholar
  66. Zahn C.T. (1971). Graph theoretical methods for detecting and describing gestalt systems. IEEE Trans. Comput. C-  20: 68–86 CrossRefGoogle Scholar
  67. Zhang, B., Hsu, M., Dayal, U.: K-Harmonic Means – A Data Clustering Algorithm. Hewlett-Packard Research Laboratory Technical Report (June 1999)Google Scholar
  68. Zhang, B.: Generalized K-Harmonic Means: Boosting in Unsupervised Learning. Hewlett-Packard Research Laboratory Technical Report (October 2000)Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2007

Authors and Affiliations

  • Meng Piao Tan
    • 1
  • James R. Broach
    • 2
  • Christodoulos A. Floudas
    • 1
  1. 1.Department of Chemical EngineeringPrinceton UniversityPrincetonUSA
  2. 2.Department of Molecular BiologyPrinceton UniversityPrincetonUSA

Personalised recommendations