Skip to main content
Log in

V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data

  • Full-length paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Summary

Clustering molecules based on numeric data such as, gene-expression data, physiochemical properties, or theoretical data is very important in drug discovery and other life sciences. Most approaches use hierarchical clustering algorithms, non-hierarchical algorithms (for examples, K-mean and K-nearest neighbor), and other similar methods (for examples, the Self-Organization Mapping (SOM) and the Support Vector Machine (SVM)). These approaches are non-robust (results are not consistent) and, computationally expensive. This paper will report a new, non-hierarchical algorithm called the V-Cluster (V stands for vector) Algorithm. This algorithm produces rational, robust results while reducing computing complexity. Similarity measurement and data normalization rules are also discussed along with case studies. When molecules are represented in a set of numeric vectors, the V-Cluster Algorithm clusters the molecules in three steps: (1) ranking the vectors based upon their overall intensity levels, (2) computing cluster centers based upon neighboring density, and (3) assigning molecules to their nearest cluster center. The program is written in C/C++ language, and runs on Window95/NT and UNIX platforms. With the V-Cluster program, the user can quickly complete the clustering process and, easily examine the results by use of thumbnail graphs, superimposed intensity curves of vectors, and spreadsheets. Multi-functional query tools have also been implemented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Xu, J., S-Clustering Algorithm: A New Algorithm to Find Natural Chemical Structure Clusters, (presented at the 1st Spotfire Users Conference, May 3 –4, 2001, Philadelphia, USA, to be formally published), 2001.

  2. Kier, L.B. and Hall, L.H., Molecular Connectivity and Drug Research. Academic Press, New York, 1976.

    Google Scholar 

  3. Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis; Chemometrics Series, Research Study Press, Wiley, New York, 1986.

  4. Murtagh, F. and Heck, A., Multivariate Data Analysis, Kluwer Academic, Dordrecht, 1987.

    Google Scholar 

  5. Willett, P., Using computational tools to analyze molecular diversity. In: Czarnik, A.W., DeWitt, S.H. (Eds.), A Practical Guide to Combinatorial Chemistry. ACS, Washington, D.C. (1997) pp.17–48.

  6. Kaufman, L. and Rousseeum, P.J., Finding Groups in Data: An introduction to cluster analysis, John Wiley & Sons, New York, 1990.

  7. Rogers, D.J. and Tanimoto, T.T., A Computer program for classifying plants, Science, 132 (1960) 1115–1118.

    Article  PubMed  Google Scholar 

  8. Willett, P., Similarity and Clustering in Chemical Information Systems, Research Studies Press, Letchworth, 1987.

  9. Ward, J.H., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58 (1963) 236–244.

    Article  Google Scholar 

  10. MacQueen, J.B., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1 (1967) 281–297.

  11. Englemann, L. and Hartigan, J.A., Percentage points of a test for clusters, Journal of the American Statistical Association, 64 (1969) 1647–1648.

    Article  Google Scholar 

  12. Wolfe, J.H., Pattern clustering by multivariate mixture analysis, Multivariate Behavioral Research, 5 (1970) 329–350.

    Article  Google Scholar 

  13. Marriott, F.H.C., Practical problems in a method of cluster analysis, Biometrics, 27 (1971) 501–514.

    Article  PubMed  CAS  Google Scholar 

  14. Scott, A.J. and Symons, M.J. Clustering methods based on likelihood ratio criteria, Biometrics, 27, (1971) 387–397.

  15. Koontz, W.L.G. and Fukunaga, K., A nonparametric valley-seeking technique for cluster analysis, IEEE Transactions on Computers, C-21 (1972a) 171–178.

  16. Koontz, W.L.G. and Fukunaga, K., Asymptotic analysis of a nonparametric clustering technique, IEEE Transactions on Computers, C-21 (1972b) 967–974.

  17. Anderberg, M.R., Cluster Analysis for Applications, New York, Academic Press, Inc. 1973.

  18. Ling, R.F., A probability theory of cluster analysis, Journal of the American Statistical Association, 68 (1973) 159–169.

    Article  Google Scholar 

  19. Sneath, P.H.A. and Sokal, R.R., Numerical Taxonomy, W.H. Freeman, San Francisco, 1973.

  20. Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., New York, 1973.

    Google Scholar 

  21. Gitman, I., An algorithm for nonsupervised pattern classification, IEEE Transactions on Systems, Man, and Cybernetics, SMC-3 (1973) 66–74.

  22. Calinski, T. and Harabasz, J., A dendrite method for cluster analysis, Communications in Statistics, 3 (1974) 1–27.

    Article  Google Scholar 

  23. Hubert, L., Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures, Journal of the American Statistical Association, 69 (1974) 698–704.

    Article  Google Scholar 

  24. Duran, B.S. and Odell, P.L., Cluster Analysis, Springer-Verlag, New York, 1974.

  25. Marriott, F.H.C., Separating mixtures of normal distributions, Biometrics, 31 (1975) 767–769.

    Article  Google Scholar 

  26. McClain, J.O. and Rao, V.R., CLUSTISZ: A program to test for the quality of clustering of a set of objects, Journal of Marketing Research, 12 (1975) 456–460.

    Google Scholar 

  27. Hartigan, J.A., Clustering Algorithms, John Wiley & Sons, Inc. New York, 1975.

    Google Scholar 

  28. Harman, H.H., Modern Factor Analysis, 3d Edition, University of Chicago Press, Chicago, 1976.

    Google Scholar 

  29. Koontz, W.L.G., Narendra, P.M. and Fukunaga, K., A Graph-theoretic approach to nonparametric cluster analysis, IEEE Transactions on Computers, C-25 (1976) 936–944.

  30. Good, I.J., The Botryology of Botryology. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.

    Google Scholar 

  31. Hartigan, J.A., Distribution Problems in Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.

  32. Hubert, L.J. and Baker, F.B., An Empirical Comparison of Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.

  33. Hartigan, J.A., Asymptotic distributions for clustering criteria, Annals of Statistics, 6 (1978) 117–131.

    Article  Google Scholar 

  34. Wolfe, J.H., Comparative cluster analysis of patterns of vocational interest, Multivariate Behavioral Research, 13 (1978) 33–44.

    Article  Google Scholar 

  35. Binder, D.A., Bayesian cluster analysis, Biometrika, 65 (1978) 31–38.

    Article  Google Scholar 

  36. Blashfield, R.K. and Aldenderfer, M.S., The literature on cluster analysis, Multivariate Behavioral Research, 13 (1978) 271–295.

    Article  Google Scholar 

  37. Huizinga, D.H., “A Natural or Mode Seeking Cluster Analysis Algorithm lrquo” Technical Report 78–1, Behavioral Research Institute, 2305 Canyon Blvd., Boulder, Colorado 80302. (1978)

  38. Arnold, S.J., A test for clusters, Journal of Marketing Research, 16 (1979) 545–551.

    Article  Google Scholar 

  39. Everitt, B.S., Unresolved problems in cluster analysis, Biometrics, 35 (1979) 169–181.

    Article  Google Scholar 

  40. Spath, H., Cluster Analysis Algorithms, Ellis Horwood. Chichester, England, 1980.

    Google Scholar 

  41. Symons, M.J., Clustering criteria and multivariate normal mixtures, Biometrics, 37 (1981) 35–43.

    Article  Google Scholar 

  42. Binder, D.A., Approximations to bayesian clustering rules, Biometrika, 68 (1981) 275–285.

    Article  Google Scholar 

  43. Barnett, V., (ed.), Interpreting Multivariate Data, John Wiley & Sons, Inc., New York, 1981.

  44. Everitt, B.S. and Hand, D.J., Finite Mixture Distributions, Chapman and Hall, New York, 1981.

  45. Hartigan, J.A., Consistency of single linkage for high-density clusters, Journal of the American Statistical Association, 76 (1981) 388–394.

    Article  Google Scholar 

  46. Art, D., Gnanadesikan, R. and Kettenring, R., Data-based metrics for cluster analysis, Utilitas Mathematica, 21A (1982) 75–99.

    Google Scholar 

  47. Hawkins, D.M., Muller, M.W. and ten Krooden, J.A., Cluster Analysis. In Hawkins, D.M. (ed.), Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982.

  48. Wong, M.A., A hybrid clustering method for identifying high-density clusters, Journal of the American Statistical Association, 77 (1982) 841–847.

    Article  Google Scholar 

  49. Wong, M.A. and Schaack, C., Using the kth nearest neighbor clustering procedure to determine the number of subpopulations, American Statistical Association 1982 Proceedings of the Statistical Computing Section, (1982) 40–48.

  50. Wong, M.A. and Lane, T., A kth nearest neighbor clustering procedure, Journal of the Royal Statistical Society, Series B, 45 (1983) 362–368.

    Google Scholar 

  51. Klastorin, T.D., Assessing cluster analysis results, Journal of Marketing Research, 20 (1983) 92–98.

    Article  Google Scholar 

  52. Bock, H.H., On some significance tests in cluster analysis, Journal of Classification, 2 (1985) 77–108.

    Article  Google Scholar 

  53. Cooper, M.C. and Milligan, G.W., The effect of error on determining the number of clusters, Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, (1988) 319–328.

  54. Hartigan, J.A., Statistical theory in clustering, Journal of Classification, 2 (1985) 63–76.

    Article  Google Scholar 

  55. Hartigan, J.A. and Hartigan, P.M., The dip test of unimodality, Annals of Statistics, 13 (1985) 70–84.

    Article  Google Scholar 

  56. Hartigan, P.M., Computation of the dip statistic to test for unimodality, Applied Statistics, 34 (1985) 320–325.

    Article  Google Scholar 

  57. Lee, K.L., Multivariate tests for clusters, Journal of the American Statistical Association, 74 (1979) 708–714.

    Article  Google Scholar 

  58. Massart, D.L. and Kaufman, L., The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, John Wiley & Sons, Inc., New York, 1983.

    Google Scholar 

  59. McLachlan, G.J. and Basford, K.E., Mixture Models, Marcel Dekker, Inc., New York, 1988.

    Google Scholar 

  60. Mizoguchi, R. and Shimura, M., A Nonparametric algorithm for detecting clusters using hierarchical structure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2 (1980) 292–300.

  61. Mezzich, J.E. and Solomon, H., Taxonomy and Behavioral Science, Academic Press, Inc., New York, 1980.

  62. Milligan, G.W., An examination of the effect of six types of error perturbation on fifteen clustering algorithms, Psychometrika, 45 (1980) 325–342.

    Article  Google Scholar 

  63. Milligan, G.W., A Review of monte carlo tests of cluster analysis, Multivariate Behavioral Research, 16 (1981) 379–407.

    Article  Google Scholar 

  64. Pollard, D., Strong consistency of k-means clustering, Annals of Statistics, 9 (1981) 135–140.

    Article  Google Scholar 

  65. Sarle, W.S., Cluster analysis by least squares, Proceedings of the Seventh Annual SAS Users Group International Conference, (1982) 651–653.

  66. Sarle, W.S., Cubic Clustering Criterion, SAS Technical Report A-108, SAS Institute Inc. Cary, NC, 1983.

  67. Titterington, D.M., Smith, A.F.M. and Makov, U.E., Statistical Analysis of Finite Mixture Distributions, John Wiley & Sons, Inc., New York, 1985.

    Google Scholar 

  68. Milligan, G.W. and Cooper, M.C., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50 (1985) 159–179.

    Article  Google Scholar 

  69. Grotschel, M. and Wakabayashi, Y., A cutting plane algorithm for a ckustering problem, Mathematical Program, 45 (1989) 59–96, North-Holland.

    Google Scholar 

  70. Rose, K., Gurewitz, E. and Fox, G.C., Statistical mechanics and phase transition in clustering, Physical Review Letters, Vol 65 No. 8 (1990) 945–948.

  71. Mueller, D.W. and Sawitzki, G., Excess mass estimates and tests for multimodality, JASA, 86 (1991) 738–746.

    Google Scholar 

  72. Minnotte, M.C., A Test of Mode Existence with Applications to Multimodality, Ph.D. thesis, Rice University, Department of Statistics, 1992.

  73. Polonik, W., Measuring mass concentrations and estimating density contour clusters–an excess mass approach, Technical Report, Beitraege zur Statistik Nr. 7, Universitaet Heidelberg, 1993.

  74. Sarle, W.S and Kuo, An-Hsiang, The MODECLUS Procedure, SAS Technical Report P-256, Cary, NC: SAS Institute Inc., 1993.

  75. Banfield, J.D. and Raftery, A.E., Model-based gaussian and non-gaussian clustering, Biometrics, 49 (1993) 803–821.

    Article  Google Scholar 

  76. Girman, C.J., Cluster Analysis and Classification Tree Methodologyas an Aid to Improve Understanding of Benign Prostatic Hyperplasia, Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina, 1994.

  77. Blatt, M. Wiseman, S. and Domany, E., Superparamagnetic Clustering of Data, Physical Review Letters, Vol. 76 (1996), No. 18, 3251–3254.

  78. Hofmann, T. and Buhmann, J., Pair-wise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19 (1997), No. 1 1–14.

  79. Hartuv, E. and Shamir, R., A Clustering Algorithm Based on Graph Connectivity, Proc. Of RECOMB'99. (1999).

  80. Sharan, R. and Shamir, R., CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis, Proc. ISMB, 507–516, AAAI Press, Menlo Park, California, 2000.

  81. Klein, B. bkprog@orbit.org, http://www.orbit.org/bkprog/

  82. Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci., Vol. 95 (1998) pp. 14863–14868.

  83. http://rana.stanford.edu/software.

  84. Kohonen, T., Self-Organizing Maps, New, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 1995, 1997, 2001. Third Extended Edition, ISBN 3-540-67921-9.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Zhang, Q. & Shih, CK. V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data. Mol Divers 10, 463–478 (2006). https://doi.org/10.1007/s11030-006-9023-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-006-9023-7

Keywords

Navigation