Summary
Clustering molecules based on numeric data such as, gene-expression data, physiochemical properties, or theoretical data is very important in drug discovery and other life sciences. Most approaches use hierarchical clustering algorithms, non-hierarchical algorithms (for examples, K-mean and K-nearest neighbor), and other similar methods (for examples, the Self-Organization Mapping (SOM) and the Support Vector Machine (SVM)). These approaches are non-robust (results are not consistent) and, computationally expensive. This paper will report a new, non-hierarchical algorithm called the V-Cluster (V stands for vector) Algorithm. This algorithm produces rational, robust results while reducing computing complexity. Similarity measurement and data normalization rules are also discussed along with case studies. When molecules are represented in a set of numeric vectors, the V-Cluster Algorithm clusters the molecules in three steps: (1) ranking the vectors based upon their overall intensity levels, (2) computing cluster centers based upon neighboring density, and (3) assigning molecules to their nearest cluster center. The program is written in C/C++ language, and runs on Window95/NT and UNIX platforms. With the V-Cluster program, the user can quickly complete the clustering process and, easily examine the results by use of thumbnail graphs, superimposed intensity curves of vectors, and spreadsheets. Multi-functional query tools have also been implemented.
Similar content being viewed by others
References
Xu, J., S-Clustering Algorithm: A New Algorithm to Find Natural Chemical Structure Clusters, (presented at the 1st Spotfire Users Conference, May 3 –4, 2001, Philadelphia, USA, to be formally published), 2001.
Kier, L.B. and Hall, L.H., Molecular Connectivity and Drug Research. Academic Press, New York, 1976.
Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis; Chemometrics Series, Research Study Press, Wiley, New York, 1986.
Murtagh, F. and Heck, A., Multivariate Data Analysis, Kluwer Academic, Dordrecht, 1987.
Willett, P., Using computational tools to analyze molecular diversity. In: Czarnik, A.W., DeWitt, S.H. (Eds.), A Practical Guide to Combinatorial Chemistry. ACS, Washington, D.C. (1997) pp.17–48.
Kaufman, L. and Rousseeum, P.J., Finding Groups in Data: An introduction to cluster analysis, John Wiley & Sons, New York, 1990.
Rogers, D.J. and Tanimoto, T.T., A Computer program for classifying plants, Science, 132 (1960) 1115–1118.
Willett, P., Similarity and Clustering in Chemical Information Systems, Research Studies Press, Letchworth, 1987.
Ward, J.H., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58 (1963) 236–244.
MacQueen, J.B., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1 (1967) 281–297.
Englemann, L. and Hartigan, J.A., Percentage points of a test for clusters, Journal of the American Statistical Association, 64 (1969) 1647–1648.
Wolfe, J.H., Pattern clustering by multivariate mixture analysis, Multivariate Behavioral Research, 5 (1970) 329–350.
Marriott, F.H.C., Practical problems in a method of cluster analysis, Biometrics, 27 (1971) 501–514.
Scott, A.J. and Symons, M.J. Clustering methods based on likelihood ratio criteria, Biometrics, 27, (1971) 387–397.
Koontz, W.L.G. and Fukunaga, K., A nonparametric valley-seeking technique for cluster analysis, IEEE Transactions on Computers, C-21 (1972a) 171–178.
Koontz, W.L.G. and Fukunaga, K., Asymptotic analysis of a nonparametric clustering technique, IEEE Transactions on Computers, C-21 (1972b) 967–974.
Anderberg, M.R., Cluster Analysis for Applications, New York, Academic Press, Inc. 1973.
Ling, R.F., A probability theory of cluster analysis, Journal of the American Statistical Association, 68 (1973) 159–169.
Sneath, P.H.A. and Sokal, R.R., Numerical Taxonomy, W.H. Freeman, San Francisco, 1973.
Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., New York, 1973.
Gitman, I., An algorithm for nonsupervised pattern classification, IEEE Transactions on Systems, Man, and Cybernetics, SMC-3 (1973) 66–74.
Calinski, T. and Harabasz, J., A dendrite method for cluster analysis, Communications in Statistics, 3 (1974) 1–27.
Hubert, L., Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures, Journal of the American Statistical Association, 69 (1974) 698–704.
Duran, B.S. and Odell, P.L., Cluster Analysis, Springer-Verlag, New York, 1974.
Marriott, F.H.C., Separating mixtures of normal distributions, Biometrics, 31 (1975) 767–769.
McClain, J.O. and Rao, V.R., CLUSTISZ: A program to test for the quality of clustering of a set of objects, Journal of Marketing Research, 12 (1975) 456–460.
Hartigan, J.A., Clustering Algorithms, John Wiley & Sons, Inc. New York, 1975.
Harman, H.H., Modern Factor Analysis, 3d Edition, University of Chicago Press, Chicago, 1976.
Koontz, W.L.G., Narendra, P.M. and Fukunaga, K., A Graph-theoretic approach to nonparametric cluster analysis, IEEE Transactions on Computers, C-25 (1976) 936–944.
Good, I.J., The Botryology of Botryology. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Hartigan, J.A., Distribution Problems in Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Hubert, L.J. and Baker, F.B., An Empirical Comparison of Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Hartigan, J.A., Asymptotic distributions for clustering criteria, Annals of Statistics, 6 (1978) 117–131.
Wolfe, J.H., Comparative cluster analysis of patterns of vocational interest, Multivariate Behavioral Research, 13 (1978) 33–44.
Binder, D.A., Bayesian cluster analysis, Biometrika, 65 (1978) 31–38.
Blashfield, R.K. and Aldenderfer, M.S., The literature on cluster analysis, Multivariate Behavioral Research, 13 (1978) 271–295.
Huizinga, D.H., “A Natural or Mode Seeking Cluster Analysis Algorithm lrquo” Technical Report 78–1, Behavioral Research Institute, 2305 Canyon Blvd., Boulder, Colorado 80302. (1978)
Arnold, S.J., A test for clusters, Journal of Marketing Research, 16 (1979) 545–551.
Everitt, B.S., Unresolved problems in cluster analysis, Biometrics, 35 (1979) 169–181.
Spath, H., Cluster Analysis Algorithms, Ellis Horwood. Chichester, England, 1980.
Symons, M.J., Clustering criteria and multivariate normal mixtures, Biometrics, 37 (1981) 35–43.
Binder, D.A., Approximations to bayesian clustering rules, Biometrika, 68 (1981) 275–285.
Barnett, V., (ed.), Interpreting Multivariate Data, John Wiley & Sons, Inc., New York, 1981.
Everitt, B.S. and Hand, D.J., Finite Mixture Distributions, Chapman and Hall, New York, 1981.
Hartigan, J.A., Consistency of single linkage for high-density clusters, Journal of the American Statistical Association, 76 (1981) 388–394.
Art, D., Gnanadesikan, R. and Kettenring, R., Data-based metrics for cluster analysis, Utilitas Mathematica, 21A (1982) 75–99.
Hawkins, D.M., Muller, M.W. and ten Krooden, J.A., Cluster Analysis. In Hawkins, D.M. (ed.), Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982.
Wong, M.A., A hybrid clustering method for identifying high-density clusters, Journal of the American Statistical Association, 77 (1982) 841–847.
Wong, M.A. and Schaack, C., Using the kth nearest neighbor clustering procedure to determine the number of subpopulations, American Statistical Association 1982 Proceedings of the Statistical Computing Section, (1982) 40–48.
Wong, M.A. and Lane, T., A kth nearest neighbor clustering procedure, Journal of the Royal Statistical Society, Series B, 45 (1983) 362–368.
Klastorin, T.D., Assessing cluster analysis results, Journal of Marketing Research, 20 (1983) 92–98.
Bock, H.H., On some significance tests in cluster analysis, Journal of Classification, 2 (1985) 77–108.
Cooper, M.C. and Milligan, G.W., The effect of error on determining the number of clusters, Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, (1988) 319–328.
Hartigan, J.A., Statistical theory in clustering, Journal of Classification, 2 (1985) 63–76.
Hartigan, J.A. and Hartigan, P.M., The dip test of unimodality, Annals of Statistics, 13 (1985) 70–84.
Hartigan, P.M., Computation of the dip statistic to test for unimodality, Applied Statistics, 34 (1985) 320–325.
Lee, K.L., Multivariate tests for clusters, Journal of the American Statistical Association, 74 (1979) 708–714.
Massart, D.L. and Kaufman, L., The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, John Wiley & Sons, Inc., New York, 1983.
McLachlan, G.J. and Basford, K.E., Mixture Models, Marcel Dekker, Inc., New York, 1988.
Mizoguchi, R. and Shimura, M., A Nonparametric algorithm for detecting clusters using hierarchical structure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2 (1980) 292–300.
Mezzich, J.E. and Solomon, H., Taxonomy and Behavioral Science, Academic Press, Inc., New York, 1980.
Milligan, G.W., An examination of the effect of six types of error perturbation on fifteen clustering algorithms, Psychometrika, 45 (1980) 325–342.
Milligan, G.W., A Review of monte carlo tests of cluster analysis, Multivariate Behavioral Research, 16 (1981) 379–407.
Pollard, D., Strong consistency of k-means clustering, Annals of Statistics, 9 (1981) 135–140.
Sarle, W.S., Cluster analysis by least squares, Proceedings of the Seventh Annual SAS Users Group International Conference, (1982) 651–653.
Sarle, W.S., Cubic Clustering Criterion, SAS Technical Report A-108, SAS Institute Inc. Cary, NC, 1983.
Titterington, D.M., Smith, A.F.M. and Makov, U.E., Statistical Analysis of Finite Mixture Distributions, John Wiley & Sons, Inc., New York, 1985.
Milligan, G.W. and Cooper, M.C., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50 (1985) 159–179.
Grotschel, M. and Wakabayashi, Y., A cutting plane algorithm for a ckustering problem, Mathematical Program, 45 (1989) 59–96, North-Holland.
Rose, K., Gurewitz, E. and Fox, G.C., Statistical mechanics and phase transition in clustering, Physical Review Letters, Vol 65 No. 8 (1990) 945–948.
Mueller, D.W. and Sawitzki, G., Excess mass estimates and tests for multimodality, JASA, 86 (1991) 738–746.
Minnotte, M.C., A Test of Mode Existence with Applications to Multimodality, Ph.D. thesis, Rice University, Department of Statistics, 1992.
Polonik, W., Measuring mass concentrations and estimating density contour clusters–an excess mass approach, Technical Report, Beitraege zur Statistik Nr. 7, Universitaet Heidelberg, 1993.
Sarle, W.S and Kuo, An-Hsiang, The MODECLUS Procedure, SAS Technical Report P-256, Cary, NC: SAS Institute Inc., 1993.
Banfield, J.D. and Raftery, A.E., Model-based gaussian and non-gaussian clustering, Biometrics, 49 (1993) 803–821.
Girman, C.J., Cluster Analysis and Classification Tree Methodologyas an Aid to Improve Understanding of Benign Prostatic Hyperplasia, Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina, 1994.
Blatt, M. Wiseman, S. and Domany, E., Superparamagnetic Clustering of Data, Physical Review Letters, Vol. 76 (1996), No. 18, 3251–3254.
Hofmann, T. and Buhmann, J., Pair-wise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19 (1997), No. 1 1–14.
Hartuv, E. and Shamir, R., A Clustering Algorithm Based on Graph Connectivity, Proc. Of RECOMB'99. (1999).
Sharan, R. and Shamir, R., CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis, Proc. ISMB, 507–516, AAAI Press, Menlo Park, California, 2000.
Klein, B. bkprog@orbit.org, http://www.orbit.org/bkprog/
Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci., Vol. 95 (1998) pp. 14863–14868.
Kohonen, T., Self-Organizing Maps, New, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 1995, 1997, 2001. Third Extended Edition, ISBN 3-540-67921-9.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, J., Zhang, Q. & Shih, CK. V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data. Mol Divers 10, 463–478 (2006). https://doi.org/10.1007/s11030-006-9023-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-006-9023-7