V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data

Xu, Jun; Zhang, Qiang; Shih, Chen-Kon

doi:10.1007/s11030-006-9023-7

V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data

Full-length paper
Published: 01 August 2006

Volume 10, pages 463–478, (2006)
Cite this article

Molecular Diversity Aims and scope Submit manuscript

Jun Xu^1,2,3,
Qiang Zhang¹ &
Chen-Kon Shih¹

140 Accesses
4 Citations
Explore all metrics

Summary

Clustering molecules based on numeric data such as, gene-expression data, physiochemical properties, or theoretical data is very important in drug discovery and other life sciences. Most approaches use hierarchical clustering algorithms, non-hierarchical algorithms (for examples, K-mean and K-nearest neighbor), and other similar methods (for examples, the Self-Organization Mapping (SOM) and the Support Vector Machine (SVM)). These approaches are non-robust (results are not consistent) and, computationally expensive. This paper will report a new, non-hierarchical algorithm called the V-Cluster (V stands for vector) Algorithm. This algorithm produces rational, robust results while reducing computing complexity. Similarity measurement and data normalization rules are also discussed along with case studies. When molecules are represented in a set of numeric vectors, the V-Cluster Algorithm clusters the molecules in three steps: (1) ranking the vectors based upon their overall intensity levels, (2) computing cluster centers based upon neighboring density, and (3) assigning molecules to their nearest cluster center. The program is written in C/C++ language, and runs on Window95/NT and UNIX platforms. With the V-Cluster program, the user can quickly complete the clustering process and, easily examine the results by use of thumbnail graphs, superimposed intensity curves of vectors, and spreadsheets. Multi-functional query tools have also been implemented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Xu, J., S-Clustering Algorithm: A New Algorithm to Find Natural Chemical Structure Clusters, (presented at the 1st Spotfire Users Conference, May 3 –4, 2001, Philadelphia, USA, to be formally published), 2001.
Kier, L.B. and Hall, L.H., Molecular Connectivity and Drug Research. Academic Press, New York, 1976.
Google Scholar
Kier, L.B. and Hall, L.H., Molecular Connectivity in Structure-Activity Analysis; Chemometrics Series, Research Study Press, Wiley, New York, 1986.
Murtagh, F. and Heck, A., Multivariate Data Analysis, Kluwer Academic, Dordrecht, 1987.
Google Scholar
Willett, P., Using computational tools to analyze molecular diversity. In: Czarnik, A.W., DeWitt, S.H. (Eds.), A Practical Guide to Combinatorial Chemistry. ACS, Washington, D.C. (1997) pp.17–48.
Kaufman, L. and Rousseeum, P.J., Finding Groups in Data: An introduction to cluster analysis, John Wiley & Sons, New York, 1990.
Rogers, D.J. and Tanimoto, T.T., A Computer program for classifying plants, Science, 132 (1960) 1115–1118.
Article PubMed Google Scholar
Willett, P., Similarity and Clustering in Chemical Information Systems, Research Studies Press, Letchworth, 1987.
Ward, J.H., Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58 (1963) 236–244.
Article Google Scholar
MacQueen, J.B., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1 (1967) 281–297.
Englemann, L. and Hartigan, J.A., Percentage points of a test for clusters, Journal of the American Statistical Association, 64 (1969) 1647–1648.
Article Google Scholar
Wolfe, J.H., Pattern clustering by multivariate mixture analysis, Multivariate Behavioral Research, 5 (1970) 329–350.
Article Google Scholar
Marriott, F.H.C., Practical problems in a method of cluster analysis, Biometrics, 27 (1971) 501–514.
Article PubMed CAS Google Scholar
Scott, A.J. and Symons, M.J. Clustering methods based on likelihood ratio criteria, Biometrics, 27, (1971) 387–397.
Koontz, W.L.G. and Fukunaga, K., A nonparametric valley-seeking technique for cluster analysis, IEEE Transactions on Computers, C-21 (1972a) 171–178.
Koontz, W.L.G. and Fukunaga, K., Asymptotic analysis of a nonparametric clustering technique, IEEE Transactions on Computers, C-21 (1972b) 967–974.
Anderberg, M.R., Cluster Analysis for Applications, New York, Academic Press, Inc. 1973.
Ling, R.F., A probability theory of cluster analysis, Journal of the American Statistical Association, 68 (1973) 159–169.
Article Google Scholar
Sneath, P.H.A. and Sokal, R.R., Numerical Taxonomy, W.H. Freeman, San Francisco, 1973.
Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, Inc., New York, 1973.
Google Scholar
Gitman, I., An algorithm for nonsupervised pattern classification, IEEE Transactions on Systems, Man, and Cybernetics, SMC-3 (1973) 66–74.
Calinski, T. and Harabasz, J., A dendrite method for cluster analysis, Communications in Statistics, 3 (1974) 1–27.
Article Google Scholar
Hubert, L., Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures, Journal of the American Statistical Association, 69 (1974) 698–704.
Article Google Scholar
Duran, B.S. and Odell, P.L., Cluster Analysis, Springer-Verlag, New York, 1974.
Marriott, F.H.C., Separating mixtures of normal distributions, Biometrics, 31 (1975) 767–769.
Article Google Scholar
McClain, J.O. and Rao, V.R., CLUSTISZ: A program to test for the quality of clustering of a set of objects, Journal of Marketing Research, 12 (1975) 456–460.
Google Scholar
Hartigan, J.A., Clustering Algorithms, John Wiley & Sons, Inc. New York, 1975.
Google Scholar
Harman, H.H., Modern Factor Analysis, 3d Edition, University of Chicago Press, Chicago, 1976.
Google Scholar
Koontz, W.L.G., Narendra, P.M. and Fukunaga, K., A Graph-theoretic approach to nonparametric cluster analysis, IEEE Transactions on Computers, C-25 (1976) 936–944.
Good, I.J., The Botryology of Botryology. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Google Scholar
Hartigan, J.A., Distribution Problems in Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Hubert, L.J. and Baker, F.B., An Empirical Comparison of Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical Clustering. In: Van Ryzin J. (ed.), Classification and Clustering. Academic Press, Inc., New York, 1977.
Hartigan, J.A., Asymptotic distributions for clustering criteria, Annals of Statistics, 6 (1978) 117–131.
Article Google Scholar
Wolfe, J.H., Comparative cluster analysis of patterns of vocational interest, Multivariate Behavioral Research, 13 (1978) 33–44.
Article Google Scholar
Binder, D.A., Bayesian cluster analysis, Biometrika, 65 (1978) 31–38.
Article Google Scholar
Blashfield, R.K. and Aldenderfer, M.S., The literature on cluster analysis, Multivariate Behavioral Research, 13 (1978) 271–295.
Article Google Scholar
Huizinga, D.H., “A Natural or Mode Seeking Cluster Analysis Algorithm lrquo” Technical Report 78–1, Behavioral Research Institute, 2305 Canyon Blvd., Boulder, Colorado 80302. (1978)
Arnold, S.J., A test for clusters, Journal of Marketing Research, 16 (1979) 545–551.
Article Google Scholar
Everitt, B.S., Unresolved problems in cluster analysis, Biometrics, 35 (1979) 169–181.
Article Google Scholar
Spath, H., Cluster Analysis Algorithms, Ellis Horwood. Chichester, England, 1980.
Google Scholar
Symons, M.J., Clustering criteria and multivariate normal mixtures, Biometrics, 37 (1981) 35–43.
Article Google Scholar
Binder, D.A., Approximations to bayesian clustering rules, Biometrika, 68 (1981) 275–285.
Article Google Scholar
Barnett, V., (ed.), Interpreting Multivariate Data, John Wiley & Sons, Inc., New York, 1981.
Everitt, B.S. and Hand, D.J., Finite Mixture Distributions, Chapman and Hall, New York, 1981.
Hartigan, J.A., Consistency of single linkage for high-density clusters, Journal of the American Statistical Association, 76 (1981) 388–394.
Article Google Scholar
Art, D., Gnanadesikan, R. and Kettenring, R., Data-based metrics for cluster analysis, Utilitas Mathematica, 21A (1982) 75–99.
Google Scholar
Hawkins, D.M., Muller, M.W. and ten Krooden, J.A., Cluster Analysis. In Hawkins, D.M. (ed.), Topics in Applied Multivariate Analysis. Cambridge University Press, Cambridge, 1982.
Wong, M.A., A hybrid clustering method for identifying high-density clusters, Journal of the American Statistical Association, 77 (1982) 841–847.
Article Google Scholar
Wong, M.A. and Schaack, C., Using the kth nearest neighbor clustering procedure to determine the number of subpopulations, American Statistical Association 1982 Proceedings of the Statistical Computing Section, (1982) 40–48.
Wong, M.A. and Lane, T., A kth nearest neighbor clustering procedure, Journal of the Royal Statistical Society, Series B, 45 (1983) 362–368.
Google Scholar
Klastorin, T.D., Assessing cluster analysis results, Journal of Marketing Research, 20 (1983) 92–98.
Article Google Scholar
Bock, H.H., On some significance tests in cluster analysis, Journal of Classification, 2 (1985) 77–108.
Article Google Scholar
Cooper, M.C. and Milligan, G.W., The effect of error on determining the number of clusters, Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, (1988) 319–328.
Hartigan, J.A., Statistical theory in clustering, Journal of Classification, 2 (1985) 63–76.
Article Google Scholar
Hartigan, J.A. and Hartigan, P.M., The dip test of unimodality, Annals of Statistics, 13 (1985) 70–84.
Article Google Scholar
Hartigan, P.M., Computation of the dip statistic to test for unimodality, Applied Statistics, 34 (1985) 320–325.
Article Google Scholar
Lee, K.L., Multivariate tests for clusters, Journal of the American Statistical Association, 74 (1979) 708–714.
Article Google Scholar
Massart, D.L. and Kaufman, L., The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis, John Wiley & Sons, Inc., New York, 1983.
Google Scholar
McLachlan, G.J. and Basford, K.E., Mixture Models, Marcel Dekker, Inc., New York, 1988.
Google Scholar
Mizoguchi, R. and Shimura, M., A Nonparametric algorithm for detecting clusters using hierarchical structure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2 (1980) 292–300.
Mezzich, J.E. and Solomon, H., Taxonomy and Behavioral Science, Academic Press, Inc., New York, 1980.
Milligan, G.W., An examination of the effect of six types of error perturbation on fifteen clustering algorithms, Psychometrika, 45 (1980) 325–342.
Article Google Scholar
Milligan, G.W., A Review of monte carlo tests of cluster analysis, Multivariate Behavioral Research, 16 (1981) 379–407.
Article Google Scholar
Pollard, D., Strong consistency of k-means clustering, Annals of Statistics, 9 (1981) 135–140.
Article Google Scholar
Sarle, W.S., Cluster analysis by least squares, Proceedings of the Seventh Annual SAS Users Group International Conference, (1982) 651–653.
Sarle, W.S., Cubic Clustering Criterion, SAS Technical Report A-108, SAS Institute Inc. Cary, NC, 1983.
Titterington, D.M., Smith, A.F.M. and Makov, U.E., Statistical Analysis of Finite Mixture Distributions, John Wiley & Sons, Inc., New York, 1985.
Google Scholar
Milligan, G.W. and Cooper, M.C., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50 (1985) 159–179.
Article Google Scholar
Grotschel, M. and Wakabayashi, Y., A cutting plane algorithm for a ckustering problem, Mathematical Program, 45 (1989) 59–96, North-Holland.
Google Scholar
Rose, K., Gurewitz, E. and Fox, G.C., Statistical mechanics and phase transition in clustering, Physical Review Letters, Vol 65 No. 8 (1990) 945–948.
Mueller, D.W. and Sawitzki, G., Excess mass estimates and tests for multimodality, JASA, 86 (1991) 738–746.
Google Scholar
Minnotte, M.C., A Test of Mode Existence with Applications to Multimodality, Ph.D. thesis, Rice University, Department of Statistics, 1992.
Polonik, W., Measuring mass concentrations and estimating density contour clusters–an excess mass approach, Technical Report, Beitraege zur Statistik Nr. 7, Universitaet Heidelberg, 1993.
Sarle, W.S and Kuo, An-Hsiang, The MODECLUS Procedure, SAS Technical Report P-256, Cary, NC: SAS Institute Inc., 1993.
Banfield, J.D. and Raftery, A.E., Model-based gaussian and non-gaussian clustering, Biometrics, 49 (1993) 803–821.
Article Google Scholar
Girman, C.J., Cluster Analysis and Classification Tree Methodologyas an Aid to Improve Understanding of Benign Prostatic Hyperplasia, Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina, 1994.
Blatt, M. Wiseman, S. and Domany, E., Superparamagnetic Clustering of Data, Physical Review Letters, Vol. 76 (1996), No. 18, 3251–3254.
Hofmann, T. and Buhmann, J., Pair-wise data clustering by deterministic annealing, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19 (1997), No. 1 1–14.
Hartuv, E. and Shamir, R., A Clustering Algorithm Based on Graph Connectivity, Proc. Of RECOMB'99. (1999).
Sharan, R. and Shamir, R., CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis, Proc. ISMB, 507–516, AAAI Press, Menlo Park, California, 2000.
Klein, B. bkprog@orbit.org, http://www.orbit.org/bkprog/
Eisen, M.B., Spellman, P.T., Brown, P.O. and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl Acad. Sci., Vol. 95 (1998) pp. 14863–14868.
http://rana.stanford.edu/software.
Kohonen, T., Self-Organizing Maps, New, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 1995, 1997, 2001. Third Extended Edition, ISBN 3-540-67921-9.

Download references

Author information

Authors and Affiliations

Boehringer Ingelheim Pharmaceuticals, Inc., 900 Ridgebury Road, Ridgefield, Connecticut, 06877-0368, USA
Jun Xu, Qiang Zhang & Chen-Kon Shih
Research Center of Modernization of Chinese Traditional Medicine, Department of Chemistry, Central South University, Lu Shan Road, Chang Sha, 410083, P.R. China
Jun Xu
Discovery Partners International, Inc., 9640 Towne Centre Drive, San Diego, CA, 92121, U.S.A.
Jun Xu

Authors

Jun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Kon Shih
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Zhang, Q. & Shih, CK. V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data. Mol Divers 10, 463–478 (2006). https://doi.org/10.1007/s11030-006-9023-7

Download citation

Received: 03 December 2005
Accepted: 06 March 2006
Published: 01 August 2006
Issue Date: August 2006
DOI: https://doi.org/10.1007/s11030-006-9023-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data

Summary

Access this article

Similar content being viewed by others

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

V-cluster algorithm: A new algorithm for clustering molecules based upon numeric data

Summary

Access this article

Similar content being viewed by others

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Unsupervised Learning Methods and Similarity Analysis in Chemoinformatics

Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation