Abstract
rCOSA is a software package interfaced to the R language. It implements statistical techniques for clustering objects on subsets of attributes in multivariate data. The main output of COSA is a dissimilarity matrix that one can subsequently analyze with a variety of proximity analysis methods. Our package extends the original COSA software (Friedman and Meulman, 2004) by adding functions for hierarchical clustering methods, least squares multidimensional scaling, partitional clustering, and data visualization. In the many publications that cite the COSA paper by Friedman and Meulman (2004), the COSA program is actually used only a small number of times. This can be attributed to the fact that this original implementation is not very easy to install and use. Moreover, the available software is out-of-date. Here, we introduce an up-to-date software package and a clear guidance for this advanced technique. The software package and related links are available for free at: https://github.com/mkampert/rCOSA.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
AITCHISON, J. (1986), The Statistical Analysis of Compositional Data, London: Chapman and Hall.
AMORIM, R.C. (2015), “Feature Relevance inWard’s Hierarchical Clustering Using the L p Norm”, Journal of Classification, 32, 46–62.
ANDREWS, J.L., and MCNICHOLAS, P.D. (2014), “Variable Selection for Clustering and Classification”, Journal of Classification, 31(2), 136–153.
BOUVEYRON, C., and BRUNET, C. (2012), “Simultaneous Model-Based Clustering and Visualization in the Fisher Discriminative Subspace”, Statistics and Computing, 22(1), 301–324.
DAMIAN, D., ORESICS, M., VERHEIJ, E., MEULMAN, J. J., FRIEDMAN, J., ADOURIAN, A., MOREL, N., SMILDE, A., and VAN DER GREEF, J. (2007), “Applications of a New Subspace Clustering Algorithm (COSA) in Medical Systems Biology”, Metabolomics, 3(1), 69–77.
DE LEEUW, J., and HEISER, W.J. (1982), “Theory of Multidimensional Scaling”, in Handbook of Statistics (Vol. 2), eds. P. Krishnaiah and L. Kanal, Amsterdam, The Netherlands: North-Holland, pp. 285–316.
DE SARBO, W., CARROLL, J., CLARCK, L., and GREEN, P. (1984), “Synthesized Clustering: A Method for Amalgamating Clustering Bases with Differential Weighting of Variables”, Psychometrika, 49, 57–78.
DE SOETE, G. (1985), “OVWTRE: A Program for Optimal Variable Weighting for Ultrametric and Additive Tree Fitting”, Journal of Classification, 5, 101–104.
DE SOETE, G., DE SARBO, W., and CARROLL, J. (1985), “Optimal Variable Weighting for Hierarchical Clustering: Analternating Least-Squares Algorithm”, Journal of Classification, 2, 173–192.
FRIEDMAN, J.H., and MEULMAN, J.J. (2004), “Clustering Objects on Subsets of Attributes”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(part 4), 815–849.
GOWER, J.C. (1966), “Some Distance Properties of Latent Roots and Vector Methods Used in Multivariate Analysis”, Biometrika, 53, 325–338.
HEISER, W.J. (1995), ‘Convergent Computation by Iterative Majorization: Theory and Applications in Multidimensional Data Analysis”, in Recent Advances in Descriptive Multivariate Analysis, ed. W. Krzanowski, Oxford: Oxford University Press, pp. 157–189.
JAIN, A. (2010), “Data Clustering: 50 Years Beyond K-Means”, Pattern Recognition Letters, 31(8), 651–666.
KOHONEN, T. (2001), Self Organizing Maps, Berlin, Heidelberg: Springer Verlag.
LEE, J., LENDASSE, A., and VERLEYSEN, M. (2004), “Nonlinear Projection with Curvilinear Distaces: Isomap Versus Curvilinear Distance Analysis”, Neurocomputing, 57, 49–76.
MEULMAN, J.J. (1986), A Distance Approach to Nonlinear Multivariate Analysis, Leiden: DSWO Press.
MEULMAN, J. (1992), “The Integration of Multidimensional Scaling and Multivariate Analysis with Optimal Transformations”, Psychometrika, 57, 539–565.
R CORE TEAM (2014), “R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing”, Vienna, Austria, www.Rproject.org/.
SAMMON, J.J. (1969), “A Nonlinear Mapping for Data Structure Analysis”, IEEE Transactions on Computers, C-18, 401–409.
SEBESTYEN, G.S. (1962), Decision-Making Processes in Pattern Recognition, New York: The Macmillan Company.
STEINLEY, D., and BRUSCO, M. (2008), “Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures”, Psychometrika, 73(1), 46–62.
SZEPANNEK, G. (2013), “orclus: ORCLUS Subspace Clustering”, R package version 0.2-5, CRAN.R-project.org/package=orclus.
TORGERSON, W. (1952), “Multidimensional Scaling: I. Theory and Method”, Psychometrika, 17, 713–726.
WARD JR, J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58(301), 236–244.
WILLIAMS, G., HUANG, J.Z., CHEN, X., WANG, Q., and XIAO, L. (2014), “wskm: Weighted k-Means Clustering”, R Package Version 1.4.19, CRAN.Rproject.org/package=wskm.
WITTEN, D.M., and TIBSHIRANI, R. (2010), “A Framework for Feature Selection in Clustering”, Journal of the American Statistical Association, 105(2), 713–726.
YOUNG, F., and HOUSEHOLDER, A. (1938), “Discussion of a Set of Points in Terms of Their Mutual Distances”, Psychometrika, 3, 19–22.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Kampert, M.M., Meulman, J.J. & Friedman, J.H. rCOSA: A Software Package for Clustering Objects on Subsets of Attributes. J Classif 34, 514–547 (2017). https://doi.org/10.1007/s00357-017-9240-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-017-9240-z