Abstract
Variable selection is a problem of increasing interest in many areas of multivariate statistics such as classification, clustering and regression. In contradiction to supervised classification, variable selection in cluster analysis is a much more difficult problem because usually nothing is known about the true class structure. In addition, in clustering, variable selection is highly related to the main problem of the determination of the number of clusters K to be inherent in the data. Here we present a very general bottom-up approach to variable selection in clustering starting with univariate investigations of stability. The hope is that the structure of interest may be contained in only a small subset of variables. Very general means, we make only use of non-parametric resampling techniques for purposes of validation, where we are looking for clusters that can be reproduced to a high degree under resampling schemes. So, our proposed technique can be applied to almost any cluster analysis method.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Carmone, F. J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509.
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–227.
Flury, B., & Riedwyl, H. (1988). Multivariate statistics: A practical approach. London: Chapman and Hall.
Fowlkes, E. B., Gnanadesikan, R., & Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.
Gnanadesikan, R., Kettenring, J. R., & Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136.
Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258–271.
Hubert, L. J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B, 72(4), 417–473.
Mucha, H.-J. (1996). ClusCorr: Cluster analysis and multivariate graphics under MS Excel. In H.-J. Mucha & H.-H. Bock (Eds.), Classification and clustering: Models, software and applications, Report 10 (pp. 97–106). Berlin: WIAS.
Mucha, H.-J. (2009). ClusCorr98 for Excel 2007: Clustering, multivariate visualization, and validation. In H.-J. Mucha & G. Ritter (Eds.), Classification and clustering: Models, software and applications, Report 26 (pp. 14–40). Berlin: WIAS.
Mucha, H.-J., & Bartel, H.-G. (2014). Soft bootstrapping in cluster analysis and its comparison with other resampling methods. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Janning (Eds.), Data analysis, machine learning and knowledge discovery (pp. 97–104). Berlin: Springer.
Mucha, H.-J., Bartel, H.-G., Dolata, J., & Morales-Merino, C. (2015). An introduction to clustering with applications to archaeometry. In J. A. Barcelo & I. Bogdanovic (Eds.), Mathematics and archaeology (Chap. 9). Boca Raton: CRC Press.
Mucha, H.-J., & Ritter, G. (2009). Classification and clustering: Models, software and applications, Report 26 (pp. 114–125). Berlin: WIAS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Mucha, HJ., Bartel, HG. (2016). Bottom-Up Variable Selection in Cluster Analysis Using Bootstrapping: A Proposal. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-25226-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)