Abstract
Factor clustering methods were proposed to cluster large data sets. Among them factor probabilistic distance clustering (FPDC) shows interesting performance. The method is based on two main steps: a Tucker3 decomposition of the distance array and probabilistic distance (PD) clustering on the resulting factors. The aim of this paper is to apply FPDC on behavioral and social data sets of large dimensions, to obtain homogeneous and well-separated clusters of individuals. The scope is to evaluate the stability and the robustness of the method dealing with these data sets. Stability of results is referred to the invariance of results in each iteration of the method. Robustness is referred to the sensitivity of the method to errors in data. These characteristics of the method are evaluated using bootstrap resampling.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Arabie, P., & Hubert, L. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.), Advanced methods in marketing research (pp 160–189). Oxford: Blackwell.
Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, 7(6), 6–17.
Ben-Israel, A., & Iyigun, C. (2008). Probabilistic d-clustering. Journal of Classification, 25(1), 5–26.
Bezdek, J. C. (1974). Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology, 1(1), 57–71.
Bryan, J. (2004). Problems in gene clustering based on gene expression data. Journal of Multivariate Analyis, 90, 67–89.
Bubeck, S., Meilă, M., & Von Luxburg, U. (2012). How the initialization affects the stability of the k-means algorithm. Probability and statistics: PS, 16, 436–452.
De Soete, G., & Carroll, J. D. (1994). k-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, et al. (Eds.), New approaches in classification and data analysis. (pp. 212–219). Heidelberg: Springer.
Devé, R. N., & Krishnapuram, R. (1997). Robust clustering methods: A unified view. IEEE Transiction on Fuzzy Systems, 5(2), 270–293.
Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology, 3, 0036.1–0036.21.
Gettler Summa, M., Palumbo, F., & Tortora, C. (2011). Factor pd-clustering. Working paper [arXiv:1106.3830v3]
Ghahramani, Z., & Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Crg-tr-96-1, University of Toronto, Toronto.
Grün, B., & Leisch, F. (2004). Bootstrapping finite mixture models. Compstat 2004, proceedings in Computational Statistics, 1115–1122.
Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computational Statistics & Data Analysis, 52(1), 258–271.
Huber, P. J. (1981). Robust Statistics. New York: Wiley.
Iyigun, C. (2007). Probabilistic Distance Clustering. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey.
Kiers, H. A. L., & Kinderen, A. (2003). A fast method for choosing the numbers of components in tucker3 analysis. British Journal of Mathematical and Statistical Psychology, 56(1), 119–125.
Kroonenberg, P. M. (2008). Applied multiway data analysis. Hoboken: Ebooks Corporation.
Lange, T., Roth, V., Braun, M. L., & Buhmann, J. M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16(6), 1299–1323.
Lebart, A., Morineau, A., & Warwick, K. (1984). Multivariate statistical descriptive analysis. New York: Wiley.
Maronna, R. A., & Zamar, R. H. (2002). Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 44(4), 307–317.
McLachlan, G. J., & Peel, D. (2003). Modelling high-dimensional data by mixtures of factor analyzers, Computational Statistics and Data Analysis, 41(3), 379–388.
Monti, S., Tamayo, P., Mesirov, J., & Golub, T. (2001). Consensus clustering: A resampling-based method for class discovery and visualization of gene. Expression Microarray Data, Machine Learning, 52, 91–118.
Rocci, R., Gattone, S. A., & Vichi, M. (2011). A new dimension reduction method: Factor discriminant k-means. Journal of Classification, 28(2), 210–226.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14, 511–528.
Timmerman, M. E., Ceulemans, E., Kiers, H. A. L., & Vichi, M. (2010). Factorial and reduced k-means reconsidered. Computational Statistics & Data Analysis, 54(7), 1858–1871.
Tortora, C., Gettler Summa, M., & Palumbo, F. (2013). Factor pd-clustering. In U. Alfred, L. Berthold, & V. Dirk (Eds.), Algorithms from and for nature and life (volume, in press).
Vendramin, L., Campello, R., & Hruschka, E. (2009). In SDM. On the comparison of relative clustering validity criteria (pp. 733–744).
Vichi, M., & Kiers, H. A. L. (2001). Factorial k-means analysis for two way data. Computational Statistics and Data Analysis, 37, 29–64.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Tortora, C., Marino, M. (2014). Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets. In: Vicari, D., Okada, A., Ragozini, G., Weihs, C. (eds) Analysis and Modeling of Complex Data in Behavioral and Social Sciences. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-06692-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-06692-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06691-2
Online ISBN: 978-3-319-06692-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)