Abstract
The goal of model-order selection is to select a model variant that generalizes best from training data to unseen test data. In unsupervised learning without any labels, the computation of the generalization error of a solution poses a conceptual problem which we address in this paper. We formulate the principle of “minimum transfer costs” for model-order selection. This principle renders the concept of cross-validation applicable to unsupervised learning problems. As a substitute for labels, we introduce a mapping between objects of the training set to objects of the test set enabling the transfer of training solutions. Our method is explained and investigated by applying it to well-known problems such as singular-value decomposition, correlation clustering, Gaussian mixture-models, and k-means clustering. Our principle finds the optimal model complexity in controlled experiments and in real-world problems such as image denoising, role mining and detection of misconfigurations in access-control data.
Chapter PDF
Similar content being viewed by others
References
Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Ranking and clustering. Journal of the ACM 55, 23:1–23:27 (2008)
Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723 (1974)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1-3), 89–113 (2002)
Buhmann, J.M.: Information theoretic model validation for clustering. In: ISIT 2010 (2010)
Buhmann, J.M., Chehreghani, M.H., Frank, M., Streich, A.P.: Information theoretic model selection for pattern analysis. In: JMLR: Workshop and Conference Proceedings, vol. 7, pp. 1–8 (2011)
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome biology 3(7) (2002)
Eastment, H.T., Krzanowski, W.J.: Cross-validatory choice of the number of components from a principal component analysis. Technometrics 24(1), 73–77 (1982)
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing 15(12), 3736–3745 (2006)
Frank, M., Buhmann, J.M., Basin, D.: On the definition of role mining. In: SACMAT, pp. 35–44 (2010)
Frank, M., Buhmann, J.M.: Selecting the rank of truncated SVD by Maximum Approximation Capacity. In: IEEE International Symposium on Information Theory, ISIT (2011)
Gabriel, K.: Le biplotoutil dexploration de données multidimensionelles. Journal de la Societe Francaise de Statistique 143, 5–55 (2002)
Hansen, L.K., Larsen, J.: Unsupervised learning and generalization. In: IEEE Intl. Conf. on Neural Networks, pp. 25–30 (1996)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)
Kuhlmann, M., Shohat, D., Schimpf, G.: Role mining – revealing business roles for security administration using data mining technology. In: SACMAT 2003, p. 179 (2003)
Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Computation 16(6), 1299–1323 (2004)
Molloy, I., et al.: Mining roles with noisy data. In: SACMAT 2010, pp. 45–54 (2010)
Miettinen, P., Vreeken, J.: Model Order Selection for Boolean Matrix Factorization. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (2011)
Minka, T.P.: Automatic choice of dimensionality for PCA. In: NIPS, p. 514 (2000)
Owen, A.B., Perry, P.O.: Bi-cross-validation of the SVD and the nonnegative matrix factorization. Annals of Applied Statistics 3(2), 564–594 (2009)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461 (1978)
Streich, A.P., Frank, M., Basin, D., Buhmann, J.M.: Multi-assignment clustering for Boolean data. In: ICML 2009, pp. 969–976 (2009)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Dataset via the Gap Statistic. Journal of the Royal Statistical Society, Series B 63, 411–423 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Frank, M., Chehreghani, M.H., Buhmann, J.M. (2011). The Minimum Transfer Cost Principle for Model-Order Selection. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6911. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23780-5_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-23780-5_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23779-9
Online ISBN: 978-3-642-23780-5
eBook Packages: Computer ScienceComputer Science (R0)