Abstract
An open problem in spectral clustering concerning automatically finding the number of clusters is studied. We generalize the method for selecting the scale parameter offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn are considered as a cluster stability indicator such that the clusters quantity corresponding to the most concentrated distribution is accepted as the “correct” number of clusters. Several numerical experiments have been conducted in order to establish the proposed spectral clustering approach and its application towards the style determination problem. The results reported here demonstrate high potential ability of the proposed method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barzily, Z., Volkovich, Z., Akteko-Ozturk, B., Weber, G.W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Ben-Hur, A., Guyon, I.: Detecting stable clusters using principal component analysis. In: Brownstein, M., Khodursky, A. (eds.) Methods in Molecular Biology, pp. 159–182. Humana Press (2003)
Calinski, R., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis 14, 315–332 (1992)
Dasgupta, S., Ng, V.: Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: ACL-IJCNLP 2009: Proceedings of the Main Conference, pp. 701–709 (2009)
Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Berry, M. (ed.) A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 551–556 (2004)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147 (July 1999)
Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 4, pp. 606–610 (2005)
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7) (2002)
Dunn, J.C.: Well Separated Clusters and Optimal Fuzzy Partitions. Journal on Cybernetics 4, 95–104 (1974)
Filippone, M., Camastra, F., Masulli, F., Rovetta, S.: A survey of kernel and spectral methods for clustering. Pattern Recognition 41(1), 176–190 (2008)
Fonck, L.: Epistle to the Hebrews, The Catholic Encyclopedia, vol. 7. Robert Appleton Company, New York (1910)
Forgy, E.W.: Cluster analysis of multivariate data - efficiency vs interpretability of classifications. Biometrics 21(3), 768–769 (1965)
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3-5), 75–174 (2010)
Gordon, A.D.: Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581 (1994)
Gordon, A.D.: Classification. Chapman and Hall, CRC, Boca Raton, FL (1999)
Hartigan, J.A.: Statistical theory in clustering. J. Classification 2, 63–76 (1985)
Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Jain, A.K., Moreau, J.V.: Bootstrap technique in cluster analysis. Pattern Recognition 20(5), 547–568 (1987)
Kenny, A., Stylometric, A.: Study of the New Testament. Oxford University Press, USA (1986)
Knight III, G.W.: The Pastoral Epistles: A Commentary on the Greek Text. Eerdmans Publishing Company (1992)
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In: Berry, M.W., Pottenger, W. (eds.) Proceedings of the Workshop on Text Mining (held in conjunction with the Third SIAM International Conference on Data Mining), pp. 5–16 (2003)
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with information– theoretical clustering. Computing in Science & Engineering, pp. 52–59 (November/December 2003)
Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k–means like algorithms. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Third SIAM International Conference on Data Mining) (2003)
Krzanowski, W., Lai, Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)
Kulis, B., Basu, S., Dhillon, I., Mooney, R.J.: Semi-supervised graph clustering: A kernel approach. In: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, pp. 457–464 (2005)
Ledger, G.: An Exploration of Dierences in the Pauline Epistles using Multivariate Statistical Analysis. Oxford University Press (1995)
Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)
Liu, X., Yu, S., Moreau, Y., Moor, B.D., Glanzel, W., Janssens, F.A.L.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: SDM 2009, pp. 49–60 (2009)
Lord, R.: De Morgan and the Statistical Study of Literary Style. Biometrica, 282 (1958)
Luxburg, U.V.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley (2000)
Mealand, D.L.: The Extent of the Pauline Corpus: A Multivariate Approach. JSNT 59 (1995)
Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Mohar, B.: Some applications of Laplace eigenvalues of graphs. In: Hahn, G., Sabidussi, G. (eds.) Graph Symmetry: Algebraic Methods and Applications. Springer (1997)
Mufti, G.B., Bertrand, P., Moubarki, E.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)
Nascimento, M., Carvalho, A.D.: Spectral methods for graph clustering – a survey. European Journal of Operational Research 2116(2), 221–231 (2011)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 (NIPS 2001), pp. 849–856 (2001)
Perkins, P.: Reading the New Testament: An Introduction, p. 47. Paulist Press (1988)
Roth, V., Lange, V., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: COMPSTAT (2002), http://www.cs.uni-bonn.De/~braunm
Roth, V., Lange, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16(6), 1299–1323 (2004)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Sugar, C., James, G.: Finding the number of clusters in a data set: An information theoretic approach. J. of the American Statistical Association 98, 750–763 (2003)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B 63(2), 411–423 (2001)
Toledano-Kitai, D., Avros, R., Volkovich, Z.: A fractal dimension standpoint to the cluster validation problem. International Journal of Pure and Applied Mathematics 68(2), 233–252 (2011)
Volkovich, V., Kogan, J., Nicholas, C.: k–means initialization by sampling large datasets. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM), pp. 17–22 (2004)
Volkovich, Z., Barzily, Z., Morozensky, L.: A statistical model of cluster stability. Pattern Recognition 41(7), 2174–2188 (2008)
Wechsler, H.: Intelligent biometric information management. Intelligent Information Management 2, 499–511 (2010)
White, S., Smyth, P.: A spectral clustering approach to finding communities in graphs. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 119, pp. 274–285. Society for Industrial Mathematics (2005)
Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 505–512 (2002)
Yu, S.X., Shi, J.: Multiclass spectral clustering. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 313–319 (2003)
Zelnik-manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17, pp. 1601–1608. MIT Press (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Avros, R., Soffer, A., Volkovich, Z., Yahalom, O. (2013). An Approach to Model Selection in Spectral Clustering with Application to the Writing Style Determination Problem. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-54105-6_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54104-9
Online ISBN: 978-3-642-54105-6
eBook Packages: Computer ScienceComputer Science (R0)