An Approach to Model Selection in Spectral Clustering with Application to the Writing Style Determination Problem

Avros, Renata; Soffer, Avi; Volkovich, Zeev; Yahalom, Orly

doi:10.1007/978-3-642-54105-6_2

Renata Avros⁵,
Avi Soffer⁵,
Zeev Volkovich⁵ &
…
Orly Yahalom⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 415))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

776 Accesses

Abstract

An open problem in spectral clustering concerning automatically finding the number of clusters is studied. We generalize the method for selecting the scale parameter offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn are considered as a cluster stability indicator such that the clusters quantity corresponding to the most concentrated distribution is accepted as the “correct” number of clusters. Several numerical experiments have been conducted in order to establish the proposed spectral clustering approach and its application towards the style determination problem. The results reported here demonstrate high potential ability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barzily, Z., Volkovich, Z., Akteko-Ozturk, B., Weber, G.W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)
MATH MathSciNet Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)
Google Scholar
Ben-Hur, A., Guyon, I.: Detecting stable clusters using principal component analysis. In: Brownstein, M., Khodursky, A. (eds.) Methods in Molecular Biology, pp. 159–182. Humana Press (2003)
Google Scholar
Calinski, R., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)
MATH MathSciNet Google Scholar
Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis 14, 315–332 (1992)
Article MATH MathSciNet Google Scholar
Dasgupta, S., Ng, V.: Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: ACL-IJCNLP 2009: Proceedings of the Main Conference, pp. 701–709 (2009)
Google Scholar
Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Berry, M. (ed.) A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)
Google Scholar
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 551–556 (2004)
Google Scholar
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147 (July 1999)
Article MATH Google Scholar
Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 4, pp. 606–610 (2005)
Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7) (2002)
Google Scholar
Dunn, J.C.: Well Separated Clusters and Optimal Fuzzy Partitions. Journal on Cybernetics 4, 95–104 (1974)
Article MathSciNet Google Scholar
Filippone, M., Camastra, F., Masulli, F., Rovetta, S.: A survey of kernel and spectral methods for clustering. Pattern Recognition 41(1), 176–190 (2008)
Article MATH Google Scholar
Fonck, L.: Epistle to the Hebrews, The Catholic Encyclopedia, vol. 7. Robert Appleton Company, New York (1910)
Google Scholar
Forgy, E.W.: Cluster analysis of multivariate data - efficiency vs interpretability of classifications. Biometrics 21(3), 768–769 (1965)
Google Scholar
Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3-5), 75–174 (2010)
Article MathSciNet Google Scholar
Gordon, A.D.: Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581 (1994)
Article MathSciNet Google Scholar
Gordon, A.D.: Classification. Chapman and Hall, CRC, Boca Raton, FL (1999)
MATH Google Scholar
Hartigan, J.A.: Statistical theory in clustering. J. Classification 2, 63–76 (1985)
Article MATH MathSciNet Google Scholar
Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)
MathSciNet Google Scholar
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Jain, A.K., Moreau, J.V.: Bootstrap technique in cluster analysis. Pattern Recognition 20(5), 547–568 (1987)
Article Google Scholar
Kenny, A., Stylometric, A.: Study of the New Testament. Oxford University Press, USA (1986)
Google Scholar
Knight III, G.W.: The Pastoral Epistles: A Commentary on the Greek Text. Eerdmans Publishing Company (1992)
Google Scholar
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In: Berry, M.W., Pottenger, W. (eds.) Proceedings of the Workshop on Text Mining (held in conjunction with the Third SIAM International Conference on Data Mining), pp. 5–16 (2003)
Google Scholar
Kogan, J., Nicholas, C., Volkovich, V.: Text mining with information– theoretical clustering. Computing in Science & Engineering, pp. 52–59 (November/December 2003)
Google Scholar
Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k–means like algorithms. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Third SIAM International Conference on Data Mining) (2003)
Google Scholar
Krzanowski, W., Lai, Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)
Article MathSciNet Google Scholar
Kulis, B., Basu, S., Dhillon, I., Mooney, R.J.: Semi-supervised graph clustering: A kernel approach. In: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, pp. 457–464 (2005)
Google Scholar
Ledger, G.: An Exploration of Dierences in the Pauline Epistles using Multivariate Statistical Analysis. Oxford University Press (1995)
Google Scholar
Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)
Article MATH Google Scholar
Liu, X., Yu, S., Moreau, Y., Moor, B.D., Glanzel, W., Janssens, F.A.L.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: SDM 2009, pp. 49–60 (2009)
Google Scholar
Lord, R.: De Morgan and the Statistical Study of Literary Style. Biometrica, 282 (1958)
Google Scholar
Luxburg, U.V.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)
Article MathSciNet Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley (2000)
Google Scholar
Mealand, D.L.: The Extent of the Pauline Corpus: A Multivariate Approach. JSNT 59 (1995)
Google Scholar
Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)
Article Google Scholar
Mohar, B.: Some applications of Laplace eigenvalues of graphs. In: Hahn, G., Sabidussi, G. (eds.) Graph Symmetry: Algebraic Methods and Applications. Springer (1997)
Google Scholar
Mufti, G.B., Bertrand, P., Moubarki, E.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)
Google Scholar
Nascimento, M., Carvalho, A.D.: Spectral methods for graph clustering – a survey. European Journal of Operational Research 2116(2), 221–231 (2011)
Article Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 (NIPS 2001), pp. 849–856 (2001)
Google Scholar
Perkins, P.: Reading the New Testament: An Introduction, p. 47. Paulist Press (1988)
Google Scholar
Roth, V., Lange, V., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: COMPSTAT (2002), http://www.cs.uni-bonn.De/~braunm
Roth, V., Lange, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16(6), 1299–1323 (2004)
Article MATH Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
Article Google Scholar
Sugar, C., James, G.: Finding the number of clusters in a data set: An information theoretic approach. J. of the American Statistical Association 98, 750–763 (2003)
Article MATH MathSciNet Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B 63(2), 411–423 (2001)
Article MATH MathSciNet Google Scholar
Toledano-Kitai, D., Avros, R., Volkovich, Z.: A fractal dimension standpoint to the cluster validation problem. International Journal of Pure and Applied Mathematics 68(2), 233–252 (2011)
MATH MathSciNet Google Scholar
Volkovich, V., Kogan, J., Nicholas, C.: k–means initialization by sampling large datasets. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM), pp. 17–22 (2004)
Google Scholar
Volkovich, Z., Barzily, Z., Morozensky, L.: A statistical model of cluster stability. Pattern Recognition 41(7), 2174–2188 (2008)
Article MATH Google Scholar
Wechsler, H.: Intelligent biometric information management. Intelligent Information Management 2, 499–511 (2010)
Article Google Scholar
White, S., Smyth, P.: A spectral clustering approach to finding communities in graphs. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 119, pp. 274–285. Society for Industrial Mathematics (2005)
Google Scholar
Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 505–512 (2002)
Google Scholar
Yu, S.X., Shi, J.: Multiclass spectral clustering. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 313–319 (2003)
Google Scholar
Zelnik-manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17, pp. 1601–1608. MIT Press (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Software Engineering, ORT-Braude College of Engineering, Karmiel, Israel
Renata Avros, Avi Soffer, Zeev Volkovich & Orly Yahalom

Authors

Renata Avros
View author publications
You can also search for this author in PubMed Google Scholar
Avi Soffer
View author publications
You can also search for this author in PubMed Google Scholar
Zeev Volkovich
View author publications
You can also search for this author in PubMed Google Scholar
Orly Yahalom
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IST - Technical University of Lisbon, Av.Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
Delft University of Technology, Mekelweg 4, 2628, Delft, CD, The Netherlands
Jan L. G. Dietz
Informatics Research Centre, Henley Business School, University of Reading, RG6 6UD, UK
Kecheng Liu
INSTICC and IPS, Estefanilha, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Avros, R., Soffer, A., Volkovich, Z., Yahalom, O. (2013). An Approach to Model Selection in Spectral Clustering with Application to the Writing Style Determination Problem. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-54105-6_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54104-9
Online ISBN: 978-3-642-54105-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics