Skip to main content

An Approach to Model Selection in Spectral Clustering with Application to the Writing Style Determination Problem

  • Conference paper
Book cover Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2012)

Abstract

An open problem in spectral clustering concerning automatically finding the number of clusters is studied. We generalize the method for selecting the scale parameter offered in the Ng-Jordan-Weiss (NJW) algorithm and reveal a connection with the distance learning methodology. Values of the scaling parameter estimated via clustering of samples drawn are considered as a cluster stability indicator such that the clusters quantity corresponding to the most concentrated distribution is accepted as the “correct” number of clusters. Several numerical experiments have been conducted in order to establish the proposed spectral clustering approach and its application towards the style determination problem. The results reported here demonstrate high potential ability of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barzily, Z., Volkovich, Z., Akteko-Ozturk, B., Weber, G.W.: On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2), 187–202 (2009)

    MATH  MathSciNet  Google Scholar 

  2. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, pp. 6–17 (2002)

    Google Scholar 

  3. Ben-Hur, A., Guyon, I.: Detecting stable clusters using principal component analysis. In: Brownstein, M., Khodursky, A. (eds.) Methods in Molecular Biology, pp. 159–182. Humana Press (2003)

    Google Scholar 

  4. Calinski, R., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3, 1–27 (1974)

    MATH  MathSciNet  Google Scholar 

  5. Celeux, G., Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis 14, 315–332 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  6. Dasgupta, S., Ng, V.: Mine the easy, classify the hard: a semi-supervised approach to automatic sentiment classification. In: ACL-IJCNLP 2009: Proceedings of the Main Conference, pp. 701–709 (2009)

    Google Scholar 

  7. Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. In: Berry, M. (ed.) A Comprehensive Survey of Text Mining, pp. 73–100. Springer, Berlin (2003)

    Google Scholar 

  8. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means, spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 551–556 (2004)

    Google Scholar 

  9. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001), Also appears as IBM Research Report RJ 10147 (July 1999)

    Article  MATH  Google Scholar 

  10. Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 4, pp. 606–610 (2005)

    Google Scholar 

  11. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7) (2002)

    Google Scholar 

  12. Dunn, J.C.: Well Separated Clusters and Optimal Fuzzy Partitions. Journal on Cybernetics 4, 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  13. Filippone, M., Camastra, F., Masulli, F., Rovetta, S.: A survey of kernel and spectral methods for clustering. Pattern Recognition 41(1), 176–190 (2008)

    Article  MATH  Google Scholar 

  14. Fonck, L.: Epistle to the Hebrews, The Catholic Encyclopedia, vol. 7. Robert Appleton Company, New York (1910)

    Google Scholar 

  15. Forgy, E.W.: Cluster analysis of multivariate data - efficiency vs interpretability of classifications. Biometrics 21(3), 768–769 (1965)

    Google Scholar 

  16. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3-5), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  17. Gordon, A.D.: Identifying genuine clusters in a classification. Computational Statistics and Data Analysis 18, 561–581 (1994)

    Article  MathSciNet  Google Scholar 

  18. Gordon, A.D.: Classification. Chapman and Hall, CRC, Boca Raton, FL (1999)

    MATH  Google Scholar 

  19. Hartigan, J.A.: Statistical theory in clustering. J. Classification 2, 63–76 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  20. Hubert, L., Schultz, J.: Quadratic assignment as a general data-analysis strategy. Br. J. Math. Statist. Psychol. 76, 190–241 (1974)

    MathSciNet  Google Scholar 

  21. Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  22. Jain, A.K., Moreau, J.V.: Bootstrap technique in cluster analysis. Pattern Recognition 20(5), 547–568 (1987)

    Article  Google Scholar 

  23. Kenny, A., Stylometric, A.: Study of the New Testament. Oxford University Press, USA (1986)

    Google Scholar 

  24. Knight III, G.W.: The Pastoral Epistles: A Commentary on the Greek Text. Eerdmans Publishing Company (1992)

    Google Scholar 

  25. Kogan, J., Nicholas, C., Volkovich, V.: Text mining with hybrid clustering schemes. In: Berry, M.W., Pottenger, W. (eds.) Proceedings of the Workshop on Text Mining (held in conjunction with the Third SIAM International Conference on Data Mining), pp. 5–16 (2003)

    Google Scholar 

  26. Kogan, J., Nicholas, C., Volkovich, V.: Text mining with information– theoretical clustering. Computing in Science & Engineering, pp. 52–59 (November/December 2003)

    Google Scholar 

  27. Kogan, J., Teboulle, M., Nicholas, C.: Optimization approach to generating families of k–means like algorithms. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Third SIAM International Conference on Data Mining) (2003)

    Google Scholar 

  28. Krzanowski, W., Lai, Y.: A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44, 23–34 (1985)

    Article  MathSciNet  Google Scholar 

  29. Kulis, B., Basu, S., Dhillon, I., Mooney, R.J.: Semi-supervised graph clustering: A kernel approach. In: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, pp. 457–464 (2005)

    Google Scholar 

  30. Ledger, G.: An Exploration of Dierences in the Pauline Epistles using Multivariate Statistical Analysis. Oxford University Press (1995)

    Google Scholar 

  31. Levine, E., Domany, E.: Resampling method for unsupervised estimation of cluster validity. Neural Computation 13, 2573–2593 (2001)

    Article  MATH  Google Scholar 

  32. Liu, X., Yu, S., Moreau, Y., Moor, B.D., Glanzel, W., Janssens, F.A.L.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: SDM 2009, pp. 49–60 (2009)

    Google Scholar 

  33. Lord, R.: De Morgan and the Statistical Study of Literary Style. Biometrica, 282 (1958)

    Google Scholar 

  34. Luxburg, U.V.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  35. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)

    Google Scholar 

  36. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley (2000)

    Google Scholar 

  37. Mealand, D.L.: The Extent of the Pauline Corpus: A Multivariate Approach. JSNT 59 (1995)

    Google Scholar 

  38. Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)

    Article  Google Scholar 

  39. Mohar, B.: Some applications of Laplace eigenvalues of graphs. In: Hahn, G., Sabidussi, G. (eds.) Graph Symmetry: Algebraic Methods and Applications. Springer (1997)

    Google Scholar 

  40. Mufti, G.B., Bertrand, P., Moubarki, E.: Determining the number of groups from measures of cluster validity. In: Proceedings of ASMDA 2005, pp. 404–414 (2005)

    Google Scholar 

  41. Nascimento, M., Carvalho, A.D.: Spectral methods for graph clustering – a survey. European Journal of Operational Research 2116(2), 221–231 (2011)

    Article  Google Scholar 

  42. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems 14 (NIPS 2001), pp. 849–856 (2001)

    Google Scholar 

  43. Perkins, P.: Reading the New Testament: An Introduction, p. 47. Paulist Press (1988)

    Google Scholar 

  44. Roth, V., Lange, V., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: COMPSTAT (2002), http://www.cs.uni-bonn.De/~braunm

  45. Roth, V., Lange, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Computation 16(6), 1299–1323 (2004)

    Article  MATH  Google Scholar 

  46. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)

    Article  Google Scholar 

  47. Sugar, C., James, G.: Finding the number of clusters in a data set: An information theoretic approach. J. of the American Statistical Association 98, 750–763 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  48. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B 63(2), 411–423 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  49. Toledano-Kitai, D., Avros, R., Volkovich, Z.: A fractal dimension standpoint to the cluster validation problem. International Journal of Pure and Applied Mathematics 68(2), 233–252 (2011)

    MATH  MathSciNet  Google Scholar 

  50. Volkovich, V., Kogan, J., Nicholas, C.: k–means initialization by sampling large datasets. In: Dhillon, I., Kogan, J. (eds.) Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM), pp. 17–22 (2004)

    Google Scholar 

  51. Volkovich, Z., Barzily, Z., Morozensky, L.: A statistical model of cluster stability. Pattern Recognition 41(7), 2174–2188 (2008)

    Article  MATH  Google Scholar 

  52. Wechsler, H.: Intelligent biometric information management. Intelligent Information Management 2, 499–511 (2010)

    Article  Google Scholar 

  53. White, S., Smyth, P.: A spectral clustering approach to finding communities in graphs. In: Proceedings of the Fifth SIAM International Conference on Data Mining, vol. 119, pp. 274–285. Society for Industrial Mathematics (2005)

    Google Scholar 

  54. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 505–512 (2002)

    Google Scholar 

  55. Yu, S.X., Shi, J.: Multiclass spectral clustering. In: Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 313–319 (2003)

    Google Scholar 

  56. Zelnik-manor, L., Perona, P.: Self-tuning spectral clustering. In: Advances in Neural Information Processing Systems 17, pp. 1601–1608. MIT Press (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Avros, R., Soffer, A., Volkovich, Z., Yahalom, O. (2013). An Approach to Model Selection in Spectral Clustering with Application to the Writing Style Determination Problem. In: Fred, A., Dietz, J.L.G., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2012. Communications in Computer and Information Science, vol 415. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54105-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54105-6_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54104-9

  • Online ISBN: 978-3-642-54105-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics