Effect of Dimensionality Reduction on Different Distance Measures in Document Clustering

  • Mari-Sanna Paukkeri
  • Ilkka Kivimäki
  • Santosh Tirunagari
  • Erkki Oja
  • Timo Honkela
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7064)

Abstract

In document clustering, semantically similar documents are grouped together. The dimensionality of document collections is often very large, thousands or tens of thousands of terms. Thus, it is common to reduce the original dimensionality before clustering for computational reasons. Cosine distance is widely seen as the best choice for measuring the distances between documents in k-means clustering. In this paper, we experiment three dimensionality reduction methods with a selection of distance measures and show that after dimensionality reduction into small target dimensionalities, such as 10 or below, the superiority of cosine measure does not hold anymore. Also, for small dimensionalities, PCA dimensionality reduction method performs better than SVD. We also show how l2 normalization affects different distance measures. The experiments are run for three document sets in English and one in Hindi.

Keywords

document clustering dimensionality reduction distance measure 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Clarke, K.R., Somerfield, P.J., Chapman, M.G.: On resemblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Experimental Marine Biology and Ecology 330(1), 55–80 (2006)CrossRefGoogle Scholar
  2. 2.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  3. 3.
    Deza, M.M., Deza, E.: Encyclopedia of distances. Springer, Heidelberg (2009)CrossRefMATHGoogle Scholar
  4. 4.
    Honkela, T., Kaski, S., Lagus, K., Kohonen, T.: WEBSOM – Self-organizing maps of document collections. In: Proceedings of WSOM 1997, pp. 310–315 (1997)Google Scholar
  5. 5.
    Huang, A.: Similarity measures for text document clustering. In: Proceedings of NZCSRSC 2008, pp. 49–56 (2008)Google Scholar
  6. 6.
    Jardine, N., van Rijsbergen, C.: The use of hierarchic clustering in information retrieval. Information Storage and Retrieval 7(5), 217–240 (1971)CrossRefGoogle Scholar
  7. 7.
    Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 881–892 (2002)CrossRefMATHGoogle Scholar
  8. 8.
    Lee, J.A., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, Heidelberg (2007)CrossRefMATHGoogle Scholar
  9. 9.
    Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4), 395–416 (2007)CrossRefGoogle Scholar
  10. 10.
    Madylova, A., Öğüdücü, Ş.G.: Comparison of similarity measures for clustering Turkish documents. Intelligent Data Analysis 13(5), 815–832 (2009)Google Scholar
  11. 11.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)Google Scholar
  12. 12.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  13. 13.
    Radovanović, M., Nanopoulos, A., Ivanović, M.: On the existence of obstinate results in vector space models. In: Proceedings of SIGIR 2010, pp. 186–193. ACM (2010)Google Scholar
  14. 14.
    Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: Proceedings of EACL 2003. ACL (2003)Google Scholar
  15. 15.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Comparison of Distance Measures for Graph-Based Clustering of Documents. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 202–213. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  16. 16.
    Strehl, A., Ghosh, J.: Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2003)MATHGoogle Scholar
  17. 17.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceedings of AAAI 2000, pp. 58–64 (2000)Google Scholar
  18. 18.
    Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press (2008)Google Scholar
  19. 19.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of SIGIR 2003, pp. 267–273. ACM (2003)Google Scholar
  20. 20.
    Zhao, Y., Karypis, G.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Mari-Sanna Paukkeri
    • 1
  • Ilkka Kivimäki
    • 2
  • Santosh Tirunagari
    • 1
  • Erkki Oja
    • 1
  • Timo Honkela
    • 1
  1. 1.Department of Information and Computer ScienceAalto University School of ScienceAaltoFinland
  2. 2.ISYS/LSMUniversité de LouvainLouvain-la-NeuveBelgium

Personalised recommendations