Clustering Polish Texts with Latent Semantic Analysis

  • Marcin Kuta
  • Jacek Kitowski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6114)

Abstract

The document clustering is an important technique of Natural Language Processing (NLP). The paper presents performance of partitional and agglomerative algorithms applied to clustering large number of Polish newspaper articles. We investigate different representations of the documents. The focus of the paper is on the applicability of the Latent Semantic Analysis to such clustering for Polish.

Keywords

document clustering latent semantic analysis part-of-speech tagging natural language processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Broda, B., Piasecki, M.: Experiments in clustering documents for automatic acquisition of lexical semantic networks for Polish. In: Proc. of the 16th Int. Conf. Intelligent Information Systems, pp. 203–212 (2008)Google Scholar
  2. 2.
    Weiss, D.: The corpus of the Polish daily Rzeczpospolita (1993-2002), http://www.cs.put.poznan.pl/dweiss/rzeczpospolita
  3. 3.
    Brants, T.: TnT – a statistical part-of-speech tagger. In: Proc. of the 6th Applied Natural Language Processing Conf., pp. 224–231 (2000)Google Scholar
  4. 4.
    Kuta, M., Chrza̧szcz, P., Kitowski, J.: Increasing quality of the Corpus of Frequency Dictionary of Contemporary Polish for morphosyntactic tagging of the Polish language. Computing and Informatics 28(3), 319–338 (2009)Google Scholar
  5. 5.
    Kuta, M., Wójcik, W., Wrzeszcz, M., Kitowski, J.: Application of stacked methods to part-of-speech tagging of Polish. In: Proc. of the 8th Int. Conf. on Parallel Processing and Applied Mathematics, PPAM 2009 (2009)Google Scholar
  6. 6.
    Woliński, M.: Morfeusz - a practical tool for the morphological analysis of Polish. In: Proc. of the Int. Conf. Intelligent Information Systems, pp. 503–512 (2006)Google Scholar
  7. 7.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104(2), 211–240 (1997)CrossRefGoogle Scholar
  8. 8.
    Zhao, Y., Karypis, G.: Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Kurdziel, M.: Visual Clustering Methods for Pattern Recognition in Biomedical Data. PhD thesis, University of Science and Technology (2010)Google Scholar
  10. 10.
    Zhao, Y., Karypis, G.: Criterion functions for document clustering. Experiments and analysis. Technical Report 01–40, University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis (2001)Google Scholar
  11. 11.
    Karypis, G.: CLUTO. A clustering toolkit. Technical Report 02–017, University of Minnesota, Department of Computer Science (2003)Google Scholar
  12. 12.

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Marcin Kuta
    • 1
  • Jacek Kitowski
    • 1
  1. 1.Institute of Computer ScienceAGH University of Science and TechnologyKrakówPoland

Personalised recommendations