Information Retrieval

, Volume 9, Issue 1, pp 33–53

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

  • Tuomo Korenius
  • Jorma Laurikkala
  • Martti Juhola
  • Kalervo Järvelin
Article

Abstract

Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four-level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18–32 documents. Their recall (A: 27–52%, B: 50–82%) and precision (A: 83–90%, B: 18–21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1–8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1–9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.

Keywords

Hierarchical clustering Graded relevance Finnish language Principal components analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alkula R (2001) From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 4(3-4):195–208.MATHGoogle Scholar
  2. Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press/Addison-Wesley, New York.Google Scholar
  3. Belew RK (2000) Finding Out About. Cambridge University Press, Cambridge.Google Scholar
  4. Boley D (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325–344.CrossRefGoogle Scholar
  5. Cutting DR, Karger DR, Pedersen JO and Tukey JW (1992) Scatter/Gather: A cluster-based approach to browsing large document collections. In Belkin N, Ingwersen P and Pejtersen AM, (eds.), Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 318–329.Google Scholar
  6. El-Hamdouchi A and Willett P (1989) Comparison of hierachic agglomerative clustering methods for document retrieval. The Computer Journal, 32(3):220–227.CrossRefGoogle Scholar
  7. Everitt BS, Landau S and Leese M (2001) Cluster Analysis, 4th edn. Arnold, London.Google Scholar
  8. Horn RA and Johnson CR (1990) Matrix Analysis, 4th edn. Cambridge University Press, Cambridge.Google Scholar
  9. Jain AK and Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey.Google Scholar
  10. Jolliffe IT (1986) Principal Component Analysis. Springer-Verlag, New York.Google Scholar
  11. Järvelin K and Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In Belkin N, Ingwersen P and Leong MK, (eds.), Proceedings of the 23th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 41–48.Google Scholar
  12. Kaufman L and Rousseeuw PJ (1990) Finding Groups in Data. Wiley, New York.Google Scholar
  13. Keen EM (1992) Presenting results of experimental retrieval comparisons. Information Processing & Management, 28(4):491–501.CrossRefGoogle Scholar
  14. Kekäläinen J (1999) The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 678. URL: http://www.info.uta.fi/tutkimus/fire/archive/QCES.pdf
  15. Kekäläinen J and Järvelin K (2002) Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology, 53(13):1120–1129.CrossRefGoogle Scholar
  16. Korenius T, Laurikkala J and Juhola M (2004) On applying the principal components analysis and cosine similarity for information retrieval. Manuscript available by a request from the authors.Google Scholar
  17. Milligan GW and Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.CrossRefGoogle Scholar
  18. Mitchell TM (1997) Machine Learning. McGraw-Hill, New York.Google Scholar
  19. Nilsson M (2002) Hierarchical clustering using non-greedy principal direction divisive partitioning. Information Retrieval, 5(4):311–321.CrossRefGoogle Scholar
  20. Pett MA (1997) Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions. Sage Publications, Thousand Oaks, California.Google Scholar
  21. Rasmussen E (1992) Clustering algorithms. In Frakes W and Baeza-Yates R, eds. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, New Jersey, pp. 419–442.Google Scholar
  22. Rencher AC (2002) Methods of Multivariate Analysis, 2nd edn. Wiley, New York.Google Scholar
  23. Salton G (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York.Google Scholar
  24. Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts.Google Scholar
  25. Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.CrossRefGoogle Scholar
  26. Sharma S (1996) Applied Multivariate Techniques. Wiley, New York.Google Scholar
  27. Slonim N and Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In Yannakoudakis E, Belkin NJ, Leong M-K and Ingwersen P, eds. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 208–215.Google Scholar
  28. Sormunen E (2000) A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 748. URL: http://acta.uta.fi/pdf/951-44-4732-8.pdf
  29. Sormunen E (2002) Liberal relevance criteria of TREC - counting on negligible documents? In Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K, eds. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 324–330.Google Scholar
  30. The Math Works Inc. (2002) Statistics Toolbox User's Guide, 4th edn. The Math Works Inc., Natick.Google Scholar
  31. van Rijsbergen CJ (1980) Information Retrieval, 2nd edn. Butterworths, London.Google Scholar
  32. Voorhees, E. (2001). Evaluation by Highly Relevant Documents. In Croft, WB, Harper, DJ, Kraft, DH & Zobel, J, eds. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. pp. 74–82.Google Scholar
  33. Willett P (1988) Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577–597.CrossRefGoogle Scholar

Copyright information

© Springer Science + Business Media, Inc. 2006

Authors and Affiliations

  • Tuomo Korenius
    • 1
  • Jorma Laurikkala
    • 1
  • Martti Juhola
    • 1
  • Kalervo Järvelin
    • 2
  1. 1.Department of Computer SciencesUniversity of TampereFinland
  2. 2.Center for Advanced StudiesUniversity of TampereFinland

Personalised recommendations