Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Korenius, Tuomo; Laurikkala, Jorma; Juhola, Martti; Järvelin, Kalervo

doi:10.1007/s10791-005-5720-6

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Published: January 2006

Volume 9, pages 33–53, (2006)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Download PDF

Tuomo Korenius¹,
Jorma Laurikkala¹,
Martti Juhola¹ &
…
Kalervo Järvelin²

119 Accesses
5 Citations
Explore all metrics

Abstract

Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four-level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18–32 documents. Their recall (A: 27–52%, B: 50–82%) and precision (A: 83–90%, B: 18–21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1–8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1–9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.

Article PDF

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

References

Alkula R (2001) From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 4(3-4):195–208.
MATH Google Scholar
Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press/Addison-Wesley, New York.
Google Scholar
Belew RK (2000) Finding Out About. Cambridge University Press, Cambridge.
Google Scholar
Boley D (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325–344.
Article Google Scholar
Cutting DR, Karger DR, Pedersen JO and Tukey JW (1992) Scatter/Gather: A cluster-based approach to browsing large document collections. In Belkin N, Ingwersen P and Pejtersen AM, (eds.), Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 318–329.
Google Scholar
El-Hamdouchi A and Willett P (1989) Comparison of hierachic agglomerative clustering methods for document retrieval. The Computer Journal, 32(3):220–227.
Article Google Scholar
Everitt BS, Landau S and Leese M (2001) Cluster Analysis, 4th edn. Arnold, London.
Google Scholar
Horn RA and Johnson CR (1990) Matrix Analysis, 4th edn. Cambridge University Press, Cambridge.
Google Scholar
Jain AK and Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey.
Google Scholar
Jolliffe IT (1986) Principal Component Analysis. Springer-Verlag, New York.
Google Scholar
Järvelin K and Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In Belkin N, Ingwersen P and Leong MK, (eds.), Proceedings of the 23th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 41–48.
Google Scholar
Kaufman L and Rousseeuw PJ (1990) Finding Groups in Data. Wiley, New York.
Google Scholar
Keen EM (1992) Presenting results of experimental retrieval comparisons. Information Processing & Management, 28(4):491–501.
Article Google Scholar
Kekäläinen J (1999) The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 678. URL: http://www.info.uta.fi/tutkimus/fire/archive/QCES.pdf
Kekäläinen J and Järvelin K (2002) Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology, 53(13):1120–1129.
Article Google Scholar
Korenius T, Laurikkala J and Juhola M (2004) On applying the principal components analysis and cosine similarity for information retrieval. Manuscript available by a request from the authors.
Milligan GW and Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.
Article Google Scholar
Mitchell TM (1997) Machine Learning. McGraw-Hill, New York.
Google Scholar
Nilsson M (2002) Hierarchical clustering using non-greedy principal direction divisive partitioning. Information Retrieval, 5(4):311–321.
Article Google Scholar
Pett MA (1997) Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions. Sage Publications, Thousand Oaks, California.
Google Scholar
Rasmussen E (1992) Clustering algorithms. In Frakes W and Baeza-Yates R, eds. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, New Jersey, pp. 419–442.
Google Scholar
Rencher AC (2002) Methods of Multivariate Analysis, 2nd edn. Wiley, New York.
Google Scholar
Salton G (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York.
Google Scholar
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts.
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Article Google Scholar
Sharma S (1996) Applied Multivariate Techniques. Wiley, New York.
Google Scholar
Slonim N and Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In Yannakoudakis E, Belkin NJ, Leong M-K and Ingwersen P, eds. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 208–215.
Google Scholar
Sormunen E (2000) A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 748. URL: http://acta.uta.fi/pdf/951-44-4732-8.pdf
Sormunen E (2002) Liberal relevance criteria of TREC - counting on negligible documents? In Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K, eds. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 324–330.
Google Scholar
The Math Works Inc. (2002) Statistics Toolbox User's Guide, 4th edn. The Math Works Inc., Natick.
Google Scholar
van Rijsbergen CJ (1980) Information Retrieval, 2nd edn. Butterworths, London.
Google Scholar
Voorhees, E. (2001). Evaluation by Highly Relevant Documents. In Croft, WB, Harper, DJ, Kraft, DH & Zobel, J, eds. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. pp. 74–82.
Google Scholar
Willett P (1988) Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577–597.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Sciences, University of Tampere, FIN-33014 University of Tampere, Finland
Tuomo Korenius, Jorma Laurikkala & Martti Juhola
Center for Advanced Studies, University of Tampere, FIN-33014 University of Tampere, Finland
Kalervo Järvelin

Authors

Tuomo Korenius
View author publications
You can also search for this author in PubMed Google Scholar
Jorma Laurikkala
View author publications
You can also search for this author in PubMed Google Scholar
Martti Juhola
View author publications
You can also search for this author in PubMed Google Scholar
Kalervo Järvelin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tuomo Korenius.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Korenius, T., Laurikkala, J., Juhola, M. et al. Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments. Inf Retrieval 9, 33–53 (2006). https://doi.org/10.1007/s10791-005-5720-6

Download citation

Received: 26 January 2004
Revised: 10 December 2004
Accepted: 13 December 2004
Issue Date: January 2006
DOI: https://doi.org/10.1007/s10791-005-5720-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Abstract

Article PDF

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

Abstract

Article PDF

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation