Abstract
Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The four-level relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18–32 documents. Their recall (A: 27–52%, B: 50–82%) and precision (A: 83–90%, B: 18–21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1–8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1–9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed.
Article PDF
Similar content being viewed by others
References
Alkula R (2001) From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 4(3-4):195–208.
Baeza-Yates R and Ribeiro-Neto B (1999) Modern Information Retrieval. ACM Press/Addison-Wesley, New York.
Belew RK (2000) Finding Out About. Cambridge University Press, Cambridge.
Boley D (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325–344.
Cutting DR, Karger DR, Pedersen JO and Tukey JW (1992) Scatter/Gather: A cluster-based approach to browsing large document collections. In Belkin N, Ingwersen P and Pejtersen AM, (eds.), Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 318–329.
El-Hamdouchi A and Willett P (1989) Comparison of hierachic agglomerative clustering methods for document retrieval. The Computer Journal, 32(3):220–227.
Everitt BS, Landau S and Leese M (2001) Cluster Analysis, 4th edn. Arnold, London.
Horn RA and Johnson CR (1990) Matrix Analysis, 4th edn. Cambridge University Press, Cambridge.
Jain AK and Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, New Jersey.
Jolliffe IT (1986) Principal Component Analysis. Springer-Verlag, New York.
Järvelin K and Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In Belkin N, Ingwersen P and Leong MK, (eds.), Proceedings of the 23th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 41–48.
Kaufman L and Rousseeuw PJ (1990) Finding Groups in Data. Wiley, New York.
Keen EM (1992) Presenting results of experimental retrieval comparisons. Information Processing & Management, 28(4):491–501.
Kekäläinen J (1999) The Effects of Query Complexity, Expansion and Structure on Retrieval Performance in Probabilistic Text Retrieval. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 678. URL: http://www.info.uta.fi/tutkimus/fire/archive/QCES.pdf
Kekäläinen J and Järvelin K (2002) Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology, 53(13):1120–1129.
Korenius T, Laurikkala J and Juhola M (2004) On applying the principal components analysis and cosine similarity for information retrieval. Manuscript available by a request from the authors.
Milligan GW and Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179.
Mitchell TM (1997) Machine Learning. McGraw-Hill, New York.
Nilsson M (2002) Hierarchical clustering using non-greedy principal direction divisive partitioning. Information Retrieval, 5(4):311–321.
Pett MA (1997) Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions. Sage Publications, Thousand Oaks, California.
Rasmussen E (1992) Clustering algorithms. In Frakes W and Baeza-Yates R, eds. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, New Jersey, pp. 419–442.
Rencher AC (2002) Methods of Multivariate Analysis, 2nd edn. Wiley, New York.
Salton G (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York.
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Massachusetts.
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Sharma S (1996) Applied Multivariate Techniques. Wiley, New York.
Slonim N and Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In Yannakoudakis E, Belkin NJ, Leong M-K and Ingwersen P, eds. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 208–215.
Sormunen E (2000) A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases. Ph.D. Thesis, University of Tampere. Acta Universitatis Tamperensis, Vol. 748. URL: http://acta.uta.fi/pdf/951-44-4732-8.pdf
Sormunen E (2002) Liberal relevance criteria of TREC - counting on negligible documents? In Beaulieu M, Baeza-Yates R, Myaeng SH, Järvelin K, eds. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, pp. 324–330.
The Math Works Inc. (2002) Statistics Toolbox User's Guide, 4th edn. The Math Works Inc., Natick.
van Rijsbergen CJ (1980) Information Retrieval, 2nd edn. Butterworths, London.
Voorhees, E. (2001). Evaluation by Highly Relevant Documents. In Croft, WB, Harper, DJ, Kraft, DH & Zobel, J, eds. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York. pp. 74–82.
Willett P (1988) Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5):577–597.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Korenius, T., Laurikkala, J., Juhola, M. et al. Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments. Inf Retrieval 9, 33–53 (2006). https://doi.org/10.1007/s10791-005-5720-6
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-5720-6