Clustering in a News Corpus

  • Richard Elling Moe
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)


We adapt the Suffix Tree Clustering method for application within a corpus of Norwegian news articles. Specifically, suffixes are replaced with n-grams and we propose a new measure for cluster similarity as well as a scoring-function for base-clusters. These modifications lead to substantial improvements in effectiveness and efficiency compared to the original algorithm.


News Article Original Algorithm Tree Cluster Online Newspaper Discrepancy Precision 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allern, S.: Newsvalue: On marketing and journalism in ten norwegian newspapers. IJ Forlaget (Publisher) (2001) (in Norwegian)Google Scholar
  2. 2.
    Zu Eissen, S.M., Stein, B., Potthast, M.: The Suffix Tree Document Model Revisited. In: Tochtermann, M. (ed.) Proceedings of the I-KNOW 2005, Graz 5th International Conference on Knowledge Management, pp. 596–603 (2005); Journal of Universal Computer ScienceGoogle Scholar
  3. 3.
    Elgesem, D., Moe, H., Sjøvaag, H., Stavelin, E.: The national public service broadcaster’s (NRK) news on the internet in 2009. Report to the Norwegian Media Authority, Department of information science and media studies, University of Bergen (2010) (in Norwegian)Google Scholar
  4. 4.
    Erdal, J.: Where does the news come from? On the flow of news between newspapers, broadcasters and the internet (in Norwegian). Official Norwegian Reports NOU2010:14, appendix 1 (2010)Google Scholar
  5. 5.
    Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Contextualized Clustering in Exploratory Web Search. In: do Prado, H.A., Ferneda, E. (eds.) Emerging Technologies of Text Mining: Techniques and Applications, pp. 184–207. IGI Global (2007)Google Scholar
  6. 6.
    Losnegaard, G.: Automatic extraction of news text from online newspapers. Project report, Department of information science and media studies, University of Bergen (2012)Google Scholar
  7. 7.
    Moe, R.: Improvements to Suffix Tree Clustering. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 662–667. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  8. 8.
    Moe, R., Elgesem, D.: Compact trie clustering for overlap detection in news. In: Proceedings of the Norwegian Informatics Conference (NIK 2013) (2013)Google Scholar
  9. 9.
    Norwegian Newspaper Corpus,
  10. 10.
  11. 11.
    Smyth, B.: Computing Patterns in Strings. Addison Wesley (2003)Google Scholar
  12. 12.
    Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Richard Elling Moe
    • 1
  1. 1.Department of Information Science and Media StudiesUniversity of BergenNorway

Personalised recommendations