Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

  • Mark Hall
  • Paul Clough
  • Mark Stevenson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7489)


Large digital libraries have become available over the past years through digitisation and aggregation projects. These large collections present a challenge to the new user who wishes to discover what is available in the collections. Subject classification can help in this task, however in large collections it is frequently incomplete or inconsistent. Automatic clustering algorithms provide a solution to this, however the question remains whether they produce clusters that are sufficiently cohesive and distinct for them to be used in supporting discovery and exploration in digital libraries. In this paper we present a novel approach to investigating cluster cohesion that is based on identifying instruders in a cluster. The results from a human-subject experiment show that clustering algorithms produce clusters that are sufficiently cohesive to be used where no (consistent) manual classification exists.


Digital Library Topic Model Latent Dirichlet Allocation Automatic Cluster Pointwise Mutual Information 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12, 461–486 (2009), doi:10.1007/s10791-008-9066-8CrossRefGoogle Scholar
  2. 2.
    Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)CrossRefGoogle Scholar
  3. 3.
    Azzopardi, L., Girolami, M., van Rijsbergen, C.: Topic based language models for ad hoc information retrieval. In: Proceedings of the IEEE International Joint Conference on Neural Networks 2004, vol. 4, pp. 3281–3286 (July 2004)Google Scholar
  4. 4.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. In: VLDB 2012 (2012)Google Scholar
  5. 5.
    Blei, D.M., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested chinese restaurant process. In: NIPS (2003)Google Scholar
  6. 6.
    Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: NIPS (2009)Google Scholar
  7. 7.
    Clough, P., Sanderson, M., Reid, N.: The eurovision st andrews collection of photographs. ACM SIGIR Forum 40(1), 21–30 (2006)CrossRefGoogle Scholar
  8. 8.
    Eklund, P., Goodall, P., Wray, T.: Cluster-based navigation for a virtual museum. In: Adaptivity, Personalization and Fusion of Heterogeneous Information, RIAO 2010, Le Centre de Hautes Etudes Internationales d’Informatique Documentaire, Paris, France, France, pp. 211–212 (2010)Google Scholar
  9. 9.
    Granitzer, M., Kienreich, W., Sabol, V., Andrews, K., Klieber, W.: Evaluating a system for interactive exploration of large, hierarchically structured document repositories. In: IEEE Symposium on Information Visualization, INFOVIS 2004, pp. 127–134 (2004)Google Scholar
  10. 10.
    Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academiy of Science 101, 5228–5235 (2004)CrossRefGoogle Scholar
  11. 11.
    Handl, J., Meyer, B.: Improved Ant-Based Clustering and Sorting in a Document Retrieval Interface. In: Guervós, J.J.M., Adamidis, P.A., Beyer, H.-G., Fernández-Villacañas, J.-L., Schwefel, H.-P. (eds.) PPSN 2002. LNCS, vol. 2439, pp. 913–923. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  12. 12.
    Hassan-Montero, Y., Herrero-Solana, V.: Improving tag-clouds as visual information retrieval interfaces. In: Proceedings InfoSciT (2006)Google Scholar
  13. 13.
    He, J., Tan, A.-H., Tan, C.-L., Sun, S.-Y.: On quantitative evaluation of clustering systems. In: Information Retrieval and Clustering, pp. 105–133. Kluwer Academic Publishers (2003)Google Scholar
  14. 14.
    Lloyd, S.P.: Least square quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Loper, E., Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the ACL 2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, vol. 1, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2002)Google Scholar
  16. 16.
    Marchionini, G.: Exploratory search: From finding to understanding. Communications of the ACM 49(4), 41–46 (2006)CrossRefGoogle Scholar
  17. 17.
    Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1650–1654 (2002)CrossRefGoogle Scholar
  18. 18.
    Mei, X.S., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of KDD 2007, pp. 490–499 (2007)Google Scholar
  19. 19.
    Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Proceedings of teh 14th Australasian Document Computing Symposum, pp. 11–18 (2009)Google Scholar
  20. 20.
    Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: JCDL 2010 (2010)Google Scholar
  21. 21.
    Pirolli, P.: Powers of 10: Modeling complex information-seeking systems at multiple scales. Computer 42(3), 33–40 (2009)CrossRefGoogle Scholar
  22. 22.
    Rao, R., Pedersen, J.O., Hearst, M.A., Mackinlay, J.D., Card, S.K., Masinter, L., Halvorsen, P.-K., Robertson, G.C.: Rich interaction in the digital library. Commun. ACM 38(4), 29–39 (1995)CrossRefGoogle Scholar
  23. 23.
    Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50. ELRA (May 2010),
  24. 24.
    Roussinov, D.G., Chen, H.: Document clustering for electronic meetings: an experimental comparison of two techniques. Decision Support Systems 27(1-2), 67–79 (1999)CrossRefGoogle Scholar
  25. 25.
    Sculley, D.: Web-scale k-means clustering. In: WWW 2010 (2010)Google Scholar
  26. 26.
    Song, M.: Bibliomapper: a cluster-based information visualization technique. In: Proceedings of the Information Visualization, pp. 130–136 (1998)Google Scholar
  27. 27.
    Sutcliffe, A., Ennis, M.: Towards a cognitive theory of information retrieval. Interacting with Computers 10, 321–351 (1998)CrossRefGoogle Scholar
  28. 28.
    van Ossenbruggen, J., Amin, A., Hardman, L., Hildebrand, M., van Assem, M., Omelayenko, B., Schreiber, G., Tordai, A., de Boer, V., Wielinga, B., Wielemaker, J., de Niet, M., Taekema, J., van Orsouw, M.-F., Teesing, A.: Searching and annotating virtual heritage collections with semantic-web technologies. In: Museums and the Web 2007 (2007)Google Scholar
  29. 29.
    Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th International Conference on Machine Learning (2009)Google Scholar
  30. 30.
    Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference, SIGIR 2006, pp. 178–185. ACM, New York (2006)CrossRefGoogle Scholar
  31. 31.
    Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008), doi:10.1007/s10115-007-0114-2CrossRefGoogle Scholar
  32. 32.
    Zhao, W., Ma, H., He, Q.: Parallel K-Means Clustering Based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 674–679. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Mark Hall
    • 1
    • 2
  • Paul Clough
    • 2
  • Mark Stevenson
    • 1
  1. 1.Department for Computer ScienceSheffield UniversitySheffieldUK
  2. 2.Information SchoolSheffield UniversitySheffieldUK

Personalised recommendations