The Semantic GrowBag Algorithm: Automatically Deriving Categorization Systems

  • Jörg Diederich
  • Wolf-Tilo Balke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4675)


Using keyword search to find relevant objects in digital libraries often results in way too large result sets. Based on the metadata associated with such objects, the faceted search paradigm allows users to structure and filter the result set, for example, using a publication type facet to show only books or videos. These facets usually focus on clear-cut characteristics of digital items, however it is very difficult to also organize the actual semantic content information into such a facet. The Semantic GrowBag approach, presented in this paper, uses the keywords provided by many authors of digital objects to automatically create light-weight topic categorization systems as a basis for a meaningful and dynamically adaptable topic facet. Using such emergent semantics enables an alternative way to filter large result sets according to the objects’ content without the need to manually classify all objects with respect to a pre-specified vocabulary. We present the details of our algorithm using the DBLP collection of computer science documents and show some experimental evidence about the quality of the achieved results.


faceted search category generation higher-order co-occurrence 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hearst, M.A.: Clustering versus faceted categories for information exploration. Commun. ACM 49(4), 59–61 (2006)CrossRefGoogle Scholar
  2. 2.
    Rodden, K., Basalaj, W., Sinclair, D., Wood, K.: Does organisation by similarity assist image browsing? In: Proc. of SIGCHI conference, pp. 190–197 (2001)Google Scholar
  3. 3.
    Ross, K., Janevski, A.: Querying faceted databases. In: Bussler, C.J., Tannen, V., Fundulaki, I. (eds.) SWDB 2004. LNCS, vol. 3372, pp. 199–218. Springer, Heidelberg (2005)Google Scholar
  4. 4.
    Weber, A., Reuther, P., Walter, B., Ley, M., Klink, S.: Multi-layered browsing and visualization for digital libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Diederich, J., Thaden, U., Balke, W.T.: The semantic growbag demonstrator for automatically organizing topic facets. In: Proc. of the SIGIR Workshop on Faceted Search (2006)Google Scholar
  6. 6.
    Diederich, J., Thaden, U., Balke, W.T.: Demonstrating the Semantic GrowBag: Automatically Creating Topic Facets for FacetedDBLP. In: Proc. of the JCDL (2007)Google Scholar
  7. 7.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: Proc. of SIGIR conference, pp. 318–329 (1992)Google Scholar
  8. 8.
    Pratt, W., Hearst, M.A., Fagan, L.M.: A knowledge-based approach to organizing retrieved documents. In: Proc. of the AAAI conference, Menlo Park, CA, USA, pp. 80–85 (1999)Google Scholar
  9. 9.
    Park, J., Hunting, S.: XML Topic Maps: Creating and Using Topic Maps for the Web. Addison-Wesley, Reading (2002)Google Scholar
  10. 10.
    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proc. of the Conference on Computational Linguistics, pp. 539–545 (1992)Google Scholar
  11. 11.
    Cimiano, P., Völker, J.: Text2onto - a framework for ontology learning and data-driven change discovery. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005)Google Scholar
  12. 12.
    Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: Proc. of the SIGIR conference, pp. 206–213 (1999)Google Scholar
  13. 13.
    Schütze, H.: Automatic word sense discrimination. Comput. Linguist. 24(1), 97–123 (1998)Google Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University (1998)Google Scholar
  15. 15.
    Jeh, G., Widom, J.: SimRank: A Measure of Structural-Context Similarity. In: Proc. of the SIGKDD conference (2002)Google Scholar
  16. 16.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)CrossRefGoogle Scholar
  17. 17.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of SIGIR conference, pp. 50–57 (1999)Google Scholar
  18. 18.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)Google Scholar
  19. 19.
    Langville, A., Meyer, C.: Deeper inside pagerank. Internet Mathmatics 2(1), 335–380 (2004)MathSciNetGoogle Scholar
  20. 20.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Jörg Diederich
    • 1
  • Wolf-Tilo Balke
    • 1
  1. 1.L3S Research Center and Leibniz Universität Hannover, HanoverGermany

Personalised recommendations