Keyqueries for Clustering and Labeling

  • Tim Gollub
  • Matthias Busse
  • Benno Stein
  • Matthias Hagen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9994)

Abstract

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.

Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus \(\chi ^2\). While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.

References

  1. 1.
    Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: AI, pp. 40–52 (2000)Google Scholar
  2. 2.
    Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, Boca Raton (2008)MATHGoogle Scholar
  3. 3.
    Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)CrossRefGoogle Scholar
  4. 4.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W., Scatter, G.: A cluster-based approach to browsing large document collections. In: SIGIR, pp. 318–329 (1992)Google Scholar
  5. 5.
    Ferragina, P., Gullì, A.: The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 506–508. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  6. 6.
    Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2), 93–115 (2012)CrossRefGoogle Scholar
  7. 7.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)Google Scholar
  8. 8.
    Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: SIGIR, pp. 981–984 (2013)Google Scholar
  9. 9.
    Gollub, T., Völske, M., Hagen, M., Stein, B.: Dynamic taxonomy composition via keyqueries. In: JCDL, pp. 39–48 (2014)Google Scholar
  10. 10.
    Hagen, M., Beyer, A., Gollub, T., Komlossy, K., Stein, B.: Supporting scholarly search with keyqueries. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 507–520. Springer, Heidelberg (2016). doi:10.1007/978-3-319-30671-1_37 CrossRefGoogle Scholar
  11. 11.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450. ACM (2010)Google Scholar
  12. 12.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  13. 13.
    Mihalcea, R., Csomai, A. Wikify!: Linking documents to encyclopedic knowledge. In: CIKM, pp. 233–242 (2007)Google Scholar
  14. 14.
    Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: IIPWM, pp. 359–368 (2004)Google Scholar
  15. 15.
    Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: CIKM, pp. 1049–1058 (2010)Google Scholar
  16. 16.
    van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, London (1979)MATHGoogle Scholar
  17. 17.
    Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. FnTIR 3(4), 333–389 (2009)Google Scholar
  18. 18.
    Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  19. 19.
    Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232 (2012)Google Scholar
  20. 20.
    Stein, B., Gollub, T., Hoppe, D.: Beyond precision@10: clustering the long tail of web search results. In: CIKM, pp. 2141–2144 (2011)Google Scholar
  21. 21.
    Stein, B., Meyer zu Eißen, S.: Topic identification: framework and application. In: I-KNOW, pp. 522–531 (2004)Google Scholar
  22. 22.
    Treeratpituk, P., Callan, J.: An experimental study on automatically labeling hierarchical clusters using statistical features. In: SIGIR, pp. 707–708 (2006)Google Scholar
  23. 23.
    Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)MATHGoogle Scholar
  24. 24.
    Weiss, D.: Descriptive clustering as a method for exploring text collections. PhD thesis, University of Poznan (2006)Google Scholar
  25. 25.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR, pp. 46–54 (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Tim Gollub
    • 1
  • Matthias Busse
    • 1
  • Benno Stein
    • 1
  • Matthias Hagen
    • 1
  1. 1.Bauhaus-Universität WeimarWeimarGermany

Personalised recommendations