Skip to main content

Keyqueries for Clustering and Labeling

  • Conference paper
  • First Online:
Information Retrieval Technology (AIRS 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9994))

Included in the following conference series:

Abstract

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.

Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus \(\chi ^2\). While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://project.carrot2.org.

  2. 2.

    Claudio Carpineto, Giovanni Romano: Ambient Data set (2008), http://credo.fub.it/ambient/.

References

  1. Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: AI, pp. 40–52 (2000)

    Google Scholar 

  2. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, Boca Raton (2008)

    MATH  Google Scholar 

  3. Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)

    Article  Google Scholar 

  4. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W., Scatter, G.: A cluster-based approach to browsing large document collections. In: SIGIR, pp. 318–329 (1992)

    Google Scholar 

  5. Ferragina, P., Gullì, A.: The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 506–508. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2), 93–115 (2012)

    Article  Google Scholar 

  7. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)

    Google Scholar 

  8. Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: SIGIR, pp. 981–984 (2013)

    Google Scholar 

  9. Gollub, T., Völske, M., Hagen, M., Stein, B.: Dynamic taxonomy composition via keyqueries. In: JCDL, pp. 39–48 (2014)

    Google Scholar 

  10. Hagen, M., Beyer, A., Gollub, T., Komlossy, K., Stein, B.: Supporting scholarly search with keyqueries. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 507–520. Springer, Heidelberg (2016). doi:10.1007/978-3-319-30671-1_37

    Chapter  Google Scholar 

  11. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450. ACM (2010)

    Google Scholar 

  12. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  13. Mihalcea, R., Csomai, A. Wikify!: Linking documents to encyclopedic knowledge. In: CIKM, pp. 233–242 (2007)

    Google Scholar 

  14. Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: IIPWM, pp. 359–368 (2004)

    Google Scholar 

  15. Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: CIKM, pp. 1049–1058 (2010)

    Google Scholar 

  16. van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, London (1979)

    MATH  Google Scholar 

  17. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. FnTIR 3(4), 333–389 (2009)

    Google Scholar 

  18. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  19. Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232 (2012)

    Google Scholar 

  20. Stein, B., Gollub, T., Hoppe, D.: Beyond precision@10: clustering the long tail of web search results. In: CIKM, pp. 2141–2144 (2011)

    Google Scholar 

  21. Stein, B., Meyer zu Eißen, S.: Topic identification: framework and application. In: I-KNOW, pp. 522–531 (2004)

    Google Scholar 

  22. Treeratpituk, P., Callan, J.: An experimental study on automatically labeling hierarchical clusters using statistical features. In: SIGIR, pp. 707–708 (2006)

    Google Scholar 

  23. Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)

    MATH  Google Scholar 

  24. Weiss, D.: Descriptive clustering as a method for exploring text collections. PhD thesis, University of Poznan (2006)

    Google Scholar 

  25. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR, pp. 46–54 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Hagen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Gollub, T., Busse, M., Stein, B., Hagen, M. (2016). Keyqueries for Clustering and Labeling. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48051-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48050-3

  • Online ISBN: 978-3-319-48051-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics