Keyqueries for Clustering and Labeling

Gollub, Tim; Busse, Matthias; Stein, Benno; Hagen, Matthias

doi:10.1007/978-3-319-48051-0_4

Tim Gollub²⁰,
Matthias Busse²⁰,
Benno Stein²⁰ &
…
Matthias Hagen²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9994))

Included in the following conference series:

Asia Information Retrieval Symposium

837 Accesses
1 Citations

Abstract

In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.

Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus \(\chi ^2\). While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://project.carrot2.org.
2.
Claudio Carpineto, Giovanni Romano: Ambient Data set (2008), http://credo.fub.it/ambient/.

References

Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: AI, pp. 40–52 (2000)
Google Scholar
Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, Boca Raton (2008)
MATH Google Scholar
Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)
Article Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W., Scatter, G.: A cluster-based approach to browsing large document collections. In: SIGIR, pp. 318–329 (1992)
Google Scholar
Ferragina, P., Gullì, A.: The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 506–508. Springer, Heidelberg (2004)
Chapter Google Scholar
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2), 93–115 (2012)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)
Google Scholar
Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: SIGIR, pp. 981–984 (2013)
Google Scholar
Gollub, T., Völske, M., Hagen, M., Stein, B.: Dynamic taxonomy composition via keyqueries. In: JCDL, pp. 39–48 (2014)
Google Scholar
Hagen, M., Beyer, A., Gollub, T., Komlossy, K., Stein, B.: Supporting scholarly search with keyqueries. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 507–520. Springer, Heidelberg (2016). doi:10.1007/978-3-319-30671-1_37
Chapter Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450. ACM (2010)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mihalcea, R., Csomai, A. Wikify!: Linking documents to encyclopedic knowledge. In: CIKM, pp. 233–242 (2007)
Google Scholar
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: IIPWM, pp. 359–368 (2004)
Google Scholar
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: CIKM, pp. 1049–1058 (2010)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, London (1979)
MATH Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. FnTIR 3(4), 333–389 (2009)
Google Scholar
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232 (2012)
Google Scholar
Stein, B., Gollub, T., Hoppe, D.: Beyond precision@10: clustering the long tail of web search results. In: CIKM, pp. 2141–2144 (2011)
Google Scholar
Stein, B., Meyer zu Eißen, S.: Topic identification: framework and application. In: I-KNOW, pp. 522–531 (2004)
Google Scholar
Treeratpituk, P., Callan, J.: An experimental study on automatically labeling hierarchical clusters using statistical features. In: SIGIR, pp. 707–708 (2006)
Google Scholar
Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)
MATH Google Scholar
Weiss, D.: Descriptive clustering as a method for exploring text collections. PhD thesis, University of Poznan (2006)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR, pp. 46–54 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Bauhaus-Universität Weimar, Weimar, Germany
Tim Gollub, Matthias Busse, Benno Stein & Matthias Hagen

Authors

Tim Gollub
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Busse
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Hagen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthias Hagen .

Editor information

Editors and Affiliations

Tsinghua University , Beijing, China
Shaoping Ma
Renmin University of China , Beijing, China
Ji-Rong Wen
Tsinghua University , Beijing, China
Yiqun Liu
Renmin University of China , Beijing, China
Zhicheng Dou
Tsinghua University , Beijing, China
Min Zhang
Yahoo Labs , Sunnyvale, California, USA
Yi Chang
Renmin University of China , Beijing, China
Xin Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gollub, T., Busse, M., Stein, B., Hagen, M. (2016). Keyqueries for Clustering and Labeling. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-48051-0_4
Published: 15 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48050-3
Online ISBN: 978-3-319-48051-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics