Abstract
In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.
Our approach combines different methods in a three-step pipeline. Overall, a query-constrained variant of k-means using noun phrase queries against an ESA-based search engine performs best. In the evaluation, we introduce a soft clustering measure as well as a freely available extended version of the Ambient dataset. We compare our approach to two often-used baselines, descriptive k-means and k-means plus \(\chi ^2\). While the derived clusters are of comparable high quality, the evaluation of the corresponding cluster labels reveals a great diversity in the explanatory power. In a user study with 49 participants, the labels generated by our approach are of significantly higher discriminative power, leading to an increased human separability of the computed clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Claudio Carpineto, Giovanni Romano: Ambient Data set (2008), http://credo.fub.it/ambient/.
References
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: AI, pp. 40–52 (2000)
Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications. Chapman & Hall/CRC, Boca Raton (2008)
Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W., Scatter, G.: A cluster-based approach to browsing large document collections. In: SIGIR, pp. 318–329 (1992)
Ferragina, P., Gullì, A.: The anatomy of SnakeT: a hierarchical clustering engine for web-page snippets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 506–508. Springer, Heidelberg (2004)
Fuhr, N., Lechtenfeld, M., Stein, B., Gollub, T.: The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2), 93–115 (2012)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)
Gollub, T., Hagen, M., Michel, M., Stein, B.: From keywords to keyqueries: content descriptors for the web. In: SIGIR, pp. 981–984 (2013)
Gollub, T., Völske, M., Hagen, M., Stein, B.: Dynamic taxonomy composition via keyqueries. In: JCDL, pp. 39–48 (2014)
Hagen, M., Beyer, A., Gollub, T., Komlossy, K., Stein, B.: Supporting scholarly search with keyqueries. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 507–520. Springer, Heidelberg (2016). doi:10.1007/978-3-319-30671-1_37
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, pp. 441–450. ACM (2010)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Mihalcea, R., Csomai, A. Wikify!: Linking documents to encyclopedic knowledge. In: CIKM, pp. 233–242 (2007)
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: IIPWM, pp. 359–368 (2004)
Pickens, J., Cooper, M., Golovchinsky, G.: Reverted indexing for feedback and expansion. In: CIKM, pp. 1049–1058 (2010)
van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, London (1979)
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. FnTIR 3(4), 333–389 (2009)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. CACM 18(11), 613–620 (1975)
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical clustering of search results. In: WSDM, pp. 223–232 (2012)
Stein, B., Gollub, T., Hoppe, D.: Beyond precision@10: clustering the long tail of web search results. In: CIKM, pp. 2141–2144 (2011)
Stein, B., Meyer zu Eißen, S.: Topic identification: framework and application. In: I-KNOW, pp. 522–531 (2004)
Treeratpituk, P., Callan, J.: An experimental study on automatically labeling hierarchical clusters using statistical features. In: SIGIR, pp. 707–708 (2006)
Vazirani, V.V.: Approximation Algorithms. Springer, Heidelberg (2001)
Weiss, D.: Descriptive clustering as a method for exploring text collections. PhD thesis, University of Poznan (2006)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR, pp. 46–54 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Gollub, T., Busse, M., Stein, B., Hagen, M. (2016). Keyqueries for Clustering and Labeling. In: Ma, S., et al. Information Retrieval Technology. AIRS 2016. Lecture Notes in Computer Science(), vol 9994. Springer, Cham. https://doi.org/10.1007/978-3-319-48051-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-48051-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48050-3
Online ISBN: 978-3-319-48051-0
eBook Packages: Computer ScienceComputer Science (R0)