Candidate Document Retrieval for Web-Scale Text Reuse Detection

Hagen, Matthias; Stein, Benno

doi:10.1007/978-3-642-24583-1_35

Candidate Document Retrieval for Web-Scale Text Reuse Detection

Matthias Hagen¹⁸ &
Benno Stein¹⁸

Conference paper

734 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Abstract

Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d’s topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents’ quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8].

Extended version of an ECDL 2010 poster paper [10].

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. of VLDB 1994, pp. 487–499 (1994)
Google Scholar
Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. JACM 55(5) (2008)
Google Scholar
Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Proc. AI 2000, pp. 40–52 (2000)
Google Scholar
Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proc. of WSDM 2009, pp. 262–271 (2009)
Google Scholar
Brants, T., Franz, A.: Web 1T 5-gram Version 1. LDC2006T13 (2006)
Google Scholar
Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D.: What makes a query difficult? In: Proc. of SIGIR 2006, pp. 390–397 (2006)
Google Scholar
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proc. of SIGIR 2002, pp. 299–306 (2002)
Google Scholar
Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar content using search engine query interface. In: Proc. of CIKM 2009, pp. 701–710 (2009)
Google Scholar
Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In: Proc. of PAN 2009, pp. 10–18 (2009)
Google Scholar
Hagen, M., Stein, B.M.: Capacity-constrained query formulation. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 384–388. Springer, Heidelberg (2010)
Chapter Google Scholar
Hauff, C., Hiemstra, D., de Jong, F.: A survey of pre-retrieval query performance predictors. In: Proc. of CIKM 2008, pp. 1419–1420 (2008)
Google Scholar
He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54. Springer, Heidelberg (2004)
Chapter Google Scholar
Kasprzak, J., Brandejs, M.: Improving the Reliability of the Plagiarism Detection System: Lab Report for PAN at CLEF 2010. In: Proc. of PAN 2010 (2010)
Google Scholar
Pôssas, B., Ziviani, N., Ribeiro-Neto, B.A., Meira Jr., W.: Maximal termsets as a query structuring mechanism. In: Proc. of CIKM 2005, pp. 287–288 (2005)
Google Scholar
Scholer, F., Garcia, S.: A case for improved evaluation of query difficulty prediction. In: Proc. of SIGIR 2009, pp. 640–641 (2009)
Google Scholar
Seo, J., Croft, W.B.: Local text reuse detection. In: Proc.of SIGIR 2008, pp. 571–578 (2008)
Google Scholar
Stein, B., Hagen, M.: Introducing the user-over-ranking hypothesis. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 503–509. Springer, Heidelberg (2011)
Chapter Google Scholar
Wu, X., Kumar, V.: The Top Ten Algorithms in Data Mining. CRC Press, Boca Raton (2009)
Book Google Scholar
Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P.G., Koudas, N., Papadias, D.: Query by document. In: Proc. of WSDM 2009, pp. 34–43 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Media, Bauhaus-Universität Weimar, Germany
Matthias Hagen & Benno Stein

Authors

Matthias Hagen
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Università di Pisa, Italy
Roberto Grossi
Consiglio Nazionale delle Ricerche, Area della Ricerca di Pisa, Istituto di Scienza e Tecnologia dell’Informazione “Alessandro Faedo”, Via Giuseppe Moruzzi 1, 56124, Pisa, Italy
Fabrizio Sebastiani & Fabrizio Silvestri &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hagen, M., Stein, B. (2011). Candidate Document Retrieval for Web-Scale Text Reuse Detection. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-24583-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics