Candidate Document Retrieval for Web-Scale Text Reuse Detection

  • Matthias Hagen
  • Benno Stein
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7024)


Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d’s topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents’ quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8].


Similar Document Query Formulation Internal Estimation Plagiarism Detection Candidate Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. of VLDB 1994, pp. 487–499 (1994)Google Scholar
  2. 2.
    Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. JACM 55(5) (2008)Google Scholar
  3. 3.
    Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: Proc. AI 2000, pp. 40–52 (2000)Google Scholar
  4. 4.
    Bendersky, M., Croft, W.B.: Finding text reuse on the web. In: Proc. of WSDM 2009, pp. 262–271 (2009)Google Scholar
  5. 5.
    Brants, T., Franz, A.: Web 1T 5-gram Version 1. LDC2006T13 (2006)Google Scholar
  6. 6.
    Carmel, D., Yom-Tov, E., Darlow, A., Pelleg, D.: What makes a query difficult? In: Proc. of SIGIR 2006, pp. 390–397 (2006)Google Scholar
  7. 7.
    Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proc. of SIGIR 2002, pp. 299–306 (2002)Google Scholar
  8. 8.
    Dasdan, A., D’Alberto, P., Kolay, S., Drome, C.: Automatic retrieval of similar content using search engine query interface. In: Proc. of CIKM 2009, pp. 701–710 (2009)Google Scholar
  9. 9.
    Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection. In: Proc. of PAN 2009, pp. 10–18 (2009)Google Scholar
  10. 10.
    Hagen, M., Stein, B.M.: Capacity-constrained query formulation. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 384–388. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  11. 11.
    Hauff, C., Hiemstra, D., de Jong, F.: A survey of pre-retrieval query performance predictors. In: Proc. of CIKM 2008, pp. 1419–1420 (2008)Google Scholar
  12. 12.
    He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Kasprzak, J., Brandejs, M.: Improving the Reliability of the Plagiarism Detection System: Lab Report for PAN at CLEF 2010. In: Proc. of PAN 2010 (2010)Google Scholar
  14. 14.
    Pôssas, B., Ziviani, N., Ribeiro-Neto, B.A., Meira Jr., W.: Maximal termsets as a query structuring mechanism. In: Proc. of CIKM 2005, pp. 287–288 (2005)Google Scholar
  15. 15.
    Scholer, F., Garcia, S.: A case for improved evaluation of query difficulty prediction. In: Proc. of SIGIR 2009, pp. 640–641 (2009)Google Scholar
  16. 16.
    Seo, J., Croft, W.B.: Local text reuse detection. In: Proc.of SIGIR 2008, pp. 571–578 (2008)Google Scholar
  17. 17.
    Stein, B., Hagen, M.: Introducing the user-over-ranking hypothesis. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 503–509. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Wu, X., Kumar, V.: The Top Ten Algorithms in Data Mining. CRC Press, Boca Raton (2009)CrossRefGoogle Scholar
  19. 19.
    Yang, Y., Bansal, N., Dakka, W., Ipeirotis, P.G., Koudas, N., Papadias, D.: Query by document. In: Proc. of WSDM 2009, pp. 34–43 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Matthias Hagen
    • 1
  • Benno Stein
    • 1
  1. 1.Faculty of MediaBauhaus-Universität WeimarGermany

Personalised recommendations