Cross-Language High Similarity Search: Why No Sub-linear Time Bound Can Be Expected

  • Maik Anderka
  • Benno Stein
  • Martin Potthast
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5993)

Abstract

This paper contributes to an important variant of cross-language information retrieval, called cross-language high similarity search. Given a collection D of documents and a query q in a language different from the language of D, the task is to retrieve highly similar documents with respect to q. Use cases for this task include cross-language plagiarism detection and translation search.

The current line of research in cross-language high similarity search resorts to the comparison of q and the documents in D in a multilingual concept space—which, however, requires a linear scan of D. Monolingual high similarity search can be tackled in sub-linear time, either by fingerprinting or by “brute force n-gram indexing”, as it is done by Web search engines. We argue that neither fingerprinting nor brute force n-gram indexing can be applied to tackle cross-language high similarity search, and that a linear scan is inevitable. Our findings are based on theoretical and empirical insights.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anderka, M., Stein, B.: The ESA Retrieval Model Revisited. In: Proc. of SIGIR 2009 (2009)Google Scholar
  2. 2.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling Up All Pairs Similarity Search. In: Proc. of WWW 2007 (2007)Google Scholar
  3. 3.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In: Proc. of SCG 2004 (2004)Google Scholar
  4. 4.
    Lin, J.: Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. In: Proc. of SIGIR 2009 (2009)Google Scholar
  5. 5.
    Potthast, M., Stein, B.: New Issues in Near-duplicate Detection. In: Data Analysis, Machine Learning and Applications (2008)Google Scholar
  6. 6.
    Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  7. 7.
    Weber, R., Schek, H.-J., Blott, S.: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. In: Proc. of VLDB 1998 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Maik Anderka
    • 1
  • Benno Stein
    • 1
  • Martin Potthast
    • 1
  1. 1.Faculty of MediaBauhaus University WeimarWeimarGermany

Personalised recommendations