Skip to main content

Parallel and Distributed Document Overlap Detection on the Web

  • Conference paper
  • First Online:
Applied Parallel Computing. New Paradigms for HPC in Industry and Academia (PARA 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1947))

Included in the following conference series:

Abstract

Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identifying a set of candidate documents that are compared in the second stage using a matching-engine. The algorithm of the matching-engine is based on suffix trees and it modifies the known matching statistics algorithm. Parallel and distributed approaches are discussed at both stages and performance results are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abramson, D., Giddy, J. and Kotler, L. High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?. International Parallel and Distributed Processing Symposium (IPDPS), pp 520ā€“528, Cancun, Mexico, May 2000.

    Google ScholarĀ 

  2. Baker M., Buyya R. Cluster Computing at a Glance in Buyya R. High Performance Cluster Computing. (Prentice Hall) pp. 3ā€“47, 1999.

    Google ScholarĀ 

  3. Broder A.Z., Glassman S.C., Manasse M.S. Syntatic Clustering of the Web. Sixth International Web Conference, Santa Clara, California USA. URL http://decweb.ethz.ch/WWW6/Technical/Paper205/paper205.html

  4. Chang W.I., Lawler E.L. Sublinear Approximate String Matching and Biological Applications. Algorithmica 12. pp. 327ā€“344, 1994.

    Google ScholarĀ 

  5. Foster I., Kesselman C. Globus: A Metacomputing Infrastructure Toolkit. Intl J Supercomputer Applications 11(2), pp. 115ā€“128, 1997.

    ArticleĀ  Google ScholarĀ 

  6. Garcia-Molina H., Shivakumar N. (1995a). The SCAM Approach To Copy Detection in Digital Libraries. D-lib Magazine, November.

    Google ScholarĀ 

  7. Garcia-Molina H., Shivakumar N. (1995b). SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DLā€™95), June 11-13, Austin, Texas.

    Google ScholarĀ 

  8. Garcia-Molina H., Shivakumar N. (1996a). Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of 1st ACM International Conference on Digital Libraries (DLā€™96) March, Bethesda Maryland.

    Google ScholarĀ 

  9. Gropp W., Lusk E., Skjellum A. (1994). Using MPI. Portable Parallel Programming with the Message-Passing Interface. (The MIT Press)

    Google ScholarĀ 

  10. Gusfield D. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. (Cambridge University Press), 1997.

    Google ScholarĀ 

  11. Kesselman C. Data Grids for Next Generation Problems in Science and Engineering. In this proceedings.

    Google ScholarĀ 

  12. Monostori_K., Zaslavsky A., Schmidt H. MatchDetectReveal: Finding Overlapping and Similar Digital Documents. Proceedings of IRMA International Conference, Anchorage, Alaska, 21-24 May, 2000.

    Google ScholarĀ 

  13. Monostori K., Zaslavsky A., Schmidt H. Parallel Overlap and Similarity Detection in Semi-Structured Document Collections. Proceedings of 6th Annual Australasian Conference on Parallel And Real-Time Systems (PARTā€™ 99), Melbourne, Australia, 1999.

    Google ScholarĀ 

  14. Si A., Leong H.V., Lau R. W. H. CHECK: A Document Plagiarism Detection System. Proceedings of ACM Symposium for Applied Computing, pp.70ā€“77, Feb. 1997.

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Monostori, K., Zaslavsky, A., Schmidt, H. (2001). Parallel and Distributed Document Overlap Detection on the Web. In: SĆørevik, T., Manne, F., Gebremedhin, A.H., Moe, R. (eds) Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. PARA 2000. Lecture Notes in Computer Science, vol 1947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-70734-4_25

Download citation

  • DOI: https://doi.org/10.1007/3-540-70734-4_25

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41729-3

  • Online ISBN: 978-3-540-70734-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics