Abstract
Proliferation of digital libraries plus availability of electronic documents from the Internet have created new challenges for computer science researchers and professionals. Documents are easily copied and redistributed or used to create plagiarised assignments and conference papers. This paper presents a new, two-stage approach for identifying overlapping documents. The first stage is identifying a set of candidate documents that are compared in the second stage using a matching-engine. The algorithm of the matching-engine is based on suffix trees and it modifies the known matching statistics algorithm. Parallel and distributed approaches are discussed at both stages and performance results are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abramson, D., Giddy, J. and Kotler, L. High Performance Parametric Modeling with Nimrod/G: Killer Application for the Global Grid?. International Parallel and Distributed Processing Symposium (IPDPS), pp 520ā528, Cancun, Mexico, May 2000.
Baker M., Buyya R. Cluster Computing at a Glance in Buyya R. High Performance Cluster Computing. (Prentice Hall) pp. 3ā47, 1999.
Broder A.Z., Glassman S.C., Manasse M.S. Syntatic Clustering of the Web. Sixth International Web Conference, Santa Clara, California USA. URL http://decweb.ethz.ch/WWW6/Technical/Paper205/paper205.html
Chang W.I., Lawler E.L. Sublinear Approximate String Matching and Biological Applications. Algorithmica 12. pp. 327ā344, 1994.
Foster I., Kesselman C. Globus: A Metacomputing Infrastructure Toolkit. Intl J Supercomputer Applications 11(2), pp. 115ā128, 1997.
Garcia-Molina H., Shivakumar N. (1995a). The SCAM Approach To Copy Detection in Digital Libraries. D-lib Magazine, November.
Garcia-Molina H., Shivakumar N. (1995b). SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DLā95), June 11-13, Austin, Texas.
Garcia-Molina H., Shivakumar N. (1996a). Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of 1st ACM International Conference on Digital Libraries (DLā96) March, Bethesda Maryland.
Gropp W., Lusk E., Skjellum A. (1994). Using MPI. Portable Parallel Programming with the Message-Passing Interface. (The MIT Press)
Gusfield D. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. (Cambridge University Press), 1997.
Kesselman C. Data Grids for Next Generation Problems in Science and Engineering. In this proceedings.
Monostori_K., Zaslavsky A., Schmidt H. MatchDetectReveal: Finding Overlapping and Similar Digital Documents. Proceedings of IRMA International Conference, Anchorage, Alaska, 21-24 May, 2000.
Monostori K., Zaslavsky A., Schmidt H. Parallel Overlap and Similarity Detection in Semi-Structured Document Collections. Proceedings of 6th Annual Australasian Conference on Parallel And Real-Time Systems (PARTā 99), Melbourne, Australia, 1999.
Si A., Leong H.V., Lau R. W. H. CHECK: A Document Plagiarism Detection System. Proceedings of ACM Symposium for Applied Computing, pp.70ā77, Feb. 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Monostori, K., Zaslavsky, A., Schmidt, H. (2001). Parallel and Distributed Document Overlap Detection on the Web. In: SĆørevik, T., Manne, F., Gebremedhin, A.H., Moe, R. (eds) Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. PARA 2000. Lecture Notes in Computer Science, vol 1947. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-70734-4_25
Download citation
DOI: https://doi.org/10.1007/3-540-70734-4_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41729-3
Online ISBN: 978-3-540-70734-9
eBook Packages: Springer Book Archive