Abstract
Translated or cross-lingual plagiarism is defined as the translation of someone else’s work or words without marking it as such or without giving credit to the original author. The existence of cross-lingual plagiarism is not new, but only in recent years, due to the rapid development of the natural language processing, appeared the first algorithms which tackled the difficult task of detecting it. Most of these algorithms utilize machine translation to compare texts written in different languages. We propose a different method, which can effectively detect translations between language-pairs where machine translations still produce low quality results. Our new algorithm presented in this paper is based on information retrieval (IR) and a dictionary based similarity metric. The preprocessing of the candidate documents for the IR is computationally intensive, but easily parallelizable. We propose a desktop Grid solution for this task. As the application is time sensitive and the desktop Grid peers are unreliable, a resubmission mechanism is used which assures that all jobs of a batch finish within a reasonable time period without dramatically increasing the load on the whole system.
Similar content being viewed by others
References
KOPI Online Plagiarism Search Portal. http://kopi.sztaki.hu/index.php?language=eng. Accessed 8 Apr 2012
Huang, R.: Plagiarism and the Web. In: A Comparison of Internet Sources for Secondary and Higher Education Students. http://pages.turnitin.com/rs/iparadigms/images/Turnitin_WhitePaper_SourcesSECvsHE.pdf. Accessed 6 Apr 2012
Size of Wikipedia. http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia. Accessed 8 Apr 2012
SZTAKI Desktop Grid. http://szdg.lpds.sztaki.hu/szdg/. Accessed 8 Apr 2012
Anderson, D.P.: BOINC: a system for public-resource computing and storage. In: 5th IEEE/ACM International Workshop on Grid Computing, 8 November 2004
BOINC Statistics. http://boincstats.com. Accessed 8 Apr 2012
Kacsuk, P., et al.: SZTAKI Desktop Grid, (SZDG): a flexible and scalable desktop Grid system. J. Grid Computing 7, 439–461 (2009)
Kacsuk, P., Farkas, Z., Fedak, G.: Towards making BOINC and EGEE interoperable. In: IEEE Fourth International Conference on eScience. eScience ’08, pp. 478–484 (2008)
The BOINC application programming interface. http://boinc.berkeley.edu/trac/wiki/BasicApi. Accessed 8 Apr 2012
BME MOKK: Hunspell stemmer. http://hunspell.sourceforge.net/. Accessed 8 Apr 2012
Cavnar, W.B., Trenkle, J.M.: N-Gram-based text categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. UNLV Publications/Reprographics, Las Vegas (1994)
Rehurek, R., Kolkus, M.: Language identification on the web: extending the dictionary method. In: 10th International Conference on Intelligent Text Processing and Computational Linguistics (2009)
Pataki, M., Vajna, M.: Detecting the language of document written in multiple languages (“Többnyelvű dokumentum nyelvének megállapítása”). VIII. In: Hungarian Computer Linguistics Conference (MSZNY 2011) (2011)
Marosi, A.C., Balaton, Z., Kacsuk, P.G.W.: A generic wrapper for running legacy applications on desktop Grids. In: IEEE Parallel and Distributed Processing Symposium, International, pp. 1–6 (2009)
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: Proceedings of the Euro-Par’05 Conference, pp. 432–441 (2003)
Nawab, R., Stevenson, M., Clough, P.: External plagiarism detection using information retrieval and sequence alignment. Notebook for PAN at CLEF (2011)
Grman, J., Ravas, R.: Improved implementation for finding text similarities in large collections of data. Notebook for PAN at CLEF (2011)
Sameer, R., et al.: External & intrinsic plagiarism detection VSM & discourse markers based approach. Notebook for PAN at CLEF (2011)
Grozea, C., Popescu, M.: The Encoplot Similarity Measure for Automatic Detection of Plagiarism. http://brainsignals.de/encsimTR.pdf. Accessed 6 Apr 2012
Kasprzak, J., Brandejs, M.: Mproving the reliability of the plagiarism detection system. Lab Report for PAN at CLEF (2010)
Pataki, M.: A new approach for searching translated plagiarism. In: Proceedings of the 5th International Plagiarism Conference. Newcastle, UK (2012)
A Copy of All Pages from All Wikipedia Wikis, in HTML Form. http://dumps.wikimedia.org/other/static_html_dumps/. Accessed 6 Apr 2012
Wikipedia text dumps. http://kopiwiki.dsd.sztaki.hu/. Accessed 8 Apr 2012
Wikipedia Static HTML Dumps. http://dumps.wikimedia.org/other/static_html_dumps/. Accessed 8 Apr 2012
WikiTaxi. http://www.wikitaxi.org/delphi/doku.php/products/wikitaxi/index. Accessed 8 Apr 2012
WP2TXT: Wikipedia to text converter. http://wp2txt.rubyforge.org/. Accessed 8 Apr 2012
Potthast, M., et al.: Overview of the 2nd International Competition on Plagiarism Detection. http://www.clef2010.org/resources/proceedings/clef2010labs_submission_125.pdf. Accessed 8 Apr 2012
Ceska, Z., Toman, M., Jezek, K.: Multilingual plagiarism detection. In: Proceedings of the 13th International Conference on Artificial Intelligence, pp. 83–92. Springer, Heidelberg (2008)
Vossen, P.: EuroWordNet: a multilingual database for information retrieval. In: Proceedings of the DELOS workshop on Cross-language Information Retrieval (1997)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT, Cambridge (1998)
Kondo, D., et al.: Scheduling task parallel applications for rapid application turnaround on enterprise desktop Grids. J. Grid Comput. 5(4), 379–405 (2007)
Heien, E.M., Fujimoto, N., Hagihara, K.: Computing low latency batches with unreliable workers in volunteer computing environments. In: Parallel and Distributed Processing. IPDPS 2008. IEEE International Symposium, 14–18 April 2008. doi:10.1109/IPDPS.2008.4536442
Lázaro, D., Kondo, D., Marquès, J.M.: Long-term availability prediction for groups of volunteer resources. J. Parallel Distributed Comput. 72(2), 281–296 (2012)
Bouguerra, S., Kondo, D., Trystam, D.: On the scheduling of checkpoints on desktop Grids. In: 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011) (2011)
Reynolds, C., et al.: Scientific workflow makespan reduction through cloud augmented desktop Grids. In: 3rd IEEE International Conference on Cloud Computing technology and Science, CloudCom 2011. Athens, Greece (2011). doi:10.1109/CloudCom.2011.13
Delamare, S., et al.: SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures. In: International Symposium on High Performance Distributed Computing (HPDC’2012). Delft, Nederlands (2012)
Marosi, A.C., Kacsuk, P.: Workers in the clouds. In: Parallel 2011 19th Euromicro International Conference on Distributed and Network-based Processing (PDP), pp. 519–526 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pataki, M., Marosi, A.C. Searching for Translated Plagiarism with the Help of Desktop Grids. J Grid Computing 11, 149–166 (2013). https://doi.org/10.1007/s10723-012-9224-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-012-9224-5