DNA Sequences Representation Derived from Discrete Wavelet Transformation for Text Similarity Recognition
Recognizing text similarity, also known as duplicated documents, is considered as the most important solution for plagiarism detection which is a rising dramatically in the era of digital revolution recently. With the aim to contribute an efficient plagiarism system, we investigate a new approach for in text similarity mining via DNA sequences representation derived from Discrete Wavelet Transformation (DWT). Consequently, the contribution of the paper is classified as threefold. Firstly, we convert the raw source materials into a unique set of floating-number series called a DeoxyriboNucleic Acid (DNA) sequences using DWT. The DNA-based structure then is also required for the testing documents input at the second step. Lastly, text similarity discovery algorithm is performed for those given input DNA strings via computing the Euclidean distance. The experimental result demonstrates the advantages of the proposed method with very high precision for detecting text similarity on standard dataset of PAN, known as Plagiarism Analysis, Authorship Identification, and Near-Duplicate detection.
KeywordsText similarity Discrete Wavelet Transformation Text analysis and mining Plagiarism system Euclidean measurement
This research is funded by Funds for Science and Technology Development of the University of Danang under grant number B2017-DN01-07.
- 1.Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. Ldv Forum 20(1), 19–62 (2005)Google Scholar
- 2.Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 Task 1: Semantic Textual Similarity-Multilingual and Cross-Lingual Focused Evaluation (2017). arXiv:1708.00055
- 3.Meuschke, N., Gipp, B.: State-of-the-art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013)Google Scholar
- 5.De, T.C., et al.: Developing plagiarism detection system for Vietnamese University. In: 12th Vietnam—Japan International Joint Symposium, Can Tho (2014)Google Scholar
- 6.Reddy, G.S., Rajinikanth, T.V., Ananda Rao, A.: Clustering and classification of text documents using improved similarity measure. Int. J. Comput. Sci. Inf. Secur. 14, 39–54 (2016)Google Scholar
- 7.Wahlstrom, S.: Evaluation of string searching algorithms. In: IDT Mini-conference on Interesting Results in Computer Science and Engineering (2004)Google Scholar
- 9.Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)Google Scholar
- 11.Mountassir, A., Berrada, I., Benberahim, H.: Representing text documents in training document spaces: a novel model for document representation. J. Theor. Appl. Inf. Technol. 56(1), 30–39 (2013)Google Scholar
- 12.Vidakovic, B.: Statistical Modeling by Wavelets, vol. 503. Wiley (2009)Google Scholar
- 13.Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–826 (2007)Google Scholar
- 14.Potthast, M., et al.: Overview of the 1st international competition on plagiarism detection. In: Stein, B., et al (eds.) PAN’09, pp. 1–9 (2009)Google Scholar