DNA Sequences Representation Derived from Discrete Wavelet Transformation for Text Similarity Recognition

  • Phan Hieu HoEmail author
  • Ngoc Anh Thi Nguyen
  • Trung Hung VoEmail author
Part of the Studies in Computational Intelligence book series (SCI, volume 769)


Recognizing text similarity, also known as duplicated documents, is considered as the most important solution for plagiarism detection which is a rising dramatically in the era of digital revolution recently. With the aim to contribute an efficient plagiarism system, we investigate a new approach for in text similarity mining via DNA sequences representation derived from Discrete Wavelet Transformation (DWT). Consequently, the contribution of the paper is classified as threefold. Firstly, we convert the raw source materials into a unique set of floating-number series called a DeoxyriboNucleic Acid (DNA) sequences using DWT. The DNA-based structure then is also required for the testing documents input at the second step. Lastly, text similarity discovery algorithm is performed for those given input DNA strings via computing the Euclidean distance. The experimental result demonstrates the advantages of the proposed method with very high precision for detecting text similarity on standard dataset of PAN, known as Plagiarism Analysis, Authorship Identification, and Near-Duplicate detection.


Text similarity Discrete Wavelet Transformation Text analysis and mining Plagiarism system Euclidean measurement 



This research is funded by Funds for Science and Technology Development of the University of Danang under grant number B2017-DN01-07.


  1. 1.
    Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. Ldv Forum 20(1), 19–62 (2005)Google Scholar
  2. 2.
    Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 Task 1: Semantic Textual Similarity-Multilingual and Cross-Lingual Focused Evaluation (2017). arXiv:1708.00055
  3. 3.
    Meuschke, N., Gipp, B.: State-of-the-art in detecting academic plagiarism. Int. J. Educ. Integr. 9(1), 50–71 (2013)Google Scholar
  4. 4.
    Brinkman, B.: An analysis of student privacy rights in the use of plagiarism detection systems. Sci. Eng. Ethics 19(3), 1255–1266 (2013)CrossRefGoogle Scholar
  5. 5.
    De, T.C., et al.: Developing plagiarism detection system for Vietnamese University. In: 12th Vietnam—Japan International Joint Symposium, Can Tho (2014)Google Scholar
  6. 6.
    Reddy, G.S., Rajinikanth, T.V., Ananda Rao, A.: Clustering and classification of text documents using improved similarity measure. Int. J. Comput. Sci. Inf. Secur. 14, 39–54 (2016)Google Scholar
  7. 7.
    Wahlstrom, S.: Evaluation of string searching algorithms. In: IDT Mini-conference on Interesting Results in Computer Science and Engineering (2004)Google Scholar
  8. 8.
    Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)CrossRefGoogle Scholar
  9. 9.
    Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)Google Scholar
  10. 10.
    Nawab, R.M.A., Stevenson, M., Clough, P.: An IR-Based approach utilizing query expansion for plagiarism detection in MEDLINE. IEEE/ACM Trans. Comput. Biol. Bioinf. 14(4), 796–804 (2015)CrossRefGoogle Scholar
  11. 11.
    Mountassir, A., Berrada, I., Benberahim, H.: Representing text documents in training document spaces: a novel model for document representation. J. Theor. Appl. Inf. Technol. 56(1), 30–39 (2013)Google Scholar
  12. 12.
    Vidakovic, B.: Statistical Modeling by Wavelets, vol. 503. Wiley (2009)Google Scholar
  13. 13.
    Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–826 (2007)Google Scholar
  14. 14.
    Potthast, M., et al.: Overview of the 1st international competition on plagiarism detection. In: Stein, B., et al (eds.) PAN’09, pp. 1–9 (2009)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.The University of DanangDanang CityVietnam

Personalised recommendations