Detecting Near Duplicates in Software Documentation


Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detection in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact duplicates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and commercial projects. Our evaluation is very comprehensive – it covers various documentation types: design and requirement specifications, programming guides and API documentation, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documentation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents.

This is a preview of subscription content, log in to check access.


  1. 1

    Parnas, D.L., Precise documentation: The key to better software, in The Future of Software Engineering, Berlin, Heidelberg: Springer-Verlag, 2011, pp. 125–148.

    Google Scholar 

  2. 2

    Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, in Proceedings of the 32 ACM/IEEE International Conference on Software Engineering (ICSE’10), New York, NY, USA: ACM, 2010, vol. 2, pp. 79–88.

  3. 3

    Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.

  4. 4

    Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Program. Comput. Software, 2008, vol. 34, no. 4, pp. 216–224.

    Article  MATH  Google Scholar 

  5. 5

    Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of software product lines, Lecture Notes in Compute Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980 of CEE-SET 2008, pp. 158–170.

  6. 6

    Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.

  7. 7

    Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone detection in reuse of software technical documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015, Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.

  8. 8

    Luciv, D.V., Koznov, D.V., Basit, H.A., and Terekhov, A.N., On fuzzy repetitions detection in documentation reuse, Program. Comput. Software, 2016, vol. 42, no. 4, pp. 216–224.

    MathSciNet  Article  Google Scholar 

  9. 9

    Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: ACM, 2007, pp. 513–516.

  10. 10

    Bassett, P.G., Framing software reuse: Lessons from the real World, Upper Saddle River, NJ, USA: Prentice-Hall, 1997.

    Google Scholar 

  11. 11

    Documentation Refactoring Toolkit. http://www. html.

  12. 12

    Torvalds, L., Linux Kernel Documentation, Dec 2013 snapshot. master/Documentation/DocBook/.

  13. 13

    Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.

  14. 14

    Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Europ. J. Comput. Sci., 2014, vol. 4, no. 4, pp. 242–258.

    Google Scholar 

  15. 15

    Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, ICSR 2017, Salvador, Brazil, 2017, Proceedings, Botterweck, G. and Werner, C., Eds., Cham: Springer-Verlag, 2017, pp. 12–27.

  16. 16

    Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software Syst. Model., 2016, vol. 15, no. 2, pp. 579–603.

    Article  Google Scholar 

  17. 17

    Huang, T.-K., Rahman, Md.S., Madhyastha, H.V., Faloutsos, M., and Ribeiro, B., An analysis of socware cascades in online social networks, Proceedings of the 22Nd International Conference on World Wide Web, New York, NY, USA: ACM, 2013, pp. 619–630.

  18. 18

    Williams, K. and Giles, C.L., Near duplicate detection in an Academic Digital Library, Proceedings of the ACM Symposium on Document Engineering, New York, NY, USA: ACM, 2013, pp. 91–94.

  19. 19

    Zhang, Q., Zhang Yu., Yu, H., and Huang, X., Efficient partial-duplicate detection based on sequence matching, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, 2010, pp. 675–682.

  20. 20

    Abdel Hamid, O., Behzadi, B., Christoph, S., and Henzinger, M., Detecting the origin of text segments efficiently, Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA: ACM, 2009, pp. 61–70.

  21. 21

    Ramaswamy, L., Iyengar, A., Liu, L., and Douglis, F., Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA: ACM, 2004, pp. 443–454.

  22. 22

    Gibson, D., Punera, K., and Tomkins, A., The volume and evolution of web page templates, Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA: ACM, 2005, pp. 830–839.

  23. 23

    Vall’es, E. and Rosso, P., Detection of near-duplicate user generated contents: The SMS spam collection, Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, New York, NY, USA: ACM, 2011, pp. 27–34.

  24. 24

    Barrón-Cedeño, A., Vila, M., Martí, M., and Rosso, P., Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection, Comput. Linguist., 2013, vol. 39, no. 4, pp. 917–947.

    Article  Google Scholar 

  25. 25

    Antiplagiarism (in Russian). Accessed January 16, 2018.

  26. 26

    Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., and Lopes, C.V., SourcererCC: Scaling code clone detection to big-code, Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, 2016, pp. 1157–1168.

  27. 27

    Jiang, L., Misherghi, G., Su, Z., and Glondu, S., DECKARD: Scalable and accurate tree-based detection of code clones, Proceedings of the 29th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Soc., 2007, pp. 96–105.

  28. 28

    Cordy, J.R. and Roy, C.K., The NiCad clone detector, in Proceedings of IEEE 19th International Conference on Program Comprehension, 2011, pp. 219–220.

  29. 29

    Akhin, M. and Itsykson, V., Tree slicing in clone detection: Syntactic analysis made (semi)-semantic (in Russian), Model. Anal. Inform. Syst., 2012, vol. 19, no. 6, pp. 69–78.

    Google Scholar 

  30. 30

    Zeltser, N.G., Automatic clone detection for refactoring, Proc. Inst. Syst. Program., 2013, vol. 25, pp. 39–50.

    Article  Google Scholar 

  31. 31

    Wagner, S. and Fernández, D.M., Analyzing text in software projects, The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 39–72.

    Google Scholar 

  32. 32

    Korshunov, A. and Gomzin, A., Topic modeling in natural language texts (in Russian), Proc. Inst. Syst. Program., 2012, vol. 23, pp. 215–242.

    Article  Google Scholar 

  33. 33

    Tomita-parser – Yandex Technologies (in Russian). Accessed January 16, 2018.

  34. 34

    Rattan, D., Bhatia, R., and Singh, M., Software clone detection: A systematic review, Inform. Software Technol., 2013, vol. 55, no. 7, pp. 1165–1199.

    Article  Google Scholar 

  35. 35

    Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.

    MathSciNet  Article  MATH  Google Scholar 

  36. 36

    Bassett, P.G., The theory and practice of adaptive reuse, SIGSOFT Software Eng. Notes, 1997, vol. 22, no. 3, pp. 2–9.

    Article  Google Scholar 

  37. 37

    de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M., Computational Geometry, Berlin: Springer, 2008, pp. 220–226.

    Google Scholar 

  38. 38

    Preparata, F.P. and Shamos, M.I., Computational Geometry: An Introduction, Berlin: Springer-Verlag, 1985, pp. 359–363.

    Google Scholar 

  39. 39

    PyIntervalTree. URL: intervaltree.

  40. 40

    Kolchin, A.V., Kotljarov, V.P., and Drobincev, P.D., The method of test scenariogeneration in the environment of the insertion modeling, Control Syst. Mach., 2012, no. 6, pp. 43–48, 63.

  41. 41

    Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. Syst. Program., 2011, vol. 20, pp. 125–141.

    Google Scholar 

  42. 42

    Kudryavtsev, D. and Gavrilova T., Diagrammatic knowledge modeling for managers: Ontologybased approach, Proceedings of the International Conference on Knowledge Engineering and Ontology Development, 2011, pp. 386–389.

Download references


This work is partially supported by RFBR grant no. 16-01-00304.

Author information



Corresponding authors

Correspondence to D. V. Luciv or D. V. Koznov or G. A. Chernishev or A. N. Terekhov or K. Yu. Romanovsky or D. A. Grigoriev.

Additional information

1The article is published in the original.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Luciv, D.V., Koznov, D.V., Chernishev, G.A. et al. Detecting Near Duplicates in Software Documentation. Program Comput Soft 44, 335–343 (2018).

Download citation


  • software documentation
  • near duplicates
  • documentation reuse
  • software clone detection