Advertisement

Programming and Computer Software

, Volume 44, Issue 5, pp 335–343 | Cite as

Detecting Near Duplicates in Software Documentation

  • D. V. LucivEmail author
  • D. V. KoznovEmail author
  • G. A. ChernishevEmail author
  • A. N. TerekhovEmail author
  • K. Yu. RomanovskyEmail author
  • D. A. GrigorievEmail author
Article

Abstract

Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detection in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact duplicates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and commercial projects. Our evaluation is very comprehensive – it covers various documentation types: design and requirement specifications, programming guides and API documentation, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documentation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents.

Keywords:

software documentation near duplicates documentation reuse software clone detection 

Notes

ACKNOWLEDGMENTS

This work is partially supported by RFBR grant no. 16-01-00304.

REFERENCES

  1. 1.
    Parnas, D.L., Precise documentation: The key to better software, in The Future of Software Engineering, Berlin, Heidelberg: Springer-Verlag, 2011, pp. 125–148.Google Scholar
  2. 2.
    Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, in Proceedings of the 32 ACM/IEEE International Conference on Software Engineering (ICSE’10), New York, NY, USA: ACM, 2010, vol. 2, pp. 79–88.Google Scholar
  3. 3.
    Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.Google Scholar
  4. 4.
    Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Program. Comput. Software, 2008, vol. 34, no. 4, pp. 216–224.CrossRefzbMATHGoogle Scholar
  5. 5.
    Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of software product lines, Lecture Notes in Compute Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980 of CEE-SET 2008, pp. 158–170.Google Scholar
  6. 6.
    Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.Google Scholar
  7. 7.
    Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone detection in reuse of software technical documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015, Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.Google Scholar
  8. 8.
    Luciv, D.V., Koznov, D.V., Basit, H.A., and Terekhov, A.N., On fuzzy repetitions detection in documentation reuse, Program. Comput. Software, 2016, vol. 42, no. 4, pp. 216–224.MathSciNetCrossRefGoogle Scholar
  9. 9.
    Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: ACM, 2007, pp. 513–516.Google Scholar
  10. 10.
    Bassett, P.G., Framing software reuse: Lessons from the real World, Upper Saddle River, NJ, USA: Prentice-Hall, 1997.Google Scholar
  11. 11.
    Documentation Refactoring Toolkit. http://www. math.spbu.ru/user/kromanovsky/docline/index_en. html.Google Scholar
  12. 12.
    Torvalds, L., Linux Kernel Documentation, Dec 2013 snapshot. https://github.com/torvalds/linux/tree/ master/Documentation/DocBook/.Google Scholar
  13. 13.
    Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.Google Scholar
  14. 14.
    Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Europ. J. Comput. Sci., 2014, vol. 4, no. 4, pp. 242–258.Google Scholar
  15. 15.
    Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, ICSR 2017, Salvador, Brazil, 2017, Proceedings, Botterweck, G. and Werner, C., Eds., Cham: Springer-Verlag, 2017, pp. 12–27.Google Scholar
  16. 16.
    Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software Syst. Model., 2016, vol. 15, no. 2, pp. 579–603.CrossRefGoogle Scholar
  17. 17.
    Huang, T.-K., Rahman, Md.S., Madhyastha, H.V., Faloutsos, M., and Ribeiro, B., An analysis of socware cascades in online social networks, Proceedings of the 22Nd International Conference on World Wide Web, New York, NY, USA: ACM, 2013, pp. 619–630.Google Scholar
  18. 18.
    Williams, K. and Giles, C.L., Near duplicate detection in an Academic Digital Library, Proceedings of the ACM Symposium on Document Engineering, New York, NY, USA: ACM, 2013, pp. 91–94.Google Scholar
  19. 19.
    Zhang, Q., Zhang Yu., Yu, H., and Huang, X., Efficient partial-duplicate detection based on sequence matching, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, 2010, pp. 675–682.Google Scholar
  20. 20.
    Abdel Hamid, O., Behzadi, B., Christoph, S., and Henzinger, M., Detecting the origin of text segments efficiently, Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA: ACM, 2009, pp. 61–70.Google Scholar
  21. 21.
    Ramaswamy, L., Iyengar, A., Liu, L., and Douglis, F., Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA: ACM, 2004, pp. 443–454.Google Scholar
  22. 22.
    Gibson, D., Punera, K., and Tomkins, A., The volume and evolution of web page templates, Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA: ACM, 2005, pp. 830–839.Google Scholar
  23. 23.
    Vall’es, E. and Rosso, P., Detection of near-duplicate user generated contents: The SMS spam collection, Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, New York, NY, USA: ACM, 2011, pp. 27–34.Google Scholar
  24. 24.
    Barrón-Cedeño, A., Vila, M., Martí, M., and Rosso, P., Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection, Comput. Linguist., 2013, vol. 39, no. 4, pp. 917–947.CrossRefGoogle Scholar
  25. 25.
    Antiplagiarism (in Russian). https://www.antiplagiat.ru/. Accessed January 16, 2018.Google Scholar
  26. 26.
    Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., and Lopes, C.V., SourcererCC: Scaling code clone detection to big-code, Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, 2016, pp. 1157–1168.Google Scholar
  27. 27.
    Jiang, L., Misherghi, G., Su, Z., and Glondu, S., DECKARD: Scalable and accurate tree-based detection of code clones, Proceedings of the 29th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Soc., 2007, pp. 96–105.Google Scholar
  28. 28.
    Cordy, J.R. and Roy, C.K., The NiCad clone detector, in Proceedings of IEEE 19th International Conference on Program Comprehension, 2011, pp. 219–220.Google Scholar
  29. 29.
    Akhin, M. and Itsykson, V., Tree slicing in clone detection: Syntactic analysis made (semi)-semantic (in Russian), Model. Anal. Inform. Syst., 2012, vol. 19, no. 6, pp. 69–78.Google Scholar
  30. 30.
    Zeltser, N.G., Automatic clone detection for refactoring, Proc. Inst. Syst. Program., 2013, vol. 25, pp. 39–50.CrossRefGoogle Scholar
  31. 31.
    Wagner, S. and Fernández, D.M., Analyzing text in software projects, The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 39–72.Google Scholar
  32. 32.
    Korshunov, A. and Gomzin, A., Topic modeling in natural language texts (in Russian), Proc. Inst. Syst. Program., 2012, vol. 23, pp. 215–242.CrossRefGoogle Scholar
  33. 33.
    Tomita-parser – Yandex Technologies (in Russian). https://tech.yandex.ru/tomita/. Accessed January 16, 2018.Google Scholar
  34. 34.
    Rattan, D., Bhatia, R., and Singh, M., Software clone detection: A systematic review, Inform. Software Technol., 2013, vol. 55, no. 7, pp. 1165–1199.CrossRefGoogle Scholar
  35. 35.
    Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Bassett, P.G., The theory and practice of adaptive reuse, SIGSOFT Software Eng. Notes, 1997, vol. 22, no. 3, pp. 2–9.CrossRefGoogle Scholar
  37. 37.
    de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M., Computational Geometry, Berlin: Springer, 2008, pp. 220–226.CrossRefzbMATHGoogle Scholar
  38. 38.
    Preparata, F.P. and Shamos, M.I., Computational Geometry: An Introduction, Berlin: Springer-Verlag, 1985, pp. 359–363.CrossRefzbMATHGoogle Scholar
  39. 39.
    PyIntervalTree. URL: https://github.com/chaimleib/ intervaltree.Google Scholar
  40. 40.
    Kolchin, A.V., Kotljarov, V.P., and Drobincev, P.D., The method of test scenariogeneration in the environment of the insertion modeling, Control Syst. Mach., 2012, no. 6, pp. 43–48, 63.Google Scholar
  41. 41.
    Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. Syst. Program., 2011, vol. 20, pp. 125–141.Google Scholar
  42. 42.
    Kudryavtsev, D. and Gavrilova T., Diagrammatic knowledge modeling for managers: Ontologybased approach, Proceedings of the International Conference on Knowledge Engineering and Ontology Development, 2011, pp. 386–389.Google Scholar

Copyright information

© Pleiades Publishing, Ltd. 2018

Authors and Affiliations

  1. 1.Saint Petersburg State UniversitySt. PetersburgRussia

Personalised recommendations