Skip to main content
Log in

Interactive Near Duplicate Search in Software Documentation

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an approach for interactive detection of near duplicates that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based , and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.

Similar content being viewed by others

Notes

  1. We only consider groups that consist of fragments longer than four tokens, because, according to our experiments [15], this particular constraint filters out many false positives.

  2. The \(1{\text{/}}\sqrt 3 \approx 0.577\) value was selected for the convenience of the following proofs; following our experiments, we have concluded that if the similarity measure is less than 1/2, then, for smaller patterns (up to 15–20 tokens), the algorithm produces many non-meaningful matches; the lower bound wehave selected is insignificantly larger than 1/2.

  3. Here and further we do not round the lengths of theintervals to integers to save up space. Nevertheless, all proofs can be performed with rounded values as well.

REFERENCES

  1. Brooks, F.P., The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley, 1975.

    Google Scholar 

  2. Parnas, D.L., Precise documentation: The key to Better Software, The Future of Software Engineering, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 125–148.

    Google Scholar 

  3. Bassett, P.G., Framing Software Reuse: Lessons from the Real World, Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1997.

    Google Scholar 

  4. Jarzabek, S. and Pettersson, U., Research journey towards industrial application of reuse technique, ICSE, 2006, pp. 608–611.

    Book  Google Scholar 

  5. Irshad, M., Petersen, K., and Poulding, S.M., A systematic literature review of software requirements reuse approaches, Inform. Software Technol., 2018, vol. 93, pp. 223–245.

    Article  Google Scholar 

  6. Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.

  7. Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, Springer, 2017, pp. 12–27.

  8. Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Programming and Computer Software, 2008, vol. 34, no. 4, pp. 216–224.

    Article  MATH  Google Scholar 

  9. Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of Software Product Lines, Lect. Notes Computer Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980, pp. 158–170.

    Google Scholar 

  10. Jarzabek, S. and Dan, D., Documentation Reuse: Managing Similar Documents, FedCSIS, 2017, pp. 1325–1334.

  11. Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, Proceedings of ACM/IEEE 32nd International Conference on Software Engineering, 2010, vol. 2, pp. 79–88.

  12. Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Eur. J. Comp. Sci., 2014, vol. 4, no. 4, pp. 242–258.

    Google Scholar 

  13. Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.

  14. Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.

  15. Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone Detection in Reuse of Software Technical Documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2015), Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.

  16. Luciv, D.V., Koznov, D.V., Chernishev, G.A., Terekhov, A.N., Romanovsky, K.Yu., and Grigoriev, D.A., Detecting near duplicates in software documentation, Programming and Comput. Software, 2018, vol. 44, no. 5.

    Article  MathSciNet  Google Scholar 

  17. Koznov, D.V., Luciv, D.V., and Chernishev, G.A., Duplicate management in software documentation maintenance, Proceedings of V International Conference Actual Problems of System and Software Engineering, vol. 1989: CEUR Workshop Proceedings, 2017, pp. 195–201.

  18. Duplicate Finder. http://www.math.spbu.ru/user/kromanovsky/docline /index.html.

  19. Luciv, D.V., Koznov, D.V., Chernishev, G.A., Basit, H.A., Romanovsky, K.Yu., and Terekhov, A.N., Poster: Duplicate finder toolkit, Proceedings of the International Conference on Software Engineering (ICSE 2018), 2018, pp. 171–172.

  20. Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: 2007, pp. 513–516.

  21. Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software and Systems Modeling, 2016, vol. 15, no. 2, pp. 579–603.

    Article  Google Scholar 

  22. Ukkonen, E., Finding approximate patterns in strings, J. Algorithms, 1985, vol. 6, no. 1, pp. 132–137.

    Article  MathSciNet  MATH  Google Scholar 

  23. Broder, A.Z., On the resemblance and containment of documents, Compression and Complexity of Sequences 1997,Proceedings, IEEE, 1997, pp. 21–29.

  24. Wu, S. and Manber, U., Fast Text Searching: Allowing Errors, Commun. ACM, 1992, vol. 35, no. 10, pp. 83–91.

    Article  Google Scholar 

  25. Landau, G.M. and Vishkin, U., Fast string matching with k differences, J. Comput. System Sci. 1988, vol. 37, no. 1, pp. 63–78.

    Article  MathSciNet  MATH  Google Scholar 

  26. Myers, G., A Fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, 1999, vol. 46, no. 3, pp. 395–415.

    Article  MathSciNet  MATH  Google Scholar 

  27. Levenshtein, V., Binary codes capable of correcting spurious insertions and deletions of ones, Problems Inform. Transmission, 1965, vol. 1, pp. 8–17.

    MATH  Google Scholar 

  28. Smyth, W., Computing Patterns in Strings, Addison-Wesley, 2003.

    Google Scholar 

  29. Bergroth, L., Hakonen, H., and Raita, T., A survey of longest common subsequence algorithms, String Processing and Information Retrieval,2000(SPIRE 2000), Proceedings, Seventh International Symposium on, 2000, pp. 39–48.

  30. Leskovec, J., Rajaraman, A., and Ullman, J.D., Mining of Massive Datasets, Cambridge: Cambridge Univ. Press, 2014.

    Book  Google Scholar 

  31. Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge: Cambridge Univ. Press, 1997.

    Book  MATH  Google Scholar 

  32. Ratcliff, J.W. and Metzener, D.E., Pattern matching: The Gestalt approach, Dr. Dobb’s J., 1988, vol. 13, no. 7, pp. 46–72.

    Google Scholar 

  33. Abboud, A., Backurs, A., and Williams, V.V., Tight hardness results for LCS and other sequence similarity measures, Foundations of Computer Science (FOCS),2015IEEE 56th Annual Symposium on, 2015, pp. 59–78.

  34. Python DiffLib module. https:// docs.python.org/3/library/difflib.html.

  35. Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.

    Article  MathSciNet  MATH  Google Scholar 

  36. Špakov, O. and Miniotas, D., Visualization of eye gaze data using heat maps, Elektronika ir elektrotechnika, 2007, pp. 55–58.

  37. Luciv, D.V., Detecting Near Duplicates in Software Documentation, 2017. arXiv: 1711.04705.

  38. Boyer, R.S. and Moore, J.S., A fast string searching algorithm, Commun. ACM, 1977, vol. 20, no. 10, pp. 762–772.

    Article  MATH  Google Scholar 

  39. Pandoc: A Universal Document Converter. https://pandoc.org/.

  40. Drobintsev, P. D. A formal approach to test scenarios generation based on guides / P. D. Drobintsev, V. P. Kotlyarov, A. A. Letichevsky // Automatic Control and Computer Sciences, 2014, Dec., vol. 48, no. 7, pp. 415–423.

    Article  Google Scholar 

  41. Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. System Programming, 2011, vol. 20, pp. 125–141.

    Google Scholar 

  42. Gorovoy, V.A., Bolotnikova, E.S., and Gavrilova, T.A., To a method of evaluating ontologies, J. Comput. Systems Sci. Int., 2011, vol. 50, no. 3, pp. 448–461.

    Article  MathSciNet  Google Scholar 

Download references

Funding

This work is partially supported by RFBR grant 16‑01-00304.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to D. V. Luciv, D. V. Koznov, A. A. Shelikhovskii, K. Yu. Romanovsky, G. A. Chernishev, A. N. Terekhov, D. A. Grigoriev, A. N. Smirnova, D. V. Borovkov or A. I. Vasenina.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luciv, D.V., Koznov, D.V., Shelikhovskii, A.A. et al. Interactive Near Duplicate Search in Software Documentation. Program Comput Soft 45, 346–355 (2019). https://doi.org/10.1134/S0361768819060045

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768819060045

Navigation