Interactive Near Duplicate Search in Software Documentation

Luciv, D. V.; Koznov, D. V.; Shelikhovskii, A. A.; Romanovsky, K. Yu.; Chernishev, G. A.; Terekhov, A. N.; Grigoriev, D. A.; Smirnova, A. N.; Borovkov, D. V.; Vasenina, A. I.

doi:10.1134/S0361768819060045

Interactive Near Duplicate Search in Software Documentation

Published: 03 December 2019

Volume 45, pages 346–355, (2019)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

D. V. Luciv¹,
D. V. Koznov¹,
A. A. Shelikhovskii¹,
K. Yu. Romanovsky¹,
G. A. Chernishev¹,
A. N. Terekhov¹,
D. A. Grigoriev¹,
A. N. Smirnova¹,
D. V. Borovkov¹ &
…
A. I. Vasenina¹

114 Accesses
1 Citation
Explore all metrics

Abstract

Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an approach for interactive detection of near duplicates that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based , and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text

Article Open access 01 September 2023

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Notes

We only consider groups that consist of fragments longer than four tokens, because, according to our experiments [15], this particular constraint filters out many false positives.
The \(1{\text{/}}\sqrt 3 \approx 0.577\) value was selected for the convenience of the following proofs; following our experiments, we have concluded that if the similarity measure is less than 1/2, then, for smaller patterns (up to 15–20 tokens), the algorithm produces many non-meaningful matches; the lower bound wehave selected is insignificantly larger than 1/2.
Here and further we do not round the lengths of theintervals to integers to save up space. Nevertheless, all proofs can be performed with rounded values as well.

REFERENCES

Brooks, F.P., The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley, 1975.
Google Scholar
Parnas, D.L., Precise documentation: The key to Better Software, The Future of Software Engineering, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 125–148.
Google Scholar
Bassett, P.G., Framing Software Reuse: Lessons from the Real World, Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1997.
Google Scholar
Jarzabek, S. and Pettersson, U., Research journey towards industrial application of reuse technique, ICSE, 2006, pp. 608–611.
Book Google Scholar
Irshad, M., Petersen, K., and Poulding, S.M., A systematic literature review of software requirements reuse approaches, Inform. Software Technol., 2018, vol. 93, pp. 223–245.
Article Google Scholar
Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.
Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, Springer, 2017, pp. 12–27.
Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Programming and Computer Software, 2008, vol. 34, no. 4, pp. 216–224.
Article MATH Google Scholar
Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of Software Product Lines, Lect. Notes Computer Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980, pp. 158–170.
Google Scholar
Jarzabek, S. and Dan, D., Documentation Reuse: Managing Similar Documents, FedCSIS, 2017, pp. 1325–1334.
Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, Proceedings of ACM/IEEE 32nd International Conference on Software Engineering, 2010, vol. 2, pp. 79–88.
Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Eur. J. Comp. Sci., 2014, vol. 4, no. 4, pp. 242–258.
Google Scholar
Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.
Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.
Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone Detection in Reuse of Software Technical Documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2015), Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Terekhov, A.N., Romanovsky, K.Yu., and Grigoriev, D.A., Detecting near duplicates in software documentation, Programming and Comput. Software, 2018, vol. 44, no. 5.
Article MathSciNet Google Scholar
Koznov, D.V., Luciv, D.V., and Chernishev, G.A., Duplicate management in software documentation maintenance, Proceedings of V International Conference Actual Problems of System and Software Engineering, vol. 1989: CEUR Workshop Proceedings, 2017, pp. 195–201.
Duplicate Finder. http://www.math.spbu.ru/user/kromanovsky/docline /index.html.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Basit, H.A., Romanovsky, K.Yu., and Terekhov, A.N., Poster: Duplicate finder toolkit, Proceedings of the International Conference on Software Engineering (ICSE 2018), 2018, pp. 171–172.
Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: 2007, pp. 513–516.
Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software and Systems Modeling, 2016, vol. 15, no. 2, pp. 579–603.
Article Google Scholar
Ukkonen, E., Finding approximate patterns in strings, J. Algorithms, 1985, vol. 6, no. 1, pp. 132–137.
Article MathSciNet MATH Google Scholar
Broder, A.Z., On the resemblance and containment of documents, Compression and Complexity of Sequences 1997,Proceedings, IEEE, 1997, pp. 21–29.
Wu, S. and Manber, U., Fast Text Searching: Allowing Errors, Commun. ACM, 1992, vol. 35, no. 10, pp. 83–91.
Article Google Scholar
Landau, G.M. and Vishkin, U., Fast string matching with k differences, J. Comput. System Sci. 1988, vol. 37, no. 1, pp. 63–78.
Article MathSciNet MATH Google Scholar
Myers, G., A Fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, 1999, vol. 46, no. 3, pp. 395–415.
Article MathSciNet MATH Google Scholar
Levenshtein, V., Binary codes capable of correcting spurious insertions and deletions of ones, Problems Inform. Transmission, 1965, vol. 1, pp. 8–17.
MATH Google Scholar
Smyth, W., Computing Patterns in Strings, Addison-Wesley, 2003.
Google Scholar
Bergroth, L., Hakonen, H., and Raita, T., A survey of longest common subsequence algorithms, String Processing and Information Retrieval,2000(SPIRE 2000), Proceedings, Seventh International Symposium on, 2000, pp. 39–48.
Leskovec, J., Rajaraman, A., and Ullman, J.D., Mining of Massive Datasets, Cambridge: Cambridge Univ. Press, 2014.
Book Google Scholar
Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge: Cambridge Univ. Press, 1997.
Book MATH Google Scholar
Ratcliff, J.W. and Metzener, D.E., Pattern matching: The Gestalt approach, Dr. Dobb’s J., 1988, vol. 13, no. 7, pp. 46–72.
Google Scholar
Abboud, A., Backurs, A., and Williams, V.V., Tight hardness results for LCS and other sequence similarity measures, Foundations of Computer Science (FOCS),2015IEEE 56th Annual Symposium on, 2015, pp. 59–78.
Python DiffLib module. https:// docs.python.org/3/library/difflib.html.
Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.
Article MathSciNet MATH Google Scholar
Špakov, O. and Miniotas, D., Visualization of eye gaze data using heat maps, Elektronika ir elektrotechnika, 2007, pp. 55–58.
Luciv, D.V., Detecting Near Duplicates in Software Documentation, 2017. arXiv: 1711.04705.
Boyer, R.S. and Moore, J.S., A fast string searching algorithm, Commun. ACM, 1977, vol. 20, no. 10, pp. 762–772.
Article MATH Google Scholar
Pandoc: A Universal Document Converter. https://pandoc.org/.
Drobintsev, P. D. A formal approach to test scenarios generation based on guides / P. D. Drobintsev, V. P. Kotlyarov, A. A. Letichevsky // Automatic Control and Computer Sciences, 2014, Dec., vol. 48, no. 7, pp. 415–423.
Article Google Scholar
Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. System Programming, 2011, vol. 20, pp. 125–141.
Google Scholar
Gorovoy, V.A., Bolotnikova, E.S., and Gavrilova, T.A., To a method of evaluating ontologies, J. Comput. Systems Sci. Int., 2011, vol. 50, no. 3, pp. 448–461.
Article MathSciNet Google Scholar

Download references

Funding

This work is partially supported by RFBR grant 16‑01-00304.

Author information

Authors and Affiliations

St. Peterburg University, 199034, St Petersburg, Russia
D. V. Luciv, D. V. Koznov, A. A. Shelikhovskii, K. Yu. Romanovsky, G. A. Chernishev, A. N. Terekhov, D. A. Grigoriev, A. N. Smirnova, D. V. Borovkov & A. I. Vasenina

Authors

D. V. Luciv
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Koznov
View author publications
You can also search for this author in PubMed Google Scholar
A. A. Shelikhovskii
View author publications
You can also search for this author in PubMed Google Scholar
K. Yu. Romanovsky
View author publications
You can also search for this author in PubMed Google Scholar
G. A. Chernishev
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Terekhov
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Grigoriev
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Smirnova
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Borovkov
View author publications
You can also search for this author in PubMed Google Scholar
A. I. Vasenina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to D. V. Luciv, D. V. Koznov, A. A. Shelikhovskii, K. Yu. Romanovsky, G. A. Chernishev, A. N. Terekhov, D. A. Grigoriev, A. N. Smirnova, D. V. Borovkov or A. I. Vasenina.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luciv, D.V., Koznov, D.V., Shelikhovskii, A.A. et al. Interactive Near Duplicate Search in Software Documentation. Program Comput Soft 45, 346–355 (2019). https://doi.org/10.1134/S0361768819060045

Download citation

Received: 28 June 2018
Revised: 05 September 2018
Accepted: 05 September 2018
Published: 03 December 2019
Issue Date: November 2019
DOI: https://doi.org/10.1134/S0361768819060045

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions