Abstract
Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an approach for interactive detection of near duplicates that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based , and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.
Similar content being viewed by others
Notes
We only consider groups that consist of fragments longer than four tokens, because, according to our experiments [15], this particular constraint filters out many false positives.
The \(1{\text{/}}\sqrt 3 \approx 0.577\) value was selected for the convenience of the following proofs; following our experiments, we have concluded that if the similarity measure is less than 1/2, then, for smaller patterns (up to 15–20 tokens), the algorithm produces many non-meaningful matches; the lower bound wehave selected is insignificantly larger than 1/2.
Here and further we do not round the lengths of theintervals to integers to save up space. Nevertheless, all proofs can be performed with rounded values as well.
REFERENCES
Brooks, F.P., The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley, 1975.
Parnas, D.L., Precise documentation: The key to Better Software, The Future of Software Engineering, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 125–148.
Bassett, P.G., Framing Software Reuse: Lessons from the Real World, Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1997.
Jarzabek, S. and Pettersson, U., Research journey towards industrial application of reuse technique, ICSE, 2006, pp. 608–611.
Irshad, M., Petersen, K., and Poulding, S.M., A systematic literature review of software requirements reuse approaches, Inform. Software Technol., 2018, vol. 93, pp. 223–245.
Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.
Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, Springer, 2017, pp. 12–27.
Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Programming and Computer Software, 2008, vol. 34, no. 4, pp. 216–224.
Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of Software Product Lines, Lect. Notes Computer Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980, pp. 158–170.
Jarzabek, S. and Dan, D., Documentation Reuse: Managing Similar Documents, FedCSIS, 2017, pp. 1325–1334.
Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, Proceedings of ACM/IEEE 32nd International Conference on Software Engineering, 2010, vol. 2, pp. 79–88.
Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Eur. J. Comp. Sci., 2014, vol. 4, no. 4, pp. 242–258.
Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.
Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.
Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone Detection in Reuse of Software Technical Documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2015), Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Terekhov, A.N., Romanovsky, K.Yu., and Grigoriev, D.A., Detecting near duplicates in software documentation, Programming and Comput. Software, 2018, vol. 44, no. 5.
Koznov, D.V., Luciv, D.V., and Chernishev, G.A., Duplicate management in software documentation maintenance, Proceedings of V International Conference Actual Problems of System and Software Engineering, vol. 1989: CEUR Workshop Proceedings, 2017, pp. 195–201.
Duplicate Finder. http://www.math.spbu.ru/user/kromanovsky/docline /index.html.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Basit, H.A., Romanovsky, K.Yu., and Terekhov, A.N., Poster: Duplicate finder toolkit, Proceedings of the International Conference on Software Engineering (ICSE 2018), 2018, pp. 171–172.
Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: 2007, pp. 513–516.
Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software and Systems Modeling, 2016, vol. 15, no. 2, pp. 579–603.
Ukkonen, E., Finding approximate patterns in strings, J. Algorithms, 1985, vol. 6, no. 1, pp. 132–137.
Broder, A.Z., On the resemblance and containment of documents, Compression and Complexity of Sequences 1997,Proceedings, IEEE, 1997, pp. 21–29.
Wu, S. and Manber, U., Fast Text Searching: Allowing Errors, Commun. ACM, 1992, vol. 35, no. 10, pp. 83–91.
Landau, G.M. and Vishkin, U., Fast string matching with k differences, J. Comput. System Sci. 1988, vol. 37, no. 1, pp. 63–78.
Myers, G., A Fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, 1999, vol. 46, no. 3, pp. 395–415.
Levenshtein, V., Binary codes capable of correcting spurious insertions and deletions of ones, Problems Inform. Transmission, 1965, vol. 1, pp. 8–17.
Smyth, W., Computing Patterns in Strings, Addison-Wesley, 2003.
Bergroth, L., Hakonen, H., and Raita, T., A survey of longest common subsequence algorithms, String Processing and Information Retrieval,2000(SPIRE 2000), Proceedings, Seventh International Symposium on, 2000, pp. 39–48.
Leskovec, J., Rajaraman, A., and Ullman, J.D., Mining of Massive Datasets, Cambridge: Cambridge Univ. Press, 2014.
Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge: Cambridge Univ. Press, 1997.
Ratcliff, J.W. and Metzener, D.E., Pattern matching: The Gestalt approach, Dr. Dobb’s J., 1988, vol. 13, no. 7, pp. 46–72.
Abboud, A., Backurs, A., and Williams, V.V., Tight hardness results for LCS and other sequence similarity measures, Foundations of Computer Science (FOCS),2015IEEE 56th Annual Symposium on, 2015, pp. 59–78.
Python DiffLib module. https:// docs.python.org/3/library/difflib.html.
Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.
Špakov, O. and Miniotas, D., Visualization of eye gaze data using heat maps, Elektronika ir elektrotechnika, 2007, pp. 55–58.
Luciv, D.V., Detecting Near Duplicates in Software Documentation, 2017. arXiv: 1711.04705.
Boyer, R.S. and Moore, J.S., A fast string searching algorithm, Commun. ACM, 1977, vol. 20, no. 10, pp. 762–772.
Pandoc: A Universal Document Converter. https://pandoc.org/.
Drobintsev, P. D. A formal approach to test scenarios generation based on guides / P. D. Drobintsev, V. P. Kotlyarov, A. A. Letichevsky // Automatic Control and Computer Sciences, 2014, Dec., vol. 48, no. 7, pp. 415–423.
Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. System Programming, 2011, vol. 20, pp. 125–141.
Gorovoy, V.A., Bolotnikova, E.S., and Gavrilova, T.A., To a method of evaluating ontologies, J. Comput. Systems Sci. Int., 2011, vol. 50, no. 3, pp. 448–461.
Funding
This work is partially supported by RFBR grant 16‑01-00304.
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Luciv, D.V., Koznov, D.V., Shelikhovskii, A.A. et al. Interactive Near Duplicate Search in Software Documentation. Program Comput Soft 45, 346–355 (2019). https://doi.org/10.1134/S0361768819060045
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768819060045