Detecting Near Duplicates in Software Documentation

Luciv, D. V.; Koznov, D. V.; Chernishev, G. A.; Terekhov, A. N.; Romanovsky, K. Yu.; Grigoriev, D. A.

doi:10.1134/S0361768818050079

Detecting Near Duplicates in Software Documentation

Published: 21 September 2018

Volume 44, pages 335–343, (2018)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

D. V. Luciv¹,
D. V. Koznov¹,
G. A. Chernishev¹,
A. N. Terekhov¹,
K. Yu. Romanovsky¹ &
…
D. A. Grigoriev¹

448 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detection in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact duplicates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and commercial projects. Our evaluation is very comprehensive – it covers various documentation types: design and requirement specifications, programming guides and API documentation, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documentation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

Parnas, D.L., Precise documentation: The key to better software, in The Future of Software Engineering, Berlin, Heidelberg: Springer-Verlag, 2011, pp. 125–148.
Google Scholar
Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, in Proceedings of the 32 ACM/IEEE International Conference on Software Engineering (ICSE’10), New York, NY, USA: ACM, 2010, vol. 2, pp. 79–88.
Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.
Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Program. Comput. Software, 2008, vol. 34, no. 4, pp. 216–224.
Article MATH Google Scholar
Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of software product lines, Lecture Notes in Compute Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980 of CEE-SET 2008, pp. 158–170.
Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.
Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone detection in reuse of software technical documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015, Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.
Luciv, D.V., Koznov, D.V., Basit, H.A., and Terekhov, A.N., On fuzzy repetitions detection in documentation reuse, Program. Comput. Software, 2016, vol. 42, no. 4, pp. 216–224.
Article MathSciNet Google Scholar
Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: ACM, 2007, pp. 513–516.
Bassett, P.G., Framing software reuse: Lessons from the real World, Upper Saddle River, NJ, USA: Prentice-Hall, 1997.
Google Scholar
Documentation Refactoring Toolkit. http://www. math.spbu.ru/user/kromanovsky/docline/index_en. html.
Torvalds, L., Linux Kernel Documentation, Dec 2013 snapshot. https://github.com/torvalds/linux/tree/ master/Documentation/DocBook/.
Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.
Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Europ. J. Comput. Sci., 2014, vol. 4, no. 4, pp. 242–258.
Google Scholar
Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, ICSR 2017, Salvador, Brazil, 2017, Proceedings, Botterweck, G. and Werner, C., Eds., Cham: Springer-Verlag, 2017, pp. 12–27.
Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software Syst. Model., 2016, vol. 15, no. 2, pp. 579–603.
Article Google Scholar
Huang, T.-K., Rahman, Md.S., Madhyastha, H.V., Faloutsos, M., and Ribeiro, B., An analysis of socware cascades in online social networks, Proceedings of the 22Nd International Conference on World Wide Web, New York, NY, USA: ACM, 2013, pp. 619–630.
Williams, K. and Giles, C.L., Near duplicate detection in an Academic Digital Library, Proceedings of the ACM Symposium on Document Engineering, New York, NY, USA: ACM, 2013, pp. 91–94.
Zhang, Q., Zhang Yu., Yu, H., and Huang, X., Efficient partial-duplicate detection based on sequence matching, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, 2010, pp. 675–682.
Abdel Hamid, O., Behzadi, B., Christoph, S., and Henzinger, M., Detecting the origin of text segments efficiently, Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA: ACM, 2009, pp. 61–70.
Ramaswamy, L., Iyengar, A., Liu, L., and Douglis, F., Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA: ACM, 2004, pp. 443–454.
Gibson, D., Punera, K., and Tomkins, A., The volume and evolution of web page templates, Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA: ACM, 2005, pp. 830–839.
Vall’es, E. and Rosso, P., Detection of near-duplicate user generated contents: The SMS spam collection, Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, New York, NY, USA: ACM, 2011, pp. 27–34.
Barrón-Cedeño, A., Vila, M., Martí, M., and Rosso, P., Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection, Comput. Linguist., 2013, vol. 39, no. 4, pp. 917–947.
Article Google Scholar
Antiplagiarism (in Russian). https://www.antiplagiat.ru/. Accessed January 16, 2018.
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., and Lopes, C.V., SourcererCC: Scaling code clone detection to big-code, Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, 2016, pp. 1157–1168.
Jiang, L., Misherghi, G., Su, Z., and Glondu, S., DECKARD: Scalable and accurate tree-based detection of code clones, Proceedings of the 29th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Soc., 2007, pp. 96–105.
Cordy, J.R. and Roy, C.K., The NiCad clone detector, in Proceedings of IEEE 19th International Conference on Program Comprehension, 2011, pp. 219–220.
Akhin, M. and Itsykson, V., Tree slicing in clone detection: Syntactic analysis made (semi)-semantic (in Russian), Model. Anal. Inform. Syst., 2012, vol. 19, no. 6, pp. 69–78.
Google Scholar
Zeltser, N.G., Automatic clone detection for refactoring, Proc. Inst. Syst. Program., 2013, vol. 25, pp. 39–50.
Article Google Scholar
Wagner, S. and Fernández, D.M., Analyzing text in software projects, The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 39–72.
Google Scholar
Korshunov, A. and Gomzin, A., Topic modeling in natural language texts (in Russian), Proc. Inst. Syst. Program., 2012, vol. 23, pp. 215–242.
Article Google Scholar
Tomita-parser – Yandex Technologies (in Russian). https://tech.yandex.ru/tomita/. Accessed January 16, 2018.
Rattan, D., Bhatia, R., and Singh, M., Software clone detection: A systematic review, Inform. Software Technol., 2013, vol. 55, no. 7, pp. 1165–1199.
Article Google Scholar
Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.
Article MathSciNet MATH Google Scholar
Bassett, P.G., The theory and practice of adaptive reuse, SIGSOFT Software Eng. Notes, 1997, vol. 22, no. 3, pp. 2–9.
Article Google Scholar
de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M., Computational Geometry, Berlin: Springer, 2008, pp. 220–226.
Book MATH Google Scholar
Preparata, F.P. and Shamos, M.I., Computational Geometry: An Introduction, Berlin: Springer-Verlag, 1985, pp. 359–363.
Book MATH Google Scholar
PyIntervalTree. URL: https://github.com/chaimleib/ intervaltree.
Kolchin, A.V., Kotljarov, V.P., and Drobincev, P.D., The method of test scenariogeneration in the environment of the insertion modeling, Control Syst. Mach., 2012, no. 6, pp. 43–48, 63.
Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. Syst. Program., 2011, vol. 20, pp. 125–141.
Google Scholar
Kudryavtsev, D. and Gavrilova T., Diagrammatic knowledge modeling for managers: Ontologybased approach, Proceedings of the International Conference on Knowledge Engineering and Ontology Development, 2011, pp. 386–389.

Download references

ACKNOWLEDGMENTS

This work is partially supported by RFBR grant no. 16-01-00304.

Author information

Authors and Affiliations

Saint Petersburg State University, 199034, St. Petersburg, Russia
D. V. Luciv, D. V. Koznov, G. A. Chernishev, A. N. Terekhov, K. Yu. Romanovsky & D. A. Grigoriev

Authors

D. V. Luciv
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Koznov
View author publications
You can also search for this author in PubMed Google Scholar
G. A. Chernishev
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Terekhov
View author publications
You can also search for this author in PubMed Google Scholar
K. Yu. Romanovsky
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Grigoriev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to D. V. Luciv, D. V. Koznov, G. A. Chernishev, A. N. Terekhov, K. Yu. Romanovsky or D. A. Grigoriev.

Additional information

¹The article is published in the original.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luciv, D.V., Koznov, D.V., Chernishev, G.A. et al. Detecting Near Duplicates in Software Documentation. Program Comput Soft 44, 335–343 (2018). https://doi.org/10.1134/S0361768818050079

Download citation

Received: 08 August 2017
Published: 21 September 2018
Issue Date: September 2018
DOI: https://doi.org/10.1134/S0361768818050079

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions