Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities

Li, Zhi-Xing; Yu, Yue; Wang, Tao; Yin, Gang; Mao, Xin-Jun; Wang, Huai-Min

doi:10.1007/s11390-020-9935-1

Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities

Regular Paper
Published: 30 January 2021

Volume 36, pages 191–206, (2021)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zhi-Xing Li¹,
Yue Yu¹,
Tao Wang¹,
Gang Yin¹,
Xin-Jun Mao² &
…
Huai-Min Wang¹

192 Accesses
6 Citations
Explore all metrics

Abstract

Communication and coordination between open source software (OSS) developers who do not work physically in the same location have always been the challenging issues. The pull-based development model, as the state-of-the-art collaborative development mechanism, provides high openness and transparency to improve the visibility of contributors’ work. However, duplicate contributions may still be submitted by more than one contributor to solve the same problem due to the parallel and uncoordinated nature of this model. If not detected in time, duplicate pull-requests can cause contributors and reviewers to waste time and energy on redundant work. In this paper, we propose an approach combining textual and change similarities to automatically detect duplicate contributions in the pull-based model at submission time. For a new-arriving contribution, we first compute textual similarity and change similarity between it and other existing contributions. And then our method returns a list of candidate duplicate contributions that are most similar to the new contribution in terms of the combined textual and change similarity. The evaluation shows that 83.4% of the duplicates can be found in average when we use the combined textual and change similarity compared with 54.8% using only textual similarity and 78.2% using only change similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Challenges of Low-Code/No-Code Software Development: A Literature Review

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

Article Open access 11 September 2019

References

Herbsleb J D, Mockus A. An empirical study of speed and communication in globally distributed software development. IEEE Transactions on Software Engineering, 2003, 29(6): 481-494. https://doi.org/10.1109/TSE.2003.1205177.
Article Google Scholar
Espinosa J, Slaughter S, Kraut R, Herbsleb J. Team knowledge and coordination in geographically distributed software development. Journal of Management Information Systems, 2007, 24(1): 135-169. https://doi.org/10.2753/MIS0742-1222240104.
Article Google Scholar
Storey M A, Singer L, Cleary B, Filho F M, Zagalsky A. The (r)evolution of social media in software engineering. In Proc. the 2014 International Conference on Future of Software Engineering, May 31–June 7, 2014, pp.100-116. https://doi.org/10.1145/2593882.2593887.
Zhu J, Zhou M, Mockus A. Effectiveness of code contribution: From patch-based to pull-request-based tools. In Proc. the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, November 2016, pp.871-882. https://doi.org/10.1145/2950290.2950364.
Gousios G, Pinzger M, van Deursen A. An exploratory study of the pull-based software development model. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.345-355. https://doi.org/10.1145/2568225.2568260.
Yu Y, Yin G, Wang T, Yang C, Wang H. Determinants of pull-based development in the context of continuous integration. SCIENCE CHINA: Information Sciences, 2016, 59(8): Article No. 080104. https://doi.org/10.1007/s11432-016-5595-8.
Ye Y, Kishida K. Toward an understanding of the motivation of open source software developers. In Proc. the 2003 IEEE/ACM International Conference on Software Engineering, May 2003, pp.419-49. https://doi.org/10.1109/ICSE.2003.1201220.
Barcomb A, Kaufmann A, Riehle D, Stol K J, Fitzgerald B. Uncovering the periphery: A qualitative survey of episodic volunteering in free/libre and open source software communities. IEEE Transactions on Software Engineering, 2020, 46(9): 962-980. https://doi.org/10.1109/TSE.2018.2872713.
Article Google Scholar
Gousios G, Zaidman A, Storey M A, van Deursen A. Work practices and challenges in pull-based development: The integrator’s perspective. In Proc. the 37th International Conference on Software Engineering, May 2015, pp.358-368. https://doi.org/10.1109/ICSE.2015.55.
Yu Y, Wang H, Yin G, Wang T. Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? Information and Software Technology, 2016, 74: 204-218. https://doi.org/10.1016/j.infsof.2016.01.004.
Article Google Scholar
Thongtanunam P, Tantithamthavorn C, Kula R G, Yoshida N, Iida H, Matsumoto K. Who should review my code? A file location-based code-reviewer recommendation approach for modern code review. In Proc. the 22nd International Conference on Software Analysis, Evolution, and Reengineering, March 2015, pp.141-150. https://doi.org/10.1109/SANER.2015.7081824.
Steinmacher I, Pinto G, Wiese I S, Gerosa M A. Almost there: A study on quasi-contributors in open-source software projects. In Proc. the 40th International Conference on Software Engineering, May 2018, pp.256-266. https://doi.org/10.1145/3180155.3180208.
Yu Y, Li Z, Yin G, Wang T, Wang H. A dataset of duplicate pull-requests in GitHub. In Proc. the 15th International Conference on Mining Software Repositories, May 2018, pp.22-25. https://doi.org/10.1145/3196398.3196455.
Gousios G, Storey M A, Bacchelli A. Work practices and challenges in pull-based development: The contributor’s perspective. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.285-296. https://doi.org/10.1145/2884781.2884826.
Yu Y, Wang H, Yin G, Ling C X. Reviewer recommender of pull-requests in GitHub. In Proc. the 2014 International Conference on Software Maintenance and Evolution, September 2014, pp.609-612. https://doi.org/10.1109/ICSME.2014.107.
Li Z X, Yu Y, Yin G, Wang T, Wang H M. What are they talking about? Analyzing code reviews in pull-based development model. Journal of Computer Science and Technology, 2017, 32(6): 1060-1075. https://doi.org/10.1007/s11390-017-1783-2.
Article Google Scholar
Li Z, Yin G, Yu Y, Wang T, Wang H. Detecting duplicate pull-requests in GitHub. In Proc. the 9th Asia-Pacific Symposium on Internetware, September 2017, Article No. 20. https://doi.org/10.1145/3131704.3131725.
Runeson P, Alexandersson M, Nyholm O. Detection of duplicate defect reports using natural language processing. In Proc. the 29th International Conference on Software Engineering, May 2007, pp.499-510. https://doi.org/10.1109/ICSE.2007.32.
Wang X, Zhang L, Xie T et al. An approach to detecting duplicate bug reports using natural language and execution information. In Proc. the 30th International Conference on Software Engineering, May 2008, pp.461-470. https://doi.org/10.1145/1368088.1368151.
Nguyen A T, Nguyen T T, Nguyen T N, Lo D, Sun C. Duplicate bug report detection with a combination of information retrieval and topic modeling. In Proc. the 27th International Conference on Automated Software Engineering, September 2012, pp.70-79. https://doi.org/10.1145/2351676.2351687.
Lazar A, Ritchey S, Sharif B. Improving the accuracy of duplicate bug report detection using textual similarity measures. In Proc. the 11th Working Conference on Mining Software Repositories, May 2014, pp.308-311. https://doi.org/10.1145/2597073.2597088.
Porter M F. An algorithm for suffix stripping. In Readings in Information Retrieval, Jones K S, Willett P (eds.), Morgan Kaufmann Publishers Inc., 1997, pp.313-316.
Manning C D, Schütze H. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
Sun C, Lo D, Wang X, Jiang J, Khoo S C. A discriminative model approach for accurate duplicate bug report retrieval. In Proc. the 32nd International Conference on Software Engineering, May 2010, pp.45-54. https://doi.org/10.1145/1806799.1806811.
Sun C, Lo D, Khoo S C, Jiang J. Towards more accurate retrieval of duplicate bug reports. In Proc. the 26th International Conference on Automated Software Engineering, November 2011, pp.253-262. https://doi.org/10.1109/ASE.2011.6100061.
Zhang Y, Lo D, Xia X, Sun J. Multi-factor duplicate question detection in stack overow. Journal of Computer Science and Technology, 2015, 30(5): 981-997. https://doi.org/10.1007/s11390-015-1576-4.
Article Google Scholar
Mann H B, Whitney D R. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics, 1947, 18(1): 50-60.
Article MathSciNet Google Scholar
Thung F, Kochhar P S, Lo D. DupFinder: Integrated tool support for duplicate bug report detection. In Proc. the 29th International Conference on Automated Software Engineering, September 2014, pp.871-874. https://doi.org/10.1145/2642937.2648627.
Tsay J, Dabbish L, Herbsleb J. Inuence of social and technical factors for evaluating contribution in GitHub. In Proc. the 36th International Conference on Software Engineering, May 2014, pp.356-366. https://doi.org/10.1145/2568225.2568315.
van der Veen E, Gousios G, Zaidman A. Automatically prioritizing pull requests. In Proc. the 12th Working Conference on Mining Software Repositories, May 2015, pp.357-361. https://doi.org/10.1109/MSR.2015.40.
Baysal O, Kononenko O, Holmes R et al. Investigating technical and non-technical factors inuencing modern code review. Empirical Software Engineering, 2016, 21(3): 932-959. https://doi.org/10.1007/s10664-015-9366-8.
Mcintosh S, Kamei Y, Adams B et al. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering, 2016, 21(5): 2146-2189. https://doi.org/10.1007/s10664-015-9381-9.
Fagan M E. Design and code inspections to reduce errors in program development. In Pioneers and Their Contributions to Software Engineering, Broy M, Denert E (eds.), Springer, 2001, pp.301-334. https://doi.org/10.1007/978-3-642-48354-7_13.
Bacchelli A, Bird C. Expectations, outcomes, and challenges of modern code review. In Proc. the 35th International Conference on Software Engineering, May 2013, pp.712-721. https://doi.org/10.1109/ICSE.2013.6606617.
Rigby P C, Storey M A. Understanding broadcast based peer review on open source software projects. In Proc. the 33rd International Conference on Software Engineering, May 2011, pp.541-550. https://doi.org/10.1145/1985793.1985867.
Thongtanunam P, McIntosh S, Hassan A E, Iida H. Investigating code review practices in defective files: An empirical study of the Qt system. In Proc. the 12th Working Conference on Mining Software Repositories, May 2015, pp.168-179. https://doi.org/10.1109/MSR.2015.23.
Jiang J, He J H, Chen X Y. CoreDevRec: Automatic core member recommendation for contribution evaluation. Journal of Computer Science and Technology, 2015, 30(5): 998-1016. https://doi.org/10.1007/s11390-015-1577-3.
Article Google Scholar
Rahman M M, Roy C K, Collins J A. CORRECT: Code reviewer recommendation in GitHub based on cross-project and technology experience. In Proc. the 38th International Conference on Software Engineering Companion, May 2016, pp.222-231. https://doi.org/10.1145/2889160.2889244.
de Lima Júnior M L, Soares D M, Plastino A, Murta L. Developers assignment for analyzing pull requests. In Proc. the 30th Annual ACM Symposium on Applied Computing, April 2015, pp.1567-1572. https://doi.org/10.1145/2695664.2695884.
Baum T, Liskin O, Niklas K, Schneider K. Factors influencing code review processes in industry. In Proc. the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, November 2016, pp.85-96. https://doi.org/10.1145/2950290.2950323.
Beller M, Bacchelli A, Zaidman A, Jürgens E. Modern code reviews in open-source projects: Which problems do they fix? In Proc. the 11th Working Conference on Mining Software Repositories, May 2014, pp.202-211. https://doi.org/10.1145/2597073.2597082.
Morales R, Mcintosh S, Khomh F. Do code review practices impact design quality? A case study of the Qt, VTK, and ITK projects. In Proc. the 22nd International Conference on Software Analysis, Evolution and Reengineering, March 2015, pp.171-180. https://doi.org/10.1109/SANER.2015.7081827.
Mcintosh S, Kamei Y, Adams B, Hassan A E. The impact of code review coverage and code review participation on software quality: A case study of the Qt, VTK, and ITK projects. In Proc. the 11th Working Conference on Mining Software Repositories, May 2014, pp.192-201. https://doi.org/10.1145/2597073.2597076.
Thongtanunam P, Mcintosh S, Hassan A E, Iida H. Revisiting code ownership and its relationship with software quality in the scope of modern code review. In Proc. the 38th International Conference on Software Engineering, May 2016, pp.1039-1050. https://doi.org/10.1145/2884781.2884852.

Download references

Author information

Authors and Affiliations

Key Laboratory of Parallel and Distributed Computing, College of Computer, National University of Defense Technology, Changsha, 410073, China
Zhi-Xing Li, Yue Yu, Tao Wang, Gang Yin & Huai-Min Wang
Laboratory of Software Engineering for Complex Systems, College of Computer, National University of Defense Technology, Changsha, 410073, China
Xin-Jun Mao

Authors

Zhi-Xing Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue Yu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xin-Jun Mao
View author publications
You can also search for this author in PubMed Google Scholar
Huai-Min Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Yu.

Supplementary Information

ESM 1

(PDF 688 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, ZX., Yu, Y., Wang, T. et al. Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities. J. Comput. Sci. Technol. 36, 191–206 (2021). https://doi.org/10.1007/s11390-020-9935-1

Download citation

Received: 15 August 2019
Accepted: 03 January 2021
Published: 30 January 2021
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11390-020-9935-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities

Abstract

Access this article

Similar content being viewed by others

Challenges of Low-Code/No-Code Software Development: A Literature Review

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities

Abstract

Access this article

Similar content being viewed by others

Challenges of Low-Code/No-Code Software Development: A Literature Review

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation