Abstract
We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244 K packages we found 11.2 M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.
Similar content being viewed by others
Data Availability
A replication package for this paper is available at https://doi.org/10.5281/zenedo.7637703 (Sun et al. 2023). This replication package containes all pairs (identifiers, product) found in PyPI
Notes
https://pypi.org/, accessed 2021-11-15
http://ctags.sourceforge.net/, retrieved 2022-09-22
Note that this copy might not have been done directly from the corpus; it is, however, a copy of the same entity that exists in the corpus.
Projects do not reside in PyPI, but PyPI links to their actual location.
https://pypi.org/project/pypi-simple/, accessed 2021-10-25
https://warehouse.pypa.io/api-reference/json.html, accessed 2021-10-25
https://ctags.io/, accessed 2021-10-25. Universal Ctags 0.0.0 (2015) derived from Exuberant Ctags 5.8.
At that time Debian Buster was already shipped as a “stable” release, so while it is possible that its content has changed since, modifications are expected to be minimal according to Debian release processes.
https://docs.libraries.io/overview.html#sourcerank, accessed 2021-12-07
Titled: “Scraping tripadvisor review, len container change, no such element Unable to locate element”, https://stackoverflow.com/questions/68878857, accessed 2022-01-16
References
Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Guéhéneuc YG (2014) Repent: analyzing the nature of identifier renamings. IEEE Transactions on Software Engineering 40(5):502–532
The Python Packaging Authority (2022) Packaging python projects. https://packaging.python.org/en/latest/tutorials/packaging-projects/ Sept 2022
Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empirical Software Engineering 18(2):219–276
Bose RPJC, Phokela KK, Kaulgud V, Podder S (2019) Blinker: a blockchain-enabled framework for software provenance. In: 2019 26th Asia-pacific software engineering conference (APSEC) IEEE p 1–8
Butler G, Grogono P, Shinghal R, Tjandra I (1995) Retrieving information from data flow diagrams. In: Proceedings of 2nd working conference on reverse engineering IEEE p 22–29
Butt AS, Fitch P (2020) ProvONE+: a provenance model for scientific workflows. In: International conference on web information systems engineering, Springer, p 431–444
Caniell M, German DM (2017) Zacchiroli S (2017) The debsources dataset: two decades of free and open source software. Empirical Software Engineering 22:1405–1437
Caprile B, Tonella P (2000) Restructuring program identifier names In icsm, p 97–107
Cordy JR, Roy CK (2011) The NiCad clone detector. In: The 19th IEEE international conference on program comprehension, icpc 2011, kingston, on, canada, June 22-24, 2011, IEEE Computer Society, p 219–220
Cosmo RD, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: Proceedings of the 14th international conference on digital preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017
Dang YB, Cheng P, Luo L, Cho A (2008) A code provenance management tool for ip-aware software development. In: Companion of the 30th international conference on software engineering, p 975–976
Davies J, German Dm, Godfrey MW, Hindle A (2011) Software bertillonage: Finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, p 183–192
Davies J, German DM, Godfrey MW, Hindle A (2013) Software bertillonage. Empirical Software Engineering 18(6):1195–1237
Deissenboeck F, Pizka M (2006) Concise and consistent naming. Software Quality Journal 14(3):261–282
Penta MD, German DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010), IEEE, p 151–160
Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering, p 147–156
Gautam P, Saini H (2016) Various code clone detection techniques and tools: a comprehensive survey. In: International conference on smart trends for information technology and computer communications, Springer, p 655–667
Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, p 291–301
Godfrey MW (2015) Understanding software artifact provenance. Science of Computer Programming 97:86–90
Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering 31(2):166–181
Gupta A, Suri B (2018) A survey on code clone, its behavior and applications. In: Networking communication and data knowledge engineering, Springer, p 27–39
Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81
Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, p 217–227
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28(7):654–670
Kapdan M, Aktas M, Yigit M (2014) On the structural code clone detection problem: a survey and software metric based approach. In: International conference on computational science and its applications, Springer, p 492–507
Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innovations in Systems and Software Engineering 3(4):303–318
Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering 32(3):176–192
Manning CD, Raghavan P, Schutze H (2009) An Introduction to Information Retrieval. Cambridge University Press, Cambridge, England
McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Software Eng 38(5):1069–1087
Miles S, Groth P, Munroe S, Moreau L (2011) Prime: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):1–42
Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th international conference on extending database technology, p 773–776
Missier P, Dey S, Belhajjame K, Vicenttín VC, Ludäscher B (2013) D-prov: extending the PROV provenance model with workflow structure. In: 5th USENIX workshop on the theory and practice of provenance (TaPP 13)
Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J et al (2011) The open provenance model core specification (v1. 1). Future Generation Computer Systems 27(6):743–756
Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1372–1384
Ombredanne Philippe (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109
Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE international conference on software maintenance (ICSM),IEEE, p 283–292
Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), IEEE, p 518–528
Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119
Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, p 138–142
Rosen L (2005) Open source licensing, volume 692. Prentice hall
Rousseau G, Cosmo RD, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25:2930–2959
Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School of Computing TR 541(115):64–68
Saini M, Verma R, Singh A, Chahal KK (2020) Investigating diversity and impact of the popularity metrics for ranking software packages. J. Softw Evol Process, 32(9)
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, ACM, p 1157–1168
Securosis, L.L.C. (2021)Open source development and application security survey [online]. https://securosis.com/assets/library/reports/Securosis_OpenSourceSurvey_Analysis.pdf, Accessed 14 June 2021
Sheneamer A, Kalita J (2016) A survey of software clone detection techniques. International Journal of Computer Applications 137(10):1–21
Sneed HM (1996) Object-oriented cobol recycling. In: Proceedings of WCRE’96: 4rd working conference on reverse engineering, IEEE, p 169–178
Stewart K, Odence P, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L. Rev. 2:191
Sun Y, German D, Zacchiroli (2023) Dataset for ”Using the Uniqueness of Global Identifiers to Determine the Provenance of Phyton Software Source Code" https://doi.org/10.5281/zenedo.7637703 February 2023
Synopsys (2020) 2020 open source security and risk analysis report (OSSRA).Technical Report, Synopsys. Accessed 15 April 2020
Warintarawej P, Huchard M, Lafourcade M, Laurent A, Pompidor P (2015) Software understanding: automatic classification of software identifiers. Intelligent Data Analysis 19(4):761–778
Wendel H, Kunde M, Schreiber A (2010) Provenance of software development processes. In: International provenance and annotation workshop, Springer, p 59–63
Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, p 286–289
Zimmermann T (2020) A first look at an emerging model of community organizations for the long-term maintenance of ecosystems’ packages. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20, New York, NY, USA, 2020. association for computing machinery. p 711-718
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Communicated by: Janet Siegmund.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, Y., German, D. & Zacchiroli, S. Using the uniqueness of global identifiers to determine the provenance of Python software source code. Empir Software Eng 28, 107 (2023). https://doi.org/10.1007/s10664-023-10317-8
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10317-8