Skip to main content
Log in

Using the uniqueness of global identifiers to determine the provenance of Python software source code

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244  K packages we found 11.2  M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

A replication package for this paper is available at https://doi.org/10.5281/zenedo.7637703 (Sun et al. 2023). This replication package containes all pairs (identifiers, product) found in PyPI

Notes

  1. https://pypi.org/, accessed 2021-11-15

  2. http://ctags.sourceforge.net/, retrieved 2022-09-22

  3. Note that this copy might not have been done directly from the corpus; it is, however, a copy of the same entity that exists in the corpus.

  4. Projects do not reside in PyPI, but PyPI links to their actual location.

  5. https://pypi.org/project/pypi-simple/, accessed 2021-10-25

  6. https://warehouse.pypa.io/api-reference/json.html, accessed 2021-10-25

  7. https://ctags.io/, accessed 2021-10-25. Universal Ctags 0.0.0 (2015) derived from Exuberant Ctags 5.8.

  8. At that time Debian Buster was already shipped as a “stable” release, so while it is possible that its content has changed since, modifications are expected to be minimal according to Debian release processes.

  9. https://docs.libraries.io/overview.html#sourcerank, accessed 2021-12-07

  10. Titled: “Scraping tripadvisor review, len container change, no such element Unable to locate element”, https://stackoverflow.com/questions/68878857, accessed 2022-01-16

References

  • Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Guéhéneuc YG (2014) Repent: analyzing the nature of identifier renamings. IEEE Transactions on Software Engineering 40(5):502–532

    Article  Google Scholar 

  • The Python Packaging Authority (2022) Packaging python projects. https://packaging.python.org/en/latest/tutorials/packaging-projects/ Sept 2022

  • Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empirical Software Engineering 18(2):219–276

    Article  Google Scholar 

  • Bose RPJC, Phokela KK, Kaulgud V, Podder S (2019) Blinker: a blockchain-enabled framework for software provenance. In: 2019 26th Asia-pacific software engineering conference (APSEC) IEEE p 1–8

  • Butler G, Grogono P, Shinghal R, Tjandra I (1995) Retrieving information from data flow diagrams. In: Proceedings of 2nd working conference on reverse engineering IEEE p 22–29

  • Butt AS, Fitch P (2020) ProvONE+: a provenance model for scientific workflows. In: International conference on web information systems engineering, Springer, p 431–444

  • Caniell M, German DM (2017) Zacchiroli S (2017) The debsources dataset: two decades of free and open source software. Empirical Software Engineering 22:1405–1437

    Article  Google Scholar 

  • Caprile B, Tonella P (2000) Restructuring program identifier names In icsm, p 97–107

  • Cordy JR, Roy CK (2011) The NiCad clone detector. In: The 19th IEEE international conference on program comprehension, icpc 2011, kingston, on, canada, June 22-24, 2011, IEEE Computer Society, p 219–220

  • Cosmo RD, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: Proceedings of the 14th international conference on digital preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017

  • Dang YB, Cheng P, Luo L, Cho A (2008) A code provenance management tool for ip-aware software development. In: Companion of the 30th international conference on software engineering, p 975–976

  • Davies J, German Dm, Godfrey MW, Hindle A (2011) Software bertillonage: Finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, p 183–192

  • Davies J, German DM, Godfrey MW, Hindle A (2013) Software bertillonage. Empirical Software Engineering 18(6):1195–1237

    Article  Google Scholar 

  • Deissenboeck F, Pizka M (2006) Concise and consistent naming. Software Quality Journal 14(3):261–282

    Article  Google Scholar 

  • Penta MD, German DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010), IEEE, p 151–160

  • Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering, p 147–156

  • Gautam P, Saini H (2016) Various code clone detection techniques and tools: a comprehensive survey. In: International conference on smart trends for information technology and computer communications, Springer, p 655–667

  • Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, p 291–301

  • Godfrey MW (2015) Understanding software artifact provenance. Science of Computer Programming 97:86–90

    Article  Google Scholar 

  • Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering 31(2):166–181

    Article  Google Scholar 

  • Gupta A, Suri B (2018) A survey on code clone, its behavior and applications. In: Networking communication and data knowledge engineering, Springer, p 27–39

  • Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81

    Article  Google Scholar 

  • Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, p 217–227

  • Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28(7):654–670

    Article  Google Scholar 

  • Kapdan M, Aktas M, Yigit M (2014) On the structural code clone detection problem: a survey and software metric based approach. In: International conference on computational science and its applications, Springer, p 492–507

  • Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innovations in Systems and Software Engineering 3(4):303–318

    Article  Google Scholar 

  • Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering 32(3):176–192

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schutze H (2009) An Introduction to Information Retrieval. Cambridge University Press, Cambridge, England

    MATH  Google Scholar 

  • McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Software Eng 38(5):1069–1087

    Article  Google Scholar 

  • Miles S, Groth P, Munroe S, Moreau L (2011) Prime: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):1–42

    Article  Google Scholar 

  • Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th international conference on extending database technology, p 773–776

  • Missier P, Dey S, Belhajjame K, Vicenttín VC, Ludäscher B (2013) D-prov: extending the PROV provenance model with workflow structure. In: 5th USENIX workshop on the theory and practice of provenance (TaPP 13)

  • Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J et al (2011) The open provenance model core specification (v1. 1). Future Generation Computer Systems 27(6):743–756

    Article  Google Scholar 

  • Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1372–1384

  • Ombredanne Philippe (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109

    Article  Google Scholar 

  • Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE international conference on software maintenance (ICSM),IEEE, p 283–292

  • Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), IEEE, p 518–528

  • Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119

    Article  Google Scholar 

  • Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, p 138–142

  • Rosen L (2005) Open source licensing, volume 692. Prentice hall

  • Rousseau G, Cosmo RD, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25:2930–2959

    Article  Google Scholar 

  • Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School of Computing TR 541(115):64–68

    Google Scholar 

  • Saini M, Verma R, Singh A, Chahal KK (2020) Investigating diversity and impact of the popularity metrics for ranking software packages. J. Softw Evol Process, 32(9)

  • Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, ACM, p 1157–1168

  • Securosis, L.L.C. (2021)Open source development and application security survey [online]. https://securosis.com/assets/library/reports/Securosis_OpenSourceSurvey_Analysis.pdf, Accessed 14 June 2021

  • Sheneamer A, Kalita J (2016) A survey of software clone detection techniques. International Journal of Computer Applications 137(10):1–21

    Article  Google Scholar 

  • Sneed HM (1996) Object-oriented cobol recycling. In: Proceedings of WCRE’96: 4rd working conference on reverse engineering, IEEE, p 169–178

  • Stewart K, Odence P, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L. Rev. 2:191

    Article  Google Scholar 

  • Sun Y, German D, Zacchiroli (2023) Dataset for ”Using the Uniqueness of Global Identifiers to Determine the Provenance of Phyton Software Source Code" https://doi.org/10.5281/zenedo.7637703 February 2023

  • Synopsys (2020) 2020 open source security and risk analysis report (OSSRA).Technical Report, Synopsys. Accessed 15 April 2020

  • Warintarawej P, Huchard M, Lafourcade M, Laurent A, Pompidor P (2015) Software understanding: automatic classification of software identifiers. Intelligent Data Analysis 19(4):761–778

    Article  Google Scholar 

  • Wendel H, Kunde M, Schreiber A (2010) Provenance of software development processes. In: International provenance and annotation workshop, Springer, p 59–63

  • Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, p 286–289

  • Zimmermann T (2020) A first look at an emerging model of community organizations for the long-term maintenance of ecosystems’ packages. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20, New York, NY, USA, 2020. association for computing machinery. p 711-718

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel German.

Ethics declarations

Conflict of interests

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Communicated by: Janet Siegmund.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, Y., German, D. & Zacchiroli, S. Using the uniqueness of global identifiers to determine the provenance of Python software source code. Empir Software Eng 28, 107 (2023). https://doi.org/10.1007/s10664-023-10317-8

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10317-8

Keywords

Navigation