Using the uniqueness of global identifiers to determine the provenance of Python software source code

Sun, Yiming; German, Daniel; Zacchiroli, Stefano

doi:10.1007/s10664-023-10317-8

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Published: 20 July 2023

Volume 28, article number 107, (2023)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

241 Accesses
Explore all metrics

Abstract

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244 K packages we found 11.2 M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defending Against Package Typosquatting

Software provenance tracking at the scale of public source code

Article 29 May 2020

Analysis of license inconsistency in large collections of open source projects

Article 03 December 2016

Data Availability

A replication package for this paper is available at https://doi.org/10.5281/zenedo.7637703 (Sun et al. 2023). This replication package containes all pairs (identifiers, product) found in PyPI

Notes

https://pypi.org/, accessed 2021-11-15
http://ctags.sourceforge.net/, retrieved 2022-09-22
Note that this copy might not have been done directly from the corpus; it is, however, a copy of the same entity that exists in the corpus.
Projects do not reside in PyPI, but PyPI links to their actual location.
https://pypi.org/project/pypi-simple/, accessed 2021-10-25
https://warehouse.pypa.io/api-reference/json.html, accessed 2021-10-25
https://ctags.io/, accessed 2021-10-25. Universal Ctags 0.0.0 (2015) derived from Exuberant Ctags 5.8.
At that time Debian Buster was already shipped as a “stable” release, so while it is possible that its content has changed since, modifications are expected to be minimal according to Debian release processes.
https://docs.libraries.io/overview.html#sourcerank, accessed 2021-12-07
Titled: “Scraping tripadvisor review, len container change, no such element Unable to locate element”, https://stackoverflow.com/questions/68878857, accessed 2022-01-16

References

Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Guéhéneuc YG (2014) Repent: analyzing the nature of identifier renamings. IEEE Transactions on Software Engineering 40(5):502–532
Article Google Scholar
The Python Packaging Authority (2022) Packaging python projects. https://packaging.python.org/en/latest/tutorials/packaging-projects/ Sept 2022
Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empirical Software Engineering 18(2):219–276
Article Google Scholar
Bose RPJC, Phokela KK, Kaulgud V, Podder S (2019) Blinker: a blockchain-enabled framework for software provenance. In: 2019 26th Asia-pacific software engineering conference (APSEC) IEEE p 1–8
Butler G, Grogono P, Shinghal R, Tjandra I (1995) Retrieving information from data flow diagrams. In: Proceedings of 2nd working conference on reverse engineering IEEE p 22–29
Butt AS, Fitch P (2020) ProvONE+: a provenance model for scientific workflows. In: International conference on web information systems engineering, Springer, p 431–444
Caniell M, German DM (2017) Zacchiroli S (2017) The debsources dataset: two decades of free and open source software. Empirical Software Engineering 22:1405–1437
Article Google Scholar
Caprile B, Tonella P (2000) Restructuring program identifier names In icsm, p 97–107
Cordy JR, Roy CK (2011) The NiCad clone detector. In: The 19th IEEE international conference on program comprehension, icpc 2011, kingston, on, canada, June 22-24, 2011, IEEE Computer Society, p 219–220
Cosmo RD, Zacchiroli S (2017) Software heritage: why and how to preserve software source code. In: Proceedings of the 14th international conference on digital preservation, iPRES 2017, Kyoto, Japan, September 25-29, 2017
Dang YB, Cheng P, Luo L, Cho A (2008) A code provenance management tool for ip-aware software development. In: Companion of the 30th international conference on software engineering, p 975–976
Davies J, German Dm, Godfrey MW, Hindle A (2011) Software bertillonage: Finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, p 183–192
Davies J, German DM, Godfrey MW, Hindle A (2013) Software bertillonage. Empirical Software Engineering 18(6):1195–1237
Article Google Scholar
Deissenboeck F, Pizka M (2006) Concise and consistent naming. Software Quality Journal 14(3):261–282
Article Google Scholar
Penta MD, German DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010), IEEE, p 151–160
Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering, p 147–156
Gautam P, Saini H (2016) Various code clone detection techniques and tools: a comprehensive survey. In: International conference on smart trends for information technology and computer communications, Springer, p 655–667
Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: 2017 IEEE/ACM 14th international conference on mining software repositories (MSR), IEEE, p 291–301
Godfrey MW (2015) Understanding software artifact provenance. Science of Computer Programming 97:86–90
Article Google Scholar
Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Transactions on Software Engineering 31(2):166–181
Article Google Scholar
Gupta A, Suri B (2018) A survey on code clone, its behavior and applications. In: Networking communication and data knowledge engineering, Springer, p 27–39
Harutyunyan N (2020) Managing your open source supply chain-why and how? Computer 53(6):77–81
Article Google Scholar
Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, p 217–227
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28(7):654–670
Article Google Scholar
Kapdan M, Aktas M, Yigit M (2014) On the structural code clone detection problem: a survey and software metric based approach. In: International conference on computational science and its applications, Springer, p 492–507
Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innovations in Systems and Software Engineering 3(4):303–318
Article Google Scholar
Li Z, Lu S, Myagmar S, Zhou Y (2006) Cp-miner: finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering 32(3):176–192
Article Google Scholar
Manning CD, Raghavan P, Schutze H (2009) An Introduction to Information Retrieval. Cambridge University Press, Cambridge, England
MATH Google Scholar
McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Software Eng 38(5):1069–1087
Article Google Scholar
Miles S, Groth P, Munroe S, Moreau L (2011) Prime: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology (TOSEM) 20(3):1–42
Article Google Scholar
Missier P, Belhajjame K, Cheney J (2013) The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th international conference on extending database technology, p 773–776
Missier P, Dey S, Belhajjame K, Vicenttín VC, Ludäscher B (2013) D-prov: extending the PROV provenance model with workflow structure. In: 5th USENIX workshop on the theory and practice of provenance (TaPP 13)
Moreau L, Clifford B, Freire J, Futrelle J, Gil Y, Groth P, Kwasnikowska N, Miles S, Missier P, Myers J et al (2011) The open provenance model core specification (v1. 1). Future Generation Computer Systems 27(6):743–756
Article Google Scholar
Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, p 1372–1384
Ombredanne Philippe (2020) Free and open source software license compliance: Tools for software composition analysis. Computer 53(10):105–109
Article Google Scholar
Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE international conference on software maintenance (ICSM),IEEE, p 283–292
Perez D, Chiba S (2019) Cross-language clone detection by learning over abstract syntax trees. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), IEEE, p 518–528
Phipps S, Zacchiroli S (2020) Continuous open source license compliance. Computer 53(12):115–119
Article Google Scholar
Pietri A, Spinellis D, Zacchiroli S (2019) The software heritage graph dataset: public software development under one roof. In: Proceedings of the 16th international conference on mining software repositories, MSR 2019, 26-27 May 2019, Montreal, Canada, IEEE / ACM, p 138–142
Rosen L (2005) Open source licensing, volume 692. Prentice hall
Rousseau G, Cosmo RD, Zacchiroli S (2020) Software provenance tracking at the scale of public source code. Empirical Software Engineering 25:2930–2959
Article Google Scholar
Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School of Computing TR 541(115):64–68
Google Scholar
Saini M, Verma R, Singh A, Chahal KK (2020) Investigating diversity and impact of the popularity metrics for ranking software packages. J. Softw Evol Process, 32(9)
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016, ACM, p 1157–1168
Securosis, L.L.C. (2021)Open source development and application security survey [online]. https://securosis.com/assets/library/reports/Securosis_OpenSourceSurvey_Analysis.pdf, Accessed 14 June 2021
Sheneamer A, Kalita J (2016) A survey of software clone detection techniques. International Journal of Computer Applications 137(10):1–21
Article Google Scholar
Sneed HM (1996) Object-oriented cobol recycling. In: Proceedings of WCRE’96: 4rd working conference on reverse engineering, IEEE, p 169–178
Stewart K, Odence P, Rockett E (2010) Software package data exchange (SPDX) specification. IFOSS L. Rev. 2:191
Article Google Scholar
Sun Y, German D, Zacchiroli (2023) Dataset for ”Using the Uniqueness of Global Identifiers to Determine the Provenance of Phyton Software Source Code" https://doi.org/10.5281/zenedo.7637703 February 2023
Synopsys (2020) 2020 open source security and risk analysis report (OSSRA).Technical Report, Synopsys. Accessed 15 April 2020
Warintarawej P, Huchard M, Lafourcade M, Laurent A, Pompidor P (2015) Software understanding: automatic classification of software identifiers. Intelligent Data Analysis 19(4):761–778
Article Google Scholar
Wendel H, Kunde M, Schreiber A (2010) Provenance of software development processes. In: International provenance and annotation workshop, Springer, p 59–63
Yuan Y, Guo Y (2012) Boreas: an accurate and scalable token-based approach to code clone detection. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering, p 286–289
Zimmermann T (2020) A first look at an emerging model of community organizations for the long-term maintenance of ecosystems’ packages. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20, New York, NY, USA, 2020. association for computing machinery. p 711-718

Download references

Author information

Authors and Affiliations

University of Victoria, Victoria, Canada
Yiming Sun & Daniel German
LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France
Stefano Zacchiroli

Authors

Yiming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Daniel German
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Zacchiroli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel German.

Ethics declarations

Conflict of interests

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Communicated by: Janet Siegmund.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, Y., German, D. & Zacchiroli, S. Using the uniqueness of global identifiers to determine the provenance of Python software source code. Empir Software Eng 28, 107 (2023). https://doi.org/10.1007/s10664-023-10317-8

Download citation

Accepted: 01 March 2023
Published: 20 July 2023
DOI: https://doi.org/10.1007/s10664-023-10317-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Abstract

Access this article

Similar content being viewed by others

Defending Against Package Typosquatting

Software provenance tracking at the scale of public source code

Analysis of license inconsistency in large collections of open source projects

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using the uniqueness of global identifiers to determine the provenance of Python software source code

Abstract

Access this article

Similar content being viewed by others

Defending Against Package Typosquatting

Software provenance tracking at the scale of public source code

Analysis of license inconsistency in large collections of open source projects

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation