Empirical Software Engineering

, Volume 18, Issue 6, pp 1195–1237 | Cite as

Software Bertillonage

Determining the provenance of software development artifacts
  • Julius Davies
  • Daniel M. German
  • Michael W. Godfrey
  • Abram Hindle


Deployed software systems are typically composed of many pieces, not all of which may have been created by the main development team. Often, the provenance of included components—such as external libraries or cloned source code—is not clearly stated, and this uncertainty can introduce technical and ethical concerns that make it difficult for system owners and other stakeholders to manage their software assets. In this work, we motivate the need for the recovery of the provenance of software entities by a broad set of techniques that could include signature matching, source code fact extraction, software clone detection, call flow graph matching, string matching, historical analyses, and other techniques. We liken our provenance goals to that of Bertillonage, a simple and approximate forensic analysis technique based on bio-metrics that was developed in 19th century France before the advent of fingerprints. As an example, we have developed a fast, simple, and approximate technique called anchored signature matching for identifying the source origin of binary libraries within a given Java application. This technique involves a type of structured signature matching performed against a database of candidates drawn from the Maven2 repository, a 275 GB collection of open source Java libraries. To show the approach is both valid and effective, we conducted an empirical study on 945 jars from the Debian GNU/Linux distribution, as well as an industrial case study on 81 jars from an e-commerce application.


Reuse Provenance Code evolution Code fingerprints 



We thank Dr. Anton Chuvakin of Security Warrior Consulting ( for his advice on PCI DSS.


  1. Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465CrossRefGoogle Scholar
  2. Davies J (2011) Measuring subversions: security and legal risk in reused software artifacts. In: Taylor RN, Gall H, Medvidovic N (eds) ICSE, pp 1149–1151, ACMGoogle Scholar
  3. Davies J, Germán DM, Godfrey MW, Hindle A Software bertillonage: finding the provenance of an entity. In: van Deursen A, Xie T, Zimmermann T (eds) (2011) In: Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 183–192Google Scholar
  4. Di Penta M, Germán DM, Antoniol G (2010) Identifying licensing of jar archives using a code-search approach. In: MSR’10 Proc. of the intl. working conf. on mining software repositories, pp 151–160Google Scholar
  5. Germán DM, Di Penta M, Guéhéneuc YG, Antoniol G (2009) Code siblings: technical and legal implications of copying code between applications. In: MSR ’09: Proc. of the Working Conf. on Mining Software Repositories, pp 81–90Google Scholar
  6. Godfrey M, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181CrossRefGoogle Scholar
  7. Gosling J, Joy B, Steele G, Bracha G (2005) The java language specification, 2nd edn, section 3.8: Identifiers. Accessed 27 March 2012
  8. Hemel A (2010) The GPL compliance engineering guide version 3.5. downloads/compliance-manual.pdf. Accessed 27 March 2012
  9. Hemel A, Kalleberg KT, Vermaas R, Dolstra E Finding software license violations through binary code clone detection. In: van Deursen A, Xie T, Zimmermann T (eds) Proceedings of the 8th international working conference on mining software repositories, MSR 2011 (Co-located with ICSE), Proceedings, IEEE. Waikiki, Honolulu, HI, USA, May 21–28, pp 63–72Google Scholar
  10. Holmes R, Walker RJ (2010) Customized awareness: recommending relevant external change events. In: Kramer J, Bishop J, Devanbu PT, Uchitel S (eds) ICSE (1), ACM, pp 465–474Google Scholar
  11. Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970CrossRefGoogle Scholar
  12. Houck MM, Siegel JA (2006) Fundamentals of forensic science. Academic PressGoogle Scholar
  13. Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRefGoogle Scholar
  14. Kapser C, Godfrey MW (2008) ‘Cloning considered harmful’ considered harmful: patterns of cloning in software. Empir Software Eng 13(6):645–692CrossRefGoogle Scholar
  15. Kersten M, Murphy GC (2005) Mylar: a degree-of-interest model for ides. In: Mezini M, Tarr PL (eds) AOSD. ACM, pp 159–168Google Scholar
  16. Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. ESEC/FSE 30(5):187–196CrossRefGoogle Scholar
  17. Krinke J (2008) Is cloned code more stable than non-cloned code? In: SCAM’08, pp 57–66Google Scholar
  18. Livieri S, Higo Y, Matsushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder. In: ICSE, pp 106–115Google Scholar
  19. Lozano A (2008) A methodology to assess the impact of source code flaws in changeability and its application to clones. In: ICSM 08: Proc. of the int. conf. of software maintenance, pp 424–427Google Scholar
  20. Lozano A, Wermelinger M, Nuseibeh B (2007) Evaluating the harmfulness of cloning: a change based experiment. In: MSR ’07: proc. of the 4th int. workshop on mining soft. Repositories, p 18Google Scholar
  21. Ossher J, Sajnani H, Lopes CV (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: ICSM, IEEE, pp 283–292Google Scholar
  22. PCI Security Standards Council (2009) Payment card industry data security standard (PCI DSS), version 1.2.1.
  23. Robillard MP, Walker RJ, Zimmermann T (2010) Recommendation systems for software engineering. IEEE Softw 27(4):80–86CrossRefGoogle Scholar
  24. Siegel J, Saukko P, Knupfer G (2000) Encyclopedia of forensic sciences. Academic PressGoogle Scholar
  25. Thummalapenta S, Cerulo L, Aversano L, Di Penta M (2009) An empirical study on the maintenance of source code clones. Empir Software Eng 15(1):1–34CrossRefGoogle Scholar
  26. Western Canada Research Grid. Accessed 27 March 2012
  27. Wheeler D Counting Source Lines of Code (SLOC). Accessed 27 March 2012

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Julius Davies
    • 1
  • Daniel M. German
    • 2
  • Michael W. Godfrey
    • 3
  • Abram Hindle
    • 4
  1. 1.Department of Computer ScienceUniversity of British ColumbiaVancouverCanada
  2. 2.Department of Computer ScienceUniversity of VictoriaVictoriaCanada
  3. 3.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada
  4. 4.Department of Computing SciencesUniversity of AlbertaEdmontonCanada

Personalised recommendations