Empirical Software Engineering

, Volume 19, Issue 4, pp 885–925 | Cite as

Conducting quantitative software engineering studies with Alitheia Core

  • Georgios Gousios
  • Diomidis Spinellis


Quantitative empirical software engineering research benefits mightily from processing large open source software repository data sets. The diversity of repository management tools and the long history of some projects, renders the task of working with those datasets a tedious and error-prone exercise. The Alitheia Core analysis platform preprocesses repository data into an intermediate format that allows researchers to provide custom analysis tools. Alitheia Core automatically distributes the processing load on multiple processors while enabling programmatic access to the raw data, the metadata, and the analysis results. The tool has been successfully applied on hundreds of medium to large-sized open-source projects, enabling large-scale empirical studies.


Quantitative software engineering Software repository mining 


Acknowledgments and Availability

The authors would like to thank the anonymous reviewers for their many excellent suggestions for improving this paper and Alitheia Core. Mirko Böhm and Vassileios Karakoidas have contributed code and ideas over the years that helped improve Alitheia Core. Panos Louridas, in addition to several contributions, implemented the Java and Python parsers. Martin Pinzger made Evolizer available for evaluation.

The full source code for Alitheia Core is available at: A snapshot of the data used to conduct the case study presented in this work can be found at: The raw project repository mirrors in the format Alitheia Core expects can be found at:

This research has been co-financed by the European Union (European Social Fund—esf) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (nsrf)—Research Funding Program: Thalis—Athens University of Economics and Business—Software Engineering Research Platform.


  1. Alpern B, Attanasio CR, Burton JJ (2000) The Jalapeño virtual machine. IBM Syst J 39(1):211–238CrossRefGoogle Scholar
  2. Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on software engineering, ACM ICSE ’06. New York, NY, USA, pp 361–370. doi: 10.1145/1134285.1134336
  3. Basili V (1996) The role of experimentation in software engineering: past, current, and future. In: Proceedings of the 18th international conference on software engineering, pp 442–449. doi: 10.1109/ICSE.1996.493439
  4. Bernstein DJ (2000) Using maildir format. Online,
  5. Bernstein PA, Goodman N (1983) Multiversion concurrency control—theory and algorithms. ACM Trans Database Syst 8:465–483CrossRefzbMATHMathSciNetGoogle Scholar
  6. Bevan J, Whitehead EJ Jr, Kim S, Godfrey M (2005) Facilitating software evolution research with Kenyon. In: ESEC/FSE-13: proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundations of software engineering, ACM, New York, NY, USA, pp 177–186CrossRefGoogle Scholar
  7. Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009) The promises and perils of mining Git. In: Godfrey MW, Whitehead J (eds) Proceedings of the 6th IEEE intl. working conference on mining software repositories, MSR ’09. Vancouver, Canada, pp 1–10Google Scholar
  8. Boehm B (1987) Industrial software metrics top ten list. IEEE Softw 4(5):84–85CrossRefMathSciNetGoogle Scholar
  9. Brooks FP (1975) The mythical man month. Addison-Wesley, Reading, MAGoogle Scholar
  10. Bunch C, Chohan N, Krintz C, Chohan J, Kupferman J, Lakhina P, Li Y, Nomura Y (2010) An evaluation of distributed datastores using the AppScale cloud platform. IEEE International Conference on Cloud Computing 0:305–312Google Scholar
  11. Buse RP, Zimmermann T (2012) Information needs for software development analytics. In: Proceedings of the 34th international conference on software engineeringGoogle Scholar
  12. Canfora G, Cerulo L (2006) Fine grained indexing of software repositories to support impact analysis. In: Proceedings of the 2006 international workshop on mining software repositories, ACM MSR ’06. New York, NY, USA, pp 105–111Google Scholar
  13. Cubranic D, Murphy G, Singer J, Booth K (2005) Hipikat: a project memory for software development. IEEE Trans Softw Eng 31(6):446–465. doi: 10.1109/TSE.2005.71 CrossRefGoogle Scholar
  14. D’Ambros M, Lanza M (2010) Distributed and collaborative software evolution analysis with Churrasco. Sci Comput Program 75(4):276–287CrossRefzbMATHMathSciNetGoogle Scholar
  15. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th symposium on operating systems design and implementation, pp 137–150Google Scholar
  16. Deligiannis S, Shepperd M, Webster S, Roumeliotis M (2002) A review of experimental investigations into object-oriented technology. Empir Software Eng 7(3):193–231. doi: 10.1023/A:1016392131540 CrossRefzbMATHGoogle Scholar
  17. Demeyer S, Tichelaar S, Tichelaar E, Steyaert P (1999) FAMIX 2.0: the FAMOOS information exchange model. Tech. rep., University of BernGoogle Scholar
  18. Deursen AV, Klint P (1998) Little languages: little maintenance? J Softw Maint: Research and Practice 10(2):75–92CrossRefGoogle Scholar
  19. Dyer R, Nguyen H, Rajan H, Nguyen T (2012) Boa: analyzing ultra-large-scale code corpus. In: Proceedings of the 3rd annual conference on systems, programming, and applications: software for humanity, ACM SPLASH ’12. New York, NY, USA, pp 87–88. doi: 10.1145/2384716.2384752
  20. Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: Proceedings of the IEEE computer society international conference on software maintenance, ICSM ’03. Washington, DC, USA, p 23Google Scholar
  21. Fluri B, Wursch M, PInzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33(11):725–743CrossRefGoogle Scholar
  22. Gall H, Fluri B, Pinzger M (2009) Change analysis with evolizer and ChangeDistiller. IEEE Softw 26(1):26–33CrossRefGoogle Scholar
  23. Gasser L, Scacchi W (2008) Open source development, communities and quality. In: IFIP international federation for information processing, chap towards a global research infrastructure for multidisciplinary study of free/open source software development, vol 275. Springer, Boston, pp 143–158Google Scholar
  24. Gawer A, Cusumano M (2002) Platform leadership. Harvard Business School PressGoogle Scholar
  25. German DM, Hindle A (2005) Measuring fine-grained change in software: towards modificationaware change metrics. IEEE International Symposium on Software Metrics 0:28–38CrossRefGoogle Scholar
  26. Gîrba T, Ducasse S (2006) Modeling history to analyze software evolution. J Softw Maint Evol: Research and Practice 18(3):207–236CrossRefGoogle Scholar
  27. Glass RL, Vessey I, Ramesh V (2002) Research in software engineering: an analysis of the literature. Inf Softw Technol 44(8):491–506. doi: 10.1016/S0950-5849(02)00049-6 CrossRefGoogle Scholar
  28. Godfrey MW, Hassan AE, Herbsleb J, Murphy GC, Robillard M, Devanbu P, Mockus A, Perry DE, Notkin D (2009) Future of mining software archives: a roundtable. IEEE Softw 26(1):67–70. doi:10.1109/MS.2009.10 CrossRefGoogle Scholar
  29. Goeminne M, Mens T (2011) A comparison of identity merge algorithms for software repositories. Science of Computer ProgrammingGoogle Scholar
  30. González-Barahona JM, Robles G (2012) On the reproducibility of empirical software engineering studies based on data retrieved from development repositories. Empir Software Eng 17(1–2):75–89CrossRefGoogle Scholar
  31. Gousios G, Spinellis D (2009) A platform for software engineering research. In: Godfrey MW, Whitehead J (eds) Proceedings of the 6th working conference on mining software repositories, IEEE MSR ’09, pp 31–40Google Scholar
  32. Halstead M (1977) Elements of software science. Elsevier Publishing CompanyGoogle Scholar
  33. Heitlager I, Kuipers T, Visser J (2007) A practical model for measuring maintainability. In: Proceedings of the 6th int. conf. on the quality of information and communications technology, IEEE computer society, pp 30–39Google Scholar
  34. Herraiz I, Izquierdo-Cortazar D, Rivas-Hernandez F, Gonzalez-Barahona J, Robles G, Duenas-Dominguez S, Garcia-Campos C, Gato J, Tovar L (2009) FLOSSMetrics: free/libre/open source software metrics. In: 13th European conference on software maintenance and reengineering, CSMR ’09, pp 281–284Google Scholar
  35. Hovemeyer D, Pugh W (2004) Finding bugs is easy. SIGPLAN Not 39(12):92–106CrossRefGoogle Scholar
  36. Howison J, Conklin M, Crowston K (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. International Journal of Information Technology and Web Engineering 1(3):17–26CrossRefGoogle Scholar
  37. Howison J, Wiggins A, Crowston K (2008) Open source development, communities and quality. In: IFIP international federation for information processing, chap eResearch workflows for studying free and open source software development, vol 275. Springer, Boston, pp 405–411Google Scholar
  38. ISO/IEC (2004) 9126:2004 software engineering—product quality—quality model. Tech. rep., international organization for standardization. Geneva, SwitzerlandGoogle Scholar
  39. Johnson P, Kou H, Paulding M, Zhang Q, Kagawa A, Yamashita T (2005) Improving software development management through software project telemetry. IEEE Softw 22(4):76–85. doi: 10.1109/MS.2005.95 CrossRefGoogle Scholar
  40. Kalliamvakou E, Gousios G, Spinellis D, Pouloudi N (2009) Measuring developer contribution from software repository data. In: Poulymenakou A, Pouloudi N, Pramatari K (eds) 4th mediterranean conference on information systems, MCIS 2009, pp 600–611Google Scholar
  41. Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. IEEE/ACM International Symposium on Code Generation and Optimization 0:75–85CrossRefGoogle Scholar
  42. Lin Y, Huai-Min W, Gang Y, Dian-Xi S, Xiang L (2010) Mining and analyzing behavioral characteristic of developers in open source software. Chinese Journal of Computers 10:1909–1918Google Scholar
  43. Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18:300–336. doi: 10.1007/s10618-008-0118-x CrossRefMathSciNetGoogle Scholar
  44. Livieri S, Higo Y, Matushita M, Inoue K (2007) Very large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCfinder. In: Proceedings of the 29th international conference on software engineering, IEEE computer society, ICSE ’07. Washington, DC, USA, pp 106–115Google Scholar
  45. Luijten B, Visser J, Zaidman A (2010) Faster defect resolution with higher technical quality of software. In: Proc. of the 4th international workshop on software quality and maintainability (SQM’10)Google Scholar
  46. Lungu M, Lanza M, Gîrba T, Robbes R (2010) The small project observatory: visualizing software ecosystems. Sci Comput Program, Elsevier 75(4):264–275. doi: 10.1016/j.scico.2009.09.004 CrossRefzbMATHGoogle Scholar
  47. McCabe T (1976) A complexity measure. IEEE Trans Softw Eng 308–320Google Scholar
  48. Mitropoulos D, Gousios G, Spinellis D (2012) Measuring the occurrence of security-related bugs through software evolution. In: Proceedings of the 16th Pan-Hellenic conference on informatics (to appear)Google Scholar
  49. Mockus A (2009) Amassing and indexing a large sample of version control systems: towards the census of public source code history. In: Godfrey MW, Whitehead J (eds) Proceedings of the 6th IEEE intl. working conference on mining software repositories, MSR ’09, pp 11–20Google Scholar
  50. Mockus A, Fielding RT, Herbsleb JD (2002) Two case studies of open source software development: apache and mozilla. ACM Trans Softw Eng Methodol 11(3):309–346. doi: 10.1145/567793.567795 CrossRefGoogle Scholar
  51. Muthanna S, Kontogiannis K, Ponnambalam K, Stacey B (2000) A maintainability model for industrial software systems using design level metrics. In: Proceedings of the 7th working conference on reverse engineering. Brisbane, AU, pp 248–256Google Scholar
  52. Nierstrasz O, Ducasse S, Gǐrba T (2005) The story of Moose: an agile reengineering environment. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on foundations of software engineering, ACM ESEC/FSE-13. New York, NY, USA, pp 1–10Google Scholar
  53. Oman P, Hagemeister J (1994) Construction and testing of polynomials predicting software maintainability. J Syst Softw 24(3):251–266CrossRefGoogle Scholar
  54. Parr T, Quong R (1995) ANTLR: a predicated-LL (k) parser generator. Softw: Practice and Experience 25(7):789–810Google Scholar
  55. Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD international conference on management of data, ACM SIGMOD ’09. New York, NY, USA, pp 165–178Google Scholar
  56. Perry DE, Porter AA, Votta LG (2000) Empirical studies of software engineering: a roadmap. In: Proceedings of the conference on the future of software engineering, ACM ICSE ’00. New York, NY, USA, pp 345–355Google Scholar
  57. Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program 13(4):277–298Google Scholar
  58. Posnett D, Devanbu P, Filkov V (2012) MIC check: a correlation tactic for ESE data. In: Proceedings of the 9th working conference on mining software repositories, MSR’12. Zurich, CHGoogle Scholar
  59. Ratzinger J, Sigmund T, Gall HC (2008) On the relation of refactorings and software defect prediction. In: Proceedings of the 2008 international working conference on mining software repositories, ACM MSR ’08. New York, NY, USA, pp 35–38. doi: 10.1145/1370750.1370759
  60. Raymond ES (2001) The cathedral and the bazaar: Musings on Linux and open source by an accidental revolutionary. O’ Reilly and Associates, Sebastopol, CAGoogle Scholar
  61. Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRefGoogle Scholar
  62. Resnick P (2008) Internet message format. RFC 5322, Internet Engineering Task ForceGoogle Scholar
  63. Robles G (2005) Empirical software engineering research on libre software: data sources, methodologies and results. PhD thesis, Universidad Rey Juan Carlos, MadridGoogle Scholar
  64. Robles G, Gonzalez-Barahona JM (2005) Developer identification methods for integrated data from various sources. In: Proceedings of the 2005 international workshop on mining software repositories, ACM MSR ’05. New York, NY, USA, pp 1–5. doi: 10.1145/1083142.1083162
  65. Robles G, Koch S, Gonzalez-Barahona JM (2004) Remote analysis and measurement of libre software systems by means of the CVSAnalY tool. In: Proceedings of the 2nd ICSE workshop on remote analysis and measurement of software systems (RAMSS). Edinburg, Scotland, UKGoogle Scholar
  66. Samoladas I, Stamelos I, Angelis L, Oikonomou A (2004) Open source software development should strive for even greater code maintainability. Commun ACM 47(10):83–87CrossRefGoogle Scholar
  67. Sarma A, Maccherone L, Wagstrom P, Herbsleb J (2009) Tesseract: interactive visual exploration of socio-technical relationships in software development. In: Proceedings of the 31st international conference on software engineering, IEEE computer society, ICSE ’09. Washington, DC, USA, pp 23–33Google Scholar
  68. Seaman C (1999) Qualitative methods in empirical studies of software engineering. IEEE Trans Softw Eng 25(4):557–572CrossRefGoogle Scholar
  69. Shang W, Adams B, Hassan AE (2010) An experience report on scaling tools for mining software repositories using MapReduce. In: Proceedings of the IEEE/ACM international conference on automated software engineering, ACM ASE ’10. New York, NY, USA, pp 275–284Google Scholar
  70. Shang W, Adams B, Hassan AE (2011) Using pig as a data preparation language for large-scale mining software repositories studies: an experience report. J Syst Softw (in press, corrected proof)Google Scholar
  71. Shaw M (2003) Writing good software engineering research papers. In: Proceedings of the 25th international conference on software engineering, 2003, pp 726–736. doi: 10.1109/ICSE.2003.1201262
  72. Shull F, Carver J, Vegas S, Juristo N (2008) The role of replications in empirical software engineering. Empir Software Eng 13:211–218. doi: 10.1007/s10664-008-9060-1 CrossRefGoogle Scholar
  73. Sjøberg DI, Hannay J, Hansen O, Kampenes V, Karahasanovic A, Liborg NK, Rekdal A (2005) A survey of controlled experiments in software engineering. IEEE Trans Softw Eng 31(9):733–753CrossRefGoogle Scholar
  74. Sjøberg DI, Dyba T, Jorgensen M (2007) The future of empirical methods in software engineering research. In: 2007 future of software engineering, IEEE computer society, FOSE ’07. Washington, DC, USA, pp 358–378. doi: 10.1109/FOSE.2007.30
  75. Šmite D, Wohlin C, Gorschek T, Feldt R (2010) Empirical evidence in global software engineering: a systematic review. Empir Software Eng 15:91–118CrossRefGoogle Scholar
  76. Spinellis D (2006) Code quality: the open source perspective. Addison-Wesley, Boston, MAGoogle Scholar
  77. Spinellis D (2007) Another level of indirection. In: Oram A, Wilson G (eds) Beautiful code: leading programmers explain how they think, chapter 17. O’Reilly and Associates, Sebastopol, CA, pp 279–291Google Scholar
  78. Spinellis D (2008) A tale of four kernels. In: Proceedings of the 30th international conference on software engineering, ACM ICSE ’08. New York, NY, USA, pp 381–390. doi: 10.1145/1368088.1368140
  79. Spinellis D, Szyperski C (2004) How is open source affecting software development? Guest editors’ introduction: developing with open source software. IEEE Softw 21(1):28–33. doi: 10.1109/MS.2004.1259204 CrossRefGoogle Scholar
  80. Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: 2010 asia pacific software engineering conference (APSEC2010), pp 336–345Google Scholar
  81. Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H (2010) Data warehousing and analytics infrastructure at Facebook. In: Proceedings of the 2010 international conference on management of data, ACM SIGMOD ’10, New York, NY, USA, pp 1013–1020Google Scholar
  82. Tichy WF, Lukowicz P, Prechelt L, Heinz EA (1995) Experimental evaluation in computer science: a quantitative study. J Syst Softw 28(1):9–18. doi: 10.1016/0164-1212(94)00111-Y CrossRefGoogle Scholar
  83. Welker K, Oman P (1995) Software maintainability metrics models in practice. Journal of Defence Software Engineering 8(19–23)Google Scholar
  84. Wheeler DA (2010) Linux kernel 2.6: it’s worth more! OnlineGoogle Scholar
  85. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan KaufmannGoogle Scholar
  86. Wohlin C, Wesslen A (2000) Experimentation in software engineering—an introduction. Kluwer Academic PublishersGoogle Scholar
  87. Zannier C, Melnik G, Maurer F (2006) On the success of empirical studies in the international conference on software engineering. In: Proceedings of the 28th international conference on Software engineering, ACM ICSE ’06. New York, NY, USA, pp 341–350. doi: 10.1145/1134285.1134333
  88. Zelkowitz MV, Wallace D (1997) Experimental validation in software engineering, evaluation and assessment in software engineering. Inf Softw Technol 39(11):735–743CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Software and Computer TechnologyDelft University of TechnologyDelftNetherlands
  2. 2.Department of Management Science and TechnologyAthens University of Economics and BusinessAthensGreece

Personalised recommendations