A Model of the Commit Size Distribution of Open Source

  • Carsten Kolassa
  • Dirk Riehle
  • Michel A. Salim
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7741)


A fundamental unit of work in programming is the code contribution (“commit”) that a developer makes to the code base of the project in work. We use statistical methods to derive a model of the probabilistic distribution of commit sizes in open source projects and we show that the model is applicable to different project sizes. We use both graphical as well as statistical methods to validate the goodness of fit of our model. By measuring and modeling a fundamental dimension of programming we help improve software development tools and our understanding of software development.


Open Source Probability Distribution Function Generalize Pareto Distribution Chunk Size Project Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alali, A., Kagdi, H., Maletic, J.I.: What’s a typical commit? A characterization of open source software repositories. In: International Conference on Program Comprehension, pp. 182–191. IEEE Computer Society, Los Alamitos (2008)CrossRefGoogle Scholar
  2. 2.
    Arafat, O., Riehle, D.: The commit size distribution of open source software. In: Hawaii International Conference on System Sciences, pp. 1–8. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  3. 3.
    Beecher, K., Boldyreff, C., Capiluppi, A., Rank, S.: Evolutionary success of open source software: An investigation into exogenous drivers. Electronic Communications of the EASST 8 (2008)Google Scholar
  4. 4.
    Canfora, G., Cerulo, L., Di Penta, M.: Ldiff: An enhanced line differencing tool. In: Proceedings of the 31st International Conference on Software Engineering, ICSE 2009, pp. 595–598. IEEE Computer Society, Washington, DC (2009), Google Scholar
  5. 5.
    Coles, S.: An introduction to statistical modeling of extreme values. Springer, London (2001)zbMATHGoogle Scholar
  6. 6.
    Daffara, C.: How many stable and active libre software projects? (2007),
  7. 7.
    Deshpande, A., Riehle, D.: Continuous Integration in Open Source Software Development. In: Russo, B., Damiani, E., Hissam, S., Lundell, B., Succi, G. (eds.) Open Source Development, Communities and Quality. IFIP, vol. 275, pp. 273–280. Springer, Boston (2008), CrossRefGoogle Scholar
  8. 8.
    Gartner: User Survey Analysis: Open-Source Software, Worldwide (2008),
  9. 9.
    Ghezzi, G., Gall, H.: Towards software analysis as a service. In: Proceedings of the 4th International ERCIM Workshop on Software Evolution and Evolvability, pp. 1–10. IEEE (2008)Google Scholar
  10. 10.
    Gibbons, J.D., Chakraborti, S.: Tests of Goodness of Fit. In: Nonparametric Statistical Inference, pp. 144–145. CRC Press (2003)Google Scholar
  11. 11.
    Hassan, A., Holt, R., Mockus, A. (eds.): Proceedings of the 1st International Workshop on Mining Software Repositories, MSR 2004 (2004)Google Scholar
  12. 12.
    Hindle, A., German, D.M., Holt, R.: What do large commits tell us? A taxonomical study of large commits. In: Proc. of the 2008 International Working Conference on Mining Software Repositories, MSR 2008, pp. 99–108. ACM, New York (2008), CrossRefGoogle Scholar
  13. 13.
    Hofmann, P., Riehle, D.: Estimating Commit Sizes Efficiently. In: Boldyreff, C., Crowston, K., Lundell, B., Wasserman, A. (eds.) OSS 2009. IFIP AICT, vol. 299, pp. 105–115. Springer, Boston (2009), Google Scholar
  14. 14.
    Lind, R., Vairavan, K.: An experimental investigation of software metrics and their relationship to software development effort. IEEE Transactions on Software Engineering 15, 649–653 (1989)CrossRefGoogle Scholar
  15. 15.
    MathWorks: Generalized Pareto Distribution,
  16. 16.
    Nagappan, N., Zeller, A., Zimmermann, T.: Guest editors’ introduction: Mining software archives. IEEE Software 26(1), 24–25 (2009)CrossRefGoogle Scholar
  17. 17.
    Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46(5), 323–351 (2005)CrossRefGoogle Scholar
  18. 18.
    Ohloh: Forum topic: Multiple enlistments (2010),
  19. 19.
    Paulson, J.W., Succi, G., Eberlein, A.: An empirical study of open-source and closed-source software products. IEEE Transactions on Software Engineering 30, 246–256 (2004)CrossRefGoogle Scholar
  20. 20.
    Purushothaman, R., Perry, D.E.: Toward understanding the rhetoric of small source code changes. IEEE Transactions on Software Engineering 31, 511–526 (2005)CrossRefGoogle Scholar
  21. 21.
    Ribatet, M.: A User’s Guide to the POT Package, 1.4 edn. (2007)Google Scholar
  22. 22.
    Singh, V.P., Guo, H.: Parameter estimation for 3-parameter generalized pareto distribution by the principle of maximum entropy (POME). Hydrological Sciences Journal 40(2), 165–181 (1995)CrossRefGoogle Scholar
  23. 23.
    Weißgerber, P., Neu, D., Diehl, S.: Small patches get in! In: Proceedings of the 5th Working Conference on Mining Software Repositories (MSR 2008), pp. 67–76 (2008)Google Scholar
  24. 24.
    Zenoss Inc.: 2010 Open Source Systems Management Survey (2010),

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Carsten Kolassa
    • 1
  • Dirk Riehle
    • 2
  • Michel A. Salim
    • 2
  1. 1.RWTH AachenGermany
  2. 2.Friedrich-Alexander-University Erlangen-NürnbergGermany

Personalised recommendations