Abstract

The quantitative analysis of software projects can provide insights that let us better understand open source and other software development projects. An important variable used in the analysis of software projects is the amount of work being contributed, the commit size. Unfortunately, post-facto, the commit size can only be estimated, not measured. This paper presents several algorithms for estimating the commit size. Our performance evaluation shows that simple, straightforward heuristics are superior to the more complex text-analysis-based algorithms. Not only are the heuristics significantly faster to compute, they also deliver more accurate results when estimating commit sizes. Based on this experience, we design and present an algorithm that improves on the heuristics, can be computed equally fast, and is more accurate than any of the prior approaches.

References

  1. 1.
    Arafat, O., Riehle, D.: The Commit Size Distribution of Open Source Software. In: Proceedings of the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, Los Alamitos (forthcoming, 2009)Google Scholar
  2. 2.
    Arafat, O., Riehle, D.: The Comment Density of Open Source Software Code. In: Companion to the Proceedings of the 31st International Conference on Software Engineering (ICSE 2009), 4 pages (2009)Google Scholar
  3. 3.
    Canfora, G., Cerulo, L., Di Penta, M.: Identifying Changed Source Code Lines from Version Repositories. In: Proceedings of the Fourth International Workshop on Mining Software Repositories, 14 pages. IEEE Press, Los Alamitos (2007)CrossRefGoogle Scholar
  4. 4.
    Cerulo, L.: Private communication (2008)Google Scholar
  5. 5.
    Daffara, C.: How Many Stable and Active Libre Software Projects? (retrieved on September 13, 2007), http://flossmetrics.org/news/11
  6. 6.
    Deshpande, A., Riehle, D.: Continuous Integration in Open Source Software Development. In: Proceedings of the Fourth Conference on Open Source Systems (OSS 2008), pp. 273–280. Springer, Heidelberg (2008)Google Scholar
  7. 7.
  8. 8.
  9. 9.
    Godfrey, M., Dong, X., Kapser, C., Zou, L.: Four Interesting Ways in Which History Can Teach Us About Software. In: Proceedings of the First International Workshop on Mining Software Repositories, 58 pages. IEEE Press, Los Alamitos (2004)Google Scholar
  10. 10.
  11. 11.
    GNU diff –d. See man page to [10]Google Scholar
  12. 12.
    Heckel, P.: A Technique for Isolating Differences Between Files. Communications of the ACM 21(4), 264–268 (1978)CrossRefMATHGoogle Scholar
  13. 13.
    Hindle, A., German, D.M., Holt, R.: What Do Large Commits Tell Us? A taxonomical study of large commits. In: Proceedings of the Fifth International Workshop on Mining Software Repositories, 99 pages. IEEE Press, Los Alamitos (2008)Google Scholar
  14. 14.
    Howison, J., Conklin, M., Crowston, K.: FLOSSmole: A Collaborative Repository for FLOSS Research Data and Analyses. International Journal of Information Technology and Web Engineering 1(3)Google Scholar
  15. 15.
    Hunt, J.W., Douglas McIlroy, M.: An Algorithm for Differential File Comparison. Bell Telephone Laboratories CSTR #41 (1976)Google Scholar
  16. 16.
    Ohloh.net, http://ohloh.net/
  17. 17.
    Weißgerber, P., Neu, D., Diehl, S.: Small Patches Get In! In: Proceedings of the Fifth International Workshop on Mining Software Repositories, 67 pages. IEEE Press, Los Alamitos (2008)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2009

Authors and Affiliations

  • Philipp Hofmann
    • 1
  • Dirk Riehle
    • 1
  1. 1.SAP Research, SAP Labs LLCPalo AltoUSA

Personalised recommendations