Cliff Walls: An Analysis of Monolithic Commits Using Latent Dirichlet Allocation

  • Landon J. Pratt
  • Alexander C. MacLean
  • Charles D. Knutson
  • Eric K. Ringger
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 365)


Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. However, there are still many challenges that can affect the quality of artifact-based studies, especially those studies examining software evolution. Large commits, which we refer to as “Cliff Walls,” are one significant threat to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. We also found that corrective maintenance tasks, such as bug fixes, did not play a significant role in the creation of large commits.


Latent Dirichlet Allocation Version Control Open Source Software Project Version Control System Prevalent Topic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arafat, O., Riehle, D.: The Commit Size Distribution of Open Source Software. In: 42nd Hawaii International Conference on System Sciences, HICSS 2009, pp. 1–8. IEEE, Los Alamitos (2009)Google Scholar
  2. 2.
    Bahn, P., Bahn, P.G., Tidy, B.: Archaeology: a very short introduction. Oxford University Press, USA (2000)Google Scholar
  3. 3.
    Bird, C., Gourley, A., Devanbu, P., Swaminathan, A., Hsu, G.: Open borders? immigration in open source projects. In: International Workshop on Mining Software Repositories, p. 6 (2007)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Briand, L.C., Morasca, S., Basili, V.R.: Defining and validating measures for object-based high-level design. IEEE Transactions on Software Engineering 25(5), 722–743 (1999)CrossRefGoogle Scholar
  6. 6.
    Delorey, D.P., Knutson, C.D., Chun, S.: Do programming languages affect productivity? a case study using data from open source projects. In: 1st International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS 2007) (May 2007)Google Scholar
  7. 7.
    Delorey, D.P., Knutson, C.D., Giraud-Carrier, C.: Programming language trends in open source development: An evaluation using data from all production phase sourceforge projects. In: 2nd International Workshop on Public Data about Software Development (WoPDaSD 2007) (June 2007)Google Scholar
  8. 8.
    Delorey, D.P., Knutson, C.D., MacLean, A.: Studying production phase sourceforge projects: A case study using cvs2mysql and sfra+. In: Second International Workshop on Public Data about Software Development (WoPDaSD 2007) (June 2007)Google Scholar
  9. 9.
    Hassan, A.E.: Predicting faults using the complexity of code changes. In: Proceedings of the 31st International Conference on Software Engineering (ICSE 2009), pp. 78–88. ACM, New York (2009)Google Scholar
  10. 10.
    Hattori, L.P., Lanza, M.: On the nature of commits. In: 23rd IEEE/ACM International Conference on Automated Software Engineering-Workshops, ASE Workshops 2008, pp. 63–71. IEEE, Los Alamitos (2008)CrossRefGoogle Scholar
  11. 11.
    Hindle, A., German, D.M., Godfrey, M.W., Holt, R.C.: Automatic classication of large changes into maintenance categories. In: IEEE 17th International Conference on Program Comprehension, ICPC 2009, pp. 30–39. IEEE, Los Alamitos (2009)CrossRefGoogle Scholar
  12. 12.
    Hindle, A., German, D.M., Holt, R.: What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 International Working Conference on Mining Software Repositories, pp. 99–108. ACM, New York (2008)CrossRefGoogle Scholar
  13. 13.
    Krein, J.L., MacLean, A.C., Delorey, D.P., Knutson, C.D., Eggett, D.L.: Language entropy: A metric for characterization of author programming language distribution. In: 4th Workshop on Public Data about Software Development (2009)Google Scholar
  14. 14.
    Krein, J.L., MacLean, A.C., Delorey, D.P., Knutson, C.D., Eggett, D.L.: Impact of programming language fragmentation on developer productivity: a sourceforge empirical study. In: International Journal of Open Source Software and Processes (IJOSSP); Publication PendingGoogle Scholar
  15. 15.
    MacLean, A.C., Pratt, L.J., Krein, J.L., Knutson, C.D.: Threats to validity in analysis of language fragmentationon sourceforge data. In: Proceedings of the1st International Workshopon Replicationin Empirical Software Engineering Research (RESER 2010), p. 6 (May 2010)Google Scholar
  16. 16.
    MacLean, A.C., Pratt, L.J., Krein, J.L., Knutson, C.D.: Trends that affect temporal analysis using sourceforge data. In: Proceedings of the 5th International Workshop on Public Data about Software Development (WoPDaSD 2010), p. 6 (June 2010)Google Scholar
  17. 17.
    McCallum, A.K.: MALLET: A machine learning for language toolkit (2002),
  18. 18.
    Mockus, A., Fielding, R.T., Herbsleb, J.D.: Two case studies of open source software development: Apache and mozilla. ACM Transactions on Software Engineering and Methodology 11(3), 309–346 (2002)CrossRefGoogle Scholar
  19. 19.
    Rainer, A., Gale, S.: Evaluating the Quality and Quantity of Data on Open Source Software Projects (2005)Google Scholar
  20. 20.
    Tarvo, A.: Mining software history to improve software maintenance qual- ity: A case study. IEEE Software 26(1), 34–40 (2009)CrossRefGoogle Scholar
  21. 21.
    Wallach, H., Mimno, D., McCallum, A.: Rethinking LDA: Why priors matter. Advances in Neural Information Processing Systems 22, 1973–1981 (2009)Google Scholar
  22. 22.
    Xu, J., Gao, Y., Christley, S., Madey, G.: A topological analysis of the open souce software development community. In: HICSS 2005: Proceedings of the 38th Annual Hawaii International Conference on System Sciences, vol. 7 (2005)Google Scholar
  23. 23.
    Zimmermann, T., Weißgerber, P.: Preprocessing CVS data for fine-grained analysis. In: Proceedings 1st International Workshop on Mining Software Repositories (MSR 2004), Citeseer, pp. 2–6 (2004)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2011

Authors and Affiliations

  • Landon J. Pratt
    • 1
  • Alexander C. MacLean
    • 1
  • Charles D. Knutson
    • 1
  • Eric K. Ringger
    • 1
  1. 1.Computer Science DepartmentBrigham Young UniversityProvoUSA

Personalised recommendations