From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects

Abstract

Complex software development projects rely on the contribution of teams of developers, who are required to collaborate and coordinate their efforts. The productivity of such development teams, i.e., how their size is related to the produced output, is an important consideration for project and schedule management as well as for cost estimation. The majority of studies in empirical software engineering suggest that - due to coordination overhead - teams of collaborating developers become less productive as they grow in size. This phenomenon is commonly paraphrased as Brooks’ law of software project management, which states that “adding manpower to a software project makes it later”. Outside software engineering, the non-additive scaling of productivity in teams is often referred to as the Ringelmann effect, which is studied extensively in social psychology and organizational theory. Conversely, a recent study suggested that in Open Source Software (OSS) projects, the productivity of developers increases as the team grows in size. Attributing it to collective synergetic effects, this surprising finding was linked to the Aristotelian quote that “the whole is more than the sum of its parts”. Using a data set of 58 OSS projects with more than 580,000 commits contributed by more than 30,000 developers, in this article we provide a large-scale analysis of the relation between size and productivity of software development teams. Our findings confirm the negative relation between team size and productivity previously suggested by empirical software engineering research, thus providing quantitative evidence for the presence of a strong Ringelmann effect. Using fine-grained data on the association between developers and source code files, we investigate possible explanations for the observed relations between team size and productivity. In particular, we take a network perspective on developer-code associations in software development teams and show that the magnitude of the decrease in productivity is likely to be related to the growth dynamics of co-editing networks which can be interpreted as a first-order approximation of coordination requirements.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    1 see https://developer.github.com/v3/

  2. 2.

    2 Notably, we avoid a bias towards large inactivity times, by not taking into account developers who committed only once.

  3. 3.

    3 please note the log scale on the x-axis.

  4. 4.

    4 effectively, we accept a less than 10 % chance of falsely excluding from the team at time t a developer who eventually commits after more than 295 days of inactivity.

  5. 5.

    5 which is commonly due to the initial stages of a project when a relatively large code base is submitted with the first few commits.

  6. 6.

    6 Note that even though a regression model may produce a statistically significant relation between these two measures, due to the large variation, such a result should not be mistaken as evidence that the mean commit contribution can be replaced by the number of commits.

  7. 7.

    7 all logarithms are in base 10.

  8. 8.

    8 Please note that we avoid regressing binned averages of log〈c〉 (or log〈n〉) as this would reduce the high variability in the data and would thus yield a spuriously large value of r 2.

  9. 9.

    9 Notably, all of the 15 large projects exhibit negative coefficients, which indicate the presence of the Ringelmann effect.

  10. 10.

    10 see https://github.com/zendframework/zf1/wiki/Contributing-to-Zend-Framework-1

  11. 11.

    11 Developers who use Git input this basic information in the configuration of their Git clients.

  12. 12.

    12 The differences pertain to the hashes of the corresponding physical files, which are irrelevant for us.

  13. 13.

    13 Note that this method is valid also when changes are merged from other local branches, and not from forked repositories. The corresponding merge commit still contains parent pointers linking it to these branches.

  14. 14.

    14 Note that due to the sensitivity of the Levenshtein index, it varies across several orders of magnitude, regardless of the project size, hence a log transformation on 〈c 〉 is justified.

  15. 15.

    15 see (1) and (2).

  16. 16.

    16 i.e., to predict the productivity of a team with an arbitrary size given the productivity of a team with a certain size.

  17. 17.

    17 see Section 5.2 for introduction of the NMI.

  18. 18.

    18 statistical significance is judged by the bootstrap approach presented also in Section 5.2

  19. 19.

    19 \(\hat {\alpha }\) refers to the slope of a bootstrap sample, whereas α is the slope of the original regression models.

  20. 20.

    20 In particular, C B added 10 new lines, together with other changes in the range [10-40].

References

  1. Adams P, Capiluppi A, Boldyreff C (2009) Coordination and productivity issues in free software: the role of brooks’ law. In: IEEE International Conference on software maintenance. doi:10.1109/ICSM.2009.5306308. ICSM 2009, pp 319–328

  2. Alali A, Kagdi H, Maletic J (2008) What’s a typical commit? a characterization of open source software repositories. In: The 16th IEEE International Conference on program comprehension. doi:10.1109/ICPC.2008.24. ICPC 2008, pp 182–191

  3. Albrecht AJ (1979) Measuring application development productivity. In: Proceedings of the joint SHARE, GUIDE and IBM application development symposium, pp 83–92

  4. Arafat O, Riehle D (2009) The commit size distribution of open source software. In: 42nd Hawaii International Conference on system sciences, HICSS’09. doi:10.1109/HICSS.2009.421, pp 1–8

  5. Banker RD, Kauffman RJ (2004) 50th anniversary article: The evolution of research on information systems: a fiftieth-year survey of the literature in management science. Manag Sci 50(3):281–298. doi:10.1287/mnsc.1040.0206

    Article  Google Scholar 

  6. Banker R D, Kemerer C F (1989) Scale economies in new software development. IEEE Trans Softw Eng 15(10):1199–1205

    Article  Google Scholar 

  7. Banker R D, Slaughter S A (1997) A field study of scale economies in software maintenance. Manag Sci 43(12):1709–1725

    Article  MATH  Google Scholar 

  8. Banker R D, Chang H, Kemerer C F (1994) Evidence on economies of scale in software development. Inf Softw Technol 36(5):275–282

    Article  Google Scholar 

  9. Blackburn J D, Scudder G D, Van Wassenhove L N (1996) Improving speed and productivity of software development: a global survey of software developers. IEEE Trans Softw Eng 22(12):875–885

    Article  Google Scholar 

  10. Blincoe K, Valetto G, Goggins S (2012) Proximity: a measure to quantify the need for developers’ coordination. In: Proceedings of the ACM 2012 conference on computer supported cooperative work, CSCW ’12. doi:10.1145/2145204.2145406. ACM, New York, pp 1351–1360

  11. Blincoe K, Valetto G, Damian D (2013) Do all task dependencies require coordination? the role of task properties in identifying critical coordination needs in software projects. In: Proceedings of the 2013 9th joint meeting on foundations of software engineering, ESEC/FSE 2013. ACM, New York. doi:10.1145/2491411.2491440, pp 213–223

  12. Blincoe KC (2014) Timely and efficient facilitation of coordination of software developers’ activities. PhD thesis, Drexel University, Philadelphia, p aAI3613734

    Google Scholar 

  13. Boehm B W (1984) Software engineering economics. IEEE Trans Software Eng 10(1):4–21. doi:10.1109/TSE.1984.5010193

    Article  Google Scholar 

  14. Boehm BW, Clark H, Brown R, Chulani MR, Steece B (2000) Software cost estimation with Cocomo II with Cdrom, 1st edn. Prentice Hall PTR, Upper Saddle River

    Google Scholar 

  15. Brooks FP (1975) The mythical man-month. Addison-Wesley

  16. Cataldo M, Herbsleb J (2013) Coordination breakdowns and their impact on development productivity and software failures. IEEE Trans Softw Eng 39(3):343–360. doi:10.1109/TSE.2012.32

    Article  Google Scholar 

  17. Cataldo M, Wagstrom PA, Herbsleb JD, Carley KM (2006) Identification of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of the 2006 20th anniversary conference on computer supported cooperative work, CSCW ’06. doi:10.1145/1180875.1180929. ACM, New York, pp 353–362

  18. Cataldo M, Herbsleb JD, Carley KM (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of the second ACM-IEEE international symposium on empirical software engineering and measurement, ESEM ’08. doi:10.1145/1414004.1414008. ACM, New York, pp 2–11

  19. Chidambaram L, Tung L L (2005) Is out of sight, out of mind? An empirical study of social loafing in technology-supported groups. Inf Syst Res 16(2):149–168. doi:10.1287/isre.1050.0051

    Article  Google Scholar 

  20. Comstock C, Jiang Z, Davies J (2011) Economies and diseconomies of scale in software development. J Softw Maint Evol Res Pract 23(8):533–548. doi:10.1002/smr.526

    Article  Google Scholar 

  21. Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: Transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on computer supported cooperative work, CSCW ’12. doi:10.1145/2145204.2145396 . ACM, New York, pp 1277–1286

  22. Earley PC (1989) Social loafing and collectivism: a comparison of the united states and the people’s republic of china. Adm Sci Q:565–581

  23. German DM (2006) A study of the contributors of postgresql. In: Proceedings of the 2006 international workshop on Mining software repositories. ACM, pp 163–164

  24. Gousios G, Kalliamvakou E, Spinellis D (2008) Measuring developer contribution from software repository data. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. doi:10.1145/1370750.1370781 . ACM, New York, pp 129–132

  25. Gousios G, Vasilescu B, Serebrenik A, Zaidman A (2014) Lean ghtorrent: Github data on demand. In: Proceedings of the 11th working conference on mining software repositories. doi:10.1145/2597073.2597126 MSR 2014. ACM, New York, pp 384–387

  26. Harison E, Koski H (2008) Does open innovation foster productivity? Evidence from open source software (oss) firms. Tech. rep., ETLA discussion paper

  27. Hindle A, German DM, Holt R (2008) What do large commits tell us?: a taxonomical study of large commits. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. doi:10.1145/1370750.1370773. ACM, New York, pp 99–108

  28. Hofmann P, Riehle D (2009) Estimating commit sizes efficiently. In: Boldyreff C, Crowston K, Lundell B, Wasserman A (eds) Open source ecosystems: diverse communities interacting, IFIP advances in information and communication technology, vol 299. doi:10.1007/978-3-642-02032-2_11. Springer, Berlin, pp 105–115

  29. Ingham AG, Levinger G, Graves J, Peckham V (1974) The ringelmann effect: Studies of group size and group performance. J Exp Soc Psychol 10(4):371–384. doi:10.1016/0022-1031(74)90033-X

    Article  Google Scholar 

  30. Jackson J M, Harkins S G (1985) Equity in effort: an explanation of the social loafing effect. J Pers Soc Psychol 49(5):1199

    Article  Google Scholar 

  31. Kalliamvakou E, Gousios G, Blincoe K, Singer L, German D M, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. doi:10.1145/2597073.2597074 . ACM, New York, pp 92–101

  32. Karau S J, Williams K D (1993) Social loafing: a meta-analytic review and theoretical integration. J Personal Soc Psychol 65(4):681

    Article  Google Scholar 

  33. Karau S J, Williams K D (1995) Social loafing: research findings, implications, and future directions. Curr Dir Psychol Sci 4(5):134–140

    Article  Google Scholar 

  34. Koenker R (1981) A note on studentizing a test for heteroskedasticity. J Econ 17(1):107–112

    MathSciNet  Article  MATH  Google Scholar 

  35. Kravitz D A, Martin B (1986) Ringelmann rediscovered: the original article. J Pers Soc Psychol 50(5):936–941

    Article  Google Scholar 

  36. Latane B, Williams K, Harkins S (1979) Many hands make light the work: The causes and consequences of social loafing. J Pers Soc Psychol 37(6):822

    Article  Google Scholar 

  37. Lerner J, Tirole J (2002) Some simple economics of open source. J Ind Econ 50(2):197–234. doi:10.1111/1467-6451.00174

    Article  Google Scholar 

  38. Levenshtein V I (1966) Binary codes capable of correcting deletions, insertions and reversals. In: Soviet physics doklady, vol 10, p 707

  39. Lin M, Lucas H, Shmueli G (2013) Research commentary - too big to fail: large samples and the p-value problem. Inf Syst Res 24(4):906–917

    Article  Google Scholar 

  40. Maxwell K, Van Wassenhove L, Dutta S (1996) Software development productivity of european space, military, and industrial applications. IEEE Trans Softw Eng 22(10):706–718 . doi:10.1109/32.544349

    Article  Google Scholar 

  41. Mockus A, Fielding RT, Herbsleb J (2000) A case study of open source software development: The apache server. In: Proceedings of the 22Nd international conference on software engineering. ICSE ’00, ACM, New York, . doi:10.1145/337180.337209, pp 263–272

  42. Mockus A, Fielding R T, Herbsleb J D (2002) Two case studies of open source software development: apache and mozilla. ACM Trans Softw Eng Methodol 11 (3):309–346. doi:10.1145/567793.567795

    Article  Google Scholar 

  43. Paiva E, Barbosa D, Roberto Lima J, Albuquerque A (2010) Factors that influence the productivity of software developers in a developer view. In: Sobh T, Elleithy K (eds) Innovations in computing sciences and software engineering. doi:10.1007/978-90-481-9112-3_17. Springer, Netherlands, pp 99–104

  44. Premraj R, Shepperd M, Kitchenham B, Forselius P (2005) An empirical analysis of software productivity over time. In: 11th IEEE International Symposium software metrics, 2005. doi:10.1109/METRICS.2005.8

  45. Ringelmann M (1913) Recherches sur les moteurs animes: Travail de l’homme. Annales de l’Institut National Agronomique 12(1):1–40

    Google Scholar 

  46. Robles G, Koch S, González-Barahona JM (2004) Remote analysis and measurement of libre software systems by means of the cvsanaly tool. In: 2nd ICSE workshop on remote analysis and measurement of software systems (RAMSS), pp 51–55

  47. Scholtes I, Mavrodiev P, Schweitzer F (2015) From aristotle to ringelmann (dataset). doi:http://dx.doi.org/abs/10.1109/METRICS.2005.8

  48. Shepperd J (1993) Productivity loss in performance groups - a motivation analysis. Psychol Bull 113(1):67–81. doi:10.1037/0033-2909.113.1.67

    Article  Google Scholar 

  49. Shiue Y C, Chiu C M, Chang C C (2010) Exploring and mitigating social loafing in online communities. Comput Hum Behav 26(4):768–777. doi:10.1016/j.chb.2010.01.014. http://www.sciencedirect.com/science/article/pii/S0747563210000166 emerging and Scripted Roles in Computer-supported Collaborative Learning

    Article  Google Scholar 

  50. Sornette D, Maillart T, Ghezzi G (2014) How much is the whole really more than the sum of its parts? 1 + 1 = 2.5: Superlinear productivity in collective group actions. PLoS ONE 9(8):e103,023. doi:10.1371/journal.pone.0103023

    Article  Google Scholar 

  51. Steiner I D (1972) Group process and productivity. Social psychology monographs. Academic

  52. Stigler G J (1958) The economies of scale. JL Econ 1:54

    Google Scholar 

  53. von Krogh G, Spaeth S, Lakhani K R (2003) Community, joining, and specialization in open source software innovation: a case study. Res Policy 32 (7):1217–1241. doi:10.1016/S0048-7333(03)00050-7. open Source Software Development

    Article  Google Scholar 

  54. Wagner J (1995) Studies of individualism-collectivism - effects on cooperation in groups. Acad Manag J 38(1):152–172

    Article  Google Scholar 

  55. Williams K, Karau S (1991) Social loafing and social compensation - the effects of expectations of coworker performance. J Pers Soc Psychol 61(4):570–581. doi:10.1037/0022-3514.61.4.570

    Article  Google Scholar 

  56. Williams K, Harkins S, Latane B (1981) Identifiability as a deterrent to social loafing - 2 cheering experiments. J Personal Soc Psychol 40(2):303–311. doi:10.1037/0022-3514.40.2.303

    Article  Google Scholar 

  57. Wolf T, Schroter A, Damian D, Panjer L D, Nguyen T H (2009) Mining task-based social networks to explore collaboration in software teams. IEEE Software 26(1):58–66. doi:10.1109/MS.2009.16

    Article  Google Scholar 

  58. Yetton P, Bottger P (1983) The relationships among group size, member ability, social decision schemes, and performance. Organ Behav Hum Perform 32(2):145–159. doi:10.1016/0030-5073(83)90144-7

    Article  Google Scholar 

Download references

Acknowledgments

Ingo Scholtes and Frank Schweitzer acknowledge support from the Swiss National Science Foundation (SNF), grant number CR31I1_140644/1.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ingo Scholtes.

Additional information

Communicated by: Burak Turhan

Appendix A:

Appendix A:

A1 Data Set

Table 3 summarises the 58 projects in our data set. For each project we show the project name and programming language, the time span of the data retrieved, as indicated by the times of the first and last retrieved commits, the total number of commits and the total number of unique developers during the analysed time span.

Table 3 Summary of the 58 OSS projects in our data set

A2 GitHub Data Set

For each project in our data set we queried the GitHub API with the query https://api.github.com/repos/%3Cowner%3E/%3Crepo%3E/commits%3Fpage=%3Cn%3E, where <owner> is a GitHub user account and <repo > is the name of a project repository belonging to this user. The query returns a paginated json list of the 30 most recent commits in the master branch of the project. By varying the parameter <n >, we control the pagination and can trace back the commit history until the very first commit.

Each element in the json list represents a commit with all Git-relevant information (see Section 3.1). More specifically, it contains the names and email addresses of both the author and the committer.Footnote 11 The author is the person who authored the code in the commit and the committer is the one with write permissions in the repository who merged the commit in the project code base. These two identities may not be the same when pull requests are considered, as the developers requesting the pull typically do not have write access. Since we quantify contributions in terms of amount of code written, we take the author email from the commit data as a unique identifier for individual developers. In cases where the author email is empty, we conservatively skip the commit.

The commit SHA contained in the json list can be used to execute a commitspecific query in the GitHub API of the form:

https://api.github.com/repos/%3Cowner%3E/%3Crepo%3E/commits/%3CSHA%3E

The result is again a json list which provides detailed information about the list and diffs of all files changed in the commit. We retrieve this additional information and use it to (i) quantify the precise contribution to the source code at the level of individual characters and (ii) construct the time-varying coordination networks of developers who have co-edited files (see Section 5.1).

A2.1 Merged Pull Requests

Upon merging a pull request, typically through the GitHub interface, the commit tree of the project is modified by including a special merge commit. The basics of this process is illustrated in Fig. 13.

Fig. 13
figure13

Simplified illustration of merging a pull request. A potential contributor forks the main branch of a project (light blue) into his/her own local repository (green). After some activity in both repositories a pull request is created and merged as indicated by the dashed arrow. This results in two commits - C5’ and C6 - in the main branch. The merge commit C6 (dark blue) has two parent links and should be excluded

In this example, a potential contributor forks the main branch after the second commit. Subsequent local changes are then made to the master branch and to the remote repository, represented by commits C4 and C5 respectively. After C5, the potential contributor creates a pull request asking for the changes in C5 to be incorporated in the main code base. Assuming the pull request is approved and no conflicts exist, C5 is merged by creating two commits - C5’ and C6. C5’ is almost identical to C5 in that it has the same author and committer fields as well as diffs.Footnote 12 C6 is a special merge commit that contains the same diffs as C5 and C5, but differs on the author and committer information. The author and committer in C6 are those of the maintainer who approved and merged the pull request, and not those of the developer who originally wrote the code in C5 and C5’. Thereby including commit C6 in the analysis would wrongly attribute the contained diff to the maintainer and inflate his/her contribution in terms of code written.

We deal with this problem by noticing that merge commits always have at least two parent pointers - one to the replicated commit from the forked repository, and one to the last commit in the main branch. In some cases when changes are merged from more than one remote branches, the merge commit will have a parent pointer to each of these remotes. Since the parent pointers are also available in our data set, we exclude all commits that have two or more parent pointers.Footnote 13

An additional complication is that Git also allows integrating changes by socalled rebasing. Different from pull requests, which generate a merge commit, in rebasing all changes are applied on top of the last commit of the branch being rebased into. The result is a single commit with only one parent link that is added at the end of the rebased branch and that incorporates these changes. Since we cannot distinguish the developer who rebased from those who authored the changes, we exclude such commits from our analysis. Even though the parent pointer rule cannot be applied here, most well-structured projects contain indicative commit messages that can be used to this end. We exclude all commits with commit messages that contain any of the keywords merge pull request, merge remote-tracking, and merge branch, regardless of punctuation.

We note that all summary statistics regarding the number of commits in this paper (e.g. Table 3) are calculated after applying the above two exclusion methods.

A3 Model Fits for Project-Wise Scaling of Productivity

For each project in our data set, we estimated the model in (2) relating the team size s to the mean team-member contribution 〈c 〉. For a small number of those projects, the team size s varies in a rather narrow range, thus questioning logarithmic transformations of both the parameter s and 〈c 〉 in the linear model of (2).Footnote 14 We thus additionally use a model variation with a logarithmic transformation of 〈c 〉, while keeping s linear, i.e.:

$$ \log\langle c^{\prime}\rangle=\hat{\beta}_{3}+\hat{\alpha}_{3}\cdot s $$
(6)

We denote this model as Log-Lin, while referring to the original model in which we perform a logarithmic transformation of both parameters as Log-Log.

For each project, we fit both models and select the one which yields the largest coefficient of determination r 2 as the appropriate model for this project. The resulting project-dependent scaling coefficients are summarized in Table ??.

Table 4 Estimation of two linear models for Fig. 6 with single-commit developers removed

The result confirms that our finding of decreasing returns to scale at the aggregate level (Section 4.1) also holds for individual projects. Virtually all projects exhibit negative scaling of the mean team-member contribution with the team size, except for two projects for which no significant scaling coefficient could be determined. At any rate, the absence of significant positive coefficients for any of the projects allows us to conclude that there is no evidence for super-linear scaling in our data set.

A4 Effect of One-Time Contributors

In order to quantify the extent to which our results of team productivity may be influenced by contributors who committed to a project only once, we identified single-commit developers in all of the studied projects. Figure 14 shows the fraction of one-time contributors in all of the studied projects, validating the intuition that they comprise a sizable part of the development team.

Fig. 14
figure14

Fraction of commits submitted by one-time contributors, i.e., developers who never contributed a second commit

In order to ensure that our results about the scaling of productivity are not qualitatively affected by the large fraction of single-commit developers, we have additionally filtered the commit logs of all projects, filtering out the commits of all developers who committed only once. By this study, we focus on the contributions of a core team of that particularly rules out single-commit developers. Using this filtered commit log, we then recomputed all model fits in the paper. In Tables 5 and 6, we report the scaling exponents. We observe no qualitative changes regarding our observation of decreasing returns to scale. We additionally reanalyzed all individual projects, again filtering out all contributions by single-commit developers. We report on the project-wise scaling exponents in the bracketed values in Table ??, again not observing any qualitative changes of our results for individual projects.

Table 5 Estimation of two linear models for Fig. 6 with single-commit developers removed
Table 6 Estimation of two linear models for Fig. 7 with single-commit developers removed

A5 Inference Versus Prediction from Linear Models

In Section 4.1 we introduced two linear modelsFootnote 15 as a means of quantifying the negative trends observed in Figs. 6 and 7. In particular we introduced

$$\begin{array}{@{}rcl@{}} \log\langle n\rangle = \beta_{0} + \alpha_{0} \cdot \log s + \epsilon_{0}\\ \log\langle c\rangle = \beta_{1} + \alpha_{1} \cdot \log s + \epsilon_{1}\\ \log\langle n^{\prime}\rangle = \beta_{2} + \alpha_{2} \cdot \log s + \epsilon_{2}\\ \log\langle c^{\prime}\rangle = \beta_{3} + \alpha_{3} \cdot \log s + \epsilon_{3} \end{array} $$
(7)

where 〈n〉 is the mean number of commits per active devel 〈n 〉 is the mean number of commits per team member, 〈c〉 is the mean commit contribution per active developer, 〈c 〉 is the mean contribution per team member and 𝜖 0,1,2,3 denote the errors of the models.

We note that for these models to provide reliable predictionsFootnote 16 the following conditions must be met: (a) Var(𝜖 0,1,2,3,4| logs) = σ 2, for all s (homoskedasticity), (b) \(\epsilon _{0,1,2,3} \sim \mathcal {N}\)(0, σ 2) (normality of the error distribution) and (c) E(𝜖 0,1,2,3| logs)=0 (linear model is correct).

We test for homoskedasticity by running the Koenker studentised version of the Breusch-Pagan test (Koenker 1981). This test regresses the squared residuals on the predictor in (7) and uses the more widely applied Lagrange Multiplier (LM) statistics instead of the F-statistics. Although more sophisticated procedures, e.g. Whites test, would account for a non-linear relation between the residuals and the predictor, we find that the Breusch-Pagan test is sufficient to detect heteroskedasticity in our data. The consequence of violating the homoskedasticity assumption is that the estimated variance of the slopes α 0,1,2,3 will be biased, hence the statistics used to test hypotheses will be invalid. Thus, to account for the presence of heteroskedasticity, we use robust methods to calculate heteroskedasticity-consistent standard errors. More specifically, we use an MMtype robust regression estimator, as described in and provided by the R package robustbase.

As for normality of the errors 𝜖 0,1,2,3, a violation of this assumption would render exact t and F statistics incorrect. However, our use of a robust MM estimator addresses possible non-normality of residuals, as it is resistant to the influence of outliers.

The last assumption pertains to the overall feasibility of the linear model. A common way to assess it is to plot the residuals from estimating (7) versus the fitted values, commonly known as a Tukey-Anscombe plot. A strong trend in the plot is evidence that the relationship between the dependent and independent variable is not captured well by a linear model. As a result, predicting the dependent variable from the calculated slope is likely to be unreliable, especially if the relationship between the variables is highly non-linear.

In Fig. 15 we show the Tukey-Anscombe plots for the four regression models in (7). While we cannot readily observe a prominent trend, we, nevertheless, see two qualitatively different regimes. Specifically the residuals in the lower ranges are close to zero, while they are relatively symmetrically distributed beyond this range. Looking at the line fits in Figs. 6 and 7 we see that the reason for this discrepancy are the outliers in the region of large team sizes, which fall close to the fitted regression lines. Therefore the residuals corresponding to these outliers will be close to zero. Investigating these specific data points reveals that they belong exclusively to the Specs project.

Fig. 15
figure15

Residuals versus fitted values for (7). The titles above each plot correspond to the respective scatterplots in Figs. 6 and 7

To actually quantify a possible trend in Fig. 15 we calculate the normalized mutual information (NMI) between the residuals and the fitted values.Footnote 17 As expected the NMI is rather low - 0.04 (top-left), 0.02 (top-right), 0.04 (bottom-left) and 0.03 (bottom-right) - an indication that there is no pronounced systematic error in the linear model. However, even though the NMI values are low, we find that there are all statistically different from zero at p = 0.05.Footnote 18

Therefore, despite the evidence against a systematic error in these linear models, assumption (c) cannot be technically satisfied. We, thus, conservatively avoid using the linear models for predictive means. Since the NMI values, however, are rather low, the regression models are sufficient for our purposes of simply quantifying the observed negative trends. We argue that the practical significance of such small effect sizes is negligible with respect to introducing strong systematic errors that could obscure a salient non-linear relationship. Effectively, we can only retroactively infer a significant negative relationship between team size and productivity, but cannot forecast team production given team size. We caution that such inference is also a subject to high variability, as indicated by the low r 2 values (see Section 5.2), and is thus valid only on average.

Finally, an argument against the significance of the slopes in (7) is the relatively large sample size of N=13998. Known as the “p-value problem” (Lin et al. 2013), the issue pertains to applying small-sample statistical inference to large samples. Statistical inference is based on the notion that under a null hypothesis a parameter of interest equals a specific value, typically zero, which represents “no effect”. In our example, we are interested in estimating the slopes α 0,1,2,3 with an associated null hypothesis that sets them to zero. It is precisely this representation of the “no effect” by a particular number that becomes problematic with large samples. In large samples the standard error of the estimated parameter becomes so small that even tiny differences between the estimate and the null hypothesis become statistically significant. Hence, unless the estimated parameter is equal to the null hypothesis with an infinite precision, there is always a danger that the statistical significance we find is due to random fluctuations in the data. One way to alleviate the issue is to consider the size of the effect (as we did above) and assess whether the practical significance of the effect is important for the context at hand, even if it is significant in the strict statistical sense.

Another way is to demonstrate that the size and significance of the effect cannot arise by a random fluctuation. To this end we again resort to a bootstrap approach. For each scatterplot in Figs. 6 and 7, we generate 10, 000 bootstrap samples by shuffling the data points. We then estimate the regression models on each bootstrap sample and record the corresponding slope estimate \(\hat {\alpha }_{0,1,2,3}\), regardless of its statistical significance.Footnote 19 We find that the slopes of the 10, 000 bootstrapped regression models are restricted in the ranges [ −0.02, 0.02], [ −0.04, 0.04], [ −0.04, 0.03] and [ −0.04, 0.06] for \(\hat {\alpha }_{0}\), \(\hat {\alpha }_{1}\), \(\hat {\alpha }_{2}\) and \(\hat {\alpha }_{3}\), respectively. Comparing those ranges to the empirical slopes in Tables 1 and 2, we see that by eliminating the relationship between team size and productivity we cannot reproduce the strength of the negative trend found in the dataset. It is the information lost from the shuffling procedure that accounts for the statistical significance of α 0,1,2,3. Hence, it is safe to conclude that our analysis does not suffer from spuriously significant results introduced by large samples.

A6 Calculating Overlapping Source Code Regions

Our method of identifying overlapping source code changes between co-edits of the same file is based on the information in the chunk header of a diff between two versions of a committed file. Such a file diff shows only those portions of the file that were actually modified by a commit. In git parlance these portions are known as chunks. Each of these chunks is prepended by one line of header information enclosed between @@…@@. The header indicates the lines which were modified by a given commit to this file. Therefore, from all chunk headers within a file diff we can obtain the line ranges affected by the commit and eventually calculate the overlapping source code regions between two different commits to the same file.

As a concrete example, assume a productivity time window of 7 days in which the file foo.txt was modified first by developer A and then by developer B in commits C A and C B , respectively. The diff of foo.txt in commit C A may contain the following chunk header:

$$\textsf{@@ -10,15 \qquad+10,12 @@} $$

The content of the header is split in two parts identified by “ −”and “+”: −10,15 and +10,12. The two pairs of numbers indicate the line ranges, outside which the two versions (before and after C A ) of foo.txt are identical. More specifically, -10,15 means that starting from line 10, C A made changes to the following 15 lines, i.e., it affected the line range [10 - 25]. What the result of these changes was is given in the second part of the header. +10,12 indicates that starting from line 10 in the new state of the file, the following 12 lines are different compared to the [10 - 25] line range. Beyond these 12 lines, the old and the new state of foo.txt are identical, provided there are no more chunks in the file diff. Therefore, the line range [10 - 25] in the old state of foo.txt and the line range [10 - 22] in the new state after C A , are the only differences introduced by the commit. This could be caused for example by the removal of three lines from the line range [10 - 25], together with other modifications in the same range.

Since C A comes prior to C B in our example, we associate the second part of the chunk header, i.e., line range [10 - 25], to C A as it represents the state of foo.txt after the changes from C A were applied and before those from C B . Now assume that the diff of foo.txt in C B has only one chunk with the following header:

$$\textsf{@@ \(-\)10,30\qquad +10,40 @@} $$

In other words, lines [10 - 40] from the old state of foo.txt were modified by C B , and the changes are reflected in lines [10 - 50] in the new state of foo.txt after C B .Footnote 20 Note that, lines [10 - 40] represent the state of foo.txt after C A , but before C B . Therefore to compute the overlapping source code ranges between C B and C A , we need to compare the line ranges [10 - 40] and [10 - 25] and calculate the overlap. In this case, the overlap is 15 lines, which is the weight we attribute to the coordination link from developer B to A in this particularly simple example. The procedure described above is applied to all pairs of commits by different developers which have edited a common file within a given productivity time window of 7 days. Processing the chunk information in the above way thus allows us to extract linebased, weighted and directed co-editing networks which capture the association between developers and source code regions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Scholtes, I., Mavrodiev, P. & Schweitzer, F. From Aristotle to Ringelmann: a large-scale analysis of team productivity and coordination in Open Source Software projects. Empir Software Eng 21, 642–683 (2016). https://doi.org/10.1007/s10664-015-9406-4

Download citation

Keywords

  • Software engineering
  • Repository mining
  • Productivity factors
  • Social aspects of software engineering
  • Open source software
  • Coordination