Skip to main content
Log in

Evolutionary trends of developer coordination: a network approach

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Software evolution is a fundamental process that transcends the realm of technical artifacts and permeates the entire organizational structure of a software project. By means of a longitudinal empirical study of 18 large open-source projects, we examine and discuss the evolutionary principles that govern the coordination of developers. By applying a network-analytic approach, we found that the implicit and self-organizing structure of developer coordination is ubiquitously described by non-random organizational principles that defy conventional software-engineering wisdom. In particular, we found that: (a) developers form scale-free networks, in which the majority of coordination requirements arise among an extremely small number of developers, (b) developers tend to accumulate coordination requirements with more and more developers over time, presumably limited by an upper bound, and (c) initially developers are hierarchically arranged, but over time, form a hybrid structure, in which core developers are hierarchically arranged and peripheral developers are not. Our results suggest that the organizational structure of large projects is constrained to evolve towards a state that balances the costs and benefits of developer coordination, and the mechanisms used to achieve this state depend on the project’s scale.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. 1 The second order Markov chain is more complex by including the random variable X t−1 in the model, but the vast majority of variance for our data is explained by the first order Markov chain. We concluded that the increase in model complexity is not justified by the improvement in the model’s fit.

  2. 2 We chose the number of synthetic data sets to generate to introduce a precision tolerance of two decimal places in the p value.

  3. 3 http://www.cnet.com/news/mozilla-pushes-back-firefox-3-6-4-0-deadlines/

  4. 4 https://eclipse.org/

  5. 5 http://siemens.github.io/codeface/

References

  • Albert R, Barabási A L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47

    Article  MathSciNet  MATH  Google Scholar 

  • Arafat O, Riehle D (2009) The commit size distribution of open source software. In: Proceedings of international conference on systems sciences. IEEE, pp 1–8

  • Arias T B C, van der Spek P, Avgeriou P (2011) A practice-driven systematic review of dependency analysis solutions. Empir Softw Eng 16(5):544–586

    Article  Google Scholar 

  • Arnold R S, Bohner SA (1993) Impact analysis-towards a framework for comparison. In: Proceedings of international conference on software maintenance. IEEE, pp 292–301

  • Atkinson A B (1970) On the measurement of inequality. J Econ Theory 2 (3):244–263

    Article  MathSciNet  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval. Addison-Wesley

  • Barabási A L, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

    Article  MathSciNet  MATH  Google Scholar 

  • Barabâsi A L, Jeong H, Néda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Phys A Stat Mech Appl 311(3):590–614

    Article  MathSciNet  MATH  Google Scholar 

  • Bavota G, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE, pp 692–701

  • Begel A, Khoo YP, Zimmermann T (2010) Codebook: discovering and exploiting relationships in software repositories. In: Proceedings of international conference on software engineering. ACM, pp 125–134

  • Bernard H R, Killworth P D, Evans M J, McCarty C, Shelley G A (1988) Studying social relations cross-culturally. Ethnology 27(2):155–179

    Article  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer

  • Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D U (2006) Complex networks: structure and dynamics. Phys Rep 424(4):175–308

    Article  MathSciNet  Google Scholar 

  • Boehm BW (ed) (1989) Software risk management. IEEE

  • Brooks FP (1995) The mythical man-month. Addison-Wesley

  • Cataldo M, Herbsleb JD (2013) Coordination breakdowns and their impact on development productivity and software failures. IEEE Trans Softw Eng 39(3):343–360

    Article  Google Scholar 

  • Cataldo M, Wagstrom P A, Herbsleb J D, Carley K M (2006) Identification Of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of conference on computer supported cooperative work. ACM, pp 353–362

  • Cataldo M, Herbsleb J D, Carley K M (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of international symposium on empirical software engineering and measurement. ACM, pp 2–11

  • Cataldo M, Mockus A, Roberts J A, Herbsleb J D (2009) Software dependencies, work dependencies, and their impact on failures. IEEE Trans Softw Eng 35(6):864–878

    Article  Google Scholar 

  • Clauset A, Shalizi C R, Newman M E J (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703

    Article  MathSciNet  MATH  Google Scholar 

  • Conway ME (1968) How do committees invent. Datamation 14(4):28–31

    Google Scholar 

  • Crowston K, Howison J (2005) The social structure of free and open source software development. First Monday 10(2)

  • Crowston K, Wei K, Li Q, Howison J (2006) Core and periphery in free/libre and open source software team communications. In: Proceedings of international conference on system sciences. IEEE, p 118.1

  • Crowston K, Kangning W, Howison J, Wiggins A (2012) Free/libre open-source software development: what we know and what we do not know. ACM Computing Surveys 44(2):7:1–7:35

    Article  Google Scholar 

  • DiBona C, Ockman S, Stone M (eds) (1999) Open sources: voices from the open source revolution. O’Reilly Media & Associates, Inc

  • Dinh-Trong T T, Bieman J M (2005) The freeBSD project: a replication case study of open source development. IEEE Trans Softw Eng 31(6):481–494

    Article  Google Scholar 

  • Dorogovtsev SN, Mendes JF (2013) Evolution of networks: from biological nets to the Internet and WWW. Oxford University Press

  • Erdős P, Rényi A (1959) On random graphs. Publ Math 6:290–297

    MathSciNet  Google Scholar 

  • Espinosa J A, Slaughter S A, Kraut R E, Herbsleb J D (2007) Familiarity, complexity, and team performance in geographically distributed software development. Organ Sci 18(4):613–630

    Article  Google Scholar 

  • Foucault M, Palyart M, Blanc X, Murphy G C, Falleri J R (2015) Impact of developer turnover on quality in open-source software. In: Proceedings of international symposium on foundations of software engineering, ACM, pp 829–841

  • Godfrey M W, Tu Q (2000) Evolution in open source software: a case study. In: Proceedings of international conference on software maintenance. IEEE, pp 131–142

  • Goldstein M, Morris S, Yen G (2004) Problems with fitting to the power-law distribution. Eur Phys J B-Condensed Matter 41(2):255–258

    Google Scholar 

  • Hinds P, Mcgrath C (2006) Structures that work: social structure, work structure and coordination ease in geographically distributed teams. In: Proceedings of conference on computer supported cooperative work. ACM, pp 343–352

  • Huang N E, Shen Z, Long S R, Wu M C, Shih H H, Zheng Q, Yen N C, Tung C C, Liu H H (1998) The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc Royal Society London A Math Phys Eng Sci 454(1971):903–995

    Article  MathSciNet  MATH  Google Scholar 

  • Huselid M A (1995) The impact of human resource management practices on turnover, productivity, and corporate financial performance. Acad Manag J 38(3):635–672

    Article  Google Scholar 

  • Hynninen P, Piri A, Niinimaki T (2010) Off-site commitment and voluntary turnover in GSD projects. In: Proceedings of international conference on global software engineering. IEEE, pp 145–154

  • Jeong H, Tombor B, Albert R, Oltvai Z N, Barabási A L (2000) The large-scale organization of metabolic networks. Nature 407(6804):651–654

    Article  Google Scholar 

  • Jermakovics A, Sillitti A, Succi G (2011) Mining and visualizing developer networks from version control systems. In: Proceedings of international workshop on cooperative and human aspects of software engineering. ACM, pp 24–31

  • Joblin M, Mauerer W, Apel S, Siegmund J, Riehle D (2015) From developer networks to verified communities: a fine-grained approach. In: Proceeding of international conference on software engineering. IEEE, pp 563–573

  • Joblin M, Apel S, Hunsen C, Mauerer W (2016) Classifying developers into core and peripheral: an empirical study on count and network metrics, preprint at arXiv:1604.00830

    Google Scholar 

  • Koch S (2004) Profiling an open source project ecology and its programmers. Electron Mark 14(2):77–88

    Article  Google Scholar 

  • Kotter JP (2014) Accelerate: building strategic agility for a faster-moving world. Harvard Business Review Press

  • Lehman MM, Ramil JF (2001) Rules and tools for software evolution planning and management. Ann Softw Eng 11(1):15–44

    Article  MATH  Google Scholar 

  • Lehman M M, Ramil JF, Wernick PD, Perry DE, MW (1997) Metrics and laws of software evolution the nineties view. In: Proceedings of software metrics symposium. IEEE, pp 20–32

  • López L, Robles G, Jesús, Herraiz I (2006) Applying social network analysis techniques to community-driven libre software projects. Int J Inf Technol Web Eng 1(3):27–48

    Article  Google Scholar 

  • Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26

    Article  Google Scholar 

  • Manning C D, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press

  • Martinez-Romo J, Robles G, Gonzalez-Barahona J M, Ortuño-perez M (2008) Using social network analysis techniques to study collaboration between a FLOSS community and a company. In: Open source development, communities and quality. Springer, pp 171–186

  • Mauerer W, Jaeger M C (2013) Open source engineering processes. Inf Technol 55(5):196–203

    Google Scholar 

  • Meneely A, Williams L (2011) Socio-technical developer networks: should we trust our measurements?. In: Proceedings of international conference on software engineering. ACM, pp 281–290

  • Meneely A, Williams L, Snipes W, Osborne J (2008) Predicting failures with developer networks and social network analysis. In: Proceedings of foundations of software engineering. ACM, pp 13–23

  • Mens T, Fernández-Ramil J, Degrandsart S (2008) The evolution of Eclipse. In: Proceedings of international conference on software maintenance. IEEE, pp 386–395

  • Mockus A (2010) Organizational volatility and its effects on software defects. In: Proceedings of international symposium on foundations of software engineering. ACM, pp 117–126

  • Mockus A, Fielding R T, Herbsleb J (2000) A Case study of open source software development: the Apache server. In: Proceedings of international conference on software engineering. IEEE, pp 263–272

  • Mockus A, Fielding RT, Herbsleb JD (2002) Two case studies of open source software development: apache and Mozilla. ACM Trans Softw Eng Meth 11(3):309–346

    Article  Google Scholar 

  • Olbrich S, Cruzes D S, Basili V, Zazworka N (2009) The evolution and impact of code smells: a case study of two open source systems. In: Proceedings of the international symposium on empirical software engineering and measurement. IEEE, pp 390–400

  • Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Softw Eng 14(1):5–32

    Article  Google Scholar 

  • Ravasz E, Barabási A L (2003) Hierarchical organization in complex networks. Phys Rev E 67(2):026,112

    Article  MATH  Google Scholar 

  • Robles G, Gonzalez-Barahona J, Herraiz I (2009) Evolution of the core team of developers in libre software projects. In: Proceedings of international working conference on mining software repositories. IEEE, pp 167–170

  • Schilling A, Laumer S, Weitzel T (2012) Who will remain? An evaluation of actual person-job and person-team fit to predict developer retention in FLOSS projects. In: Proceedings if international conference on system sciences. IEEE, pp 3446–3455

  • Scholtes I, Mavrodiev P, Schweitzer F (2016) From aristotle to ringelmann: A large-scale analysis of team productivity and coordination in open source software projects. Empir Softw Eng 21(2):642–683

    Article  Google Scholar 

  • Sosa M E, Eppinger S D, Rowles C M (2004) The misalignment of product architecture and organizational structure in complex product development. Manage Sci 50(12):1674–1689

    Article  Google Scholar 

  • Stevens W P, Myers G J, Constantine L L (1974) Structured design. IBM Syst J 13(2):115–139

    Article  Google Scholar 

  • Terceiro A, Rios L R, Chavez C (2010) An empirical study on the structural complexity introduced by core and peripheral developers in free software projects. In: Proceedings of Brazilian symposium on software engineering. IEEE, pp 21–29

  • Toral S, Martínez-torres M, Barrero F (2010) Analysis of virtual communities supporting oss projects using social network analysis. Inf Softw Technol 52(3):296–303

    Article  Google Scholar 

  • Yu Y, Benlian A, Hess T (2012) An empirical study of volunteer members’ perceived turnover in open source software projects. In: Proceedings of international conference on system sciences. IEEE, pp 3396–3405

Download references

Acknowledgments

This work has been supported by the German Research Foundation (AP 206/4, AP 206/5, AP 206/6).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mitchell Joblin.

Additional information

Communicated by: David Lo

Appendices

Appendix : A: Function-level Semantic Coupling

To determine function-level semantic coupling, we first extracted the implementation for each function in the system, including all source code and comments. We then employed well-established text-mining preprocessing operations with minor modifications for our specific domain requirements. In this framework, each function is treated as a “document” in the text-mining sense of the word, and then the document collection was processed use the following processing operations.

The preprocessing stage primarily focuses on reducing word diversity and elimination of words that contain little information. Stemming is to used to reduce words to their root form by removing suffixes (e.g., “ing”, “ly”, “er”, etc.) from each word in the document. Stemming is necessary because, even though a root word may have several forms by adding suffixes, it typically refers to a relatively similar concept in all forms. In software engineering, there is a number of variable-naming conventions, such as letter-case separated (e.g., CamelCase) or delimiter separated words that need to be tokenized appropriately. We added additional preprocessing stages to specifically handle proper tokenization of popular naming conventions. For example, the function identifier “get_user” or “getUser” are separated into the two words “get” and “user”. One simple example of why this is important is that getters and setters interacting with the same attribute would be incorrectly understood as distinct concepts without appreciating the variable-naming conventions. The final stage of the preprocessing is to remove words that are known not to contain useful information based on a-priori knowledge of the language. For example, words such as “the” are not helpful in determining the domain concept of a document. Removing these words is beneficial for the computational complexity and results by reducing the problem’s dimentionality and attenuating noise in the data.

After the preprocessing stage, we arrange all remaining data into a term–document matrix, for mathematical convenience. A term–document matrix is an M×N matrix with rows representing terms and columns representing documents. For example, an element of the term–document matrix T D i,j is non-zero when document d j contains term t i . All elements of the term–document matrix are integer weights that indicate the frequency of occurrence of a given term in a given document. We then apply a weight transformation to the term–document matrix based on the statistics of occurrence for each term. Intuition suggests that not all terms in a document are equally important with regard to identifying the domain concept. The goal of the weighting transformation is to increase the influence of terms that help to identify distinct concepts and that decrease the influence of the remaining terms. The particular weighting scheme we applied is called term frequency-inverse document frequency:

$$ \textit{tf-idf}_{t,d} = \textit{tf}_{t} \times \log\frac{N}{\textit{df}_{t}}. $$
(9)

The term t f t represents the global term frequency across all documents. The second term is the logarithm of the inverse document frequency, where N is the number of documents in the total collection and d f t is the number of documents that term t appears. Upon closer inspection, one can recognize that Equation 9 is: (a) greatest when a term is very frequent, but only appears in a small number of documents, (b) lowest when a term is present in all documents, and (c) between these two extreme cases when a term is infrequent in one document or occurs in many documents.

Even for a modest-sized software project, the number of terms used in the implementation vocabulary easily exceeds the thousands. The problem with this becomes evident when adopting the vector-space model, where we consider a document as a vector that exists in a space spanned by the terms that comprise the document collection. Fortunately, this very high dimensional space is extremely sparse, which allows us to project the documents into a lower dimensional subspace, which makes the semantic similarity computation tractable. We achieve this using a matrix decomposition technique that relies on the singular value decomposition called latent semantic indexing. An added benefit of this technique is that it is capable of correctly resolving the relationships of synonymy and polysemy in natural language (Baeza-Yates et al. 1999). Furthermore, latent semantic indexing has shown evidence to be valid and reliable in the software-engineering domain (Bavota et al. 2013).

In the final step of the analysis, we determine semantic coupling by computing the similarity between all document vectors projected onto the lower dimensional subspace attained from applying latent semantic indexing. We operationalize the similarity between two document vectors in the latent space using cosine similarity

$$ \text{similarity}(\mathbf{d}_{\mathbf{a}}, \mathbf{d}_{\mathbf{b}}) = \frac{\mathbf{d}_{\mathbf{a}} \cdot \mathbf{d}_{\mathbf{b}}}{\left\| \mathbf{d}_{\mathbf{a}} \right\|\left\| \mathbf{d}_{\mathbf{b}} \right\|}, $$
(10)

where the numerator is the dot product between the two document vectors and the denominator is the multiplication of the magnitude of the two document vectors. Intuitively, cosine similarity expresses the difference in the angle between the two document vectors; it equals 1, when the two vectors are parallel, and 0, if they are orthogonal. Two source-code artifacts are then considered to be semantically coupled if the cosine similarity exceeds a given threshold. We experimented extensively with a number of thresholds by manually inspecting the results and judging whether the functions were, in fact, semantically related using architectural knowledge of a well-known project.Footnote 5 We found that a threshold of 0.65 was able to identify most semantic relationships with only a very small number of false positives. We did, however, cautiously chose the threshold to optimize to avoid false positives rather than false negatives.

Appendix : B: Analysis Window Selection

We chose to use a sliding-window approach in our study to generate the time-resolved series of developer networks. Another option would have been to analyze the project using non-overlapping windows, but this can lead to problematic edge discontinuities between the analysis windows. For example, a set of several related changes to the software could be divided between two different analysis windows, even though the changes occurred temporally close together. For this reason, a sliding-window approach is superior to the alternative for our purposes, but we also recognized that overlapping windows could influence the appearance of developer transitions (see Section 3.3), because a commit can appear in two contiguous analysis windows. To test whether the overlapping windows distorts the overall outcome, we compared all Markov chains using non-overlapping windows with those using overlapping windows. The comparison revealed that, in all projects, our conclusion that core developers are more stable than peripheral developers is true regardless of which windowing strategy is used. In most cases, using non-overlapping windows increased the probability that a core or peripheral developer leaves the project, but peripheral developers are always significantly more likely to leave. For example, in QEMU, using overlapping windows, core and peripheral developers leave a project with 0.5 % and 10.9 % chance respectively. In the case of non-overlapping window this changes to 13 % chance for core and 55 % chance for peripheral. The data for both sets of Markov chains are included on the supplementary Web site.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joblin, M., Apel, S. & Mauerer, W. Evolutionary trends of developer coordination: a network approach. Empir Software Eng 22, 2050–2094 (2017). https://doi.org/10.1007/s10664-016-9478-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-016-9478-9

Keywords

Navigation