Abstract
Software evolution is a fundamental process that transcends the realm of technical artifacts and permeates the entire organizational structure of a software project. By means of a longitudinal empirical study of 18 large open-source projects, we examine and discuss the evolutionary principles that govern the coordination of developers. By applying a network-analytic approach, we found that the implicit and self-organizing structure of developer coordination is ubiquitously described by non-random organizational principles that defy conventional software-engineering wisdom. In particular, we found that: (a) developers form scale-free networks, in which the majority of coordination requirements arise among an extremely small number of developers, (b) developers tend to accumulate coordination requirements with more and more developers over time, presumably limited by an upper bound, and (c) initially developers are hierarchically arranged, but over time, form a hybrid structure, in which core developers are hierarchically arranged and peripheral developers are not. Our results suggest that the organizational structure of large projects is constrained to evolve towards a state that balances the costs and benefits of developer coordination, and the mechanisms used to achieve this state depend on the project’s scale.
Similar content being viewed by others
Notes
1 The second order Markov chain is more complex by including the random variable X t−1 in the model, but the vast majority of variance for our data is explained by the first order Markov chain. We concluded that the increase in model complexity is not justified by the improvement in the model’s fit.
2 We chose the number of synthetic data sets to generate to introduce a precision tolerance of two decimal places in the p value.
References
Albert R, Barabási A L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47
Arafat O, Riehle D (2009) The commit size distribution of open source software. In: Proceedings of international conference on systems sciences. IEEE, pp 1–8
Arias T B C, van der Spek P, Avgeriou P (2011) A practice-driven systematic review of dependency analysis solutions. Empir Softw Eng 16(5):544–586
Arnold R S, Bohner SA (1993) Impact analysis-towards a framework for comparison. In: Proceedings of international conference on software maintenance. IEEE, pp 292–301
Atkinson A B (1970) On the measurement of inequality. J Econ Theory 2 (3):244–263
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval. Addison-Wesley
Barabási A L, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Barabâsi A L, Jeong H, Néda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Phys A Stat Mech Appl 311(3):590–614
Bavota G, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE, pp 692–701
Begel A, Khoo YP, Zimmermann T (2010) Codebook: discovering and exploiting relationships in software repositories. In: Proceedings of international conference on software engineering. ACM, pp 125–134
Bernard H R, Killworth P D, Evans M J, McCarty C, Shelley G A (1988) Studying social relations cross-culturally. Ethnology 27(2):155–179
Bishop CM (2006) Pattern recognition and machine learning. Springer
Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang D U (2006) Complex networks: structure and dynamics. Phys Rep 424(4):175–308
Boehm BW (ed) (1989) Software risk management. IEEE
Brooks FP (1995) The mythical man-month. Addison-Wesley
Cataldo M, Herbsleb JD (2013) Coordination breakdowns and their impact on development productivity and software failures. IEEE Trans Softw Eng 39(3):343–360
Cataldo M, Wagstrom P A, Herbsleb J D, Carley K M (2006) Identification Of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of conference on computer supported cooperative work. ACM, pp 353–362
Cataldo M, Herbsleb J D, Carley K M (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In: Proceedings of international symposium on empirical software engineering and measurement. ACM, pp 2–11
Cataldo M, Mockus A, Roberts J A, Herbsleb J D (2009) Software dependencies, work dependencies, and their impact on failures. IEEE Trans Softw Eng 35(6):864–878
Clauset A, Shalizi C R, Newman M E J (2009) Power-law distributions in empirical data. SIAM Rev 51(4):661–703
Conway ME (1968) How do committees invent. Datamation 14(4):28–31
Crowston K, Howison J (2005) The social structure of free and open source software development. First Monday 10(2)
Crowston K, Wei K, Li Q, Howison J (2006) Core and periphery in free/libre and open source software team communications. In: Proceedings of international conference on system sciences. IEEE, p 118.1
Crowston K, Kangning W, Howison J, Wiggins A (2012) Free/libre open-source software development: what we know and what we do not know. ACM Computing Surveys 44(2):7:1–7:35
DiBona C, Ockman S, Stone M (eds) (1999) Open sources: voices from the open source revolution. O’Reilly Media & Associates, Inc
Dinh-Trong T T, Bieman J M (2005) The freeBSD project: a replication case study of open source development. IEEE Trans Softw Eng 31(6):481–494
Dorogovtsev SN, Mendes JF (2013) Evolution of networks: from biological nets to the Internet and WWW. Oxford University Press
Erdős P, Rényi A (1959) On random graphs. Publ Math 6:290–297
Espinosa J A, Slaughter S A, Kraut R E, Herbsleb J D (2007) Familiarity, complexity, and team performance in geographically distributed software development. Organ Sci 18(4):613–630
Foucault M, Palyart M, Blanc X, Murphy G C, Falleri J R (2015) Impact of developer turnover on quality in open-source software. In: Proceedings of international symposium on foundations of software engineering, ACM, pp 829–841
Godfrey M W, Tu Q (2000) Evolution in open source software: a case study. In: Proceedings of international conference on software maintenance. IEEE, pp 131–142
Goldstein M, Morris S, Yen G (2004) Problems with fitting to the power-law distribution. Eur Phys J B-Condensed Matter 41(2):255–258
Hinds P, Mcgrath C (2006) Structures that work: social structure, work structure and coordination ease in geographically distributed teams. In: Proceedings of conference on computer supported cooperative work. ACM, pp 343–352
Huang N E, Shen Z, Long S R, Wu M C, Shih H H, Zheng Q, Yen N C, Tung C C, Liu H H (1998) The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc Royal Society London A Math Phys Eng Sci 454(1971):903–995
Huselid M A (1995) The impact of human resource management practices on turnover, productivity, and corporate financial performance. Acad Manag J 38(3):635–672
Hynninen P, Piri A, Niinimaki T (2010) Off-site commitment and voluntary turnover in GSD projects. In: Proceedings of international conference on global software engineering. IEEE, pp 145–154
Jeong H, Tombor B, Albert R, Oltvai Z N, Barabási A L (2000) The large-scale organization of metabolic networks. Nature 407(6804):651–654
Jermakovics A, Sillitti A, Succi G (2011) Mining and visualizing developer networks from version control systems. In: Proceedings of international workshop on cooperative and human aspects of software engineering. ACM, pp 24–31
Joblin M, Mauerer W, Apel S, Siegmund J, Riehle D (2015) From developer networks to verified communities: a fine-grained approach. In: Proceeding of international conference on software engineering. IEEE, pp 563–573
Joblin M, Apel S, Hunsen C, Mauerer W (2016) Classifying developers into core and peripheral: an empirical study on count and network metrics, preprint at arXiv:1604.00830
Koch S (2004) Profiling an open source project ecology and its programmers. Electron Mark 14(2):77–88
Kotter JP (2014) Accelerate: building strategic agility for a faster-moving world. Harvard Business Review Press
Lehman MM, Ramil JF (2001) Rules and tools for software evolution planning and management. Ann Softw Eng 11(1):15–44
Lehman M M, Ramil JF, Wernick PD, Perry DE, MW (1997) Metrics and laws of software evolution the nineties view. In: Proceedings of software metrics symposium. IEEE, pp 20–32
López L, Robles G, Jesús, Herraiz I (2006) Applying social network analysis techniques to community-driven libre software projects. Int J Inf Technol Web Eng 1(3):27–48
Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26
Manning C D, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press
Martinez-Romo J, Robles G, Gonzalez-Barahona J M, Ortuño-perez M (2008) Using social network analysis techniques to study collaboration between a FLOSS community and a company. In: Open source development, communities and quality. Springer, pp 171–186
Mauerer W, Jaeger M C (2013) Open source engineering processes. Inf Technol 55(5):196–203
Meneely A, Williams L (2011) Socio-technical developer networks: should we trust our measurements?. In: Proceedings of international conference on software engineering. ACM, pp 281–290
Meneely A, Williams L, Snipes W, Osborne J (2008) Predicting failures with developer networks and social network analysis. In: Proceedings of foundations of software engineering. ACM, pp 13–23
Mens T, Fernández-Ramil J, Degrandsart S (2008) The evolution of Eclipse. In: Proceedings of international conference on software maintenance. IEEE, pp 386–395
Mockus A (2010) Organizational volatility and its effects on software defects. In: Proceedings of international symposium on foundations of software engineering. ACM, pp 117–126
Mockus A, Fielding R T, Herbsleb J (2000) A Case study of open source software development: the Apache server. In: Proceedings of international conference on software engineering. IEEE, pp 263–272
Mockus A, Fielding RT, Herbsleb JD (2002) Two case studies of open source software development: apache and Mozilla. ACM Trans Softw Eng Meth 11(3):309–346
Olbrich S, Cruzes D S, Basili V, Zazworka N (2009) The evolution and impact of code smells: a case study of two open source systems. In: Proceedings of the international symposium on empirical software engineering and measurement. IEEE, pp 390–400
Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Softw Eng 14(1):5–32
Ravasz E, Barabási A L (2003) Hierarchical organization in complex networks. Phys Rev E 67(2):026,112
Robles G, Gonzalez-Barahona J, Herraiz I (2009) Evolution of the core team of developers in libre software projects. In: Proceedings of international working conference on mining software repositories. IEEE, pp 167–170
Schilling A, Laumer S, Weitzel T (2012) Who will remain? An evaluation of actual person-job and person-team fit to predict developer retention in FLOSS projects. In: Proceedings if international conference on system sciences. IEEE, pp 3446–3455
Scholtes I, Mavrodiev P, Schweitzer F (2016) From aristotle to ringelmann: A large-scale analysis of team productivity and coordination in open source software projects. Empir Softw Eng 21(2):642–683
Sosa M E, Eppinger S D, Rowles C M (2004) The misalignment of product architecture and organizational structure in complex product development. Manage Sci 50(12):1674–1689
Stevens W P, Myers G J, Constantine L L (1974) Structured design. IBM Syst J 13(2):115–139
Terceiro A, Rios L R, Chavez C (2010) An empirical study on the structural complexity introduced by core and peripheral developers in free software projects. In: Proceedings of Brazilian symposium on software engineering. IEEE, pp 21–29
Toral S, Martínez-torres M, Barrero F (2010) Analysis of virtual communities supporting oss projects using social network analysis. Inf Softw Technol 52(3):296–303
Yu Y, Benlian A, Hess T (2012) An empirical study of volunteer members’ perceived turnover in open source software projects. In: Proceedings of international conference on system sciences. IEEE, pp 3396–3405
Acknowledgments
This work has been supported by the German Research Foundation (AP 206/4, AP 206/5, AP 206/6).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: David Lo
Appendices
Appendix : A: Function-level Semantic Coupling
To determine function-level semantic coupling, we first extracted the implementation for each function in the system, including all source code and comments. We then employed well-established text-mining preprocessing operations with minor modifications for our specific domain requirements. In this framework, each function is treated as a “document” in the text-mining sense of the word, and then the document collection was processed use the following processing operations.
The preprocessing stage primarily focuses on reducing word diversity and elimination of words that contain little information. Stemming is to used to reduce words to their root form by removing suffixes (e.g., “ing”, “ly”, “er”, etc.) from each word in the document. Stemming is necessary because, even though a root word may have several forms by adding suffixes, it typically refers to a relatively similar concept in all forms. In software engineering, there is a number of variable-naming conventions, such as letter-case separated (e.g., CamelCase) or delimiter separated words that need to be tokenized appropriately. We added additional preprocessing stages to specifically handle proper tokenization of popular naming conventions. For example, the function identifier “get_user” or “getUser” are separated into the two words “get” and “user”. One simple example of why this is important is that getters and setters interacting with the same attribute would be incorrectly understood as distinct concepts without appreciating the variable-naming conventions. The final stage of the preprocessing is to remove words that are known not to contain useful information based on a-priori knowledge of the language. For example, words such as “the” are not helpful in determining the domain concept of a document. Removing these words is beneficial for the computational complexity and results by reducing the problem’s dimentionality and attenuating noise in the data.
After the preprocessing stage, we arrange all remaining data into a term–document matrix, for mathematical convenience. A term–document matrix is an M×N matrix with rows representing terms and columns representing documents. For example, an element of the term–document matrix T D i,j is non-zero when document d j contains term t i . All elements of the term–document matrix are integer weights that indicate the frequency of occurrence of a given term in a given document. We then apply a weight transformation to the term–document matrix based on the statistics of occurrence for each term. Intuition suggests that not all terms in a document are equally important with regard to identifying the domain concept. The goal of the weighting transformation is to increase the influence of terms that help to identify distinct concepts and that decrease the influence of the remaining terms. The particular weighting scheme we applied is called term frequency-inverse document frequency:
The term t f t represents the global term frequency across all documents. The second term is the logarithm of the inverse document frequency, where N is the number of documents in the total collection and d f t is the number of documents that term t appears. Upon closer inspection, one can recognize that Equation 9 is: (a) greatest when a term is very frequent, but only appears in a small number of documents, (b) lowest when a term is present in all documents, and (c) between these two extreme cases when a term is infrequent in one document or occurs in many documents.
Even for a modest-sized software project, the number of terms used in the implementation vocabulary easily exceeds the thousands. The problem with this becomes evident when adopting the vector-space model, where we consider a document as a vector that exists in a space spanned by the terms that comprise the document collection. Fortunately, this very high dimensional space is extremely sparse, which allows us to project the documents into a lower dimensional subspace, which makes the semantic similarity computation tractable. We achieve this using a matrix decomposition technique that relies on the singular value decomposition called latent semantic indexing. An added benefit of this technique is that it is capable of correctly resolving the relationships of synonymy and polysemy in natural language (Baeza-Yates et al. 1999). Furthermore, latent semantic indexing has shown evidence to be valid and reliable in the software-engineering domain (Bavota et al. 2013).
In the final step of the analysis, we determine semantic coupling by computing the similarity between all document vectors projected onto the lower dimensional subspace attained from applying latent semantic indexing. We operationalize the similarity between two document vectors in the latent space using cosine similarity
where the numerator is the dot product between the two document vectors and the denominator is the multiplication of the magnitude of the two document vectors. Intuitively, cosine similarity expresses the difference in the angle between the two document vectors; it equals 1, when the two vectors are parallel, and 0, if they are orthogonal. Two source-code artifacts are then considered to be semantically coupled if the cosine similarity exceeds a given threshold. We experimented extensively with a number of thresholds by manually inspecting the results and judging whether the functions were, in fact, semantically related using architectural knowledge of a well-known project.Footnote 5 We found that a threshold of 0.65 was able to identify most semantic relationships with only a very small number of false positives. We did, however, cautiously chose the threshold to optimize to avoid false positives rather than false negatives.
Appendix : B: Analysis Window Selection
We chose to use a sliding-window approach in our study to generate the time-resolved series of developer networks. Another option would have been to analyze the project using non-overlapping windows, but this can lead to problematic edge discontinuities between the analysis windows. For example, a set of several related changes to the software could be divided between two different analysis windows, even though the changes occurred temporally close together. For this reason, a sliding-window approach is superior to the alternative for our purposes, but we also recognized that overlapping windows could influence the appearance of developer transitions (see Section 3.3), because a commit can appear in two contiguous analysis windows. To test whether the overlapping windows distorts the overall outcome, we compared all Markov chains using non-overlapping windows with those using overlapping windows. The comparison revealed that, in all projects, our conclusion that core developers are more stable than peripheral developers is true regardless of which windowing strategy is used. In most cases, using non-overlapping windows increased the probability that a core or peripheral developer leaves the project, but peripheral developers are always significantly more likely to leave. For example, in QEMU, using overlapping windows, core and peripheral developers leave a project with 0.5 % and 10.9 % chance respectively. In the case of non-overlapping window this changes to 13 % chance for core and 55 % chance for peripheral. The data for both sets of Markov chains are included on the supplementary Web site.
Rights and permissions
About this article
Cite this article
Joblin, M., Apel, S. & Mauerer, W. Evolutionary trends of developer coordination: a network approach. Empir Software Eng 22, 2050–2094 (2017). https://doi.org/10.1007/s10664-016-9478-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-016-9478-9