Stochastic actor-oriented modeling for studying homophily and social influence in OSS projects

Abstract

Open Source Software projects are communities in which people “learn the ropes” from each other. The social and technical activities of developers evolve together, and as they link to each other they get organized in a network of changing socio-technical connections. Traces of those activities, or behaviors, are typically visible to all, in project repositories and through communication between them. Thus, in principle it may be possible to study those traces to tell which of the observable socio-technical behaviors of developers in these projects are responsible for the forming of persistent links between them. It may also be possible to tell the extent to which links participate in the spread of potential behavioral influences. Since OSS projects change in both social and technical activity over time, static approaches, that either ignore time or simplify it to a few slices, are frequently inadequate to study these networks. On the other hand, ad-hoc dynamic approaches are often only loosely supported by theory and can yield misleading findings. Here we adapt the stochastic actor-oriented models from social network analysis. These models enable the study of the interplay between behavior, influence and network architecture, for dynamic networks, in a statistically sound way. We apply the stochastic actor-oriented models in case studies of two Apache Software Foundation projects, and study code ownership and developer productivity as behaviors. For those, we find evidence of significant social selection effects (homophily) in both projects, but in different directions. However, we find no evidence for the spread (social influence) of either code ownership or developer productivity behaviors through the networks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    These behaviors have been studied before as important indicators of bugs, socio-technical structure, and overall software quality (Curtis et al. 1988; Fong Boh et al. 2007; Weyuker et al. 2008; Bird et al. 2011; Rahman and Devanbu 2011), and are discussed further in Section 5.

  2. 2.

    The problem of causal inference is not limited to the study of epidemiological processes.

  3. 3.

    Also, these alternative models generally lack fundamental statistical data fitting ability.

  4. 4.

    http://groups.yahoo.com/groups/stocnet/

  5. 5.

    We built separate models for each consecutive wave and estimated parameters for each separate model to test this assumption. These parameter estimates were then compared across models. There were no significant departures in these tests to indicate non-constant parameter estimates across time.

  6. 6.

    Not to be confused with the SIENA model rate parameter, described in Section 5.1.4.

  7. 7.

    Note that this is equivalently the smallest such β.

  8. 8.

    Axis2/Java had high fluctuations in activity (both social and technical) towards the beginning its lifetime. As a result, model estimations which included these time periods proved difficult to estimate; in particular, the number of ministeps required before arriving at the time for the next wave became too large. As we are interested in the average social and technical behaviors of projects, the offending waves were removed from the analysis for this project.

  9. 9.

    The authors also attempted to use an equal number of days as a separator of waves. However, this led to extreme skew and imbalance in network size (nodes and ties) as projects tend to have “burst” activity behavior; earlier waves were much less varied compared to later waves.

  10. 10.

    In particular, there were instabilities in estimating the network rate parameters – the number of “chances” an actor has to change its ties.

  11. 11.

    The choice of 8 waves is likely specific to our data – SIENA supports any number of waves, though time complexity increases with more waves.

  12. 12.

    We initially built models on 3 projects: Ant, Axis2/Java, and Derby. However, Derby results were similar to Axis2/Java results (e.g. positive behavioral selection). As we are interested in presenting case studies of the application of the SAOM method in OSS, we only discuss results for Ant and Axis2/Java.

  13. 13.

    http://git-scm.com/

  14. 14.

    An exception to this rule exists if the ego X alter interaction selection effect is most significant. In this case, one must control for the lower level structures of ego value and alter value when estimating the interaction effect. This is standard practice in general statistical modeling.

  15. 15.

    Note that z j is used here to represent actor j’s behavioral value, while the z parameter is missing from the function signature. This is to emphasize that z j here is treated as a covariate and is not modeled by the network objective function.

  16. 16.

    For file ownership behavior in Axis2/Java (Table 6), addition of the influence effect of average alter caused high instability in estimation of the model. As a result, the model for high file ownership behavior only includes the linear shape and quadratic shape parameters. The exclusion of this parameter in the model should not appreciably affect our outcomes or goodness of fit as the score-type test of this parameter suggested insignificance

  17. 17.

    Evidence of clustering initially raised a concern with the authors that the constructed networks had extreme levels of clustering. Further analysis showed that this was not the case; the clustering is at an acceptable level according to prior work in these social networks.

  18. 18.

    Recall that we did not estimate average alter influence for the high file ownership model in this case.

References

  1. Anderson RM, May RM, Anderson B (1992) Infectious diseases of humans: dynamics and control, vol 28. Wiley Online Library

  2. Baerveldt C, de la Rúa F, Van de Bunt GG, et al (2010) Why and how selection patterns in classroom networks differ between students. the potential influence of networks size preferences, level of information, and group membership, vol 19, pp 0273–298

  3. Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

    MathSciNet  Article  MATH  Google Scholar 

  4. Barthélemy M, Barrat A, Pastor-Satorras R, Vespignani A (2005) Dynamical patterns of epidemic outbreaks in complex heterogeneous networks. J Theor Biol 235 (2):275–288

    MathSciNet  Article  Google Scholar 

  5. Basili VR, Caldiera G (1995) Improve software quality by reusing knowledge and experience. Sloan Manag Rev:55–64

  6. Batagelj V, Bren M (1995) Comparing resemblance measures. J Classif 12(1):73–90

    MathSciNet  Article  MATH  Google Scholar 

  7. Berardo R (2014) The evolution of self-organizing communication networks in high-risk social-ecological systems. Int J Commons 8(1):236–258

    MathSciNet  Article  Google Scholar 

  8. Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: 2010 IEEE 18th international conference on program comprehension (ICPC). IEEE, pp 124–133

  9. Bird C, Gourley A, Devanbu P, Gertz M, Swaminathan A (2006) Mining email social networks. In: Proceedings of the 2006 international workshop on Mining software repositories. ACM, pp 137–143

  10. Bird C, Nagappan N, Gall H, Murphy B, Devanbu P (2009) Putting it all together: Using socio-technical networks to predict failures. In: ISSRE’09. 20th international symposium on software reliability engineering, 2009. IEEE, pp 109–119

  11. Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, pp 4–14

  12. Bird C, Pattison D, D’Souza R, Filkov V, Devanbu P (2008) Latent social structure in open source projects. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, pp 24–35

  13. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU (2006) Complex networks: structure and dynamics. Phys Rep 424(4):175–308

    MathSciNet  Article  Google Scholar 

  14. CAD (1976) C.A.D.: A generai theory of bibiiometric and other cumulative advantage processes. J Am Soc Inf Sci:293

  15. Cardy JL, Grassberger P (1985) Epidemic models and percolation. J Phys A Math Gen 18(6):L267

    MathSciNet  Article  MATH  Google Scholar 

  16. Cataldo M, Wagstrom PA, Herbsleb JD, Carley KM (2006) Identification of coordination requirements: implications for the design of collaboration and awareness tools. In: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work. ACM, pp 353– 362

  17. Cheadle JE, Stevens M, Williams DT, Goosby BJ (2013) The differential contributions of teen drinking homophily to new and existing friendships: an empirical assessment of assortative and proximity selection mechanisms. Soc Sci Res 42 (5):1297–1310

    Article  Google Scholar 

  18. Cherry S, Robillard PN (2008) The social side of software engineeringa real ad hoc collaboration network. Int J Hum Comput Stud 66(7):495–505

    Article  Google Scholar 

  19. Cohen-Cole E, Fletcher JM (2008) Detecting implausible social network effects in acne, height, and headaches: longitudinal analysis. Bmj 337

  20. Cohen-Cole E, Fletcher JM (2008) Is obesity contagious? Social networks vs. environmental factors in the obesity epidemic. J Health Econ 27(5):1382–1387

    Article  Google Scholar 

  21. Crowston K, Howison J (2005) The social structure of free and open source software development. First Monday 10(2)

  22. Curtis B, Krasner H, Iscoe N (1988) A field study of the software design process for large systems. Commun ACM 31(11):1268–1287

    Article  Google Scholar 

  23. Davis JA (1970) Clustering and hierarchy in interpersonal relations: Testing two graph theoretical models on 742 sociomatrices. Am Sociol Rev:843–851

  24. De Souza C, Froehlich J, Dourish P (2005) Seeking the source: software source code as a social and technical artifact. In: Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work. ACM, pp 197–206

  25. Ducheneaut N (2005) Socialization in an open source software community: a socio-technical analysis. Comput Supported Coop Work (CSCW) 14(4):323–368

    Article  Google Scholar 

  26. Fong Boh W, Slaughter SA, Espinosa JA (2007) Learning from experience in software development: a multilevel analysis. Manag Sci 53(8):1315–1331

    Article  Google Scholar 

  27. Gharehyazie M, Posnett D, Filkov V (2013) Social activities rival patch submission for prediction of developer initiation in oss projects. In: 2013 29th IEEE international conference on software maintenance (ICSM). IEEE, pp 340–349

  28. Gharehyazie M, Posnett D, Vasilescu B, Filkov V (2014) Developer initiation and social interactions in oss: A case study of the apache software foundation. Empir Softw Eng:1–36

  29. Goeminne M, Mens T (2013) A comparison of identity merge algorithms for software repositories. Sci Comput Program 78(8):971–986

    Article  Google Scholar 

  30. Greenan CC (2014) Diffusion of innovations in dynamic networks. J R Stat Soc: Ser A (Statistics in Society)

  31. Halliday TJ, Kwak S (2009) Weight gain in adolescents and their peers. Econ Hum Biol 7(2):181–190

    Article  Google Scholar 

  32. Hintze JL, Nelson RD (1998) Violin plots: a box plot-density trace synergism. Am Stat 52(2):181–184

    Google Scholar 

  33. Holland PW, Leinhardt S (1971) Transitivity in structural models of small groups. Comparative Group Studies

  34. Holme P (2003) Network dynamics of ongoing social relationships. EPL (Europhys Lett) 64(3):427

    Article  Google Scholar 

  35. Hong Q, Kim S, Cheung S, Bird C (2011) Understanding a developer social network and its evolution. In: 2011 27th IEEE international conference on software maintenance (ICSM). IEEE, pp 323–332

  36. Jaccard P (1912) The distribution of the flora in the alpine zone. 1. New Phytol 11(2):37–50

    Article  Google Scholar 

  37. Jackson MO, Rogers BW (2007) Meeting strangers and friends of friends: How random are social networks? Am Econ Rev:890–915

  38. Johnson B, Song Y, Murphy-Hill E, Bowdidge R (2013) Why don’t software developers use static analysis tools to find bugs?. In: 2013 35th international conference on software engineering (ICSE). IEEE, pp 672–681

  39. Koskinen J, Edling C (2012) Modelling the evolution of a bipartite networkpeer referral in interlocking directorates. Soc Networks 34(3):309–322

    Article  Google Scholar 

  40. Kouters E, Vasilescu B, Serebrenik A, van den Brand MG (2012) Who’s who in gnome: using lsa to merge software repository identities. In: 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, pp 592–595

  41. Lazega E, Mounier L, Tubaro P, et al (2011) Norms, advice networks and joint economic governance: the case of conflicts among shareholders at the commercial court of paris. Does Economic Governance Matter:46–70

  42. Lopez-Fernandez L, Robles G, Gonzalez-Barahona JM, et al (2004) Applying social network analysis to the information in cvs repositories. In: International workshop on mining software repositories. IET, pp 101–105

  43. Lospinoso J (2010) Testing and modeling time heterogeneity in longitudinal studies of social networks. A tutorial in rsiena

  44. Lospinoso J (2012) Statistical models for social network dynamics. Ph.D. thesis, Oxford University

  45. Lospinoso JA, Schweinberger M, Snijders TA, Ripley RM (2011) Assessing and accounting for time heterogeneity in stochastic actor oriented models. ADAC 5 (2):147–176

    MathSciNet  Article  MATH  Google Scholar 

  46. Madey G, Freeh V, Tynan R (2002) The open source software development phenomenon: an analysis based on social network theory. AMCIS 2002 Proc:247

  47. Manski CF (1993) Identification of endogenous social effects: the reflection problem. Rev Econ Stud 60(3):531–542

    MathSciNet  Article  MATH  Google Scholar 

  48. Meneely A, Williams L (2009) Secure open source collaboration: an empirical study of linus’ law. In: Proceedings of the 16th ACM conference on Computer and communications security. ACM, pp 453– 462

  49. Meneely A, Williams L, Snipes W, Osborne J (2008) Predicting failures with developer networks and social network analysis. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, pp 13–23

  50. Mockus A (2007) Large-scale code reuse in open source software. In: First International Workshop on Emerging Trends in FLOSS Research and Development, 2007. FLOSS’07. IEEE, pp 7–7

  51. Nagappan N, Murphy B, Basili V (2008) The influence of organizational structure on software quality: an empirical case study. In: Proceedings of the 30th international conference on Software engineering. ACM, pp 521–530

  52. Newman ME (2002) Spread of epidemic disease on networks. Phys Rev E 66 (1):016-128

    MathSciNet  Article  Google Scholar 

  53. Pastor-Satorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86(14):3200

    Article  Google Scholar 

  54. Pinzger M, Nagappan N, Murphy B (2008) Can developer-module networks predict failures?. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, pp 2–12

  55. Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd international conference on software engineering. ACM, pp 491–500

  56. Ripley RM, Snijders TA, Boda Z, Vörös A, Preciado P (2014) Manual for siena version 4.0. University of Oxford

  57. Rogers EM (2010) Diffusion of innovations. Simon and Schuster

  58. Ruths J, Ruths D (2014) Control profiles of complex networks. Science 343(6177):1373–1376

    MathSciNet  Article  Google Scholar 

  59. Scacchi W, Feller J, Fitzgerald B, Hissam S, Lakhani K (2006) Understanding free/open source software development processes. Software Process: Improvement and Practice 11(2):95–105

    Article  Google Scholar 

  60. Schweinberger M (2012) Statistical modelling of network panel data: Goodness of fit. Br J Math Stat Psychol 65(2):263–281

    MathSciNet  Article  Google Scholar 

  61. Schweinberger M., Snijders TA (2007) Markov models for digraph panel data: Monte carlo-based derivative estimation. Computational statistics & data analysis 51(9):4465–4483

    MathSciNet  Article  MATH  Google Scholar 

  62. Shalizi CR, Thomas AC (2011) Homophily and contagion are generically confounded in observational social network studies. Sociol Methods Res 40(2):211–239

    MathSciNet  Article  Google Scholar 

  63. Shi H, Duan Z, Chen G (2008) An sis model with infective medium on complex networks. Physica A: Statistical Mechanics and its Applications 387(8):2133–2144

    Article  Google Scholar 

  64. Singh PV (2010) The small-world effect: The influence of macro-level properties of developer collaboration networks on open-source project success. ACM Trans Softw Eng Methodol (TOSEM) 20(2):6

    Article  Google Scholar 

  65. Snijders T, van Duijn M (1997) Simulation for statistical inference in dynamic network models. In: Simulating social phenomena. Springer, pp 493–512

  66. Snijders T, Steglich C, Schweinberger M (2007) Modeling the coevolution of networks and behavior. na

  67. Snijders TA (1996) Stochastic actor-oriented models for network change. J Math Sociol 21(1–2):149– 172

    Article  MATH  Google Scholar 

  68. Snijders TA (2001) The statistical evaluation of social network dynamics. Sociol Methodol 31(1):361– 395

    Article  Google Scholar 

  69. Snijders TA (2005) Models for longitudinal network data. Models and methods in social network analysis 1:215–247

    Article  Google Scholar 

  70. Snijders TA (2014) Siena algorithms

  71. Snijders TA, Van de Bunt GG, Steglich CE (2010) Introduction to stochastic actor-based models for network dynamics. Soc Networks 32(1):44–60

    Article  Google Scholar 

  72. Snijders TA, Koskinen J, Schweinberger M, et al (2010) Maximum likelihood estimation for social network dynamics. Ann Appl Stat 4(2):567–588

    MathSciNet  Article  MATH  Google Scholar 

  73. Snijders TA, Lomi A, Torló VJ (2013) A model for the multiplex dynamics of two-mode and one-mode networks, with an application to employment preference, friendship, and advice. Soc Networks 35(2):265–276

    Article  Google Scholar 

  74. Steglich C, Snijders TA, Pearson M (2010) Dynamic networks and behavior: separating selection from influence. Sociol Methodol 40(1):329–393

    Article  Google Scholar 

  75. Storey MA, Treude C, van Deursen A, Cheng LT (2010) The impact of social media on software engineering practices and tools. In: Proceedings of the FSE/SDP workshop on Future of software engineering research. ACM, pp 359–364

  76. Vasilescu B, Serebrenik A, Goeminne M, Mens T (2014) On the variation and specialisation of workloada case study of the gnome ecosystem community. Empir Softw Eng 19(4):955–1008

    Article  Google Scholar 

  77. Veenstra R, Dijkstra JK, Steglich C, Van Zalk MH (2013) Network–behavior dynamics. J Res Adolesc 23(3):399–412

    Article  Google Scholar 

  78. Vespignani A (2012) Modelling dynamical processes in complex socio-technical systems. Nat Phys 8(1):32–39

    MathSciNet  Article  Google Scholar 

  79. Wasserman S (1980) A stochastic model for directed graphs with transition rates determined by reciprocity. Sociol Methodol 11:392–412

    Article  Google Scholar 

  80. Wasserman S (1994) Social network analysis: methods and applications, vol 8. Cambridge university press

  81. Wasserman S, Iacobucci D (1988) Sequential social network data. Psychometrika 53(2):261–282

    Article  MATH  Google Scholar 

  82. Weyuker EJ, Ostrand TJ, Bell RM (2008) Do too many cooks spoil the broth? using the number of developers to enhance defect prediction models. Empir Softw Eng 13(5):539–559

    Article  Google Scholar 

  83. Xuan Q, Devanbu PT, Filkov V (2014) Converging work-talk patterns in online task-oriented communities. arXiv:1404.5708

  84. Xuan Q, Filkov V (2014) Building it together: synchronous development in oss. In: Proceedings of the 36th international conference on software engineering. ACM, pp 222–233

  85. Zeggelink E (1994) Dynamics of structure: an individual oriented approach. Soc Networks 16(4):295–333

    Article  Google Scholar 

  86. Zhang H, Fu X (2009) Spreading of epidemics on scale-free networks with nonlinear infectivity. Nonlinear Anal Theory, Methods & Applications 70(9):3273–3278

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors want to thank Mohammad Gharehyazie for sharing his ASF project data. We thank Saheel Godhane for fruitful discussion during the early stages of the project. We thank anonymous reviewers for their helpful suggestions on prior versions of manuscript. The authors gratefully acknowledge support from the Air Force Office of Scientific Research, award FA955-11-1-0246, and a faculty grant from UC Davis. The authors are thankful for generous support from UC Davis.

Author information

Affiliations

Authors

Corresponding author

Correspondence to David Kavaler.

Additional information

Data and scripts used in this work can be found at http://web.cs.ucdavis.edu/~filkov/software/ASF-Siena/.

Communicated by: Emerson Murphy-Hill

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kavaler, D., Filkov, V. Stochastic actor-oriented modeling for studying homophily and social influence in OSS projects. Empir Software Eng 22, 407–435 (2017). https://doi.org/10.1007/s10664-016-9431-y

Download citation

Keywords

  • Actor oriented models
  • Apache
  • Open source
  • Social influence
  • Social selection
  • Homophily
  • Siena