The Babel of Software Development: Linguistic Diversity in Open Source

  • Bogdan Vasilescu
  • Alexander Serebrenik
  • Mark G. J. van den Brand
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8238)

Abstract

Open source software (OSS) development communities are typically very specialised, on the one hand, and experience high turnover, on the other. Combination of specialization and turnover can cause parts of the system implemented in a certain programming language to become unmaintainable, if knowledge of that language has disappeared together with the retiring developers.

Inspired by measures of linguistic diversity from the study of natural languages, we propose a method to quantify the risk of not having maintainers for code implemented in a certain programming language. To illustrate our approach, we studied risks associated with different languages in Emacs, and found examples of low risk due to high popularity (e.g., C, Emacs Lisp); low risk due to similarity with popular languages (e.g., C++, Java, Python); or high risk due to both low popularity and low similarity with popular languages (e.g., Lex). Our results show that methods from the social sciences can be successfully applied in the study of information systems, and open numerous avenues for future research.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Very Large Data Bases, pp. 487–499. Morgan Kaufmann (1994)Google Scholar
  2. 2.
    Brijs, T., Vanhoof, K., Wets, G.: Defining interestingness for association rules. Information Theories & Applications 10(4), 370–375 (2003)Google Scholar
  3. 3.
    Capiluppi, A., Serebrenik, A., Youssef, A.: Developing an h-index for OSS developers. In: Lanza, M., Di Penta, M., Xi, T. (eds.) MSR, pp. 251–254. IEEE (2012)Google Scholar
  4. 4.
    Delorey, D., Knutson, C., Giraud-Carrier, C.: Programming language trends in open source development: An evaluation using data from all production phase Sourceforge projects. In: WoPDaSD (2007)Google Scholar
  5. 5.
    Doyle, J.R., Stretch, D.D.: The classification of programming languages by usage. Man-Machine Studies 26(3), 343–360 (1987)CrossRefGoogle Scholar
  6. 6.
    Ducheneaut, N.: Socialization in an open source software community: A socio-technical analysis. Computer Supported Cooperative Work 14(4), 323–368 (2005)CrossRefGoogle Scholar
  7. 7.
    Fearon, J.D.: Ethnic and cultural diversity by country. J. Econ. Growth 8(2), 195–222 (2003)CrossRefGoogle Scholar
  8. 8.
    Gelernter, D., Jagannathan, S.: Programming linguistics. MIT Press (1990)Google Scholar
  9. 9.
    Giuri, P., Ploner, M., Rullani, F., Torrisi, S.: Skills, division of labor and performance in collective inventions: Evidence from open source software. International Journal of Industrial Organization 28(1), 54–68 (2010)CrossRefGoogle Scholar
  10. 10.
    Goeminne, M., Mens, T.: Evidence for the Pareto principle in Open Source Software Activity. In: SQM. CEUR-WS workshop proceedings (2011)Google Scholar
  11. 11.
    Greenberg, J.: The measurement of linguistic diversity. Language 32(1), 109–115 (1956)CrossRefGoogle Scholar
  12. 12.
    Handel, Z.: What is Sino-Tibetan? Snapshot of a field and a language family in flux. Language and Linguistics Compass 2(3), 422–441 (2008)CrossRefGoogle Scholar
  13. 13.
    Heggarty, P.: Beyond lexicostatistics: How to get more out of “word lis” comparisons. Diachronica 27(2), 301–324 (2010)CrossRefGoogle Scholar
  14. 14.
    Hemetsberger, A., Reinhardt, C.: Learning and knowledge-building in open-source communities a social-experiential approach. Management Learning 37(2), 187–214 (2006)CrossRefGoogle Scholar
  15. 15.
    Jepsen, T.C.: Just what is an ontology, anyway? IT Professional 11(5), 22–27 (2009)CrossRefGoogle Scholar
  16. 16.
    Jones, C., Jones, T.: Estimating software costs, vol. 3. McGraw-Hill (1998)Google Scholar
  17. 17.
    Jones, C.: Applied Software Measurement: Global Analysis of Productivity and Quality. McGraw-Hill (2008)Google Scholar
  18. 18.
    Karus, S., Gall, H.: A study of language usage evolution in open source software. In: MSR, pp. 13–22. ACM (2011)Google Scholar
  19. 19.
    Kouters, E., Vasilescu, B., Serebrenik, A., van den Brand, M.G.J.: Who’s who in Gnome: Using LSA to merge software repository identities. In: ICSM, pp. 592–595. IEEE (2012)Google Scholar
  20. 20.
    Moberg, J., Gooskens, C., Nerbonne, J., Vaillette, N.: Conditional entropy measures intelligibility among related languages. In: Proceedings of Computational Linguistics in the Netherlands, pp. 51–66 (2007)Google Scholar
  21. 21.
    Mordal, K., Anquetil, N., Laval, J., Serebrenik, A., Vasilescu, B., Ducasse, S.: Software quality metrics aggregation in industry. Software: Evolution and Process (2012)Google Scholar
  22. 22.
    Nakakoji, K., Yamamoto, Y., Nishinaka, Y., Kishida, K., Ye, Y.: Evolution patterns of open-source software systems and communities. In: IWPSE, pp. 76–85. ACM (2002)Google Scholar
  23. 23.
    Neumann, D.E.: An enhanced neural network technique for software risk analysis. IEEE Trans. Softw. Eng 28(9), 904–912 (2002)CrossRefGoogle Scholar
  24. 24.
    Patil, G.P., Taillie, C.: Diversity as a concept and its measurement. Journal of the American Statistical Association 77(379), 548–561 (1982)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Poncin, W., Serebrenik, A., van den Brand, M.G.J.: Process mining software repositories. In: CSMR, pp. 5–14. IEEE (2011)Google Scholar
  26. 26.
    Posnett, D., D’Souza, R., Devanbu, P., Filkov, V.: Dual ecological measures of focus in software development. In: ICSE, pp. 452–461. IEEE (2013)Google Scholar
  27. 27.
    Rechenberg, P.: Programming languages as thought models. Struct. Program. 11(3), 105–116 (1990)Google Scholar
  28. 28.
    Robles, G., González-Barahona, J.M.: Contributor turnover in libre software projects. In: Damiani, E., Fitzgerald, B., Scacchi, W., Scotto, M., Succi, G. (eds.) Open Source Systems, vol. 203, pp. 273–286. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  29. 29.
    Robles, G., González-Barahona, J.M., Merelo, J.J.: Beyond source code: the importance of other artifacts in software development (a case study). Journal of Systems and Software 79(9), 1233–1248 (2006)CrossRefGoogle Scholar
  30. 30.
    Schildt, H.: C/C++ Programmer’s Reference, 2nd edn. McGraw-Hill (2000)Google Scholar
  31. 31.
    Serebrenik, A., van den Brand, M.G.J.: Theil Index for Aggregation of Software Metrics Values. In: ICSM, pp. 1–9. IEEE (2010)Google Scholar
  32. 32.
    Stallman, R.M.: EMACS the extensible, customizable self-documenting display editor. SIGPLAN Not 16(6), 147–156 (1981)CrossRefGoogle Scholar
  33. 33.
    Swadesh, M., Sherzer, J., Hymes, D.: The Origin and Diversification of Language. Adeline Transaction (1971)Google Scholar
  34. 34.
    Vasilescu, B., Filkov, V., Serebrenik, A.: StackOverflow and GitHub: associations between software development and crowdsourced knowledge. In: SocialCom, pp. 188–195. ASE/IEEE (accepted 2013)Google Scholar
  35. 35.
    Vasilescu, B., Serebrenik, A., van den Brand, M.G.J.: You can’t control the unfamiliar: A study on the relations between aggregation techniques for software metrics. In: ICSM, pp. 313–322. IEEE (2011)Google Scholar
  36. 36.
    Vasilescu, B., Serebrenik, A., Devanbu, P., Filkov, V.: How social Q&A sites are changing knowledge sharing in Open Source software communities. In: CSCW. ACM (accepted 2014)Google Scholar
  37. 37.
    Vasilescu, B., Serebrenik, A., Goeminne, M., Mens, T.: On the variation and specialisation of workload–A case study of the Gnome ecosystem community. In: Empirical Software Engineering, pp. 1–54 (2013)Google Scholar
  38. 38.
    Watt, D.A., Findlay, W.: Programming language design concepts. Wiley (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Bogdan Vasilescu
    • 1
  • Alexander Serebrenik
    • 1
  • Mark G. J. van den Brand
    • 1
  1. 1.Eindhoven University of TechnologyThe Netherlands

Personalised recommendations