Skip to main content
Log in

Antecedents of open source software defects: A data mining approach to model formulation, validation and testing

  • Published:
Information Technology and Management Aims and scope Submit manuscript

Abstract

This paper develops tests and validates a model for the antecedents of open source software (OSS) defects, using Data and Text Mining. The public archives of OSS projects are used to access historical data on over 5,000 active and mature OSS projects. Using domain knowledge and exploratory analysis, a wide range of variables is identified from the process, product, resource, and end-user characteristics of a project to ensure that the model is robust and considers all aspects of the system. Multiple Data Mining techniques are used to refine the model and data is enriched by the use of Text Mining for knowledge discovery from qualitative information. The study demonstrates the suitability of Data Mining and Text Mining for model building. Results indicate that project type, end-user activity, process quality, team size and project popularity have a significant impact on the defect density of operational OSS projects. Since many organizations, both for profit and not for profit, are beginning to use Open Source Software as an economic alternative to commercial software, these results can be used in the process of deciding what software can be reasonably maintained by an organization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We acknowledge that there are several large Scale OSS projects that are developed by more traditional teams of software programmers. Such projects provide face to face communication opportunities, user conferences and other avenues of collaboration that are not confined to online development. However, a vast majority of OSS projects, especially the ones hosted by source forge. Net remain small scale projects developed through online collaboration.

  2. Since the data was used from a third party, independent validation of data was performed by random verification of variables with the actual SF dataset. The results were also compared with another independent extraction of variables from the same warehouse, to verify the queries used for data extraction.

  3. We use the raw KSLOC as an indicator of project size, with the acknowledgement that program size can be language dependent. However, KSLOC remains one of the most commonly used measures of software size, especially when the purpose is to control for project scale [28].

  4. All SQL queries are available from the first author upon request.

References

  1. Abdel-Hamid T (1989) The dynamics of software project staffing: a system dynamics based simulation approach. IEEE Trans Softw Eng 15(2):109–119

    Article  Google Scholar 

  2. Abdel-Hamid T (1992) Investigating the impacts of managerial turnover/succession on software project performance. J Manage Inf Syst 9(2):127–144

    Google Scholar 

  3. Banker RD, Datar SM, Kemerer CF (1991) A model to evaluate variables impacting the productivity of software maintenance projects. Manage Sci 37(1):1–18

    Article  Google Scholar 

  4. Banker RD, Datar SM, Kemerer CF, Zweig D (2002) Software errors and software maintenance management. Inf Technol Manage 3(1–2):25–41

    Article  Google Scholar 

  5. Banker RD, Davis GB, Slaughter SA (1998) Software development practices, software complexity and software maintenance performance. Manage Sci 44(4):433–450

    Article  Google Scholar 

  6. Barki H, Hartwick J (1994) Measuring user participation, user involvement, and user attitude. MIS Q 18(1):59–82

    Article  Google Scholar 

  7. Barry EJ, Kemerer CF, Slaughter SA (2006) Environmental volatility, development decisions, and software volatility: a longitudinal analysis. Manage Sci 52(3):448–464

    Article  Google Scholar 

  8. Bennett KH, Rajlich VT (2002) Software maintenance and evolution: a roadmap. In: 22nd ICSE, Limrick, Ireland, pp 75–87

  9. Biyani S, Santhanam P (1998) Exploring defect data from development and customer usage on software modules over multiple releases. IBM T.J. Watson Research Center, Yorktown Heights

    Google Scholar 

  10. Boehm B (1987) Improving software productivity. IEEE Comput 20(1):43–57

    Google Scholar 

  11. Brooks FJ (1995) The mythical man-month. Addison-Wesley, Reading

    Google Scholar 

  12. Bryant A, Charmez K (2007) Grounded theory in historical perspective: an epistemological account. In: Bryant A, Charmez K (eds) The Sage handbook of grounded theory. Sage, Los Angeles, pp 31–57

    Google Scholar 

  13. Cearley DW, Fenn J, Plummer DC (2005) Gartner’s positions on the five hottest IT topics and trends in 2005. Gartner Inc, Stamford

    Google Scholar 

  14. Chiarini-Tremblay M, Berndt DJ, Foulis P, Luther S (2005) Utilizing text mining techniques to identify fall related injuries. In: Eleventh Americas conference on information systems, Omaha, NE, 11–14 August 2005, pp 1497–1504

  15. Chulani S, Ray B, Santhanam P, Leszkowicz R (2003) Metrics for managing customer view of software quality. In: Proceedings of ninth international software metrics symposium, pp 189–198

  16. Conway ME (1968) How do committees invent? Datamation 14(4):28–31

    Google Scholar 

  17. Crowston K, Annabi H, Howison J (2003) Defining open source project success. In: International conference of information systems, Seattle, WA, 14–17 December 2003

  18. Deerwester EA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Technol 41(6):391–401

    Article  Google Scholar 

  19. Dempsey BJ, Weiss D, Jones P, Greenberg J (2002) Who is an open source software developer? Commun ACM 45(2):67–72

    Article  Google Scholar 

  20. Dinh-Trong TT, Bieman JM (2005) The FreeBSD project: a replication case study of open source development. IEEE Trans Softw Eng 31(6):481–495

    Article  Google Scholar 

  21. Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17

    Google Scholar 

  22. Erlich L (2000) Leveraging legacy system dollars for e-business.IT Pro 17–23

  23. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) Data mining and knowledge discovery in databases. Commun ACM 39(11):37–54

    Google Scholar 

  24. Feller J, Fitzgerald B (2002) Understanding open source software development. Addison-Wesley, London

    Google Scholar 

  25. Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689

    Article  Google Scholar 

  26. Fenton NE, Pfleeger S (1991) Software metrics: a rigorous approach. Chapman & Hall, New York

    Google Scholar 

  27. German DM (2006) An empirical study of fine-grained software modifications. Empir Softw Eng 11:369–393

    Article  Google Scholar 

  28. Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: International workshop on principles of software evolution. ACM, Vienna, Austria

  29. Graves TL, Karr AK, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661

    Article  Google Scholar 

  30. Gremillion LL (1984) Determinants of program repair maintenance requirements. Commun ACM 27(8):826–832

    Article  Google Scholar 

  31. Harter DE, Krishnan MS, Slaughter SA (2000) Effects of process maturity on quality, cycle time, and effort in software product development. Manage Sci 46(4):451–466

    Article  Google Scholar 

  32. Hartwick J, Barki H (1994) Explaining the role of user participation in information system use. Manage Sci 40(4):440–465

    Article  Google Scholar 

  33. Herbsleb J, Zubrow D, Goldenson D, Hayes W, Paulk M (1997) Software quality and the capability maturity model. Commun ACM 40(6):30–40

    Article  Google Scholar 

  34. Herbsleb JD, Moitra D (2001) Global software development. IEEE Softw 18(2):16–20

    Article  Google Scholar 

  35. Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York

    Book  Google Scholar 

  36. Huntley CL (2003) Organizational learning in open-source software projects: an analysis of debugging data. IEEE Trans Softw Eng 50(4):485–493

    Google Scholar 

  37. IEEE-STD-1061 (1993) IEEE standard for a software quality metrics methodology. Institute of Electrical and Electronics Engineers, Inc, New York

    Google Scholar 

  38. Ives B, Olson MH (1984) User involvement and MIS success: a review of research. Manage Sci 30(5):586–603

    Article  Google Scholar 

  39. Jensen C, Scacchi W (2005) Collaboration, leadership, control, and conflict negotiation and the Netbeans.org open source software development community. In: Proceedings of the 38th annual Hawaii international conference on system sciences, HICSS ‘05, Hawaii, p 196.2

  40. Jensen C, Scacchi W (2004) Data mining for software process discovery in open source software development communities. In: Proceedings of workshop on mining software repositories, Edinburgh, Scotland, pp 449–462

  41. Popstajanova K, Trivedi K (2001) Architecture based approach to reliability assessment of software systems. Perform Eval 45:179–204

    Article  Google Scholar 

  42. Khoshgoftaar T, Munson J, Lanning D (1993) A comparative study of predictive models for program changes during system testing and maintenance. In: International conference on software maintenance

  43. Koch S (2007) Software evolution in open source projects—a large-scale investigation. J Softw Maint Evol Res Prac 19(6):361–382

    Article  Google Scholar 

  44. Li PL, Shaw M, Herbsleb J, Ray B, Santhanam P (2004) Empirical evaluation of defect projection models for widely-deployed production software systems. In: Proceedings of the 12th ACM SIGSOFT twelfth international symposium on foundations of software engineering, Newport Beach, CA, pp 263–272

  45. Lientz BP, Swanson EB (1980) Software maintenance management. Addison-Wesley, Reading

    Google Scholar 

  46. Lind RK, Vairavan K (1989) An experimental investigation of software metrics and their relationship to software development effort. IEEE Trans Softw Eng 15(5):649–653

    Article  Google Scholar 

  47. Madey G, SourceForge.net Research Data Archive, http://www.nd.edu/~oss/Data/data.html. Accessed May 2007

  48. McConnell S (1999) Open source methodology: ready for prime time? IEEE Softw 16(4):6–8

    Article  Google Scholar 

  49. Melouk S, Raja U, Keskin B (forthcoming) Managing resource allocation and task prioritization in a large scale virtual development project. Inform Res Manage J

  50. Menard SW (2002) Applied logistic regression analysis. Sage, Thousand Oaks

    Google Scholar 

  51. Michlmayr M, Senyard A (2006) A statistical analysis of defects in Debian and strategies for improving quality in free software projects. In: Bitzer J, Schröder PJH (eds) The economics of open source software development, pp 131–148

  52. Mockus A, Fielding RT, Herbsleb J (2002) Two case studies of open source software development: Apache and Mozilla. ACM Trans Softw Eng Methodol (TOSEM) 11(3):309–346

    Article  Google Scholar 

  53. Mockus A, Weiss D, Zhang P (2003) Understanding and predicting effort in software projects. In: International conference on software engineering

  54. Musa J, Iannino A, Okimoto K (1990) Software reliability. McGraw-Hill, New York

    Google Scholar 

  55. Pang A (2008) Worldwide Enterprise Applications 2008–2012 Forecast Update and 2007 Vendor Shares, IDC, Editor, p 55

  56. Park RM (1992) Software size measurement: a framework for counting source statements. CMU-SEI-92-TR-2. Software Engineering Institute, Pittsburg

    Google Scholar 

  57. Paulson JW, Succi G, Eberlein A (2004) An empirical study of open-source and closed-source software products. IEEE Trans Softw Eng 30(4):246–256

    Article  Google Scholar 

  58. Plekhanova V (1999) Capability and compatibility measurement in software process improvement. In: 2nd European software measurement conference (FESMA’99), Amsterdam, The Netherlands

  59. Pressman R (2004) Software engineering: a practitioner’s approach with bonus chapter on agile development. McGraw-Hill Science/Engineering/Math, New York

    Google Scholar 

  60. Raja U, Hale DP, Hale JE (2009) Modeling software evolution defects: a time series approach. J Softw Maint Evol Res Pract 21(1):49–71

    Article  Google Scholar 

  61. Raymond ES, O’Reilly T (2001) Cathedral and the Bazaar, http://www.tuxedo.org/~esr/writings/cathedral-bazaar/cathedral-bazaar

  62. Roberts JA, Hann I-H, Slaughter S (2006) Understanding the motivations, participation, and performance of open source software developers: a longitudinal study of the apache projects. Manage Sci 52(7):984–999

    Article  Google Scholar 

  63. Robles G, Amor JJ, Gonzalez-Barahona JM, Herraiz I (2005) Evolution and growth in large libre software projects. In: Eighth international workshop on principles of software evolution (IWPSE’05). IEEE Computer Society, pp 165–174

  64. Sabherwal R, Jeyaraj A, Chowa C (2006) Information system success: individual and organizational determinants. Manage Sci 52(12):1849–1864

    Article  Google Scholar 

  65. Scacchi W (2004) Free and open source development practices in the game community. IEEE Softw 21(1):59–66

    Article  Google Scholar 

  66. Scacchi W (2004) Understanding free/open source software evolution: applying, breaking and rethinking the laws of software evolution. In: Madhavji NH et al (eds) Software evolution. Wiley, New York

    Google Scholar 

  67. Slaughter S, Harter DE, Krishnan MS (1998) Evaluating the cost of software quality. Commun ACM 41(8):67–73

    Article  Google Scholar 

  68. Stamelos I, Angelis L, Oikonomou A, Bleris GL (2002) Code quality analysis in open source software development. Inf Syst J 12:43–60

    Article  Google Scholar 

  69. Stewart KJ, Ammeter AP (2006) Impacts of license choice and organizational sponsorship on user interest and development activity in open source software projects. Inf Syst Res 17(2):126–144

    Article  Google Scholar 

  70. Subramaniam C, Sen R, Nelson ML (2009) Determinants of open source software project success: a longitudinal study. Decis Support Syst 46(2):576–585

    Article  Google Scholar 

  71. Swanson EB, Dans E (2000) System life expectancy and the maintenance effort: exploring their equilibration. MIS Q 24(2):277–297

    Article  Google Scholar 

  72. von Krogh G, Spaeth S, Lakhani KR (2003) Community, joining, and specialization in open source software innovation: a case study. Res Policy 32(7):1217–1241

    Article  Google Scholar 

  73. Williams CC, Hollingsworth JK (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31(6):466–480

    Article  Google Scholar 

  74. Ying ATT, Murphy GC, Ng R, Chu-Carroll MC (2004) Predicting source code changes by mining change history. IEEE Trans Softw Eng 30(9):574–586

    Article  Google Scholar 

  75. Zelkowitz MV, Shaw AC, Gannon JD (1979) Principles of software engineering and design. Prentice Hall Inc, Englewood Cliffs

    Google Scholar 

  76. Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31(6):429–445

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Uzma Raja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raja, U., Tretter, M.J. Antecedents of open source software defects: A data mining approach to model formulation, validation and testing. Inf Technol Manag 10, 235–251 (2009). https://doi.org/10.1007/s10799-009-0062-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10799-009-0062-5

Keywords

Navigation