Abstract
This paper develops tests and validates a model for the antecedents of open source software (OSS) defects, using Data and Text Mining. The public archives of OSS projects are used to access historical data on over 5,000 active and mature OSS projects. Using domain knowledge and exploratory analysis, a wide range of variables is identified from the process, product, resource, and end-user characteristics of a project to ensure that the model is robust and considers all aspects of the system. Multiple Data Mining techniques are used to refine the model and data is enriched by the use of Text Mining for knowledge discovery from qualitative information. The study demonstrates the suitability of Data Mining and Text Mining for model building. Results indicate that project type, end-user activity, process quality, team size and project popularity have a significant impact on the defect density of operational OSS projects. Since many organizations, both for profit and not for profit, are beginning to use Open Source Software as an economic alternative to commercial software, these results can be used in the process of deciding what software can be reasonably maintained by an organization.
Similar content being viewed by others
Notes
We acknowledge that there are several large Scale OSS projects that are developed by more traditional teams of software programmers. Such projects provide face to face communication opportunities, user conferences and other avenues of collaboration that are not confined to online development. However, a vast majority of OSS projects, especially the ones hosted by source forge. Net remain small scale projects developed through online collaboration.
Since the data was used from a third party, independent validation of data was performed by random verification of variables with the actual SF dataset. The results were also compared with another independent extraction of variables from the same warehouse, to verify the queries used for data extraction.
We use the raw KSLOC as an indicator of project size, with the acknowledgement that program size can be language dependent. However, KSLOC remains one of the most commonly used measures of software size, especially when the purpose is to control for project scale [28].
All SQL queries are available from the first author upon request.
References
Abdel-Hamid T (1989) The dynamics of software project staffing: a system dynamics based simulation approach. IEEE Trans Softw Eng 15(2):109–119
Abdel-Hamid T (1992) Investigating the impacts of managerial turnover/succession on software project performance. J Manage Inf Syst 9(2):127–144
Banker RD, Datar SM, Kemerer CF (1991) A model to evaluate variables impacting the productivity of software maintenance projects. Manage Sci 37(1):1–18
Banker RD, Datar SM, Kemerer CF, Zweig D (2002) Software errors and software maintenance management. Inf Technol Manage 3(1–2):25–41
Banker RD, Davis GB, Slaughter SA (1998) Software development practices, software complexity and software maintenance performance. Manage Sci 44(4):433–450
Barki H, Hartwick J (1994) Measuring user participation, user involvement, and user attitude. MIS Q 18(1):59–82
Barry EJ, Kemerer CF, Slaughter SA (2006) Environmental volatility, development decisions, and software volatility: a longitudinal analysis. Manage Sci 52(3):448–464
Bennett KH, Rajlich VT (2002) Software maintenance and evolution: a roadmap. In: 22nd ICSE, Limrick, Ireland, pp 75–87
Biyani S, Santhanam P (1998) Exploring defect data from development and customer usage on software modules over multiple releases. IBM T.J. Watson Research Center, Yorktown Heights
Boehm B (1987) Improving software productivity. IEEE Comput 20(1):43–57
Brooks FJ (1995) The mythical man-month. Addison-Wesley, Reading
Bryant A, Charmez K (2007) Grounded theory in historical perspective: an epistemological account. In: Bryant A, Charmez K (eds) The Sage handbook of grounded theory. Sage, Los Angeles, pp 31–57
Cearley DW, Fenn J, Plummer DC (2005) Gartner’s positions on the five hottest IT topics and trends in 2005. Gartner Inc, Stamford
Chiarini-Tremblay M, Berndt DJ, Foulis P, Luther S (2005) Utilizing text mining techniques to identify fall related injuries. In: Eleventh Americas conference on information systems, Omaha, NE, 11–14 August 2005, pp 1497–1504
Chulani S, Ray B, Santhanam P, Leszkowicz R (2003) Metrics for managing customer view of software quality. In: Proceedings of ninth international software metrics symposium, pp 189–198
Conway ME (1968) How do committees invent? Datamation 14(4):28–31
Crowston K, Annabi H, Howison J (2003) Defining open source project success. In: International conference of information systems, Seattle, WA, 14–17 December 2003
Deerwester EA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Technol 41(6):391–401
Dempsey BJ, Weiss D, Jones P, Greenberg J (2002) Who is an open source software developer? Commun ACM 45(2):67–72
Dinh-Trong TT, Bieman JM (2005) The FreeBSD project: a replication case study of open source development. IEEE Trans Softw Eng 31(6):481–495
Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17
Erlich L (2000) Leveraging legacy system dollars for e-business.IT Pro 17–23
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) Data mining and knowledge discovery in databases. Commun ACM 39(11):37–54
Feller J, Fitzgerald B (2002) Understanding open source software development. Addison-Wesley, London
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689
Fenton NE, Pfleeger S (1991) Software metrics: a rigorous approach. Chapman & Hall, New York
German DM (2006) An empirical study of fine-grained software modifications. Empir Softw Eng 11:369–393
Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: International workshop on principles of software evolution. ACM, Vienna, Austria
Graves TL, Karr AK, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Gremillion LL (1984) Determinants of program repair maintenance requirements. Commun ACM 27(8):826–832
Harter DE, Krishnan MS, Slaughter SA (2000) Effects of process maturity on quality, cycle time, and effort in software product development. Manage Sci 46(4):451–466
Hartwick J, Barki H (1994) Explaining the role of user participation in information system use. Manage Sci 40(4):440–465
Herbsleb J, Zubrow D, Goldenson D, Hayes W, Paulk M (1997) Software quality and the capability maturity model. Commun ACM 40(6):30–40
Herbsleb JD, Moitra D (2001) Global software development. IEEE Softw 18(2):16–20
Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York
Huntley CL (2003) Organizational learning in open-source software projects: an analysis of debugging data. IEEE Trans Softw Eng 50(4):485–493
IEEE-STD-1061 (1993) IEEE standard for a software quality metrics methodology. Institute of Electrical and Electronics Engineers, Inc, New York
Ives B, Olson MH (1984) User involvement and MIS success: a review of research. Manage Sci 30(5):586–603
Jensen C, Scacchi W (2005) Collaboration, leadership, control, and conflict negotiation and the Netbeans.org open source software development community. In: Proceedings of the 38th annual Hawaii international conference on system sciences, HICSS ‘05, Hawaii, p 196.2
Jensen C, Scacchi W (2004) Data mining for software process discovery in open source software development communities. In: Proceedings of workshop on mining software repositories, Edinburgh, Scotland, pp 449–462
Popstajanova K, Trivedi K (2001) Architecture based approach to reliability assessment of software systems. Perform Eval 45:179–204
Khoshgoftaar T, Munson J, Lanning D (1993) A comparative study of predictive models for program changes during system testing and maintenance. In: International conference on software maintenance
Koch S (2007) Software evolution in open source projects—a large-scale investigation. J Softw Maint Evol Res Prac 19(6):361–382
Li PL, Shaw M, Herbsleb J, Ray B, Santhanam P (2004) Empirical evaluation of defect projection models for widely-deployed production software systems. In: Proceedings of the 12th ACM SIGSOFT twelfth international symposium on foundations of software engineering, Newport Beach, CA, pp 263–272
Lientz BP, Swanson EB (1980) Software maintenance management. Addison-Wesley, Reading
Lind RK, Vairavan K (1989) An experimental investigation of software metrics and their relationship to software development effort. IEEE Trans Softw Eng 15(5):649–653
Madey G, SourceForge.net Research Data Archive, http://www.nd.edu/~oss/Data/data.html. Accessed May 2007
McConnell S (1999) Open source methodology: ready for prime time? IEEE Softw 16(4):6–8
Melouk S, Raja U, Keskin B (forthcoming) Managing resource allocation and task prioritization in a large scale virtual development project. Inform Res Manage J
Menard SW (2002) Applied logistic regression analysis. Sage, Thousand Oaks
Michlmayr M, Senyard A (2006) A statistical analysis of defects in Debian and strategies for improving quality in free software projects. In: Bitzer J, Schröder PJH (eds) The economics of open source software development, pp 131–148
Mockus A, Fielding RT, Herbsleb J (2002) Two case studies of open source software development: Apache and Mozilla. ACM Trans Softw Eng Methodol (TOSEM) 11(3):309–346
Mockus A, Weiss D, Zhang P (2003) Understanding and predicting effort in software projects. In: International conference on software engineering
Musa J, Iannino A, Okimoto K (1990) Software reliability. McGraw-Hill, New York
Pang A (2008) Worldwide Enterprise Applications 2008–2012 Forecast Update and 2007 Vendor Shares, IDC, Editor, p 55
Park RM (1992) Software size measurement: a framework for counting source statements. CMU-SEI-92-TR-2. Software Engineering Institute, Pittsburg
Paulson JW, Succi G, Eberlein A (2004) An empirical study of open-source and closed-source software products. IEEE Trans Softw Eng 30(4):246–256
Plekhanova V (1999) Capability and compatibility measurement in software process improvement. In: 2nd European software measurement conference (FESMA’99), Amsterdam, The Netherlands
Pressman R (2004) Software engineering: a practitioner’s approach with bonus chapter on agile development. McGraw-Hill Science/Engineering/Math, New York
Raja U, Hale DP, Hale JE (2009) Modeling software evolution defects: a time series approach. J Softw Maint Evol Res Pract 21(1):49–71
Raymond ES, O’Reilly T (2001) Cathedral and the Bazaar, http://www.tuxedo.org/~esr/writings/cathedral-bazaar/cathedral-bazaar
Roberts JA, Hann I-H, Slaughter S (2006) Understanding the motivations, participation, and performance of open source software developers: a longitudinal study of the apache projects. Manage Sci 52(7):984–999
Robles G, Amor JJ, Gonzalez-Barahona JM, Herraiz I (2005) Evolution and growth in large libre software projects. In: Eighth international workshop on principles of software evolution (IWPSE’05). IEEE Computer Society, pp 165–174
Sabherwal R, Jeyaraj A, Chowa C (2006) Information system success: individual and organizational determinants. Manage Sci 52(12):1849–1864
Scacchi W (2004) Free and open source development practices in the game community. IEEE Softw 21(1):59–66
Scacchi W (2004) Understanding free/open source software evolution: applying, breaking and rethinking the laws of software evolution. In: Madhavji NH et al (eds) Software evolution. Wiley, New York
Slaughter S, Harter DE, Krishnan MS (1998) Evaluating the cost of software quality. Commun ACM 41(8):67–73
Stamelos I, Angelis L, Oikonomou A, Bleris GL (2002) Code quality analysis in open source software development. Inf Syst J 12:43–60
Stewart KJ, Ammeter AP (2006) Impacts of license choice and organizational sponsorship on user interest and development activity in open source software projects. Inf Syst Res 17(2):126–144
Subramaniam C, Sen R, Nelson ML (2009) Determinants of open source software project success: a longitudinal study. Decis Support Syst 46(2):576–585
Swanson EB, Dans E (2000) System life expectancy and the maintenance effort: exploring their equilibration. MIS Q 24(2):277–297
von Krogh G, Spaeth S, Lakhani KR (2003) Community, joining, and specialization in open source software innovation: a case study. Res Policy 32(7):1217–1241
Williams CC, Hollingsworth JK (2005) Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans Softw Eng 31(6):466–480
Ying ATT, Murphy GC, Ng R, Chu-Carroll MC (2004) Predicting source code changes by mining change history. IEEE Trans Softw Eng 30(9):574–586
Zelkowitz MV, Shaw AC, Gannon JD (1979) Principles of software engineering and design. Prentice Hall Inc, Englewood Cliffs
Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to guide software changes. IEEE Trans Softw Eng 31(6):429–445
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Raja, U., Tretter, M.J. Antecedents of open source software defects: A data mining approach to model formulation, validation and testing. Inf Technol Manag 10, 235–251 (2009). https://doi.org/10.1007/s10799-009-0062-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10799-009-0062-5