Skip to main content
Log in

Exploring fault types, detection activities, and failure severity in an evolving safety-critical software system

  • Published:
Software Quality Journal Aims and scope Submit manuscript

Abstract

Many papers have been published on analysis and prediction of software faults and/or failures, but few established the links from software faults (i.e., the root causes) to (potential or observed) failures and addressed multiple attributes. This paper aims at filling this gap by studying types of faults that caused software failures, activities taking place when faults were detected or failures were reported, and the severity of failures. Furthermore, it explores the associations among these attributes and the trends within releases (i.e., pre-release and post-release) and across releases. The results are based on the data extracted from a safety-critical NASA mission, which follows an evolutionary development process. In particular, we analyzed 21 large-scale software components, which together constitute over 8,000 files and millions of lines of code. The main insights include: (1) only a few fault types were responsible for the majority of failures pre-release and post-release, and across releases. Analysis and testing activities detected the majority of failures caused by each fault type. (2) The distributions of fault types differed for pre-release and post-release failures. (3) The percentage of safety-critical failures was small overall, and their relative contribution increased on-orbit. (4) Both post-release failures and safety-critical failures were more heavily associated with coding faults than with any other type of faults. (5) Components that experienced high number of failures in one release were not necessarily among high failure-prone components in the subsequent release. (6) Components that experienced more failures pre-release were more likely to fail post-release, overall and for each release.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. Anomaly is defined as anything observed in the documentation or operation of software that deviates from expectations based on previously verified software products or reference documents (ISO/IEC/IEEE 24765 2010).

  2. The hypothesis in the case of the second replication study (Grbac et al. 2013) was tested conditionally using only faults detected during site testing (i.e., information on faults detected during product operation was not available).

  3. It should be noted that the dataset considered in our previous study (Hamill and Goseva-Popstojanova 2009) was slightly larger than the one in this paper (i.e., 2,858 rather than 2,558 nonconformance SCRs) since further analysis of additional fields (i.e., detection activity and severity) led us to exclude some SCRs from the analysis. Nevertheless, the results and observed trends remain the same, with only insignificant changes in some values. For consistency, all results given in this paper are based on the new sample of 2,558 nonconformance SCRs.

  4. In most of this paper, integration faults and data problems are grouped together in integration and interface faults.

  5. The power of a statistical test is defined as the probability to reject the null hypothesis assuming it is false, that is, the probability to be able to identify a pattern in the data, assuming it exists.

  6. If a large effect size [i.e., correlation coefficient \(|\rho |=0.5\) (Cohen 1988)] is assumed, the two-sided test of statistical significance of the Spearman correlation coefficient with statistical power 80 % would require 32 components, which is larger than our sample sizes.

  7. Software unit is at different level of granularity—file (Ostrand and Weyuker 2002; Pighin and Marzona 2003) or modules (Biyani and Santhanam 1998).

  8. It should be noted that although component 20, which is the largest in size, has the highest number of both pre-release and post-release faults, in general component size in LOC did not appear to be linearly correlated neither with pre-release nor post-release faults, that is, in some cases, smaller components had more pre-release failures and/or more post-release failures than larger components. In other words, component size is not the reason behind positive correlation of pre-release failures and post-release failures.

  9. It should be noted that there are no missing data in Figs. 13, 14, 15, and 16. Rather, some components have gone through less releases than others. For example, component 8 had only two releases, while component 20 had seven releases.

References

  • Andersson, C., & Runeson, P. (2007). A replicated quantitative analysis of fault distributions in complex software systems. IEEE Transactions on Software Engineering, 33(5), 273–286.

    Article  Google Scholar 

  • Bell, R., Ostrand, T. J., & Weyuker, E. J. (2006). Looking for bugs in all the right places. In: International symposium on software testing and analysis (pp. 61–72).

  • Biyani, S. H., & Santhanam, P. (1998). Exploring defect data from development and customer usage on software modules over multiple releases. In: 9th IEEE international symposium on software, reliability engineering (pp. 316–320).

  • Blaikie, N. W. H. (2003). Analyzing quantitative data. Thousand Oaks, CA: SAGE.

    Google Scholar 

  • Chillarege, R., Bhandari, I., Chaar, J., Halliday, M., Moebus, D., Ray, B., et al. (1992). Orthogonal defect classification—A concept for in-process measurement. IEEE Transactions on Software Engineering, 18(11), 943–956.

    Article  Google Scholar 

  • Christmansson, J., & Chillarege, R. (1996). Generation of an error set that emulates software faults based on field data. In: 26th international symposium on fault tolerant, computing (pp. 304–313).

  • Cochran, W. G. (1954). Some methods for strengthening the common tests. Biometrics, 10, 417–451.

    Article  MATH  MathSciNet  Google Scholar 

  • Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lowrence Erlbaum Associates Inc.

    MATH  Google Scholar 

  • Duraes, J. A., & Madeira, H. S. (2006). Emulation of software faults: A field data study and a practical approach. IEEE Transactions on Software Engineering, 32(11), 849–867.

    Article  Google Scholar 

  • Eldh, S., Punnekkat, S., Hansson, H., & Jonsson, P. (2007). Component testing is not enough—A study of software faults in telecom middleware. Testing of software and communicating systems, LNCS, 4581, 74–89.

    Article  Google Scholar 

  • Fenton, N., & Ohlsson, N. (2000). Quantitative analysis of faults and failures in a complex software system. IEEE Transactions on Software Engineering, 26(8), 787–814.

    Article  Google Scholar 

  • Goseva-Popstojanova, K., Hamill, M., & Perugupalli, R. (2005). Large empirical case study of architecture-based software reliability. In: 16th IEEE international symposium on software, reliability engineering (pp. 43–52).

  • Grbac, T. G., Runeson, P., & Huljenic, D. (2013). A second replicated quantitative analysis of fault distributions in complex software systems. IEEE Transactions on Software Engineering, 39(4), 462–476.

    Article  Google Scholar 

  • Hamill, M., & Goseva-Popstojanova, K. (2009). Common trends in fault and failure data. IEEE Transactions on Software Engineering, 35(4), 484–496.

    Article  Google Scholar 

  • Hamill, M., & Goseva-Popstojanova, K. (2013). Exploring the missing link: An empirical study of software fixes. Software Testing, Verification and Reliability. doi:10.1002/stvr.1518.

    Google Scholar 

  • ISO/IEC/IEEE 24765. (2010). Systems and software engineering vocabulary. Geneva, Switzerland: ISO.

  • Kitchenham, B. (2008). The role of replications in empirical software engineering-a word of warning. Empirical software engineering, 13(2), 219–221.

    Article  Google Scholar 

  • Kitchenham, B. A., Pfleeger, S. L., Pickard, L. M., Jones, P. W., Hoaglin, D. C., Emam, K. E., et al. (2002). Preliminary guidelines for empirical research in software engineering. IEEE Transactions on Software Engineering, 28(8), 721–734.

    Article  Google Scholar 

  • Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Statistical power analysis in research.  Thousand Oaks: SAGE Publications Inc.

    Google Scholar 

  • Leszak, M., Perry, D., & Stoll, D. (2002). Classification and evaluation of defect in a project retrospective. Journal of systems and software, 61, 173–187.

    Article  Google Scholar 

  • Lutz, R. R., & Mikulski, I. C. (2004). Empirical analysis of safety critical anomalies during operation. IEEE Transactions on Software Engineering, 30(3), 172–180.

    Article  Google Scholar 

  • NASA-GB-8719.13. (2004). NASA software safety guidebook. www.hq.nasa.gov.office/codeq/doctree/8719.13.pdf. Accessed 12 April 2014.

  • Ostrand, T. J., & Weyuker, E. J. (2002). The distribution of faults in a large industrial software system. In: ACM international symposium on software testing and analysis (pp. 55–64).

  • Ostrand, T. J., Weyuker, E. J., & Bell, R. (2005a). Where the bugs are. In: ACM SIGSOFT international symposium on software testing and analysis (pp. 86–96).

  • Ostrand, T. J., Weyuker, J., & Bell, R. (2005b). Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering, 31(4), 340–355.

    Article  Google Scholar 

  • Petersen, K., & Wohlin, C. (2009). Context in industrial software engineering research. In: 3rd international symposium on empirical software engineering and measurement. ESEM ’09 (pp. 401–404).

  • Pighin, M., & Marzona, A. (2003). An empirical analysis of fault persistence through software releases. In: International symposium on empirical software engineering (pp. 206–211).

  • Robson, C. (2002). Real world research. Ontario: Blackwell.

    Google Scholar 

  • Runeson, P., & Höst, M. (2009). Guidelines for conducting and reporting case study research in software engineering. Empirical Software Engineering, 14(2), 131–164.

    Article  Google Scholar 

  • Runeson, P., Höst, M., Rainer, A., & Regnell, B. (2012). Case study research in software engineering: Guidelines and examples. Hoboken, NJ: Wiley.

    Book  Google Scholar 

  • Seigel, S. & Castellan, J. (1988). Non parametric statistics for the behavioral sciences. Newyork: McGraw-Hill.

  • Shihab, E., Jiang, Z. M., Ibrahim, W. M., Adams, B., & Hassan, A. E. (2010). Understanding the impact of code and process metrics on post-release defects: a case study on the Eclipse project. In: 2010 ACM-IEEE international symposium on empirical software engineering and measurement (pp. 4:1–4:10).

  • Shull, F. J., Carver, J. C., Vegas, S., & Juristo, N. (2008). The role of replications in empirical software engineering. Empirical Software Engineering, 13(2), 211–218.

  • Yin, R. K. (2014). Case study research: Design and methods. Thousand Oaks, CA: SAGE.

    Google Scholar 

  • Yu, W. D. (1998). A software fault prevention approach in coding and root cause analysis. Bell Labs Technical Journal, 3(2), 3–21.

    Article  Google Scholar 

  • Zheng, J., Williams, L., Nagappan, N., Snipes, W., Hudepohl, J. P., & Vouk, M. A. (2006). On the value of static analysis for fault detection in software. IEEE Transactions on Software Engineering, 32(4), 240–253.

    Article  Google Scholar 

  • Zhivich, M., & Cunningham, R. K. (2009). The real cost of software errors. IEEE Security & Privacy, 7(2), 87–90.

    Article  Google Scholar 

  • Zimmermann, T., Premraj, R., & Zeller, A. (2007). Predicting defects for Eclipse. In: 3rd international workshop on predictor models in software engineering. paper #9, 7 p.

Download references

Acknowledgments

This work was funded in part by the NASA Office of Safety and Mission Assurance, the Software Assurance Research Program under grant managed through NASA IV&V Facility in Fairmont, WV. Goseva-Popstojanova’s work was also supported in part by the National Science Foundation Grant 0916284 with funds from the American Recovery and Reinvestment Act of 2009 and by the WVU ADVANCE Sponsorship Program funded by the National Science Foundation ADVANCE IT Program award HRD-100797.

We thank the NASA personnel for their invaluable support: Jill Broadwater, Pete Cerna, Randolph Copeland, Susan Creasy, James Dalton, Bryan Fritch, Nick Guerra, John Hinkle, Lynda Kelsoe, Gary Lovstuen, Tom Macaulay, Debbie Miele, Lisa Montgomery, James Moon, Don Ohi, Chad Pokryzwa, David Pruett, Timothy Plew, Scott Radabaugh, David Soto, and Sarma Susarla.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NASA personnel and the funding agencies (NASA and NSF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katerina Goseva-Popstojanova.

Additional information

This work was done while Maggie Hamill was affiliated with West Virginia University.

Appendix: Background on \(\chi ^2\) test for contingency tables

Appendix: Background on \(\chi ^2\) test for contingency tables

For any two attributes (i.e., random variables \(X\) and \(Y\) with \(n\) and \(m\) categories, respectively), we use the \(\chi ^2\) test to explore if the distributions of failures across the \(m\) categories of \(Y\) are the same for each of the \(n\) samples (i.e., categories) in \(X\). That is, we test the null hypothesis which specifies no association between the two bases of classification. We build an \(m\) x \(n\) contingency table using the observed frequencies for each pair of categories \(X_i\) and \(Y_j\), and then calculate the standard \(\chi ^2\) statistic which can be approximated by a chi-square distribution with \((m-1) (n-1)\) degrees of freedom. When used for contingency comparisons, the chi-square test is a nonparametric test, since it compares entire distributions rather than parameters of distributions.

Note that the \(\chi ^2\) test is applicable to data in a contingency table only if the expected frequencies are sufficiently large. Specifically, for contingency tables with degrees of freedom >1, the \(\chi ^2\) statistics may be used if no cell has an expected frequency <1 and fewer than 20 % of the cells have an expected frequency <5 (Cochran 1954). For rare categories that are not of interest, categories can be combined in order to increase the expected frequencies in various cells. However, rare categories that are considered important will necessitate larger sample size, both to be able to run the \(\chi ^2\) test and to achieve adequate power (Kraemer and Thiemann 1987).

The power of a statistical test is the probability that the false null hypothesis will be rejected, that is, that the test will recognize an existing pattern in the data. Although an intuitive response may be to choose a power value around 99 % to be almost certain of demonstrating significance if the alternative hypothesis is true, that is usually impossible because the number of instances required per category is going to be prohibitive. Generally, statistical power in the 70–90 % range is considered acceptable (Kraemer and Thiemann 1987).

Once the \(\chi ^2\) statistic has been calculated, we determine the probability under the null hypothesis of obtaining a value as large as the calculated \(\chi ^2\) value. If the probability is equal to or less than the significance level \(\alpha \) (in our case \(\alpha =0.05\)), the null hypothesis is rejected. Since the null hypothesis states that there is no relationship between the two variables (i.e., any correlation observed in the sample is due to chance), rejecting the null hypothesis implies that the distribution of failures across the \(m\) categories differs significantly for the \(n\) samples and suggests that there is some correlation between the two variables. Therefore, we also measure the extent of the correlation between the two variables. We use the contingency coefficient \(C\) as a measure of correlation, since it is uniquely useful in cases when the information about at least one of the attributes is categorical (i.e., given on a nominal scale). The contingency coefficient \(C\) does not require underlying continuity for the various categories used to measure either one or both attributes. Even more, it has the same value regardless of how the categories are arranged in the rows and columns. The contingency coefficient \(C\) is calculated as:

$$\begin{aligned} C=\sqrt{\frac{\chi ^2}{N+\chi ^2}} \end{aligned}$$
(1)

where \(N\) is the total number of observations and the value of \(\chi ^2\) statistics is computed based on the contingency table (Seigel and Castellan 1988). Since the test of significance for the contingency coefficient \(C\) is based solely on the \(\chi ^2\) statistics, it follows that if the null hypothesis that the \(n\) samples come from the same distribution is rejected, then the calculated \(C\) value will be significant.

It is important to note that, unlike the other measures of correlation, the maximum value of \(C\) is not equal to 1; it rather depends on the size of the table. Specifically, the maximum value of the contingency coefficient \(C_{\rm max}\) is given by:

$$\begin{aligned} C_{\rm max}=\root 4 \of {\frac{m-1}{m} \cdot \frac{n-1}{n}} \end{aligned}$$
(2)

where \(m\) is the number of rows and \(n\) is the number of columns in the contingency table. Hence, even small values of \(C\) often may be evidence of statistically significant correlation between variables. Further, \(C\) values for contingency tables with different sizes are not directly comparable. However, by normalizing \(C\) with the corresponding \(C_{\rm max}\) as in equation (3), we ensure that the range will be between 0 and 1, and hence, \(C^*\) values for different-sized tables can be compared (Blaikie 2003).

$$\begin{aligned} C^*=C/C_{\rm max}. \end{aligned}$$
(3)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hamill, M., Goseva-Popstojanova, K. Exploring fault types, detection activities, and failure severity in an evolving safety-critical software system. Software Qual J 23, 229–265 (2015). https://doi.org/10.1007/s11219-014-9235-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11219-014-9235-5

Keywords

Navigation