Software quality estimation with limited fault data: a semi-supervised learning perspective

Seliya, Naeem; Khoshgoftaar, Taghi M.

doi:10.1007/s11219-007-9013-8

Software quality estimation with limited fault data: a semi-supervised learning perspective

Published: 10 August 2007

Volume 15, pages 327–344, (2007)
Cite this article

Software Quality Journal Aims and scope Submit manuscript

Naeem Seliya¹ &
Taghi M. Khoshgoftaar²

765 Accesses
57 Citations
Explore all metrics

Abstract

We addresses the important problem of software quality analysis when there is limited software fault or fault-proneness data. A software quality model is typically trained using software measurement and fault data obtained from a previous release or similar project. Such an approach assumes that fault data is available for all the training modules. Various issues in software development may limit the availability of fault-proneness data for all the training modules. Consequently, the available labeled training dataset is such that the trained software quality model may not provide predictions. More specifically, the small set of modules with known fault-proneness labels is not sufficient for capturing the software quality trends of the project. We investigate semi-supervised learning with the Expectation Maximization (EM) algorithm for software quality estimation with limited fault-proneness data. The hypothesis is that knowledge stored in software attributes of the unlabeled program modules will aid in improving software quality estimation. Software data collected from a large NASA software project is used during the semi-supervised learning process. The software quality model is evaluated with multiple test datasets collected from other NASA software projects. Compared to software quality models trained only with the available set of labeled program modules, the EM-based semi-supervised learning scheme improves generalization performance of the software quality models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EkmEx - an extended framework for labeling an unlabeled fault dataset

Article 08 January 2022

Software fault prediction based on the dynamic selection of learning technique: findings from the eclipse project study

Article 16 April 2021

Identifying and eliminating less complex instances from software fault data

Article 24 December 2016

References

Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In P. Bartlett & Y. Mansour (Eds), Proceedings of 11th annual ACM conference on computational learning theory, Madison, WI, July 1998, pp. 92–100, ACM Press.
Brodley, C. E., & Friedl, M. A. (1999). Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11, 131–167.
Article Google Scholar
Demirez, A., & Bennett, K. (2000). Optimization approaches to semisupervised learning. In M. Ferris, O. Mangasarian, & J. Pang (Eds), Applications and algorithms of complementarity. Boston, MA: Kluwer Academic Publishers.
Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach (2nd ed.). ITP, Boston, MA: PWS Publishing Company.
Google Scholar
Fung, G., & Mangasarian, O. (2001). Semi-supervised support vector machines for unlabeled data classification. Optimization Methods and Software, 15, 29–44.
Article Google Scholar
Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6, pp. 120–127). Morgan Kaufmann: San Francisco, CA.
Gokhale, S. S., & Lyu, M. R. (1997). Regression tree modeling for the prediction of software quality. In H. Pham (Ed.), Proceedings of 3rd international conference on reliability and quality in design, Anaheim, CA, March 1997, pp. 31–36, International Society of Science and Applied Technologies.
Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proceedings of 17th international conference on machine learning, Stanford University, CA, June–July 2000, pp. 327–334, Morgan Kaufmann.
Gray, A. R., & MacDonell, S. G. (1999). Software metrics data analysis: Exploring the relative performance of some commonly used modeling techniques. Empirical Software Engineering Journal, 4, 297–316.
Article Google Scholar
Guo, L., Cukic, B., & Singh, H. (2003). Predicting fault prone modules by the dempster-shafer belief networks. In Proceedings of the 18th international conference on automated software engineering, Montreal, Quebec, Canada, October 2003, pp. 249–252, IEEE Computer Society.
Imam, K. E., Benlarbi, S., Goel, N., & Rai, S. N. (2001). Comparing case-based reasoning classifiers for predicting high-risk software componenets. Journal of Systems and Software, 55(3), 301–320.
Article Google Scholar
Khoshgoftaar, T. M., & Joshi, V. (2004). Noise elimination with ensemble-classifier filtering: A case-study in software quality engineering. In Proceedings of the 16th international conference on software engineering and knowledge engineering, Banff, Canada, June 2004, pp. 226–231.
Khoshgoftaar, T. M., Liu, Y., & Seliya, N. (2003). Genetic programming-based decision trees for software quality classification. In Proceedings of 15th international conference on tools with artificial intelligence, Sacramento, CA, USA, November 2003, pp. 374–383, IEEE Computer Society.
Khoshgoftaar, T. M., & Seliya, N. (2002). Tree-based software quality models for fault prediction. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 203–214, IEEE Computer Society.
Khoshgoftaar, T. M., & Seliya, N. (2003). Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering Journal, 8(4), 325–350.
Article Google Scholar
Khoshgoftaar, T. M., Yuan, X., & Allen, E. B. (2000). Balancing misclassification rates in classification tree models of software quality. Empirical Software Engineering Journal, 5, 313–330, Kluwer Academic Publishers.
Article Google Scholar
Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Noise elimination with ensemble-classifier filtering for software quality estimation. Intelligent Data Analysis: An International Journal, 9(1), 3–27.
Article Google Scholar
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Hoboken, NJ: John Wiley and Sons.
Book Google Scholar
Lyu, M. (1996). Handbook of software reliability engineering. New York, NY: IEEE Computer Press, McGraw Hill.
Google Scholar
McCallum, A. K., & Nigam K. (1998). Employing EM and pool-based active learning for text classification. In Proceedings of the 15th international conference on machine learning, Madison, WI, July 1998, pp. 350–358, Morgan Kaufmann.
Mitchell, T. (1999). The role of unlabeled data in supervised learning. In Proceedings of the 6th international colloquium on cognitive science, Donostia, San Sebastian, Spain, May 1999, Institute for Logic, Cognition, Language and Information.
Nigam K., & Ghani R. (2000). Analyzing the effectiveness and applicability of co-training. In Proceedings of 9th international conference on information and knowledge management, McLean, VA, November 2000, pp. 86–93, ACM Press.
Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of 15th conference of the American association for artificial intelligence, Madison, WI, July 1998, pp. 792–799, AAAI Press.
Ohlsson, M. C., & Runeson, P. (2002). Experience from replicating empirical studies on prediction models. In Proceedings of 8th international software metrics symposium, Ottawa, Ontario, Canada, June 2002, pp. 217–226, IEEE Computer Society.
Pizzi, N. J., Summers, R., & Pedrycz ,W. (2002). Software quality prediction using median-adjusted class labels. In Proceedings of international joint conference on neural networks, Honolulu, HI, May 2002, Vol. 3, pp. 2405–2409, IEEE Computer Society.
Schneidewind, N. F. (2001). Investigation of logistic regression as a discriminant of software quality. In Proceedings of 7th international software metrics symposium, London, UK, April 2001, pp. 328–337, IEEE Computer Society.
Schneidewind, N. F. (2002). Body of knowledge for software quality measurement. IEEE Computer, 35(2), 77–83.
Article Google Scholar
Seeger, M. (2001). Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Scotland, UK, February 2001.
Suarez, A., & Lutsko, J. F. (1999). Globally optimal fuzzy decision trees for classification and regression. Pattern Analysis and Machine Intelligence, 21(12), 1297–1311.
Article Google Scholar
Whitten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with JAVA implementations. San Francisco, CA: Morgan Kaufmann.
Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their comments and suggestions, which went toward improving this paper. We are grateful to Liam Mayron, Lili Zhao and Renee Zuleta for their assistance with editorial reviews. We thank the staff of the NASA Metrics Data Program for making the software measurement data available.

Author information

Authors and Affiliations

Computer and Information Science, University of Michigan – Dearborn, 4901 Evergreen Road, Dearborn, MI, 48128, USA
Naeem Seliya
Computer Science and Engineering, Florida Atlantic University, 777 West Glades Road, Boca Raton, FL, 33431, USA
Taghi M. Khoshgoftaar

Authors

Naeem Seliya
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naeem Seliya.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seliya, N., Khoshgoftaar, T.M. Software quality estimation with limited fault data: a semi-supervised learning perspective. Software Qual J 15, 327–344 (2007). https://doi.org/10.1007/s11219-007-9013-8

Download citation

Published: 10 August 2007
Issue Date: September 2007
DOI: https://doi.org/10.1007/s11219-007-9013-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Software quality estimation with limited fault data: a semi-supervised learning perspective

Abstract

Access this article

Similar content being viewed by others

EkmEx - an extended framework for labeling an unlabeled fault dataset

Software fault prediction based on the dynamic selection of learning technique: findings from the eclipse project study

Identifying and eliminating less complex instances from software fault data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Software quality estimation with limited fault data: a semi-supervised learning perspective

Abstract

Access this article

Similar content being viewed by others

EkmEx - an extended framework for labeling an unlabeled fault dataset

Software fault prediction based on the dynamic selection of learning technique: findings from the eclipse project study

Identifying and eliminating less complex instances from software fault data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation