A transfer cost-sensitive boosting approach for cross-project defect prediction
- 867 Downloads
Software defect prediction has been regarded as one of the crucial tasks to improve software quality by effectively allocating valuable resources to fault-prone modules. It is necessary to have a sufficient set of historical data for building a predictor. Without a set of sufficient historical data within a company, cross-project defect prediction (CPDP) can be employed where data from other companies are used to build predictors. In such cases, a transfer learning technique, which extracts common knowledge from source projects and transfers it to a target project, can be used to enhance the prediction performance. There exists the class imbalance problem, which causes difficulties for the learner to predict defects. The main impacts of imbalanced data under cross-project settings have not been investigated in depth. We propose a transfer cost-sensitive boosting method that considers both knowledge transfer and class imbalance for CPDP when given a small amount of labeled target data. The proposed approach performs boosting that assigns weights to the training instances with consideration of both distributional characteristics and the class imbalance. Through comparative experiments with the transfer learning and the class imbalance learning techniques, we show that the proposed model provides significantly higher defect detection accuracy while retaining better overall performance. As a result, a combination of transfer learning and class imbalance learning is highly effective for improving the prediction performance under cross-project settings. The proposed approach will help to design an effective prediction model for CPDP. The improved defect prediction performance could help to direct software quality assurance activities and reduce costs. Consequently, the quality of software can be managed effectively.
KeywordsBoosting Class imbalance Cost-sensitive learning Cross-project defect prediction Software defect prediction Transfer learning
This work was partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science, ICT and Future Planning (MSIP)) (No. NRF-2013R1A1A2006985) and Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No.R0101-15-0144, Development of Autonomous Intelligent Collaboration Framework for Knowledge Bases and Smart Devices).
- Arcuri, A., & Briand, L. (2011). A practical guide for using statistical tests to assess randomized algorithms in software engineering. In 33rd International Conference on Software Engineering (ICSE) (pp. 1–10). doi: 10.1145/1985793.1985795.
- Dai, W., Yang, Q., Xue, G., & Yu, Y. (2007). Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning (pp. 193–200). http://dl.acm.org/citation.cfm?id=1273521. Accessed February 25, 2014.
- Dejaeger, K. (2013). Toward Comprehensible Software Fault Prediction Models Using Bayesian Network Classifiers. IEEE Transactions on Software Engineering, 39(2), 237–257. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6175912. Accessed February 25, 2014.
- Eaton, E., & DesJardins, M. (2011). Selective transfer between learning tasks using task-based boosting. AAAI, 337–342. http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/viewFile/3752@misc/3915. Accessed June 11, 2014.
- Fan, W., Stolfo, S., Zhang, J., & Chan, P. (1999). AdaCost: misclassification cost-sensitive boosting. ICML. http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:AdaCost+:+Misclassification+Cost-sensitive+Boosting#0. Accessed November 25, 2014.
- Grbac, T., Mausa, G., & Basic, B. (2013). Stability of Software defect prediction in relation to levels of data imbalance. SQAMIA. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.8978&rep=rep1&type=pdf. Accessed November 13, 2014.
- Hall, M., Frank, E., & Holmes, G. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. http://dl.acm.org/citation.cfm?id=1656278. Accessed November 13, 2014.
- Henderson-Sellers, B. (1995). Object-oriented metrics: measures of complexity, Prentice-Hall, Inc.Google Scholar
- Jureczko, M., & Madeyski, L. (2010). Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th international conference on predictive models in software engineering—PROMISE ‘10, 1. doi: 10.1145/1868328.1868342.
- Jureczko, M., & Spinellis, D. (2010). Using object-oriented design metrics to predict software defects. In Models and Methods of System Dependability. Oficyna Wydawnicza Politechniki Wrocławskiej (pp. 69–81).Google Scholar
- Martin, R. (1994). OO design quality metrics. An analysis of dependencies, 12, 151–170.Google Scholar
- Mei-Huei, T., Ming-Hung, K., & Mei-Hwa, C. (1999). An empirical study on object-oriented metrics. In Proceedings sixth international software metrics symposium (Cat. No.PR00403) (pp. 242–249). IEEE Computer Society. doi: 10.1109/METRIC.1999.809745.
- Menzies, T., Caglayan, B., He, Z., Kocaguneli, E., Krall, J., Peters, F., & Turhan, B. (2012). The PROMISE Repository of empirical software engineering data. http://openscience.us/repo/.
- Nam, J., Pan, S. J., & Kim, S. (2013). Transfer defect learning. In 35th International Conference on Software Engineering (ICSE) (pp. 382–391). doi: 10.1109/ICSE.2013.6606584.
- Shi, X., Fan, W., & Ren, J. (2008). Actively transfer domain knowledge. In Machine Learning and Knowledge Discovery in Databases, (60703110) (pp. 342–357). http://link.springer.com/chapter/10.1007/978-3-540-87481-2_23. Accessed November 29, 2014.
- Tomek, I. (1976). Two modifications of CNN. IEEE Transaction Systems, Man and Cybernetics, 769–772. http://ci.nii.ac.jp/naid/80013575533/. Accessed January 26, 2015.
- Wang, S., Chen, H., & Yao, X. (2010). Negative correlation learning for classification ensembles. In The 2010 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). doi: 10.1109/IJCNN.2010.5596702.
- Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. http://www.jstor.org/stable/3001968. Accessed October 14, 2014.
- Zimmermann, T., Nagappan, N., Gall, H., Giger, E., & Murphy, B. (2009). Cross-project defect prediction. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering (p. 91). doi: 10.1145/1595696.1595713.