Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning
- 277 Downloads
Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to a target company. Unfortunately, larger irrelevant cross-company (CC) data usually make it difficult to build a prediction model with high performance. On the other hand, brute force leveraging of CC data poorly related to within-company data may decrease the prediction model performance. To address such issues, we aim to provide an effective solution for CCDP. First, we propose a novel semi-supervised clustering-based data filtering method (i.e., SSDBSCAN filter) to filter out irrelevant CC data. Second, based on the filtered CC data, we for the first time introduce multi-source TrAdaBoost algorithm, an effective transfer learning method, into CCDP to import knowledge not from one but from multiple sources to avoid negative transfer. Experiments on 15 public datasets indicate that: (1) our proposed SSDBSCAN filter achieves better overall performance than compared data filtering methods; (2) our proposed CCDP approach achieves the best overall performance among all tested CCDP approaches; and (3) our proposed CCDP approach performs significantly better than with-company defect prediction models.
KeywordsCross-company defect prediction Transfer learning SSDBSCAN Multi-source TrAdaBoost
This work is partly supported by the grants of National Natural Science Foundation of China (61070013, 61300042, U1135005, 71401128), the Fundamental Research Funds for the Central Universities (Nos. 2042014kf0272, 2014211020201) and Natural Science Foundation of HuBei (2011CDB072).
Compliance with ethical standards
Conflicts of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants or animals performed by any of the authors.
This article does not contain any studies with human participants.
- Bennin KE, Keung J, Phannachitta P, et al (2017) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng. https://doi.org/10.1109/TSE.2017.2731766
- Bennin K, Keung J, Monden A, et al (2017) The significant effects of data sampling approaches on software defect prioritization and classification. In:11th International symposium on empirical software engineering and measurement, ESEM 2017Google Scholar
- Boetticher G, Menzies T, Ostrand T (2007) PROMISE Repository of empirical software engineering data, West Virginia University, Department of Computer Science. http://promisedata.org/repository
- Dai W et al (2007) Boosting for transfer learning. In: 24th International conference on Machine learning, pp 193–200Google Scholar
- Dhanajayan RCG, Pillai SA (2016) SLMBC: spiral life cycle model-based Bayesian classification technique for efficient software fault prediction and classification, Soft Computing, 1-13Google Scholar
- Field AP (2001) Discovering statistics using SPSS for windows: advanced techniques for beginners, pp 551–552Google Scholar
- Gray D, Bowes D, Davey N, et al (2009) Using the support vector machine as a classification method for software defect prediction with static code metrics. In: International conference on engineering applications of neural networks. Springer, Berlin, pp 223–234Google Scholar
- Hosmer DW, Lemeshow S (2000) Introduction to the logistic regression model. Appl Logist Regres 1–30Google Scholar
- Jing X et al (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th joint meeting on foundations of software engineering, pp 496–507Google Scholar
- Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning based software defect prediction. In: Proceedings of the 36th International Conference on Software Engineering, pp 414–423Google Scholar
- Kawata K, Amasaki S, Yokogawa T (2016) Improving relevancy filter methods for cross-project defect prediction, applied computing & information technology, pp 1–12Google Scholar
- Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: 9th IEEE international conference on data mining, pp 842–847Google Scholar
- Lewis DD (1998) Naive (Bayes) at forty the independence assumption in information retrieval. In: European conference on machine learning, pp 4–15Google Scholar
- Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 382–391Google Scholar
- Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th international workshop on mining software repositories, pp 409–418Google Scholar
- Shukla S, Radhakrishnan T, Muthukumaran K, et al (2016) Multi-objective cross-version defect prediction, Soft Computing 1-22Google Scholar
- Wang J, Shen B, Chen Y (2012) Compressed C4. 5 models for software defect prediction. In: 12th international conference on quality software, pp 13–16Google Scholar
- Yan Z, Chen X, Guo P (2010) Software defect prediction using fuzzy support vector regression. In: International Symposium on Neural Networks. Springer, Berlin Heidelberg, pp 17–24Google Scholar
- Yao Y, Doretto G (2010) Boosting for transfer learning with multiple sources. In: IEEE conference on computer vision and pattern recognition, pp 1855–1862Google Scholar
- Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 91–100Google Scholar