Data Transformation in Cross-project Defect Prediction

Zhang, Feng; Keivanloo, Iman; Zou, Ying

doi:10.1007/s10664-017-9516-2

Data Transformation in Cross-project Defect Prediction

Published: 14 April 2017

Volume 22, pages 3186–3218, (2017)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

1542 Accesses
46 Citations
2 Altmetric
Explore all metrics

Abstract

Software metrics rarely follow a normal distribution. Therefore, software metrics are usually transformed prior to building a defect prediction model. To the best of our knowledge, the impact that the transformation has on cross-project defect prediction models has not been thoroughly explored. A cross-project model is built from one project and applied on another project. In this study, we investigate if cross-project defect prediction is affected by applying different transformations (i.e., log and rank transformations, as well as the Box-Cox transformation). The Box-Cox transformation subsumes log and other power transformations (e.g., square root), but has not been studied in the defect prediction literature. We propose an approach, namely Multiple Transformations (MT), to utilize multiple transformations for cross-project defect prediction. We further propose an enhanced approach MT+ to use the parameter of the Box-Cox transformation to determine the most appropriate training project for each target project. Our experiments are conducted upon three publicly available data sets (i.e., AEEEM, ReLink, and PROMISE). Comparing to the random forest model built solely using the log transformation, our MT+ approach improves the F-measure by 7, 59 and 43% for the three data sets, respectively. As a summary, our major contributions are three-fold: 1) conduct an empirical study on the impact that data transformation has on cross-project defect prediction models; 2) propose an approach to utilize the various information retained by applying different transformation methods; and 3) propose an unsupervised approach to select the most appropriate training project for each target project.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Bettenburg N, Nagappan M, Hassan AE (2012) Think locally, act globally: improving defect and effort prediction models Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12, pp 60–69
Chapter Google Scholar
Bishara AJ, Hittner JB (2014) Reducing bias and error in the correlation coefficient due to nonnormality. Educational and Psychological Measurement http://epm.sagepub.com/content/early/2014/11/10/0013164414557639.full.pdf+html
Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B Methodol 26(2):211–252
MATH Google Scholar
Breslow NE, Day NE (1980) Statistical methods in cancer research. vol. 1. the analysis of case-control studies. International Agency for Research on Cancer Scientific Publications 1(32):338
Google Scholar
Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction 2013 IEEE sixth international conference on software testing, verification and validation (ICST), pp 252–261
Chapter Google Scholar
Cohen J, Cohen P, West S, Aiken L (2003) Applied multiple Regression/Correlation analysis for the behavioral sciences, 3rd edn. Lawrence Erlbaum, Mahwah, NY, USA
Google Scholar
Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10):687–708
Article Google Scholar
Cruz A, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects ESEM 2009. 3rd international symposium on empirical software engineering and measurement 2009, pp 460–463
Chapter Google Scholar
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches Proceedings of the 7th IEEE working conference on mining software repositories, MSR’10, pp 31– 41
Google Scholar
Fukushima T, Kamei Y, McIntosh S, Yamashita K, Ubayashi N (2014) An empirical study of just-in-time defect prediction using cross-project models Proceedings of the working conference on mining software repositories, ACM, MSR’14, pp 172–181
Google Scholar
Gaudard M, Karson M (2000) On estimating the box-cox transformation to normality. Commun Stat Simul Comput 29(2):559–582. doi:10.1080/03610910008813628
Guo W (2014) A unified approach to data transformation and outlier detection using penalized assessment. PhD thesis University of Cincinnati, Arts and Sciences: Mathematical Sciences
Han J, Kamber M, Pei J (2012) Data Mining: concepts and techniques, 3rd edn. Morgan Kaufmann , Boston
MATH Google Scholar
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Article Google Scholar
He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: an empirical study on defect prediction 2013 ACM/IEEE international symposium on empirical software engineering and measurement, pp 45–54
Chapter Google Scholar
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York, NY, USA
Book MATH Google Scholar
Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules? Proceedings of the 2008 workshop on defects in large software systems, DEFECTS ’08, pp 16–20
Chapter Google Scholar
Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and cca-based transfer learning Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE 2015, pp 496– 507
Chapter Google Scholar
Jing XY, Wu F, Dong X, Xu B (2016) An improved sda based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Soft Eng PP(99):1–1
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction Proceedings of the 6th international conference on predictive models in software engineering, PROMISE ’10, pp 9:1–9:10
Google Scholar
Keren G, Lewis C (1993) A handbook for data analysis in the behavioral sciences: statistical issues. Lawrence Erlbaum Hillsdale, NY, USA
MATH Google Scholar
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction Proceedings of the 33rd international conference on software engineering, ICSE ’11, pp 481–490
Google Scholar
Kuhn M, Johnson K (2013) Data pre-processing Applied predictive modeling. Springer, New York, pp 27–59
Chapter Google Scholar
Louridas P, Spinellis D, Vlachos V (2008) Power laws in software. ACM Trans Softw Eng Methodol 18(1):2:1–2:26
Article Google Scholar
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256
Article Google Scholar
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng (TSE) 33(1):2–13
Article Google Scholar
Menzies T, Butcher A, Cok D, Marcus A, Layman L, Shull F, Turhan B, Zimmermann T (2013) Local versus global lessons for defect prediction and effort estimation. IEEE Trans Softw Eng 39(6):822–834
Article Google Scholar
Misirli AT, Bener AB, Turhan B (2011) An industrial case study of classifier ensembles for locating software defects. Softw Qual J 19(3):515–536
Article Google Scholar
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction Proceedings of the 30th international conference on software engineering, ICSE ’08, pp 181–190
Google Scholar
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures Proceedings of the 28th international conference on software engineering, ACM, ICSE ’06, pp 452–461
Chapter Google Scholar
Nam J, Kim S (2015) Heterogeneous defect prediction Proceedings of the 2015 10th joint meeting on foundations of software engineering, ACM, New York, NY, USA, ESEC/FSE, 2015, pp 508–519
Chapter Google Scholar
Nam J, Pan SJ, Kim S (2013) Transfer defect learning Proceedings of the 2013 international conference on software engineering, ICSE ’13, pp 382–391
Google Scholar
Osborne JW (2008) 13 best practices in data transformation: the overlooked effect of minimum values, 0 edn, SAGE Publications, Inc., pp 197–205
Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Practical Assessment Research & Evaluation 15(12)
Panichella A, Oliveto R, De Lucia A (2014) Cross-project defect prediction models: L’union fait la force 2014 software evolution week - IEEE conference on software maintenance, reengineering and reverse engineering (CSMR-WCRE), pp 164–173
Chapter Google Scholar
Rahman F, Posnett D, Devanbu P (2012) Recalling the “imprecision” of cross-project defect prediction Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12, pp 61:1–61:11
Google Scholar
Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and cohen’s d for evaluating group differences on the nsse and other surveys? meeting of the Florida association of institutional research, pp 1–33
Google Scholar
Selim G, Barbour L, Shang W, Adams B, Hassan A, Zou Y (2010) Studying the impact of clones on software defects Proceeddings of the 17th working conference on reverse engineering, pp 13–21
Google Scholar
Shang H (2014) Selection of the optimal box–cox transformation parameter for modelling and forecasting age-specific fertility. J Popul Res pp 1–11
Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall/CRC
Song Q, Jia Z, Shepperd M, Ying S, Liu J (2011) A general software defect-proneness prediction framework. IEEE Trans Softw Eng 37(3):356–370
Article Google Scholar
Succi G, Pedrycz W, Djokic S, Zuliani P, Russo B (2005) An empirical exploration of the distributions of the chidamber and kemerer object-oriented metrics suite. Empir Softw Eng 10(1):81–104
Article Google Scholar
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models Proceedings of the 38th international conference on software engineering, ACM, ICSE’16, pp 321–332
Google Scholar
Triola M (2004) Elementary statistics. Pearson/Addison-Wesley
Turhan B, Misirli AT, Bener AB (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118
Article Google Scholar
Wu R, Zhang H, Kim S, Cheung SC (2011) Relink: recovering links between bugs and changes Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11, pp 15–25
Google Scholar
Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106
Article Google Scholar
Yin RK (2002) Case study research: design and methods, 3rd edn. SAGE Publications
Zhang F, Mockus A, Zou Y, Khomh F, Hassan AE (2013) How does context affect the distribution of software maintainability metrics? Proceedings of the 29th IEEE international conference on software maintainability, ICSM ’13, pp 350–359
Google Scholar
Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model Proceedings of the 11th working conference on mining software repositories, MSR ’14, pp 41–50
Google Scholar
Zhang F, Mockus A, Keivanloo I, Zou Y (2015) Towards building a universal defect prediction model with rank transformed predictors. Empir Soft Eng pp 1–39
Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier Proceedings of the 38th international conference on software engineering, ICSE ’16, pp 309–320
Google Scholar
Zhang H (2009) Discovering power laws in computer programs. Inf Process Manag 45(4):477–483
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09, pp 91–100
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Queen’s University, Kingston, Ontario, Canada
Feng Zhang
Department of Electrical and Computer Engineering, Queen’s University, Kingston, Ontario, Canada
Iman Keivanloo & Ying Zou

Authors

Feng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Iman Keivanloo
View author publications
You can also search for this author in PubMed Google Scholar
Ying Zou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Zhang.

Additional information

Communicated by: Tim Menzies

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 211 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, F., Keivanloo, I. & Zou, Y. Data Transformation in Cross-project Defect Prediction. Empir Software Eng 22, 3186–3218 (2017). https://doi.org/10.1007/s10664-017-9516-2

Download citation

Published: 14 April 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10664-017-9516-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Transformation in Cross-project Defect Prediction

Abstract

Access this article

Similar content being viewed by others

FENSE: A feature-based ensemble modeling approach to cross-project just-in-time defect prediction

Cross project defect prediction: a comprehensive survey with its SWOT analysis

A software defect prediction method with metric compensation based on feature selection and transfer learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 211 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data Transformation in Cross-project Defect Prediction

Abstract

Access this article

Similar content being viewed by others

FENSE: A feature-based ensemble modeling approach to cross-project just-in-time defect prediction

Cross project defect prediction: a comprehensive survey with its SWOT analysis

A software defect prediction method with metric compensation based on feature selection and transfer learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 211 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation