1 Introduction

Developing high-quality software is, of course, of great importance; however, achieving this goal is difficult due to time and financial constraints. Detecting and fixing software flaws using the most effective quality assurance methods is a time-consuming and costly task. As a result, instead of inspecting all components of the system, software defect prediction can be used to discover which components of the system are the most likely to be defective and quality assurance processes can be applied only to these selected components of the system. This leads to lower testing costs and to faster delivery of software products ( Tunkel and Herbold 2022; Hryszko and Madeyski 2017; Mona et al. 2023).

Problem relevance: It can be expensive as well as time-consuming to test all components of large software systems. Predicting software faults helps to direct testing efforts toward modules or areas of software that are more likely to have flaws. This may contribute to finding flaws early in the software development lifecycle, thus minimizing the chances of having production bugs and consequently leading to an improved product quality in general, more satisfied end users, and lower maintenance costs ( Pachouly et al. 2022). As opposed to the within-project software defect prediction approach, in cross-project software defect prediction data from multiple source projects are used to predict defects in a different target project. This is particularly useful when there are not enough historical defect data available to build a reliable within-project defect prediction model, which is the case in new projects or projects that simply have a limited bug history. Moreover, in large organizations with many projects, it might not be feasible to develop several models, one for each project. Also, by inherently benefiting from more data, cross-project defect prediction leads to more general and robust models ( Wu et al. 2018; Bai et al. 2022; Zhu et al. 2020). Replication studies are very important for the software industry because they confirm previous findings, possibly challenge some of them, or improve some of the results. When results are successfully replicated in many studies, the confidence of the researchers involved increases, perhaps leading to more informed decision-making in software development. Even if the results of an original study are not confirmed, a replication study is useful for future research by pointing out possible issues that need to be handled with more care. Last but not least, replication studies encourage researchers to publicly share the datasets and tools, which reduces multiple implementations of the same tool or method, which ultimately leads to faster progress in software industry ( Shepperd et al. 2018).

In the last few years, machine learning has been widely used for software defect prediction due to its ability to learn patterns and relationships in data. For example, Zhang et al. (2018) utilize a combination of classifiers for the software cross-project defect prediction problem. They address the cross-project software defect prediction problem because, as they very well motivated that in order for machine learning techniques to be effective, sufficient volumes of data are needed, and, in the context of a new software project, this is rarely the case. To this end, in cross-project software defect prediction, a model is trained on data coming from several projects and then this model is applied to the target project.

Zhang et al. (2018) evaluate the performance of seven composite algorithms on ten open source software systems from the PROMISE repository and obtain better results compared to other approaches from the literature. The study proposes a combined classifier approach for cross-project software defect prediction to achieve a more effective discovery of classes that are more likely to be buggy. The results show an improvement with respect to the state-of-the-art. The aim is to do the following:

  • confirm or not the original results

  • investigate the impact of preprocessing techniques on cross-project defect prediction models

  • expand the experiments from the original study by applying feature selection techniques

  • compares the results that are obtained with and without preprocessing.

The paper is organized as follows: Sect. 2 presents the background and related work, as well as the original study and replication in software engineering, in general. Section 3 presents the research questions together with the experimental methodology and the evaluation approaches of the results. Section 4 presents the experimental setup, the results obtained, the analysis, and the comparison with the replicated paper. The threats to validity are discussed in Sect. 5, while Sect. 6 presents the conclusions of this paper together with ideas for future work.

2 Background and related work

This section outlines the concepts that refer to the investigated methods employed in this research.

2.1 Terminology and concepts

Due to the increasing complexity of modern software, the reliability of software is becoming increasingly important. Building highly reliable software requires extensive testing and debugging. However, because both the budget and the available time are limited, these activities must be prioritized for efficiency. As a result, software defect prediction techniques, which predict the presence of faults, have become frequently employed to assist developers in prioritizing their testing and debugging efforts.

The process of developing classifiers to predict if bugs exist in a certain piece of source code is referred to as software defect prediction ( Pan et al. 2019, Li et al. 2017). The predicted outcomes can help developers prioritize their testing and debugging efforts. The fırst step is to collect the source code from the software repositories and extract various features. As mentioned by Suresh et al. (2012) and Lamba and Mishra (2017), these features are typically software metrics like those of Table 1.

Table 1 Metrics used as features for software defect prediction

The motivation for using these metrics resides in the fact that a good object-oriented design (good internal structure), expressed in terms of design principles, heuristics, and rules, aims to achieve quality attributes as reliability, maintainability, and reusability. Thus, assessing the internal structure of the software system, allows us to predict its external behavior. The selected metrics are classified by Brito e Abreu and Melo (1996) and Basili et al. (1996) in four internal characteristics that are essential to object orientation, i.e. coupling, inheritance, cohesion, and structural complexity, and they have a strong impact on software quality and particularly on software reliability by predicting fault-prone classes. Therefore, the metrics considered for our study have been selected based on these arguments.

The next step is to train a classifier using the instances (source code files) with features and labels. Afterwards, the classifier is ready to predict if a new source code file is buggy or not.

There are several machine learning techniques for software defect prediction discussed in the literature, but in this paper we will focus on Logistic Regression (LR, Han et al. (2022)), Bayesian Network (BN, Okutan and Yildiz (2014)), Radial Basis Function Network (RBFN, Bezerra et al. (2007)), Multi Layer Perceptron (MLP, Ceylan et al. (2006)), Alternating Decision Trees (ADTree, Nelson et al. (2011)), and Decision Tables (DT, Liu et al. (2010)). We choose to direct our study on the aforementioned classifiers only, because these are the ones that are mentioned in the paper we are replicating, namely the proposal by Zhang et al. (2018).

Logistic Regression (LR) models the relationship between features and labels as a parametric distribution \(P(y \vert x)\), where y refers to the label of a data point and \(x = (x_1,x_2, \dots ,x_n)\) from the n-dimensional feature space (Han et al. 2022). In the binary classification case, we define \(p(y=1 \vert x)\) and \(p(y=0 \vert x)\) as follows:

$$\begin{aligned}{} & {} p(y=1 \vert x)=\frac{1}{1+exp(w0+ \sum _{i=1}^n w_i \times x_i)} \\{} & {} p(y=0 \vert x)=\frac{exp(w0+ \sum _{i=1}^n w_i \times x_i)}{1+exp(w0+ \sum _{i=1}^n w_i \times x_i)}, \end{aligned}$$

where \(w=(w_0,w_1,w_2,\dots , w_n)\) is the weight vector associated with x, \(w_0\) being the bias. The task of a LR classifier is to learn the weight vector, w, and there are several approaches in this sense, such as gradient descent, the family of BFGS algorithms like L-BFGS (Moritz et al. 2016), or conjugate gradient ( Hestenes and Stiefel 1952). A new instance will be labelled with either 0 or 1, according to the maximum between the two probabilities above.

Bayesian Network (BN) is a probabilistic graphical model that represents a set of variables and their conditional dependencies through a directed acyclic graph (DAG) ( Okutan and Yildiz 2014, Carvalho et al. 2007). Each node from the DAG represents a unique random variable, and each edge corresponds to a conditional dependency. Let us denote a Bayesian network B with the triple \(B=(X,G,\theta )\), where:

  • \(X=(X_1,X_2,\dots X_n)\) is a random vector, each random variable \(X_i\) taking values from a domain \(D_i\)

  • \(G=(V,E)\) is a DAG with vertices V equal to X and edges E representing direct conditional dependencies between variables

  • \(\theta = \{ \theta _{x_i \vert \prod _{x_i}} \}_{x \in D}\), where \(\theta _{x_i \vert \prod _{x_i}}=P_B(x_i \vert \prod _{x_i})\) for each possible value \(x_i\) from \(X_i\) and \(\prod _{x_i}\) from \(\prod _{X_i}\), \(\prod _{X_i}\) being the set of parents of \(X_i\) in G.

Radial basis function network (RBFN) is a particular class of an artificial neural network that uses the radial basis function as an activation function ( Bezerra et al. 2007). Their main advantages are a short training phase and a reduced sensitivity to the order of presentation of training data. A RBF network consists of an input layer, one hidden layer, and an output layer. The hidden layer should have a higher dimensionality than the input layer because it takes the input in which the pattern might not be linearly separable and maps it into a new higher-dimensional space that is more linearly separable.

Multi layer perceptron (MLP) is a fully connected feed forward Artificial Neural Network (ANN) with at least three layers (an input, an output, and one or more hidden layers) that uses the backpropagation algorithm for learning ( Ceylan et al. 2006). The learning process can be summarised in the following steps:

  • propagate data forward starting from the input layer and ending with the output layer

  • compute the error based on the known and computed output

  • backpropagate the error, i.e., update the weights.

MLPs are able to learn non-linear models and they may be utilized in online learning contexts.

Alternating decision trees (ADTree) represent a generalization of other types of decision trees such as the classical Decision Trees, Voted Decision Trees, and Voted Decision Stumps ( Nelson et al. 2011). An ADTree consists of an alternation of decision nodes and prediction nodes. Decision nodes specify a predicate condition, while prediction nodes contain a single number. The root and leaves of ADTrees will always be prediction nodes. An instance is classified by an ADTree by finding paths in the tree from the root node to leaf nodes, where all the decision nodes in between the root and the leaf nodes are evaluated as true based on the feature values of the instance. The values of the in-between prediction nodes along the corresponding paths are then summed, and this sum is used to determine the class label.

Decision tables (DT) organize knowledge in a spreadsheet format, using columns and rows and they can be denoted as a triple \(DT=(C,A,R)\), where C is the set of conditions or constraints, A is the set of actions, R is the set of rules ( Liu et al. 2010). Each rule triggers some actions according to which conditions are fulfilled. Decision tables can be derived from decision trees, but the converse is not true. Opposed to decision trees, in decision tables, more than one “or” condition can be included. They can be a great visual tool for domain experts from various fields, without having any prior knowledge in computer science.

Ensemble learning aim to improve the performance of the model by combining various base models computed by various base classifiers ( Polikar 2009). There are several types of ensemble techniques like bagging, boosting, stacking, and voting.

Voting classifiers aggregate the output of each base classifier, typically by majority vote. Other aggregation possibilities include MaxVoting, i.e., the classifier with the highest confidence score will produce the output of the ensemble, and average-voting where the output of the ensemble is the average confidence score of the base classifiers.

2.2 Related work

Various studies investigated software defect prediction. Several researchers used soft computing approaches in combination in order to address the problem of software fault prediction and some of them made a comparative analysis of these approaches when using them in isolation. The conclusive results show that the soft computing approach has a propensity to identify faults in the process of software development with a high accuracy rate, both when they are used in isolation or when we combine them. Other related studies proposed new approaches. regarding the problem of defect prediction. In what follows, we outline several such studies, highlighting the proposed approaches, the dataset used, the metrics considered, and the results obtained.

Pushphavathi (2017) outlines the use of integrated combined approaches for software defect prediction, instead of single classifier/clustering. The experiments use a combination of Genetic Algorithm (GA), Fuzzy c-means classifier (FCM) and Random forest (RF).

The study by Rai et al. (2017) presents an approach to predict software quality by using Fuzzy logic. Three categories of metrics are used for each phase of development: at the level of the requirement phase, at the level of the design phase, and at the ending phase of coding. The results after each phase are considered as input of the fuzzy inference system for the next phase.

Sharma and Chandra (2019) perform a comparative analysis of various soft computing approaches in terms of the process of software fault prediction. The study by Gupta and Gupta (2017) aims to investigate how the presence of overlapped instances in a dataset influences the classifier’s performance, and how to deal with class overlapping problem. The classifiers perform significantly better when the evaluation measure takes the effort into account. In the study by Rhmann (2020), the applicability of hybrid search-based algorithms for cross-project defect prediction is investigated. Results showed that hybrid search-based algorithms are more suitable in case of cross-project defect prediction in comparison to within-project defect prediction.

Mustaqeem and Saqib (2021) propose a machine learning hybrid approach by combining Principal component Analysis (PCA) and Support vector machines (SVM) to overcome the limitations of existing approaches. They use PROMISE dataset of NASA to conduct their research. In the proposal by Bowes et al. (2018), the purpose is to find effective classification techniques for predicting software defects. The authors show that even though overall performance metrics may appear similar, there are significant differences between the four classifiers in terms of the specific defects they detect or fail to detect.

Related studies on software defect prediction introduce new approaches regarding the problem of defect prediction. Wu (2018) propose the cost sensitive kernelized semisupervised dictionary learning approach (CKSDL) - to be used in comparison with some traditional machine learning-based SDP methods to observe the performance gain: SVM, C4.5, NB. (in the WSDP scenario). Sun et al. (2018) propose a method termed “geodesic flow kernel software defect prediction” (GFKSDP) to solve the problem of different distributions between source domain and target domain. In CPDP methods based on transfer learning the distribution of the feature is different between the different projects although they have the same feature metrics. Thus, the source data and target data are mapped to a new feature space with lower dimensions. The proposed approach is compared with the TNB and NN-filter algorithm proposed by Zimmermann et al. (2009), and the results outlined that the TNB performed better than NN-filter and the GKFSDP better than the TNB. The used datasets are AEEEM and Relink.

Guo et al. (2018) propose the cost-effectiveness curve as a new predictive performance measure. The experiment is based on publicly available scripts and data sets including 6 open-source projects. There are cases in which models with strong classification capability perform poorly when considering the effort.

Wang et al. (2020) propose using a powerful representation learning algorithm, deep learning, to learn the semantic representations of programs automatically from source code files and code changes.

Amasaki et al. (2021) present a preliminary evaluation of JiT cross-project defect prediction (CPDP), investigating suspicious code commits that could cause defects in a product. The results revealed that two of the CPDP approaches are safer than the baseline, the results being inconsistent with a previous study.

An improved BP neural network (back propagation) is proposed by Liu et al. (2020) that has been proven to be effective, improving the prediction accuracy caused by imbalance of the category distribution of data within the project. Gupta and Gupta (2017) aimed to investigate how the presence of overlapped instances in a dataset influences the classifier’s performance, and how to deal with class overlapping problem. The classifiers perform significantly better when the evaluation measure takes effort into account. Ryu et al. (2017) propose a transfer cost-sensitive boosting method that considers both knowledge transfer and class imbalance for CPDP when given a small amount of labeled target data. Through comparative experiments with transfer learning and class imbalance learning techniques, the proposed model provides significantly higher defect detection accuracy while retaining better overall performance.

Cross-project defect prediction without a set of sufficient historical data within a company, can be employed where data from other companies are used to build predictors. Ryu et al. (2017) propose a transfer cost-sensitive boosting method that considers both knowledge transfer and class imbalance for CPDP when given a small amount of labeled target data. NezhadShokouhi et al. (2020) propose a new SDP model to overcome high-dimensional imbalanced datasets, based on Mahalanobis distance and metric learning. It is demonstrated that SDP datasets, which are provided by Halstead and McCabe’s Cyclomatic metrics, are strictly non-Gaussian and multimodal.

Soe et al. (2018) propose the random forest (RF) algorithm to be used (using the RAPIDMINER machine learning tool) and the PROMISE dataset for software defect prediction. Herbold et al. (2022) aim to provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of frequently used functions on the results. The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data.

Amasaki (2018) investigate whether feeding multiple older versions of the project is effective for CVDP (cross-version defect prediction) using CPDP (cross-project defect prediction) approaches. The investigation also involves performance comparisons of the CPDP approaches under CVDP situation. Madeyski and Kawalerowicz (2017) present the idea of continuous defect prediction according to which the prediction should be triggered with every build of a project. They create a custom dataset, based on information from every commit from 1, 265 projects. The dataset is publicly available for future research.

Yan et al. (2017) compare supervised with unsupervised methods for SDP using PROMISE dataset. This is a replication study and the replicated paper found that unsupervised learning is better for effort-aware change-level defect prediction, but the current paper could not verify that this is also true for effort-aware file-level defect prediction (within-project). The conclusions of the replicated paper hold, according to the authors, in a cross-project setting for file-level defect prediction.

Fu and Menzies (2017) perform a replication study of a paper which reported that unsupervised defect predictors outperformed supervised predictors for effort-aware just-in-time defect prediction. The authors of the replication study repeated the experiments and could not confirm the findings from the original paper on a project-by-project basis (in the original paper, the results were grouped across N projects). Nevertheless, supervised predictors did not perform outstandingly better than unsupervised ones for effort-aware just-in-time defect prediction, and hence, the authors suggest that perhaps a combination of unsupervised learners could be used in order to achieve comparable performance to supervised ones.

Aljamaan and Alazba (2020) and Li et al. (2019) compare tree-based ensembles for software defect prediction on 11 public datasets. The goal was to check which ensemble is better and if an ensemble is better than one individual tree. They found no significant improvement of the ensembles with respect to the individual trees.

Ha et al. (2019) compare the performance of several clustering algorithms for software defect prediction and they study the impact of several feature selection techniques. They found that using Kernel Principal Component Analysis or Non-Negative Matrix Factorization for feature selection gives better performance than using all features in the case of unlabeled data.

Humphreys and Dam (2019) achieve very good results by using self-attention transformer encoders for software defect prediction. A significant advantage of the proposed approach is the fact that the model has the capability of explaining its prediction so as to provide insights and potential causes of the decisions which might be of great interest for the industry.

Table 2 Related work on defect prediction approaches

Table 2 contains a structural representation of the approaches used in the context of defect prediction combined with the datasets used. Most of the approaches used the PROMISE dataset and almost all also used the NASA dataset. There are also studies that do not specify the dataset used. There are several studies that investigated defect prediction using source code metrics using both the PROMISE and NASA dataset; however, only one study used change metrics. When discussing the approaches, there is a range of them from Fuzzy logic and Naive Bayes to Hybrid AI Techniques.

We would like to emphasize that our replication is based on the results obtained by Zhang et al. (2018), results that are published earlier than some of the recent work on defect prediction that we mentioned above. The aim of our replication study is not to improve aspects of the proposed approach but to confirm the results obtained previously and also to provide insights regarding specific conditions on the preprocessing process. Thus, the motivation of this research paper is as follows: first, to replicate the findings by Zhang et al. (2018), and, secondly, to investigate the influence of using various preprocessing techniques in the context of the software defect prediction problem.

2.3 The original study

The approach by Zhang et al. (2018) was selected for the replication study, in which the authors present an empirical study for cross-project defect prediction using composite algorithms. The main purpose of the replicated paper is to investigate the effectiveness of the CODEP algorithm proposed by Panichella et al. (2014) and compare it to classical ensemble classifiers, such as Voting, Bagging, Boosting and Random Forest with various configurations. There are two sets of experiments: the first set is performed on 10 open-source software projects from the PROMISE repository, and the second one on five NASA datasets. Since the context of the study was cross-project defect prediction, the methodology was to train the classifiers on nine of the PROMISE datasets and test them on the tenth dataset. Each of the 10 datasets was, in turn, a testing dataset, while the other nine were being used for training. To evaluate the results, several performance measures were considered including F-measure and AUC. The same approach was applied in the second set of experiments that used NASA datasets. The Voting classifier using the max aggregator function is reported to significantly outperform CODEP in terms of F-measure.

2.4 Replication in software engineering

Replication is one of the fundamental elements of experimental methods, with experiments being re-executed to confirm previous findings. Replication is important to confirm these previous results.

Shepperd et al. (2018) introduced several aspects that must be met for experiments to be considered as replication. It outlined these aspects as follows: the authors must specify the original experiments to be replicated; replication studies should include experiments that allow external validity to be extended; both experiments must have common research questions (shared structures and interventions), and the analysis involves comparison of the original and new results, that is, consistent or inconsistent with the original experiment.

How to replicate software engineering experiments has also been researched by Carver (2010) and by Carver et al. (2014), providing four elements that might be included in the replication report: (1) information on the original study, providing sufficient context for understanding replication; (2) familiarize readers with information about replication to help them understand the specific important details of the replication itself; (3) illustrate commonalities and differences between the original results and the results obtained within the replication, and (4), from the conclusions drawn from series of studies, the readers are given important insights derived from a series of studies that may not be obvious from a single study.

Gómez et al. (2014) established that there are four dimensions of replication: operationalization, population, protocol, and experimenters. Fagerholm et al. (2019) recommends meticulous documentation of replication designs.

The operationalization dimension describes the act of translating a construct into its manifestation. Cause constructs are operationalized into treatments (they show the similarity of replication to baseline experiments), and effect constructs are operationalized into metrics and measurement procedures.

The population dimension refers to the experimental objects. Replications investigate the boundaries of the properties of experimental objects for which the results hold.

The protocol dimension refers to the experimental elements that, over time, may differ in replication, such as experimental design, experimental objects, guidelines, instruments of measurement, and data analysis.

The experimenters dimension relates to the people involved in the experiment. Gómez et al. (2014) specifies that when replication is performed, the observed results should be independent of the experimenter and the people who played each role.

The dimensions described above are contextualized into our replication plan by contextualizing the dimensions described above.

In conclusion, replication in software engineering is of paramount importance and is performed both to confirm the results and to find additional information about the replicated approach.

3 Research design and analysis

The replication approach is detailed in this section, providing an overview, then outlining the research questions, and further detailing the replication experiments along with the experimental objects. Finally, the approaches used for the analysis of the results of the experiment are presented.

3.1 Overview

The original study by Zhang et al. (2018) was replicated, as an operational replication as mentioned by Carver et al. (2014), where the researchers were changed and the same replication objects with some variation were used. The protocol of the original study was used. The replication study aims to address the internal and external validity threats of the original study and adds new information on the preprocessing step along with the feature selection approach.

The elements of the software engineering experimental configuration, as outline by Gómez et al. (2014), that is, the four dimensions (operationalization, population, protocol, and experimenters), are provided next in the context of our replication design.

Dimension: operationalization - Regarding operationalization, namely, the cause constructs, our experimental replications are the same as in the baseline experiment in that it uses one of the datasets used in the original paper.

Dimension: population - With respect to population, we used the same experimental objects as in the original paper, namely, 10 Java projects with the same characteristics regarding the defect-prone instances.

Dimension: protocol - With reference to protocol, some aspects of the elements of the experimental protocol did not vary (the same 10 Java projects, same source code - Weka ( Witten et al. (2016)) implementation, and data analysis technique - F-measure and AUC metrics) and some varied in the replication (preprocessing, feature selection).

Dimension: experimenters - With regard to experimenters, referring to the people involved in the experiments, different researchers from the original study were used. The people who performed each role were varied (experiment designer, experiment execution and measurements, interpretation).

3.2 Research questions

The following paragraph is changed in blue, below. The research questions were constructed using the Goal Question Question Method (GQM) model, as presented by Basili and Rombach (1988), which outlines the necessary obligations to establish objectives before starting any software measurement activity. Thus, first Goals of measurement are listed and a model that is connected to the measured entities is selected, for each goal, the Questions are derived to determine if the goal is met, and finally the choice Metrics, based on the selected model, is decided.

The research questions were formulated utilizing the Goal Question Question Method (GQM) framework, as introduced by Basili and Rombach (1988), which delineates the essential steps for defining objectives prior to commencing any software measurement activity. Consequently, initial measurement Goals are outlined, followed by the selection of a suitable model linked to the entities being measured. For each goal, Questions are derived to ascertain its fulfillment, and finally, the choice Metrics, based on the selected model, is decided.

The goals of the study are the following:

  • G1—To confirm the original study by performing the same experiments but changing the researchers.

  • G2—To provide new insight or information on cross-project defect prediction using composite algorithms when different preprocessing methods are used and also considering partial features.

The research questions are as follows:

  • RQ1: Are the results of the original study confirmed by changing only the experimenters?

  • RQ2: What is the influence of using the normalization (min-max, z-standardization) technique on the results obtained?

  • RQ3: What is the impact of using feature selection on the results obtained?

The first research question targets replication of the original experiments, thus being a confirmatory replication, whereas the next two questions are integrated into the exploratory replication with the aim of yielding new knowledge about the composite algorithms used.

The following metrics are used to answer the two research questions:

  • M1—F-measure: a metric that provides a single score that takes into consideration both precision and recall. Details are provided in Sect. 3.3.3.

  • M2—Area under the curve: metric for assessing the quality of binary classification models, with the score ranging from 0 to 1, with a higher score indicating that the model distinguishes positive and negative instances more effectively. Details are provided in Sect. 3.3.3.

3.3 Experimental design, experimental objects and analysis

This section provides details about the experiments: the main elements of the experiments, the objects used in the experiments, and the methods used to analyze the results.

3.3.1 Experimental design

We conducted a replication study of the approach used by Zhang et al. (2018) using various experiments, that is, three sets of experiments. Figure 1 sketches the design of our replication experiments: Experiments Set1 is conducted with the aim of providing a simple replication (duplicating the exact conditions), Experiments Set2 is performed with two different pre-processing methods, and Experiments Set3 is operationalized with the aim of studying feature selection with or without pre-processing.

Fig. 1
figure 1

Design of experiments

Regarding the composite algorithms investigated in the original study by Zhang et al. (2018), we have selected the best three composite algorithms to perform the experiments. Thus, the composite algorithms selected in our research study are the following: average voting, maximum voting, and bootstrap aggregation.

3.3.2 Experimental objects

Our approach uses the same dataset as in the replicated study to evaluate the composite algorithms on defect prediction from ten Java projects, that is, camel, ivy, jedit, log4j, lucene, poi, prop, tomcat, and xalan, which all belong to the PROMISE repository. The 10 projects are all in the Java domain and contain the same set of features as those listed in Table 1. Each dataset contains a set of classes labeled defective or clean classes and their corresponding metrics, namely, LOC or the Chidamber and Kemerer (CK) metric as provided in Table 1.

Table 3 summarizes the statistics for each project. The columns correspond to the project name (Name), the release version of each project (Release), the total number of classes in each project (Instances), the number of defective classes in each project (Defective Instances), and the percentage of defective classes.

Table 3 Dataset used

The dataset, namely the projects, along with the files needed for the designed experiments according to Sect. 3.3.1 are provided in the figshare link: Vescan et al. (2024).

3.3.3 Analysis of experiments

In order to compare our results with those of the original article, the same evaluation measures are used as by Zhang et al. (2018), namely Precision, F-measure and Area under the curve (AUC). This section describes them.

Precision is a metric that evaluates the proportion of accurate positive predictions made by a model out of all positive predictions. It is calculated by dividing the number of True Positives (TP) by the sum of True Positives (TP) and False Positives (FP):

$$\begin{aligned}Precision = \frac{TP}{TP+FP}. \end{aligned}$$

Precision is a useful metric in the context of imbalanced datasets, as it reflects the model’s ability to accurately identify positive instances as indeed positive. On the other hand, Recall measures the model’s ability to detect positive instances, that is, the proportion of the actual positives that were identified correctly: \(Recall = \frac{TP}{TP+FN}\), where FN stands for False Negative.

In order to accurately evaluate the effectiveness of a classifier, both precision and recall must be considered. Since improving either of them typically deteriorated the other one, the F-measure (also referred to as the F1-score) is employed. F-measure is a metric that provides a single score that takes into account both precision and recall and it is defined as follows:

$$\begin{aligned} F_{1}=\frac{2 \times precision \times recall }{precision + recall}. \end{aligned}$$

Area under the curve (AUC) is another commonly used metric to assess the quality of binary classification models. The area under the Receiver Operating Characteristic (ROC) curve assesses a model’s performance by plotting the True Positive Rate (TPR) and False Positive Rate (FPR) at different threshold values. The AUC score ranges from 0 to 1, with a higher score indicating that the model distinguishes positive and negative instances more effectively. The True Positive Rate (TPR) is a synonym for Recall, while the False Positive Rate (FPR) is defined as follows: \(FPR=\frac{FP}{FP+TN}\).

4 Replication results

This section outlines the steps for the execution of the experiments along with the results obtained. A comparison is provided between the original and replicated experimental results.

4.1 Experiment execution

The results of the designed and executed experiments are provided and discussed in this section. As explained in Sect. 3.3.1, there are three sets of experiments; the first set of experiments uses no pre-processing, the second set of experiments uses two pre-processing techniques, and the last set of experiments also considers feature selection in both cases (no pre-processing and with pre-processing).

Results for AvgVoting F-measure, AUC and Precision are shown in Tables 4, 5 and 6 respectively.

Table 4 AvgVoting—F-measure
Table 5 AvgVoting—AUC
Table 6 AvgVoting—precision

When comparing the F-measure results for the AvgVoting method in Table 4 for the 10 projects, similar results are obtained to the original ones when using raw data without preprocessing, the average value of all projects being approximately the same, namely 0.29 (first two lines in the table). When comparing the results of the original study with the results obtained after preprocessing, it is observed that the original results are approximately the same (but slightly better) in \(40\%\) of the cases, while in the rest of the cases, preprocessing techniques show clear improvements (see Fig. 2). If feature selection is also used on standardized or normalized data, the results are improved in most cases. Only two projects (jedit and prop) are not improved, as shown in Fig. 3.

Fig. 2
figure 2

AvgVoting: F-measure results original vs standardized vs normalized

Fig. 3
figure 3

AvgVoting: F-measure results feature selection vs standardized vs normalized.

Results for MaxVoting F-measure, AUC and Precision are shown in Tables 7, 8 and 9 respectively.

Table 7 MaxVoting—F-measure
Table 8 MaxVoting—AUC
Table 9 MaxVoting—precision

When comparing the F-measure results for the MaxVoting method in Table 7 for the 10 projects (except project ant), similar results are obtained to the original when using raw data without preprocessing, the average value of all projects being the same (that is, 0.412, first two lines in the table). When comparing the original results (and our replication with the same raw data) with the feature selection approach, the best values are obtained in 40% of cases for the original approach, in around 30% of cases for the replicated, and, in 30% of cases for the feature selection-based approach, better results in term of f-measure are obtained. See Fig. 4.

Fig. 4
figure 4

MaxVoting - F-measure results original vs standardized vs normalized

When comparing the replicated experiment (without preprocessing) with the normalized and standardized preprocessing, in 60% of cases best f-measure is obtained for the standardized, 30% for the replicated, and 10% for the normalized preprocessing. However, when comparing the approaches incorporating the feature selection, for raw not preprocessing data we obtained in 40% of the cases the best f-measure, and for each of the other standardized and normalized approaches with feature selection in 30% of the cases the maximum f-measure. See Fig. 5.

Fig. 5
figure 5

MaxVoting - F-measure results feature selection - original vs standardized vs normalized

Results for BaggingJ48 F-measure, AUC and Precision are shown in Tables 10, 11 and 12 respectively.

When comparing the F-measure results for the BaggingJ48 method in Table 10 for the 10 projects, similar results are obtained to the original ones when using raw data without preprocessing, the average value of all projects being around 0.2 value (first two lines in the table). When comparing the original results with the results obtained after preprocessing we observe that the original results are approximately the same (but slightly better) in 40% of the cases, while in the rest of the cases preprocessing techniques show clear improvements (see Fig. 6). If we also use feature selection on standardized or normalized data, the results are improved in most cases. Only two projects (jedit and poi) are not improved, as shown in Fig. 7.

Fig. 6
figure 6

BaggingJ48: F-measure results original vs standardized vs normalized

Fig. 7
figure 7

BaggingJ48: F-measure results feature selection vs standardized vs normalized

Table 10 BaggingJ48—F-measure
Table 11 BaggingJ48—AUC
Table 12 BaggingJ48—Precision

4.2 Research questions analysis

This section answers each of the considered research questions using the results of the experiments conducted.

RQ1: Are the results of the original study confirmed by changing only the experimenters?

The designed experiments for answering this question are the ones that used raw data and no pre-processing methods (normalization or/and feature selection). The current study results reveal that the original results are replicated when using raw data without any pre-processing methods being used. The MaxVoting approach also obtained the best performance in terms of F-measure.

RQ2: What is the influence of using the normalization (min-max, z-standardization) technique on the results obtained?

The selected experiments for answering this question are the ones that used different preprocessing regarding normalization. The experimental results reveal that by applying a normalization technique (min-max normalization or z-standardization), better results than those without normalization are obtained.

RQ3: What is the impact of using feature selection on the results obtained?

The selected experiments for answering this question are those that used the feature selection preprocessing method. Adding feature selection, the results show further improvements, except project jedit (for all three methods, AvgVoting, MaxVoting, BaggingJ48).

4.3 Result comparison between original and replicated studies

Comparing the original investigation with the results of the current research examination for the three methods used (AvgVoting, MaxVoting, and BaggingJ48), it was noticed that the same results are obtained for the three methods when no preprocessing is performed as in the original paper. The other experiments conducted with normalization revealed that the results are improved in most cases, except project jedit.

The following paragraph is changed in blue, below. The original experiments with no preprocessing information compared with replicated experiments with no preprocessing and with feature selection for all three methods reveal that 40% better results for the original context and the same, that is, 30% for the other two contexts. Regarding experiments with preprocessing data, standardized and normalized, the results are improved when using feature selection when compared to experiments with no preprocessing data.

The original experiments, conducted without preprocessing, were compared to replicated experiments conducted both without preprocessing and with feature selection for all three methods. The analysis revealed a notable 40% improvement in results for the original context, while the other two contexts showed a consistent enhancement of 30%. Regarding experiments with preprocessing data, standardized and normalized, the results are improved when using feature selection when compared to experiments with no preprocessing data.

In order to verify from a statistical point of view that the difference between our results and the results from the replicated paper are statistically significant, a Wilcoxon signed-rank test ( Wilcoxon 1992) was applied. The samples from Tables 4, 5, 6, 7, 8, 9, 10, 11, 12 of F-measure, AUC and Precision values obtained by us were tested against the samples of values obtained in the replicated paper. In case of Precision, since only the average value is reported in the replicated paper, the test was performed against the replication experiments performed by us (i.e., the second line from Tables 6, 9, 12). In the following cases, we obtained p-values lower than 0.05, showing that our results differ significantly from the results obtained in the replicated paper:

  • F-measure: BaggingJ48 standardized feature selection (0.0244), BaggingJ48 normalized feature selection (0.0322). See Table 10.

  • AUC: AvgVoting normalized (0.0019), MaxVoting normalized (0.0009), MaxVoting normalized feature selection (0.0009), BaggingJ48 standardized (0.0322), BaggingJ48 normalized (0.01855), BaggingJ48 normalized feature selection (0.0029). See Tables 5, 8, 11.

  • Precision: AvgVoting normalized (0.0009), AvgVoting normalized feature selection (0.0019), MaxVoting normalized (0.00097), MaxVoting normalized feature selection (0.0009), BaggingJ48 normalized (0.0097), BaggingJ48 normalized feature selection (0.1474). See Tables 6, 9, 12.

For all other results (i.e., not listed above), the obtained p-values exceeded the significance level, 0.05, so we could not reject the null hypotheses in those cases. Consequently, in these situations, even though there are many visible improvements over the results from the replicated paper, we cannot conclude that these differences are statistically significant.

4.4 Implication for practitioners and researchers within software engineering community

In this section, we discuss the implications of our replication study’s results within the software engineering community.

For the research community, our results bring valuable insights into software defect prediction techniques, particularly in the context of preprocessing and feature selection methods. One key takeaway from our study is the confirmation of the original results using raw data, validating the effectiveness of the composite algorithms investigated. Moreover, our experiments highlight the impact of preprocessing techniques, such as normalization and standardization, on the predictive performance of the algorithms. Additionally, the integration of feature selection enhances predictive performance, showcasing the importance of feature engineering in defect prediction models.

The insights gained from our study can provide valuable guidance for practitioners in assessing the quality-related aspects of their products and processes. By considering our findings, practitioners can identify areas where enhancements can be made to optimize their software development practices and enhance overall product quality. Furthermore, developers can make informed decisions when implementing defect prediction systems, ultimately leading to improved software quality and reduced development costs. And last but not least, our study underscores the importance of adopting a systematic approach to preprocessing and feature selection in defect prediction tasks, highlighting areas for optimization and refinement in real-world applications.

5 Threats to validity

Every experimental analysis may suffer from some threats to validity and biases that can affect the results of the study. In the following, some issues are outlined which may have influenced the obtained results, and the actions taken to mitigate them are also explained.

Internal validity refers to factors that could have influenced the obtained results. Examples of such factors include parameter settings or implementation issues in general. Since this is a replication study, the experiments from the original study were reproduced by using exactly the same setup, namely, the same tools and methods, the same parameter settings, the same datasets, and so on. By involving all three authors in the experiments’ replication task, the chances of there being any possible mistake or inaccuracy is minimized.

External validity is related to the generalization of the obtained results. As possible threats (to validity) in this sense, the following key issues are identified: the evaluation measures being used and the datasets. The selected evaluation measures are the same as those in the original paper, but other measures, such as Area Under the Precision-Recall Curve (AUPRC) could also be considered in the future. As in the replicated paper, all 10 datasets come from the same repository (PROMISE), but other datasets from other repositories could also be considered in the future.

6 Conclusion

Success in software projects is an important challenge today. The main focus of the engineering community is to predict software failures based on the history of classes and other code elements based on past history. However, these software defect prediction techniques are effective as long as the data are sufficient for modeling the prediction model. To mitigate this issue, cross-project defect prediction is used.

The objective of this research was, first, to replicate the experiments in the original proposal for the article and, secondly, to study other parameters for defect prediction in order to give new insights into and results on the best approach. In our study, three composite algorithms were used: AvgVoting, MaxVoting and BaggingJ48. These algorithms combine several machine classifiers to improve cross-project defect prediction. The experiments used preprocessing methods (min-max normalization and z-standardization) and feature selection.

Replicated experiment results confirm the original results by using raw data for the three methods. When normalization is applied, better results are obtained than in the original study. Even better results are obtained when using feature selection. In the original paper, the MaxVoting approach shows the best performance in terms of F-measure, and BaggingJ48 shows the best performance in terms of cost effectiveness. The current experiments obtained the same result in terms of F-measure: best MaxVoting, followed by AvgVoting, and then by BaggingJ48.

Our results highlight results previously obtained: the original study is validated using raw data, and our results support the original study result. In addition, our study achieved better results with preprocessing and feature selection.

Some limitations encountered during the proposed replicated study are the following: Firstly, replicating the exact experimental conditions, including tool dependencies, parameter settings, and computing environments, poses a significant challenge. Limited access to such resources may lead to variations in data quality and experimental setup, potentially affecting the reproducibility and generalizability of results. Secondly, variations in preprocessing techniques, feature selection methods, and model parameters between the original and replicated studies may introduce inconsistencies in results interpretation. Addressing these limitations requires meticulous attention to transparency in reporting findings to ensure the rigor and reliability of the replication study in advancing knowledge in software defect prediction.

As future work, we plan to extend this study by considering a wider range of comparisons using more classifiers, pre-processing techniques, and evaluation metrics, as well as investigating outlier detection methods and evaluating different cross-validation strategies.