Evaluating classifiers in SE research: the ECSER pipeline and two replication studies

Automated classifiers, often based on machine learning (ML), are increasingly used in software engineering (SE) for labelling previously unseen SE data. Researchers have proposed automated classifiers that predict if a code chunk is a clone, if a requirement is functional or non-functional, if the outcome of a test case is non-deterministic, etc. The lack of guidelines for applying and reporting classification techniques for SE research leads to studies in which important research steps may be skipped, key findings might not be identified and shared, and the readers may find reported results (e.g., precision or recall above 90%) that are not a credible representation of the performance in operational contexts. The goal of this paper is to advance ML4SE research by proposing rigorous ways of conducting and reporting research. We introduce the ECSER (Evaluating Classifiers in Software Engineering Research) pipeline, which includes a series of steps for conducting and evaluating automated classification research in SE. Then, we conduct two replication studies where we apply ECSER to recent research in requirements engineering and in software testing. In addition to demonstrating the applicability of the pipeline, the replication studies demonstrate ECSER’s usefulness: not only do we confirm and strengthen some findings identified by the original authors, but we also discover additional ones. Some of these findings contradict the original ones.


Introduction
The increasing adoption of machine learning (ML) and deep learning (DL) techniques in software engineering (SE) research have brought in research methods that SE researchers are not yet fully familiar with. In particular, statistical results have become a prevalent component of many SE papers.
Books such as Darrell Huff's "How to lie with statistics" (Huff 1993) helped to make the intricacies and pitfalls of statistical results part of popular culture. Many lay people understand the drawbacks of reporting only the arithmetic mean without variance: for example, the statement 'On average, students read two books per year' is not necessarily informative, since it could be drawn from very different populations such as {2, 2, 2, 2, 2} and {0, 0, 0, 0, 10}.
When we bring this intuition to the machine learning for software engineering field (ML4SE), are we (SE researchers) able to recognize and avoid possible pitfalls when conducting, reviewing, and reading ML4SE research? Do we understand which results we can confidently derive from our research (think of spurious correlation versus causation), and do we disseminate our results in a fair manner? Also, can SE practitioners have confidence that the results they read will translate to similar performance in an operational setting, i.e., in the software industry?
The SE research community is increasingly aware of these challenges, and some researchers have started coping with them. For example, de Oliveira Neto et al. (2019) analyzed the predominant practices in empirical SE and proposed a conceptual model for a statistical analysis workflow. Kitchenham et al. (2017) discussed the importance of properly analyzing non-normally distributed data, which are common in SE data. Mahadi et al. (2022) studied the effectiveness of classifiers when applied across projects, and they showed the instability of the conclusion validity results.
In a broader context, the SIGSOFT empirical standards (Ralph et al. 2020) are emerging as a response to the variety of research methods employed in SE, and to the difficulties of authors and reviewers when reporting and assessing research. We align with this perspective, and we focus on the provision of guidelines for conducting and reporting ML4SE research.
In this paper, we follow the research approach described in Section 2 to study how to rigorously conduct and report SE research that makes use of automated classifiers. A classifier is an algorithm that maps each element of a data set to one or more categories (classes) selected from a pre-defined list. Nowadays, the vast majority of classifiers employ ML and DL: they learn a classification model from a training set, then they use that model to predict the categories of a test set.
Our goal is to demonstrate the usefulness of following a simple pipeline that guides the researcher while conducting and reporting on the research results. In particular, we make the following contributions: -We introduce the ECSER (Evaluating Classifiers in Software Engineering Research) pipeline for SE researchers to follow when conducting research with automated classifiers. ECSER adopts and consolidates recommendations from recent literature in ML and statistics, and customizes some of them-when necessary-to the context of SE research. ECSER is specifically aimed to assist SE researchers with limited ML background to avoid common mistakes and to present their results in a credible and correct manner. -We conduct two replication studies-one in software testing, one in requirements engineering-by applying the ECSER pipeline. In doing so, we illustrate ECSER and we demonstrate its applicability and usefulness. The replications through ECSER allow us to strengthen some of the conclusions that the original papers had made, and also to identify additional findings, some of which contradict the original results. -As part of ECSER, we include the metrics of overfitting and degradation for assessing the expected operational performance of the classifiers. The aim of these metrics is that of providing a credible assessment of ML4SE research results for other researchers and practitioners. -We make available a replication package (Dell'Anna et al. 2021) that the interested reader can use to apply our pipeline in order to learn about it and as a starting point for comparing their classifiers and/or their data sets.
The rest of the paper is structured as follows. Section 2 describes our research method. Section 3 discusses related work. Section 4 presents the ECSER pipeline. Section 5 applies ECSER to the classification of functional and quality requirements, while Section 6 applies it to test case flakiness prediction. Section 7 discusses the threats to validity; Section 8 concludes the paper by listing the findings and the limitations of ECSER and by outlining future work.

Research Approach
Triggered by the increasing use of classifiers in ML4SE research, and by our personal observation on the varying styles and depth of reporting on the effectiveness of classifiers in SE research, we set our main research question:

MRQ. How can we enable SE researchers to accurately conduct and report on the evaluation of automated classifiers?
To answer this question, we follow the three steps of the design cycle of Wieringa's design science research methodology (Wieringa 2014). First, we conduct problem investigation to discover the problems with the current situation, leading to our first research sub-question:

RQ1. What are the challenges with the current research practices with classifiers in SE research?
To identify these challenges, we conduct a review of the existing literature and we rely on our own observations and experience as researchers in the ML4SE field. The answer to this research question can be found in the challenges that are listed in Section 3.
The second step of Wieringa's design cycle is to design a treatment to improve the current situation. This step is mapped to our second research sub-question:

RQ2. What is an easy-to-use, tangible artifact that can assist SE researchers when conducting and reporting research on classifiers?
We answer this question by proposing ECSER, a pipeline for SE researchers to use when constructing and evaluating classifiers. The pipeline is detailed in Section 4. The design process is guided by existing ML literature and by our experience. As evidenced in Table 1, the design of ECSER is informed both by general literature on machine learning (specifically: classifiers) and by specific SE literature that made use of classifiers. We decided to create an ML4SE-specific pipeline because we could not find an explicit step-by-step process in the literature. The main steps are mostly a consolidation of the steps suggested by major machine learning textbooks (Sheskin 2020;Flach 2012;Bishop 2006). We pay special attention to statistical techniques for a robust comparison among multiple classifiers (Demšar 2006;Benavoli et al. 2016aBenavoli et al. , 2017b, a topic that has also been discussed by prominent ML4SE researchers (Menzies and Shepperd 2019). Significant discussion of the metrics (S7) occurs both in the general ML (Japkowicz and Shah 2011;Lever 2016) and in the ML4SE (Yao and Shepperd 2020) literature. The interested reader may find detailed descriptions of the topics in the papers and books listed in Table 1.
Third, Wieringa's design cycle suggests to perform treatment validation for the design artifact, which in our case leads to the following sub-questions:

RQ3. How applicable is the pipeline to ML4SE research? RQ4. How useful is the pipeline when applied to ML4SE research?
To answer RQ3 and RQ4, we conduct two replication studies in different sub-fields of SE: (i) requirements engineering, via the classification of functional and quality requirements (Hey et al. 2020a;Kurtanovic and Maalej 2017;Dalpiaz et al. 2019); and (ii) software testing, via the detection of test case flakiness (Alshammari et al. 2021a;Pinto et al. 2020). These studies, reported in Sections 5 and 6, respectively, were selected because the authors provided ready-to-use replication packages, their classifier comparison represents well the current practice regarding classification in SE research, and these studies did not assess the operational performance of the classifiers on unseen data. These replications allow us not only to derive a number of lessons learned regarding the pipeline (the object of our study), but also to identify findings concerning the replicated studies that were not present in the original publications. Step General ML ML4SE-specific S1. Select an evaluation Flach (2012) Lever (2016) S8. Analyze overfitting Cawley and Talbot (2010), and degradation Bishop (2006) S9. Visualize ROC Boyd and Eng (2013), Goadrich et al. (2006), Lever (2016), Fawcett (2006), Flach (2012) S10. Apply statistical Sheskin (2020), Demšar (2006), Menzies and Shepperd (2019) significance tests Benavoli et al. (2017bBenavoli et al. ( , 2016a, Salzberg (1997), Stapor (2017), Japkowicz and Shah (2011), Good (2013)

Related Work
Software engineering is one of the many fruitful domains for machine learning applications. In the early 2000s, Menzies' handbook provided practical examples of the use of machine learning for software engineering problems (Menzies 2001). Zhang and Tsai (2003) listed the software engineering tasks that are powered by machine learning as i. prediction and estimation, ii. property or model discovery, iii. transformation, iv. generation and synthesis, v. reuse library and construction, vi. requirements acquisition, and vii. capture development knowledge.
Over the past 20 years, the application of ML techniques to SE problems has become increasingly prevalent. Supervised ML techniques for classification can be easily applied by non-experts using libraries such as scikit-learn (Pedregosa et al. 2011) or Weka (Hall et al. 2009). Similarly, deep learning frameworks such as TensorFlow (Abadi et al. 2015) enable their users to quickly build large-scale neural networks for machine learning tasks such as classification, among others. The many available tutorials and code snippets allow using classifiers without fully understanding the differences between the algorithms, the validation and testing options, or how to interpret the results. Therefore, ML techniques and tools are often used without proper knowledge, and the lack of understanding of the underlying complexities of the ML models may lead to poorly reported results.
In the broader ML field, researchers warned about the possible negative consequences of uninformed classification applications and provided guidelines to follow. Salzberg (1997) lists what to avoid when comparing classifiers and recommends using multiple algorithms, a benchmark, cross-validation with parameter optimization within each fold, and the binomial test to assess statistical validity. Adams and Hand (2000) discuss the use of suitable metrics for a reliable assessment of classifier performance. Demšar (2006) focuses on comparing classifiers over multiple data sets in such a way to obtain statistical significance. Benavoli et al. (2017b) adopt Bayesian Analysis to compare classifiers, and they also argue (Benavoli et al. 2016a) for the use and selection of post-hoc tests based on mean-ranks when comparing classifiers. Stapor (2017) provides the basic steps of classifier evaluation and lists alternative approaches for each step. Herbold (2020) presents Autorank: a software for nonexperts that automatically ranks classifiers based on their performance. These are just a few examples of the complexity of conducting a solid evaluation of a classifier's effectiveness.
Despite the variety of guidelines provided by the ML research community and by the numerous textbooks on ML and statistics (e.g., Flach 2012; Sheskin 2020; Japkowicz and Shah 2011), researchers and practitioners in domains such as biomedical research (Luo et al. 2016;Tanwani et al. 2009), combinatorial science (Siebert et al. 2020 and medicine (Alonso-Betanzos et al. 2015) have raised concerns about the use and reporting of ML techniques by non-experts. Luo et al. (2016), for instance, highlight that ML is often considered a "black magic" by scholars in biomedical research and that this often leads to difficulties in interpreting the reported results and to spurious conclusions, which can compromise the credibility of other studies and discourage researchers from adopting ML techniques. SE is not dissimilar from these domains, for all are heavily affected by the emergence of ML.
To assess the situation within SE research, we conducted an exploratory mapping study (raw data in our supplementary materials (Dell'Anna et al. 2021)) of the proceedings of the International Conference on Software Engineering (ICSE) from the year 2019 through 2021. We aimed to identify those papers that use classifiers and are, therefore, conducting and reporting research on classifiers in SE.
To conduct this analysis, the three authors of this paper have independently analyzed one year of the ICSE proceedings by checking relevance through the title, abstract, and full text. We first looked at the title and we read the abstract when possibly relevant. If the abstract still indicated potential relevance, we checked the paper. Since our analysis is meant to explore the problem, we decided to rely on a single annotator per paper for this preliminary task.
For each relevant paper, we collected information regarding the (i) used evaluation metrics: precision, recall, accuracy, F-Score, ROC-AUC; an (ii) explicit justification for the metrics: no, related to previous work, yes, implicit in the type of study (e.g., the RQ mentions accuracy); (iii) inclusion of the confusion matrix, which enables the reader to determine all other metrics; (iv) evaluation over multiple data sets, which is important for the generalization of the results; (v) type of baseline being used: none, own, external; and (vi) analysis of statistical significance of the obtained results. Table 2 summarizes the findings resulting from the exploratory mapping study. The study confirms the popularity of machine learning, and in particular automated classification via machine learning, for software engineering research. Out of the 376 papers accepted in the technical track of ICSE, we have marked 60 as related to classification tasks (circa 16% of the accepted papers).
Unfortunately, the analysis also confirms that, similarly to other research fields, also in SE research, the essential details are often omitted or poorly reported in the evaluation of ML-based solutions, leading to hard-to-reproduce and sometimes misleading results. The analyzed papers followed various steps for their machine learning pipelines and reported their results in different ways. Among the 60 marked papers, primarily based on ML or DL, precision and recall are the most reported performance metrics (38 times each), followed by F-score (27) and accuracy (24). Only one-fourth of the papers (14/60) do explicitly justify their selection of the metrics. Many (20/60) refer back to the custom metrics in the field (thus, if previous authors use inadequate metrics, the problem propagates through the community) or, in the ML literature. Twenty-three studies do not provide any explanation at all. For three papers, the justification is implicit in the kind of study, e.g., the title mentions a study on accuracy and the selected metric is therefore apparent. The confusion matrix, which provides a comprehensive analysis of the performance of a classifier and can be used to compute most performance metrics (Flach 2012), is reported only in six papers. Visualization of the results with receiver operating characteristic (ROC) plots is also quite unpopular: 3/60. Interestingly, even though 55 papers compare their results with an external baseline or with the authors' previous work, just six papers report the statistical significance of the results, a recommended practice by the machine learning community (Demšar 2006) for drawing solid conclusions.
To mitigate these and other challenges, researchers and practitioners of different fields outlined ML guidelines tailored for the particular domains (Wang et al. 2020;Luo et al. 2016;Greener et al. 2022). The intent and benefit of such guidelines are not only to make accessible to non-experts the scattered, dense, and non-trivial ML knowledge that is essential for conducting adequate research but also to present such knowledge via examples and case studies that are relevant for the particular domain, so to facilitate the transfer of knowledge. We follow the same idea, and to promote better practices in SE research, we devise guidelines for the conduction and evaluation of classifier research in SE.
Guidelines have already proven helpful for many areas of software engineering research. Jedlitschka et al. (2008), for example, provide detailed guidelines on planning and reporting controlled experiments for SE in a level of detail, including the title, keywords, variables, and the discussion of the experiments. Kitchenham (2004) list the tasks and sub-tasks for planning, conducting, and reporting systematic reviews targeting the SE researchers as the audience. Similarly, Kuhrmann et al. (2017) present guidelines on designing literature studies for SE based on the authors' experience. Each process step is identified, starting from preparation, and continuing with the data collection, study selection, and conclusion steps. Garousi et al. (2019) focus on reporting grey literature and conducting multivocal literature reviews for SE. Garousi and Felderer (2017) also share their guidelines for data extraction in systematic reviews based on their experience. Petersen et al. (2015) update their previous guidelines for conducting systematic mapping studies (Petersen et al. 2008) discovering the existing guidelines were insufficient and therefore providing additional guidelines to support SE researchers. Fagerholm et al. (2017) propose guidelines for using empirical studies in SE education covering learning outcomes, planning, scheduling, and use of empirical studies for SE research.
Concerning ML4SE, however, only limited support exists for SE researchers. The work of Agrawal et al. (2021), for example, discusses good practices for hyper-parameter optimization. Rajbahadur et al. (2021), instead, focus on the impact of the noise of the dependent variable introduced by discretization on classifiers. In the context of software analytics, Menzies and Shepperd (2019) present a list of "bad smells", a term used in the agile software community to denote surface indicators of deeper problems. Examples include the focus on statistical significance rather than effect size, lack of data visualization, dangers of overfitting, and partial reporting of results. Yao and Shepperd (2020) discuss the importance of metric selection and the issues in using common metrics such as F 1 -score. Clear guidelines for reporting classification-related SE research are currently missing and scattered in the literature. We argue that the lack of a standard way to report classification results that we noted in our exploratory mapping study is partly due to the lack of, and could be mitigated with, guidelines for reporting classification-related SE research.

ECSER: A Pipeline for Evaluating Classifiers in SE Research
We present ECSER (Evaluating Classifiers in Software Engineering Research), our pipeline for SE researchers to use when conducting and reporting on SE research that evaluates one or more classifier models (algorithms). ECSER was designed following the research method described in Section 2, in order to answer RQ2, starting from extensive research into the literature in ML and our own experience. Figure 1 illustrates the ten steps of ECSER, which are organized into two macroactivities: (i) the training, validation & testing of the classifier, and (ii) the analysis of the obtained results. The steps are presented sequentially for simplicity. Feedback loops are possible between the macro-activities (see the ⇔ arrows), either when one macro-activity finishes or at any time when an issue is identified. For example, if the performance metrics (S7) show high variance, the researcher may want to backtrack to treatment design Fig. 1 The ECSER pipeline for evaluating classifiers in SE research, which can be seen as the treatment validation phase in Wieringa's design science research methodology (Wieringa 2014). Steps with a dashed border are optional. The ⇔ arrows indicate that feedback loops are always possible across the macro-activities and conduct additional feature engineering, or to change the classifier algorithm, and then re-execute the pipeline from S1.
Looking at ECSER from the lens of Wieringa's design science research methodology (Wieringa 2014), it defines the treatment validation phase for SE researchers who are designing classifiers as their treatment. On the left of the figure, treatment design-outside the scope of this paper-includes important activities such as data set selection and curation, feature engineering, and algorithms selection.

Treatment Design
This macro-activity corresponds to those steps that the researchers need to conduct to build their solution and develop their data set. The selection and curation of a data set focus on identifying real-world or synthetic data to assess the performance of the classifier(s). Explicit guidelines on how to transparently and credibly conduct this step are proposed, e.g., by Hutchinson et al. (2021). Algorithm selection involves the choice of the ML or DL algorithms such as Support Vector Machines, Gradient Boosting, Random Forests, and Neural Networks. Typically, studies in SE research opt for multiple algorithm so that a comparison can be made (Ghotra et al. 2015;Agrawal and Menzies 2018;Kurtanovic and Maalej 2017;Hey et al. 2020a). When ML algorithms are chosen, feature engineering is necessary to construct those features that the learning algorithm uses to predict a given data item's class(es). This is a broad topic about which entire books were written (Dong and Liu 2018;Duboue 2020). In ML4SE research, a multitude of feature types can be derived by using project management data (Montgomery et al. 2018), code metrics (Menzies et al. 2010), change metrics (Moser et al. 2008), textual artifacts (Kurtanovic and Maalej 2017), etc.

S1. Select an Evaluation Method and Split the Data
First, a researcher needs to decide on an evaluation method, i.e., which (and how) input data will be used to evaluate the classifiers, and split the data accordingly. Several alternatives exist. The simplest and one of the most popular methods is the holdout method where the data is split into training, validation, and test sets. In this setting, the model is trained using the training set (S2), the model's hyper-parameters are fine-tuned utilizing the validation set (S3), and the model is evaluated on the test set (S4-S5). One disadvantage of this method is that the results are possibly unstable, for the model is validated only once on the validation set. A more robust alternative is k-fold cross-validation: a test set is extracted from the data set, and the remaining data is shuffled and split into k groups (folds). k-fold cross-validation (S2-S3) consists of repeating k times training and validation. Every time one group is held as the validation set, the remaining k − 1 groups are used to train the model. The model's hyper-parameters are fine-tuned with respect to the average performance on the k folds. k-fold cross-validation can be stratified to keep the positives/negatives ratio roughly even across the groups. The resulting model is then evaluated on the test set (S4-S5). 1 Clearly, k-fold cross-validation requires more computational effort, because k models are trained and tested.
Specific to the SE research, projects can be used to partition the data set when the data consists of multiple sub-sets from different SE projects. This is referred as the p-fold validation method, to emphasize the by-project splits Dalpiaz et al. 2019). Different projects can be used as validation and test sets for the holdout method. For p-fold cross-validation, some projects can be used as test sets and, rather than randomly shuffling the remaining data into sub-groups, the projects can be taken as a unit. For each of the p iterations, one project is held for validation and the remaining are used for training. One of the challenges is the existence of unevenly sized projects, e.g., too small projects or extremely unbalanced projects may lead to not-so-reliable results when used as the validation set. One special case of the p-fold method is the leave-one-project-out (LOPO) method; similarly to k-fold cross-validation, the data of a single project is reserved for testing while the rest is used for training. The p-fold method is more recent and hence less popular. On the other hand, dividing data per project is a realistic test setting for the software engineering domain to explore the generality of the results.
When the researchers have a data set that includes data from five distinct projects, they have several options. The simplest method is the holdout method, where the researchers would randomly set aside 20% of the data for testing and use the remaining 80% for training and validation. In this one-shot setting, the results are not so reliable since how the data is split directly impacts on the results. To reduce this effect, the researchers may opt for k-fold cross-validation, and if they set k to 10, they would train and test 10 different models and report the average results. The researchers may also get curious about the generality of the results and pose the question "How would our classifier model perform on a new project?". In this case, they may leave one project out for each fold, train the classifier model with the data from four projects and test the model on the remaining one. Then, they would report the average results.

S2-S5. Train, Validate, Test
The decisions taken in S1 shape the following four steps: the traditional train (S2), validation (S3), and test (S5) activities in ML. S4 is introduced to emphasize the need to re-train the model after hyper-parameter tuning. After separating the test set in S1, S2 trains the classification model with the part of the data set that is not used as a test set. To train a classifier means to identify values for its parameters that allow the classifier to predict the desired output given different inputs correctly. The parameters that need to be identified depend on the selected algorithm. For example, training a Neural Network classifier means determining the weights associated with the connections between neurons. Similarly, training a Support Vector Machine means determining the coefficients of the variables of the polynomial function that characterizes the classifier.
A difference exists depending on the validation method. For the holdout method, the training step is executed by excluding the validation set. This set is used in S3 to run hyper-parameter tuning to identify those (hyper-)parameters that predict the validation set best. Instead, the optimal hyper-parameters are identified for cross-validation by iterating across the k folds. Hyper-parameters are different from the parameters of the models that are trained in S2. While model parameters characterize how the input data should be transformed into the desired output, hyper-parameters define the structure of the model that is being trained. For example, the hyper-parameters of a Random Forest include the number of decision trees to be considered in the forest or the maximum depth to allow for each decision tree. An example of hyper-parameter of Support Vector Machines is the degree of the polynomial features that characterize the model. For Neural Networks, hyper-parameters include the number of neurons in every layer and the number of layers. Several methods can be followed to perform hyper-parameter tuning. The most basic is grid search, where a model is built for every possible combination of all of the hyper-parameter values that one intends to evaluate (e.g., for a Random Forest, one can consider the set {10, 20, 50, 100} for the number of decision trees, and the set {5, 10, 15, 20} for the maximum depth of each decision tree, and try all possible combinations of these values), and the architecture which produces the best results on the validation set is selected. Another common alternative is random search, where for each hyper-parameter, it is provided a statistical distribution from which values are randomly sampled instead of giving a discrete set of values to explore for each hyper-parameter.
In this step, the researchers should be careful not to overfit their model. For details on hyper-parameter optimization, see Agrawal et al. (2021).
In S4, the model is re-trained using the best hyper-parameters identified in S3. A typical procedure to execute this is nested cross-validation, in which the model is trained while hyper-parameters are optimized. In terms of ECSER, this corresponds to executing S2 and S3 at the same time. The advantage of nested cross-validation is that it might reduce the model's bias toward the data set resulting from standard cross-validation. Still, the researchers should pay special attention to avoiding overfitting when using nested cross-validation (Cawley and Talbot 2010).
Finally, in S5, the optimized classification model is executed on the test set. S3 and S4 are optional in Fig. 1: although a good practice that may boost classifiers' performance and that can lead to simpler (e.g., to a Random Forest with a lower number of decision trees) tuned classifiers that perform better than more complex untuned ones (Fu et al. 2016;Tantithamthavorn et al. 2019), hyper-parameter tuning is not always effective and it may require extensive computational time (Tran et al. 2020).
A recommended practice, which helps to assess the generality of a classification model, is the inclusion of multiple data sets into the test set. When performing S4, this allows not only to measure the performance on unseen data but also to run statistical tests across these multiple data sets. We discuss this topic in S10. Table 3 summarizes the Training, Validation, Testing phase of ECSER (i.e., steps S1-S5) by illustrating how the classification model evolves and how the data set is split in the holdout and cross-validation settings.

S6. Report the Confusion Matrix
It is common to select a few metrics and to report only them. We recommend, instead, to present the confusion matrix so to maximize usefulness and information content of the reported data (Hall et al. 2012). A confusion matrix reports the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) results of a classifier. These four values comprehensively summarize the results for all classes. The readers can use them to calculate any other metrics of interest, in addition to those that the research authors find relevant for the domain. Figure 2 illustrates a generic confusion matrix alongside an example for a case where a classifier results in no misclassification, i.e., it accurately classifies every data point in a data set D.

S7. Report Metrics
Depending on the domain, the researchers will report the relevant performance metrics; some examples of such metrics are shown in Table 4 (note how they all follow from the confusion matrix), and examples of their values given different confusion matrices are reported in Table 5. Precision and recall are complementary. The former evaluates the correctness of the predicted samples and the latter (also known as sensitivity or true positive rate) assesses the coverage of such predictions over the total number of positives. The F 1 score is the harmonic mean of precision and recall. Depending on the relative human cost of correcting false positives and false negatives, the weights of precision and recall may change (Berry 2021), leading to adjusted versions of the F-Score. For example, in their ML-based approach to tracing regulatory codes to product requirements, Cleland-Huang et al. employ F 2 , where higher weight is given to recall (Cleland-Huang et al. 2010). Specificity (true negative rate) is the ratio of the correctly identified as irrelevant samples to the overall irrelevant samples. Accuracy is the ratio of correct prediction to the overall cases. This metric is most suitable when the classes are balanced (Lones 2021). For instance, note that in the first unbalanced example in Table 5,

Specificity (TNR) T N/(T N + F P )
Accuracy The researchers should choose the metrics based on their research goal and the problem (Yao and Shepperd 2020). Consider a system that deletes chunks of code if a classifier labels them as useless. For such a system, precision is crucial, as false positives would have catastrophic results. The recall metric might be more important for another system that marks chunks of code as smelly. Other cases may call for combining recall and precision with different weights, so F β metric would be suitable. Berry (2021) discusses this issue specifically for requirements engineering problems.
In addition to reporting the performance of the optimized classifier model, the metrics are also helpful when comparing multiple classifier models. Although it is common to report the metrics for the performance of the classification model on the test set, the researchers should also share the metrics for the train and validation set to demonstrate the evolution of the classification model performance and to increase the replicability of their results. Primarily when the k−fold cross-validation method is adopted, presenting the mean and standard deviation of the metric values for the folds, as well as the cumulative confusion matrix across the folds (i.e., a confusion matrix where the values of TP, FP, TN, and FN are the sum of all the corresponding values obtained in every fold), provide a better insight for the performance of the classification model.

S8. Analyze Overfitting and Degradation
This is one of the steps of ECSER that are less common in current research. Since SE research aims at principled ways of tackling practical problems, we propose that classification studies should report on the differences in performance between training, validation, and test sets.
Overfitting. ECSER quantifies overfitting by calculating the difference between the performance on the test set and the performance on the training set, using the metrics that were employed in S7. Thus, let M be the performance metric of relevance, then overfitting = M test −M tr . If multiple data sets are used for testing, we can compute the average overfitting as per (1), where Test is the set of data sets used for testing, and tr is the training set.
Based on the practical SE task that the classifier means to support, different metrics can be used to assess overfitting. Standard overfitting metrics include accuracy, mean square error (MSE) and zero-one loss (Bishop 2006). Accuracy and zero-one loss are suggested for binary classifiers whose output consists of the label assigned to the data points (e.g., an automated classifier extracts non-functional requirements from a specification document). Metrics such as MSE and other continuous loss functions (e.g., logistic or exponential loss), instead, provide a more fine-grained evaluation of the errors of the classifier, and they are more relevant for classifiers whose output is a score or a probability that a data point belongs to a particular class (e.g., a classifier that annotates non-functional requirements with the likelihood that they refer to one of the qualities from the ISO/IEC 25010 standard). Finally, when standard overfitting metrics are less relevant for the practical SE task (e.g., if it is essential that the classifiers have high recall), we recommend using the metrics in S7 also for overfitting.
As an example, consider the first six confusion matrices from (the first six rows of) Table 5 relating to six data sets of size 12. Suppose that the first matrix is obtained by a classifier c on the training set, the second matrix is obtained by c on the validation set, and the remaining four matrices are obtained on the test sets. Suppose to be interested in the accuracy metric. The average overfitting w.r.t. the accuracy is therefore 1 indicating that the trained classifier present limited overfitting w.r.t. accuracy, since the average overfitting is close to 0.
Degradation. This metric compares the performance on the test set and that on the validation set, using the metrics of S7. Its calculation depends on the partitioning of the two sets. If both sets consist of a single project or consist of a data set that is not explicitly split into projects, degradation is the difference in the considered performance metric M: degradation = M test −M valid . In case the test set includes multiple projects or the validation is conducted via k-fold, we suggest calculating an average degradation, similarly to average overfitting in (1).
When the test set consists of multiple projects and the validation is conducted via k-fold, we recommend calculating degradation by statistically comparing the two distributions: the metric for the multiple samples of the validation set (e.g., for each of the k folds) and the metric for the various samples of the test set. If the data are normally distributed, the independent samples T-Test can be used. Else, the non-parametric alternative is suggested: Mann-Whitney's U test. These tests assess whether the degradation is statistically significant, i.e., if the p-value is below a given threshold. The researchers shall combine this result with the effect size, a statistical measure describing a phenomenon's strength. In line with Sullivan and Feinn (2012), we recommend reporting on the effect size also when the pvalue is above the threshold in order to better interpret the results. This case may indicate that the population is too small to derive statistically significant results, and the researcher may want to increase the number of samples/projects. As an example of effect size, Cohen's d (Cohen 2008) computes the difference between two groups of measurements in terms of their common standard deviation, and a phenomenon has no effect if |d| < 0.2; small effect, if 0.2 ≤ |d| < 0.5; intermediate effect, if 0.5 ≤ |d| < 0.8; and large, if |d| ≥ 0.8.
An example of analysis of the degradation via the Mann-Whitney's U test and the effect size is provided in Table 14 in Section 6.2 when discussing the degradation of three stateof-the-art flaky tests automated classifiers.

S9. Visualize ROC.
Plotting the receiver operating characteristics (ROC) (Fawcett 2006) helps the reader to visually comprehend the performance of the model. A ROC plot (an example is reported in Fig. 3) has the true positive rate (T P /P ) on the y axis and the false positive rate (F P /N) on the x axis.
Each point on a ROC plot summarizes graphically a confusion matrix. Indeed, a ROC plot is a coverage plot (i.e., a plot with the number of negatives in a data set on the x axis, and the number of positives on the y axis), with normalized axes. The normalized axes allow to deal with different class distributions, so that the plot always results squared, and classifiers can be compared with respect to different data sets on the same plot.
In Fig. 3, we plotted the results obtained with three classifiers on a given data set. We see that both Classifier 1 and 2 dominate Classifier 3, since Classifier 3 has lower TPR than both of them but does not have lower FPR than any of them. We also see that neither Classifier 1 nor 2 dominates the other: no clear winner can be established between them, and the selection of the classifier will depend on the relative importance of TPs and FPs in the specific problem that is considered. Since they are on the same diagonal, furthermore, Classifiers 1 and 2 have the same average recall.
A ROC plot is especially useful for evaluating the performance of multiple classifiers, as described in the following.
Testing generality across data sets. Every point in the ROC plot represents the performance of one classifier on a data set. This can be useful when the test set includes multiple Fig. 3 An example of ROC plot. The three data points report the results of three classifiers on a given data set SE projects. The researcher can identify, via visual analysis, the relative performance of the models on these data sets. Figure 4 reports, as an example, a ROC plot that shows the performance of four classifiers on 20 different data sets (note that for each classifier, the plot contains 20 data points). By visually inspecting the plot, we can see that the performance of Classifier 1 generalizes pretty well across all data sets: all data points are clustered together. Moreover, since all data points are close to the so-called ROC heaven (the top left corner of the plot), Classifier 1 performs almost perfectly on all data sets.
Conversely, the performance of Classifier 2 is generally poor, as for all data sets the TPR is low and the FPR is high. Despite the poor performance, however, all data points are generally clustered together, indicating that the (poor) results generalize across the data sets. Classifier 3 illustrates a different case: since the data points are not clustered together, the performance results do not generalize well across data sets and, while on some data sets the TPR is high, in other data sets it is low. Visually, however, the FPR appears to be relatively consistent across data sets, indicating that the lack of generality is mainly due to the TPR.
Exploring the sensitivity-specificity trade-off. In a ML classifier, a threshold t can be used to discriminate positives and negatives: the higher the threshold, the higher the probability that the classifier requires in order to associate an item with a class. A ROC plot can be used to visualize a so-called ROC curve, which shows the performance of the classifier with different thresholds. The area under the ROC curve (AUC) provides a summary measure that averages the accuracy across the spectrum of thresholds. It is worth noting that plotting the ROC curve could sometimes be misleading in case of data imbalance (Boyd and Eng 2013; Goadrich et al. 2006). In such cases, an alternative preferred visualization is the precision-recall (PR) curve, and the associated AUPRC (area under the precision-recall Fig. 4 An example of ROC plot reporting the performance of three classifiers on 20 different data sets composing the test set curve) measure. The best classifier in a PR curve is as close to the top right as possible (for the ROC curve, the best classifier is as close to the top left as possible), where there is the best trade-off of precision and recall (Lever 2016).

S10. Apply Statistical Significance Tests
Seeing a difference in the values of the metrics or on a plot does not entail a statistically significant difference in the performance of different classifiers. Proper statistical testing must be conducted to confirm that two classifiers indeed have meaningfully different performance.
Testing on a single data set. The randomization test (Good 2013) can be applied when only one data set is available for testing. It is a non-parametric method, so it can be applied even if the data are not normally distributed, or if the researchers do not know the distribution of the data. It tests the null hypothesis that the two classifiers yield to the same performance. A randomization test can be conducted using any performance metric. The idea is to verify if the results obtained with a classifiers are due to random chance by randomly shuffling the data and comparing the performance obtained on the randomized data with the actual one.
Testing on multiple data sets. Table 6 presents an overview of state-of-the-art methods for testing statistical significance across multiple data sets. This was assembled based on the recommendations on comparing multiple classifiers by Demšar (2006), the recent work on Bayesian statistical analysis by Benavoli et al. (2016aBenavoli et al. ( , 2017b, and the documentation of the Autorank Python package (Herbold 2020).
When comparing two classifiers, the simplest option is the paired samples T-Test. However, this requires each of the classifiers' results to be normally distributed (Herbold 2020), which is not a common situation when comparing classifiers (Demšar 2006); furthermore, this test is sensitive to outliers (Benavoli et al. 2016a). Non-parametric tests that make no assumptions on distribution and variance are an alternative. In particular, the most common options are Wilcoxon's Signed-Rank test and the Sign test. They both rely on ranks, rather than on the absolute difference in performance (as parametric tests do). Wilcoxon's Signed-Rank is preferable because of the stronger statistical power (Demšar 2006). An alternative approach is Bayesian analysis, a paradigm shift in statistics. Benavoli et al. (2017b) explain the limitations of tests based on null-hypothesis testing and suggest Bayesian variants of the Wilcoxon and Signed tests. While powerful and less affected by Type I Error, these tests require the researcher to define a region for two classifiers to be considered as equivalent in practical settings. As the previous tests, Bayesian variants are run in a pairwise manner. The number of comparisons to make, therefore, grows exponentially with the number of classifiers.
When comparing a group of three or more classifiers, the most common strategy is to run an omnibus test that determines whether the group of classifiers differ in a statistically significant manner, and then a post-hoc test that reveals which pairs of classifiers are significantly different. We recommend two cases, in line with Autorank's documentation (Herbold 2020): (i) if the distributions are multivariate normal (Mardia 1970;Korkmaz et al. 2014), and they have approximately the same variance (sphericity assumption), the repeated measures ANOVA test can be executed as an omnibus, followed by the post-hoc test Tukey's HSD; (ii) the standard non-parametric alternative is Friedman's omnibus test, followed by Nemenyi's post-hoc (Demšar 2006).
Based on this discussion, SE researchers can safely use non-parametric tests as they do not make assumptions of normality (unbalanced data sets are likely to break this assumption). Wilcoxon's Signed-Rank and Friedman plus Nemenyi's post-hoc are the go-to options. The more ML-savvy researchers are invited to consider all the options in Table 6 based on the necessary analyses that need to be conducted prior to employing the tests. In Sections 5 and 6, we provide several examples of application of the statistical tests described above in two case studies comparing multiple classifiers on multiple data sets.
In some cases, the available data might be insufficient for providing meaningful statistical results. This is why step S10 is indicated as optional in ECSER.

Multi-class and Multi-label Classification
ECSER also applies to multi-class (2+ classes) and multi-label (1+ labels per sample) classification tasks. In multi-class problems (e.g., the problem of determining whether an app review is positive, neutral, or negative), the classes that can be attributed to data points are mutually exclusive. In multi-label problems, instead, each data point can be attributed multiple labels (e.g., a non-functional requirement can be related to both performance and security quality aspects). These problems can be reduced to several binary (2 classes, 1 label per sample) classification tasks by applying a one-vs-rest strategy, consisting in fitting a different binary classifier per each class (or label) against all other classes (labels). This strategy, applied in Section 5, is computationally efficient (it requires to train n binary classifiers, n being the number of classes or labels) and it is the most commonly used and advisable since it provides interpretable results about the specific classes/labels. With one-vs-rest, all steps of ECSER are only affected in that they need to be repeated for each class/label. ECSER also applies, with the exception of S9, to those less common multi-label classification problems where it is relevant to evaluate the classifiers w.r.t. all labels at the same time. In particular, S1-S5 are analogous, but need to be performed with a model that produces all labels for a given sample (e.g., classifier chains (Read et al. 2011)). In S6, a multi-label confusion matrix can be reported. For S7-S8, the literature offers several metrics such as the Jaccard index, the Hamming loss, or the multi-label generalizations of precision, recall, F 1 , and accuracy (Sorower 2010). S9 cannot directly be applied for this particular type of task, unless one-vs-rest is applied. Finally, S10 is identical: the tests described in Table 6 can be applied w.r.t. the appropriate metrics selected for S7.

Case #1: Functional and Quality Requirements
As a first case study to answer RQ3 and RQ4, we take the well-known problem of classifying functional and non-functional requirements (Cleland-Huang et al. 2007), which is motivated by the importance of identifying quality aspects in a requirements specification starting from the early stages of SE. In line with recent literature (Kurtanovic and Maalej 2017;Dalpiaz et al. 2019;Hey et al. 2020a), we consider two independent classification tasks: that of identifying if a requirement contains functional (isF) and non-functional (isQ) aspects, respectively.
We apply ECSER to three of the most recent classifiers of requirements available in the field: ling17 (Dalpiaz et al. 2019), km500 (Kurtanovic and Maalej 2017) and norbert (Hey et al. 2020a). We compare these classifiers for two reasons. First, they adopt different strategies and NLP approaches for requirements classification. In particular, as we detail in Section 5.1, while ling17 leverages 17 high-level linguistic features, km500 is based on hundreds of low-level word features such as n-grams and POS n-grams, and norbert relies on a deep learning model. This aspect allows us to illustrate that the application of ECSER is independent of the type of classifiers and features being evaluated. Second, the three classifiers were recently compared on the same tasks that we consider by Hey et al. (2020a). Such comparison, however, is limited to the validation of the trained classifiers, i.e., step S3 of ECSER. We can therefore consider the work by Hey et al. (2020a) as our baseline, and use it to illustrate the usefulness of following the entire ECSER pipeline.
We make two contributions to the literature: 1. We annotate six additional data sets, four of which are released publicly (see Table 7). 2. We provide additional insights by carrying out the missing steps of ECSER, specifically, by testing the performance of the trained classifiers on unseen real-world projects.  To study the generality of these classifiers in operational contexts, we consider 13 data sets from real-world projects other than PROMISE NFR: see Table 7 for an overview. According to existing terminology (Zimmermann et al. 2009), we are therefore investigating cross-project prediction. At the expense of some replicability, we decided to use private projects, protected by non-disclosure agreements with industrial partners, to ensure that our test set consists of real, operational projects.

S1: Select an Evaluation Method and Split the Data
In line with the literature, we use PROMISE NFR as the training set in S2, and we choose the holdhout method to test the trained classifiers on the 13 data sets in S5. Since we wish to explore the effectiveness of the classifiers on unseen data (S5. S5Long), we do not perform hyper-parameter tuning and validation. We take the classifiers as proposed in the literature, using the optimal hyper-parameters identified by the authors. Hence, we do not carry out S3-S4.

S2-S5: Train the Model and Test the Model
We use the full PROMISE NFR data set to train both ling17 and km500. We train each model separately for the two classification tasks isF and isQ. In the case of norbert, we use the pre-trained models that the authors made available in the replication package (Hey et al. 2020b), which also used PROMISE NFR as training set after hyper-parameter optimization.
We do not go into the details of the models and their features, which can be found in the corresponding papers and our online supplementary material (alongside the code and the public data sets) (Dell'Anna et al. 2021). Table 8 provides an overview of the major differences between the classifiers. We observe that both ling17 and km500 employ SVM. ling17 uses a fixed set of 17 high-level linguistic features (e.g., dependency types), km500 considers the top 500 low-level word features (e.g., n-grams, or POS n-grams) that characterize the training set. The norbert classifier, instead, adopts a transfer learning approach and is grounded on BERT (Devlin et al. 2018), the well-known deep learning model developed by Google. This step provides us with three different trained and optimized classifiers for each classification task.
We test the classifiers by studying the predictions made by the trained models for each of the 13 requirements data sets introduced in Table 7.

Results Analysis
S6-S7: Report the confusion matrix and Report metrics. Table 9 reports the confusion matrices obtained with the three classifiers on the training and testing data sets for the tasks isF and isQ. In addition to giving a comprehensive overview of the performance of the classifiers, the confusion matrix can be used to derive any metrics that the reader deems relevant, as a starting point to analyze the classifier's performance on specific data sets, and also (as shown later) to study the model's overfitting to the training data. Table 10 reports the performance results of the three classifiers w.r.t. the three metrics that were considered in Hey et al. (2020a): precision, recall and F 1 -Score. For the test sets, we report the mean value obtained on the 13 data sets and the standard deviation. The results reported for the training refer to the performance of the classifiers on the PROMISE NFR data set. By testing the classifiers on these data sets, we are the first to report on the performance of all the three state-of-the-art classifiers on previously unseen data.
By comparing the average performance of the classifiers on the test sets (Test in  Table 10), we see that norbert consistently outperforms both ling17 and km500 for both tasks and all three metrics, with the exception of recall and isF, where ling17 achieves the best results. These results confirm the findings reported in Hey et al. (2020a).
The performance of ling17 and km500 on the test sets, instead, does not seem to indicate a clear winner: the results are comparable for precision, ling17 outperforms km500 for recall in isF, while km500 is better than ling17 for recall in isQ. These results deviate from those reported in Hey et al. (2020a) when comparing ling17 and km500 on a 75-25 split of PROMISE NFR: there, the precision and recall of km500 for both tasks were about 10% higher than those of ling17, and they were oscillating between 80% and 90%. Here, the average precision and recall of km500 oscillate between 55% and 76% (about 20% lower than those reported in Hey et al. (2020a)), and the precision of km500 and ling17 is almost identical for both tasks, while the recall of ling17 is higher for isF, and lower for isQ. This divergence of results can be attributed to the overfitting of km500, an issue which was also observed in Dalpiaz et al. (2019). Indeed, the performance of km500 on the training data shows close-to-perfection results, indicating that the top-500 features selected by km500 can almost perfectly characterize the whole set of requirements in PROMISE NFR. In S8, we will discuss this aspect more in detail.
The colored cells in Table 10 highlight the best-performing models. We note that it is difficult to identify an overall winner. While in some cases norbert is considerably better than the others (e.g., recall on the test sets for isQ), in other cases the difference appears to be negligible (e.g., precision on the test sets for isF). These observations motivate the execution of the following steps of ECSER in order to further analyze these mixed results. Table 10 reports also overfitting, obtained by comparing the results on the test and training sets with respect to the metrics from S7. Since we use pre-optimized classifiers, we do not analyze the performance degradation from validation to testing. Some overfitting is expected, as the testing is done on previously unseen data, but the degree of overfitting may reveal insights.

S8: Analyze Overfitting and Degradation
In Table 10, the average reduction of precision, recall and F 1 of the ling17 classifier for isF is very close to 0. The trained model does not overfit the training data; in fact, the model tends to underfit the data. This is in line with the findings reported in Dalpiaz et al. (2019) and can be explained by the low number (n=17) of features. An overfitting value close to 0, however, may positively contribute to the generalizability of the classifier's performance to unseen requirements sets.
Differently, almost opposite results can be identified for km500, which relies on a large number (n = 500) of low-level features such as text n-grams and POS n-grams. These features, which were the most informative when fitting the training set (Kurtanovic and Maalej 2017), lead to substantial degradation when considering the test sets, reaching over 40% degradation in recall for isF. Similar results emerge for isQ; in that case, however, also ling17 degrades of circa 20%.
Finally, although the results obtained with norbert on PROMISE NFR cannot be generalized to unseen data (differently from the result of ling17 for isF), its performance on unseen data is generally better than the performance of ling17. Figure 5 reports the ROC plots for the two tasks isF and isQ. Each point summarizes, graphically, the confusion matrix of a classifier on a certain data set reported in Table 9. A quick visual analysis confirms the earlier results: there is no clear-cut winner when considering the 13 test data sets. The results are mixed. Let us look at isF: norbert dominates (i.e, higher TPR, the vertical axis, and lower FPR, the horizontal axis) the others on the Helpdesk data set, while ling17 dominates norbert on the Dronology data set. In other cases, the winner depends on the relative importance of TP and TN: for example, see the WASP or DUAP data sets.

S9. Visualize ROC
The ROC plot also visually highlights the degree of overfitting to the training data. For isQ, for example, the performance on the PROMISE NFR training set for km500 is almost perfect (close to the ROC heaven, where TPR is 1 and FPR is 0), while ling17 has a lower performance on the training set. However, once tested on different data, we see that the performance of km500 degrades to values that are considerably lower than those obtained on the training, and similar, and often even lower, than those of ling17 (e.g., compare the results on ReqView, User mgmt, ERec mgmt). This suggests that the km500 classifier faces some overfitting issues. Conversely, the results of ling17 on the test sets are more clustered in the surroundings of the training data point (PROMISE) on the ROC plot, indicating less overfitting.

S10: Apply Statistical Significance Tests
We check whether the observations made so far have statistical significance. Table 11 reports the results of the statistical tests when comparing the various classifiers. In line with Table 6, we use repeated measures ANOVA and Tukey's HSD when the assumptions of multivariate normality 2 and similar variance are met; otherwise, we employ Friedman followed by Nemenyi's post-hoc test. Please check our online appendix for all the details of the execution of the tests. Here, we only provide a summary. The yellow-highlighted cells denote statistically significant results with p-value ≤ 0.05.
Tables 10 and 11 highlight that the only case in which a classifier outperforms another on all metrics, with statistical significance, is norbert over km500 for the isF task, with a large effect size for recall and F 1 , and small for precision.
Also, by looking at the ROC plot of Fig. 5 and the statistical tests of Table 11, we can identify a finding concerning the isQ task, which is the one that originated this thread of research (Cleland-Huang et al. 2006). We can observe that there is no statistically significant difference between the three classifiers for precision and F 1 for the isQ task, although the omnibus test indicates a significant difference: this should be considered as a false positive and it shows the importance of running a post-hoc after the omnibus (Tian et al. 2018).
Furthermore, while the original paper showed that km500 could outperform ling17 using cross-validation (Hey et al. 2020a), this difference disappears when considering the test set, thereby highlighting the importance of extracting the test set and of analyzing the classifier models' performance against it, and providing evidence for the usefulness of applying ECSER.  11 S10: statistical tests for the performance metrics. p f and p a are the p-values obtained with the Friedman and repeated measures ANOVA omnibus tests, respectively * indicates p ≤ 0.05, * * indicates p ≤ 0.01. When the omnibus test identified no difference, the posthoc tests (Nemenyi for Friedman, Tukey's HSD for ANOVA) were ignored. In the other cases, a cell indicates significant difference. Values within brackets indicate the interpretation of Cohen's d for the measured difference between classifiers. none indicates |d| < 0.2, small indicates 0.2 ≤ |d| < 0.5, medium indicates 0.5 ≤ |d| < 0.8, and large indicates |d| ≥ 0.8

Summary of Findings from Case #1
In the following, we summarize the key findings from the application of ECSER to the case of classification of functional and non-functional requirements.
1. S5 of ECSER confirms the findings reported in Hey et al. (2020a): norbert outperforms both ling17 and km500 for both isF and isQ tasks on unseen data in terms of precision, recall and F 1 (column Test in Table 10). The only exception is recall and isF, where ling17 achieves the best results. 2. No clear winner, however, can be identified from the analysis of the performance in Table 10 and of the ROC plots in Fig. 5: km500 fits best the training set, norbert performs best on the test set, while ling17 has the smallest overfitting. 3. The only case in which a classifier outperforms another on all metrics with statistical significance (see Table 11) is norbert over km500 for the isF task. The effect size is large on recall and F 1 -score, small on precision. 4. While the original paper (Hey et al. 2020a) shows comparable performance for the isQ and the isF tasks, our application of ECSER to the test set reveals that the performance on isQ is lower than that of isF (Fig. 5) and shows few statistically significant differences between the classifiers (see Table 11).

Case #2: Test Case Flakiness
The second case concerns the classification of test cases as flaky, i.e., if their pass/fail status is non-deterministic. We apply ECSER to the three flaky test classifiers also compared by Alshammari et al. (2021a): FlakeFlagger (FF), Vocabulary-Based Approach (Pinto et al. 2020) (Voc), and a combination of the features from the previous two models (VocFF). The pipeline suggested by the original work is classifier agnostic. To demonstrate applicability, the original paper uses the following shallow machine learning techniques: Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), Multilayer Perceptron (MLP), Naive Bayes (NB), Adaboosting (Ada), and K-Nearest Neighbors (KNN). In this paper, we compare the three classifiers using the algorithm that performed the best according to the original authors, i.e., Random Forests. Similarly to the requirements classification case, our replication completes ECSER going beyond the validation step (S3). We rely on the feature engineering described in the original paper (Alshammari et al. 2021a): we start from the pre-processed data (which includes populated features) that we could find in FF's online appendix (file processed data with vocabulary per test.csv in Alshammari et al. (2021b)). In the repository for our online materials, we provide a summary of the steps of ECSER that we completed for both this and the previous case study (Dell'Anna et al. 2021).

S1: Select an Evaluation Method and Split the Data
The original paper (Alshammari et al. 2021a) uses stratified k-fold as a validation method. Their data set (from the replication package) was randomly shuffled and split in 10 folds. Every fold was used as validation set (S3) and the classifiers were trained using the remaining entries of the data set as training set (S2). The process was repeated 10 times until results were obtained for all folds. This method seems suitable, for the available data are unbalanced and stratification combined with the k-fold method contribute to the reliability of the performance results. We adopt the same validation method, but we first extract a test set (as recommended by ECSER) in order to obtain a more credible evaluation. As shown in Table 12, we systematically select 7 projects for the test set: the one with the 2nd highest number of flaky tests, the 4th, . . . , the 14th, leading to the projects {hbase, okhttp, hector, java-websocket, httpcore, incubator-dubbo, wro4j}. This test set has 5549 test cases, 358 of which are flaky. Our choice was done without setting additional selection criteria, with the main aim of keeping a sufficiently high number of samples in the training set. The other projects are used for training and validating the model via 10-fold cross-validation.

S2: Train the Model
To cope with data imbalance in the training set, we use the SMOTE oversampling technique (as per Alshammari et al. 2021a) before training the classifiers. We train the three classifiers FF, Voc and VocFF from Alshammari et al. (2021a). They can be used with different classification models, e.g., Decisions Tree, Random Forest, Support Vector Machine, Naive Bayes, etc. The features of FF are factors identified in the literature to affect the flakiness of tests, such as the execution time, the number of covered lines of code, or the number of covered classes. Voc, instead, is a vocabulary-based approach. It differs from FF in that it uses lower-level features, such as keywords, APIs or tokens contained in the test case. VocFF, finally, combines the two approaches.

S3-S4: Hyper-parameters Tuning and Validation and Re-train with optimized Params
To validate the model, the classifiers trained in each of the 10 folds are validated on the corresponding validation sets. In Alshammari et al. (2021a), the authors compared different sampling strategies and classification algorithms. We do not repeat hyper-parameters tuning; we use their results and apply ECSER with Random Forest as a learning algorithm and SMOTE oversampling. Therefore, we also do not execute S4, as it existentially depends on S3.

S5: Test the Model
We conduct the following steps of ECSER up to S10, going beyond the validation step where the original paper (Alshammari et al. 2021a) stopped. In S5, we test Table 12 Summary of the projects for the test case flakiness case. The second and third columns quantitatively describe the 24 projects based on the counts included in the pre-processed data provided by the authors of FF (Alshammari et al. 2021a the classifiers on the test data we extracted in S1. In the following, we analyze the obtained results. Tables 13 and 14 report, respectively, the confusion matrices and the performance results obtained with the three classifiers on both the training set in S2, the validation set in S3, and the test sets in S5. For cross-validation and testing, the values reported in the confusion matrices are the cumulative values obtained in the 10 folds, and on the 7 projects in the test set, respectively. Table 14, additionally, reports an explicit analysis of overfitting and degradation of the classifiers w.r.t. the metrics from S7. The results for the validation step are in line with those reported in the original paper (Alshammari et al. 2021a). We see that FF and VocFF perform better than Voc w.r.t. all metrics. In particular, as noted by Alshammari and colleagues, the two versions of FlakyFlagger provide an increase of precision of circa 60% and of recall of about 2% compared to Voc. VocFF, furthermore, appears to perform better than FF in terms of precision and F 1 . The two trained FlakeFlagger classifiers characterize well the training data (note the almost perfect results on the training data).  (1)) and degradation (Mann-Whitney U test and the effect size η 2 )
indicates best results, statistically significant ones The test set results, however, portray a radically different picture. Row Tests in Tables 13  and 14 summarize the results of the 7 test data sets via, respectively, the cumulative confusion matrix and the macro-average and standard deviation. The mean precision of FF and VocFF dropped from ∼ 70% during validation to 9% and 12%, respectively, during testing. The average recall dropped from ∼80% to ∼5%, and the average F 1 from ∼75% to only 6% and 3% for VocFF and FF, respectively. Notably, Voc, which did not lose in precision on the test set, had also a more moderate degradation w.r.t. recall and F 1 , and the recall of Voc (the baseline in Alshammari et al. 2021a) is about 30% higher than that of FF and VocFF.
The analysis of overfitting and degradation reported in Table 14 reveals large values of overfitting (test-train); for example, the precision of VocFF decreased of 0.87 compared to the training set, and the recall decreased of 0.95. Voc is an exception: for precision, overfitting is almost zero, while for recall, the decrease is 0.55, much less than for FF and VocFF. For degradation (test-validation), unlike Section 5, we report the statistical tests since both the validation set and the test set consist of more than one sample: the validation set refers to 10 folds, the test set consists of 7 projects. These results are in line with those of overfitting. The effect size of the degradation is large for all metrics when considering both FF and VocFF, and they are significantly below the 0.01 threshold. The degradation of Voc is lesser: while the effect size is large for recall, with statistical significance ≤0.01, there is no statistically significant degradation for precision and F 1 .
These results illustrate that while the features used by FF and VocFF characterize well the training set (i.e., they have a good descriptive power), the models trained with those features do not generalize well on unseen data (i.e., they do not have a good predictive power). For example, a decision tree that uses the feature execution time (used by FF and VocFF) could perfectly describe the training set by learning enough rules such as: IF 0.01 ms ≤ execution time ≤ 0.0199 ms THEN not-flaky, IF 0.02 ms ≤ execution time ≤ 0.0299 ms THEN flaky, IF 0.03 ms ≤ execution time ≤ 0.04 ms THEN not-flaky, etc. However, the learned rules will not perform well on unseen data unless the features are actual predictors of flakiness. Figure 6 reports the ROC plot that compares the three classifiers. In addition to the results obtained on the training and on the test sets, we report also the validation results. For ease of visualization, we do not report in the figure the labels of the data sets for which the classifiers resulted in both TPR and FPR equal to 0 (i.e., for the data points lying at the origin of the axis). Details can be found in our online appendix (Dell'Anna et al. 2021). The ROC plot confirms the results about validation presented in Alshammari et al. (2021a): both VocFF and FF dominate Voc, with VocFF performing slightly better than FF. Thus, combining the features of FF and Voc allowed to improve the TPR without increasing too much the FPR, resulting in a slightly better classifier.

S9: Visualize ROC
Moving to the test set, however, the plot confirms the analysis done in S6-S8: for almost all projects in the test set, the results obtained by the classifiers, and in particular FF and VocFF, are clustered around the origin of the axis. This indicates both a low FPR (i.e., they performed well on the negative class, which however was also the predominant one) and very low TPR, performing poorly on the positive class (the one of interest for the problem). Table 15 reports the results of the Friedman omnibus test (the data are not normally distributed) and the Nemenyi post-hoc tests that compare the performance of the three test case flakiness classifiers w.r.t. the different metrics.  The statistical analysis confirms the considerations made in the previous steps and, together with Table 13, allows us to derive a finding that deviates from the original paper's conclusions (in which VocFF was shown to be the best), providing further evidence of the usefulness of applying ECSER: there is no significant difference in the precision and F 1 of the three classifiers when predicting the flakiness of a test. Instead, the recall of Voc is significantly higher than those of FF and VocFF.

Summary of Findings from Case #2
In the following, we briefly summarize the key findings that resulted from the application of ECSER to the case of classification of test cases as flaky.
1. Validation (S3) is not always a predictor of operational performance. While FF and VocFF outperform Voc on the validation set (see Table 13), the predictive power of their features is low, as overfitting and degradation show a significant decrease (see Table 14). 2. On the test set, there is no significant difference in the precision and F 1 -score of the three classifiers when predicting the flakiness of a test. Instead, the recall of Voc is significantly higher than those of FF and VocFF.

Threats to Validity
We discuss the major threats to validity following the taxonomy proposed by Wohlin et al. (2012). Conclusion Validity is concerned with the relationship between the treatment (ECSER) and the outcome: its applicability (RQ3) and usefulness (RQ4). Some of the findings we obtained via the replications depend on the reliability of the statistical tests. Following the guidelines of Table 6, we chose robust tests and we derived conclusions only from results that had strong significance level and large effect size. Another threat concerns the reliability of the measures; in our case, the labelled data. We minimized this threat by using previously labelled data, and by involving multiple taggers (including a non-author) for the additional data sets in the first case. It is obviously possible that, should the data be tagged by other humans, the results could differ.
Internal Validity affects the independent variable with respect to causality. The selection of the data to be included in the training, validation, and test sets may affect the causality of the findings. To mitigate these risks, we followed different strategies. In the requirements classification case, we selected PROMISE NFR as the training set, following state-of-theart papers in the field, and we obtained data sets from external resources for the test set, like in Dalpiaz et al. (2019). We annotated the newly introduced data sets (Table 7) with the help of an external annotator. Each data sample has been tagged by two annotators, and conflicts have been resolved in discussion meetings. When there was difficulty for conflict resolution, all annotators discussed the annotation and resolved the conflict. For the flaky test case classification case, we allocated a subset of the original projects to the test set. We did this in a systematic way by choosing the 2nd, 4th, . . . project with most positives. Although the selection carries some risks, we believe the effect is negligible, since the performance on the validation sets is similar to that reported in the original paper (Alshammari et al. 2021a).
Construct Validity concerns the degree to which an experiment can be used to draw general conclusions about the concept or theory behind it. ECSER itself aims to be an approach that can be used to derive more credible conclusions than the current state-of-the-art in SE research. To such extent, we include steps to report more detailed results (e.g., the confusion matrix in S6), to analyze overfitting and degradation (S8), and to measure the statistical significance (S10) of the results. The generality of the conclusions, however, also depends on the input data. ECSER minimizes the effect of the used data to train, validate, and test a classifier. However, it is possible that the results with different data sets may differ. Minor threats in this category may derive from our implemented code. Slight differences in the results for the flaky tests case study may stem from a different seed selection, which were not reported in Alshammari et al. (2021a). The code of the case studies is taken from the original studies where applicable, but additional code is written for the missing steps of the pipeline. All the code is available in Dell' Anna et al. (2021).
External Validity regards the ability to generalize the results beyond the studied context. First, ECSER's applicability is assessed through two cases of classification in software testing and requirements engineering. We are aware of the limited generality that can be derived from the use of only two cases, which, however, indicate clear directions, and we invite researchers from different SE areas to apply ECSER to their cases. Furthermore, we could not study whether the statistically significant differences actually translate into a noticeable difference in practical settings, and whether statistical significance is necessary for practitioners to perceive that one classifier is better than another. This is beyond the scope of the present paper, but in-vivo studies are necessary to investigate the practical effectiveness (w.r.t. work practices) of SE classifiers.

Conclusions and Outlook
In this paper, we introduced the ECSER (Evaluating Classifiers in Software Engineering Research) pipeline to guide SE researchers in accurately evaluating and reporting their research on automated classifiers. ECSER aims to assist SE researchers in avoiding common mistakes and in presenting their results credibly and correctly. ECSER adopts and consolidates recommendations from recent literature in ML and statistics and presents them for SE researchers by illustrating the concepts with examples and two case studies from the SE community.
To demonstrate the applicability and usefulness of ECSER, we conducted two replication studies, one in software testing and one in requirements engineering, by applying the pipeline. Our replications allow us to strengthen some of the conclusions in the original papers' and they reveal findings that are not present in the original studies, some of which contradict the original results.
Our overarching goal is to improve the quality of ML4SE research by proposing more rigorous ways of conducting and reporting research. To support this, we made available a replication package (Dell'Anna et al. 2021) for the interested reader to apply ECSER in order to compare their classifiers across their data sets. The pipeline is not set in stone. We call upon SE researchers to use ECSER for additional classification problems and we welcome improvements to it.
In the rest of this section, we discuss the main findings from this research. In Section 8.1, we answer the research questions that we addressed. In Section 8.2, we discuss the limitations of ECSER and we outline future work.

Answering the Research Questions
Our main research question MRQ. How can we enable SE researchers to accurately conduct and report on the evaluation of automated classifiers? is decomposed into four sub-questions, which we address in the following.

RQ1. What are the Challenges with the Current Research Practices with Classifiers in SE Research?
We answer RQ1 by investigating recent research output in one of the most prominent software engineering research conferences. In Section 3, we identify several challenges related to the ML classifiers in software engineering research. These include the risk for SE researchers of using ML tools without fully grasping the theory behind them due to the scattered ML knowledge, and the lack of arguments for selecting performance metrics. Additionally, we note that in several cases (including the two case studies in Sections 5 and 6), the evaluation of classifiers is focused on validation data rather than test data, resulting in findings and conclusions that might be misleading and not generalize well, because the validation sets are often extracted from (hence, they often reflect some properties and patterns of) the training set. Finally, we identify the difficulties in verifying and reproducing the metrics or calculating other metrics. This is due to the omission of the confusion matrix and the lack of analysis of the statistical significance of the results in most recent studies.

RQ2. What is an Easy-to-Use, Tangible Artifact that can Assist SE Researchers When Conducting and Reporting Research on Classifiers?
To mitigate the challenges identified with RQ1, we introduce ECSER, a pipeline for evaluating classifiers in SE research. ECSER is a tangible artifact, in the form of a 10-step process, that brings together the scattered ML knowledge and presents it in an organized way. When conducting and reporting research on automated classifiers, SE researchers can consult the pipeline to identify the key steps to follow for a reproducible and accurate evaluation. For each step, ECSER provides alternatives to choose from based on particular cases, alongside references for further details. The pipeline is also accompanied by examples from the SE field, which are typically not part of general ML guidelines.

RQ3. How Applicable is the Pipeline to ML4SE Research?
We apply ECSER for replicating and extending two studies. The first study concerns the automated classification of functional and quality requirements (Hey et al. 2020a;Dalpiaz et al. 2019;Kurtanovic and Maalej 2017), while the second study focuses on detecting test case flakiness (Pinto et al. 2020;Alshammari et al. 2021a). Since ECSER is agnostic to the underlying classification algorithms and the features used to characterize the data, we successfully apply it to compare six different classifiers (3 per case study) with different features and underlying models.

RQ4. How useful is the Pipeline When Applied to ML4SE Research? The application of
ECSER to the two case studies allows us to demonstrate also its usefulness. In addition to solidifying the results from the original papers, through the explicit definition of an independent test set (S1), the analysis of overfitting and degradation (S8), and the execution of statistical significance tests (S10), we identify additional findings that are not mentioned in the original studies. For the requirements classification task, for example, the norbert classifier proved to be the best overall, although the statistical results show that the differences for the isQ task are mostly not significant. For the flaky test classification task, the results are even more striking: contrary to the original results, they indicate that the baseline classifier (Voc) achieves significantly higher recall, and comparable precision, on the test set. By comparing the two cases in terms of overfitting and degradation (Table 10 vs. Table 14), we conclude that ML-based flaky test identification is an area where more research is necessary. Our results reveal limited predictive power for the used features, especially in terms of precision (in the 9%-15% range).

Limitations and Future Work
Applying all the steps of the pipeline might not be trivial. This is particularly true for step S10: statistical significance tests. In some cases, the available data might be insufficient for providing meaningful statistical results. This is why step S10 is indicated as optional in ECSER. However, the lack of sufficient data for providing meaningful statistical results should not be ignored but reflected in the discussion and conclusions drawn from the results reported in the paper. Additionally, while we attempted to provide an accessible overview of the state-of-the-art methods for testing statistical significance (e.g., see Table 6) with both recommendations and references to available implementations (e.g., the Autorank Python package (Herbold 2020), and our online appendix (Dell'Anna et al. 2021)), the selection of an appropriate statistical test and the interpretation of the results needs a basic understanding of statistics and of the different tests. Future work should study the applicability of this step through human evaluations with SE practitioners having different degrees of familiarity with statistics.
Step S3 (hyper-parameter tuning and validation) might require more computational power and time from the researchers who apply the pipeline. The degree to which the classifiers should be optimized depends on the maturity of the evaluated classifiers. This is why also steps S3-S4 of ECSER are indicated as optional. Hyper-parameter tuning, however, has been shown to lead to simpler classifiers that perform better than more complex non-tuned ones (Fu et al. 2016;Tantithamthavorn et al. 2019), so additional effort put in executing these steps might lead to better and more interpretable outcomes. In general, applying ECSER requires more effort from the researchers than just reporting, for example, the performance obtained on the test set. However, we believe that this effort is necessary for reporting solid, accurate, and non-misleading results and supporting replicability and reproducibility in ML4SE (Liu et al. 2021). In this sense, this paper attempts to minimize practitioners' efforts. In the future, we intend to conduct a controlled experiment with SE students about evaluating automated classifiers with and without the help of ECSER, to provide additional evidence about the pipeline's usefulness (also in the context of education).
ECSER aims to concisely present the steps required to evaluate classifiers. This paper, however, does not replace ML and statistics textbooks. In terms of completeness, for example, ECSER covers the evaluation of multiple classifiers but does not discuss essential aspects for constructing classifiers, such as data set selection and curation, feature engineering, and algorithm selection. These are currently considered as an input of the pipeline (leftmost block in Fig. 1). We intend to study if such tasks present a challenge for the SE practitioners by investigating whether appropriate classifiers and training sets are used for ML4SE research. This may lead to additional guidelines in that direction. Moreover, given the depth of each technique and concept discussed, the description of the several steps of ECSER might lack details for curious readers. We provide several references for every step of ECSER to satisfy their curiosity. Our two case studies also illustrate how the pipeline is applied. As an empirical evaluation of the pipeline, we intend to assess the most critical and less obvious steps of the pipeline. In the long run, we aim to construct a comprehensive resource for researchers. Our online appendix is the first step in this direction.