Classifying generated white-box tests: an exploratory study

White-box test generation analyzes the code of the system under test, selects relevant test inputs, and captures the observed behavior of the system as expected values in the tests. However, if there is a fault in the implementation, this fault could get encoded in the assertions (expectations) of the tests. The fault is only recognized if the developer, who is using test generation, is also aware of the real expected behavior. Otherwise, the fault remains silent both in the test and in the implementation. A common assumption is that developers using white-box test generation techniques need to inspect the generated tests and their assertions, and to validate whether the tests encode any fault or represent the real expected behavior. Our goal is to provide insights about how well developers perform in this classification task. We designed an exploratory study to investigate the performance of developers. We also conducted an internal replication to increase the validity of the results. The two studies were carried out in a laboratory setting with 106 graduate students altogether. The tests were generated in four open-source projects. The results were analyzed quantitatively (binary classification metrics and timing measurements) and qualitatively (by observing and coding the activities of participants from screen captures and detailed logs). The results showed that participants tend to incorrectly classify tests encoding both expected and faulty behavior (with median misclassification rate 20%). The time required to classify one test varied broadly with an average of 2 min. This classification task is an essential step in white-box test generation that notably affects the real fault detection capability of such tools. We recommended a conceptual framework to describe the classification task and suggested taking this problem into account when using or evaluating white-box test generators.


INTRODUCTION
Due to the ever increasing importance of so ware, assessment of its quality is essential. In practice, so ware testing is one of the most frequently used techniques to improve so ware quality. orough testing of so ware demands signi cant time and e ort. To alleviate the tasks of testers and developers, several automated techniques have been proposed [2]. ese advanced methods are o en available as o -the-shelf tools, e.g., Pex/IntelliTest [42], Randoop [29], or EvoSuite [15]. Some of these techniques can rely only on the source/binary code to select relevant inputs. For the selected inputs these white-box test generators record the implementation's actual output in test asserts. However, if only the implementation is used, the assertions created in the generated test cases contain the observed behavior, not the expected.
As these techniques and tools evolve, more and more empirical evaluations are required to assess their usefulness. In most of the studies, the tools were evaluated in a technology-oriented se ing (e.g., [23,39,44]). Only a limited number of studies involved human participants performing prescribed tasks with the tools [13,16,37].
A common aspect to evaluate the e ectiveness of test generator tools is the fault detection capability of the generated tests. Related studies [14,16,28,30,36,37,39] typically employ two metrics for this purpose: 1) mutation score or 2) number of detected faults. Although mutation score has been shown to be in correlation with real fault detection capability [21], it has concerns to be aware of [31]. e number of detected faults is usually measured using a faulty version (with injected faults) and a fault-free version (original) of the code under test. If a generated test passes on the faulty version and fails on the original, it is considered as a fault-detector.  However, a fundamental problem is that we have no a priori knowledge about the correctness of the implementation in real scenarios (i.e., there is no faulty and correct version, see Figure 1). If the test generator uses only the program code, then the user of the test generator must validate each assertion in the generated test code to decide whether the test encodes an expected or a faulty behavior. Although this is very simple for trivial errors, it could be rather complex in case of slight mismatches between the implementation and its intended behavior. Note that the number of generated tests to examine could be decreased with implicit or derived oracles [4] (e.g., robustness or regression testing), but if the generated tests are used for functional testing, then in the end some of the validation needs to be performed by the developers and testers. However, it is not evident that humans can correctly identify all faults that can be possibly detected using the generated tests. Although some experiments (e.g., Fraser et al. [16]) mentioned this potential issue, most of the related studies do not consider it as a validity threat during their evaluations. e consequence of this is that the practical fault-nding capability of the test generators can be much lower than presented in experimental evaluations.
us the question that motivated, and served as a basis of our research is the following: How do developers perform in using the tests generated from code to detect faults and decide whether the implementation is correct? 1 is question is mainly motivated by the fact that the actual fault-nding capability of white-box test generator tools could be much lower than reported in already 1 Note that if a test generated from a faulty implementation encodes a fault but passes, then the test can be considered faulty as well. erefore classifying the tests as faulty or correct could reveal a faulty implementation. existing experiments due to the classi cation performance of tool users.
We designed and performed an exploratory study with human participants that covers a realistic scenario resembling developers testing previously untested code with the help of test generators. e participants' task was to classify tests generated by Microso IntelliTest [42] whether they encode faulty or correct behavior for two open source projects carefully selected from GitHub. e activities of participants were recorded using both logging and screen capture, and were analyzed quantitatively and qualitatively by coding the observed behaviors in each of the videos.
Our results show that deciding whether a test encodes faulty behavior is a challenging task even in a laboratory se ing. Only 2 of the 54 participants were able to classify all 15 tests correctly, and the median of misclassi cation rate reached 33% for fault-encoding tests. Surprisingly, a large number of correct tests were also classi ed as faulty (misclassi cation rate 25%). e time required to classify one test case varied largely, but on average classifying one test required 2 minutes. Finally, the e ect of the experience of the participants on the classi cation performance was analyzed.
In experimental research, replications and secondary studies are vital to increase the validity of results. us we made the dataset, the videos, the coding of behavior and the full analysis scripts available for further use [19]. e main contributions of the paper are as follows.
• We designed a new exploratory study with human participants to investigate the importance of the classi cation of generated tests (Sect. 3). • We performed the study with 54 participants (Sect. 4) and analyzed the results (Sect. 5) showing evidence that classication of generated tests is not easy and humans can not necessarily detect all faults.
• We drew conclusions from the results and gave recommendations for further studies and replications (Sect. 6).

RELATED WORK
Test generation and oracles. Anand et al. [2] present a survey about test generation methods, including those that generate tests only from binary or source code. As these methods do not have access to a speci cation or model, they rely on other techniques than speci ed test oracles [4]. For example, for certain outputs implicit oracles can be used: a segmentation fault is always a sign of a robustness fault [38], while nding a bu er over ow means a security fault [7]. Other implicit oracles include general contracts like o.equals(o) is true [29]. However, test generators usually generate numerous tests passing these implicit oracles. For handling these tests there are basicly two options. On the one hand the developer could specify domain-speci c partial speci cations (e.g., as parameterized tests [42] or property-based tests [9]). On the other hand the tools usually record the observed output of the program for a given test input in assertions, and the developer could manually examine these asserts to check whether the observed behavior conforms to the expected behavior.
In our paper we consider this la er case, i.e., where there is no automatically processable full or partial speci cation, the generated tests were already ltered by the implicit oracles, and the remaining tests all passed on the implementation, but we cannot be sure if they encode the correct behavior. In this case derived oracles are commonly used to decrease the number of tests to manually examine or ease the validation. For example, existing tests can be used to generate more meaningful tests [25], similarity between executions can be used to pinpoint suspicious asserts [32], or clustering techniques can be used to group potentially faulty tests [1]. Moreover, if there are multiple versions from the implementation (e.g., regression testing [47] or di erent implementations for the same speci cation [29]), tests generated from one version could be executed on the other one. However, even in this scenario, tests do not detect faults, but merely di erences that need to be manually inspected (e.g., a previous test can fail on the new version not because of a fault, but because a new feature has been introduced). In summary, none of these techniques can classify all tests perfectly, and the remaining ones still need to be examined by a human.
Testing studies involving participants. Juristo et al. [20] collected testing experiments in 2004, but only a small number of the reported studies involved human subjects (e.g., Myers et al. [27], Basili et al. [5]). More recently, experiments evaluating test generator tools were performed: Fraser et al. [16] designed an experiment for testing an existing unit either manually or with the help of EvoSuite; Rojas et al. [37] investigated using test generators during development; Ramler et al. [36] compared tests wri en by the participants with tests generated by the researchers using Randoop; and Enoiu et al. [13] analyzed tests created manually or generated with a tool for PLCs. ese experiments used mutation score or correct and faulty versions to compute fault detection capability.
Related studies. We only found two studies that are closely related to our objective. In the study of Staats et al. [41] participants had to classify invariants generated by Daikon. ey found that users struggle to determine the correctness of generated program invariants (that can serve as test oracles). e object of the study was one Java class, and tasks were performed on printouts. Pastore et al. [33] used a crowd sourcing platform to recruit participants to validate JUnit test cases based on the code documentation. ey found that the crowd can identify faults in the test assertions, but misclassi ed several harder cases. ese studies suggest that classi cation is not trivial. Our study extends these results by investigating the problem in a se ing where participants work in a development environment on a more complex project.

STUDY PLANNING 3.1 Goal and method
Our main goal was to study whether developers can use and validate the tests generated only from program code by classifying whether a test encodes a correct behavior or a fault.
As there is li le empirical evidence about the topic, to understand it be er we followed an exploratory and interpretivist approach [45]. We formulated the following base-rate and relationship research questions [12].
RQ1 How do users of white-box test generation perform in the classi cation of generated tests? Note that RQ3 is intentionally de ned so that it requires a mix of exploratory and post-hoc analyses.
As these test generator tools are not yet widespread in industry we selected an o -line context. We employed an exploratory study in a laboratory se ing using students as human participants. Our research process involved both qualitative and quantitative phases. We collected data using both observational and experimental methods. e data obtained was analyzed using exploratory data analysis and statistical methods, and by coding behaviors in screen capture videos. For the design and reporting of our study, we followed the guidelines of empirical so ware engineering [12,45,46].

Variable selection
Understanding and validating generated tests is a rather complex task that can be a ected by numerous variables. We focus on the following independent variables. For each variable possible levels are listed, from which the bold ones are selected for our study.  e following dependent variables are observed: • Answers of participants: Classi cation of each test as OK (correct) or wrong (faulty).
• Activities of participants: What activities do the participants perform during the task (e.g., running tests). • Time spent by participants: How much time do the participants spend on each individual activity.
Note that as this is exploratory research there is no hypothesis yet, and because the research questions are not causality or comparative questions, all independent variables had xed levels (i.e., there are no factors and treatment).

Subjects (Participants)
Our goal was to recruit people, who are already familiar with the concepts of unit testing and white-box test generation. We recruited participants from MSc students enrolled in one of our V&V university course. ey were suitable candidates as the course has covered testing concepts, test design, unit testing and test generation (5 × 2 hours of lectures, 3 hours of laboratory exercises and approximately 20 hours of group project work).
Participation in the study was optional, but we motivated it with giving extra points (approximately 5% in the nal evaluation of the course) for participation. However, we emphasized that these points were given independently from the experiment results not to have any negative performance pressure.

Test generator tool
As our V&V course has laboratory exercises with the IntelliTest test generation tool, we choose this tool for the study as well. IntelliTest (formerly known as Pex [42]) is a state-of-the-art dynamic symbolic execution-based test generator. IntelliTest currently supports the C# language and is integrated into Visual Studio 2015. IntelliTest's basic concept is the parameterized unit test, which is a test method with arbitrary parameters called from the generated test cases with concrete arguments.

Objects (Projects and classes)
e main requirements towards the objects were that they should be wri en in C#, IntelliTest should be able to explore them, they should be not too complex so that participants could understand them during the task, but they should contain multiple non-trivial classes depending on each other, otherwise faults could be easily identi ed just by reviewing the code of the class under test. We did not nd projects satisfying these requirements in previous studies, thus we searched for open source projects.
Our project selection criteria included the followings.
• Shall have at least 500 stars on GitHub: this likely indicates a project that really works and lters out prototypes and not working code.
• Should not have any relation to graphics, user interface, multi-threading, multi-platform execution: these may introduce di culties for the test generator tool. • Shall be wri en in C# language: IntelliTest only supports this language. • Shall be compilable in a few seconds: this makes users able to run fast debugging sessions during the experiment. We decided to use two di erent classes from two projects with vastly di erent characteristics. e selection criteria for the classes were the followings.
• Shall be explorable by IntelliTest without issues to have usable generated tests. • Shall have more than 4 public methods to have reasonable amount of generated test cases. • Shall have at least 1 external invocation pointing outside the class, but not more than 3. is ensures a fault-injection location for the experiment. • Shall have at least partial commented documentation to use as speci cation. Based on pilots, we found that participants can examine 15 tests in a reasonable amount of time. To eliminate the bias possibly caused by tests for the same methods, we decided to have the 15 cases for 5 di erent methods.

Project and class selection.
Finding suitable objects turned out to be much harder than we anticipated. We selected 30 popular projects from GitHub as candidates that seemed to satisfy our initial requirements. However, we had to drop most of them: either they heavily used features not supported by IntelliTest (e.g., multithreading or graphics), or did not have inter-class dependencies, or would have required extensive con guration (e.g., manual factories, complex assumptions) to generate non-trivial test values.
Finally we kept the two most suitable projects: • Math.NET Numerics [24] is a .NET library that o ers numerical calculations in probability theory or linear algebra. It contains mostly data structures and algorithms. • NBitcoin [26] is a more business-like library, which is available as the most complete BitCoin library for .NET. Table 1 lists the selected classes and methods of the two projects. e Combinatorics class implements enumerative combinatorics and counting: combinations, variations and permutations, all with and without repetitions.
e AssetMoney class implements the logic of the Open Asset protocol for arbitrary currencies that have conversion ratio to BitCoin.
Most of the selected methods had method-level comments originally containing the description of correct behavior, but we extended them slightly based on the feedback from pilots. ey are still not perfect, but they represent comments used in real projects.

Fault selection and injection.
To obtain fault-encoding tests from IntelliTest, faults need to be injected into the objects.
ere are several alternatives to obtain such faults, each of them would a ect the validity of the study. As we were not able to extract meaningful faults for the selected classes from the version history of the project, we used arti cial faults in a systematic way. We selected di erent, representative fault types [11] from the Orthogonal Defect Classi cation [8]. e cited survey identi es the most commonly commi ed fault types in real-world programs. We used this survey as the source of the selected the faults: we selected each from the top quarters of the ODC categories (see Table 2). During the injection procedure we made sure that the faults 1) have no cross-e ects on each other, and 2) have no e ect on behavior other than the intented. We injected three faults in both projects. In case of NBitcoin we injected the faults inside the selected class, while in case of Math.NET into the project's other classes.
3.5.3 Generated tests. We generated tests with IntelliTest for each selected method using parameterized unit tests. Tests were generated from the version already containing the above faults.
ere were methods, where IntelliTest could not generate values that cover interesting behaviors. In these cases, we extended the parameterized unit tests with special assumptions that request at least one test case from IntelliTest with values that ful ll the preconditions. From each test case set, we selected 3 test cases for the study. We choose the most distinct cases that cover vastly di erent behaviors in the method under test. Each test case was given an identi er ranging from 0 to 14 (therefore both NBitcoin and Math.NET have tests T0 to T14). Furthermore, the corresponding method is indicated with a su x in each test case identi er. us for the rst method, three cases were generated: T0.1, T1.1 and T2.1. IntelliTest generates one test le for each method, but we moved the test cases into individual les to help tracking the participants.

Environment
A Windows 7 virtual machine was used that contained the artifacts along with Visual Studio 2015 and Google Chrome. Participants were asked to use only two windows: 1) an experiment portal in Chrome, and 2) Visual Studio. We designed a special website, the experiment portal ( Figure 2) in order to record the answers of the participants. It was a more reliable way to collect the results than using some mechanism in

F8
Wrong arithmetic expression in parameter of function call Interface SpecialFunctions .FactorialLn(n + k -1)

SpecialFunctions
.FactorialLn(n -k + 1) the IDE (e.g., using special comments), as participants could unintendedly delete or regenerate the test code.
Participants used this portal to decide whether the test case is wrong or correct with respect to the speci cation. e portal displayed the test code and the method comment of the corresponding method in the class under test. Participants could record their answer using two bu ons. Participants could correct their already answered cases. estions could be skipped if a participant was not sure in the answer (however, nobody used that option).
In Visual Studio the default development environment was provided with a simple activity tracking extension. Participants got the full project with every class. Participants were asked 1) not to modify any code, 2) not to execute IntelliTest, and 3) not to use screen spli ing. On the other hand, we encouraged them to use test execution and debugging to explore the code under test.

Procedure
e main procedure of the 2-hour session is as follows.
(2) Find a seat, receive a unique, anonymous identi er. Participants only receive one sheet of paper that describe both the procedure and the task with the path to the project and class under test. To obtain detailed knowledge about the participants, we designed a background questionnaire asking about the participants' experience with development and testing. Also, the questionnaire has a quiz in the end about C# and testing. In order to summarize the most important and required information, we designed a 10-minute presentation in which the procedure, the projects, the environment, the basic concepts of IntelliTest, and the rules are introduced. Furthermore, to make participants familiar with the environment and the task, a 15-minute guided tutorial is held on a simple project. is tutorial was specially elaborated to have both wrong and good answers for the test cases. During this tutorial, participants can ask anything. e main task is to classify each of the 15 generated test cases in the portal whether it is fault-encoding (wrong) or not (okay). Finally, an exit survey is lled that asks participants about their feelings regarding the task accomplished.
We planned to perform 2 study sessions as the room, where the study was planned to be conducted has only 40 seats available.

Data collection
We use two data collection procedures. On one hand, we extended the development environment so that it logs every window change, test execution and test debug as well. Also, we wrote a script that documents every request made to the experiment portal. On the other hand, we set up a screen recording tool to make sure that every action of the participants is recorded.
Each participant has 6 output les that is saved for data analysis.
• Answers: e answer submi ed to the portal in JSON.
• Background: e answers given in the background questionnaire in CSV format.

Data analysis
First, the raw data is processed by checking the answers of the participants and coding the screen capture videos. Next, the processed data is analyzed using exploratory techniques and statistical tests.  Analysis of answers. We analyze the answers obtained from the experiment portal using binary classi cation for which the confusion matrix is found in Table 3. Video coding. We annotate every recorded video using an academic behavioral observation and annotation tool called Boris [17]. We designed a behavioral coding scheme that encodes every activity, which we are interested in. e coding scheme can be found in Table 4, all occurrences of these events are marked in the videos. Note that, during the video coding, we only use point events with additional modi ers (e.g., change of page in the portal is a point event along with a modi er indicating the identi er of the new page). In order to enable interval events, we created modi ers with start and end types. Coding all videos required 66 hours.
Exploratory analysis. We perform the exploratory data analysis (EDA) using R version 3.3.2 [35] and its R Markdown language to document every step and result of this phase. We employ the most common tools of EDA: box plots, bar charts, heat maps and summarizing tables with aggregated data. antitative analysis. During the data analysis, we did not make a priori assumptions about the distribution of the results. Furthermore, we performed checks for normality of the distributions that yielded negative results, thus we used non-parametric tests [3]. In case of two of sample groups, we employed the Mann-Whitney U test. is test checks whether two groups of independent samples are from the same population (H 0 ) or not (H 1 ). It is a widely-used practice to calculate e ect sizes for statistical samples and tests [18]. We chose one of the most prevalent metric, the Vargha-Delaneŷ A 12 measure to calculate and report e ect sizes of Mann-Whitney U test.
For analyses, where we had to deal with more than two independent sample groups, we used the Kruskal-Wallis H test, which is the extension of the Mann-Whitney U test for exactly these cases. Its null hypothesis is that the sample groups are from identical populations, consequently the alternative hypothesis is that there is at least one sample group, which statistically dominates another. If there was a di erence, we employed Mann-Whitney U test for post-hoc analysis of sample group pairs.
Note that if Mann-Whitney U Test is used for cross-checks between multiple groups, one would need to correct the signi cance values using e.g., Bonferroni or Dunn-Sidak correction. However, we did not used the test for these kind of checks, thus we did not perform signi cance correction.

reats to validity
During the planning of our study, we identi ed the internal, external and construct threats to its validity. In terms of internal threats, our results might be a ected by the common threats of human studies [22]. For instance, this includes the maturation e ect caused by the learning of exercises, and the natural variation in human performance as well. Moreover, the students know each other and they could talk about the tasks in the study between the two sessions (see Section 4). We eliminated this threat by using di erent projects and faults at each occasion. e data collection and analysis procedure might also a ect the results, however we validated the video logs by R scripts and the portal functions by testing. e generalization of our results might be hindered by some factors. e performances of students and professional users of white-box test generators may di er. Yet, involving students is common in so ware engineering experiments [40], and results suggests that professional experience not necessarily increases performance [10]. Our graduate students typically have at least 6 months work experience, thus they are on the level of junior developers. Another threat to external validity is the speci cation given in comments, and not in a program speci cation. However, our goal was to carefully select open-source projects, which in general do not have formal or clear speci cations of behavior.
is decision on one hand may reduce the genericity of results for projects with formal speci cations, but on the other hand, it increases the genericity for open-source so ware. Fault injection procedure could have e ects on the genericity of the results, however we selected this method a er thinking through several other alternatives (such as GitHub issues). e threats to the construct validity in our study is concerned with the independent variables. It might be the case that some of the variables we selected are not a ecting the di culty of classi cation of generated white-box tests. We addressed this threat by carefully analyzing related studies and experiments in terms of design and results in order to obtain the most representative set of variables.

EXECUTION
Pilots. Our study was evaluated by two separate pilot sessions. First, we performed the tasks using ourselves as participants. A er xing the discovered issues of the design, we chose 4 PhD students having similar knowledge and experience as our intended participants to conduct a pilot in the live environment. We re ned the study design based on the feedback collected (see object selection and project selection in Section 3).
Sessions. We separated our live study into two di erent sessions. On the rst occasion the NBitcoin project, on the second one Math.NET was used. e sessions were carried out on 1st and 8th December 2016. Both sessions followed the procedure and could t in the 2-hour slot.
Participants. Altogether 54 students volunteered of the 120 attending the course: 30 came to the rst occasion (NBitcoin) and 24 for the second (Math.NET). 34 of the students had 4 years or more programming experience, while 31 participants had at least 6 months industrial work experience. ey scored 4.4 out of 5 points on average on the testing quiz of the background questionnaire.
Data collection and validation. We noticed three issues during the live sessions. In the rst session, Visual Studio cached the last opened window, thus participants got three windows opened on di erent tabs when they started Visual Studio. In the second session, we omi ed the addition of a le to the test project of Math.NET that led to 3 missing generated test cases in Visual Studio (for method CombinationsWithRepetition). We overcame this issue by guiding the participants step-by-step on how to add that test le. is guided part lasted around 9 minutes, thus we extended the deadline to 69 minutes in that session. Finally, unexpected shutdown of two computers caused missing timing data for the rst two tests for two participants (ID: 55 and 59). e rest of their experiments were recorded successfully. e experiment portal has a continuous saving mechanism, therefore their classi cation answers were preserved. We took all these issues into account in the timing analysis. During the data validation of the recorded data we discovered only one issue. e network in the lab room went o on 1st December, and due to this the experiment portal was not able to detect every activity. is data was recovered with the coding of the videos for each participant.

RQ1: Performance in classi cation
To evaluate the overall performance of participants in the classication of generated tests, we employed binary classi cation using the confusion matrix presented in Table 3. Figure 4 presents the overall results. e gure encodes all four outcomes of evaluated answers. e rst and foremost fact visible in the results is that there are numerous erroneous answers (marked with two shades of red). is implies that not only faulty cases were classi ed as fault-free, but also there were fault-free cases classi ed as faulty.
In case of NBitcoin, there is only one participant (ID: 10) who answered without any errors. However, there is no test, which is not marked falsely by at least one of the participants. Furthermore, one can notice two pa erns in the results for NBitcoin. First, tests T0.1 and T2.1 show very similar results for the same participants.
is is caused by the fact that the generated test codes and their names are very similar. However, there were no injected fault in the code, both cases encode expected behaviors with respect to the speci cation. e other noticeable result is that T11.4 has more false answers than true ones. is test case causes an exception to occur, yet it is an expected one. Although throwing an exception is not explicitly stated in the method comment, still the speci cation of the invoked and exception-causing method implies that.
In case of Math.NET, the overall results show similar characteristics: there is no test, which was correctly classi ed by everyone, and also there is only one participant (ID: 47) who was able to classify every test correctly. Similarly to NBitcoin, two tests show larger deviations in terms of results: T2.1 and T8.3. Taking a closer look at T2.1 (encoding a fault) reveals that its functionality was simple, participants had to examine the binomial coe cient n k calculation. But the fault was injected in the sanity check at the beginning of the method (this sanity check is not detailed in the speci cation, however, the de nition of the binomial coe cient implies that). In this particular test case, the test inputs should have triggered the valid sanity check. For test T8.3, the misunderstanding could come from an implementation detail called factorial cache, which pre-calculates every factorial value from 1 to 170. e original documentation states that numbers larger than 170 will over ow, but does not detail its exact method. Test T8.3 uses 171 as input for which the implementation returns positive in nity. is is the correct behavior used consistently in the class, but probably the participants expected an over ow exception.
We also analyzed the data in terms of di erent metrics for binary classi cation. e most widely used ones that are independent from the number of positive and negative samples are: true positive rate (TPR), true negative rate (TNR) and Ma hews correlation coe cient (MCC) [34]. Summary of these metrics are shown in Figure 5.
In terms of TPR, participants of the NBitcoin session outperformed the results of participants working with Math.NET. For NBitcoin, the median is 1, which means that more than half of the participants were able classify all fault-encoding tests as faulty. In contrast, results for Math.NET show that the upper quartile starts from 0.75, which is much lower.
For TNR, the two projects show very similar results with almost the same medians and inter-quartile ranges. Only a slightly wider distribution is visible for NBitcoin. is and the results for TPR con rms that the classi cation was easier for NBitcoin.
MCC is basically a correlation metric between the given and the true answers, and thus gives a value between -1 and 1. If MCC is zero, then the given classi cation has no relationship with the true classi cation. For NBitcoin, the MCC values show worse results than what can be expected from TPR and TNR values. e median is only around 0.55, which is only a moderate correlation. In case of Math.NET, the inter-quartile range is between 0.5 and 0.2, which can  1 2 3 4 5 6 7 8 9 10 11 12 14 15 17 18 19 21 22 23 24 25 26 27 28 29   be considered as a low correlation between the true classi cation and the ones given by participants. Another interesting note that both experiment sessions had participants with negative correlation, which indicates that the given participant had more false answers than true ones.
Summary. e overall results of the participants showed a moderate classi cation capability. Many of them commi ed errors among their answers. Some of these errors were possibly caused by misunderstanding the speci cation, however, a large portion of wrong answers may have been caused by the di culty of the problem.

RQ2: Time spent for classi cation
We analyzed the data obtained from the video annotations from di erent aspects to have an overview of the time management of participants. Note that during the time analysis we excluded the data points of participants 55 and 59, who had missing time values for T0.1 and T1.1, as these may a ect the outcome of the results. Table 5 summarizes the total time and time spent on one test. Total time was calculated using the length of the recorded videos. For the test cases, we summed the time spent in the IDE on a speci c test case and the time spent on the portal page of the given test. e total time spent during the sessions is very similar for the two project. ere is a roughly 17 minutes di erence between the fastest and slowest participants, while the average participant required 45 and 46 minutes to nish the classi cation. Note that this involves every activity including the understanding of the code under test. e results show rather large deviations in the values for the test cases. e minimum in case of NBitcoin was probably caused by two factors. First, there were participants who gained understanding of the code under test, thus were able to quickly decide on some of the tests. Second, each method had 3 test cases, and the third cases could be classi ed in a shorter amount of time, which emphasizes a presence of a learning curve. In contrast, participants required a rather long time period to classify some of the test cases. A rough  To understand how participants managed their time budget, we analyzed their time spent on each of the possible locations ( Figure 6). ese locations are the followings: portal pages of the tests, the Visual Studio windows including the test codes, class under test (CUT), other system under test (SUT) than CUT, and parameterized unit test (PUT). Note that we excluded the home page of the portal from this analysis, as it contains only a list of the cases, thus served only for navigation. e results are similar for both projects, yet there are two di erences to mention. It is clear that participants mostly used the test code and the corresponding speci cation in the portal and in Visual Studio to understand the behavior. However, in case of NBitcoin they analyzed the class under test almost as much as the test code in Visual Studio.
is is not the case for Math.NET, probably because participants were already familiar with the domain knowledge of the tested class. Another di erence is for the time spent with the system under test. Math.NET had its faults injected outside the class under test, and participants had to explore its dependencies to understand what causes the mismatch of behavior with respect to the speci cation.
In order to gain deeper insights into the time budget, we analyzed the time required for each test case (Figure 7). We calculated this metric by summarizing 5 related values: the time spent in the portal page of the test, the time spent in the Visual Studio window of the test, the time spent with CUT (class under test), PUT (parameterized unit test) and SUT (other system under test than CUT) for the test case currently opened in the portal. On a high-level overview, two trends can be noticed in the values. e rst one is the decreasing amount of time required as participants progressed. e second factor is the rst-test e ect causing the rst test to have higher values for several methods.
Summary. e analysis of the time spent by participants pointed out that they spend roughly around 100 seconds on average only with the test code to classify a particular generated white-box test. e time spent with other parts of the code is added on top of this. Based on the results, the users of white-box test generators may have to spend a noticeable amount of time to classify the generated tests based on their correctness.

RQ3: Impacting factors of classi cation
In order to have a be er understanding on what could impact the classi cation performance of participants, we applied statistical methods for di erent aspects of the dataset. Our goal is 1) to provide information about the potential relationships between various a ributes, 2) to gain important insights to the data and 3) to de ne recommendations for future studies. We selected the true positive rate (TP -sensitivity) and true negative rate (TN -speci city) metrics to evaluate the performance of classi cation. Project selection. Foremost, we analyzed whether the project selection has in uence on the classi cation performance. As there were two groups of samples (NBitcoin with 30 and MathNet with 24 samples), we used the Mann-Whitney U test. e signi cance of the test (U pv) along with the means, e ect sizes (Â 12 ) and the mean of standard deviations (sd) are shown in Table 6. In case of TP rate, the values have reasonably large di erences among the two projects (this can be also seen on Figure 5). Based on the p-value of the Mann-Whitney U test, one can reject the null hypothesis that the values are from the same population with 95% con dence. Also, the Vargha-Delaney metric has a value of 0.651, which is considered as a medium di erence between the two sample groups. In contrast, the p-value for TN rate shows there is not enough statistical signi cance to state that the two groups of samples are from di erent populations. eÂ 12 value also supports that they are similar, as 0.557 is considered as a small di erence between the values in the sample groups.
Programming experience. Participants lled the background questionnaire prior to the experiment. We had several questions on their experiences, one of them was regarding their programming experience measured in years. We de ned 7 levels in the survey from which they only selected 6. However, only 1 participant selected having no experience (possibly omi ed answering) and 2 participants selected less than 1 year. We excluded their answers as they form a very small group to be used in a statistical test. us, in the end, we had 51 samples with 4 levels of programming experience: 2 years (N=9), 3 years (N=8), 4 years (N=19), 5 or more years (N=15). In order to determine if there is any di erence between the sample groups, we used the Kruskal-Wallis H test. e results are shown in Table 7. In case of true positive rate, the p-value of the Kruskal-Wallis is 0.122, which means that one cannot signi cantly reject the null hypothesis, thus the groups are from rather likely to be from identical populations. Similarly for true negative rate, the p-value is 0.182, which yields the same results as for TP rate.   Work experience. Based on the background questionnaire lled by the participants, we analyzed the connection between their classi cation performance and their work experience. For the question regarding the industrial work experience, we also de ned 7 levels to choose from. However, the 54 participants have only selected 5 of them, which were the followings: none (N=5), less than 6 months (N=18), 7-12 months (N=9), 1-2 years (N=18) and 3-5 years (N=4).
e results for this analysis are found in Table 8. Using the Kruskal-Wallis H test for the TP rate, the results show that the di erences are not statistically signi cant enough to reject that the sample groups are from identical populations. One may note that means are very di erent for the groups, however the results are not signi cant enough to reject the null hypothesis. is may be caused both by the varying number of participants per group and by the lack of robustness in mean. An interesting phenomenon can be observed for the results in TN rate. e Kruskal-Wallis test tells that there is a statistically signi cant di erence between some of the sample groups (rejecting H 0 ). In order to gain more information about the di erences, we used the Mann-Whitney U test on the two largest sample groups (less than 6 months experience and 1-2 years of experience). e results for this test can be found in Table 9. e values shows that there is a statistically signi cant di erence between the two groups in terms of TN rate (rejecting H 0 ). Furthermore, the value of theÂ 12 statistic indicated a large di erence with the value of 0.751.
Summary. e results for RQ3 showed insights on relationships. In terms of project selection, the results may indicate that the project under test or its a ributes (e.g., complexity, fault location, type of faults, etc.) has in uence on true positive rate. For participants with di erent programming experiences, our results showed no signi cant di erences in their classi cation performance. Last, we analyzed the industrial work experience of participants that showed TN rate was signi cantly a ected by the work experience in an inverse-way for the two largest group of our participants. 6 DISCUSSION 6.1 Implications of the results e results for RQ1 showed that classifying the correctness of generated white-box tests could be a challenging tasks. e median of misclassi cation rate was 33% for fault-encoding tests, while 25% for correct tests. Both could be caused by several factors such as 1) the misunderstanding of speci cation, 2) the misunderstanding of class behavior, or even 3) the underlying fault a ributes in the soware under test. However, most likely the challenge in classi cation is mostly due to the combination of these causal factors.
For RQ2, our results showed that participants could spend minutes only to understand the encoded behavior and functionality in the test cases even for the selected classes. Moreover, this does not include other time spent in the IDE, which was mostly used to understand the code under test. e results may yield that developers and testers could spend a non-negligible amount of time with the classi cation of test cases, which may reduce the time advantage provided by automatically generated white-box tests.
In terms of RQ3, we obtained interesting factors that may have in uence on participant classi cation performance. First, our results showed signi cantly that the a ributes of the code under test may in uence the performance. Further studies may be required to support this hypothesis that control some a ributes of code under test as independent variables. A study like this may reveal, which a ribute has the most in uence in performance. Furthermore, our results had signi cance in the analysis of relationship between the participant work experience and their classi cation performance. Further studies may address this topic by controlling participant industrial experience with a reasonable amount of participants.

Insights from participants' behavior
By watching all screen capture and performing video coding, we gained important insights about the user activities and behaviors during the classi cation of generated white-box tests. As expected, many participants employed debugging to examine the code under test. ey mostly checked the exceptions being thrown, parameterized unit tests for the test methods, and assertions generated into the test code. is emphasizes the importance of debugging as a tool for investigating white-box test behavior. Most of the participants executed the tests to check their actual outcome. Some cases contained unexpected exceptions, which confused few participants.
Another interesting insight we obtained is that some participants spent only seconds with the examination of the last few test cases.
is could point out that they either gained understanding of the code under test by the end of the session (i.e., learning factor), or they got tired by the continuous a ention required during the task.

Suggestions from exit survey
e participants in our study lled an exit survey at the end of the sessions. ey had to answer both Likert-scaled and textual questions.
e results for the agreement questions yielded that participants had enough time to understand the class under test and to review the generated tests. Most of them also selected that it was easy to understand the class and the tests. ey agreed that the generated tests were di cult to read, however the answers were almost equally distributed for the questions about the di culty of the task and the con dence in their answers. is shows that they are mostly not very con dent in their own answers. Furthermore, the feedback about the time and di culty showed that our study design was appropriate in terms of these.
In their textual answers participants mentioned the di culties in reviewing the tests and gave several suggestions to improve the test code (some of these were also reported in the literature [43]).
• "Deciding whether a test is OK or wrong when it tests an unspecied case. (e.g. comparing with null, or equality of null)" • "Distinguishing between the variables was di cult (assetMoney, assetMoney1, assetMoney2). " • "Tests should compare less with null and objects with themselves. " • "I think that some assertions are useless, and not asserting 'real problems', just some technical details. " • "Generated test cases doesn't seperated into Arrange, Act, Assert and should create more private methods for these concerns. " • "Generate comments into tests describing what happening. " Our recommendation for improving test generators to help developers and testers with generated assertions consists of the followings.
• Instead of using the assert keyword, test generators shall use the observed or likelyAssert keywords. • Generated tests having null inputs shall be distinguished from the others. • Generated tests shall contain variables with more meaningful names (as already implemented in refactoring features of many IDEs). • e generated tests shall employ the Arrange, Act, Assert pa ern in the structure of generated tests.
• e tests shall contain intra-line comments that describe what the given line is responsible for.

CONCLUSIONS
is paper presented an exploratory study on whether developers could validate generated white-box tests. e study performed in a laboratory se ing with 54 graduate students resembled a scenario where junior developers having a basic understanding of test generation had to test a class in a larger, unknown project with the help of a test generator tool. e data showed that participants incorrectly classi ed a large number of both fault-encoding and correct tests (with median misclassi cation rate 33% and 25% respectively). e results con rm the ndings of previous studies and broaden their validity. e implication of the results is that the actual fault-nding capabilities of the test generator tools could be much lower than reported in technology-focused experiments.
us we suggest to take into account this factor in future studies. An experimental study always has limitations. We collected important context variables that could a ect the classi cation performance (e.g., experience, source code access), and de ned the levels chosen in the current study that collectively re ect one possible scenario. As in our study all variables had xed levels, this naturally limits its validity. Future studies altering these se ings could help to build a "body of knowledge" [6]. Our analysis indicates that the object under study and the participants' industrial experience could be possible factors. Moreover, designing a study where participants work on a known project or perform regression testing would be important future work. erefore we made available our full dataset, coded videos and lab package to support further analyses or replications.