Goal and method
Our main goal was to study whether developers can validate the tests generated only from program code by classifying whether a given test encodes an expected or an unexpected behavior.
As there is little empirical evidence about the topic, to understand it better, we followed an exploratory and interpretivist approach (Wohlin and Aurum 2015). We formulated the following base-rateresearch questions (Easterbrook et al. 2008) to gather data that can direct future research or help to formulate theories and hypotheses.
How do developers perform in the classification of generated tests?
How much time do developers spend with the classification of generated tests?
As these test generator tools are not yet widespread in industry we selected an off-line context. We designed an exploratory study in a laboratory setting using students as human participants. Also, we performed a replication to strengthen our initially explored results. Our research process involved both quantitative and qualitative phases. We collected data using both observational and experimental methods. The data obtained was analyzed using exploratory data analysis and statistical methods. For the design and reporting of our study, we followed the guidelines of empirical software engineering (Easterbrook et al. 2008; Wohlin and Aurum 2015; Wohlin et al. 2012).
Understanding and classifying generated tests is a complex task and its difficulty can be affected by numerous factors. We focus on the following independent variables. For each variable, their possible levels are listed, from which the bolds are the ones we selected for our study design.
Participant source: What type of participants are recruited (students, professionals, mixed).
Participant experience: Experience in testing and test generation tools (none, basic, experienced).
Participant knowledge of objects: Whether the participant has a priori knowledge about the implementation under test (known, unknown).
Objects source: The source, where the objects are selected from (open source, closed source, artificial/toy, …).
Object source code access: Whether the objects are fully visible to the participants (white-box, black-box).
Fault types: The source and type of the faults used in the objects (real, artificial, mutation-based).
Number of faults: The number of faults injected into the objects (0, 1, 2, 3, …).
Expected behavior description: How the specifications of the objects are given (code comments, text document, formal, …).
Test generator tool: Which test generator is used for generating tests (IntelliTest, EvoSuite, Randoop, …).
User activity: The allowed user activities in the study (run, debug, modify code …).
The following dependent variables are observed:
Answers of participants: Classification of each test as ok (expected) or wrong (unexpected).
Activities of participants: What activities are performed by participants during the task (e.g., running and debugging tests).
Time spent by participants: How much time participants spend with each individual activity and in each location.
Note that as this is an exploratory research there is no hypothesis yet, and because the research questions are not causality-related or comparative questions, all independent variables had fixed levels (i.e., there are no factors and treatment as opposed to a hypothesis-testing empirical evaluation).
Test generator tool
There are many off-the-shelf tools for white-box test generation, from which we chose Microsoft IntelliTest (Tillmann and de Halleux 2008). We decided to use it because it is already a state-of-the-art, mature product with a good user interface. IntelliTest currently supports C# language, and it is fully integrated into Visual Studio 2015. IntelliTest’s basic concept is the parameterized unit test (PUT), which is a test method with arbitrary parameters called from the generated tests with concrete arguments. Also, the PUT serves as an entry point for the test generation process.
Our goal was to recruit people, who were already familiar with the concepts of unit testing and white-box test generation. We performed the recruitment among MSc students who enrolled in one of our V&V university courses. They were suitable candidates as they already had a BSc degree in software engineering. Furthermore, our course has covered testing concepts, test design, unit testing and test generation prior to the performed study (5 × 2 h of lectures, 3 h of laboratory exercises, and approximately 20 h of group project work on the topics mentioned). Throughout the course, we used IntelliTest to demonstrate white-box test generation in both the lectures and the laboratory exercises.
Participation in the study was optional. We motivated the participation by giving the students extra points (approximately 5% in the final evaluation of the course). Note that we also announced that these points are given independently from the experiment results to avoid any negative performance pressure.
Using students as participants in a study instead of professionals has always been an active topic in empirical software engineering. However, Falessi et al. (2018) have conducted a survey with empirical software engineering experts, whether they agree or disagree about using students. On the one hand, based on their results, using students is a valid simplification of real-world settings for a laboratory study. On the other hand, this remains a threat to the validity as well, which must be considered during the interpretation of results.
Objects (projects and classes)
In terms of objects, we had to decide whether to (i) select off-the-shelf projects, or (ii) give developers an implementation task (based on a predefined specification). The latter study setup can be more difficult to control and analyze, because generated tests could differ between implementations. Thus, we decided to select objects from an external source, and to prohibit participants implementing or modifying any code.
The main requirements towards the objects were that (i) they should be written in C#, (ii) IntelliTest should be able to explore them, and (iii) they should not be too complex so that participants could understand them during performing their task. We did not find projects satisfying these requirements in previous studies of IntelliTest (Pex); thus, we searched for open-source projects. Based on our requirements, the project selection was performed along the following criteria:
Shall have at least 400 stars on GitHub: this likely indicates a project that really works and may exclude prototypes and not working code.
Should not have any relation to graphics, user interface, multi-threading or multi-platform execution: all of these may introduce difficulties for the test generator algorithm by its design.
Shall be written in C# language: The IntelliTest version used only supports this language.
Shall be able to compile in a few seconds: this makes users able to run fast debugging sessions during the experiment.
We decided to use two different classes from two projects with vastly different characteristics for both the original and the replicated study. The selection criteria for the classes were the following:
Shall be explorable by IntelliTest without issues to have usable generated tests.
Shall have more than 4 public methods to have a reasonable amount of generated tests.
Shall have at least partially commented documentation to be used as specification.
We conducted pilots prior to finalizing our design. We found that participants can examine 15 tests in a reasonable amount of time. To eliminate the bias possibly caused by tests for the same methods, we decided to have the 15 tests for 5 different methods (thus 3 tests for each method).
Selected projects and classes
Finding suitable objects turned out to be much harder than we anticipated. We selected 30 popular projects (Honfi and Micskei 2018) from GitHub as candidates that seemed to satisfy our initial requirements. However, we had to drop most of them: either they heavily used features not supported by IntelliTest (e.g., multi-threading or graphics) or they would have required extensive configuration (e.g., manual factories, complex assumptions) to generate non-trivial tests. Finally, we kept the two most suitable projects that are the following:
Math.NET Numerics (MathNET 2017) is a .NET library that offers numerical calculations in probability theory or linear algebra. It contains mostly data structures and algorithms.
NBitcoin (NBitcoin 2017) is a more business-like library, which is available as the most complete Bitcoin library for .NET.
In terms of the replicated study, we performed the same selection procedure on another set of open-source projects from GitHub that suit the initial requirements. We finally decided on the following two.
NodaTime (NodaTime 2018) is an advanced date and time handling library that aims to replace the corresponding built-in .NET types with a richer feature set.
NetTopologySuite (NetTopologySuite 2018) is a .NET library, which implements 2-dimensional linear geometry based on a standard defined by the Open Geospatial Consortium.
Table 2 lists the selected classes of the four projects. Using the requirements for the classes we manually analyzed each method inside them to ensure that they are suitable for the purpose. The Combinatorics class implements enumerative combinatorics and counting: combinations, variations and permutations, all with and without repetitions. The AssetMoney class implements the logic of the Open Asset protocol for arbitrary currencies that have conversion ratio to Bitcoin. Class Period in project NodaTime is responsible for describing and handling a given date and time period. The class CoordinateArrays of project NetTopologySuite is responsible for handling coordinates organized into an array along with providing the corresponding operations as well.
Most of the selected methods originally had method-level comments containing the description of expected behavior. In case of missing descriptions, we extended them; they are still not perfect (nor formally complete), but based on feedbacks from preliminary pilot sessions (discussed later in Section 4), they tend to represent comments used in real projects. It is important to note here that we did not extend anything if the methods invoked (from the unit under test) had clear descriptions. This way participants had to explore and understand the code more deeply to provide classification answers.
Fault selection and injection
To obtain fault-encoding tests from IntelliTest, faults need to be injected into the classes under test. There are multiple alternatives to obtain such faults, each of them affect the validity of the study in different ways.
Historical faults extracted from issue archives would more likely represent real-world scenarios, but would make the control of the study more difficult due to the limited number and type of actual faults for the selected projects.
Artificial faults can be obtained from surveys of typical software defects, e.g., Duraes and Madeira (2006). These surveys rank and categorize the most common faults made by developers during software development. On the one hand, this enables more control over the faults and objects in the study (because of the vast amount of possibilities); on the other hand, it may reduce the similarity to a real scenario if the selection and injection is performed without care (e.g., the fault is too trivial or impossible to find).
As we did not find a diverse, controllable set of historical faults for the selected classes from the GitHub version history of the projects, we used artificial faults in a systematic way. We selected representative fault types (Duraes and Madeira 2006) from the Orthogonal Defect Classification (Chillarege et al. 1992). The survey we used identifies the most commonly committed types of faults in real-world programs. We selected the actual faults from the top quarters of the ODC categories (see Table 9). During the injection procedure, we made sure that the faults (1) are causing unexpected behavior, (2) have no cross-effects on each other, and (3) have no effect on behavior other than the intended. All three were validated using test generation (IntelliTest) on the classes with and without the injected faults. We injected three faults in each selected class in order to have faulty tests in minority, yet in a measurable number.
We generated tests with IntelliTest for each selected method using parameterized unit tests (Tillmann and Schulte 2005). Tests were generated from the version already containing the selected faults. There were methods, where IntelliTest could not generate values that cover interesting or unexpected behaviors. In these cases, we extended the parameterized unit tests with special assumptions that request at least one test from IntelliTest with values that fulfill the preconditions. From each test suite, we selected 3 tests for the study due to the following reasons:
Fitting in the time frame: During the pilot sessions, we measured that a single test classification takes around 2–3 min to finish on average. Using 3 tests per method ensured that participants will most likely fit into the 1-h time frame of the study.
Learning effect: The more tests are selected for a single method, the higher is the probability that participants will understand and learn the features and issues of that method. This could introduce a bias in the final results, which may falsely indicate the simplicity of the classification task.
We chose the most distinct cases that cover vastly different behaviors in the method under test (e.g., from different equivalence partitions). Each test was given an identifier ranging from 0 to 14 (therefore, all four projects have tests T0 to T14). Furthermore, the corresponding method is indicated with a suffix in each test identifier. Thus, for the first method, three cases were generated: T0.1, T1.1, and T2.1. IntelliTest generates one test file for each method, but we moved the tests into individual files to alleviate the tracking of participant activities.
A Windows 7 virtual machine was used that contained the artifacts along with Visual Studio 2015 and Google Chrome. Participants were asked to use only two windows: (1) an experiment portal in Chrome (for brief test overview and answer submission) and (2) Visual Studio (for code inspection, test run, and debug).
We designed a special website, the experiment portal (Fig. 1), in order to record the answers of the participants. Here, participants could give their answers while analyzing the generated test code and the expected behavior. This was a more reliable way to collect the results than using some mechanism in the IDE (e.g., using special comments), as participants could not unintendedly delete or regenerate the test code.
Participants used this portal to decide whether the test encode expected behavior or not. The portal displayed the test code and the commented specification of the corresponding method. They recorded their answer using two buttons. Moreover, participants could correct their already answered cases. Questions could be skipped if a participant was not sure in the answer (however, nobody used that option).
In Visual Studio, the default development environment was provided with a simple activity tracking extension. Participants got the full project with every class. Participants were asked (1) not to modify any code, (2) not to execute IntelliTest, and (3) not to use screen splitting. On the other hand, we encouraged them to use test execution and debugging to explore the behavior implemented in the code under test.
The main procedure of the 2-h sessions is as follows:
Sign informed consent.
Find a seat, receive a unique, anonymous identifier.
Fill background questionnaire.
Listen to a 10-min briefing presentation and go through a 15-min guided tutorial.
Perform the assigned classification task in at most 1 h.
Fill exit survey.
Participants only received one sheet of paper that describe both the procedure and the task with the path to the project and class under test. To obtain detailed knowledge about the participants, we designed a background questionnaire asking about their experience with development and testing. Also, the questionnaire had a quiz in the end about C# and testing. We designed a 10-min presentation in which the procedure, the project and class under test, the environment, the basic concepts of IntelliTest, and the rules are introduced. Also, participants were implicitly warned to check the invoked methods of the method under test to obtain full overview of both the required and the actual behavior.
To make participants familiar with the environment and the task, a 15-min guided tutorial was held on a simple project created for this specific purpose. The tutorial had both types of tests to be classified (ok and wrong). The main task was to classify each of the 15 generated tests in the portal whether they encode expected (ok) or unexpected (wrong) behavior. Finally, an exit survey was filled that asked participants about their feelings regarding the task accomplished.
We planned to perform 2-2 sessions for both the original and the replicated study, as the room where the study was planned to be conducted had only 40 seats available.
We used two data collection procedures. On the one hand, we extended the development environment so that it logged every window change, test execution, and debug. Also, we wrote a script that documented every request made to the experiment portal. On the other hand, we set up a screen recording tool to make sure that every participant action is recorded.
Each participant had 6 output files that were saved for data analysis.
Answers: The answers submitted to the portal in JSON format.
Background: The answers given in the background questionnaire in CSV format.
Exit: The answers given in the exit survey in CSV format.
Portal log: The user activity recorded in the portal.
Visual Studio log: The user activity recorded into a CSV-like format using a custom Visual Studio extension.
Screen recorded video: The participant activity during the main session in MP4 format.
First, the raw data was processed by checking the answers of the participants along with their activities via parsing the logs or coding the screen capture videos. Next, the processed data was analyzed using exploratory techniques.
Analysis of answers
We analyzed the answers obtained from the experiment portal using binary classification for which the confusion matrix is found in Table 3.
In the original study, we annotated every recorded video using an academic behavioral observation and annotation tool called Boris (Friard and Gamba 2016). We designed a behavioral coding scheme that encodes every activity, which we were interested in. The coding scheme can be found in Table 4; all occurrences of these events are marked in the videos (Fig. 2). Note that, during the video coding, we only used point events with additional modifiers (e.g., change of page in the portal is a point event along with a modifier indicating the identifier of the new page). In order to enable interval events, we created modifiers with start and end types.
We performed the exploratory data analysis (EDA) using R version 3.3.2 (R. Core Team 2016) and its R Markdown language to document every step and result of this phase. We employ the most common tools of EDA: box plots, bar charts, heat maps, and summarizing tables with aggregated data.
Threats to validity
During the planning of our study, we identified the threats to its internal, external, and construct validity. In terms of internal threats, our results might be affected by the common threats of human studies (Ko et al. 2013). For instance, this includes the maturation effect caused by the learning of exercises, and the natural variation in human performance as well. We addressed this threat by randomly ordering the methods and generated tests to classify. Also, to reduce fatigue and boredom, participants only had to deal with 3 tests per method.
Moreover, the students might know each other and thus they could talk about the tasks of the study between the sessions (see Section 4). We eliminated this threat by using different projects and faults at each occasion. The data collection and analysis procedure might also affect the results; however, we validated the logs by R scripts and the portal functions by testing.
The generalization of our results (external validity) might be hindered by several factors, including the following:
Professionals or students: The performances of students and professional users of white-box test generators may differ. Yet, involving students is common in software engineering experiments (Sjøberg et al. 2005), and results suggest that professional experience does not necessarily increase performance (Dieste et al. 2017). Our graduate students typically have at least 6 months of work experience; thus, they are on the level of an average junior developer.
A priori knowledge of objects: The generalization of the results could be affected also by a less likely fact that some participant may had a priori knowledge about the selected projects as they are open-source; thus, they could classify tests better.
Completeness of described behavior: Another threat to external validity is the expected behavior given in comments, and not in a precise program specification. However, our goal during the study design was to carefully select open-source projects, which do not have formal specifications of behavior in general. This decision on the one hand may reduce the genericity of results for projects with formal specifications (incomplete specifications may reduce classification performance), but on the other hand, it increases the genericity for open-source software.
Number and size of objects: The number and size of classes and methods under test may affect the classification performance, and thus the generalization of our results. However, the objects used in the study were carefully selected using rigorous criteria, which were exactly prescribed by our study design (with considering the required times and learning curves). These requirements ensured that participants could finish their task in the given time frame.
Fault injection: Fault injection procedure could have effects on the genericity of the results; however, we selected this approach after considering several other alternatives along with their trade-offs as discussed in Section 3.
Test generator tool: We used IntelliTest during our experiments, which reduces the generalization of our results. However, IntelliTest is one of the major white-box test generators. The effects of another tool can be investigated in replication studies.
User activity: Our study allowed participants to run and debug generated tests, but modification of the source code was prohibited. Although this fact reduces the similarity to a real development scenario, modifications in the code would hinder the comparability of participant results. Moreover, it would require a vastly different study design that would introduce several new threats to the validity.
The threats to the construct validity in our study is concerned with the independent variables. It might be the case that some of the variables we selected do not have effects on the difficulty of classification of generated white-box tests. We addressed this threat by carefully analyzing related studies and experiments in terms of design and results in order to obtain the most representative set of variables.