This section describes the design, experimental objects, research questions, metrics, validation framework, procedure, results and threats to validity of the empirical study conducted to evaluate SleepReplacer. We follow the guidelines by Wohlin et al. (2012) on designing and reporting empirical studies in software engineering. To allow the replication of the study, we published the tool along with the three open-source test suites at https://sepl.dibris.unige.it/SleepReplacer.php.
The goal of the empirical study is to measure the overall effectiveness of SleepReplacer in replacing thread sleeps with a particular focus on assessing: (1) the percentage of thread sleeps replaced by SleepReplacer, (2) the time required by SleepReplacer to complete such task, and (3) the human effort reduction deriving from its adoption.
The results of this study can be interpreted from multiple perspectives: Researchers, interested in empirical data about the effectiveness of a tool able to replace thread sleeps from existing Selenium WebDriver test suites; Software Testers and Project/Quality Assurance Managers, interested in evidence data about the benefits of adopting SleepReplacer in their companies. The experimental objects, used to experiment SleepReplacer are four test suites associated with four web applications described in the next section.
To validate the proposed tool, several web applications and the corresponding test suites have been used. We used a large, medical web application (PRINTO) and three small open-source web applications (Addressbook, Collabtive, PPMA). Table 1 summarizes the main properties of the considered test suites. All the test suites are written in Java and use TestNG as testing framework and Selenium WebDriver to interact with the AUT.
The PRINTO test suite has been developed in the context of a joint academia-industrial project by several junior Testers (Olianas et al., 2021). It has been also carefully refined so the source code quality is quite high. Moreover, it is executed automatically, every night, from several months without presenting significant problems. On the contrary, the other three test suites, have been developed/refined by PhD students in the context of academic research (in particular the preliminary versions have been developed in the context of an empirical study (Leotta et al., 2013) and then refined in further works such as (Olianas et al., 2021)). They are also of good quality but they have not been so refined over time, as in the case of the PRINTO test suite. Let us provide some additional details on the web applications under test and the corresponding test suites.
AddressbookFootnote 5 is an open-source web application for contact management, that allows to store phone, address and birthday of user’s contact list. It is written in PHP, uses MySQL as database and its test suite is composed by 27 test methods. The test suite contains in total 10 thread sleeps.
CollabtiveFootnote 6 is an open-source web application for project management for small to medium sized businesses. It enables to manage the lifecycle of a project, which can be divided in tasks assigned to the different users. It is written in PHP and its test suite is composed by 40 test methods, that contain 69 thread sleeps.
PPMA (PHP Password Manager)Footnote 7 is an open-source web application, written in PHP, that allows to store passwords for different services. Its test suite is composed by 23 test methods, that contain 82 thread sleeps.
Research question, metrics, and procedure
Our study aims at answering the following four research questions:
RQ1: How many thread sleeps can SleepReplacer replace?
To answer our research question RQ1, it is necessary to count the original number of thread sleeps contained in each test suite associated with the selected web applications. Then, we have to count how many of them are replaced by SleepReplacer with explicit waits relying on the rules \(R_1\), \(R_2\), and \(R_3\) described in Sect. 4. Finally, we compute the proportion of thread sleeps replaced by SleepReplacer out of the original total. Thus, the metric used to answer this question is the percentage of thread sleeps replaced.
RQ2: How long does it take SleepReplacer to replace the thread sleeps?
To answer our research question RQ2, it is necessary to measure the execution time required by SleepReplacer for replacing the thread sleeps. As a final measure, we provide the average time (expressed in minutes) for each test suite required by SleepReplacer to complete each individual thread sleep replacement.
RQ3: How much is the reduction of human effort using SleepReplacer?
To answer our research question RQ3, it would be necessary to have the time required for a human Tester to perform the replacement of the thread sleeps with and without SleepReplacer and computes the percentage of reduction. Unfortunately, not having available these data and not being able to design an experiment with experienced Testers specifically to answer this research question, we decided to provide an estimate-based answer. Basically, based on historical data we computed the average time a Tester takes to replace a single thread sleep. Since the execution of our tool takes place with negligible human effort (in background), we considered as human effort only that deriving from the thread sleeps that SleepReplacer was unable to replace. Subsequently, we computed the percentage of reduction comparing it to the total time calculated by multiplying the estimated time of a single replacement by the total number of thread sleeps contained in the original test suite.
RQ4: What is the effect of the SleepReplacer thread sleeps replacement on the overall test suite execution time?
To answer our research question RQ4, it is necessary to measure the execution time (in minutes) of each test suite before and after the thread sleeps replacement (i.e., before and after the execution of SleepReplacer). In this way, we can appreciate any benefits in terms of time reduction.
Settings for each web application
To run the experiment, we simply provided the four test suites as input to SleepReplacer and waited for the run to finish. The only parameter we had to decide is the number of validation runs for the step 2.(b) (Fig. 1).
In the PRINTO case, the large industrial test suite, it was sufficient just one validation run since we knew that the test suite was stable (as detailed in the answer to RQ3, we asked an independent Tester to replace all the thread sleeps for a previous version of PRINTO, and he rarely reported flakiness problems due to replacing the thread sleeps with explicit waits), thus we were able to run the tool and produce a valid test suite, without flakiness (note that a single run, when adopting the PO pattern, often implies multiple thread sleep validations as described below).
However, this was not the case for the other test suites since we did not have insights about the flakiness behavior of the test methods when the thread sleeps are replaced. In real industrial cases, this info is generally known (at least as an estimate) as human Testers have an idea of the flakiness behavior of their test suites. We tried to derive this information by manually eliminating about 10% of each application’s thread sleep. The results are described for each web app in the following.
In the case of Collabtive, we needed 20 validation runs, since the replacement of some thread sleeps caused some test methods to fail non-deterministically, and this happened very rarely. For Addressbook we decided to do 20 validation runs, even if we did not expect flakiness problems, since its execution time and number of thread sleeps is very low. Indeed, we have been able to run the tool with 20 validation runs for each thread sleep in just 37 min. Finally for PPMA, since it had many more thread sleeps (82) and a longer execution time, we decided to set SleepReplacer with only 10 validation runs.
Two reasons that can explain the difference in terms of number of validation runs, between PRINTO and the open-source web applications, are the following: (a) the test suite quality is different, indeed as already said, the PRINTO test suite was carefully developed during a joint industrial project and is executed daily while the other three test suites were produced only for scientific purposes; (b) the PRINTO test suite adopts the PO design pattern while the other Web applications do not. Thus, in the case of the PRINTO test suite the thread sleeps are validated multiple times (even with a single test suite run), since the thread sleeps are inside the methods of the POs and there are multiple test methods calling them, while for other applications only once.
As a general guideline, given that the execution time is machine time (and not by far more costly human time), it would be advisable to repeat the validation as many times as possible, in order to minimize the probability of introducing flakiness.
For what concerns dependency management, PRINTO, the large, industrial test suite did not have dependencies, while the other three test suites did. We said that we have two options to run a dependent test method t during validation: (1) we can run all the test methods required to satisfy dependencies or (2) we save the state required by t to run correctly, and restore it when the tools need to run t. Since we had all the three applications under test installed in Docker containers and it is a more efficient solution, we opted for the second choice to manage dependencies in Collabtive, Addressbook and PPMA test suites.
Finally, we ran all the experiments on a laptop running Windows 10 with Intel Core i3 10110U CPU (maximum clock 4.10 GHz), 16 GB of RAM and SSD hard drive.
Effectiveness in Replacing the Thread Sleeps
Table 2 shows: (1) the number of thread sleeps present in the various test suites for the four considered web applications (column Total), (2) the number of thread sleeps successfully replaced by SleepReplacer (column Replaced #), and finally (3) the percentage of the replaced thread sleeps with respect to the total number of thread sleeps (column Replaced %).
In two cases out of four (i.e., for the Addressbook and PPMA web apps), the tool was able to successfully replace all the thread sleeps with the appropriate explicit wait. The minimum effectiveness of SleepReplacer was reached in the case of Collabtive where 56 out of 69 thread sleeps were replaced (corresponding to a still satisfying 81%). Finally, looking at the complex industrial PRINTO case study, SleepReplacer was able to manage 133 thread sleeps out of 146 (91%).
Similarly in the case of Collabtive, 13 thread sleeps remained after the execution of SleepReplacer. In this case, differently from PRINTO, most of the remained thread sleeps did not fail the test methods deterministically if replaced, but rather their replacement introduced some flakiness. A common situation is the one represented in Listing 9: we have a click on a web element, that causes the loading of another page, and we wait for it with a thread sleep. Then, we have some interactions that write some text in a form (the text “Task001”), and then another click on the form submission button.
Our tool, as it was built, replaced the thread sleep with an explicit wait that waits for the element located by the id “title” to be clickable, but during the validation the test occasionally failed on the last line (the click on the submission button). This happened because the form is loaded with an animation that makes it appear from top to bottom, and sometimes, when the “title” element is ready the animation is not ended yet, so the submission button is not clickable, and clicking it causes an ElementNotInteractableException to be thrown.
Thus, to answer RQ1, we can say that our tool SleepReplacer is effective in managing the automated migration — from thread sleeps to explicit waits — in the four considered test suites, since it was able to complete the replacement in the 92% of the cases, on average.
RQ2: Time Required for Replacing the Thread Sleeps
Table 3 shows the time required for replacing the thread sleeps using SleepReplacer. In the second column is reported for each web application the total time expressed in minutes required by a complete execution of SleepReplacer. Then in columns 3–4 and 5–6, we respectively analyze the time required for replacing each thread sleep considering respectively all the thread sleeps in the test suite and only the thread sleeps that SleepReplacer successfully replaced.
By looking at the table, it is evident that the thread sleeps contained in the three smaller test suites (i.e., the ones for the apps Collabtive, Addressbook and PPMA) required similar average times to be processed (i.e., in the order of 3 min each). Indeed, the execution time of SleepReplacer computed considering all the thread sleeps is in the range of 2.84–3.79 min for thread sleep; focusing only on the thread sleeps successfully replaced the time increases in the range 2.84-4.66 min for thread sleep. The lower range value is stable on the 2.84 value since corresponds to the case of PPMA where all the thread sleeps were successfully replaced. On the other hand, for Collabtive the value increases from 3.79 to 4.66 since about the 19% of the thread sleeps of the corresponding test suite are not replaced by SleepReplacer (note that Collabtive represents the worst case from this point of view, as described previously in Table 2).
On the contrary, in the case of the complex industrial PRINTO case study the time required to replace each thread sleep is higher: indeed it ranges from 9.56 min for thread sleep, when considering all the thread sleeps in the test suite, to 10.50 min for thread sleep when considering only the successfully replaced thread sleeps. This can be explained for three reasons: (1) the PRINTO test suite is based on the PO pattern and thus each thread sleep is contained in the PO methods; this leads to higher validation time since multiple test methods can use such PO methods and thus are executed; (2) the test methods are by far more complex than the ones of the other three web applications, so their execution time is by far higher; (3) unlike the PRINTO test suite, for the three open-source web applications (Addressbook, Collabtive, PPMA), we saved the application state required by each test method as explained in Sect. 5.3.1.
Thus, to answer RQ2, we can say that the time for successfully replace a thread sleep ranges in the interval 2.84–10.50 min with an average value of about 7 min. The actual values strongly depend on the complexity of the validation step (see Sect. 5.3.1), needed for assuring that the test suite provided in output by SleepReplacer do not present flakiness. This is true since the source code replacement executed by SleepReplacer (step 2.(a), Fig. 1) is clearly really fast. However, we can say that the obtained execution times are absolutely acceptable, since the transformation performed by SleepReplacer has to be done only once, when the test suite is restructured.
RQ3: Percentage of reduction of human effort using
Since we have not the manual thread sleeps replacement times, i.e., how long it would take an independent Software Tester to manually complete the thread sleeps replacement task for each web app, we decided to estimate such value using previous historical data. We have this information only for a previous version of the PRINTO test suite (the one with 196 thread sleeps). In that case, we asked to an independent Tester to substitute all the thread sleeps with explicit waits while recording both: (1) the time required for actually replacing the thread sleeps (i.e., the time he actually worked on the test suite code) and the validation time (i.e., the time spent to re-execute the test suite in order to check the absence of flakiness). To substitute the 196 thread sleeps, the Tester spent: (1) 556 min on the code (i.e., 2.84 min per thread sleep) and (2) 1309 min for the validation step (i.e., 6.68 min per thread sleep). In total, the overall time required to replace the thread sleep from that version of the PRINTO web app amount to 1865 min (i.e., 9.52 min per thread sleep).
In Table 4, we have used such computed values to estimate, and give an indication of, the human effort required to execute manually the thread sleeps replacement for the four considered applications (including PRINTO itself, since in this study we applied SleepReplacer to a different subsequent version).
In the case of the version of the PRINTO test suite used in this study, the total estimated is of about 23 h (1390 min); however since for a human Tester is by far more relevant to assess the actual time required while working on the test methods source code (since the validation process can be mainly done in background while working on other tasks; this is particularly true in case of longer and consecutive validation times), we can see that the time actually spent decreases to about 7 h (414 min). For the other applications the values are proportional and still relevant: about 3 h for Collabtive and 4 h for PPMA, while Addressbook gets the shorter time of 28 min (but however, it is important to remember that it has only ten thread sleeps).
Thus, to answer RQ3, we can say that from the estimate performed with an independent Tester we found that replacing each thread sleep from the test code required, on average, about 3 min, while 10 min including also the validation time. In RQ1, we have said that SleepReplacer was able to replace overall 281 threads sleeps of 307 present in the four considered test suites. The replacement has been carried out fully automatically, without human intervention. Considering an average time required for a human of 3 min per thread sleep to replace (the most conservative estimate reported before), we have that: 307 * 3 = 931 min (where 307 is the total number of threads sleep in the four considered apps) represents the time required by a Tester to execute the complete thread sleep replacement task, fully manually, on the four test suites; 26 * 3 = 78 min is instead the time required by a Tester to replace only the thread sleeps that SleepReplacer was unable to replace. So even if this estimate is rough, we can conclude that the human effort reduction is very high (i.e., about 92%). Note that since in industrial test suites the number of thread sleeps could be in the order of hundreds or even thousands, the human effort savings due to the adoption of SleepReplacer in that contexts would be extremely relevant.
RQ4: Effect of the
thread sleeps replacement on the overall test suite execution time
Table 5 shows the execution time of the four considered test suites before and after replacing the thread sleeps using SleepReplacer. Columns 2 and 3 provide the total execution times (measured in minutes), respectively, for the original test suites with thread sleeps and for restructured one. Column 4 gives the percentage of reduction achieved thanks to the explicit wait adoption. From the table, it is evident that it is always advantageous to replace thread sleeps with the explicit waits: however, the magnitude of such positive effect is quite different considering the various web applications. The lower value has been observed in the case of Addressbook with a reduction of the 13%, while the complex industrial PRINTO case study benefited more, reaching a relevant 71% reduction. Note that having a 50% reduction (as in the case of PPMA) means halving the execution times.
The reason why the percentage reductions are so different lies probably in the fact that the number and frequency of the thread sleeps (i.e., number of thread sleeps per LOCs) in the considered test suites is not constant. Indeed, for instance Addressbook required only 10 thread sleeps to run properly, while others like PPMA, even if of comparable complexity, required by far more thread sleeps (in that specific case 82). Thus, assuming that the human Tester tuned optimally all thread sleep values, clearly having replaced more thread sleeps led to a more relevant reduction in the execution time. Indeed, explicit waits minimize automatically the time to wait, while when adopting thread sleeps, it is necessary to leave a small additional time margin that allows to manage any flakiness problems.
Thus, to answer RQ4, we can say that SleepReplacer is able to produce test suites that run always faster than their original counterparts. The benefits can vary a lot and depends on thread sleeps frequency; in our experiment from a 13% to a 71% reduction. More in detail, the magnitude of the percentage reduction, heavily depends on the initial impact of thread sleeps time on the total execution time: the % of thread sleeps time with respect to the total execution time of the test suite represents an upper bound for SleepReplacer. Thus, in the cases where the total sleep time is only a small fraction of the total test suite execution time, clearly, the benefits of using SleepReplacer are limited.
In this subsection, we discuss the results obtained in our study, in order to highlight the benefits that the adoption of SleepReplacer can bring to the end-to-end testing process.
Results from RQ1 show that SleepReplacer is able to replace automatically from 81 to 100% of the thread sleeps in a test suite. This is a strong point in favor of the adoption of SleepReplacer, since it tells us that the absolute majority of thread sleeps in a test suite can be replaced automatically. Moreover, we obtained such results using only three replacement rules; this has been done to maintain the approach as general as possible and to avoid ad hoc solutions tailored for the test suites used in the empirical evaluation. But in a real-world scenario, the Testers can easily add new replacement rules based on their specific knowledge of the test suite, in order to reach an even higher replacement rate. However, the results from RQ4 show that even the test suite with the lowest replacement rate (Collabtive, 81%) obtained a significant time reduction (21%) from the use of SleepReplacer.
Results from RQ2 highlight that the time to replace a thread sleep lies in the range of 2.84–10.50 min, with an average time of 7 min. The total times for replacing all the thread sleeps in a test suite range from 1396 min (approximately 23 h, for the PRINTO test suite) to 36.6 min. The high variability of total times depends on (1) the number of thread sleeps in the test suite, (2) the presence of the Page Object pattern, and (3) the number of validation runs required. If Testers want to employ SleepReplacer to improve a test suite, they must keep in mind these factors and try to estimate what the total time would be. However, even if the use of SleepReplacer may become infeasible on very large, test suites, with this study we showed that SleepReplacer can be used not only on small test suites, but also on real-world, medium-large sized test suites, as the PRINTO case.
Results from RQ3, although they are slightly hindered by the fact that the human time is only estimated, show that the adoption of SleepReplacer can lead to great time savings with respect to manual replacement of thread sleeps, when this task is faced. In fact, excluding validation time (that can be done in background while doing other tasks), every test suite in our study requires a human replacement time that goes from 28 to 414 min. Moreover, besides the reduction of the execution time of the test suite, another point in favor of the adoption of SleepReplacer is that manual work is error-prone, while our approach guarantees to produce a working test suite.
Finally, results from RQ4 tell us that SleepReplacer achieves its goal, that is the reduction of the test suites execution time. In fact, comparing the execution time of the original versions of the test suites, with the time of the versions refactored by SleepReplacer, it is possible to observe a time reduction that goes from 13 to 71%. In our opinion, this is the strongest point in favor of the adoption of SleepReplacer, because even if the time-execution of SleepReplacer can be relevant, it is a “one-shoot’’ task, while the reduction of the execution time of the test suite can be appreciated every time the test suite is executed and can bring to substantial savings in developing environments where test suites are executed often.
Threats to validity
The main threats to validity affecting an empirical study are as follows: Internal, External, Construct, and Conclusion validity (Wohlin et al., 2012).
Internal Validity threats concern possible confounding factors that may affect a dependent variables: in this experiment the number of replaced thread sleeps (RQ1), the time required to replace them by SleepReplacer (RQ2), the time required by a human Tester for executing the thread sleep replacement task (RQ3), and the total test suite execution time (RQ4). Concerning RQ1 and RQ2, our tool is able to replace the thread sleeps with the explicit waits more used in practice. However, having test suite that requires different explicit waits would require to extend SleepReplacer to support them: note that SleepReplacer supports this kind of extension (basically it is sufficient to extend the rule list R), but this would impact on both RQ1 and RQ2 results. Concerning RQ3, as previously described, the whole calculation is based on an estimate and therefore is an approximation of the true value. We were forced to do an estimation because, in order to do a fair comparison, we could not replace the thread sleeps on our own, since we already well knew the test suites in study. In order to perform a real comparison with fair data, we would have needed experienced Testers but with no knowledge of our experimental objects. Concerning RQ4, as already describe in Sect. 5.3.1, the results are heavily related to the level of optimization adopted by the Tester during the definition of the thread sleeps times. In general, extending the wait times improve test suites stability but impact on the execution time. The values found in the four considered test suite are, in our opinion, reasonable, therefore the results obtained are generalizable to standard test suites.
External Validity threats are related to the generalization of results. All the four test suites for the web applications employed in the empirical evaluation of SleepReplacer are realistic examples covering a good fraction of the functionalities of the respective web apps. Moreover, the test suite for PRINTO has been developed in the context of an industrial project and includes 251 test methods: so its complexity is in line with standard test suites for web applications of average size.
Construct validity threats concern the relationship between theory and observation. Concerning RQ1, RQ2, and RQ4, they are due to how we measured the effectiveness of our approach with respect to the corresponding metrics. To minimize this threat, we decided to measure them objectively, in a totally automated way. Concerning RQ2 and RQ4 that can be influenced by the load of the computer executing respectively SleepReplacer and the test suite, to minimize any fluctuation, we averaged the obtained value three times. We have estimated that three times is sufficient since we noticed that the variance is minimal. Concerning RQ3, the threat is that the answer, being based on an estimate, could be prone to error. Another possible Construct validity threat is Authors’ Bias. It concerns the involvement of the authors in manual activities conducted during the empirical study and the influence of the authors’ expectations about the empirical study on such activities. To make our experimentation more realistic and to reduce as much as possible this threat, we adopted four test suites containing thread sleeps developed independently from other people, and existing before the development of SleepReplacer. Moreover, such test suites were not used during the development of SleepReplacer but only for its validation, in order to avoid any influence on the SleepReplacer implementation (e.g., including in SleepReplacer ad hoc solutions).