In this section, we present the details of our experiment from the perspective of definition and planning - the first two stages of the experiment process (Wohlin et al. 2000). To enable further replications, we share the experimental protocol and scripts at: https://doi.org/10.5281/zenodo.1193955.
Goal
Our objective is to examine whether testers manifest confirmation bias (leading to a confirmatory attitude) during testing and whether time pressure promotes the manifestation of confirmation bias. The aim of our research, according to Goal-Question-Metric (Basili et al. 1994) is as follows: Analyse the functional test cases For the purpose of examining the effects of time pressure With respect to confirmation bias From the point of view of researchers In the context of an experiment run with graduate students (as proxies for novice professionals) in an academic setting.
Consequently, the research questions of our study are:
RQ1: Do testers exhibit confirmatory behaviour when designing functional test cases?
RQ2: How does time pressure impact the confirmatory behaviour of testers when designing functional test cases?
Context
We study the manifestation of confirmation bias in functional software testing and examine how time pressure promotes confirmation bias manifestation in the same context. The phenomenon is studied in the context of a controlled experiment in academic settings with first-year master’s degree students, enrolled in the Software Quality and Testing course at the University of Oulu, as proxies for novice professionals. We limit our investigation to functional (black box) testing, which was part of the curriculum of the aforementioned course. For the purposes of the experiment, we focus only on the design of functional test cases, not their execution. We aim for an implementation-independent investigation of the phenomenon, since we are interested in studying the mental approach of the study participants in designing test cases, which precedes their execution. Besides, in system, integration and acceptance testing, software testers (by job role/title) design test cases before the code actually exists, using the requirements specifications. Therefore, our scope is limited to determining the type (i.e. consistent or inconsistent with the specifications) of functional tests designed by the participants, rather than their execution or fault-detection performance. We use a realistic object for the task of designing functional test cases under time pressure and no time pressure conditions.
Variables
This section elaborates on the independent and dependent variables of our study.
Independent Variable
The independent variable of our study is time pressure with two levels: time pressure (TP) and no time pressure (NTP). To decide on the duration for the two levels, we executed a pilot run (explained in detail in Section 3.8) with five postgraduate students. It took 45 min, on average, for the pilot participants to complete the task. Accordingly, we decided to allocate 30 min for the TP group and 60 min for the NTP group to operationalise time pressure.
The timing of the task was announced differently to the experimental groups. The experimenter reminded the participants in the TP group thrice of the remaining time; the first reminder was after fifteen minutes had elapsed and the rest of the reminders were given every five minutes thereafter. This was done to psychologically build time pressure. In contrast, after the initial announcement of the given duration to the NTP group, no further time reminders were made. This is in line with how Hernandez and Preston (2013) and Ask and Granhag (2007) operationalised time pressure in their studies.
Dependent Variables
Our study includes three dependent variables, which are c - number of consistent test cases, ic - number of inconsistent test cases and temporal demand. We define these dependent variables as follows:
Consistent test case: A consistent test case tests strictly according to what has been specified in the requirements, i.e. consistency with the specified behaviour. In the context of testing this refers to: 1) the defined program behaviour on a certain input; and 2) the defined behaviour for a specified invalid input. Example: If the specifications state, “… the phone number field does not accept alphabetic characters...”, the test case designed to validate that phone number field does not accept alphabetic characters is considered a consistent test case.
Inconsistent test case: An inconsistent test case tests the scenario or the data input that is not explicitly specified in the requirements. We also consider such test cases that present outside-of-the-box thinking at the tester’s end inconsistent. Example: If the specifications only state, “… the phone number field accepts digits...”, and the application’s behaviour for the other types of input for that field is not specified, then the following test case is considered inconsistent: the phone number field accepts only the + sign from the set of special characters (e.g. to set an international call prefix).
In contrast to Leventhal et al. (1994), we do not consider a test case validating an input from an invalid equivalence class as inconsistent, as long as it is specified in the requirements. On the contrary, we consider it consistent, because the tester has exhibited a confirmatory behaviour by conforming to what s/he has been informed to validate. If test cases are classified using our consistent and inconsistent definitions and Causevic et al.’s (2013) positive and negative definitions, the results might be the same. However, the outside-of-the-box thinking is an additional aspect of our definition of inconsistent, which considers the completeness of the requirements specification in the light of the context/domain. Unlike Calikli et al. (2010a), we do not utilise any tests from psychology to measure confirmation bias. Instead, we detect its manifestation by analysing the test artefacts of the participants and do not directly observe how people think or what their thinking inclinations, in general, are.
We therefore introduce the terms consistent and inconsistent in order to distinguish the concept from previous forms of measurement of confirmation bias, based on the contradictory understandings of a positive and negative test case. For example, Leventhal et al. (1994) and Causevic et al. (2013) use the same terminology but measure confirmation bias differently. We believe that our proposed terminology is more straightforward to comprehend as compared to the potential ambiguity of positive/negative terminology.
Temporal demand: We use the NASA task load index (TLX) as a measure of task difficulty as perceived by the participants. We apply the same definition of temporal demand as defined in the NASA-TLX, which is the degree of time pressure felt due to the pace or tempo at which events take place (Research Group HP and Ames Research Center N 1987). Therefore, it captures the time pressure perceived by the participants in the experimental groups.
Data Extraction and Metrics
The section elaborates on the data extraction and metrics defined for capturing confirmation bias and temporal demand.
Proxy Measure of Confirmation Bias
We mark the functional test cases designed by the participants as either consistent (c) or inconsistent (ic). To detect the bias of participants through a proxy measure, we derive a scalar parameter based on (c) and (ic) test cases designed by the participants and the total count of (all possible) consistent (C) and inconsistent (IC) test cases for the given specification:
if z >0 ; participant has designed relatively more consistent test cases
if z <0 ; participant has designed relatively more inconsistent test cases
if z = 0 ; participant has designed a relatively equal number of consistent and inconsistent test cases.
The value of z is the rate of change in relative terms. It is the difference of consistent test case coverage and inconsistent test case coverage. It indicates one of the above three conditions within the range [− 1,+ 1]. In terms of confirmation bias detection, z = 0 means the absence of confirmation bias. If z is + 1, then it indicates the maximum manifestation of confirmation bias, because only consistent test cases are designed with complete coverage, and − 1 is an indication of designing inconsistent test cases only with complete coverage. We should note that although − 1 is an unusual case, indicating that no consistent test cases were designed at all, it depicts a situation in which no bias has manifested. This case is quite impractical to occur because it suggests that no test case validating the specified behaviour of the application was designed. For the purposes of measurement, we predesigned a complete set of test cases (consistent and inconsistent) based on our expertise from a researcher’s perspective. Our designed set comprised 18 consistent (C) and 37 inconsistent (IC) test cases. The total number of (in)consistent test cases are defined in absolute numbers in order to be able to compare and perform an analysis in relation to a (heuristic) baseline. In order to enhance the validity of the measures, we extended the set of our predesigned test cases after the experiment with the valid test cases (consistent and inconsistent) designed by the participants that were missing from our set. This improvement step is in line with Mäntylä et al. (2014), who mentioned that it helps to improve the validity of the results. As a result, our test set includes 18 (C) and 50 (IC) test cases in total.
Temporal Demand
In order to capture temporal demand (TD), we used the values of the rating scale marked by the participants on NASA-TLX sheets. The scale ranges from 0 to 100, i.e., from low to high temporal demand perceived for the task (Research Group HP and Ames Research Center N 1987).
Hypothesis Formulation
According to the goals of our study, we formulate the following hypotheses:
H1 states that: Testers design more consistent test cases than inconsistent test cases.
$$H1_{A}: \mu (c) > \mu (ic) $$
and the corresponding null hypothesis is:
$$H1_{0}: \mu (c)\leq \mu (ic) $$
H1’s directional nature is attributed to the findings of Teasley et al. (1994) and Causevic et al. (2013). Their experiments revealed the presence of positive test bias in software testing.
As an effect of time pressure on consistent and inconsistent test cases, our second hypothesis postulates the following:
H2: (Dis)confirmatory behaviour in software testing differs between testers under time pressure and under no time pressure.
$$H2_{A}: \mu ([c, ic]_{TP})\neq \mu ([c, ic]_{NTP}) $$
$$H2_{0}: \mu ([c, ic]_{TP}) = \mu ([c, ic]_{NTP}) $$
While H2 makes a comparison in absolute terms, the third hypothesis considers the effect of time pressure on confirmation bias in relative terms with the given coverage to the specifications (z). Accordingly H3 states:
H3: Testers under time pressure manifest relatively more confirmation bias than testers under no time pressure.
$$H3_{A}: \mu (z_{TP}) > \mu (z_{NTP}) $$
$$H3_{0}: \mu (z_{TP})\leq \mu (z_{NTP}) $$
The directional nature of H3 is based on the evidence from the psychology literature in which time pressure was observed to increase confirmation bias in the studied contexts (Ask and Granhag 2007; Hernandez and Preston 2013).
To validate the manipulation of the levels of the independent variable - time pressure vs no time pressure - we formulate a posthoc sanity check hypothesis that postulates:
H4: Testers under time pressure experience more temporal demand than testers under no time pressure.
$$H4_{A}: \mu (TD_{TP}) > \mu (TD_{NTP}) $$
$$H4_{0}: \mu (TD_{TP})\leq \mu (TD_{NTP}) $$
Design
We chose a one factor with two levels between-subjects experimental design for its simplicity and adequacy to investigate the phenomenon of interest, as opposed to alternative designs. Table 2 shows the design of the experiment, where ES stands for experimental session and TP and NTP stand for the time pressure and no time pressure groups, respectively. In addition, this design is preferable as it does not introduce a confounding factor for task-treatment interaction, thus it enables the investigation of the effects of the treatment and control on the same object in parallel running sessions.
Table 2 Experimental design Participants
We employed convenience sampling in order to draw from the target population. The participants were first-year graduate-level (master’s) students registered in the Software Quality and Testing course offered as part of an international graduate degree programme at the University of Oulu, Finland, in 2015. The students provided written consent for the inclusion of their data in the experiment. All students were offered this exercise as a non-graded class activity, regardless of their consent. However, we encouraged the students to participate in the experiment by offering them bonus marks as an incentive. This incentive was announced in the introductory lecture of the course at the beginning of the term. In the data reduction step, we dropped the data of those who did not consent to participate in the experiment. This resulted in a total of 43 experimental participants.
Figure 1 presents the clustered bar chart showing the academic and industrial experience of 43 participants in software development and testing. Along the y − axis are the percentages depicting the participants’ range of experience in the categories presented along the x − axis. The experience categories are: Academic Development Experience (ADE), Academic Testing Experience (ATE), Industrial Development Experience (IDE) and Industrial Testing Experience (ITE). The four experience range categories are less than 6 months, between 6 months and one year, between 1 and 3 years and more than 3 years.
More than 80% of the participants have less than 6 months of testing experience both in academia and industry, which is equivalent to almost no testing experience. This indicates that our participants have much less experience in testing when compared to development. The second highest percentages of experience are in the >=1y and <3y range, except for in ITE. The pre-questionnaire data also shows that 40% of the participants have industrial experience, i.e. more than 6 months. Thirty-two percent marked their industrial experience in development and testing based on the developer or tester roles rather than considering testing as part of a development activity.
Considering our participants’ experience and the degree in which they are enrolled, we can categorise them as proxies for novice professionals according to Salman et al. (2015).
Training
Before the actual experimental session began, the students enrolled in the course were trained on functional software testing, as part of their curriculum, in multiple sessions. They were taught about functional (black box) testing over two lectures. In addition, one lecture was reserved for an in-class exercise where the students were trained for the experimental session using the same material but with a different object (requirements specification) to gain familiarity with the setup. Specifically, the in-class exercise consisted of designing functional test cases from a requirements specification document using the test case design template that was later used in the actual experiment as well. However, different from the actual experimental session, students were not provided with any kind of requirements supporting screenshot, which was available during the experimental sessions. One of the authors discussed students’ test cases in the same training session for giving feedback. In every lecture, we specifically taught and encouraged the students to think up inconsistent test cases. Table 3 shows the sequence of these lectures with the major content related to the experiment. We sequenced and scheduled the lectures and the in-class exercise to facilitate the experiment.
Table 3 Training sequence Experimental Materials
The instrumentation that we developed to conduct the experiment consisted of a pre-questionnaire, the requirements specification document with a supporting screenshot, the test case design template and a post-questionnaire. Filling in the test case design template and post-questionnaire are both pen-and-paper activities, whereas the pre-questionnaire was administered online. The experimental package consisting of all the above mentioned materials is available from this URLFootnote 1.
Pre/Post-questionnaires
Having background information on the participants aids in their characterisation. The pre-questionnaire was designed using an online utility and collected information regarding participants’ gender, educational background, and academic and industrial development and testing experience. Academic experience concerned collecting information regarding the development and testing performed by the participants as part of their degree programme courses on development and testing. Industrial experience questions collected information on the testing and development performed by the participants in different roles (as a developer, tester, designer, manager and other).
We used hardcopy NASA-TLX scales to collect post-questionnaire data for two reasons. The first reason is that it is a well-known instrument for measuring task difficulty as perceived by the task performer (Mäntylä et al. 2014). Second, one of the load attributes with which the task difficulty is measured, i.e. temporal demand, captures the time pressure experienced by the person performing the task. In this respect, it was useful to adopt NASA-TLX because it also aided us in assessing (i.e. H4) how well we administered and manipulated our independent variable, i.e. time pressure in terms of the temporal demand felt by the participants.
Experimental Object
We used a single object appropriate for the experimental design. The task of the participants was to design test cases for the requirements specification document of the experimental object, i.e. MusicFone, that has been used in many reported experiments as a means of simulating a realistic programming task, e.g. by Fucci et al. (2015). MusicFone is a GPS-based music playing application. It generates a list of recommendations of artists based on the artist currently being played. It also displays the artists’ upcoming concerts and allows for the planning of an itinerary for the user depending upon their selection of artists and based on the user’s GPS location. Hence, choosing MusicFone as an object is an attempt to mitigate one of the non-realism factors (non-realistic task) of the experiment, as suggested by Sjoberg et al. (2002).
MusicFone’s requirements specification document is originally intended for a lengthy programming-oriented experiment (Fucci et al. 2015). In order to address our experimental goals and to abide by the available experiment execution time, we modified the requirements specification document so that it could be used for designing test cases within the available time frame. Leventhal et al. (1994) defined three levels of specifications: minimal (features are sketched only), positive only (program actions on valid inputs) and positive/negative (program actions on valid and any invalid input) specifications. If we relate the completeness of our object’s (MusicFone) specifications to the Leventhal et al.’s (1994), it is closer to a positive only specification document. We cannot classify MusicFone into the third category because the specifications stated the required behaviours but contained less information on handling the non-required behaviour of the application, thus qualifying our object as having a realistic level of specifications (Davis et al. 1993; Albayrak et al. 2009).
In addition to the requirements specification document, we also provided a screenshot of the working version of the MusicFone application UI to serve as a conceptual prototype to enable a better understanding of the application. We ensured that the screenshot of the developed UI was consistent with the provided requirements specification because the presence of errors or feedback from the errors could possibly affect the testing behaviour in terms of a positive test bias (Teasley et al. 1994). In other words, if a tester finds an error then s/he might possibly start looking for more similar errors, which will impact the behaviour of the testing and leads to negative testing (Teasley et al. 1994). However, the participants did not interact with the UI in our experimental setting, as we considered interaction as part of the execution of tests rather than their design.
Test Case Design Template
To ensure consistency in the data collection, we prepared and provided a test case design template to the participants, as shown in Table 4. This template consists of three main columns: test case description, input/pre-condition and expected output/post-condition, along with an example test case. The example test case is provided so that the participants know the level of detail required when designing test cases. Furthermore, these three columns were chosen to aid us in understanding the designed test cases better during the data extraction phase, e.g. in marking them as consistent or inconsistent.
Table 4 Test case design template Pilot Run
We executed a pilot run with five postgraduate students to achieve two objectives. First, was to decide on the duration of the two levels (TP, NTP), and second was to improve the instrumentation of our experiment. In addition to meeting the first objective, we improved the wording of the requirements specification document to increase comprehension, based on the feedback from the participants. Two authors of this study independently marked the pilot run test cases as (in)consistent and then resolved discrepancies via discussion. This ensured they shared a common understanding and this knowledge was then applied by the first author in the data extraction.
Analysis Methods
We analysed the data by preparing descriptive statistics followed by the execution of statistical significance tests of the t-test family and F-test (Hotelling’s T2) to test our hypotheses. To properly perform the statistical tests, we checked whether the data met the chosen test’s assumptions; in the event that it failed to meet assumptions, a non-parametric counterpart of the respective test was performed. Hotelling’s T2 assumes the following:
- 1.
There are no distinct subpopulations and populations of samples have unique means.
- 2.
The subjects from both populations are independently sampled.
- 3.
The data from both populations have a common variance-covariance matrix.
- 4.
Both populations have a multivariate normal distribution.
For univariate and multivariate normality assumptions, the Shapiro-Wilk test was used. We report multiple types of effect sizes depending upon the statistical test run and assumptions considered by the respective effect size measure (Fritz et al. 2012). Cohen’s d (0.2 = small, 0.5 = medium, 0.8 = large) and correlation coefficient r (0.10 = small, 0.30 = medium, 0.50 = large) are for univariate tests and the Mahalanobis distance is applied for the multivariate test. It is important to note that the strengths mentioned for the r effect are not equivalent to those of d strengths because “Cohen requires larger effects when measuring the strength of association” (Ellis 2009). In order to validate the directional hypotheses, we performed one-tailed tests, except for H2, and α was set to 0.05 for all the significance tests. The environment used for the statistical tests and preparing relevant plots was RStudio ver.0.99.892, with external packages including Hotelling (Curran 2015), rrcov, mvnormtest and profileR (Desjardins 2005). Effect sizes r are computed using the formulae provided by Fritz et al. (2012).