1 Introduction

Testing is a verification process that consists of the following steps: (1) generate a test case, (2) predict the outcome of the test case, (3) execute the system under test (SUT) with the test case to obtain the actual outcome and (4) compare the predicted outcome against the actual outcome to obtain a pass/fail verdict. The oracle problem describes scenarios in which either it is impossible to carry out steps 2 or 4 or these are prohibitively expensive. This issue was initially discussed by Weyuker, who used the term ’untestable programs’ (Weyuker 1982). It has been observed that many real systems suffer from the oracle problem, such as bioinformatics systems (Chen et al. 2009), service-oriented architectures (Chan et al. 2007) and systems that implement graph theory, computer graphics and compilers (Zhou et al. 2004). One might expect a system to suffer from the oracle problem if the required relationship between the input and output is highly complex and the problem being solved by the system is such that we do not know the answer in advance.

Since the oracle problem is a pervasive and non-trivial problem, it has inspired a large amount of research. Numerous techniques have been proposed as solutions to the oracle problem, including assertions, metamorphic testing, statistical hypothesis testing, N-version testing and machine learning oracles (Patel and Hierons 2017). However, the main focus has been on metamorphic testing, an approach/term introduced by T.Y. Chen (1998, 2003).

Metamorphic testing (MT) is based on a conceptually simple but powerful idea. Essentially, in metamorphic testing we have a metamorphic relation that is an expected property of the SUT, with this property mentioning a number of inputs and corresponding outputs. Let us suppose, for example, that we are to test an implementation f of the sine function. We have a number of properties of this, such as \({\sin \nolimits }(x) = -{\sin \nolimits }(-x)\). Thus, for example, if f(0.346)≠ − f(− 0.346) then we can deduce that f is faulty even though we do not know the values of \({\sin \nolimits }(0.346)\) and \({\sin \nolimits }(-0.346)\). In metamorphic testing, we typically run the SUT on a number of inputs x1,…, xk, observe outputs y1,…, yk, and on the basis of these, we then choose a next inputFootnote 1xk+ 1 to use. We then apply xk+ 1, record the resultant output yk+ 1 and check that the inputs x1,…, xk+ 1 and outputs y1,…, yk+ 1 satisfy the metamorphic relation (for recent surveys on MT, see Chen et al. (2018) and Segura et al. (2016)).

In this paper, we consider the problem of testing an implementation of a uniformity statistic. A sample is a collection of data points, such as Sample1 = {1,2,2,3,4,5} or Sample2 = {1,2,3,4,5,6}, that was produced by some stochastic process such as a random number generator. The distribution of the stochastic process is said to be uniform if every element has an equal chance of being randomly selected. Uniformity statistics are a means of measuring the extent to which a sample is uniformly distributed. Our initial interest in uniformity statistics was motivated by two aspects of software engineering. First, it has been observed that test suites with highly diverse test cases are often effective in finding faults, at least when compared with less diverse test suites (see, for example, Feldt et al. 2008). There are test generation approaches, such as adaptive random testing (Chen et al. 2004), that aim to generate highly diverse test suites but if we can measure uniformity then there is potential, for example, to incorporate a uniformity statistic as a fitness function in search-based software testing. The second motivation is that in experiments we would like to uniformly sample from a space of subjects/examples; we might use uniformity statistics to either check the sampling mechanism used or to drive sampling. Note also that if we can test for uniformity then we can also test samples against other types of distributions (Marhuenda et al. 2005).

Uniformity statistics play an important role in many software engineering research areas. For example, in the context of software quality, Chen (2013) used the Kolmogorov-Smirnov test to evaluate the uniformity of measures of defect density. In the context of software testing, Dutra et al. (2018) developed an algorithm that can generate a set of test cases, such that this set is uniformly distributed, and each test case in the set satisfies some constraint. They used the χ2 uniformity statistic to evaluate their tool.

Consider a genetic algorithm that implements a crossover operator with a crossover probability of 40%. This crossover operator might be implemented as follows: a number between 0 and 1 is randomly sampled from a uniform distribution; if this number is less than or equal to 0.4, then the crossover operator will be invoked. A fault might exist in this implementation that causes the distribution to be non-uniform; in particular, the distribution might be positively skewed. The existence of such a fault would mean that the probability of crossover is greater than 40%, and assuming that 40% is the optimal crossover rate, this would lead to the genetic algorithm performing suboptimally. This demonstrates that the correctness of uniform distributions is important for tools and techniques that rely on them and by implication software engineering research that uses these tools. Examples of such tools include random number generators and Monte Carlo simulations (Harrison 2010), adaptive random testing (Liu et al. 2010), an algorithm that was proposed by Claessen et al. (2014) that generates test data that complies with the uniform distribution and a set of constraints, the cross entropy–based test case generation approach proposed by Chockler et al. (2007) and metrics for measuring information loss in a system, which were developed by Androutsopoulos et al. (2014). Confidence in the correctness of a uniform distribution can be ascertained by uniformity statistics.

Implementations of uniformity statistics suffer from the oracle problem because it is often infeasible to predict the expected value of a test statistic. This paper investigates a metamorphic testing and machine learning–based approach for overcoming the oracle problem for this class of SUT. Metamorphic testing was chosen because it has been found to be effective in many areas (Chen et al. 2018; Segura et al. 2016). We believe that implementations of many other types of statistics (e.g. correlation statistics) also suffer from the oracle problem, and we hope that the work outlined in this paper will feed into the development of testing techniques for other such statistics. We are not aware of any work that has used MT for the testing of implementations of statistics. However, MT has been applied to the problem of checking the output of stochastic programs (Guderlei and Mayer 2007). In this previous work, statistical hypothesis testing was used to check properties (e.g. mean, standard deviation) of samples produced with different (but related) inputs.

Others have investigated means of enhancing metamorphic testing with the use of machine learning techniques. For example, Chan et al. (2010) formulated a hybrid technique, where a machine learning oracle is applied to a set of test cases. If this oracle detects a failure, then the hybrid approach concludes that the system is faulty. However, if this oracle does not detect a failure, then the test cases that were evaluated by this oracle are then used for metamorphic testing. Another example includes Kanewala and Bieman’s technique (Kanewala and Bieman 2013, 2015), which uses machine learning to identify whether target code exhibits a particular metamorphic relation. Such applications of machine learning in the context of metamorphic testing differ from our own.

The approach developed in this paper is motivated by the following observation: a normal distribution, with a mean of 0.5, which has a higher standard deviation, has a ‘flatter’ distribution in the region [0,1], and so samples that are drawn from such distributions in the range [0,1] should adhere more strongly to uniformity. In other words, there is a positive correlation between standard deviation and uniformity, when considering normal distributions. Regression analysis (which is a form of machine learning) can be used to build predictive models based on such correlations. We developed a metamorphic relation, which we call the regression model–based metamorphic relation, which operates by comparing the implementation of a uniformity statistic to such a predictive model; comparisons that reveal a discrepancy between the implementation and model are an indication that a fault might be present in the uniformity statistic.

This approach can be seen as an instance of metamorphic testing, if one uses the recent generalised definition of metamorphic testing (Chen et al. 2018), since we utilise the results of previous test executions in checking the outcome of a test, assuming that the training data is obtained from the SUT. Training data might be obtained from the SUT in the context of regression testing, or might be obtained by other means e.g. from a reference implementation, in other contexts.

The proposed technique was evaluated by applying it to implementations of 18 different uniformity statistics. As we explain in Section 4, these uniformity statistics were chosen since they are representative of different approaches to checking for uniformity. The results of the evaluation were promising; the proposed technique obtained an average mutation score of 92.96% and can obtain a false positive rate of 0, and the predictive models were reasonably accurate.

The main contributions made in this paper are:

  1. 1.

    A new type of metamorphic relation called regression model–based metamorphic relation

  2. 2.

    18 regression model–based metamorphic relations for uniformity statistics

  3. 3.

    An investigation of the effectiveness of regression model–based metamorphic relations for uniformity statistics.

The remainder of this paper is structured as follows. Section 2 outlines background material, and Sections 3 introduces the proposed technique. Section 4 describes the experimental design; Section 5 presents the results of the experiments, and Section 6 explores threats to validity. Finally, Section 7 draws conclusions and outlines future work.

2 Background

2.1 The oracle problem

The oracle problem refers to situations in which it is infeasible to predict the test outcome or verify the program’s output against the predicted test outcome (Weyuker 1982). A number of techniques have been proposed to alleviate the oracle problem, including metamorphic testing, statistical hypothesis testing, N-version testing and machine learning oracles (Patel and Hierons 2017; Chen et al. 1998, 2003). This section briefly introduces these techniques.

A metamorphic test group (MG) is a sequence of test cases (Chen et al. 2018). Each test case in a metamorphic test group can be classified as either a source test case or a follow-up test case. Source test cases are produced by some, possibly arbitrary, test case generation method, whilst follow-up test cases are generated based on source test cases and possibly also the response of the SUT to these inputs (Chen et al. 2018). For example, consider a sorting algorithm. A source test case, tcs, might be a randomly generated sample, and a follow-up test case, tcf, might be the result of reversing the order of tcs. A metamorphic relation (MR) is an expected relation between source and follow-up test cases and their resulting outputs. For example, one would expect that the sorting algorithm would produce the same outputs for tcs and tcf.

There are a number of factors that can affect the effectiveness of a metamorphic relation (Patel and Hierons 2017). For example, Cao et al. (2013) discovered that the diversity of paths taken by test cases in a metamorphic test group is an important determinant of effectiveness. Another factor is the tightness of an MR (Liu et al. 2014), which refers to how much an MR constrains behaviour. For example, suppose that two MRs, MRt and MRl, were developed for the max function, \({\max \nolimits }(a,b)\). MRt might check that \({\max \nolimits }(a,b)\times 2 = max(a\times 2,b\times 2)\), whilst MRl might check \({\max \nolimits }(a,b) < {\max \nolimits }(a\times 2,b\times 2)\) (for positive a and b). MRt is tighter than MRl because if we fix one side of the equation then only one value can satisfy MRt, whilst an infinite number of values could satisfy MRl. Finally, if we use multiple MRs then it is desirable to use a diverse set of MRs (Liu et al. 2014; Chen et al. 2015).

N-version testing, which is also referred to as back-to-back testing, involves comparing the outputs of the SUT against the outputs of reference implementations (Weyuker 1982). One of the main issues is that a reference implementation may not be available. A number of approaches have been developed to overcome this obstacle, e.g. testability transformations (McMinn 2009), component harvesting (Hummel et al. 2006) and using previous versions of the SUT as reference implementations (Zhang et al. 2009).

The third approach that we consider in this section is statistical hypothesis testing. Statistical hypothesis testing involves executing the SUT multiple times to obtain a set of outputs, and then conducting a statistical test, e.g. t test, to compare the distribution of this set to an expected distribution (Guderlei and Mayer 2007). One issue with this approach is that it assumes that the expected distribution is already known (Mayer 2005), which might not necessarily be true. It is sometimes possible to combine this with MT, by comparing the properties of samples produced using different (but related) inputs (Guderlei and Mayer 2007). Another issue is the need to tune parameters, e.g. it is necessary to tune the significance threshold of statistics that are used by a statistical hypothesis test (Patel and Hierons 2017).

Finally, machine learning oracles are predictive models or classifiers that can be used as test oracles (Patel and Hierons 2017). For example, a machine learning oracle might be an approximation of the SUT and so can predict the output of a given input or might be able to classify a test case as passed or failed, based on patterns that can be observed in the execution trace. Machine learning oracles are derived from machine learning algorithms that have been applied to training datasets. Similar issues to those that were discussed for statistical hypothesis testing are relevant for machine learning oracles (Patel and Hierons 2017). For example, some machine learning oracles have tunable parameters (Patel and Hierons 2017). One also requires a training dataset; again one might alleviate this assumption by procuring a dataset from a domain expert or a reference implementation (Chan et al. 2006).

2.2 Uniformity statistics

In this paper, we considered five statistics that are empirical distribution function statistics: \(D_{n}^{+}\), \(D_{n}^{-}\), Vn, \({W_{n}^{2}}\) and \({U_{n}^{2}}\). For a given value x, an empirical distribution functionFootnote 2, which is associated with a particular sample, returns the probability that a value that is chosen at random from the associated sample will be less than or equal to x. Empirical distribution function statistics operate by comparing the cumulative distribution function to the empirical distribution function (Marhuenda et al. 2005).

We also considered eight order statistics (Marhuenda et al. 2005): \(C_{n}^{+}\), \(C_{n}^{-}\), Cn, Kn, T1, T2, \(T_{1}^{\prime }\) and \(T_{2}^{\prime }\). Let Sampleo be a sample {U(1), U(2),..., U(n)}, which has been arranged in ascending order. The core intuition behind order statistics is to compare each element of the sample U(i) with its expected value. In the case of the former 6 statistics, the expected value of U(i) is defined as i/(n + 1), where n is the sample size. The expected value of U(i) for the latter two statistics is (i − 1)/(n − 1).

Spacings statistics (G(n) and Q) are computed based on spacings (Marhuenda et al. 2005). Spacing is defined to be the difference between two adjacent elements in an ordered (ascending order) sample. There are two exceptions to this; firstly, a spacing is defined to be the value of the first element of the sample, if the two adjacent elements are the first element of the sample and the non-existent element that is adjacent left of the first element, e.g. in the case of Sampleo, the spacing for U(0) and U(1) would be U(1). Similarly, if the two adjacent elements are the last element of the sample, and the non-existent element that is adjacent right of the last element, then the spacing is the difference between 1 and the value of the last element of the sample. In continuation of the previous example, the spacing for U(n) and U(n+ 1) would be the difference between 1 and U(n).

\(S_{n}^{(m)}\), A(n) and Em, n are uniformity statistics that are referred to as higher order spacing statistics. In the context of higher order spacings statistics, a spacing is defined to be the difference between two elements from an ordered (ascending order) sample. Unlike the definition of a spacing used by the spacings statistics, the elements are not necessarily adjacent—they are separated by m − 1 elements, where m is a user-defined parameter. To illustrate, consider the following sample and value for m: sample = {1,2,3,4,5,6,7,8} and m = 3 respectively. Examples of spacings include the difference between elements 1 and 4, 2 and 5, and 3 and 6 (because in all cases, there is a gap of m − 1 i.e. 2 elements).Footnote 3 Higher order spacing statistics are computed based on spacings (Marhuenda et al. 2005).

3 Regression model–based metamorphic relation

3.1 Intuition

Let US denote an implementation of a uniformity statistic. Also let training data be a set of pairs, 〈SDi, TSVi〉, such that SDi is the standard deviation of a sample, Sampi, and TSVi is the result of computing US on Sampi. A predictive model, PMUS can be obtained through a regression analysis of training data. PMUS can predict the output of US for a given standard deviation value.

Our partial oracle operates by first generating a set of samples and executing US with these samples to obtain a set of outputs. It then computes the standard deviation of each of these samples and executes PMUS with these standard deviations to obtain another set of outputs. Finally, it uses the Wilcoxon signed rank test to compare the first set of outputs against the second set of outputs. A significant difference indicates failure.

Our partial oracle is an example of a metamorphic relation. The remainder of this section illustrates this by superimposing metamorphic testing jargon on to the description above. PMUS can be derived from information about previous executions; thus, inputs for PMUS can be conceptualised as source test cases. Inputs into US are akin to follow-up test cases. A comparison is performed between the outputs of the source and follow-up test cases (facilitated by the Wilcoxon signed rank test), to determine whether the system is faulty.

3.2 The approach

The standard deviation is intrinsically linked to the values of test statistics. We realised that it might be possible to use regression analysis,Footnote 4 which is a form of machine learning, to learn the precise nature of the relationship between the standard deviation and a given test statistic and derive a formula that describes the relationship.

A test case for a uniformity statistic is a sample that contains 1000 elements, drawn from a normal distributionFootnote 5 parameterised with a mean of 0.5 and a standard deviation between 0 and 0.28, such that each element is a value between 0 and 1. We generated 20,000 test cases and calculated the standard deviation of each test case (sample). Our rationale for selecting these values can be found in Section 4.4.

The following procedure can be used to acquire a formula that describes the relationship between the test statistic and the standard deviation, for a given uniformity statistic. We evenly divided the range 0 to 0.3 into 60 subranges Subranges = {[0,0.005],[0.005,0.01],...,[0.295,0.3]} (these subranges were chosen because they enable effective regression analysis to be conducted for our dataset). Each test case was assigned to a subrange, such that the standard deviation of the test case was greater than or equal to the lower bound of the subrange and less than the upper bound of the subrange. We calculated the test statistic values (using our implementations of the statistics) of all of the test cases in a given subrange and averaged these values. This average test statistic value was then associated with the midpoint of the subrange, e.g. 0.0025 is the midpoint of the first subrange in Subranges. This was repeated for all of the subranges. Finally, a regression analysis was performed based on these averages and midpoints, resulting in a mathematical formula that gives a relationship between the test statistic and the standard deviation.

We adopted this procedure for each test statistic (see Table 1), with the exception of Em, n, where the subrange with the range of 0–0.005 was omitted, since the behaviour in this range is uncharacteristic; excluding this subrange improved the accuracy of the regression analysis.

Table 1 Mathematical models

These formulas served as predictive models (partial oracles) that can be used to check the output of the implementation of a test statistic.

3.3 Using the regression model based metamorphic relation

Algorithm 1 describes the regression model–based metamorphic relation. Lines 1 and 2 of algorithm 1 create two empty lists, TestStatisticValues and ModelValues. Lines 3–9 then generate NumberOfTestCases test cases and, for each of these test cases, calculates the test statistic value and model value and stores them in the aforementioned lists. Thus, in the program state of the regression model based metamorphic relation, just before the execution of line 10, TestStatisticValues and ModelValues will contain NumberOfTestCases test statistic values and NumberOfTestCases model values respectively. Line 10 then performs a Wilcoxon signed rank test to compare TestStatisticValues and ModelValues, to determine whether the difference is statistically significant. Finally, lines 11–15 report a failure verdict, if the difference is statistically significant or otherwise report a pass verdict.

figure c

4 Experimental design

The experiments described in this section were designed to address the research questions in Section 4.1.

4.1 Research questions

RQ1.:

How accurate are the models that are used by the regression model–based metamorphic relation? We inspected the accuracy of the models, to gain insights into the effectiveness of the regression model–based metamorphic relation.

RQ2.:

How many false alarms does our technique raise? We evaluated the soundness of the partial oracle, based on its false positive rate, i.e. the likelihood that it will incorrectly report that failure is present.

RQ3.:

How effective is the regression model–based metamorphic relation at detecting faults? We measured the effectiveness of our technique, by quantifying the proportion of mutants it kills.

RQ4.:

To what extent does the output of the mutant have to deviate from the output of the model, in order for our technique to detect the difference? We estimated the sensitivity of the regression model–based metamorphic relation.

RQ5.:

What impact would changing the alpha value have on the results? We conducted a sensitivity analysis to determine the impact of changing the alpha value.

RQ6.:

Why didn’t the regression model–based metamorphic relation achieve a mutation score of 100%? The partial oracle did not kill all of the mutants. We attempted to identify the main root causes.

4.2 Experimental subjects

Our experimental subjects were implementations of 18 uniformity statistics (see Section 2) that were programmed in Python 3.6 by the authors, based on the technical details that were expressed in an article authored by Marhuenda et al. (2005). The A gives pseudocode for these statistics. These were chosen because they are examples of four different types of statistics (Marhuenda et al. 2005).

The representativeness of the experiment was the key motivating factor for our choice of uniformity statistics. In particular, a sample of 18 statistics was deemed to be large enough to enable conclusions about generalisability to be drawn. Additionally, these statistics are representative of several different types of uniformity statistics, i.e. empirical distribution function, order, spacings and higher order spacing statistics (Marhuenda et al. 2005).

Table 2 shows the number of lines of code in each subject program. The sizes of these subject programs are representative of the typical size of an implementation of a uniformity statistic.

Table 2 Number of lines of code (LOC) in each subject program

4.3 Faults

Mutation testing is a technique that can make small syntactic changes to a system (Offutt 1992) to simulate a fault. For example, suppose that mutation testing was applied to a system Syso and that the technique modified abs(x) + 4 to abs(x) − 4; let Sysn denote the modified version of Syso. In mutation testing, abs(x) − 4 is referred to as a mutation, Sysn is called a mutant, and if a testing technique reports a failure whilst being evaluated on Sysn, Sysn is said to have been killed, or is otherwise said to have survived. Mutation testing is a widely used method for procuring large samples of mutants, and some evidence suggests that mutants are representative of real faults (Andrews et al. 2005).

We applied a well-known, automated mutation testing tool called Mutmut (Hovmöller 2017) to all of the subject programs, to obtain a sample of mutants. In the context of black box testing, a mutant that has the same input-output mapping as the original program for a given test suite is equivalent to the original program with respect to that test suite. Such mutants could confound the results because it is not possible for an oracle to detect a difference between the original and mutant programs; thus, these were removed from the sample. As will be discussed in Section 4.4, we used a total of 5000 test cases. These 5000 test cases were used to conduct strong mutation testing on each mutant. Mutants were classified as equivalent if they were not killed by strong mutation testing. We also removed mutants that crash and time out (120 seconds), because testing techniques are not required to detect such mutants, and so these mutants can confound the results. After removing the equivalent, crashed and timed out mutants, a total of 291 mutants remained. We used all of these mutants in our experiments.

Tables 3 a, b and c show how the mutants are distributed across the subject programs.

Table 3 Number of mutants for each subject program

4.4 Test suites

We initialised the regression model–based metamorphic relation with the following input values:SizeLowerbound = 1000, SizeUpperbound = 1000, Distribution = Normal, MeanLowerbound = 0.5, MeanUpperbound = 0.5, StandardDeviationLowerbound = 0, StandardDeviationUpperbound = 0.28 and NumberOfTestCases = 5000 (see algorithm 1).

Since the regression model–based metamorphic relation for uniformity statistics exploits the relationship between the standard deviation and test statistic value, it is natural to restrict oneself to samples that only vary in terms of the standard deviation. This was our rationale for initialising the regression model–based metamorphic relation, such that the sample size, distribution and mean are the same across all samples, but the standard deviation varies.

Our reasons for choosing the specific values for the sample size, distribution, and mean and standard deviation are as follows. A sample size of 1000 was chosen, because large samples are more likely to achieve the target standard deviation. Many statistical tests assume a normal distribution; thus, generating samples that conform to a normal distribution could enable us to experiment with alternative statistical tests. Since every element in a given sample is a value between 0 and 1, it was natural to fix the mean at 0.5—the midpoint of an element’s value range. Recall that the regression model–based metamorphic relation samples values in the range [0,1]; this restriction makes the random generation of samples with higher standard deviation values increasingly unlikely. Our choice of the standard deviation’s upper bound was informed by this observation. As an aside, we should note that the uniformity statistics that are covered in this paper assume that the sample is being drawn from [0,1], which is why our sample elements are always restricted to values in the range [0,1].

We decided to use a test suite that consisted of 5000 test cases, because this test suite size was deemed to be large enough to enable us to draw meaningful conclusions. It should be noted that the same 5000 test cases were used for each subject program, and that these test cases are different from the ones that were used in Section 3.2.

4.5 Measures

One of the main measures that was used throughout this paper is the mutation score (MS) (see, for example, Wu et al. 2005). The mutation score can be calculated as follows (equivalent mutants were not counted):

$$ \frac{\text{NumberOfKilledMutants}}{\text{TotalNumberOfMutants}} $$

We also used the false positive rate (FPR). The FPR can be determined by executing the test suite on a correct version of the SUT, recording the number of test cases that reported failures, NumberOfFalsePositiveTestCases and finally calculating:

$$ \frac{\text{NumberOfFalsePositiveTestCases}}{\text{TestSuiteSize}} $$

4.6 Alpha

Recall that the regression model–based metamorphic relation performs a significance test (see algorithm 1). We used an alpha value of 0.000005 for this significance test. Our motivation for choosing such a low significance threshold is to reduce the sensitivity of the test to account for potential inaccuracies in the models.

5 Results and discussion

In this section, we address the research questions that were outlined in Section 4.1.

5.1 RQ1. How accurate are the models that are used by the regression model based metamorphic relation?

Each figure in Fig. 1a–r is a scatterplot for one subject program. Each interval of the X-axis pertains to a test case, and intervals of the Y-axis correspond to uniformity statistic values. Black dots on the scatterplot represent the correct test statistic value and grey dots represent the model’s prediction.

Fig. 1
figure 1figure 1

A series of comparisons between each statistic and its respective model

Figure 1 a–r reveal that for many subject programs, the test statistic and model are almost indistinguishable, e.g. \({W_{n}^{2}}\) (Fig. 1d). These models are therefore incredibly accurate. Recall that the models were created by analysing the relationship between the test statistic value and one factor, i.e. the standard deviation. The existence of cases where the statistic and model are less similar, e.g. \(S_{n}^{(m)}\) (Fig. 1e), suggests that other factors might also be important. Thus, there may be scope to improve the accuracy of the models by considering such factors. Interestingly however, despite the dissimilarities between the statistics and models in these cases, one can observe that the grey dots largely overlap with the cluster of black dots. This indicates that the model can provide reasonably accurate predictions for even these statistics.

We conducted one Mann-Whitney U test per scatterplot in Fig. 1a–r to compare the actual values against the predicted values. We also applied Benjamini-Hochberg correction to all of the p values that were returned by these tests. Promisingly, with the exception of Mann-Whitney U tests for four uniformity statistics (\(S_{n}^{(m)}\), G(n), Q and A(n)), none of the Mann-Whitney U tests reported a significant difference. This supports our observation that most models are relatively accurate, and that some models are more accurate than others.

The inaccuracy of some models may make partial oracles that are based on these models more susceptible to false positives. We investigate this possibility in the next section.

5.2 RQ2. How many false alarms does our technique raise?

Only four of the subject programs reported false positives—\(T_{2}^{\prime }\), Em, n, Q and A(n). An investigation revealed that the cause of these false positives was an inappropriate use of alpha values. We found that changing the significance criteria to be less than 3.8921426976752E-39, 1.09300719274052E-12, 2.08955179387564E-07 and 4.13728711820944E-09, for \(T_{2}^{\prime }\), Em, n, Q and A(n) respectively, eliminates all false positives and makes no difference to any mutant classifications for \(T_{2}^{\prime }\), Q and A(n) and only affects 2/17 of the mutant classifications for Em, n (these mutant classifications will be discussed in the next section). These results demonstrate that the appropriate choice of alpha value is important.

Given that our approach incorporates stochastic sample generation, it is possible for it to produce and process extreme/uncharacteristic samples. Such samples could potentially cause false positives. The above results demonstrate that, given an appropriate choice of alpha value, the technique is very robust despite this. This suggests that the probability of such samples being generated is low and/or that such samples are unlikely to have significant influence. This can be attributed to the large number of samples that were used and the large sizes of these samples.

5.3 RQ3. How effective is the regression model based metamorphic relation at detecting faults?

Figure 2 is a bar chart in which each interval of the Y-axis pertains to one subject program. The X-axis communicates the number of mutants that were used to exercise the regression model–based metamorphic relation. Each bar was partitioned into the number of mutants that were killed (stripe fill) and the number of mutants that survived (solid fill).

Fig. 2
figure 2

Total number of mutants killed by the regression model–based metamorphic relation

Table 4 provides another lens on the data in Fig. 2. Figure 2 and Table 4 demonstrate that the regression model–based metamorphic relation was very effective for all of the subject programs, achieving an average mutation score of 92.96%. Even though there is some variation in the technique’s effectiveness across the subject programs, the technique is relatively consistent. These results are promising and suggest that the regression model–based metamorphic relation is an effective and reliable approach to solving the oracle problem in implementations of uniformity statistics.

Table 4 Mutation score of each subject program

5.4 RQ4. To what extent does the output of the mutant have to deviate from the output of the model in order for our technique to detect the difference?

The following procedure can be applied to a mutant, Mut. Let TS be a set of pairs 〈Outputi, Predictioni〉, where Outputi is the output of Mut for a given test case tci and Predictioni is the model’s corresponding prediction. For each pair, 〈Outputi,Predictioni〉, the absolute difference between Outputi and Predictioni can be computed. Let AbsDiffs denote all of these absolute differences. The following tuple can be obtained by calculating the descriptive statistics of AbsDiffs: (Mean,StandardDeviation,Skewness,Kurtosis,Min,Max).

The aforementioned procedure can be applied to all of the killed mutants of a program, Prog, to obtain a set of tuples, ProgKilledTuples, and applied to the original program to obtain one more tuple, ProgOriginalTuple. The absolute difference between each tuple in ProgKilledTuples and ProgOriginalTuple can be computed. Let Meank,StandardDeviationk,Skewnessk,Kurtosisk,Mink, and \({\max \nolimits }_{k}\) be the smallest absolute differences that were observed during this iterative process. Similarly, let ProgSurvivedTuples be the result of applying the aforementioned procedure to all of the survived mutants of Prog. The absolute difference between each tuple in ProgSurvivedTuples and ProgOriginalTuple can be computed. Let Means,StandardDeviations,Skewnesss,Kurtosiss, and Mins, and \({\max \nolimits }_{s}\) be the largest absolute differences that were observed during this iterative process. Finally, the following boolean values can be calculated for Prog: Meank ≤Means, StandardDeviationk ≤StandardDeviations, Skewnessk ≤Skewnesss, Kurtosisk ≤Kurtosiss, Mink ≤Mins and \({\max \nolimits }_{k}\leq {\max \nolimits }_{s}\). A boolean value of “True” indicates that there exists at least one killed mutant that is at least as close (if not closer) to the original program, than at least one survived mutant, based on the descriptive statistic that the boolean value corresponds to. Alternatively, a boolean value of “False” means that all of the survived mutants were closer to the original program than all of the killed mutants, based on the descriptive statistic that the boolean value corresponds to. Table 5 presents these boolean values for each subject program that did not have a mutation score of 100% and did not report a false positive.

Table 5 A comparison between killed and survived mutants

Unsurprisingly, Table 5 shows that for most subject programs, all survived mutants were closer to the original program than the killed mutants were, based on at least one descriptive statistic. An interesting observation that can be made is that there exists two subject programs (Kn and \({U_{n}^{2}}\)) in which the following holds true for each descriptive statistic: there exists at least one killed mutant that is at least as close (if not closer) to the original program than at least one survived mutant. Closer inspection of the raw data revealed that for both of these subject programs, there existed at least one killed mutant that was closer to the original program in terms of all descriptive statistics than at least one survived mutant.

Non-parametric statistics have been designed to be applicable in a wider range of scenarios than their parametric counterparts. However, the cost of their enhanced generalisability is a reduction in statistical power, and by implication non-parametric statistics are more likely to report that there is no effect, when there actually is. Since the Wilcoxon signed rank test is a non-parametric statistic, this might explain why some killed mutants were closer to the original program than some survived mutants.

5.5 RQ5. What impact would changing the alpha value have on the results?

Recall that an observation that was made in Section 5.2, was that one’s choice of alpha value is an important determinant of the partial oracle’s effectiveness. In this section, we explore this observation further.

A scatterplot in Fig. 3 plots the log of the p value (that was used for the pass/fail verdict) that was reported for each mutant (black markers) and the log of the p value (that was used for the pass/fail verdict) that was obtained for the original program (grey markers). The original program and mutants along the X-axis are organised into ascending order, based on the p value. Thus, black markers that are on the left-hand side of a grey marker represent mutants that acquired a lower p value than the original program, and black markers that are situated on the right-hand side of a grey marker pertain to mutants that have a higher p value than the original program. A logarithmic scale was used in these scatter plots for presentation purposes; in these scatterplots, it is assumed that \({\log }(0) = 0\).

Fig. 3
figure 3

A series of scatterplots that plot mutants against p values

Figure 3 can be used to conduct a sensitivity analysis as follows. One can superimpose a horizontal line at any point on the graph to represent the alpha value. With the exception of log(0), black markers that appear below the line represent killed mutants, and black markers that appear above the line represent survived mutants. Similarly, if the grey marker appears below the line, then there is a false positive, and if it appears above the line, then there is not a false positive.

A number of observations can be made from Fig. 3. For example, in every scatterplot, almost all, if not all mutants have a lower p value than the original program. This is promising because it demonstrates that the partial oracle can distinguish between mutants and the original program fairly accurately.

It can also be observed that the p value of the original program varies dramatically across subject programs. This means that choosing the optimal alpha value for a given subject program will be difficult. However, the results that were reported in Section 5.2 demonstrated that it is possible for the partial oracle to be effective in most cases with an alpha value of 0.000005, thus finding the optimal alpha value is not necessary if one opted to leverage this alpha value. It’s also worth noting that in most cases, most mutants obtained a p value of 0, which means that changing the alpha to be arbitrarily close to 0 would have relatively little impact in most cases. Thus, using such an alpha might be a viable strategy.

5.6 RQ6. Why didn’t the regression model-based metamorphic relation achieve a mutation score of 100%?

The regression model–based metamorphic relation was not able to kill all of the mutants. This section enumerates the key factors that were responsible for this outcome. Firstly, Sections 5.2 and 5.5 demonstrated that our choice of alpha value had an impact on the technique’s ability (or lack thereof) to kill a mutant.

We conducted a similar analysis to the one that produced Table 5. This analysis differed in terms of its definitions of Meank, StandardDeviationk, Skewnessk, Kurtosisk, Mink, \({\max \nolimits }_{k}\). In particular, instead of being defined as the smallest absolute differences between ProgKilledTuples and ProgOriginalTuple, they are defined as the average absolute differences between these tuples. Similarly, Means, StandardDeviations, Skewnesss, Kurtosiss, Mins and \({\max \nolimits }_{s}\) are redefined to be the average absolute differences between ProgSurvivedTuples and ProgOriginalTuple, instead of the largest absolute differences. Every cell in the newly produced table was false, which suggests that survived mutants are generally more similar to the original program than killed mutants. This indicates that the subtlety of mutants is another important factor.

Recall that some killed mutants were closer to the original program than some survived mutants (see Section 5.4). Also recall that one explanation for this is that our implementation of the technique incorporated the Wilcoxon signed rank test, which is a non-parametric statistic, and so is particularly susceptible to type 2 errors. This is another factor that could affect our technique’s ability to kill mutants. The use of more robust statistics might alleviate this factor; we intend to explore the impact of using alternative statistics in future work.

5.7 Discussion

N-version testing is a testing technique, where the output that is produced by the program under test, in response to test case, tc, is compared with the output of a reference implementation of the program under test, for tc. N-version testing is a potential alternative to the regression model–based metamorphic relation; however, it has a number of limitations. For example, to the best of our knowledge, there are no readily available implementations available for a large number of the uniformity statistics that are covered in this paper. This necessitates the development of reference implementations, and thus could introduce correlated failures, i.e. the program under test and reference implementation might have the same faults and thus produce the same failures.

Since uniformity statistics operate on floating point numbers, it is necessary to define a similarity threshold. The relationship between standard deviation and uniformity is not linear in most cases. This means that an appropriate similarity threshold for one area of the output domain might not be appropriate for other regions of the output domain. This reduces the viability of applying N-version testing to uniformity statistics.

Since it can be difficult to obtain reference implementations, the option to use N-version testing might not be available. In such cases, it is possible for a domain expert to construct a training dataset for the regression model–based metamorphic relation or obtain this dataset through simulations. This means that the option to use regression model–based metamorphic relation can still be available in situations in which the option to use N-version testing is not.

6 Threats to validity

This section outlines the main threats to validity.

6.1 Internal validity

A number of tools were used to create our partial oracles, to carry out the experiments and to analyse the experimental results, e.g. SPSS (IBM 2015), R (Community 2017), Microsoft Excel and Mutmut (Hovmöller 2017). These tools might contain errors that could affect our results. However, all of these tools are widely used and highly reputable, and so it is unlikely that such faults exist. Additionally, such errors might also exist in our own code, e.g. subject programs, partial oracles, test case generators and analysis scripts. To reduce the likelihood of this, we thoroughly tested all of our code. Where possible, we also used automated tools to reduce the potential for bias to be introduced by the person performing the experiment.

6.2 External validity

There are three main threats to external validity. Firstly, even though steps were taken to ensure that a diverse range of uniformity statistics were included in our experiments (empirical distribution function, order, spacings and higher order spacing statistics (Marhuenda et al. 2005)), we did not include discrete uniformity statistics (Steele et al. 2005). Our results might have been different for such statistics; thus, this may threaten the generalisability of our results. We will seek to address this in future work.

Secondly, metamorphic testing is often conducted with several metamorphic relations, but in this paper, we only leveraged one metamorphic relation. Our reasoning is twofold. Firstly, we were unable to identify any more general metamorphic relations. Even though it might have been possible to construct metamorphic relations that would be useful for individual subject programs, such metamorphic relations would be generally less useful.

In our experiments, we only leveraged test cases that were generated from a normal distribution. Our results might have been different, had we drawn test cases from a different distribution, e.g. a bimodal distribution. We intend to investigate the impact of changing the distribution, on our technique, in future work.

Finally, all of the subject programs that were used in our experiments were developed by the authors. This is a threat to validity, because our implementations might not be representative of third-party implementations. We attempted to alleviate this threat by including third-party implementations in our experiments, but we were unable to find any suitable implementations. We therefore decided to compare the functionality of some of our implementations with corresponding R code. In order to do so, we randomly generated 100 test cases and evaluated stat0064, stat0067, stat0068, stat0069, stat0070, stat0071 and stat0077 from R’s PoweR package (see the documentation on R’s PowerR package for more details), as well as our implementations of these statistics. We then compared the outputs that were produced by our implementations and R’s implementations, using the Mann-Whitney U test. None of the tests produced a significant result at an alpha value of 0.05. This suggests that our implementations are representative of third-party implementations.

6.3 Construct validity

The two main metrics used in this paper are the mutation score and false positive rate. These metrics are widely used by the research community. The first of these metrics is used so frequently by the research community that some consider them to be a de-facto standard (Segura et al. 2016).

6.4 Statistical validity

We used a number of statistical tests, including Fisher’s exact test and the Wilcoxon signed rank test (Pallant 2007) in this paper. These statistical tests are non-parametric and were used partly because their assumptions had been fulfilled by the data, and also because their parametric counterparts were not applicable. Given that these statistics were applied a large number of times in certain analyses, each of these analyses had a high probability of falsely reporting at least one significant difference. To that end, we applied Benjamini-Hochberg correction to these analyses to account for this.

7 Conclusions

Implementations of uniformity statistics suffer from the oracle problem because it is infeasible to predict the test statistic value of a given sample. In this paper, we attempted to address this by formulating a partial oracle called the regression model–based metamorphic relation.

We also investigated the effectiveness of the regression model–based metamorphic relation. We discovered that, when appropriate p values are used, the technique can achieve mutation scores ranging from 77.78 to 100% (tending towards higher mutation scores), and produce no false alarms. Our experiments also revealed that the models that the regression model–based metamorphic relation are based on are incredibly accurate. These results are promising and suggest that the regression model–based metamorphic relation is a viable means of ameliorating the oracle problem in the implementation of uniformity statistics.

The predictive models used by the regression model–based metamorphic relation were primarily created based on the relationship between the test statistic value and standard deviation. There might be scope to incorporate other features of the sample, instead of just the standard deviation, to improve their accuracy further; improvements in accuracy would allow us to tune the partial oracles to be more sensitive. Thus, one avenue of future work might include investigating the viability of these methods.

The regression model–based metamorphic relation is dependent on predictive models that are learned from a dataset. In practice, one could obtain such a dataset from the SUT, but models that are derived from such a dataset would only be suitable for regression testing. If the tester wishes to apply these partial oracles in a more general context, they would either have to acquire such a dataset from elsewhere, e.g. from a reference implementation, or request a domain expert to create the model based on their knowledge of the system. Unfortunately, a scenario may exist in which neither a reference implementation, nor a domain expert are available. In such situations, these partial oracles are rendered inapplicable. This is an important limitation of the approaches, and one that we wish to address in future work.

We believe that implementations of other classes of statistics, e.g. correlation statistics, might have similar oracle problems. The partial oracles that were presented in this paper are only suitable for uniformity statistics. Thus, future work that explores methods of alleviating the oracle problem for implementations of other classes of statistics may be invaluable.