A mapping study on testing non-testable systems

The terms “Oracle Problem” and “Non-testable system” interchangeably refer to programs in which the application of test oracles is infeasible. Test oracles are an integral part of conventional testing techniques; thus, such techniques are inoperable in these programs. The prevalence of the oracle problem has inspired the research community to develop several automated testing techniques that can detect functional software faults in such programs. These techniques include N-Version testing, Metamorphic Testing, Assertions, Machine Learning Oracles, and Statistical Hypothesis Testing. This paper presents a Mapping Study that covers these techniques. The Mapping Study presents a series of discussions about each technique, from different perspectives, e.g. effectiveness, efficiency, and usability. It also presents a comparative analysis of these techniques in terms of these perspectives. Finally, potential research opportunities within the non-testable systems problem domain are highlighted within the Mapping Study. We believe that the aforementioned discussions and comparative analysis will be invaluable for new researchers that are attempting to familiarise themselves with the field, and be a useful resource for practitioners that are in the process of selecting an appropriate technique for their context, or deciding how to apply their selected technique. We also believe that our own insights, which are embedded throughout these discussions and the comparative analysis, will be useful for researchers that are already accustomed to the field. It is our hope that the potential research opportunities that have been highlighted by the Mapping Study will steer the direction of future research endeavours.


Introduction
In software testing, a test input is generated for the System Under Test (SUT), an expected test outcome is determined for this test input, and the SUT is executed with this test input to obtain an output. This output is finally compared to the expected test outcome. If this comparison reveals any discrepancies, then the SUT is deemed to be faulty. The mechanism that is responsible for predicting the expected test outcome and performing this comparison task is called an oracle. Software testing is based on the assumption that an oracle is always available (Davis and Weyuker 1981). The terms "Non-testable systems" (Davis and Weyuker 1981) and "oracle problem" (Liu et al. 2014a) are interchangeably used to describe situations in which it is infeasible to apply oracles. Conventional testing techniques are ineffective under such circumstances.
Several automated testing techniques that can detect functional software faults in nontestable systems have been proposed. A sizeable amount of research has been conducted on these techniques in the context of the oracle problem. Our Mapping Study endeavours to collect, collate, and synthesise this research to satisfy three major objectives. The first objective is to present a series of discussions about each of these techniques, from different perspectives, e.g. effectiveness, usability, and efficiency. The second objective is to perform a series of comparisons between these techniques, based on the above perspectives. The final objective is to identify research opportunities. This paper begins by outlining the Review Protocol (see Section 2). A series of discussions revolving around each testing technique are presented across Sections 3 to 7. These techniques are compared in Section 8. Potential future research directions are also presented across these sections. Additionally, Section 9 discusses related work. Threats to validity are discussed in Section 10, and our conclusions are outlined in Section 11.

Review protocol
To conduct our Mapping Study, we first defined a Review Protocol, based on the guidelines of Kitchenham (2007), Popay et al. (2006), Higgins andGreen (2011), andShepperd (2013). This section presents our Review Protocol. In particular, Sections 2.1, 2.2, 2.3 and 2.4 outline the scope, search procedure, data extraction approach, and quality appraisal methodology, respectively. Finally, a brief overview of the synthesis, which forms the majority of the remainder of this paper, is presented in Section 2.5.

Scope
The scope of this Mapping Study was originally automated testing and debugging techniques that have been designed to detect functional software faults in non-testable systems. Our Review Protocol (presented in Sections 2.2 to 2.4) is based on this scope. We decided to narrow the scope of our synthesis (i.e. Sections 2.5 to 8) to enhance the focus of the Mapping Study. In particular, our revised scope does not consider debugging techniques. We realised that Specification-based Testing and Model-based Testing depend on the availability of a specification or model, and that the oracle problem implies that these are not available. To that end, we also decided to omit these techniques from the scope of our synthesis.

Search
Papers that have a broad focus, e.g. frameworks or systematic reviews must contribute a relatively substantial amount of relevant content. For example, a paper is not deemed to be relevant if all of its relevant material is comprised of a short aside.
Duplicates must be excluded. We consider rewrites and preliminary/older versions of the same papers to be duplicates. We also consider journal papers that extend conference papers to be duplicates, as well as published chapters of theses. Preference is given to published over non-published papers, the most up-to-date version, and the paper from the most reputable source. If both papers are published in reputable journals, the most detailed one is taken, and in the case that they are precise duplicates of each other, an arbitrary choice is made.
The paper must be written in English.
The paper must be accessible.
The paper must have been published before mid 2014. where successive iterations checked the paper in escalating levels of detail (Shepperd 2013) against the Inclusion and Exclusion Criteria; if enough evidence could be accrued during an early iteration to classify the paper, then the process terminated prematurely. The iterations were as follows: {title}, {abstract, introduction, conclusion}, {the entire paper}. We conducted a search in mid 2014 to find relevant papers (that were available before and during mid 2014). We achieved this by applying several search methods in parallel and iteratively, and checking the relevance of each search result returned by these search methods, by using the aforementioned iterative process. The remainder of this section outlines these search methods.
One of our methods included using the search strings listed in Table 2 to query six research repositories: Brunel University Summon, ScienceDirect (using the Computer Science Discipline filter), ACM (queried using Google's "site:" function), IEEE, Google (twice -with the filter on, and off), and Citeseerx (each search term prefixed with "text:"). Let Results RR SS denote the papers that were returned by a research repository, RR, in response to a search string, SS. It would have been infeasible to manually check the relevance of all of the papers in Results RR SS . Thus, we used the following terminating condition: the first occurrence of 50 consecutive irrelevant results. We retained papers that were found to be relevant before the terminating condition was satisfied.
During the search process, we became aware of several techniques that had been used to solve the oracle problem. The authors postulated that other studies on these techniques in the context of the oracle problem may also have been conducted. Thus, a specialised search string for each technique was prepared; these search strings supplemented those in Table 2.
Every paper in the reference list of each relevant paper was also checked for relevance. Again, we retained papers that were found to be relevant.
We compiled a list of all of the authors that had contributed to the papers that were deemed to be relevant. Each of these authors had at least one list of their publications Table 2 Search strings Search Strings ((Stochastic OR (Non-deterministic OR nondeterministic OR "non deterministic" OR non-determinism OR nondeterminism OR "non determinism")) AND (System OR Software OR Program OR Application OR Algorithm) AND Testing ((Stochastic OR (Non-deterministic OR nondeterministic OR "non deterministic" OR non-determinism OR nondeterminism OR "non determinism")) AND (System OR Software OR Program OR Application OR Algorithm) AND (("Check" OR "Checking") OR ("Verification" OR "Verify")) ((Stochastic OR (Non-deterministic OR nondeterministic OR "non deterministic" OR non-determinism OR nondeterminism OR "non determinism")) AND (System OR Software OR Program OR Application OR Algorithm) AND ("Fault Localisation" OR "Fault Localization") ("Random output" OR "Randomised output" OR "Randomized output" OR "Randomized algorithm" OR "Randomised algorithm") AND (Systems OR Software OR Programs OR Applications OR Algorithms) AND Testing ("Probabilistic System" OR "Probabilistic Program" OR "Probabilistic algorithm") AND (("Check" OR "Checking") OR ("Verification" OR "Verify")) (("NonTestable" OR "Non-Testable" OR "Non Testable") OR ("Oracle Problem" OR "no oracle") OR ("Pseudo-oracle" OR "Pseudo oracle")) AND (Testing OR (("Check" OR "Checking") OR ("Verification" OR "Verify")) OR ("fault localisation" OR "Fault localization") OR ("Debugging" OR "Debug") OR ("Fault detection" OR "Failure detection" OR "Mutant detection" OR "Defect detection" OR "Detecting Faults" OR "Detecting Failures" OR "Detecting Mutants" OR "Detecting Defects") OR ("Validating" OR "Validate")) publicly available. Examples of such lists include: author's personal web page, CV, DBLP, Google Scholar, and the repository the author's study was originally discovered in. We selected one list per author, based on availability and completeness, and checked all of the publications on this list for relevance.
Finally, all of the authors were emailed a list of their papers that had been discovered by the search, accompanied with a request for confirmation regarding the completeness of the list. This enabled the procurement of cutting edge works in progress, and also reduced publication bias (Kitchenham 2007).
The various search methods described above led to the discovery of several papers that we did not have access to. We were able to obtain some of these papers by contacting the authors. The rest of these papers were omitted from the Mapping Study. The search methods also returned what we believed were duplicate research papers. The authors of these papers were asked to confirm our suspicions, and duplicates were removed.
In total, our search methods collectively procured 141 papers.

Data extraction
We used the data extraction form in Table 3 to capture data, from relevant papers, that was necessary to appraise study quality and address the research aims. Unfortunately, many papers did not contain all of the required data; thus requests were sent to authors to obtain missing data. Where this approach was unsuccessful, assumptions were made based on the paper and the author's other work. For example, if they had not reported the number of mutants used, but tended to use 1000+ in other papers, one can assume a significant number of mutants were used in the study. What evidence is there to suggest the parameters of the experiment were representative and were they adequate described?
What evidence is there to suggest the experimental set-up, conduct and experiment output analysis was appropriate, robust and unbiased?
Have adverse effects been reported, and if so, how were they mitigated?
Are the arguments compelling, critical and supported by internal and external evidence?
Executive Summary: Noteworthy points made about Technique 1: Noteworthy points made about Technique 2: Noteworthy points made about Technique n:

Quality
Our quality instrument is presented in Table 4. Each relevant paper was checked against this quality instrument. Papers that were found to have severe methodical flaws, and to have taken minimal steps to mitigate bias were deemed to be of low quality. Relatively little research has been conducted on the oracle problem; thus, many relevant studies are exploratory. Certain study design choices may have been unavoidable in such studies, and may cause a quality instrument to label these studies as low quality. This means that these valuable studies could be rejected, despite the fact that they may have been at the highest attainable quality at the time. To account for this, papers that were deemed to be of low quality, were only discarded if they did not make a novel contribution. This led to the elimination of 4 papers. Thus a total of 137 papers were deemed to be suitable for our synthesis.

Synthesis overview
Synthesis involves analysing and explaining the data that was obtained by the data extraction form to address the research aims. Narrative Synthesis was used because it is ideal for theory building (Shepperd 2013) and explanations. The synthesis was conducted according to the guidelines of Popay et al. (2006), Cruzes and Dyba (2011), Da Silva et al. (2013), and Barnett-Page and Thomas (2009). See Sections 3 to 8. The Mapping Study process revealed that five umbrella testing techniques have been developed to alleviate the oracle problem -N-version Testing, Metamorphic Testing, Assertions, Machine Learning, and Statistical Hypothesis Testing. Thus, our synthesis focuses on these techniques. The research community has conducted a different amount of research on each technique, in the context of the oracle problem. For example, Metamorphic Testing has received more attention than any other testing technique. Naturally, the amount of attention that is afforded to each technique, by our synthesis, was determined by the amount of research that was conducted on that technique.
The disproportionate attention that has been given to Metamorphic Testing suggests that this technique may have numerous interesting research avenues. Although less attention has been afforded to the other techniques, they are known to be effective for some situations in which Metamorphic Testing is not (see Section 8). Thus, the number of pages does not reflect how promising they are. However, it does mean that it is unlikely that all of the useful research avenues that are associated with these techniques have been explored.
Sections 3-7 present a series of discussions about each technique, and Section 8 compares these techniques. The discussions pertaining to each technique are organised into a set of high level issues, e.g. effectiveness, efficiency, and usability. Some terms that are used to describe certain issues by one research community may be used to describe different issues by other research communities. We would therefore like to clarify how such terms are used in this paper; in particular efficiency and cost. Efficiency is used to describe the amount of computational resources that are consumed or the amount of time required to perform a task, whilst cost is used in reference to monetary costs. Although effort/manual labour can be discussed in the context of cost, effort/manual labour is presented as a usability issue in this paper.

N-version testing
Let S be the SUT. Another system, S R , is said to be a reference implementation (RI) of S, if it implements some of the same functionality as S. In N-version Testing, S and S R are executed with the same test case, such that this test case executes the common functionality in these systems. The outputs of S and S R that result from these executions are compared. N-version Testing reports a failure, if these outputs differ (Weyuker 1982). If one has access to multiple RIs, then this process can be repeated once for each RI. N-version Testing was originally developed to alleviate the oracle problem. One form of oracle problem includes situations where the test outcome is unpredictable. Such an oracle problem can arise if the SUT has been designed to discover new knowledge (Weyuker 1982), e.g. machine learning algorithms. Since an RI mimics the SUT to generate its own output, N-Version Testing does not require the tester to have prior knowledge about the test outcome. This makes it viable for such oracle problems.

Effectiveness of N-version testing
N-version Testing is fundamentally a black-box testing technique. It's therefore not surprising that some have found that N-version Testing cannot test the flow of events (Nardi and Delamaro 2011), and cannot detect certain fault types, e.g. coincidental correctness 1 (Brilliant et al. 1990), since white-box oracle information is necessary to achieve these feats. Let S be a system. In the future, S may be modified due to software maintenance. S denotes the modified version of S. Faults may be introduced into S during maintenance. Spectrumbased Fault Localisation is a debugging technique that represents the system's execution trace as program spectra. Tiwari et al. (2011) suggested using Spectrum-based Fault Localisation to obtain the program spectras of S and S , and comparing these program spectras. Disparities between these program spectra can be an indication of a fault in S . In their approach, S is essentially the SUT, and S acts as a reference implementation. Thus, their approach can be perceived to be a modified version of N-version Testing, in which program spectra are compared instead of outputs. Since program spectra can represent event flows, this modified version of N-version Testing may be able to test the flow of events. However, there is little evidence to suggest that this approach can generalise outside of regression testing. Thus, we believe that feasibility studies that explore the use of this approach in other contexts would be valuable.
Let S denote the SUT and S R be a reference implementation of S, such that S and S R have faults that result in the same failed outputs, S o and S o R , respectively. Since N-version Testing detects a fault by checking S o = S o R (Manolache and Kourie 2001), it cannot detect these faults. This is referred to as a correlated failure. Numerous guidelines for reducing the likelihood of correlated failures are available. The remainder of this section explores these guidelines.
A fault is more likely to be replicated in both S and S R if the same team develop both, because they might be prone to making the same types of mistakes (Murphy et al. 2009a). Thus, one guideline includes using independent teams for each system (Manolache and Kourie 2001), e.g. using 3rd party software as S R . However, this does not eliminate the problem completely because independent development teams can also make the same mistakes (Murphy et al. 2009a). This could be because certain systems are susceptible to particular fault types. Thus, another guideline involves diversifying the systems to reduce the overlap of possible fault types across systems (Manolache and Kourie 2001). This can be achieved by using different platforms, algorithms (Lozano 2010), design methods, software architectures, and programming languages (Manolache and Kourie 2001) for each system. For example, pointer related errors are less likely to lead to correlated failures if S and S R are encoded in C++ and Java, respectively.
The third guideline we consider revolves around manipulating the test suite. Some test inputs lead to correlated failures, and others do not (Brilliant et al. 1990). Thus, the chance of detecting a fault depends on the ratio of inputs that lead to a correlated failure (CF ) to inputs that do not (we refer to non-correlated failures as standard failures (SF )). Since multiple faults may collectively contribute to populating CF and diminishing SF (Brilliant et al. 1990), one could adopt a strategy of re-executing the test suite when a fault is removed because this may improve the CF : SF ratio. To demonstrate, let f 1 and f 2 represent two faults in the SUT, and {1, 2, 3, 4, 5} be a set of inputs that lead to a correlated failure as a result of f 1 . Further suppose that {5, 6} is the set of inputs that can detect f 2 . Since f 1 causes 5 to lead to a correlated failure, only input 6 can detect f 2 ; thus by removing f 1 , the number of inputs that can be used to detect f 2 doubles, since 5 would no longer lead to a correlated failure.
Although the guidelines discussed above (i.e. using independent development teams, diversifying the systems, and test suite manipulation) can reduce the number of correlated failures, the extent to which they do varies across systems. This is because different systems have outputs with different cardinalities, 2 which have been observed to influence the incidence of correlated failures (Brilliant et al. 1990).

Usability of N-version testing
The only manual tasks in N-version Testing are procuring RIs and debugging; thus discussions regarding usability will revolve around these issues. This section discusses the former, and the latter is covered in Section 3.3.
At its inception, the recommended method of procuring RIs for the purpose of N-version Testing was development (Weyuker 1982). Developing RIs can require substantial time and effort (Hummel et al. 2006). Many solutions have been proposed, that might reduce the labour intensiveness of this task. For example, Davis and Weyuker (1981) recognised that the performance of an RI is not important, because it is not intended to be production quality code. They also realised that program that are written in High Level Programming Languages have poorer runtime efficiency, but are quicker and easier to develop (Davis and Weyuker 1981). To that end, they suggest using such languages for the development of RIs. Similarly, we suspect that it might be possible to sacrifice other quality attributes, to make RI development faster and easier.
Solutions that can eliminate development effort completely have also been proposed. For example, it has been reported that the previous versions of the same system ) (this approach has been widely adopted in practice), or other systems that implement the same functionality ) could be used as RIs. RIs could also be automatically generated, e.g. through Testability Transformations (McMinn 2009 Component Harvesting is an alternative procurement strategy. It involves searching online code repositories with some of the desired RI's syntax and semantics specification (Hummel et al. 2006). Hummel et al. (2006) assert that this substantially reduces procurement effort. However, this may not always be true; other activities may be introduced that will offset effort gains. For example, an RI's relevant functionality may depend on irrelevant functionality; such dependencies must be removed . The SUT and RIs must also have a common input-output structure (Chen et al. 2009a). Thus, it may be necessary to standardise the structure of the input and output . Atkinson et al. (2011) remark that the effectiveness of the search depends on how well the user specifies the search criteria. It is therefore possible for the search to return systems that cannot be used as RIs. Additionally, systems that have unfavourable legal obligations may also be returned ); using these systems as RIs may therefore be infeasible. Identifying and removing such systems from the search results may be labour intensive.
Suitable RIs may not exist (Murphy et al. 2009a). This means Component Harvesting may be inapplicable in some cases. Additionally, the applicability of the technique is restricted by its limitation to simple RIs  i.e. RIs that are limited in terms of scale and functionality. This means that the technique can only support simple SUTs.
Although Testability Transformations and Component Harvesting can substantially improve the usability of N-version Testing, these techniques clearly have limited generalisability i.e. the former and latter only cater for a limited range of faults and systems, respectively. Further research that results in improvements in their generalisability could add significant value. For example, Component Harvesting might be extended to more complex RIs as follows: since the semantics of simple RIs are understood, it may be possible to automatically combine multiple simple RIs into a more complex RI.

Cost of N-version testing
The cost of N-version Testing is a divisive issue. The cost of obtaining RIs is particularly contentious. Many claim that the RI procurement process is expensive because it may involve the re-implementation of the SUT (Hoffman 1999). However, others argue that this process can be inexpensive because it can be automated by procurement strategies like Component Harvesting (Hummel et al. 2006). However, as discussed in Section 3.2, these strategies are only applicable under certain conditions and so manual re-implementation may be necessary in some situations. This means that the cost of obtaining RIs can vary.
In manual testing, the tester must manually verify the output of a test case. In N-version Testing, this process is automated (Brilliant et al. 1990). This means that test execution can be cheaper in N-version Testing in comparison to manual approaches; thus N-version Testing could be cheaper if a large number of test cases are required. It might be necessary to generate additional test cases because of software maintenance (Assi et al. 2016). Thus, the requirement for a larger number of test cases might be correlated with update frequency. Update cost can be exacerbated by N-version Testing because changes may have to be reflected across all RIs (Lozano 2010). This cost may be further exacerbated, depending on the RI's maintainability (Manolache and Kourie 2001). This may offset the cost effectiveness gains obtained from cheaper test cases in some scenarios.
Let V 1 be the SUT and R1 be an RI based on V 1. Suppose that V 1 was updated to become V 2. Some test cases that are applicable for V 1 (and by implication, R1) may also be applicable for V 2 ). Thus, instead of updating R1 to be consistent with V 2, one could simply restrict testing to these test cases. This might alleviate update costs. However, such an approach clearly cannot cater for new functionality (Hummel et al. 2006).
The impact of increasing the number of RIs on cost effectiveness is also unclear. A failure's cost can be high Manolache and Kourie 2001), which means substantial cost savings may be obtained by detecting a fault that could result in such a failure. Since the number of RIs is positively correlated with effectiveness (Brilliant et al. 1990), the chance of obtaining these cost savings can improve if more RIs are used. However, as mentioned above, the cost of developing an RI can be expensive (Lozano 2010). Thus, increasing the number of RIs will inherently increase development cost. Clearly, the direction of the correlation between the number of RIs and cost is dependent on whether a sufficiently expensive fault is found.
It is unclear which system is the source of failure (Chen et al. 2009a); this means that one must debug multiple systems. Thus, using more RIs can lead to an increase in debugging costs. However, using multiple RIs enables the establishment of a voting system, where each RI (and the SUT) votes for its output (Oliveira et al. 2014a). Systems that are outnumbered in a vote are likely to be incorrect. Thus, debugging effort can be directed and therefore minimised. Unfortunately, correct systems can be outnumbered in the vote ; therefore a voting system may have limited impact in some situations.

Content-based image retrieval
Some systems produce graphical outputs. The correctness of graphical outputs can be verified by comparing them to reference images (Delamaro et al. 2013). Reference images could be obtained from RIs. Oliveira et al. (2014b) proposed combining a Content-based Image Retrieval System with feature extractors and similarity functions to enable the automated comparison of such outputs with reference images, based on their critical characteristics.
Unfortunately, the application of their technique is not fully automated. For example, one must acquire appropriate feature extractors and similarity functions (Oliveira et al. 2014b). However, some of this manual effort may not always be necessary. For instance, many feature extractors and similarity functions are freely available (Oliveira et al. 2014b), thus one may not have to develop these, if these free ones are appropriate.

Metamorphic testing
In Metamorphic Testing (MT), a set of test cases, called the Metamorphic Test Group (MTG), is generated. MTG has two types of test cases. Source test cases are arbitrary and can be generated by any test case generation strategy (Chen et al. 1998b;Guderlei and Mayer 2007b), whilst follow up test cases are generated based on specific source test cases and a Metamorphic Property . A Metamorphic Property is an expected relationship between source and follow up test cases. For example, consider a self-service checkout that allows a customer to scan product barcodes and automatically calculates the total price. The Metamorphic Property might state that a shopping cart that consists of two instances of the same product type should cost more than a shopping cart with just one. Let B 1 and B 2 be instances of the same product, and SC 1 = {B 1 , B 2 } denote a shopping cart containing both instances. Suppose that SC 1 is a source test case. Based on this Metamorphic Property and source test case, MT might use subset selection to derive two follow up test cases: SC 2 = {B 1 } and SC 3 = {B 2 }. Thus, the MTG may consist of SC 1 , SC 2 , and SC 3 . The Metamorphic Property in conjunction with the MTG is called a Metamorphic Relation (MR). MRs are evaluated by executing the MTG and checking that the Metamorphic Property holds (Kanewala and Bieman 2013b) between these executions; in this case checking that the price of SC 1 is greater than the price of SC 2 , and the price of SC 1 is greater than the price of SC 3 .
A permutation relation is an MR where changes in the input order has a predictable effect on the output. For example, consider a sort function, Sort (I ), where I is a list of integers. A permutation relation might develop the following source and follow up test cases: Sort (1, 3, 2) and Sort (3, 1, 2), with the expectation that their outputs are the same. Some refer to MT as Symmetric Testing in situations where only permutation relations are used (Gotlieb 2003).
Like N-version Testing, MT was created to alleviate the oracle problem. In particular, MT attempts to resolve oracle problems where the test outcome is unpredictable due to a lack of prior knowledge. As has been made apparent above, MT does not rely on predicted test outcomes to verify the correctness of the SUT. Thus, MT can operate in the presence of this oracle problem. MT has also been shown to be effective for a large range of different oracle problems, including complex (Guderlei and Mayer 2007b) (i.e. systems that involve non-trivial processing operations) and data intensive systems (Chen et al. 2009a), because the process of evaluating an MR can be inexpensive.

Effectiveness of metamorphic testing
Experiments on MT's effectiveness have produced varied results, ranging from 5% (Murphy et al. 2009a) to 100% mutation scores (Segura et al. 2011). Several factors, that influence effectiveness and thus may explain this disparity, have been reported. These factors can broadly be categorised as follows: coverage , characteristics, the problem domain, and faults. This section explores these factors. For generalisability purposes, our discussions are limited to implementation independent issues.

Coverage
Numerous strategies for maximising the coverage of MT are available. For example, it has been observed that some MRs place restrictions on source test cases ). Thus, one's choice of MRs could constrain a test suite's coverage. Coverage could be maximised by limiting the usage of such MRs. Núñez and Hierons (2015) observed that certain MRs target specific areas of the SUT. This means that increasing the number of MRs that are used, such that the additional MRs focus on areas of the SUT that are not checked by other MRs, could increase coverage. Merkel et al. (2011) state that since testing resources are finite, there is a trade-off between the number of MRs and test cases that can be used. Therefore, increasing the number of MRs to implement the above strategy could limit the test suite size. Thus, the aforementioned coverage gains could be offset. The optimal trade-off is context dependent.
Let P be a program consisting of three paths P = {{s 1 , s 2 }, {s 2 }, {s 2 , s 3 }}, and let MR 1 and MR 2 be MRs that each have an MTG that consists of two test cases. Suppose that MR 1 's MTG covers the first and second path and thereby executes statements s 1 and s 2 , and that MR 2 's MTG covers the first and third path and so covers all three statements. This demonstrates that an MR's MTG can obtain greater coverage, if the paths that are traversed by each of its test cases are different (Cao et al. 2013). Several guidelines have been proposed to design MRs to have such MTGs. For example, white-box analysis techniques (Dong et al. 2013), or coverage information generated by regression testing (Cao et al. 2013) could assist in the identification of MRs that have MTGs with different test cases. MRs that use a similar strategy to the SUT tend to have MTGs that have similar source and follow up test cases (Mayer and Guderlei 2006), and thus should be avoided. Different MRs can have different MTG sizes (Cao et al. 2013). It seems intuitive that MRs that have MTGs that consist of a larger number of test cases are more likely to have test cases that traverse dissimilar paths.

Characteristics
An MR has many characteristics that can be manipulated to improve its effectiveness. For example, it has been observed that decreasing the level of abstraction of an MR can improve its fault detection capabilities (Jiang et al. 2014). This section explores these characteristics and their relationships with effectiveness.
MRs can vary in terms of granularity, e.g. application or function level. In a study conducted by Murphy et al. (2013), it can be observed that MRs that are defined at the application level can detect more faults than MRs that are defined at the function level, in some systems. This means that MRs that were defined at a higher level of granularity were more effective for these systems. Interestingly, the converse was also observed for other systems (Murphy et al. 2013), and so the most effective level of granularity might depend on the system. Regardless, both MR types found different faults (Murphy et al. 2013), and thus, both can add value in the same context.
It has been reported that an MR that captures a large amount of the semantics of the SUT (i.e. an MR that reflects the behaviours of the SUT to a greater degree of completeness and accuracy) can be highly effective (Mayer and Guderlei 2006). Let MR r and MR p be two MRs, such that MR r captures more of the semantics of the SUT than MR p . This suggests that MR r might be more effective than MR p . We believe that certain test cases can capture some of the semantics of the SUT. Let tc be such a test case. It may therefore be possible for MR p to obtain a comparable level of effectiveness to MR r , if MR p is evaluated based on tc, because the additional semantics in tc may counteract the deficit of such semantics in MR p . However, recall that some MRs place restrictions on test inputs ; this may limit the scope for using test cases like tc with MRs like MR p .
The fourth widely reported characteristic is strength. Let MR 1 and MR 2 be two MRs, such that MR 1 is theoretically stronger than MR 2 . This means that if one can confirm that MR 1 holds with respect to the entire input domain, then this implies that MR 2 also holds with respect to the entire input domain (Mayer and Guderlei 2006). This implies that MR 1 can detect all of the faults that can be detected by MR 2 , in addition to other faults. Some regard MRs like MR 2 to be redundant (Sim et al. 2013). Interestingly, a study conducted by Chen et al. (2004a) compared the failure detection rate 3 of 9 MRs. The weakest MR obtained the highest failure detection rate for 15/16 of the faults, whilst the strongest MR obtained the lowest failure detection rate for 13/16 faults. This suggests that strong MRs are not necessarily more effective than weak MRs (Chen et al. 2004a), and weak MRs are therefore not redundant. Mayer and Guderlei (2006) realised that weak MRs can have more failure revealing test cases than stronger MRs. This may explain why weak MRs can be more effective.
Black-box MT emphasises the development of strong MRs. It is therefore not surprising that the observation that weak MRs can be more effective than strong MRs led Chen et al. (2004a) to conclude that black-box MT should be abandoned. Proponents of this argument view an understanding of the algorithm structure as necessary (Chen et al. 2004a). Although this argument has a strong theoretical foundation, Mayer and Guderlei (2006) have questioned the practicality of the position. One must consider all input-output pairs to deduce the relative strength of one MR to another, which can be infeasible in practice (Mayer and Guderlei 2006). Thus, opponents contend that categorising MRs based on their strength is impractical, and by implication, deciding to abandon black-box MT based solely on MR strength is nonsensical (Mayer and Guderlei 2006). Whilst we agree with Mayer and Guderlei (2006) that it may be impractical to determine whether one MR is stronger than another, we disagree with the notion that this threatens the validity of the argument of Chen et al. (2004a), since knowledge about the relative strength of two MRs is not necessary to leverage the advice of Chen et al. (2004a).
Tightness is another major characteristic (Liu et al. 2014b); tighter MRs have a more precise definition of correctness. For example, a tight MR may check X == (Y × 2); only one answer is acceptable. A looser MR may check (X > 2); whilst (X ≤ 2) indicates a fault, an infinite number of answers are acceptable. Therefore, tighter MRs are more likely to be effective (Merkel et al. 2011). Although tight MRs are preferable, they may be unavailable. For example, consider a non-deterministic system that returns a random output that is approximately two times larger than the input. A tight MR is not available because predicting the precise output is impossible, however, the following loose MR can be used: output < (input × 4) (Murphy et al. 2009b).
Another important characteristic is the soundness of an MR. A sound MR is one that is expected to hold for all input values. Conversely an MR that is unsound is only expected to hold for a subset of the input values (Murphy and Kaiser 2010). Unlike sound MRs, unsound MRs are prone to producing false positives 4 (Murphy and Kaiser 2010). It might be advisable to avoid using such MRs, to curtail false positives. However, it has been reported that MRs that are less sound might be capable of detecting faults that cannot be detected by MRs that are more sound (Murphy and Kaiser 2010). Thus, such MRs might add value.

Problem domain
It has been reported that MT is more effective when one uses multiple MRs, instead of just one MR (Merkel et al. 2011). Since MRs are domain specific (Chen et al. 2009a), the total number of potential MRs in one problem domain can be different than in another. For example, Chen et al. (2004a) found nine MRs for Dijkstra's Algorithm, whilst Guderlei and Mayer (2007a) could only find one MR for the inverse cumulative distribution function. Therefore, the problem domain is likely to directly influence MT's effectiveness.
Specialised variants of MT have been developed to account for the characteristics of certain problem domains. For example, Murphy et al. (2009a) propose Metamorphic Heuristic Oracles to account for floating point inaccuracies and non-determinism. This approach involves allowing MT to interpret values that are similar, as equal (Murphy and Kaiser 2010). The definition of "similar" is context dependent (Murphy et al. 2009a), thus general guidance is limited.

Faults
MT and its variants can detect a diverse range of faults, e.g. MT can find faults in the configuration parameters (Núñez and Hierons 2015) and specifications (Chen et al. 2009a), and Statistical Metamorphic Testing (see Section 7.3) can find faults that can only be detected by inspecting multiple executions . However, MRs are necessary, but not sufficient (Chen et al. 2003b); they are not effective for all fault types, e.g. coincidentally correct faults (Cao et al. 2013;Yoo 2010).
Specifications can be used as a source of inspiration for the MR identification process (Jiang et al. 2014;Liu et al. 2014a). It has been reported that the effectiveness of MT can be compromised by errors in the specification (Liu et al. 2014a). This could be because errors in the specification may propagate to the MRs, if the MRs have been designed based on the specification. The same specification errors may have also propagated into the SUT, thus there might be scope for correlated failures (see Section 3.1). This might explain why MT cannot find certain faults. One might reduce this risk by using other sources of inspiration, e.g. domain knowledge (Chen et al. 2009a) or the implementation ). Mishra et al. (2013) observed that students performed better on class assignments revolving around equivalence partitions and boundary value analysis, when compared to MT. This suggests that MT might be more difficult to grasp than other testing techniques. This could be because MT requires a wide skillset to operate. Poon et al. (2014) claim that MR implementation requires limited technical expertise. However, others have stated that the tester might not be competent enough to implement MRs , which indicates that developing MRs might be difficult. These conflicting conclusions suggest that the difficulty of MR development might be context dependent.

Prerequisite skills and knowledge
One's domain expectations might not necessarily match the implementation details of the SUT. This disparity might be a result of an intended design decision . For example, the SUT's precision may be compromised in favour of efficiency. Thus, if one is not aware of such design decisions and design MRs purely based on domain expectations, the MR might erroneously interpret this disparity as a failure. Thus, knowledge about the implementation details of the SUT might be important.
Domain experts can identify more MRs, that are more effective, more productively than non-domain experts (Chen et al. 2009c). This suggests that domain knowledge is also important. Therefore, if one lacks adequate domain knowledge, it is advisable to consult domain experts (Liu et al. 2012). MRs are identified in booms and slumps; the SUT is investigated during a slump to develop new intuitions that can be used to identify MRs, and such MRs are defined in boom periods (Chen et al. 2016). This iterative process affords further opportunities to continuously supplement one's domain knowledge.
An experiment conducted by Zhang et al. (2009) found that different developers can identify different MRs. This is not surprising because different people have different domain knowledge. It may therefore be advisable to leverage a team (Poon et al. 2014), because this may ensure greater coverage over the domain knowledge. A small team, e.g. consisting of 3 people has been shown to be sufficient (Liu et al. 2014a).

Effort
A number of factors affect the effort required to apply MT. For example, it has been observed that an MR that has been developed for one system, might be reusable in another system . Thus, MT might be easier to apply in situations in which MRs that were developed for other systems are available. Another example is MTG size. Since it is not apparent which test case in the MTG reveals the failure, all test cases must be considered during debugging ). This means that effort can be substantially reduced if the MTG size is reduced. Alternatively, Liu et al. (2014b) proposed a method that could provide some indication of the likelihood that a particular test case in MTG executed the fault. Their method deems a test case to be more likely to have executed the fault, if it was executed by more violated MRs. This could be used to direct debugging effort. Another alternative is Semi-Proving, which is covered in Section 4.6.
The most significant factor affecting effort is believed to be the difficulty of MR identification. Thus, most research has been conducted on this factor. For example, Chen et al. (2016) found that MR identification is difficult because inputs and outputs must be considered simultaneously (Chen et al. 2016). They alleviated this by automating input analysis, thereby constraining the tester's attention to outputs (Chen et al. 2016). The technique specifies a set of characteristics, called "Categories"; each is associated with inputs that manipulate it. These inputs are subdivided into "choices"; all inputs belonging to a particular choice manipulate the characteristic in the same way.
A test frame is a set of constraints that define a test case scenario. Pairs of test frames (that correspond to source and follow up test cases) can be automatically generated by grouping various categories and choices together, such that they are "Distinct" and "Relevant" (i.e. marginally different). For example, let F unction (a, b, c, x, y, z) be a function with 6 input variables; a distinct and relevant pair may only differ by one of these variables, e.g. z. These test frames produce test cases that are executed to obtain a set of outputs, which can be manually checked for relationships. The process of automatically generating test frames and manually analysing them is iterative (Chen et al. 2016); since an infeasible number of pairs typically exist, the terminating condition is the tester's satisfaction with the identified pool of MRs (Chen et al. 2016).
The empirical evidence is promising; people's performance with respect to MR identification improved, and novices achieved a comparable level of performance to experts (Chen et al. 2016). However, the technique has an important limitation; it can currently only support MRs that are composed of one source and follow up test case (Chen et al. 2016). Kanewala and Bieman (2013b) alternatively propose training Machine Learning classifiers to recognise operation sequence patterns that are correlated with particular MRs. Such a classifier can predict whether unseen code exhibits a particular MR. Results have been promising; the technique has a low false positive rate, and can identify MRs even when faults are present in the SUT (Kanewala and Bieman 2013b).
Although this technique achieves greater automation (Kanewala and Bieman 2013b) than the approach devised by Chen et al. (2016), additional human involvement is introduced elsewhere. For example, training datasets are necessary for the machine learning classifiers , and obtaining these can be difficult (Chen et al. 2016). One may wish to extend the classifier with a graph kernel, 5 that has parameters ) that might have to be tuned to improve accuracy. Furthermore, since each classifier is associated with one MR type (Kanewala and Bieman 2013b), these additional tasks must be repeated for each MR type.

Efficiency of metamorphic testing
There is a time cost associated with test case generation and execution (Chen et al. 2014c). As discussed in Section 4.1.1, different MRs have different MTG sizes. This means that some MRs might incur greater time costs than others. Thus, one might improve the efficiency of MT by restricting oneself to MRs with smaller MTGs. However, as was discussed in Section 4.1.1, MRs with larger MTGs might obtain greater coverage, thus such a restriction might reduce the effectiveness of the technique. Alternatively, one might consider using parallel processing -the test cases in the MTG can be executed simultaneously (Murphy and Kaiser 2010).
Other approaches include combining MRs in various ways to make more efficient use of test cases. For example, one could use the same test cases for different MRs (Chen et al. 2014c (Wu 2005). Combination relations is another possible method. Liu et al. (2012) suggested defining a new MR that is composed of multiple MRs. For ease of reference, we called such an MR a "combination relation". By evaluating this single MR, one implicitly evaluates all of the constituent MRs, and thus makes more efficient use of test cases. Logic dictates that a single MR that embodies multiple MRs would have a level of effectiveness that is equivalent to the sum of its constituent parts (Liu et al. 2012). Interestingly however, it has been found that such an MR can actually obtain a higher level of effectiveness than its constituent MRs (Liu et al. 2012). This could be because one MR in the combination relation may empower another. For example, MRs MR n and MR c may be effective for numerical and control flow faults, respectively; combining the two may extend MR n 's capability to control flow faults.

Combination relations
Conversely, effectiveness can deteriorate; Liu et al. (2012) observed that including a loose MR in a combination relation can reduce the combination relation's overall effectiveness. Thus, they advocate only combining MRs that have similar tightness. They also observed that some MRs might "cancel" out other MRs either partially or completely (Liu et al. 2012). This may also explain why the effectiveness of a combination relation might deteriorate. These observations suggest that one may be limited in one's choice regarding which MRs can be combined, and by implication, the technique might be inapplicable in some scenarios (Monisha and Chamundeswari 2013).
Different MRs can accommodate different subsets of the input domain (Dong et al. 2007). Since an input must be suitable for all MRs in the combination relation, additional restrictions might have to be placed on constituent MRs. For example, suppose that MR 1 and MR 2 are two MRs in a combination relation. MR 1 can accommodate five test inputs, {t a , t b , t c , t d , t e }, and MR 2 can only accommodate three test inputs, {t a , t b , t f }. In this situation, it is not possible to use test cases {t c , t d , t e , t f }. This means that MRs that can accommodate larger subsets may be more useful (Dong et al. 2007). This could also explain why a combination relation's effectiveness might deteriorate.

Metamorphic runtime checking
In Metamorphic Runtime Checking, MRs are instrumented in the SUT, and evaluated during the SUT's execution. One of the benefits of this approach is that MRs are evaluated in the context of the entire SUT (Murphy et al. 2013). This can improve the effectiveness of MT. To illustrate, Murphy et al. (2013) observed that MRs that are evaluated in one area of the system, could detect faults in other areas.
Unfortunately, unintended side effects can be introduced during instrumentation (Murphy et al. 2013). For example, consider a function, F (x), and a global counter variable, I . I is incremented every time F (x) is executed. A follow up test case that executes F (x) will inadvertently affect I 's state. Thus, sandboxing may be advisable (Murphy and Kaiser 2010).
Sandboxes introduce additional performance overheads (Murphy and Kaiser 2009). However, since Metamorphic Runtime Checking uses test data from the live system (Murphy et al. 2009b), the generation of a source test case is no longer necessary. These efficiency gains may offset the losses from the performance overheads incurred from sandboxes.
To improve the efficiency of the approach further, some have suggested parallel execution (Murphy and Kaiser 2010). It has been observed that the number of times each MR is evaluated is dependent on the number of times each function is invoked (Murphy et al. 2013). To illustrate, let f 1 () and f 2 () be two functions in the same system, such that f 1 () is always invoked twice as many times as f 2 () because of the control flow of the system. Suppose that MRs MR 1 and MR 2 are evaluated each time f 1 () and f 2 () are invoked, respectively. Since MR 1 is evaluated twice as many times as MR 2 , MR 1 would add more performance overheads than MR 2 . Thus, one could prioritise MRs like MR 2 over MRs like MR 1 to improve performance.

Semi-proving
An MR's verdict only indicates the SUT's correctness for one input. Semi-Proving attempts to use symbolic execution to enable such a verdict to generalise to all inputs (Chen et al. 2011b).
In Semi-Proving, each member of an MR's MTG, MetT estGrp = {tc 1 , tc 2 , ...tc n }, is expressed, using symbolic inputs, as constraints that represent multiple concrete test cases. Each test case, tc i , in MTG is symbolically executed, resulting in a set of symbolic outputs, O i = {o i j , o i j +1 , ...o i n }, and corresponding symbolic constraints, C i = {c i j , c i j +1 , ...c i n }, that the output is predicated on. Let CP be the Cartesian product of each C i i.e. C 1 C 2 ... C n . For each combination comb = C 1a , C 2b , ...C nc in CP , the conjunction of all members of comb should either result in a contradiction or agreement. For each agreement, Semi-Proving checks whether the MR is satisfied or violated under the conditions represented by comb.
Since all concrete executions represented by a symbolic execution are accounted for, it is possible to prove the correctness for the entire input domain, with respect to a certain property (Chen et al. 2011b). However, this might not always be feasible. For example, in some systems, certain loops, arrays, or pointers could cause such a large number of potential paths to exist, that it would be infeasible for Semi-Proving to check them all exhaustively (Chen et al. 2011b). To alleviate this problem, one could restrict the application of the technique to specific program paths, replace some symbolic inputs with concrete values, use summaries of some of the SUT's functions instead of the functions themselves, or restrict the technique with upper-bounds (Chen et al. 2011b). Chen et al. (2011b) realised that the correctness of some symbolic test cases can be inferred from others. For example, consider the max function and the following two symbolic test cases: max(x, y) and max(y, x). Since these test cases are equivalent, only one must be executed to deduce the correctness of both. Optimising resource utilisation through this strategy may also alleviate the problem.
Obtaining such coverage can improve the fault detection effectiveness of MT (Chen et al. 2011b). Improvements in effectiveness for subtle faults, e.g. missing path faults, has been reported to be particularly noteworthy by several researchers (Chen et al. 2011b;Gotlieb and Botella 2003). Another advantage of greater coverage is improvements in debugging information. In particular, there is greater scope for the precise failure causing conditions (Chen et al. 2011b) and test cases (Liu et al. 2014b) to be identified. Whether this improves debugging productivity is questionable though; investigating this information requires manual inspection of multiple (possibly all) execution paths (Chen et al. 2011b).

Heuristic test oracles
Heuristic Test Oracles are a loose variant of Metamorphic Testing. In this approach, expected input-output relationships are initially identified, e.g. input increase implies output decrease. The SUT is then executed multiple times with different inputs, to obtain a set of outputs. These inputs and outputs are used in conjunction with each other to check whether the expected input-output relationship holds (Hoffman 1999).
Thus, Heuristic Test Oracles can only be applied to systems that have predictable relationships between inputs and outputs (Hoffman 1999). Some systems may not have relationships that span the entire input domain. In such situations, it might be possible to define heuristics for a subset of the input domain (Hoffman 1999). For example, Sine's input domain can be split into three subdomains: Subdomain One = {0 ≤ i ≤ 90}, Subdomain T wo = {90 ≤ i ≤ 270}, and Subdomain T hree = {270 ≤ i ≤ 360}. A positive correlation between the input and output can be observed in Subdomain One and Subdomain T hree, whilst a negative correlation is assumed in Subdomain T wo (Hoffman 1999).
It has been reported that these oracles are effective (Lozano 2010). Some have also claimed that these oracles are faster and easier to develop (Hoffman 1999) and maintain (Lozano 2010) than N-version Testing based oracles. Heuristic Test Oracles also have high reuse potential (Hoffman 1999), thus implementation may be bypassed completely in some cases.

Assertions
Assertions are Boolean expressions that are directly embedded into the SUT's source code (Baresi and Young 2001). These Boolean expressions are based on the SUT's state variables, e.g. X > 5, where X is a state variable. Assertions are checked during the execution of the SUT, and may either evaluate to true or false; false indicates that the SUT is faulty (Harman et al. 2013). Our general discussions on Assertions in this section are based on the above definition. We are aware that some people use alternative definitions. For example, some definitions allow one to augment the SUT, e.g. by introducing auxiliary variables (see Section 5.1), and other definitions consider runtime exceptions to be "free" assertions. Our discussions regarding such alternative definitions of the technique will be clearly indicated in the text.
Unlike N-version Testing and Metamorphic Testing, Assertions were not originally designed to alleviate the oracle problem. However, it has been observed that in order to evaluate an assertion, one does not have to predict the test outcome (Baresi and Young 2001). This means that assertions are applicable to certain classes of oracle problem, e.g. for situations in which it is not possible to predict the test outcome.

Effectiveness of assertions
Several characteristics of Assertions have been found to influence effectiveness. For example, one characteristic is that Assertions must be embedded in source code (Sim et al. 2014). Unfortunately, this can cause unintended side effects that manifest false positives, e.g. additional overheads (Kanewala and Bieman 2014) may cause premature time-outs. Thus, one must carefully write assertions to avoid side effects (Murphy and Kaiser 2009).
Assertions can be written in independent programming or specification languages, e.g. assertions can be written in Anna, and be instrumented in a program written in Ada (Baresi and Young 2001). Some languages are particularly intuitive for certain tasks, e.g. LISP for list manipulation. One could exploit these observations, by writing Assertions in the most apposite language for the types of tasks to be performed. This might reduce the chance of introducing unintended side effects. Unfortunately, this approach can also increase the chance of introducing unintended side effects if it causes deterioration in readability. One can use polymorphism; assertions can be specified in a parent class, and a child class can inherit assertions from the parent class (Araujo et al. 2014). By using such a strategy, one can isolate assertions (in parent classes) from the system's source code (in child classes); this might alleviate readability issues.
The code coverage of Assertions can be limited, depending on the nature of the program. For example, let List be an array. To test List, an Assertion may assert that some property holds for all members of List. It may be infeasible to evaluate this Assertion, if List has an large number of elements (Baresi and Young 2001). Thus, it may be infeasible for assertions to be used in areas of the code that have large arrays. Consider another example; Assertions can only check a limited range of properties that are expected to hold at particular points in the program, e.g. Age ≥ 0 (Harman et al. 2013). This means their coverage could be limited. According to some alternative definitions of Assertions, auxiliary variables can be introduced into the system, for the purpose of defining Assertions (Baresi and Young 2001). Introducing auxiliary variables might create new properties that can be checked by Assertions, and thus alleviate the problem. For example, suppose that x is a variable in the system, and y is a newly introduced auxiliary variable; we might include an assertion such as x > y.
Another facet of coverage is oracle information. One aspect of oracle information is the types of the properties that can be checked by a technique. For example, Assertions can check the characteristics of the output or a variable, e.g. range checks (Sim et al. 2014), or how variables might be related to one another (Kanewala and Bieman 2013a), e.g. X = Y . This makes Assertions particularly effective for faults that compromise the integrity of data that is assigned to variables (Murphy and Kaiser 2010). Another aspect of oracle information is the number of executions that test data is drawn from. Test data from multiple executions is necessary for certain faults, e.g. the output distributions of a probabilistic algorithm . Assertions are unable to detect such faults because they are restricted to one execution .
Some believe that Assertions can be used to detect coincidentally correct faults (Kanewala and Bieman 2013a). This supposition probably stems from the fact that Assertions have access to internal state information, and thus could detect failures in internal states, that do not propagate to the output. To the best of our knowledge, there isn't any significant empirical evidence that demonstrates that Assertions can cope with coincidental correctness. Thus, investigating this might be a useful future research direction.
It has been observed that the detection of some coincidentally correct faults may require oracle information from multiple states (Patel and Hierons 2015). Based on an alternative definition of Assertions, Baresi and Young (2001) report that Assertions can check multiple states, if they are used in conjunction with state caching. However, they also remark that state caching may be infeasible, if a large amount of data must be cached. In such situations, Assertions cannot detect such coincidentally correct faults. Additionally, they observed that Assertions cannot correlate events between two modules that do not share a direct interface. This means that assertions may not be able to check certain states simultaneously, and thus may render it incapable of detecting certain coincidentally correct faults. These observations demonstrate that, despite the fact that assertions have access to internal state information, they may not necessarily be effective for coincidental correctness, even when state caching is feasible.
MT has access to information from multiple executions. Sim et al. (2014) combined MT and Assertions, such that Assertions are evaluated during the execution of a Metamorphic Test's source and follow up test cases. This integration may alleviate some of the oracle information coverage issues described above.

Usability of assertions
One key skill that is a part of many developers repertoires is program comprehension i.e. the capability to understand the logic of a program by inspecting the source code ). Developers have experience with modifying source code ), e.g. to add new functionality. Therefore, developers will be comfortable with comprehending and modifying the system's source code. These tasks are integral to the application of assertions. This led Zhang et al. (2009) to conclude that constructing assertions can be more natural than developing oracles from other approaches like Metamorphic Testing. However, Assertions assumes that the tester has knowledge about the problem domain, or the SUT's implementation details (Kanewala and Bieman 2013a). This means that an assertion could require more effort to construct in situations in which the tester has limited knowledge regarding these areas, since they would have to first acquire this knowledge. Other factors that affect the effort required to construct an assertion include the level of detail the assertion is specified at Araujo et al. (2014) and the programming language's expressiveness (Nardi and Delamaro 2011).
Some tools can support the development of assertions, e.g. the assert keyword in some programming languages (Baresi and Young 2001), and invariant detection tools. Invariant detection tools can be used to automatically generate assertions. They work by conducting multiple executions and recording consistent conditions (Kanewala and Bieman 2013a); these conditions are assumed to be invariant and so pertain to assertions. It is typically infeasible to consider all executions; thus only a subset is used. Variant conditions may be consistent across this subset and thus may be misinterpreted as invariant. Thus, invariant detection tools can produce spurious assertions (Murphy et al. 2013). Therefore, the manual inspection of suggestions from these tools is necessary (Kanewala and Bieman 2013a). Unfortunately, manual inspection can be error prone; cases have been observed where 50% of the incorrect invariants that were proposed by such a tool were misclassified by the manual inspection process (Harman et al. 2013).

Multithreaded programs
Interference in multi-threaded environments can cause assertions to produce false positives (Araujo et al. 2014). Several guidelines have been proposed to circumvent this. Firstly, assertions can be configured to evaluate under safe conditions, e.g. when access to all required data has been locked by the thread (Araujo et al. 2014). Secondly, the application of assertions can be restricted to blocks of code that are free from interference (Araujo et al. 2014). Recall that assertions can add performance overheads. This is problematic in multithreaded environments, because these performance overheads can introduce new or remove important interleavings. This can be alleviated by load balancing (Araujo et al. 2014).

Further discussion
Research on assertions in the context of the oracle problem is scarce. Most studies either combine it with other techniques or use it as a benchmark. This implies that Assertions are assumed to be at least moderately effective for non-testable programs; but this is largely unsubstantiated. Thus, empirical studies that test this assumption may be valuable.
The literature reported in this Mapping Study did not present guidelines for assertion use in non-testable systems. We therefore believe that future work that establishes such guidelines in the context of the oracle problem will be valuable.

Machine Learning
Machine Learning (ML) Oracle approaches leverage ML algorithms, in different ways, for testing purposes. One method involves training a machine learning algorithm, on a training dataset, to identify patterns that are correlated with failure. The SUT can be executed with a test case, and this trained machine learning algorithm can then be used to check for such patterns in this test case execution. For example, Chan et al. (2009) constructed a training dataset, in which each data item corresponded to an individual test case, and consisted of a set of features that characterised the input and output of this test case. Each data item was also marked as "passed" or "failed". A classifier was trained on this training dataset and so became capable of classifying test cases that were executed by the SUT, as either passed or failed. Another method involves training a machine learning algorithm to be a model of the SUT; thus, the ML algorithm becomes akin to a reference implementation in N-version Testing (Oliveira et al. 2014a).
ML techniques were not originally developed for testing non-testable programs, but they can be applied to such programs (Kanewala and Bieman 2013a). To illustrate, ML Oracles draw their oracle information from training datasets, which can be obtained when information about the expected test outcome is not available prior execution. This can allow them to test systems for which the expected test outcome is not known before the execution.

Design and application of machine learning oracles
Several factors affect the effectiveness of ML Oracles. The first set of factors concerns the composition of the training dataset. It has been reported that the balance of passed and failed test cases can affect bias ). Datasets can also vary in terms of size. Larger datasets have less bias ) and are less susceptible to the negative effects of noise (Frounchi et al. 2011).
A training dataset must often be reduced to a set of features that characterise it, because the form of the training dataset is seldom appropriate for ML. This is typically achieved by using one or more feature extractors. The second set of factors revolves around the number of feature extractors one uses. Several trends between the number of feature extractors used and effectiveness can be observed: improvement, stagnation, and decline. Two of the feature extractors used in a study conducted by Frounchi et al. (2011) include the Tanimoto Coefficient T C and Scalable ODI SODI . In this study, it was observed that one set of features that consisted of {T C} was the most effective set for negative classifications, and that another set of feature extractors that contained {T C, SODI } was the most effective set for positive classifications. Clearly, the addition of SODI to a set of feature extractors that just contains T C can lead to an increase in the accuracy for one type of classification, but a decrease for another type of classification. This implies that an ML Oracle's overall effectiveness can be improved or reduced by adding additional feature extractors, if the improvement in one classification type more than offsets, or is more than offset by the loss of accuracy for other classification types, respectively. These implications might explain why one may observe the improvement and decline trends.
Let G = {f e 1 , f e 2 , ...f e j } be a group of feature extractors, such that all f e i ∈ G are highly correlated with one another. Using multiple members from G is unlikely to significantly improve classification accuracy (Frounchi et al. 2011). This could explain stagnation trends. This suggests that one should limit the number of members of G, that are used by ML Oracles. Different feature extractors may have different quality attributes, e.g. levels of efficiency (Frounchi et al. 2011) or generalisability and so some may be more favourable than others in certain contexts. Thus, one may consider choosing a subset of G based on the quality attributes offered by the different feature extractors in G. Techniques like wrappers and filters can identify and remove feature extractors that will not significantly improve classification accuracy (Frounchi et al. 2011) and thus can purge excess members of such a group.
Naturally, one would expect that one major factor that might affect effectiveness, is the choice of ML algorithm. However, it has been reported that the choice of algorithm does not have a significant impact on effectiveness, and thus, these algorithms might be interchangeable (Frounchi et al. 2011).

Limitations
ML Oracles have several limitations, and to the best of our knowledge, these limitations have not been resolved by the community yet. Recall that ML Oracles either predict the output of the SUT and then compare this prediction to the SUT's output, or they classify the output of the SUT as correct or incorrect. This means that such oracles are fundamentally used for black-box testing. It's therefore not surprising that examples of these oracles cannot test event flow (Nardi and Delamaro 2011). For similar reasons, such oracles would be hindered by coincidental correctness. Some have also observed that the negative impact of coincidental correctness on ML Oracles can be exacerbated, if the ML Oracle is trained based on features of the internal program structure (Kanewala and Bieman 2013a;Chan et al. 2010). It has also been reported that some variants of ML Oracles are incapable of testing non-deterministic systems or streams of events, e.g. ML oracles based on Neural Networks (Nardi and Delamaro 2011). We believe that resolving these limitations might be useful avenues for future work.

Design and application of machine learning oracles
The previous section revealed that in order to leverage ML Oracles, one must obtain appropriate training datasets, an ML algorithm, and apposite feature extractors. This section explores the user-friendliness of these activities.
We begin by considering training dataset procurement. One approach might include obtaining an RI of the SUT and then generating the training dataset from this RI (Chan et al. 2006). RIs have several characteristics that influence dataset quality, e.g. the correctness of the RI. To illustrate, an RI might have a fault, which means that some of the training samples in the dataset may characterise incorrect behaviours (i.e. failures that manifested from this fault), but be marked as correct behaviours. This reduction in dataset quality can limit the effectiveness of an ML Oracle. For example, the SUT might have the same fault as the RI (Kanewala and Bieman 2013a), and this can lead to correlated failures. In addition, it has been observed that the extent to which an RI is similar to the SUT is correlated with accuracy , and that oracles based on similar RIs can be accurate, effective, and robust (Kanewala and Bieman 2013a). These discussions reveal that one must consider a large number of factors during the RI selection process, which could be difficult.
The nature of the data in the training dataset could also have an impact on effort. As discussed above, one aspect of a dataset's composition is test suite balance (in terms of the proportion of passed to failed test cases). If one's dataset is imbalanced, it may be necessary to expend additional effort to obtain additional data to supplement and balance the dataset. The output of an RI characterises correct behaviours, and the output of mutants of an RI characterise incorrect behaviours ). Thus, if one lacks passed test cases, one could execute an RI with test cases, and if one lacks failed test cases, one could execute failure revealing test cases over a set of mutants of an RI. However, one may have to construct raw datasets manually, if a suitable RI does not exist (see Section 3.2).
The nature of the input and output data used and produced by an ML algorithm can differ from that of the SUT. Thus, it could be necessary to translate inputs that are used by the SUT into a form that is compatible with the ML algorithm and to translate outputs into a form that is amenable for comparison with the SUT's output (Pezzè and Zhang 2014). If such translations are necessary, the developers of ML Oracles must either write additional programs to automate this translation task, or perform the translation task manually.
Experts may have to manually label each training sample in the dataset, if one uses a supervised machine learning algorithm to train one's ML Oracle. An example of this can be found in the work conducted by Frounchi et al. (2011). This obviously means that larger datasets will require substantially more effort to prepare, in these situations. If multiple experts are used, then there is scope for disagreement (Frounchi et al. 2011). The resolution of these disagreements will also add to the overall effort required to apply the technique.
We finally consider feature extractor selection. One's choice of feature extractors is an important determinant of the effectiveness of ML Oracles. For this reason, many believe that a domain expert should be involved in this process (Kanewala and Bieman 2013a). If the developer is not a domain expert, consultation may be necessary.

Debugging
ML Oracles can report false positives ), which means testers may waste time investigating phantom faults. ML Oracles can also produce false negatives (Kanewala and Bieman 2013a). False negatives introduce a delay, which means they can also waste resources. Some ML Oracles have tuneable thresholds. Modifying these thresholds can influence the incidence of false positives and negatives (Nardi and Delamaro 2011), which can enable management of such classification errors. Unfortunately, the optimal threshold values vary across systems (Nardi and Delamaro 2011).

Metamorphic machine learning
Metamorphic Machine Learning merges MT and ML, such that an ML Oracle evaluates each member of the MTG, before they are used to evaluate the MR. The integration of MT with ML has been found to improve the effectiveness of ML (Chan et al. 2010). However, the level of this improvement depends on the quality of the ML Oracle. To illustrate, Chan et al. (2010) observed that the extent of the improvement for ML Oracles that used more training data (and were therefore of higher quality) was lower. They rationalised that this was because there was less scope for MT to offer an improvement. Since ML can detect a fault before all of the test cases in the MTG have been executed (Chan et al. 2010), one could argue that the union of MT and ML can also enhance the efficiency of MT, because it may not be necessary to execute all test cases to detect a fault.

Statistical hypothesis testing
In Statistical Hypothesis Testing (SHT), the SUT is executed multiple times to obtain numerous outputs, which are aggregated using summary statistics, e.g. mean and variance. These aggregated values characterise the distribution of this set of outputs and are compared (using a statistical test, e.g. Mann-Whitney U) to values that delineate the expected distribution. Comparisons that do not yield significant differences can be interpreted as evidence that the SUT behaved correctly (Ševčíková et al. 2006), and significant differences are evidence of the contrary.
The test outcome of a system can be unpredictable because of non-determinism, which means that such systems are instances of the oracle problem. SHT was developed to resolve this specific type of oracle problem (Guderlei and Mayer 2007a). SHT recognises that such systems may have a typical output distribution, and that information about this typical output distribution may be available prior to execution, even if information about the test outcome of a single execution is not. Since it conducts testing by checking the SUT's output distribution against the typical output distribution, it can be applied in situations where it is not possible to predict the test outcome of a single execution.

Assumptions
The generalisability of SHT is limited (Sim et al. 2013), because the technique makes several assumptions that may not always hold. For example, the SUT or input generation method must be non-deterministic (Mayer and Guderlei 2004). Thus, the technique is not applicable to scenarios in which the SUT is deterministic, and random testing is not used. Another example of such an assumption is that the expected output distribution is known (Mayer 2005). One could use reference implementations (RIs) to determine the expected distribution (Guderlei and Mayer 2007a), if this assumption does not hold. Unfortunately, the negative issues that are associated with the use of RIs may also affect SHT, if this approach is used, e.g. correlated failures. Another issue could be that an RI may not be available (Guderlei and Mayer 2007a), thus the technique might not always be applicable.
The statistical techniques used in SHT make assumptions about the data. This means that some statistics may not be applicable to certain data samples that are produced by a system because these data samples may not satisfy the assumptions of these statistics. To illustrate,Ševčíková et al. (2006) investigated a simulation package and found that the data that was produced by this system either adhered to Normal or Poisson distributions. Parametric statistics assume that the distribution is Normal and so may not be applicable to all of the data samples produced by their simulation package.
In situations in which a test statistic's assumptions have been broken, one could use a different statistic that does not make such an assumption. For example, one could use a non-parametric statistic, if the data is abnormally distributed. However, it has been reported that non-parametric statistics are less effective (Guderlei et al. 2007), thus doing so may compromise the effectiveness of SHT. Alternatively, it might be possible to modify data samples to satisfy the broken assumptions. For example, Ševčíková et al. (2006) used a test statistic that assumed that variance was constant across all dimensions of an output, but remarked that such an assumption may not always hold. They also stated that log or square root transformations could be used to stabilise the variance (Ševčíková et al. 2006). Thus, performing such transformations may resolve the issue. Sevčíková et al. (2006) compared the performance of Pearson's χ 2 with a statistic they called LRT S poisson and found that the latter was more powerful. This suggests that the effectiveness of SHT is partly dependent on the choice of statistical test, and that one should always opt to use the most effective, applicable statistical tests.

Effectiveness of statistical hypothesis testinǧ
The summary statistics that characterise the distributions are also an important determinant of effectiveness. To illustrate, Guderlei et al. (2007) found that variance was more effective than mean. Interestingly, they also observed that the variance and mean detected mutants that the other failed to detect. This indicates that one should use multiple summary statistics.
SHT's performance was abysmal in an experiment conducted by Guderlei et al. (2007). In this experiment, SHT only considered characteristics of the SUT's output, instead of the entire output. The authors suspect that this explains SHT's performance. This suggests that one should maximise the amount of data being considered by SHT to enhance its effectiveness. However, one of the findings of an experiment conducted byŠevčíková et al. (2006) was that tests that considered fewer dimensions of the output could be more effective. This indicates the converse i.e. reducing some of the data being considered by SHT could improve effectiveness. These conflicting observations suggest that the most appropriate amount of data to expose SHT to is context dependent. We believe that future work that establishes a set of guidelines with respect to the most apposite amount of data to make available to SHT would be valuable.
Yoo (2010) exposed a variant of SHT, called Statistical Metamorphic Testing (see Section 7.3), to different datasets. He observed that the dataset that offered the worst performance may have had outliers and suggested that this may explain its comparatively poorer performance to other datasets. This suggests that the nature of the data is also important.
SHT is necessary, but not sufficient; false positives and negatives are possible (Ševčíková et al. 2006). In SHT, one has control over the significance level. Higher significance levels result in more false positives, but fewer false negatives (Ševčíková et al. 2006) and vice versa. Thus, one can tune the significance level to enable better management of these classification errors.
It is unclear which test cases are incorrect (Mayer 2005); thus manual inspection of each is necessary. This suggests that reducing the size of the sample might be beneficial from a debugging perspective. However, it has also been observed that increasing the sample size can lead to an increase in the number of faults that can be detected by the technique (Guderlei et al. 2007). Thus, reducing the size of the test suite could lead to a reduction in effectiveness. Unsurprisingly, it has been reported that SHT can be very resource intensive because it requires a large number of executions to produce stable results (Guderlei et al. 2007). This means reducing the test suite size might also be beneficial from an efficiency viewpoint, but doing so may compromise the stability of the technique. There are clearly several trade-offs associated with the sample size that might affect the effectiveness of the technique.

Statistical metamorphic testing
Recall that SHT assumes that one either has knowledge about the expected output distribution, or a reference implementation that can determine the expected distribution. Guderlei and Mayer (2007a) combined SHT with MT to ameliorate this assumption. The integrated approach is called Statistical Metamorphic Testing. The approach operates as follows. For a given MR, the source and follow up test cases are executed multiple times to obtain two or more sets of outputs. Each set is aggregated into one statistical value, and a statistical hypothesis test is evaluated based on these values. This integrated approach also enhances MT's capability to operate in non-deterministic systems (Yoo 2010).
The integration of these techniques can clearly be advantageous in some respects, e.g. from a generalisability perspective. However, the union of these techniques can also be detrimental in other ways. For example, it was reported that in Statistical Metamorphic Testing, the most appropriate statistical analysis is dependent on the MR (Yoo 2010). This means one must expend additional effort to determine the most appropriate statistical analysis for each MR, which is an otherwise unnecessary task in standard MT. Yoo (2010) investigated the effectiveness of Statistical Metamorphic Testing and found that it is affected by choice of statistical hypothesis test and choice of test cases. He also noted that Statistical Metamorphic Testing was incapable of detecting faults that failed to propagate to the output i.e. cases of coincidental correctness.

Comparing techniques
Sections 3 to 7 described a series of techniques that were devised to alleviate the oracle problem. Each technique was explored in terms of its effectiveness, efficiency, and usability. Sections 8.1, 8.2, and 8.3 compare these techniques on the basis of these issues.

Effectiveness
Certain faults can only be detected by assessing specific oracle information, e.g. specific test cases may be necessary to detect certain faults. Table 5 shows that different techniques have access to different oracle information and thus may find different faults. For example, since Assertions only evaluates the SUT based on oracle information from a single execution, it cannot detect faults that require oracle information from multiple executions. Statistical Metamorphic Testing has access to such information and so can detect such faults. However, some MRs place restrictions on which test cases can be used. Let T CF be the set of test cases that can manifest a particular fault, F . If an MR's restrictions prevent it from using members of T CF , then it will not be able to detect F . Assertions do not have this restriction and so might be able to detect F . Clearly, practitioners should select testing techniques based on the types of faults that their system is prone to. This highlights some research opportunities; in particular, it may be possible to extend the types of faults that one technique can detect, by combining it with another technique that uses different oracle information. An example of this was presented at the end of Section 5.1. Although different techniques can find different types of faults, they might not be able to detect all instances of these faults. Every technique has limitations in terms of coverage (see Table 5), thus this potential explanation applies to all of the techniques. Alternatively, correlated failures may explain this phenomenon for a subset of the techniques (see Table 5). A large amount of research has been conducted on reducing correlated failures for N-version Testing, but very little has been done in the context of other techniques that are known to experience correlated failures. We therefore believe that such research could be a valuable asset to the community. Table 5 outlines an example of a design and application option for each technique. Sections 3 to 7 revealed that some techniques have more design and application options than others. Such techniques offer a greater degree of control; this might enable better optimisation of the technique for different contexts. However, it may be more difficult to find a suitable design and mode of application for such techniques.
One's choices regarding a technique's design and application options can have both a positive and negative impact. For example, increasing the number of feature extractors used by an ML Oracle can lead to improvements in one type of classification, but reductions in another. Unfortunately, guidelines on how one should exploit many of these types of design and application options for their context have not been proposed. Research that leads to the establishment of such guidelines would be useful.
Unfortunately, to the best of our knowledge, empirical data regarding the effectiveness of Assertions for coincidental correctness is unavailable. We therefore believe that significant value can be gained by studying this technique in the context of coincidental correctness and the oracle problem. Interestingly, Sections 3 to 7 suggest that the remaining techniques can be ineffective for coincidental correctness (see Table 5). Thus, research that explores methods of reducing the impact of coincidental correctness on these techniques would be valuable. For example, Clark and Hierons (2012) and Androutsopoulos et al. (2014) developed a series of metrics that estimate the probability of encountering coincidental correctness on particular program paths. Such metrics can be used to select test cases that are less susceptible to coincidental correctness. Table 6 reveals that the contexts in which the different techniques perform particularly poorly may differ. For example, the feature extractors that are available in a certain context may be particularly inefficient, but the SUT in this context may have very few large arrays. This means that assertions and machine learning may be efficient and inefficient in this context, respectively. Conversely, in another context, the SUT may contain an abundance of these programming constructs, and so assertions may be inefficient, but the feature extractors available in this context may be efficient. Examples of reasons that may explain why the efficiency of this technique might vary in different contexts.

N-Version Testing
RIs must be developed to replicate the functionality of the SUT, but they do not necessarily have to mimic other quality attributes, including efficiency. Thus, in some situations, one might develop an RI to be equally (or more) efficient to the SUT, but in another situation, one might opt to disregard efficiency completely.
Metamorphic Testing Different MRs have different MTG sizes. Therefore, MT's efficiency in a particular context will be partly determined by the MTG sizes of the MRs that are applied in that context.

Assertions
Assertions can be inefficient at testing large arrays. Thus, the overall efficiency of Assertions in a particular context, will be determined by the number of large arrays in the SUT that must be checked by the technique.
Machine Learning Some feature extractors are more efficient than others; since the appropriate choice of feature extractors is domain specific, the efficiency of ML may vary in different domains.

Statistical Hypothesis Testing
There is a trade-off between efficiency and result stability (which is determined by sample size). In one situation, the tester might require greater result stability than in another and thus might have to sacrifice efficiency to a greater extent in such a situation. Table 7 demonstrates that the required effort to apply each technique can vary. Techniques may differ in terms of the contexts in which they are difficult to use. For instance, it may not be possible to obtain an RI via component harvesting for the SUT, and so manual construction of an RI may be necessary in a certain context. In the same context, all of the assumptions of a statistic being used in SHT may be satisfied by the data, so data transformation tasks are unnecessary. N-version Testing may require substantial effort in such a scenario, but SHT may not. The converse is also possible. Table 7 also shows that the required expertise for different techniques also varies. Thus, one's choice of technique may partly depend on the expertise currently available. For example, if one lacks knowledge about machine learning, but has an adequate understanding of statistics, then one may be more inclined to select Statistical Hypothesis Testing, instead of Machine Learning oracles.

Related work
Although several Systematic Literature Reviews that target associated areas exist, each one explores the subject matter from a different perspective and thus offers a distinct contribution. For example, many systematic reviews had different scopes, which means they surveyed different sets of papers. For example, , Nardi and Delamaro (2011), and Baresi and Young (2001) had a more constrained scope; they were restricted to Scientific Software, Dynamical Systems, and specification-and model-based testing, respectively. Harman et al. (2013) had a wider scope, e.g. they accounted for nonautomated solutions like crowd sourcing. However, they had a different relevance criteria and search strategy, so their systematic review procured different studies. Skills and knowledge that might be required to use this technique.

N-Version Testing
In some situations, it might be necessary to implement an RI from scratch, but this may not be necessary in other situations.
The capability to write programs in different programming languages.

Metamorphic Testing
The tester may have to spend additional time and effort studying the domain to acquire domain knowledge in situations in which a domain expert is not available.
Requires domain knowledge.

Assertions
The amount of effort required to write assertions in a given context will depend on the programming language being used in that context.

Requires domain knowledge.
Machine Learning Certain ML tasks are necessary in some situations, but are unnecessary in others, e.g. labelling dataset items.
Knowledge about machine learning.
Statistical Hypothesis Testing Tasks like data transformation may be necessary in some situations, but not others.
Knowledge about statistics. Different systematic reviews also conducted different types of synthesis. Harman et al. (2013), , and Nardi and Delamaro (2011) conducted a higher level synthesis, which means their synthesis was effective for finding high level research opportunities, e.g. measurements for oracles (Harman et al. 2013), but less capable of identifying lower level research opportunities like a technique's relationship with specific fault types. Baresi and Young (2001) performed a low level synthesis, but the nature of their data is different, e.g. they explored multiple specification languages from a high level view, instead of a finer grained inspection of issues that generalise to all specifications. Finally, some systematic reviews have additional or different objectives. For example, Pezzè and Zhang (2014) and Oliveira et al. (2014a) endeavoured to establish a taxonomy to classify oracles and Harman et al. (2013) examined trends in research on the oracle problem.
Since our Mapping Study takes a unique perspective on the Oracle Problem in terms of the combination of scope, type of synthesis and objectives, it also offers a distinct contribution. In particular, our Mapping Study surveyed the literature on automated testing techniques that can detect functional software faults in non-testable systems. It also presented a series of discussions about each technique, from different perspectives like effectiveness and usability, performed a set of comparisons between these techniques, and identified research opportunities.

Threats to validity
This section outlines the main threats to validity and how they were mitigated. Threats are organised by Mapping Study phase.

Search
Since the first author was unfamiliar with the problem domain at the outset, relevance misclassifications were possible. To reduce this possibility, edge case papers were conservatively kept for more detailed analysis, after more knowledge had been accrued.
Many of the titles and abstracts did not give sufficient information about the true intent or scope of the paper, which may have led to misclassifications. Authors of known relevant papers were emailed with our list of their relevant papers and requested to confirm comprehensiveness. This reduced the impact of this threat.
Another threat is the restrictions placed on the search, e.g. number of research repositories. These were necessary to retain feasibility. To reduce the impact of these restrictions, we applied several other search strategies, e.g. perusing reference lists.
The search facilities offered by many repositories were flawed, which means they may not have returned all relevant studies. Where necessary, a series of workarounds were used to address this problem, e.g. using Google's "site:" function for ACM DL.
There are also threats to repeatability; web content is ever growing, and thus, the ranking of web pages are ever changing, which means that 50 consecutive irrelevant results may appear prematurely in comparison to the first search, or after significantly more results have been examined.
Including grey literature is an important step to combatting publication bias (Kitchenham 2007) and obtaining cutting edge research. We used research repositories like Google and Citeseerx to obtain such literature.
Determining the relevance of a paper is a subjective task. To reduce subjectivity, an inter-rater reliability test was conducted independently by two researchers on the Relevance Inclusion and Exclusion Criteria, on a sample of 12 papers. The results of this test were used to increase the precision of our criteria.

Data extraction
The nature of the data being captured was broad, and none of the available data extraction forms were flexible enough to capture all of the important data. Thus, a data extraction form was specifically developed for this Mapping Study, with appropriate inbuilt flexibility.
Some of the papers did not report all of the data that was necessary to complete the data extraction form, and we were unable to elicit some of this data from the authors of these papers. In such cases, it was necessary to make assumptions about the data. Although these assumptions were informed, there is a chance that they may have been incorrect.

Quality criteria
None of the available quality instruments were suitable; adoption of inappropriate quality instruments may lead to inaccurate classifications. Thus, an appropriate quality instrument was developed. The design of our quality instrument was based on the guidelines of Kitchenham (Kitchenham 2007), and took inspiration from 27 examples of quality instruments, and domain knowledge.
Measuring the quality of a paper involves some degree of subjectivity. To that end, two researchers independently conducted a test of inter-rater reliability on the quality instrument, on a sample of 12 papers. We used the results of this test to fine tune our quality criteria.

Throughout the process
Many decisions were necessarily subjective; several practices were adopted to decrease potential bias introduced through subjectivity. For example, as mentioned above, inter-rater reliability tests were conducted on several critical, subjective parts of the process. The review protocol was also defined prior to starting the Mapping Study, which enabled most subjective decisions to be taken before the data had been explored.
Additionally, where possible, subjectivity in processes was reduced through careful design, e.g. the relevance Inclusion and Exclusion Criteria are based on relatively objective guidelines.
We contacted the authors of the papers that were covered by the mapping study, at various stages of the process, by email, to elicit information and/or provide confirmation on various issues. We found that, in some cases, it was not possible to contact the author, and that a large proportion of the authors did not reply (an author is not considered to have replied if the author did not reply within a month of the last email that was sent). Given that there was such a large number of authors, human error is also possible i.e. we may have failed to email a small number of them. Additionally, even though many of the authors did reply, some of their responses only addressed a subset of the issues. This could affect our results, e.g. people that we did not establish contact with might have had a paper that could have been included in the mapping study, or had people addressed all of the issues, making certain assumptions about their work would not have been necessary.

Conclusion
In this paper, we conducted a mapping study on automated testing techniques that can detect functional software faults in non-testable systems. In particular, five techniques were considered-N-version Testing, Metamorphic Testing, Assertions, Machine Learning, and Statistical Hypothesis Testing.
A series of discussions revolving around issues like effectiveness, efficiency, and usability were presented about each technique, and these techniques were compared against each other based on these issues. It is our hope that this material will be a useful resource for researchers that are attempting to familiarise themselves with/navigate the field. We have embedded our own insights, that emerged from an analysis of these discussions and comparisons, throughout the material. We therefore believe that the material will also be beneficial for researchers that are already accustomed to the field. Finally, we envisage that the material could assist practitioners in selecting the most apposite technique for their context, as well as help them make an informed decision on the most beneficial method of applying the chosen technique for their particular situation. To exemplify the latter, a practitioner might have decided to use Metamorphic Testing and have tight deadlines; they could consult Section 4 for strategies on improving the efficiency of Metamorphic Testing, e.g. reducing an MR's MTG size and understanding the ramifications of using these strategies-in this case, reducing MTG size might reduce the effectiveness of Metamorphic Testing.
The aforementioned material also highlighted several potential research opportunities, which may serve to steer the direction of future research endeavours. These include the opportunity for new testing techniques that can tolerate coincidental correctness in nontestable systems, more empirical studies that explore the effectiveness of Assertions in the context of the oracle problem, and the development of context specific guidelines on how each technique should be used. It is our hope that this Mapping Study will raise awareness of these research opportunities in the research community and that researchers will peruse them. Researchers may use the results of the Mapping Study to increase confidence in the novelty of the contributions they might make by perusing these research opportunities.