1 Introduction

One of the significant recent advances in software testing has been in the area of automation. It is now possible to automatically generate test data for an arbitrary system that achieves a remarkably high level of coverage. However, there is a downside to this—unless they are specified completely and explicitly, the outputs from all this automatically generated data need to be checked to determine their correctness or otherwise. In theory, this should just be a matter of comparing the results with those defined by the complete (and, ideally, machine processable) system specification, but unfortunately such an artefact rarely exists. As a result, a huge amount of human effort is needed if there is a large set of test cases. Involving people in the process is expensive and possibly error-prone and therefore some other strategy for building an oracle—a mechanism for determine the (in)correctness of outputs associated with inputs—needs to be developed.

Anomaly detection is a general set of strategies that can be used to detect unusual values or outliers in large data sets. It has been employed successfully in various research areas such as cyber-intrusion detection, fraud detection, industrial damage detection, image processing, system health monitoring, event detection in sensor networks, and detecting eco-system disturbances (Chandola et al. 2009). The aim of the work reported in this paper to investigate whether software bugs generate an anomalous pattern of behaviour that can be distinguished from normal (non-buggy) behaviour. If this is the case, then the possibility of detecting bugs automatically can be raised.

This paper reports on an extended experiment into the use of a range of clustering-based anomaly detection techniques to support the construction of a test oracle. In the first part of the study, a range of clustering algorithms are applied to just the test case input/output pairs of three systems and the effectiveness of this approach is evaluated. In the second part of the study the test case input/output pairs are augmented with their associated execution traces with the aim of improving the accuracy of the approach and the results of a second experiment investigating and evaluating this are presented.

The main findings of this first study provided evidence to support the feasibility of using anomaly detection techniques to separate passing and failing test results and confirm the findings of our preliminary study in the area (Rafig and Roper 2015). The results vary between systems in particular and algorithms to a lesser extent, but show that smaller than average-sized clusters exhibit both a far higher density of failures and a sample of the range of faults in the program. The practical implications of the approach suggest that the task of checking outputs can be reduced to a fraction of that normally required: the majority of failures in a system can be observed from inspecting a minority of the results. Reducing the proportion of failures in the output also demonstrates that the approach is robust to a decline in the failure intensity rate.

2 Principles and an illustrative example

The concept behind the approach explored in this paper is illustrated in Fig. 1. The Software Under Test (SUT) is executed with test case inputs which generate corresponding outputs. The paths taken by the test cases through the SUT are represented by the wavy lines in the figure, and some of these might encounter faults in the software (represented by * symbols) which may cause some of the outputs to fail. It is also possible to collect this trace information as part of this testing process. So after running the tests we have a set of input test cases, outputs from the tests, and (if we chose to collect them) execution traces. At this stage it is unknown which of the tests have passed and which have failed unless all of the outputs are examined to see if the results are as expected. The approach presented in this paper explores the application of clustering to separate the passing and failing tests into distinct clusters: the failing outputs (being less frequent) gathering in the smaller clusters and the more frequent passing tests grouping into larger clusters. Checking the results then proceeds by examining the contents of the smallest clusters first as these should be more likely to contain the failing outputs. In this way, failing outputs are identified sooner and the process of checking results becomes more efficient.

Fig. 1
figure 1

Principles of using clustering to automatically classify failing outputs. The program under test is run on a set of inputs which will generate outputs and optional traces, and may encounter bugs in the program (the *’s). The pass/fail status of the outputs is unknown and the aim is to automatically separate these using clustering strategies

To illustrate this idea, a small example is presented which demonstrates the principles behind the approach and the potential benefits the it offers to the software engineer. This illustration is taken from Defects4JFootnote 1—a collection of open-source systems, faults, and an infrastructure for running and profiling tests. Part of the tutorial documentation for Defects4J contains the infamous triangle problem—a well-known testing example which takes three integer inputs and returns the corresponding type of triangle represented by these three values: equilateral, isosceles, scalene or invalid. This program comes with several faults in the form of mutants that may be applied and 35 test cases. In the example here the fault takes the form of the line in the program containing:

figure c

being replaced by:

figure d

The tester is of course unaware of this and runs the tests to produce the results shown in Table 1. In this illustration no use is made of JUnitFootnote 2 or any other unit testing framework in order to specify the expected results. Although this is not necessarily good practice, it is not unusual: the ISTQB Worldwide Software Testing Practices Report surveyed 3200 test managers and technical staff from 89 countries and found that unit testing tools were employed in just under 43 % of organisations (ISTQB 2016). Furthermore, it is not always easy to specify the results of a test (resulting in partial or incomplete oracles), and testing is also carried out at many levels: integration, system, acceptance, regression etc., where the tests may not be defined using one of the nUnit family of frameworks.

Table 1 Inputs and actual outputs for the triangle example

In typical circumstances, the tester would then proceed to work through the outputs one-by-one to check whether the test passed or failed. Instead of doing this, we firstly group the related inputs and outputs together into 35 vectors (+<0,1301,1,INVALID>,<1108,1,1,INVALID> ...<1108,2,2,ISOSCELES>+) and then apply a clustering algorithm. This groups the data into 4 clusters, illustrated by Fig. 2, one large cluster containing 29 items and 3 much smaller ones containing 1, 2 and 3 items, respectively. By concentrating first of all on the small clusters, the tester would find two failing outputs after examining just six results: these are T34 and T35 which appear in cluster 3 along with the passing case T13 (for information cluster 1 contains T8 and cluster 2 contains T10 and T14). At this point the programmer may feel that they have enough evidence that the program is not working and choose to stop examining test results and work on debugging the program. This evidence has been obtained after looking at just a fraction of all test outputs, saving the developer time and making the testing process much more efficient.Footnote 3

The purpose of the work reported in this paper is to explore whether this approach scales up to larger systems where there are hundreds or thousands of test cases: do failing outputs tend to gather in the smaller clusters meaning that developers can confidently focus their efforts on just a small proportion of test results, and in the case where there are multiple failures then what proportion of these feature in the small clusters?

Fig. 2
figure 2

Clusters generated from triangle example data. The clustering technique groups the test inputs and outputs into four clusters. There are two failing tests which appear in cluster 3. The remaining 33 are all passing tests

3 Background and related work

The automatic generation of test oracles is an important problem in software testing area, but this problem has received considerably less attention compared to other testing problems such as the generation of test cases. Three extensive reviews of test oracles exist: by Baresi and Young (2001), by Pezzè and Zhang (2005), and by Barr et al. (2015) who classified the existing literature on test oracles into three broad categories:- specified oracles; implicit oracles; and derived oracles. Specified oracles are test oracles obtained from formal specification of the system behaviour. For instance, Frank and Doong developed the ASTOOT tool which generates test suites along with test oracles from algebraic specifications (Doong and Frankl 1994). In their work, test oracles can be generated by the ASTOOT tool and then used to verify the equivalence between two different executions scenarios. Specified oracles are effective in finding system failures but their success depends heavily on the availability of formal specification of the system behaviour. However, the vast majority of systems lack an accurate, complete and up-to-date machine readable specification. Therefore, the applicability of specified oracles is limited.

Implicit oracles are test oracles generated without requiring any domain knowledge or formal specification to implement. Hence, they can be applied to all runnable programs in the general sense. For example, in the fuzzing approach proposed by Miller et al. (1990), the main principle is to generate random inputs and attack the system to find faults which cause the system to crash. If a crash is spotted, then the fuzz tester reports the crash with the set of inputs or input sequences caused it. The fuzzing approach is widely used in the security vulnerabilities detection area such as buffer overflows and memory leaks etc.

Derived oracles are synthesised from properties of the system under test, or several artefacts other than the specification (e.g. documentation and system execution information), or other versions of the system under test. For instance, metamorphic testing has been used to test search engines such as Google and Yahoo (Zhou et al. 2012), and the BERT tool may be used to identify behavioural differences between two versions of a program from examining inputs, outputs, return values and program states (Jin et al. 2010)—a promising regression testing approach but one which relies on the presence of a previous reference version of the software, which may not always be available or suitable.

Our work is rooted in the area of derived oracles from system executions; therefore, the related work can be divided in two main sections: test oracles based on invariant detection and test oracles based on anomaly detection.

3.1 Test oracles based on invariant detection

Program behaviours can be automatically checked against given invariants for violations. Therefore, invariants can be used as test oracles to find out the correct and incorrect output. Invariants are often inserted into the code by the developers, but this again can be a costly exercise and an additional burden at the time of coding. Daikon can be used to learn and infer invariants from program executions dynamically by using a collection of inputs (test cases), monitoring key values (class attributes, method entry and exit points, loop invariants etc.) and then making inferences from this large set of observations (Ernst et al. 2007). Sekar et al. (2001) proposed an approach to learn Finite State Automata (FSA) by using sequences of systems calls. Their approach deals with system security and is aimed at detecting anomalous sequences of system calls which are likely to point to intrusion attempts and malware. Hangal and Lam build up invariants over program variable from the executions of the passed tests and then use any violations of these invariants to identify potential bugs (using the DIDUCE tool) (Hangal and Lam 2002).

3.2 Test oracles based on anomaly detection techniques

Chandola et al. (2009) define anomaly detection as a matter of spotting patterns in data that correspond to abnormal behaviour. This concept is illustrated in Fig. 3—the unfilled circles represent regions of normal behaviour, whereas the filled points represent anomalous data. The aim of the work reported in this paper to investigate whether software bugs generate a non-conformant pattern of behaviour that can be distinguished from the conformant or normal behaviour—in other words, in Fig. 3 do the groups of unfilled points corresponded to passed tests and the filled ones with failures? If this is the case, then the possibility of detecting bugs automatically can be raised.

Fig. 3
figure 3

Principle of anomaly detection. Non-anomalous items (unfilled circles) group together in larger clusters while anomalous ones (filled circles) are left isolated

The main principle of creating test oracles in this context is to hypothesise a formal model of program behaviours from sets of observations. There is a large body of work on using anomaly detection strategies such as clustering and classification techniques to support software testing tasks. However, these typically operate on quite different types of data set (e.g. execution traces), or utilise semi-supervised or supervised learning strategies (such as the presence of a previous version of the program). Consequently, the application of anomaly detection strategies in this context has not been extensively investigated (for a detailed review of anomaly detection techniques and applications see the work of Chandola et al. (2009)). The following subsections discuss some recent work in this area. The work in those subsections will be classified in to three main categories: (1) unsupervised learning techniques; (2) semi-supervised learning techniques; (3) supervised learning techniques.

Unsupervised learning techniques do not require training data and thus are most widely applicable. The techniques in this category make the implicit assumption that normal instances are far more frequent than anomalies in the test data. If this assumption is not true, then such techniques suffer from a high false alarm rate. Examples of such work include that of Dickinson et al. (2001, 2001) who demonstrated the advantage of automated clustering of execution profiles over random selection for finding failures by using function caller/callee feature profiles as the basis for cluster formation. This work is in turn based on that of Podgurski et al. (1999), who used cluster analysis of profiles and stratified random sampling to calculate estimates of software reliability and found that failures were often isolated in small clusters based on unusual execution profiles. Our work is similar to this and explores the same observed hypothesis about the distribution of failures over clusters, but we investigate the use of test case input/output pairs (and input/output pairs combined with execution profiles) from the system under test instead of execution profiles alone.

Semi-supervised learning techniques typically assume that training data has labelled instances for only the normal class (i.e. a subset of passing test cases needs to be identified). A model is built for the class that corresponds to normal behaviour and then used to identify anomalies in the unlabelled test data. Podgurski et al. investigate how bugs could be classified when represented by a failed test that had the same cause (Podgurski et al. 2003). Their approach worked based on the analysis of the execution profile corresponding to reported failures of the test and was built on top of their earlier unsupervised learning system. Bowring and colleagues proposed an automatic classification of program behaviours using execution data which aimed at reverse engineering a more abstract description of system’s behaviour (Bowring et al. 2004).

Supervised learning techniques assume the availability of a training data set which has labelled instances for normal as well as anomaly classes and is therefore the least generally applicable. However, this has been successfully used in regression testing where a reference version of the software exists which makes accurate data labelling possible. For example, Vanmali et al. (2002) trained a multi-layer neural network on the original software application by using randomly generated test data that conformed to the specification. When new versions of the original application are created and regression testing was required, the tested code was executed on the test data to yield outputs that are compared with those of the neural network. Frounchi et al. explored the possibility of using supervised learning as test oracle (Briand 2008) within the image processing domain.

4 Clustering techniques

Clustering aims to partition a population of objects, each containing various attributes, into groups in such way that objects with similar values are placed in the same cluster, whereas those with dissimilar ones are placed in different clusters. The similarity of objects can be decided by using different distance metrics (discussed in more detail in Sect. 5.2). In this work the objects of interest are observations from program executions—test inputs and outputs and execution traces—and the aim of clustering is to separate the passing and failing executions. There is a very large variety of approaches towards clustering and so far this work has explored the use of the following algorithms: agglomerative hierarchical clustering, density based spatial clustering of application with noise clustering (DBSCAN) and expectation-maximization clustering (EM). The following subsections give a brief description of each approach. For further details on the techniques the reader is referred to the work of Han et al. (2012) or Witten and Frank (2005) for example.

4.1 Agglomerative hierarchical clustering

The agglomerative hierarchical algorithm is an example of a clustering approach that aims to build a hierarchy of objects. The core principle of this type of clustering method is that the objects are more related to nearby objects (as defined by the distance metric) than to objects farther away. A hierarchical clustering method can be either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.

Agglomerative hierarchical clustering initially assigns each object to its own cluster, calculates the distance between each pair of clusters, and combines the most similar ones. This process is repeated, building larger and larger clusters at higher levels of the hierarchy, until no close similarity or dissimilarity between two clusters can be found.

Divisive hierarchical clustering operates in the opposite fashion, initially assigning all objects into one cluster and then dividing this main cluster into smaller ones based on object dissimilarity until no further splits can be made.

In both approaches the user can specify the desired number of clusters as a termination condition.


Density based spatial clustering of application with noise is an example of density based clustering approach, grouping together those objects that are close neighbours which allows it to find arbitrarily shaped clusters. Unlike agglomerative hierarchical clustering, the number of clusters can be determined automatically after specifying two key parameters: the minimum number of points in a cluster and the distance between them. The approach also supports the notion of an outlier—objects not belonging to any cluster. A cluster is defined as containing at least a minimum number of points (MinPts), every pair of points of which either lies within a user specified distance (\(\epsilon\)) of each other or is connected by a series of points that each lie within distance \(\epsilon\) of the next point in the chain. Smaller values of \(\epsilon\) yield denser clusters. Based on the value of \(\epsilon\) and the minimum cluster size, it is possible that some objects will not belong to any cluster (these outliers are considered as noise).

4.3 Expectation-maximization (EM) clustering algorithm

The EM clustering algorithm is an example of probability based clustering approach. In contrast to an approach such as k-means clustering, in which a fixed number of clusters (k) is given at the outset and objects are assigned to those clusters so that the means across clusters (for all objects) are as different from each other as possible, EM works purely from the set of objects without any a priori information to find the most likely set of clusters from a probabilistic perspective. EM operates iteratively to assign data objects to clusters and update the parameters of the probability distributions governing the various clusters until the best model is found.

5 Experimental evaluation

Two main experiments were run to evaluate the effectiveness of clustering techniques in separate failing and passing tests.

Experiment 1::

In the first experiment the input to the clustering algorithms consisted of just the test case inputs along with their associated outputs.

Experiment 2::

The second experiment extended this by adding to the input/output pairs their corresponding execution trace.

The main hypothesis under investigation being: “Normal data instances belong to large and dense clusters, while anomalies (failures) either belong to small or sparse clusters”. In other words, is the execution data which falls outside the clusters or appears in small (sparse) clusters indicative of bugs? Data about the distribution of failures over clusters, the impact of the number of clusters, the density of clusters, and the number of faults revealed per cluster were analysed to examine this hypothesis. This section describes the framework used for the experiments.

5.1 Subject programs

Versions of three subject programs were used in this study: the NanoXML XML parser system, the scalable internet event notification architecture system (Siena), and the Sed stream editor. All are available from the Software Infrastructure Repository (SIR),Footnote 4 are non-trivial systems, have several versions with well-documented faults embedded within them (these are either real or seeded—typically coming from faults in previous versions of the system), and also come with test suites—an important factor as having sets of good, but independently created, tests is vital for this experiment.

NanoXML NanoXML is a non-GUI based XML parser written in Java. NanoXML consists of a component library and an application JXML2SQL which takes as input an XML file and either transforms it into a html file, showing the contents in tabular form, or into an SQL file. NanoXML has 24 classes, five versions (although the fourth version was excluded as it contains no faults), each containing multiple faults—seven in each of versions 1–3 and eight in version 5—and 214 test cases. The error rates in all faulty versions ranged from 31 to 39 % (the error rate is the proportion of the supplied test cases which will fail due to the seeded faults).

Siena Scalable internet event notification architecture (Siena) is an Internet-scale event notification middleware for distributed event-based applications deployed over wide-area networks. Siena is responsible for selecting notifications that are of interest to clients (as expressed in client subscriptions) and then delivering those notifications to the clients via access points. Siena contains 26 classes (nine in its core and 17 which constitute an application), 567 test cases and seven faulty versions: three with single, and four with multiple ones. Versions with multiple faults (V1,V3,V5 and V7) have been excluded from this experiment for the time being because of the absence of a fault matrix (a simple way of establishing which test cases are responsible for revealing which fault). Therefore, only V2, V4 and V6 are included in the experiment, each having a single fault and an error rate of 17 %.

Sed Stream editor (Sed) is a Unix utility that parses and transforms text using a simple compact programming language. Sed takes a set of commands and a text stream and performs some operation (or set of operations) on the input stream. Sed is typically used for extracting part of a file using pattern matching or substituting multiple occurrences of a string within a file. Sed is written in C, has 225 functions, 370 test cases and seven versions with multiple faults. Only one version was used in this experiment: version 5 which contains 4 faults and has an error rate of 18 %.

5.2 Experimental set-up

The main components of the experiment were: a set of programs with known failures, a set of test case inputs for each program, a way to determine whether an execution of each test was successful or not (passed or failed), and a mechanism for recording the execution trace taken through the program by each test. The seeded versions of the subject programs were run on the test cases to produce the associated outputs, and Daikon (Ernst et al. 2007) was used to obtain the execution traces. The resulting set of test case input/output pairs was augmented with their associated execution traces, transformed to reduce the volume of data (traces are often very large), and then analysed using several clustering algorithms. Knowing which data objects corresponding to failed test cases enabled us to determine how well the clustering algorithms performed. Each of these steps is described in more detail below.

Test case input/output pair collection The subject programs come with Test Specification Language (TSL) test suites and tools to run these automatically (details are available from the SIR repository and the article by Do et al. (2005)). Test cases which failed to produce any output were discarded (seven out of 214 for NanoXML, and 73 out of 567 for Siena, and seven out of 370 for Sed, giving final test case numbers of 207, 494 and 363).

Execution trace collection Daikon was used to instrument the subject programs in order to collect the execution traces used in the second experiment. For the subject programs, we execute each test case to produce its associated execution trace. Daikon allows programs to be monitored and traced at varying levels of granularity, but for this study we extracted sequences of method invocations (entry points) and method exits in the order they occurred during test execution.

Identification of failures The NanoXML and Sed systems come with matrices which map test cases to failures corresponding to faults and makes the identification of faults effectively automatic. Siena has no such fault matrix so the test outputs of the original version were compared with that of the faulty ones to find the failing tests.

Data transformation To be acceptable to the various clustering algorithms, the data requires processing before it can be analysed. The processing procedures differ from one data type to another—for instance numeric data sometimes requires normalisation. All systems used in the experiment work with textual input and produce textual output. Very often there is little semantic information in such data and a lot of noise, so to minimise the content (and redundancy) but still retain any uniqueness, the data (test case input/output pairs) were transformed by a simple process of tokenisation. The tokenisation method is widely used in the area of text mining to produce a suitable set of attribute vectors to build a classification model (a problem not dissimilar to the one we are dealing with) and is also suggested by Witten and Frank (2005). Several transformation methods such as hash coding, Huffman coding strategies and others were examined, but tokenisation turned out to be the most suitable one. Table 2 shows an example of this for NanoXML and Siena. Notice that the parameters for Siena commands were all encoded as “1” as they remained unchanged between input and output.

The Sed test data (input/output pairs) consists of a command line which contains 2 main parts: the parameters identifying the operations to be performed and a text file that needs to be modified which therefore forms both part of the input and output. Therefore, the data were transformed in a slightly different way compared to NanoXML and Siena: all input components remained unchanged except the filename (e.g. “../inputs/default.in”) which was encoded as the token “<1>” as the file itself contains only the text to be modified. Trying to tokenise the file to be modified (and its modified version) failed to reduce the size of the output sufficiently and so for output part the diff utility (a data comparison tool) was used to calculate the differences between the input text file and its modified form (this process reports how to change the first file to make it match the second file with specific operation that needs to be performed such as “a” for add and “c” for change). The magnitude of the compression achieved by this method is hard to quantify, depending as it does on the file and the modifications, but it typically yielded a much smaller representation of the output data. Table 2 shows an example of this coding strategy.

As explained earlier, each test input/output pair was augmented with its associated execution trace. Such traces are often very long (hundreds, and in some cases thousands of entries), and each entry in a sequence is often a full Java method signature including package name, class name, method name, and parameters (along with their respective long signatures). This required more compression than could be provided by simple tokenisation so the trace compression algorithm developed by Nguyen et al. (2013) was used. The algorithm replaces the collections of method sequence entry and exit values with their hash keys, consisting usually of just 1 or 2 characters. It takes into account the occurrence frequency to assign shorter hash keys for entries that are most frequent. Table 3 shows a sample of sequences for one of collected traces and their hash key values (for space reasons, just 3 sequences are included rather than all sequences of that trace) which are then concatenated to produce one single string. The obtained trace from the example in this table is 0LA37...

Table 2 Example coding of input/output pairs

Finally all the data items can be combined into vectors that forms the input to the clustering algorithm. These vectors are built from two components for the first experiment (test input and output) and three for the second experiment (test input, output and execution trace). So if the NanoXML example from Table 2 above generated the trace fragment shown in Table 3, then the vectors would take the form of <FCRSSNRSS, F> for experiment 1 and <FCRSSNRSS, F, 0LA37...> for experiment 2. This structure is repeated for all the input/output [trace] combinations for each test case.

Table 3 Example coding of sequence traces

Perform clustering Agglomerative hierarchical clustering was used in experiment 1. The second experiment extended this to include also DBSCAN and EM clustering. Agglomerative hierarchical clustering has been used by other researchers for some similar types of problem and shown to perform reasonably well [e.g. Dickinson et al. (2001), Dickinson et al. (2001), Yan et al. (2010), Yoo et al. (2009)] and is also recommended by Witten and Frank (2005) as the most suitable solution for nominal and string data (which the coding systems produce for two subject programs). In contrast, DBSCAN and EM were chosen because of their ability to determine the number of clusters automatically rather than have to specify them at the outset (one of the limitations observed in the first experiment).

A range of distance measures were initially explored such as Euclidean distance, Minkowski distance, Manhattan distance and edit distance in order to establish the most suitable measure for the experiments proper. The first three were similar in terms of the performance and principle. However, edit distance did not perform well and agglomerative hierarchical clustering consistently assigned all input/output pairs into one cluster even when the clusters count was increased. After exploring these various alternatives, Euclidean distance was settled on as the measure of (dis)similarity between two objects. The WEKA toolkitFootnote 5 used in this study computes this by converting all nominal attributes into binary numeric attributes. So, an attribute with k values is transformed into k binary attributes (using the one-attribute-per-value approach) (Witten and Frank 2005). Thus, all attributes values are binary: being either a numeric attribute or a synthetic binary attribute that is treated as numeric. The squared Euclidean distance sums the squared differences between these attributes: a zero sum indicates agreement (similarity), but a non-zero sum suggests a dissimilarity.

The consequence of choosing Euclidean distance is that nominal or categorical data (such as the inputs, outputs and traces used in these experiments) are only considered equal if they are identical. Any form of difference, no matter how small or large, causes them to be considered unequal. This means that two traces may differ in just one method call out of thousands but are considered as different as two that had no method calls in common. This might seem an odd decision but the rationale behind this is that even a slight difference in an execution trace may be indicative of an error. Using other measures would mean such a difference was hardly perceptible and could easily be missed. The impact of this decision, along with other distance measures, is something that needs to be explored further in the future.

In addition to a similarity metric, agglomerative hierarchical clustering requires a linkage metric which is used to determine when clusters should be merged or split. There are three approaches: Single Linkage calculates the minimum distance between an object in one cluster and an object in another, Average Linkage computes the mean distance between objects in the two clusters, and Complete Linkage is based on the maximum distance between objects. All three are explored in this study.

Number of clusters For agglomerative hierarchical clustering, the number of clusters needs to be provided as parameter. This can clearly have a significant impact: too many clusters results in fragmentation and too few in over-generalisation. Therefore, a number of different cluster counts were explored based on a percentage of the number of subject program test cases: 1, 5, 10, 15, 20 and 25 %.

The number of clusters for EM is determined automatically by cross validation, a technique often used in classification (Witten and Frank 2005). A given data set is firstly divided into m parts. Next, \(m-1\) parts are used to build a clustering model, and the remaining part used to test the quality of the clustering. This process is repeated m times to derive clusterings of k clusters by using each part in turn as the test set. The average of the quality measure is taken as the overall quality measure. Then, the overall quality measure with respect to different values of k is compared to find the best number of clusters that fits the data.

The DBSCAN algorithm uses two specified parameters (\(\epsilon\): the radius parameter, and MinPts: the neighbourhood density threshold—see Sect. 4.2) to determine the number of clusters automatically. For our experiments, we found that the parameters which gave the best results were \(\epsilon\) = 1.5 and MinPts = 1.

Small cluster size One of the key elements of this research is the hypothesis that failures tend to congregate in small clusters. But what is a small cluster? For these initial studies, small is defined as less than or equal to the mean of the cluster size (the remainder being considered as large). For the purposes of this experiment, all clusters were examined to determine the proportion of failures contained therein, but in practice is it envisaged that only small clusters would be inspected and larger ones ignored. The definition of ‘small’ and ‘large’ is quite coarse in this instance. One of the topics of future research is to more accurately define what can be considered to be small clusters.

5.3 Evaluation of clustering techniques

The performance of the clustering algorithms can be assessed by looking at the way that failures are distributed over the small clusters (the definition of “small” is flexible so what follows is a general definition). To capture more accurately for this experiment, we used the F-measure—a combination measure of Precision and Recall (widely used measures in information science domain). These measures in turn rely on the concepts of true positives (TP), false positives (FP) and false negatives (FN) which are defined in this context as follows:


A failing test result that appears in a small cluster.


A passing test result that appears in a small cluster.


A failing test result that appears in a large (i.e. not small) cluster.

Precision is defined as the ratio of “correctly clustered” failures (i.e. failures that appear in small clusters) to the sum of all the entries in the small clusters:

$$\begin{aligned} \mathrm{Precision\,(PR)} = \frac{(\mathrm{TP})}{\mathrm{(TP + FP)}} \end{aligned}$$

Recall is the ratio of “correctly clustered” failures to the total number of true failures (failures appearing in both small and large clusters):

$$\begin{aligned} \mathrm{Recall\,(RE)} = \frac{\mathrm{(TP)}}{\mathrm{(TP + FN)}} \end{aligned}$$

The F-measure—the harmonic mean of precision and recall—combines these two as follows:

$$\begin{aligned} \mathrm{F}-\mathrm{measure} = 2\frac{\mathrm{(PR \times RE)}}{\mathrm{(PR + RE)}} \end{aligned}$$

In this study, we have defined small clusters as those being of average size or less (i.e. the total number of passing and failing outputs divided by the number of clusters). Further work in this area will explore other values of small.

To illustrate the process of the evaluation, we introduce a small example which shows how the small cluster size, precision, recall and F-measure are computed. Assume that a system under test generates 21 data points during execution of its set of test cases. The system contains three faults (referred to as F1, F2 and F3) which cause failures which appear in the output 4, 4 and two times respectively. The remaining 11 test outputs were all passes (we do not need to distinguish amongst these). Again assume that after applying clustering, six clusters were created which grouped the outputs as follows: (f1, f2, f3, p, p, p), (p, p, p, p, p), (f1, f2, p, p), (f1, f2, p), (f2, f3), (f1), where fn corresponds to a failure associated with fault n and p corresponds to pass execution. This can be illustrated graphically as shown in Fig. 4 (where the clusters are sorted in increasing order of size on the y-axis and the “cluster count” legend is just an arbitrary value allocated to a cluster). This representation allows us to see the distribution of failures over the clusters.

Fig. 4
figure 4

Evaluation example

The key values are computed as follows:

  • Small clusters are those of average size or less (i.e. (number of data points)/(number of clusters)). In the above example, the average cluster size is (21/6) = 3.5, so the small clusters are all of these containing \(\le\) 3 data points (i.e. clusters 1, 2 and 3).

  • Precision: Five of the outputs in the 3 small clusters are failures (TPs) and one is a pass (FP), so PR = 5/(5 + 1) = 0.83

  • Recall: Five of the outputs in the 3 small clusters are failures (TPs) but 5 failures also ended up being allocated to the “large” clusters (TNs), so RE = 5/(5 + 5) = 0.5

  • The F-measure is then \(2 \times (0.83 \times 0.5)/(0.83 + 0.5) = 0.62\).

6 Experiment 1 (clustering test input/output pairs): results and discussion

This first experiment explored the use of clustering to group data composed just of test case inputs and their associated outputs.

6.1 Distribution of failures

Fig. 5
figure 5

Hierarchical clustering algorithm with average linkage for NanoXML (version 1)

Fig. 6
figure 6

Hierarchical clustering algorithm with average linkage for NanoXML (version 2)

Fig. 7
figure 7

Hierarchical clustering algorithm with average linkage for NanoXML (version 3)

Fig. 8
figure 8

Hierarchical clustering algorithm with average linkage for NanoXML (version 5)

Fig. 9
figure 9

Hierarchical clustering algorithm with average linkage for Siena (version 2)

Fig. 10
figure 10

Hierarchical clustering algorithm with average linkage for Sed (version 5)

The first question to explore is whether failures are distributed in a random pattern or whether they tend to congregate in the smaller clusters as hypothesised. Figures 5, 6, 7, 8, 9 and 10 show bar charts representing the cluster size and composition for all versions of NanoXML, Siena (faulty version 2), and Sed using agglomerative hierarchical clustering with average linkage. The results are interesting and in several cases (NanoXML versions 2 and 3 and Siena version 2) it can be seen from these that failures in the test input/output pair population tend to cluster together and these clusters tend to be the smaller ones. This effect is less pronounced in NanoXML versions 1 and 5 where the smallest clusters also tend to contain more of the passing cases. The pattern for Sed is quite different—there are a very large number of small clusters rather than a gradually increasing distribution as in the other cases, and these contains a mixture of both passing and failing cases. Overall there is some support for the main hypothesis behind this work, that failure tends to gravitate towards the smaller clusters but it is by no means universal. The following sections examine this in more detail.

6.2 Failures found versus cluster counts and cluster sizes

Table 4 Composition of small clusters in terms of failures and F-measure versus cluster size for hierarchical clustering with different linkage metrics for NanoXML
Table 5 Composition of small clusters in terms of failures and F-measure versus cluster size for hierarchical clustering with different linkage metrics for Siena and Sed

To investigate this observation further, we examined the population of input/output pairs that were in small clusters (defined as being of average size or less) and corresponded to failures. Tables 4 and 5 show, for varying numbers and sizes of clusters over all systems and for the three different linkage metrics that may be used with agglomerative hierarchical clustering (Average, Single and Complete), the percentage of all data points corresponding to failures. The first column (Cluster Count %) defines the number of clusters the algorithm is charged with creating expressed as a percentage of the number of test cases. So, for NanoXML a value of 10 in the Cluster Count % corresponds to 21 as it has 207 tests, for Siena this would be 50 as it has 494 test cases, and for Sed which has 370 tests it would be 37. The second column (Cluster Size %) is the average size of the clusters that are created by the algorithms, again expressed in terms of the number of tests. So as the values in Cluster Count % column increase, so do the number of clusters created which leads to a corresponding decrease in the average size of the clusters. The subsequent columns refer to the version number of the program. Note that the faults in Siena changed the same output data in all versions, even though they are distinct faults, so only the results from one version are considered since there is nothing to be gained from examining the other versions.

Considering the results for NanoXML (Table 4), the data shows that when the cluster counts are between 15 and 25 % of the number of test cases (corresponding to cluster sizes of around 3 % of the number of test cases—i.e. around six data points for NanoXML), well over 55 % of the data points are failures irrespective of which linkage metric is used, and over 60 % when the average linkage metric is employed. For Siena (Table 5) a similar pattern emerges but the best results are at the higher cluster count levels (20–25 %, possibly due to the larger number of test cases which gives an average cluster size of around 4) and tend to be over 70 %. The results for Sed (Table 5) are less dramatic and although a similar trend is displayed the failure density never reaches 50 %, peaking at just over 40 % when the complete linkage metric is used with an average cluster size of about 3. From the graphs shown earlier (Fig. 10), it was observed that Sed contained a very large number of small clusters and only one large cluster, rather than a steadily increasing cluster size which suggests that the data is very fragmented and the algorithm is clearly struggling to form larger groups of data items.

Even with the results from Sed, the findings lend support to the main hypothesis of this paper: As the number of clusters increases and their average size decreases, so the failure density of the small (less than average) sized clusters tends to increase. One case where this is not quite true is version 3 of NanoXML where the largest clusters contained the most failures: the input-output pairs corresponding to failures are so distinct from the rest that they were all grouped into one cluster (an impressive but probably unusual case!).

Table 6 Percentage of failures and F-measure for EM algorithm
Table 7 Percentage of failures and F-measure for DBSCAN algorithm (note for NanoXML epsilon = 0.9 Minpoints = 2, and for Siena and Sed epsilon = 1.5 Minpoints = 1)

Tables 6 and  7 show the results of clustering test inputs and outputs using the Expectation Maximisation and DBSCAN algorithms, respectively. Unlike agglomerative hierarchical clustering, neither of these algorithms require the number of clusters to be specified in advance. The results show that EM performs well with all versions of NanoXML but less so with Siena and very poorly with Sed. Interestingly, for NanoXML the number of clusters created is close to the best number when specified for agglomerative hierarchical clustering. The results for DBSCAN are weaker for NanoXML and very poor for Siena but extremely encouraging for Sed, generating both a very high failure density in the smallest clusters and a reasonable F score. In the case of Sed DBSCAN has generated a very large number of small clusters (matching the pattern observed earlier in Fig. 10)—almost twice the number that was explored using agglomerative hierarchical clustering, which confirms our earlier observations about the data being very fragmented.

Fig. 11
figure 11

Percentage of failures found over the smallest clusters for all Nanoxml versions using single linkage

Fig. 12
figure 12

The average percentage of failures found over the smallest clusters for all Siena versions using linkage metrics

Fig. 13
figure 13

The average percentage of failures found over the smallest clusters for Sed using linkage metrics

Although general pattern is for failure intensity to increase as the cluster size decreases, a trend which can also be observed in Figs. 11, 12 and 13 which present the percentage of failures found in the small clusters with different cluster counts in the subject programs (essentially a graphical summary of the data that appears in Tables 4 and 5), there are cases where the failure intensity peaks and then begins to drop (although not substantially) as the clusters are forced to fragment. An important lesson from this study is that the cluster size is crucial: too few and the technique may be ineffective but too many may cause the failure intensity to diminish as the clusters are forced to fragment. Identifying the ideal number of clusters (or similarly, the best parameters for algorithms such as DBSCAN) is something which needs further empirical investigation to establish.

6.3 Failure density of smallest clusters

From the perspective of supporting the practising software engineer in their work and also in the construction of a test oracle, the interesting question concerns the return on investment: how many outputs need to be examined before a reasonable number of failures are observed? To answer this we examined in more detail the proportion of failing outputs appearing in the smallest sized clusters. The absence of a fault matrix for Siena makes this very time consuming to compute; therefore, only the results for the highest failure density clusters for NanoXML and Sed have been calculated so far. The results of this are summarised in Tables 8 and  9 and show the cluster size (the three values correspond to the absolute size of the cluster, the number of clusters of that size, and the size of the cluster and proportional to the test set size) and details of the failures found (the proportion, the actual failures indicated by ‘Fn’, and the number of occurrences of each failure). Failures associated with a new fault (i.e. not previously encountered) are indicated in bold font. The final column shows the cumulative count of unique faults observed (via their associated failures) over the total number of faults in the system. So, for instance, the first entry of Table 8 shows that for Version 1 using 25 % of the number of test cases to define the number of clusters, there were 13 clusters each of size 1 corresponding to 0.48 % of the number of test cases, containing failures 1 (three times), 2 and 6 (once each), giving a cumulative count of 3 out of a total of 7.

Table 8 shows that on average over all four versions a fair proportion of the failures—45 % (13/29)—are contained within the very smallest clusters (formed from just one or two items). This is encouraging from a test oracle perspective: out of 43 outputs, 23 correspond to failures giving a failure density of 53 %. This initially good rate tails off until the cluster size reaches 4 and additional failures appear in the outputs (except for version 5). By this point an average of 66 % (19/29) of the failures have appeared in the clusters, albeit at the expense of having to examine more non-failing outputs and encountering duplicate failing outputs (but still giving a failure density of around 59 %). This failure density figure, combined with the fact that clusters tend to contain outputs associated with the same failure, means that in practice less than half of the outputs from a small cluster need to be checked before a failing output is encountered.

The results for Sed (Table 9) are less impressive but nevertheless encouraging. Even though the failure density is lower than for NanoXML, the failures are well represented in the smallest clusters: by examining these 3 out of the 4 failures would be encountered. On the downside the outputs of 62 small clusters (all of size 1) need to be checked, but this is still far less work than examining all 370 test outputs.

Table 8 Failure distribution over less than average-sized clusters for Nanoxml
Table 9 Failure distribution over less than average-sized clusters for Sed version 5

Of course, there are still additional failing outputs embedded in the larger clusters which cannot be ignored. This is clearly a weakness of the approach and one of the main topics of future work is to explore how these can be teased out into smaller clusters. A further feature of the clustering is that there is often number of independent clusters associated with the same failure (separated typically because the input/output pairs have different attribute values). This is also a challenge since finding the same failure appearing in several clusters can be quite frustrating for the individual charged with the task of checking outputs. Merging them together is not the answer as this will typically result in a larger cluster which may escape scrutiny, so some way of indicating similarity between them needs to be explored.

7 Experiment 2 (clustering test input/output pairs and execution traces): results and discussion

A second experiment was run to investigate whether collecting additional data in the form of the execution traces associated with each test case would improve the accuracy of the clustering performed in the first experiment by increasing in particular the failure density of the small clusters. Since this trace data can be quite extensive, it was compressed as described in Sect. 5.2. Apart from collecting and including this additional trace data in the clustering, all other aspects of this experiment were identical to the previous experiment.

7.1 Distribution of failures over clusters

Fig. 14
figure 14

Hierarchical clustering algorithm with single linkage for NanoXML (Version 1)

Fig. 15
figure 15

Hierarchical clustering algorithm with single linkage for NanoXML (Version 2)

Fig. 16
figure 16

Hierarchical clustering algorithm with single linkage for NanoXML (Version 3)

Fig. 17
figure 17

Hierarchical clustering algorithm with single linkage for NanoXML (Version 5)

Fig. 18
figure 18

Hierarchical clustering algorithm with single linkage for Siena (Version 2)

Fig. 19
figure 19

Hierarchical clustering algorithm with average linkage for Sed (Version 5)

Again, the first major question to explore is whether failures are distributed in a random pattern over the clusters or whether they gravitate towards the small clusters as hypothesised. To examine this a sample of the results are shown visually—space prohibits the inclusion of all the results, but the full set is available online.Footnote 6 Figures 14, 15, 16, 17, 18 and 19 show bar charts of the cluster composition for NanoXML (all faulty versions), Siena (just faulty version 2 as 4 and 6 produce an identical pattern as mentioned before) and Sed, where failing outputs are coloured blue and passing ones yellow. In these cases the cluster count for NanoXML is set at 15 % of the number of test cases (producing approximately 30 clusters), 20 % for Siena (producing just under 100 clusters) and 25 % for Sed (producing just over 90 clusters). In all cases the results are using agglomerative hierarchical clustering (DBSCAN and EM clustering algorithms were also used but tended to perform relatively poorly—something which is explored in more detail later).

It can be seen from these results that as in experiment 1 the failure data do tend to cluster together and these clusters are the smaller ones in most cases. There are some exceptions to this: for example for NanoXML version 5 the very smallest clusters are dominated by non-failing outputs, whereas the converse is true for the other versions, and in all cases of NanoXML some failures creep into the largest clusters. The results for Siena are more consistent with a clear tendency for failures to gravitate towards the small clusters and away from the larger ones. The results for Sed are similar to experiment 1—many small clusters and one large cluster but this time with a few intermediate-sized ones. It must be stressed that these are selected, and very high-level, results (although others reflect a similar pattern) but it would seem that a substantial number of failures congregate in small clusters. The detailed composition of these small clusters is examined in more detail in the next section.

7.2 Failure composition of small clusters

This apparent observed tendency for failures to gravitate towards the smaller clusters need to be explored in more detail: the precise degree to which it occurs; the impact of the different clustering algorithms and parameters (especially the number of clusters); and particularly the way that multiple failures are distributed (for example, in the case of several failures do they all appear in the small clusters or is one failure dominant?). To explore this principle further, we examined the population of the small clusters (defined as being of average size or less) for each of the algorithms, identified the percentage of these clusters that correspond to failures, and also used the F-measure to answer the point about the way that multiple failures are clustered.

Table 10 Percentage of failures and F-measure versus cluster size for hierarchical clustering with different linkage metrics for NanoXML
Table 11 Percentage of failures and F-measure versus cluster size for hierarchical clustering with different linkage metrics for Siena
Table 12 Percentage of failures and F-measure versus cluster size for hierarchical clustering with different linkage metrics for Sed

Tables 10, 11 and 12 show, for NanoXML, Siena and Sed respectively, the results of applying agglomerative hierarchical clustering for different linkage metrics with varying numbers of clusters. The tables show the percentage of all data points in small (less than average sized) clusters that correspond to failures, and the F-measure for the small clusters. The percentage figure gives an indication of the failure density and the F-measure adds to this by considering the range of faults that are revealed by failures that appear in the small clusters (for NanoXML there are seven faults in versions 1–3 and eight in version 5). The first column (Count) defines the number of clusters the algorithm is charged with creating expressed as a percentage of the number of test cases. The second column (Size) is the average size of the clusters again in terms of the number of test cases. The cluster count figure has to be supplied as a parameter, whereas the size figure is a consequence of the number of clusters and is not controllable. The subsequent columns refer to the version number of the programs and % and F refer to the percentage of failures and the F-measure. The figures in bold italics indicate the high values.

The data for NanoXML shows an interesting bi-modal response: the best results occur when there is either the smallest number of clusters (1 % which corresponds to two clusters) or when the cluster counts range between 10 and 25 % of the number of test cases (yielding between 20 and approximately 50 clusters). When the cluster count is very small, the algorithm will generate two large clusters (these will be of similar size but the smaller one is always treated as the small cluster) and in some cases one of these is composed entirely of failures and the other of passing outputs (those where the F-measure has a value of 1)—in other words the algorithm has managed to perfectly separate the passing and failing executions. These impressive clusterings were investigated in more detail and found that in version 2, 3 and 5 of NanoXML, all failing outputs follow exactly the same path through the program (despite being different faults generating distinct outputs) and the algorithms were perfectly separating the results based upon the execution trace. Even though this is probably a rare occurrence, it clearly demonstrates the power that execution traces can bring to this process.

As the cluster count increases so the results tend to drop quite dramatically until they pick up at around the 15 % level (\(+/-\)5 %) before tailing off again. In this range, the average small cluster sizes are between 2.73 and 6.39 % of the number of test cases—around 5 to 13 elements and it is worth noting that well over 60 % – sometimes far more—of the data points are failures. This again lends support to the experimental hypothesis behind this work that failures tend to congregate in small clusters. Another notable point is the fact that the F-measure tends to vary in line with the percentage of failures (and in all but one case the highest F-measure is also the highest percentage of failures), indicating that the failures associated with the numerous faults are evenly distributed across the small clusters. This is important as it could have been the case that the small clusters were dominated by a small and unrepresentative number of failures. The exact composition of these clusters will be explored in more detail later. It is also notable that both the linkage metrics and the versions of the program have an impact on the results, but the best overall and most stable results are produced by using the single linkage metric with a cluster count set at 15 % of the number of test cases.

The results for Siena (Table 11) tend to follow a similar pattern: in some cases the smallest number of clusters (5) tend to perform well and again manage to perfectly separate the data (once again this result is down to the passing and failing outputs being completely separable by their traces), but in other cases (with the single linkage metrics) they perform very poorly. The data for Siena also support the key hypothesis behind this paper with the cluster counts between 5 and 25 % of the number of test cases consisting of over 70 % failures. In contrast to NanoXML there is less of an impact of version (probably due to each having just a single failure) but like NanoXML the linkage metrics influence the findings, with the single linkage producing the least consistent results and the complete linkage the best. The reasons behind this are unclear and need further investigation.

The picture for Sed is similar to that for experiment 1—a gradual increase in failure density and F-measure as the cluster size drops but a much lower overall failure density value than was observed in the other two projects. Including the trace information has not produced any dramatic results as with NanoXML and Siena as there is no dominant pattern of traces arising from failing executions.

Table 13 Percentage of failures and F-measure versus cluster size for DBSCAN clustering algorithm
Table 14 Percentage of failures and F-measure versus cluster size for EM clustering algorithm

The results of using EM and DBSCAN to perform the clustering are shown in Tables 13 and 14. The first column (systems) defines the subject programs with their version number. The second and third columns identify, as in the previous tables, the number of clusters and the average small cluster size again in terms of the percentage of test cases. The key difference in this case is that the cluster count is determined automatically by the algorithm. The final column shows the percentage of failures in the small clusters and the F-measure for each algorithm. With the exception of version 1, DBSCAN performed well on NanoXML: for version 2 the result was equal to the best found using agglomerative hierarchical clustering, and versions 3 and 5 were close to the best. It is also notable that the cluster count chosen was 15 %—identified as the best compromise for agglomerative hierarchical clustering. The trace information in version 1 is far more diverse which may explain the less impressive performance in this case. The results for Siena are consistent but far inferior to those produced by most of the different cluster size parameters using agglomerative hierarchical clustering. Sed produced the most disappointing results for this algorithm—far worse than when it was operating on test input and outputs alone which suggests that the clustering seems to be fragmenting the data further and is something that needs to be explored in future work. The findings for EM are very disappointing, with the odd exception of NanoXML Version 1. In the majority of cases, the algorithm failed to apportion any of the failures into the smallest clusters and also elected to use a very small number of clusters.

7.3 Fault density of smallest clusters

As in experiment 1, the practical utility of the approach and the return on investment was explored: how many outputs need to be examined before a reasonable number of failures and associated faults are observed? To answer this we examined in more detail the precise composition of failing data appearing in the smallest sized clusters—in other words which failing outputs appeared in which clusters.

Table 15 Failure distribution over less than average-sized clusters for NanoXML

The results of this analysis for NanoXML (with a clustering size of 15 % using agglomerative hierarchical clusteringFootnote 7) are shown in Table 15 (which takes the same form as Table 8 in Sect. 6.3). The three figures in the leftmost column show the absolute size of the cluster, the number of clusters of that size, and the size of the cluster proportional to the test set size (note that the table is presented in increasing order of cluster size and includes only clusters which are of less than average size). The second column identifies the failures found (indicated by ’Fn’) and the number of occurrences of this failure. Failures associated with new faults (i.e. those not previously encountered) are indicated by a bold font. The final column is a cumulative count of the number of faults observed after examining the cluster over the total number of faults in the system. For example, the first entry of Table 15 shows that for Nanoxml Version 1 using 15 % of the number of test cases to define the number of clusters, there were ten clusters each of size 1 corresponding to 0.67 % of the number of test cases, containing failures 1 and 2 (2 times each) and 6 (once), giving a cumulative count of 3.

The NanoXML results show a number of failures appearing in the smallest clusters with additional ones appearing after examining just a few more clusters (with the exception of version 5). This is an important finding as it suggests that those failures which are going to be observed tend to appear relatively early in the ordering of clusters. This has important practical implications: collectively these smallest clusters correspond to between 25 and 30 % of the total output of the system, and the observed failures appear in an even smaller grouping, which means that the majority of failures in a system can be identified by looking at between one-fifth and one-quarter of the output—a substantial saving in effort for the developer.

The results for Siena are included in Table 16 although since Siena contains just the one fault the impact is less pronounced (and just one version is included since the results for other two are similar). However, it does show that the observed failures also tend to be concentrated early on in the small clusters and have the same implications as the NanoXML results.

The findings for Sed are shown in Table 17. The pattern is similar to the first experiment but the number of clusters to be examined has dropped very slightly. Again there are clear practical benefits: 75 % of the program’s failures are concentrated in about 16 % of its results.

Table 16 Failure distribution over less than average-sized clusters for Siena
Table 17 Failure distribution over less than average-sized clusters for Sed version 5

7.4 Impact of failure density

One key factor in this study is the failure density. As mentioned in Sect. 5.1, this is between 31 and 39 % for NanoXML and 17 % for Siena. This failure rate is a factor of the combination of test cases supplied for the two systems and the nature of the faults embedded within the systems. However, in practical terms, this may be too high. The expectation is that this approach would be applied to a relatively mature system which may not have many obvious faults, and consequently a much smaller failure rate. Furthermore, an assumption behind anomaly detection is that anomalous events are relatively rare, whereas in these experiments the failure rate has been fairly high, so may represent a difficult case for the successful application of clustering techniques. To explore the impact of this, we took two versions of two of the systems—NanoXML V3 and Siena V4 (Sed was ignored as it demonstrated a similar failure rate to Siena)—and randomly pruned out fault revealing test cases to systematically reduce the failure rates to 10, 5 and 1 % for each system.

Table 18 NanoXML V3 with reduced failure rate
Table 19 Siena V4 with reduced failure rate

The results for this part of the investigation are shown in Tables 18 and 19 which, for each system, shows the cluster size, again in terms of the percentage of test cases (but note that the actual number of clusters will decrease as the failure rate decreases as test cases are being pruned from the suite), and the percentage of failures found and F-measure over the small clusters for failure rates of 10, 5 and 1 %. Both systems exhibit a similar distinctive pattern: as the failure rate decreases the recall (percentage of failures found) tends to remain high but the F-measure drops as the cluster count increases. The reason behind this is that with an increase in the number of clusters the false positive rate also increases as more passing tests become classified into the small clusters. This also has an important practical implication for this technique suggesting that if the system under investigation is expected to have a low failure rate, then the cluster count (if specified as a parameter) should be very small, but as the expected failure rate increases then so should the number of clusters. Further experimentation is required in order to validate this observation.

8 Threats to validity

The main threat to the validity of this study is the limited number and types of subject programs used in our experiments along with their associated faults and failure rates (although some investigation of the impact of reducing the failure rate has been undertaken). The input/output pairs of the subject programs were string data, and the programs themselves were of moderate size. The coding scheme also indicates a potential threat, but this was created by examining a subset of inputs and outputs in ignorance of whether they are passing or failing pairs and then applied automatically to the remainder of the data set. This is relatively early work in this area, and the aim is to mitigate these threats by exploring a wider range of systems in the near future.

9 Conclusions and future work

This paper has presented an extension study of our preliminary study Rafig and Roper (2015) and investigated several clustering techniques such as agglomerative hierarchical, DBSCAN and EM clustering algorithms to build an automated test oracle. The input/output pairs investigated initially were augmented with execution traces with the aim of improving the proportion of unique failures in the smaller clusters.

The study confirmed the results of our earlier findings (Rafig and Roper 2015): in several cases small (less than average sized) clusters contained more than 60 % of failures (and often a substantially higher proportion). As well as having a higher failure density, they also contained a spread of failures in the cases where there were multiple faults in the programs. The results provide us with some useful guidelines in terms of specifying the number of clusters as a parameter to the algorithms. Over both experiments agglomerative hierarchical clustering produced the most consistently good results, although performance varied according to which linkage metric was used (and also varied with experiment). The results for DBCAN were also generally encouraging, particularly since the number of clusters does not need to be supplied as a parameters.

The results also demonstrate important practical consequences: the task of checking test outputs may potentially be reduced significantly to examining a relatively small proportion of the data to discover a large proportion of the failures. The approach has also been shown to be robust to a drop in the failure rate—all the way down to 1 % of the output—and initial results suggest that when the failure rate is likely to be low then the number of clusters should also be small.

Future research will be devoted to further empirical investigation of the effectiveness of our approach as an automated oracle, to corroborate the findings and to increase their external validity, particularly by exploring a wider range of programs and faults. Additional work includes exploring other anomaly detection strategies such as classification (mainly based on semi-supervised learning) with the aim of increasing the failure detection ability and reducing the false positive rate.