In this section, we present a synthesis of the data extracted from the primary studies, in order to provide detailed answers to the research questions.
Context
Addressed Problem (RQ 1.1)
Figure 6 presents the paper distribution across the different addressed problems. Overall, 11 main problems were identified:
Realism of Test Input Data
Input generation should be targeted towards creating input data that can expose faults in the considered system, yet being representative of real-world scenarios (Udeshi and Chattopadhyay 2019; Tian et al. 2018). Indeed, a fault exposed by a test input that cannot occur in practice is not a real fault. Udeshi et al. (Udeshi and Chattopadhyay 2019) propose a test input generation approach that mutates inputs in a way that makes the result conform to a given grammar, which characterises the validity domain. Tian et al. (2018) produce artificial inputs that represent real driving scenes in different conditions.
A further challenge is assessing whether the results obtained in a simulated environment would also scale to the real world (de Oliveira Neves et al. 2016; Wolschke et al. 2018; Li et al. 2016). Two works propose to generate realistic test scenarios from in-field data (de Oliveira Neves et al. 2016), or by mining test cases from real-world traffic situations or traffic simulators (Wolschke et al. 2018).
Test Adequacy Criteria
Twelve papers define metrics to measure how a test suite is adequate for assessing the quality of an MLS. They often exploit them to drive test input generation. Since classical adequacy criteria based on the code’s control flow graph are ineffective with NNs, as typically 100% control flow coverage of the code of an NN can be easily reached with few inputs, researchers have defined novel test adequacy criteria specifically targeted to neural networks (Kim et al. 2019; Ma et al. 2018b, 2019; Sekhon and Fleming 2019; Sun et al. 2018a, b; Pei et al. 2017; Shen et al. 2018; Guo et al. 2018; Xie et al. 2019).
Behavioural Boundaries Identification
Similar inputs may unexpectedly trigger different behaviours of an MLS. A major challenge is identifying the boundaries between different behaviours in the input space (Mullins et al. 2018; Tuncali and Fainekos 2019), which is related to boundary-value analysis in software testing (Young and Pezzè 2005). For instance, Tuncali and Fainekos (2019) investigate similar scenarios that trigger different behaviours of autonomous vehicles in safety critical settings, e.g., nearly avoidable vehicle collisions.
Scenario Specification and Design
For scenario-based test cases, one fundamental challenge is related to the specification and design of the environment in which the MLS operates. In fact, only a high fidelity simulation of the environment can produce realistic and meaningful synthetic data (Klueck et al. 2018; Fremont et al. 2019; Majumdar et al. 2019).
Oracle
Overall, we found 13 papers in our pool that tackle the oracle problem for MLSs (Zheng et al. 2019; Xie et al. 2011; Nakajima and Bui 2016, 2018, 2019; Qin et al. 2018; Cheng et al. 2018b; Ding et al. 2017; Gopinath et al. 2018; Murphy et al. 2007a, 2008; Saha and Kanewala 2019; Xie et al. 2018). The challenge is to assess the correctness of MLSs’ behaviour, which is possibly stochastic, due to the non-deterministic nature of training (e.g., because of the random initialisation of weights or the use of stochastic optimisers) and which depends on the choice of the training set. The vast majority of the proposed oracles leverages metamorphic relations among input data as a way to decide if the execution with new inputs is a pass or a fail, under the assumption that such new inputs share similarities with inputs having known labels (Xie et al. 2011; Cheng et al. 2018b; Ding et al. 2017; Saha and Kanewala 2019).
Faults and Debugging
Eight works considered in our mapping are related to faults in MLSs. Six of them address the problems of studying and defining the spectrum of bugs in MLSs, and automating the debugging of MLSs (Cheng et al. 2018a; Zhang et al. 2018a; Ma et al. 2018c; Odena et al. 2019; Dwarakanath et al. 2018; Eniser et al. 2019). Concerning the former, two studies in our pool present an empirical study on the bugs affecting MLSs (Cheng et al. 2018a; Zhang et al. 2018a). Indeed, the very notion of a fault for an MLS is more complex than in traditional software. The code that builds the MLS may be bug-free, but it might still deviate from the expected behaviour due to faults introduced in the training phase, such as the misconfiguration of some learning parameters or the use of an unbalanced/non-representative training set (Humbatova et al. 2020; Islam et al. 2019).
Regarding debugging automation, four studies address the problem of debugging an MLS (Ma et al. 2018c; Odena et al. 2019), or localising the faults within an MLS (Dwarakanath et al. 2018; Eniser et al. 2019). The challenge in this case is to unroll the hidden decision-making policy of the ML model, which is driven by the data it is fed with. Other two papers (Li et al. 2018; Rubaiyat et al. 2018) investigate how to inject faults in MLSs in order to obtain faulty versions of the system under test.
Regression Testing
Five papers deal with the regression testing problem in the context of MLSs (Byun et al. 2019; Shi et al. 2019; Zhang et al. 2019; Groce et al. 2014; Wolschke et al. 2017), i.e., the problem of selecting a small set of test scenarios that ensure the absence of mis-behaviours on inputs that were managed correctly by the previous version of the MLS. The works by Byun et al. (2019) and by Shi et al. (2019) both propose a test prioritisation technique to reduce the effort of labelling new instances of data. Groce et al. (2014) deal with test selection for MLSs, whereas Wolschke et al. (2017) perform test minimisation by identifying nearly-similar (likely redundant) behavioural scenarios in the training set.
Online Monitoring and Validation
Eight works address the problem of online monitoring for validating the input at runtime. Since during development/training it is impossible to foresee all potential execution contexts/inputs that the MLS may be exposed to, it is likewise essential to keep monitoring the effectiveness of the systems after they are deployed “in the field”, possibly preventing mis-behaviours when an anomalous/invalid input is being processed by the MLS.
Six of them leverage anomaly detection techniques to identify unexpected execution contexts during the operation of MLSs (Henriksson et al. 2019; Patel et al. 2018; Aniculaesei et al. 2018; Wang et al. 2019; Bolte et al. 2019; Zhang et al. 2018b), whereas two papers are related to online risk assessment and failure probability estimation for MLSs (Strickland et al. 2018; Uesato et al. 2019).
Cost of Testing
The cost of performing MLS testing is particularly challenging, especially in resource-constrained settings (e.g., during system or in-field testing) and in the presence of high dimensional data. Eight papers tackle this problem in the automotive domain (Abdessalem et al. 2016, 2018a; Beglerovic et al. 2017; Zhao and Gao 2018; Bühler and Wegener 2004; Murphy et al. 2009; Abeysirigoonawardena et al. 2019; Tuncali et al. 2018). In this domain, comprehensive in-field testing is prohibitively expensive in terms of required time and resources. Therefore, simulation platforms are typically used to test MLSs since they allow re-testing new system releases on a large number of conditions, as well as in challenging and dangerous circumstances (e.g., adverse weather, or adversarial pedestrians suddenly crossing the road) (Stocco et al. 2020).
Integration of ML Models
Two papers in our pool test the interplay of different ML models within the same system (Abdessalem et al. 2018b; Zhang et al. 2016). Abdessalem et al. (2018b) address the functional correctness of multiple ML models interacting within autonomous vehicles. Differently, Zhang et al. (2016) focus on different levels of metamorphic testing applied to two different computer vision components within a pipeline.
Data Quality Assessment
MLSs may exhibit inadequate behaviours due to poor training data, i.e., inputs that are not representative of the entire input space. At the same time, low quality test data may produce misleading information about the quality of the MLS under test. Hence, the key step towards improving the MLS quality is by achieving high training/test data quality (Ma et al. 2018d; Udeshi et al. 2018).
Testing Levels (RQ 1.2)
Figure 7 illustrates graphically the paper distribution across testing levels. Five works (7%) manipulate only the input data, i.e., they perform input level testing (Bolte et al. 2019; Byun et al. 2019; Henriksson et al. 2019; Wang et al. 2019; Wolschke et al. 2018). The majority of the papers (64%) operate at the ML model level (model level testing) (Cheng et al. 2018a; Ding et al. 2017; Du et al. 2019; Dwarakanath et al. 2018; Eniser et al. 2019; Gopinath et al. 2018; Groce et al. 2014; Guo et al. 2018; Kim et al. 2019; Li et al. 2018; Ma et al. 2018b, c, 2018d, 2019; Murphy et al. 2007a, b, 2008, 2008, 2009; Nakajima and Bui 2016, 2018, 2019; Odena et al. 2019; Patel et al. 2018; Pei et al. 2017; Qin et al. 2018; Saha and Kanewala 2019; Sekhon and Fleming 2019; Shen et al. 2018; Shi et al. 2019; Spieker and Gotlieb 2019; Strickland et al. 2018; Sun et al. 2018a, b; Tian et al. 2018; Udeshi and Chattopadhyay 2019; Udeshi et al. 2018; Uesato et al. 2019; Xie et al. 2018, 2019, 2011; Zhang et al. 2018a, b, 2019; Zhao and Gao 2018), whereas 27% operate at the system level (Abdessalem et al. 2016, 2018a; Abeysirigoonawardena et al. 2019; Aniculaesei et al. 2018; Beglerovic et al. 2017; Bühler and Wegener 2004; Cheng et al. 2018b; Fremont et al. 2019; Klueck et al. 2018; Li et al. 2016; Majumdar et al. 2019; Mullins et al. 2018; de Oliveira Neves et al. 2016; Rubaiyat et al. 2018; Tuncali et al. 2018, 2019; Wolschke et al. 2017; Zhang et al. 2016; Zheng et al. 2019). Only one work considers multiple interacting ML models at the integration level (Abdessalem et al. 2018b). This result indicates that ML models are mostly tested “in isolation”, whereas it would be also important to investigate how failures of these components affect the behaviour of the whole MLS (i.e., whether model level faults propagate to the system level).
Domains (RQ 1.3)
Figure 8 illustrates graphically the paper distribution across the MLS domains. Nearly half of the analysed papers (56%) propose and evaluate a technique which is domain-agnostic, i.e., in principle it may be applicable to any MLS (Aniculaesei et al. 2018; Byun et al. 2019; Cheng et al. 2018a, b; Du et al. 2019; Eniser et al. 2019; Guo et al. 2018; Henriksson et al. 2019; Kim et al. 2019; Li et al. 2018; Ma et al. 2018b, c, 2018d, 2019; Murphy et al. 2007a, 2007b, 2008, 2008, 2009; Nakajima and Bui 2016, 2018, 2019; Odena et al. 2019; Pei et al. 2017; Saha and Kanewala2019; Sekhon and Fleming 2019; Shen et al. 2018; Shi et al. 2019; Sun et al. 2018a, b; Tian et al. 2018; Udeshi and Chattopadhyay 2019; Uesato et al. 2019; Xie et al. 2018, 2019, 2011; Zhang et al. 2018a, 2019; Zhao and Gao 2018). Around 30% proposed approaches are designed for autonomous systems (Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Klueck et al. 2018; Li et al. 2016; Mullins et al. 2018; de Oliveira Neves et al. 2016; Patel et al. 2018; Strickland et al. 2018; Wolschke et al. 2017; Fremont et al. 2019), among which self-driving cars (Bolte et al. 2019; Majumdar et al. 2019; Rubaiyat et al. 2018; Wolschke et al. 2018; Zhang et al. 2018b) or ADAS (Tuncali et al. 2018, 2019; Abdessalem et al. 2016, 2018a, b).
The prevalence of autonomous systems and in particular autonomous driving cars indicate that safety critical domains are those in highest demand of techniques to ensure the dependability and reliability of such systems, with testing approaches specifically designed for their peculiar features.
Algorithms (RQ 1.4)
Figure 9 illustrates the paper distribution across the ML algorithms to which the proposed testing solutions are applied. In some papers, the proposed technique has been applied to more than one algorithm. The majority of techniques are generically applicable to NNs (25 papers), i.e., regardless of the purpose for which the NN is used (Byun et al. 2019; Ding et al. 2017; Du et al. 2019; Eniser et al. 2019; Gopinath et al. 2018; Guo et al. 2018; Kim et al. 2019; Li et al. 2018; Ma et al. 2018b, c, d, 2019; Odena et al. 2019; Pei et al. 2017; Sekhon and Fleming 2019; Shen et al. 2018; Spieker and Gotlieb 2019; Sun et al. 2018a, b; Tian et al. 2018; Uesato et al. 2019; Wang et al. 2019; Xie et al. 2011; Zhang et al. 2016, 2018b). Only one paper (Du et al. 2019) specifically targets Recurrent Neural Networks (RNNs), which indicates that SE literature has only barely considered testing NNs related to sequential data. The second most prevalent category (17 papers) concerns autonomous driving algorithms (Abeysirigoonawardena et al. 2019; Aniculaesei et al. 2018; Bolte et al. 2019; Fremont et al. 2019; Klueck et al. 2018; Li et al. 2016; Majumdar et al. 2019; Mullins et al. 2018; de Oliveira Neves et al. 2016; Patel et al. 2018; Rubaiyat et al. 2018; Strickland et al. 2018; Tuncali et al. 2018, 2019; Wolschke et al., 2017, 2018; Zhao and Gao 2018). The prevalence of NNs matches the growing popularity and success of this approach to machine learning. Since NNs are general function approximators, they can be applied to a wide range of problems. Hence, testing techniques that prove to be effective on NNs will exhibit an incredibly wide range of application scenarios.
Proposed Approach
In the following, we present an overview of the approaches proposed in the analysed papers. We provide information on general properties of these approaches such as their generated artefacts, context model and public availability. Moreover, we focus on the specific attributes of the testing process such as the input generation method, test adequacy criteria and the oracle mechanism adopted in each of the papers.
Test Artefacts (RQ 2.1)
Overall, 55 out of 70 papers generate some artefact as a part of their approach. Figure 10 reports the types of artefacts that have been produced by two or more works. Nearly 60% (33 out of 55) of the papers present various methods to generate test inputs for the MLS under test (Abdessalem et al. 2016, 2018a, b; Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Du et al. 2019; Eniser et al. 2019; Fremont et al. 2019; Guo et al. 2018; Klueck et al. 2018; Li et al. 2016; Ma et al. 2018c, 2019; Majumdar et al. 2019; Murphy et al. 2007a, b; de Oliveira Neves et al. 2016; Odena et al. 2019; Pei et al. 2017; Sekhon and Fleming 2019; Sun et al. 2018a, b; Tian et al. 2018; Tuncali et al. 2018, 2019; Udeshi et al. 2018, 2019; Wolschke et al. 2017, 2018; Xie et al. 2019; Zhang et al. b; Zheng et al. 2018b). However, what a test input represents differs across the proposed approaches and highly depends on the domain of the tested system. As per our analysis, the most popular forms of test input are images and test scenario configurations. The inputs in form of images are generally used with classification systems or lane keeping assistance systems of self-driving cars, which aim to predict the steering angle from an image of a road taken by the camera sensor. In case the MLS under test handles scenarios with two or more interacting objects, the input for such a system is a test scenario configuration. For example, in the paper by Abdessalem et al. (2018b), the input of the self-driving car simulation is a vector of configurations for each of the objects involved, such as the initial position of the car, the initial position of the pedestrians, the positions of the traffic signs, and the degree of fog.
In 12 out of 55 (22%) papers the produced artefact is an oracle (Qin et al. 2018; Xie et al. 2018; Ding et al. 2017; Dwarakanath et al. 2018, Murphy et al. 2008, 2009; Nakajima and Bui 2016, 2018, 2019; Saha and Kanewala 2019; Xie et al. 2011; Zhang et al. 2016). The main focus of 11 papers from this list is a set of metamorphic relationships (MRs), which are then used to generate a metamorphic oracle. Only one work (Qin et al. 2018) proposes a differential oracle based on program mirroring.
Compared to input generation, the oracle problem in MLS testing has received substantially less attention, indicating the need for further approaches to produce effective MLS oracles. System level oracles are particularly difficult to define, being extremely domain specific (e.g., in the self-driving car domain, they require the definition of safe driving conditions and thresholds). Moreover, they often take the form of continuous quality functions (e.g., quality of driving metrics) rather than binary ones (e.g., the car crashing or not).
Test Adequacy (RQ 2.2)
Test adequacy criteria have been used in 24 papers out of 70 (Du et al. 2019; Mullins et al. 2018; Murphy et al. 2007a, 2009; Pei et al. 2017; Li et al. 2018; Qin et al. 2018; Abeysirigoonawardena et al. 2019; Udeshi et al. 2018, 2019; Ma et al. 2018b, 2019; de Oliveira Neves et al. 2016; Xie et al. 2018, 2011; Eniser et al. 2019; Odena et al. 2019; Li et al. 2016; Uesato et al. 2019; Nakajima 2018; Zhang et al. 2019; Nakajima and Bui 2016; Sekhon and Fleming 2019; Dwarakanath et al. 2018). Overall, 28 test adequacy criteria were used or proposed in such papers. The work by Pei et al. (2017) is the first one that proposed to use neuron activations as part of an adequacy criterion. The neuron is considered activated if its output value is higher than a predefined threshold. Neuron coverage (NC) of a set of test inputs is defined as the proportion of activated neurons over all neurons when all available test inputs are supplied to an NN. The authors suggest that at a high level, this metric is similar to test coverage of traditional systems, as it measures the parts of NN’s logic exercised by the input data.
Ma et al. (2018b) propose a set of five fine-grained adequacy criteria that they classify into neuron-level and layer-level. They use activation values of a neuron obtained from the training data and divide the range of values for each neuron into k buckets. The ratio of the number of buckets covered by the test inputs to the overall number of buckets (k multiplied by the number of neurons) defines the k-multi-section neuron coverage (KMNC). In case the activation value of a neuron is not in the range found in the training data, it is said to fall into a corner-case region. If the activation value is higher than the maximum in the range, then it is in the upper corner case region. Similarly, if it is lower than the minimum value in the range, then it belongs to the lower corner case region. Strong neuron activation coverage (SNAC) is defined as the ratio of the number of neurons for which upper corner cases are covered to the overall number of neurons. Neuron boundary coverage is defined as the ratio of the number neurons for which both upper and lower corner cases are covered to the total number of corner cases (the number of neurons multiplied by two).
Kim et al. (2019) note that neuron coverage and k-multi-section neuron coverage are not practically useful, as they carry little information about individual inputs. They argue that it is not self-evident that a higher NC indicates a better input, as some inputs naturally activate more neurons. They also note that KMNC does not capture how far the neuron activations go beyond the observed range, making it hard to assess the value of each input. To overcome these limitations they propose a new metric, Surprise Adequacy (SA), which aims to quantify the degree of surprise (i.e., novelty with respect to the training set) of the neuron activation vector. Surprise adequacy has two variations: likelihood-based and distance-based. The Distance-based Surprise Adequacy (DSA) is calculated using the Euclidean distance between the activation traces of a given input and the activation traces observed during training. The Likelihood-based Surprise Adequacy (LSA) uses kernel density estimation to approximate the probability density of each activation value, and obtains the surprise of the input as its (log-inverse) probability, computed using the estimated density.
Neuron Coverage, KMNC and Surprise Adequacy are all metrics that target feed-forward DL systems. The only work that addresses coverage criteria for Recurrent Neural Networks (RNNs) is the one by Du et al. (2019). In this work, authors model RNN as an abstract state transition system to characterise its internal behaviours. Based on the abstract model, they propose five coverage criteria, two of which address coverage of states and three the coverage of transitions.
Figure 11 shows how often each adequacy criterion was used. Overall, the data indicate a relatively wide adoption of the proposed adequacy criteria. Indeed, availability of ML-specific ways to measure the adequacy of the test data is crucial for MLS testing. Only a few papers adopted adequacy criteria for black box testing, e.g., scenario coverage, that is useful when we do not have white box access and we are interested in the behaviour of the whole system.
Test Input Generation (RQ 2.3)
Overall, 48 out of 70 papers describe how they generate inputs. As some papers use more than one input generation technique, our final list contains 52 elements, as illustrated in Fig. 12.
Our analysis (see Fig. 12) shows that the most widely applied technique for input generation is input mutation (Murphy et al. 2008, b; Odena et al. 2019; Rubaiyat et al. 2018; Tian et al. 2018; Ding et al. 2017; Du et al. 2019; Dwarakanath et al. 2018; Nakajima and Bui 2016; Nakajima 2019; Saha and Kanewala 2019; Xie et al. 2011, 2018, 2019; de Oliveira Neves et al. 2016; Guo et al. 2018) (16 out of 52, 31%), which consists of the creation of new inputs by applying semantic information-preserving transformation to existing inputs. The majority of papers using input mutation are on metamorphic testing (Ding et al. 2017; Du et al. 2019; Dwarakanath et al. 2018; Murphy et al. 2008; Nakajima and Bui 2016; 2019; Saha and Kanewala 2019; Xie et al. 2011, 2018, 2019; de Oliveira Neves et al. 2016; Guo et al. 2018) (11 out of 16), and the corresponding transformations are defined by a metamorphic relationship. Examples of such input mutations are affine transformations (Tian et al. 2018), change of the pixel values (Nakajima 2019), cropping (Ding et al. 2017) for the images or alterations that mimic the environment interference for the audio files (Du et al. 2019), designed so that they introduce changes that are imperceptible to humans. In contrast, the approach by Rubaiyat et al. (2018) changes input images by simulating environmental conditions such as rain, fog, snow, and occlusion created by mud/snow on the camera. The work by Tian et al. (2018) also transforms input images by mimicking different real-world phenomena like camera lens distortions, object movements, or different weather conditions. Their goal is to automatically generate test inputs that maximise neuron coverage. Similarly, the work by Guo et al. (2018) has the optimisation objective of reaching higher neuron coverage, while also exposing exceptional behaviours. To achieve this goal, they mutate input images and keep the mutated versions that contribute to a certain increase of neuron coverage for subsequent fuzzing. The applied mutations have to be imperceptible for humans while the prediction of the MLS for the original and mutated input should differ (i.e., the MLS exhibits a misbehaviour).
Another widely used methodology to generate test inputs is the search-based approach (Abdessalem et al. 2016, 2018a, b; Bühler and Wegener 2004; Udeshi et al. 2018; Tuncali and Fainekos 2019; Eniser et al. 2019; Mullins et al. 2018; Pei et al. 2017; Udeshi and Chattopadhyay 2019; Sekhon and Fleming 2019; Beglerovic et al. 2017) (13 out of 52, 25%). In six papers the generation of inputs using a search-based approach aims to detect collision scenarios for autonomous driving systems. Therefore, their fitness functions use metrics such as distance to other static or dynamic objects (Bühler and Wegener 2004; Abdessalem et al. 2016, 2018b, 2018a), time to collision (Tuncali and Fainekos 2019; Beglerovic et al. 2017; Abdessalem et al. 2016), speed of the vehicle (Tuncali and Fainekos 2019; Abdessalem et al. 2018b) or level of confidence in the detection of the object in front of the vehicle (Abdessalem et al. 2016, 2018a). In contrast, Mullins et al. (2018) aim to identify test inputs for an autonomous system that are located in its performance boundaries, i.e., in the regions of the input space where small alterations to the input can cause transitions in the behaviour, resulting in major performance changes.
The majority of the works that use adversarial input generation (5 papers out of 52, 10%) employ the existing state-of-the-art attacking methods to generate such inputs (Cheng et al. 2018a; Kim et al. 2019; Wang et al. 2019; Zhang et al. 2019). In contrast, the work by Abeysirigoonawardena et al. (2019) has a more targeted approach which aims to create adversarial self-driving scenarios that expose poorly-engineered or poorly-trained self-driving policies, and therefore increase the risk of collision with simulated pedestrians and vehicles. While adversarial inputs can successfully trigger misbehaviours of the MLS under test, they are often very unlikely or impossible to occur in reality, unless the system is under the attack of a malicious user. However, security verification and validation of MLS is a different research area on its own and the present systematic mapping does not cover it.
Test Oracles (RQ 2.4)
Figure 13 provides an overview of the types of oracles that have been adopted with MLSs. The most popular type of oracle is the metamorphic oracle, used in 22 out of 50 (44%) papers (Aniculaesei et al. 2018; Ding et al. 2017; Du et al. 2019; Dwarakanath et al. 2018; Guo et al. 2018; Murphy et al. 2008, 2008, 2009; Nakajima and Bui 2016, 2018, 2019; Saha and Kanewala 2019; Tian et al. 2018; Udeshi et al. 2018; Xie et al. 2011, 2018, 2019; Zhang et al. 2016, 2018b; Sun et al. 2018a; Tuncali et al. 2018, 2019). A central element of a metamorphic oracle is a set of metamorphic relationships that are derived from the innate characteristics of the system under test. The new test inputs are generated from the existing ones using MRs so that the outputs for these inputs can be predicted. Out of 22 papers adopting a metamorphic oracle, 11 focus on proposing and evaluating novel MRs for different kinds of MLS. However, these papers mostly consider classical supervised learning algorithms, such as k-nearest neighbours, naive Bayes classifier, support vector machine, and ranking algorithms. The work by Xie et al. (2018) proposes MRs for unsupervised machine learning algorithms, such as clustering algorithms. The remaining papers (11 out of 22) use MRs that are already available in the literature or that encode well-known domain-specific properties of the system.
In 10 out of 50 (20%) papers, domain-specific failure of the MLS under test was used as an oracle (Abdessalem et al. 2016, 2018a, b; Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Odena et al. 2019; Rubaiyat et al. 2018; Uesato et al. 2019; Li et al. 2016). In general, failure is denoted as the weakest form of an oracle. However, only one of the analysed papers (Odena et al. 2019) conforms to such definition, i.e., the crash of the system under test. In all remaining cases, more complicated and domain-specific deviations from the expected behaviour are adopted, such as collisions with pedestrians or other vehicles, not stopping at the stop sign, or exceeding the speed limit.
Differential or cross-referencing oracle is a type of “pseudo-oracle” (Davis and Weyuker 1981) in which multiple implementations of the same algorithm are compared against each other. If the results are not the same, then one or more of the implementations may contain a defect. This type of oracle was used in six analysed papers (12%) (Murphy et al. 2007b; 2007a; Pei et al. 2017; Sekhon and Fleming 2019; Udeshi and Chattopadhyay 2019; Qin et al. 2018). While the work by Qin et al. (2018) proposes a program synthesis approach that constructs twin “oracle-alike mirror programs”, the remaining papers find different implementations for the MLS under test and use those to cross check the results. A drawback of this type of oracle is the attribution of the fault when the considered implementations produce different results. This was the case in the work by Murphy et al. (2007b) and the authors commented that “there was no way to know which output was correct”. On the other hand, three papers from our pool (Pei et al. 2017; Udeshi and Chattopadhyay 2019; Sekhon and Fleming 2019) take advantage of such a situation, as getting different outputs in each of the different implementations makes the inputs rather interesting and worth further investigation by the developers. Pei et al. (2017) and Sekhon and Fleming (2019) use such differential behaviour along with a coverage criterion as part of a joint optimisation problem aimed to generate erroneous corner case scenarios for MLSs. Similarly, Udeshi and Chattopadhyay (2019) propose an approach that, given a pair of ML models and a grammar encoding their inputs, searches the input space for inputs that expose differential behaviours.
Another commonly used oracle for classifiers (6 papers out of 50, 12%) is the misclassification of manually labeled inputs (Gopinath et al. 2018; Fremont et al. 2019; Ma et al. 2018c, 2019; Zhang et al. 2019; Shi et al. 2019). While using human labels as an oracle is a pretty straightforward approach (especially for data-driven systems such as MLS), it may also require substantial effort. Another type of oracle observed during our analysis is mutation killing, which is used in three different papers (6%) that either propose (Ma et al. 2018d; Shen et al. 2018) or evaluate (Cheng et al. 2018a) mutation operators for MLS.
Access to the System (RQ 2.5)
The proposed testing approaches require different levels of access to the MLS under test. In 29 cases out of 70 (41%), it is enough to have a black-box access to the system (Abdessalem et al. 2016, 2018a; Abeysirigoonawardena et al. 2019; Aniculaesei et al. 2018; Beglerovic et al. 2017; Bolte et al. 2019; Bühler and Wegener 2004; Xie et al. 2018; Fremont et al. 2019; Klueck et al. 2018; Li et al. 2016; Majumdar et al. 2019; Mullins et al. 2018; Murphy et al. 2008, 2009; Nakajima 2018; de Oliveira Neves et al. 2016; Patel et al. 2018; Qin et al. 2018; Rubaiyat et al. 2018; Tuncali et al. 2018, 2019; Udeshi and Chattopadhyay 2019; Uesato et al. 2019; Wolschke et al. 2018; Wolschke et al. 2017; Zhao and Gao 2018; Zheng et al. 2019; Zhang et al. 2016). In 11 cases (16%), along with the inputs and outputs of MLS, training data should also be available: black-box access is not sufficient and data-box access to the system should be provided (Cheng et al. 2018b; Ding et al. 2017; Henriksson et al. 2019; Saha and Kanewala 2019; Udeshi et al. 2018; Xie et al. 2011; Dwarakanath et al. 2018; Groce et al. 2014; Zhang et al. 2018b; Murphy et al. 2007b; Spieker and Gotlieb 2019). This is mostly the case for the papers on metamorphic testing, in which the authors propose metamorphic relationships that change the training data in some specific way and then analyse the changes in the output of the retrained system. Another example of data-box access is the work by Groce et al. (2014), where the distance between the training data and a test input is used as a metric to address the test selection problem. The remaining 30 cases (43%) require white-box access to the system (Abdessalem et al. 2018b; Byun et al. 2019; Eniser et al. 2019; Kim et al. 2019; Ma et al. 2018b, 2018c, 2018d, 2019; Murphy et al. 2007a, 2008; Nakajima and Bui 2016; Odena et al. 2019; Pei et al. 2017; Sekhon and Fleming 2019; Shi et al. 2019; Strickland et al. 2018; Sun et al. 2018a, 2018b; Tian et al. 2018; Wang et al. 2019; Zhang et al. 2018a, 2019; Li et al. 2018; Du et al. 2019; Shen et al. 2018; Gopinath et al. 2018; Guo et al. 2018; Xie et al. 2019; Cheng et al. 2018a; Nakajima 2019), as they need information on the internal state of the trained model. The most prevalent examples in this category are the approaches that need the values of neuron activations to measure some adequacy criteria (Pei et al. 2017; Ma et al. 2018b) or the weights and biases of the model to apply mutation operators on them (Ma et al. 2018d; Shen et al. 2018).
Context Model (RQ 2.6)
As MLSs can be complex systems that have to operate in various environments and interact with different dynamic objects, they need to be able to determine the context in which they are operating, and adapt their behaviour to context-related norms and constraints. In our pool of papers, 17 works (24%) model the context in which the MLS operates. All of them are in the autonomous driving domain (Abdessalem et al. 2016, 2018a, b, Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bolte et al. 2019; Bühler and Wegener 2004; Cheng et al. 2018b; Fremont et al. 2019; Klueck et al. 2018; Li et al. 2016; Mullins et al. 2018; de Oliveira Neves et al. 2016; Tuncali et al. 2018, 2019; Wolschke et al. 2017; Zhao and Gao 2018). The context models presented in these papers vary in terms of simplicity and number of considered actors. For example, the work by Bühler and Wegener (2004) addresses the problem of autonomous parking and provides a context model in which the car is the only dynamic object and the environment is represented just as a set of geometric points that define the parking space. Similarly, in the work by Beglerovic et al. (2017) the goal of the ADAS is to avoid collisions and the information about the environment is provided in the form of the geometric coordinates of the static obstacles. In contrast, in the work by Abdessalem et al. (2018a) the authors address more complicated scenarios and their context model is much more detailed. In addition to the ADAS under test, they consider additional mobile objects such as pedestrians and other cars. Moreover, they take into account roadside objects, such as trees, parked cars, traffic signs, and describe the environment in terms of types of road, weather conditions and scene light. Considering a complex context can increase the realism of, and thus the confidence on, the testing process, but it is more expensive, especially because there is no standard context model that can be reused.
Availability (RQ 2.7)
Availability of the proposed solutions is important in the field of software testing, as new research contributions very often rely on existing tools and prototypes. The availability of the research artefacts (i.e., experimental data, source code) of the papers on MLS testing is a strong indicator of how effectively the future research will build on and compare with the currently existing work. Our results do not draw an optimistic picture on this issue, as for 50 out of 70 (71%) papers there is no available artefact. For 20 papers (29%), the proposed solutions are available in open-source format (Bühler and Wegener 2004; Cheng et al. 2018b; Klueck et al. 2018; Bolte et al. 2019; Fremont et al. 2019; Tuncali et al. 2018, 2019; Wolschke et al. 2017; Uesato et al. 2019; Abdessalem et al. 2016, 2018a, b; Zhao and Gao 2018; Abeysirigoonawardena et al. 2019; Strickland et al. 2018; Patel et al. 2018; Mullins et al. 2018; de Oliveira Neves et al. 2016; Beglerovic et al. 2017; Li et al. 2016), with the exception of one (Zheng et al. 2019) where the tool is available, but the source code is not. It is worth to note that all such papers were published between the years of 2017 and 2019, which may indicate a growing positive trend toward open research in MLS testing.
Evaluation
In the following, we describe the empirical evaluations of the testing solutions proposed by the analysed primary studies. We provide an overview of the evaluation types and methods, as well as information about the objects and the experimental setup.
Evaluation Type (RQ 3.1)
We split the analysed studies based on the type of experimental evaluation: academic or industrial. As shown in Fig. 14, the vast majority of the studies (56) carry out an evaluation in an academic context. Three studies (Bühler and Wegener 2004; Byun et al. 2019; Zheng et al. 2019) perform an evaluation on proprietary systems belonging to an industrial partner. Five works (Murphy et al. 2007a, 2008, Abdessalem et al. 2016, 2018a, b) adopted a combined approach, featuring an evaluation in both academic and industrial contexts, while six works (Zhang et al. 2016; Aniculaesei et al. 2018; Zhao and Gao 2018; Wolschke et al. 2017, 2018, Nakajima 2018) contained no empirical evaluation of any sort. Taking into account that ML is widely used in industry and its applications are growing exponentially, the collected data suggest that industrial research should be given greater focus.
Evaluation Method (RQ 3.2)
We observed a number of different evaluation methods, including experiments with no human involved, experiments with human evaluation and proof of concept studies. The category representing the experimental approach with no humans is the widest one and features 49 works, which is 70% of the total number of papers. Eleven papers (16%) (Bolte et al. 2019; Cheng et al. 2018b; Fremont et al. 2019; Klueck et al. 2018; Li et al. 2016; Majumdar et al. 2019; Murphy and Kaiser 2008; Nakajima and Bui 2016; Sekhon and Fleming 2019; Tuncali et al. 2018; 2019) provided small-scale exemplary proofs of viability for the proposed approach and are united under the “proof of concept” category. An instance of such an evaluation method is the work by Klueck et al. (2018), where the authors provide an example of their test case generation process focusing only on one selected ontology and its conversion into a combinatorial testing input model. Four studies out of 70 (6%) (Abdessalem et al. 2016, 2018a, b, Groce et al. 2014) included human evaluation in their empirical approach, while the remaining six (9%) (Zhang et al. 2016; Aniculaesei et al. 2018; Zhao and Gao 2018; Wolschke et al. 2017, 2018; Nakajima 2018) carried out no evaluation of any kind (see Fig. 15). Overall, the young discipline of ML testing seems to have already adopted a quite demanding evaluation standard, targeting the empirical method of controlled experiments. When operating at the system level, experiments with humans become critical to assess the validity and severity of the reported failures.
ML Models (RQ 3.3, RQ 3.4, RQ 3.5)
We have analysed the usage of existing ML models for the purpose of evaluating the proposed testing approaches. We found that 43 out of 70 papers (61%) contain a mention of the adopted ML models. Most of them are publicly available; only 26% of these papers used models that are not open-source.
Figure 16 depicts the most popular models (to keep the size of the picture small, we report only the models that were used at least three times in the evaluation of the retrieved papers; the full list is available in our replication package (Riccio et al. 2019)). Our results show that authors tend to reuse widely adopted and open-source models, such as LeNet (LeCun et al. 1998) VGG (Simonyan and Zisserman 2014), or ResNet (He et al. 2016). In three studies (Murphy et al. 2007b; Murphy et al. 2008; Murphy et al. 2009), Murphy et al. used Martirank (Gross et al. 2006), a ranking implementation of the Martingale Boosting algorithm (Long and Servedio 2005).
Among the steering angle prediction models for self-driving cars, five papers (Kim et al. 2019; Tian et al. 2018; Zhang et al. 2018b; Pei et al. 2017; Majumdar et al. 2019) trained their models using the datasets provided by Udacity.Footnote 5 Udacity is a for-profit educational organisation that helps students expand their machine learning skills and apply them to the area of autonomous driving. Udacity contributed to an open-source self-driving car project by releasing a self-driving car simulation platform, and by introducing a series of challenges related to various aspects of autonomous driving (e.g., lane keeping, object localisation and path planning). According to our results, the model by the team ChauffeurFootnote 6 is the most used (3 papers) (Kim et al. 2019; Tian et al. 2018; Zhang et al. 2018b).
In 58% of the cases, researchers did not use pre-trained models, but rather performed the training themselves. It can be noticed that the most widely used architectures are convolutional networks, which in turn suggests that image recognition is a frequent subject for studies and scientific experiments.
Training Dataset (RQ 3.6)
We studied the datasets that were used to train the ML models considered in the evaluation of the proposed approaches. Out of the 70 relevant studies, 28 (40%) did not use or did not mention the usage of any specific dataset. The remaining 42 (60%) studies mention 48 datasets of different size and nature among which 10 are custom ones. According to the content, these datasets can be classified into a number of broad categories, illustrated in Fig. 17. We report only categories that contain two or more instances. We can notice that the digit recognition task is the most frequently tackled one, closely followed by the general image classification problem. Among the remaining categories, the numbers of datasets used for tasks related to autonomous driving indicate a growing interest for the training of NN-based autonomous vehicles.
Concerning the datasets used, MNIST (LeCun and Cortes 2010) (a labeled dataset of hand-written digits that has a training set of 60,000 examples, and a test set of 10,000 examples) is the most popular (29%), which is not surprising due to the frequent occurrence of digit recognition in our evaluations. CIFAR-10 (Krizhevsky et al. 2009) (a dataset of colour images in 10 classes that has a training set of 50,000 images, and a test set of 10,000 images) is also quite adopted (16%). In the autonomous driving and ADAS domains, the datasets of real world driving images released by Udacity are the most used (7%). Among the seldom used datasets are the ones targeting more specific domains of application, such as Drebin (Android malware) (Arp et al. 2014), Cityscapes (set of stereo video sequences recorded in street scenes) (Cordts et al. 2016) or the UCI (University of California, Irvine) ML Repository: Survive (medical) (Dua and Graff 2017). Moreover, we noticed that the creation of a custom and tailored dataset is also a relatively frequent practice (21% of all datasets used).
System & System Availability (RQ 3.7, RQ 3.8)
A relatively small subset (27%) of the papers study the behaviour of the whole MLS (Abdessalem et al. 2016, 2018a, b; Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Cheng et al. 2018b; Fremont et al. 2019; Li et al. 2016; Majumdar et al. 2019; Mullins et al. 2018; Murphy and Kaiser 2008; de Oliveira Neves et al. 2016; Rubaiyat et al. 2018; Strickland et al. 2018; Tuncali et al. 2018, 2019; Uesato et al. 2019; Zheng et al. 2019). From the data in Fig. 18, we can notice that advanced driver-assistance systems (ADAS) are the most widely evaluated MLSs, along with other types of autonomous vehicles (ROS) and unmanned underwater vehicles (UUV). This confirms the high interest of the academic and industrial communities towards autonomous systems.
For what concerns availability, 74% of the systems used in the relevant literature are closed-source. The largest proportion of systems that are open source consists of advanced driver-assistance systems that are contained in the ADAS category.
Simulator (RQ 3.9)
Nearly one fourth of the analysed studies (27%) (Abdessalem et al. 2016, 2018a, b; Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Cheng et al. 2018b; Fremont et al. 2019; Li et al. 2016; Majumdar et al. 2019; Mullins et al. 2018; Murphy and Kaiser 2008; de Oliveira Neves et al. 2016; Rubaiyat et al. 2018; Strickland et al. 2018; Tuncali et al. 2018, 2019; Uesato et al. 2019; Zheng et al. 2019) make use of a simulator. In such cases, experiments are conducted in a computer simulation of the physical environment, to avoid the costs and issues associated with experiments conducted in the field, as well as to reduce the time necessary to run the experiments. Only one study (Li et al. 2016) independently developed a custom simulator that specifically suits the requirements of the study, while the others adopted and, in some cases, modified the existing simulators.
From the adopted solutions (see Fig. 19), the PreScan (International 2017) and Matlab simulation platforms stand out in terms of the number of mentions, each being used by three works (Abdessalem et al. 2016, 2018a, b; Beglerovic et al. 2017; Bühler and Wegener 2004; Tuncali and Fainekos 2019), followed by the Carla (Dosovitskiy et al. 2017) and UdacityFootnote 7 simulators, used by two works (Abeysirigoonawardena et al. 2019; Cheng et al. 2018b; Majumdar et al. 2019; Zhang et al. 2018b). The latter simulators are designed specifically for autonomous driving systems. Despite the prevalence of simulators tailored for ADAS and autonomous driving systems, in the list we can see a flight simulator for unmanned aerial vehicles (X-Plane 11) (Byun et al. 2019) and a robot simulator (V-REP, now discontinued and replaced by its successor CoppeliaSim) (Strickland et al. 2018).
Failure Type (RQ 3.10)
An important aspect in the evaluation of any testing solution is the definition of failure. While for 18 (26%) (Aniculaesei et al. 2018; Bolte et al. 2019; Cheng et al. 2018b; Dwarakanath et al. 2018; Gopinath et al. 2018; Henriksson et al. 2019; Klueck et al. 2018; Nakajima and Bui 2016, 2018, 2019; de Oliveira Neves et al. 2016; Spieker and Gotlieb 2019; Wang et al. 2019; Wolschke et al. 2018, 2017; Zhang et al. 2016, 2018a; Zhao and Gao 2018) papers this information is not applicable or not available, 52 (74%) works provide a description of the failure types detected in the experimental evaluation. In case the object of evaluation is an ML model, the failure type is defined based on the type of prediction (e.g., misclassification for a classifier). As the focus is shifted from isolated ML components to MLSs such as autonomous vehicles, the failure type frequently involves the choice of domain-specific parameters (e.g., the amount of deviation from the center line for a lane keeping assistance system).
In line with the results about Training Dataset (Section 5.3.4) and ML Models (Section 5.3.3), one of the most frequent failure types is misclassification. The second most popular category is domain-specific failures defined for autonomous vehicles and driving assistance systems. This set of failures combines several specific instances, the most frequent being deviation from the expected steering angle (five papers (Ma et al. 2018c; Patel et al. 2018; Pei et al. 2017; Tian et al. 2018; Zhang et al. 2018b)) and number of collisions (10 papers (Abdessalem et al. 2018b; Abeysirigoonawardena et al. 2019; Beglerovic et al. 2017; Bühler and Wegener 2004; Li et al. 2016; Majumdar et al. 2019; Rubaiyat et al. 2018; Tuncali et al. 2018; Tuncali and Fainekos 2019; Uesato et al. 2019)). The importance of metamorphic and mutation testing for MLS is reflected in the relatively large number of mentions of the associated types of failures: five papers (Xie et al. 2018; Murphy et al. 2008, 2009; Saha and Kanewala 2019; Xie et al. 2011) used the violation of metamorphic relationships, whereas three papers (Cheng et al. 2018a; Ma et al. 2018d; Shen et al. 2018) used the notion of mutation killing. The full picture is shown in Fig. 20.
Metrics (RQ 3.11)
In this section, we discuss the metrics most frequently adopted to evaluate the proposed approaches. The exhaustive list, obtained from 60 papers (85%), covers a wide range, depending on the task considered in the evaluation as well as the testing approach and can be divided into eight main categories: (1) Effectiveness, (2) Coverage, (3) Similarity, (4) Failures, (5) Mutation Score, (6) Error Rate, (7) Time, (8) Domain Expert Opinion. The categories are listed in descending order with regards to the number of papers that used such a metric in the evaluation process and leave aside a number of metrics that were used only once or twice or are too specific to be classified. The numbers for each category are presented in Fig. 21.
In the Effectiveness category, metrics based on the loss and accuracy are the most often adopted ones (ten papers (Eniser et al. 2019; Ma et al. 2018c; Strickland et al. 2018; Li et al. 2018; Ding et al. 2017; Byun et al. 2019; Udeshi and Chattopadhyay 2019; Xie et al. 2019; Kim et al. 2019; Wang et al. 2019)), while precision, recall and F-measure appear in three works (Zheng et al. 2019; Fremont et al. 2019; Mullins et al. 2018) and AUROC in only two papers (Kim et al. 2019; Wang et al. 2019). Coverage is the second most extensive class (14 papers Sun et al. 2018a, 2018b; Cheng et al. 2018b; Du et al. 2019; de Oliveira Neves et al. 2016; Mullins et al. 2018; Ma et al. 2018b, 2019; Tian et al. 2018; Pei et al. 2017; Sekhon and Fleming 2019; Guo et al. 2018; Xie et al. 2019; Kim et al. 2019), in which we can distinguish a relatively large family of neuron coverage metrics (Ma et al. 2018b; Tian et al. 2018; Pei et al. 2017; Sekhon and Fleming 2019; Guo et al. 2018; Xie et al. 2019; Kim et al. 2019). The variety of metrics stemming from neuron coverage are reviewed in details in the Section 5.2.2. Category Time includes execution time or time spent to generate a desired number of inputs. It served as a metric for performance evaluation in four papers (Shi et al. 2019; Zhang et al. 2019; Udeshi et al. 2018; Guo et al. 2018). Three studies (Abdessalem et al. 2016, 2018a, b) use domain expert opinions as qualitative indicators to evaluate the proposed approaches. The time performance of an MLS testing approach is particularly important when dealing with complex MLSs, such as self-driving cars, because even in a simulation environment the budget of available system executions is severely limited. Human feedback is also quite important, since failure scenarios might be useless if falling outside the validity domain.
Comparative Study (RQ 3.12)
In general, comparative studies are important as they show the differences (e.g., pros and cons) between a novel technique and the state of the art. They can guide researchers and practitioners in the choice of an appropriate solution for their specific needs. In our list of relevant papers, 26 (37%) (Abeysirigoonawardena et al. 2019; Bühler and Wegener 2004; Byun et al. 2019; Cheng et al. 2018a; Ding et al. 2017; Fremont et al. 2019; Groce et al. 2014; Guo et al. 2018; Kim et al. 2019; Ma et al. 2018b, 2018c, 2019; Mullins et al. 2018; Saha and Kanewala 2019; Sekhon and Fleming 2019; Shi et al. 2019; Spieker and Gotlieb 2019; Sun et al. 2018a, 2018b; Tuncali et al. 2018, 2019; Udeshi and Chattopadhyay 2019; Udeshi et al. 2018; Xie et al. 2011; Zhang et al. 2019; Zheng et al. 2019) include a comparative evaluation. Test input selection, generation and prioritisation are the techniques most frequently involved in comparative evaluations. For other MLS testing techniques, comparative studies are generally lacking, which may be related to the scarce availability of baseline MLS testing solutions as open-source tools (see results for RQ 2.7).
Experimental Data Availability (RQ 3.13)
Availability of experimental data is crucial for replication studies and for projects that build on top of existing solutions. Unfortunately, only a fraction of authors of the considered studies made their experimental data publicly available (despite we considered the cases when data is partially accessible as “available”). This fraction comprises 13 papers out of 70 (19%) (Abdessalem et al. 2018b; Byun et al. 2019; Eniser et al. 2019; Henriksson et al. 2019; Kim et al. 2019; Pei et al. 2017; Shi et al. 2019; Sun et al. 2018a, 2018b; Tian et al. 2018; Udeshi and Chattopadhyay 2019; Udeshi et al. 2018; Zhang et al. 2018a). This is a rather negative result, which may hinder the growth of the research in the field.
Time Budget (RQ 3.14)
Training and testing of MLSs is known to be a very expensive and time consuming process and getting a rough estimate of the average time spent to conduct an experiment in the field is indeed quite useful. However, only 8 (11%) (Ma et al. 2018c; Abdessalem et al. 2018a, 2018b; Sun et al. 2018b; Du et al. 2019; Tuncali and Fainekos 2019; Xie et al. 2019; Mullins et al. 2018) studies report information about time budget, ranging from 1/2h to 50h, with an average of 16.4h. This values give an order of magnitude of the time budget allocation necessary when testing complex MLSs.
Software Setup (RQ 3.15)
Concerning the software libraries and frameworks used for the implementation and evaluation of the proposed approaches, only 22 (31%) papers (Beglerovic et al. 2017; Bühler and Wegener 2004; Byun et al. 2019; Xie et al. 2018; Ding et al. 2017; Eniser et al. 2019; Groce et al. 2014; Kim et al. 2019; Ma et al. 2018b; Ma et al. 2018d; Murphy et al. 2009; Odena et al. 2019; Pei et al. 2017; Saha and Kanewala 2019; Shi et al. 2019; Spieker and Gotlieb 2019; Tuncali et al. 2018; Tuncali and Fainekos 2019; Udeshi et al. 2018; Xie et al. 2019; Xie et al. 2011; Zhang et al. 2018b) provide such information. As illustrated in Fig. 22, Keras and Tensorflow are the most used frameworks, followed by Matlab. The Ubuntu operating system was explicitly mentioned in six papers (Udeshi et al. 2018; Byun et al. 2019; Eniser et al. 2019; Pei et al. 2017; Murphy et al. 2009; Guo et al. 2018).
Hardware Setup (RQ 3.16)
Only 19 (27%) papers (Bolte et al. 2019; Byun et al. 2019; Xie et al. 2018; Ding et al. 2017; Eniser et al. 2019; Fremont et al. 2019; Guo et al. 2018; Kim et al. 2019; Ma et al. 2018b, 2018d; Murphy et al. 2009; Pei et al. 2017; Qin et al. 2018; Shi et al. 2019; Spieker and Gotlieb 2019; Sun et al. 2018a, b; Udeshi et al. 2018; Xie et al. 2019) contain information on the hardware setup used to conduct the experiments. In three works (Ma et al. 2018d, 2018b; Spieker and Gotlieb 2019), experiments were run on a cluster of computers, while nine mention the specific GPU model that was used (Bolte et al. 2019; Byun et al. 2019; Fremont et al. 2019; Guo et al. 2018; Pei et al. 2017; Ding et al. 2017; Ma et al. 2018b, 2018d; Xie et al. 2019). Interestingly, all of the GPU models reported in the papers are NVIDIA products, with the GeForce series being mentioned five times (Bolte et al. 2019; Byun et al. 2019; Fremont et al. 2019; Guo et al. 2018; Pei et al. 2017) and the Tesla series four times (Ding et al. 2017; Ma et al. 2018b, 2018d; Xie et al. 2019). We conjecture this result is influenced by the adoption of CUDA as parallel computing platform and programming model in the NVIDIA graphical processing units (GPUs), because Tensorflow supports NVIDIA GPU cards with CUDA.
Demographics
In the following, we report and comment some statistics about the papers considered in this study, including year of publication, venue, authors, affiliations and affiliation countries. Note that for preprint versions that were later published at a conference, journal or workshop, we always refer to the latter version. The reported data, as well as the number of citations, were collected from Google Scholar on May 11th, 2020.
Year of Publication
When analysing the year of publication of the papers considered in this study, we distinguish between papers that have been made available exclusively in preprint archives and papers that have been accepted for publication in a peer-reviewed journal, conference or workshop. The aggregated publication years are shown in Fig. 23. The trend apparent from this figure is that the number of papers is rapidly increasing in recent years, showing a growing interest and attention of the software engineering community towards testing MLSs.
Venue and Venue Type
Table 6 shows the publication venues of the papers considered in this systematic mapping. More than half of the papers (40 out of 70) were published at conferences; only a relatively small number at workshops (10) or journals (8). The high number (30) of papers that we found only in arXiv at the time when we downloaded all relevant works (February 27th, 2019), some of which (18/30) were published later in a peer-reviewed venue, as well as the high number of papers published at conferences/workshops, indicate the importance of fast knowledge transfer in a quickly evolving field like ML testing. It is common practice for researchers working on ML testing to continuously check for new arXiv submissions relevant for their research.
Table 6 Venues represented in the mapping (number of papers shown within brackets) Authors
We aggregate statistics about the authors of the considered papers without taking the order of authors into account. Overall, 241 distinct authors contributed to 70 analysed papers; the average number of authors per paper was 4.34. On average, an author contributed to 1.26 of the papers, where 205 authors contributed to one paper, 18 to two papers and 11 to three papers. The following six authors contributed to more than three (3) papers:
-
6 papers: Kaiser, G. (Columbia University) (Murphy et al. 2007a, 2007b, 2008, 2008, 2009; Xie et al. 2011)
-
6 papers: Murphy, C. (Columbia University) (Murphy et al. 2007a, 2007b, 2008, 2008, 2009; Xie et al. 2011)
-
5 papers: Liu, Y. (Nanyang Technological University) (Ma et al. 2018b, 2018d, 2019; Xie et al. 2018, 2019)
-
4 papers: Li, B. (University of Illinois at Urbana–Champaign) (Ma et al. 2018b, 2018d, 2019; Xie et al. 2019)
-
4 papers: Ma, L. (Harbin Institute of Technology) (Ma et al. 2018b, 2018d, 2019; Xie et al. 2019)
-
4 papers: Xue, M. (Nanyang Technological University) (Ma et al. 2018b, 2018d, 2019; Xie et al. 2019)
Affiliations
Overall, the authors who contributed to the papers considered in this study work for 84 distinct organisations.Footnote 8 On average, each of these organisations contributed to 1.68 papers and each paper was written by authors from 2.0 different organisations. The six organisations which contributed to the most papers are:
-
8 papers: Columbia University (Tian et al. 2018; Pei et al. 2017; Murphy et al. 2007a, 2007b, 2008, 2008, 2009; Xie et al. 2011)
-
6 papers: Nanjing University (Shi et al. 2019; Xie et al. 2018, 2011; Cheng et al. 2018a; Shen et al. 2018; Qin et al. 2018)
-
6 papers: Nanyang Technological University (Xie et al. 2018, 2019; Ma et al. 2018b, 2018d, 2019; Du et al. 2019)
-
5 papers: Carnegie Mellon UniversityFootnote 9 (Gopinath et al. 2018; Ma et al. 2018b, 2018d, 2019; Xie et al. 2019)
-
5 papers: Harbin Institute of Technology (Ma et al. 2018b, 2018d, 2019; Xie et al. 2019; Du et al. 2019)
-
5 papers: Kyushu University (Ma et al. 2018b, 2018d, 2019; Xie et al. 2019; Du et al. 2019)
-
5 papers: University of Illinois at Urbana-Champaign (Ma et al. 2018b, 2018d, 2019; Zheng et al. 2019; Xie et al. 2019)
For-Profit Organisations
It is notable that besides universities, we also observed various contributions from for-profit companies. This is particularly evident for papers in the automotive domain. For-profit organisations that contributed to papers in this domain include: IEE S.A. Contern (Luxembourg) (Abdessalem et al. 2016,2018a, 2018b), AVL List (Austria) (Klueck et al. 2018; Beglerovic et al. 2017) Volkswagen (Germany) (Bolte et al. 2019) DaimlerChrysler (Germany) (Bühler and Wegener 2004) and Toyota (USA) (Tuncali et al. 2018). This finding is encouraging, but we argue that more industrial involvement should be actively promoted, especially by non-profit organisations (e.g., through collaborations with the industrial sector), because of the growing number of industrial products/services that embed some ML technology and demand for dedicated ML testing techniques. Insights from the industry can help researchers steer their work towards applicable and relevant topics that can be applied in practice. Moreover, industrial data sets are crucial to evaluate the proposed techniques in realistic and relevant contexts.
Countries
We analysed the number of papers by country, considering the country of the affiliation indicated in each paper. If a paper was written by multiple authors from different countries, we counted that paper for all represented countries. As above, we did not take the author order into account. Table 7 shows the number of papers by country. At least one author of 31 papers was affiliated to a research institution in the United States of America (USA), more than any other country. USA is followed by China (20 papers), Japan (9 papers), Singapore (8 papers) and Australia (7 papers). Germany (6 papers) is the most active European country. Table 7 reports the paper distribution by continent. The most active continent is America (33 papers), followed by Asia (29 papers) and Europe (19 papers).Footnote 10
Table 7 Countries (ISO3) of authors affiliations (number of papers shown within brackets) Citation Counts
We collected the citation counts from Google Scholar on May 11th, 2020. If multiple versions of a paper were available, we aggregated the citation counts. On average, the considered papers were cited 37.62 times with a median of 13.5 citations. The most cited papers in our pool are the following:
-
377 citations: DeepXplore: Automated Whitebox Testing of Deep Learning Systems by Pei et al. (2017)
-
354 citations: DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars by Tian et al. (2018)
-
159 citations: Testing and validating machine learning classifiers by metamorphic testing by Xie et al. (2011)
-
113 citations: Properties of machine learning applications for use in metamorphic testing by Murphy et al. (2008)
-
111 citations: DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems by Zhang et al. (2018b)
Figure 24 illustrates the distribution of the number of citations per paper. It is quite remarkable that the two most cited papers were published in 2017 and 2018: in the two to three years since their first publicationFootnote 11 they were cited more than 350 times, once more indicating the rapid growth of and the increasing interest in the area of MLS testing.