1 Introduction

Cyber-Physical Systems (CPSs) leverage physical capabilities from hardware components as well as computational and artificial intelligence from software components to operate in complex and dynamic environments, potentially involving humans (Baheti and Gill 2011). Specifically, CPSs continuously collect sensor data from the surrounding environment and analyze them to control physical actuators at run-time (Baheti and Gill 2011; Academies of Sciences 2017).

CPSs find application in many domains ranging from Robotics and Transportation to Healthcare and are expected to drastically improve the quality of life of citizens and the economy (Chen 2017). For instance, self-driving cars (SDCs), an emerging application of CPS in transportation, are expected to impact our society profoundly by drastically reducing human errors that currently cause more than 90% of driving accidents, improving passenger comfort, and limiting pollution (Kalra and Paddock 2016). Currently, one of the main factors limiting the widespread usage of SDCs is the lack of adequate testing. Releasing SDCs equipped with defective software poses the risk that they might become erratic, which has already led to some fatal crashes (Baheti and Gill 2011; Guardian 2018). Testing automation is crucial for ensuring the safety and reliability of software, including the one controlling SDCs (Kalra and Paddock 2016; Kim et al. 2019). However, most developers rely on human-written test cases to assess SDCs’ behavior. This practice has several limitations and drawbacks: (i) difficulty in testing SDCs in representative and safety-critical scenarios (Guardian 2018; The-Washington-Post 2019; Ingrand 2019); (ii) difficulty in assessing SDC’s behavior in different environments and execution conditions (Kalra and Paddock 2016). As a consequence, SDC practitioners in the field are facing a fundamental development challenge: observability, testability, and predictability of the behavior of SDCs are highly limited (Guardian 2018; The-Washington-Post 2019; Ingrand 2019). Thus, new testing practices and tools are needed to find SDC faults earlier during development and, eventually, support the widespread usage of autonomous driving.

Simulation environments can potentially address several of the challenges mentioned above (BeamNG GmbH 2022; Bondi et al. 2018; Dosovitskiy et al. 2017; Nvidia 2020) since simulation-based testing is more efficient than and can be as effective as traditional field operational testing (Afzal et al. 2020; Dosovitskiy et al. 2017). Additionally, simulation-based testing results are easier to replicate and can support established model-in-the-loop (MiL), software-in-the-loop (SiL), and hardware-in-the-loop (HiL) development strategies. Consequently, an increasingly large number of commercial and open-source simulation environments have been delivered to the market to conduct testing in the autonomous driving domain (Dosovitskiy et al. 2017; BeamNG GmbH 2022) as well as other CPS domains (Shin et al. 2018). For such reasons, our work focuses on simulation-based testing in the context of SDCs.

1.1 Problem Statement and Research Questions

Simulation environments enable automated test generation and execution (Gambi et al. 2019). However, the potential size of the testing space of simulation environments is, in principle, infinite, which poses several challenges and questions (What SDC test cases to select to identify faults efficiently? Is it possible to characterize safety-critical SDC tests?) in exercising the SDC behaviors adequately (Birchler et al. 2023, 2022, 2022c; Abdessalem et al. 2018b; Gambi et al. 2019). The time budget devoted to testing activities are usually limited, making the identification of faults particularly challenging in the SDC domain since the execution of simulation-based tests is considerably slower compared to other forms of tests (e.g., unit and system tests of traditional software systems).

For instance, testing how an ego-car handles a driving scenario can easily take several minutes (Panichella et al. 2021; Birchler et al. 2022, 2022c); in contrast, running a unit or system test of a traditional software system takes some (milli)seconds. It is important to point out that simulation-based testing tests the subject on the system level, which involves all components and not just a unit, and simulates the environment from which the test subject takes its inputs. Therefore, it is paramount that developers test SDCs cost-effectively, for example, by using test suites optimized to reduce testing effort or by improving existing automated test generators’ efficiency without affecting their ability to identify faults (Yoo and Harman 2010; Nucci et al. 2020; Abdessalem et al. 2018b).

In this paper, we investigate techniques to improve the cost-effectiveness of simulation-based testing in the context of SDCs. Specifically, we focus on techniques that employ Machine Learning (ML) models for supporting test case selection (TCS), addressing the following main challenges: (i) to leverage test case characteristics as well as ad-hoc SDC test case metrics to characterize best unsafe (fault revealing) and safe (not fault revealing) SDC test cases; (ii) to identify suitable ML models that can reliably predict the SDCs’ behavior before executing those test cases; (iii) to experiment with the usage of such ML strategies to effectively distinguish unsafe test cases from safe ones; (iv) to integrate the proposed ML-based approach into the context of an industrial organization in the automotive domain, thus demonstrating its applicability in industrial settings.

We are interested in testing the safety of SDCs; therefore, we deem as relevant those scenarios that expose a fault (e.g., an SDC drives out off the road). We call those scenarios unsafe. Consequently, our TCS techniques exploit ML models to classify SDC test cases that are unsafe (i.e., likely to expose a fault) or safe.

To address the aforementioned challenges, in this paper, we seek to answer the following research questions:

  • RQ1: To what extent is it possible to identify safe and unsafe SDC test cases before executing them? Answering RQ1 is important to understand whether, and to what extent, it is possible to classify test cases for SDCs before executing them and by only considering static input features (i.e., referred to as Road Characteristics). We investigate the use of ML models for classifying test cases and study their application in the context of Lane Keeping, the fundamental requirement in autonomous driving. Specifically, in testing lane-keeping systems, unsafe scenarios cause self-driving cars to depart their lane (Gambi et al. 2019; Birchler et al.2022, 2022c), and input features describe the geometry of a road as a whole (i.e., Road Features).

  • RQ2: Does SDC-Scissor improve the cost-effectiveness of simulation-based testing of SDCs? RQ2 investigates whether SDC-Scissor improves the cost-effectiveness of simulation-based testing of SDCs, compared to baseline approaches. Hence, in the context of RQ2, we investigated whether SDC-Scissor reduces the time dedicated to executing irrelevant (safe) tests without affecting testing effectiveness.

  • RQ3: What is the actual upper bound on the precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features? In RQ1 and RQ2, we focused on investigating the feasibility and cost-effectiveness of using SDC Road Characteristics as features for the problem of classifying SDC test cases before executing them. In RQ3, we explore a complementary aspect, which is investigating whether there is an actual upper bound on precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests). Hence, once we identified the best ML models for classifying safe and unsafe test cases when compared to baseline approaches (in RQ1 and RQ2), we focus on answering RQ3 by (i) designing additional SDC test case features, called Diversity Metrics (compared to the previous features used in RQ1 and RQ2 for training the ML models, these metrics are more complex than just computing simple road characteristics of SDC test cases); and (ii) leveraging hyperparameter tuning strategies to find the optimal configurations of the most promising ML models (as observed in RQ1 and RQ2).

We conducted our investigation using the freely available SDCs simulator BeamNG.tech (BeamNG GmbH 2022) (elaborated in Section 2). We selected BeamNG.tech because it can execute procedurally generated driving scenarios, and it was recently adopted as the reference simulator in the ninth and tenth editions of the Search-Based Software Testing tool competitionFootnote 1 (Panichella et al. 2021; Devroey et al. 2022).

Complementary to the investigation of the aforementioned research questions, we investigate the extent to which SDC-Scissor can be integrated into the context of industrial organizations in the automotive domain. Specifically, to perform such an investigation, we generate SDC test cases and assess the ability of SDC-Scissor to generate signals compatible with the CAN Bus protocol (CIA 2017; Boumiza and Braham 2019; Gundu and Maleki 2022) used in the AICAS organization (details about the AICAS company, their protocol, as well as the design and results of our integration study, are provided in Section 6).

1.2 Summary of Results & Paper Contributions

SDC-Scissor avoided the execution of 50% of unnecessary tests as well as identified more failure triggering test cases compared to two baseline strategies.

SDC-Scissor outperformed the baseline across all test pools; with the Logistic model, we achieved an accuracy of 70%, a precision of 65%, and a recall of 80% (Table 12) in selecting unsafe tests.

Our assessment of SDC-Scissor shows that SDC-Scissor successfully selects test cases independently from the AI engine used or different driving styles, with the Logistic model providing the more stable results. Our results also show that the knowledge is not transferable from one AI engine to another one, i.e., SDC-Scissor performed worse when training ML models on data from a specific AI engine and testing on data from a different AI engine. However, from the discussion of our results (in RQ3), we also observed that there is an upper bound for the extent to which static SDC features can be used to predict SDC testing outcomes. Finally, the integration of SDC-Scissor into the AICAS use case allowed us to demonstrate that the proposed approach can automate the testing process of such a large automotive company, coping with the need to complement their hardware-based simulation (based on the Can Bus protocol) with simulation-based testing automation. The contributions of this paper can be summarized as follows:

  • Selection of SDCs test cases(RQ1): We investigated new methods in the area of SDCs for test case selection. We first compute SDC features that can be used to characterize safe and unsafe test cases before executing them. Hence, we introduced SDC-Scissor that leverages ML models to support test case selection for SDCs, to enhance testing cost-effectiveness.

  • SDC-Scissor’s Cost-effectiveness (RQ2): We compared the proposed approach against two distinct baseline approaches to demonstrate the testing cost-effectiveness of SDC-Scissor. The first one is a random baseline approach that selects tests randomly. The second baseline selects tests based on their road length, which means that test cases with long roads are preferred based on the intuitive assumption that long roads have a higher probability of being unsafe.

  • Offline v.s. Real-time Training (RQ2): We investigated two opposite setups for SDC test case selection that leverage ML models trained on offline data (i.e., trained on a large static dataset) and real-time data (i.e., dynamically generated tests).

  • Upper-bound of SDC static features (RQ3): We empirically investigated whether there is an actual upper-bound on the precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests).

  • Integration of SDC-Scissor in an Industrial Use Case (analysis detailed Section 6): We integrated SDC-Scissor into the development context of the AICAS use case, demonstrating that the proposed tool can automate the testing process of such a large automotive company.

To foster the replicability of our study, we built a large dataset of labeled test cases (Khatiri et al. 2021) that can be used for replicating our results and promoting further research. Furthermore, SDC-Scissor is publicly available on GitHub,Footnote 2 which can be used with the data to replicate our results.

Paper Structure

The paper proceeds as follows: Section 2 provides some background about CPS simulation technologies, regression testing, a discussion of the simulation-based testing (of Lane Keeping) systems used in the context of our study, a discussion on automated test generation in the context of SDCs, and a summary of the main terminology used in our study. Section 3 presents the approach proposed in this paper. Section 4 describes the empirical study design, while Section 5 presents its main results. Section 6 provides a brief background on AICAS, the industrial organization involved in our study, details on the Can Bus (i.e., their signal-based protocol), and elaborates on the design and results of SDC-Scissor’s integration within the AICAS organization. Section 7 reflects on the results reported in Section 5 and Section 6, providing complementary insights and providing a discussion on future work for researchers and SDC developers. Section 8 discusses related work, while Section 9 discusses the threats that could affect the validity of our results. Finally, Section 10 concludes the paper and outlines future research directions.

2 Background

This section introduces background elements to make this paper self-contained. It presents the main approaches to SDC simulation (Section 2.1) and discusses automated testing of Lane Keeping systems (Section 2.2). Finally, it concludes with a recap of the terminology used in the rest of this paper (Section 2.3).

2.1 CPS Simulation Technologies

Several simulation technologies have been developed to support developers in various stages of the design and validation of CPSs. Those technologies provide various levels of accuracy and realism at different execution costs, i.e., more accurate simulations generally require larger computational power. In the domain of self-driving cars, developers resort to abstract simulation models (González et al. 2018; Sontges and Althoff 2018; Althoff et al. 2017), rigid-body simulations (Loquercio et al. 2020; Zapridou et al. 2020), and soft-body simulations (Gambi et al. 2019; Riccio and Tonella 2020) among others.

Basic simulation models, like MATLAB and Simulink models as well as abstract driving scenarios (Althoff et al. 2017), have been mainly utilized for model-in-the-loop simulations, benchmarking of trajectory planners, and Hardware/Software co-design. They implement fundamental abstractions (e.g., signals, motion primitives) but target mostly non-real-time executions and lack photo-realism, which limits their applicability for testing SDC systems.

Rigid-body simulations approximate the physics of bodies by modeling entities as undeformable bodies (Abdessalem et al. 2018b). Rigid-body simulations implement a very coarse approximation of reality and can simulate only basic object motions and rotations. Consequently, rigid-body simulations cannot simulate realistic and critical scenarios (e.g., car crashes, inertia) accurately, even when they are combined with rendering engines to achieve photo-realistic simulations (Dosovitskiy et al. 2017; Bondi et al. 2018; Xu et al. 2019).

Soft-body simulations improve over rigid-body simulations and can simulate a wide range of simulation cases in addition to primitive body motions and rotations. As stated by Dalboni and Soldati (Dalboni and Soldati 2019), soft-body simulations can simulate body deformations, anisotropic mass distributions, and inertia, which are essential in many CPS domains. For SDCs, soft-body simulations are a better fit for simulating safety-critical driving scenarios (Gambi et al. 2019) and, like rigid-body simulations, they can be coupled with powerful rendering engines to achieve photo-realism (e.g., BeamNG GmbH (2022)). Consequently, in our work, we leverage soft-body simulations for simulation-based testing of SDCs.

2.2 Simulation-Based Testing of Lane Keeping Systems

In this paper, we study how SDC-Scissor can optimize the testing of the software that controls self-driving cars using physically accurate driving simulations. Specifically, we focus on testing Lane Keeping systems (LKS) that implement one of the fundamental features of autonomous driving.

Simulation-based testing requires creating relevant testing scenarios and reifying them into concrete executions (Li et al. 2016). In accordance with current research on automated testing of LKS (Panichella et al. 2021; Gambi et al. 2022), we consider scenarios that take place on a sunny day on single, flat roads surrounded by plain green grass. Consequently, tests take the form of the following driving task: driving without going off the lane from a given starting position, i.e., the beginning of a road, to a target position, i.e., the end of that road.

The roads defining these driving tasks are obtained by interpolating road points using cubic-splines to obtain a smooth road spine, i.e., the road’s center line (see Fig. 1). Driving simulators use the road spines to implement the actual driving tasks to execute.

Fig. 1
figure 1

Virtual roads for testing Lane Keeping systems. The white dots represent the road points, the (central) yellow lines represent the interpolated road spine, the triangles represent the starting locations, and the squares represent the target locations

In this context, unsafe tests correspond to virtual roads that expose problems in the ego-vehicle while driving autonomously on them, for instance, causing it to drive off-road or invade the opposite lane. As discussed in the next Section, SDC-Scissor extracts a set of features from the road spine and road points that enable it to predict whether the corresponding virtual road will expose a problem in the ego-vehicle before the test execution.

SDC-Scissor relies on the open-source testing infrastructure developed for the CPS testing competition of the SBST (Search-Based Software Testing) workshop (Panichella et al. 2021). This infrastructure can automatically implement executable simulations from the road spines, execute them, and collect their results (e.g., pass/fail). We opted for this infrastructure for two main reasons: (1) It utilizes BeamNG.tech (BeamNG GmbH 2022) simulator; hence, it can execute physically accurate and photo-realistic driving simulations. (2) It has already been used to benchmark several automatic test generators (see Panichella et al. (2021) and Gambi et al. (2022)); hence, it enables us to study the generality of SDC-Scissor. SDC-Scissor uses Frenetic (Castellano et al. 2021) as the main test generator, which uses a genetic algorithm for defining road points on a cartesian plane.

The open-source testing infrastructure developed for the CPS testing competition (Panichella et al. 2021) enables driving agents to drive simulated vehicles and get programmatic control over running simulations (e.g., pause/resume simulations, move objects around). We consider two different driving agents as test subjects for our evaluation: The first is the driving agent shipped with the BeamNG.tech, which we refer to as BeamNG.AI, and the second, is an open-source trajectory planner, which we refer to as Driver.AIFootnote 3 (Gambi et al. 2019). As explained by BeamNG.tech developers, a parameter called the “risk factor” (RF) controls the driving style of BeamNG.AI: low RF values (e.g., 0.7) result in smooth driving, whereas high RF values (e.g., 1.2 and above) result in an edgy driving that may lead the ego-car to “cut corners”. Driver.AI instead analyzes the road geometry and plans the car trajectory by computing for each turn the maximum safe driving speed (v) using the standard formula for centripetal force on flat roads with static friction (μ) (CNX 2021):

$$ v = \sqrt{\mu \times r \times g} $$
(1)

where r is the turn radius and g is the free-fall acceleration.

Driver.AI relies on the user to provide the value of the friction coefficient, as well as information about the maximum acceleration and deceleration of the ego-car. In our evaluation, we estimated those values empirically following a trial-and-error approach. It is important to mention that, at the moment, both BeamNG.tech and Driver.AI do not have previous versions of their driving agents. This means that their behavior can only be altered or investigated by experimenting with the parameters already discussed in the context of our study. As a consequence, the target of our regression testing strategy is primarily focused on enabling SDC test selection, with the main goal of reducing the effort required to detect faults. For future work, assuming new versions of both BeamNG.tech and Driver.AI are delivered, we plan to experiment with consecutive versions of these AI agents so that it is possible to investigate the potential fault-detection capability of both of them.

2.3 Article Terminology

To avoid any confusion in terminology, it is important to note that in the rest of the paper, we will refer to simulation-based test cases generated by SDC-Scissor as test cases. Test cases are composed of virtual roads composed of a sequence of multiple road segments, as exemplified in Fig. 1. Formally, road segments refers to (parametric) portions of roads of test cases; hence, they can be straight segments (no curvature), left turns (positive curvature), or right turns (negative curvature).

We refer to test cases that have been executed and evaluated in simulation as executed test cases. Then, if a test is passed successfully, we refer to it as a passing test, and if it failed, potentially revealing some issues with the system under test, we refer to it as a failing test.

On the other hand, as we elaborate more in the next sections, SDC-Scissor automatically assigns labels to the test cases regarding them being likely to fail or pass without executing them. In this context, we refer to the test cases which are considered by SDC-Scissor to be likely to pass as safe test cases and the ones that are considered likely to fail as unsafe test cases.

Regarding the features used in SDC-Scissor, static (road) features refer to any test case features that can be calculated without running any simulations, i.e., they are suitable for predicting test results (simulation results) before running simulation. As discussed in detail in the next section, we propose to use two different sets of road features: road characteristics and diversity metrics.

Regarding the experiments to answer RQ2, we will discuss offline experiments that involves test selection from a previously generated (offline) pool of test cases in Section 4.2.2. We conducted the offline experiment in two experimental setups that mimic the issues of having a limited testing budget in the context of SDCs: 1) FIX, in which the amount of total test cases that can be executed in the simulation environment is fixed to a certain number. 2) REACH in which we continue executing the test cases until we reach a certain number of failing tests.

As discussed later in Section 5.3, we complement RQ2 evaluations with real-time experiments, in which we study the application of SDC-Scissor to automated test generation, i.e., the test pool is being generated in real-time, and only the unsafe tests are being kept and executed. There, we have two experimental setups: 1) with a pre-trained ML model. 2) with an adaptive ML model that could be retrained with the correct labels of the generated test cases.

3 The SDC-Scissor Approach

In this section, we first overview SDC-Scissor’s software architecture and its main usage scenarios (Section 3.1); next, we describe the selected features used as inputs to SDC-Scissor (Section 3.2); finally, we explain how SDC-Scissor uses these features to classify test cases before executing them (Section 3.3).

3.1 SDC-Scissor Architecture Overview

SDC-Scissor supports two main usage scenarios: Benchmarking and Prediction. In the Benchmarking scenario, SDC developers (or testers) leverage SDC-Scissor to determine the best ML model(s) to classify SDC simulation-based tests as safe or unsafe. In the Prediction scenario, instead, SDC-Scissor uses the most promising ML model(s) to classify newly generated test cases.

SDC-Scissor Software Architecture (Fig. 2) implements these scenarios by means of five main software components, which have the main following responsibilities and relations:

  1. (i)

    SDC-Test Generator generates SDC simulation-based test cases.

  2. (ii)

    SDC-Test Executor executes the tests and stores the test results, i.e., safe or unsafe labels, to allow training of the ML models.

  3. (iii)

    SDC-Features Extractor extracts the input features from the SDC simulation-based test cases.

  4. (iv)

    SDC-Benchmarker uses these features and collected labels to train the selected ML models and determines which ML model best predicts the tests that are more likely to detect faults.

  5. (v)

    SDC-Predictor uses the trained ML models to classify newly generated test cases, thus achieving cost-effective SDC simulation-based testing via test selection.

Fig. 2
figure 2

Overview of SDC-Scissor’s software architecture

3.2 SDC Test Case Features

SDC Test Case Road Characteristics - Features Set 1

(Used in RQ1, RQ2, and RQ3). To predict whether test cases are likely to result in safe or unsafe test cases before their execution, we use a set of simple static features extracted from the global characteristics (we refer to Road Characteristics) of the virtual roads used as test cases. We extract two types of Road Characteristics describing the main road attributes (see Table 1) and descriptive statistics about the road composition (see Table 2). Exemplary road attributes we consider are the total length of the virtual road, its starting and target positions on the map, and the count of left and right turns. To calculate road statistics, instead, we adopt the following procedure: (1) We extract the driving path that the ego-car must follow during the test execution; this path defines the test case and contains the road segments that the ego-car must traverse to reach the target position from the starting position. (2) We extract the metrics such as segment length, road angle, and pivot radius from the road segments. (3) We compute descriptive statistics by applying standard aggregation functions (e.g., minimum, maximum, average) on the collected road segment metrics.

Table 1 Road attributes extracted by the SDC-Features Extractor
Table 2 Road statistics extracted by the SDC-Features Extractor

SDC Test Case: Diversity Metrics - Features Set 2 (Used in RQ3)

To predict whether test cases are likely to result in safe or unsafe test cases before their execution, we also designed a new set of road features called Diversity Metrics. Specifically, we calculate per road segment the area that is spawned between the direct line of a segment (start and end of the segment) and the actual road. The concept of the diversity feature is also explained in Fig. 3, where the green area represents the diversity of a single road segment. The curly braces indicate the segments of the road. A segment consists of road points marked as red diamonds. Furthermore, the yellow lines represent the direct paths between the start and end points of each segment. Concretely, we used for the calculation of the area Shapely (Sean 2022), an open-source library for Python to perform geometric calculations. For each identified segment, we define a Shapely Polygon object that includes the road points and the line representing the direct segment line. All classes of Shapely provide a similar interface as well for calculating the area of a Shapely object. The previously constructed Polygon has a property called area. With this approach, we retrieve the area (also known as diversity in our context) of the segments. On this basis, we calculate two additional features; (i) Full Road Diversity, and (ii) Mean Road Diversity. As described in Table 3, the Full Road Diversity is computed by summing up all areas spawned by each segment of a road, whereas the Mean Road Diversity feature is the mean value of all areas of a single road. The main assumption for using these new features is that the road is more diverse if the spawned area is greater and, therefore, unsafer.

Fig. 3
figure 3

Road diversity as area (green) between the road (black) and direct segment line (yellow)

Table 3 Diversity features extracted by the SDC-Features Extractor

3.3 The SDC-Scissor’s Workflow

As described in Section 2, SDC-Scissor’s leverages an existing, open-source, and extensible SDC testing infrastructure to execute the test cases (SDC-Test Executor). Likewise, it relies on existing test generation algorithms integrated with that infrastructure to automatically generate the test cases to optimize (SDC-Test Generator). Hence, SDC-Scissor can already be used to improve the cost-effectiveness of several test generators.

During Benchmarking, SDC-Scissor utilizes SDC-Test Generator and SDC-Test Executor to collect the necessary data for training the ML Models, i.e., labeled test cases; next, it relies on SDC-Benchmarker to determine the ML models that best classify the SDC test cases as safe or unsafe as described below. Given a set of labeled test cases and the corresponding input features extracted by SDC-Features Extractor, SDC-Benchmarker trains and evaluates an ensemble of standard ML models using the well-established sklearnFootnote 4 library. Next, it assesses each ML model’s quality using K-fold cross-validation and the whole dataset. Finally, it identifies the best-performing ML models according to Precision, Recall, and F-score metrics (Birchler et al. 2022) and outputs the best (trained) models as well as the features needed to operate them.

SDC-Scissor can work with various ML models. In this study, we consider ML models that have been successfully used for defect prediction or other classification problems in Software Engineering (Bezerra et al. 2007; Kaur and Malhotra 2008; Panichella et al. 2015; Sorbo et al. 2016; Rani et al. 2021; Panichella and Ruiz 2020). Specifically, we consider Naive Bayes (that applies Bayes’ theorem to train a probabilistic classifier) (Caruana and Niculescu-mizil 2006), Logistic Regression (that uses a logistic function to model the probability of observing a certain class) (Sammut and Webb 2011), J48 (that creates a decision tree following the well-known C4.5 algorithm) (Frank et al. 2005; Sorbo et al. 2022), and Random Forests (that uses an ensemble of decision trees) (Ho 1998).

During Prediction, SDC-Scissor takes as input the (trained) ML Models and the definition of the features needed to use them. Next, it generates new test cases using SDC-Test Generator and utilizes SDC-Features Extractor to extract the necessary features. Finally, it invokes SDC-Predictor for classifying safe or unsafe test cases before executing them.

In the next section, we describe the studies we conducted to evaluate the benefits of using SDC-Scissor for test selection in the context of SDCs. After that, we present and discuss the achieved results.

4 Study Design

In this paper, we investigate Machine Learning-based test selection techniques for improving the cost-effectiveness of simulation-based testing of SDCs.

The first challenge (RQ1) we focus on is to investigate whether, and to what extent, it is possible to classify test cases for SDCs as safe or unsafe before executing them, i.e., only considering input features, such as the one discussed in Section 3 by conducting offline and real-time experiments. Specifically, we investigate the use of ML models for classifying test cases in the context of Lane Keeping systems (see Section 2).

The second challenge we focus on is devising techniques that effectively leverage features extracted from SDC test cases to reduce testing costs while keeping testing effectiveness high. Hence, we investigate whether SDC-Scissor improves the cost-effectiveness of simulation-based testing of SDCs, compared to baseline approaches (RQ2).

A further aspect we investigate is whether there is an upper bound on the precision and recall achieved by ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests). Hence, we focus on investigating whether fine-tuning the ML algorithms (e.g., calculating derived features and performing hyper-parameter tuning) improves SDC-Scissor’s ability to discern safe test cases from unsafe ones (RQ3).

Finally, to investigate the practical usefulness of SDC-Scissor, we integrated our tool into the context of an industrial organization in the automotive domain (details of such an investigation are reported in Section 6).

In the following sections, we describe the dataset used in our study and the steps we followed to address these challenges.

4.1 SDC Test Cases Dataset Preparation

To enable the prediction of safe and unsafe SDC test cases, we used SDC-Scissor for executing the generated test cases and collected labels (safe/unsafe) from the test results (pass/fail). As reported in Table 4, we generated a dataset with 14,175 data rows with full road features that are obtained from simulations of 8,500 tests using two driving agents and four configurations. What can be observed from the table is that SDC-Scissor takes AI engines’ inputs to generate the test cases, this lead to test cases having different configurations of roads and, as a consequence, different sets of road segments composing them. The test cases, their labels, and the SDC features characterizing them are the main data used for conducting our experiments. An overview of the data is reported in Table 4.

Table 4 Dataset summary of SDC test cases on segment level and full road level (composed by segments)

4.2 Research Method

We designed a set of experiments to answer our research questions:

  • Machine Learning-based Experiments (RQ1): The first set of experiments investigates whether ML models trained with the selected SDC test case features can identify safe and unsafe test cases before their execution.

  • Offline Experiments (RQ2): The second set of experiments investigates if and how much SDC-Scissor improves the cost-effectiveness of SDC simulation-based testing compared to baseline approaches.

  • Real-Time Experiments (RQ2): In these experiments, we train an adaptive model based on data observed while executing the tests and compare it with a pre-trained model.

  • Optimization Experiments (RQ3): The third set of experiments investigates how SDC-Scissor performance improves by adding new SDC features and tuning ML Models hyperparameters. Specifically, in RQ3, we focus on investigating whether there is an actual upper bound on the precision and recall achieved by the ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests).

4.2.1 Machine Learning-based Experiments (RQ1)

In the context of RQ1, we study whether ML models can be used to predict safe or unsafe test cases and which combinations of features allow us to achieve more accurate predictions. As discussed in Section 3, we integrated into SDC-Scissor several ML models, and in the context of our work, we experimented with Logistic Regression (Tolles and Meurer 2016), the J48 (Frank et al. 2005), the Random Forest (Ho 1998), and the Naive Bayes (Caruana and Niculescu-mizil 2006) as ML models. We trained the ML models mentioned above using a training and test sets split strategy for each of the configurations listed in Table 4 separately. We evaluated the performance of each ML model by computing the standard metrics of precision, recall, and F-score (Baeza-Yates and Ribeiro-Neto 2011; Bezerra et al. 2007; Ceylan et al. 2006; Kaur and Malhotra 2008; Canfora et al. 2013; Panichella et al. 2015).

Rebalancing of Training Data

Since unsafe scenarios are an exception –not the norm– when generating random tests, the raw data we collected with SDC-Scissor is unbalanced toward safe cases. Therefore, we re-balanced the training data (in the case of the training and test sets split strategy) to avoid skewed distributions that would otherwise bias the ML models towards one specific class. Specifically, we adopted random oversampling, a re-balancing technique proven to be robust (Ling and Li 1998), to supplement the training data with multiple copies of some of the minority classes.

Size of the Training Dataset

To study how the training set size affects the ML models’ performance, we created balanced training datasets of increasing size (Table 5). However, we generated the test datasets to evaluate the ML models by randomly sampling the data point not included in the training datasets. Notably, we did not re-balance the test datasets to preserve the underlying distribution classes in the data.

Table 5 Model training dimensions

We also study the effects of different training strategies on each ML model’s performance. To do so, we evaluated the ML models using standard K-fold cross-validation (Refaeilzadeh et al. 2009). In particular, we set K = 10 (i.e., 10-fold cross-validation) and utilize all the available data in each configuration.

4.2.2 Offline Experiments (RQ2)

To answer RQ2, we investigate whether SDC-Scissor improves the cost-effectiveness of simulation-based testing of SDCs, compared to baseline approaches. The quality focus is to understand whether SDC-Scissor reduces the time dedicated to executing safe (irrelevant) tests without affecting testing effectiveness (i.e., its ability to identify unsafe tests) compared to such baselines.

SDC-Scissor can use pre-trained models to classify safe and unsafe test cases. Therefore, we designed experiments to analyze how using pre-trained ML models for selecting (existing) test cases improves regression testing. For those experiments, we consider the combinations of ML models and features that achieve the best results in the context of RQ1 (see Section 5.1). In addition, we contextualize the results achieved by SDC-Scissor using a baseline approach that performs a random selection of test cases. Notably, random selection is considered one of the standard baselines for evaluating test selection strategies (Shin et al. 2018; Yoo and Harman 2010). Finally, we also compare SDC-Scissor against a slightly more intelligent baseline approach that selects test cases by ordering the test to be executed considering their road length (in decreasing order). The conjecture of this second baseline is that the longer the road, the higher the probability of observing a fault.

Studying the effectiveness of SDC-Scissor offline requires test cases and executions; therefore, we used a dataset with known test execution times. Due to the lack of backward compatibility of BeamNG.tech, we generated a new dataset for complementing our evaluation (see Table 10) involving the usage of the most recent version of BeamNG.tech. For all other evaluations, we used the data as reported in Table 6. In summary, the separated new dataset consists of 3559 with 2225 safe and 1334 failing tests labeled with the BeamNG.AI (RF 1.5). As reported in Table 6, we created a Training Set, accounting for 80% of the whole data set, and we used the remaining 20% of data for testing. We created a balanced Training Set, but we purposely created four unbalanced Test Pools with different distributions of unsafe cases, ranging from few (5% of the testing data) to many (70% of the testing data). In creating our test pools, we under-sampled safe test cases (e.g., Test Pool (30/70)) since the number of unsafe test cases was inferior to the total amount of test cases in our complete dataset. Our conjecture is that using different Test Pool compositions allows us to assess SDC-Scissor’s performance in various settings.

Table 6 Offline experiment dataset: test pools with different distributions of unsafe cases, ranging from few (5% of the testing data) to many (70% of the testing data)

Experimental Setups of Offline Experiments

We conducted the offline experiment in two experimental setups, referred to as FIX and REACH. Since they mimic the issues of having a limited testing budget in the context of SDCs, We believe they are representative. We repeated the experiments in both setups 30 times to increase the confidence in the achieved results.

The FIX setup investigates the benefits of using SDC-Scissor when the resources allocated for testing are limited, i.e., the amount of test cases that can be executed in the simulation environment is fixed to a value S (e.g., S = 5,6,etc.). The process we followed to experiment with the FIX setup is illustrated in Fig. 4 alongside the baseline processes. The baseline approach draws tests from the test pool (randomly or by considering their road length) and adds them to the test suite until the test suite reaches the target size S. SDC-Scissor, instead, samples the tests from the test pool but adds them to the test suite only if the ML model predicts that they are unsafe; as before, the process ends when the test suite reaches the target size S. In this setup, more effective techniques select larger portions of unsafe tests; therefore, we evaluate the performance of SDC-Scissor using the ratio of unsafe to safe test cases in the final test suites compared to the baseline approaches.

Fig. 4
figure 4

FIX experiment overview

The REACH experiment, instead, investigates the ability of SDC-Scissor to reduce the time to identify at least N unsafe test scenarios. In our experiment, we set N = 10 since the time to identify that many unsafe test cases potentially requires the execution of many more (safe) test cases. The process we followed to experiment with the REACH setup is illustrated in Fig. 5 alongside the random baseline approach. As before, the baseline randomly samples tests from the test pool and executes them until N unsafe tests have been identified. REACH, instead, executes only those tests that are predicted to be unsafe by the ML models. In this setup, more effective techniques identify N unsafe tests sooner; therefore, we consider the number of true positives (TP),Footnote 5 true negatives (TN), false positives (FP), and false negatives (FN) predicted by the ML models. Having information about TP, TN, FP, and FN enables us to count how many tests were needed to reach the goal, how long it took to do so, and how much time was wasted in evaluating safe test cases.

Fig. 5
figure 5

REACH experiment overview

4.2.3 Real-Time Experiments (RQ2)

We complement the previous Offline Experiments to answer RQ2, which focuses on applying SDC-Scissor to regression test case selection, with Real-Time Experiments in which we study the application of SDC-Scissor to automated test generation.

We conducted the Real-Time Experiments according to the following procedure: (i) SDC-Scissor to generate random test cases; (i) for each newly generated test case, SDC-Scissor classifies it as safe/unsafe; and, (i) we filter out test cases classified as safe before generating the next test case, whereas we executed the test cases classified as unsafe. As the test subject, we used BeamNG.AI in the moderate configuration (RF equal to 1.5) as this configuration is a compromise between overly conservative and overly aggressive driving styles.

A cost-effective test generator devotes more time to executing (likely) unsafe tests that can expose defects rather than executing safe test cases, which might not contribute any additional insight into the behavior of the SDC under test. Correctly identifying unsafe test cases, therefore, is paramount and depends on the quality of the ML model used as a classifier which, in turn, depends on the technique employed by the ML models and the data used to train them. Particularly relevant in this context is whether the ML model is predefined and fixed or allowed to be updated online as new data become available. The trade-off between these two configurations is that ML models have little operational costs once trained but may miss relevant behaviors; on the contrary, dynamically retrained ML models can cope with missing training data but at the cost of additional time spent in retraining them. Therefore, we compare the following two approaches:

  • Pre-trained Model in which we used the best performing model identified during the Machine Learning-based Experiments (Section 5.1). We trained this model using the re-balanced dataset for the case of BeamNG.AI RF 1.5, as this is the configuration of the test subject used for this experiment.

  • Adaptive Model in which we also used the best performing model identified during the Machine Learning-based Experiments (Section 5.1 but trained with only 60 randomly generated test cases. After this initial training, we retrain the ML model after executing the predicted unsafe test cases using the newly collected ground truth labels for those test cases. Figure 6 illustrates this process. Notably, since the ML model may be inaccurate, this process collects both positive and negative labels.

Fig. 6
figure 6

Overview of the adaptive model configuration for the real-time experiments

As before, we contextualize the results achieved by SDC-Scissor using a baseline approach that implements plain vanilla random generation, i.e., it does not filter the test cases.

We ran each configuration on a dedicated machine equipped with an Intel Core i5-6600K (3.5 GHz), 16 GB RAM, and an NVIDIA GeForce GTX 1070 GPU and set the test generation time budget to six hours.

During each execution of the experiment, we stored all the tests generated by SDC-Scissor so we could execute the test cases filtered out by SDC-Scissor post-mortem to calculate metrics such as accuracy, precision, and recall.

Table 7 provides an overview of the metrics used for the evaluation of SDC-Scissor across the various configurations. Those metrics include the count of unsafe tests found during each experiment (true positives), true negatives, false positives, and false negatives. Additionally, we consider how SDC-Scissor allocated the time budget to run safe and unsafe test cases, generate test cases, and rebuild the ML models.

Table 7 Evaluation metrics for the real-time experiments

In the second study, SDC-Scissor leverages real-time data (i.e., dynamically generated tests) and continuously (re-)trained ML models; this setup lets us evaluate the application of the proposed technique for automated test generation. As described before, in both setups, we compared the time-saving ability of SDC-Scissor with respect to the random selection strategy as well as its ability to detect more faults while allocating lower test execution costs.

4.2.4 Optimization Experiments (RQ3)

RQ3 investigates whether there is an upper bound on the precision and recall of ML techniques in identifying SDC safe and unsafe test cases when using SDC test case features available before executing the tests. A range of different optimization algorithms can be used to achieve potentially better results with respect to the default configuration of parameters of the ML models. Two of the most common hyperparameter tuning methods are Random Search and Grid Search (Bergstra et al. 2011; Bergstra and Bengio 2012; Adnan et al. 2022). Grid search performs better for spot-checking combinations that are known to perform well. Therefore, we experiment with Grid search as a hyperparameter optimization approach and investigate how SDC-Scissor’s performance improves when it employs fine-tuned ML models. Specifically, with Grid Search, we experimented with several parameter combinations for the best ML models using a 10-fold validation setting, as summarized below.

For the Decision Tree (J48) we covered all possible combinations of the following parameters:

  • C (confidenceFactor): Is the confidence factor, and we experimented with values [0.001,0.01,0.05,0.1,0.5]

  • M (minNumObj): Is the minimum number of instances in a leaf, and we experimented with values [1,10,20,50,100]

  • R (reducedErrorPruning): Reduced error pruning is an alternative algorithm for pruning that focuses on minimizing the statistical error of the tree. We experimented with the following values [yes,no]

  • S (subtreeRaising): This is a specific method of pruning whereby a whole set of branches further down the tree are moved up to replace branches that were grown above it. We experimented with the following values of it [yes,no]

For the Random Forest, we covered all possible combinations of the following parameters:

  • I (numIterations): Is the number of trees in the forest, and we experimented with values [5,10,100,1000,2000]

  • K (numFeatures): Is the max number of features considered for splitting a node, and we experimented with values [0,10,100,500,1000]

  • depth: Is the maximum depth of the tree (0 unlimited), and we experimented with values [0,5,10,20]

  • M (minNumObj): Is the minimum number of instances in a leaf , and we experimented with values [1,10,20,50,100]

For the Gradient Boosting, we covered all possible combinations of the following parameters:

  • ’loss’ = [’log_loss’, ’deviance’, ’exponential’]

  • ’learning_rate’ = [0.01, 0.1, 0.2, 0.4]

  • n_estimators’ = [10, 100, 1000]

  • ’criterion’ = [’friedman_mse’, ’squared_error’, ’mse’]

For the Logistic Regression, we covered all possible combinations of the following parameters:

  • ’penalty’ = [’l1’, ’l2’, ’elasticnet’, ’none’]

  • ’dual’ = [True, False]

  • ’max_iter’ = [10, 100, 1000]

  • ’solver’ = [’newton-cg’, ’lbfgs’, ’liblinear’, ’sag’, ’saga’]

For the Support Vector Machine, we covered all possible combinations of the following parameters:

  • ’penalty’ = [’l1’, ’l2’]

  • ’loss’ = [’hinge’, ’squared_hinge’]

  • ’dual’ = [True, False]

It is important to note that we perform Grid Search (with a 10-fold cross-validation strategy) over all experiments (for a total of over 700 experimented combinations of parameters) and use the best combination of features and ML model from Section 4.2.1.

Section 5 elaborates on the achieved experimental results for all research questions, while Section 7 reflects on the results reported in such section, providing complementary insights, findings, and implications.

5 Results

This section presents the achieved results organized by research questions, while Section 7 discusses them in depth.

5.1 Machine Learning-Based Experiments (RQ1)

In this section, we discuss the results of RQ1. Specifically, we describe the results achieved using the Road Characteristics listed in Section 3.2 as input features to build the ML models.

5.1.1 Machine Learning-Based Experiments with Road Characteristics

We evaluated the ML models trained using Road Characteristics as the main SDC features with four splits of training and test data, as summarized in Table 5. However, for the sake of readability, we report here only the results achieved by the best-performing configuration, i.e., 80% training and 20% for testing. The full results can be found in our replication package (Khatiri et al. 2021). Table 8 reports Precision, Recall, and F-score for both unsafe and safe labels separately to study how the ML models can classify each case (i.e., the experiments summarized in Table 5). It is important to note that in all experiments reported in Table 5, we rebalanced the training data (as discussed in Section 4.2.1).

Table 8 Performance of the ML models trained using road features

Regarding the BeamNG.AI dataset, with Risk Factor 1.5, the ML model performing the best in terms of F-score is Logistic (with 71% for both labels), followed by Random Forest (between 68%–69% for both labels). The other models, instead, achieved lower F-score values.

Regarding the Driving.AI dataset, we observe that the ML models achieved lower accuracy (49.1%) than the BeamNG.AI dataset. This result can be explained by looking at how unbalanced the Driver.AI dataset is since Driver.AI drives carefully, its dataset comprises mainly safe scenarios, and the predictions of the ML models tested on it are biased toward safe predictions.

Comparing the F-score achieved by the ML models against the Driver.AI and BeamNG.AI datasets shows this problem more evidently: the ML models performed comparably well for safe and unsafe classes against the BeamNG.AI dataset, whereas they performed well only for the safe test class in the case of Driver.AI. However, we can observe some similarities between all ML models in terms of F-score values when trained on the Driving.AI dataset and the BeamNG.AI dataset. For instance, for both datasets, Logistic and Random Forest tend to achieve better results. In both cases, and especially in the case of Driver.AI, most ML models struggle to classify safe test cases when compared to unsafe test cases.

figure f

5.1.2 Analysis of Relevant Features

Although the ML models trained using the road features can effectively classify the test cases as safe or unsafe, it is crucial to know the level of contribution of each of these features. We analyzed the road features for the BeamNG dataset discussed in Table 8 using two popular feature evaluation methods: information gain and correlation. While the detailed analysis results are reported in Appendix A, we summarise the main findings here.

figure g

5.1.3 Impact of Risk Factor (RF)

To make it more clear how SDC-Scissor’s performance is affected by varying RF values, we compared its performance on BeamNG datasets with RF 1, 1.5, and 2 separately. While we report the details in Appendix B, here we summarise the main findings.

figure h

5.1.4 Knowledge Transfer Between Different Driving Agents

We also studied the ability of the ML models to transfer knowledge from a driving agent to another one by training ML models with one AI’s dataset (BeamNG RF 1.5) and testing it with another AI’s dataset (Driver.AI) and vice versa. While we report the details in Appendix C, here we summarise the main findings.

figure i

5.2 Offline Experiments (RQ2)

In this section, we discuss the results of RQ2. Specifically, we focus on devising techniques that effectively leverage features extracted from SDC test cases to minimize testing costs while keeping testing effectiveness high. For this reason, we investigate whether SDC-Scissor improves the cost-effectiveness of simulation-based testing of SDCs, compared to baseline approaches (RQ2). Hence, we report the results of the FIX and REACH experiments (detailed in Section 4.2.2). Additionally, we report the results of the comparison between various ML models against the baseline approaches (described in Section 4.2.2) by considering different test pool compositions.

5.2.1 FIX Experiment results

The goal of this experiment is to optimize the usage of the available resource in terms of testing execution time and effectiveness. Figure 7 compares the ratio of unsafe tests selected for execution using different ML models against the first baseline approach (random selection) across different test pool compositions. As can be observed from the figure, the Logistic model outperformed the baseline in all different test pool compositions (described in Section 4). Figure 8 illustrates that with fewer unsafe test cases in the pool, we observe improvements in the number of selected unsafe tests using ML models over the baseline. In the pool with the least unsafe tests, the Logistic model finds 133% more unsafe tests compared to the baseline approach. In the more balanced testing pool, Logistic finds 50% more unsafe tests, while with the pool with more unsafe than safe tests, it identifies 30% more unsafe tests. The Logistic model performs slightly better than the other models in all compositions except one (0.3/0.7), where Random Forest performed the best.

Fig. 7
figure 7

Comparison logistic model and baseline across different test pool compositions

Fig. 8
figure 8

Number of executed unsafe scenarios during the experiments on a) Test Pool (0.05/0.95) b) Test Pool (0.3/0.7) c) Test Pool (0.7/0.3)

The confusion matrices in Table 9 further illustrate the concrete results in terms of effectiveness with the various pool compositions. In the pool with only 0.05 unsafe tests (Table 9-a), the Logistic model achieved 10 false negatives and 260 true negatives; this means that the model avoided the execution of 549 safe tests (considering that safe test cases take around 24 seconds in average to be executed), thus potentially reducing cost by more than 200 minutes in total on the less critical scenarios. However, the false-positive number is still high, with a cumulative 263 false-positives identified. As can be observed in Table 9-b, for the Test Pool 0.7/0.3, the Logistic model achieved over 260 true positives and only 37 false positives. We observe that the precision correlates with the dataset composition; indeed, for datasets having more unsafe tests, the precision for unsafe tests is higher. For datasets having fewer unsafe tests, we obtain the opposite effect in the results. Figure 7 shows that the ML model performance and the baseline depend on the test compositions. The baseline and ML models perform better in test pools with more unsafe tests. Thus, according to our results, designing an appropriate test pool composition is of critical importance to achieving accurate prediction results.

figure j
Table 9 Confusion matrix for logistic model, cumulative over 30 rounds for a) Test pool (0.05/0.95), b) Test pool (0.7/0.3)

We assessed the cost-effectiveness of SDC-Scissor also against a second baseline whose selection strategy is based on the road length. The assumption is that the longer the road is the more likely it will be unsafe. In contrast to the random baseline, which selects the tests randomly from the test set, the second baseline orders the tests according to the road length and selects the longest ones. In Table 10, the cost-effectiveness of SDC-Scissor is compared to both baselines. The Random Forest and Logistic models have the best cost-effectiveness compared to both baselines with a selection of 80% unsafe tests. On the other hand, the SVM and Naive Bayes have a worse selection than both baselines selecting only 40% unsafe tests each, whereas the random and RL baselines select an average 42.6% and 60% unsafe tests, respectively.

figure l
Table 10 Cost-effectiveness \(\left (\frac {\#failing}{\#passing}\right )\) of SDC-Scissor against a random baseline and a road length-dependent baseline

5.2.2 REACH Experiment

The goal of this experiment is to investigate whether the usage of ML models allows for reducing the total test execution time. By reducing the total test execution costs, a testing pipeline would be able to spend more testing time on more safety-critical test cases. The task in this experiment was to identify, as early as possible, ten unsafe tests while minimizing the number of total executed test cases. To perform the various comparisons, for each experimented strategy, we collected information about the number of test cases required to reach ten unsafe cases as well as the cumulative cost (i.e., the execution time) to run all the test cases (i.e., till the final unsafe scenario was identified). Further, we collected information concerning the execution time for both safe and unsafe test cases. The conjecture behind this analysis is that the testing cost concerning safe cases should as limited much as possible, whereas the test cost dedicated to unsafe cases is beneficial to identify flaws of SDC in virtual environments.

Figures 9 and 10 provide an overview of the performance of the baseline compared to the Logistic model (the best-performing model in previous experiments) across different test pool compositions. Table 11 summarizes the results of the REACH experiment. We observed that the Logistic model performed better across all test pool compositions. The test costs strictly depend on the required numbered of tests to be executed before identifying the minimum set of 10 unsafe tests. Although the difference in the number of required tests tends to be higher in the pool with fewer unsafe tests (in the 0.05/0.95 pool between 171 to 98.5 tests, in the 0.7/0.3 between 14 to 11 tests), SDC-Scissor allows for reducing test execution time dedicated to less critical tests when the test pool presents more unsafe tests. Figure 11 show that in the smaller unsafe pool it is higher the test execution time dedicated to less critical tests. The test execution time for these less critical tests is 85% higher in the baseline than in the Logistic model. In the larger pool, the Logistic model selects 80% unsafe tests, whereas the baselines only have 42.6% and 60%, respectively.

figure m
Fig. 9
figure 9

Comparing the logistic model with the baseline across the different test pools

Fig. 10
figure 10

Time spent for the execution of safe tests, Logistics vs. Baseline across different test pools

Table 11 Results of the REACH experiments comparing the logistic model and the baseline in various test pool compositions (safe/unsafe test ratio)
Fig. 11
figure 11

Time spent on executing each safe and unsafe test case for different models in a) test pool (0.7/0.3) b) test pool (0.05/0.95)

In Section 7, we discuss further results of RQ2, providing additional insights on this research question.

5.3 Real-Time Experiments (RQ2)

In this section, we present the results of the real-time experiments, where we compare the results of a pre-trained model and a real-time model with the baseline approach.

Baseline vs. Pre-trained and Adaptive Models

Figure 12 gives an overview of the results achieved by the experimented models. We observe that the baseline executed a higher number of test cases (472). The pre-trained model runs more test cases (405) than the real-time approach (378). Figure 12 summarizes our main observations, as elaborated in the next paragraphs.

Fig. 12
figure 12

Comparison of the metrics for different real-time approaches in a 6-hour run a) generated test cases distribution. b) spent time distribution across different tasks.

The pre-trained and real-time models apply a machine learning-based test selection, which leads to numerous rejected (i.e., non-executed) test cases: real-time and pre-trained experienced 588 and 309 rejected tests, respectively. The baseline uses 98% of the time to execute test cases; only 2% is dedicated to generating test cases. The pre-trained and real-time approaches use more time for test generation (6% pre-trained, 11% real-time approach). In addition to the longer test generation process, these two approaches allocate time for predictions and evaluation of tests (pre-trained 4%, real-time 5%), which the baseline does not need to perform. Compared to the pre-trained approach, the real-time approach continuously trains the machine learning model with new tests.

Interestingly, although the baseline executes more test cases, both pre-trained and real-time approaches found more unsafe test cases (baseline 195, pre-trained 265, real-time 256). The pre-trained model was able to find 35% more unsafe test cases, executing only 49% of safe tests. In Fig. 12, we can observe that the baseline only spends 34% of the time running unsafe tests, while 64% of the test time was spent on executing safe test tests. In contrast, our proposed approaches dedicated more than 50% of the time to unsafe tests, which is positive since, in a testing environment, the goal is to find more errors in less time (in our case, it corresponds to exposing more weakness in SDC).

figure n

Adaptive vs. Pre-trained Model

Figure 12 shows that the testing time allocation for the pre-trained and real-time models is similar, but the real-time model spends more time on test generation (11%) than the pre-trained one (6%). The pre-trained model is based on the previously generated dataset with 5,643 (consisting of 3,559 valid test descriptions as described in Section 4) test cases, whereas the real-time model started with generating an initial dataset of 60 test cases as described in Section 4. Table 12 shows that the pre-trained model achieved a higher accuracy (72.1%) than the real-time model (69%). The lower accuracy explains the higher number of test cases generated by the real-time model (tests generated; real-time 962, pre-trained 714). Although the pre-trained model has higher accuracy in general and higher unsafe recall, it only found 3.13% more unsafe tests than the real-time model.

figure o
Table 12 Comparison between pre-trained and real-time models

Training costs: Pre-trained and Adaptive Models v.s. Random Baseline

From a qualitative point of view, the cost of the training dataset is about 0 for the random baseline, while it is > 0 for the pre-trained and adaptive Models. It is important to mention that, for all results discussed in Section 5.3 and for the adaptive and pre-trained models, we did not include the cost required for training the ML models on the training data. This choice was made since the cost of training the best ML model can be considered negligible compared to the cumulative cost of generating all tests and executing them. Indeed, the average cost to train the Logistic Regression model (i.e., the best ML model) on 60 test cases is of about 0.139 seconds, whereas the cost to train the same ML model on 5,643 tests (for the offline model) is of about 0.685 seconds. However, since for other ML models or particular settings of the same ML model (e.g., different from its standard configuration), we could achieve rather higher training costs, we discuss this topic in the threat to validity.

Training Dataset Preparation: Pre-trained and Adaptive Models v.s. Random Baseline

It is important to report that the comparison of SDC-Scissor and the random baseline does not take into account the time (i.e., the cost) required for the training dataset preparation in the real-time experiments. From a qualitative point of view, the cost of the preparation of the training data is about 0 for the random baseline (since no training is needed), while for the pre-trained and adaptive models, this has a non-negligible cost. The preparation of the training data includes: (i) the time required for the design, implementation, and testing of the road characteristics (i.e., one week of full-time work) into SDC-Scissor; (ii) and the cost for the automated extraction of such features from all test cases (158 seconds). In total, this required us (i.e., the first author of this work) around one week of work. Hence, while both the pre-trained and adaptive models are more cost-effective than a random baseline when selecting test cases, the training data preparation cost represents a very high cost to be sustained upfront, which becomes beneficial only over a long period of test execution time. In the context of regression testing, when a new update for a large component of SDC software is developed, a well-prepared training dataset lowers the testing cost of that component.

5.4 Optimization Experiments (RQ3)

In RQ3, we focus on investigating whether there is an actual upper bound of ML techniques in identifying SDC safe and unsafe test cases when using static SDC features (available before executing the tests). We performed Grid Search for the Random Forest, J48, Gradient Boosting, Logistic, Naive-Bayes, and Support Vector Classifier to identify the best hyper-parameters for each model. Table 13 summarizes the results of Grid Search by showing the F-score (F1) for safe and unsafe test cases as well as the averaged F-score.

Table 13 Best ML model configurations after a Grid search

The best two models regarding the averaged F-score are the Gaussian Naive Bayes (F1 = 60.0%) and the J48 Decision Tree classifiers (F1 = 59.5%). Although these two models have similar averaged F-scores, they are distinct among the classes. Among the unsafe tests, the J48 Decision Tree achieved an F-score of 70.3%, but for the safe tests, it achieved 42.6%. In the case of the Naive Bayes model, we have among the unsafe tests an F-score of 41.0% and 71.0% for the safe tests.

For the best two models according to their averaged F-score, we show their corresponding confusion matrices in Figs. 13 and 14. Furthermore, a detailed overview of their precision, recall, and F-scores among the classes are reported in Table 14.

Fig. 13
figure 13

Confusion matrix for the Gaussian Naive Bayes model

Fig. 14
figure 14

Confusion matrix for the J48 decision tree model

Table 14 Best ML models with recall, precision, and F-score

Both confusion matrices show a similar distribution. The models identify most of the true unsafe test scenarios with 1’677 and 1’650 cases, but in predicting the safe tests, the models have a low true positive rate with 409 and 516 correct predicted safe tests.

figure p

6 Integration of SDC-Scissor in the Industrial Use Case

6.1 Experiments Involving an Industrial Use Case (AICAS)

We investigate the extent to which SDC-Scissor can be integrated into the context of industrial organizations in the automotive domain, addressing one of the open questions in simulation-based testing (Birchler et al. 2022, 2022c; Gambi et al. 2019; Abdessalem et al. 2018b) for SDCs. We identified the AICAS companyFootnote 6 as an ideal use case for this investigation. AICAS develops JamaicaCAR, an OSGi-based technology for the automotive sector, currently running in more than five million cars worldwide. A pressing challenge for AICAS concerns the need to combine simulations and HiL testing protocols to optimize the testing costs. Specifically, AICAS aims to reduce testing costs by automatically generating inputs, i.e., signals, compatible with the Controller Area Network (CAN) Bus protocol (CIA 2017) in simulated environments.

Based on the trajectories planned by the planning module of SDC, the control module of SDC typically takes charge of the longitudinal and lateral control of the vehicle and generates appropriate control commands (e.g., steering, acceleration, brake) that it sends to the related hardware component of the SDC via the CAN Bus (see Fig. 15).

Fig. 15
figure 15

Can bus in the context of an SDC

To allow validation of the described scenarios, AICAS provided us with devices under test (DuT) equipped to communicate via the CAN Bus. We connected the devices to the CAN bus and the CAN bus to a driving simulator that allowed us to generate the appropriate signals (see Fig. 16). The devices act as a validation context for the described automotive scenarios.

Fig. 16
figure 16

AICAS’s Jamaica EDP validation setup

There are several main advantages of integrating test cases generated by SDC-Scissor in the testing workflow of AICAS:

  • Increased level of test automation: Currently, AICAS inputs are manually generated or designed by testers and developers in its organization. The usage of an integrated framework such as SDC-Scissor can enable the generation of test cases automatically, increasing automation and diversity of generated SDC scenarios.

  • Increased level of realism: Most of the manually entered signals inserted in the Can Bus protocol by the testers and developers of the AICAS organization do not reflect a real driving set of signals (e.g., the provided acceleration and steering angle of the vehicle are not reflecting a real driving test scenario, which makes the used inputs in most cases too random or unrealistic).

Integration Steps

To investigate the extent to which SDC-Scissor can be integrated into the context of AICAS, we extended SDC-Scissor with a CAN Bus code pipeline (see the full pipeline in Fig. 17), which automates the following steps:

  • SDC Test Case Generation and Storage (Steps 1-2): As visualized in Fig. 18, we first use SDC-Scissor to generate 3,559 SDC test cases (with BeamNG, with RF 1.5 - moderate driving), execute them, and store the corresponding execution log in a JSON file (i.e., the actual simulation.full.json containing all information concerning the generated and executed tests by SDC-Scissor, see Fig. 18), which constitutes the dataset of our experiments.

  • SDC Test Data Conversion & Generation of CAN Playback Data (Steps 3-5): In this stage, we convert (and visualized in Fig. 19) the execution log from the JSON file (i.e., simulation.full.json generated by SDC-Scissor to CAN Playback Data (i.e., the file simulation.canplayback.*).

  • Transmission of CAN-based Signals (Steps 6): The messages (i.e., the CAN Playback Data) generated in the previous step are then transmitted to the CAN Device according to defined timestamps, consistent with the one generated by SDC-Scissor while executing SDC test cases. Specifically, referring to the specified used CAN database (i.e., < .dbc >), we converted SDC-Scissor test case data (i.e., < simulation.file.json >) to CAN messages (i.e., < simulation.canplayback.csv >). Using a specified CAN interface device, logged CAN frames are played back to external CAN bus devices. These final steps allow us to finally send realistic SDC signals concerning the driving scenarios to the CAN Device (i.e., SDC test cases generated by SDC-Scissor) in an automated fashion).

Fig. 17
figure 17

CAN Bus code pipeline integrated into SDC-Scissor

Fig. 18
figure 18

SDC-Scissor’s CAN bus code pipeline: SDC test case generation and storage

Fig. 19
figure 19

SDC-Scissor’s CAN bus code pipeline: SDC test data conversion & generation of CAN playback data

From a technological point of view, the definition and implementation of the pipeline in Figure 17 required us to leverage the following libraries: (i) Python-CAN,Footnote 7 which allows controlling various CAN interface devices in the Python environment; (ii) the cantools,Footnote 8 which support CAN database encoding and decoding actions (from the device to the Simulator, and vice versa).

6.2 Industrial Use Case (AICAS): Integration Results

To investigate the extent to which SDC-Scissor can be integrated into the context of AICAS, we extended SDC-Scissor with a CAN Bus code pipeline described in Section 6.1 and shown in Fig. 17. The development and integration of this pipeline in the AICAS context required around five months of work: considering the time to design the pipeline till its implementation and integration, including the time for running all the required experiments reported in this article (this includes the generations of test cases by SDC-Scissor, their execution, the analysis of the data, etc.).

Table 15 reports the details of the test cases generated by SDC-Scissor. Specifically, we generated around 3,600 test cases, which required a total execution time of 12h, 17m, and 11s, with an average simulation time of 12.428 seconds for each test case and a max. observed simulation time of 21.4 seconds.

Table 15 Dataset summary

The most challenging steps of the integration of SDC-Scissor into the context of AICAS are represented by the SDC Test Data Conversion & Generation of CAN Playback Data (Steps 3-5, shown in Fig. 17) and the Transmission of CAN-based Signals (Steps 6, shown in Fig. 17). The main aspect that made this task challenging was the need for signal conversions and mapping between SDC-Scissor’s signals and CAN Playback Data. As shown in Fig. 20, for each signal generated by SDC-Scissor, we had to generate a corresponding value mapped with the CAN Playback module.

Fig. 20
figure 20

Mapping between SDC-Scissor’s signals and CAN Playback Data

Based on the simulation-based signals generated by the implemented SDC-Scissor pipeline, we were able to generate appropriate control commands (e.g., steering, acceleration, brake), and send them to the related hardware component of the SDC via the CAN Bus. Table 16 reports the details of SDC-Scissor’s integration process. Specifically, for all 3,600 generated test cases, which required a total execution time of 12h, 17m, and 11s, it required a total of 52.391 seconds to SDC-Scissor for enabling the automated signal conversions, mapping, and transmission of CAN messages.

Table 16 Results of the Integration Process

As visualized in Fig. 21, it requires 14.721 ms on average to SDC-Scissor to translate simulation-based signals into CAN-compatible signals. In comparison with the current manual signal generation process, it requires on average 1-2 days for AICAS developers and testers to design and then generate a sequence of CAN signals corresponding to 10-15 test cases generated by SDC-Scissor (according to the qualitative assessment of our main contact people within AICAS). In addition to the test automation enabled by SDC-Scissor in the context of AICAS, the generation of a more realistic sequence of SDC signals (corresponding to signals of a realistic SDC car driving in a virtual test case) is vital for the identification of safety-critical scenarios to be executed and tested via the CAN Bus protocol.

figure q
Fig. 21
figure 21

Performance of conversion and transmission time

7 Discussion

This section discusses additional factors that can influence the results of the various research questions, providing more insights and findings about them. Moreover, it also provides a concrete discussion on directions for future research in the field.

7.1 Discussion of Experiments Using Road Characteristics as Input Features to the ML Models

As we have observed from the conducted experiments in RQ1, SDC-Scissor is able to classify safe and unsafe test cases in both the Driving.AI dataset and the BeamNG.AI dataset, with the Logistic and Random Forest models achieving the most reliable results in terms of F-score values for labels. Moreover, we also observed that the Road Characteristics extracted by SDC-Scissor contribute differently to identifying the safe and unsafe test cases. The Road Characteristics concerning the pivot radius (min, mean, std, median), the sum of the turn angles, the number of left and right turns, and the total length of the road are among the most important features, which are all belonging to the set of road features.

In the context of RQ1, there are other factors that can impact the results of SDC-Scissor, such as (i) the risk factor (RF) of the SDCs; (ii) the ability of the ML models to transfer knowledge from a driving agent to another one (i.e., between BeamNG RF 1.5 dataset and the Driver.AI dataset); finally, (iii) we complement the previous Offline Experiments, which focus on applying SDC-Scissor to regression test case selection, with Real-Time Experiments in which we study the application of SDC-Scissor to automated test generation.

7.2 Further Remarks and Future Directions

This work can have relevant implications for developers and researchers. Hence, this final discussion reflects further remarks on the results of all questions, with a specific focus on future directions of RQ3 and RQ4 for developers and researchers.

For what concerns developers, the designed tool allows identifying specific problems that need to be carefully monitored in simulation environments at the time of testing. These include, for instance, the need for coping with testing multiple hardware versions and diversified test inputs to verify correctness with realistic test inputs. Also, it is of paramount importance to be able to generate inputs that lead to a different safety-critical situation in a safe manner (i.e., without harming humans). SDC-Scissor allows to generate and identify test cases that can cause the SDC to fail by using different safety criteria (in the context of this work, we focus on the line-keeping feature as the main safety criterion, but further criteria can be easily integrated and tested).

The integration of SDC-Scissor into the AICAS use case allows us to demonstrate that the proposed approach can automate the testing process of such a large automotive company, coping with the need to complement their hardware-based simulation (based on the Can Bus protocol) with simulation-based testing automation. Specifically, SDC-Scissor allows addressing two pressing challenges of AICAS such as the need for (i) an Increased level of test automation (e.g., AICAS inputs are manually generated or designed by testers and developers in its organization) with test cases automatically generated to increase the diversity of generated SDC scenarios; (ii) and the need of Increase level of realism, since most of the manually entered signals inserted in the Can Bus protocol by the testers and developers of the AICAS organization do not reflect a real driving set of signals (e.g., the provided acceleration and steering angle of the vehicle are not reflecting a real driving test scenario, which makes the used inputs in most cases too random or unrealistic).

To enable the detection and fixing of SDC bugs during the evolution of SDCs, developers can focus on configuring SDC-Scissor to test different combinations of simulators, and AI agents in diversified testing cases, to identify faults in the AI engine and the connected hardware of the system. Of course, we expect that test cases for assessing and detecting SDC bugs can vary between different organizations. To perform such new experiments, SDC-Scissor can be used to generate new test cases by increasing the level of realism of the generated simulation by including obstacles in the generated tests. This is to observe the behavior of the SDCs as well as the ability of SDC-Scissor to identify safe and unsafe test cases in the context of more articulated test cases.

From the discussion of the results of RQ3, we identified that there is an upper bound of the extent to which static SDC features (i.e., features available before executing the tests) can be used to predict SDC testing outcomes. This represents a relevant topic for both developers and researchers for future investigation. From one side, we may argue that novel static SDC features need to be designed to achieve better results (in terms of precision, recall, and F-score). On the other side, we also observed in RQ3 how the usage of different SDC features and hyperparameter optimization strategies do not lead to drastically better results. Given the complexity of the simulation environment and its simulated physics, we argue that to cope with the upper bound of static SDC features, better results can be achieved by combining static metrics and runtime SDC metrics (i.e., metrics available during the execution of SDC test). The rationale of such implication is that there is limited information that can be used to derive if SDC test cases will fail or not before their execution, and achieving better results requires designing metrics that are available during the execution of test cases. For instance, one could consider using the average distance, speed, and steering angle in the proximity of an SDC failure (namely, a crash or a violation of the safety criterion, such as the lane-keeping feature).

For what concerns researchers, this work triggers activities towards better testing and analysis of SDCs. First and foremost, given the identified safe and unsafe test cases, it can be used to derive higher-order (Jia and Harman 2009) SDC-specific mutation operators. For example, the integration of obstacles and different fault detection strategies related to other safety criteria (different from the lane-keeping feature) during the execution of test cases could lead to mutants that change the test case outcome towards more faulty SDC behaviors. More complicated would be dealing with runtime adjustments of SDC test cases, which may require to be instantiated by perturbing the SDC behavior during the testing process.

Also, the work could foster the development of specific static analysis tools for SDC, looking for SDC-specific recurring problems observed in failing test cases. Complementary empirical research could be directed to investigate the difficulty (e.g., duration) of fixing SDC-specific bugs and developing tools guiding developers in allocating the appropriate development effort to various types of SDC bugs. In the context of SDCs, the usage of SDC-Scissor can help researchers (and developers) have a deep knowledge of SDC bugs and their root causes, which is potentially facilitated by their high reproducibility. Specifically, being able to reproduce a bug is crucial during bug triaging and debugging tasks but not always possible in field testing (Bettenburg et al. 2007; Huang et al. 2013; Zimmermann et al. 2010; Panichella 2015).

Fixing or addressing SDC-specific bugs and automatically assessing the correctness of the SDC behavior represent a critical challenge for developers and researchers. Hence, future studies should look at further safety-related bugs due to the uncertainty of SDC behavior, concerning, for instance, the effect of different SDC initializations in the SDC test case outcomes. During our experiments, we also noticed a non-deterministic behavior of the test outcomes, also known as flaky tests. Concretely, depending on the definition of a failing test for SDC-Scissor, we observed 1% to 5% flaky test cases, which we discarded when creating our dataset. Future research should address the concern of having flaky tests in virtual environments since they lower the reliability of simulation-based tests of safety-critical systems such as SDCs.

Finally, SDC developers heavily rely on different experts (they need to have both software and hardware knowledge) to assess the correctness of SDC test outcomes. As the judgment of the experts highly depends on their experience and domain knowledge, such human oracles may not be reliable or can be considered subjective. This human-based assessment can be supported by reproducible SDC test regression frameworks, such as SDC-Scissor, to mitigate the effect of subjective assessments of the correctness of SDC test outcomes.

8 Related Work

SDC-Scissor improves CPS testing cost-effectiveness by identifying and discarding likely irrelevant (i.e., safe) tests. Therefore, SDC-Scissor’s main application areas are (automated) test generation and test regression selection. Specifically, SDC-Scissor employs Machine Learning models to classify tests as safe or unsafe before their execution. Research has yielded many approaches to reduce testing efforts (Elberzhager et al. 2012; Zhang et al. 2020). These approaches can be classified into the following categories: test case selection (Chen and Lau 1996), test suite reduction, test case minimization (Rothermel et al. 1998), and test case prioritization (Rothermel et al. 1999). Test case selection identifies subsets of available tests relevant (or necessary) for testing a given change in the code; test suite reduction removes redundant test cases from existing test suites, thus leading to smaller test suites that can execute faster; test case minimization removes irrelevant statements from the tests, reducing their size; finally, test case prioritization approaches rank test cases by the likelihood of detecting faults such that their execution can lead to finding faults soon.

Most of the available approaches focus on regression testing and do not employ Machine Learning (Yoo and Harman 2012). Only recently (Pan et al. 2022), we observed a positive increment in the number of proposed approaches that rely on ML to select and prioritize test cases; however, those approaches focus mostly on traditional software systems (e.g., Roper (2019)), and the problem of reducing testing effort for Cyber-Physical Systems remains open (Sadri-Moshkenani et al. 2022). In particular, compared to traditional software systems, CPS face additional challenges due to their continuous interactions with the environment and the tight coupling between the hardware and software components comprising them. Hence, standard testing approaches are ineffective, inefficient, or inapplicable (Briand et al. 2016).

Testing of CPSs typically follows the X-in-the-loop paradigms (Matinnejad et al. 2013) which involves a great deal of simulation and takes the form of the model in the loop (MiL), software in the loop (SiL), and hardware in the loop (HiL), depending on the level of abstraction adopted to represent the CPS’s software and hardware components and the relevant environmental elements. Considering the specific requirements of X-in-the-loop testing, researchers proposed various optimization techniques tailored for CPSs. We discuss the most relevant examples in the following and point interested readers to Sadri-Moshkenan’s survey for a more detailed discussion (Sadri-Moshkenani et al. 2022).

Effective CPS testing requires the generation of test cases that effectively stress the system under tests to systematically find critical and challenging test cases (Gambi et al. 2019). However, many of the proposed approaches (e.g., Panichella et al. (2021), Gambi et al. (2022), Gambi et al. (2019), and Li et al. (2020)) rely on randomization to generate tests and require the execution of all the generated tests. As we showed in our evaluation, without proper support (e.g., SDC-Scissor), those approaches struggle to efficiently identify relevant scenarios. Abdessalem and co-authors, instead, augmented traditional evolutionary search algorithms commonly used for automated test generation with Machine Learning models to improve the cost-effectiveness of CPS testing. They evaluated their approaches on SDC collision avoidance. Specifically, Abdessalem et al. (2016) used Artificial Neural Networks to predict test cases’ fitness without executing them. By doing so, They could avoid the lengthy execution of test cases that might not contribute much towards achieving testing goals (i.e., finding problems in the system under test). More recently, Abdessalem et al. (2018a) employed a Decision Tree to guide the test generation. In particular, during the test generation, Abdessalem et al. train a Decision Tree that can identify regions of the test input space that likely lead to generating critical test cases. Compared to Abdessalem et al.’s work, we adopt a similar approach but investigate the use of different Machine Learning models to classify tests as safe or unsafe. Additionally, we apply SDC-Scissor to a different problem, i.e., testing the SDC Lane Keeping system.

In traditional settings, test selection and prioritization are performed by computing test similarity or test adequacy (i.e., code coverage). However, given the complexity of test inputs for CPSs (e.g., simulated environments), computing those metrics is technically challenging. Consequently, new similarity metrics and procedures to compute them have been proposed. For instance, Arrieta et al. (2016, 2018a) proposed to measure the similarity between the test cases based on the so-called signal values of all the states for the simulation-based test cases. Traditional test adequacy metrics may not be adequate for CPSs that are based on Artificial Intelligence and Deep Learning. Because of this, current research efforts focus on identifying domain-specific heuristics to select test cases. For instance, Arrieta et al. (2018b) and Shin et al. (2018) proposed to select the test cases based on high-level objectives such as requirement coverage, the risks of damaging CPS Hardware components, and test execution times.

Compared to those studies, we investigate a different CPS domain and different test selection objectives.

Regarding test selection objectives, we focus on improving the cost-effectiveness of simulation-based tests to assess safety requirements. In contrast, previous studies prioritized the execution of tests based on their fault-detection capability (Arrieta et al. 2019), or selected tests based on signals diversity (Arrieta et al. 2016, 2018a, 2018b), that require test execution. Since, in the SDC domain, executing simulation-based tests is prohibitive, we face the challenge of selecting test cases before their execution. Consequently, our techniques consider only the initial state of the car and the road features (e.g., geometry, lane markings), as those features are available without executing the tests in the simulator.

9 Threats to Validity

Threats to internal validity may concern, as for previous work (Gambi et al.2019; Birchler et al. 2022, 2022c), the cause-effect relationships between the technologies used to generate the scenarios and their elements and the corresponding results, which strictly depends on the realism of our scenarios. Indeed, we did not recreate all the elements that can be found on real roads (e.g., weather conditions, etc.). However, to increase our internal validity, we used both BeamNG.AI and Driver.AI as test subjects. They both leverage a good knowledge of the roads, which means that they do not suffer from the limitations of vision-based lane-keeping systems. For future work, we plan to leverage the new BeamNG features, which allow experimenting with test cases composed of traffic lights as well as other cars and static objects. Moreover, we plan to experiment with consecutive versions of BeamNG.AI and Driver.AI (when they are available), so that it is possible to investigate the potential fault-detection capability of both of them. Currently, this is not possible since both BeamNG.AI and Driver.AI do not have previous versions of their driving agents. Furthermore, since testing involves an underlying assumption that there will be no malicious attack on the system, future work should be conducted on more cautious driving AIs. The goal should also be to detect unsafe scenarios with a lower risk factor. A reckless driving style can be considered malicious behavior, which is, to a certain extent, provoked by the configuration RF2.

The current implementation of the diversity feature does not take into account the actual length of the road. Theoretically, it is possible that a short road can have a higher diversity than a longer one, which also contradicts an assumption that a long road is generally unsafer since there is more space to encounter an unsafe state of the vehicle.

Given the performances of the ML techniques used in our experiments may depend on the setting of their hyper-parameters. We initially leveraged their default settings, knowing that the obtained results could represent a lower bound for the classification performances. Then, we experimented with Grid search as a hyperparameter optimization approach (RQ3) to investigate potential optimal combinations of parameters for the selected ML models. Finally, threats to external validity concern the generalization of our findings. Although the (i) number of experimented test cases in our study is relatively larger (Gambi et al. 2019); and (ii) we experimented with different AI engines (i.e., BeamNG.AI and Driver.AI) and integrated SDC-Scissor into the development context of the AICAS use case (demonstrating that the proposed tool can automate the testing process of such a large automotive company) compared to previous studies; we cannot claim that our results can be generalized to the universe of general open-source CPS simulation environments in other domains. Therefore, further replications are desirable, and so are further studies considering more data as well as other CPS domains.

As discussed in Section 7, for all results in Section 5.3, for both the Adaptive and Pre-trained Models, we did not include the cost required for training the ML models on the training data. This choice was made since the cost of training the best ML model can be considered negligible compared to the cumulative cost of generating all tests and executing them. However, this could be a threat to the external validity of our results, since for other ML models or particular settings of the same ML model (e.g., different from its standard configuration), we could achieve rather higher training costs. Another threat could be related to the evaluation metrics used in our study, which could provide biased performance measures such as precision, recall, and F-score. Hence, for future work, we plan to leverage additional metrics such as the MCC (Matthews Correlation Coefficient), being reported as a well-known measure for unbiased performance measurements. To minimize potential external validity, in conducting our experimental evaluation, we followed the guidelines by In addition, we considered an additional baseline approach that selects test cases by ordering the test to be executed considering their road length (in decreasing order).

10 Conclusions and Future Work

Regression testing for SDCs is particularly challenging due to the cost of running many driving scenarios in simulation. To improve the cost-effectiveness of regression testing, we introduced a test case selection approach, called SDC-Scissor, that relies on a set of SDC road features extracted from driving scenarios prior to running the tests in the context of the BeamNG SDC simulation environment. Then, SDC-Scissor uses ML approaches to select the test cases having a higher likelihood of experiencing unsafe situations.

We empirically investigated the performance of SDC-Scissor and compared it with baseline approaches (RQ1). Our assessment of SDC-Scissor shows that SDC-Scissor successfully selects test cases independently from the AI engine used or different risk levels (i.e., different driving styles), with the Logistic model providing the most stable results. Interestingly, our results also show that the knowledge is not transferable from one AI engine to another one, i.e., SDC-Scissor performed worse when training ML models on data from a specific AI engine and testing on data from a different AI engine.

Our findings also suggest that SDC-Scissor can reduce the number of executed tests required to find at least 10 unsafe tests (RQ2). Specifically, SDC-Scissor outperformed the baseline across all test pools. It selected unsafe cases using the Logistic model with an accuracy of 70%, a precision of 65%, and a recall of 80%. In terms of running time, we observed that SDC-Scissor is able to select test scenarios in a cost-effective manner compared to two random baseline approaches (RQ2). We experimented with Grid search as a hyperparameter optimization approach (RQ3) to investigate potential optimal combinations of parameters for the selected ML models (RQ3). Our results show that there is an upper bound of an average F-score of 60% with the J48 and Naive Bayes classifiers. Complementary, compared to previous studies, we integrated SDC-Scissor into the development context of the AICAS use case, demonstrating that the proposed tool can automate the testing process of such a large automotive company.

As future work, we plan to replicate our study on further SDC datasets, AI engines, and SDC features. Moreover, we plan to perform new empirical studies on further CPS domains to investigate how SDC-Scissor performs when safety criteria concern new types of safety-critical faults different from those investigated in this study. Finally, we want to investigate different meta-heuristics and multi-objective approaches (Canfora et al. 2013, 2015) to enable test case generation based on the designed feature sets.