Machine learning-based test selection for simulation-based testing of self-driving cars software

Simulation platforms facilitate the development of emerging Cyber-Physical Systems (CPS) like self-driving cars (SDC) because they are more efficient and less dangerous than field operational test cases. Despite this, thoroughly testing SDCs in simulated environments remains challenging because SDCs must be tested in a sheer amount of long-running test cases. Past results on software testing optimization have shown that not all the test cases contribute equally to establishing confidence in test subjects’ quality and reliability, and the execution of “safe and uninformative” test cases can be skipped to reduce testing effort. However, this problem is only partially addressed in the context of SDC simulation platforms. In this paper, we investigate test selection strategies to increase the cost-effectiveness of simulation-based testing in the context of SDCs. We propose an approach called SDC-Scissor (SDC coS t-effeC tI ve teS t S electOR) that leverages Machine Learning (ML) strategies to identify and skip test cases that are unlikely to detect faults in SDCs before executing them. Our evaluation shows that SDC-Scissor outperforms the baselines. With the Logistic model, we achieve an accuracy of 70%, a precision of 65%, and a recall of 80% in selecting tests leading to a fault and improved testing cost-effectiveness. Specifically, SDC-Scissor avoided the execution of 50% of unnecessary tests as well as outperformed two baseline strategies. Complementary to existing work, we also integrated SDC-Scissor into the context of an industrial organization in the automotive domain to demonstrate how it can be used in industrial settings.


Introduction
Cyber-physical systems (CPSs) leverage physical capabilities from hardware components as well as computational and arti cial intelligence from software components to operate in complex and dynamic environments, potentially involving humans [12].Speci cally, CPSs collect, analyze, and leverage sensor data from the surrounding environment continuously to control physical actuators at run-time [4,12].CPSs nd application in various domains ranging from Robotics and Transportation to Healthcare and are expected to drastically improve the quality of life of citizens and the economy [20].
Among various and emerging CPS application domains, the usage of selfdriving cars (SDCs) in transportation is expected to impact our society profoundly.Human errors cause more than 90% of driving accidents (e.g., driving while under the in uence of alcohol, fatigue, and other distractions) [39]; hence, automated driving systems such as SDCs have the potential to reduce such errors and eliminate most accidents.However, the recent fatal crashes involving self-driving cars suggest that the advertised large-scale adoption of SDCs appears optimistic and premature [12,34].One of the main factors limiting the usage of autonomous driving solutions is the lack of adequate testing.Consequently, the risk of releasing SDCs equipped with defective software, which might become erratic and lead to fatal crashes, is still quite high [34].
Testing automation is crucial for ensuring the safety and reliability of SDCs [39,42].However, most developers rely on human-written test cases (at unit and system levels) to assess SDCs' behavior.This practice has several limitations and drawbacks: (i) limited possibility to repeat tests under the same conditions [42]; (ii) di culty in testing SDCs in representative and safety-critical scenarios [34,37,59]; (iii) di culty in assessing SDC's behavior in di erent environments and execution conditions [39].
As a consequence, SDCs practitioners in the eld are facing a fundamental development challenge: observability, testability, and predictability of the behavior of SDCs are highly limited [34,37,59].Thus, new testing practices and tools are needed to nd SDC faults earlier during development and, eventually, support the widespread usage of autonomous driving.
The utilization of simulation environments can potentially address several of the challenges mentioned above [1,13,15,25] since simulation-based testing is more e cient than and can be as e ective as traditional eld operational testing [5,25].Additionally, simulation-based testing can support and comple-ment wellestablished hardware-in-the-loop (HiL), model-in-the-loop (MiL), and softwarein-the-loop (SiL) development strategies.Consequently, an in-creasingly large number of commercial and open-source simulation environ-ments have been delivered to the market to conduct testing in the autonomous driving domain [13,25] as well as other CPS domains [55].

Problem Statement and Summary of Results
The usage of simulation environments enables automated test generation and execution [31].However, the size of the testing space of simulation environments is in principle in nite, which poses the main open challenge of exercising the SDC behaviors adequately [3,31].The budget devoted to testing activities is usually limited, making the identi cation of faults particularly challenging in the SDC domain since the execution of simulation-based tests is considerably slower compared to other forms of tests (e.g., unit tests) [27,63].Therefore, it is paramount that developers test SDCs cost-e ectively, for example, by using test suites optimized to reduce testing e ort without a ecting their ability to identify faults in SDCs using simulations both in nominal operating conditions and corner cases [3,48,64].
In this paper, we investigate test case selection (TCS) techniques to improve the cost-e ectiveness of simulation-based testing in the context of SDCs.Speci cally, we focus on techniques that employ Machine Learning (ML) models to optimize the TCS cost-e ectiveness.The main challenges we focuses on while designing such ML-based test case selection strategies for SDCs are as follow: (i) the de nition of the features that best characterize faulty and nonfaulty SDC test scenarios; (ii) the identi cation of suitable ML models that can reliably predict the SDCs' behavior before executing those test scenarios; and (iii) the usage of ML strategies to e ectively distinguish relevant (faulty) from irrelevant (non-faulty) test scenarios.
We are interested in testing the safety of SDCs; therefore, we deem as relevant those scenarios that expose a fault (e.g., a SDC drives out of the road).We call those scenarios unsafe.Consequently, our TCS techniques exploit ML models to classify SDC test scenarios that are unsafe (i.e., likely to expose a fault) or safe.
In this paper, we seek to answer the following research questions: RQ1: To what extent is it possible to identify safe and unsafe test scenarios for SDCs before executing them?
We focus on designing driving scenarios input features, i.e., features that concern the SDC simulation-based tests and can be extracted before their execution.Thus, we propose SDC-Scissor, a framework the leverages the aforementioned features to train machine-learning models that classify test scenarios as safe or unsafe.Speci cally, to distinguish between safe and un-safe test scenarios we focus on lane-keeping functionalities in which unsafe scenarios cause a self-driving car to depart its lane [31] and investigate features that either describe the geometry of a road as a whole (i.e., full road) or describe properties of the road segments comprising it.Finally, we investigate the accuracy of SDC-Scissor in classifying safe and unsafe test scenarios of SDCs.
RQ2: Does SDC-Scissor improve the cost-e ectiveness of simulation based testing of SDCs?
We investigated whether SDC-Scissor reduces testing time dedicated to execute irrelevant (safe) tests while keeping a high test cost-e ectiveness (i.e., identify th same of higher number of safe tests without impacting test costs).We study SDC-Scissor's behavior in two opposite setups and con-textualize our ndings by comparing the results against a random baseline approach (i.e., the scenarios are randomly generated, selected, and exe-cuted).In the rst study, SDC-Scissor leverages ML models trained on o -line data (i.e., trained on a large static dataset); this setup lets us eval-uate the application of the proposed technique for regression testing.In the second study, SDC-Scissor leverages real-time data (i.e., dynamically gen-erated tests) and continuously (re-)trained ML models; this setup lets us evaluate the application of the proposed technique for automated test gen-eration.As described before, in both setups we compared the time-saving ability of SDC-Scissor with respect to the random selection strategy as well as its ability to detect more faults while allocating lower test execu-tion costs.
We conducted our investigation using the freely available SDCs simulator BeamNG.tech[13] (elaborated in Section 2) and the open-source tool AsFault [31].We selected BeamNG.techbecause it can execute procedurally generated driving scenarios, and it was recently adopted as the reference sim-ulator in the ninth edition of the Search-Based Software Testing tool compe-tition1 [50].We selected AsFault because it can automatically generate test scenarios to assess SDCs' lane-keeping and is compatible with BeamNG.tech.It is important to note that in the rest of the paper we will refer to test scenarios generated with AsFault as test cases, to avoid any confusion in terminology.
Our results show that SDC-Scissor achieved high prediction accuracy (between 72% and 96% in predicting unsafe test cases).Not only SDC-Scissor avoided the execution of 50% of unnecessary tests as well as identi ed 35% more unsafe test cases compared to the random baseline approach.
Our assessment of SDC-Scissor shows that SDC-Scissor successfully selects test cases independently from the AI engine used or di erent risk levels (i.e., di erent driving styles), with the Logistic model providing the more stable results.Interestingly, our results also show that the knowledge is not transferable from one AI engine to another one, i.e., SDC-Scissor performed worse when training ML models on data from a speci c AI engine and testing on data from a di erent AI engine.Finally, SDC-Scissor does not introduce signi cant computational overhead in the SDCs testing process, critical to SDC development in industrial settings.

Paper contributions
The contributions of this paper can be summarized as follows: 1. Feature sets: We qualitatively and quantitatively investigated essential input features that can be used to characterize safe and unsafe test cases before executing them.that can be used for replication purposes and future research [41].

Paper structure
The paper proceeds as follows: Section 2 provides some background about regression testing and automated test generation in the context of SDCs and CPSs.Section 3 describes the empirical study design, while Section presents its main ndings.Section 5 discusses related work, while Section discusses the threats that could a ect the validity of our results.Finally, Section 7 concludes the paper and outlines future research directions.

Background on Regression Testing and Simulation for CPSs
This section (i) brie y discusses the test optimization approaches in traditional systems and re ects upon existing test selection strategies in the context of CPSs; then, (ii) introduces background elements to make this paper self-contained.

Software Testing Optimization
Research has yielded many approaches to optimize testing.However, most of the available approaches focus on regression testing and found application only in traditional software systems [65].These approaches can be classi ed into the following categories: test case selection [21], test suite, and test case minimization) [53], and test case prioritization [54].
Test case selection approaches identify subsets of available tests relevant (or necessary) for testing a given change in the code.Test suite reduction approaches remove redundant test cases from existing test suites leading to smaller test suites that can execute faster, while test case minimization approaches remove irrelevant statements from the tests, reducing their size.Finally, test case prioritization approaches rank test cases by the likelihood of detecting faults such that their execution can lead to nding faults soon.
Compared to traditional software systems, CPSs face additional challenges due to their continuous interactions with the environment and the tight cou-pling between the hardware and software components comprising them.Consequently, when it comes to testing, standard testing approaches are ine ective, ine cient, or inapplicable [16].Testing of CPSs has typically been performed following X-in-the-loop paradigms [46] that in practice takes the form of the model in the loop (MiL), software in the loop (SiL), and hardware in the loop (HiL).For MiL testing, most of the software and hardware components, sensors, and other relevant environmental elements are abstracted using models, such that testing can focus on assessing the correctness of the control algorithms governing the CPSs.For SiL testing, only the hardware components and the environments are abstracted using physically accurate simulators; in this case, the behavior and integration of CPSs software components can be tested in realistic execution conditions.Finally, HiL testing focus on checking the integration of hardware and software components in production-like, ei-ther simulated or real, environments.Notably, X-in-the-loop testing involves a great deal of simulation.
Considering the speci c need for X-in-the-loop development of CPSs, researchers proposed testing optimization techniques tailored for CPSs [7{10,56].Test selection (or prioritization) for traditional systems is typically performed by computing the test similarity or test adequacy (i.e., code coverage).How-ever, given the complexity of test inputs for CPSs (e.g., simulated environ-ments), computing traditional similarity metrics based on lexicographic sim-ilarity of test code and test inputs is technically challenging and may not be adequate.Consequently, new similarity metrics and procedures to compute them have been proposed recently.For instance, Arrieta et al. [7,9] proposed to measure the similarity between the test cases based on the so-called signal values of all the states for the simulation-based test cases.
Adopting code coverage as a proxy for test adequacy in CPSs systems, which are based on arti cial intelligence and deep learning, is not adequate; hence, selecting tests purely driven by code coverage is bound to produce ine ective test suites.Because of this, current research e orts focus on di erent heuristics to select test cases.Arrieta et al. [8] proposed to select test guided by objectives such as requirement coverage and test execution times; they applied test selection in the context of multi-objective test case generation for CPSs and showed improvement over baseline approaches.Complementary, Shin et al. [56] proposed an approach for acceptance test selection for a satellite system that selects relevant test cases based on two objectives: (i) the risks of causing hardware damage, and (ii) the number of test cases executed within a given time budget.
Compared to those studies, we investigate (1) a di erent CPS domain and (2) di erent test selection objectives.Speci cally, we investigate lanekeeping in self-driving car simulation environments, whereas previous work focused on industrial tanks [8,10], satellites [55], electric windows [7] and cruise controllers [8].Regarding test selection objectives, we focus on improving the cost-e ectiveness of simulation-based tests to assess safety requirements.In contrast, previous studies prioritized the execution of tests based on their fault-detection capability [10], or selected tests based on signals diversity [7{9], that require at least one execution of the test cases in both approaches.Since in the SDC domain, executing simulation-based tests is prohibitive, we face the challenge of selecting test cases before their execution.Consequently, our techniques consider only the initial state of the car and the characteristics of the roads (e.g., geometry, lane markings), as those features are available without executing the tests in the simulator.

Background on CPS Simulation
Several simulation technologies have been developed to support developers in various stages of the design and validation of CPSs.For instance, in the selfdriving cars domain, developers resort to basic simulation models [33,58], rigidbody simulations [45,67], and soft-body simulations [29,52] among others.
Basic simulation models, like MATLAB and Simulink models, have been mainly utilized for model-in-the-loop simulations and Hardware/Software codesign.They implement fundamental abstractions (e.g., signals) but tar-get mostly non-real-time executions and generally lack photo-realism; consequently, their usage as a means for testing lane-keeping and other visionbased systems is limited.
Rigid-body simulations approximate the physics of bodies by modeling entities as undeformable bodies or as compositions of a limited number of rigid three-dimensional objects such as boxes, cylinders, and convex meshes [3].Rigid-body simulations implement a coarse approximation of reality; hence, they can e ciently simulate basic object motions and rotations and scale well in the number of simulated entities (e.g., vehicles).However, they can simulate breaks only inaccurately and cannot simulate body deformation at all.Soft-body simulations, instead, can simulate deformable and breakable objects and uids; hence, they can handle a wide range of simulation cases in addition to primitive body motions and rotations.Mass-spring systems and nite element method (FEM) are the main approaches to simulate solid ob-jects, while nite volume method (FVM) and nite di erence method (FDM) are the main approaches for simulating uids [47].For simulating SDCs, mass-spring systems and FEM are the most suited soft-body approaches since they Fig. 1 Example for simple test case by AsFault [31] target solid objects.Both approaches model solid objects as a composition of (many) atomic elements interacting with each other and reacting to external forces.Therefore, the simulations follow a bottom-up approach: the high-level behavior of the simulated objects emerges from the simulation of the behavior of the atomic elements comprising them.
Rigid-body v.s.Soft-body SDC simulations.Both rigid-and soft-body simulations can be e ectively combined with powerful rendering engines to implement photo-realistic simulations [13,15,25,62]; consequently, both approaches are viable solutions for simulating SDCs.However, soft-body simulations can simulate a wider variety of physical phenomena compared to rigid-body simulations.For example, soft-body simulations can model body deformations, fractures, vibrations, anisotropic mass distributions, and inertia, essential in many CPSs scenarios.Soft-body simulations are also very versatile.As stated by Dalboni and Soldati [24] speaking about mass-spring systems, using the elementary description of target systems as collections of nodes and beams (i.e., springs), it is possible to contemplate all the laws of mechanics that rule the physical world.Consequently, soft-body simulations can simulate di erent materials and other phenomena, such as aerodynamics and pressured volume changes relevant in many CPSs domains.Soft-body simulations are more accurate than rigid-body simulations; hence, they are more computationally demanding and less scalable in the number of simulated bodies.Consequently, soft-body simulations are less suitable than rigid-body simulations for simulating complicated tra c scenarios where the movement of the simulated entities is generally unrestricted (i.e., there are no collisions between the simulated entities).In contrast, soft-body simulations are a better t for implementing safety-critical scenarios (e.g., car crashes [29]) and focused scenarios in which high simulation accuracy, even in extreme situations, matters the most (e.g., simulating an unbalanced load of trucks or driving with a at tire).

SDC Test Case Generation Environment
Manually creating adequate test scenario suits for SDCs is a complex and laborious task as it requires testers to have experience in multiple domains, including CPS, simulations, physics, and 3D object modeling.In order to tackle this issue, Gambi et al. [32] proposed a search-based approach for procedurally generating driving scenarios (as in Figure 1) for testing lane-keeping systems.This approach is implemented in an open-source tool called AsFault [31].This section brie y summarizes how AsFault generates simulation-based tests, as we will use it to evaluate our ML-based test selection techniques.
AsFault generates virtual roads for testing lane-keeping systems and leverages a genetic algorithm to re ne those virtual roads until they become so challenging for the system under test (i.e., the lane-keeping system driving the ego-car) that they cause the ego-car to drive o the lane.When this hap-pens, we say that AsFault identi ed an Out of Bound Episodes (OBEs), i.e., a safetycritical issue.AsFault reports the total count of OBEs and the road segments where OBEs have been observed for each generated scenario.In our experiments, we use this information in order to label test scenarios as safe (causing no OBEs) or unsafe (causing at least an OBE).AsFault relies on BeamNG.tech[13] for executing the generated tests as physically accurate and photo-realistic driving simulations (see Figure 1).
We consider two lane-keeping systems as test subjects for our evaluation: The rst, BeamNG.AI2 , is the driving agent shipped with the BeamNG.tech,and the second, Driver.AI3 , is a trajectory planner shipped with AsFault.These test subjects have perfect knowledge of the virtual roads and drive the ego-car by computing an ideal driving trajectory to stay in the center of the lane while driving within a con gurable speed limit.As explained by BeamNG.techdevelopers, a parameter called the \aggression" factor controls the driving style of BeamNG.AI: low aggression values (e.g., 0.7) result in smooth driving, whereas high aggression values (e.g., 1.2 and above) result in an edgy driving that may lead the ego-car to \cut corners".Driver.AI instead analyzes the road geometry and plans the car trajectory by computing for each turn the maximum safe driving speed (v) using the standard formula for centripetal force on at roads with static friction ( ) [22]: where r is the turn radius and g is the free-fall acceleration.Driver.AI relies on the user to provide the value of the friction coe cient, as well informa-tion about the maximum acceleration and deceleration of the ego-car.In our evaluation, we estimated those values empirically following a trial-and-error approach.This section describes the design of our empirical study, including the preparation of the training and testing datasets, the adopted research method, and the experimental settings.The following section, instead, elaborates on the achieved experimental results.

Dataset Preparation
As discussed in Section2, we used AsFault to generate the test cases that form our dataset.AsFault provides a set of attributes that can describe some aspects of the generated test cases.We consider those attributes as potential input features and re ne them as described in the remainder of this section.We also used AsFault for executing the test cases to obtain the required labels (safe or unsafe) to train the ML models.

SDC Test Case Feature Sets and Labeling
To predict whether test simulations likely result in safe or unsafe test cases before their execution, we designed two sets of input features: Full Road and Road Segment features.The former set of features concerns global attributes of the virtual roads used as test cases, while the latter focuses on the local characteristics of the road segments forming the virtual roads.The goal is to understand whether ML models trained using global features have the same prediction power as ML models trained using local features or not.
• Full Road Features describe global characteristics of SDC test cases, such as the total length of the virtual road, its starting and target positions on the map, and the count of left and right turns.
We extract two types of full road features describing the main road attributes (see Table 1) and some statistics about the road composition (see 2).We calculate road statistics in three steps.First, we extract the driving path that the ego-car must follow during the test execution; this path de nes the test case and contains the road segments that the ego-car must traverse to reach the target position from the starting position.Next, we extract the available metrics from each road segment (i.e., length for straight road segments and road angle and pivot radius for turns).Finally, we compute the statistics by applying standard aggregation functions (e.g., minimum, maximum, average) on the collected road segments metrics.
• Road Segment Features describe particular characteristics of the road seg-ments that compose a test case (see Table 3).Given the path that the ego-car must follow, we determine features that describe single road segments (e.g., is this the rst segment in the path or the last one?) and features that correlate adjacent road segments (e.g., is the segment before this one a left turn? is the segment after this one a sharper turn? ) For each considered test case, we extract one full road data point and multiple road segment data points 4 and label them as unsafe, if AsFault reports an OBE during the simulation, or safe otherwise.

Test Scenario Dataset
To achieve a comprehensive set of test cases, we considered many randomly generated test cases and test subjects in multiple con gurations (Section 2.3).We considered random tests to have an unbiased sampling of the test space and multiple test subjects and con gurations to draw conclusions about the generalizability of the proposed techniques.As reported in Table 4, we generated 8; 500 test cases and collected labels from 14; 100 simulations; in total, we collected approximately 163; 000 road and road segment data points.This section brie y summarizes the process we followed to build this dataset.
As described in Section 2, BeamNG.AI's driving style can be in uenced by setting its aggression factor (AF).Therefore, we considered three AF val-ues ranging from cautious (AF 1.0) to moderate (AF 1.5) to reckless (AF 2.0).Using di erent values for the aggression factor enables us to study the e ectiveness of our techniques concerning various SDCs' driving styles.To study the generality of our techniques, instead, we consider a second test sub-ject, Driver.AI.Speci cally, we tested Driver.AI with the same test cases used for testing BeamNG.AI in the moderate con guration.This way, we can di-rectly compare the results achieved by both test subjects.From the data in Table 4, we make the following two observations.First, the number of un-safe tests increased with increasingly large values of BeamNG.AI's aggression

Machine Learning-based Experiments
We study whether ML models can predict if a scenario is safe or unsafe and which combinations of features allows to achieve the more accurate prediction results.Therefore, we train various ML models and classify the test cases generated by AsFault while testing BeamNG.AI and Driver.AI.We used Weka [28] to train and evaluate standard ML models that have been successfully used for defect prediction in software engineering in the past (e.g., [14,19,40]): • Logistic Regression that uses a logistic function to model the probability of observing a certain class [60].• J48 that creates a decision tree following the well-known C4.5 algorithm [28].• Random Forest that uses an ensemble of decision trees [36].
• Naive Bayes that applies the Bayes' theorem to train a probabilistic classi er [18].We trained the ML models mentioned above using a training and test sets split strategy, for each of the con gurations listed in Table 4, separately.We evaluated the performance of each ML model by computing the standard metrics of precision, recall, and accuracy [11,14,17,19,40,49], computed as follows: T P Precision = T P + F P T P Recall = T P + F P

TP +TN Accuracy = T P + T N + F P + F N
In the formula, we refer with TP the true positive cases (i.e., unsafe tests correctly identi ed), while with FP, the cases in which safe tests have been miss-classi ed as unsafe tests.Vice versa, in the formula, we refer with FP the true negative cases (i.e., safe tests correctly identi ed), while with FN, the cases in which unsafe tests have been miss-classi ed as safe tests.
Since unsafe scenarios are an exception {not the norm{ when generating random tests, the raw data we collected with AsFault are unbalanced toward safe cases.Therefore, we rebalanced the training data to avoid skewed distri-butions that would otherwise bias the ML models towards one speci c class.Speci cally, we adopted random oversampling, a rebalancing technique proven to be robust [44], to supplement the training data with multiple copies of some of the minority classes.To study how the training set size a ects the ML mod-els' performance, we created balanced data sets of increasing size (Table 5).We generate the test sets necessary to evaluate the ML models by randomly sampling the data point not included in the training set.Notably, we did not rebalance the test set to preserve the underlying distribution classes in the data.
We also study the e ects of di erent training strategies on each ML model' performance.To do so, instead of creating a balanced dataset for each con guration, we evaluated the ML models using standard K-fold cross-validation [51].In particular, we set K = 10 and utilize all the available data in each con guration.

O ine Experiments
To answer RQ2, we need to understand how frequently SDC-Scissor selects unsafe test cases and whether it devotes them more execution time than safe test cases.SDC-Scissor can use pre-trained models to classify safe and unsafe test cases.However, it can also retrain the ML models on the y, as new data are collected from test executions.Therefore, we plan experiments to analyze how using pre-trained ML models for selecting (existing) test cases improves regression testing.Likewise, we plan experiments to assess the e ectiveness of SDC-Scissor as a means to dynamically adding new test cases to automatically generated test suites (see Section 3.2.3).For those experiments, we consider the combinations of ML models and features that achieve the best results in the context of RQ1 (see Section ).Finally, we contextualize the results achieved by SDC-Scissor using a baseline approach that performs a random selection of test scenarios.Notably, random selection is considered one of the standard baselines for evaluating test selection strategies [55,64].
Studying the e ectiveness of SDC-Scissor o ine requires test cases and executions; therefore, we used the data previously collected in the context of RQ1.Speci cally, we consider the data generated while testing BeamNG.AI in the moderate con guration (AF 1.5).We decided so because this con guration provides a large number of test cases and executions (see Complete Set in Table 6).From the Complete Set, we created a Training Set, accounting for 80% of the data, and we used the remaining 20% of data for testing.We created a balanced Training Set, but we purposely created four unbalanced Test Pools with distribution of unsafe cases ranging from few (5% of the testing data) to many (70% of the testing data).Our conjecture is that using di erent Test Pool compositions allows assessing SDC-Scissor's performance in various settings.
We conducted the o ine experiment in two experimental setups, referred to as FIX and REACH, and repeated the experiments in both setups thirty times to increase the con dence in the achieved results.
The FIX setup investigates the bene ts of using SDC-Scissor when the resources allocated for testing are limited, i.e., the amount of test cases that can be executed in the simulation environment is xed (e.g., S).The process we followed to experiment with the FIX setup is reported in Figure 2 alongside the baseline process.The baseline draws tests from the test pool at random and adds them to the test suite until the test suite reaches the target size S. FIX, instead, samples the tests from the test pool but adds them to the test suite only if the ML model predicts that they are unsafe; as before, the process ends when the test suite reaches the target size S.In this setup, more e ective techniques select larger portions of unsafe tests; therefore, we evaluate the performance of SDC-Scissor using the ratio of unsafe to safe test cases in the nal test suites.
The REACH setup investigates the ability of SDC-Scissor to reduce the time to identify at least N unsafe test scenarios.We conjecture that testing time should be spent on executing unsafe test cases, as those help developers expose problems of SDCs earlier.In our experiment, we set N = 10, since the time to identify that many unsafe test cases potentially requires the execution of many more (safe) test cases.The process we followed to experiment with the REACH setup is reported in Figure 3 alongside the baseline process.As before, the baseline randomly samples tests from the test pool and executes them until N unsafe tests have been identi ed.REACH, instead, follows a similar process but executes only those tests that are predicted to be unsafe by the ML models.In this setup, more e ective techniques identify N unsafe tests sooner; therefore, we consider the number of true positives (TP), 5 true negatives (TN), false positives (FP), and false negatives (FN) predicted by the ML models.Having information about TP, TN, FP, and FN enables us to count how many tests were needed to reach the goal, how long it took to do so, and how much time was wasted in evaluating safe test cases.

Real-Time Experiments
We complement the previous O ine Experiments, which focus on applying SDC-Scissor to regression test case selection, with Real-Time Experiments, which study the application of SDC-Scissor to automated test generation.
We conducted the Real-Time Experiments according to the following pro-cedure: (i) AsFault to generates random test cases; (ii) for each newly gen-erated test case, SDC-Scissor classi es it as safe/unsafe; and, (iii) we lter out test cases classi ed as safe before generating the next test case, whereas we executed the test cases classi ed as unsafe.(iv) As test subject, we used BeamNG.AI in the moderate con guration (AF equal to 1.5) as this con gura-tion is a compromise between overly conservative and overly aggressive driving styles.
A cost-e ective test generator devotes more time to execute (likely) unsafe tests that can expose defects rather than executing safe test cases, which might not contribute any additional insight into the behavior of SDC under test.Cor-rectly identifying unsafe test cases, therefore, is paramount and depends on the quality of the ML model used as a classi er which, in turn, depends on the technique employed by the ML models and the data used to train them.Particularly relevant in this context is whether the ML model is prede ned and xed or allowed to be updated online as new data become available.The trade-o between these two con gurations is that ML models have little op-erational costs once trained but may miss relevant behaviors; on the contrary, dynamically retrained ML models can cope with missing training data but at the cost of additional time spent in retraining them.Therefore, we compare the following two approaches: • Pre-trained Model in which we used the best performing model iden-ti ed during the Machine Learning-based Experiments (Section 3.2.1.We trained this model using the re-balanced dataset for the case of BeamNG.AI AF 1.5, as this is the con guration of the test subject used for this exper-iment.• Adaptive Model in which we also used the best performing model iden-ti ed during the Machine Learning-based Experiments (Section 3.2.1 but trained with only 60 randomly generated test cases.After this initial train-ing, we retrain the ML model after executing the predicted unsafe test cases using the newly collected ground truth labels for those test cases.Figure 4 illustrates this process.Notably, since the ML model may be inaccurate, this process collects both positive and negative labels.
As before, we contextualize the results achieved by SDC-Scissor using a baseline approach that implements plain vanilla random generation, i.e., it does not lter the test cases.
We ran each con guration on a dedicated machine equipped with an Intel Core i5-6600K (3.5 GHz), 16 GB RAM, and an NVIDIA GeForce GTX 1070 GPU, and set the test generation time budget to six hours.
During each execution of the experiment, we stored all the tests generated by AsFault so we can execute the test cases ltered out by SDC-Scissor postmortem to calculate metrics such as accuracy, precision, and recall.
Table 7 provides an overview of the metrics used for the evaluation of SDC-Scissor across the various con gurations.Those metrics include the count of unsafe tests found during each experiment (true positives), true negatives, false positives, and false negatives.Additionally, we consider how SDC-Scissor allocated the time budget to run safe and unsafe test cases, generate test cases, and rebuild the ML models.

Results
This section reports, for each research question, the obtained results and the main ndings.Table 7 Evaluation metrics for the Real-Time Experiments.

Number of Unsafe Test Execution
The number of unsafe tests the approach 0-N simulated during the experiment

Number of Safe Tests Execution
The number of safe tests the approach sim-0-N ulated during the experiment

Machine Learning-based Experiments
In this section, we discuss the results of RQ1.Speci cally, we rst describe the results achieved using Full Road (Section 4.1.1)and Road Segment features (Section 4.1.2) to build the ML models; next, we describe the e ects of using various training and test con gurations (Section 3).

Full Road Features
We evaluated the ML models trained using Full Road features with four different splits of training and test data (see Table 5).However, for the sake of readability, we only report the ML models' performance metrics (i.e., accuracy, precision, recall, and F1 score) of the best performing con guration (i.e., 80% training and 20% for testing) in Table 8.The full results can be found in our replication package [41].In the table, we report an aggregate value of accuracy (column Acc.), but we present precision, recall, and F1 score sepa-rately for unsafe and safe labels.We do so because the test set follows the original distribution of safe and unsafe test cases; hence, it is unbalanced and highly biased towards safe cases.Consequently, reporting an aggregated value for those metrics may not be representative as the strong presence of safe cases would dominate the results.Regarding the BeamNG.AI datasets, we can observe that the ML models' accuracy improved for increasing AF levels.For instance, with AF2 SDC-Scissor reached a precision of 99.7% for unsafe predicted tests.The dataset composition seems to be the key factor explaining this result since setting the aggression factor to higher values resulted in signi cantly more unsafe cases.Conversely, a small number of safe cases improved accuracy and precision for unsafe cases counterbalanced by a decrease in the precision of safe predictions.
Finally, we can observe a similarity between the ML models' F1 score for safe and unsafe classes for the BeamNG.AI AF 1.5 case.This result can be explained by looking at how evenly distributed the safe and unsafe classes are, which illustrates the importance of having unbiased datasets for training and testing the models.
We can also observe that the ML models achieved lower accuracy for Driv-ing.AI (49.1%) than BeamNG.AI AF 1.5 (accuracy 67.9%).This result can be explained by looking at how unbalanced the Driver.AI dataset is.Since Driver.AI drives carefully, its dataset comprises mainly safe scenarios, and the predictions of the ML models tested on it are biased towards safe predictions.Comparing the F1 score achieved by the ML models against the Driver.AI and BeamNG.AI AF 1.5 datasets shows this problem more evidently: the ML models performed comparably well for safe and unsafe classes against the BeamNG.AI dataset, whereas they performed well only for the safe test class in the case of Driver.AI.This result supports the observation that the more the SDC under test drives safely, the harder it becomes to predict unsafe test cases.
Finding 1.The accuracy of SDC-Scissor is in uenced by the driving agents, their driving style, and the diversity of datasets.For example, for more aggressive driving agents, the accuracy achieved by the ML models was higher than for cautious driving agents.Hence, predicting unsafe test cases is harder for cautious drivers than reckless ones.Consequently, improving testing of SDCs is more challenging for less aggressive driving agents.
We studied the ability of the ML models to transfer knowledge from a driving agent to another one by using the BeamNG AF 1.5 dataset to train the ML models but using the Driver.AI test set, generated from the same set of virtual roads, to evaluate them, and vice versa.As it is possible to observe in Table 9 the knowledge from a driving agent is not transferable to another one.Table 9 shows that the ML models trained on Driver.AI and evaluated on BeamNG performed signi cantly worse than the same models trained on BeamNG exclusively (from 67.9% to 41% on average).However, when training the ML models on the BeamNG.AI dataset and evaluating them using the Driver.AI datasets, the ML models performed only slightly worse (between 49.1% and 47.8% on average).Interestingly, when using both datasets together, the results show a compromised solution between the accuracy achieved when training on the di erent AI engines separately: BeamNG 67.9%, Driver.AI 49.1%, Combined datasets 55.5%.Finding 2. Our results show that the knowledge is not transferable from one driving agent to another, i.e., SDC-Scissor performed worse when training ML models on data from a speci c driving agent and testing them on data from a di erent one.Despite this, ML models trained on the BeamNG data performed only slightly worse when evaluated on the Driver.AI data.
We are interested in nding the most suitable ML model to be used by SDC-Scissor; therefore, we consider the accuracy achieved by the various ML models across all datasets.As can be seen from Figure 5, the various ML models achieved comparable accuracy.The Logistic model has the highest median value (69.8%) but did not perform drastically better than the worst performing model, i.e., the Naive Bayes model (63.3%).However, the Logistic model's accuracy seem to be more stable, as the variability associated to its results is smaller than the other models.Finding 3. No machine learning model outperformed the others in terms of accuracy.However, among them, the Logistic model provided the more stable results.

Road Segment Features
We investigated the use of local features describing road segments for training ML models that can accurately predict whether test cases are safe or unsafe.However, training the ML models using road segment features showed opposite results when predicting safe and unsafe test cases.As Table 10 shows, the ML models achieved very high precision while predicting safe test cases but were imprecise while predicting unsafe test cases.One possible explanation of this behavior is that virtual roads mainly consisted of "safe" road segments, i.e., road segments that belong to roads in which no OBE was observed.Hence, the number of safe road segments was signi cantly higher than the number of unsafe ones.As a result, the ML models were biased and consistently favored safe predictions in all experiments.
Finding 4.Although using road segment features to train ML models achieved better accuracy (85%) than using full road features, the ML models trained using road segment features achieved very low precision for the critical, unsafe class.The high accuracy of ML models trained using road segment features is an artifact of the strong bias towards safe cases of the data and has no practical bene ts in SDC testing optimization where the goal is to invest more e ort in executing unsafe test cases.

Analysis of Relevant Features.
In our study, we considered two sets of features, full road, and road segment features.Although the ML models trained using these feature sets can e ectively classify the test cases as safe or unsafe, it is crucial to know the contribution of each of these features.For instance, more profound knowledge of the features may help to de ne better-suited feature sets.Hence, we analyzed in detail the full road features using the BeamNG AF 1.5 dataset.
Table 11 reports the results of using two popular feature evaluation meth-ods: information gain and correlation.We order the features based on their evaluation scores and set a threshold (0.01 for information gain and 0.1 for correlation) for each evaluation method to select only the features with the highest contribution.It can be seen from Table 11-A and Table 11-B that the ordering and the relative score of the features are similar in most of the top cases among the two methods.Speci cally, the top eight features are precisely the same in both methods, with a slight change in the order between ranks 2 to 4. Additionally, we note that the remaining features above the thresholds di er in just one feature, i.e., "std angle" which ranked in correlation score lower than the information gain (rank 14 vs. 10).
Overall, we observe that almost all full road features contributed to distinguish safe versus unsafe test cases.Also, among the statistical features that we reported in Table 2, features concerning the pivot radius tend to be more critical and relevant for the distinction of the classes.The minimum and av-erage radius of the pivots are among the most contributing features, while the statistics concerning the turn angles start appearing only from rank 10.Finding 5.The de ned features contribute di erently to characterizing the safe and unsafe scenarios.The statistics concerning the pivot radius (min, mean, std, median), the sum of the turn angles, the number of left and right turns, and the total length of the road are among the most important features, which are all belonging to the set of full road features.

Offline Experiments
In this section, we discuss the results of RQ2.Speci cally, we report the results of the FIX and REACH experiments (detailed in Section 3).Additionally, we report the results of the comparison between various ML models against the baseline approach (described in Section 3) by considering di erent test pool compositions.

FIX Experiment results
The goal of this experiment is to optimize the usage of the available resource in terms of testing execution time and e ectiveness.Figure 6 compares the ratio of unsafe tests selected for execution using di erent ML models against the baseline approach (random selection) across di erent test pool composi-tions.As can be observed from the gure, the Logistic model outperformed the baseline in all di erent test pool compositions (described in Section 3). Figure 7 illustrates that with fewer unsafe test cases in the pool, we observe improvements in the number of selected unsafe tests using ML models over the baseline.In the pool with the least unsafe tests, the Logistic model nds 133% more unsafe test tests compared to the baseline approach.In the more balanced testing pool, Logistic nds 50% more unsafe tests, while with the pool with more unsafe than safe tests, it identi es 30% more unsafe tests.The Lo-gistic model performs slightly better than the other models in all compositions except one (0.3/0.7) where Random Forest performed the best.
The confusion matrices in Table 12 further illustrate the concrete results in terms of e ectiveness with the various pool compositions.In the pool with only 0.05 unsafe tests (Table 12-a), the Logistic model achieved 10 false negatives and 260 true negatives; this means that the model avoided the execution of 549 safe test tests (considering that safe test cases in average take around 24 seconds in average to be executed), thus potentially reducing cost by more than 200 minutes in total on the less critical scenario.However, the false-positive number is still high, with a cumulative 263 falsepositive identi ed.As can be observed in Table 12)-b, for the Test Pool 0.7/0.3 the Logistic model achieved over 260 true positives and only 37 false positives.We observe that the precision correlates with the dataset composition, indeed, for datasets having more unsafe tests, the precision for unsafe tests is higher.For datasets having less unsafe tests, we obtain the opposite e ect in the results.Figure 6 shows that the ML model performance and the baseline depend on the test compositions.The baseline and ML models perform better in test pools with more unsafe test tests.Thus, according to our results, designing an appropriate test pool composition is of critical importance to achieving accurate prediction results.

REACH Experiment
The goal of this experiment is to investigate whether the usage of ML models allows reducing the total test execution time.By reducing the total test execu-tion costs, a testing pipeline would be able to spend more testing time on more safety-critical test cases.The task in this experiment was to identify, as early as possible, 10 unsafe tests while minimizing the number of total executed test cases.To perform the various comparisons, for each experimented strategy, we collected the information about the number of test cases required to reach 10 unsafe cases as well as the cumulative cost (i.e., the execution time) to run all the test cases (i.e., till the nal unsafe scenario was identi ed).Further, we collected information concerning the execution time for both safe and un-safe test cases.The conjecture behind this analysis is that the testing cost concerning safe cases should be limited as much as possible, whereas the test cost dedicated on unsafe cases is bene cial to identify aws of SDC in virtual environments.Figure 8 and Figure 9 provide an overview of the performance of the baseline compared to the Logistic model (the best performing model in previous experiments) across di erent test pool compositions.Table 13 summarize the results for the REACH experiment.We observed that the Logistic model performed better across all test pool compositions.The test costs strictly depend on the required numbered of tests to be executed before identifying the mini-mal set of 10 unsafe tests.Although the di erence in the number of required tests tends to be higher in the pool with less unsafe tests (in the 0.05/0.95pool between 171 to 98.5 tests, in the 0.7/0.3 between 14 to 11 tests), SDC-Scissor allows reducing test execution time dedicated to less critical tests when the test pool presents more unsafe tests.Figures 10 show that in the smaller unsafe pool it is higher the test execution time dedicated to less critical tests.The test execution time to these less critical tests is 85% higher in the baseline than in the Logistic model.In the larger pool, the Logistic model reduces the unnecessary execution time by 170%.In this section, we discuss the results of the real-time experiments, where we compare the results of a pre-trained model and a real-time model with the baseline approach, as described in Section 3.
Baseline vs. Pre-trained and Adaptive Models.Figure 11 gives an overview of the results achieved by the experimented models.We observe that the baseline executed the higher number of test cases (472).The pre-trained model runs more test cases (405) than the real-time approach (378).Figure 11 summarizes our main observations, as elaborated in the next paragraphs.
The pre-trained and real-time models apply a machine learning-based test selection, which leads to numerous rejected (i.e., non-executed) test cases: realtime and pre-trained experienced 588 and 309 rejected tests respectively.The baseline uses 98% of the time to test cases; only 2% are dedicated to generating test cases.The pre-trained and real-time approaches use more time for test generation (6% pre-trained, 11% real-time approach).In addition to the longer test generation process, these two approaches allocate time for predictions and evaluation of tests (pre-trained 4%, real-time 5%), which the baseline does not need to perform.Compared to the pre-trained approach, the real-time approach continuously trains the machine learning model with new tests.Interestingly, although the baseline execute more test cases, both pretrained and real-time approaches found more unsafe test cases (baseline 195, pre-trained 265, real-time 256).The pre-trained model was able to nd 35% more unsafe test cases, executing only 49% safe tests.In Figure 11, we can observe that the baseline only spends 34% of the time running unsafe tests, while 64% of the test time was spent on executing safe test tests.In contrast, our proposed approaches dedicated more than 50% of the time on unsafe tests, which is positive, since, in a testing environment, the goal is to nd more er-rors in less time (in our case, it corresponds to expose more weakness in SC critical tests).Finding 8. Our results show that even though the baseline approach executes more test cases, both the real-time and the pre-trained (i.e., o ine) models integrated into SDC-Scissor are able to nd more unsafe tests than the baseline.The time investment of predicting the outcome of test cases and generating more tests is bene cial for testing purposes.The pre-trained model was able to nd 35% more unsafe tests than the baseline, with the baseline only dedicating 34% of the time budget on assessing unsafe tests.The o ine model spends 52% running unsafe, and only 38% safe test cases.
Adaptive vs. Pre-trained Model. Figure 11 shows that the testing time alloca-tion for the pre-trained and real-time models is similar, but the realtime model spends more time for test generation (11%) than the pre-trained one (6%).The pre-trained model is based on the previously generated dataset with 5,643 test cases (as described in Section 3), whereas the realtime model started with generating an initial dataset of 60 test cases as described in Section 3. Table 14 shows that the pre-trained model achieved a higher accuracy (72.1%) than the real-time model (69%).The lower accuracy explains the higher number of test cases generated by the pretrained model (tests generated; real-time 962, pre-trained 714).Although the pre-trained model has higher accuracy in general and higher unsafe recall, it only found 3.13% more unsafe tests than the real-time model.

Related Work
In this section, we brie y discuss relevant research concerning the following topics: (i) academic and industry studies concerning DevOps limitation and practices in the context of general CPSs; (ii) studies proposing, using, or eval-uating simulation environments for testing SDCs.

Limitations of DevOps for CPSs
Contemporary DevOps pipelines allow improving the communication between Development(Dev) and Operations(Ops) during the software development pro-cess, enabling continuous improvements to existing and new products [35].Sev-eral researchers and practitioners advocate DevOps as a promising approach for CPSs development [23,61].However, both in traditional [23,66] and CPS application domains, the state-of-the-art of DevOps is still forming [61], and emerging practices need validation in the wild.A recent survey by Torngren and Sellgren [61] discusses how CPSs' engineering deals with the inner complexity of CPSs' design and the challenges that arise from the environments in which CPSs operate.According to them, while (semi-automated) integration happens through software, there are several distinguishing characteristics between software and physical systems that make co-designing hardware and software hard.Those characteristics include entirely di erent approaches, techniques, abstractions, platforms, faults & fail-ure modes, and development practices.They conclude that to cope with the foreseen demand of CPSs at scale and in multiple domains, CPS develop-ment and testing need rapid prototyping, code/test generation, and various testing phases [57] encapsulating X-in-the Loop (XiL) activities.In XiL, the 'X' indicates the target of development and testing and typically refers to model (MiL), software (SiL), and hardware (HiL).A typical CPS development pipeline needs to e ciently and e ectively integrate various XiL activities to support development and evolution [57,61].
In this paper, we investigated ways to improve XiL activities, by focusing on proposing simulation-based test scenario generation and selection approaches for SDCs, equipped with machine-learning-based strategies for enabling the selection of most relevant (or critical) test cases (or scenarios).An optimal test case selection needs to target the identi cation of relevant (i.e., potentially unsafe) test cases, while optimizing simulated testing pipeline, to ensure higher safety for the nal products.

Simulation-based Testing of SDCs
Given the danger and ine ectiveness of physical testing [39], researchers and practitioners devised simulation-based testing approaches for SDCs.This pro-vides the opportunity to conveniently test the control software of a SDC in a (realistic enough) simulated environment, and in a diverse set of (automati-cally generated) test cases.Additionally, using simulation, the systems under test can face tests that might be otherwise too expensive, too hard, too risky, or impossible to recreate in real life [25].
Consequently, simulation is becoming one of the cornerstones in developing and validating SDCs, as it is heavily utilized in various XiL activities and across the entire development life-cycle.Simulation is currently used to support the initial inception and requirement analysis (MiL), Hardware/Software Co-design (MiL), design and testing of software components (SiL), training and validation of Machine Learning components (SiL), and testing and validation of the deployed system (HiL).
Abdessalem et al. [2] proposed an approach for test scenario generation for SDCs based on a combination of evolutionary search algorithms and decision tree classi cation models.Their goal is to leverage classi cation models to guide the search-based generation of tests faster towards critical test cases.Also, search algorithms re ne classi cation models so that the models can accurately characterize critical regions (i.e., the regions of a test input space that are likely to contain the most critical test cases).They evaluate their approach by generating test cases for an automated emergency braking system, and use PreScan [38], a commercial SDC simulator for the test execution.
AV-FUZZER was proposed by Li et al. [43], which consists of a testing framework to nd the safety violations of an autonomous vehicle in the pres-ence of an evolving tra c environment.They leverage domain knowledge of vehicle dynamics and genetic algorithms to minimize the safety potential of an autonomous vehicle over its projected trajectory and design a local fuzzer that increases the exploitation of local optima in the areas where highly likely safetyhazardous situations are observed.For evaluating their proposed framework, They use an Unreal Engine based real-time simulation platform, LGSVL [26], that is capable of simulating complex urban and freeway driving scenarios using a library of urban layouts, buildings, pedestrians, and vehicles.
Search-based testing struggles to generate test scenarios with peculiar features, e.g., a rear-end car crash, that developers might need to validate or debug their implementations, and manually creating such test scenarios is timeconsuming and cumbersome.Hence, Gambi et al devised AC3R [30], an approach to derive scenario-based tests from simulations of real car crashes.AC3R leverages natural language processing and a custom ontology to ex-tract information from police reports that describe car crashes and uses basic kinematics to plan the intercepting vehicles' trajectories.It uses BeamNG to automatically simulate the whole environment that re-enacts the car crashes.
As mentioned before in Section 2, we leveraged AsFault [31] to generate test case and BeamnNG [13] to run them in simulation in our study.Complementary to such previous studies, we speci cally focused on designing and integrating an approach in the SDC testing pipeline, to recognize and exclude safe scenarios without executing them based on machine learning models.This research direction is relevant to save valuable testing and processing resources, and allocate computing power and time to execute more critical (i.e., unsafe) and potentially risky test cases.

Threats to Validity
Threats to internal validity may concern, as for previous work [32], the cause-e ect relationships between the technologies used to generate the scenarios and their elements and the corresponding results, which strictly depends on the realism of our scenarios.Indeed, since we used AsFault, we did not recre-ate all the elements that can be found in real roads (e.g., weather condition, weather conditions, etc.).However, to increase our internal validity, we used both BeamNG.AI and Driver.AI as test subjects.They both leverage a good knowledge of the roads, which means that they do not su er from limitations of vision-based lane-keeping systems.For future work, we plan to leverage the new BeamNG features, which allow to experiment with test cases composed by tra c lights as well as other cars and static objects.
Finally, threats to external validity concern the generalization of our nd-ings.Although the (i) number of experimented test cases in our study is rel-atively larger [32]; and (ii) we experimented with di erent AI-engines (i.e., BeamNG.AI and Driver.AI) compared to previous studies; we cannot claim that our results can be generalized to the universe of general open-source CPS simulation environments in other domains.Therefore, further replications are desirable, so are further studies considering more data as well as other CPS domains.To further minimize potential external validity, in conducting our experimental evaluation, we followed the guidelines by Arcuri et al. [6] that suggests to compare results with randomized test generation algorithms (our baseline approach in RQ2) and repeated the experiments several times.

Conclusions and Future Work
Regression testing for SDCs is particularly challenging due to the cost of run-ning many test driving scenarios.To improve the cost-e ectiveness of regres-sion testing, we introduced a test case selection approach, called SDC-Scissor, that relies on a set of SDC road features extracted from driving scenarios prior to running the tests in the context of BeamNG SDC simulation environment.Then, SDC-Scissor uses ML approaches to select the test cases having a higher likelihood to experience unsafe situations.
We empirically investigated the performance of SDC-Scissor and compared it with baseline approaches.Our assessment of SDC-Scissor shows that SDC-Scissor successfully selects test cases independently from the AI engine used or di erent risk levels (i.e., di erent driving styles), with the Logistic model providing the more stable results.Interestingly, our results also show that the knowledge is not transferable from one AI engine to another one, i.e., SDC-Scissor performed worse when training ML models on data from a speci c AI engine and testing on data from a di erent AI engine.Moreover, among the de ned features to train the ML models, the one that contribute the most in the accuracy of SDC-Scissor are the concerning the set of full road features.
Our ndings also suggest that SDC-Scissor can reduce the number of executed tests required to nd at least 10 unsafe tests.Speci cally, SDC-Scissor outperformed the baseline across all test pools, with the Logistic model reducing the unnecessary execution time dedicated to safe tests by 170%.In terms of running time, we observed that is able to select test scenarios in a cost-e ective manner compared to a random baseline approach.
As future work, we plan to replicate our study on further SDC datasets, AI engines, and SDC features.Moreover, we plan to perform new empirical studies on further CPS domains to investigate how SDC-Scissor performs when safety criteria concern new types of safety-critical faults, di erent from those investigated in this study.Finally, we want to investigate di erent meta-heuristics to enable test case generation based on the designed feature sets.

Fig. 4
Fig. 4 Overview of the Adaptive Model con guration for the Real-Time Experiments.

Fig. 6
Fig. 6 Comparison Logistic Model and Baseline across di erent Test Pool Compositions.

Fig. 8
Fig. 8 Comparing the Logistic model with the baseline across the di erent test pools.

Fig. 9 Fig. 10
Fig. 9 Time spent for the execution of safe tests, Logistics vs. Baseline across di erent test pools a b

Fig. 11
Fig. 11 Comparison of the metrics for di erent real-time approaches in a 6-hour run a) generated test cases distribution.b) spent time distribution across di erent tasks.

Finding 9 .
The o ine model achieved an accuracy of 72.1%, which is higher than the real-time model (69%).A real-time approach can achieve similar results compared to an o ine model, with the real-time model nding only 3.13% fewer unsafe tests than the o ine model.In achieving such results, the real-time model only used an initial set of 60 test cases, whereas the o ine model leveraged 5,643 tests.
2. Selection of SDCs test cases: We investigated new methods in the area of SDCs for test case selection.Hence, we introduced SDC-Scissor that leverages ML models to improve testing cost-e ectiveness via test case selection.3. Offine v.s.Real-time Training: We investigated two opposite setups for SDC test case selection that leverage ML models trained on o -line data (i.e., trained on a large static dataset) and real-time data (i.e., dynamically generated tests).4. Replication package: We built a large dataset of labeled test cases

Table 1
Full Road Attributes.In the table, we report for each feature their name, descrip-tion, type, and range.We computed the range empiricallyIn this paper, we investigate Machine Learning-based test selection techniques for improving the cost-e ectiveness of simulation-based testing for SDCs.The rst challenge concerns the identi cation of features that can be used to predict whether test cases are safe or unsafe (RQ1).We focus on extracting features from test case de nitions (e.g., road features and road segment features), i.e., features available before executing the simulations.The second challenge is de-vising techniques that e ectively leverage such test input features to minimize testing costs while keeping testing e ectiveness high (RQ2).Speci cally, we investigate two alternative setups (explained later in this section): pre-trained ML models (referred to o ine training later), which may nd application in regression testing, and real-time retrained ML models, which are suitable in automated test generation.

Table 2
Full Road Statistics.In the table, we report for each feature their name, descrip-tion, type, and range.We computed the range empirically

Table 3
Road Segment Features.In the table, we report for each feature their name, description, type, and range.We computed the range empirically

Table 4
Second, testing Driver.AI resulted in fewer unsafe cases than testing BeamNG.AI in the moderate con guration.The above observations suggest that the aggression factor strongly in uences the safety of BeamNG.AI; hence, changing its value likely results in di erent driving styles.At the same time, Driver.AI drives more cautiously than BeamNG.AI in the moderate con gu-ration.Therefore, di erent test subjects indeed drive di erently on the same roads.
3.2 Research MethodWe designed three experiments to answer our research questions: The rst set of experiments (i.e., Machine Learning-based Experiments) investigates whether ML models trained with global and local road features can identify safe and unsafe test cases before their execution (RQ1).The second and third set of experiments (i.e., O ine Experiments and Real-Time Experiments) in-vestigate if and how much SDC-Scissor improves the coste ectiveness of SDC simulation-based testing (RQ2).

Table 6
Offline Experiment Dataset

Table 8
Performance of the ML models trained using full road features.The results refer to the split 80/20 between training and test data.The best results are shown in bold face.

Table 9 ML
Models' accuracy on mixed datasets.

Table 10
Performance of the ML models trained using road segment features.The best results are shown in bold face.

Table 13
Results of the REACH experiments comparing the Logistic model and the base-line.Execution time is reported seconds and the values are averaged across the experiment repetitions.Finding 7. We investigate whether SDC-Scissor can reduce the number of executed tests required to nd at least N unsafe tests.Our results show that SDC-Scissor outperformed the baseline across all test pools, with the Logistic model reducing the unnecessary execution time dedicated to safe tests by 170%.SDC-Scissor performed better compared to the baseline when test pools are characterized by fewer unsafe tests.

Table 14
Comparison between pre-trained and real-time models.