Keywords

1 Introduction

The increasing complexity of machine learning (ML) based systems necessitates rigorous design and validation approaches to ensure correctness and trust-worthiness of a system. Unlike traditional algorithms composed of specified logical rules, ML algorithms are data-driven and implicitly derive their own inferences. The “reasoning” and results are often unpredictable and difficult to interpret, resulting in the loss of transparency of a system. This renders powerful techniques used in traditional software development, e.g. unit testing and regression testing, either ineffective or in need of serious modifications.

The data-collection process can be quite expensive or in some cases impossible, therefore simulations supplement the missing demand by synthetic data. Quality assurance for the ML-based system requires a substantial volume as well as verifiable quality metrics of the synthetic data. This paper presents a systematic and domain-agnostic methodology for synthetic data generation that addresses two aspects of data quality: transparency and the diversity of scenarios behind the data. The methodology is based on formal application scenario descriptions, appended with formal scenario variation descriptions, and experimentable digital twins. Application scenarios describe the environment, the entities, actions, goals and the initial configuration of an experiment. Conversely, the digital twin of a system is its comprehensive representation, i.e. it collects the set of knowledge representations of a system that may belong to different domains and cater to diverse functionalities. The digital twin can be simulated within virtual testbeds, platforms that provide various simulation functionalities, to create the experimentable digital twin (EDT) [14]. The proposed methodology integrates the two concepts in an iterative manner.

2 State of the Art

To achieve simulation-based variation, a target scenario is typically explicitly modelled in the simulation platform of choice, and its parameters are varied accordingly—e.g. the steering angle and acceleration of a constant turn rate and acceleration (CTRA) model [16]. On the other end, adversarial methodologies are increasingly used to challenge ML-based systems, where another ML-system iteratively generates adversarial configurations for the system-under-test [5].

Both ends of the spectrum miss a generic, platform-independent and semantically meaningful description of the scenario and parameters to be varied. Jian et al. uses a configurable scene grammar to describe static scenes, with stochasticity as part of the description to describe the possible scene variation [11]. Fremont et al. developed a probabilistic programming language to describe dynamic scenarios with variation which can be integrated into a simulation engine [9]. Fremont et al. [10] uses the formal probabilistic description in SCENIC for test-case generation of autonomous vehicle safety scenarios. Another prevalent approach for describing dynamic scenes is found in the automotive industry in the OpenSCENARIO standard [2]. The PEGASUS methodology [3] defines logical scenarios as a supplement to OpenSCENARIO description to specify parameter variation. In contrast to SCENIC, the decoupled description of the scenario and scenario variation offers a higher potential for systematization and optimization of scenario variation, as will be seen later in this paper. The PEGASUS methodology however does not deal with complex probability distributions and inter-parameter constraints for parameter variation which are addressed in this paper.

There is vast literature and frameworks for validation of ML-models by exploring certain parameters spaces, regardless if the parameters are semantically meaningful or not. The VERIFAI framework allows the user to define an abstract feature space as input, which it changes to run falsification test for the ML-model [7]. DeepXplore varies inputs for deep learning systems to explore the resulting neuron coverage, and can find the inputs that most contribute to differential behavior [13].

3 The Scenario Variation Methodology

As data quality plays a vital role in quality assurance for ML-based systems, the data generation process should incorporate maximum transparency and formalism, as with quality control for conventional software. Furthermore, the process should allow the identification and control of data quality metrics such as data accuracy, understandability, correctness and context coverage [8]. The scenario variation methodology, summarized in Fig. 1, affords the designer control over these factors via a systematic workflow and semantically meaningful control parameters. The following sections go through each of the steps.

Fig. 1
figure 1

The scenario variation methodology for synthetic data generation

3.1 Scenario Configuration

The scenario configuration stage involves the definition of the basic application scenario. This paper uses the classification of Dahmen et al. by classifying scenarios into abstract, logical and concrete scenarios [6].

Abstract Scenario

The abstract scenario provides the description of an environment and defines the participating entities, actions and goals. Certain parameters at this level are abstract, i.e. either undefined or assigned preliminary values. The abstract scenario must be specified in a human-readable and formal syntax (e.g. a standardized XML-Schema like OpenSCENARIO [2]), be semantically complete and consistent. For example, the abstract scenario of a vehicle performing a lane-change maneuver may be (informally) described with abstract parameters \(p_1-p_4\):

Given road with \(p_1\) lanes, the actor car with the initial position on lane \(p_2\) and velocity \(p_3\) moves to lane \(p_1-1\) after \(p_4\) minutes have passed.

Logical Scenario

The logical scenario uses the abstract parameters to specify rules for scenario variation, and likewise follows a formal syntax and is semantically meaningful. Maqbool et al. [12] introduced an XML-based test specification to define logical scenarios via a dedicated meta-model that allows a hierarchical modeling of parameters ranges, probability distributions, inter-parameter mathematical and logical constraints. The example in Scenario 1 illustrates this approach. A generic speed distribution element is defined for vehicle speeds in urban settings. The two abstract parameters, speed_vehicle_1 and speed_vehicle_2 inherit the attributes of this element, and speed_vehicle_2 overwrites the distribution. A mathematical constraint is additionally specified between the abstract parameters - regardless of the chosen values of the abstract parameters, the constraint must hold.

figure a

Logical Scenario Design

The logical scenario discussed above is well-equipped to generate possible, impossible, probable and improbable scenarios. As Fig. 1 illustrates, both domain expertise and historical data may be taken as sources for the design of a logical scenario. Examples of logical scenario design by domain expertise are exemplified in [17] where sets of possible values of parameters are derived by listing and clustering the pre-conceived situations the system may encounter. An example of design by historical data can be seen in [16], where a driving study from BMW is used to estimate probable driver inputs for a car within a sharp curve. The logical scenario methodology can fully support both approaches via specification of parameter distributions and constraints while simultaneously using a platform-independent and formal syntax to do so. The third method for logical scenario design illustrated in Fig. 1 via feedback from scenario evaluation is discussed in Sect. 3.4.

3.2 Scenario Variation

The scenario variation stage use the logical scenarios to generate concrete scenarios. Concrete scenarios have concrete values for the previously abstract parameters, distributed according to the logical scenario specification. This stage uses sampling techniques to generate samples distributed as close as possible to the specified parameter space. The contribution uses the variants of Markov-Chain-Monte-Carlo proposed by Maqbool et al. [12] to generate the samples.

The scenario based approach with decoupled abstract, logical and concrete scenarios help to impart understandability to ML data, as concrete scenarios provide a unique, formal and human-readable basis behind each data-set. Secondly, the distributions and constraints offer control over data accuracy—they can control the similarity between concrete scenarios and the desired realistic distribution. Accuracy of the data is further ensured by the digital twin approach for simulating the concrete scenarios, discussed in the next section.

3.3 Scenario Evaluation

The scenario evaluation stage brings the concrete scenarios to life using simulation techniques. The authors propose the use of experimentable digital twins (EDT) to match the flexible and multi-domain nature of the scenarios. The EDT of a system is the digital twin implemented as a simulation model in a virtual testbed that offers diverse simulation functionalities. EDTs collect various aspects of the system and can be easily reconfigured for different application contexts throughout the training and validation process. Additionally, EDTs offer scalability in the level of detail (e.g. simulation realism, sensor resolution) and computing resources [15]. Figure 2 illustrates the EDT for a rover on an extra-terrestrial terrain modeled in the multi-domain simulation software VEROSIM. The figure illustrates how the EDT-based simulation allows the fusion of environment generation, multi-body dynamics and various perception sensors.

Fig. 2
figure 2

Diverse sensors mounted on an extra-terrestrial rover. The sensor positions are labeled, and the black box illustrates the rendered output of the camera and stereo-camera

The flexible and modular nature of EDTs make them an ideal fit for the parameterized and iterative scenario evaluation methodology in Fig. 1. For instance, scenario design iterations can be performed on simplistic models without expensive sensor rendering and can be seamlessly upgraded per requirement in subsequent iterations. EDT-based scenario evaluation stage generates the ground truth data and the replay data. The ground truth data is annotated and labeled by the simulation and serves as ML input data, whereas the replay data contains the simulation events and results that may be used for the post-analysis of a particular simulation run. Thus EDTs further impart control over the accuracy and coherency of the synthetic data-sets by flexibility in the realism, scope and configuration of simulation entities.

3.4 Scenario Redesign

As previously mentioned, design-by-domain-expertise and design-by-historical-data are not always feasible in practical scenario design. Simulation results provide a valuable insight into the effect of parameters and opens the way towards iterative scenario design. Parameterization via logical scenarios makes every scenario viable for iterative redesign, and this iterative process can be carried out by a domain expert or an optimization algorithm. Consider the example (illustrated in detail in Sect. 4) of an automotive simulation, where the ML-designer requires data-sets from both accidental and non-accidental situations, but the desired scenario distribution for such data-sets is unknown. Random simulation within the complete parameter space can offer insight and allow the scenario designer to set the desired parameter bounds. Various optimization- and heuristic-based algorithms can be used for this purpose [1, 4].

3.5 ML Training and Validation

Once the concrete scenarios in the scenario design phase have acquired the sufficient characteristics, the EDT-based simulations can be used to generate ground truth or input data for training and validating ML-based systems. The scenario variation methodology suggests another feedback loop after training or validating the ML-system. This loop can be utilized, e.g. to iterativly find critical scenarios for the ML-model using the same heuristics as in Sect. 3.4.

4 Application Examples

Two examples from the space- and automotive domain are presented to illustrate the multi-domain capability of the scenario variation methodology. Within both examples, the OpenSCENARIO standard is adapted to describe the abstract scenario, whereas the logical scenario is specified via the test specification in [12]. VEROSIM is used as the EDT-based simulation software.

Fig. 3
figure 3

Logical scenario design of a satellite docking maneuver

Rendezvous and Docking Scenario

The rendezvous and docking maneuver (RvD) maneuver, illustrated in Fig. 3a requires a chaser shuttle to scan a target satellite via LiDAR and determine the relative pose. The ML-based pose-estimation algorithm is to be trained via synthetic LiDAR scans with ground-truth information. The specifications of the LiDAR scanner by the chaser and the ML-model posit two constraints. Firstly, all measurements must be taken such that the chaser is within a flight corridor. The flight corridor is specified via a cone, with its apex on the satellite, length \(l_C\) and radius \(r_C\), see Fig. 3b. Secondly, the closer the chaser is to the satellite, the higher the likelihood of it being on the center. The simulated datasets should reflect this distribution.

figure b

The abstract scenario is modeled via teleport action of OpenSCENARIO—both chaser and satellite are teleported to initial positions, whereas the initial coordinates of the chaser are defined as abstract parameters. The logical scenario is specified as in Scenario 2. Lines 3–4 enforce requirement 1 via inter-parameter constraints. As Fig. 3 illustrates, given unit vector \(\vec {n}\) along the corridor axis, vector \(\vec {d}\) from the target to the chaser, the dot product of the vectors—the distance between the target and chaser along the corridor axis—must be less than the corridor length. Secondly, the angle between \(\vec {n}\) and \(\vec {d}\) must be less than the corridor angle \(\theta \), so that the chaser is always within the corridor bounds. Lines 5–6 implement requirement 2. The “RvD_Dist” is implemented in the scenario variation engine by extending the meta-model in [12], and is simply referred to in the logical scenario. The implementation uses a gaussian distribution dependent on the distance between the chaser and target. The resulting concrete scenarios generated with unique chaser positions are illustrated in Fig. 3c.

Automotive Collision Avoidance

In the second use-case, a collision avoidance ML-model must maneuver a vehicle to avoid an incoming truck via an evasive maneuver. The ML-model needs sufficient samples of both collision and non-collision scenarios for training, otherwise it runs the risk of over-fitting to a particular case. To find out the target logical scenario, an initial logical scenario is set up with the velocity of the car v and the point of curvature \(p_{arc}\) as abstract parameters. The point of curvature is the evasion inducing point within the spline trajectory of the vehicle. The first iteration of logical scenario, illustrated in Scenario 3 assigns suitable ranges to two abstract parameters with uniform probability distribution functions (PDF). The resulting concrete scenarios and their EDT simulations are illustrated in Fig. 4a and b respectively. The percentage of no-collision scenarios is relatively much lower than collision scenarios, which may cause ML-model to over-fit to no-collision scenarios. Based on the results, the next iteration of logical scenario design can either impose newer parameter ranges, or an appropriate PDF to ensure sufficiency of both collision and non-collision scenarios. A bi-modal gaussian PDF with two means located within the highest collision and no-collision densities is chosen. The gaussian PDF allows the scenario designer more flexibility by providing a finer balance between the area of the sampling region and the frequency of outlier sampling. The logical scenario is formulated in the second iteration of Scenario 3. The resulting concrete scenarios and simulations are illustrated in Fig. 4c and d, and show an equal distribution of accident and no-accident scenarios. With the desired logical distribution now found, the number of concrete scenarios can be further increased and the simulation can be made further complex by adding realistic sensor EDTs.

figure c
Fig. 4
figure 4

Iterative scenario design for collision avoidance

5 Conclusions

This contribution introduced a methodology for synthetic data generation based on formal scenarios, semantic parameter variation and experimentable digital twins (EDT). The methodology provides transparency and formality to the data generation process, and delivers control over data quality via scenario distribution and EDT configurations. A human-readable concrete scenario behind each synthetic data imparts a higher degree of understanding about the data. The proposed logical scenarios allow a formal scenario distribution specification. They can support domain expertise, historical data, as well as iterative methods to derive the scenario distribution. EDTs concurrently provide a simulation platform to simulate the scenarios throughout the scenario design process, offering high flexibility in simulation perspective, complexity and scale. Future works plan to carry out further research on iterative design of logical scenarios using metrics from trained machine learning models, and explore techniques to derive exploratory, exploitative and adversarial logical scenarios.