Abstract
Synthetic data is an indispensable supplement to the difficult-to-acquire real data in order to meet the substantial demand by machine learning based systems. Data playing the key role in machine learning models, its objective and maintainable quality metrics are vital for quality assurance of the whole system. This paper introduces a systematic and domain-neutral methodology based on formalized scenario variation and experimental digital twins for the generation of synthetic data. The methodology uses human-readable scenarios and semantically meaningful parameter variations to describe possible entities, actions and events to be simulated, whereas experimental digital twins bring the scenarios to life by the integration of various domains of a system such as mechanics, sensors, actuators and communication under one platform that can be simulated as a whole. The scenario description and digital twin simulation is carried out iteratively to derive the optimal distribution of synthetic data. Thus scenarios and experimentable digital twins can together serve as mediums to systematically cover diverse application scenarios, test dangerous situations and find faults within a system.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
The increasing complexity of machine learning (ML) based systems necessitates rigorous design and validation approaches to ensure correctness and trust-worthiness of a system. Unlike traditional algorithms composed of specified logical rules, ML algorithms are data-driven and implicitly derive their own inferences. The “reasoning” and results are often unpredictable and difficult to interpret, resulting in the loss of transparency of a system. This renders powerful techniques used in traditional software development, e.g. unit testing and regression testing, either ineffective or in need of serious modifications.
The data-collection process can be quite expensive or in some cases impossible, therefore simulations supplement the missing demand by synthetic data. Quality assurance for the ML-based system requires a substantial volume as well as verifiable quality metrics of the synthetic data. This paper presents a systematic and domain-agnostic methodology for synthetic data generation that addresses two aspects of data quality: transparency and the diversity of scenarios behind the data. The methodology is based on formal application scenario descriptions, appended with formal scenario variation descriptions, and experimentable digital twins. Application scenarios describe the environment, the entities, actions, goals and the initial configuration of an experiment. Conversely, the digital twin of a system is its comprehensive representation, i.e. it collects the set of knowledge representations of a system that may belong to different domains and cater to diverse functionalities. The digital twin can be simulated within virtual testbeds, platforms that provide various simulation functionalities, to create the experimentable digital twin (EDT) [14]. The proposed methodology integrates the two concepts in an iterative manner.
2 State of the Art
To achieve simulation-based variation, a target scenario is typically explicitly modelled in the simulation platform of choice, and its parameters are varied accordingly—e.g. the steering angle and acceleration of a constant turn rate and acceleration (CTRA) model [16]. On the other end, adversarial methodologies are increasingly used to challenge ML-based systems, where another ML-system iteratively generates adversarial configurations for the system-under-test [5].
Both ends of the spectrum miss a generic, platform-independent and semantically meaningful description of the scenario and parameters to be varied. Jian et al. uses a configurable scene grammar to describe static scenes, with stochasticity as part of the description to describe the possible scene variation [11]. Fremont et al. developed a probabilistic programming language to describe dynamic scenarios with variation which can be integrated into a simulation engine [9]. Fremont et al. [10] uses the formal probabilistic description in SCENIC for test-case generation of autonomous vehicle safety scenarios. Another prevalent approach for describing dynamic scenes is found in the automotive industry in the OpenSCENARIO standard [2]. The PEGASUS methodology [3] defines logical scenarios as a supplement to OpenSCENARIO description to specify parameter variation. In contrast to SCENIC, the decoupled description of the scenario and scenario variation offers a higher potential for systematization and optimization of scenario variation, as will be seen later in this paper. The PEGASUS methodology however does not deal with complex probability distributions and inter-parameter constraints for parameter variation which are addressed in this paper.
There is vast literature and frameworks for validation of ML-models by exploring certain parameters spaces, regardless if the parameters are semantically meaningful or not. The VERIFAI framework allows the user to define an abstract feature space as input, which it changes to run falsification test for the ML-model [7]. DeepXplore varies inputs for deep learning systems to explore the resulting neuron coverage, and can find the inputs that most contribute to differential behavior [13].
3 The Scenario Variation Methodology
As data quality plays a vital role in quality assurance for ML-based systems, the data generation process should incorporate maximum transparency and formalism, as with quality control for conventional software. Furthermore, the process should allow the identification and control of data quality metrics such as data accuracy, understandability, correctness and context coverage [8]. The scenario variation methodology, summarized in Fig. 1, affords the designer control over these factors via a systematic workflow and semantically meaningful control parameters. The following sections go through each of the steps.
3.1 Scenario Configuration
The scenario configuration stage involves the definition of the basic application scenario. This paper uses the classification of Dahmen et al. by classifying scenarios into abstract, logical and concrete scenarios [6].
Abstract Scenario
The abstract scenario provides the description of an environment and defines the participating entities, actions and goals. Certain parameters at this level are abstract, i.e. either undefined or assigned preliminary values. The abstract scenario must be specified in a human-readable and formal syntax (e.g. a standardized XML-Schema like OpenSCENARIO [2]), be semantically complete and consistent. For example, the abstract scenario of a vehicle performing a lane-change maneuver may be (informally) described with abstract parameters \(p_1-p_4\):
Given road with \(p_1\) lanes, the actor car with the initial position on lane \(p_2\) and velocity \(p_3\) moves to lane \(p_1-1\) after \(p_4\) minutes have passed.
Logical Scenario
The logical scenario uses the abstract parameters to specify rules for scenario variation, and likewise follows a formal syntax and is semantically meaningful. Maqbool et al. [12] introduced an XML-based test specification to define logical scenarios via a dedicated meta-model that allows a hierarchical modeling of parameters ranges, probability distributions, inter-parameter mathematical and logical constraints. The example in Scenario 1 illustrates this approach. A generic speed distribution element is defined for vehicle speeds in urban settings. The two abstract parameters, speed_vehicle_1 and speed_vehicle_2 inherit the attributes of this element, and speed_vehicle_2 overwrites the distribution. A mathematical constraint is additionally specified between the abstract parameters - regardless of the chosen values of the abstract parameters, the constraint must hold.
![figure a](http://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-031-10071-0_11/MediaObjects/521998_1_En_11_Figa_HTML.png)
Logical Scenario Design
The logical scenario discussed above is well-equipped to generate possible, impossible, probable and improbable scenarios. As Fig. 1 illustrates, both domain expertise and historical data may be taken as sources for the design of a logical scenario. Examples of logical scenario design by domain expertise are exemplified in [17] where sets of possible values of parameters are derived by listing and clustering the pre-conceived situations the system may encounter. An example of design by historical data can be seen in [16], where a driving study from BMW is used to estimate probable driver inputs for a car within a sharp curve. The logical scenario methodology can fully support both approaches via specification of parameter distributions and constraints while simultaneously using a platform-independent and formal syntax to do so. The third method for logical scenario design illustrated in Fig. 1 via feedback from scenario evaluation is discussed in Sect. 3.4.
3.2 Scenario Variation
The scenario variation stage use the logical scenarios to generate concrete scenarios. Concrete scenarios have concrete values for the previously abstract parameters, distributed according to the logical scenario specification. This stage uses sampling techniques to generate samples distributed as close as possible to the specified parameter space. The contribution uses the variants of Markov-Chain-Monte-Carlo proposed by Maqbool et al. [12] to generate the samples.
The scenario based approach with decoupled abstract, logical and concrete scenarios help to impart understandability to ML data, as concrete scenarios provide a unique, formal and human-readable basis behind each data-set. Secondly, the distributions and constraints offer control over data accuracy—they can control the similarity between concrete scenarios and the desired realistic distribution. Accuracy of the data is further ensured by the digital twin approach for simulating the concrete scenarios, discussed in the next section.
3.3 Scenario Evaluation
The scenario evaluation stage brings the concrete scenarios to life using simulation techniques. The authors propose the use of experimentable digital twins (EDT) to match the flexible and multi-domain nature of the scenarios. The EDT of a system is the digital twin implemented as a simulation model in a virtual testbed that offers diverse simulation functionalities. EDTs collect various aspects of the system and can be easily reconfigured for different application contexts throughout the training and validation process. Additionally, EDTs offer scalability in the level of detail (e.g. simulation realism, sensor resolution) and computing resources [15]. Figure 2 illustrates the EDT for a rover on an extra-terrestrial terrain modeled in the multi-domain simulation software VEROSIM. The figure illustrates how the EDT-based simulation allows the fusion of environment generation, multi-body dynamics and various perception sensors.
The flexible and modular nature of EDTs make them an ideal fit for the parameterized and iterative scenario evaluation methodology in Fig. 1. For instance, scenario design iterations can be performed on simplistic models without expensive sensor rendering and can be seamlessly upgraded per requirement in subsequent iterations. EDT-based scenario evaluation stage generates the ground truth data and the replay data. The ground truth data is annotated and labeled by the simulation and serves as ML input data, whereas the replay data contains the simulation events and results that may be used for the post-analysis of a particular simulation run. Thus EDTs further impart control over the accuracy and coherency of the synthetic data-sets by flexibility in the realism, scope and configuration of simulation entities.
3.4 Scenario Redesign
As previously mentioned, design-by-domain-expertise and design-by-historical-data are not always feasible in practical scenario design. Simulation results provide a valuable insight into the effect of parameters and opens the way towards iterative scenario design. Parameterization via logical scenarios makes every scenario viable for iterative redesign, and this iterative process can be carried out by a domain expert or an optimization algorithm. Consider the example (illustrated in detail in Sect. 4) of an automotive simulation, where the ML-designer requires data-sets from both accidental and non-accidental situations, but the desired scenario distribution for such data-sets is unknown. Random simulation within the complete parameter space can offer insight and allow the scenario designer to set the desired parameter bounds. Various optimization- and heuristic-based algorithms can be used for this purpose [1, 4].
3.5 ML Training and Validation
Once the concrete scenarios in the scenario design phase have acquired the sufficient characteristics, the EDT-based simulations can be used to generate ground truth or input data for training and validating ML-based systems. The scenario variation methodology suggests another feedback loop after training or validating the ML-system. This loop can be utilized, e.g. to iterativly find critical scenarios for the ML-model using the same heuristics as in Sect. 3.4.
4 Application Examples
Two examples from the space- and automotive domain are presented to illustrate the multi-domain capability of the scenario variation methodology. Within both examples, the OpenSCENARIO standard is adapted to describe the abstract scenario, whereas the logical scenario is specified via the test specification in [12]. VEROSIM is used as the EDT-based simulation software.
Rendezvous and Docking Scenario
The rendezvous and docking maneuver (RvD) maneuver, illustrated in Fig. 3a requires a chaser shuttle to scan a target satellite via LiDAR and determine the relative pose. The ML-based pose-estimation algorithm is to be trained via synthetic LiDAR scans with ground-truth information. The specifications of the LiDAR scanner by the chaser and the ML-model posit two constraints. Firstly, all measurements must be taken such that the chaser is within a flight corridor. The flight corridor is specified via a cone, with its apex on the satellite, length \(l_C\) and radius \(r_C\), see Fig. 3b. Secondly, the closer the chaser is to the satellite, the higher the likelihood of it being on the center. The simulated datasets should reflect this distribution.
![figure b](http://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-031-10071-0_11/MediaObjects/521998_1_En_11_Figb_HTML.png)
The abstract scenario is modeled via teleport action of OpenSCENARIO—both chaser and satellite are teleported to initial positions, whereas the initial coordinates of the chaser are defined as abstract parameters. The logical scenario is specified as in Scenario 2. Lines 3–4 enforce requirement 1 via inter-parameter constraints. As Fig. 3 illustrates, given unit vector \(\vec {n}\) along the corridor axis, vector \(\vec {d}\) from the target to the chaser, the dot product of the vectors—the distance between the target and chaser along the corridor axis—must be less than the corridor length. Secondly, the angle between \(\vec {n}\) and \(\vec {d}\) must be less than the corridor angle \(\theta \), so that the chaser is always within the corridor bounds. Lines 5–6 implement requirement 2. The “RvD_Dist” is implemented in the scenario variation engine by extending the meta-model in [12], and is simply referred to in the logical scenario. The implementation uses a gaussian distribution dependent on the distance between the chaser and target. The resulting concrete scenarios generated with unique chaser positions are illustrated in Fig. 3c.
Automotive Collision Avoidance
In the second use-case, a collision avoidance ML-model must maneuver a vehicle to avoid an incoming truck via an evasive maneuver. The ML-model needs sufficient samples of both collision and non-collision scenarios for training, otherwise it runs the risk of over-fitting to a particular case. To find out the target logical scenario, an initial logical scenario is set up with the velocity of the car v and the point of curvature \(p_{arc}\) as abstract parameters. The point of curvature is the evasion inducing point within the spline trajectory of the vehicle. The first iteration of logical scenario, illustrated in Scenario 3 assigns suitable ranges to two abstract parameters with uniform probability distribution functions (PDF). The resulting concrete scenarios and their EDT simulations are illustrated in Fig. 4a and b respectively. The percentage of no-collision scenarios is relatively much lower than collision scenarios, which may cause ML-model to over-fit to no-collision scenarios. Based on the results, the next iteration of logical scenario design can either impose newer parameter ranges, or an appropriate PDF to ensure sufficiency of both collision and non-collision scenarios. A bi-modal gaussian PDF with two means located within the highest collision and no-collision densities is chosen. The gaussian PDF allows the scenario designer more flexibility by providing a finer balance between the area of the sampling region and the frequency of outlier sampling. The logical scenario is formulated in the second iteration of Scenario 3. The resulting concrete scenarios and simulations are illustrated in Fig. 4c and d, and show an equal distribution of accident and no-accident scenarios. With the desired logical distribution now found, the number of concrete scenarios can be further increased and the simulation can be made further complex by adding realistic sensor EDTs.
![figure c](http://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-031-10071-0_11/MediaObjects/521998_1_En_11_Figc_HTML.png)
5 Conclusions
This contribution introduced a methodology for synthetic data generation based on formal scenarios, semantic parameter variation and experimentable digital twins (EDT). The methodology provides transparency and formality to the data generation process, and delivers control over data quality via scenario distribution and EDT configurations. A human-readable concrete scenario behind each synthetic data imparts a higher degree of understanding about the data. The proposed logical scenarios allow a formal scenario distribution specification. They can support domain expertise, historical data, as well as iterative methods to derive the scenario distribution. EDTs concurrently provide a simulation platform to simulate the scenarios throughout the scenario design process, offering high flexibility in simulation perspective, complexity and scale. Future works plan to carry out further research on iterative design of logical scenarios using metrics from trained machine learning models, and explore techniques to derive exploratory, exploitative and adversarial logical scenarios.
References
Abdessalem, R.B., Nejati, S., Briand, L.C., Stifter, T.: Testing vision-based control systems using learnable evolutionary algorithms. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 1016–1026. IEEE (2018)
ASAM: Asam openscenario. https://www.asam.net/standards/detail/openscenario/ (2021). Accessed 10 Aug 2021
Audi, A.G., Volkswagen, A.G.: Description of the Pegasus-method
Ben Abdessalem, R., Nejati, S., Briand, L.C., Stifter, T.: Testing advanced driver assistance systems using multi-objective search and neural networks. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 63–74 (2016)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Dahmen, U., Osterloh, T., Roßmann, J.: Generation of virtual test scenarios for training and validation of AI-based systems (in press). In: IEEE International Conference on Progress in Informatics and Computing. IEEE (2021)
Dreossi, T., Fremont, D.J., Ghosh, S., Kim, E., Ravanbakhsh, H., Vazquez-Chanlatte, M., Seshia, S.A.: VERIFAI: a toolkit for the design and analysis of artificial intelligence-based systems. arXiv:1902.04245 (2019)
Felderer, M., Russo, B., Auer, F.: On testing data-intensive software systems. In: Security and Quality in Cyber-Physical Systems Engineering, pp. 129–148. Springer (2019)
Fremont, D.J., Dreossi, T., Ghosh, S., Yue, S., Sangiovanni-Vincentelli, A.L., Seshia, S.A.: Scenic: a language for scenario specification and scene generation. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 63–78 (2019)
Fremont, D.J., Kim, E., Pant, Y.V., Seshia, S.A., Acharya, A., Bruso, X., Wells, P., Lemke, S., Lu, Q., Mehta, S.: Formal scenario-based testing of autonomous vehicles: from simulation to the real world. In: 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1–8 (2020)
Jiang, C., Qi, S., Zhu, Y., Huang, S., Lin, J., Lap-Fai, Y., Terzopoulos, D., Zhu, S.-C.: Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. Int. J. Comput. Vis. 126(9), 920–941 (2018)
Maqbool, O., Roßmann, J.: Formal scenario-driven logical spaces for randomized synthetic data generation (in press). In: MODELSWARD (2022)
Pei, K., Cao, Y., Yang, J., Jana, S.: Deepxplore: automated whitebox testing of deep learning systems. In: proceedings of the 26th Symposium on Operating Systems Principles, pp. 1–18 (2017)
Schluse, M., Priggemeyer, M., Atorf, L., Rossmann, J.: Experimentable digital twins-streamlining simulation-based systems engineering for industry 4.0. IEEE Trans. Ind. Inf. 14(4), 1722–1731 (2018)
Thieling, J., Roßmann, J.: Scalable sensor models and simulation methods for seamless transitions within system development: from first digital prototype to final real system. IEEE Syst. J. (2020)
Wagner, S., Groh, K., Kuhbeck, T., Dorfel, M., Knoll, A.: Using time-to-react based on naturalistic traffic object behavior for scenario-based risk assessment of automated driving. In: 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1521–1528. IEEE (2018)
Weber, H., Bock, J., Klimke, J., Roesener, C., Hiller, J., Krajewski, R., Zlocki, A., Eckstein, L.: A framework for definition of logical scenarios for safety assurance of automated driving. Traffic Inj. Prev. 20(sup1), S65–S70 (2019)
Acknowledgements
This work is part of the project “KImaDiZ”, supported by the German Aerospace Center (DLR) with funds of the German Federal Ministry of Economics and Technology (BMWi), support code 50 RA 1934.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Maqbool, O., Roßmann, J. (2023). Scenario-Driven Data Generation with Experimentable Digital Twins. In: Schüppstuhl, T., Tracht, K., Fleischer, J. (eds) Annals of Scientific Society for Assembly, Handling and Industrial Robotics 2022. MHI 2022. Springer, Cham. https://doi.org/10.1007/978-3-031-10071-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-10071-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10070-3
Online ISBN: 978-3-031-10071-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)