1 Introduction

The expanding use of machine learning (ML) in safety-critical applications has led to an urgent need for rigorous design methodologies that can ensure the reliability of systems with ML components  [15, 17]. Such a methodology would need to provide tools for modeling the system, its requirements, and its environment, analyzing a design to find failure cases, debugging such cases, and finally synthesizing improved designs.

The VerifAI toolkit  [1] provides a unified framework for all of these design tasks, based on a simple paradigm: simulation driven by formal models and specifications. The top-level architecture of VerifAI is shown in Fig. 1. We first define an abstract feature space describing the environments and system configurations of interest, either by explicitly defining parameter ranges or using the Scenic probabilistic environment modeling language  [6]. VerifAI then generates concrete tests by searching this space, using a variety of algorithms ranging from random sampling to global optimization techniques. Finally, we simulate the system for each test, monitoring the satisfaction or violation of a system-level specification; the results of each test are used to guide further search, and any violations are recorded in a table for automated analysis (e.g. clustering) or visualization. This architecture enables a wide range of use cases, including falsification, fuzz testing, debugging, data augmentation, and parameter synthesis; Dreossi et al.  [1] demonstrated all of these applications individually through several small case studies.

Fig. 1.
figure 1

Architecture of VerifAI.

In this paper, we provide an integrated case study, applying VerifAI to a complete design flow for a large, realistic system from industry: TaxiNet, an experimental autonomous aircraft taxiing system developed by Boeing for the DARPA Assured Autonomy project. This system uses a neural network to estimate the aircraft’s position from a camera image; a controller then steers the plane to track the centerline of the runway. The main requirement for TaxiNet, provided by Boeing, is that it keep the plane within 1.5 m of the centerline; we formalized this as a specification in Metric Temporal Logic (MTL)  [11]. Verifying this specification is difficult, as the neural network must be able to handle the wide range of images resulting from different lighting conditions, changes in runway geometry, and other disturbances such as tire marks on the runway.

Our case study illustrates a complete iteration of the design flow for TaxiNet, analyzing and debugging an existing version of the system to inform an improved design. Specifically, we demonstrate:

  1. 1.

    Modeling the environment of the aircraft using the Scenic language.

  2. 2.

    Falsifying an initial version of TaxiNet, finding environment conditions under which the aircraft significantly deviates from the centerline.

  3. 3.

    Analyzing counterexamples to identify distinct failure cases and diagnose potential root causes.

  4. 4.

    Testing the system in a targeted way to confirm these root causes.

  5. 5.

    Designing a new version of the system by retraining the neural network based on the results of falsification and debugging.

  6. 6.

    Validating that the new system eliminates some of the failure cases in the original system and has higher overall performance.

Following the procedure above, we were able to find several scenarios where TaxiNet exhibited unsafe behavior. For example, we found the system could not properly handle intersections between runways. More interestingly, we found that TaxiNet could get confused when the shadow of the plane was visible, which only occurred during certain times of day and weather conditions. We stress that these types of failure cases are meaningful counterexamples that could easily arise in the real world, unlike pixel-level adversarial examples  [8]; we are able to find such cases because VerifAI searches through a space of semantic parameters  [3]. Furthermore, these counterexamples are system-level, demonstrating undesired behavior from the complete system rather than simply its ML component. Finally, our work differs from other works on validation of cyber-physical systems with ML components (e.g.  [19]) in that we address a broader range of design tasks (including debugging and retraining as well as testing) and also allow designers to guide search by encoding domain knowledge using Scenic.

For our case study, we extend VerifAI in two ways. First, we interface the toolkit to the X-Plane flight simulator  [12] in order to run closed-loop simulations of the entire system, with X-Plane rendering the camera images and simulating the aircraft dynamics. More importantly, we extend the Scenic language to allow it to be used in combination with VerifAI’s active sampling techniques. Previously, as in any probabilistic programming language, a Scenic program defined a fixed distribution  [6]; while adequate for modeling particular scenarios, this is incompatible with active sampling, where we change how tests are generated over time in response to feedback from earlier tests. To reconcile these two approaches, we extend Scenic with parameters that are assigned by an external sampler. This allows us to continue to use Scenic’s convenient syntax for modeling, while now being able to use not only random sampling but optimization or other algorithms to search the parameter space.

Adding parameters to Scenic enables important new applications. For example, in the design flow we described above, after finding through testing some rare event which causes a failure, we need to generate a dataset of such failures in order to retrain the ML component. Naïvely, we would have to manually write a new Scenic program whose distribution was concentrated on these rare events (as was done in  [6]). With parameters, we can simply take the generic Scenic program we used for the initial testing, and use VerifAI’s cross-entropy sampler  [1, 14] to automatically converge to such a distribution  [16]. Alternatively, if we have an intuition about where a failure case may lie, we can use Scenic to encode this domain knowledge as a prior for cross-entropy sampling, helping the latter to find failures more quickly.

In summary, the novel contributions of this paper are:

  • The first demonstration on an industrial case study of an integrated toolchain for falsification, debugging, and retraining of ML-based autonomous systems.

  • An interface between VerifAI and the X-Plane flight simulator.

  • An extension of the Scenic language with parameters, and a demonstration using it in conjunction with cross-entropy sampling to learn a Scenic program encoding the distribution of failure cases.

We begin in Sect. 2 with a discussion of our extension of Scenic with parameters and our X-Plane interface. Section 3 presents the experimental setup and results of our case study, and we close in Sect. 4 with some conclusions and directions for future work.

2 Extensions of VerifAI

Scenic with Parameters. To enable search algorithms other than random sampling to be used with Scenic we extend the language with a concept of external parameters assigned by an external sampler. A Scenic program can specify an external sampler to use; this sampler will define the allowed types of parameters, which can then be used in the program in place of any distribution. The default external sampler provides access to the VerifAI samplers and defines parameter types corresponding to VerifAI’s continuous and discrete ranges. Thus for example one could write a Scenic program which picks the colors of two cars randomly according to some realistic distribution, but chooses the distance between them using VerifAI’s Bayesian Optimization sampler.

The semantics of external parameters is simple: when sampling from a Scenic program, the external sampler is first queried to provide values for all the parameters; the program is then equivalent to one without parameters, and can be sampled as usualFootnote 1.

X-Plane Interface. Our interface between X-Plane and VerifAI uses the latter’s client-server architecture for communicating with simulators. The server runs inside VerifAI, taking each generated feature vector and sending it to the client. The client runs inside X-Plane and calls its APIs to set up and execute the test, reporting back information needed to monitor the specifications. For our client, we used X-Plane Connect  [18], an X-Plane plugin providing access to X-Plane’s “datarefs”. These are named values which represent simulator state, e.g., positions of aircraft and weather conditions. Our interface exposes all datarefs to Scenic, allowing arbitrary distributions to be placed on them. We also set up the Scenic coordinate system to be aligned with the runway, performing the appropriate conversions to set the raw position datarefs.

3 TaxiNet Case Study

3.1 Experimental Setup

TaxiNet’s neural network estimates the aircraft’s position from a camera image; the camera is mounted on the right wing and faces forward. Example images are shown in Fig. 2. From such an image, the network estimates the cross-track error (CTE), the left-right offset of the plane from the centerline, and the heading error (HE), the angular offset of the plane from directly down the centerline. These estimates are fed into a handwritten controller which outputs (the equivalent of) a steering angle for the plane.

Fig. 2.
figure 2

Example input images to TaxiNet, rendered in X-Plane. Left/right = clear/cloudy weather. Top/bottom = 12 pm/4 pm.

The Boeing team provided the Berkeley team with an initial version of TaxiNet without describing which images were used to train it. In this way, the Berkeley team were not aware in advance of potential gaps in the training set and corresponding potential failure casesFootnote 2. For retraining experiments, the same sizes of training and validation sets were used as for the original model, as well as identical training hyperparameters.

The semantic feature space defined by our Scenic programs and searched by VerifAI was 6-dimensional, made up of the following parametersFootnote 3:

  • the initial position and orientation of the aircraft (in 2D, on the runway);

  • the type of clouds, out of 6 discrete options ranging from clear to stormy;

  • the amount of rain, as a percentage, and

  • the time of day.

Given values for these parameters from VerifAI, the test protocol we used in all of our experiments was identical: we set up the initial condition described by the parameters, then simulated TaxiNet controlling the plane for 30 s.

The main requirement for TaxiNet provided by Boeing was that it should always track the centerline of the runway to within 1.5 m. For many of our experiments we created a greater variety of test scenarios by allowing the plane to start up to 8 m off of the centerline: in such cases we required that the plane approach within 1.5 m of the centerline within 10 s and then stay there for the remainder of the simulation. We formalized these two specifications as MTL formulas \(\varphi _{\text {always}}\) and \(\varphi _{\text {eventually}}\) respectively:

$$ \varphi _{\text {always}}= \square (\text {CTE} \le 1.5) \qquad \varphi _{\text {eventually}}= \lozenge _{[0, 10]} \square (\text {CTE} \le 1.5) $$

While both of these specifications are true/false properties, VerifAI uses a continuous quantity \(\rho \) called the robustness of an MTL formula  [4]. Its crucial property is that \(\rho \ge 0\) when the formula is satisfied, while \(\rho \le 0\) when the formula is violated, so that \(\rho \) provides a metric of how close the system is to violating the property. The exact definition of \(\rho \) is not important here, but as an illustration, for \(\varphi _{\text {always}}\) it is (the negation of) the greatest deviation beyond the allowed 1.5 m achieved over the whole simulation.

For additional experimental results, see the Appendix of the full version  [5].

3.2 Falsification

In our first experiment, we searched for conditions in the nominal operating regime of TaxiNet which cause it to violate \(\varphi _{\text {eventually}}\). To do this, we wrote a Scenic program \(\mathcal {S}_{\text {falsif}}\) modeling that regime, shown in Fig. 3. We first place a uniform distribution on time of day between 6 am and 6 pm local time (approximate daylight hours). Next, we determine the weather. Since only some of the cloud types are compatible with rain, we put a joint distribution on them: with probability 2/3, there is no rain, and any cloud type is equally likely; otherwise, there is a uniform amount of rain between \(25\%\) and \(100\%\)Footnote 4, and we allow only cloud types consistent with rain. Finally, we position the plane uniformly up to 8 m left or right of the centerline, up to 2000 m down the runway, and up to \(30^\circ \) off of the centerline. These ranges ensured that (1) the plane began on the runway and stayed on it for the entire simulation when tracking succeeded, and (2) it was always possible to reach the centerline within 10 s and so satisfy \(\varphi _{\text {eventually}}\).

Fig. 3.
figure 3

Generic Scenic program \(\mathcal {S}_{\text {falsif}}\) used for falsification and retraining.

However, it was quite easy to find falsifying initial conditions within this scenario. We simulated over 4,000 runs randomly sampled from \(\mathcal {S}_{\text {falsif}}\), and found many counterexamples: in only 55% of the runs did TaxiNet satisfy \(\varphi _{\text {eventually}}\), and in 9.1% of runs, the plane left the runway entirely. This showed that TaxiNet’s behavior was problematic, but did not explain why. To answer that question, we analyzed the data VerifAI collected during falsification, as we explain next.

3.3 Error Analysis and Debugging

VerifAI builds a table which stores for each run the point sampled from the abstract feature space and the resulting robustness value \(\rho \) (see Sect. 3.1) for the specification. The table is compatible with the pandas data science library  [13], making visualization easy. While VerifAI contains algorithms for automatic analysis of the table (e.g., clustering and Principal Component Analysis), we do not use them here since the parameter space was low-dimensional enough to identify failure cases by direct visualization.

We began by plotting TaxiNet’s performance as a function of each of the parameters in our falsification scenario. Several parameters had a large impact on performance:

  • Time of day: Figure 4 plots \(\rho \) vs. time of day, each orange dot representing a run during falsification; the red line is their median, using 30-min bins (ignore the blue dots for now). Note the strong time-dependence: for example, TaxiNet works well in the late morning (almost all runs having \(\rho > 0\) and so satisfying \(\varphi _{\text {eventually}}\)) but consistently fails to track the centerline in the early morning.

  • Clouds: Figure 5 shows the median performance curves (as in Fig. 4) for 3 of X-Plane’s cloud types: no clouds, moderate “overcast” clouds, and dark “stratus” clouds. Notice that at 8 am TaxiNet performs much worse with stratus clouds than no clouds, while at 2 pm the situation is reversed. Performance also varies quite irregularly when there are no clouds — we will analyze why this is the case shortly.

  • Distance along the runway: The green data in Fig. 6 show performance as a function of how far down the runway the plane starts (ignore the orange/purple data for now). TaxiNet behaves similarly along the whole length of the runway, except around 1350–1500 m, where it veers completely off of the runway (\(\rho \approx -30\)). Consulting the airport map, we find that another runway intersects the one we tested with at approximately 1450 m. Images from the simulations show that at this intersection, both the centerline and edge markings of our test runway are obscured.

Fig. 4.
figure 4

Performance of TaxiNet as a function of time of day, before and after retraining. (Color figure online)

These visualizations identify several problematic behaviors of TaxiNet: consistently poor performance in the early morning, irregular performance at certain times depending on clouds, and an inability to handle runway intersections. The first and last of these are easy to explain as being due to dim lighting and obscured runway markings. The cloud issue is less clear, but VerifAI can help us to debug it and identify the root cause.

Inspecting Fig. 5 again, observe that performance at 2–3 pm with no clouds is poor. This is surprising, since under these conditions the runway image is bright and clear; the brightness itself is not the problem, since TaxiNet does very well at the brightest time, noon. However, comparing images from a range of times, we noticed another difference: shortly after noon, the plane’s shadow enters the frame, and moves across the image over the course of the afternoon. Furthermore, the shadow is far less visible under cloudy conditions (see Fig. 2). Thus, we hypothesized that TaxiNet might be confused by the strong shadows appearing in the afternoon when there are no clouds.

Fig. 5.
figure 5

Median TaxiNet performance by time of day, for different cloud types. (For clarity, individual runs are not shown as dots in this figure.)

Fig. 6.
figure 6

TaxiNet performance by distance along the runway. Solid lines are medians. The lowest median value for original TaxiNet clipped by the bottom of the chart is \(-32\). (Color figure online)

Fig. 7.
figure 7

TaxiNet performance (with fixed plane position) by time of day, with and without shadows.

To test this hypothesis, we wrote a new Scenic scenario with no clouds, varying only the time of day; we used VerifAI’s Halton sampler  [9] to get an even spread of times with relatively few samples. We then ran two experiments: one with our usual test protocol, and one where we disabled the rendering of shadows in X-Plane. The results are shown in Fig. 7: as expected, in the normal run there are strong fluctuations in performance during the afternoon, as the shadow is moving across the image; with shadows disabled, the fluctuations disappear. This confirms that shadows are a root cause of TaxiNet’s irregular performance in the afternoon.

Figures 4 and 6 show that there are failures even at favorable times and runway positions. We diagnosed several additional factors leading to such cases, such as starting at an extreme angle or further away from the centerline; see the Appendix  [5] for details.

Finally, we can use VerifAI for fault localization, identifying which part of the system is responsible for an undesired behavior. TaxiNet’s main components are the neural network used for perception and the steering controller: we can test which is in error by replacing the network with ground truth CTE and HE values and testing the counterexamples we found above again. Doing this, we found that the system always satisfied \(\varphi _{\text {eventually}}\); therefore, all the failure cases were due to mispredictions by the neural network. Next, we use VerifAI to retrain the network and improve its predictions.

3.4 Retraining

The easiest approach to retraining using VerifAI is simply to generate a new generic training set using the falsification scenario \(\mathcal {S}_{\text {falsif}}\) from Fig. 3, which deliberately includes a wide variety of different positions, lighting conditions, and so forth. We sampled new configurations from the scenario, capturing a single image from each, to form new training and validation sets with the same sizes as for original TaxiNet. We used these to train a new version of TaxiNet, \(\mathcal {T}_{\text {generic}}\), and evaluated it as in the previous section, obtaining much better overall performance: out of approximately 4,000 runs, 82% satisfied \(\varphi _{\text {eventually}}\), and only 3.9% left the runway (compared to 55% and 9.1% before). A variant of \(\mathcal {T}_{\text {generic}}\) using VerifAI’s Halton sampler, \(\mathcal {T}_{\text {Halton}}\), was even more robust, satisfying \(\varphi _{\text {eventually}}\) in 83% of runs and leaving the runway in only 0.6% (a \(15\times \) improvement over the original model). Furthermore, retraining successfully eliminated the undesired behaviors caused by time-of-day and cloud dependence: the blue data in Fig. 4 shows the retrained model’s performance is consistent across the entire day, and in fact this is the case for each cloud type individually.

However, this naïve retraining did not eliminate all failure cases: the orange data in Fig. 6 shows that \(\mathcal {T}_{\text {Halton}}\) still does not handle the runway intersection well. To address this issue, we used a second approach to retraining: over-representing the failure cases of interest in the training set using a specialized Scenic scenario  [6].

We altered \(\mathcal {S}_{\text {falsif}}\) as shown in Fig. 8, increasing the probability of the plane starting 1200–1600 m along the runway, a range which brackets the intersection; we also emphasized the range 0–400 m, since Fig. 6 shows the model also has difficulty at the start of the runway. We trained a specialized model \(\mathcal {T}_{\text {specialized}}\) using training data from this scenario together with the validation set from \(\mathcal {T}_{\text {generic}}\). The new model had even better overall performance than \(\mathcal {T}_{\text {Halton}}\), with 86% of runs satisfying \(\varphi _{\text {eventually}}\) and 0.5% leaving the runway. This is because performance near the intersection is significantly improved, as shown by the purple data in Fig. 6; however, while the plane rarely leaves the runway completely, it still typically deviates several meters from the centerline. Furthermore, performance is worse than \(\mathcal {T}_{\text {generic}}\) and \(\mathcal {T}_{\text {Halton}}\) over the rest of the runway, suggesting that larger training sets might be necessary for further performance improvements.

Fig. 8.
figure 8

Position distribution emphasizing the runway beginning and intersection. Probabilities corresponding to the original scenario (Fig. 3) shown in comments.

While in this case it was straightforward to write the Scenic program in Fig. 8 by hand, we can also learn such a program automatically: starting from \(\mathcal {S}_{\text {falsif}}\) (Fig. 3), we use cross-entropy sampling to move the distribution towards failure cases. Applying this procedure to \(\mathcal {T}_{\text {generic}}\) for around 1200 runs, VerifAI indeed converged to a distribution concentrated on failures. For example, the distribution of distances along the runway gave \(\sim \)79% probability to the range 1400–1600 m, 16% to 1200–1400 m, and 5% to 0–200, with all other distances getting only \(\sim \)1% in total. Referring back to Fig. 6, we see that these ranges exactly pick out where \(\mathcal {T}_{\text {Halton}}\) (and \(\mathcal {T}_{\text {generic}}\)) has the worst performance.

Finally, we also experimented with a third approach to retraining, namely augmenting the existing training and validation sets with additional data rather than generating completely new data as we did above. The augmentation data can come from counterexamples from falsification  [2], from a handwritten Scenic scenario, or from a failure scenario learned as we saw above. However, we were not able to achieve better performance using such iterative retraining approaches than simply generating a larger training set from scratch, so we defer discussion of these experiments to the Appendix  [5].

4 Conclusion

In this paper, we demonstrated VerifAI as an integrated toolchain useful throughout the design process for a realistic, industrial autonomous system. We were able to find multiple failure cases, diagnose them, and in some cases fix them through retraining. We interfaced VerifAI to the X-Plane flight simulator, and extended the Scenic language with external parameters, allowing the combination of probabilistic programming and active sampling techniques. These extensions are publicly available  [1, 7].

While we were able to improve TaxiNet’s rate of satisfying its specification from 55% to 86%, a 14% failure rate is clearly not good enough for a safety-critical system (noting of course that TaxiNet is a simple prototype not intended for deployment). In future work, we plan to explore a variety of ways we might further improve performance, including repeating our falsify-debug-retrain loop (which we only showed a single iteration of), increasing the size of the training set, and choosing a more complex neural network architecture. We also plan to further automate error analysis, building on clustering and other techniques (e.g.,  [10]) available with VerifAI and Scenic, and to incorporate white-box reasoning techniques to improve the efficiency of search.