Explaining autonomous driving with visual attention and end‑to‑end trainable region proposals

Autonomous driving is advancing at a fast pace, with driving algorithms becoming more and more accurate and reliable. Despite this, it is of utter importance to develop models that can offer a certain degree of explainability in order to be trusted, understood and accepted by researchers and, especially, society. In this work we present a conditional imitation learning agent based on a visual attention mechanism in order to provide visually explainable decisions by design. We propose different variations of the method, relying on end-to-end trainable regions proposal functions, generating regions of interest to be weighed by an attention module. We show that visual attention can improve driving capabilities and provide at the same time explainable decisions.


Introduction
Although autonomous driving vehicles are starting to become a reality, their diffusion worldwide is still slowed down by how such advancements are perceived by society.To ensure the pervasivity of automotive in everyday life, it is fundamental that algorithms and learning models guiding the decisions of autonomous vehicles are trustworthy, transparent and fully understandable.In other words, it is of paramount importance that the technologies that the end user will rely on must be explainable.Explainability in autonomous driving has been largely studied in recent years, especially regarding machine learning and computer vision algorithms that make autonomous navigation possible (Omeiza et al. 2021;Zablocki et al. 2021;Cultrera et al. 2020).Explanations can be provided in different forms and styles, e.g.presenting factual, contrastive or counterfactual evidence to support cause effect relationships (Lim and Dey 2009) or showing the sensitivity of the decision with reference to parts of the input (Omeiza et al. 2021).
A simple yet effective way to interpret decisions, especially for computer vision based applications, is to provide a visual explanation of what the model is focusing on.Ex-post methods, such as Grad-cam (Selvaraju et al. 2017), attempt to demystify "black-box" models locating the most relevant pixels in the input image that lead to a decision.Such works, however, despite being largely used for explaining pretrained classifiers, have been shown to be hard to adapt for regression tasks (Letzgus et al. 2021).A better alternative is to exploit a model specifically designed to be explainable.Visual attention has been largely used for this purpose, integrating in a model a mechanism to weight regions or parts of the input to establish their importance explicitly (Anderson et al. 2018).
In this work, we present a study on how different types of visual attention can be exploited to explain the decisions of a driving agent.We propose a conditional imitation learning approach capable of learning driving policies from RGB frames, trained with an attention block that weighs image regions based on their importance for the task.We design different region proposals, trained end-to-end along with the driving agent.A preliminary version of this work was described in Cultrera et al. (2020), introducing the first visual attention based driving agent in the literature that learned to assign attention weights to a static grid of regions of interest in the input image.This work differs substantially from Cultrera et al. (2020) in several aspects: (i) we overcome the limitation of having static proposals by developing different dynamic region proposal functions based on either Region Proposal Networks (Girshick 2015) or Spatial Transformer Networks (Jaderberg et al. 2015); ii) we provide a comparison with ex-post explainability methods, showing the importance of explicitly modeling visual attention to obtain meaningful interpretations; (iii) we show that the learned attention maps can be used to retrieve hard examples framing the problem as an anomaly detection task.

Related works
Imitation learning is an approach for learning a policy that reflects a behaviour by analyzing demonstrations performed by an expert.Prior work has often exploited this paradigm for automotive, where a driving policy can be learned by attempting to replicate steering commands for urban navigation (Bojarski et al. 2016) or following high level commands such as turn or go straight (Codevilla et al. 2018).This type of task has also been addressed by Liang et al. (2018) with reinforcement learning.Such approaches learn a mapping between what is perceived by the ego-vehicle and the output controls.However, to foster generalization an intermediate representation can be used such as low-dimensional affordance as in Sauer et al. (2018) or perception indicators related to the surrounding environment as proposed by Chen et al. (2015).
Different types of sensor data are often exploited and additional synthetic data can be gathered from simulators to train driving models (Codevilla et al. 2018;Berlincioni et al. 2019;Lee et al. 2018;Berlincioni et al. 2021).Several approaches exploit additional data rather than RGB frames alone, e.g, considering depth (Xiao et al. 2019), semantic segmentation (Li et al. 2018) or LiDAR data (Haris and Glowacz 2022) as inputs or to perform multi-task learning (Xiao et al. 2019;Yang et al. 2018;Codevilla et al. 2019;Ishihara et al. 2021;Greco et al. 2022).The importance of model architecture and image features has also been stressed in the literature, benchmarking different convolutional networks (Orden and Visser 2021) or learning to generate features capable of generalizing across different environmental conditions (Guo et al. 2021).Temporal modeling is also used by Eraqi et al. (2017), George et al. (2018) and Xu et al. (2017).Zhang and Cho (2016), instead, developed an agent that can gracefully fallback to a safe policy when dangerous scenarios emerge.
One of the biggest problems of training imitation learning methods end-to-end is the inability to explain a model's behaviour.In fact, being a safety critical domain, explainability is becoming of prominent interest in automotive (Kim and Canny 2017;Xu et al. 2020;Marchetti et al. 2022;Bojarski et al. 2017).Xu et al. (2020) predict vehicle actions such as slowing down and provide a textual explanation.Marchetti et al. (2022) exploit memory augmented neural networks to forecast agent positions and reason about causeeffect relationships in motion patterns.In Kim et al. (2020), a model is proposed to generate advice (e.g., "wet road") that are then converted into actions (e.g."slow down").
Qualitative explanations are also a common way of providing interpretable results by letting the model attend to portions of the input image.Examples can be found in Kim and Canny (2017) and Chen et al. (2017).In the former, salient regions are extracted from a saliency model to condition the output by weighing feature maps, whereas the latter exploit a biologically inspired cognitive model.However, both are not end-to-end trainable and require separate training steps to compute attention.Differently from these approaches, we learn attention end-to-end instead of using an external source of saliency to weigh intermediate network features.Dong et al. (2021) use a transformer's self attention mechanism to correlate frames to previously observed images and infer the action to be taken.In our work, instead, we learn to generate region proposals that are scored to highlight the most relevant regions of the observed scene.This indicates how well steering controls can be predicted based on the corresponding attended regions.
Proposals have been introduced in literature for object detection tasks by leveraging low-level image characteristics to localize salient regions (Uijlings et al. 2013;Zitnick and Dollár 2014;Cuffaro et al. 2016).Learning strategies have also been proposed, such as Region Proposal Networks (RPN) (Ren et al. 2015) where the network is trained as a class-agnostic detector.Similarly, Spatial Transformer Networks (STN) (Jaderberg et al. 2015) learn to focus on salient parts of the image by learning affine transformations instead of generating proposals.In this work we integrate both RPNs and STNs in our visual attention module to highlight regions relevant for the driving taks.

Problem statement
We address the autonomous driving problem in urban environments with a vision-based imitation learning strategy.In imitation learning an agent is trained to learn a policy by attempting to replicate demonstrations D performed by an expert (Attia and Dayan 2018).Demonstrations are made of the ith state observation z i and the action performed at that instant a i .Therefore a demonstration is an input-output pair D = (z i , a i ) .The goal is to learn a policy function capable of mapping the observations to output actions ∶ Z → A .Here Z is the set of possible observations and A represents the possible actions (Argall et al. 2009).In an automotive context, the expert is a human driver and the policy to be learned is "driving safely".In general, the agent has access only to a representation of the surrounding environment, e.g., an RGB frame of the scene from an egocentric point of view.Actions, instead, are driving controls that allow the vehicle to follow the desired path, e.g.steering angle and throttle.In practice, the imitation learning framework is made possible by pre-recorded driving sessions performed by expert drivers, which yield a collection of (frame, driving-controls) pairs, acting as demonstrations.
This end-to-end approach is particularly effective for its ability to learn a safe driving policy without the need to provide explicit safe driving rules, such as 'to turn right follow the right edge of the roadway'.Yet this hinders a true understanding of the reasons why a certain driving action is adopted and this contrasts with the increasing demand for explainability that is rapidly emerging in the autonomous driving domain.

Model overview
The goal of our proposed model is to learn a driving policy capable of imitating the expert by producing the steering angle values required to comply with a given high level command.Following prior works (Codevilla et al. 2018;Sauer et al. 2018), to ensure system scalability on highlevel input commands, we divide our architecture into multiple branches, with a separate head for each type of command.
We equip each branch with a visual attention mechanism to make the model interpretable so to explain the estimated maneuvers.In particular, we rely on a region proposal function R that generates Regions of Interest (RoI) in the input image.The visual attention module then scores each RoI, assigning an importance to each region, thus highlighting portions of the image that are relevant for addressing the driving task.The model, which is endto-end trainable, first extracts a global feature map f from the input image with a convolutional neural network backbone.Based on f, the region proposal function R outputs a set of R relevant RoIs {BB i } R i=1 in the form of bounding boxes BB i = [x i , y i , h i , w i ] , where x i and y i denote the upper- left coordinates and h i and w i its height and width.
At this point, we obtain a region descriptor r i for each RoI in the image by applying RoI Pooling (Girshick 2015) on the feature map generated by the convolutional backbone.The pooling is applied only on the portion of the convolutional feature map identified by the ith bounding box.Pooled features are then weighed using an attention layer and concatenated together, yielding a global descriptor which is condensed into a lower dimensional space with a dense block.The block is followed by a dense regressor that generates steering angle predictions as outputs.
The multi-head architecture that we propose is depicted in Fig. 1.

Visual attention
An attention scheme allows a model to attend only on relevant parts of the input.When the input is an image, this translates into a task driven saliency over pixels or image regions.In automotive, explicitly attending to specific parts of the observed scene aids the decision making process by taking into account environmental cues or surrounding objects, such as turns, intersections or other agents.Thus, a visual attention must weigh RoIs by generating a probability distribution according to their importance for the driving task.To make the attention task-driven, we learn it directly from the data with an endto-end training, that is using an attention layer inside a model that generates driving commands from the input image.To foster the model's interpretability, we use a branched architecture with multiple prediction heads.Each head generates driving steering angles for different high level commands.As a consequence, by integrating a separate attention layer in each head, we obtain different ways of attending to elements in the scene, depending on the high level command.
To perform attention over image regions, we first flatten all RoI-pooled region features r i and we stack them in a single vector r, describing the whole scene.We feed r to a dense layer that generates a different logit for each image region.Logit are then normalized using a softmax activation function, yielding a set of RoI weights = 1 , … , R , where R is the number of regions: = Softmax(r ⋅ W a + b a ) .Here W a and b a are respectively the weights and biases of the fully connected attention layer.We use the softmax function since it dampens the logits while sharpening the most relevant ones.This means that the model is able to concentrate only on a restricted subset of regions.We obtain a final feature r a , where the importance of each region is weighed by the respective attention value, by concatenating RoI features scaled with : Our attention block architecture is shown in Fig. 2.

Multi-head architecture
To comply with different high level commands, the autonomous vehicle must exhibit different behaviors.For example, when reaching an intersection the agent can turn left or right or can keep going straight, depending on the route it has to follow.To keep our architecture as flexible and extensible as possible, we structure our model as a branched multi-head regressor where each head predicts steering values for different high level commands.This allows new heads to be easily plugged in to address additional high level commands.This is a common trend which has been shown to outperform single-headed models (Codevilla et al. 2018;Sauer et al. 2018).We use an attention layer in each head, so that the model can learn to addend to different cues, depending on the command.This solution has the advantage of improving the explainability of the model, since the attention maps are conditioned on the commands, highlighting what is important for different tasks.
To decide which head to use, we feed the command as input to the model as we use it as a selector to pick the correct branch.This has the effect of guiding backpropagation only through the part of the model that is actually used to generate the output while training the network.At the same time, the convolutional backbone is shared among commands, so it will get updated for ecah sample, regardless of the high-level command.This enables an end-to-end training of the model.

Region proposals
The shared convolutional backbone, after extracting features from input images, is followed by a RoI pooling layer (Girshick 2015).Each Region of Interest generated by the region proposal function R can exhibit different sizes and aspect ratios.The RoI pooling layer extracts a fixed-size descriptor r i for each proposal by dividing the region into a number of cells on which a pooling function is applied.Here we adopt the max-pooling operator over 4 × 4 cells.
RoI generation is a fundamental step in our pipeline since extracting good RoIs allows the attention mechanism, explained in Sect.3.2, to correctly select salient regions of the image.We propose different formulations for R .First, we use a static grid of fixed boxes at different scales.We then propose a fully learnable formulation, making R a neural network capable of generating task-driven dynamic proposals, depending on image content.In this case, we follow two alternative approaches to build the region proposal function, relying on Region Proposal Networks (RPN) (Ren et al. 2015) and Spatial Transformer Networks (STN) (Jaderberg et al. 2015).In the following we provide an overview

Static proposals
The simplest formulation to obtain a set of region proposals, is to let R yield a static grid of fixed handcrafted RoIs.To obtain meaningful RoIs we use a multi-scale regular grid spanning across the image.Here, we assume the image to have height H and width W. We generate the grid by sliding variable size boxes on the input image with a given stride.For this purpose we used four types of windows as explained in Fig. 3: -BIG H -horizontal boxes with height H/2 and width W; these boxes cover the whole width of the image, spanning from top to bottom with a 60px vertical stride; -BIG V -vertical boxes with height H, width W/2 and an horizontal stride equal to W/2; it yields two regions dividing the image into a left and right side; -MEDIUM-boxes with height H/2 and width W/2, with an area equal to a quarter of the image; we slide the box on the two horizontal halves of the image with an horizontal stride of 60px; -SMALL-boxes with height H/2 and width W/4, with a 3px stride in the horizontal and vertical directions.
The shape of the four boxes is designed to consider different aspects of the scene.The BIG scale is coarse and is intended to focus on structural elements in the scene (e.g.vertical for traffic signs or buildings and horizontal for forthcoming intersections).The MEDIUM and SMALL scales instead focus on smaller details such as approaching vehicles or distant turns.Overall, we use a grid of 48 image regions: 2 BIG V , 6 BIG H , 8 MEDIUM and 32 SMALL.

Dynamic proposals
Casting R as a static proposal generator has evident limita- tions since we are posing a strong prior on the regions that the model can attend to.However, dynamically generating region proposals is not trivial.In fact, once proposals are cropped, all the spatial information is lost.Indeed, when the proposals are static, the network learns to correlate the relative position of each feature with its underlying semantics.This is possible since the ith feature to be attended will always correspond to the same spatial coordinates.The fact that the model is learning positional cues based on the order in which RoI features are presented to the attention model can be easily demonstrated by shuffling the boxes during inference.In Sect.9, we show that by doing so, the model is unable to generate meaningful steering commands.
To overcome this limitation, we simply concatenate a spatial cue to the input image by adding two additional channels containing x and y normalized frame coordinates.This allows the model to take into account absolute RoI positions and be invariant to proposal ordering.We adopt this technique to enable the generation of dynamic proposals that vary depending on the image content.We propose two different, alternative, formulations for R : Region Pro- posal Networks (Girshick 2015) and Spatial Transformer Networks (Jaderberg et al. 2015).
Region Proposal Networks-Region Proposal Network (RPN) (Ren et al. 2015) is a convolutional architecture for generating proposals given a convolutional feature map of an image.An RPN has a regression and a classification layer.The regression layer modifies some anchors with predefined sizes and aspect ratios to generate bounding boxes coordinates.The classification head instead is used to assign objectness scores to proposals.
Although RPN represents an effective method to generate RoIs, its original formulation appears to have limitations: (i) a strong supervision signal is needed, namely ground truth bounding-boxes; (ii) it relies on RoI Pooling to extract features, which is non differentiabile.The authors originally overcame these limitations for an object detection task by adopting a two step-training, i.e. pretraining the RPN with ground truth class-agnostic boxes.Since we do not know a-priori which regions might be considered useful by the model, for the purpose of our work it is essential that the model can be trained end-to-end and discover relevant proposals in a task driven fashion.
As a solution, instead of standard RoI Pooling, we use a differentiable RoI pooling layer called Precise RoI Pooling (Jiang et al. 2018).Precise RoI Pooling is an integration-based pooling strategy based on bilinear interpolation and allows the gradient to be backpropagated through the bounding-box coordinates.This makes the regression head differentiable and allows the proposal generation to be fully task-driven.
In addition, unlike Faster R-CNN (Girshick 2015), which generates a large number of boxes and then thresholds them based on the predicted objectness, we completely remove the classification head and retain all the generated proposals.This stems from the fact that without a direct supervision, the classification head is unable to provide effective objectness scores.To control the number of boxes we act on the stride of the convolutions and on the number of anchors used to generate the proposals.Further details are provided in Sect. 5.
Spatial Transformer Networks-Spatial Transformer Networks (STN) (Jaderberg et al. 2015) allow a model to learn spatial transformations on input feature maps.The effectiveness of STN derives from the fact that transformations are learned without a direct supervision, in a task-driven fashion.In detail STN is made up of three main blocks.A Localization Network is responsible for predicting the parameters of the transformation matrix T .It takes a feature map as input and is formed by a convolutional or fully connected block followed by a regressor.A Grid Generator then uses the affine transformation matrix T to output a parameterized sampling grid T (G) , where G is a regular grid correspond- ing to image coordinates.The final output transformation is obtained using a Grid Sampler which applies the sampling grid T (G) on the input feature map.This operation is achieved through bilinear interpolation.Overall, the transformation performed by the STN is an affine transformation that maps points in the input feature map into warped positions.Therefore, using an STN as region proposal function, does not require a RoI Pooling step as the transformation directly outputs the features of the region of interest.
We constrain the transformation to avoid skew and rotation, making it of the form: where s is the scale factor, and T x , T y are the translation parameters.In our model, each branch dedicated to a highlevel command is equipped with an STN generating R spatial transformations (e.g.proposals).

Training details
The shared backbone is composed of five convolutional layers with ELU activations.The first three layers have respectively 24, 36 and 48 5 × 5 kernels with stride 2 and are followed by two other layers with 64 3 × 3 filters with stride 1.All input images are resized to a resolution of 200 × 88 pixels.In the RPN model we control the number of boxes by changing the stride in the convolutional block.In particular we use stride 2 to generate 108 boxes.Anchor size and aspect ratio parameters used for training are respectively {64, 80, 100} and {0.5, 1} .RPN produces a feature map of size 1024 for each box.A linear reduction layer is used to reduce its size to 512.

Dataset
To train and evaluate our autonomous agent, we use the CARLA Simulator (Dosovitskiy et al. 2017).
Carla is an open-source platform conceived and designed for research in autonomous driving.It provides a realistic reconstruction of an urban and suburban environment that includes two Towns.It also offers the possibility of setting different weather conditions and daytimes.We use data from Codevilla et al. (2018), in which Town01 is used as training and Town02 as validation.The dataset was recorded using four different weather conditions.For each example in the train set, measurements concerning the value of steering, throttle, brake, and high-level command are provided.To test the abilities of an autonomous agent, Codevilla et al. (2018) also released a test benchmark composed of separate driving episodes.The benchmark is goal-oriented: for each episode the autonomous agent is asked to reach a goal point on the map, given a starting point, within a certain time limit.Both Towns are included.For each Town the benchmark is divided into four tasks: (i) Go Straight-driving along a straight road; (ii) One Turn-the destination is reached by making either a right or left turn; (iii) Navigation-to reach the destination an agent has to drive along a longer route, in which there might be several turns; (iv) Navigation dynamic-the same as Navigation but with other vehicles and pedestrians.For each task, there are 25 episodes, replicated in six different weather conditions, four of which already seen in training, and two used only for testing.In total, the benchmark consists of 600 episodes for Town01 and 600 for Town02 for a total of 1200 episodes.

Experimental results
In this section we discuss experimental results obtained by our proposed models.First of all, we compare the driving success rates on the CARLA Benchmark using different types of attention.In Table 1  It must be noted that in this work we are not interested in obtaining the best results on a driving benchmark, but rather to offer a comprehensive analysis on how attention mechanisms can be integrated into a driving model and which benefits this can provide.Nonetheless, in Table 2, we provide a comparison with other state of the art methods.Since our focus is on explainability and attention, our model does not come equipped with bells and whistles like data augmentation or exploiting high-level input representations (e.g., depth, semantic segmentation).
An additional characterization of the results, further motivating the success of STN over RPN, can be given looking at the proposals generated by the models along with their attention weights.This is of particular interest since it provides a degree of explainability with respect to the outputs of the model.Figure 4 shows attention heatmaps obtained by cumulating the top 5 proposals over the entire validation set consisting of 74,000 frames.STN offers spatially fine grained explanations compared to the other methods.Attention is focused on the horizon and sidewalks for the follow command, the center line for the straight command and the bottom left centerline for the left and right commands.For the turn left command the model looks at the centerline when the road is still straight, yet also focusing ahead on the left side to anticipate the curve.On the contrary, for the turn right command, the model keeps the lane by following the bottom left part of the centerline and looks on the right side when the curve is visible.Interestingly, often in Carla the centerline interrupts at intersections.The model is therefore exploiting this cue to perform the driving task.This bimodal distribution of attention is even more visible in Fig. 5, which shows the distribution of all the proposals generated for each model.For RPN, instead, the distribution of the boxes is mostly focused on the left and right edge of the road at the horizon, using bigger and coarser RoIs.The static model, on the other hand, yields an attention map that has a regular, axis aligned distribution, with the right side of the road getting higher attention for all the high level commands.Samples of attentions in single frames are shown in Fig. 6.A frame-by-frame breakdown of a right turn is also provided to show how the attention changes when approaching an intersection in Fig. 7.

Comparison with feature attribution methods
Our proposed model has been specifically designed to provide visual explanations of its predictions.What the model is performing is therefore an importance attribution of regions in the pixel space.Nonetheless, several methods for feature attribution exist in literature and have been successfully used to provide ex-post visual explanations of deep learning models (Bach et al. 2015;Selvaraju et al. 2017;Shrikumar et al. 2017;Sundararajan et al. 2017).Here, we compare the attributions provided by our model, previously discussed in Fig. 6, with off-the-shelf feature attribution methods, namely DeepLift (Shrikumar et al. 2017), LRP (Bach et al. 2015) and Integrated Gradients (Sundararajan et al. 2017).We apply such methods on the baseline architecture without the explicit attention module.
Qualitative results for the explainability models described above are shown in Fig. 8. DeepLift (Shrikumar et al. 2017), LRP (Bach et al. 2015) and Integrated Gradients (Sundararajan et al. 2017) generate a sparse attention that focuses mostly on the road surface.Despite this being a comprehensible behavior for a model without attention, it does not suggest much in terms of interpretability.Overall attention maps are quite similar between the various explainability models, moreover it is difficult to identify patterns that can help to understand the motivations of a particular behavior.This experiment suggests that using attention mechanisms based on proposal generation rather than ex-post explanation methods allows us to obtain a visual explanation of what leads to a prediction.Furthermore, it should be considered that these ex-post explainability methods have limitations.
In particular, they are not well suited for regression models.
Evidence of these limitations can be found in the work of Letzgus et al. (2021).This consideration is further confirmation that using an explicit attention mechanism leads to considerable benefits in terms of interpretability.

Ablation studies
In this section we validate the importance of some of the components of our proposed models.Here, we perform the experiments on a subset of the CARLA driving benchmark, namely testing only using the Training conditions and New weather splits.First of all we experimentally validate the intuition, introduced in Sect.4.2, according to which the model based on a static proposals learns to derive spatial information from the order of the boxes.In Table 3 we show that simply shuffling the order in which the boxes are presented makes the model unable to emit meaningful steering commands.To overcome this limitation, we retrain the model adding two additional channels to the input, representing normalized spatial coordinates ranging from 0 to 1.The overall success rate is almost on par with the original method, even if the order of the boxes is shuffled.
We now study the effect of the number of boxes in our dynamic proposal functions.Controlling the number of boxes with STN is straightforward since the STN can be modified to generate multiple transformation matrices.For RPN instead, since we do not use the classification head, we change the number of boxes by modifying the stride of the convolutions and the number of anchors.In particular we use stride = 1 to generate 432 boxes and stride = 3 to generate 72 boxes.The reference RPN model, used in Table 1, has stride = 2 to generate 108 boxes, which is comparable to STN which uses 100 boxes.In Table 4 we show the results of the models varying the number of boxes for STN and RPN.In both cases, when using approximately 100 proposals we obtain the best results.

Retrieving failed episodes
Indeed, the proposed attention mechanism is beneficial to the explainability of the driving behaviour and can also support the identification of anomalous conditions anticipating possible driving failures.Inspired by prior work (Yang et al.  2022), we address the problem of detecting anomalies using convolutional autoencoders.We have trained two networks with the same architecture, the first fed with RGB frames and the second with attention maps produced by our model.We generate attention maps using the STN model, overlaying each generated box on a reference black image, weighing the RoI with the corresponding attention value.
To test the models we used a test set consisting of 600 episodes extracted from the CARLA benchmark, 300 of which were successfully completed by the model.Failed episodes contain collisions with pedestrians, cars, other objects and/or unusual maneuvers.Our assumption is that failed episodes will contain out of the ordinary events, making the predicted attention anomalous.We thus leverage the reconstruction error of the autoencoders to detect such anomalies.We treat this task as a retrieval task, aiming at automatically identifying failed episodes.To evaluate the task, for each episode we take the maximum reconstruction error and use it to generate precision recall curves, as shown in Fig. 9.The model trained on attention maps reaches an AUC on the precisionrecall curve of 71.53, while the model trained on RGB only 50.06.Similarly, computing Average Precision, we obtain 56.24 using attention maps and 37.97 with RGB frames.This experiment demonstrates that modeling attention is

Conclusions
In this paper we describe the architecture of an end-to-end trainable driving system capable of generating driving controls (e.g.steering angle) from an RGB frame representing the scene captured by a vision system.The architecture is designed to implement an attention mechanism that induces a selection of regions of the RGB frame that are most relevant for the prediction of the driving controls.This contributes to the explainability of the model by showing which regions of the observed scene are used for driving.Such an indication can help improving the training process but also entails the potential for detecting anomalies in the observed scene, anticipating potential driving failures.The accuracy of different region proposal mechanisms is reported by measuring the driving success rate on the CARLA Benchmark and demonstrating that region proposal by STN (Jaderberg et al. 2015) yields the best overall success rate compared both to RPN (Ren et al. 2015) and to a fixed frame partitioning scheme.Reported experiments also demonstrate that the proposed attention mechanism leads to considerable benefits in terms of interpretability compared to methods providing ex-post visual explanations of deep learning models (Shrikumar et al. 2017;Bach et al. 2015;Sundararajan et al. 2017).

Fig. 1
Fig. 1 A convolutional backbone generates a feature map.Then, a region proposal function extracts RoIs that are pooled and weighed by an attention layer.Separate region proposal and attention modules

Fig. 2
Fig. 2 Attention block.A weight vector is generated by a linear layer with softmax activation.The final descriptor is a concatenation of RoI features scaled with attention weights

Fig. 3
Fig. 3 Four sliding windows are used to generate a multi-scale grid of RoIs.Colors indicate the box type: BIG V (red), BIG H (green), MEDIUM (yellow) SMALL (blue)

Fig. 4
Fig. 4 Heatmap for the top five boxes, ranked by attention.The heatmap is the result of cumulating top proposals over the entire validation set

Fig. 5 Fig. 6 Fig. 7
Fig. 5 Heatmap proposal distribution.The heatmap is obtained cumulating all proposals irrespectively of their attention weight on the validation set

Fig. 9
Fig. 9 Precision-recall curves for detecting failed episodes we compare results of a vanilla model without attention against models with static and dynamic proposals.The model without attention does not generate any proposals and directly feeds the global feature map to the final multi-head architecture.Explicitly modeling attention leads to significantly increased driving performance.Interestingly, the model with the static proposal function obtains good results, improving by a large margin compared to the attention-less baseline.As for the dynamic proposal functions, the STN proposal obtains the best overall success rate, with the notable exception of New weather and new Town where static proposals yield slightly better results.Surprisingly, RPN perform worse than all the other proposal-based methods.We impute this to multiple factors: (i) proposals are generated based on local features corresponding to anchors, whereas STN performs global reasoning; (ii) anchors which must be handcrafted, thus diminishing the expressiveness of the model; (iii) training an RPN without direct supervision on box positions may not be enough, especially since there is no specific foreground object the box coordinate regressor can adhere to.

Table 1
Percentage of completed tasks using static proposals, RPN and STN

Table 2
Comparison with the state of the art, measured in percentage of completed tasks Bold values indicate the best resultsAdditional sources of data used by a model are identified by superscripts: ⧫ (depth), ⋄ (semantic segmentation), † (temporal modeling), ⋆ (different training data).The best result per task is shown in bold for methods using only RGB frames as input

Table 4
Ablation study.We vary the number of proposals produced by STN and RPN.Both STN and RPN perform better using a number of boxes around 100.In general, STN can obtain higher driving accuracy even with a low number of proposals, compared to RPN