1 Introduction

Vehicle autonomy has always been a subject of interest, whether it be for the task of driving [1,2,3], marine navigation [4,5,6] or even aerial agents [7, 8]. The focus of this work is on self-driving cars using deep learning. Over the last decade, supervised deep learning methods have achieved exemplary performance in various applications. One domain of interest to us is vehicle control, where convolutional network architectures have proven to be instrumental in directly mapping images to steering commands [9, 10]. Despite their great success, supervised deep learning methods are prone to two severe limitations:

  1. 1.

    They require tremendous amounts of labeled training data. In the context of a vision-based driving application, this would mean having a camera placed at the front of the car which records a video as the vehicle is being driven. The corresponding steering angle executed by the driver is also recorded and time synchronized for each image frame. The necessity of having an expert driver who collects data from a vehicle with such a controlled setup limits the scalability of the underlying application.

  2. 2.

    The trained neural network model struggles at inference time when data that is not in the distribution of the training set is encountered. In the context of self-driving cars, such out-of-distribution data would constitute those scenarios where the vehicle is driving off-courseFootnote 1 [11,12,13].

It is worth noting that the second limitation is a consequence of the first one, as anomalous off-course data is scarce during data collection with an expert driver. Specifically, the traffic rules and lane regulations are strictly necessary to be followed to ensure safety. During inference, the network predictions may cause the car to diverge far from the boundaries of the driving lane. Since there are no such aberrant scenarios in the training set, it may not be possible for the car to take corrective measures to recover from this deviation.

In this work, we propose a framework which addresses these problems associated with supervised deep learning models. The framework merely requires an unlabeled sequence of RGB images from which the vehicle trajectory is determined using visual odometry. The steering labels can then be inferred by applying Model Predictive Control (MPC) using this trajectory and modeling the kinematics of the ego-vehicle. Meanwhile, off-course images are synthesized at novel views and are also included in the training set to enhance robustness. The primary contributions of our framework are enumerated below:

  1. 1.

    We demonstrate training a lateral vehicle control network from only an unlabeled sequence of images. This eliminates the need for a specialized setup on the car that has to retrieve the steering angles executed by the driver. Moreover, the task of synchronizing the recording of the images with the corresponding steering angles is also eliminated.

  2. 2.

    The data collection process does not necessitate driving off-course for recording anomalous cases. Rather images at divergent positions are synthesized using only on-course image sequences and included in the training data.

  3. 3.

    We demonstrate in the experiments that the proposed fusion of MPC with synthesized novel views leads to improved robustness at inference time.

2 Related work

2.1 Supervised methods for vehicle control

Toromanoff et al. [1, 10] train a supervised network that directly maps image data to the steering commands. One limitation of these methods is scaling, as they require an expert driver whose steering maneuvers need to be recorded during the course of data collection. Moreover, performance of such models tends to be constrained to the domain on which they were trained [14]. Attempts to partially circumvent this problem involve using multiple laterally displaced cameras while recording data [3], shearing the images [15] or using a fish-eye camera to generate laterally displaced images [10]. An appropriate label correction is applied to each of the laterally displaced images. However, such approaches may cause visual distortions or are constrained by the maximum lateral displacement in the images. For robustness, [16] adapts the strategy from [17] by injecting noise into the control command and letting the expert recover from these disturbances during data collection. While this method may be expedient in controlled environments, it is impractical and too dangerous to be deployed in the real world. The injected noise may cause the vehicle to veer off-course and result in a potential collision. In our approach, we synthesize images at arbitrary locations from a single on-course trajectory without having the need to drive off-course. Moreover, recording the steering commands of the driver is not required. Rather, the steering labels are inferred from MPC.

2.2 Methods not requiring supervision

Zhang et al. [18,19,20] train a neural network in a simulated setting for vehicle control using Reinforcement Learning (RL). RL methods do not require explicit supervision; but involve a random exploration of the environment as part of learning a suitable policy [21]. The trajectory resulting from the exploration strategy may cause the driving agent to depart from the driving lane, thereby violating traffic rules and causing accidents. One way of partially alleviating this issue is to use the training data from a virtual environment which tends to be less intrusive [22]. Therefore, the authors of [23] train an RL model in a virtual environment and evaluate it in the real world. However, an intermediate semantic representation to translate virtual images to the real world is required.

In comparison, our method is capable of training directly on images that the model is expected to see at inference and does not require any additional semantic information. Kendall et al. [11] also demonstrates training an RL policy directly in the real world. However, the approach necessitates a safety driver to seize control whenever the car diverges from the lane. An another issue with RL-based control policies is that they tend to require tremendous amounts of data and computational resources for training [19, 24]. In [25], supervised data is not needed and training can potentially be done on real world data. However, they additionally utilize images from multiple trajectories by aligning them to the reference. In contrast, our framework only requires a single reference trajectory from which additional images at arbitrary positions are synthesized.

2.3 Path planning and longitudinal control

The task of autonomous driving can be decomposed into two integral components: 1) High level planning to determine the optimal path for the driving agent to reach its destination. 2) The low level steering commands executed by the ego-vehicle using data received from its immediate surroundings. Li et al. [26, 27] demonstrate path planning methods for generating safe vehicle trajectories. In contrast, our work is concerned with predicting low level vehicle control commands. These low level commands can be further segregated into lateral and longitudinal control. Matute et al. [28, 29] generate velocity profiles such that possible passenger inconvenience resulting from sudden acceleration/deceleration can be mitigated. Similar to our case, they also test their approach in simulation. However, our network is focused on predicting the steering command for lateral vehicle control and does not require state estimation at inference time. Uebel et al. [30] also determines the longitudinal dynamics of the vehicle. However, they additionally take the current and future state of the multiple traffic lights ahead to furnish energy efficient driving. Our framework only requires a single RGB camera for immediate lateral control. It is not focused on long term velocity profiling and trajectory planning.

2.4 Methods combining deep learning and optimal control

Bansal et al. [31] combines optimal control with deep learning for vision-based navigation of a robot in an indoor setting. It is assumed that the environment is static and the robot state is perfectly known. Mohseni et al. [32] combines MPC and uses an ensemble of neural networks for collision avoidance. The sensory input to the network is obtained from Lidar which tends to be more expensive [33]. In [34], a network is trained to predict the output obtained from MPC for controlling the moisture content produced from a paper-making machine. They generate training data in simulation. In comparison, our framework is not limited to just the synthetic domain and can equally be applied on real world data. We show image synthesis for the real world KITTI dataset [35].

3 Framework

Figure 1 provides a high-level overview of our proposed framework. Note that the ultimate goal of the framework is to train a neural network that takes an image as input and predicts the appropriate steering angle for lateral vehicle control. The network is trained only from an unlabeled sequence of images; which are obtained from a camera setup placed at the front of a car. No ground truth steering labels are available. Rather, the steering labels are inferred from MPC. The framework comprises of four main components:

  • Visual Odometry, which provides the trajectory traversed by the vehicle using a sequence of RGB images.

  • View Synthesis, that generates additional images at arbitrary positions lateral to the original (on-course) trajectory.

  • Model Predictive Control, which supplies the steering labels for both the synthesized images and those in the original trajectory.

  • Neural Network, which maps the input images to the lateral steering commands.

Fig. 1
figure 1

Overview of the proposed framework describing the steps for data and label generation (Steps 1-4), training (Step 5) and inference. Step 1: A sequence of images obtained from a camera rigidly attached to the car is fed through a visual odometry algorithm. It generates the pose of each image along the ego-vehicle’s trajectory. We refer to it as the reference trajectory. Step 2: Images at arbitrary positions are synthesized. Step 3: The positions of the synthesized images are aligned in relation to the reference trajectory. Step 4: MPC generates the appropriate steering angles at locations of the reference and synthesized images. Step 5: A network is trained to predict the appropriate steering angle with an image frame as input. The target labels for training are obtained from MPC for each frame. Inference: Once the training is done, only the neural network component of the entire framework is used at inference time to make steering predictions from raw images directly taken from the camera. Note that no view synthesis, visual odometry or MPC calculation is required at the inference step

We first describe the motivation for using model predictive vehicle control and how it works in Section 3.1. Its limitations with regards to real time implementation are also discussed (in Section 3.1.3). Next, view synthesis (Section 3.2) and visual odometry (Section 3.2) are explained as solutions to overcoming these limitations. Finally, Section 3.4 gives information about the neural network.

3.1 Model predictive control (MPC)

We use MPC to implicitly acquire the steering labels. Note that MPC is an optimization based approach, whereas our neural network is trained with a learning based approach. As depicted in Fig. 1, we therefore use MPC to provide the target labels which are then used to train the neural network.

In driving-related tasks, MPC has shown to outperform other controllers [36,37,38], as it allows to achieve desired behaviours by appropriately adjusting the cost function [39]. We therefore treat MPC for the task of self-driving as a receding horizon problem. As depicted in Fig. 2, the controller optimizes to predict the future set of actions that need to be executed for the ego-vehicle (point A) to reach a goal state (point C) along the original reference trajectory (in purple). The first control action is executed and the vehicle attains a new state at time T1. The process is repeated at the new vehicle state. Successive optimizations with the updated goal state (point C) for timestamps Ti, where i ∈ [1, n], reduce the deviation between the ego-vehicle (point A) and the closest state (point B) on the reference. Note that we only execute the first action rather than all actions predicted by MPC because the motion model of the ego-vehicle is only an approximation of the real world. Therefore, attempting to execute all actions in order to follow the entire path predicted by MPC may cause the ego-vehicle to deviate far away from the reference.

Fig. 2
figure 2

Depicts MPC treated as a receding horizon problem. At time T0, the ego-vehicle is at position A. B is the point on the purple reference trajectory closest to A. The dotted blue line shows the estimated trajectory of the car, if the sequence of control actions optimized by MPC are executed. The first set of action(s) predicted by the controller is executed and the ego-vehicle attains a new state. The process is repeated all the while the distance between point B and the goal state is maintained to be D. The state at some later time Tn is also shown. Figure should be viewed in colour

We now define the motion model of the ego-vehicle and cost function used for the optimization.

3.1.1 Motion model

We use the bicycle model [40] to describe the kinematics of a 4-wheeled ego-vehicle with planar motion. The state of the ego-vehicle is described by its orientation (𝜃) and location coordinates X, Y. It can be controlled by regulating the acceleration (a) and steering angle (δ). We assume there is no slip in the vehicle, which is a valid assumption for vehicles that execute turns at low or moderate speed [41]. If L is the wheelbase and V is the ego-vehicle’s velocity, then the equations of motion can be formulated as:

$$ \dot{X} = V cos\theta; \dot{Y} = V sin\theta; ~\dot{V} = a; ~\dot{\theta} = V \frac{\tan\delta}{L} $$
(1)

3.1.2 Cost function

The cost function aims to produce the optimal sequence of control actions such that the difference between the goal state and an ego-vehicle state at any of the N timesteps into the horizon ahead is minimized. The state at each of the N timesteps ahead can be estimated by iteratively applying the equations of the motion model. Longer horizons ought to produce better estimates of the control actions to be taken. However, this comes at the expense of longer optimization cycles. If Xg, Yg and 𝜃g describe the state variables of the goal state, then the objective function to be minimized can be expressed as:

$$ \underset{a,\delta}{\min} \sum\limits_{i=1}^{N} [\alpha_{1} (X_{\text{i}} - X_{\text{g}})^{2} + \alpha_{2} (Y_{\text{i}} - Y_{\text{g}})^{2} + \alpha_{3} (\theta_{\text{i}} - \theta_{\text{g}})^{2}], $$
(2)

where \(\alpha _{1-3} \in \mathbb {R}\) are values to appropriately weight the different terms in the cost function. Also, note that the cost function is dynamically modified to bring the goal state closer to the current state when making turns. This is to ensure that the optimization does not cause the vehicle to cut corners, while simultaneously reducing the velocity and abiding by the no-slip condition.

The optimization yields a sequence of N control actions. However, only the first is executed and the vehicle attains a new state. This process is repeated at this new vehicle state.

3.1.3 MPC limitations and solutions

Optimization Time: In principle, the entire set of N control actions can be executed after a single MPC optimization. This would relieve the computational burden from optimizing at each time step. However, the ego-vehicle model is only an approximation of the real world. Therefore, attempting to execute all control actions that are farther into the future may yield an ego-vehicle trajectory which significantly deviates from the estimated trajectory. Hence, in order to obtain an optimal solution, the calculations need to be performed at each time step which may not be feasible for real time execution. To alleviate this issue, we perform the optimizations offline and train a neural network to predict the first control action in the sequence. Offline computation allows inference time to remain the same, irrespective of how long the optimization cycles were to train the network. The input to the network is an image corresponding to the ego vehicle position.

Reference Trajectory: The reference trajectory required to implement MPC in real time is no longer needed when a neural network is used at inference. Nevertheless, it would still be required to determine the target labels for training the network. We utilize a visual odometry system in order to track the moving camera and obtain the reference trajectory for the training stage. Please see Section 3.3 for further details on visual odometry.

Localization: Even if the reference trajectory is known, another issue with real time MPC implementation is that the position of the ego-vehicle needs to be localized against the reference. Methods used for localization would incur additional cost and further constrain the hardware resources. In contrast, the neural network directly maps the input image to the appropriate steering command, thereby obviating the intermediate localization step. Nevertheless, to make the network robust to deviations from the reference, we would still like to train it with images at arbitrary locations that are not on the reference trajectory. For this, we formulate a self-supervised pipeline which uses images from the reference trajectory to synthesize images at arbitrary locations away from the reference. Hence, we have inverted the problem by synthesizing images at desired locations and training the network offline. This is in contrast to localizing the car in relation to the reference in real time. Please see the next Section 3.2 on view synthesis.

3.2 View synthesis

Figure 3 provides the schematics on how images at novel view points are generated using a single image from the reference trajectory traversed by the car. This image is first fed to a depth estimation network to output the corresponding depth. The depth estimation problem is framed similar to a view synthesis one by constraining the network to perform image synthesis using an intermediary variable such as a depth map. If It is the target image and \(I_{t^{\prime }\to t}\) is the reconstructed image from the source \(I_{t^{\prime }}\), then the objective function to be minimized can be formulated as a combination of photometric reconstruction error Lp and edge-aware depth smoothness Ls, as follows [42]:

$$ {L}_{D}(I_{t}, I_{t^{\prime}\to t}) = \mu {L}_{p}(I_{t}, I_{t^{\prime}\to t}) + \lambda {L}_{s}(I_{t}) $$
(3)

For the photometric reconstruction error Lp, we use a combination of SSIM and L1 loss terms, and set α = 0.85. μ denotes the per-pixel mask to eliminate pixels which remain stationary between adjacent frames in the sequence.

$$ {L}_{p}(I_{t}, I_{t^{\prime}\to t}) = \underset{t^{\prime}}{\min} pe(I_{t}, I_{t^{\prime}\to t}) $$
(4)
$$ pe(I_{a}, {I}_{b}) = \alpha \frac{1-\text{SSIM}(I_{a}, {I}_{b})}{2} + (1-\alpha) \begin{Vmatrix} I_{a} - {I}_{b} \end{Vmatrix}_{1} $$
(5)

Edge-aware smoothness is used to prevent shrinking of the predicted depth, where \(d_{t}^{*} = d_{t} / \overline {d_{t}}\) denotes the mean-normalized inverse depth. The combined loss LD is averaged over each pixel, scale, and batch.

$$ {L}_{s}(I_{t}) = \begin{vmatrix}{ \partial_{x}d_{t}^{*}} \end{vmatrix} e^{-\begin{vmatrix}{\partial_{x}I_{t}}\end{vmatrix}} + \begin{vmatrix}{\partial_{y}d_{t}^{*}} \end{vmatrix}e^{-\begin{vmatrix} {\partial_{y}I_{t}}\end{vmatrix}} $$
(6)
Fig. 3
figure 3

Describes the schematics of synthesizing images at novel view points using a source image from the reference trajectory

This depth map is then projected to yield a 3D coloured point cloud. Multiple images can then be synthesized from this point cloud at desired imaginary positions [43]. Note that certain regions in the synthesized image may be beyond the visible field of view of the reference image. In this case, the synthesized image may have voids. We make sure that the imaginary camera position is chosen such that the void regions do not occlude the drivable regions in the image. The depth estimation network can be trained in an entirely un/self-supervised manner [42, 44]. However, such methods tend to suffer from bleeding artifacts at the object boundaries [45]. This is further exacerbated for thin objects [46]. Hence, attempting to synthesize images at novel views may produce curved boundaries for certain edges. Nevertheless, for the task of lateral vehicle control, the trained neural network tends to focus on the high level features of the image [47]. Hence, for all intents and purpose, the network trained with synthesized images yields similar performance as the network trained with original images. This is also shown in Table 1 in Section 4 and further discussed in Section 5.7.

3.3 Visual odometry

Recall that by using a neural network, we are absolved from the requirement of having a reference trajectory at inference time. The network also has the possibility to generalize itself to control the vehicle in new unseen environments where no reference trajectory is available. This is done by shifting the requirement of having the reference trajectory at inference time to having it during offline training of the network. This can be obtained by running a state-of-the-art visual odometry system [48,49,50]. These approaches generate the 6 Degree-of-Freedom (DoF) pose information for each frame in the trajectory. Since the camera setup is rigidly attached to the car, the pose of the camera can also be used to determine the pose of the car at the corresponding frame. Using stereo image pairs can additionally provide the scale information [51, 52]. The camera poses are represented by a transformation matrix T belonging to the special Euclidean group SE(3) representing rigid body motions. It comprises of RSO(3) and \(\mathbf {t} \in \mathbb {R}^{3}\).

3.4 Neural network

The neural network is trained to take an image as input and produce the appropriate steering angle (δ) for lateral vehicle control. The training is done by minimizing the L1 loss between the value predicted by the model and the first δ of the N steering commands optimized by model predictive control. This image can either be one from the reference trajectory or can also be synthesized.

4 Experiments

There are many publicly available real world datasets [35, 53,54,55,56] that have been instrumental for benchmarking and evaluating various autonomous driving tasks such as object detection/tracking, SLAM, semantic segmentation etc. However, none of these real world benchmarks provide steering labels for evaluation of vehicle control algorithms. More importantly, they also do not facilitate interaction with the environment. This interaction is necessary to quantitatively evaluate sensorimotor control methods for true driving quality. Codevilla et al. [57] had conducted extensive studies to demonstrate that the true driving quality obtained from online interaction does not necessarily correlate with offline evaluation. In fact, they showed that two models with similar offline performance metrics can have significantly different driving behaviours. Hence, offline evaluation on static images of the available real world benchmarks would be inadequate for our task of evaluating online sensorimotor control. In contrast, simulation engines [58,59,60,61] aid online interaction with the driving environment. They are therefore increasingly being used to quantitatively evaluate contemporary vehicle control algorithms [62, 63]. We use the CARLA simulator [60] for evaluation and benchmark against the strategies from [16] and [3]. Further details on our experimental setup on the CARLA simulator are given in the following subsections.

Note that in addition to the quantitative results on CARLA, we also provide qualitative evaluation on the real world KITTI dataset. Please refer to Section 4.4 for more details.

4.1 CARLA simulator

The CARLA simulator has been designed to validate various autonomous driving algorithms encompassing both perception and control. It also supports a diverse suite of sensors. The primary objective of this paper is to demonstrate how a vehicle control model can be trained merely from an unlabeled sequence of images. Therefore, the RGB camera is the only category of sensor used in this paper. Note that the CARLA simulator also provides the ability to violate traffic rules which would otherwise be infeasible and costly in the real world. Hence, we are able to compare the performance of our method with the supervised approach adapted from [16], wherein noise is injected into the control signal during data collection. This causes the vehicle to swerve off-course, thereby breaking traffic rules. The action taken by the expert driver to bring the vehicle back on-course is recorded. We show that the performance of our model is comparable to this supervised method. This is despite the fact that our method does not require dangerous traffic violations during data collection. Recall from the Introduction Section 1, that we have defined the ego-vehicle to be on-course if it is driving safely within its own lane. Otherwise, it is considered to be off-course. Driving off-road or in the lane of oncoming traffic would be categorized as off-course.

4.2 Data collection

We place the camera setup at the front of the ego-vehicle. Images of size 1200 × 600 with a field of view of 110 are recorded as the vehicle traverses the road in the autopilot mode. Note that recording the steering angle executed by the ego-vehicle during data collection is not needed for our method. Nevertheless, for the purpose of comparison with the supervised method, these steering commands are also collected. We use [51] as the visual odometry algorithm for determining the reference trajectory. This is needed for determining the control values with MPC (Section 3.1), which serve as the training labels for our neural network (Section 3.4). Meanwhile, [42] is used to train the depth estimation model in an entirely self-supervised manner. The estimated depth is then utilized to warp the original colour image as if the scene would be seen from a new perspective.

4.3 Quantitative evaluation results

Table 1 reports the online evaluation results for different starting positions both on trajectories that the model had seen during training and also the testing trajectories which were not seen during training. Each episode is run up to 250 timesteps for natural turns. The online metric used for evaluation is the mean ratio of time the ego-vehicle remains within its driving lane [47]. The car is considered to be within its own lane if no portion of it is in the lane of the oncoming traffic or off the road and it does not collide with other traffic participants/obstacles. For the purpose of comparison, we additionally report the evaluation scores for other model configurations. They are described in further detail in Section 5.

Table 1 Ratio of time the car remains within its driving lane

4.4 Qualitative results

Evaluating the online performance of the neural network for steering angle prediction is not possible on existing real world datasets, since it requires interaction with the captured environment. Nevertheless, we can depict the qualitative performance of the other three components of our framework, namely view synthesis, visual odometry and MPC on the real world KITTI dataset [35]. The results of visual odometry for calculating the vehicle pose have been included in the supplementary material. Moreover, view synthesis and control labels from MPC are depicted in the supplementary video on the KITTI dataset. For visual odometry and view synthesis, images at their native resolution of 1241 x 376 are used without further pre-processing. However, for view synthesis, locations at farther distances from the source image lead to visible voids at the boundaries. This is because the field of view (FOV) of the source image does not capture the entire FOV of the synthesized images. Therefore, for illustration purposes, we center crop the image to 1000 x 376 in the video. We conduct the same qualitative evaluation on CARLA as well. These results can also be visualized in the video.

5 Discussion

In this section, we make some observations on the experimental results of our approach when compared with other models.

5.1 Single trajectory model

The power of our framework comes from its ability to synthesize additional off-course data from a single on-course reference trajectory. To examine the significance of utilizing the synthesized images, we train another model only with images from the single reference trajectory. Note that the reference trajectory is always on-course. As can be seen in Table 1, this model’s performance drops significantly on both the train and test sequences. There are two plausible explanations for this:

1. Over-fitting: Firstly, the single trajectory model was trained with fewer images, which has the tendency to lead to over-fitting. But then, why does online evaluation on the training trajectories yields dismal results? This is despite that images in these trajectories were seen by the model during training. As alluded to in [57], the training loss is an offline evaluation metric and it does not necessarily correlate with the true driving quality at inference. So, if the car were to deviate even slightly from the reference training trajectory at inference, it would come across a scene whose corresponding image was not available in the training set. This brings us to our second point.

2. Lack of anomalous scenarios in the training data: The training data for the single trajectory model did not contain such anomalous off-course scenarios. Therefore, it has difficulty figuring out what ought to be the correct action to take to return the ego-vehicle to its original course. This would cause the model in taking the wrong decision, leading the ego-vehicle to go further astray. This accumulation in errors will eventually cause a violation of traffic rules or even a collision. This is despite conducting an evaluation on the same trajectories the model had seen in the training set.

5.2 Effect of perturbations

To further investigate this issue with the single trajectory model, we introduce perturbation into the predicted steering command. This is to see if and how well it can recover from deviation on the training trajectories. As can be seen in Fig. 4, the greater the intensity of perturbations, the more difficult it is for the model to execute a recovery and the worse is the performance. This is in contrast to our model, trained additionally with synthesized images representing potential anomalous driving scenarios. Our model therefore has the ability to cater to such circumstances and bring the vehicle back to its original course. It maintains a fairly consistent performance even in light of perturbations.

Fig. 4
figure 4

Shows the effect on the performance of the various model configurations when different levels of perturbation are added to the final steering command. The performance is reported as the mean of the ratio of time the car remains within its driving lane. Higher ratio values correspond to better performance of a method. Note that perturbations are introduced regularly every 15 timesteps for a continuous duration of 5 timesteps. In the remaining 10 timesteps of the cycle, no perturbation are added to allow the models to recover from deviations. Figure should be viewed in colour

5.3 Online MPC

This is the only approach in Table 1 that is not data driven. Rather, it adapts the approach described in Section 3.1 by performing real time optimizations at every ego-vehicle state. It assumes that the ground truth reference trajectory and the ego-vehicle state are always perfectly known, as they can be easily obtained from the CARLA simulator. It is not surprising to see that this online MPC approach outperforms our method on the training set. However, ground truth trajectory and precise ego-vehicle state are unknown in the real world. One can determine the ego-vehicle state in relation to the reference trajectory using visual re-localization methods such as [64] or by using visual image descriptors [65, 66] on the PnP algorithm in a RANSAC scheme. Although, these visual localization methods are highly accurate in determining the ego-vehicle state to centimeter accuracy, they are however slower when compared to our network (please also see Section 5.8 on ‘Computational Cost’). Moreover, our framework neither requires a ground truth trajectory nor needs to determine the ego-vehicle state at inference time. Instead of using the ground truth trajectory, our method derives the reference trajectory from visual odometry. In addition to this, instead of localizing the ego-vehicle against the reference in real time, we synthesize images at arbitrary locations from the reference offline. Note that since the online MPC approach requires a reference trajectory, its performance for unseen test trajectories cannot be reported. However, Fig. 4 shows that it is robust to perturbations on the “training” trajectories.

5.4 Supervised network

Table 1 also compares the performance of our method with a supervised model; which is trained with ground truth steering labels recorded during data collection. Note that our approach does not require supervised labels. Yet, it far outperforms the network trained with supervision. The supervised model suffers from the same issue as that of the single trajectory model described earlier in Section 5.1. It was trained with only images on the reference trajectory. Hence, if the vehicle deviates off-course, it is not capable of returning back to course. The model is also not robust against perturbations as depicted in Fig. 4.

5.5 3-Camera model

This approach is adapted from [3]. It is also a supervised approach trained with ground truth steering labels but it uses three cameras during data collection. Here, two cameras are placed to the left and right of the central camera at a distance of 0.3 meters each. This provides additional deviated trajectories during training. The steering labels for the images from these two cameras are obtained by adding a slight bias to the steering label of the central camera. The bias is a hyper-parameter and in our case is obtained by respectively taking the mean of the steering values obtained by MPC for all the left and right camera images. The performance of the model is better than both the supervised and single trajectory model as can be seen in Table 1 and Fig. 4. This is because the images from the deviated trajectories expand the training distribution. However, this comes at a cost of using three time-synchronized cameras. Moreover, note that the performance of this model is still worse than our approach. One explanation is that our method of view synthesis can create an arbitrary number of additional trajectories used for training rather than only three.

5.6 Supervised network with noise injection

We would like to investigate whether the lack of images depicting anomalous scenarios is the pivotal reason for constrained performance of the supervised model. For this we adapt the strategy of [16]. It is similar to the supervised model, except that noise is injected into the steering command during data collection. This would cause the car to swerve. The corrective action executed by the expert driver to bring the vehicle back to its original course is recorded. Table 1 shows that the performance of such a model trained with noise injection improves significantly. However, note that attempting to collect supervised data with noise injection in real traffic may be extremely dangerous. This is because the injected noise may not only cause the ego-vehicle to violate traffic rules but also surprise other traffic participants resulting in them potentially taking false decisions. Moreover, it would necessitate having an expert driver with specialized skills to take immediate corrective actions. This is not a pragmatic and scalable solution.

5.7 Our method + ground truth data

This method is similar to ours, except that ground truth reference trajectory and images are used for training. Recall that our approach used visual odometry to determine the reference trajectory and synthesized images at different locations. Despite this, our model’s performance is at par with the model trained with ground truth data.

5.8 Computational cost

Note that the online MPC method assumed that the state of the vehicle in relation to the ground truth trajectory is precisely known. Although, this information can easily be extracted from the simulation, it will not be available in the real world. As an alternate solution, state of the art visual localization methods can be used to determine the ego vehicle state in relation to a reference trajectory. Among such methods one could use visual descriptors such as [66] on the PnP algorithm in a RANSAC setting. This achieves up-to centimeter (cm) accuracy. Figure 5 shows that the method achieves a cumulative re-localization of 98% within 50 cm.

Fig. 5
figure 5

Shows the cumulative error of re-localization within 3 meters. The error is the translation norm between the predicted and ground truth in meters

Although very accurate, it is much slower than our network. This re-localization approach runs at a frequency of 11 Hz on the GeForce GTX TITAN X 12 GB GPU and 1 Hz on an Intel(R) Xeon(R) CPU E5-2637 CPU. In contrast, our network runs faster at 25 Hz on a GPU and 16 Hz on a CPU with the same specifications. This is because our network at inference time directly maps the RGB image to the appropriate steering commands. It completely eliminates the state estimation process at inference.

6 Conclusion

In this paper, we presented a framework for training a lateral vehicle control network from an unlabeled sequence of RGB images. The approach demonstrated improved robustness from using additional images. These views were synthesized from only the available on-course data but appeared as if emerging from a deviated traversal of the vehicle. Hence, data collection did not have to violate traffic rules to record such deviated aberrant situations. Moreover, steering labels were inferred from MPC rather than being recorded by an expert driver. Experimental results demonstrated that our approach yields on par performance with methods that rely on additional data collection and supervision.