End-to-End Learning with Memory Models for Complex Autonomous Driving Tasks in Indoor Environments

The interest in autonomous vehicles has increased exponentially in recent years. While Lidar is a proven autonomous driving technology, end-to-end learning approaches have become popular as computer performance has improved. A fully end-to-end method—NVIDIA’s PilotNet has shown its ability to predict speed and steering angle with only camera images. This method achieved the Lidar-based methods’ performance in simple driving tasks. However, a significant drawback was no past spatiotemporal information, imposing an error-sensitive performance, especially in complex driving tasks. Spurred by this deficiency, this paper introduces two novel models: CNN + LSTM and CNN3D, aiming for complex autonomous driving tasks in indoor environments.


Introduction
The interest in autonomous vehicles (AVs) has increased exponentially in recent years. Even though AVs can not fully replace human drivers at the current stage, there is an irreversible trend towards full driving automation. AVs are an important research area because they have the potential to make transport safer and more efficient while also reducing carbon emissions. In 2018, the US Department of Transportation National Highway Traffic Safety Administration (NHTSA) reported that 94% of severe traffic accidents are caused by human error in the US [1]. AVs or computer-assisted driving systems can significantly reduce fatalities in accidents and thus are promising to solve this issue [2]. However, there are still many challenges to overcome, as the demand for AVs is not only to be better than human drivers but also to be economical. Hence, the technology that can solve these two problems simultaneously is still an open research case. Currently, most AV companies are using Lidar to collect data. The Lidar sensor is accurate at sensing distances of the surrounding environment, but extremely expensive [3], while an alternative sensor is a camera. Since computers nowadays can learn from a large dataset, using a camera to collect self-driving data for deep learning algorithms is cheap and significantly reduces the time to program traditional driving algorithms. If camera-based deep learning algorithms can reach the performance of Lidar-based algorithms, then there is a high probability that we can afford to market AVs.
In 2016 NVIDIA introduced PilotNet, an end-to-end Convolutional Neural Network (CNN) widely used in image analysis [4]) that extracts raw pixels from a single front camera image as input and produces steering commands as output [5,6]. This network has been proven incredibly powerful, but there is still much space for improvement, e.g., adding memory to increase driving consistency so it can deal with more complex scenarios and perform a sequence of actions smoothly. The hypothesis is that driving in a sequence of actions needs memory to perform well. The upgrade to PilotNet can be implemented in two ways: extend the 2D convolutional layers to 3D convolutional layers (CNN3D) or add Long Short-Term Memory (LSTM) after the 2D convolutional layers. Both methods have the potential to achieve better steering results with more profound spatiotemporal information. LSTM has been developed to solve common RNNs shortcomings. LSTM uses a cell, an input gate, an output gate, and a forget gate for its calculations. These features allow it to achieve long-term memory and eliminate the "Vanishing Gradient Problem" [7]. LSTM networks are usually used to classify and predict time information. If there is a lag of unknown duration between the basic features in the time series, the LSTM network can also capture it well. In addition, the relative insensitivity to gap length is a significant advantage of LSTM over RNN and other sequence learning methods in many fields. CNN3D is a neural network that uses 3D filters (kernels). This network is beneficial in image processing because 3D filters can extract spatial and temporal features from a sequence of image inputs hence obtaining action information [8].
In 2017, an RC car equipped with a Lidar sensor and a Raspberry Pi named ModCar ( Fig. 1) was developed as a research project at the University of Western Australia (UWA). This project provided a Lidar-based autonomous driving baseline. Later, this project utilized NVIDIA's PilotNet and performed end-to-end deep learning research by modifying the vehicle with a wide-range camera. This project provided a camera-based autonomous driving baseline. The vehicle now has a Lidar drive mode, manual mode, and PilotNet-drive mode. These features satisfy the prerequisites for further autonomous driving research.
In simple driving tasks, such as lane following and driving between two walls, PilotNet works fine and has the same performance as the Lidar-based autonomous driving algorithm. However, for complex driving tasks in indoor environments, such as turning around at a dead-end or recovering from a malfunction, PilotNet fails. This paper continues the research on end-to-end visual navigation and proposes two novel models: CNN + LSTM and CNN3D, aiming for complex autonomous driving tasks in indoor environment. Our source code is available at: https://github. com/zhi-hui-lai/ModCar Project.git.

Related Work
Among the earliest groundwork for end-to-end driving is ALVINN [9], which maps images to direct steering control through a fully-connected network. Then NVIDIA extended this approach deeply with a Convolutional Neural Network (CNN) called PilotNet [5]. After PilotNet, researchers attempted numerous different approaches to improve endto-end driving performance.
Different input configurations have been tried. Maqueda et al. employed an event camera to collect images over a specified time interval as an input to a CNN [10]. Codevilla et al. utilized conditional imitation learning and recorded commands as extra neural network input [11]. Wang et al. also implemented imitation learning and described a subgoal to guide their angle-branched architecture [12]. Yang et al. applied a multi-modal multi-task network that included feedback speed sequence as additional input [13]. Wang et al. proposed an auto-encoder for eliminating irrelevant roadside objects in images before feeding into a CNN [14]. Drews et al. integrated CNN, cost maps, particle filter, IMUs, and wheel speed measurement in their autonomous system [15].
Chi et al. developed a neural network architecture by combining spatio-temporal convolution, convolutional LSTM and LSTM [16]. Hou et al. proposed a novel fast recurrent fully convolutional network (F-RFCN) [17]. Okamoto et al. proposed a combined self-driving approach by converting a frontal-view image to a top-view image with a CNN and then feeding it to a dynamic Bayesian network (DBN) to compute affordances [18]. Different output configurations have been tried (not necessarily direct control output). These approaches are in the partially end-to-end learning category. Chen et al. replaced direct control commands with affordance information in their neural network output [19]. Chen et al. inserted a cognitive map between CNN and RNN in their neural network to detect the traffic conditions [20]. Weiss et al. developed a novel end-to-end deep learning model, AdmiralNet, combining the original PilotNet, LSTM, and 3D convolution. This network works well on the photorealistic F1 racing simulator and the 1/10 scale racecar testbed. Unlike the original PilotNet predicting steering commands, they use AdmiralNet to predict Bezier Curves from pixels directly and compared four different deep learning methods: PilotNet (pixel-to-control), CNN-LSTM (pixel-to-control), CNN-LSTM (pixel-to-waypoint), and CNN-LSTM (pixel-to-Bezier curve trajectory). The results are impressive, with AdmiralNet Bezier Curve Predictor outperforming all other methods [21]. This experiment shows that the performance of an autonomous driving system can be significantly improved by adding memory and predicting trajectories rather than control commands.
Some researchers added auxiliary tasks along with the control command prediction task. Xu et al. presented a novel FCN-LSTM network that takes advantage of semantic segmentation (as an auxiliary task) [22]. Chen et al. considered auxiliary tasks, including image segmentation, transfer learning, optical flow, and additional vehicle information [23].
Some researchers experimented with end-to-end methods on different tasks. Pierre proved the feasibility of robotfollowing with an end-to-end approach by comparing several models, including CNN, stacked CNN, Conv3D, and RNN [24]. Jeong et al. proposed a classifier to predict whether the next lane is free [25].
Jaritz et al. used deep reinforcement learning to drive a robot in the WRC6 racing game [26]. Liang et al. applied Controllable Imitative Reinforcement Learning (CIRL) to drive a robot in the CARLA simulator [27]. Sallab et al. implemented deep reinforcement learning to drive a robot in the TORCS racing game [28].
Nie et al. proposed an Integrated Multimodality Fusion Deep Neural Networks (IMF-DNN) framework that can handle both objection detection (as an auxiliary task) and driving behaviors (steering angle and speed) prediction by fusing camera and Lidar data [29]. Sobh et al. tested several camera lidar fusion methods in the CARLA simulator and found the fusion of Polar Grid Mapping (PGM) and semantic segmentation performed the best [30].
This paper aims to solve new tasks in an indoor environment with memory models, including driving in the center between two walls and making a three-point turn at a dead-end. Several models involving PilotNet, CNN+LSTM (ours), CNN+LSTM Weiss et al. [21], and CNN3D (ours) are investigated. Weiss' CNN-LSTM (pixel-to-Bezier curve trajectory) model has excellent performance in fast racing games, however, it requires positioning equipment on the actual car to collect waypoints that our ModCar doesn't have. So this paper only compares the pixel-to-control methods. Figure 2 represents the simplified ModCar control flow, UWA's ModCar uses the RoBIOS GUI [31] as the lowlevel control program, so users can efficiently operate it via a touch screen or gamepad. The main interface is shown in Fig. 3a; there are three driving modes: manual, camera neural network, and Lidar. As presented in Fig. 3b, the manual drive mode uses the gamepad to control the speed and steer the robot, and image recordings are possible at a 30Hz frame rate. Lidar drive mode is visualized in Fig. 3d; it uses the Lidar sensor to measure the distance of the surrounding environment and then inputs these distances into a fine-tuned algorithm to calculate the speed and steering of the robot. The camera neural network drive mode is illustrated in Fig. 3c; it uses the camera to capture images and then feeds these images into a trained neural network to generate the speed and steering of the robot.

Neural Network Models
This part introduces the neural network models implemented in the project. The previous work experimented with the original PilotNet using speed as additional output and dropout for the deep network's normalization (regularization). We adopt this architecture with extra batch normalization, which is illustrated in Fig. 4a. The input is an image in YUV color space with a height of 200 pixels, a width of 66 pixels and 3 color channels. Input normalization converts the image pixel values from [0, 255] 'uint8' type to [-1, 1] 'float' type. The activation function used by the model is 'ELU' instead of 'RELU' for better performance [32], and its output range is [-1, ∞] when alpha equals one. If the input is not normalized, the activation function will encounter errors [33], so the next two models also applied input normalization.
The CNN3D model ( Fig. 4b) uses 3D convolutional layers instead of 2D, with the additional dimension employed as a timeline. Specifically, when the input is an image sequence, it obtains both spatial and temporal information. The GlobalAveragePooling3D layer is similar to the GlobalAveragePooling2D layer but has an additional pooling dimension.
The CNN+LSTM model (Fig. 4c) involves an LSTM layer between the fully connected layers (dense layer) and the 2D convolutional layers. Furthermore, the input changes from a single image to an image sequence (in this case, five images in a row), where the current image and the past four images are input into five convolutional layers separately, then input into the LSTM layer together. The LSTM layer can extract information along the timeline and assist the model in making better decisions. This model also uses GlobalAveragePooling2D layers to replace the Flatten layers, where the latter layer transforms a multi-dimensional tensor into a one-dimensional tensor. This transformation has a risk of overfitting, especially for complex models. However, the GlobalAveragePooling2D layer solves this problem because it sums up spatial information and has no parameters to optimize. Batch normalization transforms the data to a zero mean and a unit variance, reducing the "Vanishing or Exploring Gradient Problem" [33], affording a higher learning rate and speeding up the entire training process.

Simulation
Before any practical experiments, it is necessary to try the neural network models in simulation, e.g., the EyeSim platform [34]. The latter is a mobile robot simulator for the EyeBot family that can simulate a mobile car, which has attached a camera, Lidar, a PSD sensor, and an LCD screen for the camera and data display.

Data Collection
A well-designed autonomous driving algorithm is preferred against a manual button-controlled drive to reduce human error when obtaining training data for the neural network. Thus, we apply a simulated Lidar-based algorithm to drive the ModCar in the maze map as depicted in Fig. 5. During this process, we exploit the camera to capture images and store them with a labeled steering angle and speed until adequate training images are captured; the speed and steering angle data distribution is illustrated in Fig. 6. The simulation experiment collected 5,110 clockwise driving images and 5,166 anti-clockwise driving images (10,276 images in total). It should be noticed that the Lidar drive mode captures 700 images per lap with a frame rate of 10Hz, and thus, 10,276 images require 15 laps to collect, presenting an adequate training data density. The captured images with labeled steering angle and speed are stored in the npy format, a NumPy-array format that is small and expandable with additional information like labels. The speed, ranging from (100, 150, 250), corresponds to (backward, stop, forward). The steering angle, ranging from (100, 150, 200), corresponds to (right turning, straight, left turning). Before the neural network training, the images are randomly split into three datasets: training, validation, and testing, with the latter involving 0.4% of the total images, and the remaining imagery is split into a 4:1 training-tovalidation ratio.

Data Preprocessing
The driving program using RoBIOS API captures 320x240 RGB images. As shown in Fig. 7 (left), each image is first converted to the YUV color space, then passes through a Gaussian blurring filter, and is finally resized to 200×66 pixels by the OpenCV resize function. The YUV color space helps adjust the brightness. The Gaussian blurring filter smoothes the resizing process and prevents image distortion [35]. Given that 200×66 is the input size of the original PilotNet, the entire PilotNet architecture is built based on this size, so there is no need to modify that. Additionally, a smaller size can help reduce the computational load, especially when generating image sequences for CNN+LSTM and CNN3D models.

Data Augmentation
Data augmentation increases the dataset's size by generating data variants. This can improve the model's generalization and robustness [36], but excessive augmentation generates much noise preventing the model from learning. Thus, the extent of data augmentation needs to be well considered. As illustrated in Fig. 7 (right), our simulation experiment considers two data augmentation types: blurring and  flipping. Blurring helps the model learn more holistic features without overlearning the minutiae [35], while image flipping balances the uneven distribution of left and right steering data, and eliminates the dataset's inherent bias for assisting the model's better learning [37]. Each augmentation is randomly applied to 5% of the training data in a single batch.

Image Sequence Generation
To generate an image sequence for the CNN+LSTM and CNN3D models, we arrange the collected pictures in order and use a sequence-length window to slide from the beginning to the end by sliding one image at a time. Each sequence uses the labels of the last image.

Model Training
The model training employs the bagging strategy. Bagging or bootstrap aggregation refers to randomly sampling the dataset with replacement in each batch [33], significantly reducing the dataset's variance. Besides, avoiding overfitting and saving training time can be achieved by including early stopping, learning rate decaying, batch normalization, and dropout. This project has four neural network models to train. Thus, we first fit the data into each model with its specific inputs and then train it on Ubuntu 18.04 with one NVIDIA TITAN X (Pascal) GPU. Once training finishes, we copy the models to the EyeSim model folder for further exploitation. Table 1 indicates basic parameter settings for model training.

Validation and Comparison Methods
Open-loop metrics consider the following: 1. Separate the training set and validation set with an 0.8 training ratio and then calculate the Mean Square Error (MSE) of each model. The ModCar driving program has a pause function. Each time the robot hits a wall or stops moving, the pause button will be pressed and counted as an intervention. We set the manual reset time to 5 seconds. The elapsed time is the total time minus the pause time.

Practical Experiments
This project considers three practical experiments: driving in a rectangular loop and turning around at two corridors; one of the corridor is a non-trained environment.

Data Collection
We apply the Lidar-based algorithm to drive ModCar on the 4th floor of the EECE building and the ground floor of the CME building at UWA.
The 4th floor of the EECE building has a rectangular loop for the Lidar drive mode to run continuously (Fig. 8 red zone). We prioritized right-turning first in the Lidar drive mode and collected 30,679 clockwise driving images. Then we prioritized left-turning first and collected 30,518 anti-clockwise driving images (61,197 images in total). The Lidar driving mode captures 590 images per lap at a frame rate of 10Hz, and thus, 61,197 images require 104 laps to be collected. This data density far exceeds the simulated data density because the natural environment is complicated, and data is unevenly distributed. As illustrated in Fig. 9, the steering angle in most pictures is marked as 150, i.e., the car is driving in a straight line. This unbalance causes the model to ignore a few turns and fails a correct learning process.  Thus, more data and data augmentation are both needed to compensate for this bias.
The ground floor of the CME building has a corridor ( Fig. 10 red zone) which is not wide enough for the robot to make a complete U-turn, so the robot uses a three-point turn strategy as shown in Fig. 11. It stays in the middle of the walls on both sides. If it encounters a dead-end, it turns left until it approaches the wall, turns right, reverses, and finally turns left to the center of the road to repeat the initial steps. For this scenario, we trained two versions of neural network models. For version one (V1), we collected 47,838 images. The Lidar drive mode captures 560 images in one round trip with a frame rate of 10Hz. Thus, 47,838 images need 86 rounds to be collected. The data density is sufficient, but due to the design defects of the Lidar's driving program, when encountering a dead end, the robot has a 50% chance to turn left and a 50% chance to turn right. This inconsistent behavior is not conducive to the training of the neural network models, with data cleaning reducing this effect. Hence, we manually delete the images where the car turns to the right and simply duplicate the images where the car turns to the left to compensate for the loss of images. For version two (V2), we collected 75,845 images. The Lidar drive mode captures 560 images in one round trip with a frame rate of 10Hz. Thus, 75,845 images need 135 rounds to be collected. In this version, we use a different strategy to eliminate the inconsistent behavior of the Lidar program, namely image flipping: flip all right turns to left turns. The collected speed and steering angle data distribution is presented in Fig. 12.

Data Preprocessing
As shown in Fig. 13, the practical data preprocessing is mainly the same process as for simulation, except that the images are taken by a fisheye lens, requiring additional image processing to eliminate lens distortion. Precisely, the images are cropped 25% of the top and 31.25% of  [38]. Apart from this step, the images are resized to 230x66 instead of 200x66 for additional data augmentation.

Data Augmentation
Practical experiments consider additional data augmentation process: brightness changing (Fig. 14), image shifting and rotating (Fig. 15). As the brightness of both places changes due to sunlight, we randomly modify the image brightness within the range of [− 20%, + 20%] and utilize brightness modification on 5% of the training data in a single batch. Besides, the images collected by Lidar autonomous driving algorithm are always at the center of the road; there is not enough diversity of different lane positions. When the car deviates from such a trained trajectory, it cannot recover. While NVIDIA solved the diversity problem by image shifting and rotating in 2020 [39], this paper uses a simpler method to achieve the same effect.

Model Training
Practical model training is almost identical to the simulation except that the CME models have the speed weight α=1, the steering weight β=2 and one more model (Weiss' CNN+LSTM) for training. Weiss' CNN+LSTM architecture is similar to our CNN+LSTM, but it connects the prediction layer directly after the LSTM layer. Once the model training is finished, we copy the models to the Raspberry Pi model folder for further manipulation.

Validation and Comparison Methods
The practical validation and comparison methods are almost identical to the simulation ones except that the CME experiments are evaluated by NoI instead of Autonomy because NoI is more representative in this case. Because the assumption HRT=5s does not hold in this case, the robot may stop in place or shake back and forth. These cases should be counted as HRT but are difficult to measure, so the NoT here is more representative. Table 2 indicates the simulation results.

Open-loop Test
The CNN+LSTM model affords the best performance in the open-loop test, but the metrics utilized do not necessarily correlate with the real driving performance. Because the Lidar drive mode used to collect data is not the optimal solution, the neural network model may surpass the performance of Lidar during the learning process. The more the neural network model mimics the Lidar drive mode, the more it is restricted by the Lidar driving performance.

Closed-loop Test
In the closed-loop test, each model is applied on six laps to calculate the mean lap time, and then each model is subjected to two 10-minute autonomy tests. The results show that the CNN+LSTM model achieves the  shortest time per lap, but each model learns from the same speed samples, suggesting that the CNN+LSTM model attains a better solution and travels a shorter route. Overall, all neural network models can drive perfectly in a clockwise and a counterclockwise 10-minute autonomy test without requiring any human intervention except Weiss' CNN+LSTM model. Additionally, all neural network models are much faster than the Lidar model they learned from. Furthermore, the results indicate that the Lidar drive algorithm does not apply optimal route planning, deliberately keeping a distance from each obstacle that significantly limits its driving performance. It is worth noting that amending to this method optimal route planning requires significant effort, while in contrast, neural networks learn optimal route planning automatically.

Prediction
The prediction process of all neural network models is very similar. Here we use the CNN+LSTM model as an example. Although the last image of the sequence is illustrated, the corresponding predictions exploit five consecutive images as input. As presented in Fig. 16, the predicted steering angles are significantly different from the actual ones, which allows the neural network models to complete a lap faster than the Lidar model.

Obstacle Avoidance Test
In this trial, we add new walls as obstacles to test the model's ability to avoid obstacles in the simulation map.
All neural network models react to obstacles and change their routes, indicating that the neural network models have successfully learned to recognize walls and plan routes based on them. For illustration purposes, the new walls (obstacles) in Fig. 17 are marked in red. Despite the Lidar model avoiding these obstacles, this model is sometimes too sensitive and forces the robot to stop and turn around. This performance suggests that to avoid these obstacles ideally, several adjustments have to be made to the Lidar algorithm every time it encounters a new environment limiting its effectiveness. Contrary to this, the neural network models adapt to the new environment well without any adjustment. Table 3 illustrates the EECE rectangular loop results.

Open-loop Test
In the open-loop test, all models attain similar performance. CNN+LSTM model has the best validation loss, while PilotNet has the best training loss.

Closed-loop Test
In the closed-loop test, each model runs three laps clockwise and three laps anti-clockwise to calculate the mean lap time and autonomy. All neural network models can achieve nearly full autonomy except for some failures caused by environmental changes, e.g., broken light bulbs and new obstacles.

Prediction
All neural network models can drive autonomously, even with some incorrect predictions. Here we use the CNN+LSTM model as an example. In Fig. 18, the left prediction illustrates that the model drifts from the trained trajectory because of different steering angles. Still, image shifting or rotating give models the ability to recover. The right prediction illustrates that the robot drives with poor lighting conditions, but the models can identify the core feature and make the right decision.

CME Corridor Results
Tables 4 and 5 shows the CME corridor results.

Open-loop Test
In the open-loop test, the proposed CNN+LSTM model performs best, while Weiss' CNN+LSTM has the highest validation loss, matching the performance of the closedloop tests.

Closed-loop Test
In the closed-loop test version one (V1), each model runs six laps to calculate the mean lap time and autonomy. Among the neural network models, only the CNN3D and CNN+LSTM models can complete a lap without human interventions, the PilotNet and Weiss' CNN+LSTM models keep the car in the middle between the walls when going straight, but it is jittering at the dead end and cannot turn around. In the closed-loop test version two (V2), the models are tested in a trained corridor with new obstacles and a non-trained corridor. Each model runs six laps in the trained corridor (C1) and three laps in the non-trained corridor (C2) to calculate the mean lap time and NoI. Among the neural network models, none of the models can complete a lap without human intervention. This is due to the fact that during the training phase, the corridor was free, but during the experiment phase there were some new unknown obstacles present. However, the proposed CNN+LSTM and CNN3D models have a lower mean lap time and NoI. The PilotNet model remains the car in the middle of the wall when going straight, but it is jittering at the dead-end and cannot turn around. Weiss' CNN+LSTM performs even worse than PilotNet. Figure 19 illustrates two predictions where the image is labeled with the wrong speed due to Lidar algorithm flaws, but the CNN3D model predicts it correctly. The left picture presents the case where the robot should drive forward, but the image is labeled as backward. The right picture presents the case where the robot should turn left, but the image is labeled as straight. These interferences increase the training difficulty, but models with image sequences manage a greater tolerance.

Saliency Map
As shown in Fig. 20, all neural network models have successfully highlighted the walls' edges in the saliency map.

Conclusion
This research verifies the significance of end-to-end learning for autonomous driving systems and the original PilotNet model improvement by adding LSTM or 3D  convolutional layers; however, the architecture must be well-designed. This research also verifies the feasibility of end-to-end driving in indoor environments. The improvements are: • Complex motions: The model can create more complex motions like a three-point turn at a dead-end, in which the robot needs to make a series of moves to turn around at the deadends. These actions are in a temporal sequence, but PilotNet does not have access to temporal information, and for it, these actions are like a single input corresponding to multiple outputs. Both CNN+LSTM and CNN3D models solve this problem perfectly by virtue of their processing of temporal information. • Recovery from failure: Even if the robot may not succeed in a single turnaround at a dead-end, the CNN+LSTM and CNN3D models repeatedly try until they succeed while PilotNet stops there. • Better driving performance: In all experiments, the proposed CNN+LSTM and CNN3D models have similar performance, the CNN+LSTM model is slightly better, but both models have a better closed-loop performance than PilotNet.
It should be noticed that in this experiment the performance of the CNN+LSTM and CNN3D models is limited by the Raspberry Pi's computing power, as with low FPS the model cannot turn in time if the speed is high. Additionally, the inconsistency of the gap between time steps due to speed decaying increases the difficulty of temporal information extraction.
Future work will include the following tasks: • Waypoints prediction: Predicting waypoints using the Inertial Measurement Unit (IMU) to obtain future locations and optimize the path. Use of IMU-odometry fusion to derive the pose and direction of the attached vehicle. • Ego-state input: Adding different types of input, i.e., past steering angles and speeds. One feature of Lidar drive mode that no model can currently emulate is braking. The robot in this project can simulate the braking effect by reversing its motor. Still, when using images with brakes as a dataset, even a model using image sequences as input cannot tell when to brake. When judging the brakes, the driver considers the immediate view as well as the speed and direction, so adding past speed and steering can help the neural network model learn more complex driving behaviors. • Parking: Training neural network models to park. The action of turning around at a dead-end is similar to parking. It may be possible to learn parking actions for CNN+LSTM and CNN3D models. Still, it is more challenging to learn only with the front camera because the driver usually needs to look at the scene behind the vehicle and judge when parking. Installing an additional camera at the robot's rear, and using images from multiple cameras as input may solve this problem.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions for his assistance during the ModCar experiments. Finally, the authors would like to express their gratitude to Nvidia for donating two GPUs used for the neural network training.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.