Imitation learning based decision-making for autonomous vehicle control at traffic roundabouts

The essential of developing an advanced driving assistance system is to learn human-like decisions to enhance driving safety. When controlling a vehicle, joining roundabouts smoothly and timely is a challenging task even for human drivers. In this paper, we propose a novel imitation learning based decision making framework to provide recommendations to join roundabouts. Our proposed approach takes observations from a monocular camera mounted on vehicle as input and use deep policy networks to provide decisions when is the best timing to enter a roundabout. The domain expert guided learning framework can not only improve the decision-making but also speed up the convergence of the deep policy networks. We evaluate the proposed framework by comparing with state-of-the-art supervised learning methods, including conventional supervised learning methods, such as SVM and kNN, and deep learning based methods. The experimental results demonstrate that the imitation learning-based decision making framework, which ourperforms supervised learning methods, can be applied in driving assistance system to facilitate better decision-making when approaching roundabouts.


Introduction
Autonomous vehicle (AV) and advanced driver assistance systems (ADAS) development is playing a key role in contemporary intelligence transportation systems. An autonomous vehicle captures environmental data via sensor techniques to navigate the vehicle without human interventions [2]. As highlighted in [44], AVs can not only carry out basic manipulations, such as acceleration, deceleration, braking, forward and backward movement, turning and other conventional vehicles functions, but also accomplish high-level tasks, such as mission planning, path planning, intelligent obstacle avoidance and all human-like behaviors. Although many AV manufacturers have made significant progress on AV development, e.g. Google self-driving car in the U.S. [21], VisLab's BRAiVE in Italy [9] and Jaguar Land Rover in the U.K. [26], it is still a great challenge for AV to make decisions under complex environments, e.g. a busy urban environment with multiple junctions [37] or with numerous objects moving in various directions [3].
Traffic roundabout is a looping junction where road traffic is restricted to go in one direction around a central island with priority given to the coming vehicles that have already entered the roundabout [29]. The roundabout systems in the U.K. have resulted in many accidents due to human drivers' misjudgements of the speed, distance or intention of approaching vehicles in the roundabout [14]. In addition, there are many different types of roundabouts, e.g., mini-roundabouts, signalized roundabouts and non-signalized roundabouts [37]. Therefore, it is challenging to provide intelligent recommendations to join a roundabout system without making any hassles to the entire system.
In recent years, artificial intelligence and machine learning methods have been widely applied to make decisions at complex junctions [14]. Qi et al. [40] uses convolutional neural networks (CNN) to detect vehicles so that an AV could make decisions based on the environmental contexts. In [15], behavioral rule-based model is built to take vehicle angles, speeds and diameters of crossroads into consideration to deal with issues happening at crossroads. In [44], an adaptive tactical behavior planner (ATBP) is proposed to simulate human-like motion behaviours at non-signalized roundabouts by analysing individual driver's historical navigation patterns. In [16], Gritschneder et al. design a reinforment learning framework to generate optimal actions via a multiple-layer perceptron neural network based on the observations obtained from GPS system to reflect the position and motion of other nearest vehicles. Imitation learning (IL) has become one of the most popular learning frameworks due to its advantages of leveraging domain expert knowledge [23]. An IL model shares similar idea of reinforcement learning but avoid the randomized control trials mechanism in reinforcement learning framework when optimizing control actions. It is more suitable to the tasks which cannot afford the costs incurred by the random trials. Furthermore, it can speed up the training process of the control policy deep models comparing to conventional supervised learning models. Thus, many IL systems, such as [19,28,38,48], have been proposed to control AVs in many on-road and off-road tasks.
In this paper, we propose an IL-based decision-making system to provide intelligent recommendations to join a roundabout timely and smartly. In specific, a deep learning based IL system is trained to learn how human drivers to manipulate vehicles based on observations of other vehicles in roundabouts. In addition, we investigate how different backbone architectures, such as VGG-16 and ResNet-18 make impact on the learning performance. The novelty of our paper is highlighted as: (1) we propose an imitation learning-based decisionmaking system (ILBDM) to join roundabouts timely and safely. To our best knowledge, this is the first system to provide guidance for drivers at roundabout by exploiting imitation learning method. (2) we provide a new roundabout-entering dataset for AV research. As data is the main driving force for new deep learning-based algorithm development, our work has paved way to solve a difficult high level control task; and (3) we evaluate the proposed ILBDM system comprehensively to prove the superior performance of imitation learning method over supervised learning methods for a sequence of decision-making task.
This paper is organized as follows: Section 2 describes the related work which includes intelligent transportation system, neural computational models for autonomous vehicle decision-making, and autonomous vehicle decision-making in roundabout applications. In Section 3, the proposed ILBDM system is explained in details. After presenting the overall IL framework, we provide the technical details of extracting observations from the driving environment, including car detection, motion feature extraction and backbone network architectures, and how we set up the reward schemes to train the system. In Section 4, we evaluate the performance of the proposed framework under different backbone network architectures. Furthermore, the experimental results demonstrate that the proposed method outperforms the systems with supervised learning algorithms. In Section 5, a conclusion is drawn and ideas for future research are discussed.

Intelligence transportation System
Intelligent transportation system (ITS) has become the key to reduce the negative impact from traffic congestions and pollutions that are the most serious contemporary issues caused by the rapid urbanisation development [13,17]. In [13], it summarises that ITS can solve these issues by using (1) routing optimisation, (2) intelligent traffic light control, and (3) decentralised multi-agent communications. Routing optimisation is an active research field of ITS. Many optimisation algorithms, such as genetic algorithm [5], ant colony algorithm [1] and particle swarm optimisation algorithm [10], have been proposed to make optimal path planning for vehicle navigation. Intelligent traffic light control system provides another solution to reduce traffic congestions. Chen et al. [10] proposes a real time traffic light control algorithm that adjusts both the sequence and length of traffic lights by using several traffic factors which include traffic volume, waiting time, and traffic density. Vallati et al. [47] designs a PDDL+ encoding planning module to optimise the traffic light control for solving those traffic congestions which are caused by unexpected accidental events. In [25], two intelligent traffic light control schemes are used in fog computing to deal with resisting malicious vehicles and single-point failure.
In addition to these ITS systems, advanced vehicle control systems have become an emerging technology to make contributions to solve these traffic problems as well as to enhance the driving safety.

Neural computational autonomous vehicle control
Control policy neural networks have been widely proposed in autonomous vehicle control systems since the work of [39]. In [39], A three-layer back-propagation neural network named ALVINN is proposed to take road images as input and produce travel directions as output. In [8], a deep learning network called PilotNet is used to estimate steering angles by extracting and finding salient objects from visual perceptional input data. Reinforcement learning (RL) approaches are deployed in many AV systems in recent years, e.g., [12,53,54]. Wolf et al. [54] proposes a deep Q-network (DQN) policy network to steer vehicle in a simulated driving environment. In [12], several deep reinforcement learning methods, including DQN, Deep Deterministic Actor Critic and Deep Attention Reinforcement Learning, are trained to control a vehicle on the Open-source Racing Car Simulator (Torcs) to demonstrate the feasibility of using RL framework for AV control tasks. In [53], An RL model predictive control neural network is trained to control a vehicle to run on an elliptical dirt track at the Georgia Tech Autonomoous Racing Facility. Although RL based methods do not need any labeled training data, most of them have to be trained in a simulated environment to reduce the costs incurred by the exploration steps in the RL framework.
Imitation learning (IL) is an appealing deep learning framework to learn a policy network guided by human domain expert to speed up the network convergence as well as enforce strong constraints on the mapping space between input observations and output actions. In [59], Zeng et al. use LIDAR data and high definition maps to find trajectories that minimize predefined losses. [43] proposes to combine imitative model with goal-directed planning to outperform directing IL methods. In [7], a model named ChauffeurNet is trained by taking the advantages of both human expert's guided data and synthesized perturbations of the expert's driving data. In [11], Codevilla et al. assume that both perceptual input and driver intention are required to make optimal decisions. Therefore, a conditional imitation learning based model is proposed to consider dirver intention in the decision-making process.

Autonomous vehicle decision-making at roundabouts
As one of the most difficult decision-making tasks, vehicle control at roundabout has raised siginificant attentions during this decade [18,34,35,37,50,51]. In [18], low-level texture features and motion features are extracted from monocular video sequences to detect and track moving vehicles in roundabouts. The method is tested on BRAiVE AV/ADAS system and achieve a good accuracy performance with a real-time processing speed. In [35], a panoramic stereo-vision based system is designed to detect upcoming vehicles and calculate the time-to-contact that defines the estimated time of potential collision with the egovehicle. In [37], Okumura et al. propose an action planning method for AVs to merge into a roundabout. In this work, four learning inputs (approaching car speed, difference in heading between the vehicle and the road, the distance from the vehicle to the merge point, and distance from the vehicle to the nearest branch point) can support AVs to make the right "enter", "wait" and "merge" decisions. In [49], grid-based image processing approach (GBIPA) is proposed to characterize traffic situations that can be used for machine learning algorithms to learn the roundabout joining criteria. Approaching car features (Position, direction and speed) can be extracted by proposed GBIPA as learning inputs, and the trained classifiers using the proposed GBIPA approach is evaluated on test videos captured at roundabouts, where the SVM yields the best performance with a 90.28% classification accuracy. In [50], Wang et al. designed a human-like decision-making system at mini-roundabouts based on both of front view and side view cameras. In addition, [51] extends some of previous works in [49] and proposes a multi-grid-based image processing approach using multiple cameras (MGC), it can deal with two issues: 1) the autonomous vehicles' can swiftly change the position/orientation when reaching a roundabout, and 2) The driver's views and behaviors can also be varied. Proposed MGC include different size of grid to boost the accuracy and to protect the autonomous vehicle when entering a roundabout.

Proposed ILBDM system
In this proposed work, we design an IL based decision-making algorithm to facilitate intelligent decisions to enter roundabouts. Considering that the vehicle control at roundabouts can be formulated as a sequence of decisions, IL method is more suitable to the task comparing to the convention supervised learning methods. In particular, the system learns a neural computational model by feeding human expert data to make strong constraints when searching the solution space to update the deep policy network. It differs to our previous work, i.e. [49][50][51] on two folders: first, this IL based model learns to maximize the expected rewards when taking an action at a timestamp whilst our previous work makes an IID assumption of the control actions at individual timestamps by using supervised learning methods. Secondly, we investigate whether deep policy backbone networks outperform the conventional decision models, such as SVM and kNN classifiers, for the roundabout decision-making task.
The proposed ILBDM system is a fast and reliable imitation learning-based approach. In particular, we deploy the Deep Q-Learning from Demonstrations (DQfD) method [22] as our IL system. Although there are labelling data from domain experts as guidance which is similar to the supervised learning framework, the IL highlights that the decisioin-making is a continuous process as the decisions made in the past can influence the decisions made in the future. Therefore, a well-learned function can map states to actions that could maximise the expected discounted rewards over the entire decision-making process. Following the assumption in the reinforcement learning framework, a Markov decision process is formulated for the task for IL learning. Here, a tuple (S, A, R, T , γ ) consists of a set of states S, a set of actions A, a reward function R(s, a), a transition function T (s, a, s ) = P (s |s, a), and a discount factor γ . A policy network π is learned to provide recommendations on actions by maximizing cumulated discounted rewards which can be expressed as a function Q π (s, a): Here, Q π (s, a) represents the expected cumulated discounted rewards, R(s, a) represents the immediate reward when taking an action a at state s, s represents the state at the next timestamp and Q π (s , a ) is the expected maximium reward if taking action a at state s .
The overview of the proposed framework is illustrated in Fig. 1. A monocular camera system mounted in front of our vehicle is used to capture video sequence data from the driving environment. The raw data are fed into a pre-processing pipeline to extract efficient observed states from the environment for the decision-making at roundabouts. This forms the state space S. At each timestamp, an action a ∈A is made by a deep policy network to maximize the expected cumulated rewards in the driving sequence guided by generated by the expert driver in ILBDM system. The goal of ILBDM is to learn a policy that imitates an expert policy π given demonstrations from that expert driver π E . A demonstration is defined as a sequence of state-action pairs that result from a policy interacting with the environment d={ s1, a1, s2, a2, ...}.
Regarding to the loss function, the proposed system learns a policy by minimizing the Huber loss function [46]) over the set of demonstrations with respect to the policy. The Huber loss is a loss function used in robust regression which is less sensitive to outliers in data than the squared error loss. It is described as the following equation: where y t is the target output defined in target network: Here Q t = Q(s t , a t ) represent the Q value from the deep policy network and δ is the control parameter which can be tuned in Eq. 2. The IL training network minimizes the loss until the model converges.

Perceptional observations
Effective observed states extraction improves the reliability of an intelligent decisionmaking system as it makes the system insenstitive to noise signals from the complicated driving environment. There are two modules for the perceptional observation extraction from driving sequences. These include a vehicle detection module which uses the Faster R-CNN network [42] to dectect vechicles at roundabouts and a motion extraction module to extract their movement features based on an optical flow algorithm in [32]. Examples of extracting observations from driving sequence is illustrated in Fig. 2. Here, the detected ROIs are used as filters so that vehicle movements based on the optical flow algorithm can be extracted as the input of the decision-making DL policy networks.
Vehicle detection module is one of the key modules for the roundabout entering decision-making process. In our work, the faster R-CNN method originally proposed in [42] is adapted to detect the vehicle regions of interest. The faster R-CNN is a two-stage CNN based detection method which includes a Region Proposal Network (RPN) for proposals selection and a classifier to verify the objects from these candidates. The RPN uses Fig. 2 Observations stage of proposed ILBDM system. Images in the first row shows Vehicle Detection results, images in the second row visualize the optical flow results from their correpsonding frames (hue represents optical flow direction and saturation shows optical flow magnitude followed [32]) and images in the third row shows the masked optical flow as the observations the first 13 covolutional layers of VGG-16 network to generate feature maps and two three-layer regressors to locate the anchor boxes which have high object scores as object proposals. Following the selection stage, these proposals are further verified by a classifier to decide whether there is a vehicle in each candidate box. In our work, a pre-trained model downloaded from [56] is used to detect vehicles at roundabouts.
After experimental comparisons of several state-of-the-art methods, the faster R-CNN is selected for the vehicle detection module in our system as it is the most effective method to process our collected data in terms of both accuracy and processing speed. The comparison with the methods including single shot detection (SSD) [31], inception [57] and mask R-CNN [60] which are shown in Table 1. In particular, 1000 frames from different weather conditions and various types of roundabouts extracted from random-selected 20 sequences are tested by using the four algorithms. It shows that the precision of the Faster R-CNN approach is 2.5% better than Mask R-CNN, 8.52% better than the Inception, and 23.55% better than SSD. For the detection timing per image, the Faster R-CNN approach spends 0.12 s on the detection of per image, which outperforms 0.03s faster than Mask R-CNN, 1.06s faster than Inception. Although detection time of SSD is the best in all the algorithms, the false negative (FN) rate of the detection is far from satisfation.
It is important to reduce the FN to a minimal level when considering that any missing detection of vehicles could be more risky comparing to the cases of false detection (FP). Therefore, we re-set parameters in the faster R-CNN method to ensure a minimal FN rate is achieved. As presented in Table 1, we can achieve 14 FN when we accept the FP number to 74 for designing our vehicle detection module. Because the frame rate is 30 frames per second, the false negative number is acceptable as there is averagely about one vehicle missed in every 100 frames. Although there are more false detection (FP) of vehicles in the sequences as illustrated in Fig. 3 (a-c), they bring little impact on the final decisionmaking as the filtered optical flow features are used as the perceptional observations and the movements in these false detection regions are not significant (illustrated in Fig. 3 (d-f)).
Motion extraction is the second core module for the perceptional observation extraction. As is mentioned in [24], approaching vehicle velocities are the most important feature when a decision is made to enter a roundabout. In our system, the optical flow algorithm in [32] is deployed to extract features for representing the vehicle movements. Due to its accuracy and robustness, this method has been widely used in many motion-based applications, e.g., [52,55]. Figure 2 illustrates the estimated optical flows when our vehicle approaches a roundabout. Here, we use a color-map scheme to visualize the optical flow based on both its magnitude and its direction. The blue hue indicates the main direction of the optical flow is to the left while the red hue indicates the optical flow at the correponding pixels is to the  right. It demonstrates that the movement feature can be an efficient representation for the control task. Due to the complicated environment at roundabouts, movements from irrelevant objects could easily distract the decision-making to enter the roundabout. Therefore, the ROIs from the vehicle detection module are set as masks to filter the optical flow feature which is illustrated in the third row of Fig. 2.

Decision making policy network backbones
Many DL based backbone networks have been developped for various learning tasks during this decade, e.g., AlexNet [4], GoogleNet [6], VGG family (including VGG-16 and VGG-19) [33], ResNet family (including ResNet-18, ResNet-34, ResNet-50 and ResNet-101) [45], and DenseNet [58]. Considering that the deep policy network for decision-making in our proposed system requires to output decisions with acceptable processing speed, we select three backbone networks as the candidate policy networks. These include a simple CNN, VGG-16 and ResNet-18. The architectures of these backbones are illustrated in Table 2 The CNN architecture is the default backbone used in DQfD. This architecture is concise and efficient for non-linear mappings, thus deploying in many classification and controlling tasks, e.g. [30] and [22]. As illustrated in Table 2, the network has three convolution layers followed by average pooling and ReLU as its activation functions. The first convolution layer contains 6 kenels with kernel size 5 × 5, the second convolution layer contains 16 kernels with kernel size 5 × 5 and the third convolution layer contains 120 kernels with kernel size 5 × 5.
VGG-16 is a popular convolution neural network model designed by Zimmermann et al. in [60]. As illustrated in Table 2, VGG-16 adopts a deeper network structure, which has 9 convolution layers. Max pooling is used in the network to make it easier to capture changes in images, bring greater local information differences, and describe edge textures better. It achieves a great trade-off between precision and process speed to perform as a backbone architecture for a real-time system.
ResNet architecture is playing a dominant role in many recent vision classification and control tasks [20]. The concept of residual learning can effectively reduce the impact of disappeared gradient issue as well as focus on learning detailed patterns. In our work, the ResNet-18 is used as one of the backbone architectures due to its suitability for the fast decision-making task.

Decision making reward scheme
A reward scheme can be used to learn different driving strategies by setting rewards to encourage preferred behaviors. For example, setting larger rewards for the "Go" action when the demonstration provides "Go" guidance would learn a more aggressive driving behavior. While setting larger rewards for the correct "Wait" action can lead to a more cautious driving behavior. For the ILBDM system, the reward scheme works as a part of the Eq. 1, R(s, a), The return from a state is defined as the sum of discounted future reward at time t: where T is the time-step when AV approaches a roundabout, with a discounting factor γ ∈ [0, 1]. Note that the γ is set to 0.8 in our experiments. For the work, we adopt a balanced reward scheme to train the system. A positive reward of 1 is provided at each step if the AV/ADAS's action is consistent with the human expert driver before entering the roundabout, i.e., true positive and true negative (currect prediction for "Go" and "Wait" ), and 0 for inconsistent decision (false "Go", and false "Wait").

Experimental results
In this section, we present the experimtal results to demonstrate the performance of the ILBDM system. It includes the experimental settings, the reward and loss convergence during the training iterations, the decision making results, and the comparison with the benchmarking methods under the supervised learning framework, which include SVM, kNN, and three deep learning based classifiers. The DL based classifiers deploy the same backbone networks in our system.

Experimental settings
All the videos for this study are real-life driving recordings produced by a camera fixed on the right window of an ego vehicle in order to provide the road condition on the right side of the car. Video captured in that setting demonstrated the usual view of the drivers in a roundabout in the UK, where priority was given to approaching vehicles from right directions [27]. Nextbase 312GW cameras were used due to its quality reputation and its wide application in traffic experiments [49][50][51]. Nearly 50 different roundabouts across the Leicestershire, UK were filmed in 18 months, from October 2016 to April 2018. The time frames were 9 am to 11 am and 3 pm to 6 pm. The morning time normally provides satisfactory quality video in a natural daylight condition. The afternoon time provides busy traffic in the peak hours that maximized the intricacy in the roundabouts. The experiments were run on a computer with an Intel Core i7-7700 CPU operating at 2.80 GHz and GTX1060 graphics card in order to evaluate the ILBDM performance. Tensorflow deep learning framework is adopted in this paper [41]. For video data collection, Images with 1920*1080 pixels and a frame rate of 30fps is taken from 130 videos when AV/ADAS approaches a roundabout. Video recorder is the main sensor used in this experiment. Furhtermore, the data were splitted into training and test datasets. The training dataset contains 16,380 images (10415 for a wait before entering a roundabout, 5965 for go to roundabout) and the testing dataset contains 1800 images (1000 for a wait before entering a  Table 3. The benchmarking algorithms include both traditional machine learning techniques and DL based supervised learning classifiers. Support vector machine (SVM) and k-Nearest Neighbor [36] are the two classical conventional ML methods in the comparison. Here, the SVM classifier is an RBF SVM with γ = 0.5. Regarding to the kNN method, the k closest matching examples from the training dataset are retrieved by comparing the Euclidean distance of features in the feature space to make decisions for the test image. The k value is set to 5 in our experiment. Furthermore, we compare with the supervised learning based DL classifiers which deploy the same backbone networks to demonstrate the advantage of IL based framework.

Model training
The models are trained by using different learning algorithms: 1) traditional machinelearning based approach, i.e. RBF SVM. Here the other convention ML method, i.e. kNN, has not training stage as it uses a retrieval way to make decisions, 2) deep learning-based supervised classifiers (CNN, VGG-16 and Resnet-18), and 3) ILBDM learning based system (DQfD with different CNN policy networks, i.e., CNN, VGG-16 and Resnet-18). In supervised learning methods (machine-learning based approaches and deep learning-based networks), one can easily track the performance of a model during training by evaluating it on the training and validation sets. Fig. 4 shows the convengence of the accuracy on the training set and loss from DQfD with the three backbone newtorks respectively. The inputs of image size is 224*224, the learning rate is 0.0005, and epoch is 20. The convergence of the rewards and losses during the training of ILBDM with different CNN policy networks are illustrated in Fig. 4. It shows that all the backbone networks can achieve convergence after around 200k-300k interations. The timings of the training process of the methods are shown in Table 4. The timing for training the supervised learning DL methods with the three backbone networks are 1.35, 3.25 and 4.35 hours respectively. For the ILBDM system, the timing for the three backbones are 2.15, 6.45 and 7.25 hours respectively. It proved that the convergence of the imitation learning based methods are slower comparing to the supervised learning methods. However, the inference timings of networks in the ILBDM system are similar to the other classifiers.

Comparison results
We evaluate the proposed ILBDM system by comparing with conventional ML methods first. The comparison results are shown in Table 4. We use SVM and kNN as the classifiers to process the same observations extracted from images. The accuracy rates of SVM and kNN are 76.23% and 81.03% respectively. In addition, we also test the same data with our previous work named GBIPA-SC-NR in [49]. The GBIPA-SC-NR is a grid-based decisionmaking algorithm. After extracting measurements from grids divided evenly on an image, three conventional classifiers, including SVM, kNN and multi-layer perceptional (MLP) artificial neural netowork (ANN) are used to classify the data into decisions. The accuracy rates are 87.62%, 77.62%, and 81.49% respectively. The reason that the GBIPA-SC-NR ourperforms the same classifiers used on the observation data is because the dimension of the feature space is much lower in the GBIPA-SC-NR which reduce the impact of the curse of dimensionality. For the proposed ILBDM system, the accuracy rates are 96.21%, 93.32% and 89.56% respectively. The accuracy demonstrates that the overall performance of the proposed system is significantly better comparing to the convetional ML methods. Furthermore, we compare the results from the proposed system with the supervised learning based methods by using the same backbone networks. The accuracy rates are 87.36%, 92.57% and 83.08% for CNN, VGG-16 and ResNet-18 respectively. Although there are variations in the results, all of the networks in the ILBDM system outperforms the networks under the supervised learning framework. This demonstrates that the effectiveness of IL framework for the roundabout joining task. From a theoretical point of view, supervised learning models learn non-linear mapping functions to project observations captured by the vision sensor to decisions. However, they do not consider contextual temporal information when making decisions. In contrast, imitation learning methods implicitly learn the temporal contextual features as their models treats the outputs as a sequence of actions. This characteristic makes imitation learning more suitable for this decision-making task.
In addition, Table 4 shows the AV/ADAS decision timing based on four groups of learning approaches. It is illustrated that the decision timing for AV/ADAS to enter a roundabout from deep learning methods and proposed ILBDM system are faster than the traditional machine learning algorithms and GBIPA-SC-NR. The fastest decision time is based on DQfD-CNN in proposed ILBDM class with the number of 0.1035(s) which is 1.1873 (s) faster than ANN in GBIPA-SC-NR class. Therefore, Table 4 illustrates the proposed ILBDM approach provides remarkable performance by considering both the decision accuracy and inference timing.

Discussion
According to the literatures, it is found that deep imitation learning based method combines the advantages from both the supervised learning and reinforcement learning based frameworks. Therefore, in this work, we propose an imitation learning based system -ILBDM and prove that it outperforms all the supervised learning methods to accomplish the decision-making task when joining the roundabouts. The positive impacts from our work can be summarized in four folders: first, high-quality data were collected for the experiments. In particular, a significant amount of real-world data containing roughly 50 roundabouts were recorded in different time frames at different days. The data reflect the real traffic conditions, thus increasing the possibility of applying the techniques in reality. It is noticed that this is the first large real world dataset for solving this challenging task. Although the data in [34] contain 50 different roundabouts which is comparable to our work, they were generated by using a driving simulator; secondly, the proposed ILBDM can effectively make decisions, thus showing the capability of applying in real-world. The accuracy rate of the proposed system based on the DQfD-CNN achieves 96.21% which are siginificantly better than the other state-of-the-art algorithms; thirdly, the proposed ILBDM can work with cars moving in different speed situations. ILBDM provides vehicle detection and optical flow modules to determine the approaching car's speed and positions. It means that the speed and distance of the oncoming cars can be tracked, measured and calculated as effective observational states; fourthly, proposed ILBDM system improvs our previous work of the grid based method, GBIPA-SC-NR. Compared with GBIPA-SC-NR, both of the accuracy and inference timing are improved significantly.
Real-time processing is vital for an autonomous decision-making model. In our work, the total execution time for planning from one frame is 0.43 seconds (0.2 seconds for optical flow extraction, 0.12 seconds for car detection and 0.11 seconds for the action network). As the purpose of the work is not developing a fully autonomous decision-making model to replace human driver but a decision augmentation tool to facilitate safe behavior of human driver, we believe the performance is acceptable. While, in our future work, we will further investigate to replace the optical flow estimation module by 3D-CNN and speed up the processing to achieve real-time performance which could potentially serve as a fully autonomous driving system. Furthermore, since the ILBDM can learn from individuals' driving styles and behaviors, the system has potential to model different types of human-like decisions. In the future work, we will collect more training data based on different driver styles so that driver's behaviors in reality can be learned and simulated.

Conclusion
In this paper, we present an imitation learning based decision making system named ILBDM for an AV/ADAS to make the most suitable decisions to join roundabouts timely and safely. The ILBDM system have an effective observation extraction pipeline which include the vehicle detection based on the Faster R-CNN and motion feature extraction from optical flow. It trains deep policy networks based on several popular backbone networks, inlcuding CNN, VGG-16 and ResNet-18 to recommend actions to maximize the cumulative returns from a sequence of decision-makings. The learned network in the proposed ILBDM system were evaluated on 130 videos from real world. The results demonstrates that the proposed ILBDM system can be applied to effectively help AV/ADAS make the most suitable decisions when approaching a roundabout. Furthermore, it is believed that the proposed framework has potentials to be adapted and deployed in other high-level autonomous vehicle control tasks when collecting corresponding data.

Declarations
Competing interests All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version. This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue. The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.