Autonomous Overtaking for Intelligent Vehicles Considering Social Preference Based on Hierarchical Reinforcement Learning

As intelligent vehicles usually have complex overtaking process, a safe and efficient automated overtaking system (AOS) is vital to avoid accidents caused by wrong operation of drivers. Existing AOSs rarely consider longitudinal reactions of the overtaken vehicle (OV) during overtaking. This paper proposed a novel AOS based on hierarchical reinforcement learning, where the longitudinal reaction is given by a data-driven social preference estimation. This AOS incorporates two modules that can function in different overtaking phases. The first module based on semi-Markov decision process and motion primitives is built for motion planning and control. The second module based on Markov decision process is designed to enable vehicles to make proper decisions according to the social preference of OV. Based on realistic overtaking data, the proposed AOS and its modules are verified experimentally. The results of the tests show that the proposed AOS can realize safe and effective overtaking in scenes built by realistic data, and has the ability to flexibly adjust lateral driving behavior and lane changing position when the OVs have different social preferences.


Introduction
Safety and efficiency are two ultimate goals pursued by intelligent transportation systems. In order to prevent accidents caused by human errors and improve traffic efficiency, advanced driving assistance systems (ADAS) have been proposed and developed in the field of autonomous driving for more than a decade.
As one of the most challenging driving scenarios, overtaking accounts for a large proportion of traffic accidents and could lead to serious consequences [1]. According to the investigation, about 4-10% of all traffic accidents occur during overtaking [2]. Therefore, a reliable automated overtaking system (AOS) is indispensable for ADAS to avoid overtaking accidents.
For a reliable AOS, two types of frameworks have been developed, i.e., the rule-based control framework and the classic overtaking framework. The former is done by defining and learning specific rules. In Refs. [3,4], rule-based control frameworks were used as speed and steering modules to realize the vehicle control according to the specific rules. However, restricted by its poor adaptability to complex road conditions, the rule-based control framework is still far from real applications. Hence, the latter with better adaptability has raised increasing attention in recent years [5,6].
In a complete classic overtaking framework, three parts are required, including decision making, motion planning and control [7]. As the high-level role, the decision-making part is designed to guarantee correctness and accuracy. To achieve this, some rule-based methods adopted in Refs. [8, 1 3 9] were proposed, where decisions were made according to rules set in advance. However, with the limitation of finite rules, rule-based methods only perform well in some simple scenarios, but adapt poorly to the uncertainty of dynamic traffic.
Recently, with the development of machine learning, reinforcement learning based (RL-based) methods have become prevalent, where the uncertainty of dynamic traffic and the real-time problem can be solved simultaneously by Markov decision process (MDP) of RL. In Refs. [10][11][12], RL-based overtaking frameworks were built to make decisions for longitudinal and lateral driving behaviors, such as "acceleration" and "left-turn". In Refs. [13,14], combining the neural network, deep reinforcement learning (DRL) were adopted in behavioral decision-making, improving the adaptability in different overtaking situations. Although some positive adaptabilities have been obtained in these works, reactions of overtaken vehicle (OVs) during overtaking are ignored, which may lead to some unexpected decisions made by AOS.
When it comes to overtaking maneuvres like a lane change, reactions of OVs are pivotal for the decision-making of AOS. To describe these reactions, social preferences have been increasingly considered in ADAS by various research [15,16], where the general way is to classify drivers into several categories according to their driving aggressiveness. In Ref. [17], two types were proposed for drivers, including annoying and cautious drivers. In Ref. [18], three styles were assumed, i.e., aggressive, normal and cautious. However, these classifications are subjective without a data basis. In addition, these preferences were not considered in the process of decision-making either.
This paper proposes an MDP-based module to ensure that a proper overtaking decision can be made when OVs have different social preferences. In Ref. [19], an SMDP-based decision-making module and motion primitives (MPs) were proposed, which have solved the motion-planning and control problems for a complete classic overtaking framework. Hence, combining two decision-making modules, a complete hierarchical reinforcement learning based (HRL-based) AOS considering social preferences can be developed. The contributions of this paper are as follows: (1) An MDP-based decision-making module is proposed. In this module, social preferences of OV are extracted by data-driven methods, with sound objectivity. After training under these preferences, the module can make proper decisions as motion commands of host vehicle (HV) according to the real-time reactions of OV.
(2) A complete HRL-based AOS combining two modules is proposed. In this AOS, the high-level module is the MDP-based module proposed in this paper, and the lowlevel module is the Semi-Markov decision process based (SMDP-based) decision module proposed in Ref. [19]. Two modules will operate in corresponding phases during overtaking, thus completing a safe overtaking task.
The rest of this paper is organized as follows. The framework of AOS are introduced in Sect. 2. Section 3 depicts the SMDP-based module and MP for motion planning and control. Section 4 elaborates the MDP-based decision-making module considering social preferences. Section 5 presents the tests and analysis for the decision-making modules and AOS. In Sect. 6, the conclusion and future work are detailed.

Whole Framework of AOS
As shown in Fig. 1, a typical overtaking manoeuvre includes three phases. In Start Phase, HV(the grey car) implements lane change to the adjacent lane. In Parallel Driving Phase, HV drives in parallel with OV(the blue car) and then executes overtaking. Finally, HV drives back to the original lane in End Phase. For a smart AOS, generated strategies should be able to deal with different situations in the three phases. Hence, this problem is suitable to be formulated as an RL problem, where the optimal strategy can be learned by training an AOS aiming to obtain more rewards under each situation.
The proposed HRL-based overtaking framework is shown in Fig. 2. Following the formulation in the RL problem [20], the environment and agent need to be defined. Hence, this overtaking framework consists of two main parts, including the environment and the proposed AOS (agent in RL). The purpose of the environment is to receive control commands (actions) from AOS and provide the system with the information of HV and OV at each time step (states). Thus, AOS can be trained in this environment. As the agent in RL, the HRL-based AOS needs to output proper decisions and control commands. Considering the difference in three phases, two modules, the MDP-based and SMDP-based modules, are set to deal with different phases. Specifically, Overtaking Phase is judged first according to the current state. For Start Phase and End Phase, the SMDPbased module will operate to select the optimal MP from the MP library. Module training is done by an SMDP Q-learning algorithm proposed in Ref. [21], where the balance between overtaking efficiency and comfort is the main concern. In Parallel Driving Phase, the interaction between HV and OV will be considered to deal with the uncertainty caused by OV. The MDP-based module considering social preferences is designed to output proper decision commands, including "Lane Keeping", "Lane Changing" and "Give Up Overtaking". To achieve this function, a semi-model-based improved Q-learning algorithm is proposed, where the social preference is given by the following data-driven method.
To obtain the data-driven social preference, three units are built, including the social preference clustering unit (SPCU), state transition probability statistics unit (STPSU), and social preference prediction unit (SPPU). In SPCU, numerous naturalistic data of overtaking are clustered into three categories of social preferences offline, including altruism, prosocial and egotism. Then these labeled data are passed to STPSU and SPPU, where the former is designed for offline training and the latter is for online estimation. Specifically, during training, state transition probabilities of OV with different social preferences can be calculated in STPSU. Thus, OV will drive following these calculated probabilities, which can increase the situation complexity during training. However, in online overtaking, trained SPPU needs to estimate the real-time social preference of OV. Given the estimated result AOS can make the proper decision during overtaking.
To this end, considering the social preference, the complete overtaking policy of HV is as follows. If OV is prosocial or egoistic, HV can complete overtaking. The SMDP-based module can output optimal motion primitives in Start Phase and End Phase. The MDP-based module will make decisions for lane change position in Parallel Driving Phase. While OV is altruistic, the MDP-based module will output "Give Up Overtaking" command in Parallel Driving Phase. Thus, HV will quit overtaking because of driving safety.
According to the description above, the design of the two modules is of paramount importance. In the next two sections, training two modules will be detailed.

SMDP-Based Module and MP
In Ref. [19], the SMDP-based decision-making module has been developed for selecting MP to achieve safe and efficient motion planning and control. To clearly illustrate the procedure of AOS later on, this section will make a brief introduction to this module and MP.

Development of MPs
MP refers to the control value sequence of vehicles, consisting of throttling, braking and steering. The purpose of MP is to solve the inconsistency problem between the planned and tracked trajectories. In short, with the desired trajectory to be tracked, what needs to do is to execute the corresponding MP. Herein, MPs for overtaking are extracted and stored offline. Thus, the generated MPs can be used for vehicle control directly.

Definition of RL Elements
The definitions of elements in the SMDP Q-learning algorithm are represented in this subsection, including the state space, option space and reward function. where K t is a constant parameter, and t m represents the duration of the executed MP.
The comfort reward is evaluated by the average lateral acceleration during each overtaking phase. A higher average lateral acceleration means a higher jerk during overtaking, which is undesired for AOS. Hence, this reward is defined as Eq. (4).
where K c is a constant parameter, and a yaw represents the average lateral acceleration.
Following the description in Ref. [21], the optimal position of lane change can be calculated. Therefore, in Parallel Driving Phase, the absolute value of the difference between the lane-change position of HV and the calculated optimal position is used to evaluate the lane-change position. The reward of lane-change position in Parallel Driving Phase can be calculated by Eq. (5). When Start Phase or End Phase, this reward will be set as 0.
where K d is a constant parameter, x OV and x HV are the longitudinal positions of OV and HV respectively, and T r is the duration time of 3 s [22].
To encourage safe driving during training, the agent will be punished greatly in case of collision, where a large negative value will be given as the reward value. The collision reward is shown in Eq. (6).

SMDP Q-Learning Algorithm
In SMDP, a tuple (S, O, R, P, D) is required to be formulated, where S,O,R,P and D denote state space, option space, reward function, state transition probability and the initial state distribution, respectively.
For an option space in this case, options refer to the MP candidate. In this module, the MP selection will be limited according to the corresponding Overtaking Phase. The option space is expressed as: where O S , O P , O E are the option subgroups for Start Phase, Parallel Driving Phase and End Phase, respectively. Each subgroup includes many candidate options (motion primitives here). Each MP can be represented by: o = [⟨ste 1 thr 1 bra 1 ⟩, ... , ⟨ste n thr n bra n ⟩] , where ste, thr, and bra indicate the control values of steering, throttling, and braking, and n depends on the length of motion primitive.
To seek the optimal MP in various situations, the SMDP Q-learning algorithm has been proposed. The iterative equation can be rewritten as follows: where Q S t , O t is the Q value for the state-option pair (S t , O t ) at time t . After training, the strategy for MP selection can be generated.

MDP-Based Module Considering Social Preferences
As mentioned in Sect. 2, the MDP-based module is designed for Parallel Driving Phase to estimate the social preference of OV. To do this, a data-driven method for social preference classification is adopted, and a semi-model-based improved Q-learning algorithm is proposed.

MDP Q-Learning Algorithm
Following the definition of social preference in Sect. 2, three types of social preferences are defined, namely egoism, prosocial and altruism, to distinguish different driving behaviors of OV. Since the reaction of OV is mainly reflected in the speed variation, the average speed will be used as the basis for the classification [23][24][25]. To obtain an objective result, the social preference classification is done through the clustering method. As one of the most classic clustering algorithms, the K-means algorithm performs well in classification tasks with low data dimensions and small data numbers [26]. Hence, K-means is adopted to classify the overtaking data. After clustering, the driving data of OV in Parallel Driving Phase will be labeled with three social preferences. To incorporate the three social preferences in training, the state transition probabilities corresponding to different preferences are required. Hence, as shown in Fig. 3, a grid space is set to cumulate the transition number, where each grid will be regarded as a state. In Parallel Driving Phase, the state transition number of OV is counted, and the state transition probability of OV can be calculated according to the following equation: where G k t indicates the grid position of OV when time t is k . G k+n t+1 means the grid position of OV when time t + 1 is k + n . P G k+n t+1 |G k t is the probability when OV is in the grid k at time t and in the grid k + n at time t + 1 , that is, the state transition probability. The function count(⋅) counts the number of state transition events before and after the occurrence. G represents the set of all grid positions. Hence, for the three social preferences, the corresponding transition probabilities can be obtained. when OV is egoistic, HV should consider giving up overtaking. Therefore, the action space incorporates three actions for lateral driving behaviors, namely "Lane Keeping", "Lane Changing" and "Giving Up Overtaking".

Definition of RL Elements
For decision-making of the lane-change position, different rewards or penalties should be given according to different actions. Specifically, for action "Lane Keeping", a small penalty should be given to avoid vehicles always keeping the lane without changing lanes. Therefore, the reward is set as -1. But for action "Lane Changing", punishment should be given according to the gap between the position of lane change and the optimal position of the lane change. The reward can be calculated by Eq. (11).
where K l is a constant parameter, and G HV and G OV represent the grid position of HV and OV at the current time, respectively. C denotes the expected position error which is calculated by: Fig. 3 Grid space for the state transition probability calculation where g is the grid length, T r denotes the duration of 3 s [22], K is constant parameter, and OV is the social preference judged.
For action "Giving Up Overtaking", a great punishment should be rendered if OV is not egoistic. If OV is not egoistic but takes the "Give Up Overtaking" action, the reward is set to -1000. If the OV is egoistic when taking the "Give Up Overtaking" action, the reward is set as 20.

Improved Q-Learning Algorithm
Given the state transition probabilities calculated in Sect. 4.1, a decision-making module considering social preferences can be trained offline. As an off-policy learning algorithm based on MDP, the Q-learning algorithm is used here [27]. It is because the social preference of OV is considered, the Q function needs to involve the state of OV. Hence, the iterative formula of the algorithm can be written as: where S HV t , S OV t ∈ S and A HV t ∈ represent the vehicle state and action in time step t . The expected Q value needs to be rewritten as: where P S OV t+1 |S OV t is the state transition probability of OV from the time t to the time t + 1 . When the state transition probability of OV is known, the expected Q value of each time step needs to be adjusted according to the possible states of OV in the next time step. The improved Q-learning algorithm is shown in Table 1.

Real-Time Estimation of Social Preference
To estimate the real-time social preference of OV, a real-time classifier is built by two models, including the support vector machine model (SVM) with linear kernel [28] and maximum entropy model (MEM) based on logistic regression [29]. In this classifier, two models are trained by the previously labeled data offline. After training, this classifier can be used for real-time estimation according to the speed of OV.
To summarize Sect. 3 and Sect. 4, the SMDP-based module and the MDP-based module are illustrated, respectively. The former is used for motion planning and control, and the latter is used for decision-making in Parallel Driving Phase. By combining two modules, a complete AOS can be built.
In this AOS, the social preference of OV will be considered, ensuring a safe overtaking maneuver.

AOS Training and Experiments
In this section, the proposed MDP-based decision-making module and AOS are tested in the urban driving simulator CARLA which has been widely used for the development, training and test of ADAS [30]. In Sect. 5.1, state transition probabilities of OV with different social preferences in the Parallel Driving Phase are calculated. During training, the motions of OV will follow these state transition probabilities. Two algorithms, the proposed improved Q-learning algorithm and deep Q network (DQN) algorithm, are utilized in the decision-making module. In Sect. 5.2, the proposed AOS combining two decision-making modules is tested in the real-data-based simulation to realize overtaking tasks.

Tests for MDP-Based Decision-Making Module Considering Social Preference
In this section, state transition probabilities of OV with three social preferences are counted and calculated by the proposed MDP-based module. The data processing and calculation are as follows.
To obtain an objective classification, the overtaking data for Parallel Driving Phase are selected from the dataset NGSIM [31]. NGSIM is a public dataset of highway traffic in the US, which has been widely used in many tests. To extract overtaking scenarios like this case, the dataset is preprocessed manually. Considering the speed range in which overtaking occurs, data with the speed range of 6-18 m/s for HV are selected as valid data. In all valid data, speeds of OV (at the beginning of Parallel Driving Phase) are divided into several ranges, which are detailed in Table 2.
It can be seen that the amount of data increases first and then decreases with the increase of the speed of OV. For the speed range "0-3", this condition may be unfavorable for overtaking. For example, OV may have just braked. For the speed range "Over 18", this condition indicates that the speed of OV is much higher than the speed of HV and overtaking is difficult to be executed. Therefore, the data in these two cases are ignored. Hence, the remaining 373 sets of data are used for this case.
To obtain an objective basis for the classification of the three categories, after entering Parallel Driving Phase, the average speed of the OV in each data within about 6 s is calculated and recorded. Then, taking the average speed as the characteristic quantity, K-means is applied to cluster the valid data. The clustering results are shown in Fig. 4. Cluster 1 contains the data with relatively low average speeds of OV; Cluster 2 consists of the data in which the average speeds of OV are relatively moderate; Cluster 3 represents the data in which speeds of OV are relatively high.Thus, the data in Cluster 1, 2 and 3 are labeled as altruistic, prosocial and egoistic, respectively.
After clustering the valid data into three categories, state transition probability of OV under each social preference and grid can be calculated by Eq. (9). The results are shown in Table 3. Since the amount of data of grid 1-5 or over 21 is small, which will contribute to unnecessary error, the state transition probabilities of these grids are neglected. Hence, the mean state transition probabilities of egoistic, prosocial and altruistic OV are 0.4628, 0.3156 and 0.2321. During training, OV will react according to these mean probabilities.
The decision-making module based on the improved Q-learning algorithm is tested in CARLA. A simulated straight two-lane road is selected for overtaking. The initial speeds of HV and OV are set to 14 and 8 m/s, respectively. The initial position of OV is 7-11 grids ahead of HV in each episode. The parameter settings of the algorithm are shown in Table 4.
The average reward of each episode is shown in Fig. 5. The purple, yellow, and red curves are the training results when OV is altruistic, egoistic and prosocial, respectively.  The average reward values in three cases can gradually approach a fixed value, which shows that the improved Q-learning algorithm converges gradually. The speeds of convergence in the three cases are different. Under the experimental condition of egoistic OV, the algorithm converges the fastest, about 220 episodes. However, the speeds of convergence are much slower when OV is altruistic or prosocial since it takes about 300 episodes to converge. This result shows that the convergence speed of the algorithm is related to the setting of the reward function, and the convergence speed of the algorithm with a fixed reward value is faster than that with non-fixed reward value. The heat maps of decision results for the improved Q-learning algorithm are revealed in Fig. 6. When the social preference of OV is egoistic (as shown in Fig. 6(b)), the highlighted part with red or green color denotes that the decision is "Give Up Overtaking". While the social preference of OV is altruistic or prosocial (as shown in Fig. 6(a) and (c)), the highlighted part represents the decision "Lane Changing". The decision of other non-highlighted parts is "Lane Keeping".

Fig. 4 Clustering results of K-means
In Fig. 6(a) and (c), the highlighted part has an approximately linear shape which signifies that the decision results are correct because this decision is made by Eq. (12). Some small highlight spots emerge due to that the motion control of OV is based on state transition probability and their appearances are common and inevitable. In Fig. 6(b), since the social preference of OV is egoistic, the decision-making module should make the decision of "Give Up Overtaking" at any time. The large highlight area in the figure shows that the decision result is reasonable.
To test the performance for continuous decision outputs, DQN algorithm is also applied to build the module [32]. The initial speeds of HV and OV are set to 14 and 8 m/s, respectively. The initial position of OV is 8 grids ahead of HV in each episode. The parameter settings of the algorithm are shown in Table 5. The exploration rate of DQN algorithm is set to decrease gradually with the increase of the number of episodes. The exploration rate can be calculated by: where k represents the number of episodes, and k denotes the exploration rate at episode k.
The average reward of each episode is shown in Fig. 7. The yellow, red and purple curves are the training results when OV is egoistic, prosocial and altruistic. After about 4000 episodes, the average reward in three conditions can also approach a fixed value, which indicates that the DQN algorithm has converged. In three conditions, the convergence rate of the algorithm is almost the same.   Fig. 7 Training results of DQN algorithm Compared with the training results of the two algorithms, the improved Q-learning algorithm converges after about 300 episodes while DQN algorithm converges after about 2000 episodes. These results show that the convergence rate of DQN algorithm is significantly lower than that of the improved Q-learning algorithm.
The heat maps of decision results of the DQN algorithm are revealed in Fig. 8. Compared with the decision results of the improved Q-learning algorithm, when OV is altruistic or prosocial, the highlighted part of the decision results of both algorithms is approximately linear, which means that the decision results conform to the optimal overtaking position defined by Eq. (12). When OV is egoistic, the highlighted part shows that both algorithms can make an accurate decision of "Give Up Overtaking". The difference between two heat maps is that no matter which type of vehicle is overtaken, the area of highlighted part in the heat maps of DQN algorithm is significantly higher than that of the improved Q-learning algorithm, which demonstrates that the safe overtaking decision made by DQN has covered more situations (states). There are two possible reasons: firstly, the training time of DQN algorithm is longer (caused by the difficulty of training DQN itself), thus the amount of training data generated is larger and the model training is more sufficient. Secondly, the exploration rate of DQN algorithm is much higher than that of the improved Q-learning algorithm, which means that the module based on DQN algorithm explores more overtaking scenarios. These results demonstrate that the proposed module is suitable for both discrete and continuous states.

Tests for AOS
Since the decision of the AOS needs to consider the social preferences of OV, real-time estimation of the social preferences is required. Given the labeled data after clustering, two classifiers are established. One is based on SVM and the other is based on MEM. By inputting all labeled data into a fit multiclass model based on SVM and MEM, the classifier is trained. The number of clusters is the same as the number of defined social preferences, both of which are 3. The distance between cluster center and data points is measured by the square of Euclidean distance.
The classification results of the classifier based on SVM and MEM are presented in Fig. 9. Classifications 1, 2 and 3 refer to different social preferences, namely, egoistic, prosocial and altruistic, respectively. In Fig. 9(a), it is obvious that the classifier classifies social preferences according to two critical speeds, which are 7.4 and 11.4 m/s. OV with speeds range from 0 to 7.4 are considered as altruistic; OV with speeds range from 7.4 to 11.4 are considered as prosocial; OV with speeds over 11.4 are considered as egoistic.
The classification results of MEM-based classifier are demonstrated in Fig. 9(b). The Labels 1, 2 and 3 are equivalent to diverse social preferences: egoism, prosocial and altruism, respectively. Two critical speeds are 7 and 11 m/s. Noteworthily, even though the total amount of data is adequate, the probabilities of three categories in speed range 0-0.9 m/s are equal, which means this MEM-based classifier performs poorly under the situation of low speeds.
Compared with the results of classifiers based on SVM and MEM, some characteristics can be found. The MEMbased classifier performs poorly in prediction under the situation of low speeds, and the SVM-based classifier performs better in the classification of small sample data. Despite this, the prediction results of the two classifiers are similar. Therefore, in real-time cases, the critical speeds are set as the average values of two classifiers: 7.2 and 11.2 m/s. According to the above critical speeds, the AOS can predict the social preference of OV in real-time. Simultaneously, the real overtaking data can be classified according to the social preference of OV.
Combining two modules as a whole, the proposed AOS is tested in CARLA based on realistic data. The data collection is done by the BYD tang platform from Beijing Institute of Technology. The collected data covers many overtaking scenarios in the Haidian District, Beijing. Figure 10 shows the equipment mounted on vehicles, including Camera, Lidar and GPS. In the collected dataset, OVs are marked by YOLO, and corresponding trajectories are extracted from  0  20  10  0  30 35  5  15  25  20  10  0  30 35  5  15  25  20  10  0  30 35  5  Lidar. Based on the collected realistic data, validation can be done to mimic real overtaking. Considering validation safety, CARLA is used to reproduce the overtaking scenarios based on realistic data. Specifically, in CARLA, HV is controlled by AOS, and the operation of OV follows the collected realistic overtaking data. In all collected overtaking data, 63 groups are selected involving obvious reactions of OV. The selected 63 groups of overtaking data are classified into three social preferences, including 15 prosocial, 31 altruistic and 17 egoistic.  12) is 3 × 8 = 24 m. Furthermore, because OV is altruistic, the optimal overtaking distance ought to be shortened by 3 × 3 = 9 m. Hence, the optimal overtaking distance is 15 m, which is close to the test results. The same sound result can be obtained when OV is altruistic. When OV is egoistic, the results of three overtaking tests are shown in Fig. 13. In three groups, AOS makes the decision of "Give Up Overtaking" due to the aggressive behavior of OV. To this end, in these real-data-based cases, overtaking can be completed safely and effectively by the proposed AOS.

Tests Results Comparison with Social-Preference-not-in-Mind AOS
To better show the improvement of the AOS compared with those are social-prefernce-not-in-mind, a more complicated situation is set in this part, that is, the changing social preference of OV. During overtaking, OV will  follow more than simply one social preference, requiring more robustness in the AOS. Ref. [19] ignoring social preference will be selected for comparison. Specifically, the social preference of OV in this experiment moves from egoist to altruist, which is used to mimic an always-occurring change in driver mindset: from being unwilling to be overtaken to actively giving way. 30 different tests are used for comparison between the AOS (HV-S) and social-preference-not-in-mind AOS (HV-NS, our previous work), and the compared overtaking success times are presented in Table 6. It can be seen that the proposed AOS can handle these cases with the changeable social preference, in which HV-S can complete overtaking safely when OV becomes altrustic, but HV-NS not. It is because the social preference consideration increases the state input of the model itself during training, and makes it more reasonable to deal with realistic driving scenarios. This result proves the adaptability improvement in more complicated scenarios. Figure 14 presents a case as an example. Thanks to the proposed MDP-based module which considers social preferences, some promising improvements on adaptability are obtained.

Conclusions
This study proposes an HRL-based AOS, consisting of two RL-based modules. The SMDP-based module and MP are proposed in the previous work, which is used for motion planning and control. The MDP-based module considering social preferences of OV is designed for decision-making based on reactions of OV. Three tests are carried out to verify the proposed MDP-based module and AOS.
In the first test, objective social preferences are extracted by a data-driven method. Under these social preferences, the MDP-based module is trained by two classic RL algorithms (the improved Q-learning and DQN). The training result shows that modules based on two RL methods are both feasible. After training, this module can estimate social preference in real-time. In the second one, the whole AOS is tested based on realistic data. The result demonstrates the sound ability to interact with OV under different social preferences, and achieve safe and efficient autonomous overtaking. In the third one, the comparison with social-preference-notin-mind AOS is done, the result shows an improvement of adaptability in more complicated scenarios. In future work, the on-road test with realistic traffic will be considered.  30 6 Fig. 14 Trajectories between HV-S and HV-NS when the social preference of OV moves from egoist to altruist