1 Introduction

Autonomous intelligent systems (AISs) still face significant challenges in large-scale production and application

The state of the art in AISs showcases a remarkable convergence of cutting-edge technologies in artificial intelligence (AI), sensor technologies, and intelligent control, contributing to the impressive strides in promoting autonomy across diverse fields such as transportation, manufacture, healthcare, architecture, and agriculture [13]. Traditional AISs relying on rule- and optimization- based technologies can perform simple tasks in specific environments [4, 5]. However, these systems become impractical for complex tasks in dynamic open environments due to the difficulty in obtaining rules and the complexity of the systems. To address this issue, the rapidly advancing field of AI in recent years has brought new opportunities. Integrating machine learning into AISs’ perception, decision-making, planning, and control processes can leverage human prior knowledge and demonstrations to automatically extract potential complex rules and learn human behavior. Human prior knowledge, including decision logic, decision processes, knowledge experience, value preferences, etc., can be encoded to guide the decision-making and behavior of AISs [6]. Human demonstrations record data of how human behaves in complex tasks, providing learning material for AISs. In this regard, one of the most typical approaches is imitation learning, which includes behavior cloning (BC) that infers intelligent decision models directly from human demonstration data [7], generative adversarial imitation learning (GAIL) [8], and inverse reinforcement learning (IRL) that infers reinforcement learning reward models [9]. However, as human needs become more complex, current AI-based AISs still grapple with substantial challenges that underscore the ever-stronger complexities and uncertainties arising from interactions among the system, the environment, and humans. One major hurdle lies in the transition from the controlled and finite scenarios during design and testing to the open and infinite uncertain environments encountered during actual operation, while another arises from the uncertainty of human needs. Therefore, enhancing the learning and evolution capabilities of AISs is urgently needed to improve their safety, reliability, and adaptability in dynamic and uncertain environments.

Engineering insights from ChatGPT

Recently, the emergence of ChatGPT has shone brightly in the field of natural language processing (NLP), sparking a new revolution in content generation. It can amazingly handle strong uncertainties in “chat”, such as different contents, manners, value preferences, and psychological states, and satisfy most people. Specifically, the supervised fine-tuning (SFT) technology relies solely on the similarity of text to judge the quality of answers and thus enables ChatGPT to grasp the basic structure and content of conversational tasks, achieving a “superficial resemblance”. Furthermore, to endow ChatGPT with the ability to cope with the strong uncertainties in chat and align with human values, ChatGPT utilizes the Reinforcement Learning from Human Feedback technique (RLHF). By collecting human evaluations on the answers, it models a trustworthy alignment standard known as the human value model, using it to assess the credibility of answers. Based on this, through fine-tuning with reinforcement learning, ChatGPT ultimately aligns with human expressions, logic, and common sense, achieving the ability to “chat like a human”. Analogically, the triumph of reliable HF in ChatGPT may have brought us insights to deal with the strong complexity and uncertainty in the real-world deployment of AISs.

Human feedback: a denotation of feedback to deal with uncertainties

Feedback has long been proven as the most effective mechanism for automation systems to cope with uncertainties. The diagram of feedback is depicted in Fig. 1. Traditional feedback control, based on the measurement of industrial physical signals, has achieved significant success in handling plant uncertainties (such as unmodeled dynamics, parameter uncertainties, etc.) and external disturbances in industrial automation systems. Subsequently, advancements in information technology have propelled the development of multi-modal feedback, leveraging multi-modal perception techniques to deal with environmental uncertainties, such as light, weather, noise, and traffic changes. Today, as the working environments and task objectives of systems become increasingly complex, HF is introduced into the design, improvement, and learning processes of systems to handle evaluation biases and uncertainties such as personalized demands and changing task needs. In fact, human factors have consistently played a role in parameter tuning for automated systems and the development process of automobiles. Consequently, HF imparts automation systems with new connotations of feedback and provides a mechanism for the integration of human intelligence and machine intelligence to jointly cope with the growing complexity and diversity of uncertainties.

Figure 1
figure 1

Diagram of feedback

From ChatGPT to ID: why do we need HF in intelligent driving?

When it comes to fields like intelligent driving (ID) where AISs have already achieved notable milestones, demonstrating remarkable capabilities in perception, navigation, decision-making, and motion control, HF may make a difference. Similar to chat scenarios, intelligent driving also faces multiple uncertainties derived from vehicle parameters, driving environment, driver diversities, and passenger preferences. Figure 2 demonstrates the uncertainties in chat and ID. Existing learning-based intelligent driving technologies still struggle to cope with these endless uncertainties and face challenges such as slow learning speed, weak safety, low credibility, and poor adaptability in dynamic real-world traffic environments. On one hand, current methods mainly rely on imitation learning and reinforcement learning techniques. Imitation learning often uses supervised learning to learn driving strategies from human demonstrations to replicate human driving skills. Reinforcement learning, on the other hand, guides intelligent vehicles to interact with the environment autonomously through predefined rewards and goals. However, these methods often detach online supervision from humans during the learning process, significantly reducing learning efficiency, quality, safety, and machine credibility, making it challenging to meet the requirements of practical driving tasks. However, current intelligent driving systems lack online learning and self-evolution capabilities, making it difficult to cope with the inherent uncertainties, vulnerabilities, and openness in actual driving tasks in real-time. Considering the robustness and adaptability demonstrated by humans in complex driving scenarios, introducing HF into the learning cycle of intelligent driving is also expected to enhance intelligent driving intelligence.

Figure 2
figure 2

Uncertainties in Chat and ID

2 HF enhanced intelligent driving: a unified framework

To augment driving intelligence so as to deal with uncertainties, we propose a unified framework for HF-enhanced intelligent driving

As shown in Fig. 3, a multi-layer HF-enhanced self-learning algorithms module is first established to endow intelligent driving with the genes of self-evolution, granting it the capability of self-learning. Subsequently, to enable intelligent vehicles to continuously interact with dynamic environments and evolve, a cloud-controlled multi-vehicle collaborative evolution virtual platform is developed and a cloud-controlled fully unmanned intelligent driving testing system is further constructed. Finally, the testing results will be analyzed to evaluate the current driving intelligence quotient (DIQ) and provide evaluation feedback for further iterative evolution of the self-learning algorithms.

Figure 3
figure 3

A unified framework for HF-enhanced intelligent driving

Multi-layer HF-enhanced self-learning algorithms: born to evolve

Reinforcement learning, compared to imitation learning, has advantages such as model independence, no need for supervision, and autonomous learning, making it more suitable for learning in complex and uncertain scenarios in intelligent driving tasks. However, it faces challenges such as low sampling efficiency, slow learning speed, difficulty in ensuring safety, and low credibility of learning results, making it challenging to quickly implement in the field of intelligent driving. Interactive reinforcement learning (Int-RL), based on human-machine interaction and collaboration, aims to learn how to perform tasks from real-time evaluations or feedback from human supervisors. Its application in real problems such as robot control and human-machine cooperation is becoming more widespread. By introducing real-time human supervision or feedback into the traditional reinforcement learning process for real-time reward shaping, policy shaping, exploration guidance, and value function augmentation, interactive reinforcement learning can improve sampling efficiency, accelerate learning, strengthen safety supervision, enhance the credibility of learning results, and provide a fast and reliable online learning and evolution framework based on real-time human-machine collaboration. Therefore, the proposed framework is based on interactive reinforcement learning which is of great significance for addressing the challenges of slow learning speed, weak safety, low credibility, and poor adaptability in dynamic open real-world traffic environments. More importantly, to leverage human feedback as much as possible for facilitating evolution, we propose a multi-layer mechanism to inject human feedback. As illustrated in Fig. 3, it includes specific modules for policy and reward function pre-training with human demonstration, reward function adjustment based on human evaluation, online policy fine-tuning using human intervention, and human knowledge injection. First, we collect human driving data and utilize supervised learning and inverse reinforcement learning (IRL) for offline pre-training of the policy and the reward function model to learn an initial policy and a human-like reward function, respectively. With the extracted reward function, we can develop driving strategies aligned with human habits. Next, we combine the pre-trained policy and reward function with reinforcement learning for training in new scenarios while introducing online human evaluations to construct a human value function, thereby adjusting the pre-trained reward function. Through “my” evaluations, the algorithm learns preferences in driving habits similar to human drivers. Additionally, we incorporate human online intervention during the reinforcement learning process to guide the exploration of intelligent driving vehicles, obtaining high-quality training data and accelerating the learning process. Especially in complex interactive scenarios, human intervention can enhance driving capabilities in challenging situations. Finally, we model human prior general driving rules and habits, injecting them into the exploration process of intelligent vehicles to improve learning efficiency and enable intelligent driving to acquire human-like driving logic.

Cloud-controlled training & testing platform: empowering environment interactions and feedback to boost evolution

This platform consists of two parts. The first one is the cloud-controlled multi-vehicle collaborative evolution platform. It introduces several key innovations, including the development of a multi-driving-simulator in-the-loop system, enabling multi-vehicle adversarial evolution in complex driving scenarios, and facilitating the testing and evaluation of different levels of automated driving systems. Specifically, it is comprised of essential components and incorporates cutting-edge technologies. The main components include the Digital Twin Scenario Generation System, Environment Sensing Simulation System, Microscopic Traffic Flow Simulation System, Vehicle Hardware Embedded System, and High Fidelity Vehicle Dynamics System. The key technologies encompass Cloud Control Trajectory Optimization, Vision-Force-Somatosensory Collaboration, Extreme Weather Perception Simulation, Multi-Source Heterogeneous Sensing Modeling, Heterogeneous Information Exchange Synchronization, Multi-Vehicle Interaction and Collaboration in Hybrid Traffic, V2X Channel Simulation, and Steering Wheel Force, Driving Intelligence Evaluation. Besides, the cloud control multi-vehicle cooperative platform achieves interaction on one hand by a host driver operating a host driving simulator and three co-drivers operating co-driving simulators, in order to reflect multi-vehicle interaction behaviors under real traffic as much as possible. On the other hand, the traffic vehicles could be controlled through interaction driver models in traffic flow simulation software like SCANeR and SUMO, or based on self-developed driver models, to achieve more complex interactions with the host and co-driving simulators. Regarding the other platform, the cloud-controlled & fully unmanned automated driving testing system, it addresses challenges in the existing intelligent driving tests, such as limited test scenario coverage, scattered test condition configurations, simplified environmental conditions, binary evaluation of driving intelligence, and fragmented evaluation of intelligent driving functions. This system is based on 5G cloud control technology for the collaborative planning and control of traffic participants. It establishes a comprehensive unmanned testing and evaluation framework tailored for intelligent connected vehicles, aiming to achieve rapid and dynamic generation of corner case scenarios, fully automated acceleration testing, and multidimensional evaluation of intelligent driving functions. The system encompasses three key technologies: autonomous testing techniques, autonomous testing equipment, and autonomous evaluation system. Its goal is to accelerate the testing of the intelligence of intelligent driving vehicles, providing validation support for the admission and practical implementation of high-level intelligent driving vehicles. Consequently, by integrating these components and technologies, the cloud-controlled training and testing platform provides a robust and sophisticated infrastructure for the advancement of intelligent driving systems, offering a comprehensive approach to testing, evaluation, and evolution in diverse and challenging scenarios.

Driving intelligence evaluation: guiding the evolution direction

Evaluating the IQ of an intelligent system is crucial for determining the necessity of evolution and the direction in which evolution should occur. Despite the explosive developments of intelligent vehicles, assessing their intelligence remains challenging. This arises from the conflict between the deep reliance on test scenarios and the infinite variability of real-world scenarios. In the context of evaluating driving intelligence, the Turing test, traditionally applied to assess artificial intelligence, undergoes a transformation. Therefore, we proposed a driving IQ evaluation scheme for intelligent driving. Instead of the conventional question-and-answer format in the Turing test, where a human tester gauges responses to determine if they originate from a human or machine, in the proposed scheme, the Q&A model is replaced by a scenario bank, comprising predefined scenes and tasks that the vehicle encounters, in simulation, Hardware-in-the-Loop platform (HIL), or real-world settings. In addition, typical behavior metrics are extracted for evaluating the behavior index to assess how much an intelligent vehicle is as smart as a human. Finally, the evaluation process involves a two-part model: one calculates scenario complexity according to the scenario bank, while the other assesses vehicle behavior. This model, replacing the human tester, defines the DIQ of an intelligent vehicle as a function of scenario complexity and behavior index. Note that in the proposed framework, driving IQ is commonly used to assess and provide feedback on the basic and universal driving intelligence of autonomous driving, in order to promote the driving strategy to possess fundamental intelligence, such as the ability to complete specific tasks, as well as basic safety, comfort, and efficiency. As for the human evaluative feedback, on the one hand, it can be used to evaluate and provide corrective feedback on universal driving policies, encouraging the driving strategy to align with human self-value preferences, that is, the ability to be “self-like”. On the other hand, when autonomous vehicles encounter scenarios or driving tasks that are difficult to handle, human intervention feedback can be combined with evaluative feedback to enhance the driving policy’s ability to deal with the challenging situations.

3 Multi-layer HF enhanced self-learning algorithms

An interactive deep reinforcement learning framework

The interaction process among the intelligent driving system (agent), the traffic environment, and the human can be formulated as a sequential optimal decision problem based on Markov Decision Process (MDP). A typical MDP is denoted by such a quintuple: \(M = \langle \mathcal{S},\mathcal{A},\mathcal{T},\mathcal{R},\gamma \rangle \), where \(\mathcal{S}\) and \(\mathcal{A}\) represent the set of all states and actions, respectively. \(\mathcal{T}\) denotes the transition function \(P_{s_{t}s_{t + 1}}^{a_{t},a_{h,t}}\), which describes the probability of the transition from \(s_{t}\) to \(s_{t + 1}\) by applying the action \(a_{t}\) or the human action \(a_{h,t}\). \(r_{s_{t}s_{t + 1}}^{a_{t},a_{h,t}}\) represents the reward of above one-step transition. The policy of intelligent driving system can thus be denoted as \(\pi (a_{t}|s_{t}) = P_{s_{t}}^{a_{t}}\), which describes the probability of execute \(a_{t}\) in state \(s_{t}\) of the environment.

The goal of RL is to find a policy \(\pi ^{*}\) that maximizes the expected reward of the agent. In that case, two value functions \(V_{\pi} (s_{t})\) and \(Q_{\pi} (s_{t},a_{t})\) is introduced, the former starts from state \(s_{t}\) to execute the policy π, while the latter from the state-action pair \((s_{t},a_{t})\). To find the optimal policy and the optimal values \(V^{*}(s_{t})\) and \(Q^{*}(s_{t},a_{t})\), multiple algorithms are proposed [10]. For example, double deep-Q-network (DQN) is proposed to reduce the over-estimation phenomenon of DQN. Suppose the parameter of Q network is \(\phi _{t}\), parameter of target Q network is ψ, thus \(\phi _{t}\) can be updated by:

$$ \begin{aligned} \phi _{t + 1} &= \phi _{t} + \alpha \Bigl[ r_{s_{t}s_{t + 1}}^{a_{t},a_{h,t}} \\ &\quad{} + \gamma Q \Bigl( s_{t + 1},\arg \max _{a_{t + 1} \in \mathcal{A}}Q ( s_{t + 1},a_{t + 1};\phi _{t} );\psi \Bigr) \\ &\quad{} - Q ( s_{t},a_{t};\phi _{t} ) \Bigr] \\ &\quad{} \times \nabla Q ( s_{t},a_{t};\phi _{t} ), \end{aligned} $$
(1)

where α is the learning rate.

Human-enhanced behavior modeling

1) Policy initialization with human demonstrations. In this part, imitation learning methods are widely used where an agent learns by imitating or mimicking the behavior of a demonstrator or expert. Typical methods include BC to directly map inputs to outputs by supervised learning, GAIL formulating the learning process as a game between a generator and a discriminator, and IRL inferring a reward model meanwhile using RL techniques to optimize the policy.

Take BC as an example, it maps from inputs to outputs of a dataset through fitting approaches. The mean-square-error loss is usually used as the loss function and is defined as:

$$ L ( x_{i},y_{i};\theta ) = \frac{1}{N}\sum _{i = 1}^{N} \bigl[ \pi _{\theta} ( y_{i}| x_{i} ) - y_{i} \bigr]^{2}, $$
(2)

where \(x_{i}\) and \(y_{i}\) are the inputs and outputs, respectively; \(\pi _{\theta} \) is the policy with parameter vector θ, and N is the batch size. Further, the gradient descent method is utilized to update the parameter of the policy network:

$$ \theta _{k + 1} = \theta _{k} - \beta \cdot \nabla _{\theta} L ( x_{i},y_{i};\theta _{k} ). $$
(3)

Therefore, BC can be used to pre-train the policy network \(\pi _{\theta _{\mathrm{pre}}}\) for intelligent driving. Compared with using reinforcement learning to explore from scratch, its benefits are obvious: firstly, offline training can avoid interaction with the real environment so that there will be no impact on the environment; secondly, policy’s pre-training can make the initial policy model have a relatively reliable performance.

2) Policy adjustment based on human intervention. Due to the random exploration of the RL agent, the learning efficiency is relatively low. To accelerate the learning speed of the vehicle, a human driver evaluates the learning level of the vehicle in real-time and intervenes to assist the learning via BC on the intervention data according to Eqs. (2)-(3).

Human-enhanced value augmentation

1) Reward initialization with human experiences. In the context of RL, the reward model could be a mechanism-driven model or a data-driven model. Initializing the reward model with human experiences involves leveraging the human prior knowledge to design the reward function or utilizing human demonstrations to recover the reward function. Specifically, for the mechanism model, the knowledge refers to human prior general driving rules, habits, and preferences, such as pursuing maximum traveling efficiency, avoiding collisions, yielding at intersections, and adhering to traffic regulations. As a result, the agent gains a foundational understanding of human driving behavior. Regarding the data-driven model, it involves learning the reward function from observed expert data through IRL, for example. This is advantageous when explicit reward functions are hard to define, as IRL enables the extraction of reward information implicitly encoded in expert demonstrations. Besides, the two kinds of models could be integrated to first define an initial reward structure with adjustable parameters which would be then learned or optimized through inverse learning. Anyway, the initialization provides a meaningful starting point for the learning agent, influencing its behavior towards desired outcomes from the outset. This is particularly valuable in domains like ID where expert insights are crucial for shaping the learning process.

2) Reward adjustment based on human intervention. Even though the initial reward model is elaborated, it may still not be able to describe accurate driving goals, especially in complex scenarios with strong interactions. Therefore, in our framework, the human supervision is introduced and the conflicts between the interventions and the agent’s actions are utilized to adjust the initial reward model. In a typical example, the reward function of the intelligent driving agent is defined as:

$$ \begin{aligned} r_{s_{t}s_{t + 1}}^{a_{t},a_{h,t}} = \textstyle\begin{cases} w_{v} ( \frac{v_{x} - v_{\max}}{v_{\max}} )^{2} - w_{h} ( \frac{a_{t} - a_{h,t}}{d_{h}} )^{2}, \\ \quad hi = 1,a_{t} \ne a_{h,t}, \\ w_{v} ( \frac{v_{x} - v_{\max}}{v_{\max}} )^{2} + w_{h} ( 1 - \frac{a_{t} - a_{h,t}}{d_{h}} )^{2}, \\ \quad hi = 1,a_{t} = a_{h,t}, \\ w_{v} ( \frac{v_{x} - v_{\max}}{v_{\max}} )^{2}, \quad hi = 0, \end{cases}\displaystyle \end{aligned} $$
(4)

where \(w_{v}\) measures the difference between the current speed and the maximum speed limit \(v_{\max}\) to reflect driving efficiency, \(w_{h}\) is the coefficient factor that subsequently weighs the difference between exploration action \(a_{t}\) and human action \(a_{h,t}\); hi is the human intervention flag; \(d_{h} = \max(a_{t} - a_{h,t})\) is the maximum difference between machine and human actions; \(a_{t}\) and \(a_{h,t}\) belong to \(\mathcal{A}\) which is set as \([1,0, - 1]^{\mathrm{T}}\) in the simulation, respectively representing three advanced lateral control commands: left lane change, no lane change, and right lane change. If the human makes an intervention, \(hi = 1\), otherwise \(hi = 0\). It is assumed that the human intervenes both to penalize and encourage the agent’s decisions.

3) Alignment with human value via human evaluation. Although the human intervention could be used to adjust the initial reward, it still relies on a “superficial resemblance” and needs mass human manipulations. To develop a more general human-enhanced mechanism to align the agent with human values, as shown in Fig. 4, we propose a human evaluation based method where human scores or rankings on the agent’s performance are collected for establishing a human value model using supervised learning methods. It will replace the adjusted reward model and finally be used to fine-tune the behavior policy through RL.

Figure 4
figure 4

Alignment with human value via human evaluation

Human-enhanced exploration & exploitation

In addition to behavior and value models, another factor influencing the practical application of reinforcement learning (RL) is the mechanism of exploration and exploitation. Traditional random exploration and exploitation mechanisms struggle to ensure the safety of the agent during both training and deployment phases, often accompanied by inefficient explorations that significantly reduce learning efficiency. However, human drivers’ learning is inherently safe and efficient. For example, novice drivers explore lane changes or overtaking behaviors under the premise of safety to achieve high traffic efficiency. Obviously, people do not enhance their driving skills by continually getting into accidents. Therefore, we propose to emulate the learning process of human drivers to enhance the exploration and exploitation capabilities of agents. The related factors include constructing human-like safe driving areas and exploration mechanisms in an MPC framework, defining the driving tendencies of human drivers, and designing constraints for agent exploration and exploitation based on the general rules observed in different human tasks [11, 12].

4 Merging into a congested ramp: a typical example

Congested ramp scenarios have always been challenging for intelligent driving, primarily due to the involvement of high-density traffic flow, frequent variations in vehicle speeds and gaps, and the need for accurate judgment of the opportune moment to merge into the target lane. This necessitates that intelligent driving systems swiftly make precise decisions in highly congested and dynamic traffic environments to ensure the safe and efficient exit of the vehicle from the ramp. In this part, we illustrate the effectiveness of the proposed framework under a typical congested ramp scenario, as depicted in Fig. 5. Note that the proposed framework can selectively integrate each type of human feedback method according to the needs of algorithm development. In the case study part, the algorithms used are mainly based on the reinforcement learning decision-making algorithm, model predictive control algorithm, and safety assurance mechanism from our previous work [11]. Building upon this, the framework introduces real-time assessment by human supervisors to evaluate the vehicle’s learning level and determine whether to intervene with actions. It also encourages the vehicle to pursue human preferences and penalizes poor experiences that conflict with human intentions, thereby assisting the vehicle in learning.

Figure 5
figure 5

A typical congested ramp scenario

Navigating the challenges posed by perception uncertainties and complex vehicle dynamics, the application of a reinforcement learning (RL) agent, trained in a simplistic simulation environment, to real-world intelligent driving encounters significant limitations. In this instance, the CarSim simulation environment is adopted to incorporate a high-fidelity vehicle model that accurately reproduces real vehicle dynamics, thereby augmenting the authenticity of the approach. To incorporate human feedback into the training process, a human-in-the-loop simulation platform is established using MATLAB/Simulink and CarSim. This platform displays real-time driving statuses of both the host vehicle and surrounding vehicles on a computer screen. Within this setup, a human supervisor intervenes by clicking the computer keyboard to make decisions for the host vehicle. In practical applications, the human supervisor can convey alternative instructions through actions like using the turn signal lever or employing interactive methods such as voice commands and gestures. Additionally, to make a more realistic and interactive traffic flow, the intelligent driver model (IDM) from the Simulation of Urban MObility (SUMO) software is incorporated.

In the simulation, the host vehicle starts with an initial speed of 20 km/h and a distance of 120 m from the last merging point. Concurrently, the 9 vehicles in the exit lane are assigned random initial speeds (ranging from 0 to 20 km/h) and gaps (ranging from 1 m to 5 m). The host vehicle is tasked with efficiently entering the ramp through strategically executed cut-in behaviors. Consequently, the states include the positions and velocities of all vehicles, along with the distance to the last merging point, while the action space is formed by the target lane IDs. To further minimize the entry time into the ramp, a duration reward is introduced. Additionally, the longitudinal and lateral motion planning tasks are executed using model predictive control (MPC). Due to the unique characteristics of this scenario, such as low speed and large steering wheel angles, a vehicle kinematics model is employed. Moreover, the MPC incorporates constant acceleration predictions for surrounding vehicles to address the densely dynamic traffic conditions. For further details on parameter settings, please refer to [11].

Figures 6 (a) and 6 (b) illustrate the average episodic reward and the time taken to enter the ramp, respectively, with and without human supervision, where 1ep, 5ep, and 10ep respectively represent the results over the past 1, 5, and 10 episodes during the training process. In Fig. 6 (d), the longitudinal distance from the host vehicle to the front and rear traffic vehicles, along with the host vehicle’s positions, is depicted. It is evident that learning occurs at a significantly faster rate when human supervision is present compared to when it is absent. This accelerated learning can be attributed to the high-quality intervention data provided by human online supervision during the training process, preventing the DRL agent from engaging in meaningless random exploration. Moreover, human supervision offers appropriate rewards and penalties, contributing to the agent’s improvement in training speed. The average time to enter the ramp is ultimately achieved in 20 seconds, slightly faster than the scenario without human supervision. Additionally, Fig. 6 (c) presents the Human Intervention Ratio (HIR), which denotes the proportion of human-guided data collected during training relative to the total data, including the agent’s self-exploration data. The graph reveals a rapid decline in the HIR at the beginning of the training period, stabilizing after a certain duration. This pattern signifies a continuous growth in human trust in agent intelligence, leading to a reduction in intervention frequency. The early decrease in HIR is attributed to the guidance advantage of human supervision data during the training process. As the agent’s self-exploration data is initially of lower quality, human drivers intervene more frequently, resulting in a higher proportion of human data. During this phase, the agent benefits from utilizing human-guided data through the conditional sampling mechanism to expedite and enhance learning. As the agent’s learning quality improves, human driver interventions decrease, allowing the proportion of agent data to gradually increase. This transition enables the use of a more diverse dataset, beyond human data alone, to enhance the optimality of the strategy. Consequently, the agent can leverage self-exploration data through the conditional sampling mechanism to learn more effective strategies. Additionally, it can be observed from Fig. 6 (d) that the distance of the ego vehicle relative to the front vehicle is always greater than 0, while the distance relative to the rear vehicle is always less than 0 (less than 0 indicates no collision occurred), indicating that the ego vehicle did not collide with the surrounding vehicles throughout the training process. At the same time, the lateral position of the ego vehicle is distributed between −2 and 6 m (within the lane boundary range), indicating that the ego vehicle never exceeded the lane boundaries. These results are all due to the strong capability of handling the safety constraints by the model predictive control-based motion control module within this framework.

Figure 6
figure 6

(a) Average reward and time to enter the ramp of the host vehicle with human supervision. (b) Average reward and time to enter the ramp of the host vehicle without human supervision. (c) Human intervention ratio. (d) Longitudinal distance from the host vehicle to the front and rear traffic vehicles, as well as the positions of the host vehicle

The case presented above shows some preliminary results, aimed at verifying the effectiveness of the proposed framework from an algorithmic level. The cloud control training and testing platform has been built and validated. The next step will be to conduct a comprehensive verification of the proposed framework and algorithms on the cloud control platform. Besides, as the task becomes more complex and the rewards become more sparse, human feedback will play a greater role. We will also verify the algorithm in more complex driving tasks to explore the potential of human feedback.

5 Conclusion

Feedback has long been proven as the most effective mechanism for automation systems to cope with uncertainties. From traditional feedback to human feedback, this paper proposes that HF provides a mechanism for the integration of human intelligence with machine intelligence, which can enhance machine intelligence, constrain machine behavior, and jointly cope with the growing complexity and diversity of uncertainties. Moreover, a unified framework for HF-enhanced intelligent driving is proposed, involving multi-layer HF-enhanced learning algorithms, Cloud-controlled training & testing platform, and driving intelligence evaluation, and validated in the congested ramp scenario. It can be anticipated that HF will become a new paradigm for enhancing the intelligence of AISs and promoting its widespread applications.