Performance Evaluation Method for Automated Driving System in Logical Scenario

With the continuous improvement of automated driving technology, how to evaluate the performance of an automated driving system is attracting more and more attention. Meanwhile, with the creation of scenario-based test methods, the traditional evaluation index based on a single test can no longer meet the requirements of high-level safety verification for automated driving system, and the performance evaluation of such a system in logical scenarios will be the mainstream. Based on the scenario-based test method and Turing test theory, a performance evaluation method for an automated driving system in the whole parameter space of a logical scenario is proposed. The logical scenario parameter space is partitioned according to the risk degree of concrete scenario, and the evaluation process in different zones are determined. Subsequently, the anthropomorphic index in the safe zone and the collision-avoidance index in the danger zone are defined by comparing test results of human driving and ideal vehicle motion. Taking front vehicle low-speed and cut-out scenarios as examples, two automated driving algorithms are tested in the virtual environment, and the test results are evaluated both by the proposed method and by human observation. The results show that the results of the proposed method are consistent with the subjective feelings of humans; additionally, it can be applied to scenario-based tests and the verification process of an automated driving system.


Introduction
With the continuous improvement of automated driving technology, the evaluation methods of traditional vehicles became unable to meet the requirements of evaluation for high-level automated driving system (ADS) [1]. To solve this problem, the PEGASUS project proposed a scenariobased test and evaluation method, which puts forward the concept of the logical, and concrete scenarios according to the vehicle system design process [2]. The logical scenario refers to the scenario type described by the parameter space, and the concrete scenario is described by specific parameters. When designing the ADS, the operational designed domain (ODD) is aimed at similar driving conditions rather than specific parameters. Therefore, unlike traditional vehicles, the performance evaluation of the ADS should be oriented to the whole parameter space instead of the specific scenario parameters [3], i.e., performance evaluation in logical scenarios is the mainstream for the development of ADS.
Performance evaluation has been an important part of ADS development [4]. In the early research on automated vehicles (AVs), with the relatively simple function of ADS, performance evaluation was based on the accident-free mileage on simple roads conditions [5]. With the gradual development of technology, a variety of AV races were organized to compare the performance of ADSs designed by different institutions and evaluated by task-driven methods. The most famous competitions were the series of Defense Advanced Research Projects Agency in the United States, which compared task completion by participating vehicles in preset test scenarios (2004,2005 desert challenge, and 2007 urban challenge) [6,7]. With the increasing complexity of the ADS function, the number of competitions is increasing, with famous races including the World Intelligent Driving Challenge, Intelligent Vehicles Future Challenge, and Grand Cooperative Driving Challenge [8,9]. This kind of task-driven evaluation method applied to specific scenario parameters. Once these parameters change, evaluation results could not characterize the performance of ADSs in the new scenario. Most of these task-driven evaluation indicators were based on the subjective judgment of experts; hence, these indexes could not be reused effectively in different test scenarios. This method may encourage the tailoring of participating vehicles in the competition, taking actions unable to be applied in actual road driving. Some standards and regulations related to AVs, such as ISO17361 and the Euro NCAP AEB test protocol [10], are also based on the task-driven evaluation.
ADS performance can be evaluated by mileage-based methods, in which miles per disengagement is an important index [11]. The performance of the tested vehicle is calculated according to the total driving mileage, number of manual takeovers, and number of accidents over a year. The Department of Motor Vehicles releases relevant AV disengagement reports every year for evaluation and compares all tested vehicles. Based on the overall road test data in a year, such an evaluation is costly and time-consuming [12] and may cover many repeated scenarios during testing, which is also difficult to combine with scenario-based test methods.
The autonomy levels for unmanned systems framework for robot intelligence evaluation can also be used to measure ADS performance [13]. Proposed by the National Institute of Standards in 2003, it divided the evaluation dimensions into scenario complexity, task complexity, and human intervention. The Automated Driving Applications and Technologies for Intelligent Vehicles project carried out in Europe divide evaluation methods into four parts, which include technology, user-related, traffic, and impact [14]. The National Highway Traffic Safety Administration proposed safety impact methodology [15], which explained the relationship between test data, crash scenario development, algorithm model building, testing, and benefit evaluation. These frameworks still aimed at a specific test condition, and it was difficult to introduce a manual intervention index in ADS simulation tests.
A number of other studies have carried out some ADSrelated performance evaluation. Li et al. studied the weighting method of test results in different scenarios [16]. Wang et al. combined a subjective evaluation of drivers with radar charts to obtain quantitative evaluation results [17]. Sun examined the intelligence level of ADSs from the aspects of vehicle control, basic driving, basic traffic, advanced driving, and advanced traffic behavior [18]. Goodrich et al. proposed a three-dimensional intelligent space evaluation method to evaluate the autonomy of ADSs regarding mobility control, motion planning, and situational awareness [19].
Zhang et al. adopted the technique for order preference by similarity to an ideal solution to evaluate vehicle target recognition and classification capabilities according to the distance between vehicle indicators and ideal attribute values [20]. Du et al. constructed a neural network based on interactive samples of naturalistic driving data to obtain an objective representation index set to express the evaluation level of vehicle performance [21]. These methods still lack application in the logical scenario.
Besides of the above issues, existing evaluation methods rarely considered the similarity between AVs and human driving, i.e., the anthropomorphic index. The similarity at specific locations is mostly evaluated without considering the overall driving trajectory. Quante et al. used human driving behavior as a benchmark to consider the performance of the measured ADS in hazardous conditions [22]. Since the invention of robotics, the comparative analysis of machine and human intelligence has become a research hotspot. As a typical representative of intelligent machines, the degree of humanoid driving of AVs has attracted more attention. However, studies focus on humanoid control strategies [23,24] and lack quantitative evaluation indexes of the humanoid nature of the driving process. The anthropomorphic level of AVs mainly affects two aspects. For the occupant, the differences in the driving behavior of the AV will affect the psychological acceptance of the driver, thus leading to distrust of the ADS. For other traffic participants, the lowintelligence ADS cannot interact with them correctly, thus affecting traffic efficiency and even causing traffic accidents.
To address the above issues, this paper proposes a performance evaluation method for an ADS in the whole logical scenario parameter space. It can be combined with the scenario-based ADS test method, effectively solving the defect of the existing ideas. By segmenting the logical scenario parameter space, the quantitative evaluation of ADS performance in the whole logical scenario parameter space is realized. In addition, this method establishes a method to quantitatively evaluate the similarity between automated driving and manual driving, presenting visual comparison between different driving modes in specific test cases.
The remainder of this paper is organized as follows: Sect. 2 introduces the overall process of proposed method. Section 3 presents the construction process of evaluation indexes. A evaluation example using method proposed in this paper is provided in Sect. 4. Finally, concluding remarks and a plan for future work are provided in Sect. 5.

Evaluation Method of ADS
When evaluating the performance of ADS in the logical scenario, due to the wide coverage of the parameter space, the evaluation focus varies at different locations in the space.
When measuring performance in a concrete scenario, the focus should be on anthropomorphic indicators instead of the ability to avoid collisions, since the scenario parameters are so simple that an accident is almost impossible. When evaluating performance in hazardous locations, the focus is on safety, because the scenario parameters are dangerous and the goal of the ADS prioritizes the avoidance of danger over passenger comfort.
For the above reasons, this paper divides the performance evaluation of the ADS in the logical scenario into collision-avoidance and anthropomorphic indexes. This section focuses on how to segment the logical scenario parameter space according to the scenario hazard situation and the ADS performance evaluation process in different zones.

Zone Segmentation
The parameter space is divided into safe and hazardous zones according to the hazard level of the concrete scenario parameters. The safety zone is the area where it is safer for ADS driving and collision is almost impossible, and the hazardous zone is where collisions are highly likely to happen. The partitioning in this paper is based on time to collision (TTC), which is often used to quantitatively evaluate the level of danger while driving, and is calculated as where d is the relative distance between the front and ego vehicle, and v 2 and v 1 are the speeds of the ego vehicle and obstacle, respectively. TTC represents the time it takes for a collision to occur if the two vehicles continue to drive in their current state, so a smaller value implies a more dangerous driving situation. However, TTC will become negative when the speed of the front vehicle exceeds that of the ego vehicle. Although the situation is safe, the situation may be identified as dangerous because the value is less than zero, and identical speeds of the two vehicles are not conducive to the calculation. Hence, the scenario hazard level is measured by The maximum inverse time to collision (ITTC) during driving based on the ideal vehicle motion is used to identify dangerous concrete scenarios. Given the initial velocity v 1 of the obstacle ahead, initial velocity v 0 and ideal deceleration of the ego car a, and initial distance d 0 between the obstacle and ego vehicle, it can be assumed that the ego car starts braking operation from the scenario beginning, and its realtime speed v 2 in the driving process and distance d between the ego car and the obstacle ahead can be obtained as where t is the current travel time.
Bringing v 2 and d into Eq. (2), the maximum ITTC time during driving can be calculated by taking the derivative of t and setting it to 0, and bringing t back into Eq. (2) to obtain the maximum ITTC: Statistics show that a collision is possible when TTC is shorter than 1.5 s [25], so 0.7 s −1 is used as the threshold value of the maximum ITTC during driving for logical scenario parameter space partitioning, at which time the maximum ITTC is unable to be calculated for all concrete scenarios in the parameter space, which result in that some locations cannot be identified in a safe or hazardous zone. To find the continuum boundary, the Gaussian process (GP) is introduced to fit the complete boundaries of safety and hazardous zones [26]. GP regression is often used to infer data at unknown locations, and it uses mean and covariance functions to characterize the data distribution: where l(x) is the fitting result, m is the mean function, which is defined as a zero matrix in the paper, and k is the covariance function, the square exponential kernel function is chosen as the kernel density function: where σ f is the feature length scalar, σ l is the signal standard deviation, x is the training data, and x * represents the data at the unknown location.
After GP regression fitting, a continuous boundary between the hazardous and safe locations in the parameter space can be obtained.

Evaluation Process
After partitioning the logical scenario parameter space, the evaluation process within the entire parameter space can be determined based on the concrete scenario parameter characteristics (occurrence probability and location) and evaluation focus at different locations. The performance evaluation in the hazardous zone is reflected by the collision-avoidance index, and the measure in the safe zone is for the anthropomorphic index. The performance evaluation process of the measured algorithm in the whole parameter space of a logical scenario is shown in Fig. 1. The steps of the evaluation process are as follows: (1) The logical scenario parameter space is partitioned to obtain the safe and hazardous zones. (2) The representative concrete scenario parameters are selected through combination testing and probability sampling to form the set to be tested, build the simulation test platform, place the ADS under simulation conditions for traversal testing with the concrete scenario set, and obtain the test results. (3) The ADS anthropomorphic assessment results are analyzed if no collision happens in the safe zone, otherwise only the collision-avoidance performance of the ADS under test in the hazardous zone needs to be evaluated. The safe zone collision results can be directly put into the collision loss set for the collision-avoidance evaluation of the ADS, i.e., skip to step (5). (4) When no collision occurs in the safe zone, the driving trajectory field of skilled drivers in the concrete scenario can be obtained based on the naturalistic driving data (NDD) or driver driving data collected elsewhere. The trajectory field data can then be analyzed at each slice along the road direction, determining the probability of the peak trajectory field data at that location, using it to calculate the probability of generating the trajectory field values at other sampling points with the skilled driver as an a priori distribution, and combining the results with NDD to obtain the human evaluation index within the safety zone.

Evaluation Index Construction
The proposed anthropomorphic index in the safe zone and the collision-avoidance index in the hazardous zone are described, combining the two to obtain the whole performance evaluation index.

Anthropomorphic Index
The Turing test is the most common approach to assess the intelligence level of a machine, i.e., whether it can deceive a human observer by behaving appropriately [27]. For AVs, high-level intelligence is indicated by a high-degree consistency between the system driving and human driving process. Studies on the similarity between human and AV driving have chosen specific parameters, such as the type of operation taken by the ADS and human [28], and the TTC with the front vehicle when the operation is taken [22]. However, these parameters are difficult to generalize in different scenarios and can only describe the similarity at a specific state rather than over the whole driving process. There are also methods to evaluate the consistency of the behavior of two segments by evaluating the similarity of their trajectories, such as longest common subsequence [29] and dynamic time warping [30], but these only consider the similarity of trajectories and not the velocity state at the corresponding position, with insufficient consideration of driving information. For quantitative comparison of consistency during the whole driving process, containing both track and vehicle information, the trajectory field is used to describe the vehicle motivation state and evaluate the similarity of the human and ADS driving trajectory fields by probability analysis.  The concepts of the instantaneous field and trajectory field are defined: Considering the vehicle as a mass point, the instantaneous field is formed by the vehicle's motion influence on the surrounding space-time, which is related to the vehicle's travel trajectory, the speed at the corresponding position, the influence time on the surrounding, and the physical parameters of the vehicle itself, and the trajectory field is the sum of the influence of instantaneous field over the entire trajectory. It is generally assumed that different ADSs under test are carried out on the same vehicle dynamics. Therefore, when calculating the trajectory field, the physical properties of the vehicle are not considered, and the trajectory field is mainly based on the vehicle motion states [31]. After obtaining the vehicle driving trajectory field, the similarity degree between human and machine driving can be quantified.
When the related data of the trajectory field for human drivers are obtained, the trajectory field value distributions at different locations are determined. Since driving behavior has the characteristics of a Gaussian distribution [32], the distribution at different locations is determined using Gaussian model as the basis for similarity judgment during machine driving. The numerical distributions of trajectory fields at different locations described using Gaussian model are where is the value of the trajectory field and μ and σ are the mean and standard deviation, respectively.
Only the magnitude of the trajectory field values is not enough to adequately judge the similarity. For example, a position farther from the center of the high-speed trajectory may produce the same trajectory field value as a position closer from the center of the low-speed trajectory, which will produce an illusion with high consistency and cause recognition errors. Therefore, it is necessary to obtain the location of the trajectory field peak value, i.e., the driving trajectory of the ADS. Considering the above two aspects, the anthropomorphic index for a single concrete scenario based on the trajectory field is where L is the distance along the road direction of the tested ADS. The schematic diagram of these parameters is shown in Fig. 2. If the AV stops, the value is the distance along the road directly from the starting point of AV driving to the stop position; otherwise, it is the road length of the test scenario. L mean is the average distance driven by the human and is calculated in the same way as L. n h is the number of vehicle operations performed by the tested ADS. A steering wheel angle greater than 10 degrees and backing to the center (both steering and lane changing will return to normal position in the opposite direction) or an absolute value of the vehicle longitudinal acceleration greater than 0.5 m/s 2 and the gas/brake pedal backing to initial position are defined as an operation; n A is the average number of operations performed by a human, calculated in the same way as n h . r is the position for road slicing, where the interval is defined as 0.5 m, and when the distance between the penultimate sampling position and the termination position is less than 0.5 m, the end position is directly sampled without considering the sampling interval of 0.5 m. p t_r is the probability that the center of the trajectory field is at that position when humans drive, where the trajectory field center is still described by a Gaussian distribution (Eq. (11)). p t_2σ_r is the probability that the trajectory field center is at twice the standard deviation position when humans drive. p v_j is the probability that the value of the trajectory field at the location is generated by human driving; p v_2σ_j is the probability that the trajectory field value is twice the standard deviation position generated by human driving; n s is the number of sample points in (12) are the factors of operation number correction and driving distance correction, respectively, which are usually 1. When the ADS performs better (fewer operations and longer distances), k 3 and k 4 are needed to modify the operation number correction and driving distance correction factors to 1. When sampling along the slice, the trajectory field value at the location farthest from the peak position has less influence on the similarity comparison, so there is no need to select all the locations of the whole slice for sampling. The peak value of the trajectory field is taken as the center of the upper and lower 1.5 m for sampling, with 0.5 m as the sampling interval, i.e., seven points are selected along the slice (n s is 7). If the sampling point exceeds the road boundary, the sampling is divided into three equal parts according to the road boundary and the peak of the trajectory center.
The probability of the trajectory field center and the trajectory field value in Eq. (12) is divided by the probability at twice the standard deviation position, because differences in human driving styles will lead to randomness. The greater the difference in human driving, the more dispersed is the performance of driving behavior, and the whole space will exhibit a uniform distribution. At this time, due to the discrete characteristics of driver behavior, the probability of the operation selected by the ADS occurring in the Gaussian distribution composed of driver behavior may be small, but at the same time the probability of the behavior at two times the standard deviation in this Gaussian distribution will change correspondingly. Dividing the probability of the behavior at two times, the standard deviation can compensate for the evaluation error of the driver behavior dispersion on the ADS anthropomorphism.
The overall construction process of Eq. (12) is as follows. The road is sliced along the moving direction, and seven sampling points are centered on the peak center of the measured ADS trajectory field. The probability that the trajectory field center belongs to that location in the case of human driving is calculated and used as the prior probability for the seven sampling points. After calculating the similarity between the trajectory field of the tested ADS and a human in one slice, it continues forward at 0.5 m intervals until all slices over the entire road length are calculated. The anthropomorphic results are corrected according to the number of vehicle operations and the driving distance.
In some cases, the ADS takes redundant actions, but its trajectory has similarities to that of human driving. As shown in Fig. 3, the driver changes lanes and, due to human deviation, there is a lane change position margin, where brown and black curves are, respectively, the most conservative and aggressive lane change curves, and curves between these are highly consistent with human driving. Because Eq. (12) takes the calculation type of the slice, if the measured ADS takes the red part of the driving trajectory, its trajectory field data overlap with those of human drivers, but there is a great difference in actual operation. When this happens, the ADS under test takes redundant actions, making it necessary to use operation factors to correct the similarity in Eq. (12).
The distance correction coefficient is also required. When the measured ADS driving distance is less than that of the human-driven vehicle, the trajectory field and location may be similar to those of the driver if directly sampled to the end of the ADS parking position: When the human driving distance is shorter than the ADS distance, the redundant driving part of the ADS is not available for comparison with human driving behavior at the same location.
The anthropomorphic result of the tested ADS within the safe zone in a logical scenario is where p i is the probability that the ith concrete scenario occurs during naturalistic driving. When D is greater, the level of the anthropomorphic result of the measured ADS in the safe zone is higher, and it is more consistent with the human driver's result. Since the anthropomorphic index considers the physiological and psychological feelings of passengers in most situations, the naturalistic driving probability can reflect the driver's feeling when driving on real roads for a concrete scenario, and this probability is increased in Eq. (13).

Collision-avoidance Index
Safety is the most important evaluation dimension in the vehicle driving process. When the situation is dangerous, avoiding accidents or minimizing collision damage is the most important goal of the ADS design, and there is no need to consider indicators such as anthropomorphism, comfort, and economy. Evaluation methods mostly focus on the vehicle behavior in a concrete scenario and rarely consider the evaluation indicators in a logical scenario.
The danger degree of collision is related to the crash loss and the parameter location where a collision occurs. The  Fig. 3 Schematic diagram of tested ADS error trajectory closer is the hazard boundary point in the parameter space from collision parameters, the more unavoidable the collision at that location is, and the lower the weight should be in the overall safety assessment. Considering the crash loss and collision avoidability degree, a collision-avoidance index C is proposed within the hazardous zone in a logical scenario layer.
First, the concept of collision loss is defined below. When a collision occurs, acceleration magnitudes cause different degrees of damage to the vehicle and driver. In scenarios where collisions are unavoidable, the ADS may reduce the peak acceleration during the collision through a series of operations, which can effectively reduce the damage of the collision. The concept of collision loss is introduced to quantify the degree of crash damage and define it as the maximum acceleration of the tested vehicle during a collision [33].
Since the collision process is too complex, and it is difficult for current commercial ADS simulation software, such as PreScan and VTD, to accurately simulate vehicle damage after a collision, it is necessary to establish an accurate finite element model of the vehicle according to the model under test. At the same time, a finite element analysis platform (ANALYSIS) can be combined with the ADS simulation platform, which inputs the vehicle motion parameters during a collision to the finite element analysis software to realize the automated analysis of the collision situation. Research shows that the damage during a collision mainly includes the relative speed of the vehicle, relative angle, and utilization of the bumper. The collision loss estimation method [34] is adopted, formulated as where L i is the collision loss in the ith crash concrete scenario; w is the bumper overlap rate of the tested car with obstacles at the time of collision, with the minimum value of 0.5; and v e and v o are the velocity of the tested car and obstacle, respectively, at the time of collision.
The relative weights of collision loss at different locations must be calculated. For example, the larger the initial speed of the tested car in the AEB test scenario, the more dangerous it would be, so the weight of the results needs to be reduced for scenarios where collisions are difficult to avoid. Hence, the concept of collision loss weights is proposed: where r i denotes the weights in the ith concrete scenario position; r i * is the vector from the most dangerous point in the hazardous zone to the ith concrete scenario point, when there is more than one hazardous zone, the one with the minimum European distance is selected; and r i ** is a vector, whose start point is the most dangerous point in the hazardous zone, which is the same as r i * , and whose end point is the intersection of the line where r i * is located and the boundary between the safe and hazardous zones.
Finally, the collision-avoidance index is defined as where n c is the number of crashes of the tested ADS, and U g_i is the collision loss of ideal vehicle motion in the ith concrete scenario. A greater C implies a better collisionavoidance capability in the hazardous zone, where the maximum value is 1.
Once an accident occurs, the collision becomes a deterministic event for the occupants of the vehicle. Therefore, collision-avoidance performance focuses on the ability of ADS systems to avoid injury or reduce injury when a hazard occurs. For these reasons, Eq. (17) does not use the NDD probability distribution as the analysis condition for the weights as in Eq. (13), but is based on the stringency of the scenario parameters on the crash avoidance capability.
Ideal vehicle motion, which only considers the restrictions of vehicle dynamics, considers other elements like obstacle position as known true information, i.e., that perception, decision, and execution systems are performing optimally. Taking the front vehicle braking scenario as an example, once braking occurs in the front vehicle, the sensing system of the ideal vehicle senses the front vehicle's state change, the decision system makes the state control of braking for this vehicle movement, and the execution system brakes the vehicle according to the ideal pressure building curve. Since the ideal vehicle motion assumes that all subsystems of ADS operate according to the optimal state, the collision loss of the tested algorithm in the concrete scenario must be greater than that of the ideal vehicle.

Whole Performance Evaluation Index
After obtaining the anthropomorphic and collisionavoidance indexes, these can be combined to obtain the (16) performance evaluation index of the tested ADS whole parameter space of the logical scenario: where a and b are the relative weights of the two evaluation indicators, which are determined according to the test purpose and scenario (when the percentage system is selected, they sum to 100); k 5 and k 6 are the correction factor, and they need to be flexibly adjusted according to the scenario type and test purpose. It should be noted that for the measured ADS with a collision in the safe zone, as shown in the flowchart in Fig. 1, it only outputs the results of the collision-avoidance index and does not calculate the proposed anthropomorphic index, i.e., the value of the first term in Eq. (18) is 0.
The construction of k 5 and k 6 can be considered in two situations. When the number of measured ADSs is large, parameters such as the median or mean of the measured ADS scores can be determined as a/2 or b/2, thus inverting the values of k 5 and k 6 ; otherwise, they can be defined as 1.

Examples of Evaluation
Two black-box ADSs are tested in the virtual environment with front vehicle low-speed and cut-out scenarios, and their performance is evaluated using the proposed method. The performance of the measured ADS is scored in two scenarios with typical parameters.

Description of Tested Logical Scenario
The schematics of the two scenarios are shown in Figs

Parameter Space Partition
The hazard situations of different concrete scenarios in the parameter space are calculated according to Eqs. (2)(3)(4)(5)(6)(7)(8). For the two kinds of scenarios, a single-vehicle operation (braking) is taken to analyze the hazard. The braking deceleration speed of the vehicle is chosen as 5 m/s 2 ; AVs use different following strategies, so the following distance is set to the value of 3.5 times the velocity for simple calculation.
For the front vehicle cut-out scenario, the L3 level ADS regulations, Proposal for a new UN Regulation on uniform provisions concerning the approval of vehicles with regards to Automated Lane Keeping System, stipulate that the vehicle is allowed to have a risk assessment time of 0.4 s and a braking decision time of 0.75 s (from complete target detection to start of braking) in the cut-out scenario. The regulations only specify the lower limit of performance, so it is assumed that the time for the ego vehicle to discover the obstacle is 1 s after the front vehicle cut-out, and then, the ego vehicle performed a braking operation. Based on the above settings, the boundary of the hazardous and safety zones can be fit to realize the partition of the parameter space. In the selection of braking deceleration in the process of scenario partitioning, the maximum braking deceleration of the road cannot be directly selected for calculation. Emergency braking can trigger high-risk situation. Research generally maintains that 5 m/s 2 is already a large deceleration [33], and this value is selected in the calculation of the zone boundary.  After the calculation, no dangerous situation is found in the front car low-speed scenario, and the whole parameter space forms the safe zone; the partition result of the front vehicle cut-out scenario is shown in Fig. 6, where the blue part is the safe zone and the yellow part is the hazardous zone. In Fig. 6, d c and v t are the initial distance and velocity difference, respectively, between the front and ego vehicle.

Data Collection by Human Drivers
To determine the driving trajectory field of drivers in different concrete scenarios, 10 skilled drivers are invited to collect relevant driving data on a driving simulator, see Fig. 7. Two of the drivers are female, and eight of them are male, all of whom have driver's license for more than 2 years, and are between 22 and 27 years old and skilled at driving.
Many enterprises and institutions have collected NDD and AV test scenario data. With increasing amounts of opensource data, the analysis of human-driven system data does not require an extra amount of data collection.

Virtual Test Platform
The ADS simulation test platform is constructed with PreScan, CarSim, and MATLAB. CarSim provides vehicle dynamics information to the tested ADS model in MAT-LAB. The ADS model accepts current vehicle information and outputs the vehicle operation information to CarSim. PreScan provides the tested ADS model with a built test scenario that includes the initial vehicle speed of the ego vehicle, and the initial vehicle speed and position of the front vehicle. It accepts vehicle motion information from MATLAB. The CLI module in PreScan is used to realize the automatic scenario construction and operation of the test case. VisViewer in PreScan receives scenario and vehicle information from PreScan to realize 3D visualization.

Performance Evaluation Result of ADSs
The two ADSs are tested in front vehicle low-speed and cut-out scenarios. Discrete concrete scenarios with equal steps are selected in simulation traversal tests. During the virtual test, 10 typical scenarios are selected for each logical scenario, 10 human drivers observe the vehicle's motion from the driver's perspective, and the driving performance of the tested ADS is scored from 1 to 10 according to their subjective perception.
The results of a concrete front vehicle low-speed scenario are selected to provide a more intuitive interpretation of the anthropomorphic index evaluation process. The measured ADSs and the drivers' driving trajectory fields in this scenario are shown in Figs. 8, 9 and 10. ADS 1 brakes, followed by the front vehicle with the same velocity, while in ADS 2 the driver performs a lane change.   According to Eqs. (9)(10)(11)(12)(13), the anthropomorphic indexes of the two ADSs in this scenario are 0.91 and 0.4, respectively, and ADS 2 is more similar to a human-driven system than ADS 1. According to the trajectory field, it can be seen that ADS 2 could perform the lane change at the right time, with almost no speed loss during the entire process, while ADS 1 chooses to slow down and follow the front vehicle when it finds the front vehicle is too slow, which reduces the overall efficiency.
Since neither tested ADS collided in the safe zone in the two logical scenarios, the proposed anthropomorphic index D i of both in all concrete scenarios in the safe zone is calculated and brought into Eq. (13), where p i is calculated based on open source NDD [35], and the final results of the two ADSs in both scenarios are obtained as shown in Table 3.
The collision-avoidance index results of the two algorithms are calculated. Since the parameters selected for the scenarios are safe and the two measured ADSs have no collisions, their collision-avoidance indexes are all full scores.
Taking the values of a, b as 50, full-parameter space performance evaluation results of the two ADSs in the two scenarios are shown in Table 4.
The scoring results of the two ADSs by 10 drivers are shown in Fig. 11, from which it can be seen that the subjective feelings of the drivers (Fig. 11) are consistent with the results of the proposed method (Table 4). Scenarios 1 and 2 in Fig. 11 correspond to front vehicle low-speed and front vehicle cutout, respectively. In Fig. 11, the drivers' subjective perception is that ADS 1 is better than ADS 2, and the scores of the front vehicle low-speed scenario are higher.

Conclusions
To adapt to current scenario-based test methods, a performance evaluation method for ADS in the whole parameter space of a logical scenario is proposed. It solves the shortcomings in traditional methods, which treat the logical scenario parameter space as a whole space. The parameter space is partitioned into hazardous and safe zones according to the danger level of concrete scenario, and the collision-avoidance and anthropomorphic indexes are determined in the hazardous and safe zones, correspondingly. The performance of two black-box ADSs is evaluated by the method proposed in this paper and driver's subjective scoring in the front vehicle lowspeed scenario and front vehicle cut-out scenario. The evaluation results reflect the tested system performance in the overall parameter space, and it is consistent with the experience of drivers. Due to the lack of extensively tested ADS algorithms, certain parameters in the current assessment indexes are obtained through mathematical analysis. As the number of tested ADS algorithms increases in the future, these parameters can be optimized by means of statistics to further improve the accuracy of the assessment.