Strategic Conflict Management using Recurrent Multi-agent Reinforcement Learning for Urban Air Mobility Operations Considering Uncertainties

The rapidly evolving urban air mobility (UAM) develops the heavy demand for public air transport tasks and poses great challenges to safe and efficient operation in low-altitude urban airspace. In this paper, the operation conflict is managed in the strategic phase with multi-agent reinforcement learning (MARL) in dynamic environments. To enable efficient operation, the aircraft flight performance is integrated into the process of multi-resolution airspace design, trajectory generation, conflict management, and MARL learning. The demand and capacity balancing (DCB) issue, separation conflict, and block unavailability introduced by wind turbulence are resolved by the proposed the multi-agent asynchronous advantage actor-critic (MAA3C) framework, in which the recurrent actor-critic networks allow the automatic action selection between ground delay, speed adjustment, and flight cancellation. The learned parameters in MAA3C are replaced with random values to compare the performance of trained models. Simulated training and test experiments performed on a small urban prototype and various combined use cases suggest the superiority of the MAA3C solution in resolving conflicts with complicated wind fields. And the generalization, scalability, and stability of the model are also demonstrated while applying the model to complex environments.


Introduction
Urban air mobility enables emerging airborne missions including last-mile delivery, scheduled public air metro, and on-demand air taxis in low-altitude metropolitan regions. In addition, the rapidly iterative process of electrical vertical takeoff and landing (eVTOL) aircraft makes it soon feasible and convenient for urban air transports. However, frequent Cheng Huang cheng-huang.huang@cranfield.ac.uk Ivan Petrunin i.petrunin@cranfield.ac.uk Antonios Tsourdos a.tsourdos@cranfield.ac.uk 1 School of Aerospace, Transport and Manufacturing, Cranfield University, Cranfield, MK43 0AL, Bedfordshire, UK flight activities in low-altitude urban airspace must be regulated with safety and security considerations, as the crowded traffic flow with potential separation conflict will represent dangers to ground users. In the meanwhile, the complex urban building morphology changes the micro-climate and wind field apparently, which directly affects the fleet operation in dense-distribution regions. As a consequence, it is crucial to manage the traffic flow efficiently and resolve all conflicts under uncertain circumstances.
Conventionally, the air traffic management (ATM) services (ATS), e.g. advisory, alerting, and preventing collision avoidance, are provided by human air traffic controllers (ATCOs) [1]. Therefore, traffic efficiency is restricted by the controllers' workload in addition to the airspace structure and weather conditions. Similarly, such responsibilities in Unmanned Aircraft System (UAS) Traffic Management (UTM) are undertaken by the UAS Service Supplier (USS) [2,3]. The dynamic capacity management module in UTM further liberates the potential of dense traffic operation. And not coincidentally, resembling the architecture for UTM, the Provider of Services to UAM (PSU) performs ATS in the UAM Operations Environment (UOE) [4,5]. It means that both in ATM, UTM as well as UAM, similar roles and responsibilities are defined for delivering necessary services.
To achieve conflict-free and efficient traffic management, ATS is offered by the aforementioned service providers in three stages: strategic conflict management, tactical separation provision, and collision avoidance [1]. Although the structure is proposed in ATM, it also applies to UAM and UTM with some modifications. The strategic conflict management service, which is carried out several days or months in advance of the actual operation in UAM or UTM [6] compared with the typical half-year in ATM, works as the first layer and aims to reduce the pressure to the next phases. If there is any conflict left from the previous stage, the separation provision is responsible for keeping the separation minima between the pairwise aircraft. Collision avoidance is an individual component and the last resort on board [1].
In this paper, we focus on conflict management at the strategic level. In detail, strategic conflict resolution is achieved through 3 components: (1) airspace organization and management, (2) demand and capacity balancing (DCB), and (3) traffic synchronization [1]. The airspace organization and management provide the regulations and procedures to flexibly accommodate various types of air activities, facilitate the seamless handling of dynamic flight trajectories and adjust airspace boundaries. The second DCB component mitigates conflicts where the demand exceeds its capacity for the effective use of airspace resources. The final traffic synchronization is fully integrated with conflict management and DCB to negotiate conflict-free and orderly trajectories.
Prior to the management stage, the flight intents including aircraft information, flight path, and scheduled departure/arrival times must be submitted to PSUs from fleet operators. The PSU will provide the modification suggestions to the submitted plan after coordinating the aforementioned three strategic components.
With the defined airspace management module above, the prescribed intents are firstly reorganized to ensure global operational efficiency in UOE. The airspace structure in low-altitude urban areas differs from the regional ATM airspace and sectors. And the airspace in UAM is expected to be given flexibility, which can be offered by the gridbased solution [7]. The high-density stacked blocks in AirMatrix [8] are filled everywhere in the city. Similarly, the air corridor can also be built over buildings or ground roads. The AirMatrix is then resampled to rarefy blocks in high altitudes and fewer population areas and condenses the blocks around buildings [9], as the number of blocks significantly affects the complexity. As the result, the multi-resolution structure contributes to high-freedom operations.
The flexible airspace structure allows two strategic conflict management solutions. One is directly generating conflict-free trajectories using A * method [10] with conflict factors involved, however, the underlying problem is that the efficiency is not promised as the frequency of requiring the entire trajectories to be regenerated maintains a high level in the dynamic UOE. The other follows a conventional way to generate trajectories from the submitted flight intents and then eliminate all conflicts. Contrary to the first approach, it allows for fine-tuning parameters of trajectories, which is closer to the initial intents of the fleet operator.
Besides the airspace and trajectory, the crucial parts of strategic conflict management are the DCB issue and separation conflict, where the DCB issue is caused by the regional demand and capacity imbalance, whereas the separation conflict results from the loss of separation minima between pairwise aircraft. While dealing with the DCB issue, the main focus is reducing the demand or increasing the capacity in the hotspot. Conventional approaches are compared in Table 1. The mathematical models, e.g. 0-1 Integer Programming (IP) [11] and Eulerian-Lagrangian [12], offer options like ground delay, rerouting, or airborne delay to mitigate the crowded traffic in static sectors. But for dynamically merging and dividing airspace blocks, intelligently selecting the appropriate action from ground delay, speed adjustment, and flight cancellation, obviously, those approaches for conventional air traffic flow management (ATFM) lack the ability to deal with those issues in highly dynamic environments. The issue is then investigated with the emerging multi-agent reinforcement learning (MARL) methodologies, which interact with the changing environments to learn generalized solutions. In cooperating with the trajectory-based operation (TBO) concept, the trajectories can be managed effectively. For instance, the agents in the multi-agent system can be defined as "peers" if their trajectories have interaction, and are then connected to form a network [13]. The edge-based MARL and agent-based MARL utilize the information propagated through the graph to solve the DCB issue. To promote the collaboration between agents, the unsupervised clustering and ranking are integrated with MARL to select the agent with priority [14]. Other approaches such as hierarchical MARL also provide insight into DCB resolution for manned air traffic [15]. And for unmanned aircraft in UTM or UAM, the deep Q-Learning network (DQN) with genetic algorithm (GA) is a good attempt to resolve the congestion problem in urban lowaltitude airspace [16] consider weather changes. Because of the difference in airspace structure and operational notion, the hybrid operation of manned and unmanned flight, scheduled and on-demand flight, both pose a huge challenge for UAM.
Solving the DCB issue can only alleviate the regional traffic pressure, the problem of losing separation distance in the sector remains to be resolved. The separation matter is always studied in the tactical phase, where the instant heading change, speed adjustment, and flight level change are typical maneuvers. It becomes much easier to involve uncertain factors in the short-time stage, due to this, the single-agent deep deterministic policy gradient (DDPG) [17] and multi-agent version [18] can cope with the conflicts between pairwise or multiple aircraft considering environmental uncertainties. It is acceptable to simply send the heading change command with the specific modification degree in the wide-range airspace, whereas this operation becomes risky in UOE, as (1) it requires much more high-frequency command exchange to guide the flight, which increases the pressure to the system and (2) the flights in UOE must be strictly planned and monitored to avoid safety issues. For these reasons, elaborated information for trajectory modification should be designed for the secure and efficient operation of UAM. In this paper, as also described in Table 1, components for UAM operational concept are refurbished considering urban uncertainties and aircraft performance. To achieve this, the flexible airspace structure and refined strategy for trajectory management collaborate with MARL for better strategic conflict management. The flight performances such as aircraft type and flight ability are determined as the strict constraints in the process of defining those elements. The airspace structure is firstly adapted for UAM with dense blocks, meanwhile, the multi-resolution blocks are merged or split in accordance with the aircraft dimension. To support precise operation, the Gaussian Mixture Model (GMM) is leveraged to model the speed profile of the flight, consequently, the conflict can be resolved by adjusting the parameters of GMM. The wind model is integrated into the environment and built through Dryden Gust Model and computational fluid dynamics (CFD). The strategic conflict in this paper consists of the DCB issue, separation conflict, and unavailable block status due to wind intensity. After carefully designing the UAM architecture, the Multi-Agent Asynchronous Advantage Actor-Critic (MAA3C) framework with recurrent actor-critic networks is proposed to train a generalized model and eliminate all conflicts in any dynamic environments, where both ground delay, speed adjustment, and cancellation serve as candidate actions.
Overall, the key contributions of the paper can be concluded as follows: • Components in strategic conflict management are especially refurnished for urban air mobility, in which the aircraft flight performance is integrated into the multi-resolution airspace organization and spatiotemporal trajectory generation. Especially, the maximum wind speed resistance level of each aircraft is considered for flight safety. • The Dryden gust model and computational fluid dynamics (CFD) simulation are utilized to model the wind turbulence in a low-altitude urban area and provide regional and detailed wind profiles as the stochastic factors in the UOE.
• Cancellation, ground delay, and airborne speed adjustment comprise three types of actions for the constructed Multi-Agent Asynchronous Advantage Actor-Critic (MAA3C) solution. The importance of each action is dynamically adjusted by the recurrent structure of MAA3C. In particular, the cancellation action is emphasized to enable stable and effective performance in presence of wind turbulence.
The rest of the paper is outlined as follows: Section 2 defines the problem and essential components of strategic conflict management in addition to wind turbulence models. The customized multi-agent reinforcement learning framework and characterized networks are presented in Section 3. In Section 4, we develop a small prototype for UAM simulation and analyze the training and test results under kinds of circumstances. Finally, the paper is concluded in Section 5.

Operational Concept Components
Compared with conventional ATM, the operational environment is narrowed from large-scale high-altitude spaces to low-altitude high-density areas. Besides, the urban morphology creates additional static obstacles and dynamic micro-climate, which impact flight safety and security. Moreover, the high-density flights and uncertainties increase the airspace demand and conflict possibilities, which make it challenging to solve the DCB and conflict issues in urban regions. As a result, this paper works on a proper solution for UAM services, especially for solving the strategic DCB issue and separation conflicts in dynamic urban environments.
In this section, essential components of conflict management are introduced step by step to construct the problem. As depicted in Fig. 1, the dynamic low-altitude airspace and spatiotemporal trajectory are generated at first considering aircraft performance. These two components produce the initial flight schedule and analyze potential conflicts between trajectories in complicated environments after involving wind turbulence. Then the issue will be solved by the MARL solution considering wind uncertainties to generate final conflict-free flight plans. Details of each component describing the problem are explained as follows. And the conflict-free flight plan is refined by the MARL solution which will be described in Section 3.

Airspace Organization and Management
UAM operations environment (UOE) defines the airspace structure in which PSU provides necessary services to aircraft for safety and efficiency. As the densely distributed buildings pose challenges to secure flight, dynamic and adaptive organization is required for various air activities in low-altitude airspace. The multi-layered grids can formulate a flexible structure, in which the airspace is discretized into stacked blocks. The safety separation from buildings can be achieved by placing blocks far from the static obstacles. For instance, the AirMatrix [8] or 4D grids [7] over buildings or roads can provide sufficient flexibility for airspace management. Whereas, to serve accurate navigation, the increase of block resolution must lead to the exploding number of controlled blocks. Although the multi-resolution AirMatrix at different areas and altitudes can decrease the burden to some extent [9], the static architecture is lack operational freedom.
In the convenience of conflict management, we define that the aircraft can only occupy one operation block. To further increase the flexibility while achieving highresolution operation, we place basic blocks and dynamically merge them into larger blocks according to the size of the aircraft. As in Fig  large block as an actual operation block for the large-size aircraft, and the small aircraft remain operating in the basic blocks. In this case, the entire structure keeps dynamic and varies over time. All basic blocks constitute a static graph G g (V , E), in which the discrete blocks are regarded as the set of nodes V , and the adjacency between pairwise nodes constitutes the edge set E. And the practical operation graph is mathematically represented as a graph G(V , E), of which the original basic nodes are deleted and replaced with the new merging nodes as shown in Fig. 3. Besides, children nodes are also stored in G for block retrieval and recovery. Since the graph G changes over time, the merging and recovery process is critical, we summarized the detailed process in Algorithm 1.

Trajectory Generation
The trajectory is typically defined by a list of key information including latitude, longitude, altitude, and time. With detailed graph, the trajectory is formulated by , where node i ∈ R 3 is the node in the graph G which contains 3D position, v i ∈ R + and N f is the number of traversed blocks of flight f .
We design a trajectory formulation module for strategic conflict management with aircraft performance taken into account. The time-based management (TBM) is the kernel part of the trajectory-based operation (TBO), by assigning child nodes child nodes d specific time at points on the spatial trajectory for strategic planning. In this way, to enable efficient TBM, two components including spatial points design and time assignment will be implemented in detail as below.

Spatial Trajectory
The spatial information of a trajectory is composed of discrete 3-dimensional points. The center of blocks node i ∈ R 3 , represents the specific point along a trajectory. The objective is to collect the list of consecutive nodes between the departure and arrival locations. Merging nodes happens in the process of trajectory generation. The aircraft performance database AC = {ac i } i∈N a and vertiport database V T are firstly collected. Dijkstra algorithm is utilized to find the shortest path between the origin vertiport Ori and destination vertiport Des on graph G g (V , E). The detailed process is described in Algorithm 2.

Time Assignment
With the prescribed spatial points, efficient traffic flow is then achieved by the TBM. Under the condition that the  Electric vertical take-off and landing (eVTOL) aircraft or small drones have more advanced performance than fixedwing aircraft in the urban environment, because of flexible maneuvers and airborne hovering abilities. To accurately revise the arrival time at a specific location to avoid potential conflict, we construct a time-speed graph based on the Gaussian Mixture Model (GMM) as the en-route speed profiles of aircraft. The GMM is composed of K independent normal distribution N , and the probabilistic density function (PDF) is: where μ k , σ k and φ k are the mean value, variance and weight of the k th distribution, respectively.
Since the GMM curve is continuous, it needs to be discretized into counterparts as the number of traversed blocks as in Fig. 4, supposing that the speed is constant inside the block and changes at the border of two adjacency blocks. All aircraft need to follow the advised speed at each kernel point. With traversing blocks and initial departure time t 0 given, the crossing time at next blocks {t i } can be obtained by where dim block is the dimension of the operation block. If the new speed profile v i is refined for conflict resolution, the new In addition to airborne adjustment, the GD program directly shifts the entire trajectory in the temporal dimension by adding the delay time δt and fresh plan Fig. 5. In some situations, both GD and speed change are applied, the new operation plan substitutes the original one.

Wind Turbulence
Stochastic uncertainties such as wind disturbance pose a critical challenge to flight safety. Not only directly deviating from the nominal trajectory, but the strong wind also has a great possibility of causing failed control or drone crash. In this strategic stage, instead of adding wind to the dynamics and kinetics model of aircraft, the wind field is used to determine if the aircraft can take off. Any aircraft, of which the maximum wind speed resistance is less than the predicted wind speed at some time and points, needs to make efforts to avoid unavailable blocks. Two wind models are leveraged to describe the wind information in different resolutions. The first one utilizes the Dryden model to predict the regional wind speed on each aircraft [19]. Because of the complex urban morphology, the regional weather prediction model is not sufficient for the local wind field prediction. Then the computational fluid dynamics (CFD) model is simulated to get high-resolution wind flows [20]. Although the simulated scenarios cannot be exactly the same as the frequently changing wind field, our scope is making the solution generalized and being capable of resolving any similar issues.
Dryden gust model is a commonly used mathematical model for continuous wind turbulence. And in this wind model, velocity components are defined by the stochastic processes and corresponding power spectral density. As the wind is calculated in the vehicle frame, the transfer function in the three-body frame axes [u, v, w] are calculated by [21]: where [σ u , σ v , σ w ] and [L u , L v , L w ] are the turbulence intensity and scale length, V a is the effective airspeed of small aircraft. In [22], it suggests setting a constant speed V a 0 for the Dryden model. From Eq. 2, the different areas at the same altitude share the same wind value {W t } t e t=t s , where t s and t e are the starts and end time of operation. But the dense distribution and morphology of buildings have a great impact on the local wind flow, such as the street-canyon effect and induced turbulence, etc. Various points at the same altitude must have different {W t } t e t=t s profiles. Due to this, numerical computation using the CFD is able to solve the dynamic Navier-Stokes equations for the changing wind field.

Conflict Management
The DCB issue and the separation conflict are two matters in strategic conflict management. In ATM, capacity typically refers to the number of flights the ATCOs can process during a specific period. And because each ATCO is responsible for a certain number of airspace sectors, the demand also represents the capacity of the sector. But in this paper, we use automatic tools to support the decisionmaking, in that way, and we regard the capacity of a PSU as infinite. For each block, the capacity is customized to 1, in other words, every block can only be occupied by one aircraft at each moment.
The demand is obtained through entry count. In detail, the operation period (H hours) in each day is equally split into SN fragments like in: where dt = H/SN. Any aircraft whose entry time t i (i ∈ [0, N f − 1]) into block j at sn ∈ {0, 1, · · · , SN − 1} is added into the demand d node j ,sn . When the demand d node j ,sn exceeds the capacity c node j ,sn == 1 in any block, the DCB issue appears. For pairwise aircraft in the hotspot block, they have: Where j 1 and j 2 are block indexes of different flights at the same period. This expression also meets the condition below: The loss of separation means the distance between pairwise aircraft is less than the safe minima. For efficient operation with aircraft performance, the dynamic separation minima are considered, where the nearest spatiotemporal distance between points node j 1 and node j 2 of two interactive trajectories is compared with the relative margin: where δv max is the maximum relative speed, and τ is the temporal margin. From Eqs. 5 and 6, if we can keep (v max × τ ) >> dim block , the separation conflict and DCB issue can be merged into one meta-problem. There is no DCB issue in each block if separation conflicts are resolved.
When considering stochastic wind, we introduce a special term availability for the block, which determines if the aircraft can access the block. If the wind speed in specific traversing blocks is larger than the maximum wind speed resistance at some periods, the block will be unavailable for this aircraft. In addition to the separation conflict, we also define a new conflict, where the planned trajectory contains unavailable blocks.
The entire initial operation plan {traj f = is scanned to mark the conflicted flights and corresponding block index.

Multi-agent Reinforcement Learning Solution
In this section, the strategic conflict problem is formulated as a multi-agent system. The Partially Observable Markov Decision Process (POMDP) is described, and as an extension, the MAA3C with mask recurrent structure is implemented.

POMDP
In POMDP, the agent i (i ∈ N ) in the multi-agent system N = {1, · · · , N a } only observes the local information O i from global state S. Let A i be the action space of agent i, then A i i∈N formulates the joint action space A = A 1 × · · · × A N . The transition probability from the previous state to the next state P : S × A −→ S. After executing the action, each agent obtain its reward R i : S × A × S −→ R with a discount factor γ ∈ [0, 1]. As a result, the tuple N , S, A i i∈N , P, R i i∈N , γ, O i i∈N establishes the POMDP [23]. The detailed components are described in detail as follows: • Agent: Agents are flights with conflict status in simulated scenarios, for instance, losing safe separation or encountering intolerant wind. As the number of trajectories is large and varies in the operation plan, not all flights need to take actions. From this definition, we can recognize the changing number of agents at each training step because of the high-dynamic environment. .
We can note that the dimension N f × 3 varies with the agent. And the conflict is set to c i = 1 if the agent loses safe separation or the wind W i is greater than its maximum wind speed resistance in node i at time t i , otherwise c i = 0. • Action a i st ∈ A i (i = 1, · · · , N a ): Each agent receives its action according to the policy π i : O i −→ A i . To resolve complex conflicts in uncertain environments, both ground delay, airborne speed adjustment, and cancellation are designed as effective candidate actions. The action a i st of every agent formulates the joint action. Agents cooperate to achieve global optimization and are viewed as homogeneous.
Although 3 types of actions are listed, just one type is allowed in each step. If the ground delay is selected as the action, the entire trajectory will shift δt in temporal dimension and be updated with traj f = . While the airborne speed is to be changed, δμ k , δσ k (k ∈ [1, K]) is the specific output for this action. Updating μ i = μ i + δμ i and σ i = σ i + δσ i in GMM model gives the new . Finally, when cancellation is the selection, the current agent will be removed from the operation schedule.

MAA3C
With the necessary components defined for the multi-agent system, we need to consider an efficient architecture to enable stationary and scalability. In POMDP, the multiagent system is non-stationary since other agents interact with the environment simultaneously. Some approaches can deal with the non-stationary issue such as centralized training, opponent modeling, communication, etc. [24] In this paper, the actor-critic framework instead of valuebased Q-learning is built and the centralized critic is utilized to cope with the non-stationary problem. Instead of concatenating observation matrices of all agents like a typical centralized actor, we employ parameter sharing to reduce the complexity. Each agent shares the same policy network. However, it just takes its observation as input. In this way, agents can still keep uniqueness because the various inputs give different hidden features. This procedure makes the shared policy network a generalized opponents model which can deal with any type of input.
The asynchronous advantage actor-critic algorithm (A3C) [25] is selected as the basic architecture and extended for the multi-agent system (MAA3C). The single-agent A3C assesses the action with the advantage function where Q π (S, a) > V π (S) of action a indicates the better return of the current policy. To enable the multi-agent extension, each agent shares the common advantage function A π (S st , a st ). Because of the parameter sharing, the average advantage value A π avg = N a n=0 A π n S n st , a n st contributes to the policy loss J (π φ ) and critic loss L(φ v ), as the consequence, updating the policy network φ and the critic network φ v .
where logπ φ (a n st | S n st ) avg is the average log value of all agents wit policy π φ . And {U − V S n st ; φ v } avg denotes the average advantage value A π avg , U is the accumulative returns and V is the critic value for state S n st . E[·] refers to the expected value.

Recurrent Actor-Critic
The learning procedure is formulated as a multi-agent system. However, two issues remained to be processed. One is the various number N f of traversed blocks for each aircraft, which leads to the dimension misalignment. The other one is the scalability of the system which would accept various agent numbers. Those problems are addressed by the recurrent actor-critic networks as depicted in Fig. 6. Instead of feeding the input one by one, the observations of all agents are stacked. Specifically, the number of agents is treated as batch size. The first layer concatenates the information of all traversed blocks, and it can process flexible input length thanks to the encoding ability of recurrent neural networks. Different components in the second layer generate the speed adjustment parameter [δμ k , δσ k ] (k ∈ [1, K], K = 2), ground delay δt, specially designed variant mask, cancellation value Ca and the critic value V π (S). As only one action is expected to be selected at each step, a two-value mask mask = 0 or 1 is proposed. It determines the action between ground delay and speed refinement at first, but the cancellation Ca has the highest priority, if it outputs the cancellation decision Ca = 1, this agent and related parameters would be removed from the flight schedule. The action determination at each step is given by: The two-layer structure can deal with the dynamic number of traversed blocks and various agent numbers at the same time. In this way, the scalability of the multi-agent solution can be guaranteed. Training and testing on different dimensional operation plan naturally become acceptable. Combined with the multi-agent learning procedure, the recurrent actor-critic structure is integrated into the MAA3C framework as in Algorithm 3.

Use Cases
In this section, we establish a small-scale urban environment to accomplish the last-mile delivery task with hundreds of small UAVs. And workflows for low-altitude airspace organization, operation plan generation, aircraft performance database, and wind models are presented exhaustively. The proposed MAA3C solution is then applied to the prototype to evaluate its effectiveness.

Flight Performance Dataset
To enable efficient performance-based operation, flight performance plays a critical role in the process of airspace

Low-altitude Air Corridor
To enable flexible airspace management, the dense grid is a great preference. Therefore, an air corridor with multilayer blocks is constructed over Multi-User Environment for Autonomous Vehicle Innovation (MUEAVI) road built at Cranfield University, as depicted in Fig. 7. The dimension of each block is set to 4m×4m×4m. Ground infrastructures such as cameras, lidar, and radar have great potential to achieve resilient surveillance in urban GPS-degraded areas. We implement this low-altitude air corridor in Carla Simulator [26] for corridor and trajectory visualization. As displayed in Fig. 8, 5 vertiports are simulated in the environment for UAVs taking off and landing. In  Section 2.1, we explain the graph construction process for dense grids, all 1900 blocks comprise the nodes set V of the graph G(V , E).

Operation Plan Generation
With the flexible airspace management strategy, we are now able to generate batches of trajectories, which then formulate the entire operation plan. In the training phase, there are 100 trajectories simulated considering the flight performance and air corridor structure according to Algorithm 2. Some selected trajectories are drawn in Fig. 9. The valid operation time starts from 8:00 to 18:00, and it also means the time assignment should be limited in this range. The operation timeline is then divided into 8640 snapshots, where the entry demand is counted in each 5speriod snapshot. As depicted in Fig. 10, the spatiotemporal information of all trajectories is projected into the 2D map. Each point shows the existence of the aircraft in the block at each period and formulates consecutive trajectories when indices of traversing blocks are sequential. It is clear that some overlapped trajectories lead to separation conflicts. We count and then visualize the traffic flow in the 3D air corridor structure as in Fig. 11, combined with the conflict map in Fig. 12, it is convenient to observe the crowded traffic in the vicinity of vertiports and nearby airspace. There are 10 failed separations accumulated near blocks above vertiports.

Model Training
As the wind turbulence is integrated into the environment, the conflict number must change once the wind effect is activated. To involve specific wind profiles, the Dryden gust model and CFD model are simulated to get regional and local wind information, respectively. The Dryden gust data for training (Dryden-I) is generated as shown in Fig. 13. To make the information condensed, the directional velocities are projected into a wind-rose plot in Fig. 14 to observe the frequency of magnitude and direction. We can notice the predominant wind towards east (0 • ) and south-east (315 • ) directions. But in the strategic phase, the magnitude poses greater threats to the operation safety than the direction. From the wind-rose diagram, the intensity over 8 m/s keeps a high frequency and makes the training challenging. As the blocks with excessive wind speed are closed and unavailable for certain aircraft, consequently, the inaccessibility to blocks enlarges the conflict number from 10 to 66.
As the Dryden gust model only provides the view from wide-range regions, the fragmented wind field between urban buildings cannot be presented. Therefore, the CFD simulation at the building scale works perfectly as the complementary source to train a generalized solution. All numerical computations are accomplished with OpenFOAM [27]. As in Fig. 15, one snapshot of the CFD simulation result intuitively displays the complex wind flow, especially the street-canyon effect between buildings, where the wind speed increases or decreases dramatically. We select several points on the map to visualize the detailed data as in Fig. 16. Compared with the standard inlet velocities, the points near buildings, e.g. p 2 and p 3 , always have high fluctuations. And the point p 1 behind the building keeps lower intensity than the inlet face, which further pinpoints the complicated environment in the urban area. The point p4 in the open area, in contrast, shares the same tendency as inlet velocity due to no shelter. As a consequence, the conflict number increases sharply from 10 to 182 when considering the high-resolution turbulence.
The conflict heat maps in Figs. 17 and 18 reveal the severe conditions after the Dryden wind and CFD wind model involved. The regional wind leads to discrete points, whereas the CFD model only affects flights in specific areas. To eliminate all conflicts and ensure a safe flight, the proposed MAA3C framework is applied to train with the Dryden model and CFD model, respectively.   The proposed architecture is implemented in PyTorch on Ubuntu 18.04 and Intel Xeon W-2123 CPU @3.60GHz×8. Two models are effectively trained with the same initial operation plan and different-resolution wind models until convergence. And all parameters of 6 recurrent neural networks in Fig. 6 are initialized as in Table 3, where the block number is set to 700. The flight number can be reset from 100 to other numbers for various scenarios. In the training process, all cores are utilized to train the model asynchronously, and learning parameters in child threads are set to synchronize with the parent thread every 100 steps.
The return curve in Fig. 19 finally converges to a relatively stable value with all conflicts resolved in the Dryden-I model. Meanwhile, the fast converging curve in Fig. 20 also shows the ability of MAA3C to learn effective parameters for solving the conflict issue in complex turbulence.

Model Test
Although the performance of trained models shows the capability in the process of training, it is not adequate to directly announce the generalization of the MARL framework. As a comparison with the MAA3C solution, we replace all neural networks with continuous uniform distributions to assess learned parameters. The comparison model is entitled RANDOM, of which the architecture in Fig. 21 resembles the structure in Fig. 6. The cumulative distribution function of each random component is formulated with a uniform distribution as in Eq. 12. Therefore, the ground delay, speed change, and cancellation parameters [δμ k , δσ k , δt, Ca] can be generated with Eq. 13.
Especially, only those agents whose Ca value is larger than the average of all agents are canceled. In the meanwhile, the mask component in RANDOM is produced with the Bernoulli distribution p(mask = 0) + p(mask = 1) = 1 to make a selection between ground delay and speed change.

Test with Dryden Model
Different scenarios are simulated to evaluate the trained models. Firstly, another two Dryden wind profiles (Dryden-II and Dryden-III) are simulated with different average intensities in addition to Dryden-I. As exhibited in Fig. 22, the average magnitude of Dryden-II is lower than Dryden-I which is used for training. The wind direction in the north (90 • ) and south-east (300 • ) appears frequently during the  Table 4.
Cases 1-6 test the model only with new operation plans and without wind effect. The MAA3C and RANDOM will be applied to the same operation plan such as Cases 1 and 2, 3 and 4, etc. We can observe the test return values from Fig. 26. All cases with MAA3C obtain stable return values, in contrast to the lower mean values and larger variances when employing RANDOM. In particular, Cases 1 and 2 get 0 reward values because they are not necessary for deconfliction with no conflict. For Cases 7-12, 13-18, and 19-24, three different Dryden wind profiles (Dryden-I, Dryden-II, and Dryden-III) are integrated into the environment, respectively. The total conflict number changes obviously with the wind influence. If the Dryden-I in Case 13-18 is regarded as the baseline, the Dryden-II decreases the environment complexity and gets small conflict numbers in Case 7-12. However, the Dryden-III makes the situation complicated as conflict numbers in Cases 19-24 are nearly doubled. Under those challenging circumstances, the return values in all test cases embody the stabilization of the MAA3C solution. With the same wind input in cases 7-12, 13-18, or 19-24, the growth in the flight number significantly degrades the return values. In this trend, the MAA3C solution can still stabilize the output in multiple trials. Another evident finding is that the lowintensity wind contributes to a high return value for the same operation plan, for instance, Case 3,9,15,and 21. To analyze the detailed reason behind this, the specific delay and cancellation numbers are plotted in Fig. 27. The discovery is that number of flights to perform actions increases when the environment becomes much more  stressful. For instance, for Cases 1, 3, and 5 in the same wind field, the number of flights to perform delay, speed change, and cancellation grows with the total flight number. Similarly, Cases like 5, 11, 17, and 23 with the same operation tend to process much more flights to cope with high-magnitude wind. We notice that the solution is inclined to select the cancellation when wind turbulence is complicated. From Case 13 to 24, the cancellation percentage ranges between 12%-26%. We need to mention that RANDOM can also resolve all conflicts because of the involvement of random cancellations. The competitive results sometimes owe to the overall architecture which will ensure all conflicts are eliminated through iterative steps. We can note that the RANDOM can balance the selection between delay and cancellation since the generated random values can always keep a certain percentage for these two parts.

Test with CFD Simulation Model
Not only testing regional wind model but the simulated wind fields with CFD are also merged into the environment for refined urban operation. Along with the numerical model CFD-wind-I used for training, another new CFD model (CFD-wind-II) is simulated as in Fig. 28. The speed curves of the picked 5 points show the complex changing trends along with the dynamic inlet velocities. In comparison with the speed magnitude in Fig. 16, the wind intensity drops a lot as in Fig. 29. Combining all possible tuples of wind input and study methods, there are different 18 cases indexed in Table 5 to validate the MAA3C solution in precise wind fields. Cases 25-30 are evaluating models on operation plans without wind as Cases 1-6 in Table 4. Cases 31-36 are totally brand-new scenarios with the wind model CFD-wind-II, and meanwhile, we can observe fewer conflicts compared with Cases 37-42.
The same as before, Cases 25, 26, 31, and 32 are not included since there is no conflict. The reward output from MAA3C presents a stable performance in multiple test episodes as displayed in Fig. 30. The mean and variance values are relatively better than RANDOM. And RANDOM fluctuates obviously in presence of the large conflict number contrary to MARL results. The number of flights that need to take action increases with the total conflict as plotted in Fig. 31. An obvious observation is that all cases with MAA3C tend to execute cancellation instead of delay or speed adjustment. The cancellation proportion keeps a low level which is less than 10% for Case 25 -38, but for Case 39 -42, it is floating from 10% to 19%.
Those observations imply the critical role of the weight of each component in the reward function. If the cancellation is underestimated, conflicts in complicated scenarios can not be resolved. In this paper, conflict resolution is the priority, therefore, the cancellation is put to the decisive location in the architecture. Accordingly, delay and cancellation exhibit similar selection possibilities in simple scenarios. But with the environment being  (Dryden-II) complicated, the MAA3C shifts the emphasis to only cancellation with the expectation to solve all conflicts. This learning ability is not RANDOM can achieve when coping with dynamic environments.
We need to mention that the simulated wind will not be the same as in the realistic and cannot cover all situations. However, the objective of our solution is to be generalized for any operation schedule and wind input. Finally, the test results with different Dryden gusts and CFD wind models demonstrate the stability and robustness of the proposed MAA3C framework in presence of dynamicwind turbulence. At the same time, the trained model can be flexibly applied to different operation plans with various flight numbers, which achieves scalable, resilient, and efficient operations for UAM.

Conclusion
In this paper, the strategic conflict management problem, which comprises multi-resolution airspace organization, elaborated trajectory generation, and performance-based operation, is resolved by the MARL solution in presence of wind turbulence. The definitions of flexible airspace, adjustable trajectory with the GMM model, Dryden gust model, and CFD simulation model enable the MAA3C to learn effective parameters. Especially, the recurrent actor-critic networks in MAA3C allow scalable input with the various number of flights and airspace blocks. The crucial performance-related concept is integrated into each component of strategic conflict resolution to achieve efficient operations in urban regions. This paper explores applying MARL for operations in high-dynamic environments. Conflict management follows a similar procedure in ATM or UTM and then involves some concepts in UAM ConOps. But the problem lay with focusing on the initial stage with air corridor structure rather than completely free flight, due to the lack of effective surveillance measures in low-altitude urban areas. It is possible to extend the proposed solution to broad airspace with more blocks. In reality, the wind flow changes   RANDOM  200  29  7  Dryden-II  MAA3C  50  1  8  Dryden-II  RANDOM  50  1  9  Dryden-II  MAA3C  100  19  10  Dryden-II  RANDOM  100  19  11  Dryden-II  MAA3C  200  35  12  Dryden-II  RANDOM  200  35  13  Dryden-I  MAA3C  50  25  14  Dryden-I  RANDOM  50  25  15  Dryden-I  MAA3C  100  51  16  Dryden-I  RANDOM  100  51  17  Dryden-I  MAA3C  200  186  18  Dryden-I  RANDOM  200  186  19  Dryden-III  MAA3C  50  61  20  Dryden-III  RANDOM  50  61  21  Dryden-III  MAA3C  100  227  22  Dryden-III  RANDOM  100  227  23  Dryden-III  MAA3C  200  439  24 Dryden-III RANDOM 200 439
Author Contributions Cheng Huang contributed to the algorithm design, implementation, and writing of this paper; Ivan Petrunin and Antonios Tsourdos contributed to the result analysis and revision of the manuscript.
Funding This research was partially supported by grants from the Funds of China Scholarship Council (202008420248).

Availability of data and materials
The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Declarations
Ethics approval No applicable as this study does not contain biological applications.
Consent to participate All authors of this research paper have consented to participate in the research study.

Consent for Publication
All authors of this research paper have read and approved the submitted version.

Conflict of Interests
The authors have no relevant financial or nonfinancial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.