Multi-Start Team Orienteering Problem for UAS Mission Re-Planning with Data-Efficient Deep Reinforcement Learning

In this paper, we study the Multi-Start Team Orienteering Problem (MSTOP), a mission re-planning problem where vehicles are initially located away from the depot and have different amounts of fuel. We consider/assume the goal of multiple vehicles is to travel to maximize the sum of collected profits under resource (e.g., time, fuel) consumption constraints. Such re-planning problems occur in a wide range of intelligent UAS applications where changes in the mission environment force the operation of multiple vehicles to change from the original plan. To solve this problem with deep reinforcement learning (RL), we develop a policy network with self-attention on each partial tour and encoder-decoder attention between the partial tour and the remaining nodes. We propose a modified REINFORCE algorithm where the greedy rollout baseline is replaced by a local mini-batch baseline based on multiple, possibly non-duplicate sample rollouts. By drawing multiple samples per training instance, we can learn faster and obtain a stable policy gradient estimator with significantly fewer instances. The proposed training algorithm outperforms the conventional greedy rollout baseline, even when combined with the maximum entropy objective.


Introduction
As the operational technology of Unmanned Aerial Systems (UAS) matures, there is a growing need for fast and accurate high-level decision-making for autonomous mission planning.UAS applications in logistics and surveillance (e.g., airborne reconnaissance, forest fire detection, geographical monitoring, online commerce, and drone delivery) are gaining interest [1], [2].Prior UAS mission studies addressed variants of the vehicle routing problem formulated as the NP-hard combinatorial optimization (CO), such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP).
These classical CO problems are primarily concerned with mission preplanning based on the current knowledge of the environment.However, missions in real life involve many unknown and possibly changing factors such as sudden gusts, GPS denial, unexpected threats, terrain uncertainties, fuel leakage, and hardware malfunction.Once the vehicles have left the base, it is critical to respond to the unexpected environmental changes by managing mission objectives autonomously, thus prompting the need for near-optimal mission re-planning in realtime.Furthermore, visiting all nodes may not be practical considering resource availability.
Instead, such applications may require vehicles to visit as many nodes as possible within a maximum duration given on each route.These characteristics of real-life applications give rise to the Multi-Start Team Orienteering Problem (MSTOP), which is a generalization of the Team Orienteering Problem (TOP) with additional degrees of freedom on launch location and available fuel for each vehicle.Many routing problems assume vehicles that identically begin routing from the depot.In contrast, MSTOP models the real-life mission re-planning scenario by launching vehicles located away from the depot, each with a different amount of fuel available.
The MSTOP is formulated in the context of route planning for intelligent UAS and robotic agent systems.Given the nature of higher level decision making, more efficient route plans for optimal assignments among agents are desirable.For example, a fleet of UAVs supressing forest fires needs an optimal order of visiting sites to make the most out of their limited volume of extinguishing water.The fleet may also be subject to frequently updating their assigned spots as wildfires can spread unpredictably, which calls for re-planning the routes.
Another application is the efficient operation of unmanned delivery drones.If a delivery drone were to visit a number of sites to deliver multiple parcels, the order of sites to be visited can be optimized so that operational revenue is maximised.On top of that, a scheduled delivery site can be modified at the request of the customer, and the drones already in delivery require a new mission plan.In this manner, the MSTOP belongs to a general higher-level planning framework for a wide range of applications in the UAS and robotic systems.
Various traditional approaches have been applied to solve the CO problems so far.For example, exact algorithms are generally based on branch-and-bound or branch-and-cut approaches to obtain optimal solutions.However, finding an optimal solution may take an inordinate amount of time when the problem size grows.Approximate algorithms rapidly produce near-optimal solutions that are often tailored for specific CO problems.Heuristic approaches utilize domain expertise to design hand-crafted strategies for progressively constructing a solution.These approaches may not be straightforwardly applicable to other routing problems.
The deep reinforcement learning (RL) approach has recently emerged as a fast and powerful heuristic solver to find near-optimal solutions to many CO problems.This paper aims to develop a deep RL-based construction framework for solving the MSTOP.We propose a data-efficient training methodology that improves the solution quality and learning speeds.To demonstrate the effectiveness of our training methodology, we experiment on two classical CO problems: TSP and CVRP.These experiments confirm that our training methodology outperforms the conventional methodology in [3] and is comparable to the state-of-the-art policy optimization with multiple optima for reinforcement learning (POMO) [4] while using significantly smaller data.In addition, we identify the asymmetry in the solution representation of MSTOP and use it to improve performance during inference further.With this advanced inference strategy, our model can generate high-quality solutions in a notably short time, bringing us a step closer to real-time mission re-planning.
In summary, our primary contributions are threefold.First, we explore the MSTOP, a routing problem that reflects a real-life mission re-planning scenario, using a data-driven method (deep reinforcement learning).Specifically, we follow the Transformer's encoderdecoder architecture [5].We use a standard encoder with a multi-head attention mechanism.
For the decoder, we adapt the decoding strategy in [6], the current state-of-the-art deep RL solver for TSP, and adopt the nested inner/outer loop framework similar to [7].We name our policy network the Deep Dynamic Transformer Model (DDTM).
Second, we propose a data-efficient training approach based on a baseline derived from multiple instances generated by applying linear coordinate transformations to a single instance.
These augmented instances are distinct in their raw form since each node in the 2D cartesian plane has been transformed.But, as a graph, these are identical because the lengths between the nodes are preserved.We replace the greedy rollout baseline with a local, mini-batch mean (obtained by rolling out all augmented instances) and combine it with the maximum entropy RL method [8,9].Our proposed methodology outperforms the computationally expensive greedy rollout baseline [3] and significantly expedites the learning process.
Finally, we improved the efficiency of the inference phase by using the instanceaugmentation tailored for the MSTOP.Unlike TSP and CVRP, solutions to MSTOP are inherently asymmetric since the order of vehicles breaks the symmetry in the solution representations (see Fig. 1).We utilize the asymmetry in MSTOP solutions by permuting all vehicle orders and generating multiple rollouts for each permutation of vehicle order at the inference stage.This method is more efficient than the conventional sampling and instanceaugmentation inference (using a single-vehicle order).
The remainder of this paper is as follows.Section II briefly introduces past studies related to our work (e.g., deep RL approaches for classical CO problems).Section III formulates the MSTOP as the Mixed Integer Linear Programming (MILP) and Markov Decision Process (MDP).Section IV describes our DDTM policy network in detail.Section V describes our proposed REINFORCE baseline and presents inference results on various routing problems.In Section VI, to corroborate the effectiveness of our method, we report an ablation study among several training baselines and present generalization results.Finally, Section VII concludes the paper and discusses future research directions.
Fig. 1 Multiple representations for an optimal solution exist in TSP and CVRP.However, for MSTOP, the order of vehicles breaks the symmetry in solution representation

Literature Review
The Team Orienteering Problem (TOP) belongs to the broader Vehicle Routing Problem with Profits (VRPP) class.A fleet of vehicles is given, but the vehicles are not required to visit all the nodes or customers.Each node is associated with a prize (profit), denoting its relative attractiveness.The objective is to find a subset of nodes that maximizes the total collected profits while satisfying a limit on the maximum duration of each route [10][11][12].Exact algorithms to solve the TOP include approaches based on column generation and constraint branching [13] and branch-and-price algorithm [14].Taking the TOP as a basis, we devise the MSTOP by extending it with two additional degrees of freedom: launch locations of vehicles and remaining fuel for each vehicle.The MSTOP stands in contrast to traditional CO problems in that the launch locations for each vehicle are distinct.Therefore, the problem state seen by each vehicle is naturally different at each construction step.[15] One of the early attempts to apply the deep RL approach to CO in a constructive manner is the study by Bello et al. [16].They used the pointer network (PtrNet) architecture [17] to encode input sequences and construct the node sequence in the decoder.Their model was tested on the TSP and the 0-1 knapsack problem (KP) and yielded close-to-optimal results.The PtrNet model is further improved by Khalil et al. [18] and Nazari et al. [19].Deudon et al. [20] used the pointer network with an attention encoder.Inspired by the Transformer model for machine translation [5], Kool et al. [3] proposed the attention model (AM) based on the transformer architecture to solve various CO problems such as the TSP, VRP, and Orienteering Problem (OP).Cappart et al. [21] combined the RL and constraint programming (CP) to solve the TSP with Time Windows (TSPTW) by learning branching strategies.Additionally, Bono et al. [15] proposed a modified Transformer model to handle the dynamic and stochastic VRPs (DS-VRPs) by using online measurements of the environment to online select the next vehicle via a vehiclecustomer intersection module.More recently, Li et al. [22] improved the AM to solve the Heterogeneous Capacitated VRP (HCVRP).Li et al. [23] proposed the attention-dynamic model to solve the covering salesman problem (CSP).Xu et al. [24] designed an attention model with multiple relational attention mechanism that better captures the transition dynamics.
Pan and Liu [25] designed a graph-based partially observally MDP (POMDP) that captures the changes in the customer demands to solve a dynamic and uncertain VRP using a deep neural network model with dynamic attention mechanism.Besides attention model, Wang [26] proposed a variational autoencoder-based reinforcement learning methodology using a graph reasoning network for classic vehicle routing problems.In terms of performance, Kwon et al.
[4] introduced the POMO method which has demonstrated state-of-the-art results on TSP, CVRP, and KP.During training, the POMO decoder generates multiple heterogeneous trajectories that start at every node to maximize entropy on the first action.
The majority of past studies used policy gradient approaches, which have advantages over supervised learning (SL) [27].Bello et al. [16] used an actor-critic algorithm to train their model.However, Kool et al. [3] showed that a greedy rollout baseline yields better results than a (learned) critic baseline.Many subsequent works, including [6], [22], [23], [24], and [7], used the greedy rollout baseline.Although the greedy rollout baseline is effective, it requires an additional forward-pass of the model, increasing the computational load on the device.To leverage more data parallelism for efficient learning of training instances, Kool et al. [28,29] proposed to use a local baseline equal to the average return over k trajectories sampled without replacement from a single instance using Stochastic Beam Search.They reported that this baseline performed on par or slightly better than the computationally expensive greedy rollout and significantly better than the batch baseline.The benefit of sampling without replacement is that the gradient estimators do not lose much final performance while learning from substantially fewer instances (number of training instances is reduced by factor of k).
In addition, Kwon et al. [4] used a shared baseline based on all POMO samples, taking the average tour length over n sample trajectories from a single instance, where n is the number of nodes.Like multiple-sample baselines in [28], the POMO-shared baseline is local, concentrating on a single instance.As reported in [4], their baseline is very effective since it generates n, typically larger than k in [28], non-duplicative sample trajectories for a single instance.However, the POMO requires an additional tensor dimension, and as the graph size n increases, the tensor size increases by n-fold.Consequently, while the training time of POMO is comparable to that of REINFORCE with greedy rollout (owing to the parallel generation of trajectories), it requires more GPU memory.Moreover, the POMO training may not be readily applicable on problems such as MSTOP, where we cannot simply use all the nodes as starting points for exploration.
Many strategies for efficient inference were also proposed in prior studies.Bello et al. [16] proposed the "one-shot" greedy inference and sampling strategies.Deudon et al. [20] improved their solution quality by refining it with the 2-Opt heuristic [30].Kwon et al. [4] suggested ×8 instance-augmentation to generate multiple trajectories and select the best solution to obtain better results.

Mathematical Formulation of MSTOP
This section presents the MILP formulation of MSTOP.In particular, this formulation is defined on a graph following [10].A complete graph (G = (N, A)) consists of the set of all nodes (N = XV  ) and the set of arcs or edges (A).The set of nodes (N) is the union of X (= {0, 1, …, n}, customer (1 ~ n) and depot (0) nodes), and ) represents the arcs among the vehicle locations and the remaining nodes.
In the MSTOP, multiple vehicles begin at locations different from the depot.Each vehicle has an available amount of fuel at the start.All vehicles have the same maximum route duration Tmax.Given the vehicle set, the MSTOP determines K routes that maximize the total profits collected over the partial routes while satisfying a maximum duration constraint on each route.
Let xijk be a binary variable, which equals one if arc (i, j) in A is traversed by vehicle k (in K), and zero otherwise.Also, binary variable yik equals one if node i (in X) is visited by vehicle k (in K) and otherwise zero.Traveling length associated with arc (i, j), tij, is the Euclidean distance between the two nodes.fk denotes the available fuel amount at the start for each vehicle k ( K  ), and pi is the scalar prize associated with node i ( X  ) .Subscript v denotes a vehicle's launching node.The MILP formulation for the MSTOP is as follows: (MILP Formulation for MSTOP) subject to 1,...,   . (10) Eq. (1) expresses the objective of the problem, which is maximizing the collected profit from routes.Eqs. ( 2)- (10) present the constraints of the problem.Eq. ( 2) ensures that all routes end at the depot.Eq. (3) guarantees that an arc enters a node and leaves from that node.Eqs.
(4)- (5) ensure that a route begins at the initial vehicle location.Eq. ( 6) constrains the number of total routes (K).Eq. ( 7) imposes a constraint that each node is visited at most once.Eq. ( 8) limits the maximum duration or length for each route.Lastly, Eqs. ( 9)-( 10) define the decision variables.
Note that the local constraints of the formulation do not guarantee that all nodes in a route are properly connected without subtours.To generate a feasible set of routes, we add the subtour elimination constraints.However, given the nature of routing problems, adding such constraints before the optimization can significantly increase the model size for large-scale problems.As a result, we add the subtour elimination constraints in a lazy fashion [31].This way, we can remove solutions with subtours during the optimization.

MDP Formulation of MSTOP
This section introduces the MDP formulation of the MSTOP.To apply reinforcement learning to MSTOP, we model the problem as a sequential decision-making process, where an agent performs a sequence of actions (i.e., decides which node to visit) through interactions with the surrounding environment (i.e., observing changes in the state) to maximize the cumulative reward.
In our MDP setting, a vehicle is first assigned at random.The agent selects nodes to visit starting from the initial position of the assigned vehicle.Once a partial route is constructed, the agent chooses the next vehicle starting at a different location.The complete solution is constructed by concatenating the individual partial routes.We model the MSTOP as an MDP defined by a 4-dimensional tuple <S, A, P, R>, where S denotes the state space, A the action space, P the state transition model, and R the reward model.

State Space (S):
Each state at time step t is defined as a tuple st (=<Xt, Vt>).The first component of the tuple, Xt, denotes the set of all nodes (={ t i x }), and the second component, Vt, expresses the states of all vehicles (={ t k v }).Here, xi t (= (ri, pi t )) contains the information of a node where ri (=(xi, yi)) is the location and pi t is the prize assigned to the node.Also,

 
,, represents the vehicle location, fk t is the vehicle's available/remaining fuel amount, and Ok t is the total prizes collected until step t.We denote the terminal time as T at which all vehicles arrive at the depot.

Action Space (A):
The permissible set of actions in our MDP is the choice of the next node to visit by considering the vehicle's current partial route and the amount of fuel.We denote each action at time step t (at ∈ A) as xj t and view the action as an addition of a node to the partial route.The construction of partial route satisfies the maximum travel duration constraint for each vehicle by action masking policy, i.e. masking the nodes that cannot be visited.

State Transition Model (P: S×A→S):
The state transition model describes how the current state (st) transitions to the next state (st+1) when an action (at) is taken.We adopt deterministic transition dynamics, i.e., a vehicle moves to the chosen node with the probability of 1.Given the current vehicle k and chosen action   t tj ax  (i.e., the vehicle visits node j), we update the elements of   t i x and   t k v at step t as follows.
Eq. ( 11) sets the prize associated with node j as 0 when visited, and Eq. ( 12) updates the current location of vehicle k.Eq. ( 13) updates the available amount of fuel by subtracting tij (distance between nodes i and j) from it.Eq. ( 14) updates the total prize by adding the prize value obtained at node j (pj).

Reward Model (R: S × A  ):
We model the cumulative reward as the sum of total prizes collected from all partial routes.To be specific, the reward is defined as Termination time T, determined by the number of actions executed until the completion of all partial routes, defines the trajectory length.

Proposed Model and Solution Procedure
4.1 Proposed Framework Fig. 2 explains a framework proposed to solve the MSTOP, which contains inner and outer loops.The inner loop begins at the vehicle's initial location and generates a partial route that terminates at the depot.Each partial route is a permutation of numbers ending with 0, as shown in Fig. 3.When the inner loop is finished, the outer loop updates the graph instance.

Fig. 2 Diagram explaining the proposed framework
This procedure contrasts the models in [3], where the encoder is executed only once initially (t=0).In classical CO problems, when a vehicle returns to the depot, the graph instance changes only slightly because the next vehicle starts at the same depot.However, constructing that have arrived at the depot In the solution procedure, the encoder first processes raw features of the graph instance to a hidden representation (node-vehicle embeddings).These embeddings are then passed to the decoder that extracts relevant information to generate a probability distribution over nonvisited nodes to select the next node.This process is repeated until the depot is selected.We then update the graph following each partial route before moving on to the next vehicle.
4.2 Encoder-Decoder Architecture of DDTM Fig. 4 presents the encoder-decoder architecture of DDTM used for MSTOP.Fig. 5 illustrates the encoder structure (for a single encoding layer).The encoder embeds the MSTOP features using separate parameters for the additional vehicle features -vehicle location and available fuel.We denote the embedded feature data as h (l) , where l is the encoder layer.The embedded data as a whole represents the graph instance, and each element in h (l) is a mapping corresponding to each feature.A good feature mapping needs to consider the feature's context within the graph.

Initial Embedding
Multi-Head Self-Attention  For example, the node representation should contain sufficient information to be selected among its neighbors and to determine its position in the output sequence.To understand how one feature is related to another from a broader perspective, we apply multihead self-attention, which generates enhanced feature embeddings.The encoding steps are formally expressed as follows ..., , where dk = d/H with d (= 128) is a hyperparameter and H (= 8) is the number of heads.
To compute multi-head attention, we concatenate the attention outputs of each head ( h l Z ) as The next embedded feature, h (l+1) , is obtained by passing h (l) through a feed-forward layer with batch normalization, residual connection, and ReLU activation as follows, where are trainable parameters with dh (= 512).
After nenc encoding layers, the final output of the encoder is the node-vehicle embedding (   ) defined as where N' (= N -Nvisited) is the remaining number of nodes and K' is the remaining number of vehicles.After a partial route is constructed (t > 0), the graph instance seen by the next vehicle differs from that seen by the previous ones.We update the graph instance by computing   enc n h and   enc n h using Eqs.( 15)-( 24), and mask the visited nodes using the outer product as, where  is a column mask vector that masks visited nodes and vehicles at the depot, is a column vector of ones, and is the Hadamard product for matrices.
Given the node-vehicle and graph embeddings by the encoder, the decoder produces probability distributions ( dec t p ) for all candidate nodes and selects the next node.Candidate nodes are those not visited by any vehicle at the start of decoding.Our decoding strategy consists of three steps based on [6] as follows: Step 1: We begin by computing the multi-head self-attention between the current node and the nodes in the current partial route.By examining the history of visited nodes for the current node, we obtain the contextual information up to the current decoding time, tdec.We first extract the current node embedding ( ) is the node selected in the previous step.Since the partial route begins at the vehicle's location and ends at the depot, the order of nodes in the partial route matters.This characteristic requires the addition of positional encoding [5] to the linearly projected pair to generate   where dec t PE is a d-dimensional row vector.Each element of the vector is defined as where   0,1,..., 1 id  is the position along the d dimension.
Fig. 6 illustrates the decoding Step 1.There are tdec visited nodes in the current partial route.We first compute the self-attention between Step 1 is mathematically described as follows (where dk=d/H).
Fig. 6 Step 1 of the decoding procedure.The orange contour indicates the partial route at time step tdec Step  34)).We mask the nodes that cannot be visited from the current location.Fig. 7 illustrates the encoder-decoder attention in Step 2 of the decoding procedure.The following equations express Step 2.
,  7 Step 2 of the decoding procedure.The blue box denotes the current node, the green contour represents the set of candidate nodes, and the red cross indicates masked nodes ).The decoding step 3 is described as the following equations and illustrated in Fig. 8.  ) whose actions are chosen by the policy defined as The objective of the policy optimization problem expressed in Eq. ( 44) uses the expectation over all possible trajectories.For a given stochastic policy (   ), the trajectory probability ( ( ; ) : ( ; ) ) represents the probability of generating a trajectory following the policy.The trajectory probability is factorized as is the state-transition probability of the MDP defined in Section III.
Williams [32] proposed a viable estimator of the policy gradient using Monte-Carlo sampling by assuming that R() is independent of  : ) In practice, the unbiased REINFORCE gradient estimator presented in Eq. ( 46) suffers from a high variance of the returns () i R  and is sample inefficient since it requires many sample episodes to converge.We can overcome these issues by including a baseline (b(s)), an action-independent function, in the policy gradient estimation.Consequently, an unbiased estimate of the gradient with reduced variance is expressed as

Choice of REINFORCE baseline b(s)
An example of the baseline is the average return over sample trajectories ), where N is the number of samples in a mini-batch.Although the mini-batch baseline can effectively reduce variance in Gradient-Bandit algorithms [33], Kool et al. [28] showed that it performs significantly worse than other state-of-the-art baselines.
Prior studies suggest that designing an effective yet computationally tractable In general, a local baseline performs significantly better than a batch baseline.In particular, a local baseline based on multiple samples without replacement is expected to perform better because non-duplicate samples are guaranteed [28,29].This observation can be extended to POMO [4], whose local batch mean is based on N non-duplicate sample trajectories from a single instance, despite an increased tensor size.Since each POMO trajectory begins at a unique node, these samples are also guaranteed to be non-identical.These REINFORCE baselines are more data-efficient than the greedy rollout because they require fewer training instances (reduced by some factor).
It would be effective if a baseline as equally data-efficient as the multiple-sample baselines and even computationally lighter than the POMO shared baseline is used.The proposed baseline meets these requirements by utilizing the instance augmentation, which was first suggested in [4] for effective inference.).While each of these instances is distinct, the optimal tour would be identical since these transformations preserve the lengths between nodes.We then rollout sample trajectories of each of these "counterfactuals."The policy model would perceive these as distinct instances, only to at similar solutions as it generates multiple rollouts in parallel.The model inherently learns to find improved solutions for a given instance based on the local batch mean.
The policy model also learns more effective heuristics because the baseline offers a more focused view on a single instance through diverse perspectives.Fig. 9 is an illustration of how our local baseline works.We believe that the proposed baseline combines the strengths of multiple-sample baselines and the POMO shared baseline.However, this limitation could be mitigated in large-size problems for which longer trajectories are likely to be unique.

Comparison with greedy rollout baseline:
Apart from the additional forward-pass of the earlier model version, we empirically found that the greedy rollout baseline entails slightly noisy learning.The current model's (best) performance may not be replicated or generalized to another set.This finding is more apparent towards the later stages of training, especially when the model finds it difficult to surpass its greedy self and there is a noticeable lack of baseline policy updates.At this point, the model does not learn much from the competition with its greedy self.
Comparison with POMO shared baseline: Compared to the POMO baseline, our approach is more computationally efficient since it uses a fixed local batch size that does not increase with the number of nodes.

Combining with maximum entropy objective
Training the policy model with entropy can smooth out the optimization landscape, speeding up the learning process.In some environments, it yields a better final policy [9].It also turns out to be robust to internal algorithmic disturbances and external environmental disturbances like dynamics and reward function [8].We note that robustness to external disturbances is an important factor determining the generalization capability (i.e., performance on graphs of various sizes).This work combines the maximum entropy RL with our instanceaugmentation baseline and shows improved training and inference performance for various problem instances.
We implement the maximum entropy RL as follows.The objective aims to maximize the expected cumulative return augmented by a conditional action entropy as where  in a slightly different gradient [9] (trajectory view): Although Sultana et al. [34] used the entropy maximization term to train the policy with a greedy rollout baseline, we note that its application has not been used with other baselines.
By integrating the objective function with entropy and using our instance-augmentation baseline, our policy model learns a more stochastic policy that is applicable in a generalized setting.Algorithm 1 presents our proposed REINFORCE algorithm.The Adam optimizer [35] with a constant learning rate of 0.0001 is used to train the policy model parameters.

Problem Setup and Hyperparameters
This section describes the controlled experiments to solve the MSTOP using the DDTM.To observe the benefits of our instance-augmentation baseline (over greedy rollout), we conduct an ablation study on classical TSP and CVRP using the original AM.To this end,  Training DDTM to solve MSTOP: We follow the basic problem setup in [3] for the Orienteering Problem (OP), i.e., the coordinates of all customer and depot nodes are randomly sampled within the unit square.The prizes of nodes are either initialized as one (constant) or sampled from a uniform distribution between 0 and 1.
Table 2 describes the experimental details, including the graph size (n), the number of vehicles (N), and the maximum length constraint for each route (Tmax).Additionally, each vehicle in MSTOP starts at a random location within the same unit square and is given a variable remaining tour length (or equivalently fuel amount) with the distance between the current vehicle location and the depot as the lower bound.This setting ensures that the sum of the remaining tour length and the partial tour constructed henceforth is bounded above by Tmax.
For all MSTOP cases, the DDTM is initialized with nenc=4 and ndec=2, which we found to be an acceptable trade-off between computational load and the quality of learned policy.
For numerical experiments, we train 1,280,000 instances per epoch.Training AM to solve TSP/CVRP: We adopt the problem setup prescribed in [3].We used the same hyperparameters for training AM policy network for a fair comparison (except for the application of 'warmup').

Entropy weight:
To ensure the benefits of maximum entropy realized in our methodology, we need to use a suitable value for α.A very large α value can make the problem close to the maximum entropy problem, whose policy is purely random.On the contrary, if α is small, premature convergence may occur due to inadequate exploration.The α value used for training is 0.01 for both MSTOP and TSP/CVRP.We observed that this value works well on MSTOP20 (uniformly distributed prizes) and TSP50.

Inference Result
This section presents the performance of DDTM on 10,000 random MSTOP instances.
To validate our proposed methodology, we assess the performance of 1) DDTM trained with our proposed baseline and maximum entropy objective, and 2) DDTM trained with greedy rollout baseline and maximum entropy objective.The following section presents a comprehensive ablation study for various REINFORCE training baselines.
We use three decoding strategies.The greedy strategy rolls out a single greedy trajectory for each instance.The sampling strategy generates 1280 trajectories (per instance) and selects the best one.Finally, the instance augmentation strategy draws multiple greedy trajectories for each instance and selects the best result.To effectively handle inherent asymmetry in the MSTOP solutions, we permute the order of starting vehicles (see Table 3).Then, we generate a single greedy trajectory for each vehicle order and choose the best out of N! trajectories.To expand the search space, for each permutation, we further rollout eight trajectories about each problem instance (by solving its augmented instances) and select the best out of 8*N! trajectories.As illustrated in Fig. 10, this increases the chance of finding near-optimal solutions.Fig. 10 LEFT: Routing begins with Vehicle A. RIGHT: Routing begins with Vehicle B (optimal tour found) To the best of our knowledge, we could not find any algorithms specifically for MSTOP.
For n values of 10 and 20, we compare the results with the optimal solutions obtained using the MILP formulation introduced in Section III (implemented with Gurobi [31]).We also implement the heuristic by Tsiligirides for OP introduced in [36] with slight modification and compare the results.The MILP solution is used as the reference to compute the optimality gap.
For larger instances (n=50 and n=70), it takes prohibitively long to solve the MILP to optimality.
Therefore, the best out of the solutions obtained by various methodologies is used as a reference to compute the optimality gap.
Table 4 and Table 5 summarize the experimental results for comparison.We report the average of total prizes over 10,000 test MSTOP instances.Using the greedy strategy, the DDTM finds near-optimal solutions with optimality gaps of around 4 -5%.The optimality gap values for DDTM solutions obtained using the sampling strategy are 1 -2%.In almost all strategies, the DDTM outperforms the heuristic by Tsiligrides.The DDTM performs best with the ×8N! instance augmentation strategy, which finds high-quality solutions much faster than the sampling technique, demonstrating its superiority.Fig. 11 presents the quality (optimality gap) of the solutions obtained using the DDTM trained under the proposed methodology for 10,000 test MSTOP20 instances.The optimality gap of the DDTM solutions is 0 % in more than 90 % of constant-prize instances.Also, in over 90% of instances with uniformly-distributed prizes, the optimality gap is smaller than 5 %.Fig.

AM & Training Baselines (TSP, CVRP):
We believe that the proposed methodology is a general technique that can be used instead of the conventional greedy rollout baseline.To validate this, we perform additional experiments on the vanilla AM network using the original code to solve TSP and CVRP.For a fair comparison, we plot the learning curves on the same validation set (with seed 1234) and also report the inference results on the same test set (with seed 4321) used in [3].6).cases.In particular, the proposed approach is comparable to the state-of-the-art POMO method in terms of the optimality gap.The best performance for TSP50 obtained by the proposed approach (optimality gap: 0.15%, sampling) is better than that by the POMO inference without augmentation (0.24 % [4]).Similarly, in CVRP50 instances, the best result obtained by the proposed method (1.75 %; sampling) outperforms the POMO inference with a single trajectory (3.52% [4]).Even on large instances (n=100), the proposed methodology (D) shows improvement over all decoding strategies.environment.However, optimality gaps tend to increase when tested on different graph sizes.
In general, the proposed methodology (D) shows better generalization than the conventional method (B) in terms of reduced optimality gaps for changing graph sizes.Moreover, we also observe that models trained under uniformly distributed prizes generalize better than the counterparts trained under constant prizes when tested on environments with different prize distributions.This is not surprising since uniformly distributed prizes can be seen as a generalized version of constant prizes, and the problems with constant prizes are generally considered easier to solve.One exception is the case of DDTM trained under MSTOP10 with uniformly distributed prizes being tested on MSTOP20 with constant prizes, where the model trained using the proposed methodology (D) performs worse than the conventional approach (B).The reason behind this result might be attributed to using entropy weight α tuned for MSTOP20 (with uniformly distributed prizes) problems.Similar to Fig. 16, the proposed methodology (D) generally performs better for both changing graph sizes and prize distributions, as evidenced by larger test scores.The degree of improvement is more apparent for large-scale problems, demonstrating that the proposed methodology generalizes well with scalability on graph size.which is likely a result of using entropy weight  that is tuned for TSP50.From the various tests on different routing problems, it can be observed that our proposed methodology generally results in an improved generalization performance compared to the existing conventional method.

Conclusion
The Applying the proposed approach to missions involving the cooperation between agents would be also a meaningful extention of this study [37].
Another promising subject for future study is to handle the instance-augmentation inference for problems with many vehicles.We also acknowledge that the current implementation of DDTM architecture is heavy, resulting in a longer training time compared to the original AM.
One possible resolution would be to "compress" the model [38,39] for efficient training and inference.
Funding: This work was prepared at the Korea Advanced Institute of Science and Technology, Department of Aerospace Engineering, under a research grant from the National Research Foundation of Korea (2020R1A2C1005037).

Fig. 5
Fig. 5 Encoder structure the current amount of fuel ( t k f ) .We set tdec as zero at the start of the decoding for each partial route and increment it by one per each node selection within the inner loop.Since the decoding starts at the vehicle's initial location, we select the current node embedding as

2 :
This step queries the next node to visit among all candidate nodes.The step uses the encoder-decoder attention between the self-attention of a partial route (output of Step 1current vehicle embedding only (Eq. (

Step 3 :
Step 1 and Step 2 form a single decoding layer.After ndec decoding layers, the resultant output the final attention layer, where we compute a single-head attention to get probability distribution across all candidate nodes.The decoder receives a graph embedding (   enc n h) from the encoder, and its linear projection is added to   dec l t h .The query is constructed from the sum.The key is obtained by a linear projection of the context node embedding in Eq. (34) without current vehicle embedding (

Fig. 8
Fig. 8 Step 3 of the decoding procedure.The purple boxes above candidate nodes and depot indicate the selection probability REINFORCE baseline is crucial in training the policy network.In this work, we propose to use the average return of sample trajectories generated by instance augmentation from a single instance as the baseline, referred to as the instance-augmentation baseline.Our baseline is a potential alternative to the existing baselines with improved training speed and reduced variance.The proposed baseline is motivated by observations of other baselines in prior works.

denotesthe
Shannon entropy of's conditional distribution over actions along the trajectory, -action marginal of trajectory distribution induced by   and  is the entropy weight or temperature.The maximum entropy objective function presented in Eq. (48) results

Algorithm 1 :
we consider three problem/policy pairs -MSTOP/DDTM, TSP/AM, and CVRP/AM.The graph sizes (n) of 10, 20, 50, and 70 are set for the MSTOP.For TSP and CVRP, we consider the instances with sizes of 50 and 100.Furthermore, to check how our proposed training algorithm improves the generalization performance, we test the performance of each AM on problem instances of various sizes.Proposed REINFORCE Algorithm (Instance-augmentation baseline with maximum entropy objective) 1 Training: training set S, augmentation factor K, entropy weight , E epochs, T steps per epoch, B instances per batch 2 Considering the GPU memory constraints, we train 1250 batches of 1024 instances (n=10, 20) for 200 epochs, train 2500 batches of 512 instances (n=50) for 100 epochs, and train 3333 batches of 384 instances (n=70) for 100 epochs.The instance-augmentation baseline uses a batch size reduced by 8, i.e. 128 for n=10 and n=20, 64 for n=50, and 48 for n=70, so that the total number of training instances is the same.These training instances are generated randomly on the fly at every epoch to prevent overfitting.After each epoch, we roll out the current model (with greedy decoding) on a held-out validation set of size 10,000 and plot the learning curve to observe the training process.

Fig. 14
Fig. 14 Learning curves for MSTOP20 with uniformly distributed prizes.Dark curves are smoothed results, lighter curves are raw results

Fig. 15
Fig. 15 Learning curves on TSP50 using the vanilla AM.Dark curves are smoothed results, lighter curves are raw results Fig. 15 shows the learning curves for the original AM with different baselines for

Fig. 16
Fig. 16 Generalization performance of DDTMs trained and tested between MSTOP10 and MSTOP20 environments.Models trained under (a) Constant prizes and (b) Uniformly distributed prizes.Optimality gaps reported as the performance measure.

Fig. 16 illustrates
Fig. 16 illustrates the generalization performance of DDTMs trained on MSTOP10 and

Fig. 17 presents
Fig. 17 presents the generalization result for DDTM trained on MSTOP50 and

Fig. 17 Fig. 18
Fig. 17 Generalization performance of DDTMs trained and tested between MSTOP50 and MSTOP70 environments.Models trained under (a) Constant prizes and (b) Uniformly distributed prizes.Test scores reported as performance measure Multi-Start Team Orienteering Problem (MSTOP) is introduced to address the routing problems arising in dynamic environments.An attention-based policy network model referred to as the Deep Dynamic Transformer Model (DDTM) is proposed to solve the MSTOP.The proposed learning procedure modifies the REINFORCE algorithm by introducing a new baseline with instance-augmentation and combining it with the maximum entropy objective, improving its learning efficiency and inference capability.A set of numerical experiments comparing the performance of the proposed procedure with existing methodologies demonstrates its effectiveness.For a suitable value of entropy weight, the instance-augmented baseline outperforms the conventional greedy rollout baseline both in terms of inference performance, generalization performance and training speed.The test result indicates that the proposed approach performs comparably to the current state-of-the-art POMO baseline while requiring less computational resources.The procedure is further applied to classical TSP and CVRP, showing the potential to be a general technique for solving various routing problems.It would be interesting to apply the proposed methodology to other asymmetric CO problems, such as the Multi-Depot VRP and Multi-Depot MSTOP, where the order of vehicles break the symmetry in solution representations.

Generate Problem Graph Embedding Encoder Start partial tour Decoder Select next node Partial tour complete? Update state No Yes All tours constructed? No Done Update state Update graph embedding Yes depot Vehicle A Vehicle B Vehicle C depot Vehicle A Vehicle B Vehicle C a
partial route in an MSTOP modifies the graph instance.Not only does the next vehicle face a different set of nodes (i.e., without visited nodes), but it also starts at a different location.Complete MSTOP solution obtained by combining individual routes -each route is constructed by a single vehicle.Opaque nodes indicate either (i) visited nodes or (ii) vehicles , The value of C in Eq. (43) is selected as 10.Consequently, the next node

Table 1
Unit square transformations

Comparison with multiple samples with/without replacements:
Our baseline does not strictly generate non-duplicate samples.However, it is highly less likely to generate many duplicate samples, especially in the early stages of training, when the policy network  j .Indeed, as training proceeds,   may generate duplicate samples since it learns which action produces high-return trajectories in a more general setting.

Table 2
MSTOP problem instances of various sizes n N

Table 3
Permutations of vehicle order.Bold denotes the first vehicle to start routing.The DDTM sequentially begins routing according to the given vehicle order

Table 4
Experimental results on MSTOP (constant prizes; bold: best result)

Table 5
Experimental results on MSTOP (uniformly distributed prizes; bold: best result)

Table 6
Comparison of training time for different training strategies (per epoch, in min: sec);

Table 7
summarizes the inference test results on TSP and CVRP.Our proposed methodology (D) outperforms the other training methods across all decoding strategies in all

Table 7
Test results of vanilla AM trained with different methods [3]Values reported in[3].6.4GeneralizationResultThissectiondiscusses the generalization capability of our training methodology.Kool et al.[3]demonstrated that the AM and greedy rollout baseline can be generalized to problems with different graph sizes, although the error increases as the graph size increases.Since training with the maximum entropy objective is known to improve the model's robustness, we conduct a comparative study on generalization performance between greedy rollout with maximum entropy objective (B) and our proposed methodology (D) to see how our proposed methodology reduces generalization error.Note that the generalization results are reported according to the instance-augmentation decoding strategy on the same test datasets as in the previous sections.