1 Introduction

As the operational technology of Unmanned Aerial Systems (UAS) matures, there is a growing need for fast and accurate high-level decision-making for autonomous mission planning. The ability to adjust evolving mission objectives is essential for addressing the dynamic nature of real-world scenarios, enhancing safety, optimizing resources, and ensuring the success of missions across various industrial, civil, and defense sectors. For example, E-commerce drones for last-mile delivery should be able to optimize routes to ensure timely service. Surveillance drones monitoring ongoing traffic require adaptive responses for optimal data collection. Several other potential applications include forest fire detection in emergency response, geographical monitoring for scientific research (where objectives may change based on initial findings), surveying and mapping for urban planning, airborne reconnaissance for border control, and search and rescue operations in disaster-stricken areas where UAS assist in locating and aiding survivors [1, 2].

Prior UAS mission studies addressed variants of the vehicle routing problem formulated as the NP-hard combinatorial optimization (CO), such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP). These classical CO problems are primarily concerned with mission preplanning based on the current knowledge of the environment. However, missions in real life involve many unknown and possibly changing factors such as sudden gusts, GPS denial, unexpected threats, terrain uncertainties, fuel leakage, and hardware malfunction. Once the vehicles have left the base, it is critical to respond to the unexpected environmental changes by managing mission objectives autonomously, thus prompting the need for near-optimal mission re-planning in real-time. Furthermore, visiting all nodes may not be practical considering resource availability. Instead, such applications may require vehicles to visit as many nodes as possible within a maximum duration given on each route. These characteristics of real-life applications give rise to the Multi-Start Team Orienteering Problem (MSTOP), which is a generalization of the Team Orienteering Problem (TOP) with additional degrees of freedom on launch location and available fuel for each vehicle. Many routing problems assume vehicles that identically begin routing from the depot. In contrast, MSTOP models the real-life mission re-planning scenario by launching vehicles located away from the depot, each with a different amount of fuel available.

The MSTOP is formulated in the context of route planning for intelligent UAS and robotic agent systems. Given the nature of higher-level decision making, more efficient route plans for optimal assignments among agents are desirable. For example, a fleet of UAVs suppressing forest fires needs an optimal order of visiting sites to make the most out of their limited volume of extinguishing water. The fleet may also be subject to frequently updating their assigned spots as wildfires can spread unpredictably, which calls for re-planning the routes. Another application is the efficient operation of unmanned delivery drones. If a delivery drone were to visit a number of sites to deliver multiple parcels, the order of sites to be visited can be optimized so that operational revenue is maximized. On top of that, a scheduled delivery site can be modified at the request of the customer, and the drones already in delivery require a new mission plan. In this manner, the MSTOP belongs to a general higher-level planning framework for a wide range of applications in the UAS and robotic systems.

Various traditional approaches have been applied to solve the CO problems so far. For example, exact algorithms are generally based on branch-and-bound or branch-and-cut approaches to obtain optimal solutions. However, finding an optimal solution may take an inordinate amount of time when the problem size grows. Approximate algorithms rapidly produce near-optimal solutions that are often tailored for specific CO problems. Heuristic approaches utilize domain expertise to design hand-crafted strategies for progressively constructing a solution. These approaches may not be straightforwardly applicable to other routing problems.

The deep reinforcement learning (RL) approach has recently emerged as a fast and powerful heuristic solver to find near-optimal solutions to many CO problems. This paper aims to develop a deep RL-based construction framework for solving the MSTOP. We propose a data-efficient training methodology that improves the solution quality and learning speeds. To demonstrate the effectiveness of our training methodology, we experiment on two classical CO problems: TSP and CVRP. These experiments confirm that our training methodology outperforms the conventional methodology in [3] and is comparable to the state-of-the-art policy optimization with multiple optima for reinforcement learning (POMO) [4] while using significantly smaller data. In addition, we identify the asymmetry in the solution representation of MSTOP and use it to improve performance during inference further. With this advanced inference strategy, our model can generate high-quality solutions in a notably short time, bringing us a step closer to real-time mission re-planning.

In summary, our primary contributions are threefold. First, we explore the MSTOP, a routing problem that reflects a real-life mission re-planning scenario, using a data-driven method (deep reinforcement learning). Specifically, we follow the Transformer’s encoder-decoder architecture [5]. We use a standard encoder with a multi-head attention mechanism. For the decoder, we adapt the decoding strategy in [6], the current state-of-the-art deep RL solver for a single vehicle TSP, and generalize the strategy to handle multiple vehicle launch locations. Our overall approach adopts the nested inner/outer loop framework similar to [7] that updates the current state after each vehicle returns to the depot to reflect the changes after a partial tour is complete. We name our neural policy network the Deep Dynamic Transformer Model (DDTM).

Second, we propose a data-efficient training approach based on a baseline derived from multiple instances generated by applying linear coordinate transformations to a single instance. These augmented instances are distinct in their raw form since each node in the 2D cartesian plane has been transformed. But, as a graph, these are identical because the lengths between the nodes are preserved. We replace the greedy rollout baseline with a local, mini-batch mean (obtained by rolling out all augmented instances) and combine it with the maximum entropy RL method [8, 9]. Our proposed methodology outperforms the computationally expensive greedy rollout baseline [3] and significantly expedites the learning process.

Finally, we improved the efficiency of the inference phase by using the instance-augmentation tailored for the MSTOP. Unlike TSP and CVRP, solutions to MSTOP are inherently asymmetric since the order of vehicles breaks the symmetry in the solution representations (see Fig. 1). We utilize the asymmetry in MSTOP solutions by permuting all vehicle orders and generating multiple rollouts for each permutation of vehicle order at the inference stage. This method is more efficient than the conventional sampling and instance-augmentation inference (using a single-vehicle order).

Fig. 1
figure 1

Multiple representations for an optimal solution exist in TSP and CVRP. However, for MSTOP, the order of vehicles breaks the symmetry in solution representation

The remainder of this paper is as follows. Section II briefly introduces past studies related to our work (e.g., deep RL approaches for classical CO problems). Section III formulates the MSTOP as the Mixed Integer Linear Programming (MILP) and Markov Decision Process (MDP). Section IV describes our DDTM policy network in detail. Section V describes our proposed REINFORCE baseline and presents inference results on various routing problems. In Section VI, to corroborate the effectiveness of our method, we report an ablation study among several training baselines and present generalization results. Finally, Section VII concludes the paper and discusses future research directions.

2 Literature review

The Team Orienteering Problem (TOP) belongs to the broader Vehicle Routing Problem with Profits (VRPP) class. A fleet of vehicles is given, but the vehicles are not required to visit all the nodes or customers. Each node is associated with a prize (profit), denoting its relative attractiveness. The objective is to find a subset of nodes that maximizes the total collected profits while satisfying a limit on the maximum duration of each route [10,11,12]. Exact algorithms to solve the TOP include approaches based on column generation and constraint branching [13] and branch-and-price algorithm [14]. Taking the TOP as a basis, we devise the MSTOP by extending it with two additional degrees of freedom: launch locations of vehicles and remaining fuel for each vehicle. The MSTOP stands in contrast to traditional CO problems in that the launch locations for each vehicle are distinct. Therefore, the problem state seen by each vehicle is naturally different at each construction step [15]. It is also important to note that the context of multi-start in MSTOP stands in contrast with a number of existing works sharing the same term. For example, Lin et al. [16, 17] uses the term multi-start to refer to a variant of simulated annealing approach they used to solve TOP, while Hapsari et al. [18] deals with multi-objective TOP.

One of the early attempts to apply the deep RL approach to CO in a constructive manner is the study by Bello et al. [19]. They used the pointer network (PtrNet) architecture [20] to encode input sequences and construct the node sequence in the decoder. Their model was tested on the TSP and the 0–1 knapsack problem (KP) and yielded close-to-optimal results. The PtrNet model is further improved by Khalil et al. [21] and Nazari et al. [22]. Deudon et al. [23] used the pointer network with an attention encoder. Inspired by the Transformer model for machine translation [5], Kool et al. [3] proposed the attention model (AM) based on the transformer architecture to solve various CO problems such as the TSP, VRP, and Orienteering Problem (OP). Cappart et al. [24] combined the RL and constraint programming (CP) to solve the TSP with Time Windows (TSPTW) by learning branching strategies. Additionally, Bono et al. [15] proposed a modified Transformer model to handle the dynamic and stochastic VRPs (DS-VRPs) by using online measurements of the environment to online select the next vehicle via a vehicle-customer intersection module. More recently, Li et al. [25] improved the AM to solve the Heterogeneous Capacitated VRP (HCVRP). Li et al. [26] proposed the attention-dynamic model to solve the covering salesman problem (CSP). Xu et al. [27] designed an attention model with multiple relational attention mechanism that better captures the transition dynamics. Pan and Liu [28] designed a graph-based partially observable MDP (POMDP) that captures the changes in the customer demands to solve a dynamic and uncertain VRP using a deep neural network model with dynamic attention mechanism. Besides attention model, Wang [29] proposed a variational autoencoder-based reinforcement learning methodology using a graph reasoning network for classic vehicle routing problems. In terms of performance, Kwon et al. [4] introduced the POMO method which has demonstrated state-of-the-art results on TSP, CVRP, and KP. During training, the POMO decoder generates multiple heterogeneous trajectories that start at every node to maximize entropy on the first action.

The majority of past studies used policy gradient approaches, which have advantages over supervised learning (SL) [30]. Bello et al. [19] used an actor-critic algorithm to train their model. However, Kool et al. [3] showed that a greedy rollout baseline yields better results than a (learned) critic baseline. Many subsequent works, including [6, 25,26,27], and [7], used the greedy rollout baseline. Although the greedy rollout baseline is effective, it requires an additional forward-pass of the model, increasing the computational load on the device. To leverage more data parallelism for efficient learning of training instances, Kool et al. [31, 32] proposed to use a local baseline equal to the average return over k trajectories sampled without replacement from a single instance using Stochastic Beam Search. They reported that this baseline performed on par or slightly better than the computationally expensive greedy rollout and significantly better than the batch baseline. The benefit of sampling without replacement is that the gradient estimators do not lose much final performance while learning from substantially fewer instances (number of training instances is reduced by factor of k).

In addition, Kwon et al. [4] used a shared baseline based on all POMO samples, taking the average tour length over n sample trajectories from a single instance, where n is the number of nodes. Like multiple-sample baselines in [31], the POMO-shared baseline is local, concentrating on a single instance. As reported in [4], their baseline is very effective since it generates n, typically larger than k in [31], non-duplicative sample trajectories for a single instance. However, the POMO requires an additional tensor dimension, and as the graph size n increases, the tensor size increases by n-fold. Consequently, while the training time of POMO is comparable to that of REINFORCE with greedy rollout (owing to the parallel generation of trajectories), it requires more GPU memory. Moreover, the POMO training may not be readily applicable on problems such as MSTOP, where we cannot simply use all the nodes as starting points for exploration.

Many strategies for efficient inference were also proposed in prior studies. Bello et al. [19] proposed the “one-shot” greedy inference and sampling strategies. Deudon et al. [23] improved their solution quality by refining it with the 2-Opt heuristic [33]. Kwon et al. [4] suggested × 8 instance-augmentation to generate multiple trajectories and select the best solution to obtain better results.

3 Problem definition

3.1 Mathematical formulation of MSTOP

This section presents the MILP formulation of MSTOP. In particular, this formulation is defined on a graph following [10]. A complete graph G consists of the set of all nodes (N) and the set of arcs (A). We summarize key notations in the mathematical formulation of MSTOP in Table 1. Since each vehicle is associated with a unique starting location, we drop the subscript k in the notation vk for simplicity whenever its inclusion is implied.

Table 1 Notation table for MSTOP

In the MSTOP, multiple vehicles begin at locations different from the depot. Each vehicle has an available amount of fuel at the start. Given the vehicle set, the MSTOP determines K routes that maximize the total profits collected over the partial routes while satisfying a maximum duration constraint on each route.

In the MILP formulation below, xijk denotes a binary variable, which equals one if arc (i, j) in A is traversed by vehicle k (in K), and zero otherwise. Also, binary variable yik equals one if node i (in X) is visited by vehicle k (in K) and otherwise zero. tij is measured as the Euclidean distance between the two nodes, and the subscript v denotes a vehicle’s launching node. The MILP formulation for the MSTOP is as follows:

(MILP Formulation for MSTOP)

$${\text{max}}\sum_{i\in X\backslash \left\{0\right\}}{p}_{i}\sum_{k=1}^{K}{y}_{ik},$$
(1)

subject to

$$\begin{array}{cc}\sum\limits_{i\in X\backslash \left\{0\right\}}{x}_{i0k}+{x}_{v0k}=1& k=1,...,K\end{array},$$
(2)
$$\sum\limits_{j\in X,j<i}{x}_{ijk}+\begin{array}{cc}\sum\limits_{j\in X,i<j}{x}_{jik}+{x}_{vik}=2{y}_{ik}& \forall i\in X\backslash \left\{0\right\},k=1,...,K\end{array},$$
(3)
$$\sum_{j\in X}\begin{array}{cc}{x}_{vjk}={y}_{vk}& k=1,...,K\end{array},$$
(4)
$$\sum_{k=1}^{K}{y}_{vk}=K,$$
(5)
$$\sum_{k=1}^{K}{y}_{0k}=K,$$
(6)
$$\begin{array}{cc}\sum\limits_{k=1}^{K}{y}_{ik}\le 1& \forall i\in X\backslash \left\{0\right\}\end{array},$$
(7)
$$\begin{array}{cc}\sum\limits_{\left(i,j\right)\in A,j<i}{t}_{ij}{x}_{ijk}+{f}_{k}\le {T}_{{\text{max}}}& k=1,...,K,\end{array}$$
(8)
$$\begin{array}{cc}{y}_{ik}\in \left\{\mathrm{0,1}\right\}& \forall i\in X\cup \left\{v\right\},k=1,...,K,\end{array}$$
(9)
$$\begin{array}{ccc}{x}_{ijk}\in \left\{\mathrm{0,1}\right\}& \forall \left(i,j\right)\in A,j<i,i\in X\backslash \{0\}\cup \left\{v\right\},& k=1,...,K.\end{array}$$
(10)

Equation (1) expresses the objective of the problem, which is maximizing the collected profit from routes. Equations (2)-(10) present the constraints of the problem. Equation (2) ensures that all routes end at the depot. Equation (3) guarantees that an arc enters a node and leaves from that node. Equations (4)-(5) ensure that a route begins at the initial vehicle location. Equation (6) constrains the number of total routes (K). Equation (7) imposes a constraint that each node is visited at most once. Equation (8) limits the maximum duration or length for each route. Lastly, Eqs. (9)-(10) define the decision variables.

Note that the local constraints of the formulation do not guarantee that all nodes in a route are properly connected without subtours. To generate a feasible set of routes, we add the subtour elimination constraints. However, given the nature of routing problems, adding such constraints before the optimization can significantly increase the model size for large-scale problems. As a result, we add the subtour elimination constraints in a lazy fashion [34]. This way, we can remove solutions with subtours during the optimization.

3.2 MDP formulation of MSTOP

This section introduces the MDP formulation of the MSTOP. To apply reinforcement learning to MSTOP, we model the problem as a sequential decision-making process, where an agent performs a sequence of actions (i.e., decides which node to visit) through interactions with the surrounding environment (i.e., observing changes in the state) to maximize the cumulative reward.

In our MDP setting, a vehicle is first assigned at random. The agent selects nodes to visit starting from the initial position of the assigned vehicle. Once a partial route is constructed, the agent chooses the next vehicle starting at a different location. The complete solution is constructed by concatenating the individual partial routes. We model the MSTOP as an MDP defined by a 4-dimensional tuple < S, A, P, R > , where S denotes the state space, A the action space, P the state transition model, and R the reward model.

State space (S)

Each state at time step t is defined as a tuple st (= < Xt, Vt >). The first component of the tuple, Xt, denotes the set of all nodes (= {\({x}_{i}^{t}\)}), and the second component, Vt, expresses the states of all vehicles (= {\({v}_{k}^{t}\)}). Here, xit (= (ri, pit)) contains the information of a node where ri (= (xi, yi)) is the location and pit is the prize assigned to the node. Also, \({v}_{k}^{t}=\left({\rho }_{k}^{t},{f}_{k}^{t},{O}_{k}^{t}\right)\) denotes the vehicle information where \({\rho }_{k}^{t}=\left({x}_{k},{y}_{k}\right)\) represents the vehicle location, fkt is the vehicle’s available/remaining fuel amount, and Okt is the total prizes collected until step t. We denote the terminal time as T at which all vehicles arrive at the depot.

Action space (A)

The permissible set of actions in our MDP is the choice of the next node to visit by considering the vehicle’s current partial route and the amount of fuel. We denote each action at time step t (at ∈ A) as xjt and view the action as an addition of a node to the partial route. The construction of partial route satisfies the maximum travel duration constraint for each vehicle by action masking policy, i.e. masking the nodes that cannot be visited.

State transition model (P: S × A → S)

The state transition model describes how the current state (st) transitions to the next state (st+1) when an action (at) is taken. We adopt deterministic transition dynamics, i.e., a vehicle moves to the chosen node with the probability of 1. Given the current vehicle k and chosen action \({a}_{t}=\left({x}_{j}^{t}\right)\) (i.e., the vehicle visits node j), we update the elements of \(\left\{{x}_{i}^{t}\right\}\) and \(\left\{{v}_{k}^{t}\right\}\) at step t as follows.

$${p}_{i}^{t+1}=\left\{\begin{array}{cc}0& i=j\\ {p}_{i}^{t}& i\ne j\end{array}\right.,$$
(11)
$${\rho }_{k}^{t+1}={r}_{j},$$
(12)
$${f}_{k}^{t+1}={f}_{k}^{t}-{t}_{ij},$$
(13)
$${O}_{k}^{t+1}={O}_{k}^{t}+{p}_{i}^{t}.$$
(14)

Equation (11) sets the prize associated with node j as 0 when visited, and Eq. (12) updates the current location of vehicle k. Equation (13) updates the available amount of fuel by subtracting tij (distance between nodes i and j) from it. Equation (14) updates the total prize by adding the prize value obtained at node j (pj).

Reward model (R: S × A \(\to {\mathbb{R}}\))

We model the cumulative reward as the sum of total prizes collected from all partial routes. To be specific, the reward is defined as \(\mathcal{R}={\sum }_{k=1}^{K}{O}_{k}^{T}\). Termination time T, determined by the number of actions executed until the completion of all partial routes, defines the trajectory length.

4 Proposed model and solution procedure

4.1 Proposed framework

Figure 2 explains a framework proposed to solve the MSTOP, which contains inner and outer loops. The inner loop begins at the vehicle’s initial location and generates a partial route that terminates at the depot. Each partial route is a permutation of numbers ending with 0, as shown in Fig. 3. When the inner loop is finished, the outer loop updates the graph instance.

Fig. 2
figure 2

Diagram explaining the proposed framework

Fig. 3
figure 3

Complete MSTOP solution obtained by combining individual routes – each route is constructed by a single vehicle. Opaque nodes indicate either (i) visited nodes (triangular) or (ii) vehicles that have arrived at the depot

This procedure contrasts the models in [3], where the encoder is executed only once initially (t = 0). In classical CO problems, when a vehicle returns to the depot, the graph instance changes only slightly because the next vehicle starts at the same depot. However, constructing a partial route in an MSTOP modifies the graph instance. Not only does the next vehicle face a different set of nodes (i.e., without visited nodes), but it also starts at a different location.

The rationale behind this sequential construction framework, which addresses one vehicle at a time, is grounded in empirical observations that simultaneous consideration of the next node for each vehicle can impede training convergence due to additional freedom in decision-making. During early training epochs, this additional complexity presents challenges for the model to “learn” to generate routes.

In the solution procedure, the encoder plays a pivotal role in transforming the raw features of the graph instance, encompassing mission node and vehicle data, into a hidden representation known as node-vehicle and graph embeddings. These embeddings, as computed in Eqs. (23) and (24), capture essential information about the spatial relationships and characteristics of the nodes and vehicles within the graph. The major interplay between the encoder and decoder occurs when the output of the encoder, comprising the node-vehicle embeddings and graph embedding, is sent as input to the decoder. Subsequently, the decoder leverages this information to extract relevant features, generating a probability distribution over non-visited (candidate) nodes that guides the selection of the next node in the route. This iterative process continues until the depot is chosen (i.e., completing an individual vehicle route). Following each partial route, the graph is updated before advancing to the next vehicle. Table 2 outlines key terminologies used in this section that describe the structure of DDTM.

Table 2 Summary of key terminologies for DDTM

4.2 Encoder-decoder architecture of DDTM

Figure 4 presents the encoder-decoder architecture of DDTM used for MSTOP. Figure 5 illustrates the encoder structure (for a single encoding layer). The encoder embeds the MSTOP features using separate parameters for the additional vehicle features – vehicle location and available fuel. We denote the embedded feature data as h(l), where l is the encoder layer. The embedded data as a whole represents the graph instance, and each element in h(l) is a mapping corresponding to each feature.

Fig. 4
figure 4

Encoder-Decoder architecture of DDTM

Fig. 5
figure 5

Encoder structure

A good feature mapping needs to consider the feature’s context within the graph. For example, the node representation should contain sufficient information to be selected among its neighbors and to determine its position in the output sequence. To understand how one feature is related to another from a broader perspective, we apply multi-head self-attention, which generates enhanced feature embeddings. The self-attention mechanism enables the encoder to effectively weigh and consider the significance of different features of the input graph. The encoding steps are formally expressed as follows.

$${h}_{0}^{\left(l\right)}=\left[{x}_{0},{y}_{0}\right]{W}_{0}^{init},$$
(15)
$$\begin{array}{cc}{h}_{i}^{\left(l\right)}=\left[{x}_{i},{y}_{i},{p}_{i}\right]{W}_{node,i}^{init}& \text{for }i\in \left\{1,...,N\right\},\end{array}$$
(16)
$$\begin{array}{cc}{\widehat{h}}_{k}^{\left(l\right)}=\left[{\widehat{x}}_{k},{\widehat{y}}_{k},{f}_{k}\right]{W}_{veh,k}^{init}& \text{for }k\in \left\{1,...,K\right\},\end{array}$$
(17)
$${h}^{\left(l\right)}=\left[{h}_{0}^{\left(l\right)},{h}_{1}^{\left(l\right)},...,{h}_{N}^{\left(l\right)},{\widehat{h}}_{0}^{\left(l\right)},...,{\widehat{h}}_{K}^{\left(l\right)}\right],$$
(18)
$$\begin{array}{ccc}{Q}_{l}={h}^{\left(l\right)}{W}_{l}^{Q},& {\text{K}}_{l}={h}^{\left(l\right)}{W}_{l}^{K},& {\text{V}}_{l}={h}^{\left(l\right)}{W}_{l}^{V},\end{array}$$
(19)
$${Z}_{l}^{h}=attention\left({Q}_{l},{K}_{l},{V}_{l}\right)={\text{Softmax}}\left(\frac{{Q}_{l}{K}_{l}^{T}}{\sqrt{{d}_{k}}}\right){V}_{l},$$
(20)

where dk = d/H with d (= 128) is a hyperparameter and H (= 8) is the number of heads. To compute multi-head attention, we concatenate the attention outputs of each head (\({Z}_{l}^{h}\)) as

$${\text{MHA}}\left({h}^{\left(l\right)}\right)=\left[{Z}_{l}^{1},{Z}_{l}^{2},...,{Z}_{l}^{H}\right]{W}_{l}^{out}.$$
(21)

The next embedded feature, h(l+1), is obtained by passing h(l) through a feed-forward layer with batch normalization, residual connection, and ReLU activation as follows,

$${\widetilde{h}}^{\left(l\right)}=BN\left({h}^{\left(l\right)}+MHA\left({h}^{\left(l\right)}\right)\right),$$
(22)
$${h}^{\left(l+1\right)}=FF\left({\widetilde{h}}^{\left(l\right)}\right)=BN\left({W}_{1}^{ff}{\text{ReLU}}\left({W}_{0}^{ff}{\widetilde{h}}^{\left(l\right)}\right)+{\widetilde{h}}^{\left(l\right)}\right),$$
(23)

where \({W}_{0}^{ff}\in {\mathbb{R}}^{d\times {d}_{h}}\) and \({W}_{1}^{ff}\in {\mathbb{R}}^{{d}_{h}\times d}\) are trainable parameters with dh (= 512). After nenc encoding layers, the final output of the encoder is the node-vehicle embedding (\({h}^{\left({n}_{enc}\right)}\)) and the graph embedding (\({\overline{h} }^{\left({n}_{enc}\right)}\)) defined as

$${\overline{h} }^{\left({n}_{enc}\right)}=\left\{\begin{array}{cc}\frac{1}{N+K+1}\left(\sum\limits_{i=0}^{N+1}{{h}_{i}}^{\left({n}_{enc}\right)}+\sum\limits_{k=1}^{K}{{\widehat{h}}_{k}}^{\left({n}_{enc}\right)}\right)& \text{if }\;t=0\\ \frac{1}{{N}^{\prime}+{K}^{\prime}+1}\left(\sum\limits_{i=0}^{N+1}{{h}_{i}}^{\left({n}_{enc}\right)}+\sum\limits_{k=1}^{K}{{\widehat{h}}_{k}}^{\left({n}_{enc}\right)}\right)& \text{if }\;t>0\end{array}\right.$$
(24)

where N’ (= NNvisited) is the remaining number of nodes and K’ is the remaining number of vehicles. After a partial route is constructed (t > 0), the graph instance seen by the next vehicle differs from that seen by the previous ones. We update the graph instance by computing \({h}^{\left({n}_{enc}\right)}\) and \({\overline{h} }^{\left({n}_{enc}\right)}\) using Eqs. (15)–(24), and mask the visited nodes using the outer product as,

$${\mathcal{M}}_{att}=\mathcal{M}\otimes {1}^{T}+1\otimes {\mathcal{M}}^{T}-\mathcal{M}\otimes {\mathcal{M}}^{T}\in {\mathbb{R}}^{\left(N+K\right)\times \left(N+K\right)},$$
(25)
$${Z}_{l}=attention\left({Q}_{l},{K}_{l},{V}_{l}\right)=Soft{\text{max}}\left(\frac{{Q}_{l}{K}_{l}^{T}}{\sqrt{{d}_{k}}}\odot {\mathcal{M}}_{att}\right){V}_{l},$$
(26)

where \(\mathcal{M}\in {\mathbb{R}}^{\left(N+K\right)\times 1}\) is a column mask vector that masks visited nodes and vehicles at the depot, \(1\in {\mathbb{R}}^{\left(N+K\right)\times 1}\) is a column vector of ones, and \(\odot\) is the Hadamard product for matrices.

Given the node-vehicle and graph embeddings by the encoder, the decoder produces probability distributions (\({p}_{t}^{dec}\)) for all candidate nodes and selects the next node. Candidate nodes are those not visited by any vehicle at the start of decoding. Our decoding strategy consists of three steps based on [6] as follows:

  • Step 1: We begin by computing the multi-head self-attention between the current node and the nodes in the current partial route. By examining the history of visited nodes for the current node, we obtain the contextual information up to the current decoding time, tdec. We first extract the current node embedding (\({\widetilde{h}}_{{t}_{dec}}\)) from the node-vehicle embeddings (\({h}^{\left({n}_{enc}\right)}\)), then concatenate it with the current amount of fuel (\({f}_{k}^{t}\)). We set tdec as zero at the start of the decoding for each partial route and increment it by one per each node selection within the inner loop. Since the decoding starts at the vehicle’s initial location, we select the current node embedding as \({\widetilde{h}}_{0}={\widehat{h}}_{k}^{\left({n}_{enc}\right)}\) and update it as \({\widetilde{h}}_{{t}_{dec}}={h}_{a}^{\left({n}_{enc}\right)}\), where a (\(:={a}_{{t}_{dec}-1}\in \left\{1,...,N\right\}\)) is the node selected in the previous step. Since the partial route begins at the vehicle’s location and ends at the depot, the order of nodes in the partial route matters. This characteristic requires the addition of positional encoding [5] (which describes the position of a node within the graph instance so that each node can have a unique representation) to the linearly projected pair to generate \(\overset\circ{h}_{{t}_{dec}}^{\left(l\right)}\in {\mathbb{R}}^{1\times d}\) as follows,

    $$\overset\circ{h}_{{t}_{dec}}^{\left(l\right)}=\left[{\widetilde{h}}_{{t}_{dec}},{f}_{k}^{{t}_{dec}}\right]{W}_{o}^{proj}+P{E}_{{t}_{dec}},$$
    (27)

    where \(P{E}_{{t}_{dec}}\) is a d-dimensional row vector. Each element of the vector is defined as

    $$P{E}_{{t}_{dec},i}=\left\{\begin{array}{cc}{\text{sin}}\left({t}_{dec}/{10000}^{2i/d}\right)& if\;i\text{ is even}\\ {\text{cos}}\left({t}_{dec}/{10000}^{2i/d}\right)& if\;i\text{ is odd}\end{array}\right.,$$
    (28)

    where \(i\in \left\{\mathrm{0,1},...,d-1\right\}\) is the position along the d dimension.

    Figure 6 illustrates the decoding Step 1. There are tdec visited nodes in the current partial route. We first compute the self-attention between \(\overset\circ{h}_{{t}_{dec}}^{\left(l\right)}\) and \(\left[\overset\circ{h}_{0}^{\left(l\right)},\overset\circ{h}_{1}^{\left(l\right)},...,\overset\circ{h}_{{t}_{dec}-1}^{\left(l\right)}\right]\in {\mathbb{R}}^{{t}_{dec}\times d}\). Step 1 is mathematically described as follows (where dk = d/H).

    $$\overset\circ{Q}_{l}={h}_{{t}_{dec}}^{\left(l\right)}{W}_{l,sa}^{Q}\in {\mathbb{R}}^{1\times {d}_{k}},{W}_{l,sa}^{Q}\in {\mathbb{R}}^{d\times {d}_{k}}$$
    (29)
    $$\begin{array}{cc}{\text{K}}_{l}=\left[\overset\circ{h}_{0}^{\left(l\right)},\overset\circ{h}_{1}^{\left(l\right)},...,\overset\circ{h}_{{t}_{dec}-1}^{\left(l\right)}\right]{W}_{l,sa}^{K}\in {\mathbb{R}}^{{t}_{dec}\times {d}_{k}},& {W}_{l,sa}^{K}\in {\mathbb{R}}^{d\times {d}_{k}},\end{array}$$
    (30)
    $$\begin{array}{cc}{\text{V}}_{l}=\left[\overset\circ{h}_{0}^{\left(l\right)},\overset\circ{h}_{1}^{\left(l\right)},...,\overset\circ{h}_{{t}_{dec}-1}^{\left(l\right)}\right]{W}_{l,sa}^{V}\in {\mathbb{R}}^{{t}_{dec}\times {d}_{k}},& {W}_{l,sa}^{V}\in {\mathbb{R}}^{d\times {d}_{k}},\end{array}$$
    (31)
    $$\overset\circ{Z}{}_{l}^{h}=attention\left({Q}_{l},{K}_{l},{V}_{l}\right)={\text{Softmax}}\left(\frac{{Q}_{l}{K}_{l}^{T}}{\sqrt{{d}_{k}}}\right){V}_{l}\in {\mathbb{R}}^{1\times {d}_{k}},$$
    (32)
    $${\left.h_{t_{dec}}^{\circ\left(l\right)}\leftarrow\text{MHA}\left(\cdot\right)\right|}_{sa}=\left[{Z^\circ}_l^1,{Z^\circ}_l^2,...,{Z^\circ}_l^H\right]W_{l,sa}^{out}\in\mathbb{R}^{1\times d},W_{l,sa}^{out}\in\mathbb{R}^{d\times d}.$$
    (33)
  • Step 2: This step queries the next node to visit among all candidate nodes. The step uses the encoder-decoder attention between the self-attention of a partial route (output of Step 1; denoted as \({h}_{{t}_{dec}}^{\circ \left(l\right)}\) for coherence) and context node embeddings (\({H}_{node}\in {\mathbb{R}}^{\left(N+2\right)\times d}\); node-vehicle embeddings with current vehicle embedding only (Eq. (34)). We mask the nodes that cannot be visited from the current location. Figure 7 illustrates the encoder-decoder attention in Step 2 of the decoding procedure. The following equations express Step 2.

    $${H}_{node}=\left[{{h}_{0}}^{\left({n}_{enc}\right)},{{h}_{1}}^{\left({n}_{enc}\right)},...,{{h}_{N}}^{\left({n}_{enc}\right)},{{\widehat{h}}_{k}}^{\left({n}_{enc}\right)}\right]\in {\mathbb{R}}^{\left(N+2\right)\times d},$$
    (34)
    $$\begin{array}{cc}{Q}_{l,att}=\overset\circ{h}_{{t}_{dec}}^{\left(l\right)}{W}_{l,att}^{Q}\in {\mathbb{R}}^{1\times {d}_{k}},& {W}_{l,att}^{Q}\in {\mathbb{R}}^{d\times {d}_{k}}\end{array},$$
    (35)
    $$\begin{array}{cc}{\text{K}}_{l,att}={H}_{node}{W}_{l,att}^{K}\in {\mathbb{R}}^{\left(N+2\right)\times {d}_{k}},& {W}_{l,att}^{K}\in {\mathbb{R}}^{d\times {d}_{k}},\end{array}$$
    (36)
    $$\begin{array}{cc}{\text{V}}_{l,att}={H}_{node}{W}_{l,att}^{V}\in {\mathbb{R}}^{\left(N+2\right)\times {d}_{k}},& {W}_{l,att}^{V}\in {\mathbb{R}}^{d\times {d}_{k}},\end{array}$$
    (37)
    $$\overset\circ{Z}{}_{l,att}^{h}=attention\left({Q}_{l,att},{K}_{l,att},{V}_{l,att}\right)={\text{Softmax}}\left(\frac{{Q}_{l,att}{K}_{l,att}^{T}}{\sqrt{{d}_{k}}}\odot {\mathcal{M}}^{T}\right){V}_{l,att}\in {\mathbb{R}}^{1\times {d}_{k}},$$
    (38)
    $${\left.h_{t_{dec}}^{\circ\left(l\right)}\leftarrow\text{MHA}\left(\cdot\right)\right|}_{att}=\left[{Z^\circ}_{l,att}^1,{Z^\circ}_{l,att}^2,...,{Z^\circ}_{l,att}^H\right]W_{l,att}^{out}\in\mathbb{R}^{1\times d},W_{l,att}^{out}\in\mathbb{R}^{d\times d}.$$
    (39)
  • Step 3: Step 1 and Step 2 form a single decoding layer. After ndec decoding layers, the resultant output \({h}_{{t}_{dec}}^{\circ \left(l\right)}\) is sent to the final attention layer, where we compute a single-head attention to get probability distribution across all candidate nodes. The decoder receives a graph embedding (\({\overline{h} }^{\left({n}_{enc}\right)}\)) from the encoder, and its linear projection is added to \({h}_{{t}_{dec}}^{\circ \left(l\right)}\). The query is constructed from the sum. The key is obtained by a linear projection of \({\widetilde{H}}_{node}\in {\mathbb{R}}^{\left(N+1\right)\times d}\), which is the context node embedding in Eq. (34) without current vehicle embedding (\({{\widehat{h}}_{k}}^{\left({n}_{enc}\right)}\)). The decoding step 3 is described as the following equations and illustrated in Fig. 8.

    $${\widetilde{H}}_{node}=\left[{h}_{0}{}^{\left({n}_{enc}\right)},{h}_{1}{}^{\left({n}_{enc}\right)},...,{h}_{N}{}^{\left({n}_{enc}\right)}\right]\in {\mathbb{R}}^{\left(N+1\right)\times d},$$
    (40)
    $$\begin{array}{cc}{Q}_{f,att}={h}_{{t}_{dec}}^{\circ \left(l\right)}{W}_{f,att}^{Q}\in {\mathbb{R}}^{1\times d},& {W}_{f,att}^{Q}\in {\mathbb{R}}^{d\times d},\end{array}$$
    (41)
    $$\begin{array}{cc}{\text{K}}_{f,att}={\widetilde{H}}_{node}{W}_{f,att}^{K}\in {\mathbb{R}}^{\left(N+1\right)\times d},& {W}_{f,att}^{K}\in {\mathbb{R}}^{d\times d},\end{array}$$
    (42)
    $${p}_{t}^{dec}={\text{Softmax}}\left(C\cdot \mathit{Tan}h\left(\frac{{Q}_{f,att}{K}_{f,att}^{T}}{\sqrt{d}}\odot {\mathcal{M}}^{T}\right)\right) \in {\mathbb{R}}^{1\times \left(N+1\right)}.$$
    (43)
Fig. 6
figure 6

Step 1 of the decoding procedure. The orange contour indicates the partial route at time step tdec

Fig. 7
figure 7

Step 2 of the decoding procedure. The blue box denotes the current node, the green contour represents the set of candidate nodes, and the red cross indicates masked nodes

Fig. 8
figure 8

Step 3 of the decoding procedure. The purple boxes above candidate nodes and depot indicate the selection probability

The value of C in Eq. (43) is selected as 10. Consequently, the next node \(a\in \left\{\mathrm{0,1},...,N\right\}\) is sampled from the output probability distribution \({p}_{t}^{dec}\) (following a categorical distribution or greedy fashion), and t and tdec are incremented by one.

5 Data-efficient training with proposed REINFORCE baseline

This section presents our proposed training methodology that improves learning efficiency. In terms of data efficiency, our methodology requires fewer (raw) training instances at every epoch compared to the conventional method. Since the training instances are generated on the fly, an epoch in our proposed methodology takes shorter time to generate the training data and transfer them over to the GPU. Moreover, in terms of sample efficiency, our method reaches an equivalent performance (validation score) within fewer training epochs or with fewer training instances in comparison with other methods.

5.1 Preliminary

Policy-gradient methods learn the policy directly and explicitly through gradient-based optimization. We define the model’s policy as a parametrized function \({\pi }_{\theta }(a|s)\), where θ denotes the trainable parameters of the model. The function is stochastic in that it defines a probability distribution of actions (a) at a given state (s). The goal of policy optimization is to maximize the expected cumulative return (sum of rewards, R(τ)) of the trajectory (\(\tau =({s}_{0},{a}_{0},{s}_{1},{a}_{1},...,{s}_{T})\)) whose actions are chosen by the policy defined as

$$J(\theta )={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}\left[R(\tau )\right]={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}\left[{\sum }_{t=0}^{T}r\left({s}_{t},{a}_{t}\right)\right].$$
(44)

The objective of the policy optimization problem expressed in Eq. (44) uses the expectation over all possible trajectories. For a given stochastic policy (\({\pi }_{\theta }\)), the trajectory probability (\(P(\tau ;{\pi }_{\theta }):=P(\tau ;\theta )\)) represents the probability of generating a trajectory following the policy. The trajectory probability is factorized as

$$P(\tau ;\theta )=\prod_{t=0}^{T}{\pi }_{\theta }({a}_{t}|{s}_{t})p({s}_{t+1}|{s}_{t},{a}_{t}),$$
(45)

where \(p\left({s}_{t+1}\text{| }{s}_{t},{a}_{t}\right)\) is the state-transition probability of the MDP defined in Section III. Williams [35] proposed a viable estimator of the policy gradient using Monte-Carlo sampling by assuming that R(τ) is independent of θ:

$${\nabla }_{\theta }J(\theta )={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}[R(\tau ){\nabla }_{\theta }{\text{log}}P(\tau ;\theta )].$$
(46)

In practice, the unbiased REINFORCE gradient estimator presented in Eq. (46) suffers from a high variance of the returns \(R({\tau }_{i})\) and is sample inefficient since it requires many sample episodes to converge. We can overcome these issues by including a baseline (b(s)), an action-independent function, in the policy gradient estimation. Consequently, an unbiased estimate of the gradient with reduced variance is expressed as

$${\nabla }_{\theta }J(\theta )={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}[(R(\tau )-b){\nabla }_{\theta }{\text{log}}P(\tau ;\theta )].$$
(47)

5.2 Choice of REINFORCE baseline b(s)

An example of the baseline is the average return over sample trajectories (\(b={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}[R(\tau )]\approx \frac{1}{N}{\sum }_{i=1}^{N}R({\tau }_{i})\)), where N is the number of samples in a mini-batch. Although the mini-batch baseline can effectively reduce variance in Gradient-Bandit algorithms [36], Kool et al. [31] showed that it performs significantly worse than other state-of-the-art baselines.

Prior studies suggest that designing an effective yet computationally tractable REINFORCE baseline is crucial in training the policy network. In this work, we propose to use the average return of sample trajectories generated by instance augmentation from a single instance as the baseline, referred to as the instance-augmentation baseline. Our baseline is a potential alternative to the existing baselines with improved training speed and reduced variance. The proposed baseline is motivated by observations of other baselines in prior works. In general, a local baseline performs significantly better than a batch baseline. In particular, a local baseline based on multiple samples without replacement is expected to perform better because non-duplicate samples are guaranteed [31, 32]. This observation can be extended to POMO [4], whose local batch mean is based on N non-duplicate sample trajectories from a single instance, despite an increased tensor size. Since each POMO trajectory begins at a unique node, these samples are also guaranteed to be non-identical. These REINFORCE baselines are more data-efficient than the greedy rollout because they require fewer training instances (reduced by some factor).

It would be effective if a baseline as equally data-efficient as the multiple-sample baselines and even computationally lighter than the POMO shared baseline is used. The proposed baseline meets these requirements by utilizing the instance augmentation, which was first suggested in [4] for effective inference.

Table 3 lists the coordinate transformations applied to all features (nodes, depots, and vehicle locations) to generate additional instances for a given training instance (a total of 8 instances). While each of these instances is distinct, the optimal tour would be identical since these transformations preserve the lengths between nodes. We then rollout sample trajectories of each of these “counterfactuals.” The policy model would perceive these as distinct instances, only to arrive at similar solutions as it generates multiple rollouts in parallel. The model inherently learns to find improved solutions for a given instance based on the local batch mean. The policy model also learns more effective heuristics because the baseline offers a more focused view on a single instance through diverse perspectives. Figure 9 is an illustration of how our local baseline works. We believe that the proposed baseline combines the strengths of multiple-sample baselines and the POMO shared baseline.

Table 3 Unit square transformations
Fig. 9
figure 9

Proposed REINFORCE baseline

Comparison with multiple samples with/without replacements

Our baseline does not strictly generate non-duplicate samples. However, it is highly less likely to generate many duplicate samples, especially in the early stages of training, when the policy network \({\pi }_{\theta }\) has not yet “learned” much. So, our baseline promotes more “exploration” in the initial learning phase. To see this, we note that each augmented instance is associated with a distinct input embedding in the encoder output (\({h}^{\left({n}_{enc}\right)}\)). Let i denote the original instance, and let k and j denote the augmented instances derived from i. For k (≠ j) and \({s}_{k}^{i}\ne {s}_{j}^{i}\) in raw form, \({h}_{k}^{\left({n}_{enc}\right),i}\ne {h}_{j}^{\left({n}_{enc}\right),i}\) in the latent space. Since a trajectory is sampled based on \({h}^{\left({n}_{enc}\right)}\), it is likely that \({\tau }_{k}^{i}\) is different from \({\tau }_{j}^{i}\). Indeed, as training proceeds, \({\pi }_{\theta }\) may generate duplicate samples since it learns which action produces high-return trajectories in a more general setting. However, this limitation could be mitigated in large-size problems for which longer trajectories are likely to be unique.

Comparison with greedy rollout baseline

In greedy rollout baseline, a solution is generated by running the policy greedily, i.e., at each construction step, the node with the highest probability (where the probability distribution is obtained from an earlier version of the model) is visited. This deterministic solution trajectory serves as a baseline in the REINFORCE algorithm. While effective, the greedy rollout baseline incurs an additional forward pass of the earlier model version, which increases computation by 50%. Apart from this, we also empirically found that the greedy rollout baseline entails slightly noisy learning. The current model’s (best) performance may not be replicated or generalized to another problem set. This finding is more apparent towards the later stages of training, especially when the model finds it difficult to surpass its greedy self, and there is a noticeable lack of baseline policy updates. At this point, the model does not learn much from the competition with its greedy self.

Comparison with POMO shared baseline

Compared to the POMO baseline, our approach is more computationally efficient since it uses a fixed local batch size that does not increase with the number of nodes.

5.3 Combining with maximum entropy objective

Training the policy model with entropy can smooth out the optimization landscape, speeding up the learning process. In some environments, it yields a better final policy [9]. It also turns out to be robust to internal algorithmic disturbances and external environmental disturbances like dynamics and reward function [8]. We note that robustness to external disturbances is an important factor determining the generalization capability (i.e., performance on graphs of various sizes). This work combines the maximum entropy RL with our instance-augmentation baseline and shows improved training and inference performance for various problem instances.

We implement the maximum entropy RL as follows. The objective aims to maximize the expected cumulative return augmented by a conditional action entropy as

$${J}_{MaxEnt}(\theta )={\sum }_{t}{\mathbb{E}}_{\left({s}_{t},{a}_{t}\right)\sim {\rho }_{{}_{\theta }}^{\pi }}[r\left({s}_{t},{a}_{t}\right)+\alpha {\mathbb{H}}({\pi }_{\theta }(\cdot |{s}_{t}))]$$
(48)

where \({\mathbb{H}}({\pi }_{\theta }(\cdot |{s}_{t}))={\mathbb{E}}_{{a}_{t}\sim {\pi }_{\theta }}[-{\text{log}}{\pi }_{\theta }({a}_{t}|{s}_{t})]=-{\sum }_{{a}_{t}}[{\pi }_{\theta }({a}_{t}|{s}_{t}){\text{log}}{\pi }_{\theta }({a}_{t}|{s}_{t})]\) denotes the Shannon entropy of conditional distribution over actions along the trajectory, \({\rho }_{\theta }^{\pi }\left({s}_{t},{a}_{t}\right)\) is the state-action marginal of trajectory distribution induced by \({\pi }_{\theta }\) and \(\alpha\) is the entropy weight or temperature. The maximum entropy objective function presented in Eq. (48) results in a slightly different gradient [9] (trajectory view):

$${\nabla }_{\theta }J(\theta )={\mathbb{E}}_{\tau \sim {\pi }_{\theta }}[R(\tau ){\nabla }_{\theta }{\text{log}}P(\tau ;\theta )+\alpha {\sum }_{t}{\nabla }_{\theta }{\mathbb{H}}({\pi }_{\theta }(\cdot |{s}_{t}))]$$
(49)

Although Sultana et al. [37] used the entropy maximization term to train the policy with a greedy rollout baseline, we note that its application has not been used with other baselines. By integrating the objective function with entropy and using our instance-augmentation baseline, our policy model learns a more stochastic policy that is applicable in a generalized setting. Algorithm 1 presents our proposed REINFORCE algorithm. The Adam optimizer [38] with a constant learning rate of 0.0001 is used to train the policy model parameters.

Algorithm 1
figure a

Proposed REINFORCE Algorithm (Instance-augmentation baseline with maximum entropy objective)

6 Experiments and discussion

To establish the effectiveness of our proposed REINFORCE algorithm, we conducted a comprehensive study, first comparing our instance-augmentation baseline with the greedy rollout baseline and subsequently comparing our instance-augmentation baseline with maximum entropy objective against the greedy rollout baseline with entropy. The strengths of our method are substantiated across various problem sets, encompassing TSP, CVRP, and MSTOP, demonstrating consistent improvement in training (in terms of both solution quality and training time) even with increasing problem sizes.

6.1 Problem setup and hyperparameters

This section describes the controlled experiments to solve the MSTOP using the DDTM. To observe the benefits of our instance-augmentation baseline (over greedy rollout), we conduct an ablation study on classical TSP and CVRP using the original AM. To this end, we consider three problem/policy pairs – MSTOP/DDTM, TSP/AM, and CVRP/AM. The graph sizes (n) of 10, 20, 50, and 70 are set for the MSTOP (Table 4) and we consider these cases with 2 and 3 vehicles. The decision to focus on scenarios involving 2 and 3 vehicles is rooted in our motivation to providing insights into optimizing the efficiency of limited resources in situations where deploying a larger fleet is impractical. For TSP and CVRP, we consider the instances with sizes of 50 and 100. For TSP and CVRP, we consider the instances with sizes of 50 and 100. Furthermore, to check how our proposed training algorithm improves the generalization performance, we test the performance of each AM on problem instances of various sizes.

Table 4 MSTOP problem instances of various sizes

Training DDTM to solve MSTOP

We follow the basic problem setup in [3] for the Orienteering Problem (OP), i.e., the coordinates of all customer and depot nodes are randomly sampled within a normalized [0,1] × [0,1] world. The prizes of nodes are either initialized as one (constant) or sampled from a uniform distribution between 0 and 1.

Table 4 describes the experimental details, including the graph size (n), the number of vehicles (N), and the maximum length constraint for each route (Tmax). Additionally, each vehicle in MSTOP starts at a random location within the same [0,1] × [0,1] world and is given a variable remaining tour length (or equivalently fuel amount) with the distance between the current vehicle location and the depot as the lower bound. This setting ensures that the sum of the remaining tour length and the partial tour constructed henceforth is bounded above by Tmax. For all MSTOP cases, the DDTM is initialized with nenc = 4 and ndec = 2, which we found to be an acceptable trade-off between computational load and the quality of learned policy.

For numerical experiments, we train 1,280,000 instances per epoch. Considering the GPU memory constraints, we train 1250 batches of 1024 instances (n = 10, 20) for 200 epochs, train 2500 batches of 512 instances (n = 50) for 100 epochs, and train 3333 batches of 384 instances (n = 70) for 100 epochs. The instance-augmentation baseline uses a batch size reduced by 8, i.e. 128 for n = 10 and n = 20, 64 for n = 50, and 48 for n = 70, so that the total number of training instances is the same. These training instances are generated randomly on the fly at every epoch to prevent overfitting. After each epoch, we roll out the current model (with greedy decoding) on a held-out validation set of size 10,000 and plot the learning curve to observe the training process.

Training AM to solve TSP/CVRP

We adopt the problem setup prescribed in [3]. We used the same hyperparameters for training AM policy network for a fair comparison (except for the application of ‘warmup’).

Entropy weight

To ensure the benefits of maximum entropy realized in our methodology, we need to use a suitable value for α. A very large α value can make the problem close to the maximum entropy problem, whose policy is purely random. On the contrary, if α is small, premature convergence may occur due to inadequate exploration. The α value used for training is 0.01 for both MSTOP and TSP/CVRP. We observed that this value works well on MSTOP20 (uniformly distributed prizes) and TSP50.

6.2 Inference result

This section presents the performance of DDTM on 10,000 random MSTOP instances. To validate our proposed methodology, we assess the performance of 1) DDTM trained with our proposed baseline and maximum entropy objective, and 2) DDTM trained with greedy rollout baseline and maximum entropy objective. The following section presents a comprehensive ablation study for various REINFORCE training baselines.

We use three decoding strategies. The greedy strategy rolls out a single greedy trajectory for each instance. The sampling strategy generates 1280 trajectories (per instance) and selects the best one. Finally, the instance augmentation strategy draws multiple greedy trajectories for each instance and selects the best result. To effectively handle inherent asymmetry in the MSTOP solutions, we permute the order of starting vehicles (see Table 5). Then, we generate a single greedy trajectory for each vehicle order and choose the best out of N! trajectories. To expand the search space, for each permutation, we further rollout eight trajectories about each problem instance (by solving its augmented instances) and select the best out of 8*N! trajectories. As illustrated in Fig. 10, this increases the chance of finding near-optimal solutions.

Table 5 Permutations of vehicle order. Bold denotes the first vehicle to start routing. The DDTM sequentially begins routing according to the given vehicle order
Fig. 10
figure 10

LEFT: Routing begins with Vehicle A. RIGHT: Routing begins with Vehicle B (optimal tour found)

To the best of our knowledge, we could not find any algorithms specifically for MSTOP. For n values of 10 and 20, we compare the results with the optimal solutions obtained using the MILP formulation introduced in Section III (implemented with Gurobi [34]). We also implement the heuristic by Tsiligirides for OP introduced in [39] with slight modification and compare the results. The MILP solution is used as the reference to compute the optimality gap. For larger instances (n = 50 and n = 70), it takes prohibitively long to solve the MILP to optimality. Therefore, the best out of the solutions obtained by various methodologies is used as a reference to compute the optimality gap.

Tables 6 and 7 summarize the experimental results for comparison. We report the average of total prizes over 10,000 test MSTOP instances. Using the greedy strategy, the DDTM finds near-optimal solutions with optimality gaps of around 4 – 5%. The optimality gap values for DDTM solutions obtained using the sampling strategy are 1 – 2%. In almost all strategies, the DDTM outperforms the heuristic by Tsiligrides. The DDTM performs best with the × 8N! instance augmentation strategy, which finds high-quality solutions much faster than the sampling technique, demonstrating its superiority.

Table 6 Experimental results on MSTOP (constant prizes; bold: best result)
Table 7 Experimental results on MSTOP (uniformly distributed prizes; bold: best result)

Figure 11 presents the quality (optimality gap) of the solutions obtained using the DDTM trained under the proposed methodology for 10,000 test MSTOP20 instances. The optimality gap of the DDTM solutions is 0% in more than 90% of constant-prize instances. Also, in over 90% of instances with uniformly-distributed prizes, the optimality gap is smaller than 5%. Figure 12 show example solutions of MSTOP20 for different prize distributions. The DDTM inference solutions with “ × 8N!-augmentation-strategy” are plotted on the left. The corresponding MILP solutions are presented on the right for comparison.

Fig. 11
figure 11

DDTM solution quality (optimality gap) on MSTOP20 instances

Fig. 12
figure 12

Example solution of MSTOP20 (uniformly distributed prizes). Numerical values next to blue nodes represent node prizes

6.3 Ablation study

The ablation study analyses the contribution of our proposed training methodology (instance-augmentation baseline with maximum entropy objective) to training policy network models. Specifically, we compare the learning curves using different baselines on the DDTM (for solving MSTOP) and the original AM (for TSP and CVRP). Each learning curve is obtained by evaluating the model on a held-out validation set of 10,000 random instances. The following learning curves are plotted for four different training strategies: greedy rollout baseline (A), greedy rollout baseline with maximum entropy objective (B), instance-augmentation baseline (C), and instance-augmentation baseline with maximum entropy objective (D).

DDTM & training baselines (MSTOP)

Fig. 13 shows the learning curves of the four training methods – (A) to (D) – on MSTOP20 with uniformly distributed prizes. It can be observed that our proposed baseline (C) helps the model learn better policy than both the greedy rollout baseline (A) and its combination with the maximum entropy objective (B). As an added benefit, the instance-augmentation baseline substantially speeds up learning by generating fewer training data. With the maximum entropy objective (D), the proposed methodology significantly outperforms the rest of the methodologies and achieves high validation scores in fewer training epochs, demonstrating the sample efficiency of the proposed training methodology.

Fig. 13
figure 13

Learning curves for MSTOP20 with uniformly distributed prizes. Dark curves are smoothed results, lighter curves are raw results

AM & training baselines (TSP, CVRP)

We believe that the proposed methodology is a general technique that can be used instead of the conventional greedy rollout baseline. To validate this, we perform additional experiments on the vanilla AM network using the original code to solve TSP and CVRP. For a fair comparison, we plot the learning curves on the same validation set (with seed 1234) and also report the inference results on the same test set (with seed 4321) used in [3].

Figure 14 shows the learning curves for the original AM with different baselines for TSP50. The instance-augmentation baseline (C) performs comparatively better than the greedy rollout baseline (A) and slightly worse than the greedy rollout baseline with maximum entropy objective (B). However, the proposed methodology (D) substantially improves the quality of the learned policy. Moreover, using the instance-augmentation baseline – (C) and (D) – instead of greedy rollout baseline – (A) and (B) – significantly reduces the per-epoch training time by over 30% (see Table 8). Our proposed method is thus effective in expediting per-epoch training time while simultaneously keeping competitive performance. This indeed supports the claim that our proposed method strikes a favorable balance between training speed and overall performance.

Fig. 14
figure 14

Learning curves on TSP50 using the vanilla AM. Dark curves are smoothed results, lighter curves are raw results

Table 8 Comparison of training time for different training strategies (per epoch, in min: sec); training performed on a single 3090Ti GPU

Table 9 summarizes the inference test results on TSP and CVRP. Our proposed methodology (D) outperforms the other training methods across all decoding strategies in all cases. In particular, the proposed approach is comparable to the state-of-the-art POMO method in terms of the optimality gap. The best performance for TSP50 obtained by the proposed approach (optimality gap: 0.15%, sampling) is better than that by the POMO inference without augmentation (0.24% [4]). Similarly, in CVRP50 instances, the best result obtained by the proposed method (1.75%; sampling) outperforms the POMO inference with a single trajectory (3.52% [4]). Even on large instances (n = 100), the proposed methodology (D) shows improvement over all decoding strategies.

Table 9 Test results of vanilla AM trained with different methods

6.4 Generalization result

This section discusses the generalization capability of our training methodology. Kool et al. [3] demonstrated that the AM and greedy rollout baseline can be generalized to problems with different graph sizes, although the error increases as the graph size increases. Since training with the maximum entropy objective is known to improve the model’s robustness, we conduct a comparative study on generalization performance between greedy rollout with maximum entropy objective (B) and our proposed methodology (D) to see how our proposed methodology reduces generalization error. Note that the generalization results are reported according to the instance-augmentation decoding strategy on the same test datasets as in the previous sections.

Figure 15 illustrates the generalization performance of DDTMs trained on MSTOP10 and MSTOP20 environments for N = 2 vehicles where the horizontal axis represents the test environment (i.e. prize distribution and graph size) and the vertical axis refers to the optimality gap. Part (a) reports the performance of DDTM trained under constant prizes whereas part (b) corresponds to that of DDTM trained under uniformly distributed prizes. We observe that the models naturally perform best when tested under the same conditions as the training environment. However, optimality gaps tend to increase when tested on different graph sizes. In general, the proposed methodology (D) shows better generalization than the conventional method (B) in terms of reduced optimality gaps for changing graph sizes. Moreover, we also observe that models trained under uniformly distributed prizes generalize better than the counterparts trained under constant prizes when tested on environments with different prize distributions. This is not surprising since uniformly distributed prizes can be seen as a generalized version of constant prizes, and the problems with constant prizes are generally considered easier to solve. One exception is the case of DDTM trained under MSTOP10 with uniformly distributed prizes being tested on MSTOP20 with constant prizes, where the model trained using the proposed methodology (D) performs worse than the conventional approach (B). The reason behind this result might be attributed to using entropy weight α tuned for MSTOP20 (with uniformly distributed prizes) problems.

Fig. 15
figure 15

Generalization performance of DDTMs trained and tested between MSTOP10 and MSTOP20 environments. Models trained under (a) Constant prizes and (b) Uniformly distributed prizes. Optimality gaps reported as the performance measure

Figure 16 presents the generalization result for DDTM trained on MSTOP50 and MSTOP70 environments for N = 3 vehicles where the vertical axis represents the test score. Similar to Fig. 15, the proposed methodology (D) generally performs better for both changing graph sizes and prize distributions, as evidenced by larger test scores. The degree of improvement is more apparent for large-scale problems, demonstrating that the proposed methodology generalizes well with scalability on graph size.

Fig. 16
figure 16

Generalization performance of DDTMs trained and tested between MSTOP50 and MSTOP70 environments. Models trained under (a) Constant prizes and (b) Uniformly distributed prizes. Test scores reported as performance measure

Figure 17 presents the generalization performance for TSP and CVRP versus the graph size. For both TSP and CVRP, the proposed methodology (D) shows better generalization performance (reduced optimality gaps) except for the CVRP100 model on graph size n = 50, which is likely a result of using entropy weight \(\alpha\) that is tuned for TSP50. From the various tests on different routing problems, it can be observed that our proposed methodology generally results in an improved generalization performance compared to the existing conventional method.

Fig. 17
figure 17

Generalization results for (a) TSP and (b) CVRP

7 Conclusion

The Multi-Start Team Orienteering Problem (MSTOP) is introduced to address the routing problems arising in dynamic environments. An attention-based policy network model referred to as the Deep Dynamic Transformer Model (DDTM) is proposed to solve the MSTOP. The proposed learning procedure modifies the REINFORCE algorithm by introducing a new baseline with instance-augmentation and combining it with the maximum entropy objective, improving its learning efficiency and inference capability. A set of numerical experiments comparing the performance of the proposed procedure with existing methodologies demonstrates its effectiveness. For a suitable value of entropy weight, the instance-augmented baseline outperforms the conventional greedy rollout baseline both in terms of inference performance, generalization performance and training speed. The test result indicates that the proposed approach performs comparably to the current state-of-the-art POMO baseline while requiring less computational resources. The procedure is further applied to classical TSP and CVRP, showing the potential to be a general technique for solving various routing problems. It would be interesting to apply the proposed methodology to other asymmetric CO problems, such as the Multi-Depot VRP and Multi-Depot MSTOP, where the order of vehicles break the symmetry in solution representations. Applying the proposed approach to missions involving the cooperation between agents would be also a meaningful extension of this study [37]. Another promising subject for future study is to handle the instance-augmentation inference for problems with many vehicles. We can tackle these large problems by breaking them into smaller, more manageable subproblems. By doing so, we can utilize our model (that is trained for 2 or 3 vehicles) to iteratively solve portions of the larger problem. Subsequently, we can then concatenate the individual solutions to generate a comprehensive solution for the entire fleet. While this iterative approach may not yield optimal solutions, it may produce near-optimal solutions rapidly, as our model solves in the order of 10 ms. We also acknowledge that the current implementation of DDTM architecture is heavy, resulting in a longer training time compared to the original AM. One possible resolution would be to “compress” the model [38, 39] for efficient training and inference.