Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning

Recent works using deep learning to solve routing problems such as the traveling salesman problem (TSP) have focused on learning construction heuristics. Such approaches find good quality solutions but require additional procedures such as beam search and sampling to improve solutions and achieve state-of-the-art performance. However, few studies have focused on improvement heuristics, where a given solution is improved until reaching a near-optimal one. In this work, we propose to learn a local search heuristic based on 2-opt operators via deep reinforcement learning. We propose a policy gradient algorithm to learn a stochastic policy that selects 2-opt operations given a current solution. Moreover, we introduce a policy neural network that leverages a pointing attention mechanism, which can be easily extended to more general k -opt moves. Our results show that the learned policies can improve even over random initial solutions and approach near-optimal solutions faster than previous state-of-the-art deep learning methods for the TSP. We also show we can adapt the proposed method to two extensions of the TSP: the multiple TSP and the Vehicle Routing Problem, achieving results on par with classical heuristics and learned methods.


Introduction
The traveling salesman problem (TSP) is a well-known combinatorial optimization problem.In the TSP, given a set of locations (nodes) in a graph, we need to find the shortest tour that visits each location exactly once and returns to the departing location.The TSP is NP-hard [33] even in its Euclidean formulation, i.e., nodes are points in the 2D space.Classic approaches to solve the TSP can be classified in exact and heuristic methods.The former have been extensively studied using integer linear programming [2] which are guaranteed to find an optimal solution but are often too computationally expensive to be used in practice.The latter are based on (meta)heuristics and approximate algorithms [3] that find solutions requiring less computational time, e.g., edge swaps such as k-opt [11].However, designed heuristics require specialized knowledge and their performances are often limited by algorithmic design.
Recent works in machine learning and deep learning have focused on learning heuristics for combinatorial optimization problems [6,27].For the TSP, both supervised learning [18,38] and reinforcement learning [5,7,20,24,40] methods have been proposed.The idea behind the proposed methods is that a machine learning method could learn better heuristics by extracting useful information directly from data, rather than having an explicitly programmed behavior.Most approaches to the TSP have focused on learning construction heuristics, i.e., methods that can generate a solution sequentially by extending a partial tour.These methods employed sequence representations [5,38], graph neural networks [18,20] and attention mechanisms [7,24,40] resulting in high-quality solutions.Construction methods still require additional procedures such as beam search, classical improvement heuristics, and sampling to achieve such results.This limitation hinders their applicability as it is required to revert to handcrafted improvement heuristics and search algorithms for state-of-the-art performance.Thus, learning improvement heuristics, i.e., when a solution is improved by local moves that search for better solutions remains relevant.Here, if we can learn a policy to improve a solution, we can use it to get better solutions from a construction heuristic or even random solutions.Recently, a deep reinforcement learning method [40] has been proposed for such a task, achieving near-optional results using node swap and 2-opt moves.However, the architecture has its output fixed by the number of possible moves, making it less favorable to expand to general k-opt, leading to lower optimality gaps [12].
Two natural extensions of the TSP are the multiple TSP (mTSP) and the capacitated vehicle routing problem (CVRP).In the first, we consider the original problem augmented with more salesmen, constrained on the size of tours or number of visits.The CVRP also considers multiple salesmen (vehicles) with a maximum capacity.Customers have certain demand values that need to be fulfilled by vehicles without exceeding their total capacity.These problems are harder to solve than the TSP due to the added constraints and usually require tailored heuristics.Both problems have also been subject of the recent interest in combining machine learning and combinatorial optimization [8,16,19,34].However, few previously proposed models can be seamlessly used in multiple routing problems [24,40].
In this work, we propose a deep reinforcement learning algorithm trained via Policy Gradient to learn improvement heuristics based on 2-opt moves.Our architecture is based on a pointer attention mechanism [38] that outputs nodes sequentially for action selection.We introduce a reinforcement learning formulation to learn a stochastic policy of the next promising solutions, incorporating the search's history information by keeping track of the current best-visited solution.Our results show that we can learn policies for the Euclidean TSP that achieve near-optimal solutions even when starting with poor quality solutions.Moreover, our approach can achieve better results than previous deep learning methods based on construction [5,7,18,20,24,29,38] and improvement [40] heuristics.Compared to [40], our method can be easily adapted to general k-opt and it is more sample efficient.Our method outperforms other effective heuristics such as Google's OR-Tools [35] for simulated instances and are close to optimal solutions.Lastly, it can be easily expanded to the mTSP and CVRP.

Related Work
In machine learning, early works for the TSP have focused on Hopfield networks [14] and deformable template models [1].However, the performance of these approaches has not been on par with classical heuristics [25].Recent deep learning methods have achieved high-performance learning construction heuristics for the TSP.Pointer Networks (PtrNet) [38] learned a sequence model coupled with an attention mechanism trained to output TSP tours using solutions generated by Concorde [2].In [5], the PtrNet was further extended to learn without supervision using Policy Gradient, trained to output a distribution over node permutations.Other approaches encoded instances via graph neural networks.A structure2vec (S2V) [20] model was trained to output the ordering of partial tours using deep Q-learning (DQN).Later, graph attention was employed to a hybrid approach using 2-opt local search on top of tours trained via Policy Gradient [7].Graph attention was extended in [24] using REINFORCE [39] with a greedy rollout baseline, resulting in lower optimality gaps.Recently, the supervised approach was revisited using graph convolution networks (GCN) [18] learning probabilities of edges occurring in a TSP tour.It achieved stateof-the-art results up to 100 nodes whilst also combining with search heuristics.
Recent machine learning approaches specialized for the mTSP include [19], which proposed a neural network architecture trained via supervised learning.Combined with constraint enforcing layers they can achieve competitive results in comparison to OR-Tools.In [16], multi-agent reinforcement learning is used to learn an allocation of agents to nodes, and regular optimization is used to solve TSP associated with each agent.The VRP has gained much interest since [31].In this work, a policy gradient algorithm is proposed to generate solutions as a sequence of consecutive actions.Later, [24] extended the attention method to the VRP outperforming [31], followed by [40] who also expanded their model to the VRP case obtaining lower gaps.A specialized VRP model combined reinforcement and supervised learning to learn to construct solutions, outperforming [24], but trained on different distributions of node locations [8].Another VRP method, named neural large neighborhood search (NLNS) [15] proposed integrating learning methods and classical search.In the method, the policy is trained to reconstruct randomly destroyed solutions.Another approach, named learn to improve (L2I) [28] considered learning improvements policies by choosing from a pool of operators.Recently, deep policy dynamic programming (DPDP) [23] was proposed with the aims to combine neural heuristics with dynamic programming.The method is trained to predict edges from example solutions and outperforms previous neural approaches solving TSPs and VRPs with 100 nodes.
It is important to previous end-to-end methods to have additional procedures such as beam search, classical improvement heuristics, and sampling to achieve good solutions.Thus, in this work, we encode edge information using graph convolutions and use classical sequence encoding to learn node orderings.We decode these representations via a pointing attention mechanism to learn a stochastic policy of the action selection task.In the TSP, our approach resembles classical 2-opt heuristics [10] and can outperform previous deep learning methods in solution quality and sample efficiency.

Travelling Salesman Problem
We focus on the 2D Euclidean TSP.Given an input graph, represented as a sequence of n locations in a two dimensional space X = {x i } n i=1 , where x i ∈ [0, 1] 2 , we are concerned with finding a permutation of the nodes, i.e. a tour S = (s 1 , … , s n ) , that visits each node once (except the starting node) and has the minimum total length (cost).We define the cost of a tour as the sum of the distances (edges) between consecutive nodes in S as , where ‖⋅‖ 2 denotes the 2 norm.

k-Opt Heuristic for the TSP
Improvement heuristics enhance feasible solutions through a search procedure.A procedure starts at an initial solution S 0 and replaces a previous solution S t by a better solution S t+1 .Local search methods such as the effective Lin-Ker- nighan-Helsgaun (LKH) [11] heuristic perform well for the TSP.The procedure searches for k edge swaps (k-opt moves) that will be replaced by new edges resulting in a shorter tour.A simpler version [26] considers 2-opt (Fig. 1) and 3-opt moves alternatives as these balance solution quality and the O(n k ) complexity of the moves.Moreover, sequential pair- wise operators such as k-opt moves can be decomposed in simpler l-opt ones, where l < k .For instance, sequential 3-opt operations can be decomposed into one, two or three 2-opt operations [11].However, in local search algorithms, the quality of the initial solution usually affects the quality of the final solution, i.e. local search methods can easily get stuck in local optima [10].To avoid local optima, different metaheuristics have been proposed including Simulated Annealing and Tabu Search.These work by accepting worse solutions to allow more exploration of the search space.In general, this strategy leads to better solution quality.However, metaheuristics still require expert knowledge and may have sub-optimal rules in their design.To tackle this limitation, we propose to combine machine learning and 2-opt operators to learn a stochastic policy to improve TSP solutions sequentially.A stochastic policy resembles a metaheuristic, sampling solutions in the neighborhood of a given solution, potentially avoiding local minima.Our policy iterates over feasible solutions and the minimum cost solution is returned at the end.The main idea of our method is that taking future improvements into account can potentially result in better policies than greedy heuristics.

Reinforcement Learning Formulation
Our formulation considers solving the TSP via 2-opt as a Markov decision process (MDP), detailed below.In our MDP, a given state S is composed of a tuple of the current solution (tour) S and the lowest-cost solution S ′ seen in the search.The proposed neural architecture (Sect.5) approximates the stochastic policy   (A| S) , where represents train- able parameters.Each A = (a 1 , a 2 ) corresponds to a 2-opt move where a 1 , a 2 are node indices.Our architecture also contains a value network that outputs value estimates V  ( S) , with as learnable parameters.We assume TSP samples drawn from the same distribution and use Policy Gradient to optimize the parameters of the policy and value networks (Sect.6).
States A state S is composed of a tuple S = (S, S � ) , where S and S ′ are the current and lowest-cost solution seen in the search, respectively.That is, given a search trajectory at time t and solution S, S t = S and S � t = S � = arg min S t ∈{S 0 ,…,S t } L(S t).Actions We model actions as tuples A = (a 1 , a 2 ) where a 1 , a 2 ∈ {1, … , n} , a 2 > a 1 correspond to index positions of solution S = (s 1 , … , s n ).
Transitions Given A = (i, j) transitioning to the next state defines a deterministic change to solution Ŝ = (… , s i , … , s j , … ) , resulting in a new solution S = (… , s i−1 , s j , … , s i , s j+1 … ) and state S = (S, S � ) .That is, selecting i and j in Ŝ implies breaking edges at positions (i − 1, i) and (j, j + 1) , inserting edges (i − 1, j) and (i, j + 1) and inverting the order of nodes between i and j (Fig. 1).
Rewards Similar to [40], we attribute rewards to actions that can improve upon the current best-found solution, i.e., R t = L(S � t ) − L(S � t+1 ).Environment Our environment runs for steps.For each run, we define episodes of length T ≤ , after which a new episode starts from the last state in the previous episode.This ensures access to poor quality solutions at t = 0 , and high- quality solutions as t grows.
Returns Our objective is to maximize the expected return G t , which is the cumulative reward starting at time step t and Fig. 1 TSP solution before a 2-opt move (left), and after a 2-opt move (right).Added edges are represented in dashed lines.Note that the sequence s i , … , s j is inverted finishing at T at which point no future rewards are available, i.e., G t =

Policy Gradient Neural Architecture
Our neural network, based on an encoder-decoder architecture is depicted in Fig. 2. Two encoder units map each component of S = (S, S � ) independently.Each unit reads inputs X = (x 1 , … , x n ) , where x i are node coordinates of node s i in S and S ′ .The encoder then learns representations that embed both graph topology and node ordering.Given these representations, the policy decoder samples action indices a 1 , … , a k sequentially, where k = 2 for 2-opt.The value decoder operates on the same encoder outputs but outputs real-valued estimates of state values.We detail the components of the network in the following sections.

Encoder
The purpose of our encoder is to obtain a representation for each node in the input graph given its topological structure and its position in a given solution.We incorporate elements from GCN [22] and sequence embedding via recurrent neural networks (RNN) to accomplish this objective [13].Furthermore, we use edge information to build a more informative encoding of the TSP graph.

Embedding Layer
We input two-dimensional coordinates x i ∈ [0, 1] 2 , ∀i ∈ 1, … , n , which are embedded to d-dimensional features as where W x ∈ ℝ d×2 , b x ∈ ℝ d .We use as input the Euclidean distances e i,j between coordinates x i and x j to add edge infor- mation and weigh the node feature matrix.To avoid scaling the inputs to different magnitudes we adopt symmetric normalization [22] as Then the normalized edges are used in combination with GCN layers to create richer node representations using its neighboring topology.

Graph Convolutional Layers
In the GCN layers, we denote as x i the node feature vector at GCN layer associated with node i.We define the node feature at the subsequent layer combining features from nodes in the neighborhood N(i) of node i as (1) Fig. 2 In the architecture, a state S = (S, S � ) is passed to a dual encoder where graph and sequence information are extracted.A policy decoder takes encoded inputs to query node indices and output actions.A value decoder takes encoded inputs and outputs state values.Figure as in [32] where W g ∈ ℝ d×d , b g ∈ ℝ d , r is the Rectified Linear Unit and N(i) corresponds to the remaining n − 1 nodes of a com- plete TSP network.At the input to these layers, we have = 0 and after layers we arrive at representations z i = x i leveraging node features with the additional edge feature representation.

Sequence Embedding Layers
Next, we use node embeddings z i to learn a sequence rep- resentation of the input and encode the ordering of nodes.Due to symmetry, a tour from nodes (1, … , n) has the same cost as the tour (n, … , 1) .Therefore, we read the sequence in both orders to explicitly encode the symmetry of a solution and the order of the nodes.To accomplish this objective, we employ two Long short-term memory (LSTM) as our RNN functions, computed using hidden vectors from the previous node in the tour and the current node embedding resulting in where in (4) a forward RNN goes over the embedded nodes from left to right, in (5) a backward RNN goes over the nodes from right to left and h i , c i ∈ ℝ d are hidden vectors.We point out the RNN modules are included to impose order in the tour for the policy decoder.That is, the bi-LSTM imposes ordering for the 2-opt operation and aids node (edge swap) selection.With the bidirectional orderings, even if the same tour is observed in one of its circular permutations, the predecessor and successor information of each node is maintained, which helps edge selection, i.e., remove (i − 1, i) , (j, j + 1) and add (i − 1, j) , (i, j + 1) .Note that a 2-opt move only requires the difference between the costs of the removed and inserted edges.
Our representation reconnects back to the first node in the tour ensuring we construct a sequential representation of the complete tour, i.e.
. Afterwards, we combine forward and backward representations to form unique node representations in a tour as

Dual Encoding
In our formulation, a state S = (S, S � ) is represented as a tuple of the current solution S and the best solution seen so far S ′ .For that reason, we encode both S and S ′ using independ- ent encoding layers (Fig. 2).We abuse notation and define a sequential representation of S ′ after going through encoding (4) Note that in the proposed MDP, it is neces- sary to know the cost of the best solution seen in the search to be able to compute the rewards.Thus, we consider that the agent has full information about the state space necessary to compute the cost improvement over the best seen solution.

Policy Decoder
We aim to learn the parameters of a stochastic policy   (A| S) that given a state S , assigns high probabilities to moves that reduce the cost of a tour.Following [5], our architecture uses the chain rule to factorize the probability of a k-opt move as and then uses individual softmax functions to represent each term on the RHS of ( 6), where a i corresponds to node posi- tions in a tour, a <i represents previously sampled nodes and k = 2 .At each output step i, we map the tour embedding vectors to the following query vector where ) .Our initial query vector q 0 receives t h e to u r re p re s e n t a t i o n f ro m S a n d S ′ a s ⋅‖⋅ represents the concatenation operation.Our query vectors q i interact with a set of n vectors to define a pointing distribu- tion over the action space.As soon as the first node is sampled, the query vector updates its inputs with the previously sampled node using its sequential representation to select the subsequent nodes.

Pointing Mechanism
We use a pointing mechanism to predict a distribution over node outputs given encoded actions (nodes) and a state representation (query vector).Our pointing mechanism is parameterized by two learned attention matrices K ∈ ℝ d×d and Q ∈ ℝ d×d and vector v ∈ ℝ d as where p  a i | a <i , S = softmax(C tanh(u i )) predicts a distri- bution over n actions, given a query vector q i with u i ∈ ℝ n .We mask probabilities of nodes prior to the current a i as we only consider choices of nodes in which a i > a i−1 due to symmetry.This ensures a smaller action space for our model, i.e. n(n − 1)∕2 possible feasible permutations of the input.We clip logits in [−C, +C] [5], where C ∈ ℝ is a parameter to control the entropy of u i .

Value Decoder
Similar to the policy decoder, our value decoder works by reading tour representations from S and S ′ and a graph rep- resentation from S. That is, given embeddings Z the value decoder works by reading the outputs z i for each node in the tour and the sequence hidden vectors h n , h ′ n to estimate the value of a state as parameters that map the state representation to a real valued output and the tours to a combined value representation.We use a mean pooling operation in (9) to combine node representations z i in a single graph represen- tation.This vector is then combined with the tour representation h v to estimate current state values.

Policy Gradient Optimization
In our formulation, we maximize the expected rewards given a state S defined as J(| S) =   [G t | S] .Thus, during train- ing, we define the total objective over a distribution S of uniformly distributed TSP graphs (solutions) in [0, 1] 2 as J() = S∼S [J(| S)].To optimize our policy we resort to the Policy Gradient learning rule, which provides an unbiased gradient estimate w.r.t. the model's parameters .During training, we draw B i.i.d.transitions and approximate the gradient of J( ) , indexed at t = 0 as where the advantage function is defined as and the superscript b represents a transition sample from the the mini-batch of size B, i.e., b ∈ {1, … , B} .To avoid pre- mature convergence to a sub-optimal policy [30], we add an entropy bonus and similarly to (10) we normalize values in (11) dividing by k, i.e., the number of indices to select (k = 2 for 2-opt).Moreover, we increase the length of an episode after a number of epochs, i.e. at epoch e, T is replaced by T e .The value network is trained on a mean squared error objective between its predictions and Monte Carlo estimates of the returns, formulated as an additional objective Afterward, we combine the previous objectives and perform gradient updates via Adaptive Moment Estimation (ADAM) [21], with H , V representing weights of ( 11) and ( 12), respec- tively.Our model is close to REINFORCE [39] and periodic episode length updates.In our case, this is beneficial as at the start the agent learns how to behave over small episodes for easier credit assignment, later tweaking its policy over larger horizons.The complete algorithm is depicted in Algorithm 1.

Experiments and Results
We conduct extensive experiments to investigate the performance of our proposed method.We consider three benchmark tasks, Euclidean TSP with 20, 50, and 100 nodes, ( 12) TSP20, TSP50, and TSP100 respectively.For all tasks, node coordinates are drawn uniformly at random in the unit square [0, 1] 2 during training.For validation, a fixed set of TSP instances with their respective optimal solutions is used for hyperparameter optimization.For a fair comparison, we use the same test dataset as reported in [18,24] containing 10,000 instances for each TSP size.Thus, previous results reported in [24] are comparable to ours in terms of solution quality (optimality gap).Results from [40] are not measured in the same data but use the same data generation process.Thus, we report the optimality gaps reported in the original paper.Moreover, we report running times reported in [18,24,40].Since time can vary due to implementations and hardware, we rerun the method of [24] in our hardware.Due to provided supervised samples, the method of [18] is not ideal for combinatorial problems.Thus, we compare our results in more detail to [24] (running time and solution quality) and [40] (solution quality and sample efficiency).

Experimental Settings
All our experiments use a similar set of hyperparameters defined manually using the validation performance.We use a batch size B = 512 for TSP20 and TSP50 and B = 256 for TSP100 due to GPU memory.For this reason, we generate 10 random mini-batches for TSP20 and TSP50 and 20 minibatches for TSP100 in each epoch.TSP20 trains for 200 epochs as convergence is faster for smaller problems, whereas TSP50 and TSP100 train for 300 epochs.We use the same = 0.99 , 2 penalty of 1 × 10 −5 and learning rate = 0.001 , decaying by 0.98 at each epoch.Loss weights are V = 0.5 , H = 0.0045 for TSP20 and TSP50, H = 0.0018 for TSP100.We train on an RTX 2080Ti GPU, generating random feasible initial solutions on the fly at each epoch.Each epoch takes an average time of 2 m 01 s, 3 m 05 s, and 7 m 16 s for TSP20, TSP50, and TSP100, respectively.We clip rewards to 1 to favor non-greedy actions and stabilize learning.Due to GPU memory, we employ mixed precision training [17] for TSP50 and TSP100.For comparison with [40], we train for a maximum step limit of 200.Note that our method is more sample efficient than the proposed in [40], using 50% and 75% of the total samples for TSP20 and TSP50/100 during training.During testing, we run our policy for 500, 1000, and 2000 steps to compare to previous works.Our implementation is available online 1 .

Experimental Results and Analysis
We learn TSP20, TSP50, and TSP100 policies and depict the optimality gap and its exponential moving average in the log scale in Fig. 3.The optimality gap is averaged over 256 validation instances and 200 steps (same as training) in the figure .The results show that we can learn effective policies that decrease the optimality gap over the training epochs.We also point out that increasing the episode length improved validation performance as we consider longer planning horizons in (10).Moreover, it is interesting to note that the optimality gap grows with the instance size as solving larger TSP instances is harder.Additionally, we report the gaps of the best performing policies in Fig. 4. In the figure, we show the optimality gap of the best solution for 512 test instances over 2000 steps.Here, results show that we can quickly reduce the optimality gap initially and later steps attempt to fine-tune the best tour.In the experiments, we find the optimal solution for TSP20 instances and stay within optimality gaps of 0.1% for TSP50 and 0.7% for TSP100.Overall, our policies can be seen as a solver requiring only random initial solutions and sampling to achieve near-optimal solutions.
To showcase that, we compare the learned policies with classical 2-opt first improvement (FI) and best improvement (BI) heuristics, which select the first and best cost-reducing 2-opt operation, respectively.Since local search methods can get stuck in local optima, we include a version of the heuristics using restarts.We restart the search at a random solution as soon as we reach a local optimum.We run all heuristics and learned policies on 512 TSP100 instances for a maximum of 1000 steps starting from the same solutions.The boxplots in Fig. 5 depict the results.We observe that our policy (TSP100-Policy) outperforms classical 2-opt heuristics finding tours with lower median and less dispersion.These results support our initial hypothesis that considering future rewards in the choice of 2-opt moves leads to better solutions.Moreover, our method avoids the worst case O(n 2 ) complexity of selecting the next solution of FI and BI.

Comparison to Classical Heuristics, Exact and Learning Methods
We report results on the same 10,000 instances for each TSP size as in [24] and rerun the optimal results obtained by Concorde to derive optimality gaps.We compare against nearest, random and farthest insertion constructions heuristics.and include the vehicle routing solver of OR-Tools [35] containing 2-opt and LKH as improvement heuristics.
We add to the comparison recent deep learning methods based on construction and improvement heuristics, including supervised [18,38] and reinforcement [5,7,20,24,40] learning methods.We note, however, that supervised learning is not ideal for combinatorial problems due to the lack of optimal labels for large problems.Previous works to [24] are presented with their reported running times and optimality gaps as in the original paper.For recent works, we present the optimality gaps and running times as reported in [18,24,40].We report previous results using greedy, sampling and search decoding and refer to the methods by their neural network architecture.We note that the test dataset used in [40] is not the same but the data generation process and size are identical.This fact allied with the high number of samples decreases the variance of the results.We focus our attention on GAT [24] and GAT-T [40] (GAT-Transformer) representing the best construction and improvement heuristic, respectively.Note that we do not include LKH for the TSP as it achieves optimal results.Note that for the TSP, new works such as the ones in [23] appeared after the first version of this article and are not included in the results table.
Our results, in Table 1, show that with only 500 steps our method outperforms traditional construction heuristics, learning methods with greedy decoding and OR-Tools achieving 0.01% , 0.36% and 1.84% optimality gap for TSP20, TSP50, and TSP100, respectively.Moreover, we outperform GAT-T requiring half the number of steps (500 vs 1000).We note that with 500 steps, our method also outperforms all previous reinforcement learning methods using sampling or search, including GAT [7] applying 2-opt local search on top of generated tours.Our method only falls short of the supervised learning method GCN [18], using beam search and shortest tour heuristic.However, GCN [18], similar to samples in GAT [24], uses a beam width of 1280, i.e. it samples more solutions.Increasing the number of samples (steps) increases the performance of our method.When sampling 1000 steps (280 samples short of GCN [18] and GAT [24]) we outperform all previous methods that do no employ further local search improvement and perform on par with GAT-T on TSP50, using 5000 samples (5 times as many samples).For TSP100, sampling 1000 steps results in a lower optimality gap ( 1.26% ) than all compared methods.Lastly, increasing the sample size to 2000 results in even lower gaps, 0.00% (TSP20), 0.12% (TSP50) and 0.87% (TSP100).

Testing Learned Policies on Larger Instances
Since we are interested in learning general policies that can solve the TSP regardless of its size, we test the performance of our policies when learning on TSP50 instances (TSP50-Policy) and applying on larger TSP100 instances.Results, in Table 2, show that we can extract general enough information to still perform well on 100 nodes.Similar to a TSP100-Policy, our TSP50-Policy can outperform previous reinforcement learning construction approaches and requires fewer samples.With 1000 samples TSP50-Policy performs similar to GAT-T [40] using 3000 samples, at 1.86% optimality gap.These results are closer to optimal than previous learning Fig. 3 Optimality gaps on 256 validation instances for 200 steps over training epochs.From [32] Fig. 4 Optimality gaps of best found tours on 512 testing instances over 2000 sampling steps.From [32] Fig. 5 Tour costs of learned, FI and BI heuristics with restarts on TSP100 instances after 1000 steps.From [32] methods without further local search improvement as in GCN [18].When increasing to 2000 steps, we outperform all compared methods at 1.37% optimality gap.

Running Times and Sample Efficiency
Comparing running times is difficult due to varying hardware and implementations among different approaches.In Table 1, we report the running times to solve 10,000 instances as reported in [18,24,40] and ours.We focus on learning methods, as classical heuristics and solvers are efficiently implemented using multi-threaded CPUs.We note that our method cannot compete in speed with greedy methods as we start from poor solutions and require sampling to find improved solutions.This is neither surprising nor discouraging, as one can see these methods as a way to generate initial solutions for an improvement heuristic like ours.We note, however, that while sampling 1000 steps, our method is faster than GAT-T [40] even though we use a less powerful GPU (RTX 2080Ti vs Tesla V100).Moreover, our method requires fewer samples to achieve superior performance.The comparison to GAT [24] is not so straightforward as they use a GTX 1080Ti and a different number of samples.For this reason, we run GAT [24] using our hardware and report running times sampling the same number of solutions in Table 4.Our method is slower for TSP20 and TSP50 sampling 2000 solutions.However, as we reach TSP100, our method can be computed faster and, overall, requires less time to produce shorter tours.

Ablation Study
In Table 3, we present an ablation study of the proposed method.We measure the performance at the beginning and towards the end of training, i.e. at epochs 10 and 200, rolling out policies for 1000 steps for 512 TSP50 instances and 10 trials.We point out that our main objective is to find good policies as early as possible.In other words, good policies found earlier are considered better than waiting more time to obtain the same results.We observe that removing the LSTM (a) affects performance the most leading to a large 134.42% gap at epoch 200.Removing the GCN component (b) has a lower influence but also reduces the overall quality of policies, reaching 0.30% optimality gap.We then test the effect of the bidirectional LSTM (c) replacing it by a single LSTM.In this case, gaps are even higher, at 2.20%, suggesting that encoding the symmetry of the tours is important.We also compare to two variants of the proposed model, one that does not take as input the best solution (d) and one that shares the parameters of the encoding units (e).For these cases, we note that the final performance is similar to the proposed method, i.e. 0.22% optimality gap.However, in our experiments, the proposed method can achieve better policies faster, reaching a 3.0% gap at epoch 10, whereas (d) and (e) yield policies at the 4.55% and 5.15% level, respectively.

Generalization to Real-World TSP Instances
In Table 5, we study the performance of our method on TSPlib [36] instances.In general, these instances come from different node distributions than those seen during training and it is unclear whether our learned policies can be reused for these cases.We compare the results of the policy trained on TSP100 sampling actions for 2000 steps to results obtained from OR-Tools.We note that of 35 instances tested, our method outperforms OR-Tools in 12 instances.These results are encouraging as OR-Tools is a very specialized heuristic solver.When we compare optimality gaps 8.61% (ours) and 3.70%, we see that our learned policies are not too far from OR-Tools even though our method never trains on instances with more than 100 nodes.The difference in performance increases for large instances, indicating that fine-tuning or training policies for more nodes and different distributions can potentially reduce this difference.However, similar to the results in Table 2, our method still can achieve good results on instances with more than 100 nodes, such as ts225 (0.86% gap).

The Multiple Traveling Salesmen Problem
The multiple TSP (mTSP) [4] is an extension to the original TSP that includes a number of salesmen m starting and ending their tours at a depot location.The goal is to construct tours for the m salesmen such that the total cost of the tours is minimized.In our formulation, we include an extra depot node with index 0 and coordinates x 0 ∈ ℝ 2 and the remaining customer nodes {1, … , n} .Since adding more salesmen with- out any imposed constraint would lead to the same solution as the TSP, we include two additional constraints in the problem formulation, (1) each salesman needs to be utilized in a feasible solution and (2) in a given salesman tour at least = 2 nodes have to be visited, excluding the depot.The latter ensures that a tour cannot be formed by visiting just one node and returning to the depot, reducing the remaining problem to a TSP with n − 1 nodes.The remaining constraints are usual TSP constraints.

Instance Generation
We follow the same instance generation procedure as for the TSP, i.e., we draw n + 1 nodes (including the depot) at ran- dom from a uniform distribution in the 0-1 square.

Initial Solution Generation
We represent a solution S to the mTSP, as an ordered list of nodes, S = (s 1 , … , s p ) , where s i ∈ {0, … , n} .In our solution, each tour is represented by adding the depot index at the beginning and ending of each tour without repetition.For example, a solution with two tours and n = 5 is represented as S = (0, 1, 2, 0, 4, 3, 5, 0) , where the first tour visits nodes 0, 1, 2 and 0 and the second tour visits nodes 0, 4, 3, 5 and 0. The size of a solution p depends on n (number of customers) and m (number of salesmen) and it is expressed as p = n + m + 1.
We generate initial solutions by first sampling instances and then breaking the canonical ordering of nodes into m tours.We start from a solution containing all the nodes, i.e. S = (0, 1 … , n) and find the depot positions of the tours by first computing the number of required splits = ⌊ n m

⌋
, then for m − 1 depot positions (the last depot position is always at the end of the solution), we find the indices of the depot by: and we insert each depot at its corresponding index.Lastly, we add a depot to the end of the solution S, ensuring we have short and long tours in a given initial solution.

mTSP Neural Architecture
Encoder We use the same encoding architecture for the mTSP as for the TSP, however, the embedding layer and the GCN layers operate only on the n + 1 node coordinates of the under- lying instance graph assuring we only encode the information about the instance.That is, here we abuse notation and define x i as the coordinates of node i ∈ {0, … , n} .The RNN layers then take as input the graph embedded node features and proceed to perform the solution encoding, i.e., where z i corresponds to the node features of node s i , i.e, z i ∈ {x 0 , .., x n } , and z i = x s i .
( Tour Length Constraints and Masking Without loss of generality, the first action selection masks all the depot positions and the last customer node at the end of the last tour.Then the second action considers only customer nodes indices that are greater than the index a 1 that when selected result in the tour with the minimum length to be greater or equal than .Let c(S, a 1 , j) = min(c 1 (S, a 1 , j), … , c m (S, a 1 , j)), denote the number of customer nodes in the shortest tour in the resulting solution when applying the 2-opt operation defined by (a 1 , j) to a solution S, then the masking becomes where ũi j = v T tanh(Ko j + Qq i ) .To encode the previous mask- ing, we keep track of an auxiliary indicator b i ∈ {−1, 0, 1} , where i ∈ {1, … , p} , representing if a node is right before (-1), after (1) or further away (0) from a depot when traversing the solution from left to right.Thus, checking if c(S, a 1 , j) ≥ 2 can be achieved by

Training and Experimental Parameters
We make a few modifications to the training parameters.Compared to the TSP, we reduce the size of the mini-batches to 64, 128 and 256 for mTSP20, 50, and 100, respectively.This modification allows for faster training when using a more complex masking operation and longer solutions.We train models on instance problems with two values of m ∈ {2, 4} .Similar to the TSP, we sample 10 mini-batches at each epoch and train mTSP20 for 200 epochs and mTSP50 for 300 epochs.To avoid high training times of mTSP100, we use the best learned policy on mTSP50 as a warm-start for mTSP100 and train for 100 epochs.Our random initial solutions are far from optimality with costs 11.51, 26.98, 52.78 for m = 2 and 12.46, 27.94, 53.80 for m = 4 over the increasing instance sizes.Each epoch takes on average 2m, 6m, and 10m for mTSP20, 50, and 100, respectively.We run two sets of experiments, one containing 1000 instances to mitigate the high running times of our baselines and one with 10,000 instances to be comparable with the TSP experiments.The remaining parameters of the model remain the same as for the TSP.( 17)

Experimental Results and Analysis
We apply the learned policies sampling 2000 solutions on each of the 1000 and 10,000 set of instances to assess the performance of our method.We compare the performance to an Integer Linear Programming (ILP) formulation of the problem running the Gurobi solver [9] for a max of 30 s for each instance.We also include the highly effective LKH3 [12] heuristic as a baseline as it balances solution quality and speed and is the state-of-the-art algorithm for several routing problems.We implement both baselines in a serialized manner.This is comparable to our results as even though we sample actions in batches taking advantage of batch parallelization of GPUs, we perform the 2-opt actions in series.

Comparison to Exact and Heuristics Baselines
The results for the set of 1000 instances are presented in Table 6.We observe that the learned policies are close to the performances of both Gurobi and LKH3 when solving instances with 20 nodes with 0.02%, 0.08% optimality gaps, respectively.Similar to the TSP the gap increases as we increase the size of the instances.Moreover, as we increase the size of the instances the performance of Gurobi running for just 30 s decreases considerably taking significantly longer (8h) and yielding results far from LKH3.On the other hand, our learned policies remain much closer (1.69% for 2TSP100, 1.91%for 4TSP100) to the best results found by LKH3 whilst requiring less time.
We also present the results on 10,000 instances as these should provide better estimates of the performance of our policies.We present the results in Table 7.Since Gurobi does not scale we only provide the results from Gurobi for mTSP20.The results are similar to those obtained in 1000 instances with our model finding close costs to those found by LKH3, whilst requiring less running time than the heuristic.

The Capacitated Vehicle Routing Problem
In the Capacitated Vehicle Routing Problem (CVRP) [37], each customer node has an associated demand and multiple routes should be constructed starting and ending at a depot.The CVRP is a generalization of the mTSP.It considers that each vehicle (salesman) has a given capacity and that tours have to be formed such that the combined demand of all customers does not exceed the capacity of the vehicles.
Similar to mTSP, we add an extra depot node with index 0 and coordinates x 0 ∈ ℝ 2 and consider the remaining nodes as customer nodes.We adopt the same formulation as in [24,31], and define a capacity D for a single vehicle traversing all the routes.We associate each customer node i ∈ {1, … , n} with a demand 0 ≤ i ≤ D .Each route should start and end at the depot and should not exceed the total capacity of the vehicle.Similar to [24], we assume a normalized capacity D = 1 and use normalized demands δi =  i D , this allows us to learn general policies that can be used with different capacities.

Instance Generation
For comparison, we follow [24,31] and generate node coordinates sampled uniformly at random in the unit square.The unnormalized demands i , where i ∈ {1, … , n} , are sampled following a discrete uniform distribution from {1, … , 9} and the demand of the depot is 0 = 0 .Each problem size n defines different capacities D, with D = 30, 40, 50 , for n = 20, 50, 100 , and remain fixed for all instances.

Initial Solution Generation
Similar to the mTSP, we represent a solution S to the CVRP, as an ordered list of nodes, S = (s 1 , … , s p ) , where s i ∈ {0, … , n} .A tour is represented by adding the depot at the start and beginning of each tour.However, unlike the mTSP, where the number of salesmen is fixed, in the CVRP a solution can have different lengths depending on the number of tours traversed.To allow for batching solutions, we compute the maximum length of a solution p.We define the maximum demand max = max( 1 , … , n ) and maximum the number of custom- With our parameters, p corresponds to 28, 64 and 121 for n = 20, 50, 100.
We generate initial solutions by first sampling the node coordinates and demands.We define an initial solution traversing nodes in the sampled order, i.e., we start with a solution S = (0, 1, … , n) .We accumulate the sum of demands whilst traversing the nodes and construct a tour when At this point we add a depot to the solution and start a new tour with the last visited node i.We repeat this procedure until we visit all customer nodes.Since not all solutions have the same length we pad the solutions with depot nodes at the end.This allows us to batch solutions respecting their maximum sizes p and lets the algorithm add new depot locations to a solution if deemed necessary.For instance, a CVRP solution of the form S = (0, 1, 2, 0, 3, 6, 5, 4, 0, … , 0) represents two tours, one tra- versing nodes 0, 1, 2, 0 and the other traversing nodes 0, 3, 6, 5, 4, 0. The remaining depots are padded to complete the solution.

CVRP Neural Architecture
Embedding layer To allow our model to use both node coordinates and demands of the nodes, we provide the normalized demands δi of each node to the embedding layer, where each x i is the coordinate of node i ∈ {0, … , n} and adjust the dimen- sion of the parameter W x accordingly.The embedding layers then produces node features following:

GCN layers
We compute the Euclidean distances using the node coordinates x i as in the TSP case and use the normal- ized edges ẽi,j to compute the graph node features similar to the mTSP case by applying GCN layers following Eq.( 14).RNN Layers We adjust the dimensions and follow the same architecture of the mTSP, i.e.Eqs. ( 15) and ( 16), in which the node features x i , i ∈ {0, … , n} are used to compose nodes in a solution, where S = (s 1 , … , s p ) , s i ∈ {0, … , n} and z i = x s i .Capacity Constraints and Masking To allow for only feasible solutions we need to ensure that a 2-opt action will not create tours that do not respect the capacity constraints.Thus, before the action selection starts we create a feasibility matrix P ∈ {0, 1} p×p and go through all possible p(p − 1)∕2 node exchanges and check if is forms a feasible solution where the maximum demand across all tours do not exceed the capacity D. Then for the first element of the action a 1 : and for a 2 :

Training and Experimental Parameters
We train on CVRP20 and CVRP50 instances with a minibatch size of 64 and 128.We do not train our policies on CVRP100 due to high training times in our hardware, but we report the performance of the policy trained on CVRP50 instances tested on CVRP100.For the same reason, we warm-start CVRP50 with a policy trained on CVRP20 and train for additional 200 epochs.Our initial solutions have average costs of 12.53, 29.79, 58.19 for n = 20, 50, 100 .Each epoch takes 1 m 83 s and 7 m 30 s for instances with 20 and 50 nodes.The remaining training parameters remain identical to the TSP.

Experimental Results and Analysis
We compare our results to other end-to-end deep learning methods and CVRP heuristics.We run our policies for 500, 1000 and 2000 steps on the same 10,000 instances as in [24].This allows us to compare both optimality gaps and costs.We include the LKH3 baseline from the previous paper and rerun both the deep learning model and the baselines to compare running times.We also compare to the improvement method GAT-T [40] and report the objective gaps and times reported in their original paper since no pretrained model is available.We note that whilst learning the CVRP, GAT-T starts from a nearest neighbor heuristic, with much lower costs than our initial (20) solutions.This allows for the model to experience a higher number of solutions that are closer to optimality, where the action selection is harder.We do not employ such a strategy and always start learning from randomized solutions.We also include in the comparison the improvement method L2I; however, the reported results are only averaged over 2000 instances and cannot be compared to the remaining methods.We also include in the comparison, the results obtained with NLNS.Lastly, we compare to the recent DPDP, reporting results for the VRP with 100 nodes and DPDP with beam sizes of 10K (10 thousand), 100K (100 thousand) and 1M (one million), for the VRP with 100 nodes.
Comparison to Heuristics and Learned Baselines We present the comparison to previously proposed methods in Table 8.Our method outperforms other reported deep reinforcement learning baselines for CVRP20 .The best results are found after sampling 2000 solutions resulting in 0.37% gap to LKH3.Note that our policy performs better than GAT-T, even when sampling 5000 solutions.For CVRP50, our learned policy matches GAT (greedy) after sampling 500 solutions.However, GAT-T can achieve lower optimality gaps when sampling more solutions than both our proposed method and GAT.We report CVRP100 results for completeness although we do not train on instances with 100 customer nodes.As expected, our evaluated policies are farther from the LKH3 baseline when compared to the other learned methods that train on CVRP100 instances, including DPDP 1M.However, the results show that the learned policies can generalize to instances of different sizes.An important aspect of our results in comparison to a constructive method is that we are required to check feasibility each time a solution is generated.This leads to high running times due to the polynomial growth in the feasibility checks as we increase the size of the instances.This issue can be alleviated by running multiple instance mini-batches in parallel but it is not implemented in our evaluations.

Limitations and Future Work
A limitation of the proposed approach is the large sample complexity common to policy gradient methods.This causes training to be slow and requires many iterations to achieve performance levels comparable to classical heuristics.Another important limitation of our model and of other improvement heuristics is the increasing size of the state space when solving real-world problems and the increased running times when performing feasibility checks necessary to maintain feasible solutions.The latter can slow down training times and increase evaluation times considerably when the size of instances increases.
Expanding the proposed neural architecture to sample k-opt operations is an interesting topic for future work.Moreover, we aim at exploring methods that can achieve better sampling complexity and can accommodate more complex problems with different types of constraints without incurring the increased running times of feasibility checking.Lastly, we point out that future work on learning methods can be particularly interesting when solving problems where standard Operations Research solvers are less suitable, for example, when problems involve many stochastic elements.

Conclusion
In this work, we introduced a deep reinforcement learning approach for approximating a 2-opt improvement heuristic for three routing problems based on the TSP, namely the TSP, the multiple TSP, and the CVRP.We proposed a neural architecture with graph and sequence embedding capable of outperforming learned construction and improvement heuristics requiring fewer samples for the TSP.Our learned heuristics also outperformed classical 2-opt and achieved similar performance to state-of-the-art classical heuristics as well as exact solvers in all problems studied.
the maximum number of possible toursm max = ⌈ n ⌉, and finally, the length of the tour is given by p = n + m max + 1 .

Table 1
Performance of TSP methods w.r.t.Concorde Type: SL supervised learning, RL reinforcement learning, S sampling, G greedy, B beam search, BS B and shortest tour, T 2-opt local search.

Table 3
Ablation studies on 512 TSP50 instances running policies for 1000 steps

Table 4
[32]ormance of GAT[24]vs our method Results are compared on the same hardware sampling the same number of solutions.From[32]

Table 5
Performance of OR-Tools vs our method on TSPlib instances

Table 8
[15,40]sults on 10,000 instances reported in[24]* Costs are estimated from the reported gaps and times are presented as reported in[15,40].* * Reported costs are averaged only on 2000 instances and not directly comparable.† Trained on CVRP50