Abstract
Endtoend training of neural network solvers for graph combinatorial optimization problems such as the Travelling Salesperson Problem (TSP) have seen a surge of interest recently, but remain intractable and inefficient beyond graphs with few hundreds of nodes. While stateoftheart learningdriven approaches for TSP perform closely to classical solvers when trained on trivially small sizes, they are unable to generalize the learnt policy to larger instances at practical scales. This work presents an endtoend neural combinatorial optimization pipeline that unifies several recent papers in order to identify the inductive biases, model architectures and learning algorithms that promote generalization to instances larger than those seen in training. Our controlled experiments provide the first principled investigation into such zeroshot generalization, revealing that extrapolating beyond training data requires rethinking the neural combinatorial optimization pipeline, from network layers and learning paradigms to evaluation protocols. Additionally, we analyze recent advances in deep learning for routing problems through the lens of our pipeline and provide new directions to stimulate future research.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
NPhard combinatorial optimization problems are the family of integer constrained optimization problems which are intractable to solve optimally at large scales. Robust approximation algorithms to popular problems have immense practical applications and are the backbone of modern industries. Among combinatorial problems, the 2D Euclidean Travelling Salesperson Problem (TSP) has been the most intensely studied NPhard graph problem in the Operations Research (OR) community, with applications in logistics, genetics and scheduling [1]. TSP is intractable to solve optimally above thousands of nodes for modern computers [2]. In practice, the Concorde TSP solver [3] uses linear programming with carefully handcrafted heuristics to find solutions up to tens of thousands of nodes, but with prohibitive execution times.^{Footnote 1} Besides, the development of problemspecific OR solvers such as Concorde for novel or understudied problems arising in scientific discovery [4] or computer architecture [5] requires significant time and specialized knowledge.
An alternate approach by the Machine Learning community is to develop generic learning algorithms which can be trained to solve any combinatorial problem directly from problem instances themselves [6,7,8]. Using classical problems such as TSP, Minimum Vertex Cover and Boolean Satisfiability as benchmarks, recent endtoend approaches [9,10,11] leverage advances in graph representation learning [12,13,14,15] and have shown competitive performance with OR solvers on trivially small problem instances up to few hundreds of nodes. Once trained, approximate solvers based on Graph Neural Networks (GNNs) have significantly favorable time complexity than their OR counterparts, making them highly desirable for realtime decisionmaking problems such as TSP and the associated class of Vehicle Routing Problems (VRPs).
1.1 Motivation
Scaling endtoend approaches to practical and realworld instances is still an open question [8] as the training phase of stateoftheart models on large graphs is extremely timeconsuming. For graphs larger than few hundreds of nodes, the gap between GNNbased solvers and simple nonlearnt heuristics is especially evident for routing problems like TSP [16, 17].
As an illustration, Fig. 1 presents the computational challenge of learning TSP on 200node graphs (TSP200) in terms of both sample efficiency and wall clock time. Surprisingly, it is difficult to outperform a simple insertion heuristic when directly training on 12.8 Million TSP200 samples for 500 hours on universityscale hardware.
We advocate for an alternative to expensive largescale training: learning efficiently from trivially small TSP and transferring the learnt policy to larger graphs in a zeroshot fashion or via fast finetuning. Thus, identifying promising inductive biases, architectures and learning paradigms that enable such zeroshot generalization to large and more complex instances is a key concern for training practical solvers for realworld problems.
1.2 Contributions
Towards endtoend learning of scaleinvariant TSP solvers, we unify several stateoftheart architectures and learning paradigms [16,17,18,19] into one experimental pipeline and provide the first principled investigation on zeroshot generalization to large instances. Our findings suggest that learning scaleinvariant TSP solvers requires rethinking the status quo of neural combinatorial optimization to explicitly account for generalization:

The prevalent evaluation paradigm overshadows models’ poor generalization capabilities by measuring performance on fixed or trivially small TSP sizes.

Generalization performance of GNN aggregation functions and normalization schemes benefits from explicit redesigns which account for shifting graph distributions, and can be further boosted by enforcing regularities such as constant graph diameters when defining problems using graphs.

Autoregressive decoding enforces a sequential inductive bias which improves generalization over nonautoregressive models, but is costlier in terms of inference time.

Models trained with expert supervision are more amenable to posthoc search, while reinforcement learning approaches scale better with more computation as they do not rely on labelled data.
Our framework and datasets are available online.^{Footnote 2} Additionally, we use our pipeline to characterize the recent stateoftheart in deep learning for routing problems and provide new directions for future research.
2 Related work
Neural combinatorial optimization
In a recent survey, Bengio et al. [8] identified three broad approaches to leveraging machine learning for combinatorial optimization problems: learning alongside optimization algorithms [20,21,22], learning to configure optimization algorithms [23, 24], and endtoend learning to approximately solve optimization problems, a.k.a. neural combinatorial optimization [6, 7].
Stateoftheart endtoend approaches for TSP use Graph Neural Networks (GNNs) [12,13,14,15] and sequencetosequence learning [25] to construct approximate solutions directly from problem instances. Architectures for TSP can be classified as: (1) autoregressive approaches, which build solutions in a stepbystep fashion [9, 16, 19, 26,27,28]; and (2) nonautoregressive models, which produce the solution in one shot [17, 18, 29,30,31]. Models can be trained to imitate optimal solvers via supervised learning or by minimizing the length of TSP tours via reinforcement learning [32].
Other classical problems tackled by similar architectures include Vehicle Routing [33, 34], Maximum Cut [9], Minimum Vertex Cover [11], Boolean Satisfiability [10, 35], and Graph Coloring [36]. Using TSP as an illustration, we present a unified pipeline for characterizing neural combinatorial optimization architectures in Section 3.
Notably, TSP has emerged as a challenging testbed for neural combinatorial optimization. Whereas generalization to problem instances larger and more complex than those seen in training has at least partially been demonstrated on nonsequential problems such as SAT, MaxCut, and MVC [10, 11], the same architectures do not show strong generalization for TSP [16, 17].
Combinatorial optimization and GNNs
From the perspective of graph representation learning, algorithmic and combinatorial problems have recently been used to characterize the expressive power of GNNs [37, 38]. An emerging line of work on learning to execute graph algorithms [39, 40] has lead to the development of provably more expressive GNNs [41] and improved understanding of their generalization capability [42, 43]. Towards tackling realistic and largescale combinatorial problems, this paper aims to quantify the limitations of prevalent GNN architectures and learning paradigms via zeroshot generalization to problems larger than those seen during training.
Novel applications
Advances on classical combinatorial problems have shown promising results in downstream applications to novel or understudied optimization problems in the physical sciences [4, 44] and computer architecture [5, 45, 46], where the development of exact solvers is expensive and intractable. For example, autoregressive architectures provide a strong inductive bias for device placement optimization problems [47, 48], while nonautoregressive models [49] are competitive with autoregressive approaches [50, 51] for molecule generation tasks.
3 Neural combinatorial optimization pipeline
NPhard problems can be formulated as sequential decision making tasks on graphs due to their highly structured nature. Towards a controlled study of neural combinatorial optimization for TSP, we unify recent ideas [16,17,18,19] via a five stage endtoend pipeline illustrated in Fig. 2. Our discussion focuses on TSP, but the pipeline presented is generic and can be extended to characterize modern architectures for several NPhard graph problems.
3.1 Problem definition
The 2D Euclidean TSP is defined as follows: “Given a set of cities and the distances between each pair of cities, what is the shortest possible route that visits each city and returns to the origin city?” Formally, given a fullyconnected input graph of n cities (nodes) in the two dimensional unit square \(S = \{x_{i}\}_{i=1}^{n}\) where each \(x_{i} \in {[0,1]}^{2}\), we aim to find a permutation of the nodes π, termed a tour, that visits each node once and has the minimum total length, defined as:
where ∥⋅∥_{2} denotes the ℓ_{2} norm.
Graph sparsification
Classically, TSP is defined on fullyconnected graphs, see Fig. 2(b). Graph sparsification heuristics based on knearest neighbors aim to reduce TSP graphs, enabling models to scale up to large instances where pairwise computation for all nodes is intractable [9] or learn faster by reducing the search space [17]. Notably, problemspecific graph reduction techniques have proven effective for outofdistribution generalization to larger graphs for other NPhard problems such as MVC and SAT [11].
Fixed size vs. variable size graphs
Most work on learning for TSP has focused on training with a fixed graph size [16, 17], likely due to ease of implementation. Learning from multiple graph sizes naturally enables better generalization within training size ranges, but its impact on generalization to larger TSP instances remains to be analyzed.
3.2 Graph embedding
A Graph Neural Network (GNN) encoder computes ddimensional representations for each node in the input TSP graph, see Fig. 2(c). At each layer, nodes gather features from their neighbors to represent local graph structure via recursive message passing [13]. Stacking L layers allows the network to build representations from the Lhop neighborhood of each node. Let \(h_{i}^{\ell }\) and \(e_{ij}^{\ell }\) denote respectively the node and edge feature at layer ℓ associated with node i and edge ij. We define the feature at the next layer via an anisotropic message passing scheme using an edge gating mechanism [52]:
where \(U^{\ell }, V^{\ell }, A^{\ell }, B^{\ell }, C^{\ell } \in \mathbb {R}^{d \times d}\) are learnable parameters, Norm denotes the normalization layer (BatchNorm [53], LayerNorm [54]), Aggr represents the neighborhood aggregation function (Sum, Mean or Max), σ is the sigmoid function, and ⊙ is the Hadamard product. As inputs \(h_{i}^{\ell =0}\) and \(e_{ij}^{\ell =0}\), we use ddimensional linear projections of the node coordinate x_{i} and the euclidean distance ∥x_{i} − x_{j}∥_{2}, respectively.
Anisotropic aggregation
We make the aggregation function anisotropic or directional via a dense attention mechanism which scales the neighborhood features \(h_{j}, \forall j \in \mathcal {N}_{i},\) using edge gates σ(e_{ij}). Anisotropic and attentionbased GNNs such as Graph Attention Networks [14], Transformers [55, 56], and Gated Graph ConvNets [52] have been shown to outperform isotropic Graph ConvNets [12] across several challenging domains [57], including TSP [16, 17].
3.3 Solution decoding
Nonautoregressive decoding (NAR)
Consider TSP as a link prediction task: each edge may belong/not belong to the optimal TSP solution independent of one another [18]. We define the edge predictor as a two layer MLP on the node embeddings produced by the final GNN encoder layer L, following Joshi et al. [17], see Fig. 2(d). For adjacent nodes i and j, we compute the unnormalized edge logits:
\(W_{1} \in \mathbb {R}^{3d \times d}, W_{2} \in \mathbb {R}^{d \times 2}\), and [⋅,⋅,⋅] is the concatenation operator. The logits \(\hat {p}_{ij}\) are converted to probabilities over each edge p_{ij} via a softmax.
Autoregressive decoding (AR)
Although NAR decoders are fast as they produce predictions in one shot, they ignore the sequential ordering of TSP tours. Autoregressive decoders, based on attention [16, 19] or recurrent neural networks [6, 26], explicitly model this sequential inductive bias through stepbystep graph traversal. We follow the attention decoder from Kool et al. [16], which starts from a random node and outputs a probability distribution over its neighbors at each step. Greedy search is used to perform the traversal over n time steps and masking enforces constraints such as not visiting previously visited nodes.
At time step t at node i, the decoder builds a context \(\hat {h}_{i}^{C}\) for the partial tour \(\pi ^{\prime }_{t^{\prime }}\), generated at time \(t^{\prime } < t\), by packing together the graph embedding h_{G} and the embeddings of the first and last node in the partial tour: \(\hat {h}_{i}^{C} = W_{C} \left [ h_{G}, h^{L}_{\pi ^{\prime }_{t1}}, h^{L}_{\pi ^{\prime }_{1}} \right ],\) where \(W_{C} \in \mathbb {R}^{3d \times d}\) and learned placeholders are used for \(h^{L}_{\pi ^{\prime }_{t1}}\) and \(h^{L}_{\pi ^{\prime }_{1}}\) at t = 1. The context \(\hat {h}_{i}^{C}\) is then refined via a standard MultiHead Attention (MHA) operation [55] over the node embeddings:
where Q,K,V are inputs to the Mheaded MHA (M = 8). The unnormalized logits for each edge e_{ij} are computed via a final attention mechanism between the context \({h_{i}^{C}}\) and the embedding h_{j}:
The \(\tanh\) is used to maintain the value of the logits within [−C,C] (C = 10) [7]. The logits \(\hat {p}_{ij}\) at the current node i are converted to probabilities p_{ij} via a softmax over all edges.
Inductive biases
NAR approaches, which make predictions over edges independently of oneanother, have shown strong outofdistribution generalization for nonsequential problems such as SAT and MVC [11]. On the other hand, AR decoders come with the sequential/tour constraint builtin and are the default choice for routing problems [16]. Although both approaches have shown close to optimal performance on fixed and small TSP sizes under different experimental settings, it is important to fairly compare which inductive biases are most useful for generalization.
3.4 Solution search
Greedy search
For AR decoding, the predicted probabilities at node i are used to select the edge to travel along at the current step via sampling from the probability distribution p_{i} or greedily selecting the most probable edge p_{ij}, i.e. greedy search. Since NAR decoders directly output probabilities over all edges independent of oneanother, we can obtain valid TSP tours using greedy search to traverse the graph starting from a random node and masking previously visited nodes. Thus, the probability of a partial tour \(\pi ^{\prime }\) can be formulated as \(p(\pi ^{\prime }) = {\prod }_{j^{\prime } \sim i^{\prime } \in \pi ^{\prime }} p_{i^{\prime }j^{\prime }}\), where each node \(j^{\prime }\) follows node \(i^{\prime }\).
Beam search and sampling
During inference, we can increase the capacity of greedy search via limited width breadthfirst beam search, which maintains the b most probable tours during decoding. Similarly, we can sample b solutions from the learnt policy and select the shortest tour among them. Naturally, searching longer, with more sophisticated techniques, or sampling more solutions allows trading off run time for solution quality. However, it has been noted that using large b for search/sampling or local search during inference may overshadow an architecture’s inability to generalize [58]. To better understand generalization, we focus on using greedy search and beam search/sampling with small b = 128.
3.5 Policy learning
Supervised learning
Models can be trained endtoend via imitating an optimal solver at each step (i.e. supervised learning). For models with NAR decoders, the edge predictions are linked to the groundtruth TSP tour by minimizing the binary crossentropy loss for each edge [17]. For AR architectures, at each step, we minimize the crossentropy loss between the predicted probability distribution over all edges leaving the current node and the next node from the groundtruth tour, following Vinyals et al. [6]. We use teacherforcing to stabilize training [59].
Reinforcement learning
Reinforcement learning is a elegant alternative in the absence of groundtruth solutions, as is often the case for understudied combinatorial problems. Models can be trained by minimizing problemspecific cost functions (the tour length in the case of TSP) via policy gradient algorithms [7, 16] or QLearning [9]. We focus on policy gradient methods due to their simplicity, and define the loss for an instance s parameterized by the model 𝜃 as \({\mathcal{L}}(\theta  s) = \mathbb {E}_{p_{\theta }(\pi  s)}\left [L(\pi )\right ]\), the expectation of the tour length L(π), where p_{𝜃}(πs) is the probability distribution from which we sample to obtain the tour πs. We use the REINFORCE gradient estimator [60] to minimize \({\mathcal{L}}\):
where the baseline b(s) reduces gradient variance. Our experiments compare standard critic network baselines [7, 19] and the greedy rollout baseline proposed by Kool et al. [16].
4 Experimental setup
We design controlled experiments to probe the unified pipeline described in Section 3 in order to identify inductive biases, architectures and learning paradigms that promote zeroshot generalization. We focus on learning efficiently from small problem instances (TSP2050) and measure generalization to a wider range of sizes, including large instances which are intractable to learn from (e.g. TSP200). Each experiment starts with a ‘base’ model configuration and ablates the impact of a specific component of the fivestage pipeline. We aim to fairly compare stateoftheart ideas in terms of model capacity and training data, and expect models with good inductive biases for TSP to: (1) learn trivially small TSPs without hundreds of millions of training samples and model parameters; and (2) generalize reasonably well across smaller and larger instances than those seen in training.
To quantify ‘good’ generalization, we additionally evaluate our models against a simple, nonlearnt furthest insertion heuristic baseline, which constructively builds a partial tour \(\pi ^{\prime }\) by inserting node i between tour nodes \(j_{1}, j_{2} \in \pi ^{\prime }\) such that the distance from node i to its nearest node j_{1} is maximized. Kool et al. [16] provide a detailed description of insertion heuristic baselines.
Training datasets
We perform ablation studies of each component of the pipeline by training on variable TSP2050 graphs for rapid experimentation. We also compare to learning from fixed graph sizes up to TSP100. Each TSP instance consist of n nodes sampled uniformly in the unit square \(S = \{x_{i}\}_{i=1}^{n}\) and \(x_{i} \in {[0,1]}^{2}\). In the supervised learning paradigm, we generate a training set of 1,280,000 TSP samples and groundtruth tours using the optimal Concorde solver as an oracle. Models are trained using the Adam optimizer for 10 epochs with a batch size of 128 and a fixed learning rate 1e − 4. For reinforcement learning, models are trained for 100 epochs on 128,000 TSP samples which are randomly generated for each epoch (without optimal solutions) with the same batch size and learning rate. Thus, both learning paradigms see 12,800,000 TSP samples in total. Considering that TSP2050 are trivial in terms of complexity as they can be solved by simpler nonlearnt heuristics, training good solvers at this scale should ideally not require billions of instances.
Model hyperparameters
For models with AR decoders, we use 3 GNN encoder layers followed by the attention decoder head, setting hidden dimension d = 128. For NAR models, we use the same hidden dimension and opt for 4 GNN encoder layers followed by the edge predictor. This results in approximately 350,000 trainable parameters for each model, irrespective of decoder type. Unless specified, most experiments use our best model configuration: AR decoding scheme and Graph ConvNet encoder with Max aggregation and BatchNorm (with batch statistics). All models are trained via supervised learning except when comparing learning paradigms.
Evaluation
We compare models on a heldout test set of 25,600 TSPs, consisting of 1,280 samples each of TSP10, TSP20, \(\dots\), TSP200. Our evaluation metric is the optimality gap w.r.t. the Concorde solver, i.e. the average percentage ratio of predicted tour lengths relative to optimal tour lengths. To compare design choices among identical models, we plot line graphs of the optimality gap as TSP size increases (along with a 99%ile confidence interval) using beam search with a width of 128. Compared to previous work which evaluated on fixed problem sizes, our protocol identifies not only those models that perform well on training sizes, but also those that generalize better than nonlearnt heuristics for large instances which are intractable to train on.
5 Results
5.1 Does learning from variable sizes help generalization?
We train five identical models on fully connected graphs of instances from TSP20, TSP50, TSP100, TSP200 and variable TSP2050. The line plots of optimality gap across TSP sizes in Fig. 3 indicates that learning from variable TSP sizes helps models retain performance across the range of graph sizes seen during training (TSP2050). Variable graph training compared to training solely on the maximum sized instances (TSP50) leads to marginal gains on small instances but, somewhat counterintuitively, does not enable better generalization to larger problems. Learning from small TSP20 is unable to generalize to large sizes while TSP100 models generalize poorly to trivially easy sizes, suggesting that the prevalent protocol of evaluation on training sizes [16, 17] overshadows brittle outofdistribution performance.
Training on TSP200 graphs is intractable within our computational budget, see Fig. 1. TSP100 is the only model which generalizes better to large TSP200 than the nonlearnt baseline. However, training on TSP100 can also be prohibitively expensive: one epoch takes approximately 8 hours (TSP100) vs. 2 hours (TSP2050) (details in Appendix B). For rapid experimentation, we train efficiently on variable TSP2050 for the rest of our study.
5.2 What is the best graph sparsification heuristic?
Figure 4 compares full graph training to the following heuristics: (1) Fixed node degree across graph sizes, via connecting each node in TSPn to its knearest neighbors, enabling GNN encoder layers/aggregators to specialize to constant degree k; and (2) Fixed graph diameter across graph sizes, via connecting each node in TSPn to its n × k%nearest neighbors, ensuring that the same number of message passing steps are required to diffuse information across both small and large graphs.
Although both sparsification techniques lead to faster convergence on training instance sizes (not shown), we find that only approach (2) leads to better generalization on larger problems than using full graphs. Consequently, all further experiments use approach (2) to operate on sparse 20%nearest neighbors graphs. Our results also suggest that developing more principled problem definition and graph reduction techniques beyond simple knearest neighbors for augmenting learningbased approaches may be a promising direction.
5.3 What is the relationship between GNN aggregation functions and normalization layers?
In Fig. 5, we compare identical models with anisotropic Sum, Mean and Max aggregation functions. As baselines, we consider the Transformer encoder on full graphs [16, 19] as well as a structureagnostic MLP on each node, which can be instantiated by not using any aggregation function in (2), i.e. \(h_{i}^{\ell +1} = h_{i}^{\ell } + \text {ReLU} \left (\textsc {Norm} \left (U^{\ell } h_{i}^{\ell } \right ) \right )\).
We find that the choice of GNN aggregation function does not have an impact when evaluating models within the training size range TSP2050. As we tackle larger graphs, GNNs with aggregation functions that are agnostic to node degree (Mean and Max) are able to outperform Transformers and MLPs. Importantly, the theoretically more expressive Sum aggregator [61] generalizes worse than structureagnostic MLPs, as it cannot handle the distribution shift in node degree and neighborhood statistics across graph sizes, leading to unstable or exploding node embeddings [39]. We use the Max aggregator in further experiments, as it generalizes well for both AR and NAR decoders (not shown).
We also experiment with the following normalization schemes: (1) standard BatchNorm which learns mean and variance from training data, as well as (2) BatchNorm with batch statistics; and (3) LayerNorm, which normalizes at the embedding dimension instead of across the batch. Figure 6 indicates that BatchNorm with batch statistics and LayerNorm are able to better account for changing statistics across different graph sizes. Standard BatchNorm generalizes worse than not doing any normalization, thus our other experiments use BatchNorm with batch statistics.
We further dissect the relationship between graph representations and normalization in Appendix D, confirming that poor performance on large graphs can be explained by unstable representations due to the choice of aggregation and normalization schemes. Using Max aggregators and BatchNorm with batch statistics are temporary hacks to overcome the failure of the current architectural components. Overall, our results suggest that inference beyond training sizes will require the development of expressive GNN mechanisms that are able to leverage global graph topology [62] while being invariant to distribution shifts in terms of node degree and other graph statistics [63].
5.4 Which decoder has a better inductive bias for TSP?
Figure 7 compares NAR and AR decoders for identical models. To isolate the impact of the decoder’s inductive bias without the inductive bias imposed by GNNs, we also show Transformer encoders on full graphs as well as structureagnostic MLPs. Within our experimental setup, AR decoders are able to fit the training data as well as generalize significantly better than NAR decoders, indicating that sequential decoding is powerful for TSP even without graph information.
Conversely, NAR architectures are a poor inductive bias as they require significantly more computation to perform competitively to AR decoders. For instance, recent models [17, 18] used more than 30 GNN layers with over 10 Million parameters. We believe that such overparameterized networks are able to memorize all patterns for small TSP training sizes [64], but the learnt policy is unable to generalize beyond training graph sizes. At the same time, when compared fairly within the same experimental settings, NAR decoders are significantly faster than AR decoders described in Section 3.3 as well as those which reembed the graph at each decoding step [9], see Fig. 8.
5.5 How do learning paradigms impact the search phase?
Identical models are trained via supervised learning (SL) and reinforcement learning (RL).^{Footnote 3} Figure 9 illustrates that, when using greedy decoding during inference, RL models perform better on the training size as well as on larger graphs. Conversely, SL models improve over their RL counterparts when performing beam search or sampling.
In Appendix C, we find that the rollout baseline, which encourages better greedy behaviour, leads to the model making very confident predictions about selecting the next node at each decoding step, even out of training size range. In contrast, SL models are trained with teacher forcing, i.e. imitating the optimal solver at each step instead of using their own prediction. This results in less confident predictions and poor greedy decoding, but makes the probability distribution more amenable to beam search and sampling, as shown in Fig. 10. Our results advocate for tighter coupling between the training and inference phase of learningdriven TSP solvers, mirroring recent findings in generative models for text [65].
5.6 Which learning paradigm scales better?
Our experiments till this point have focused on isolating the impact of various pipeline components on zeroshot generalization under limited computation. At the same time, recent results in natural language processing have highlighted the power of large scale pretraining for effective transfer learning [66]. To better understand the impact of learning paradigms when scaling computation, we double the model parameters (up to 750,000) and train on tens times more data (12.8M samples) for AR architectures. We monitor optimality gap on the training size range (TSP2050) as well as a larger size (TSP100) vs. the number of training samples.
In Fig. 11, we see that increasing model capacity leads to better learning. Notably, RL models, which train on unique randomly generated samples throughout, are able to keep improving their performance within as well as outside of training size range as they see more samples. On the other hand, SL is bottlenecked by the need for optimal groundtruth solutions: SL models iterate over the same 1.28M unique labelled samples and stop improving at a point. Beyond favorable inductive biases, distributed and sampleefficient RL algorithms [67] may be a key ingredient for learning from and scaling up to larger TSPs beyond tens of nodes.
6 Recent case studies and future work
Since the initial publication of this work [68], deep learning for routing problems has received considerable attention from the research community [27, 28, 30, 31, 69,70,71,72,73]. In this section, we highlight recent advances, characterize them using the unified pipeline presented in Fig. 2, and provide future research directions, with a focus on improving generalization to largescale and realworld instances.
As a reminder, the unified neural combinatorial optimization pipeline consists of: (1) Problem Definition → (2) Graph Embedding → (3) Solution Decoding → (4) Solution Search → (5) Policy Learning.
Leveraging equivariance and symmetries
The autoregressive Attention Model [16] sequentially constructs TSP tours as permutations of cities, but does not consider the underlying symmetries of routing problems.
Kwon et al. [27] consider invariance to the starting city in constructive heuristics: They propose to train the Attention Model with a new reinforcement learning algorithm (innovating on box 5 in Fig. 2(a)) which exploits the existence of multiple optimal tour permutations. Similarly, Ouyang, Wang, et al. [28] consider invariance with respect to rotations, reflections, and translations (Euclidean symmetry group) of the input cities: They propose a constructive approach similar to Attention Model while ensuring invariance by performing data augmentation during the problem definition stage (Fig. 2(a), box 1) and using relative coordinates during graph encoding (Fig. 2(a), box 2). Their approach shows particularly strong results on zeroshot generalization from random instances to the realworld TSPLib benchmark suite.
Future work may follow the Geometric Deep :earning blueprint [74] by designing models that respect the symmetries and inductive biases that govern the data. As routing problems are embedded in euclidean coordinates and the routes are cyclical, incorporating these contraints directly into the architectures or learning paradigms may be a principled approach to improving generalization to largescale instances greater than those seen during training.
Improved graph search algorithms
Several papers have proposed to improve the oneshot nonautoregressive approach of Joshi et al. [17] by retaining the same GNN encoder (Fig. 2(a), box 2) while replacing the graph search component of the pipeline (Fig. 2(a), box 4) with more powerful and flexible algorithms, e.g. Dynamic Programming [31] or MonteCarlo Tree Search (MCTS) [30].
Notably, the GNN + MCTS framework of Fu et al. [30] shows that the NAR approach can generalize to TSPs with up to 1000 nodes. They ensure that the predictions of the GNN encoder generalize from small to large TSP by updating the problem definition (Fig. 2(a), box 1): large problem instances are represented as many smaller subgraphs which are of the same size as the training graphs for the GNN, and then merge the GNN edge predictions before performing MCTS.
Overall, this line of work suggests that stronger coupling between the design of both the neural and symbolic/search components of models is essential for outofdistribution generalization.
Learning within local search heuristics
Recent work has explored an alternative to constructive AR and NAR decoding schemes which involves learning to iteratively improve (suboptimal) solutions or learning to perform local search [69,70,71,72,73]. Since deep learning is used to guide decisions within classical search algorithms (which are designed to work regardless of problem scale), this approach implicitly leads to better zeroshot generalization to larger problem instances compared to constructive approaches studied in our work. In particular, NeuroLKH [71] uses GNNs to improve the LinKernighanHelsgaun algorithm and demonstrates strong zeroshot generalization to TSP with 5000 nodes as well as across TSPLib instances.
A limitation of this line of work is the need for handdesigned local search heuristics, which may be missing for understudied problems. On the other hand, constructive approaches are comparatively easier to adapt to new problems by enforcing constraints during the solution decoding and search procedure (Fig. 2(a), box 4).
Learning Paradigms that promote generalization
Future work could look at novel learning paradigms which explicitly focus on generalization beyond supervised and reinforcement learning. For e.g., this work explored zeroshot generalization to larger problems, but the logical next step is to finetune the model on a small number of larger problem instances [75]. Thus, it will be interesting to explore finetuning/generalization as a metalearning problem, wherein the goal is to train model parameters specifically for fast adaptation and finetuning to new data distributions and problem sizes.
Another interesting direction could explore tackling understudied routing problems with challenging constraints via multitask pretraining on wellknown routing problems such as TSP and CVPR, followed by problemspecific finetuning. Similar to language modelling as a pretraining objective in NLP [66], the goal of pretraining for routing would be to learn generally useful neural network representations that can transfer well to novel routing problems.
7 Conclusion
Learningdriven solvers for combinatorial problems such as the Travelling Salesperson Problem have shown promising results for trivially small instances up to a few hundred nodes. However, scaling fully endtoend deep learning approaches to realworld instances is still an open question as training on large graphs is extremely timeconsuming and challenging to learn from.
This paper advocates for an alternative to expensive largescale training: training models efficiently on trivially small TSP and transferring the learnt policy to larger graphs in a zeroshot fashion or via fast finetuning. Thus, identifying promising inductive biases, architectures and learning paradigms that enable such zeroshot generalization to large and more complex instances is a key concern for tackling realworld combinatorial problems.
We perform the first principled investigation into zeroshot generalization for learning large scale TSP, unifying stateoftheart architectures and learning paradigms into one experimental pipeline for neural combinatorial optimization. Our findings suggest that key design choices such as GNN layers, normalization schemes, graph sparsification, and learning paradigms need to be explicitly redesigned to consider outofdistribution generalization. Additionally, we use our unified pipeline to characterize recent advances in deep learning for routing problems and provide new directions to stimulate future research.
Notes
The largest TSP solved by Concorde to date has 109,399 nodes with running time of 7.5 months.
For RL, we show the greedy rollout baseline. Critic baseline results are available in Appendix E
It is worth noting that classical algorithmic and symbolic components such as graph reduction, sophisticated tree search as well as posthoc local search have been pivotal and complementary to GNNs in enabling such generalization.
Distribution plots show 0, 5, 50, 95, and 100percentiles for embedding statistics at various TSP sizes, thus visualizing how the statistics changes with problem scale (implemented via TensorBoard [76]).
References
Lenstra, J.K., & Kan, A.R. (1975). Some simple applications of the travelling salesman problem. Journal of the Operational Research Society.
Applegate, D.L., Bixby, R.E., Chvatal, V., & Cook, W.J. (2006). The traveling salesman problem: A computational study.
Applegate, D., Bixby, R., Chvatal, V., & Cook, W. (2006). Concorde TSP solver.
Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., žídek, A., Nelson, A.W., Bridgland, A., & et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature.
Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J.W., Songhori, E., Wang, S., Lee, Y.J., Johnson, E., Pathak, O., Nazi, A., & et al. (2021). A graph placement methodology for fast chip design. Nature.
Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer networks. In NeurIPS.
Bello, I., Pham, H., Le, Q.V., Norouzi, M., & Bengio, S. (2017). Neural combinatorial optimization with reinforcement learning. In ICLR.
Bengio, Y., Lodi, A., & Prouvost, A. (2020). Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research.
Khalil, E., Dai, H., Zhang, Y., Dilkina, B., & Song, L. (2017). Learning combinatorial optimization algorithms over graphs. In NeurIPS.
Selsam, D., Lamm, M., Bünz, B., Liang, P., de Moura, L., & Dill, D.L. (2019). Learning a sat solver from singlebit supervision. In ICLR.
Li, Z., Chen, Q., & Koltun, V. (2018). Combinatorial optimization with graph convolutional networks and guided tree search. In NeurIPS.
Kipf, T.N., & Welling, M. (2017). Semisupervised classification with graph convolutional networks. In ICLR.
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., & Dahl, G.E. (2017). Neural message passing for quantum chemistry. In ICML.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. ICLR.
Battaglia, P.W., Hamrick, J.B., Bapst, V., SanchezGonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., & et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint.
Kool, W., van Hoof, H., & Welling, M. (2019). Attention, learn to solve routing problems!. In ICLR.
Joshi, C.K., Laurent, T., & Bresson, X. (2019). An efficient graph convolutional network technique for the travelling salesman problem. arXiv preprint.
Nowak, A., Villar, S., Bandeira, A.S., & Bruna, J. (2017). A note on learning algorithms for quadratic assignment with graph neural networks. arXiv preprint.
Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., & Rousseau, L.M. (2018). Learning heuristics for the TSP by policy gradient. In CPAIOR.
Gasse, M., Chételat, D., Ferroni, N., Charlin, L., & Lodi, A. (2019). Exact combinatorial optimization with graph convolutional neural networks. In NeurIPS.
Cappart, Q., Goutierre, E., Bergman, D., & Rousseau, L.M. (2019). Improving optimization bounds using machine learning: Decision diagrams meet deep reinforcement learning. In AAAI.
Chalumeau, F., Coulon, I., Cappart, Q., & Rousseau, L.M. (2021). Seapearl: A constraint programming solver guided by reinforcement learning. In CPAIOR.
Wilder, B., Dilkina, B., & Tambe, M. (2019). Melding the datadecisions pipeline: Decisionfocused learning for combinatorial optimization. In AAAI.
Ferber, A., Wilder, B., Dilkina, B., & Tambe, M. (2020). MIPaaL: Mixed integer program as a layer. In AAAI.
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to sequence learning with neural networks. In NeurIPS.
Ma, Q., Ge, S., He, D., Thaker, D., & Drori, I. (2020). Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In AAAI workshop on deep learning on graphs.
Kwon, Y.D., Choo, J., Kim, B., Yoon, I., Gwon, Y., & Min, S. (2020). Pomo: Policy optimization with multiple optima for reinforcement learning. In NeurIPS.
Ouyang, W., Wang, Y., Weng, P., & Han, S. (2021). Generalization in deep rl for tsp problems via equivariance and local search. arXiv preprint.
Nowak, A., Folqué, D., & Estrach, J.B. (2018). Divide and conquer networks. In ICLR.
Fu, Z.H., Qiu, K.B., & Zha, H. (2021). Generalize a small pretrained model to arbitrarily large TSP instances. In AAAI.
Kool, W., van Hoof, H., Gromicho, J., & Welling, M. (2021). Deep policy dynamic programming for vehicle routing problems. arXiv preprint.
Joshi, C.K., Laurent, T., & Bresson, X. (2019). On learning paradigms for the travelling salesman problem. NeurIPS Graph Representation Learning Workshop.
Nazari, M., Oroojlooy, A., Snyder, L., & Takác, M. (2018). Reinforcement learning for solving the vehicle routing problem. In NeurIPS.
Chen, X., & Tian, Y. (2019). Learning to perform local rewriting for combinatorial optimization. In NeurIPS.
Yolcu, E., & Poczos, B. (2019). Learning local search heuristics for boolean satisfiability. In NeurIPS.
Huang, J., Patwary, M., & Diamos, G. (2019). Coloring big graphs with alphagozero. arXiv preprint.
Sato, R., Yamada, M., & Kashima, H. (2019). Approximation ratios of graph neural networks for combinatorial problems. In NeurIPS.
Cappart, Q., Chételat, D., Khalil, E., Lodi, A., Morris, C., & Veličković, P. (2021). Combinatorial optimization and reasoning with graph neural networks. In IJCAI.
Veličković, P., Ying, R., Padovano, M., Hadsell, R., & Blundell, C. (2020). Neural execution of graph algorithms. In ICLR.
Veličković, P., & Blundell, C. (2021). Neural algorithmic reasoning. Patterns.
Corso, G., Cavalleri, L., Beaini, D., Liò, P., & Veličković, P. (2020). Principal neighbourhood aggregation for graph nets. In NeurIPS.
Xu, K., Li, J., Zhang, M., Du, S.S., Kawarabayashi, K.i., & Jegelka, S. (2019). What can neural networks reason about?. In ICLR.
Xu, K., Li, J., Zhang, M., Du, S.S., Kawarabayashi, K.i., & Jegelka, S. (2020). How neural networks extrapolate: From feedforward to graph neural networks. In ICLR.
GómezBombarelli, R., Wei, J.N., Duvenaud, D., HernándezLobato, J.M., SánchezLengeling, B., Sheberla, D., AguileraIparraguirre, J., Hirzel, T.D., Adams, R.P., & AspuruGuzik, A. (2018). Automatic chemical design using a datadriven continuous representation of molecules. ACS central science.
Mao, H., Schwarzkopf, M., Venkatakrishnan, S.B., Meng, Z., & Alizadeh, M. (2019). Learning scheduling algorithms for data processing clusters. In ACM special interest group on data communication.
Paliwal, A., Gimeno, F., Nair, V., Li, Y., Lubin, M., Kohli, P., & Vinyals, O. (2019). Regal: Transfer learning for fast optimization of computation graphs. arXiv preprint.
Mirhoseini, A., Pham, H., Le, Q.V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., & Dean, J. (2017). Device placement optimization with reinforcement learning. In ICML.
Zhou, Y., Roy, S., Abdolrashidi, A., Wong, D., Ma, P.C., Xu, Q., Zhong, M., Liu, H., Goldie, A., Mirhoseini, A., & et al. (2019). Gdp: Generalized device placement for dataflow graphs. arXiv preprint.
Bresson, X., & Laurent, T. (2019). A twostep graph convolutional decoder for molecule generation. In NeurIPS workshop on machine learning and the physical sciences.
Jin, W., Barzilay, R., & Jaakkola, T. (2018). Junction tree variational autoencoder for molecular graph generation. In ICML.
You, J., Liu, B., Ying, Z., Pande, V., & Leskovec, J. (2018). Graph convolutional policy network for goaldirected molecular graph generation. In NeurIPS.
Bresson, X., & Laurent, T. (2018). An experimental study of neural networks for variable graphs. In ICLR Workshop.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint.
Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. arXiv preprint.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Joshi, C. (2020). Transformers are graph neural networks. The Gradient.
Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., & Bresson, X. (2020). Benchmarking graph neural networks. arXiv preprint.
François, A., Cappart, Q., & Rousseau, L.M. (2019). How to evaluate machine learning approaches for combinatorial optimization: Application to the travelling salesman problem. arXiv preprint.
Williams, R.J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.
Williams, R.J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning.
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks?. In ICLR.
Garg, V.K., Jegelka, S., & Jaakkola, T. (2020). Generalization and representational limits of graph neural networks. In ICML.
Levie, R., Bronstein, M.M., & Kutyniok, G. (2019). Transferability of spectral graph convolutional neural networks. arXiv preprint.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In ICLR.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration. In ICLR.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2020). Exploring the limits of transfer learning with a unified texttotext transformer. JMLR.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint.
Joshi, C.K., Cappart, Q., Rousseau, L.M., & Laurent, T. (2021). Learning tsp requires rethinking generalization. In International conference on principles and practice of constraint programming.
Wu, Y., Song, W., Cao, Z., Zhang, J., & Lim, A. (2021). Learning improvement heuristics for solving routing problem. IEEE Transactions on Neural Networks and Learning Systems.
da Costa, P.R.d.O., Rhuggenaath, J., Zhang, Y., & Akcay, A. (2020). Learning 2opt heuristics for the traveling salesman problem via deep reinforcement learning. In Asian conference on machine learning.
Xin, L., Song, W., Cao, Z., & Zhang, J. (2021). Neurolkh: Combining deep learning model with linkernighanhelsgaun heuristic for solving the traveling salesman problem. In NeurIPS.
Ma, Y., Li, J., Cao, Z., Song, W., Zhang, L., Chen, Z., & Tang, J. (2021). Learning to iteratively solve routing problems with dualaspect collaborative transformer. In NeurIPS.
Hudson, B., Li, Q., Malencia, M., & Prorok, A. (2021). Graph neural network guided local search for the traveling salesperson problem. arXiv preprint.
Bronstein, M.M., Bruna, J., Cohen, T., & Veličković, P. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint.
Hottung, A., Kwon, Y.D., & Tierney, K. (2021). Efficient active search for combinatorial optimization problems. arXiv preprint.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., & et al. (2016). Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint.
Hermans, A., Beyer, L., & Leibe, B. (2017). In defense of the triplet loss for person reidentification. arXiv preprint.
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., & Leskovec, J. (2018). Hierarchical graph representation learning with differentiable pooling. In NeurIPS.
Inc, G.O. (2015). Gurobi optimizer reference manual. URL http://www.gurobi.com.
Acknowledgements
We would like to thank R. Anand, X. Bresson, V. Dwivedi, A. Ferber, E. Khalil, W. Kool, R. Levie, A. Prouvost, P. Veličković and the anonymous reviewers for helpful comments and discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
None.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Code and datasets: github.com/chaitjo/learningtsp
Appendices
Appendix A: Additional Context for Fig. 1
Experimental setup
In Fig. 1, we illustrate the computational challenges of learning large scale TSP by comparing three identical models trained on 12.8 Million TSP instances via reinforcement learning. Our experimental setup largely follows Section 4. All models use identical configurations: autoregressive decoding and Graph ConvNet encoder with Max aggregation and LayerNorm. The TSP2050 model is trained using the greedy rollout baseline [16] and the Adam optimizer with batch size 128 and learning rate 1e − 4. Direct training, active search and finetuning on TSP200 samples is done using learning rate 1e − 5, as we found larger learning rates to be unstable. During active search and finetuning, we use an exponential moving average baseline, as recommended by Bello et al. [7].
Furthest insertion baseline
We characterize ‘good’ generalization across our experiments by the wellknown furthest insertion heuristic, which constructively builds a solution/partial tour \(\pi ^{\prime }\) by inserting node i between tour nodes \(j_{1}, j_{2} \in \pi ^{\prime }\) such that the distance from node i to its nearest tour node j_{1} is maximized.
We motivate our work by showing that learning from large TSP200 is intractable on universityscale hardware, and that efficient pretraining on trivial TSP2050 enables models to better generalize to TSP200 in a zeroshot manner. Within our computational budget, furthest insertion still outperforms our best models. At the same time, we are not claiming that it is impossible to outperform insertion heuristics with current approaches: reinforcement learningdriven approaches will only continue to improve performance with more computation and training data. We want to use simple nonlearnt baselines to motivate the development of better architectures, learning paradigms and evaluation protocols for neural combinatorial optimization.
Routing problems and generalization
It is worth mentioning why we chose to study TSP in particular. Firstly, TSP has stood the test of time in terms of relevance and continues to serve as an engine of discovery for general purpose techniques in applied mathematics.
TSP and associated routing problems have also emerged as a challenging testbed for learningdriven approaches to combinatorial optimization. Whereas generalization to problem instances larger and more complex than those seen in training has at least partially been demonstrated on nonsequential problems such as SAT, MaxCut, MVC [9,10,11],^{Footnote 4} the same architectures do not show strong generalization for TSP. For e.g., furthest insertion outperforms or is competitive with stateoftheart approaches for TSP above tens of nodes, see Fig. D.1.(e, f) from Khalil et al. [9] or Fig. 5 from Kool et al. [16], despite using more computation and data than our controlled study.
Appendix B: Hardware and Timings
Fairly timing research code can be difficult due to differences in libraries used, hardware configurations and programmer skill. In Table 1, we report approximate total training time and inference time across TSP sizes for the model setup described in Section 4. All experiments were implemented in PyTorch and run on an Intel Xeon CPU E52690 v4 server and four Nvidia 1080Ti GPUs. Four experiments were run on the server at any given time (each using a single GPU). Training time may vary based on server load, thus we report the lowest training time across several runs in Table 1.
Appendix C: Learning paradigms and amenity to search
Figure 10 demonstrate that SL models are more amenable to beam search and sampling, but are outperformed by RLrollout models under greedy search. In Fig. 12, we investigate the impact of learning paradigms on probability distributions by plotting histograms of the probabilities of greedy selections during inference across TSP sizes for identical models trained with SL and RL. We find that the rollout baseline, which encourages better greedy behaviour, leads to the model making very confident predictions about selecting the next node at each decoding step, even beyond training size range. In contrast, SL models are trained with teacher forcing, i.e. imitating the optimal solver at each step instead of using their own prediction. This results in less confident predictions and poor greedy decoding, but makes the probability distribution more amenable to beam search and sampling techniques.
We understand this phenomenon as follows: More confident predictions (Fig. 12b) do not automatically imply better solutions. However, sampling repeatedly or maintaining the topb most probable solutions from such distributions is likely to contain very similar tours. On the other hand, less sharp distributions (Fig. 12a) are likely to yield more diverse tours with increasing b. This may result in comparatively better optimality gap, especially for TSP sizes larger than those seen in training.
Appendix D: Visualizing node and graph embedding spaces
Our results in Section 5.3 suggest that inference beyond training sizes requires the development of GNN architectures and normalization layers that are both expressive as well as invariant to distribution shifts. We explore how node and graph embeddings for TSP graphs evolve across training distribution (TSP2050) and beyond (up to TSP200) through visualizing the statistics of the embedding spaces. Intuitively, constructing TSP tours involves decisions which are not just locally optimal, but also optimal w.r.t some global graph structure. Thus, node embeddings represent local information while graph embeddings, which are conventionally computed as the mean of node embeddings, provide global structural information.
We utilize distribution plots to study the variation in embedding statistics^{Footnote 5} of three identical models: (1) GNNMax, which represents our best model configuration from Section 5: autoregressive decoding, Graph ConvNet encoder with Max aggregation and BatchNorm with batch statistics; (2) GNNSum, which uses Sum aggregation for the Graph ConvNet and shows comparatively poor generalization beyond training size, see Fig. 5; and (3) GNNMax + learnt BN, which uses standard BatchNorm, i.e. learns statistics from the training data, and also shows comparatively poor generalization, see Fig. 6.
We draw upon work in learning embeddings for computer vision [77] to characterize embedding spaces across TSP sizes according to: (1) magnitudes, denoted by ℓ_{2} norms, indicating whether embeddings are shrinking to one magnitude or expanding outwards as TSP size increases; and (2) pairwise distances, which tells us how wellseparated the embedding are, or whether they are pulled apart/towards each other as TSP size increases.
Node embedding space
In Figs. 13 and 14, we see that GNNMax leads to the most stable node embedding norms and pairwise distances (which are calculated at an intragraph level) across TSP sizes. On the other hand, GNNSum and GNNMax + learnt BN lead to fluctuating and monotonically increasing embedding norms as size increases, e.g. compare Fig. 13b and c. Clearly, maintaining similar distributions for node embeddings across graph sizes indicates that the GNN is building meaningful representations of local structure, or, at the very least, does not break down for large graphs. This enables better generalization, as the decoder has lower chances of encountering embeddings which are statistically different than those seen during training.
Graph embedding space
Figures 15 and 16 indicate that the graph embedding space is shrinking towards a single magnitude and moving closer as graph size increases. Interestingly, with standard BatchNorm, the graph embedding magnitude monotonically increases with graph size to ranges beyond those for training graphs. On the other hand, using batch statistics for BatchNorm, as done in GNNMax and GNNSum, leads to graph embedding magnitudes converging to a single value which is within the range of values for training graphs, thus enabling better generalization. E.g. compare Fig. 15b and c.
We can further visualize this phenomenon through 2D Principal Component Analysis (PCA) plots of graph embedding spaces for GNNMax and GNNMax + learnt BN models, see Fig. 17a and b. In both cases, the graph embeddings at larger sizes have very similar magnitudes and are extremely close to each other, indicating that the model is unable to differentiate among different graphs. Thus, decoders currently lack good global structural context. Investigating better graph embeddings through pooling methods [78] could be an interesting approach towards representing global graph structure beyond training sizes.
Appendix E: Extra Results
NAR decoders and aggregation functions
In Section 5, we found that AR decoding provides a powerful sequential inductive bias for TSP and is able to generalizes well with both GNNs as well as structureagnostic encoder architectures. This result may lead one to question the need for GNNs, altogether. Interestingly, Fig. 18 illustrates a different trend for NAR architectures: GNN encoders generalize better than both Transformers and MLPs, indicating that leveraging graph structure is essential in the absence of the sequential inductive bias. (It is worth noting that, overall, all models with NAR decoders generalize poorly compared to AR architectures for our experimental setup.)
Critic baseline
Figure 19 illustrates that, for identical models, the critic baseline [7, 19] is unable to match the performance of the rollout baseline [16] under both greedy and beam search settings. We did not explore tuning learning rates and hyperparameters for the critic network, opting to use the same settings as those for the actor. In general, getting actorcritic methods to work seems to require more parameter tuning than the rollout baseline.
Scaling computation for AR and NAR architectures
In Figures 20 and 21, we present extended results for Section 5.6, where we scale model parameters and data. We observe that using larger models (up to 1.5 Million parameters) enables fitting the training dataset better. The impact of larger models is especially evident for NAR architectures. As previously noted, recent NARbased models [17, 18] used more than 30 layers with over 10 Million parameters to outperform AR architectures on fixed TSP sizes. We believe that such overparameterized networks are able to memorize all patterns for small TSP training sizes [64], but the learnt policy is unable to generalize beyond training graph sizes as NAR decoding does not provide a useful inductive bias for TSP.
Appendix F: Visualizing model predictions
As a final note, we present a visualization tool for generating model predictions and heatmaps of TSP instances, see Figures 22, 23. We advocate for the development of more principled approaches to neural combinatorial optimization, e.g. along with model predictions, visualizing the reduce costs for each edge (cheaply obtained using the Gurobi solver [79]) may help debug and improve learningdriven approaches in the future. Using reduce costs as supervision signals could also be an inexpensive alternative to running optimal solvers to create large labelled datasets.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Joshi, C.K., Cappart, Q., Rousseau, LM. et al. Learning the travelling salesperson problem requires rethinking generalization. Constraints 27, 70–98 (2022). https://doi.org/10.1007/s1060102209327y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1060102209327y