1 Introduction

The Traveling Salesman Problem (TSP) is a classic NP-Hard problem in computer science and operation research that seeks to find the shortest possible route to visit every city exactly once and return to the starting city [1]. Finding an optimal solution for the TSP is computationally expensive when the number of cities is large. Researchers have studied a variety of heuristics and approximation algorithms that can provide high-quality solutions to the problem in a reasonable amount of time.

One of the most simple and popular heuristics is a nearest-neighbor heuristics. It starts at a randomly chosen city and repeatedly selects the nearest unvisited city as the next city to visit while there are unvisited cities. Finally, it returns to the starting city to complete the tour. Another famous heuristics for TSP is the Christofides algorithm [2]. It finds an approximate solution using the minimum spanning tree of the graph representing the cities, which is a tree that connects all the cities with the minimum possible total edge weight. It is known to provide a solution that is guaranteed to be within a factor of 3/2 of the optimal solution. Another famous tool for solving TSP using a heuristics approach is to use Google-OR tools [3]. It performs local search and meta-heuristics to find the approximate solutions of a wide range of combinatorial optimization problems such as TSP and vehicle routing problems. However, heuristic approaches trade optimality for computational cost and are expressed in the form of rules [4, 5].

For many TSP instances, Concorde is considered as the fastest and most exact TSP solver that produces the optimal solution [6]. Concorde uses an Integer Programming solver with Cutting Planes and Branch-and-Bound. It assumes a symmetric TSP where the distance between two cities is the same in each opposite direction [7]. Gurobi also finds optimal TSP results but Concorde is faster than Gurobi because it is specialized for TSPs [4, 8].

Many studies have been conducted to find an approximate solution of TSP based on deep learning. Among these studies, the pioneering work is the Pointer Network [7]. It is a supervised learning-based approach that uses RNN-based encoders and decoders. In the experiment, a planar symmetric TSP is assumed, and they use a beam search decoding procedure to remove invalid tours such as visiting the same city twice or ignoring a destination. Bello et al. updated the learning parameters of the LSTM-based model with a reinforcement learning-based approach that uses the tour length as a negative reward signal [9]. Nazari et al. added an embedding instead of using the RNN encoder of the Pointer Network to reduce the computational complexity without impacting performance [10]. Joshi et al. proposed a method for predicting the edge probability matrix of the entire graph through a graph convolutional neural network model and a supervised learning-based approach [11]. Stohy et al. proposed a hybrid pointer network model for TSP that demonstrated good performance for large-scale TSP instances [12]. However, it suffers from a long inference time towing to a more complex model structure compared to the baseline graph pointer network [13]. Several efforts have been made to introduce a convolutional neural network (CNN) to the TSP. Researchers used 2D convolution for TSP but did not show good performance [14, 15]. Sultana et al. introduced a new model that combines 1D-CNN with LSTM but is still an RNN-based model [16].

Recent attention-based transformer models have shown good performance in various research fields [17, 18, 19, 20]. Researchers successfully used a transformer-based model to find approximate solutions for TSP [21, 22, 4, 23, 5, 24, 25]. Deudon et al. proposed a novel approach for solving TSP using deep reinforcement learning. The city coordinates are utilized as inputs, and the model is trained using reinforcement learning to predict a distribution of a city sequence [21]. Kool et al. proposed a transformer-based model consisting of purely attention blocks and trained the model using REINFORCE for solving various routing problems such as TSP and vehicle routing problems [4]. Wu et al. proposed a transformer-based deep reinforcement learning framework that trains an improvement heuristic that iteratively improves an initial solution [22]. Researchers proposed an approach that applied multiple rollout and data augmentation methods to Kool’s attention model [23].

Recently, Bresson et al. proposed a TSP Transformer model [5]. It is based on a standard Transformer encoder with multi-head attention and residual connection but uses batch normalization instead of using layer normalization. It uses an auto-regressive decoding approach and introduces a self-attention block in the decoder part. It constructs the query using all cities in the partial tour with a self-attention module [5]. They showed a state-of-the-art (SOTA) performance for various TSP instances and reported performance with an optimal (optimality) gap of 0.0004% for TSP50 and 0.39% for TSP100. Although the TSP Transformer model shows the SOTA for many TSP instances, it has a complex model structure based on a fully-connected attention-based model. It also requires large GPU usage. Moreover, the training and inference time are very long [26]. Recently, various studies have been conducted to reduce the computational complexity of standard transformer models [27, 28, 29]. For TSP, a recent study has been conducted to make the model lightweight while removing the learnable decoder [24]. A similar study to lightweight TSP Transformer model is performed in [25]. Yang et al. proposed a memory-efficient Transformer-based model for TSP called Tspformer. Their model successfully reduces GPU/CPU memory usage compared with the standard Transformer-based models [25], but the solution quality is not as good as the SOTA model.

Pan et al. proposed a constructive approach based on hierarchical reinforcement learning (H-TSP), which is specialized in solving large-scale TSP instances [30]. It employs a hierarchical deep reinforcement learning approach with policies in two levels: upper- and lower-level policies. While H-TSP demonstrates excellent performance on large-scale TSP instances, it requires an additional warm-up stage for the lower-level model by pre-training and undergoes a rather complex training process. Therefore, it suffers from the drawback of demanding substantial computing resources and significant training time. Ren et al. tackled the dynamic TSP using a self-supervised reinforcement learning approach [31]. This paper proposed a new feature extraction mechanism combining self-attention and context attention mechanisms [31]. This approach has the advantage of not requiring a manually crafted reward function.

Fig. 1
figure 1

Overview of the proposed CNN-Transformer model for TSP

In this paper, we propose a novel CNN-Transformer model based on partial self-attention by performing attention only on recently visited nodes in the decoder. Linear embedding in the standard Transformer model does not consider local spatial information and has limitations in learning local compositionality. Therefore, we add a CNN embedding layer to the standard Transformer model to extract the local spatial features of the input data, as the CNN is effective in learning the spatial invariance of nodes in the Euclidean space. Second, the standard Transformer model is based on fully-connected attention-based models [26]. Therefore, it suffers from huge computational complexity and memory consumption. Furthermore, the Transformer model structure has a weakness at learning local compositionality owing to its fully-connected topology. For TSP, we improve the attention mechanism by proposing partial self-attention that focuses only on recently visited nodes in the decoder. Our observations reveal a significant reduction in redundancy in the Transformer model’s fully-connected topology for TSP. This reduction improves the TSP solution’s quality by removing excessive attention connections. The main contributions of our paper are summarized as follows:

  • To the best of our knowledge, we propose the first CNN-Transformer-based model for learning TSP solutions. Our results show that the CNN embedding layer is very effective in learning local spatial features of various TSP instances.

  • The proposed model is based on partial self-attention that performs attention only on recently visited nodes in the decoder. Therefore, the proposed model is able to better learn local compositionality compared with the standard Transformer model that is based on the fully-connected topology.

  • By removing the redundant attention connections in the decoder, the proposed model significantly reduces the GPU memory usage and has much lower inference time compared with the standard Transformer model.

2 Proposed model

We propose a CNN-Transformer model that combines a convolutional neural network embedding layer with a standard Transformer model. The proposed CNN-Transformer model has the architecture of the encoder and decoder. It has a CNN embedding layer in the encoder to extract local spatial information. We improve the attention mechanism based on a partial self-attention, which removes unnecessary attention connections.

Figure 1 illustrates the overall structure of the proposed model. The input of the proposed model is a planar point set \(\textbf{X}=\{\textbf{x}_1,\dots ,\textbf{x}_n\}\) with n nodes (cities), where \(\textbf{x}_i\in \mathbb {R}^2\) represents the 2D Cartesian coordinates of the points. The output of the model, denoted as \(\pmb {\pi }=\{\pi _1,\dots ,\pi _n\}\), is a sequence that represents the optimal predicted tour where \(\pi _t\) is the node index selected at the \(t^\text {th}\) decoding step at the decoder. Let \(D(\textbf{x}_i,\textbf{x}_j)\) be the distance between nodes \(\textbf{x}_i\) and \(\textbf{x}_j\). Our goal is to minimize the total tour length \(\sum _{t=1}^{n-1} D(\textbf{x}_{\pi _t},\textbf{x}_{\pi _{t+1}})+D(\textbf{x}_{\pi _n},\textbf{x}_{\pi _1})\) while visiting each node exactly once and then return to the starting node.

2.1 Encoder

The proposed encoder is composed of a CNN embedding layer and L identical encoder layers as illustrated in Fig. 2. The CNN embedding layer generates embedding vectors by extracting local spatial information from the input data points, which is passed on to the subsequent encoder layers.

Each encoder layer consists of two sublayers: multi-head self-attention (MHSA) sublayer and point-wise feed forward (FF) sublayer. The MHSA sublayer performs multi-head self-attention to capture the dependencies between each node and the point-wise FF performs non-linear activation. The residual connection [32] and batch normalization [33] were incorporated between each sublayer. Similar to previous studies [4, 5], we use batch normalization instead of using layer normalization as it can effectively handle a large number of nodes.

2.1.1 CNN embedding layer

We add the CNN embedding layer in the encoder for extracting spatial information from the input nodes. Let \(\textbf{X}_i^\text {k-NN}\) be the concatenation of feature vectors of the \(i^\text {th}\) node and its k-nearest neighboring node feature vectors closest in distance to the \(i^\text {th}\) node. Then, the embedding vector of the \(i^\text {th}\) node in the encoder, \(\textbf{x}_i^\text {emb}\), is computed as the sum of node embedding of \(i^\text {th}\) input nodes, \(\textbf{x}_i \textbf{W}_\text {emb}\), and the convolution of \(\textbf{X}_i^k\text {-NN}\), which is expressed as:

$$\begin{aligned} \textbf{x}^{\text {emb}}_i=\textbf{x}_i\textbf{W}_{\text {emb}} + \text {Conv}(\textbf{X}^{\text {{ k}-NN}}_i)\in \mathbb {R}^d, \end{aligned}$$
(1)

where \(\textbf{W}_\text {emb}\in \mathbb {R}^{2\times d}\) is a learnable parameter for node embedding. A fixed value of kernel size k+1 is used to ensure that \(\text {Conv}(\textbf{X}_i^k\text {-NN})\) is in the same d-dimensional space as node embedding, \(\textbf{x}_i \textbf{W}_\text {emb}\).

2.1.2 Encoder layer

The encoder has L identical encoder layers, and the first encoder layer takes \(\textbf{x}_i^\text {emb}\) from CNN embedding layer as input. Each encoder layer has two sublayers: MHSA sublayer and point-wise FF sublayer.

MHSA sublayer. The input of the MHSA sublayer of \(l^\text {th}\) encoder layer, \(\textbf{E}^{l-1}\), is the output of the \((l-1)^\text {th}\) encoder layer. The output of the MHSA sublayer of \(l^\text {th}\) encoder layer, \(\hat{\textbf{E}}^l\), is obtained by first applying multi-head self-attention (\(\text {MH}^l\)) to \(\textbf{E}^{l-1}\), followed by residual connection and batch normalization (\(\text {BN}^l\)). The function \(\text {MH}^l\)(Q, K, V) takes three inputs Q, K, V, which represent the query, key, and value vectors, respectively, to perform multi-head attention mechanism at the \(l^\text {th}\) encoder layer. The output of the MHSA sublayer, \(\hat{\textbf{E}}^l\), is formulated as follows:

$$\begin{aligned} \begin{aligned} \hat{\textbf{E}}^l = \text {BN}^l\left( \textbf{E}^{l-1} + \text {MH}^l\left( \textbf{E}^{l-1},\textbf{E}^{l-1},\textbf{E}^{l-1}\right) \right) , \end{aligned} \end{aligned}$$
(2)

where \(\textbf{E}^0 = \{\textbf{x}_f,\textbf{x}^{\text {emb}}_1,\dots ,\textbf{x}^{\text {emb}}_n\}\in \mathbb {R}^{(n+1)\times d}\). Here, \(\textbf{E}^0\), the input of the first encoder layer, is created by concatenating start token \(\textbf{x}_f\) with \(\{\textbf{x}_1^\text {emb},\dots ,\textbf{x}_n^\text {emb}\}\). We add \(\textbf{x}_f\) to create a virtual node feature vector that learns dependencies with other node features, so that the decoding can start at the best possible location [5].

Fig. 2
figure 2

Proposed encoder architecture with the CNN embedding layer and L identical encoder layers

Point-wise FF sublayer. The input of the point-wise FF sublayer in the \(l^\text {th}\) encoder layer, which is composed of two linear projections and a ReLU activation, is \(\hat{\textbf{E}}^l\). It performs non-linear activation followed by residual connection and batch normalization and produces output \(\textbf{E}^l\), which is denoted as:

$$\begin{aligned} \textbf{E}^l = \text {BN}^l \left( \hat{\textbf{E}}^l + \text {FF}^l \left( \hat{\textbf{E}}^l \right) \right) , \end{aligned}$$
(3)

where \(\text {FF}^l\) is a FF sublayer of \(l^\text {th}\) encoder layer.

Encoder output. Let \(\textbf{e}_i^L\) be the encoder output of the \(i^\text {th}\) node of \(L^\text {th}\) encoder layer and f be the index of start token, respectively. The final encoder output of \(L^\text {th}\) encoder layer, \(\textbf{E}^L=\{\textbf{e}_f^L,\textbf{e}_1^L,\dots ,\textbf{e}_n^L\}\), is produced and fed into the decoder.

2.2 Decoder

We perform decoding auto-regressively, one node at a time. The decoder is comprised of four layers, each of which are followed by residual connection and layer normalization. Figure 3 illustrates an example for the decoding process when the current time step (t) is 10 and the number of reference vectors (m) is three.

Fig. 3
figure 3

Proposed decoder architecture. This figure shows a decoding process when the current time step \(t=10\) and the number of reference vectors \(m=3\)

The first layer is multi-head partial self-attention layer (MHPSA), which extracts past information by performing attention with encoder outputs of already visited nodes in previous steps. Unlike previous works, we use fewer reference vectors for performance and computational efficiency. The second layer is a masked multi-head attention layer (MMHA), which performs an attention mechanism where the query is an output of the MHPSA layer. The reference vectors are the encoder outputs of unvisited nodes. The third layer is the point-wise FF layer, which performs linear projection and non-linear activation, similar to an point-wise FF sublayer in the encoder. The pointer layer selects the next node to visit by calculating a probability distribution over the unvisited nodes.

MHPSA layer. We are motivated by the fact that recently visited nodes are more relevant to the node to be selected in the current step than nodes that are visited earlier. Based on this intuitive fact, the proposed partial self-attention performs attention only on recently visited nodes. We expect that the proposed model is able to better learn local compositionality compared with previous works based on fully-connected attention.

MHPSA performs self-attention using the decoder input of current time step t, \(\textbf{h}_t\), as query and decoder inputs of recently visited nodes as reference vectors. It is calculated as follows:

$$\begin{aligned} \textbf{h}_t=\textbf{e}^L_{\pi _{t-1}} + \text {PE}_t, \end{aligned}$$
(4)

where \(\pi _0 = f\) and \(\text {PE}_t\) denotes positional encoding at time step t.

The proposed partial self-attention uses decoder inputs for only the m last visited nodes as reference vectors. For instance, suppose the current time step is t, then the decoder inputs used at time step \(t-m\) to t are used as reference vectors, denoted as \(\textbf{H}_t=\{\textbf{h}_{t-m},\dots ,\textbf{h}_t\}\in \mathbb {R}^{m\times d}\) . Consequently, memory usage and computation time are significantly reduced. The output of MHPSA layer, \(\hat{\textbf{h}}_t\), is calculated as follows:

$$\begin{aligned} \hat{\textbf{h}}_t = \text {LN} \left( \textbf{h}_t + \text {MH}^{L+1} \left( \textbf{h}_t,\textbf{H}_t,\textbf{H}_t \right) \right) , \end{aligned}$$
(5)

where \(\text {LN}\) refers to layer normalization.

MMHA layer. Masked multi-head attention layer performs attention mechanism using \(\hat{\textbf{h}}_t\) as query and \(\textbf{E}^L\) as reference vectors. We use the same masked multi-head attention layer as [5]. Let \(\zeta _t\in \mathbb {R}^{n+1}\) be the mask where the value is one for unvisited and zero for visited nodes to the attention weight, respectively. Then, the output from the MMHA layer is calculated as follows:

$$\begin{aligned} \tilde{\textbf{h}}_t = \text {LN} \left( \hat{\textbf{h}}_t + \text {MMH}^{L+1} \left( \hat{\textbf{h}}_t,\textbf{E}^L,\textbf{E}^L,\zeta _ t\right) \right) , \end{aligned}$$
(6)

where \(\text {MMH}(Q, K, V, \zeta )\) is a modified function of \(\text {MH}(Q,\) KV), which takes an additional input mask \(\zeta \) and replaces Attention function [17] with \(\text {MaskedAttention}\) function formulated as follows:

$$\begin{aligned} \text {MaskedAttention}(Q, K, V,\zeta ) = \text {softmax}\left( \frac{QK^\intercal }{\sqrt{d_k}}\odot \zeta \right) V, \end{aligned}$$

where \(\odot \) denotes element-wise product operation.

Point-wise FF layer. The input of the point-wise FF layer is \(\tilde{\textbf{h}}_t\) and the output is \(\bar{\textbf{h}}_t\), which is denoted as:

$$\begin{aligned} \bar{\textbf{h}}_t = \text {LN} \left( \tilde{\textbf{h}}_t + \text {FF}^{L+1} \left( \tilde{\textbf{h}}_t \right) \right) . \end{aligned}$$
(7)

Pointer layer. The goal of the pointer layer is to compute a distribution over unvisited nodes. We perform single-head attention by using \(\bar{\textbf{h}}_t\) as query and \(\textbf{E}^L\) as reference vectors. We use attention weights as probability distributions, \(p_t\), that determine the next node to visit. Masking is used to avoid already visited nodes. Then, \(p_t\) can be computed as:

$$\begin{aligned} p_t = \text {softmax} \left( c\cdot \tanh \left( \frac{qK^\intercal }{\sqrt{d}} \right) \odot \zeta _t \right) , \end{aligned}$$
(8)

where q and K are query and reference vectors of single head attention and c is a hyperparameter that controls the range of the logits [9].

During the training phase, the decoder considers \(p_t\) as a categorical distribution for sampling node indices, and the node index with the highest probability is selected during the inference phase. This process is repeated n times, resulting in a node indices sequence \(\pmb {\pi }=\{\pi _1,\dots ,\pi _n\}\), which is the final output of the model.

2.3 Model training based on reinforcement learning

We trained our model via reinforcement learning, utilizing tour length as a negative reward. The loss function is the average tour length. Let \(\theta \) be training model parameters. Then, \(p(\pmb {\pi };\theta )\) is the probability that the model generates \(\pmb {\pi }\), which can be defined as:

$$\begin{aligned} p(\pmb {\pi };\theta ) = \prod ^n_{t=1} p(\pi _t|\pi _1,\dots ,\pi _{t-1};\theta ), \end{aligned}$$
(9)

where \(p(\pi _t |\pi _1,\dots ,\pi _{t-1};\theta )\) is the probability that \(\pi _t\) is chosen from \(p_t\) at time step t. We use REINFORCE algorithm to update \(\theta \). A duplicate version of \(\theta \), \(\theta _b\), is used as a baseline. We denote \(\pmb {\pi }'\) an index sequence that the model parameterized by \(\theta _b\) generates in a greedy way. Let \(\ell (\pmb {\pi })\) and \(\ell (\pmb {\pi }')\) be the total tour length of node sequences of training and baseline models, respectively. Then, a REINFORCE loss \(L(\theta )\) is optimized by gradient descent methods using the REINFORCE algorithm:

$$\begin{aligned} L(\theta ) = \mathbb {E}_{\pmb {\pi }\sim p(\pmb {\pi };\theta )} \left[ \ell \left( \pmb {\pi } \right) -\ell \left( \pmb {\pi }' \right) \right] . \end{aligned}$$
(10)

The gradient of \(L(\theta )\) is computed as:

$$\begin{aligned} \nabla _\theta L(\theta ) \approx \sum _{\pmb {\pi }} \left( \ell \left( \pmb {\pi } \right) - \ell \left( \pmb {\pi }' \right) \right) \nabla _\theta \log p(\pmb {\pi };\theta ). \end{aligned}$$
(11)

We optimize the training model using \(\nabla _\theta L(\theta )\) during one epoch. When one epoch ends, we compare the average tour length of training and baseline models. We copy \(\theta \) to \(\theta _b\) if the average tour length of the training model is shorter than that of the baseline model.

3 Experiment

3.1 Datasets

Random dataset. We assume a 2D planar symmetric TSP, where the distance between two cities in opposite directions are equal. For model training and validation, we use a randomly generated data from a uniform distribution on the fly in \(\left[ 0, 1 \right] \times \left[ 0, 1 \right] \). We generated 10,000 test instances. We trained and tested on TSP problems from \(n = 50\) (TSP50) to 200 nodes (TSP200). The output tour using Concorde [6] was considered to have exact solutions and labels of test instances.

TSPLIB dataset. TSPLIB [34] is a widely-used benchmark dataset for evaluating the performance of TSP solvers on a variety of real-world data with varying distributions. We choose ten 2D-Euclidean problem instances from TSPLIB, which are considered relatively difficult. Let N and A be the number of nodes and the square area covered by the nodes, respectively [16]. Then, we use a critical parameter value, \(\frac{l}{\sqrt{N\cdot A}}\), to evaluate the difficulty level of TSPLIB where l is the optimal tour length [35, 36]. A critical parameter value close to 0.75 indicates a higher difficulty level. We normalize each TSPLIB instance such that all TSPLIB instances are in \(\left[ 0, 1 \right] \times \left[ 0, 1 \right] \).

3.2 Hyperparameters and decoding strategy

Hyperparameters. We did not tune the hyperparameters of the proposed model. We use the same hyperparameters for all TSP problem sizes. The proposed model has six encoder layers (\(L=6\)) with \(k=10\). The kernel size of CNN embedding layers is set to 11. The model has 128 hidden dimensions (\(d=128\)) and the hidden dimension of each point-wise FF layer (also sublayer) and is set to 512. The model has eight attention heads. The logit range clipping value c in the decoder is set to 10.

For model training, we use an Adam optimizer with a fixed learning rate of 0.0001. A batch size of 512 is used. We train for 100 epochs using training data but further increasing the number of epochs could improve the model’s performance. Our experiments were conducted on an AMD EPYC 7513 32-Core Processor and a single Nvidia A6000 GPU.

Decoding strategy. At test time, we employ both greedy and beam search for decoding. Beam search [37] is a breadth-first search strategy that considers top-B cases in every decoding time step and chooses the best solution at the end of the decoding. We set the beam width (B) as 2,500 in order to compare our results with other SOTA models [5] that use beam width of 2,500.

Table 1 Comparison of Transformer-based models for TSP: Kool et al. [4], TSP Transformer [5], Tspformer [25], and our model. Here, MHSA and MMHA denote Multi-Head Self-Attention and Masked Multi-Head Attention, respectively. MHSA in the decoder was not used in Kool et al. [4]

3.3 List of Experiments

In our study, we conducted four experiments to evaluate the proposed model’s performance. The third experiment used both the random dataset and the TSPLIB dataset, while the other experiments only used the random dataset. For Experiments 1 and 2, we employed greedy decoders to find optimal tours.

Table 2 Average tour length (Obj.) and the optimality gap (Gap) of our model with various m (number of reference vectors) values for various TSP instance sizes using the random datasets. The greedy decoder is used for this experiment

Experiment 1

We test whether the proposed partial self-attention in the decoder is more effective in improving the performance of the model compared with the existing fully-connected self-attention in the standard Transformer decoder. We test by decreasing the number of reference vectors (m) from 200 to 5.

Experiment 2

We remove the CNN embedding layer in our CNN-Transformer model and test whether it is effective in extracting local spatial information and produces better output performance by performing an ablation study.

Experiment 3

We compare the proposed model with other solvers, including the optimal solver (Concorde [6]), heuristic solvers such as 2-opt search [39], Monte Carlo Tree Search (MCTS) [39], Google OR-Tools [3], and other SOTA Transformer-based models for TSP [4, 5, 25, 30]. Table 1 summarizes the main difference between the proposed model with other Transformer-based models. Here, ‘node’ means the hidden feature vector for a particular node generated by the encoder or MHSA layer of each model and ‘graph’ is the average value of the hidden feature vectors of all nodes.

Experiment 4

We compare the computational complexity between the proposed CNN-Transformer model with other Transformer-based models.

3.4 Metrics

The performance of the model was evaluated using the following metrics.

Average predicted tour length. Let \(\hat{l}_i^\text {TSP}\) be the predicted tour length of the \(i^\text {th}\) instance. Then, the average tour length is computed as \(\frac{1}{n}\sum _{i=1}^n \hat{l}_i^\text {TSP}\) , where n is the number of total test instances. Here, we set n as 10,000.

Optimality gap. The optimality gap is computed as the average percentage of the predicted tour length to the optimal solution, which is computed as \(\frac{1}{n}\sum _{i=1}^n \left( \frac{\hat{l}_i^\text {TSP}}{l_i^\text {TSP}}-1 \right) \), where n is the number of total test instances. Here, \(l_i^\text {TSP}\) is the optimal solution of the \(i^\text {th}\) instance produced by Concorde [6]. We set n as 10,000.

Training time. The training time is measured as the time taken to train 10,000 instances with batch size of 512.

Inference time. We report the inference time taken to solve the test set of 100 instances of TSP100 and TSP200. Beam width (B) was set to 2,500 and batch size was set to one due to the limitation of memory capacity.

GPU Memory Usage. Maximal memory usage capacitated by the training process is measured.

4 Results

4.1 Experiment 1

Table 2 presents the average tour length and the optimality gap of our model with various m values. The proposed model showed the best performance when m was 100 for TSP150 and 20 for TSP200. Our experimental results show that the optimal m value changes with respect to problem size. For TSP100, the proposed model using partial self-attention with \(m=5\) achieved an optimality gap of 2.83%, but the proposed model using the fully-connected attention, i.e., \(m=100\), produced an optimality gap of 3.10%.

Table 3 Ablation study
Table 4 Average tour length (Obj.) and optimality gap (Gap) of various models and solvers for various TSP instance sizes using the random datasets. Results with * are reported from other papers. In the type column, H: Heuristic, RL: Reinforcement learning, G: Greedy, S: Sampling and BS: Beam search

4.2 Experiment 2

Table 3 presents the results of the ablation study. Both models use greedy decoders. Our results show that for all TSP problem sizes, the proposed model outperforms the proposed model without the CNN embedding layer. We observed that removing the CNN embedding layer degrades the overall performance. Therefore, we can conclude that the CNN embedding layer in our model is effective in extracting spatial features and improving the output performance.

4.3 Experiment 3

Random dataset. Table 4 compares the performance of the proposed model with other SOTA Transformer-based models. We present the results by dividing the table into three sections: exact solver, heuristics, and deep learning models. Concorde [6], which is known to produce optimal results, shows the best performance. Among deep learning models, the proposed model has the best performance both for greedy and beam search decoding. Our experimental results showed that our model significantly outperformed MCTS [39] and the 2-opt search method [6] in performance and generalization. For TSP50, the proposed model is only behind by a 0.1% optimality gap compared with the exact solver, Concorde. Our model using beam search reduces the optimality gap from 0.11% to 0.10% for TSP50 and 1.26% to 1.11% for TSP100 compared with the TSP Transformer.

The proposed model shows competitive performance compared with H-TSP for random datasets. The proposed model showed better performance for TSP50 and TSP100 compared with the H-TSP, whereas the H-TSP showed better performance for TSP 200. The reason is that the H-TSP is known as a specialized model for large-scale TSP, showing outstanding performance in such instances.

TSPLIB dataset. Table 5 presents the output performances for our model and other Transformer-based models on various TSPLIB instances. All models were trained on TSP50. We observe that the proposed model gives consistent results for various TSPLIB instances and outperforms TSP Transformer in most TSPLIB instances. Therefore, the proposed model using partial self-attention is also effective in real-world datasets. The proposed model outperforms H-TSP for all TSPLIB instances and outperforms TSP Transformer for all TSPLIB instances except for one case, i.e., rd100.

Two most difficult TSPLIB instances are the kroC100 and the berlin52 instances. The berlin52 is known to be a hard TSP instance because many nodes are highly constrained in very small regions. For very hard real-world instances, the proposed model showed the best performance among compared models. Figure 4 displays the output of Concorde and various models on kroC100 and berlin52. For the kroC100, the optimal tour length of Concorde is 20,749. The tour length of our model is 21,523, while that of the TSP Transformer model is 21,788. For the berlin52, the tour length of our model is 7,610, but the tour length of the TSP Transformer model is 7,637.

Table 5 Tour length (Obj.) and optimality gap (Gap) for the TSPLIB instances. The critical parameters \(\frac{l}{\sqrt{N\cdot A}}\) close to 0.75 indicate harder TSP instances. The greedy decoder is used for this experiment
Fig. 4
figure 4

Output tours of Concorde [6], TSP Transformer [5], Tspformer [25], H-TSP [30], and our model for (a) kroC100 and (b) berlin52 using the TSPLIB dataset

4.4 Experiment 4

Table 6 summarizes the overall model complexity and runtimes of various Transformer-based models for TSP100 and TSP200. The proposed model shows competitive GPU memory usage, training time, and inference time compared to other models. Our results show that the computational complexity of the H-TSP is highest among the compared models. It requires an additional warm-up stage for the lower-level model by pre-training and undergoes a rather complex training process. Therefore, it has a drawback of demanding substantial computing resources and significant training time.

The H-TSP has 3.7 times more model parameters and the training time was approximately 68 times longer than our model for TSP200. The proposed model has more model parameters than the TSP Transformer, as we added the CNN embedding layer in our model. However, our model consumes 17.8% and 19.8% less GPU memory for TSP100 and TSP200, respectively. Compared to the Tspformer model proposed by Yang et al. [25], our model exhibits slightly higher computational complexity.

Table 6 Comparison of model parameters, GPU memory usage, and training (T) time for 10,000 TSP instances and inference (I) time for 100 TSP instances of TSP100 and TSP200. G: Greedy and BS: Beam search

5 Discussion

Recent studies tried several heuristic search algorithms such as Monte Carlo Tree Search [40, 41] or 2-opt search [21, 22] to further enhance the quality of TSP solutions. The proposed model uses beam search decoding techniques, but other heuristic search algorithms for TSP can be combined to further enhance the performance. We also observed that shortest tour heuristic proposed in [11] which selects the shortest tour among the set of B complete tours also improve the output performance of the model.

Traditional optimization-based solvers, such as Concorde [6], still outperform neural network models in terms of performance. However, it is noteworthy that neural network models exhibit faster inference times than Concorde. For example, Concorde takes approximately 16 seconds to solve 100 TSP instances of TSP100 whereas the proposed model takes approximately 11 seconds.

Our results show that the proposed model is successful in significantly reducing memory consumption and inference time. Our model is also based on a standard Transformer model with multiple layers of transformer blocks and adds a CNN embedding layer to extract spatial features. We plan to apply various lightweight techniques for Transformer-based models such as proposed in [26, 27, 28, 42, 29] and also for CNN [43, 44, 45, 46].

A Star-Transformer model based on a star-shaped topology has been shown to be an effective lightweight techniques for many NLP tasks [26]. We plan to achieve further performance enhancement by introducing relay-nodes in the encoder used in a Star-Transformer model. Recent work has shown that ProbSparse self-attention is also effective in lightweighting Transformer-based models [29]. We also plan to improve the proposed CNN-Transformer model by introducing ProbSparse self-attention to further lightweight the model. Finally, the proposed CNN embedding layer exhibits superior performance than the linear embedding layer, but requires more computational complexity. In the subsequent study, we plan to further lightweight the CNN embedding layer by utilizing lightweight techniques in MobileNet [43, 44].

6 Conclusion

We proposed the first CNN-Transformer model based on partial self-attention in the decoder. Our model is able to extract and learn spatial features from the input data and produces better TSP solutions compared with the standard Transformer-based models. Our results show that our model is able to better learn local compositionality compared with the standard Transformer model. We also observed that the proposed model significantly reduces the GPU memory usage and inference time by applying partial self-attention in the decoder. Our results also indicate that the proposed CNN embedding layer and partial self-attention are effective in improving performance for both random and real-world datasets. We found that fully-connected self-attention in the decoder can degrade performance, and performing partial self-attention only on recently visited nodes in the decoder can yield better performance than fully-connected attention models. The proposed model outperforms existing SOTA Transformer-based models in various aspects and shows the best performance for real-world datasets.