Abstract
Routing problems are a class of combinatorial problems with many practical applications. Recently, end-to-end deep learning methods have been proposed to learn approximate solution heuristics for such problems. In contrast, classical dynamic programming (DP) algorithms guarantee optimal solutions, but scale badly with the problem size. We propose Deep Policy Dynamic Programming (DPDP), which aims to combine the strengths of learned neural heuristics with those of DP algorithms. DPDP prioritizes and restricts the DP state space using a policy derived from a deep neural network, which is trained to predict edges from example solutions. We evaluate our framework on the travelling salesman problem (TSP), the vehicle routing problem (VRP) and TSP with time windows (TSPTW) and show that the neural policy improves the performance of (restricted) DP algorithms, making them competitive to strong alternatives such as LKH, while also outperforming most other ‘neural approaches’ for solving TSPs, VRPs and TSPTWs with 100 nodes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
If we have multiple partial solutions with the same state and cost, we can arbitrarily choose one to dominate the other(s), for example the one with the lowest index of the current node.
- 2.
E.g., arriving at node i at \(t = 10\) is not feasible if node j has \(u_j = 12\) and \(c_{ij} = 3\).
- 3.
- 4.
- 5.
The running time of 4000 h (167 days) is estimated from 24 min/instance [43].
- 6.
For example, three nodes with a demand of two cannot be assigned to two routes with a capacity of three.
- 7.
Up to a limit, as making the time windows infinite size reduces the problem to plain TSP.
- 8.
- 9.
For the symmetric TSP and VRP, we add \(\textsc {knn}\) edges in both directions. For the VRP, we also connect each node to the depot (and vice versa) to ensure feasibility.
- 10.
- 11.
If all time windows are disjoint, there is only one feasible solution. Therefore, the amount of overlap in time windows determines to some extent the ‘branching factor’ of the problem and the difficulty.
- 12.
Serving 100 customers in a 100\(\,\times \,\)100 grid, empirically we find the total schedule duration including waiting (the makespan) is around 5000.
- 13.
For efficiency, we use a custom function similar to torch.unique, and argsort the returned inverse after which the resulting permutation is applied to all variables in the beam.
- 14.
- 15.
Unless we have multiple expansions with the same costs, in which case can pick one arbitrarily.
- 16.
This may give slightly different results if the scoring function is inconsistent with the domination rules, i.e. if a better scoring solution would be dominated by a worse scoring solution but is not since that solution is removed using the score bound before checking the dominances.
References
Accorsi, L., Vigo, D.: A fast and scalable heuristic for the solution of large-scale capacitated vehicle routing problems. Transp. Sci. 55(4), 832–856 (2021)
Applegate, D., Bixby, R., Chvatal, V., Cook, W.: Concorde TSP Solver (2006). http://www.math.uwaterloo.ca/tsp/concorde
Bai, R., et al.: Analytics and machine learning in vehicle routing research. arXiv preprint arXiv:2102.10012 (2021)
Bellman, R.: On the theory of dynamic programming. Proc. Natl. Acad. Sci. U.S.A. 38(8), 716 (1952)
Bellman, R.: Dynamic programming treatment of the travelling salesman problem. J. ACM (JACM) 9(1), 61–63 (1962)
Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S.: Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016)
Bertsekas, D.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific (2017)
Cappart, Q., Moisan, T., Rousseau, L.M., Prémont-Schwarz, I., Cire, A.: Combining reinforcement learning and constraint programming for combinatorial optimization. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)
Chen, X., Tian, Y.: Learning to perform local rewriting for combinatorial optimization. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 6281–6292 (2019)
Cook, W., Seymour, P.: Tour merging via branch-decomposition. INFORMS J. Comput. 15(3), 233–248 (2003)
da Costa, P.R.d.O., Rhuggenaath, J., Zhang, Y., Akcay, A.: Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning (ACML) (2020)
Da Silva, R.F., Urrutia, S.: A general VNS heuristic for the traveling salesman problem with time windows. Discret. Optim. 7(4), 203–211 (2010)
Daumé, H., III., Marcu, D.: Learning as search optimization: approximate large margin methods for structured prediction. In: International Conference on Machine Learning (ICML), pp. 169–176 (2005)
Delarue, A., Anderson, R., Tjandraatmadja, C.: Reinforcement learning with combinatorial actions: an application to vehicle routing. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33 (2020)
Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., Rousseau, L.-M.: Learning heuristics for the TSP by policy gradient. In: van Hoeve, W.-J. (ed.) CPAIOR 2018. LNCS, vol. 10848, pp. 170–181. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93031-2_12
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)
Dumas, Y., Desrosiers, J., Gelinas, E., Solomon, M.M.: An optimal algorithm for the traveling salesman problem with time windows. Oper. Res. 43(2), 367–371 (1995)
Falkner, J.K., Schmidt-Thieme, L.: Learning to solve vehicle routing problems with time windows through joint attention. arXiv preprint arXiv:2006.09100 (2020)
Fu, Z.H., Qiu, K.B., Zha, H.: Generalize a small pre-trained model to arbitrarily large tsp instances. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)
Gao, L., Chen, M., Chen, Q., Luo, G., Zhu, N., Liu, Z.: Learn to design the heuristics for vehicle routing problem. In: International Workshop on Heuristic Search in Industry (HSI) at the International Joint Conference on Artificial Intelligence (IJCAI) (2020)
Gasse, M., Chetelat, D., Ferroni, N., Charlin, L., Lodi, A.: Exact combinatorial optimization with graph convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Gromicho, J., van Hoorn, J.J., Kok, A.L., Schutten, J.M.: Restricted dynamic programming: a flexible framework for solving realistic VRPs. Comput. Oper. Res. 39(5), 902–909 (2012)
Gromicho, J.A., Van Hoorn, J.J., Saldanha-da Gama, F., Timmer, G.T.: Solving the job-shop scheduling problem optimally by dynamic programming. Comput. Oper. Res. 39(12), 2968–2977 (2012)
Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021). https://www.gurobi.com
van Heeswijk, W., La Poutré, H.: Approximate dynamic programming with neural networks in linear discrete action spaces. arXiv preprint arXiv:1902.09855 (2019)
Held, M., Karp, R.M.: A dynamic programming approach to sequencing problems. J. Soc. Ind. Appl. Math. 10(1), 196–210 (1962)
Helsgaun, K.: An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems: Technical report (2017)
van Hoorn, J.J.: Dynamic programming for routing and scheduling. Ph.D. thesis (2016)
Hottung, A., Bhandari, B., Tierney, K.: Learning a latent search space for routing problems using variational autoencoders. In: International Conference on Learning Representations (ICML) (2021)
Hottung, A., Tierney, K.: Neural large neighborhood search for the capacitated vehicle routing problem. In: European Conference on Artificial Intelligence (ECAI) (2020)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML), pp. 448–456 (2015)
Joshi, C.K., Laurent, T., Bresson, X.: An efficient graph convolutional network technique for the travelling salesman problem. In: INFORMS Annual Meeting (2019)
Joshi, C.K., Laurent, T., Bresson, X.: On learning paradigms for the travelling salesman problem. In: Graph Representation Learning Workshop at Neural Information Processing Systems (NeurIPS) (2019)
Kim, M., Park, J., Kim, J.: Learning collaborative policies to solve NP-hard routing problems. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Kok, A., Hans, E.W., Schutten, J.M., Zijm, W.H.: A dynamic programming heuristic for vehicle routing with time-dependent travel times and required breaks. Flex. Serv. Manuf. J. 22(1–2), 83–108 (2010)
Kool, W., van Hoof, H., Welling, M.: Attention, learn to solve routing problems! In: International Conference on Learning Representations (ICLR) (2019)
Kwon, Y.D., Choo, J., Kim, B., Yoon, I., Gwon, Y., Min, S.: Pomo: policy optimization with multiple optima for reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Laporte, G.: The vehicle routing problem: an overview of exact and approximate algorithms. Eur. J. Oper. Res. (EJOR) 59(3), 345–358 (1992)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning (ICML), pp. 3744–3753. PMLR (2019)
Li, S., Yan, Z., Wu, C.: Learning to delegate for large-scale vehicle routing. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Li, Z., Chen, Q., Koltun, V.: Combinatorial optimization with graph convolutional networks and guided tree search. In: Advances in Neural Information Processing Systems (NeurIPS), p. 539 (2018)
Lu, H., Zhang, X., Yang, S.: A learning-based iterative method for solving vehicle routing problems. In: International Conference on Learning Representations (2020)
Ma, Q., Ge, S., He, D., Thaker, D., Drori, I.: Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: AAAI International Workshop on Deep Learning on Graphs: Methodologies and Applications (DLGMA) (2020)
Ma, Y., et al.: Learning to iteratively solve routing problems with dual-aspect collaborative transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Malandraki, C., Dial, R.B.: A restricted dynamic programming heuristic algorithm for the time dependent traveling salesman problem. Eur. J. Oper. Res. (EJOR) 90(1), 45–55 (1996)
Mazyavkina, N., Sviridov, S., Ivanov, S., Burnaev, E.: Reinforcement learning for combinatorial optimization: a survey. arXiv preprint arXiv:2003.03600 (2020)
Mingozzi, A., Bianco, L., Ricciardelli, S.: Dynamic programming strategies for the traveling salesman problem with time window and precedence constraints. Oper. Res. 45(3), 365–377 (1997)
Nair, V., et al.: Solving mixed integer programs using neural networks. arXiv preprint arXiv:2012.13349 (2020)
Nazari, M., Oroojlooy, A., Snyder, L., Takac, M.: Reinforcement learning for solving the vehicle routing problem. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 9860–9870 (2018)
Novoa, C., Storer, R.: An approximate dynamic programming approach for the vehicle routing problem with stochastic demands. Eur. J. Oper. Res. (EJOR) 196(2), 509–515 (2009)
Nowak, A., Villar, S., Bandeira, A.S., Bruna, J.: A note on learning algorithms for quadratic assignment with graph neural networks. In: Principled Approaches to Deep Learning Workshop at the International Conference on Machine Learning (ICML) (2017)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 8026–8037 (2019)
Peng, B., Wang, J., Zhang, Z.: A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems. In: Li, K., Li, W., Wang, H., Liu, Y. (eds.) ISICA 2019. CCIS, vol. 1205, pp. 636–650. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-5577-0_51
Ropke, S., Pisinger, D.: An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows. Transp. Sci. 40(4), 455–472 (2006)
Schrimpf, G., Schneider, J., Stamm-Wilbrandt, H., Dueck, G.: Record breaking optimization results using the ruin and recreate principle. J. Comput. Phys. 159(2), 139–171 (2000)
Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
Sun, Y., Ernst, A., Li, X., Weiner, J.: Generalization of machine learning for problem reduction: a case study on travelling salesman problems. OR Spectr. 43(3), 607–633 (2020). https://doi.org/10.1007/s00291-020-00604-x
Toth, P., Vigo, D.: Vehicle Routing: Problems, Methods, and Applications. SIAM (2014)
Uchoa, E., Pecin, D., Pessoa, A., Poggi, M., Vidal, T., Subramanian, A.: New benchmark instances for the capacitated vehicle routing problem. Eur. J. Oper. Res. (EJOR) 257(3), 845–858 (2017)
Vesselinova, N., Steinert, R., Perez-Ramirez, D.F., Boman, M.: Learning combinatorial optimization on graphs: a survey with applications to networking. IEEE Access 8, 120388–120416 (2020)
Vidal, T.: Hybrid genetic search for the CVRP: open-source implementation and swap* neighborhood. arXiv preprint arXiv:2012.10384 (2020)
Vidal, T., Crainic, T.G., Gendreau, M., Lahrichi, N., Rei, W.: A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Oper. Res. 60(3), 611–624 (2012)
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2692–2700 (2015)
Wiseman, S., Rush, A.M.: Sequence-to-sequence learning as beam-search optimization. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1296–1306 (2016)
Wu, Y., Song, W., Cao, Z., Zhang, J., Lim, A.: Learning improvement heuristics for solving routing problems. IEEE Trans. Neural Netw. Learn. Syst. (2021)
Xin, L., Song, W., Cao, Z., Zhang, J.: Step-wise deep learning models for solving routing problems. IEEE Trans. Ind. Inform. (2020)
Xin, L., Song, W., Cao, Z., Zhang, J.: NeuroLKH: combining deep learning model with Lin-Kernighan-Helsgaun heuristic for solving the traveling salesman problem. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Xu, S., Panwar, S.S., Kodialam, M., Lakshman, T.: Deep neural network approximated dynamic programming for combinatorial optimization. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 1684–1691 (2020)
Yang, F., Jin, T., Liu, T.Y., Sun, X., Zhang, J.: Boosting dynamic programming with neural networks for solving np-hard problems. In: Asian Conference on Machine Learning (ACML), pp. 726–739. PMLR (2018)
Acknowledgement
We would like to thank Jelke van Hoorn and Johan van Rooij for helpful discussions. Also we would like to thank anonymous reviewers for helpful suggestions. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1 The Graph Neural Network Model
For the TSP, we use the exact model from [32], which we describe here for self-containment. The model uses node input features and edge input features, which get transformed into initial representations of the nodes and edges. These representations then get updated sequentially using a number of graph convolutional layers, which exchange information between nodes and edges, after which the final edge representation is used to predict whether the edge is part of the optimal solution.
Input Features and Initial Representation. The model uses input features for the nodes, consisting of the (x, y)-coordinates, which are then projected into H-dimensional initial embeddings \(\mathbf {x}_i^{0}\) (\(H = 300\)). The initial edge features \(\mathbf {e}_{ij}^{0}\) are a concatenation of a \(\frac{H}{2}\)-dimensional projection of the cost (Euclidean distance) \(c_{ij}\) from i to j, and a \(\frac{H}{2}\)-dimensional embedding of the edge type: 0 for normal edges, 1 for edges connecting K-nearest neighbors (\(K = 20\)) and 2 for self-loop edges connecting a node to itself (which are added for ease of implementation).
Graph Convolutional Layers. In each of the \(L = 30\) layers of the model, the node and edge representations \(\mathbf {x}_i^{\ell }\) and \(\mathbf {e}_{ij}^{\ell }\) get updated into \(\mathbf {x}_i^{\ell + 1}\) and \(\mathbf {e}_{ij}^{\ell +1}\) [32]:
Here \(\mathcal {N}(i)\) is the set of neighbors of node i (in our case all nodes, including i, as we use a fully connected input graph), \(\odot \) is the element-wise product and \(\sigma \) is the sigmoid function, applied element-wise to the vector \(\mathbf {e}_{ij}^{\ell }\). \(\text {ReLU}(\cdot ) = \max (\cdot , 0)\) is the rectified linear unit and \(\text {BN}\) represents batch normalization [31]. \(W_1, W_2, W_3, W_4\) and \(W_5\) are trainable parameter matrices, where we fix \(W_4 = W_5\) for the symmetric TSP.
Output Prediction. After L layers, the final prediction \(h_{ij} \in (0,1)\) is made independently for each edge (i, j) using a multi-layer perceptron (MLP), which takes \(\mathbf {e}_{ij}^{L}\) as input and has two H-dimensional hidden layers with \(\text {ReLU}\) activation and a 1-dimensional output layer, with sigmoid activation. We interpret \(h_{ij}\) as the predicted probability that the edge (i, j) is part of the optimal solution, which indicates how promising this edge is when searching for the optimal solution.
Training. For TSP, the model is trained on a dataset of 1 million optimal solutions, found using Concorde [2], for randomly generated TSP instances. The training loss is a weighted binary cross-entropy loss, that maximizes the prediction quality when \(h_{ij}\) is compared to the ground-truth optimal solution. Generating the dataset takes between half a day and a few days (depending on number of CPU cores), and training the model takes a few days on one or multiple GPUs, but both are only required once given a desired data distribution.
1.1 1.1 Predicting Directed Edges for the TSPTW
The TSP is an undirected problem, so the neural network implementationFootnote 10 by [32] shares the parameters \(W_4^l\) and \(W_5^l\) in Eq. (3), i.e. \(W_4^l = W_5^l\), resulting in \(\mathbf {e}_{ij}^l = \mathbf {e}_{ji}^l\) for all layers l, as for \(l = 0\) both directions are initialized the same. While the VRP also is an undirected problem, the TSPTW is directed as the direction of the route determines the times of arrival at different nodes. To allow the model to make different predictions for different directions, we implement \(W_5^l\) as a separate parameter, such that the model can have different representations for edges (i, j) and (j, i). We define the training labels accordingly for directed edges, so if edge (i, j) is in the directed solution, it will have a label 1 whereas the edge (j, i) will not (for the undirected TSP and VRP, both labels are 1).
1.2 1.2 Dataset Generation for the TSPTW
We found that using our DP formulation for TSPTW, the instances by [8] were all solved optimally, even with a very small beam size (around 10). This is because there is very little overlap in the time windows as a result of the way they are generated, and therefore very few actions are feasible as most of the actions would ‘skip over other time windows’ (advance the time so much that other nodes can no longer be served)Footnote 11. We conducted some quick experiments with a weaker DP formulation, that only checks if actions directly violate time windows, but does not check if an action causes other nodes to be no longer reachable in their time windows. Using this formulation, the DP algorithm can run into many dead ends if just a single node gets skipped, and using the GNN policy (compared to a cost based policy as in Sect. 4.4) made the difference between good solutions and no solution at all being found.
We made two changes to the data generation procedure by [8] to increase the difficulty and make it similar to [12], defining the ‘large time window’ dataset. First, we sample the time windows around arrival times when visiting nodes in a random order without any waiting time, which is different from [8] who ‘propagate’ the waiting time (as a result of time windows sampled). Our modification causes a tighter schedule with more overlap in time windows, and is similar to [12]. Secondly, we increase the maximum time window size from 100 to 1000, which makes that the time windows are in the order of 10% of the horizonFootnote 12. This doubles the maximum time window size of 500 used by [12] for instances with 200 nodes, to compensate for half the number of nodes that can possibly overlap the time window.
To generate the training data, for practical reasons we used DP with the heuristic ‘cost heat + potential’ strategy and a large beam size (1M), which in many cases results in optimal solutions being found.
Appendix 2 Implementation
We implement the dynamic programming algorithm on the GPU using PyTorch [53]. While mostly used as a Deep Learning framework, it can be used to speed up generic (vectorized) computations.
1.1 2.1 Beam Variables
For each solution in the beam, we keep track of the following variables (storing them for all solutions in the beam as a vector): the cost, current node, visited nodes and (for VRP) the remaining capacity or (for TSPTW) the current time. As explained, these variables can be computed incrementally when generating expansions. Additionally, we keep a variable vector parent, which, for each solution in the current beam, tracks the index of the solution in the previous beam that generated the expanded solution. To compute the score of the policy for expansions efficiently, we also keep track of the score for each solution and the potential for each node for each solution incrementally.
We do not keep past beams in memory, but at the end of each iteration, we store the vectors containing the parents as well as last actions for each solution on the trace. As the solution is completely defined by the sequence of actions, this allows to backtrack the solution after the algorithm has finished. To save GPU memory (especially for larger beam sizes), we store the O(Bn) sized trace on the CPU memory.
For efficiency, we keep the set of visited nodes as a bitmask, packed into 64-bit long integers (2 for 100 nodes). Using bitwise operations with the packed adjacency matrix, this allows to quickly check feasible expansions (but we need to unpack the mask into boolean vectors to find all feasible expansions explicitly). Figure 4a shows an example of the beam (with variables related to the policy and backtracking omitted) for the VRP.
1.2 2.2 Generating Non-dominated Expansions
A solution \(\boldsymbol{a}\) can only dominate a solution \(\boldsymbol{a}'\) if \({\text {visited}}(\boldsymbol{a}) = {\text {visited}}(\boldsymbol{a}')\) and \({\text {current}}(\boldsymbol{a}) = {\text {current}}(\boldsymbol{a}')\), i.e. if they correspond to the same DP state. If this is the case, then, if we denote by \({\text {parent}}(\boldsymbol{a})\) the parent solution from which \(\boldsymbol{a}\) was expanded, it holds that
This means that only expansions from solutions with the same set of visited nodes can dominate each other, so we only need to check for dominated solutions among groups of expansions originating from parent solutions with the same set of visited nodes. Therefore, before generating the expansions, we group the current beam (the parents of the expansions) by the set of visited nodes (see Fig. 4). This can be done efficiently, e.g. using a lexicographic sort of the packed bitmask representing the sets of visited nodesFootnote 13.
Travelling Salesman Problem. For TSP, we can generate (using boolean operations) the \(B \times n\) matrix with boolean entries indicating feasible expansions (with n action columns corresponding to n nodes, similar to the \(B \times 2n\) matrix for VRP in Fig. 4), i.e. nodes that are unvisited and adjacent to the current node. If we find positive entries sequentially for each column (e.g. by calling \(\textsc {torch.nonzero}\) on the transposed matrix), we get all expansions grouped by the combination of action (new current node) and parent set of visited nodes, i.e. grouped by the DP state. We can then trivially find the segments of consecutive expansions corresponding to the same DP state, and we can efficiently find the minimum cost solution for each segment, e.g. using torch_scatterFootnote 14.
Vehicle Routing Problem. For VRP, the dominance check has two dimensions (cost and remaining capacity) and additionally we need to consider 2n actions: n direct and n via the depot (see Fig. 4). Therefore, as we will explain, we check dominances in two stages: first we find (for each DP state) the single non-dominated ‘via-depot’ expansion, after which we find all non-dominated ‘direct’ expansions (see Fig. 4b).
The DP state of each expansion is defined by the expanded node (the new current node) and the set of visited nodes. For each DP state, there can be only oneFootnote 15 non-dominated expansion where the last action was via the depot, since all expansions resulting from ‘via-depot actions’ have the same remaining capacity as visiting the depot resets the capacity (see Fig. 4b). To find this expansion, we first find, for each unique set of visited nodes in the current beam, the solution that can return to the depot with lowest total cost (thus including the cost to return to the depot, indicated by a dashed green rectangle in Fig. 4). The single non-dominated ‘via-depot expansion’ for each DP state must necessarily be an expansion of this solution. Also observe that this via-depot solution cannot be dominated by a solution expanded using a direct action, which will always have a lower remaining vehicle capacity (assuming positive demands) as can bee seen in Fig. 4b. We can thus generate the non-dominated via-depot expansion for each DP state efficiently and independently from the direct expansions.
For each DP state, all direct expansions with cost higher (or equal) than the via-depot expansion can directly be removed since they are dominated by the via-depot expansion (having higher cost and lower remaining capacity, see Fig. 4b). After that, we sort the remaining (if any) direct expansions for each DP state based on the cost (using a segmented sort as the expansions are already grouped if we generate them similarly to TSP, i.e. per column in Fig. 4). For each DP state, the lowest cost solution is never dominated. The other solutions should be kept only if their remaining capacity is strictly larger than the largest remaining capacity of all lower-cost solutions corresponding to the same DP state, which can be computed using a (segmented) cumulative maximum computation (see Fig. 4b).
TSP with Time Windows. For the TSPTW, the dominance check has two dimensions: cost and time. Therefore, it is similar to the check for non-dominated direct expansions for the VRP (see Fig. 4b), but replacing remaining capacity (which should be maximized) by current time (to be minimized). In fact, we could reuse the implementation, if we replace remaining capacity by time multiplied by \(-1\) (as this should be minimized). This means that we sort all expansions for each DP state based on the cost, keep the first solution and keep other solutions only if the time is strictly lower than the lowest current time for all lower-cost solutions, which can be computed using a cumulative minimum computation.
1.3 2.3 Finding the Top B Solutions
We may generate all ‘candidate’ non-dominated expansions and then select the top B using the score function. Alternatively, we can generate expansions in batches, and keep a streaming top B using a priority queue. We use the latter implementation, where we can also derive a bound for the score as soon as we have B candidate expansions. Using this bound, we can already remove solutions before checking dominances, to achieve some speedup in the algorithm.Footnote 16
1.4 2.4 Performance Improvements
There are many possibilities for improving the speed of the algorithm. For example, PyTorch lacks a segmented sort so we use a much slower lexicographic sort instead. Also an efficient GPU priority queue would allow much speedup, as we currently use sorting as PyTorch’ top-k function is rather slow for large k. In some cases, a binary search for the k-th largest value can be faster, but this introduces undesired CUDA synchronisation points.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Kool, W., van Hoof, H., Gromicho, J., Welling, M. (2022). Deep Policy Dynamic Programming for Vehicle Routing Problems. In: Schaus, P. (eds) Integration of Constraint Programming, Artificial Intelligence, and Operations Research. CPAIOR 2022. Lecture Notes in Computer Science, vol 13292. Springer, Cham. https://doi.org/10.1007/978-3-031-08011-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-08011-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08010-4
Online ISBN: 978-3-031-08011-1
eBook Packages: Computer ScienceComputer Science (R0)