Skip to main content

Deep Policy Dynamic Programming for Vehicle Routing Problems

  • Conference paper
  • First Online:
Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR 2022)

Abstract

Routing problems are a class of combinatorial problems with many practical applications. Recently, end-to-end deep learning methods have been proposed to learn approximate solution heuristics for such problems. In contrast, classical dynamic programming (DP) algorithms guarantee optimal solutions, but scale badly with the problem size. We propose Deep Policy Dynamic Programming (DPDP), which aims to combine the strengths of learned neural heuristics with those of DP algorithms. DPDP prioritizes and restricts the DP state space using a policy derived from a deep neural network, which is trained to predict edges from example solutions. We evaluate our framework on the travelling salesman problem (TSP), the vehicle routing problem (VRP) and TSP with time windows (TSPTW) and show that the neural policy improves the performance of (restricted) DP algorithms, making them competitive to strong alternatives such as LKH, while also outperforming most other ‘neural approaches’ for solving TSPs, VRPs and TSPTWs with 100 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    If we have multiple partial solutions with the same state and cost, we can arbitrarily choose one to dominate the other(s), for example the one with the lowest index of the current node.

  2. 2.

    E.g., arriving at node i at \(t = 10\) is not feasible if node j has \(u_j = 12\) and \(c_{ij} = 3\).

  3. 3.

    https://github.com/wouterkool/dpdp.

  4. 4.

    https://github.com/vidalt/HGS-CVRP.

  5. 5.

    The running time of 4000 h (167 days) is estimated from 24 min/instance [43].

  6. 6.

    For example, three nodes with a demand of two cannot be assigned to two routes with a capacity of three.

  7. 7.

    Up to a limit, as making the time windows infinite size reduces the problem to plain TSP.

  8. 8.

    https://github.com/sashakh/TSPTW.

  9. 9.

    For the symmetric TSP and VRP, we add \(\textsc {knn}\) edges in both directions. For the VRP, we also connect each node to the depot (and vice versa) to ensure feasibility.

  10. 10.

    https://github.com/chaitjo/graph-convnet-tsp/blob/master/models/gcn_layers.py.

  11. 11.

    If all time windows are disjoint, there is only one feasible solution. Therefore, the amount of overlap in time windows determines to some extent the ‘branching factor’ of the problem and the difficulty.

  12. 12.

    Serving 100 customers in a 100\(\,\times \,\)100 grid, empirically we find the total schedule duration including waiting (the makespan) is around 5000.

  13. 13.

    For efficiency, we use a custom function similar to torch.unique, and argsort the returned inverse after which the resulting permutation is applied to all variables in the beam.

  14. 14.

    https://github.com/rusty1s/pytorch_scatter.

  15. 15.

    Unless we have multiple expansions with the same costs, in which case can pick one arbitrarily.

  16. 16.

    This may give slightly different results if the scoring function is inconsistent with the domination rules, i.e. if a better scoring solution would be dominated by a worse scoring solution but is not since that solution is removed using the score bound before checking the dominances.

References

  1. Accorsi, L., Vigo, D.: A fast and scalable heuristic for the solution of large-scale capacitated vehicle routing problems. Transp. Sci. 55(4), 832–856 (2021)

    Article  Google Scholar 

  2. Applegate, D., Bixby, R., Chvatal, V., Cook, W.: Concorde TSP Solver (2006). http://www.math.uwaterloo.ca/tsp/concorde

  3. Bai, R., et al.: Analytics and machine learning in vehicle routing research. arXiv preprint arXiv:2102.10012 (2021)

  4. Bellman, R.: On the theory of dynamic programming. Proc. Natl. Acad. Sci. U.S.A. 38(8), 716 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bellman, R.: Dynamic programming treatment of the travelling salesman problem. J. ACM (JACM) 9(1), 61–63 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S.: Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 (2016)

  7. Bertsekas, D.: Dynamic Programming and Optimal Control, vol. 1. Athena Scientific (2017)

    Google Scholar 

  8. Cappart, Q., Moisan, T., Rousseau, L.M., Prémont-Schwarz, I., Cire, A.: Combining reinforcement learning and constraint programming for combinatorial optimization. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)

    Google Scholar 

  9. Chen, X., Tian, Y.: Learning to perform local rewriting for combinatorial optimization. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 6281–6292 (2019)

    Google Scholar 

  10. Cook, W., Seymour, P.: Tour merging via branch-decomposition. INFORMS J. Comput. 15(3), 233–248 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  11. da Costa, P.R.d.O., Rhuggenaath, J., Zhang, Y., Akcay, A.: Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In: Asian Conference on Machine Learning (ACML) (2020)

    Google Scholar 

  12. Da Silva, R.F., Urrutia, S.: A general VNS heuristic for the traveling salesman problem with time windows. Discret. Optim. 7(4), 203–211 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  13. Daumé, H., III., Marcu, D.: Learning as search optimization: approximate large margin methods for structured prediction. In: International Conference on Machine Learning (ICML), pp. 169–176 (2005)

    Google Scholar 

  14. Delarue, A., Anderson, R., Tjandraatmadja, C.: Reinforcement learning with combinatorial actions: an application to vehicle routing. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33 (2020)

    Google Scholar 

  15. Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., Rousseau, L.-M.: Learning heuristics for the TSP by policy gradient. In: van Hoeve, W.-J. (ed.) CPAIOR 2018. LNCS, vol. 10848, pp. 170–181. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93031-2_12

    Chapter  Google Scholar 

  16. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)

    Article  MathSciNet  MATH  Google Scholar 

  17. Dumas, Y., Desrosiers, J., Gelinas, E., Solomon, M.M.: An optimal algorithm for the traveling salesman problem with time windows. Oper. Res. 43(2), 367–371 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  18. Falkner, J.K., Schmidt-Thieme, L.: Learning to solve vehicle routing problems with time windows through joint attention. arXiv preprint arXiv:2006.09100 (2020)

  19. Fu, Z.H., Qiu, K.B., Zha, H.: Generalize a small pre-trained model to arbitrarily large tsp instances. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)

    Google Scholar 

  20. Gao, L., Chen, M., Chen, Q., Luo, G., Zhu, N., Liu, Z.: Learn to design the heuristics for vehicle routing problem. In: International Workshop on Heuristic Search in Industry (HSI) at the International Joint Conference on Artificial Intelligence (IJCAI) (2020)

    Google Scholar 

  21. Gasse, M., Chetelat, D., Ferroni, N., Charlin, L., Lodi, A.: Exact combinatorial optimization with graph convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  22. Gromicho, J., van Hoorn, J.J., Kok, A.L., Schutten, J.M.: Restricted dynamic programming: a flexible framework for solving realistic VRPs. Comput. Oper. Res. 39(5), 902–909 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  23. Gromicho, J.A., Van Hoorn, J.J., Saldanha-da Gama, F., Timmer, G.T.: Solving the job-shop scheduling problem optimally by dynamic programming. Comput. Oper. Res. 39(12), 2968–2977 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  24. Gurobi Optimization, LLC: Gurobi Optimizer Reference Manual (2021). https://www.gurobi.com

  25. van Heeswijk, W., La Poutré, H.: Approximate dynamic programming with neural networks in linear discrete action spaces. arXiv preprint arXiv:1902.09855 (2019)

  26. Held, M., Karp, R.M.: A dynamic programming approach to sequencing problems. J. Soc. Ind. Appl. Math. 10(1), 196–210 (1962)

    Article  MathSciNet  MATH  Google Scholar 

  27. Helsgaun, K.: An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems: Technical report (2017)

    Google Scholar 

  28. van Hoorn, J.J.: Dynamic programming for routing and scheduling. Ph.D. thesis (2016)

    Google Scholar 

  29. Hottung, A., Bhandari, B., Tierney, K.: Learning a latent search space for routing problems using variational autoencoders. In: International Conference on Learning Representations (ICML) (2021)

    Google Scholar 

  30. Hottung, A., Tierney, K.: Neural large neighborhood search for the capacitated vehicle routing problem. In: European Conference on Artificial Intelligence (ECAI) (2020)

    Google Scholar 

  31. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML), pp. 448–456 (2015)

    Google Scholar 

  32. Joshi, C.K., Laurent, T., Bresson, X.: An efficient graph convolutional network technique for the travelling salesman problem. In: INFORMS Annual Meeting (2019)

    Google Scholar 

  33. Joshi, C.K., Laurent, T., Bresson, X.: On learning paradigms for the travelling salesman problem. In: Graph Representation Learning Workshop at Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  34. Kim, M., Park, J., Kim, J.: Learning collaborative policies to solve NP-hard routing problems. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  35. Kok, A., Hans, E.W., Schutten, J.M., Zijm, W.H.: A dynamic programming heuristic for vehicle routing with time-dependent travel times and required breaks. Flex. Serv. Manuf. J. 22(1–2), 83–108 (2010)

    Article  Google Scholar 

  36. Kool, W., van Hoof, H., Welling, M.: Attention, learn to solve routing problems! In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  37. Kwon, Y.D., Choo, J., Kim, B., Yoon, I., Gwon, Y., Min, S.: Pomo: policy optimization with multiple optima for reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  38. Laporte, G.: The vehicle routing problem: an overview of exact and approximate algorithms. Eur. J. Oper. Res. (EJOR) 59(3), 345–358 (1992)

    Article  MATH  Google Scholar 

  39. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  40. Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: International Conference on Machine Learning (ICML), pp. 3744–3753. PMLR (2019)

    Google Scholar 

  41. Li, S., Yan, Z., Wu, C.: Learning to delegate for large-scale vehicle routing. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  42. Li, Z., Chen, Q., Koltun, V.: Combinatorial optimization with graph convolutional networks and guided tree search. In: Advances in Neural Information Processing Systems (NeurIPS), p. 539 (2018)

    Google Scholar 

  43. Lu, H., Zhang, X., Yang, S.: A learning-based iterative method for solving vehicle routing problems. In: International Conference on Learning Representations (2020)

    Google Scholar 

  44. Ma, Q., Ge, S., He, D., Thaker, D., Drori, I.: Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In: AAAI International Workshop on Deep Learning on Graphs: Methodologies and Applications (DLGMA) (2020)

    Google Scholar 

  45. Ma, Y., et al.: Learning to iteratively solve routing problems with dual-aspect collaborative transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  46. Malandraki, C., Dial, R.B.: A restricted dynamic programming heuristic algorithm for the time dependent traveling salesman problem. Eur. J. Oper. Res. (EJOR) 90(1), 45–55 (1996)

    Article  MATH  Google Scholar 

  47. Mazyavkina, N., Sviridov, S., Ivanov, S., Burnaev, E.: Reinforcement learning for combinatorial optimization: a survey. arXiv preprint arXiv:2003.03600 (2020)

  48. Mingozzi, A., Bianco, L., Ricciardelli, S.: Dynamic programming strategies for the traveling salesman problem with time window and precedence constraints. Oper. Res. 45(3), 365–377 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  49. Nair, V., et al.: Solving mixed integer programs using neural networks. arXiv preprint arXiv:2012.13349 (2020)

  50. Nazari, M., Oroojlooy, A., Snyder, L., Takac, M.: Reinforcement learning for solving the vehicle routing problem. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 9860–9870 (2018)

    Google Scholar 

  51. Novoa, C., Storer, R.: An approximate dynamic programming approach for the vehicle routing problem with stochastic demands. Eur. J. Oper. Res. (EJOR) 196(2), 509–515 (2009)

    Article  MATH  Google Scholar 

  52. Nowak, A., Villar, S., Bandeira, A.S., Bruna, J.: A note on learning algorithms for quadratic assignment with graph neural networks. In: Principled Approaches to Deep Learning Workshop at the International Conference on Machine Learning (ICML) (2017)

    Google Scholar 

  53. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 8026–8037 (2019)

    Google Scholar 

  54. Peng, B., Wang, J., Zhang, Z.: A deep reinforcement learning algorithm using dynamic attention model for vehicle routing problems. In: Li, K., Li, W., Wang, H., Liu, Y. (eds.) ISICA 2019. CCIS, vol. 1205, pp. 636–650. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-5577-0_51

    Chapter  Google Scholar 

  55. Ropke, S., Pisinger, D.: An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows. Transp. Sci. 40(4), 455–472 (2006)

    Article  Google Scholar 

  56. Schrimpf, G., Schneider, J., Stamm-Wilbrandt, H., Dueck, G.: Record breaking optimization results using the ruin and recreate principle. J. Comput. Phys. 159(2), 139–171 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  57. Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  58. Sun, Y., Ernst, A., Li, X., Weiner, J.: Generalization of machine learning for problem reduction: a case study on travelling salesman problems. OR Spectr. 43(3), 607–633 (2020). https://doi.org/10.1007/s00291-020-00604-x

    Article  MathSciNet  MATH  Google Scholar 

  59. Toth, P., Vigo, D.: Vehicle Routing: Problems, Methods, and Applications. SIAM (2014)

    Google Scholar 

  60. Uchoa, E., Pecin, D., Pessoa, A., Poggi, M., Vidal, T., Subramanian, A.: New benchmark instances for the capacitated vehicle routing problem. Eur. J. Oper. Res. (EJOR) 257(3), 845–858 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  61. Vesselinova, N., Steinert, R., Perez-Ramirez, D.F., Boman, M.: Learning combinatorial optimization on graphs: a survey with applications to networking. IEEE Access 8, 120388–120416 (2020)

    Article  Google Scholar 

  62. Vidal, T.: Hybrid genetic search for the CVRP: open-source implementation and swap* neighborhood. arXiv preprint arXiv:2012.10384 (2020)

  63. Vidal, T., Crainic, T.G., Gendreau, M., Lahrichi, N., Rei, W.: A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Oper. Res. 60(3), 611–624 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  64. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2692–2700 (2015)

    Google Scholar 

  65. Wiseman, S., Rush, A.M.: Sequence-to-sequence learning as beam-search optimization. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1296–1306 (2016)

    Google Scholar 

  66. Wu, Y., Song, W., Cao, Z., Zhang, J., Lim, A.: Learning improvement heuristics for solving routing problems. IEEE Trans. Neural Netw. Learn. Syst. (2021)

    Google Scholar 

  67. Xin, L., Song, W., Cao, Z., Zhang, J.: Step-wise deep learning models for solving routing problems. IEEE Trans. Ind. Inform. (2020)

    Google Scholar 

  68. Xin, L., Song, W., Cao, Z., Zhang, J.: NeuroLKH: combining deep learning model with Lin-Kernighan-Helsgaun heuristic for solving the traveling salesman problem. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  69. Xu, S., Panwar, S.S., Kodialam, M., Lakshman, T.: Deep neural network approximated dynamic programming for combinatorial optimization. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 1684–1691 (2020)

    Google Scholar 

  70. Yang, F., Jin, T., Liu, T.Y., Sun, X., Zhang, J.: Boosting dynamic programming with neural networks for solving np-hard problems. In: Asian Conference on Machine Learning (ACML), pp. 726–739. PMLR (2018)

    Google Scholar 

Download references

Acknowledgement

We would like to thank Jelke van Hoorn and Johan van Rooij for helpful discussions. Also we would like to thank anonymous reviewers for helpful suggestions. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wouter Kool .

Editor information

Editors and Affiliations

Appendices

Appendix 1 The Graph Neural Network Model

For the TSP, we use the exact model from [32], which we describe here for self-containment. The model uses node input features and edge input features, which get transformed into initial representations of the nodes and edges. These representations then get updated sequentially using a number of graph convolutional layers, which exchange information between nodes and edges, after which the final edge representation is used to predict whether the edge is part of the optimal solution.

Input Features and Initial Representation. The model uses input features for the nodes, consisting of the (xy)-coordinates, which are then projected into H-dimensional initial embeddings \(\mathbf {x}_i^{0}\) (\(H = 300\)). The initial edge features \(\mathbf {e}_{ij}^{0}\) are a concatenation of a \(\frac{H}{2}\)-dimensional projection of the cost (Euclidean distance) \(c_{ij}\) from i to j, and a \(\frac{H}{2}\)-dimensional embedding of the edge type: 0 for normal edges, 1 for edges connecting K-nearest neighbors (\(K = 20\)) and 2 for self-loop edges connecting a node to itself (which are added for ease of implementation).

Graph Convolutional Layers. In each of the \(L = 30\) layers of the model, the node and edge representations \(\mathbf {x}_i^{\ell }\) and \(\mathbf {e}_{ij}^{\ell }\) get updated into \(\mathbf {x}_i^{\ell + 1}\) and \(\mathbf {e}_{ij}^{\ell +1}\) [32]:

$$\begin{aligned} \mathbf {x}_i^{\ell + 1}&= \mathbf {x}_i^{\ell } + \text {ReLU} \left( \text {BN}\left( W_1^{\ell } \mathbf {x}_i^{\ell } + \sum _{j \in \mathcal {N}(i)} \frac{\sigma (\mathbf {e}_{ij}^{\ell })}{\sum _{j' \in \mathcal {N}(i)} \sigma (\mathbf {e}_{ij'}^{\ell })} \odot W_2^{\ell } \mathbf {x}_j^{\ell } \right) \right) \end{aligned}$$
(2)
$$\begin{aligned} \mathbf {e}_{ij}^{\ell + 1}&= \mathbf {e}_{ij}^{\ell } + \text {ReLU} \left( \text {BN}\left( W_3^{\ell } \mathbf {e}_{ij}^{\ell } + W_4^{\ell } \mathbf {x}_i^{\ell } + W_5^{\ell } \mathbf {x}_j^{\ell } \right) \right) . \end{aligned}$$
(3)

Here \(\mathcal {N}(i)\) is the set of neighbors of node i (in our case all nodes, including i, as we use a fully connected input graph), \(\odot \) is the element-wise product and \(\sigma \) is the sigmoid function, applied element-wise to the vector \(\mathbf {e}_{ij}^{\ell }\). \(\text {ReLU}(\cdot ) = \max (\cdot , 0)\) is the rectified linear unit and \(\text {BN}\) represents batch normalization [31]. \(W_1, W_2, W_3, W_4\) and \(W_5\) are trainable parameter matrices, where we fix \(W_4 = W_5\) for the symmetric TSP.

Output Prediction. After L layers, the final prediction \(h_{ij} \in (0,1)\) is made independently for each edge (ij) using a multi-layer perceptron (MLP), which takes \(\mathbf {e}_{ij}^{L}\) as input and has two H-dimensional hidden layers with \(\text {ReLU}\) activation and a 1-dimensional output layer, with sigmoid activation. We interpret \(h_{ij}\) as the predicted probability that the edge (ij) is part of the optimal solution, which indicates how promising this edge is when searching for the optimal solution.

Training. For TSP, the model is trained on a dataset of 1 million optimal solutions, found using Concorde [2], for randomly generated TSP instances. The training loss is a weighted binary cross-entropy loss, that maximizes the prediction quality when \(h_{ij}\) is compared to the ground-truth optimal solution. Generating the dataset takes between half a day and a few days (depending on number of CPU cores), and training the model takes a few days on one or multiple GPUs, but both are only required once given a desired data distribution.

1.1 1.1 Predicting Directed Edges for the TSPTW

The TSP is an undirected problem, so the neural network implementationFootnote 10 by [32] shares the parameters \(W_4^l\) and \(W_5^l\) in Eq. (3), i.e. \(W_4^l = W_5^l\), resulting in \(\mathbf {e}_{ij}^l = \mathbf {e}_{ji}^l\) for all layers l, as for \(l = 0\) both directions are initialized the same. While the VRP also is an undirected problem, the TSPTW is directed as the direction of the route determines the times of arrival at different nodes. To allow the model to make different predictions for different directions, we implement \(W_5^l\) as a separate parameter, such that the model can have different representations for edges (ij) and (ji). We define the training labels accordingly for directed edges, so if edge (ij) is in the directed solution, it will have a label 1 whereas the edge (ji) will not (for the undirected TSP and VRP, both labels are 1).

1.2 1.2 Dataset Generation for the TSPTW

We found that using our DP formulation for TSPTW, the instances by [8] were all solved optimally, even with a very small beam size (around 10). This is because there is very little overlap in the time windows as a result of the way they are generated, and therefore very few actions are feasible as most of the actions would ‘skip over other time windows’ (advance the time so much that other nodes can no longer be served)Footnote 11. We conducted some quick experiments with a weaker DP formulation, that only checks if actions directly violate time windows, but does not check if an action causes other nodes to be no longer reachable in their time windows. Using this formulation, the DP algorithm can run into many dead ends if just a single node gets skipped, and using the GNN policy (compared to a cost based policy as in Sect. 4.4) made the difference between good solutions and no solution at all being found.

We made two changes to the data generation procedure by [8] to increase the difficulty and make it similar to [12], defining the ‘large time window’ dataset. First, we sample the time windows around arrival times when visiting nodes in a random order without any waiting time, which is different from [8] who ‘propagate’ the waiting time (as a result of time windows sampled). Our modification causes a tighter schedule with more overlap in time windows, and is similar to [12]. Secondly, we increase the maximum time window size from 100 to 1000, which makes that the time windows are in the order of 10% of the horizonFootnote 12. This doubles the maximum time window size of 500 used by [12] for instances with 200 nodes, to compensate for half the number of nodes that can possibly overlap the time window.

To generate the training data, for practical reasons we used DP with the heuristic ‘cost heat + potential’ strategy and a large beam size (1M), which in many cases results in optimal solutions being found.

Appendix 2 Implementation

We implement the dynamic programming algorithm on the GPU using PyTorch [53]. While mostly used as a Deep Learning framework, it can be used to speed up generic (vectorized) computations.

1.1 2.1 Beam Variables

For each solution in the beam, we keep track of the following variables (storing them for all solutions in the beam as a vector): the cost, current node, visited nodes and (for VRP) the remaining capacity or (for TSPTW) the current time. As explained, these variables can be computed incrementally when generating expansions. Additionally, we keep a variable vector parent, which, for each solution in the current beam, tracks the index of the solution in the previous beam that generated the expanded solution. To compute the score of the policy for expansions efficiently, we also keep track of the score for each solution and the potential for each node for each solution incrementally.

We do not keep past beams in memory, but at the end of each iteration, we store the vectors containing the parents as well as last actions for each solution on the trace. As the solution is completely defined by the sequence of actions, this allows to backtrack the solution after the algorithm has finished. To save GPU memory (especially for larger beam sizes), we store the O(Bn) sized trace on the CPU memory.

For efficiency, we keep the set of visited nodes as a bitmask, packed into 64-bit long integers (2 for 100 nodes). Using bitwise operations with the packed adjacency matrix, this allows to quickly check feasible expansions (but we need to unpack the mask into boolean vectors to find all feasible expansions explicitly). Figure 4a shows an example of the beam (with variables related to the policy and backtracking omitted) for the VRP.

Fig. 4.
figure 4

Implementation of DPDP for VRP (Color figure online)

1.2 2.2 Generating Non-dominated Expansions

A solution \(\boldsymbol{a}\) can only dominate a solution \(\boldsymbol{a}'\) if \({\text {visited}}(\boldsymbol{a}) = {\text {visited}}(\boldsymbol{a}')\) and \({\text {current}}(\boldsymbol{a}) = {\text {current}}(\boldsymbol{a}')\), i.e. if they correspond to the same DP state. If this is the case, then, if we denote by \({\text {parent}}(\boldsymbol{a})\) the parent solution from which \(\boldsymbol{a}\) was expanded, it holds that

$$\begin{aligned} {\text {visited}}({\text {parent}}(\boldsymbol{a}))&= {\text {visited}}(\boldsymbol{a}) \setminus \{{\text {current}}(\boldsymbol{a})\} \\&= {\text {visited}}(\boldsymbol{a}') \setminus \{{\text {current}}(\boldsymbol{a}')\} \\&= {\text {visited}}({\text {parent}}(\boldsymbol{a}')). \end{aligned}$$

This means that only expansions from solutions with the same set of visited nodes can dominate each other, so we only need to check for dominated solutions among groups of expansions originating from parent solutions with the same set of visited nodes. Therefore, before generating the expansions, we group the current beam (the parents of the expansions) by the set of visited nodes (see Fig. 4). This can be done efficiently, e.g. using a lexicographic sort of the packed bitmask representing the sets of visited nodesFootnote 13.

Travelling Salesman Problem. For TSP, we can generate (using boolean operations) the \(B \times n\) matrix with boolean entries indicating feasible expansions (with n action columns corresponding to n nodes, similar to the \(B \times 2n\) matrix for VRP in Fig. 4), i.e. nodes that are unvisited and adjacent to the current node. If we find positive entries sequentially for each column (e.g. by calling \(\textsc {torch.nonzero}\) on the transposed matrix), we get all expansions grouped by the combination of action (new current node) and parent set of visited nodes, i.e. grouped by the DP state. We can then trivially find the segments of consecutive expansions corresponding to the same DP state, and we can efficiently find the minimum cost solution for each segment, e.g. using torch_scatterFootnote 14.

Vehicle Routing Problem. For VRP, the dominance check has two dimensions (cost and remaining capacity) and additionally we need to consider 2n actions: n direct and n via the depot (see Fig. 4). Therefore, as we will explain, we check dominances in two stages: first we find (for each DP state) the single non-dominated ‘via-depot’ expansion, after which we find all non-dominated ‘direct’ expansions (see Fig. 4b).

The DP state of each expansion is defined by the expanded node (the new current node) and the set of visited nodes. For each DP state, there can be only oneFootnote 15 non-dominated expansion where the last action was via the depot, since all expansions resulting from ‘via-depot actions’ have the same remaining capacity as visiting the depot resets the capacity (see Fig. 4b). To find this expansion, we first find, for each unique set of visited nodes in the current beam, the solution that can return to the depot with lowest total cost (thus including the cost to return to the depot, indicated by a dashed green rectangle in Fig. 4). The single non-dominated ‘via-depot expansion’ for each DP state must necessarily be an expansion of this solution. Also observe that this via-depot solution cannot be dominated by a solution expanded using a direct action, which will always have a lower remaining vehicle capacity (assuming positive demands) as can bee seen in Fig. 4b. We can thus generate the non-dominated via-depot expansion for each DP state efficiently and independently from the direct expansions.

For each DP state, all direct expansions with cost higher (or equal) than the via-depot expansion can directly be removed since they are dominated by the via-depot expansion (having higher cost and lower remaining capacity, see Fig. 4b). After that, we sort the remaining (if any) direct expansions for each DP state based on the cost (using a segmented sort as the expansions are already grouped if we generate them similarly to TSP, i.e. per column in Fig. 4). For each DP state, the lowest cost solution is never dominated. The other solutions should be kept only if their remaining capacity is strictly larger than the largest remaining capacity of all lower-cost solutions corresponding to the same DP state, which can be computed using a (segmented) cumulative maximum computation (see Fig. 4b).

TSP with Time Windows. For the TSPTW, the dominance check has two dimensions: cost and time. Therefore, it is similar to the check for non-dominated direct expansions for the VRP (see Fig. 4b), but replacing remaining capacity (which should be maximized) by current time (to be minimized). In fact, we could reuse the implementation, if we replace remaining capacity by time multiplied by \(-1\) (as this should be minimized). This means that we sort all expansions for each DP state based on the cost, keep the first solution and keep other solutions only if the time is strictly lower than the lowest current time for all lower-cost solutions, which can be computed using a cumulative minimum computation.

1.3 2.3 Finding the Top B Solutions

We may generate all ‘candidate’ non-dominated expansions and then select the top B using the score function. Alternatively, we can generate expansions in batches, and keep a streaming top B using a priority queue. We use the latter implementation, where we can also derive a bound for the score as soon as we have B candidate expansions. Using this bound, we can already remove solutions before checking dominances, to achieve some speedup in the algorithm.Footnote 16

1.4 2.4 Performance Improvements

There are many possibilities for improving the speed of the algorithm. For example, PyTorch lacks a segmented sort so we use a much slower lexicographic sort instead. Also an efficient GPU priority queue would allow much speedup, as we currently use sorting as PyTorch’ top-k function is rather slow for large k. In some cases, a binary search for the k-th largest value can be faster, but this introduces undesired CUDA synchronisation points.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kool, W., van Hoof, H., Gromicho, J., Welling, M. (2022). Deep Policy Dynamic Programming for Vehicle Routing Problems. In: Schaus, P. (eds) Integration of Constraint Programming, Artificial Intelligence, and Operations Research. CPAIOR 2022. Lecture Notes in Computer Science, vol 13292. Springer, Cham. https://doi.org/10.1007/978-3-031-08011-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08011-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08010-4

  • Online ISBN: 978-3-031-08011-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics