Introduction

Multi-objective optimization (MOO) problems [1, 2] are common in the real world, where the optimization of two or more objectives simultaneously is required. These problems often arise in various domains, such as engineering, finance, and logistics, where decision-makers must balance competing objectives [3, 4]. A MOO problem [5] can be formulated as follows, without any loss of generality:

$$\underset{x}{{\text{min}}}f\left(x\right)=({f}_{1}\left(x\right), {f}_{2}\left(x\right),\dots , {f}_{m}\left(x\right))$$
(1)

where \(f\left(x\right)\) consists of \(m\) multiple objective functions, and \(x\subseteq {R}^{n}\) is the decision space. Because these \(m\) objectives are frequently in conflict, a collection of trade-off solutions known as Pareto optimal solutions is searched for MOO. For two purposes, \(u\) is shown to dominate \(v\) if and only if \({u}_{i}\le {v}_{i}\) for every \(i\in \{1, 2,\dots , m\}\) and \({u}_{j}\le {v}_{j}\) for at least one \(j\in \{1, 2,\dots , m\}\). The solution is called a Pareto optimal solution when there is no solution \(x\) where \(f\left(x\right)\) dominates \(f\left({x}^{*}\right)\). All of the Pareto optimal solutions construct a Pareto set. A Pareto front is formed by the appropriate objective vectors \(\{f\left({x}^{*}\right)|{x}^{*}\in \mathrm{Pareto set}\}\).

Routing problems like multi-objective vehicle routing problems (MOVRPs) [6] and traveling salesman problems (MOTSPs) [7] are typical MOO problems. Finding all of the precise Pareto optimum solutions to MOO problem-solving is quite tricky. For many issues, it can be NP-hard to discover a single Pareto optimum solution [1], and the number of Pareto solutions may scale exponentially with the complexity of the problem [2]. It is frequently impossible to predict the decision-maker’s preference among several objectives, which makes it exceedingly challenging to simplify the problem into a single purpose. Several approaches have been developed to approximate Pareto sets in an acceptable amount of computing time for various MOO situations. For each challenge, these techniques frequently need carefully developed, customized heuristics. It may be highly labor-intensive, such as multi-objective evolutionary algorithms NSGA-II [8] and MOEA/D [9]. As the problem statement evolves, the traditional approaches typically necessitate a fresh search or adaptation of heuristic guidelines through iterative trial and error. However, it is worth noting that a substantial portion of CO predicaments encountered in analogous contexts exhibit akin intrinsic structures [10]. Also, the computational effort and complexity of this class of algorithms increase significantly with the size of the problems. Evolutionary methods can search for feasible solutions, but many iterations also significantly increase the computation time with the problem size [11, 12].

Combinatorial optimization (CO) aims to select the best choice variables in a discrete decision space, aligning with RL’s fundamental sequential decision-making property [10, 13]. Deep reinforcement Learning (RL), which enables both offline training and online inference of CO policies, is considered a promising approach for addressing CO problems [10, 11]. However, most current learning-based methods focus on single-objective optimization problems, while real-world applications often involve MOO [14]. In the context of MOO, the effectiveness of existing methods heavily relies on RL’s comprehension and refinement of the weight vector. When RL encounters an unfamiliar weight vector, it may require training a new model or adapting an existing one to handle the relevant sub-problems. This necessity arises from the fact that the training and testing datasets need to adhere to similar distributions [13,14,15,16].

To overcome these limitations, there is a growing need for novel techniques that can successfully apply RL to the challenging domain of MOO. By developing innovative algorithms and strategies, we can enhance RL’s ability to handle diverse and complex objective functions, enabling more efficient and effective optimization in real-world scenarios [12, 16]. These advancements not only expand the applicability of RL in MOO but also facilitate more comprehensive and reliable decision-making processes [11].

One promising avenue to address these challenges is integrating dynamic programming-enhanced meta-reinforcement learning. By leveraging the power of dynamic programming [2], we can effectively capture the inherent structure of CO problems and facilitate more efficient adaptation to shifting problem formulations. This integration expedites the adjustment process and establishes a more versatile problem-solving framework. Furthermore, the essence of meta-reinforcement learning enhances this paradigm by enabling the approach to learn and generalize from past experiences. This empowers it to swiftly adapt to novel instances of MOO, thereby reducing the overhead associated with heuristic reconfiguration [17].

Our approach to tackling MOO challenges revolves around meta-learning, which enables our model to dynamically adjust parameters and acquire new tasks based on prior knowledge. The core component of our approach is the meta-value network, which guides various policies in handling a multitude of potential sub-problems. In our framework (depicted in Fig. 1), the meta-value network is trained to adapt quickly to different sub-problems through a few fine-tuning steps. This allows our model to efficiently leverage prior knowledge and rapidly adjust its parameters to address specific instances of MOO. By integrating exact dynamic programming within the meta-value network, we aim to elevate the solution quality and tackle the challenges MOO poses.

Fig. 1
figure 1

An example of introducing machine learning into MOO. MOTSP instances are optimized to generate different solutions; among them, \({P}_{1}\), \({P}_{2}\), \({P}_{3}\) are different optimal trade-offs between the two objectives \({f}_{1}\left(x\right)\) and \({f}_{2}\left(x\right)\), and \({P}_{4}\) is a poor solution that should be avoided

Integrating exact dynamic programming within the meta-value network is a crucial aspect of our approach. Dynamic programming, known for its divide-and-conquer principles, allows us to decompose complex MOO problems into smaller sub-problems. By harnessing the power of dynamic programming, we can efficiently explore and exploit the decision space, optimizing the solution quality. To leverage the capabilities of neural networks, we approximate and enhance the dynamic programming principles within the meta-value network. The neural network component of our approach enables us to capture and generalize the inherent structure of the MOO problems, enhancing the adaptability and efficiency of the algorithm. By synthesizing the core principles of dynamic programming with the meta-learning paradigm, we create a robust methodology for addressing MOO challenges. The meta-learning component facilitates the rapid acquisition of new tasks and the adjustment of parameters while integrating dynamic programming principles that enhance the solution’s quality and efficiency.

Our comprehensive approach seamlessly combines the strengths of meta-learning, dynamic programming, and neural networks into a cohesive algorithm. At its core, the meta-value network serves as a guiding beacon, leveraging prior knowledge and dynamically adjusting parameters to adeptly handle the diverse sub-problems inherent in the MOO domain. This integration of dynamic programming principles within the neural network architecture elevates solution quality, enabling efficient exploration and exploitation of the decision space. By approximating and augmenting dynamic programming with neural networks, we forge a holistic methodology that unites the fundamental tenets of both paradigms, culminating in a robust and versatile solution for MOO challenges. Experimental results underscore its proficiency in enhancing solution quality, computation speed, and adaptability to a wide array of objective functions.

In summary, this paper makes the following contributions:

  • We incorporate meta-learning into the meta-critic network within the actor-critic framework. This allows us to learn various policies that tackle realistic MOO problems or tasks. The meta-learning component can serve as an adaptive plug-in that can be applied to reinforcement learning and supervised learning, enhancing their performance.

  • We propose a novel meta-attention network approximating the dynamic programming meta-value network to attain more precise policies. This meta-attention network balances accuracy, complexity, and runtime, enabling us to learn more exact policies that align with the given optimization tasks.

  • We design a Transformer-based policy network to address the challenges posed by realistic and complex routing optimization problems. This network is specifically tailored to handle such problems, providing contextual information during the encoder-decoder processing to aid in meta-learning, specifically the task-actor encoder.

The remaining sections of this paper are organized as follows: In Sect. “Related work”, we provide an introduction to multi-objective optimization, learning-based construction heuristic methods, learning-based improvement heuristic methods, dynamic programming, and the differences between the proposed method and existing methods. Sect. “Preliminaries and overview” presents the problem definition, the Markov decision process based on dynamic programming, and meta-learning background. Sect. “Methodology” provides a detailed description of the proposed method and its components, including the architecture of the policy network, the meta-learning optimization framework, and the overall optimization algorithm with its complexity analysis. In Sect. “Evaluation”, we validate the efficiency and effectiveness of the proposed method on multiple datasets and different tasks.

Related work

Multi-objective optimization

The realm of MOO has garnered increasing attention from various research communities for several decades [2]. Researchers have predominantly pursued two principal avenues in addressing MOO challenges: exact methodologies and approximation strategies. The landscape of MOO often entails NP-hard problems with extensive dimensions, rendering exact techniques impractically resource-intensive. Consequently, a plethora of heuristics and approximation methods [8, 9] have emerged to navigate this complexity, striving to yield a judicious number of Pareto-optimal approximations within reasonable computational constraints [18]. The intricacies of NP-hard MOO problems have propelled the development of heuristics and approximation methodologies. While exact approaches prove arduous due to computational infeasibility, these heuristics strive to balance solution quality and computational efficiency. They endeavor to unearth an array of approximated Pareto solutions that provide valuable insights into the trade-offs between competing objectives. Nevertheless, these heuristic paradigms often necessitate meticulous handcrafting tailored to each unique problem instance. This requirement for painstaking design constitutes a notable hurdle, particularly when transitioning to real-world scenarios where the intricacies of the problem are further compounded.

As the field evolves, there is a growing inclination toward seeking more adaptable and automated techniques. Relying on manually engineered heuristics for distinct problems becomes cumbersome and nontrivial in practical applications [19]. Researchers are exploring innovative avenues that leverage machine learning and optimization fusion prowess, thereby mitigating the need for bespoke heuristics [10, 20, 21]. These modern paradigms hold the potential to learn from past problem instances and generalize across a spectrum of scenarios, diminishing the burden of bespoke design efforts [3].

We broadly classify the current learning-based approaches into those based on the learning construction and improvement heuristic. An extensive survey of different methods at the intersection of deep RL and CO is given by Bengio et al. [13]; Mazyavkina et al. [22].

Construction heuristic

The essence of a construction heuristic lies in its ability to build solutions from scratch, meticulously exploring the realm of feasible solutions one step at a time. Vinyals et al. designed the pointer network [23] which uses an attention mechanism to learn the conditional probability of a sequence of permutations given the input. They applied supervised learning to train the model to solve TSP instances with 50 client nodes. The idea was extended by Bello et al. [24], who applied an actor-critic RL algorithm to train the pointer network and improved the performance by a fine-tuning approach based on supervised learning.

The utilization of graph neural networks [25] and attention mechanisms [26] to craft encoding–decoding policies has become widespread in contemporary times. Dai et al. [27] proposed a graph embedding network that learns greedy heuristics, and they applied the Q-learning algorithm to train the policy network. Li et al. [28] combined tree search and graph convolutional networks (GCN), training them in a supervised manner. Mittal et al. [29] employed supervised learning followed by RL to train policy networks modeled by GCN, enabling them to handle millions of instances through data preprocessing. Kool et al. [30] utilized a transformer [26] to model the encoder and decoder, employing the REINFORCE algorithm with a deterministic greedy rollout baseline to train the policy network. They also combined the greedy algorithm and beam search in decoding to generate a solution sequence. Joshi et al. [31] employed a GCN to generate an edge adjacency matrix representing the probability of edge occurrences in the solution, subsequently applying beam search to convert the edge probabilities into a reasonable solution. Fu et al. [32] integrated machine learning with heuristic techniques such as tree search, graph sampling, graph transformation, and heat map merging. Their approach extended a small pre-trained model to handle arbitrarily large instances of the Traveling Salesperson Problem (TSP).

Improvement heuristic

Approaches founded on improvement heuristics incorporate reinforcement learning (RL) into advanced search or aim to learn improvement operators directly [33, 34]. These methods typically take longer to solve instances (due to the extensive number of iterative operations) than construction heuristic-based approaches, yet they often yield superior results. Chen and Tian et al. [35] proposed a method that iteratively modifies the local elements of a solution. A trainable region selection policy identifies the portions of the solution to be altered in each iteration. In contrast, a trainable rule selection policy selects an operation from a collection of feasible modification procedures in each iteration. Hottong and Tierney et al. [36] employ predetermined manual procedures to systematically alter sections of the solution, which are then rebuilt using learned repair processes. Wu et al. [2] suggested using RL to select better solutions from defined local neighborhoods (e.g., 2-opt neighborhoods) for solving routing issues. Hottong et al. [37] learned continuous representations of solutions to discrete routing problems using conditional variational autoencoders and searched for solutions with a generic continuous optimizer. Lu et al. [38] introduced a learning-based iterative technique for the vehicle routing problem (VRP) that focuses on controller selection. Their approach learns to optimize solutions iteratively using an improvement operator based on RL. To enhance result quality, Zhao et al. [39] merged a deep reinforcement learning (DRL) model with a local search strategy, and they tested their approach on VRPTW instances of various sizes. Zheng et al. [40] combined three RL strategies (Q-learning, Sarsa, and Monte Carlo) with the well-known Lin Kernighan-Helsgaun (LKH) algorithm for tackling the traveling salesperson problem (TSP). Similar methodologies based on improvement heuristics also encompass [41,42,43,44,45,46].

Dynamic programming

The construction and improvement heuristics mentioned above are powerful but limited in adaptability because particular procedures must be created for individual issues. Dynamic programming is a versatile and exact approach to finding an optimal solution. On the other hand, traditional dynamic programming techniques are confined to small-scale issue cases due to the curse of dimensionality; hence, they are not as popular as heuristics. For a complete examination of using classical dynamic programming to solve multiple objective challenges, see Mahmoud A. Abo-Sinna [47].

If an effective and computationally efficient strategy for selecting potentially viable solutions can be devised, surpassing the limitations of dynamic programming might be possible. Currently, employing neural networks to approximate or enhance dynamic programming techniques presents a promising avenue of exploration. For instance, Yang et al. introduced a method called neural network dynamic programming (NNDP) [48]

In contrast to NNDP, the approach proposed by Xu et al. [49] eliminates the need for training test samples. Instead of training a single neural network for each problem size, they train a series of neural networks sequentially. Similar methodologies that amalgamate neural networks with dynamic programming to tackle combinatorial optimization problems include works like [50, 51], among others.

Differences from existing methods

Existing learning-based methodologies have predominantly focused on addressing single-objective combinatorial problems. Recent endeavors have sought to extend these techniques [14, 15, 52] to tackle the complexities of MOO. These approaches employ the MOEA/D [9] framework to decompose the MOO problem into a series of single-objective subproblems. Subsequently, a collection of models is established to address each subproblem independently.

Nonetheless, the exponential proliferation of Pareto solutions presents a formidable challenge, necessitating many models to identify the entire Pareto set comprehensively. In this paper, our proposed approach capitalizes on the power of meta-learning and meta-neural networks to construct versatile meta-models that exhibit superior adaptability and flexibility across diverse subproblems. By leveraging meta-learning, our models acquire the ability to adapt to new tasks based on prior knowledge swiftly, enabling efficient handling of a wide range of subproblems within the MOO.

Moreover, we integrate the principles of dynamic programming within the meta-model to facilitate the decomposition of MOO, ultimately enhancing the precision and accuracy of our solutions.

Preliminaries and overview

Problem setup

This paper focuses on solving two MOO problems: the multi-objective traveler problem (MOTSP) and the multi-objective vehicle routing problem with a time window (MOVRPTW).

MOTSP [53] is defined on a graph \(G=(\overline{V }, \overline{E })\) with \(m\) cost matrices, where \(\overline{V }=\{0,\dots ,n\}\) is the node-set (node 0 denotes the depot, and other nodes denote customers), \(\overline{E }=\{{e}_{ij}|i,j\in V, i\ne j\}\) is the edge set. The \(k-th\) cost matrix \({[{c}_{i,j}^{k}]}_{n\times n}\) gives the cost from node \(i\) to node \(j\). The objective is to find a permutation \(\pi (\pi \left(0\right), \pi \left(2\right),\dots ,\pi \left(i\right),\dots ,\pi (n))\) of \(n\) nodes that minimizes the \(m\) objective functions simultaneously. The \(k-th\) objective function can be calculated as follows:

$${\varphi }_{k}\left(\pi \right)=\sum_{i=0}^{n-1}{c}_{\pi \left(i\right), \pi (i+1)}^{k}+{c}_{\pi \left(n\right), \pi (0)}^{k}, k=\mathrm{1,2},\dots ,m$$
(2)

MOVRPTW [54] comprises multiple routes corresponding to uniform vehicles with the same capacity. Each vehicle leaves the depot, serves the customers under capacity and time window constraints, and returns to the depot. We simultaneously optimize multiple conflicting objectives (for example, minimizing the total tour distance, total time, etc.):

$$minimize f\left(\pi \right)=({f}_{1}\left(\pi \right), {f}_{2}\left(\pi \right),\dots ,{f}_{m}\left(\pi \right))$$
(3)

The decomposition of the multi-objective problem into multiple single-objective problems (scalar optimization subproblems) can be done using the strategy of weight sums, i.e., linear combinations of different objectives:

$$minimize f\left(\pi |{\lambda }^{j}\right)=\sum_{i=1}^{m}{\lambda }_{i}^{j}{f}_{i}\left(\pi \right)$$
(4)

where \(\uplambda \) is a uniformly distributed weight vector and \(f\left(\pi |{\lambda }^{j}\right)\) is the objective function of the \(j-th\) subproblem.

Dynamic programming and Markov decision processes

Dynamic programming was created to address a specific optimization problem [49]. When using it to solve a combinatorial optimization problem, we usually consider it a multi-step decision problem. Making an optimal decision at each step should lead to the optimal solution for the overall problem. Dynamic programming can be considered an improvement of the fundamental search strategy when the search space has the form of optimum substructures and overlapping subproblems.

We denote \((S, A\left(s\right), P\left({s}{^\prime}|s, a\right), R\left(s, a\right), \gamma )\) as a discounted Markov Decision Process (MDP), where \(S\) is the set of problem states, \(A\left(s\right)\) is the set of feasible actions when in a state \(s\in S\), \(P\left({s}{^\prime}|s, a\right)\) is the transition probability of transitioning from state \(s\) to state \({s}{^\prime}\in {S}{^\prime}\subseteq S\) after taking an action \(a\in A(s)\), \(R(s, a)\) is a direct reward when taking action \(a\) in state \(s\), and \(\gamma \in [\mathrm{0,1})\) is the discount rate applied to the future reward. The Bellman equation determines the maximum value for each state:

$$V\left(s\right)=\underset{a\in A(s)}{\mathit{max}}(R\left(s, a\right)+\gamma \sum _{{s}{^\prime}\in {S}{^\prime}}P({s}{^\prime}|s, a)V({s}{^\prime}))$$
(5)

where \(V\left(s\right)\) is the value function of state \(s\). The Bellman equation can be solved by backward induction, and the final value function is produced by calculating the value function for smaller subproblems one at a time.

Background on meta-learning

Meta-learning aims to generalize across tasks rather than around data points [55]. Inputs \({x}_{t}\), outputs \({a}_{t}\), a loss function \({\mathcal{L}}_{i}({x}_{t}, {a}_{t})\), a transition distribution \({P}_{i}({x}_{t}|{x}_{t-1}, {a}_{t-1})\), and an episode length \({H}_{i}\) define each task \({\Gamma }_{i}\). The distribution \(\pi ({a}_{t}|{x}_{1},...,{x}_{t};\theta )\) is modeled by a meta-learner with parameters \(\theta \). The meta-learner’s goal is to minimize its expected loss concerning \(\theta \), given a distribution over tasks \(\Gamma =P({\Gamma }_{i})\).

$$\underset{\theta }{\mathit{min}}{E}_{{T}_{i}\sim T}\left[\sum _{t=0}^{{H}_{i}}{\mathcal{L}}_{i}({x}_{t}, {a}_{t})\right]$$
(6)

where

$${x}_{t}\sim {P}_{i}\left({x}_{t}|{x}_{t-1}, {a}_{t-1}\right), {a}_{t}\sim \pi ({a}_{t}|{x}_{1},\dots ,{x}_{t};\theta )$$
(7)

The expected loss is optimized over tasks (or mini-batches) sampled from \(\Gamma \) to train a meta-learner. For testing, the meta-learner is evaluated on previously unseen tasks from a task distribution \(\widetilde{\Gamma }=P(\widetilde{{\Gamma }_{i}})\), which is comparable to the training task distribution \(\Gamma \).

Methodology

Policy network

We use the transformer [26, 30, 56], the state-of-the-art modeling for routing optimization, as the infrastructure for policy networks. We also improve the original transformer to make it more suitable for complex routing problems.


Encoder: The encoder takes node features \({x}_{i}\) (\(i\in \overline{V })\) including coordinates and demand as input and first applying a linear projection \({h}_{i}^{0}={W}_{0}{x}_{i}+{b}_{0}\) (\({W}_{0}\) and \({b}_{0}\) are learnable parameters) to create an initial embedding \({h}_{i}^{0}\). The central part of the encoder consists of three self-attentive blocks (SA) that generate the embedding.

$${\widehat{h}}_{i}^{node}=SA(SA(SA({h}_{i}^{0}, {H}^{0})))$$
(8)

where \({H}^{0}=({h}_{0}^{0},\dots ,{h}_{n}^{0})\) is the sequence of initial embeddings. Each block consists of a multi-head attention layer (MHA) and an element-wise fully connected layer (FF), each in turn then followed by a residual (res) connection and a batch normalization layer (BN).

$$ \begin{aligned} {h}_{i}^{l} &={BN}^{l}\left({BN}^{l-1}\left({h}_{i}^{l-1}+{FF}^{res}\left({h}_{i}^{l-1}\right)\right)\right.\\ &\quad \left. +{MHA}^{res}\left(\left\{{h}_{1}^{l-1},\dots ,{h}_{n}^{l-1}\right\},SA\right)\right) \end{aligned}$$
(9)

where \({h}_{i}^{l}\) denotes the embedding of \(i-th\) node in layer \(l (l\in \{1,\dots ,N\})\).

The general definition of the fully connected and residual layers is as follows:

$${MHA}^{res}\left({h}_{i},H;W\right)={h}_{i}+MHA({h}_{i},H;W)$$
(10)
$$FF\left({h}_{i};W,b\right)=max(0,W{h}_{i}+b)$$
(11)
$${FF}^{res}\left({h}_{i};W,b\right)={h}_{i}+max(0,W{h}_{i}+b)$$
(12)

We apply MHA to different aggregate types of information from other nodes, as follows:

$$\begin{aligned} & {MHA}_{i}^{l}\left(\left\{{h}_{1}^{l-1},\dots ,{h}_{n}^{l-1}\right\},SA\right)\\ &\quad =\sum_{m=1}^{{\text{M}}}{Att(h,H;{W}^{Q},{W}^{K})}_{m}{W}^{V}{H}_{m} \end{aligned}$$
(13)

where \(M\) denotes the number of heads in the MHA, which is based on SA, and its calculation process is described as follows:

$$Att\left(h,H;{W}^{Q},{W}^{K}\right)=softmax\left(\frac{1}{\sqrt{{d}_{K}}}{h}^{T}{({W}^{Q})}^{T}{W}^{K}{H}_{m}{|}_{m=1,\dots ,M}\right)$$
(14)

Both W-series and b-series are learnable parameters (Fig. 2.

Fig. 2
figure 2

An overview of the proposed framework. DPML consists mainly of a policy network (actors) and a meta-critic network. The meta-critic consists of a meta-value network and a task-actor encoder, instructing actors to learn about different policies

Decoder: The decoder accepts a context \({C}_{t}\) and selects the next node to add to the current tour by looking over the sequence \(seq=({\widehat{h}}_{0}^{node}, {\widehat{h}}_{1}^{node}, \dots , {\widehat{h}}_{n}^{node})\) at decoding step\(t\in \{1, ..., T\}\). The context \({C}_{t}\) is made up of the graph embedding\({\widehat{h}}_{i}^{graph}=\frac{1}{n+1}{\sum }_{i=0}^{n}{\widehat{h}}_{i}^{node}\), the current vehicle’s remaining capacity\({Q}_{r}^{(t)}\), and the preceding node’s embedding\({\widehat{h}}^{pre}={\widehat{h}}_{pre}^{node}\):

$${C}_{t}=[{\widehat{h}}^{graph}; {Q}_{r}^{\left(t\right)}; {\widehat{h}}^{pre}]$$
(15)

The concatenation operator is represented by \([;]\). A multi-head attention layer and a subsequent layer with just attention weights operate on a masked input sequence. Values of alternatives ruled out by hard constraints of the routing issues are set to \(-\infty \) and give zero weights. At step \(t\) of decoding, the decoder outputs the probability \({P}_{\theta }\) of choosing node \(i\) for the decision \({\pi }_{t}\).

$${P}_{\theta }\left({\pi }_{t}=i| {\pi }_{1:t-1}, {x}_{i}\right)=Att(MHA\left({C}_{t}, seq\right), mask(seq))$$
(16)

Meta-learning

We construct our policy network as an actor-network for each training task. We also construct a core policy-guiding network, the meta-critic network, consisting of a meta-value network and a task-actor encoder. We train this meta-critic network simultaneously using the actor-critic policy gradient algorithm with multiple tasks (problem instances). It is akin to constructing a new actor network when dealing with a new task (the probability distributions generated by the decoding policies of various subproblem instances are different) while the meta-critic network remains intact.


Task-actor encoder: In the task-actor encoder, we input the task’s historical trajectory and get a task representation. We concatenate it with the general value network’s inputs (states and actions) and feed it into the meta-value network. While the transformer has achieved groundbreaking results in modeling sequences for supervised learning tasks, there is a distinct lack of demonstration of the transformer as a valuable RL memory. We define the task-actor encoder as a gated recurrent unit (GRU) that inputs a learning trace as a sequence of states:

$${s}_{t}=GRU({C}_{t}, {s}_{t+1}|{W}_{s}, {U}_{s}, {b}_{s})$$
(17)

where \({W}_{s}\) and \({U}_{s}\) are weight matrices, \({b}_{s}\) are biases. GRU derives a vector representation of the hidden state, denoted as

$$u=sigmoid({W}_{u}{C}_{t}+{U}_{u}{s}_{t-1}+{b}_{u})$$
(18)
$$r=sigmoid({W}_{r}{C}_{t}+{U}_{u}{s}_{t-1}+{b}_{r})$$
(19)
$${s}_{t}=u\circ {s}_{t-1}+(1-u)\circ tanh({W}_{\rm T}{C}_{t}+{U}_{\rm T}\left(r\circ {s}_{t-1}\right)+ {b}_{\rm T})$$
(20)

where \(\circ \) is the element-wise multiplication.


Actor and Meta-critic network optimization: Policy networks for different tasks are optimized under the guidance of meta-critic networks. The policy network \({P}_{\theta }\) outputs a categorical distribution \({\pi }_{t}\) of available actions and then samples the actions from the generated distribution \({a}_{t}\sim {P}_{\theta }\). Suppose there exist \({M}^{(T)}\) different training tasks (problem instances) and the updated rules for each actor (policy) \(i\) and meta-critic network \({Q}_{\phi }\) are as follows:

$$\begin{aligned} &{\theta }^{(i)}\leftarrow \underset{{\theta }^{\left(i\right)}}{\mathit{argmax}}l({\pi }_{t}^{\left(i\right)}, {a}_{t}^{(i)}){Q}_{\phi }\left({s}_{t}^{\left(i\right)}, {a}_{t}^{\left(i\right)}, {C}_{t}^{(i)}\right),\\ &\quad \forall i\in [\mathrm{1,2},\dots ,{M}^{(T)}] \end{aligned}$$
(21)
$$\begin{aligned}\phi &\leftarrow \underset{\phi }{\mathit{argmin}}\sum_{i=1}^{{M}^{(T)}}({Q}_{\phi }\left({s}_{t}^{\left(i\right)}, {a}_{t}^{\left(i\right)}, {C}_{t}^{\left(i\right)}\right)-{R}_{t}^{\left(i\right)}\\ &\quad-\gamma {Q}_{\phi }({s}_{t+1}^{\left(i\right)}, {a}_{t+1}^{\left(i\right)}, {C}_{t+1}^{(i)}))^{2}\end{aligned}$$
(22)

where \(l(,)\) is the cross-entropy loss function, \({C}_{t}^{\left(i\right)}\) is the contextual embedding (Eq. (14)) generated by our policy network \({P}_{\theta }\) and \({a}_{t}^{\left(i\right)}={P}_{{\theta }^{(i)}}({s}_{t}^{(i)})\).


Meta Graph Attention Networks: We incorporate meta-knowledge into the GAT [57] to make it MetaGAT, which shares parameters and inputs with the policy network. The score of the edge \({e}_{ij}\) is related to the hidden states of node \(i\) and node \(j\) and the meta-knowledge of nodes and edges learned from the graph. For edge \({e}_{ij}\), we obtain the hidden states of the nodes, i.e., \({\widehat{h}}_{i}^{node}\) and \({\widehat{h}}_{j}^{node}\) by index, and the metaknowledge \({MK}_{ij}\), which is a combination of node and edge metaknowledge:

$${MK}_{ij}={MK}^{\left(\overline{V }\right)}\left(i\right)\parallel {MK}^{\left(\overline{V }\right)}\left(j\right)\parallel {MK}^{\left(\overline{E }\right)}\left({e}_{ij}\right)$$
(23)

Here \(\parallel \) denotes the concatenation operator, and the attention score is calculated as follows:

$${w}_{ij}=Att({\widehat{h}}_{i}^{node}, {\widehat{h}}_{j}^{node}, {MK}_{ij})$$
(24)

Here \({w}_{ij}\) is a vector indicating the importance of \({\widehat{h}}_{i}^{node}\) for \({\widehat{h}}_{j}^{node}\).


Approximate dynamic programming meta-value network: Dynamic programming can be used to solve optimization problems in two ways. For changes based on the value function, solving the value function of the states and then selecting a policy based on maximizing the value function; the other approach is to derive the optimal policy for each state and perform the optimal action in each state. Since our meta-learning-based RL algorithm is to construct a meta-value network to guide policies to handle different tasks, this process is more in line with the former.

We implement dynamic programming using the above meta-neural networks to approximate the meta-value network. We approximate the value function \({V}_{\varphi }\) starting from the smallest problem or the state in the last decision step. MetaGAT is trained to minimize the mean square error (MSE) of the results of the value function estimation:

$$\begin{aligned}MSE&=\frac{1}{{T}_{a}^{s}}\sum _{{{s}_{1}\in S}_{1}}\sum _{a\in A({s}_{1})}\\ &\quad{({MetaGAT}_{1}\left({s}_{1}, {\varphi }_{1}, {a}_{1}\right)-R({s}_{1}, {a}_{1}))}^{2}\end{aligned}$$
(25)

where \({T}_{a}^{s}\) is the total number of state and action combinations, \({\varphi }_{1}\) is the parameter of \({MetaGAT}_{1}\), \({S}_{1}\) the set of all possible states of the subproblem, \(R({s}_{1}, a)\) is the instantaneous reward obtained by selecting the action \({a}_{1}\) at state \({s}_{1}\). Then for subsequent neural networks corresponding to subproblems, we train the neural network \({MetaGAT}_{{\text{n}}}\) is to minimize the following MSE:

$$\begin{aligned}MSE&=\frac{1}{{T}_{a}^{s}}\sum _{{{s}_{n}\in S}_{n}}\sum _{{a}_{n}\in A({s}_{n})}\\ &\quad({MetaGAT}_{n}\left({s}_{n}, {\varphi }_{n}, {a}_{n}\right)-R\left({s}_{n}, {a}_{n}\right)\\ &\quad-{V}_{n-1}^{\mathcal{N}}({s}_{n-1}))^{2}\end{aligned}$$
(26)
$${V}_{n-1}^{\mathcal{N}}\left({s}_{n-1}\right)=\sum _{i=1}^{n-1}R({s}_{i}, \mathit{arg}\underset{a}{{\text{max}}}{MetaGAT}_{i}({s}_{i}, {\varphi }_{i}, {a}_{i}))$$
(27)

where \({s}_{i}\) is the state of subproblem \(i\) due to the following policies generated by the neural networks. The complete flow of the proposed method is shown in Algorithms 1, 2, and 3, where Algorithms 1 and 3 are meta-learning that can be used as a general framework. Algorithm 2 is a specific RL algorithm that can be replaced with other RL algorithms or supervised learning.

Algorithm 1
figure a

Meta-learning

Algorithm 2
figure b

Actor-critic algorithm

Algorithm 3
figure c

Meta-testing


Complexity analysis: Our transformer-based encoder has a time complexity of \(O\) (\({n}^{2}*d\)), where \(d\) is the number of dimensions and \(n\) is the number of nodes. The Transformer-based encoder involves processing data with a complexity that scales linearly with the number of dimensions and logarithmically with the dimensionality itself. This complexity arises from the operations performed on each node, such as feature extraction and dimension reduction. On the other hand, the decoder outputs a linear sequence of size \(n\) at a time, resulting in a computational complexity of \(O(n)\). This complexity is primarily determined by the number of nodes in the output sequence and remains constant regardless of the dimensions or other factors.

Several factors influence the training algorithm’s complexity, including the number of \(episode\_max\), \(mini-batch\), batch \(B\), and \(step\_max\). Algorithm 1, consisting of a triple loop, has a time complexity of \(O\)(\(step \_max*{\text{mini}}-{\text{batches}}*episode\_max\)), where \(step\_max\) and \(episode\_max\) represent the maximum time steps and maximum episodes, and \(mini-batch\) denotes the number of sampled mini-batches. Algorithm 2, with double loops iterating over batch \(B\) and step_max, has a time complexity of \(O\)(\(B* step \_max\)), where \(B\) represents the number of batches. Algorithm 3, comprising a single loop, has a time complexity of \(O\)(\(step \_max\)). Finally, the time complexity of DPML is determined by the highest time complexity among the mentioned algorithms, which is \(O\)(\(step \_max*{\text{mini}}-{\text{batches}}*episode\_max\)).


Optimality analysis: This paper’s multi-objective routing optimization problems aim to minimize the cost of each sub-problem (sub-objective function), as shown in Eq. (4).

Lemma 1

[58] A feasible solution \({\pi }^{*}\) is Pareto optimal if and only if there exists a weight vector \(\lambda >0\) such that \({\pi }^{*}\) is an optimal solution to the problem (4).

Lemma 1 states that every Pareto solution may be obtained by resolving subproblems with specific weights. According to Lemma 1, in the best scenario, if every produced solution \(\pi \) is the optimal solution \({\pi }^{*}\) of the problem (4) with vector \(\lambda \), our model may provide the complete Pareto set of the original MOO problem.

Assumption 1

A potentially robust approximation of the complete Pareto set for MOO problem can be achieved if the proposed method is capable of effectively solving subproblems with any weight \(\lambda \).

We verify this assumption using the well-known \(\varepsilon \)-Pareto approximation approach for the MOO [2, 59]

Definition 1

(\(\varepsilon \)-Pareto Domination). For a MOO problem and \(\varepsilon >0\), if \({f}_{i}\left({\pi }_{a}\right)\le \left(1+\varepsilon \right){f}_{i}\left({\pi }_{b}\right), \forall i\in \{1,\dots , m\}\), then \({\pi }_{a}\) \(\varepsilon \)-dominate \({\pi }_{b}\) (\({\pi }_{a}\) \({\prec }_{\varepsilon }\) \({\pi }_{b}\)) is true.

The (1 + \(\varepsilon \)) approximation for single-objective optimization is a straightforward generalization of the formulation provided here. This idea allows us to define \(\varepsilon \)-approximate Pareto sets [60] as follows:

Definition 2

(\(\varepsilon \)-Approximate Pareto Set). For an \(\varepsilon >0\), a set \({\mathcal{P}}_{\varepsilon }\) is an \(\varepsilon \)-approximate Pareto set, if for any feasible solution \(\pi \), there exists a solution \({\pi }{\prime}\in {\mathcal{P}}_{\varepsilon }\) such that \({\pi }{\prime}{\prec }_{\varepsilon }\pi \).

Some solutions in \({\mathcal{P}}_{\varepsilon }\) can potentially dominate nearly all feasible solutions to the MOO [3, 59]. The \(\varepsilon \)-approximate Pareto set would be a logical choice for practice when finding Pareto sets is challenging and intractable. Although several \(\varepsilon \)-approximate Pareto sets may exist, each MOO has a distinct Pareto set. The effectiveness of our policy on each single-objective subproblem substantially influences its capacity to identify \(\varepsilon \)-approximate Pareto sets.

Theorem 1

Let \({\pi }^{*}\) represents the optimal solution to the problem (4) with weight \(\lambda \). We can produce a “\(\varepsilon \)-approximate Pareto set \({\mathcal{P}}_{\varepsilon }\)” for the MOO, if the proposed method can produce an approximate solution \(\pi {{\prec }_{\varepsilon }\pi }^{*}\) for any weight \(\lambda \).

Proof

Let \(\mathcal{P}\) be the Pareto set of the MOO problem. Lemma 1 states that for each \({\pi }_{Pareto}\in \mathcal{P}\), there exists a weight vector \(\lambda >0\) such that \(\pi ={\pi }^{*}\) is an optimal solution to subproblem (4) with a weight \(\lambda \). As a result, our strategy can produce an approximated solution \(\pi {{\prec }_{\varepsilon }\pi }^{*}={\pi }_{Pareto}\). Our approach can produce a \(\varepsilon \)-approximate Pareto set \({\mathcal{P}}_{\varepsilon }\) for the MOO by producing approximate solutions for all \({\pi }_{Pareto}\in \mathcal{P}\).

Evaluation

Experiment settings

Datasets for MOTSP: DPML was evaluated on the Euclidean dual objective TSP [61], a standard MOTSP. This issue has two objectives that must be optimized concurrently, each with its Euclidean distance matrix. A node within the problem has two coordinate features, each uniformly sampled from \([0, 1]\times [0, 1]\) from a standard benchmark instance or normalized to \([0, 1]\times [0, 1]\). The Euclidean distance of each objective is calculated using each coordinate feature. We set the number of weight vectors \(N\) in the Pareto front construction to 100. They are uniformly distributed between (0,1) and (1, 0), i.e., \({\lambda }^{1}=\left(0, 1\right)\), \({\lambda }^{2}=(\frac{1}{99}, \frac{98}{99})\),…, \({\lambda }^{100}=(1, 0)\).


Datasets for MOVRPTW: The instances for MOVRPTW are created in the same way as the Solomon dataset’s “R” group (random group) [39]. The capacity for problem sizes 20 and 50 is set to \({Q}^{20}=500\) and \({Q}^{50}=750\), respectively. \([{a}_{0}=0, {b}_{0}=1000]\) is the whole time range, whereas the service duration \({h}_{i}\) is uniformly set to 10. The additional material has extra information. We employ the same vehicle capacities \({Q}^{20}\)=30 and \({Q}^{50}=40\), and the same validation and test sets as indicated in [30]. For the MOVRPTW solution, its total travel time (\({f}_{1}\)) is larger than its makespan (\({f}_{2}\)). As a result, we aim to balance the weights to achieve these two objectives. The following changes are made to the aggregation function of the \(j-th\) subproblem to be optimized:

$$f\left(\pi |{\lambda }^{j}\right)={\lambda }_{1}^{j}\frac{{f}_{1}(\pi )}{{f}_{1}^{*}}+{\lambda }_{2}^{j}\frac{{f}_{2}(\pi )}{{f}_{2}^{*}}$$
(28)

Here \({f}_{1}^{*}\) and \({f}_{2}^{*}\) are the global minimum value of the corresponding objectives. Euclidean instances with 4-D input and mixed instances with 3-D input are generated as the training set for our method. They are all generated from [0, 1] uniform distribution.


Baseline and evaluation metrics: Whereas most state-of-the-art learning-based approaches and solution solvers have been designed to cope with single-objective optimization, the proposed methodology focuses on MOO in real-world applications. Therefore, we choose representative state-of-the-art learning-based methods such as AM [30] (deal with different subproblems separately), MODRL/D-AM [15], MODRL/D-EL [52] and DRL-MOA [14] as baselines. We also choose the traditional evolutionary algorithms NSGA-II [8] and MOEA/D [9], two of the most popular multi-objective evolutionary algorithms widely used in real-world applications.

We apply Hypervolume (HV) and the number of non-dominated solutions (|NDS|) to evaluate various execution methods. HV is an important metric to evaluate the combined convergence and diversity of the Pareto front. At the same time, NDS reflects the diversity of the Pareto front when the HV values are close. In general, methods with larger HVs or NDSs perform better.


Hyper-parameters: The policy network’s node encoder comprises three SA blocks with a dim of 128. \({d}_{node}=128\) is the embedding dimension of all issues. We employ a hidden dimension of 256 for the decoder. As meta-node and edge knowledge learners, we employ two FCNs (two layers with the same number of hidden cells) and do a grid search for the number of hidden cells on {4,8,16,32,64}. We employ the same number of hidden cells for the task-actor encoder and Meta-GAT and run a grid search for the number across {16,32,64,128}. The model is trained for 50 epochs, each with 1,024,000 training instances, with issue sizes ranging from \({BS}^{20}=512\) to \({BS}^{50}=128\). We apply the Adam optimizer with a smooth learning rate decay schedule depending on \({\eta }_{{t}_{e}}=(\frac{1}{1+\gamma {t}_{e}}){\eta }_{{t}_{e}-1}\) at epoch \({t}_{e}\) with an initial learning rate of \({10}^{-4}\) and decay factor \(\upgamma =0.001\). Our model mainly converges after 50 epochs, although the results are comparable to other research that only converges after 100. We use Hypervolume (HV) and the number of non-dominated solutions (|NDS|) as metrics to evaluate different methods, and the method with higher values performs better. Our implementation is available at https://github.com/Anonymousauthorx/DPML.

Experimental results

We trained and evaluated various learning-based algorithms on 128 random instances of MOTSP-20 and MOTSP-30 with 20 and 50 nodes, respectively, to assess their performance. We trained on random cases MOTSP-50, MOTSP-80, and MOTSP-100 and then tested each instance independently to see if the presented meta-model could generalize to diverse learning tasks. Unlike previous meta-learning techniques, ours does not require specific parameter fine-tuning and merely requires training a meta-critic network to correspond to policies with various data distributions. On the other hand, MODRL/D-AM [30], DRL-MOA [12] and MODRL/D-EL [52] are exposed to one-on-one correspondence training and testing using MOTSP-30, MOTSP-80, and MOTSP-100.

“T-step” denotes updating the stochastic model and meta-model for each sub-problem. When using DRL-MOA with T-step, we first use AM to update the initial model in 5000 steps for the first subproblem with weight vector \({\uplambda }^{1}=(0 ,1)\). Then, using \(T\) update steps, the next model is derived from the previous sub-model for each of the remaining 99 sub-problems. In these instances, we set the reference point for calculating HV to (60, 60). The average findings for HV and |NDS| across 128 random occurrences are shown in Table 1.

Table 1 Average results of HV and |NDS| for different methods trained and tested on instances with different numbers of nodes

Regarding accuracy (i.e., MOTSP20 and MOTSP30), our technique beats AM and DRL-MOA by requiring extremely few updates (time) steps to attain good results. The findings also suggest that the proposed meta-models have significant generalization skills when adapting to new challenges at various dimensions. DPML is superior to AM and DRL-MOA since they are trained directly on relevant problems without scaling issues. The proposed meta-critic-based meta-model captures some standard (data distribution) MOTSP properties. A small amount of training on diverse cases is enough to provide promising results for various sizes of instances.

For MOVRPTW, we train our meta-models using Solomon instances, namely R101, R102, and R103, and test them separately. These instances have 100 clients, and their coordinates are normalized to \([0, 1]\times [0, 1]\). Table 2 shows HV and |NDS| for learning-based approaches. We see that DPML is better than the other methods in performance and generalization. Other methods can only follow the traditional training and testing paradigm of training a model with one dataset and then testing it on another. In contrast, our method can train a meta-critic using different datasets and then generate policies based on different data distributions in different test sets.

Table 2 The HV and |NDS| results were obtained by different learning-based methods on three Solomon instances

The Pareto fronts obtained by the three learning algorithms on the VRPTW instance of Solomon (R101) can be observed in Fig. 3a. The non-dominated solutions discovered by DPML are located in the lower-left corner of the figure, exhibiting a uniform dispersion. MOVRPTW, on the other hand, reveals a smaller number of non-dominated solutions due to the positive correlation between the evaluated objectives (total tour time and completion time), which implies that optimizing one objective can improve the other. Figure 3a illustrates that DPML provides decision-makers with valuable trade-off options. In Fig. 3b, the fluctuation of |NDS|, derived from the scatterplot test of the first four epochs of DPML (trained on R101 and tested on R102), can be observed as the number of time steps increases. It is noteworthy that DPML exhibits consistent fluctuations within a relatively stable range during the test, even when dealing with unknown data distributions. This demonstrates its strong adaptability in learning diverse policies through the meta-model.

Fig. 3
figure 3

The left figure a shows the Pareto fronts obtained by different learning-based algorithms on the Solomon VRPTW instance, and the right figure b shows the |NDS| scatter plot obtained during testing


Effect of DPML on other datasets: We also performed comparison experiments on three commonly used benchmark MOTSP instances: kroAB100, kroAB150, and kroAB200 [14]. These instances are constructed from kroA and kroB in the TSP library. We set the reference point used to calculate HV to (90, 90) for these instances. The results are shown in Table 3.

Table 3 Different learning-based algorithms obtained HV and | NDS | results on three MOTSP instances

We compared our proposed DPML method with the state-of-the-art reinforcement learning techniques tailored explicitly for MOO, namely MODRL/D-AM and MODRL/D-EL, on the R201, R202, and R203 datasets. While MODRL/D-AM employs decomposition and AM for MOO, MODRL/D-EL takes a step further by integrating evolutionary algorithms to fine-tune its parameters based on MODRL/D-AM.

The results presented in Table 4 demonstrate the competitive performance of DPML in terms of both the HV and the Cardinality of the |NDS|. This noteworthy achievement can be attributed to the distinctive features of DPML, particularly the incorporation of meta-learning and dynamic programming algorithms. These enhancements contribute to the heightened flexibility and precision of DPML compared to its counterparts, MODRL/D-AM and MODRL/D-EL.

Table 4 Comparative results of HV and |NDS| of DPML and state-of-the-art multi-objective reinforcement learning methods on MOVRP dataset

We also compared our method with state-of-the-art evolutionary algorithms; the results are shown in Table 5.

Table 5 DPML and traditional evolutionary algorithms on MOTSP instances obtained HV and |NDS| results

The results show that DPML outperforms other learning-based methods on many MOTSP datasets and outperforms evolutionary algorithms with only a few update steps required.

Table 6 compares the results between DPML and traditional methods on the Solomon dataset. Again, it can be seen that our method requires very few update steps to outperform traditional evolutionary algorithms.

Table 6 Comparative results of HV and |NDS| of DPML and evolutionary algorithms on MOVRP dataset

Unseen weight vectors: Table 7 compares our technique to DRL-MOA for the new subproblem of unseen weight vectors. For the test, twenty weight vectors were created. \({\uplambda }_{1}{f}_{1}+{\lambda }_{2}{f}_{2}\), which gives the weighted-sum cost of the relevant sub-problem and yields the \(f\) column. The neighboring submodels utilized for fine-tuning in DRL-MOA are shown in the “submodel” column. DPML beats DRL-MOA for all new subproblems for the reasons listed below, as shown in the figure. DRL-MOA trains submodels that are particular to the subproblem. A competent submodel for the new subproblem may not be obtained by fine-tuning the nearby submodels in a few steps.

Table 7 Comparative results of DPML and DRL-MOA when dealing with new subproblems with unseen weight vectors

On the other hand, the meta-model in DPML learns from diverse subproblems and captures their common characteristics. As a reason, it can better handle new sub-problems using weight vectors that aren’t observable. In this example, DPML only needs to keep one meta-model in memory, but DRL-MOA requires ten sub-models. DPML is a good learning paradigm for MOP, according to all evidence.


Ablation study: Due to the composition of various components in the proposed method, it is necessary to conduct ablation studies to validate the effectiveness of each component. We individually removed dynamic programming, meta-learning, and task-actor encoders to observe the performance of DPML. The experimental results of the ablation studies are presented in Table 8, with results obtained from R201, R202, and R203. When dynamic programming was removed, DPML’s HV decreased by 5.06%, 1.88%, and 1.41%, respectively. Simultaneously, |NDS| decreased by 33.33%, 30.77%, and 41.67%. Removing meta-learning decreased DPML’s HV by 9.47%, 8.78%, and 19.93%, while |NDS| decreased by 40%, 46.15%, and 50%. Similarly, when the task-actor encoders were removed, DPML’s HV decreased by 2.01%, 1.53%, and 1.38%, and |NDS| decreased by 20%, 15.38%, and 25%. These results indicate the crucial role played by the three components in the performance of DPML, with meta-learning and dynamic programming being particularly important.

Table 8 The performance of DPML is evaluated under three distinct scenarios: absence of dynamic programming (DG), absence of meta-learning (Meta), and absence of the task-actor encoder (TAE)

Discussion

A range of experiments has substantiated the efficiency and effectiveness of the proposed method, and the underlying reasons can be summarized as follows: Firstly, by leveraging meta-learning, we circumvent the need for excessive individual models, thereby streamlining the computational complexity associated with identifying the complete Pareto set. Secondly, incorporating dynamic programming principles within our meta-models empowers us to decompose and address the intricacies of MOO effectively. Integrating divide-and-conquer strategies with meta-learning enhances solution quality and enables us to navigate the MOO more efficiently.

The capacity of deep learning tends to enhance as the model’s complexity increases. However, this increased complexity also creates a computational burden, necessitating additional computing resources. Furthermore, when utilizing deep meta-reinforcement learning to simulate dynamic programming, the meticulous design of neural networks is essential to preserve their differentiability, resulting in a relatively higher overall workload. To address these concerns, we have taken measures to balance the capacity and complexity of our deep learning model. We have optimized the computational efficiency by carefully designing our neural networks to maintain differentiability, thereby reducing the overall workload. Additionally, we have carefully managed the allocation of computing resources to ensure efficient utilization while maximizing the model’s capacity for enhanced performance.

Conclusion

We propose a novel meta-reinforcement learning approach based on dynamic programming to tackle real-world multi-objective routing optimization problems. Our method introduces a policy network with a transformer architecture designed for complex routing problems. This transformer allows the policy network to capture and utilize rich contextual information from nodes and graphs. To incorporate meta-learning into the RL framework, we introduce a meta-critic network that leverages multiple tasks. This network comprises a task-actor encoder and a meta-value network, which work together to guide the policy network in handling diverse sub-problems. The meta-value network is constructed using a meta-attention network, which combines meta-knowledge to approximate the value function. Additionally, we employ a GRU-based task-actor encoder to provide historical trajectories for the meta-critics, enhancing the effectiveness of RL training in practical scenarios.