Solving two-stage stochastic route-planning problem in milliseconds via end-to-end deep learning

With the rapid development of e-economy, ordering via online food delivery platforms has become prevalent in recent years. Nevertheless, the platforms are facing lots of challenges such as time-limitation and uncertainty. This paper addresses a complex stochastic online route-planning problem (SORPP) which is mathematically formulated as a two-stage stochastic programming model. To meet the immediacy requirement of online fashion, an end-to-end deep learning model is designed which is composed of an encoder and a decoder. To embed different problem-specific features, different network layers are adopted in the encoder; to extract the implicit relationship, the probability mass functions of stochastic food preparation time is processed by a convolution neural network layer; to provide global information, the location map and rider features are handled by the factorization-machine (FM) and deep FM layers, respectively; to screen out valuable information, the order features are embedded by attention layers. In the decoder, the permutation sequence is predicted by long-short term memory cells with attention and masking mechanism. To learn the policy for finding optimal permutation under complex constraints of the SORPP, the model is trained in a supervised learning way with the labels obtained by iterated greedy search algorithm. Extensive experiments are conducted based on real-world data sets. The comparative results show that the proposed model is more efficient than meta-heuristics and is able to yield higher quality solutions than heuristics. This work provides an intelligent optimization technique for complex online food delivery system.


Introduction
With the prevalence of mobile Internet, online food delivery (OFD) APPs have become more and more popular for the convenience in daily life. Millions and billions of transactions are completed via these APPs every day. In 2016, the worldwide market of food delivery reached up to €83 billion [39]. In China, one of the best-known OFD platforms, Meituan, obtained a total revenue of ¥24.7 billion for the second quarter of 2020. Over 457 million customers order food on Meituan platform with more than 6.3 million active restaurants to choose [23]. In the United States, the total food sales of OFD were expected to grow by 16% from 2017 to 2022 according to Morgan Stanley Research [24]. With huge market opportunities and strong user demand, the OFD will continue to develop quickly and steadily in the future.
The major mode of the Meituan is shown in Fig. 1. When a customer orders food, the order will be pushed to the corresponding restaurant and then assigned to a rider instantly 1 3 with a well-planned route by the platform. The whole mode can be abstracted as an order dispatching problem and an online route-planning problem (ORPP), where the latter is the key problem of the system. The quality of the planned routes can directly influence the assignment of orders to riders where improper assignments will cause great waste of transportation resources. Besides, low-quality routes will make riders to take a detour or deliver food later than the estimated time of arrival (ETA) promised to customers, and therefore affect the efficiency of riders and experience of customers. Since there are massive orders every day and each order needs to be delivered in a short period of time (usually less than 40 min), the platform should make decisions very fast even within 1 min. The computational time left for route planning will be limited to millisecond level. In consequence, it is absolutely necessary for the platforms to put forward intelligent techniques to deal with the complex problems efficiently and robustly.
Extensive research works about solution methods [1,29] have been carried out on the traditional route-planning problem (RPP) which is abstracted as traveling salesman problem (TSP), such as branch and bound method [3], 2-opt algorithm [34], genetic algorithm (GA) [12], and ant colony optimization [13]. Compared to the traditional RPP or TSP, the ORPP is much more complex. In addition to the immediacy requirement of the online fashion, the ORPP is also subject to time-window constraints, precedence constraints, and so on. However, few research works have been carried out on the ORPP. To minimize the total cost of the ORPP, Wang et al. [37] proposed an iterated greedy algorithm (IG) with several problem-specific heuristics. To speed up the initialization process of the same problem, they employed the extreme gradient boosting method to adaptively select the appropriate constructive algorithms [36]. The problem related to the ORPP is the single vehicle pickup and delivery problem with time windows (SVPDPTW), which is an extension of TSP and a basic version of pickup and delivery problem (PDP). Hosny and Mumford [17] presented a GA with a duplicate gene encoding to deal with a large number of constraints. To solve the SVPDPTW with capacity constraints, Edelkamp and Gath [9] designed a nested Monte Carlo search with policy adaptation. However, these algorithms cannot satisfy the immediacy requirement of online optimization. On the contrary, machine learning (ML) turns out to be promising for solving combinational optimization (CO) problems effectively in short computational time. Bengio et al. [5] have investigated the major methods of combing ML with traditional CO algorithms, and divided them into three kinds: (a) end-to-end learning methods, which use ML to directly solve the problem; (b) ML-first methods, which apply ML to provide meaningful properties of optimization problems and guide the search direction for CO algorithms; (c) ML-alongside methods, which utilize ML during the iterative process of optimization algorithms. The end-to-end learning methods are suitable for real-time applications, while the latter two are still time-consuming due to the CO algorithms.
Recently, end-to-end machine learning methods have been explored on CO problems, especially on TSP. Vinyals et al. [35] proposed a pointer network model to tackle the Euclidean TSP with supervised learning based on sequenceto-sequence framework. The encoder and decoder are both constructed by recurrent neural networks, which make it possible to solve different input graph sizes. Nevertheless, this supervised learning model has the limitation of strong dependence on high-quality labels. To overcome the drawback, Bello et al. [4] used a reinforcement learning method to train the similar pointer network and set the tour length as a reward signal. To avoid the influence of input sequence on the model, Khalil et al. [18] employed a graph neural network to process the input data, and combined the reinforcement learning to address the problem. By modifying the pointer network with attention mechanism and reinforcement learning method, the solutions gained by Kool et al. [19] have been improved over recent heuristics for TSP. Besides, Ma et al. [22] introduced the graph pointer network trained by reinforcement learning which performed better than the pointer network [35], but could not dominate the attention model [19].
Most of the above literatures assumed all the parameters as deterministic values. However, uncertainty is ubiquitous and inevitable in real life. Powell [28] published a comprehensive review of the stochastic optimization. As mentioned, the stochastic problems can be solved either exactly [15] or approximately [40]. The former usually assumes the distribution functions to be additive such as gamma distribution and normal distribution, or can be decomposed into multiple additive functions. The latter includes sampling methods such as Monte Carlo sampling or direct online observations (also called data driven approach). However, the convergence of the Monte Carlo method is very slow: the ultimate accuracy cannot be improved faster than O(1∕ The major mode of Meituan platform is the number of simulation samples or replications [20]. The computational cost will be prohibitively high if the problem is complex and requires high accuracy. To speed up the stochastic optimization, some methods have been developed. For ranking and selection problems, Chen and Lee [8] proposed an optimal computing budget allocation method, which allocates the samples sequentially to optimize the selection quality under a simulation budget constraint. Besides, Bengio et al. [7] proposed a supervised learning method to find the representation scenario (RS) and transformed the stochastic problem to a deterministic one, which could obtain similar solution quality with much less computing time than general algorithms. However, the RS could not always exist and the generalization ability of the model is unsatisfactory sometimes. Although this method greatly reduces the computational time compared to sampling methods, it sacrifices accuracy to a certain extent.
As for TSP and PDP, uncertainty mainly lies in travel time, food preparation time, service time, and customer demand. For the dynamic PDP with stochastic food preparation time, Ulmer et al. [33] assumed that the time was gamma distributed and presented a cost function approximation with time buffers to solve the uncertainty. For the stochastic TSP with pickups and deliveries, Elgesem et al. [10] assumed that the travel time was independent normal distributed, and employed several exact methods based on Monte Carlo simulation to solve the problem. As for the green vehicle routing problem with stochastic costumer demands, Niu et al. [27] generated the mean demand according to a discrete uniform distribution and proposed a membrane-inspired multi-objective algorithm to solve the problem. The above literatures all assumed that the random variables obeyed independent and additive functions. However, this assumption may lose some information of real data distributions. In this paper, we assume the food preparation time as stochastic variables and propose the stochastic ORPP (SORPP). The discrete distribution functions of the food preparation time are predicted by an ML model trained by historic data from Meituan. The functions are very complex with long tail, multimodality, and without additivity. To our best knowledge, there are no other research works that consider stochastic variables with such kind of distributions.
From the literature review, it can be seen that the existing exact algorithms or meta-heuristics of related problems are inappropriate to solve the SORPP due to their unacceptable computational time. Although some heuristics can solve related problems quickly, they cannot guarantee the quality of the obtained solutions. Therefore, the core challenge for SORPP is how to obtain satisfactory solutions within a very short period of time. Hence, we use an end-to-end machine learning method to solve the proposed problem efficiently. As mentioned before, the existing research works of end-to-end learning can mainly be classified into two types: supervised learning [35] and reinforcement learning [4,18,19,22]. The former is relatively easy to implement, but strongly depends on the label quality. In the case of high-quality labels, supervised learning can well imitate the "expert experience" and learn the optimization policy to generate satisfying solutions. The latter does not require labeling and complex feature engineering, but it is difficult to design appropriate reward/action/ state functions. With approximated policy, the reinforcement learning model may fall into local optima easily. Besides, it usually costs much longer training time than supervised learning. In real-life situation of Meituan platform, each rider can only carry a small number of packages limited by the trunk capacity. If the computational time is not limited, the optimal (or approximate optimal) solutions of the problems on this scale are easy to find. That is, we can obtain plenty of highquality labels of the problem. Therefore, we design an endto-end deep learning model trained by supervised learning to solve the problem. The model is denoted as Meituan stochastic delivery network (MSDN).
Overall, the major contributions of this paper can be summarized as follows: 1. We propose the stochastic online route-planning problem for the first time which is formulated by a two-stage stochastic programming mathematical model. 2. We design an end-to-end deep learning method to solve the SORPP. In the encoder, the model produces the embeddings of all input features by specially designed network layers. In the decoder, the permutation sequence is predicted by long-short term memory (LSTM) cells with attention and masking mechanism. 3. We present problem-specific features to improve the performance of the model. 4. We adopt the IG algorithm based on Monte Carlo sampling to obtain high-quality labels for model training. 5. We conduct extensive experiments on the real-world data sets from Meituan. The results show the effectiveness and efficiency of the proposed model.
The remaining of the paper is organized as follows. The next section provides the description and formulation of the SORPP. The continuous section introduces the IG algorithm for labeling. And the following section presents the details of the proposed MSDN. Computational results and comparisons are reported in the consequent section. Finally, the paper is ended with some conclusions and future work in the last section.

Problem description
A problem instance of the SORPP comprises one rider and n orders, denoted as W = {w 1 , w 2 , …, w n }. Each order specifies a pickup point (the corresponding restaurant) and a delivery point (the corresponding customer). As shown in Fig. 2, the rider will start from the current position and pick up or deliver food according to the planned route where the food preparation time of unpicked orders is stochastic. At the scheduling time, some of the orders have already been picked up, so we only consider their delivery points. P = {1, …, n 1 } is the set of pickup points and D = {n 1 + 1, …, n 1 + n} is the set of the delivery points, where n 1 is the number of pickup points. In addition, an order is represented as a point pair (i, where − 1 is the pickup point which has already been visited before.
In Meituan's situation, some basic constraints are given as follows.
1. Precedence constraints. The rider must pick up the food before deliver it. 2. Time window constraints. The rider must pick up the food after it has been prepared (hard time-window) and should try to deliver it before the promised time, also called ETA (soft time-window). Since the food preparation time is stochastic, all the values sampled from the corresponding probability mass functions (PMFs) should meet the time-window constraints.
The goal of the SORPP is to minimize the expected time cost, denoted as ETC. The problem can be modeled as a two-stage stochastic programming with the following notations.

Parameters:
v The speed of the rider. n The number of orders. n 1 The number of pickup points.

W
The set of all orders. K The set of scenarios. Decision variables: where: The objective function (1) is to minimize the expected time cost ETC which includes the rider traveling time, the expected waiting time, and overtime under stochastic situation. x and u are the first-stage decisions. tw and to are the second-stage decisions. Constraints (2) and (3) imply that each order can only be picked up once and delivered once, and the starting point will only be select once. Constraints (4) and (5) indicate the transformation of x and the sequence of the points u. Constraints (6) ensure that the precedence relationships cannot be violated. Constraints (7)-(9) display the value ranges of the first-stage decision variables. The objective function (10) of the second stage is to minimize the expected waiting time and overtime under stochastic situation. Constraints (11) and (12) reveal that the arrive time of one point is equal to the leave time of the previous visited point plus the traveling time between the two in a certain scenario. Constraints (13) and (14) show that the leave time of each point is larger than the arrive time and the preparation time (for pickup points). Constraints (15) and (16) denote that the leave time and arrive time of the starting point are both 0. In addition, Constraints (17) define that tw is larger than the difference between the stochastic food preparation time and the arrive time for pickup points. Similarly, to is larger than the difference between the leave time and the ETA for delivery points as shown in Constraints (18). Constraints (19) and (20) demonstrate the value ranges of tw and to in each scenario.

The IG algorithm for labeling
As mentioned before, we employ supervised learning to train the model where the labels are obtained by the IG algorithm. The IG algorithm is of powerful exploitation capability and has been successfully applied to various scheduling problems [25]. As shown in Fig. 3, the main procedure of the IG algorithm includes initialization, destruction, reconstruction, and problem-specific local search.

Initialization
The initial route is generated by the adaptive Nawaz-Enscore-Ham (aNEH) heuristic [25]. The main steps of the aNEH are as follows.
Step 2: Rank n orders by their ETAs in an ascending order, W = (w (1) , w (2) , …, w (n) ) where w (i) represents the ith order with the least ETAs. Let i = 1.
Step 3: Insert the pickup point of w (i) into all position of Π 0 . Once assigned, insert the delivery point of w (i) into all positions after the position of the related pickup point. The permutation with minimal ETC will be reserved and replace Π 0 . The ETC is calculated by Monte Carlo sampling.
By this way, a solution with certain quality is generated.

Random destruction
In the destruction phase, α (α < n) orders are randomly selected, with the removal of their pickup and delivery points from the permutation sequence. The chosen point pairs constitute a list, denoted as L S = {pair i , i ∈ [1, α]}, and the remaining permutation is denoted as Π R .

Greedy reconstruction
In the reconstruction phase, the L S is shuffled at first to ensure sufficient randomicity. Then, the pickup point (if not − 1) and the delivery point of the order in L S will be inserted into Π R successionally. The partial solution with minimal ETC will be reserved greedily.

Problem-specific local search
To further improve the performance of the algorithm, a problem-specific local search is designed with following two neighborhood search operators.

Backward search
Find the delivery points with largest expected overtime and move them backward to an optimal position.

Forward search
Find the points with most sufficient time and move them forward to an optimal position.

Acceptance and stopping criteria
To avoid falling into local minimum, we employ the acceptance criteria of simulated algorithm. That is, we not only accept the solutions better than current one, but also worse solutions sometimes according to an acceptable probability. The probability is p = e − E � TC −E best TC ∕T , where T > 0 is the current temperature, E ′ TC is the objective function value of a certain solution, and E best TC is the objective function value of the best solution obtained before. T is initialized with initial temperature T 0 , and is updated as T g+1 = c × T g at iteration g. c ∈ (0, 1) is the cooling rate.
The algorithm will be finished if one of the following stopping criteria is met: the maximum number of iterations g max has been reached; the best solution is not improved for t consecutive iterations.

MSDN for SORPP
We design an end-to-end deep learning model to solve the SORPP in an online fashion. The MSDN is composed of an encoder and a decoder, which is shown in Fig. 4. The encoder is used to produce embeddings of all input features, and the decoder can produce the sequence of the route points based on the output of the encoder.

Feature engineering
Feature engineering is to extract features from raw data and transform them into applicable format, which is a crucial part in machine learning. Suitable features can reduce the difficulty of modeling and improve the performance of output results [41]. However, it is challenging to design effective features which usually depend on expert knowledge about the optimization problem and statistics analysis on large quantity of data.
In our paper, basic features of the SORPP include the rider and order information as follows.

Rider-related features
The location (latitude and longitude) of starting point, the average speed during the last month, the number of carried orders, and the number of pickup points and delivery points.

Order-related features
The ETAs, the locations of the pickup and delivery points, respectively, and the PMFs of the food preparation time.
Although these features are enough to define an instance of the SORPP, they cannot well reveal the law of optimal solutions. For example, the order which is closer to the rider or is more urgent will possibly be visited with higher priority. Therefore, we design problem-specific features to describe the urgency of orders, the position relationship (including distance and time) between orders, as well as the position relationship between orders and the rider, as shown in Fig. 5. otherwise. Besides, the specific statistics of PMF is composed of the corresponding mean, sum, medium, maximum, minimum values, and standard deviation. By introducing these problem-specific features, the performance of the model is further improved.
We pre-process the data by cleaning missing data and replacing anomalous data with average values. Besides, all the continuous features are normalized first according to the average values and standard deviations of the training set.

Encoder
The encoder is used to transform the input into an intermediate embedding. As shown in Fig. 6, we deal with different types of features separately and then concatenate them to a final embedding. The encoder component includes three parts: rider embedding of rider features, order embedding of order features, and location embedding of the location matrix which consists of all the location information extracted from the rider and all orders.

Location embedding
Because the latitudes and longitudes are nonnumeric information, we adapt the GPS embedding (or cell coding) [21] to represent them. The main idea of GPS embedding is to divide the area into small grids with corresponding geographical information according to certain rules. Compared to geohash method [26] or one-hot representation [6], the GPS embedding can depict GPS information more accurately based on distance weights. Therefore, it can avoid the sudden change of neighborhood location and make the embedding smoother. The main steps of the GPS embedding are as follows.
Step 1: Map the latitudes and longitudes into grids of 300 × 300 m 2 .
Step 2: Reserve the grids which are not out of vocabulary. Since some of the grids have one or multiple GPS points, while others may have none, we only reserve the grids with multiple GPS points according to a threshold (set as 300) to guarantee the continuity of a route.
Step 3: Transform the retained grids into a d 1 -dimensional embedding denoted as Em(*).
Step 4: Calculate the embedding of each location based on bilinear interpolation of 4 nearest vertices of the grid. For an example, as shown in Fig. 7, G is a certain location which is concatenated with 4 vertices, denoted as G ij i, j = {1, 2}. The embedding of G is calculated as follows: To better reflect the location relationship between the points, we extract the location features from all the points and form a global location embedding with (1 + n + n 1 , d 1 ) size by above method, denoted as Em loc . Then, a factorization-machine (FM) [30] layer is employed to fuse the features. The FM can find the correlation of features by  where w ∈ R d , V i ∈ R k are weight vectors, d is the number of features, k is the given embedding length, and x i is the value of ith feature. The addition unit ⟨w, x⟩ reflects the importance of order-1 features and the inner product units represent the impact of order-2 feature interaction. In the location embedding, x = Em loc , d = n + n 1 , and k = d 1 . The FM acts on the first axis and outputs a vector with the size of d 1 which is denoted as L loc . By this method, all the locations can be mapped into a reasonable embedding which implies the spatial distance relationship between them.

Rider embedding
Rider features include one continuous feature (rider speed) and five discrete features. The latitude and longitude of the starting point is disposed by above method. To avoid information loss during the training process, we increase the dimension of other numeric features by transforming them into a d 2 -dimensional learnable embedding, denoted as e r . The deep FM [14] is adopted to further process the rider features thanks to its effectiveness in combinatory features. Deep FM consists of an FM and a deep neutral network (DNN). The former is used to learn the linear and pairwise interactions between features and the latter is used to learn high-order feature interactions. The FM and DNN component shares the same input features. Set x (0) = [Em(0), e r ] as the initial input. y FM can be calculated as Eq. (22)  where i is the layer depth and σ is an activation function which is set as ReLU function [11] in this paper. x i , W i and b i are the output, model weight, and bias of the ith layer, respectively. Then, a dense vector is generated and fed into the activation function as follows: where |H| is the number of hidden layers. The output rider embedding is denoted as L rider = [y FM , y DNN ].

Order embedding
The order embedding is mostly important in the encoder component based on attention mechanism, which enables (22) it to extract important information effectively. The order embedding contains the basic information of all the order points which affect the generation of the permutation sequence.
Similarly, the order locations are disposed by GPS embedding, and turned into a d 1 -dimensional embedding for each point. The other numeric features are transformed into a d 3 -dimensional embedding for each pickup point or delivery point. The initial order embeddings are generated by concatenation, denoted as It is typical to generate a large number of scenarios to approximately describe complex PMFs. However, the computational time of this method is unaffordable for online system. Since the PMFs can be regarded as a 2-D image-like matrix and convolution neural network is widely employed for image classification and compression [32], we adopt the CNN with one hidden layer to pre-process them and obtain global information of food preparation time. The filter size is 3 × 3, and the output vector is concatenated with the embedding of pickup point features, while the delivery points remain unchanged as follows: By this way, the features of pickup and delivery points are turned into learnable embeddings, respectively. And a feed-forward network (FFN) [31] is employed to match their size as follows: After the above pre-processing, we employ the similar encoder used in attention model by [19] to extract valuable information of order features for its effectiveness in TSP. The point embeddings are updated by a multi-head selfattention (MHSA) and a node-wise FFN. Each layer adds a skip-connection [16] and batch normalization (BN) [2]. The specific structure can be referred to [19]: Then, a graph embedding is calculated as the mean of the final node embeddings, denoted as L order = 1 The final embedding in encoder consists of the graph embedding of order features, the rider embedding, and the location embedding as follows. It is used to compose the input of the decoder together with point embeddings: (29) L = L loc , L rider , L order .

Decoder
The decoder is used to generate the permutation sequence based on attention mechanism similar to the pointer networks [34] and problem-specific masking mechanism. The former allows the decoder to quickly judge the importance of each points and the latter guarantees the feasibility of generated solutions. The decoder consists of a LSTM cell and a softmax layer. At each timestep t in {1, n 1 + n}, the LSTM cell will output a point vector g o,t based on the point embeddings from the encoder. The implementation of the LSTM is shown in Eqs. (30)- (35), where g f,t , g i,t , g o,t , g c',t , g c,t , and H t are the output of forget gate, input gate, exposure gate, new memory cell state, final memory cell state, and hidden state, respectively. W (*) and b (*) are the weights and bias for each gate or state, and are shared among all cells at each timestep. Besides, h c,t is the decoder context at time t, which is composed of the point embedding and final embedding from encoder. Since the route starts at the location of the rider, the h c,t is set as Eq. (36): To ensure the feasibility of the solution, the candidate points must satisfy the constraints of the SORPP. To meet the precedence constraint, the points will be masked (set the pointer vector u i,t = − ∞) if the corresponding food has not been picked up by the rider. Likewise, to guarantee each point only appears once, the points that have already been visited will also be masked. By this way, the pointer vector u i,t is defined as: where W 1 and W 2 are trainable matrices. Then, a distribution over the next candidate points is generated by passing the vector into the softmax layer as follows. The one with the largest probability will be chosen as the next point: An example is shown in Fig. 8 which explains how to generate a feasible permutation sequence. Two orders are considered and represented by (1,3) and (2, 4), respectively. The decoder takes as input the graph embedding and point embeddings. Note that the point 3 and 4 are masked at t = 1 because the corresponding orders have not been picked up and the point 1, 2, and 3 are masked at t = 4, because they have been visited before. The permutation sequence Π = {0, 2, 1, 3, 4} is constructed at the end.

Loss function
Since the decoder can be regarded as a multi-classify model, the cross-entropy loss function as follows is employed in our paper, which is widely used in multi-classify problems: where π L is the permutation sequence of label, π i is the ith point in π L , p i is the predicted possibility of π i at t = i, and θ represents all the parameters of the model. By this way, the distance of the predicted route and the label can be estimated. Given a training pair (N, π L ), the parameters of the MSDN are learnt by maximizing the loss function for the training set, that is:

Experimental settings
In this section, numerical experiments are conducted to evaluate the performance of the MSDN. The data sets are sampled from real historical data across China in Meituan platform. Two kinds of training and validation sets are generated, denoted as TS1 and TS2, respectively, where TS1 obeys the distribution of real data and TS2 is equalized according to n. Besides, the test set is shared by all models and comparison algorithms which will be introduced later. The data distribution is shown in Fig. 9, where the numbers around the pie chart represent the order number n of a route. The PMFs are obtained from extensive history data and are assumed to be known in this paper. Figure 10 shows an example of a PMF and its cumulative distribution function (CDF) correspondingly. All the parameters are empirically set, as shown in For the SORPP, the performances of the algorithms are evaluated in terms of computational time and solution quality. In this paper, we employ the average CPU time to measure the computational time and the following two metrics to evaluate the solution quality. Fig. 9 The distribution of different data sets

Route consistency (RC)
where len s is sequence length and S m is the length of longest common prefix between two permutation sequences. In this paper, 0 is excluded, since all the sequences start from 0. For example, supposing two solutions A = {0, 1, 2, 3, 4} and B = {0, 1, 3, 4, 2}. S m is 1 (the length of {1}), and len s is 4 after excluding 0. Therefore, the RC of A and B equals to 1/4 = 0.25. If the first nonzero number of two solutions is not the same, the RC = 0 despite how similar the successive points can be. The RC will be higher if two solutions have longer common prefix. It can be used to measure the learning ability of models by evaluating the similarity between the solutions learned by the model and the labels.

Relative percentage deviation (RPD)
where alg is the solution obtained by a certain algorithm (or model), and lab is the label obtained by the IG algorithm in this paper. Different from the RC, the RPD is used to evaluate the solution quality of the algorithms and model. The algorithm performs better with a smaller RPD.

Effectiveness of special designs
In this section, we verify the effectiveness of several special designs, including data equalization, the design of the  Tables 2 and 3.  From Tables 2 and 3, it can be seen that the MSDN_ CNN_SF_TS2 performs best on average, which indicates the superiority of this model. Besides, the MSDN_CNN_ SF_TS2 is better than MSDN_nCNN_SF_TS2 on most instance, which shows the effectiveness of the CNN layer on the information extract of PMFs. As for MSDN_CNN_ SF_TS1, the model trained by TS1 is superior on instances with small n especially when n = 2, but is inferior with large n. Similarly, although MSDN_CNN_SF_TS2 can perform better on average when we equalize the training data, it also loses accuracy on the instances with small n. By comparing these two models, we can conclude that models with different training sets cannot always perform  best on every instance, which lends support to no free lunch theorem [38] and shows the important influence of the data distribution on the model performance. Besides, the MSDN_CNN_BF_TS2 is obviously worse than others. The main reason is that the problem-specific features can provide more information for the model to learn the implicit relationship.
In general, we can draw the following conclusions: the CNN layer is effective to extract the information of PMFs; the problem-specific features can significantly improve the model performance; The distribution of training data will impact the model performance on instances of different scales.

Comparisons with other algorithms
To the best of our knowledge, there is little works carried out on SORPP. Therefore, we apply some algorithms which are commonly used in TSP or PDP as comparative algorithms. Due to the immediacy of the online problem, only heuristics or meta-heuristics are adopted as follows. The first four heuristics belong to constructive algorithms and the last three belong to (meta-)heuristics with iterative process, called iterative algorithm for convenience.

Random generation (RG)
The route is constructed randomly. If the solution is unfeasible, we will repair it by swapping the position of the illegal pickup point and its corresponding delivery point.

Earliest ETA first (EEF)
The order with earliest ETA will be inserted into the permutation sequence first. Its pickup and delivery points are assigned in order.

Most urgent first (MUF)
The most urgent order will be inserted into the permutation sequence first. Its pickup and delivery points are assigned consecutively. The urgency is defined similarly as in "Feature engineering".

Nearest first (NF)
The points are sorted by the distance to the rider starting position. And the nearest point will be assigned first. If the solution is unfeasible, the illegal delivery point will be inserted right after its corresponding pickup point.

IG with random initialization (IG_RG)
The solutions are initialized by RG and then improved by the destruction and reconstruction operators described in "The IG algorithm for labeling". Due to time-limitation, these operators are only performed once.

IG with NF initialization (IG_NF)
It is the same as IG_RG except solutions are initialized by NF.
We use Monte Carlo sampling to dispose the uncertainty in these algorithms. To determine the sampling times, we compare the computational time and the fluctuation of the ETC under different sampling times, represented by s = {1, 10, 100, 1000, 10,000, 100,000}. Denoting the evaluation of a label under a certain s as a case, there are 6 × 2600 = 15,600 cases in total. To obtain credible results, each case runs 20 times to calculate the average computational time. Besides, six cases of the instance 1 under different s run 1000 times, respectively, to evaluate the fluctuation.
The results are shown in Table 4 and Fig. 11. Table 4 shows the average computational time of all the cases grouped by n. The computational time multiplies with the increasement of s, which shows the positivity between them. Figure 11 is a boxplot of the ETC on a certain permutation sequence (the label of instance 1). From the figure, it can be seen that with the decline of s, the fluctuation increases rapidly and the deviation of expectation enlarges promptly. By balancing the precision and elapsed time, we set the sampling times as 10,000 for the compared algorithms.
The comparison results of the model and the constructive algorithms are listed in Table 5, where the MSDN represents the MSDN_CNN_SF_TS2 for convenience. It can be seen that the computational time is similar between the model and the constructive algorithms. That is because most time of these methods is spent on solution evaluation by sampling method while constructing or predicting the sequence usually costs less than 1 ms. Although the consumed time is similar, the model performs much better than the constructive algorithms. Table 6 shows the comparison results of the MSDN and the iterative algorithms. It can be seen that the average RPDs of the MSDN are superior than IG_RG and IG_NF on all instances and are better than aNEH on some instances. For instances with n = 2 or 3, all these methods perform well, because it is easy for iterative algorithms or models to find the optimal solution. With the increasement of the point number in a route, the difficulty in solving the problem increases, which enlarges the average RPD. When n = 8, the model performs slightly worse than aNEH mainly because of the lack of similar data in the training set. Although the average RPDs of the MSDN are not always the best, it is satisfying with relatively short computational time than the compared iterative algorithms, almost one-fiftieth of them. Therefore, the MSDN is more appropriate to be employed online than the compared iterative algorithms.
In conclusion, the MSDN is more effective and efficient to solve the proposed problem.

Conclusions
In this paper, we addressed the stochastic online route-planning problem with the minimization of expectation time cost for the first time and established a two-stage stochastic programming mathematical model. It is a complex problem with immediacy requirement, uncertainty, precedence constraints, time-window constraints, and so on. To solve the problem in a very short time, we designed an end-to-end deep learning method and employed different network layers to tackle problem-specific features with different formats. According to the circumstances of real-world delivery, the model was trained by supervised learning to study the optimization policy under complex constraints, where the labels were obtained by iterated greedy algorithm. The experimental results showed that the MSDN is effective and efficient to solve the addressed problem on real-world data sets. Therefore, this research work can provide effective intelligent technique for complex online food delivery system. In our future work, we will generalize the problem to large scale and propose the models with better generalization ability. In addition, it is interesting to solve the online route-planning problem with other types of uncertainties and to generalize the problems with more objectives including other robustness criteria.