Reinforcement learning for the traveling salesman problem with refueling

The traveling salesman problem (TSP) is one of the best-known combinatorial optimization problems. Many methods derived from TSP have been applied to study autonomous vehicle route planning with fuel constraints. Nevertheless, less attention has been paid to reinforcement learning (RL) as a potential method to solve refueling problems. This paper employs RL to solve the traveling salesman problem With refueling (TSPWR). The technique proposes a model (actions, states, reinforcements) and RL-TSPWR algorithm. Focus is given on the analysis of RL parameters and on the refueling influence in route learning optimization of fuel cost. Two RL algorithms: Q-learning and SARSA are compared. In addition, RL parameter estimation is performed by Response Surface Methodology, Analysis of Variance and Tukey Test. The proposed method achieves the best solution in 15 out of 16 case studies.


Introduction
The traveling salesman problem (TSP) is one of the bestknown combinatorial optimization problems and is often considered in autonomous vehicle route planning [11,19,31,48,50,65,80]. In a TSP, the sequence of autonomous agent movements should optimize a route between a set of nodes [3,16,32,33,55]. Moreover, the agent must visit each node (city) only once, considering equivalent the initial final position (goal) of route. In this aspect, the TSP generalizations encompass various aspects of mobile robotics, such as restrictions of the vehicle [48], dynamic environments [65] and multiple vehicles [38,80].
An important research area for autonomous vehicle route planning considers fuel constraints [35,78]. In such cases, the challenge is to define a route to ensure that the vehicle carries out all the way without finishing the fuel. Following this same line, refueling problems seek to optimize the expenditure on the fuel purchase for road routes [27,60,71].
Vehicle refueling problems have been extensively investigated [25,27,36,42,43,56,57,62,[69][70][71]. One of the lines of study is the fixed route vehicle refueling problem (FRVRP), where the goal is to select the refueling points on a fixed route. [27,43,73]. For example, [43] have presented a linear time greedy algorithm for the FRVRP. There are also applications of FRVRP to real problems. [60] have developed other example, where a fixed route refueling model for a case study of a Brazilian carrier; [73] have analyzed the influence of fuel weight, congestion, and acceleration on refueling policy optimization. Other works seek to analyze the refueling policy on variables routes [27,71]. In this sense, it is worth highlighting the applications based on TSP [67,71,82]. Suzuki [71] has presented a model that addresses the Traveling Salesman Problem With Time Windows and refueling. The goal is to define a route to minimize fuel consumption, respecting the time window for each customer [71]. Other applications of TSP with refueling are unmanned aerial vehicles [67] and geosynchronous satellites [82]. It is important to point out that refueling problems are usually classified into four groups [27]: refueling with fixed route, refueling with variable route, TSP with uniform cost at each point and TSP with the fuel cost varying in the localities. In this sense, the last class can be applied to treat refueling in road networks in Brazil, where fuel price variations are found in each city according to data from the Brazilian National Petroleum Agency (ANP) 1 .
In the literature, several methods have already been applied to solve refueling problems [35,56,67,72,77,82]. Levy et al. [35] have adopted heuristics Variable Neighborhood Descent and Variable Neighborhood Search (VNS) to vehicle routing problem with fuel constraints. The work [77] also adopts the VNS to optimize a fleet with alternative-fuel (gasoline or diesel vehicles). Zhang et al. [82] have used the Ant Colony Optimization to solve refueling multiple geosynchronous problems. The author of [72] discusses Simulated Annealing and Tabu Search methods for the pollution routing problem (minimize the fuel consumption or pollutants emission). Other papers presented news algorithms to optimize refueling problems [56,67]. Although Reinforcement Learning has shown to be a great tool to combinatorial optimization problems there is less attention to solve refueling problems.
Reinforcement learning (RL) is an artificial intelligence technique with relevant applications in robotics [8,15,[28][29][30]37], path planning [20,39,47,59,75,76] and combinatorial optimization problems [4,7,13,14,21,44,53,54,64,79], such as the TSP [1,2,18,22,41,45,52,66,81]. In RL, an agent learns from rewards and penalties in interacting with an environment [68]. One of the main topics of investigation in RL is the estimation of learning parameters, like learning rate (α) and discount factor (γ ), -greedy and reinforcement function [6,17,23,24,40,54,63]. In fact, parameter definition can directly influence a good route learning [5,12,52,54]. Bal and Mahalik [5] have shown how to estimate the parameters α and γ by trial and error for a simulated navigation environment. In Ottoni et al. [52], the authors have presented a systematic approach for the RL parameter estimation using Response Surfaces Methodology (RSM). In [54], a complete factorial experiment and the Scott-Knott have been used to find the best combination of factors ( -greedy and reinforcement function) for the Sequential Ordering Problem. The paper [12], in turn, has proposed a method based on evolutionary computation to seek the best reinforcement function and Deep Learning network architecture for an autonomous navigation problem. Yet, no rigorous method for estimating the parameters for refueling problems has been found.
To overcome the lack of a parameter estimation framework for refueling problems, this work introduces a statistical 1 http://anp.gov.br/preco/. methodology for tuning RL parameters employed on traveling salesman problem With refueling. More specifically, we have analyzed how the RL parameters and the refueling problems characteristics influence the learning of routes to optimize the fuel cost. We have proposed an RL structure to solve the traveling salesman problem with refueling (TSPWR), through a model (actions, states, reinforcements) and RL-TSPWR algorithm. Instances to solve uniform and non-uniform cost routes were worked out based on ANP data. The experiments involve simulations with two traditional RL algorithms: Q-learning [74] and SARSA [68]. In addition, RL parameter estimation is performed using statistical methods: RSM [51], Analysis of Variance (ANOVA) [49] and Tukey Test [49]. Best solutions have been found in 15 out of 16 analyzed numerical experiments.
The remainder of this paper is organized as follows. The second and the third sections present basic theoretical concepts of the RL and TSPWR, respectively. Then, the fourth section describes the proposed technique. The results are given in the fifth section and concluding remarks are delivered in the sixth section.

Reinforcement learning
Reinforcement learning (RL) is a machine-learning technique based on Markov decision processes (MDPs) [26,61,68,74]. MDPs are structured from finite sets of actions, states, reinforcement and a state transition model. The learner agent interacts with the environment in a sequence of steps in time (t): (i) the agent receives a representation of the environment (state); (ii) select and execute an action; (iii) receive the reinforcement signal; (iv) update the learning matrix; (v) observe the new state of the environment [68].
In RL, the goal is to learn a policy (π ) that maximizes numerical reinforcement [68]. A policy defines the agent behavior, mapping states into actions. The -greedy method is an example of the action selection policy adopted in RL [68]. In this method, the parameter (0 < < 1) is defined and the policy π(s) is applied according to the following equation [68]: where π(s) is the decision policy for the current state s, a * is the best estimated action for the state s at the current time and a a is a random action selected with probability . SARSA [68] and Q-learning [74] are common RL algorithms. These methods are based on temporal difference learning (TD), that is, updates do not need to refer to realtime intervals, but to successive decision-making steps. The SARSA (see Algorithm 1) is an RL on-policy TD Control algorithm, which depends on the next action (a t+1 ) defined by the policy π(s) to update the learning matrix, according to the following equation: where s is a state and a is an action at the current instant (t), respectively; s is state and a is action at the next instant (t + 1); Q t (s, a) is the value at time t in the Q matrix for the pair state × action (s, a). Q t+1 is the updating of the learning matrix in t + 1 by executing the action a in state s; r (s, a) is the reinforcement by the execution of the pair (s, a); α is the learning rate; γ is the discount factor.
The parameters learning rate (α) and discount factor (γ ) are adopted in several algorithms [68]. These parameters can be set between 0 and 1. The learning rate controls overlap speed of new information and the discount factor describes an agent preference between current and future rewards. If γ ≈ 1, then the future rewards are highly significant. Otherwise, if γ 1, the current rewards are more relevant at the instant t than the subsequent rewards (discounted) [61,68]. 1 Set the parameters: α, γ and 2 For each pair s,a to initialise the matrix Q(s,a)=0 3 Observe the state s 4 Select the action a using -greedy method 5 repeat 6 Take the action a 7 Receive immediate reward r(s, a) 8 Observe the new state s' 9 Select the new action a' using -greedy method On the other hand, Q-learning (see Algorithm 2) is an offpolicy TD Control algorithm [61,68,74]. In that sense, it does not depend on the next action (a t+1 ) to perform the update at the instant t, according to the following equation: where max a Q(s , a ) is the utility of s , that is, the maximum value in the line of Q referring to the new state. Select the action a using -greedy method 6 Take the action a 7 Receive immediate reward r(s, a) 8 Observe the new state s' 9 Update Q (s, a) with Eq. (3) 10 s = s' 11 until the stopping criterion is satisfied; Algorithm 2: Q-learning.

Traveling salesman problem with refueling
The problem considered in this work is the path planning in a road network for autonomous vehicles. A mobile agent must travel through a set of cities and decide where to refuel to minimize the final route cost. For this, the Traveling Salesman Problem With refueling (TSPWR) is adopted in two forms: uniform and non-uniform cost [27]. In the first case, a vehicle must visit a set of locations and return to the starting city at the route end and fuel price does not vary between route stations. Second, in the problem with non-uniform cost, there are different selling cost for the fuel in the cities.
The following restrictions are considered: vehicle fuel tank capacity, minimum amount of fuel for refueling and guarantee of completing the entire route [60]. In addition, this work considers the possibility of using a tow truck in case fuel runs out between two locations, which requires an additional cost for that.

Problem formulation
A mathematical formulation for the proposed problem is based in [10,60,70] and contains two decision variables: u i, j and z i j . The u i, j assumes 1 if the arc (i, j) makes up the solution and 0 otherwise. Also, z i j is a decision variable that gets 1 only if the tow truck is used between the locations i and j. This formulation is presented in the following equations: subject to: where N is a set of nodes. In addition, the refueling cost in the city j is c j and l j is the amount of fuel replenished in j. The tow truck cost on an arc (i, j) is represented by g i j . Thus, Eq. (4) is the objective function, wherein the total route cost given by the sum of refueling and tow truck costs should be minimized. Equations (5) and (6) ensure that each location is visited only once. Furthermore, Eq. (7) ensures that the amount of fuel in the tank ( f j + l j ) does not exceed maximum capacity (L max ), where f j is the reservoir level at the time of arrival in the city j. Equation (8) ensures that the vehicle completes the maximum tank level when refueling, where w i j = 1 if refueling occurs at location j. Besides that, Eq. (9) restricts the minimum quantity for refueling. In addition, Eqs. (10) and (11) ensure that the variables u i j , w j , z i j are binary and the other variables are non-negative, respectively. Finally, in Eq. (12), the set V represents any set of constraints that eliminate the formation of sub-routes.

Instances
In this paper, four instances are proposed: Bahia30D, Minas24D, Minas30D and Minas57D. Each instance involves a set of cities from two Brazilian states (Minas Gerais and Bahia). The data is composed by the Euclidean distances between localities, calculated from the coordinates (latitude and longitude). In addition, also the diesel average cost (D) in each city was defined from the ANP website data obtained in December 2018. Then, the cities are described in the following format "city (diesel average price in Reais (R$)-Brazilian currency)":

Methodology
The methodology proposed in this paper consists of four steps. First, the RL model is structured in states, actions and reinforcement functions. After that, the algorithm for solving the TSPWR with Reinforcement Learning (RL-TSPWR) is proposed. The following steps present the experiments and methods for tuning RL parameters. Response Surface Models were used to optimize α and γ , wherein the best combinations of the reinforcement function and are obtained by means of ANOVA and Tukey test.

Reinforcement learning model
The model aims to enable the agent to learn how to path planning that minimizes refueling cost and distance. For this, the RL model defined for the TSPWR resolution consists of a set of states, actions and reinforcements. The wording adopted is based on previous studies that applied RL in TSP solution: [9,41,52]. The proposed structure is as follows: -States: locations (nodes) that the agent (traveling salesman) must visit to perform the route. In this sense, the number of states varies according to the instance nodes. -Action: intention to move to another location (state) of the problem. In addition, the refueling action is performed whenever the vehicle arrives at a location with less than 25% of tank level maximum capacity (0.25 × L max ). -Reinforcements: functions were defined to associate the cost with the movement between two localities, the refueling cost in each city and tow truck cost. Five different types of reinforcements have been proposed, according to the following equations: where d i j is the distance between cities i and j; c j is the refueling cost in the node j; tow truck cost on an arc (i, j) is represented by g i j and z i j is a decision variable that gets 1 only if the tow truck is used between the locations i and j. Thus, the higher the total cost of moving and refueling, the more negative the penalty for route formation is.

RL-TSPWR algorithm
This section presents the RL-TSPWR algorithm, which applies RL (Q-learning version) in the TSPWR solution (see Algorithm 3). The variables of the proposed algorithm are associated with the mathematical formulation of Eqs. (4)- (12).
In this paper, the simulated vehicle has small truck features for all experiments and TSPWR constants are maximum fuel tank capacity of 150 l (L max = 150); average diesel consumption of 7 km/l; the tow truck cost of using was fixed at R$ 200.00 (g i j = 200), and the reference level The RL-TSPWR starts by initializing RL parameters, the learning matrix, TSPWR variables/constants and the initial state (s 0 ) (lines 1-4). Then, the execution loops start (lines 5 and 6). Subsequently, the destination city is selected, and the action is performed (lines 7 and 8). After that, the new fuel tank level is calculated from the distance between cities (i and j) and the average consumption (km/l). In line 10, the calculation of the truck cost is started. If the tank level is less than zero, then the vehicle reached the destination city ( j) without fuel. Then, it is necessary to perform the reset at the tank level, assign the truck cost value (g i j ) and the decision variable z i j receives 1. Otherwise, the truck cost for the arc (i, j) is zero (z i j = 0). In line 17, computation of the refueling cost is initialized. If fuel level at the destination node (level) is less than the reference level (level-ref) and city is not the initial, then the vehicle must be refueled. In this way, the litres amount (l j ) and the cost of refueling (c j l j ) are calculated (lines 21 and 22). In addition, the tank level is updated with the maximum vehicle level (L max ). If the vehicle does not need to refuel, this cost is zero (c j l j = 0). Then, the total cost on the route is updated, based on the sum of the truck cost and refueling cost (line 27). The distance traveled on the route is also updated (line 28). Subsequently, the reinforcement is calculated, such as Eq. (13). Finally, the RL operations are carried out: new state notice, update Q matrix and current state (lines 30 and 31). Algorithm 3: RL-TSPWR Algorithm. This is an application of the Q-learning algorithm for solving the TSPWR.
Algorithm 3 executes its instructions from two repeat loops. The first repetition structure is controlled by the number of episodes (stopping criterion). On the other hand, the second loop is dependent on the number of locations in the instance (iterations for the formation of one route). Thus, the complexity of the RL-TSPWR algorithm can be represented by the number of learning iterations (nr) to provide a solution (Eq. 18): where E is the number of episodes and N is the number of locations for the instance. Table 1 exemplifies the RL-TSPWR complexity (using 10,000 episodes): Table 1 shows the efficiency of the proposed structure. For example, the Minas57D (N = 57) instance has 7.110 × 10 74 possible solutions. In contrast, the algorithm presents a solution after a sequence of 570,000 learning iterations and only 10,000 routes explored. It is worth mentioning that the number of episodes is also a parameter that can be investigated.

Tuning of RL parameters:˛and
The purpose of this section is to present the methodology for tuning the RL parameters (α and γ ) for the TSPWR. For this, experiments with different combinations of these parameters are proposed. In addition, mathematical modeling is adopted via response surface methodology to estimate α and γ . In this stage, the experimental methodology was based on recent works: [52,54].

RL parameters experiments:˛and
Simulations were performed using the Matlab and were comprised by 16  Each combination of parameters was simulated in 3 runs (repetitions) with 1000 episodes. A run is an independent rep-etition, that is, the learning is accumulated over the thousand episodes and always reset when starting a run. The episode performance measures are the total refueling cost and distance in the route. In addition, the -greedy parameter was set to = 0.01 and the reinforcement function adopted was R 1 (Eq. 13).

RSM
The response surface methodology (RSM) involves a set of statistical techniques for analyzing optimization problems. The structure and RSM model of second order is presented [51] as follows: where y is the response variable, x 1 and x 2 are the independent variables, β n are the coefficients and the effect of the error (residual) is represented by e.
Ottoni et al. [52] have presented the mathematical modeling using RSM for the estimation of α and γ parameters. The structure proposed by [52] is given in the following equation: where α and γ are the independent variables of the model andŷ is the predicted response. In this work, 16 RSM models were adjusted using the software R [34,58], according to Table 2. These models aim to estimate α and γ to minimize the total cost on a route. Data referring to the lowest cost on the route (refueling + tow truck) have been used with a combination of α and γ .

Tuning of RL parameters: reinforcement function and
The second stage of experiments aims to analyze the influence of the reinforcement functions and parameter in TSPWR learning. For that, simulations with different combinations of these parameters are proposed. ANOVA and Tukey test were adopted to identify the best combinations of factors for the refueling problem. Besides that, the parameters (α and γ ) estimated via RSM were used in the experiments in this section. The experimental and analysis methodology have been based in [54].

RL parameters experiments: reinforcement function and
In this step, the objective was to conduct experiments with two learning specifications: reinforcement function and parameter ( -greedy policy) as follows: Simulations were comprised by 240 groups of experiments: 2 (algorithms) × 4 (instances) × 2 (problem types) × 5 (reinforcement functions) × 3 ( values). In this respect, a total of 15 parameters combinations (R and ) have been conducted for each model ( Table 2). Each experiment was simulated in 10 runs (repetitions) with 10,000 episodes. The episode performance measures are the total refueling cost in the route.
The results of these experiments were used as data for the modeling presented in the next section.

Factorial design
In this step, a factorial design was developed to estimate the factor effects (R × ) in the TSPWR simulations. The factors analyzed are the reinforcement function (five levels) and the parameter (three levels) [49,54]: where μ is the overall mean effect, η j is the effect of the j th level of the reinforcement functions ( j = 1, 2, 3, 4, 5), θ k is the effect of the k th of -greedy politics (k = 1, 2, 3), (ηθ ) jk is the effect of interaction between η j and θ k , and ξ jkl is a random error component (l = 1 to 10). Analysis of variance test was conducted to check if there is a difference between the treatment means. The level of significance adopted was 5%. When ANOVA indicates that there is a difference between the levels of the model, Tukey test of multiple comparisons [49] has been applied.
The objective was to evaluate the performance of parameter adjustment for the TSPWR, in comparison with the use of values adopted in the literature in RL simulations for the classic TSP (or similar). These combinations of parameters were simulated in three repetitions with 20,000 episodes for each group of experiments.

Tuning of RL parameters results:˛and
The results adjusted for setting the RSM models are described below. The analysis is based on the work of [52].

Adjusted models
Measures of the adjusted models analysis should present normality of the residues, coefficient of multiple determination (R 2 ), adjusted coefficient of multiple determination (R 2 a ) and significance of the coefficients.
The first test determines if the model residues follow a normal distribution. Adopting the Kolmogorov-Smirnov (KS) [46] test, it was observed that for the 16 models, the hypothesis of residual normality ( p K S > 0.05) was accepted, according to Table 3. Then, the values of R 2 and R 2 a were analyzed. The more these coefficients are approaching 1, it evidenced a good fit of the model to the sample. Table 3 also shows the calculated values for R 2 and R 2 a . Table 4 shows the adjusted coefficients for each model. In this sense, the test of significance of the individual coefficients, it points out that the coefficients are highly significant in all models ( p < 0.001).

Stationary points
The analysis of stationary points allows us to verify the values that optimize the predicted response in the adjusted RSM models. In this respect, the estimation of the parameters α and γ refers to a second optimization problem to minimize  Table 5 shows the stationary points obtained using the R software [34,58].

Tuning of RL parameters results: reinforcement function and
In this section, we present the experiments results for tuning the reinforcement function and the parameter . Initially, some graphics are shown for interaction between the factors. The interaction plots demonstrate the influence of these parameters (R and ) on the TSPWR optimization process. After that, the results of ANOVA and Tukey test for full factorial experiment are presented.

Interaction plots analysis
Interaction plots are important tools for analyzing the factors influence on the response variable. In this work, these graphs were approached in a preliminary analysis of the factorial design results to visualize effects of the -greedy policy and the reinforcement function in the TSPWR solution.
To illustrate the graphical analysis, Figs. 1 and 2 present interaction plots for models 1 and 13, respectively. It is possible to observe that the combinations of R 1 × 0.01 and R 5 × 0.01 tend to minimize the response to the situation of Bahia30D/Non-Uniform/Q-learning (Model 1). On the other hand, in Fig. 2, referring to Minas57D/Non-Uniform/Qlearning, the best results are to adopt the reward function R 3 × 0.01 or R 3 × = 0.01. In this respect, the simple change of instances (Bahia30D to Minas57D) directly influenced the combination performance (R × ) for the TSPWR. Thus, the present analysis reinforces the need to adjust the -greedy policy and reinforcement function according to the simulated data.

Factorial design results
Analysis of adjusted models of full factorial experiments was carried out in three phases: (i) residue normality analysis, (ii) analysis of variance and (iii) multiple comparison test. Adopting the KS test [46], the assumption of residue normality was confirmed for all models ( p KS > 0.05). ANOVA test was applied to check if there is a difference between the configuration performance (R × ) in the TSPWR optimization. The results of analysis of variance showed that for the 16 models, the factors interaction is highly significant ( p < 0.001). That is, there is a statistical difference in RL performance for TSPWR resolution, according to the reinforcement function and the parameter selected. In this sense, Tukey multiple comparison test was then performed to identify the best combinations (R × ) by factorial design model. Table 6 presents the results of the Tukey test and the residues normality tests ( p K S ).
In Table 6, one can identify settings for each model (R × ). which achieved the best results ("Tukey Test" column). Moreover, as in all situations Tukey test indicated more than one combination, a tiebreaker criterion was used: the lowest mean solution (cost) by combination. Thus, Table 6 also presents the best configuration for each column ("Best" column) and the respective mean solution.
For example, take the model 1 (Bahia30D, Q-learning, and Non-Uniform). In this case, the Tukey test showed that there are four combinations that showed good performances: R 1 × 0.01, R 1 × 0.05, R 5 × 0.01 and R 5 × 0.05. Furthermore, the configuration R 5 × 0.01 times showed the lowest mean of the solution between those indicated by the multiple comparison test. On the other hand, observing the Model 16 (Minas57D, SARSA and Uniform), another 3 combinations (R 3 × 0.01; R 3 × 0.05; R 3 × 0.10) were indicated by Tukey test. Thus, Table 6 reveals that for a simulated situation, it may be interesting to adopt different combinations of the reinforcement function and parameter to TSPWR optimization.
Further exploring the "Tukey Test" column in Table 6, it is important to highlight that R 5 × 0.01 is the combination that most appeared among the settings indicated (on 15 of the 16 models). In this sense, it shows the relevance of the reinforcement functions that have the distance between the nodes (d i j ) as a term of the Equation:  Table 6 presents R 5 × 0.01 as the most suitable combination (4 times). In this case, the second configurations were: R 3 × 0.01, R 3 × 0.05 and R 5 × 0.05 (with three indications for each). That is, for none of the models, the reinforcement Alg algorithm, Pr problem, Q Q-learning, S SARSA, U uniform, N non-uniform, 1 = 0.01, 2 = 0.05, 3 = 0.10 functions R 2 or R 4 or the parameter = 0.10 are presented as parameters as shown best for the experiments. It is also important to highlight the differences in reinforcement functions performance according to the instance adopted. For example, for the Minas24D instance, in all models the best configuration ("Best" column in the Table  6) contains the term R 5 . However, this is not repeated for the Minas57D instance, where the indicated reinforcement function was R 3 . Thus, one hypothesis is that the difference between the number of instance nodes directly influenced the reinforcement function performance.

RL-TSPWR parameters
In this section, the final estimated parameters for the TSPWR instances are presented. Table 7 shows the best parameters (lowest cost in Reais-Brazilian currency) per each of the 16 situations (4 instances × 2 problems × 2 algorithms).
When observing the -greedy policy, the value of = 0.01 achieved the best results in most cases (10 times). On the other hand, for the learning rate and discount factor param-eters, it is possible to define tuning ranges in

Comparison with literature parameters
In this section, results of the parameters adjusted by this paper (see Table 7) for the TSPWR are presented in comparison with the adoption of fixed parameters (α and γ ) in the literature [18,41,45,66], which are referred to studies that applied the RL in simulations of the classic TSP (or similar). Table  8 shows the best solutions found (cost in Reais-Brazilian currency) in this phase.
The proposed technique achieved best results in 15 out of 16 groups of experiments according to Table 8. This shows the capacity of the proposed methodology to tuning of parameters suitable for the TSPWR. In addition, it reveals the importance of performing parameter adjustment according to the conditions of the simulation (instance, algorithm and problem).
The first important aspect of this work is the TSP approach in conjunction with the refueling problem. Generally, the TSP is applied to minimize the distance on the route, as in [2,18,52]. However, there is less attention in the literature for TSP with refueling [27,71].
Another relevant point of this proposal is application in variable routes. In the literature, when specifically observed the refueling problems, in many works only a fixed route is adopted, as in [43,60]. In fact, applying the refueling problem on variable routes is much more complex than on fixed routes [27]. It is also worth noting that, only the work of [60] also considered data from Brazilian road networks in the simulations. In this regard, it is worth mentioning that the developed instances (Bahia30D, Minas24D, Minas30D and Minas57) will be made available in the public database format: TSPWR-Library. In addition, the TSPWR proposed modeling (Sect. 3) innovates when considering the possibil- Table 9 Comparison of this proposal with different works in the literature: I [27], II [43], III [71], IV [60], V [18], VI [2], VII [52] and VIII [54] Proposed I  II  III  IV  V  VI  VII  VIII  [ ity of using a tow truck if the fuel runs out between two locations. The proposed application of the RL for TSPWR is another important aspect of this paper. For this, the RL model was structured in states, actions and reward functions, considering the TSPWR characteristics. In addition, the algorithm (RL-TSPWR) for the application of RL in TSPWR was proposed. In the literature, studies that addressed the refueling problem used other methods, such as: VNS [77], Ant Colony Optimization [82] and Tabu Search [72].
We have avoided to compare RL techniques with other meta-heuristics in TSPWR resolution since RL methods have been carefully adjusted for application in the proposed refueling instances. Other meta-heuristics from literature have not been so far made the same adjustments. For example, to simulate a local search algorithm, such as VNS [77], it would be necessary to carry out a best initial solution study and which neighborhood structures would be adequate to generate good results for the problem in question. Also, implementation of Genetic Algorithms would require definition of the evolutionary parameters (selection, reproduction and mutation) suitable for application in the proposed TSPWR instances.
To exemplify, simulations were carried out using the VNS meta-heuristic to solve TSPWR instances. The initial solution was defined as an ordered sequence of cities. Already the neighborhood structure was based on random changes in the visit order of the nodes. In this regard, the VNS meta-heuristic achieved worse performances in the four instances: Bahia30D (4424.2), Minas24D (2972.4), Minas30D (3388.8) and Minas57D (8470.0). However, it is emphasized that the VNS is a local search algorithm that would probably perform better with tuning of initial solution. In this respect, this is an important advantage of RL methods, as it is not necessary to provide an initial solution.
Finally, we highlight the use of statistical methods (RSM, ANOVA and Tukey Test) in the tuning RL parameters process. In comparison with other works [2,18,52,54], only this proposal made the 4 parameters adjustment: reinforcement function, , α and γ .

Contributions of this paper
Based on the comparison with other literature works, the main contributions of this paper are highlighted: 1. Reinforcement Learning Approach to refueling problems solution. 2. Proposal of the RL-TSPWR Algorithm. 3. Statistical methodology for tuning of four RL parameters (reinforcement function, , α and γ ) uniting concepts presented in [52] and [54]. 4. New mathematical formulation for refueling problems using tow truck cost, variable routes and non-uniform cost. 5. Development of instances (TSPWR-Library) with fuel cost data for Brazilian cities.

Conclusion
This paper has applied Reinforcement Learning to the Traveling Salesman Problem with refueling. The outline of the contributions of this paper relative to the recent literature in the field can be summarized as: (i) proposal for TSPWR formulation problem; (ii) algorithm for applying the RL to the TSPWR resolution; (iii) development of instances based on real data from the ANP; (iv) experiments realization under uniform and non-uniform cost conditions; (v) tuning of RL parameters applied to TSPWR using the statistical methods. Estimated parameters with statistical methods achieved the best solution in 15 out of 16 experimental groups. These results are valid for the two algorithms (Q-learning and SARSA) and for simulations with uniform and non-uniform fuel prices in each location. In addition, using ANOVA and Tukey test it was possible to find the best combination of reinforcement function and -greedy policy for each instance. It is worth mentioning that the reinforcement functions obtained different performance according to the data analyzed. Nevertheless, in all cases adjusted reinforcement function has the distance between nodes (d i j ) term. By analyzing the − greedy policy, it is clear that the value of = 0.01 reached the best solutions in most cases. In future works, experiments with more instances and vehicle types are expected. New instances based on the TSPLIB library should be investigated. In addition, it is expected to analyze other factors, such as fuel type and vehicle model. Moreover, simulations with other meta-heuristics in the TSPWR instances should be investigated. In this aspect, computational complexity of the methods should be analyzed, and the convergence issue should also be discussed. Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy-right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.