An integrated local-search/set-partitioning refinement heuristic for the Capacitated Vehicle Routing Problem

In this paper, an effective heuristic algorithm for large-scale instances of the Capacitated Vehicle Routing Problem is proposed. The technique consists in a local search method entangled with a restricted Set Partitioning problem optimization. Helsgaun’s LKH-3 algorithm has been used for the local search phase, with a number of implementation improvements. The restricted Set Partitioning formulation is solved by means of an exact commercial Integer Liner Programming solver. The resulting algorithm is able to consistently improve the solutions obtained by a state-of-the-art heuristic from the literature, as well as some of the best-know solutions maintained by the CVRPLIB website.


Introduction
Firstly introduced by Dantzig and Ramser [11], Vehicle Routing Problems (VRPs) are a class of problems calling for a minimum-cost set of vehicle routes to serve a given set of customers with known demands.
The Capacitated Vehicle Routing Problem (CVRP) is one of the most studied VRP versions, in which the transportation request consists of the distribution of goods from a single depot to a set of customers using homogeneous vehicles with a limited capacity. In the symmetric case, it can be defined on a complete undirected graph G = (V , E) with edge costs c e 's and a special depot node d. Each customer node i ∈ N = V \ {d} is characterized by its demand q i ≥ 0 which represents the amount of goods requested, while each vehicle route must start and finish at d and has to visit a set of customers whose total demand does not exceed a given capacity C. The overall number of vehicles to be used is often fixed in advance.
Historically, many mathematical formulations have been proposed for this problem [27,42]. Particularly relevant for our work is the so-called Set-Partitioning (SP) formulation, common to many other VRP variants. In the SP formulation, the objective is to find the best combination of feasible routes that partitions the customer nodes of the graph, minimizing the overall cost, i.e.: where Ω is the set of feasible routes for the CVRP, c p is the cost associated to each route p ∈ Ω, Ω i ⊂ Ω is the subset of routes that visit the customer i ∈ N , k is the required number of routes, and θ p is a binary variable which is 1 if the route p is in the optimal solution, 0 otherwise. An important aspect of the SP formulation is its generality, as it easily extends to all VRP variants where the additional constraints only affect the feasibility of the routes, hence they are implicitly represented by the route set Ω. However, a main drawback is represented by the cardinality of Ω, which grows exponentially with the number of customers. To tackle this issue, only a subset of potentially-relevant routes is explicitly generated, and optimization techniques like Column Generation [12,15] or Branch and Price [18,32] are used. Within these schemes, a Restricted SP (RSP) formulation is iteratively solved, containing only a subset of routes.
Although several advanced mathematical programming decomposition algorithms have been proposed in the last few decades, only relatively small instances-containing only few hundred customers-have been solved to optimality [42]. Problems encountered in real-life scenarios are often substantially larger, thus efficient heuristic algorithms are the only option available to obtain good-quality solutions within acceptable computing times.
The aim of our paper is to design a powerful (yet time consuming) refinement heuristic which is able to improve top-quality solutions. Thus, our method is meant to be used on top of a state-of-the-art heuristic, more than to replace it. This is very much is the spirit of other refinement heuristics from the literature, whose quality is certified by the capability of improving state-of-the-art solutions in a final post-processing step.
The paper is organized as follows. Previous literature on CVRP heuristics is sketched in Sect. 2. In Sect. 3, the comprehensive strategy of our algorithm is described, along with the modifications and improvements applied. Extensive computational results are reported in Sect. 4, showing that our method is able to consistently improve the solutions obtained by a state-of-the-art heuristic from the literature, as well as some of the best-know solutions maintained in the CVRPLIB website [31]. Some conclusions are finally drawn is Sect. 5.

Previous work
A brief outline of the CVRP heuristics that are most relevant for our work follows.
Helsgaun's [21,22] heuristic, LKH-3 (whose code can be found in the dedicated website [19]), is a penalty-based extension of the famous Lin and Kerninghan [28] heuristic (LK), able to tackle many VRP variants. Although less efficient with respect to other state-of-the-art CVRP heuristics, LKH-3 (from now on, just LKH) plays a prominent role in our work in that it is the building block of our local-search phase, so we next give a brief description of this method.
Originally designed for the Traveling Salesperson Problem (TSP), the LKH algorithm is based on the concept of r -Opt moves and r -optimality. In a r -Opt move, r edges from the current solution are replaced by other r edges in such a way that another solution is obtained [21]. A solution is said to be r -optimal if it is impossible to obtain a shorter tour by means of any r -Opt move [21]. It is also intuitive that, for 0 ≤ r ≤ r , an r -optimal tour is also r -optimal, and for a tour of n city to be optimal, it must also be n-optimal. Furthermore, is also reasonable that the probability for a r -optimal tour to be optimal grows with r [21]. However, the number of possible r -Opt moves grows rapidly with the number of nodes of the graph, making it impossible to fully explore the available moves for large values of r . For this reason, r is usually set to 2 or 3, as the algorithm rapidly loses efficiency for larger numbers. To overcome this limit, the LK heuristic introduces a scheme where the r value is decided at run-time, iteration after iteration. Initially, r is set to 2, its minimal value, and then it is gradually increased searching for new potential pairs with the following rationale: starting from the most "out-of-place" pair, the algorithm iterates searching for the new most "out-of-place" pairs of the remaining set, repeating the search multiple times [28]. If an improvement is found, the search restarts from scratch, while it stops otherwise. For further information, the reader can refer to [21] for a brief explanation, or to the original Lin and Kernighan's paper [28].
Vidal et al. [44] propose HGS, a hybrid genetic algorithm combining the effectiveness of their population based method with the Local-Search exploration of neighborhoods defined from a set of operators.
Arnold and Sörensen's [5] knowledge-guided local search (KGLS) is an effective Local-Search heuristic which adopts three different neighborhood-defining operators along with a knowledge based penalization to avoid local optima.
Christiaens and Vanden Berghe [9] develop a simple yet effective algorithm named Slack Induction by String Removals (SISR), consisting in a ruin-and-recreate local search heuristic.
In their recent work, Accorsi and Vigo [2] propose FILO, a very efficient and effective iterated local search heuristic, which through the combination of acceleration and localization techniques is able to find state-of-the-art solutions for very large scale CVRP instances in a short computing time. The algorithm adopts a large number of operator-defined neighborhoods and a combination of a ruin-and-recreate scheme coupled with simulated annealing.
Sharing some similarities with the work presented in the present paper, Subramanian et al. [40] propose Iterated Local Search with Set Partitioning (ILS-SP), a hybrid algorithm merging the effectiveness of a competitive iterated local search heuristic along with the optimization a SP formulation that tries to heuristically find the best combination of the explored routes. The adoption of a SP optimization phase has been also studied for many other heuristic techniques, as in the works of Foster et al. [16], Ryan et al. [39], Rochat et al. [37], Kelly et al. [25], De Franceschi et al. [13], or Monaci and Toth [30] for the Bin-Packing Problem.
Finally, Queiroga et al. [36] propose a heuristic working as a refinement technique to improve the solution obtained by other heuristics. Exploring a large solution neighborhood, their algorithm is able to consistently improve near-optimal solutions. The adopted technique is POPMUSIC [41], a matheuristic [14] based on the VRPSolver [33,34] exact solver for VRPs.

Algorithm Outline
The overall scheme of our approach can be subdivided into three main phases.
1. The LKH heuristic is executed, in parallel; from the solutions generated at the end of each "trial" of the core LK algorithm, routes are extracted to populate a pool (called the "route pool"). 2. Considering the Linear Programming (LP) relaxation of the SP formulation, a column-generation pricing procedure is applied to "filter" the most meaningful routes from the pool. 3. The RSP formulation, considering only the selected routes, is solved with a given time limit.
The three phases above are iterated until a global time limit expires-or a maximum number of repetitions is reached.
The described algorithm has been called Local Search-Column Generation Heuristic (LS-CGH) since it uses the LKH heuristic to generate good candidate routes that are then fed to the RSP optimization.
To better differentiate between the different types of iterations (one nested into the other), the following terms will be used: -in accordance with the naming adopted by the LKH algorithm, the term "trial" refers to a single pass of the core Lin-Kerninghan algorithm, ending when no more improving r -Opt moves can be found. -A "run" is a set of successive "trials", each starting from the perturbed solution of the previous one. -The sequence of a single execution of LKH, followed by Column Generation filtering and the RSP optimization, has been named "round".
Our LS-CGH algorithm then consists in a number of "rounds", repeating the threephase scheme multiple times. Each round is linked to the next one as it exploits the best solution found as its initial solution, and also because the route pool is maintained between rounds.
A high-level representation of the three main phases of the algorithm is given in Algorithm 1. In the pseudocode, the following functions are used: -LKH: Calls the LKH-based heuristic described in Sect. 3.1 and in Algorithm 2.
Returns the best solution found by the algorithm (S), along with a populated route pool (P). -CGFilter: Applies the column-generation inspired filtering (described in Sect. 3.2) to the route pool.

Phase 1: Lin, Kernighan and Helsgaun Heuristic
To integrate the LKH algorithm with our LS-CGH scheme-which has been implemented as multi-thread C++ project-and also to improve its efficiency, a number of customizations have been applied to the original Helsgaun's code available at [19]. A summary of the most relevant changes are reported next.
-Due to the extensive use of global variables and non-reentrant primitives in the C code, the algorithm was not "out-of-the-box" ready to be encapsulates into a multithread scheme. Therefore we have systematically modified all global variables storage making them "thread local", and we have substituted all the non-reentrant C primitives with their corresponding reentrant versions. After these changes, we were able to synchronize the code by means of a step-by-step execution implemented upon pthread barrier. -An improved synchronization has been implemented to equalize the duration of parallel "runs". -The Jonker and Volgenant's [24] mTSP-to-TSP transformation has been implemented to adapt solutions generated by the RSP optimisation and make them compatible with the current LKH instance. -A basic control interface has been added to control the execution of the LKH algorithm and to let successive LKH calls execute one after the other with a reduced overhead. -A route extraction function has been implemented to obtain a suitable amount of diversified routes to fill the route pool. -The caching system already adopted within the algorithm has been extended and slightly improved. -The CVRP penalty function has been redesigned, improving its speed while maintaining the exact same behaviour as the original one. -A Simulated Annealing (SA) scheme has been added on top of the original solution acceptance test, to improve the performance of the original algorithm and to perturb the initial solution in the attempt of escaping from local optima.
For the sake of clarity, in what follows we will call "newLKH" our modified version of the LKH. To give a clearer idea of the structure of the newLKH algorithm and of the introduced changes, a sketch of this variant is given in Algorithm 2. The overall scheme resembles the original LKH, since most of its logic is not affected by our changes. The two main additions are the route-extraction step (ExtractRoutes), and the Simulated Annealing acceptance test (SATest) called on every solution returned by the LinKernighan function. To be more specific, the following functions appear in the pseudocode: An overview of the LinKernighan function is provided in Algorithm 3, highlighting the positions of the "Penalty" and "Flip" functions (to be described in Sect. 3.1.2 and Sect. 3.4.1, respectively). The functions that appear in the pseudocode are as follows.

Algorithm 2:
High-level pseudocode for the LKH algorithm. Our newLKH version containing all the speed-related optimizations (namely: the new Penalty function, the caching system and the new Flip function) is freely available, for research purposes [8].

Speed improvements
Some of the most relevant changes aimed at speeding up the execution of the original LKH code are outlined next.
Cost function: To reduce the overhead related to the computation of distances between vertices, the LKH algorithm uses, since its first version, a clever caching system proposed by Bentley [6]. This caching system works with two arrays of the same size: one array is used to save the used distances, while in the other one the smaller of the two node indices is saved as a signature. The position of each distance-signature pair in their respective arrays is chosen with a fast hash function. Thanks to this simple mechanism, both Helsgaun and Bentley report that the time with TSP problems can be halved or more [6,21].
In the LKH original cost function, several checks are performed before calling the computationally expensive distance function. Indeed, depending on the VRP version and other internal parameters, the required distance might have already been stored by previous operations. Thus, before calling the distance function, all these fields are checked. The cache is checked as a last step, only if none of the fields contains the required value. Even though the performed checks are usually less expensive than a call to the distance function, searching all the places where the distance could have been stored (which are not located adjacently in memory) can be slower than a direct check of the cache which, very often, already contains the actual value required. For this reason, we have modified the original cost function moving the cache check ahead, in a small prologue (often inlined by the compiler even without linking time optimization, since it is defined in a shared header file) that first checks if the requested cost is already stored inside the cache. Only when this step fails, it proceeds by calling the remaining part of the cost function, performing all the field checks and, eventually, the final call to the distance function. Furthermore, since distance and signature are always accessed together, the subdivision into two distinct array have been modified into a single array containing the signature and its distance adjacent in memory, to improve the cache-locality of this system.

Forbidden function:
The "Forbidden" function tells if a given edge is part or not of the given instance. A simple example of forbidden edges is the set of edges between depot copies-note that, in the Jonker and Volgenan's mTSP-to-TSP transformation [24], multiple copies of the depot are introduced. This function is heavily used by the algorithm, as shown by our profiling. Since the caching mechanism proved to be a really effective improvement for the cost function, we have implemented an analogous mechanism for the Forbidden function, using again a small prologue to possibly skip not only all the checks made by the original one, but also the function-call overhead.
Balanced workload: As previously described, we have modified the original LKH source code to make it reentrant. The reason for this extensive modification has been the need of enabling a parallel execution of multiple instances of the LKH algorithm. However, running different threads in parallel, synchronized only at the beginning and at the end of each LKH call, often leads to an unbalanced situation where some threads take less time than others. This difference varies randomly with the status of the algorithm. To avoid the waste of potential computational resources, all the threads are synchronized such that each parallel run ends only when the slowest one has ended. In this way, fast runs (which sometimes are even twice as fast as the slowest one), can carry on with their "trials", avoiding to reach the phtread barrier early and then wait for the others to finish.
Some utility procedures have also been implemented to connect LKH with the remaining part of our LS-CGH scheme. We next describe two main components of such an interface: the route pool and the Jonker and Volgenant's solution transformation.
Route Pool: To store the routes extracted by the solutions generated by the LKH we have implemented a simple route pool. We have decided to use a data structure built on top of C++ STL std::unordered_set to avoid duplicates while keeping the best version of each route within the same group of nodes. Every route is distinguished from the others by the set of visited customers (which are saved as a sorted list), while the actual customer sequence and the length of the routes are updated every time a better "duplicate" is found.
Jonker and Volgenant's solution transformation: An important transformation, proposed by Jonker and Volgenant [24] and applied in LKH, is the mTSP-to-TSP conversion which transforms an instance with m salespersons into a TSP instance with m − 1 copies of the depot. This transformation is used to reduce the search space, decreasing the symmetry of mTSP and other problems with multiple routes (e.g., CVRP). It is easy to see that when m − 1 identical copies of the depot are introduced into the graph, for each tour there exists m! equivalent tours which only differ by the order of the depot copies. This transformation deletes part of the edge of the graph, by assigning to some selected nodes two depot copies to which they are allowed to be connected with, and by forbidding the edges to the other depot copies-thus reducing the number of possible route permutations.
A problem we encountered interfacing the RSP phase with the LKH one, concerns the compatibility of the CVRP solutions produced. Indeed, the combination of routes with the Set-Partitioning ILP optimization does not consider the Jonker and Volgenant's mTSP-to-TSP transformation [24] applied within the LKH algorithm. When the ILP optimization generates CVRP solutions, the transformation is applied to avoid the use of the forbidden edges. Our algorithm follows the general directives advised in the original Jonker and Volgenant's paper [24], namely: For any given solution, the Penalty function computes the penalty value with a computational complexity linear in the size of the CVRP solution. Inside LKH, such a solution is represented by a TSP tour containing a number of depot copies equal to the number of vehicles (following the Jonker and Volgenant [24] symmetry-breaking transformation). In what follows, the term "tour" will refer to this internal representation and it will not be a synonym for "route", which instead refers to the cycle covered by a singe vehicle.
The Penalty function is called inside the LKH to check a new proposed solution in the following way: 1. A new r -Opt move is found and stored (decomposed as a series of 2-Opt moves) within the LK function. 2. The move is applied to the best tour found in the current "trial", named current tour, obtaining a new proposed tour. 3. The penalty function is called to check the proposed tour. 4. If the proposed tour improves the penalty of the current tour, or keeps the penalty unchanged while improving its cost, it becomes the new current tour, otherwise the saved r -Opt move is reversed to obtain the original current tour.
Notice that, at any given time, the proposed and current tours are abstract concepts used to explain their role, while the tour stored in memory is actually one which is first modified and then eventually restored if it does not improve the previous one. However, due to its strict policy requiring that the infeasibility level can never increase, the Penalty function frequently rejects new candidates solutions. As a matter of fact, in almost all our tests the function rejects the proposed tour more than 95% of the times, thus representing one of the main bottlenecks for the entire algorithm. This observation enabled us to optimize the original LKH scheme by speeding-up the frequent "rejecting" case, introducing a rarely executed "update" step, thus resulting in a significant performance improvement. Indeed, the main change to the original penalty function has been the restriction of the penalty checks to only the routes "touched" by the proposed r -Opt move. Since the penalty function is called at every new potential change of the tour, these are the only routes modified between successive calls of the penalty function.
As in the original code there is no route-related data structure, a basic one has been implemented to store the route penalty for the current tour. Then, for each node, a reference to its route-data is stored, in accordance to the current CVRP solution. Thanks to this additional information, one can efficiently retrieve the current penalties of the routes touched by the proposed r -Opt move, as they appear in the current tour.
As a further optimization, we observe that route penalties need to be stored only if the current tour penalty is not yet zero. Indeed, when a feasible CVRP solution has been found (and the current penalty is, therefore, zero), then the previous cumulative penalty of any subset of routes is also zero. Therefore the previously described step can be completely avoided to further speed up the function.
Finally, when a proposed tour is accepted, an update procedure needs to be executed to restore route-data consistency.

Simulated annealing
To avoid to get stuck in local optima, the original LKH algorithm uses a so-called "kick" strategy, i.e., every time a "trial" of the core LK procedure cannot find any other move that improves the current solution, a random r -Opt move (usually a double bridge 4-Opt move [3,4,20,29]) is applied to the current solution and the LK procedure is called again. As previously explained, a single iteration of such scheme is named "trial" in the LKH context. This technique has however two shortcomings: -When LK is applied over a TSP instance that maps the VRP one, the additional constraints applied through the penalties make the search space very sparse. Therefore, although effective with true TSP instances, it can result to be not powerful enough to perturb the solution and move from the current VRP local-optima. -When a warmstart is provided to the algorithm, LKH starts from a potentially very good local optimum from which it is not able to move (especially if such a warmstart has been produced by previous iterations of the LKH algorithm itself). Therefore, a perturbing strategy able to lead the search trajectory away from this starting point and to explore new solution neighborhoods is needed.
As in the recent FILO heuristic [2], we decided to integrate a Simulated Annealing (SA) [26] scheme into LKH, motivated also by the compatibility of the original penaltybased scheme with such a technique.
Two overlapping SA schemes have been implemented, one based on the number of "trials", and one based on the LKH time limit. During the execution, the temperature is decreased for both the SAs and the smaller one is considered for the actual SA acceptance test. In this way, when both the trial and the time limits are given, the algorithm can automatically adapt to fit the tighter of the two.
Inspired again by the SA implementation in FILO [2], we have set up our SA scheme as follows: -The ratio between the initial temperature and the final one has been fixed to 100.
-Adopting the terminology introduced in Sect. 3.1, let z be the cost of the proposed solution, z be the cost of the current solution used as a starting point, and T t be the temperature at the "trial" t of the algorithm. The solution z is accepted as new current solution if where U[0, 1] is a uniform random variable in the [0, 1] range. -Two distinct temperatures are maintained during the execution, namely: T t trial which represents the trial-based SA temperature, and T t time which is the temperature of the time-based one. The actual temperature T t is computed as the minimum of the two. Therefore, the update formulas are: where MT R I AL is the maximum number of "trials", Δt is the time lasted from "trial" t and "trial" t + 1, and T M AX is the time limit for the "run". -Finally, the initial temperature is computed as the value of the best solution obtained after 50 "trials", multiplied by a factor c (say) defined as follows. As we aim for long runs, we have distinguished the initial part of the algorithm (where the objective is to find a good solution without getting stuck into local optima) from the second one (which tries to find improvements to the given initial solution).
For the first part a factor c z (say) has been used to scale the initial temperature when no initial solution is provided to the algorithm, while c w (say) is the same factor when an initial solution is present-because provided externally or from previous rounds of the algorithm. After some preliminary computational tests, we have fixed c z = 2.5 · 10 −3 and c w = 5 · 10 −4 .

Phase 2: column generation filtering
The number of routes generated during the LKH execution is typically exceedingly large, hence a technique to select the best routes is essential for the efficiency of the whole algorithm.
Considering our heuristic context, we need to balance two aspects: efficiency of the column generation phase, and RSP optimization speed. To achieve the former, a set of policies built around the common objective of finding a good and relatively small subset of routes has been defined, from which the RSP optimization could start. The initial core set of candidate routes consists in the selection of the "best" 8, 000 routes from the ordered list of all routes, sorted by non-decreasing solution costs. (Indeed, in our computational tests we have seen that values between 5, 000 and 10, 000 are adequate for fast runs where the Set-Partitioning phase needs to be fast to avoid introducing large slow-down for the whole LS-CGH algorithm.) Starting from this core set, the following filtering techniques are applied: 1. The LP relaxation of the RSP containing only the initial set of route is iteratively solved using the dual simplex algorithm. At each iteration, the reduced costs of the routes still in the route pool are computed, saving the value of the most negative one, say c min < 0. At this point, the routes with a reduced cost less than 0.8 · c min are added to the RSP, therefore inserting a number of potentially useful columns at each iteration. This pricing procedure stops when all reduced costs are nonnegative, or when a time limit is reached. 2. Since the previous policy often does not select enough routes, we also use a filtering criterion akin to the one proposed by Caprara et al. [7] for the solution of largescale set covering problems. At every pricing iteration we also select, for each customer, the ten routes with smallest (possibly positive) reduced costs. The pricing procedure stops when the time limit is reached or when the cumulative sum of the reduced costs added during the previous iteration, becomes nonnegative.
To handle the case in which the pricing procedure selects too many routes, we have set as a hard bound value equal to 16, 000, i.e., twice the initial set size.

Phase 3: restricted set partitioning problem optimization
The final step of our scheme consists in the solution of the RSP formulation. For this task we used a state-of-the-art commercial MIP solver (IBM ILOG CPLEX 12.10). Although this is an exact algorithm, it has been successfully integrated in our heuristic scheme by setting an aggressive time limit and by an early activation of its "polishing procedure" [38].
It is worth observing that, as an alternative to the SP formulation, a Set Covering formulation might be used, that would allow for route overlaps. (Note that multiple customer visits can be removed by a short-cut post-processing procedure, that for instances with costs satisfying the triangle property would even reduce the final solution cost.) However, as reported by Rochat and Taillard [37], and confirmed by our own computational tests, the Set Covering formulation is significantly slower to solve by our MIP solver, so we preferred to stay with the SP formulation.

VRP taxonomy
To position our technique within the VRP scientific literature and to give a clearer idea of its applicability to other VRP variants, we make use of the Pillac et al. [35] VRP taxonomy. Broadly speaking, VRPs can be classified by the point of view of the instance data evolution, in this sense we have static problems where all the information is known beforehand, vs. dynamic problems where the information regarding the instance is known only during the optimization. Then, we have deterministic vs. stochastic problems: in the former, all information is known exactly, while in the latter the input data is modelled in the form of random variables. From the product of this two classifications, one obtains four different classes: -static and deterministic; -dynamic and deterministic; -static and stochastic; -dynamic and stochastic.
The technique proposed in the present work specifically aims at problems of the first category: static and deterministic, as this is the nature of our local search and set partitioning phases.
More precisely, our scheme can readily be extended to all the VRP variants characterized by solutions with independent routes (i.e., variants that can be represented through the SP formulation, needed for the SP-phase of our algorithm) and supported by LKH. Here is a brief list of possible candidates: Of course, for any such VRP variant one needs to implement a specialized feasibility check for the routes found in the LKH solutions, to ensure that only feasible routes are inserted into the route pool. One could also extend our technique to other variants which are compatible with the TSP-tour representation and the LKH penalty system. In this case, the implementation would be more involved than in the previously cited variants (which are already supported by Helsgaun's algorithm) since, along with the definition of the Penalty function, also the internal data structure should be modified and extended. Similarly, all prepossessing steps (including the instance file parsing, the application of potential reductions or other preprocessing operations that can simplify the search) should be revised to account for the new variant.

New flip function
Within LKH, most VRP variants undergo an ATSP-to-TSP transformation [23], hence in what follows we will use the symmetric and asymmetric terms not to refer to the cost of the arcs in the original problem formulation, but to the cost of the arcs of the LKH internal representation of the problem. For instance, a symmetric CVRPTW instance is converted to an asymmetric one so as to remove all the finite-cost arcs which are not feasible due to the time-window constraints. In this sense, among the above-mentioned variants, only the "m-TSP" and the "CVRP" variants are viewed as symmetric problems, while all the others are asymmetric.
Within the LKH algorithm, whenever a 2-Opt move is applied, a function named Flip is called to copy a portion (segment) of the TSP tour representation in its reversed order. The operation is part of every 2-Opt move, although sometimes it could be avoided by applying more complex r -Opt moves that maintain the orientation of every part of the solution. Within LKH, every r -Opt move is decomposed into a sequence of 2-Opt moves, hence every r -Opt move must go through different "flips". If naively implemented, each flip operation has a O(n) complexity, and is often the main bottleneck of any r -Opt move-based algorithm.
To improve its overall performance, LKH exploits a clever data structure due to Fredman et al. [17]. Three versions of the Flip function are implemented, with complexity O(n) (naive doubled-linked list version), O( √ n) (two-level tree), and O( 3 √ n) (three-levels tree), respectively. In particular, the second one is usually adopted since it is able to maintain a good trade-off with the size of common instances.
We observed that most of the proposed r -Opt are rejected by the Penalty function. As the Flip function is called every time an r -Opt move is applied, in the very likely "rejection" case the solution undergoes two "flip" operations: one to produce a proposed tour, and another to restore the current tour. As a result, this function can be optimized by introducing an "update" step when a better solution is found, with a significant speedup for the most-common "rejection" case.

Computational results
In the present section, we address the following questions: -How effective are our improvements to the original LKH implementation, in particular in terms of speed? -Is our overall refinement heuristic able to improve the solutions found, in long computing times, by a state-of-the-art CVRP heuristic such as FILO? -Are we able to improve some best-known solutions from CVRPLIB library, thus providing an implicit comparison will the best methods from the literature-that arguably have been applied to the instances of this well-known library?
In the computational tests that follow, the Uchoa et al. [43] X dataset has been used. Following Queiroga et al. [36], this dataset was restricted to its largest 57 instances (called 57-X in what follows).
For speedup evaluation and for the final tests with the FILO heuristic, we also considered the XXL set [5] which contains 10 instances of size up to 30, 000 customers.
All the tests have been performed on Intel Xeon E3-1220 V2 CPUs, using up to 4 threads. We will refer to the Gap of a solution with respect to the currently Best-Known Solution (BKS), defined as: When not available, an initial solution can be obtained by using one of several constructive methods that LKH provides. In its default setting, a pseudo-random procedure is selected that takes into account the possible presence of some restrictions on the edges of the graph, like the presence of "fixed" edges. Another useful constructive CVRP algorithm implemented within LKH is the Clarke and Wright (CW) saving algorithm [10]. Our computational experience shows that, for the X dataset, the final solution quality does not depend too much on the selected constructive heuristic. For the bigger XXL instances, instead, CW is often superior to the pseudo-random one, as it starts from a solution that, even when infeasible, is of better quality. Thus, for the single-thread speedup tests described in Sect. 4.1 we use CW for the initialization. For the comparison with the original LKH in Sect. 4.2, instead, we use CW for the first thread, while for the remaining threads we use the pseudo-random one to help increasing route pool variability. Notice that, for both newLKH and LS-CGH, only the very first round makes use of such an initialization, while the best solution found is used in the other rounds.

Original LKH vs new LKH
In this section, the original LKH is compared with our modified version. The comparison only addresses the LKH phase of LS-CGH (i.e., without RSP and route extraction), both run in single-thread for the same number of "trials". As the implemented LKH changes do not alter the search trajectory between the original version and the new one (when run in single-thread mode and when the same random seed is used), the two versions visit the same solutions sequence and perform the same algorithmic steps, hence producing the same final solution.
In Table 1 the speedup achieved by the new version is reported along with the size of the instance. (Since inside the LKH each solution in represented by a TSP tour of length equal to the number of customers plus the number of vehicles, we report this figure as the size of the instance.) Along with the 57-X test-bed, the 10 XXL instances of the Belgium data-set has been considered in order to evaluate the behaviour of the algorithm for a broader range of sizes.
For each test a single "run" of the LKH was executed, starting from a near-optimal warmstart. The number of "trials" has been set to 10, 000 and 5 random seeds where tried for each instance. The reported speedup is the average of the 5 speedups obtained by each seed. Figure 1 (left) shows how the speedup scales with the size of the instance of the 57-X set. The linear increase of the speedup with the size of the instances is further confirmed in Fig. 1 (right) where also the very large XXL instances are considered.

Original LKH vs new LKH vs LS-CGH
In order to asses the effectiveness of the proposed scheme, three different variants have been compared. All the tests have been executed with the same time limit of 200 minutes, using 4 threads for both the LKH "runs" and the CPLEX solver (when used).
In Table 2 we compare the original LKH, the new LKH and our LS-CGH methods, and report the best gap reached (w.r.t the BKS) after 200 minutes. The "LKH" columns give the performance of the original LKH algorithm, without the proposed improvements and executed without the "round" subdivision adopted in our scheme. Four parallel threads with "runs" of 10, 000 "trials" have been executed, until the time limit was reached. The "newLKH" columns give instead the solutions obtained by our new LKH scheme, without the SP phase. All the improvements applied to the original algorithm have been activated and the LKH "runs" (with 50, 000 "trials" each, and a time limit of 2000 seconds for each "run") have been subdivided into "round's" of 4 parallel "runs", providing each round with the best solution found by the previous one. Finally, in the "LS-CGH" columns the results for our complete LS-CGH algorithm are reported, thus including the same setup as in the newLKH columns with the addition of the SP phase.
Both newLKH and LS-CGH show a significant decrease in the average gap, as well as a consistently lower gap for each instance in the 57-X set.
It is worth noting that the LKH algorithm involves a large number of parameters to tune: in our tests, we used the default values provided in the scripts available in Helsgaun's website. In Table 2, a significant improvement is shown already by our own version of LKH (namely, newLKH). This is due to three main factors.
-The improved time performance of the algorithm allowed for the exploration of a larger number of r -Opt moves with respect to the original LKH. -The SA in the first round, applied with a high initial temperature, takes better advantage of a large number of "trials". The search descent is therefore less steep (w.r.t. the number of "trials"), and also less prone to get stuck into local optima. -The adopted "round" subdivision, in which the best solution obtained is used as warmstart for the next "round", greatly improves the efficacy of the algorithm to refine the solutions in long runs.
Finally, with the addition of CG filtering and RSP optimization, further improvements have been obtained.

Statistical analysis of LS-CGH
A statistical analysis of percentage gaps obtained for multiple runs on a representative subset of the studied instances has been carried out. From the 57-X dataset, we have chosen seven representative instances selected as suggested by Queiroga et al. [36] so as to cover all the different characteristics considered during the generation of the whole X dataset. As to the Belgium dataset, two (Antwerp1 and Flanders1) out of the ten instances have been randomly chosen. For these two instances, simulated annealing has been disabled because, for these sizes, the time limit is not enough to get stuck into local optima. Thus, the use of simulated annealing would only make local search slower without the benefit of the broader exploration that would happen with a much longer time limit. For each instance, ten runs with different random seeds have been executed, and the corresponding box-plots are reported in Fig. 2.
According to the plot, a low variation is experienced for the Belgium instances. This can be explained by the fact that, for these very large problems, the 200-minute time limit is quite restrictive, hence the algorithm had less time to find local optima in which getting stuck. For the seven instances from the 57-X dataset, instead, the Fig. 1 On the left, the average speedup of the "New" LKH with respect to the original version for 5 random seeds on the 57-X set. On the right the same chart including the 10 XXL instances Fig. 2 Statistical analysis of percentage gap w.r.t. the best-known solution, on a representative subset of the 57-X and Belgium datasets computing time allowed let the algorithm reach several local optima, hence the higher variance due to implemented diversification mechanisms-exceptional cases being the X-n469-k138 and X-n979-k58 instances with their outliers. Figures 3 and 4 report a similar analysis for the ten instances in Table 2 for which LS-CGH got the best and worst relative gaps, respectively.  Table 2 where LS-CGH performed best in terms of relative gap  Table 2 where LS-CGH performed worst in terms of relative gap

LS-CGH as a refinement tool for FILO
To asses the ability of improving the solution obtained by state-of-the-art heuristic algorithms, our proposed scheme has been tested starting from the best solution obtained by FILO [2]. As previously described, FILO is a recent, fast and effective heuristic, especially designed for instances of very large size as those in the XXL dataset. The solutions obtained by FILO on a very large number of instances from the literature are available online [1].
Our test consisted in a long run (200 minutes) of our algorithm starting from the best solutions obtained by the 10M-iteration runs of FILO. For each instance, we selected the best solution among those produced by FILO in 10 runs with different random seeds.
As shown in Tables 3 and 4, our LS-CGH algorithm is consistently able to improve many of the solution produced by FILO, lowering the average gap to 0.076% for the largest 57 instances of the X data-set, and to 0.079% for the XXL data-set.

CVRPLIB best-known solution improvements
During the months preceding the writing of the paper, our LS-CGH algorithm was consistently and repeatedly able to improve the best-known solutions (BKSs) for a number of instances from the literature, competing with many other algorithms developed by different groups around the world. The current BKSs are maintained in the CVRPLIB website [31], where the history of the obtained improvements is also reported. As stated in the website, everyone can submit new BKSs, without a description of the applied techniques. This fact has enabled a number of different "competitors" to submit many improvements, especially for the difficult instances of the X and XXL datasets. Different techniques have been applied to these instances, both refining heuristic starting from the previous BKS, and "standalone" ones starting from scratch.
In our case, for 30 large-scale well-studied instances from the CVRPLIB, we have been able to improve the BKSs from literature several times, providing a total of 105 improved BKSs. At the time of writing (March 2021), 14 BKSs produced by our LS-CGH heuristic are still unbeaten; see Table 5. After an initial testing phase where the ensemble of proposed techniques was still incomplete, all the new BKS have been obtained using the same parameter setting, with the only exception of the overall time limit which was set to infinity. Thus, for each instance we "manually" monitored the time lasted from the last improvement, and aborted the code when no improvement was found in the least 24 hours.

Conclusions
In this work a new CVRP refining heuristic, LS-CGH, has been proposed. We use a custom parallel and optimized version of the Lin-Kerninghan-Helsgaun heuristic to generate a large pool of feasible CVRP routes, and exploit an LP-based pricing proce-   Table 5 CVRPLIB best-known solution improvements by date.  dure to "filter" the most meaningful ones to feed a Set Partitioning model producing the final CVRP solution. Our optimized version of the LKH heuristic is available, for research purposes, at https://github.com/c4v4/LKH3. The LS-CGH algorithm succeeded in improving several of the best solutions obtained by a recent state-of-the-art heuristic (FILO) in 10M iterations. In addition, a log of the best-known solutions obtained in the past months by our method is publicly available on the CVRPLIB website [31], witnessing its ability to improve 105 solutions obtained by the best CVRP heuristics internationally competing on the same testbed.
In future work, our proposed method can be adapted to other routing problems, including the Capacitated Vehicle Routing Problem with Time Windows (CVRPTW), the Capacitated Arc Routing Problem (CARP), the Vehicle Routing Problem with Backhauls (VRPB), and many others. Since LKH itself is able to address some of these VRP variants, it can be used as route generator as suggested in the present work.