1 Introduction

Solving rich vehicle routing problems (VRPs) is a vital research area due to their numerous real-life applications, including the delivery of cash to banks, industrial waste collection, school bus routing, first- and last-mile logistics, escape route planning, and many more [28, 40, 96]. In modern intelligent transportation systems (ITSes) [82], we aim at accommodating an increasing number of transportation requests while exploiting environmentally friendly approaches to reduce air pollution and road congestion [25, 47], especially in the city area. Thus, optimizing the routing schedules is an integral part of ITSes that are becoming essential in the smart city environments [32, 37].

The feasibility of routing plans is commonly affected by practical constraints, including the maximal vehicle capacities, time windows in which the transportation requests should be handled, various precedence and inter-request dependencies, and other constraints that are reflected in numerous VRP formulations [73, 77]. They encompass, among others, capacitated VRP (CVRP) [33], weighted VRP (WVRP), VRP with time windows (VRPTW), cumulated VRP with time windows (CumVRP-TW) [17], pickup and delivery with time windows (PDPTW) [44], multiple-depots VRPs [61], and VRPs with heterogeneous fleets [31]. A recently trending variant is the green VRP focusing on environmental issues [48, 62]. One of the aspects of this VRP is a fleet of electric vehicles [13] that introduces an additional constraint related to the vehicle’s recharging (either partial or full) along the route (also, there are combined variants, e.g., electric VRPTW [EVRPTW]) [21, 41]. There are formulations with occasional drivers that can complement the fleet of vehicles as well [15, 49].

Such rich VRPs are often considered to be the single- or multi-objective optimization problems [84, 93], with minimizing the number of vehicles feasibly serving the requests being the primary objective, and optimizing the total travel distance being the secondary objective. These objectives, however, do not reflect real-life use cases, and are often not enough to build the full picture on the abilities of the optimization algorithms. These metrics do not quantify various practical aspects of ITSes that are pivotal in industrial applications (including those related to the user perspective)—such aspects include, among others, the passengers’ satisfaction, fleet utilization, optimization time or convergence capabilities of the solvers. Note that if a solver converges to structurally-similar schedules, then it might be of an important practical use, especially in the scenarios in which external factors may trigger the updates of the routing schedule. Here, rebuilding the solution would be very costly, because it could incur the necessity of redirecting vehicles that are already in operation. Thus, utilizing classic quality measures indeed allows for comparing the emerging techniques, but it poses an important challenge once a practical transportation problem is to be tackled, as exploiting those metrics may lead to suboptimal or impractical choicesFootnote 1.

Although there is a battery of VRP solvers that can be utilized in a range of real-life formulations (we discuss such algorithms in Section 1.1), the emerging—deterministic, stochastic, and dynamic (e.g., being updated online)—VRP variants bloom, as they better reflect modern ITSes [50]. It triggers research in the area, and the new VRP solvers are being actively developed which, in turn, makes their thorough verification and validation a critical yet challenging task. We address this issue in the work reported here through proposing a comprehensive validation methodology that can shed more light on the functional and non-functional abilities of the optimization techniques for any rich VRP. Our contribution is discussed in detail in Section 1.2.

1.1 Related work

The algorithms for solving rich VRPs are split into approximate and exact methods. The latter techniques are commonly based on the column generation algorithms, and the branch-and-{bound, cut, price} approaches [3, 67]. Although they ensure obtaining optimal routing plans, such algorithms are only applicable to small-sized instances due to the NP-hardness of VRPs. On the other hand, approximate methods allow to elaborate high-quality, but not necessarily optimal solutions and are of high practical importance, as they can be conveniently applied to massively large VRPs. Additionally, such approaches are being actively developed in other applications areas [36], as they can lead to obtaining very high-quality solutions to challenging discrete optimization problems in an affordable time. These areas include, but are not limited to resource provisioning and task scheduling for heterogeneous applications in distributed green clouds [97], scheduling of crude oil operations in refineries [26], disassembly planning [16], solving industrial group scheduling [103] and job shop scheduling [8] problems, building commercial recommendation systems [91], bike repositioning in bike-sharing systems [30], energy-optimized partial computation offloading in mobile-edge computing [2], facial image manipulation [95], or allocating virtual machines in data centers [102].

Approximate techniques encompass the route and distance minimization approaches (RM and DM, respectively) which tackle two most important objectives of rich VRPs. The RM heuristics include the construction and improvement ones [38]. Construction algorithms successively build the solution from scratch. Typically, when none of the existing vehicles can handle a request without violating the constraints, a new one is attached to the operating fleet [43]. The improvement algorithms start with a low-quality solution and iteratively improve it through relocations or exchanges of route fragments [85]. The exchange heuristics involve intra-route and inter-route exchanges, which modify the connections within a single route or between different routes [41]. Additionally, there are hybrid solvers that couple different approximate algorithms [27], or exploit exact techniques to intensify the search within the most promising parts of the solution space [12, 23]. Finally, two-phase algorithms follow two general schemes: cluster-first route-second and route-first cluster-second. The main idea is to reduce the search space’s size by combining multiple transportation requests into clusters—this way, the problem becomes easier to solve [105]. Combining vehicles from different clusters is often not allowed, but adding non-clustered customers appearing along the way to a particular cluster is possible. Moreover, such algorithms enable us to seamlessly hire additional vehicles (drivers), perhaps at a higher cost, if there is no solution after the clustering phase. Hence, they can be effectively applied in ITSes in which transportation requests may appear outside the main traffic network.

The DM approaches are often built upon various (meta) heuristic techniques involving both population- and local search-based algorithms [4], with the latter operating on a single solution. The bio-inspired techniques encompass, among others, ant [66] or bee colony optimizations [22], particle swarm optimization [100], firefly [1, 65], single- and multi-objective evolutionary algorithms [81, 87], and bat techniques [60]. There exist approaches that hybridize local (intense) optimization with the global exploration of the solution space—memetic algorithms (also referred to as hybrid genetic algorithms) utilize local-search operators to effectively balance the search exploration and exploitation, and to ultimately converge to high-quality solutions faster [7, 83]. The algorithms that optimize a single solution (which is in contrast to the population-based techniques) involve a wide range of tabu searches [20, 39], as well as a variety of neighborhood search methods [10, 74, 92]. Finally, parallel techniques play an important role in solving different VRPs, as they can not only accelerate the computations [15, 54], but also allow to elaborate higher-quality routing schedules, e.g., through efficient cooperation of parallel solvers [6, 24, 51, 53, 59].

The heuristic algorithms for VRPs are of high practical importance, but they are often heavily parameterized. Their offline parameter tuning is a cumbersome and time-consuming task, as it commonly involves a trial-and-error procedure which requires executing a considered algorithm multiple times with different parameterizations. On top of that, incorrectly selected hyperparameters can easily deteriorate the performance of these methods. Additionally, a single parameter is tuned at a time in the majority of such approaches—it can result in suboptimal choices since the parameters are not independent. Although there are well-established tools allowing to automatically configure algorithms, such as irace [46], developing the approaches which employ run-time (self)adaptation of hyperparameter values is a blooming research area, as this approach may help better respond to the optimization status [71]. There are very successful algorithms that benefit from such a technique in other fields, and range from the approaches for developing the dendritic neural models for classification, approximation and prediction that utilize the Taguchi’s experimental design method for elaborating the desired parameterizations [18], online fault detection models and strategies [101], designing deep neural networks with the use of particle swarm optimization [45], improving the abilities of swarm techniques while enhancing their exploration competences through adaptively selecting the most important parameters [11], and many more [88]. A preprocessing-based adaptation was proposed by [57, 58]. Here, the idea was to analyze the characteristics of the problem instance and select the best algorithm parameters for a given test variant. In a similar vein, [9] argued that an appropriate initialization is pivotal for obtaining higher-quality VRP solutions and applied an array of machine learning techniques to learn how to use distinct features from four commonly used construction heuristics for VRPs. An interesting research pathway includes the adaptations that benefit from the historical data collected through GPS devices [104]. Finally, dynamic evolutionary techniques control the hyperparameter values during the execution using three common approaches: in a (i) deterministic manner without any feedback from the running evolutionary solver, (ii) through the adaptive techniques, where there is a feedback loop that indeed affects the hyperparameter updates, and in a (iii) self-adaptive way, where the parameters are encoded into individuals and evolve during the optimization [55]. Although dynamic approaches can respond to the search progress, such adaptations are often dependent on specific instance characteristics, hence they require a thorough validation process that could prove their generalization abilities over different VRPs.

We have been witnessing an unprecedented tsunami of “novel” heuristics for not only VRPs, but for tackling combinatorial problems in general. In an excellent paper, [80] proved that virtually all such methods are based upon a metaphor of some natural or man-made process, and proposing new techniques in this line of research can be threatening to the field due to the lack of appropriate scientific rigor [80]. Additionally, [80] pointed out that “comparing different metaheuristic algorithms has so far been a largely unstructured affair, with testing procedures being determined on the fly and sometimes with a specific outcome in mind”. There have been attempts toward designing new benchmarks for specific VRPs [86, 99]. Uchoa et al. [86] introduced a set of 100 CVRP instances including transportation requests of different spatial distributions, and showed that they can be solved using state-of-the-art exact and heuristic techniques. On the other hand, a real-world mail delivery case of the city of Artur Nogueira was investigated by [99]. In neither of these works, the authors introduced the validation methodology that could be used to investigate the emerging techniques ([86] analyzed, however, the impact of the instance characteristics on the performance of specific techniques), and focused on proposing the VRP test instances (the way such instances are used in future works is not standardized). This is a serious flaw, as we know that our understanding of the future algorithm’s performance strongly depends on the way we validate it, and on how carefully we select test instances so that the generalization of algorithm performance on future instances can be inferred [78].

In Table 1, we present the recent algorithms for tackling various rich VRPs, alongside their experimental settings. Additionally, it is common that the authors compare the new algorithms with a selected one (or the best known routing schedules) through reporting the “gap” between the obtained and previously known solutions, calculated according to a specific metric, sometimes designed on the fly.

Table 1 The algorithms for solving VRPs are commonly verified over classic quality metrics, such as the number of routes (\(K\)) and the total travel distance (\(T\))
Table 2 Symbols and notations used in this paper

1.2 Contribution

We can note that the validation procedure followed in the research papers varies across different studies, and there is no widespread acceptance in this regard in the field. Additionally, the practical aspects of the algorithms and resulting routing schedules are commonly neglected while validating new techniques in an unstructured way. We tackle these important issues and introduce a comprehensive methodology to verify and validate the existing and emerging approaches for solving rich VRPs in a multi-faceted way. This methodology can be considered “comprehensive”, as it clearly defines the most important aspects of such validation procedures, being (i) the approach for generating benchmark test instances of varying difficulty, (ii) a set of metrics (which reflect practical aspects of routing schedules) for quantifying the quality of obtained VRP solutions, and (iii) the procedure that enables us to evaluate techniques for solving VRPs in a reproducible and hands-free manner. Thus, we believe that our efforts will become an important step toward combating the reproducibility crisis in the artificial intelligence research [34] through standardizing the way the community validates the emerging algorithms for tackling rich VRPs. Overall, our contribution centers around the following points:

  • We propose an end-to-end procedure toward evaluating techniques for solving VRPs in a quantifiable, reproducible, and hands-free manner, with an intention of building a standardized methodology of confronting the VRP solvers (Section 2). It enables us to (i) perform quantifiable, reproducible, and traceable research, (ii) automatically generate VRP benchmarks that are solvable and characterized by varying difficulty, (iii) comprehensively investigate single- and multi-step solvers, and thoroughly validate each component of such algorithms in the quantitative, qualitative, and statistical manner thanks to its modular architecture.

  • We propose a benchmark generation procedure that allows us to elaborate VRP test instances of varying difficulty (that may be conveniently controlled through the generator’s parameterization), with a baseline solution proving their solvability (Section 2.2). Such instances may be located in a real-life coordinate system, and can adhere to any set of transportation constraints.

  • We propose a set of generic metrics that allow us to quantify the quality of obtained VRP solutions in terms of their real-life practical characteristics, such as profitability, customer satisfaction, and many more (Section 2.3.1). These metrics are useful in building an understanding of practical aspects of the VRP solvers that can ultimately influence the process of selecting a specific optimization algorithm for deploying it in the target ITS. It is an inevitable step toward designing efficient and user-centered ITSes that meet the user needs.

  • We propose a novel way of quantifying the distance between the VRP solutions via using new graph similarity metrics (Section 2.3.2). This quantification may be pivotal in understanding the convergence abilities of randomized solvers, and in confronting different optimizers. Note that minimizing the structural distance between solutions obtained using an investigated algorithm may be pivotal in dynamic VRPs, in which the external factors can affect the current routing schedule. In such scenarios, practitioners would likely select the solvers that lead to structurally-similar routing schedules, as the local changes are always easier to implement within the operating fleet of vehicles (if rerunning a solver over a slightly modified transportation instance led us to obtaining a significantly different set of routes, then it would be extremely costly to reschedule the vehicles that were already in operation).

  • We present the abilities of our framework in practice through investigating various state-of-the-art heuristics for solving PDPTW, being a challenging yet representative NP-hard variant of a rich VRP, capturing different optimization constraints that are common in other VRPs. We focus on both benchmark and real-life tests that were elaborated using our generation procedure (Section 3). Our experiments are backed up with a battery of visualizations that show that the design of the proposed framework makes exporting various artifacts straightforward, and they can be easily integrated with widely adopted tools. Additionally, we equip our paper with a video showing our analysis tools in a step-by-step manner (the video is available at https://gitlab.com/jnalepa/standardized_vrp) .

We hope that our contributions will be an important step toward building an established and standardized framework for thorough and fair validation of new VRP solvers, and thus allow the researchers escape the trap of testing their algorithms in an ad-hoc, unstructured and biased way. To avoid any misunderstanding, it is worth mentioning that in this manuscript we do not propose a new algorithm for solving a specific rich VRP—we contribute to the body of knowledge which focuses on verification and validation of such techniques in scientific and practical settings.

1.3 Paper structure

The remainder of this paper is structured as follows. Section 2 presents our framework for the automatic validation of the VRP techniques, together with our approaches toward the quantitative, qualitative, and statistical analysis of the investigated techniques. In Section 3, the results of our experimental study are presented and discussed in detail. The study is split into three experiments, the first investigating the functional capabilities of the entire validation pipeline through generating real-life PDPTW test instances in the Gliwice area, Poland, and ultimately solving them using selected algorithms from the literature. In the second experiment, we show how to objectively compare selected algorithms over the benchmark test instances, whereas the third experiment further proves the practical utility of our system through deploying it over the sample data collected from the existing Demand Responsive Transport system. Section 4 concludes the paper and highlights the most exciting future research directions which may be followed based on the results and ideas presented in this paper. Finally, Table 2 gathers the symbols and notations used in this paper.

Fig. 1
figure 1

A high-level flowchart of the proposed validation framework. The components of the Solver module are rendered in blue, whereas the Benchmark Generator and the post-processing module are annotated in orange. For each building block, we indicate its input and output characteristics

2 Method

In Section 2.1, we present a high-level overview of our framework for the automatic validation of the VRP algorithms. Its pivotal components are discussed in detail in Section 2.2 (Benchmark Generator), and in Section 2.3, which highlights our approaches toward the quantitative, qualitative, and statistical analysis of the investigated techniques. Although the introduced pipeline is independent of the underlying VRP variant, we focus on PDPTW to provide easy-to-follow concrete examples of its real-life implementation.

2.1 Overview of the validation approach

The proposed validation framework consists of the benchmark provider, solver (being an algorithm that is undergoing the analysis), and the post-processing module. The first module, referred to as the Benchmark Generator (see Fig. 1 and Section 2.2), supplies the solver with already defined benchmarks or newly generated datasets, and combines the data from both sources. The solver includes three major components, marked in blue in Fig. 1: the Initial Solution Generator (ISG), and the RM and DM components. Each solver building block operates on the same set of benchmarks (passed through the in1 entry point), and produces a number of solutions (through the out exit point). Additionally, RM and DM start off either with a set of predefined solutions provided externally (e.g., from earlier solver runs), or with those that are supplied directly from one of the previous components (through in2). The post-processing module involves several elements (discussed in Section 2.3) run independently for each solver’s component, and finally for the solver as a whole.

There are three important advantages of the framework. Firstly, it considers the solver components as black boxes, allowing to automatically use and validate a wide range of algorithms (see Section 3). Secondly, the solver allows for a great dose of parallelism, both at the data and instruction level. Data-level parallelism stems from the fact that all the benchmarks can be solved independently, whereas the instruction-level parallelism results from the potential independence of the RM and DM modules executed on a set of VRP tests. Finally, the artifacts elaborated by the post-processing module can be easily integrated with existing map engines. For simplifying the discussion—although our framework is generic and can be coupled with any engine—we assume that Google Earth and the corresponding Keyhole Markup Language (KML) files are the default choices that help visualize the obtained solutions.

Fig. 2
figure 2

An algorithm for generating a single benchmark test

2.2 Benchmark generator

The Benchmark Generator (BG) operates on two (not necessarily disjoint) sets of vertices \(\mathcal {V}_1\) and \(\mathcal {V}_2\) representing the possible pickup and delivery locations, respectively. The generation process is controlled by five parameters: \(\mathcal {N}_r\), being the number of requests to generate (hence, we have \(\mathcal {N}_r/ 2\) pickup-delivery pairs), \(\mathcal {N}_{vr}\)—the suggestedFootnote 2 number of requests per vehicle, \(\mathcal {C}\)—the vehicle’s capacity, \(\mathcal {W}\)—the time window (TW) span taken with respect to the distance between successive locations, and \(\mathcal {T}\)—the service time.

The benchmark instances are created using the algorithm shown in Fig. 2. Note that its input parameters are sufficient to generate a wide variety of tests having, e.g., clustered or randomly distributed locations, narrow or wide TWs, and so forth. Moreover, while generating a test \(\mathcal {B}\), we simultaneously obtain a valid (but not necessarily optimal) solution \(\mathcal {S}\), thus we ensure that the instance is solvable. First, the depot location is randomly selected (line 3), and—until the desired number of requests is reached—we repeat the following steps:

  1. 1.

    labelgeneratorspsstep:2 Randomly select the actual number of requests per vehicle from the range \(\left[ \mathcal {N}_{vr}- s\%, \mathcal {N}_{vr}+ s\%\right] \) (line 5).

  2. 2.

    Randomly select the locations \(l_j\) and \(l_k\) from \(\mathcal {V}_1\) and \(\mathcal {V}_2\), and generate some non-zero demand not exceeding \(r\%\) of \(\mathcal {C}\), where \(r\) is a hyperparameter of BG (lines 7–8).

  3. 3.

    Calculate the pickup and delivery TWs (lines 9 and 12; \(t_i, t_j\) are the current times, and \(d_{ij}, d_{jk}\) denote the travel times between respective locations), update the current time (line 14), and add requests to the benchmark test and to the corresponding solution (lines 10, 13, and 15).

The validity of the solution \(\mathcal {S}\) is ensured as follows:

  1. 1.

    The precedence constraint is satisfied since within each route the pickup is always put before the delivery (see the TW definitions in lines 9 and 12, Fig. 2).

  2. 2.

    The capacity constraint is satisfied because the demand never exceeds \(\mathcal {C}\), and due to consecutive pickups and deliveries, the remaining capacity can never be negative, and the “amounts” of the corresponding pickups and deliveries are equal.

  3. 3.

    The TW constraint is satisfied since the TWs are built around the actual time-to-reach of a given location. The latest time-to-return to the depot is calculated after all the routes are established, based on the final deliveries.

Overall, BG generates tests of various characteristics, alongside the corresponding feasible solutions in \(O(\mathcal {N}_r)\) time.

2.3 Quantitative, qualitative, and statistical analysis

To perform the quantitative, qualitative, and statistical analysis, we exploit the post-processing module which elaborates a number of various artifacts (Fig. 1). In the following points, we discuss the process of assessing the final VRP solutions and of assessing the capabilities of different algorithms.

2.3.1 Assessment of solutions: feasibility and quality

Before assessing the solution (and the algorithm used to produce it), we first need to ensure that the solution is feasible, hence it adheres to the assumed optimization constraints. The assessment of the solution’s feasibility involves the following steps (for the considered PDPTW case):

  1. 1.

    Preliminary validation—each passengerFootnote 3 has to be picked up from and delivered to the correct location; the same passenger has to appear exactly twice within the solution (once for the pickup and once for the delivery operations).

  2. 2.

    Precedence validation—for each passenger, the pickup has to occur before the corresponding delivery.

  3. 3.

    Capacity validation—the capacity must never be excee-ded for any vehicle.

  4. 4.

    Time window validation—for each request (either pickup or delivery), the actual arrival time at the given location must not fall after the TW closing time. It is, however, allowed to arrive before opening the TW (the vehicle waits at this location in this situation).

  5. 5.

    Depot validation—each route has to start and finish at the depot; the time of arrival at the depot cannot exceed the time-to-return value, and no route can start before opening the depot’s TW.

Note that the feasibility assessment is generic enough to cover not only PDPTW but also other types of routing problems, such as CVRP (capacity validation), VRPTW (time windows validation), and so forth.

Given that a routing schedule is feasible, we assess its quality primarily based on the number of vehicles used in this solution ( \(K\)) and the total distance traveled by the vehicles (\(T\)), as these are two common primary and secondary optimization objectives. Additionally, we capture the convergence time \(\tau \), i.e., the time required to produce the final solution. Apart from such classic (primary) quality metrics, we introduce a set of secondary metrics. These metrics carry a lot of practical significance and can be applied to assess the quality of scientific and industry-oriented VRP solvers. Our validation methodology utilizes the following metrics:

  1. 1.

    Total vehicles’ round-trip time (\(\alpha \))—in real-life routing problems (i.e., located within the geographical coordinate system constrained by the features of a specific traffic network), the time required to travel a certain distance is not equal to the distance itself. Measuring the travel time is important from the point of view of obeying the regulations governing the drivers’ work times. This metric should be minimized.

  2. 2.

    Average vehicles’ round-trip time (\(\beta \)), calculated as \(\alpha / K\), estimates the average duration of a round-trip. Knowing this value and the actual duration of the respective routes, we can observe the variability of the routing schedule. Consequently, we can also draw conclusions about the distribution of workload among drivers assigned to particular routes. This metric should be minimized through the reduction of \(\alpha \), without increasing the total number of routes (\(K\)).

  3. 3.

    Average number of vehicle round-trips per hour (\(\gamma \)), calculated as \(3600 / \beta \). It indicates the frequency of vehicles’ visits in the given area. It can be also computed for individual routes, to see whether certain locations are visited more frequently than the others. This metric should be maximized since more frequent visits in the given area enable the passengers to choose the most appropriate time to travel, thus make them more satisfied with the transportation system/provider.

  4. 4.

    Total passengers’ travel distance (\(\delta \)), calculated as a sum of distances traveled by passengers inside a vehicle (distance registration starts at the pickup location and finishes at the delivery location). Note that this value is always smaller than \(T\), since it does not take into account the distances traveled by an empty vehicle between the depot and the first pickup (last delivery) location in any route. In some cases (e.g., dial-a-ride scenario) this metric can be an indicator of passengers’ satisfaction. Ideally, the passengers should be delivered to their destination immediately after pickup, without the need to travel excessive distance to other locations, as this may incur costs. This metric should be minimized.

  5. 5.

    Total passengers’ travel time (\(\varepsilon \))—combined with \(\delta \), this metric informs how much time the passengers spend in transit (including the service times at the locations visited after the pickup and before the delivery). Note that for real-life problems, close proximity of certain stops does not mean that the travel times are also short (e.g., the route may include segments of high traffic frequency, increasing the travel times significantly, without affecting the distance traveled). Comparing two solutions, an increase in this metric indicates a change in the route schedule resulting in delayed delivery of one or more passengers. This metric should be minimized.

  6. 6.

    Average pickup-delivery pair’s travel time (\(\zeta \)), computed as \(\varepsilon / (\mathcal {N}_r/ 2)\). It shows the average time spent by a group of passengers (resulting from a pickup-delivery pair) in a vehicle. This metric should be minimized.

  7. 7.

    Total cost (\(\eta \)), calculated as \(c_{K}\cdot K+ c_{T}\cdot T/ 1000\), where \(c_{K}\) is the cost of a vehicle and \(c_{T}\) is the cost per each traveled kilometer (or other unit). In practice, \(c_{K}\) can correspond to the salary of a driver, the insurance costs, and so forth, while \(c_{T}\) can be the fuel cost. Both \(c_{K}\) and \(c_{T}\) can be customized to reflect, e.g., different vehicle types (electric, fuel, gas, etc.). This metric should be minimized.

  8. 8.

    Average cost per route (\(\theta \)), calculated as \(\eta / K\). It helps observe the proportion of cost incurred by the vehicle itself and the distance it travels in a route. Moreover, it indicates the increase in traveled distance needed to balance the cost of the vehicle. This metric should be minimized. The \(\theta \) reduction should result from the reduction of \(\eta \), and not from the increase in \(K\).

  9. 9.

    Average number of pickup-delivery pairs per vehicle (\(\kappa \)), calculated as \((\mathcal {N}_r/ 2) / K\) estimates the average route length given in terms of the pickup-delivery pairs. This metric should be maximized, since an increase in \(\kappa \) means that the value of \(K\) decreases, which is desired, as fewer vehicles incur smaller costs.

  10. 10.

    Average number of passengers per vehicle (\(\lambda \)), calculated as \(\mathcal {D}/ K\), where \(\mathcal {D}\) denotes the total demand (number of passengers) across all pickups. Comparing the value with the vehicle’s capacity \(\mathcal {C}\), it can be observed how well the available space in the vehicle is utilized. A low value indicates that smaller (and possibly cheaper) vehicles could be used to handle the routing schedule. This metric should be maximized to better utilize the vehicles’ capacity.

  11. 11.

    Average arrival waiting time (\(\xi \)), calculated as \(t_{a}/\) \( (\mathcal {N}_r/ 2)\), where \(t_{a}\) is the total arrival waiting time. The arrival waiting time pertains to the deliveries, and is determined as the positive difference between the actual arrival time at the delivery location and the left time window margin of this location. In practice, it indicates how long the passenger needs to wait past the optimal arrival time (left TW margin) to be dropped off. To better contextualize it within a practical transportation scenario, let us note that—in the case of a communication hub with very high traffic load—a bus may not be allowed to enter the bus stop prior to its scheduled time to avoid interference with other buses sharing the same stop. Consequently, being at the destination (in terms of geographical coordinates), the passengers may not be allowed to get off the bus earlier due to safety reasons. This metric should be minimized.

  12. 12.

    Average departure waiting time (\(\pi \)), calculated as \(t_{d}/ (\mathcal {N}_r/ 2)\), where \(t_{d}\) is the total departure waiting time. The departure waiting time pertains to the pickups, and is determined as the positive difference between the actual arrival time at the pickup location and the left time window margin of this location. In practice, it indicates how long the passenger needs to wait past the optimal departure time (left TW margin) to be picked up. This metric should be minimized.

  13. 13.

    Average waiting time (\(\rho \))—this metric combines the arrival and departure waiting times discussed before and is given by \((t_{a}+ t_{d}) / (\mathcal {N}_r/ 2)\). It defines the net difference between the actual and optimal pickup/delivery times. This metric should be minimized.

In Table 3, we illustrate the relations between our secondary metrics and the primary metrics \(K\) and \(T\), as well as the mutual relations among secondary metrics. Except for the cost metrics \(\eta \) and \(\theta \) which combine both \(K\) and \(T\), the dependence of other metrics on \(K\) results from the fact that we calculate per-vehicle averages. Moreover, although some metrics are related to each other, they carry interpretable, practical significance, as noted before. Besides, the average metrics are handy while performing the analysis of variability of route characteristics for a single solution. Therefore, we believe that all proposed metrics bring a full overview of the algorithm’s capabilities from the practical point of view.

Table 3 The relations between the proposed and primary quality metrics

The metrics can be also combined into the following objective functions (1)–(2), which are minimized:

$$\begin{aligned} \mathcal {F}_1&= \omega _1\alpha '+ \omega _2\delta ' + \omega _3\varepsilon ' + \omega _4\eta '\nonumber \\&\quad - \omega _5\kappa ' - \omega _6\lambda '+ \omega _7\xi '+ \omega _8\pi ' + \omega _9\rho ', \end{aligned}$$
(1)
$$\begin{aligned} \mathcal {F}_2&= \omega _1\alpha ' + \omega _2\delta ' + \omega _3\varepsilon ' + \omega _4\eta ' + \omega _5/\kappa '\nonumber \\&\quad + \omega _6/\lambda '+ \omega _7\xi ' + \omega _8\pi ' + \omega _9\rho ', \end{aligned}$$
(2)

where \(\omega _i \ge 0\), \(1 \le i \le 9\), are weights such that \(\sum _{i=1}^{9} \omega _i = 1\), and the parameters \(\alpha '\), \(\delta '\), \(\varepsilon '\), \(\eta '\), \(\kappa '\), \(\lambda '\), \(\xi '\), \(\pi '\), and \(\rho '\) are the aforementioned metrics scaled to the range (0, 1], obtained by dividing each metric value by the maximum value obtained in a number of runs. The functions differ in the way metrics \(\kappa \) and \(\lambda \) are handled. Since these two metrics should be maximized, while objective functions \(\mathcal {F}_1\) and \(\mathcal {F}_2\) should be minimized, we propose to take their opposite or inverse values. Although the opposite values in \(\mathcal {F}_1\) seem more natural, looking at metrics’ definitions (i.e., their dependence on the inverse of \(K\)), the objective \(\mathcal {F}_2\) turns out to be more intuitive.

The functions have a few crucial properties. Firstly, they carry additional insights regarding the solution’s quality, which cannot be observed based solely on the primary metrics \(K\) and \(T\)—note that within \(\mathcal {F}_1\) and \(\mathcal {F}_2\), only the metrics \(\eta \), \(\kappa \), and \(\lambda \) depend on \(K\) or \(T\) (cf. Table 3). Secondly, the functions are quite universal with respect to the coordinate space in which the problem is located. Moving from the geographical coordinates to the Euclidean space, the only differences are that \(\alpha = T\) and \(\delta = \varepsilon \). However, these changes do not affect the practical significance of either \(\mathcal {F}_1\) or \(\mathcal {F}_2\). Also, they do not diminish the added value provided by these objectives with respect to the typical primary metrics. Finally, thanks to the weights \(\omega _i\), the functions provide a great dose of flexibility as to the importance of respective metrics. This way, the functions can be adjusted to individual needs or a specific scenario.

2.3.2 Assessment of algorithms: statistical and non-functional analysis

Observing only the differences across various quality metrics is not enough without knowing whether these differences are significant in the statistical sense. Our framework employs the statistical analysis component, which verifies the statistical significance of the differences in \(K\), \(T\), \(\tau \), and possibly other metrics. Although we use the Wilcoxon signed-rank test to confront two algorithms over a set of benchmark instances, any statistical test for matched results may be equally valid and applicable here. Additionally, we can effectively utilize this component to understand the convergence abilities of randomized algorithms, as we can conveniently run tests, e.g., Kruskal-Wallis with post-hoc Dunn’s, for repeated executions of the very same optimization approach. The advantage of incorporating the statistical analysis component into the framework is that the obtained p-values provide clear evidence behind the claims concerning the optimization performance of any investigated technique.

The above metrics allow to assess the quality of the solutions on the aggregated level, i.e., for the entire benchmark set encompassing a number of separate test instances. However, they often fail to provide some deeper insights that could be captured for particular instances. As an example, understanding how similar (or different) two VRP solutions with the same \(K\) and comparable \(T\) are. To answer this question, we propose a novel idea of using new graph similarity metrics.

Let \(G_1 = (V_1, E_1)\) and \(G_2 = (V_2, E_2)\) be two directed graphs representing the VRP solutions, with \(V_1, V_2\) being the sets of labeled vertices, and \(E_1, E_2\) denoting the sets of (weighted) edges. The edges are not explicitly labeled, but they acquire the labels resulting from the vertices they connect, i.e., the label of an edge connecting vertices \(v_1, v_2\) is \((v_1, v_2)\). Let \(L_1, L_2\) denote the sets of edge labels for graphs \(G_1, G_2\), respectively. We propose the following similarity metrics (3), (4), and (5), based on counting the number of common walks:

$$\begin{aligned} \mathcal {K}_1 = \frac{|L_1 \cap L_2|}{|L_1 \cup L_2|}, \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {K}_2 = \frac{\sum _{L_1 \cap L_2} 1 / (\left| o_1 - o_2\right| + 1)}{|L_1 \cup L_2|}, \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {K}_3 = \frac{2 \cdot \sum _{L_1 \cap L_2} w_e}{\sum _{L_1} w_e+ \sum _{L_2} w_e}, \end{aligned}$$
(5)

where \(o\) is the edge offset within a route, relative to the depot, \(w_e\) denotes the weight of the corresponding edge, and \(\mathcal {K}_1\) can be treated as the Jaccard index calculated for graphs.

Fig. 3
figure 3

Example solutions for a given VRP. Node 0 represents the depot, while nodes 1–2, 3–4, \(\ldots \), 11–12 denote the paired pickups and deliveries

The intuition behind these metrics is that the solution graphs G are given as unordered sets of routes of the form \(v_1, w_1, v_2, \ldots , w_l, v_l\), where \(v_i \in V\), \(w_i\) are weights, \(v_1 = v_l\) is the depot, and l is the route length. Consequently, to express the similarity between two such graphs, we need to look for the same route segments represented by the same labels.

The similarity metrics \(\mathcal {K}_1\)\(\mathcal {K}_3\) can be computed efficiently in \(O(|E_1| + |E_2|)\) time. Their values are also easy to interpret. In particular, \(\mathcal {K}= 1\) means that \(G_1 \equiv G_2\), \(\mathcal {K}= 0\) is obtained if \(L_1 \cap L_2 = \emptyset \), and the closer the value to 1 becomes, the greater the similarity between \(G_1\) and \(G_2\) is obtained. The metric \(\mathcal {K}_1\) counts the number of common edge labels in two graphs. However, it does not consider the relative offsets of the common edges in the graphs. On the other hand, \(\mathcal {K}_2\) adjusts the similarity metric based on the offsets of edges relative to the depot. It gives preference (larger impact on the \(\mathcal {K}_2\) value) to the edges that appear at the same offsets in the compared graphs. Finally, \(\mathcal {K}_3\) puts the analysis into the real-world context through including the weights, being e.g., the distances. Overall, the three similarity metrics give us a comprehensive and quantifiable view of the distance between any pair of directed graphs that represent the solutions of any VRP.

Table 4 Our similarity metrics obtained for the solutions shown in Fig. 3
Fig. 4
figure 4

Distribution of \(\mathcal {K}_1\)\(\mathcal {K}_3\) values for all two-route graphs with 4 pairs of pickups and deliveries. The size of each circle corresponds to the number of times a given value of \(\mathcal {K}_1\)\(\mathcal {K}_3\) was observed

Let us illustrate the similarity metrics’ performance with an example. In Fig. 3, we depict a couple of example solutions to a given VRP. For a human expert, it may look obvious that the initial three solutions look similar to each other, while the other two seem to be substantially different, however the precise quantification of such differences remains an open challenge. Additionally, for a large set of test instances, manual comparison is infeasible and tedious. To tackle this issue, we can calculate \(\mathcal {K}_1\)\(\mathcal {K}_3\) (Table 4). Consider, for example, the graphs \(G_1\) and \(G_3\). The set of labels for \(G_1\) is (0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 0), (0, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 0), while the set of labels for \(G_3\) is (0, 3), (3, 4), (4, 1), (1, 2), (2, 5), (5, 6), (6, 0), (0, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 0). The union of these two sets is of size 17, while the intersection is of size 11. Hence, \(\mathcal {K}_1 = 0.65\). The offsets of common labels for the two graphs are as follows: ((1, 2), 1, 3), ((3, 4), 3, 1), ((5, 6), 5, 5), ((6, 0), 6, 6), ((0, 7), 0, 0), ((7, 8), 1, 1), ((8, 9), 2, 2), ((9, 10), 3, 3), ((10, 11), 4, 4), ((11, 12), 5, 5), ((12, 0), 6, 6). Therefore, the numerator of \(\mathcal {K}_2\) is 9.66, and so \(\mathcal {K}_2 = 0.57\). Finally, the sum of weights for the common labels is 432 (note that we include each weight twice, once for each graph), while the sum of all weights for \(G_1\) and \(G_3\) is 558, and so \(\mathcal {K}_3 = 0.77\).

In more general terms, based on Table 4, we can observe that \(\mathcal {K}_1\) considers \(G_2\) and \(G_3\) to be equally similar to \(G_1\). The \(\mathcal {K}_2\) metric provides the best means for differentiation of \(G_5\) from \(G_1\)\(G_3\). Finally, we can appreciate how the real-life route characteristics, being reflected in the weights associated with the edges affect the values of \(\mathcal {K}_3\). Using the proposed similarity metrics, we obtain quantitative and objective information in a repeatable way. They also help us draw some conclusions upon the routing algorithms. If an algorithm produces very similar solutions every time it runs, we conclude it has good convergence and search space exploitation capabilities. On the other hand, if the solutions differ between runs, the exploration capabilities prevail.

To get some more insights into the proposed metrics, we performed the following experiment. We generated all possible two-route solutions to a PDP problem with 8 requests (4 pairs of pickups and deliveries). Then, we calculated \(\mathcal {K}_1\)\(\mathcal {K}_3\) for all obtained pairs of solutions and collected the frequencies of all unique similarity values for each metric—these frequencies are plotted in Fig. 4. Looking at the plots, we observe that the metrics differ in the amount of detail provided, expressed as the number of produced unique values—\(\mathcal {K}_1\) gives only 10 unique values, while \(\mathcal {K}_3\) generates over \(38\cdot 10^3\) values. Moreover, although the frequencies of the border values (0.0 and 1.0) are the same for \(\mathcal {K}_1\)-\(\mathcal {K}_3\), their contribution relative to the other values is significantly different depending on the similarity metric. Let us also note that if we were to compare the solutions using classic metrics like \(K\) or \(T\), we would gain much less useful information. On the one hand, the \(K\)-based comparison would simply treat all the solutions as being exactly the same. On the other hand, \(T\)-based comparison produces almost \(24\cdot 10^3\) unique values (absolute differences between \(T\)’s of two solutions), but they are harder to interpret. For example, the value of 0.0 indicates that the graphs are considered the same, but for the graphs having completely different edges there is no specific value to be expected (as opposed to the value of 0.0 in the case of \(\mathcal {K}_1\)\(\mathcal {K}_3\)). To be more precise, graphs having no common edges according to our metrics, produce differences in \(T\) in the range [0.23, 168.75]—note that the overall range of values for this metric is [0.0, 178.49], so looking at the differences in \(T\) does not help to tell whether two graphs are similar or have nothing in common.

2.3.3 Assessment of solutions and algorithms: visualization and exploration

Apart from the quantitative analysis, the framework also facilitates graphical analysis of the solutions. It exploits the KML files which can be conveniently visualized in a number of widely available tools, such as Google Earth. Using these artifacts, we can not only display the solution on a map, but also assess the solutions visually and compare them with the outcomes provided by human experts.

The visualization of obtained solutions in the framework comes in two flavors. The first one presents a complete solution, with the respective routes placed in separate folders. This way, the user can explore the solution easily by showing/hiding complete routes. The second visualization method involves generation of individual routes as separate KML files, to facilitate more interactive exploration. Using the presentation mode, the user can navigate between successive stops to fully understand (and validate) the architecture of the obtained route.

3 Experiments

Our study includes three experiments. In Experiment 1, we investigate the functional abilities of the entire validation pipeline through generating real-life PDPTW instances located in the Gliwice area, Poland, and solving them using selected state-of-the-art techniques (Section 3.2). Although we can use our BG to synthesize instances located in any map, we focused on the area that covers the bus stops utilized by the Blees’sFootnote 4 employees that travel to the office on a daily basis—this approach brings a real-life aspect into our experimentation. Experiment 2 walks us through the process of confronting selected algorithms over the known benchmark test instances, and shows how our methodology can be used to easily compare new and existing optimization techniques over widely-used and established test instances (Section 3.3). In both experiments, we take the PDPTW as an example of a rich VRP [55] (Section 3.1). Our validation approach is, however, independent of the underlying problem formulation and can be easily adjusted to handle other VRPs through updating the optimization objectives and constraints. Finally, Experiment 3 further proves the practical utility of our system over the sample data collected from the operating Demand Responsive Transport system.

3.1 Problem formulation

The PDPTW is defined on a directed graph \(G=(V,E)\), with a set \(V\) of \(C = \left| \mathcal {V}_1\right| +\left| \mathcal {V}_2\right| +1\) vertices and a set of edges \(E\). The vertices \(v_i\), \(i\in \{1,2,\dots ,\left| \mathcal {V}_1\right| +\left| \mathcal {V}_2\right| \}\), represent the locations of the requests, and \(v_0\) indicates the location of the depot. A set of edges \(E=\{(v_i,v_{i+1})|v_i,v_{i+1}\in V, v_i \ne v_{i+1}\}\) represents the travel links between particular passengers. The travel costs \(c_{i,j}\), \(i,j \in \{0,1,\dots ,C\}\), \(i \ne j\), are equal to the distances (in the Euclidean metric) between the travel points. Each request \(h_i\), \(i\in \{0,1,\dots ,\mathcal {N}_r\}\) is a coupled pair of pickup (\(\mathcal {V}_1\)) and delivery (\(\mathcal {V}_2\)) passengers, with the pickup and delivery demands, service times (\(s_i\)), and time windows within which the corresponding pickup and delivery should start (\([e_i, l_i]\)), where \(\mathcal {V}_1\cup \mathcal {V}_2= V\setminus \{v_0\}\) and \(\mathcal {V}_1\cap \mathcal {V}_2= \emptyset \). For each request \(h_i\), the amounts of delivered (\(q^d(h_i)\)) and picked up (\(q^p(h_i)\)) goods are defined, where \(q^d(h_i)=-q^p(h_i)\). The pickup must always occur before the corresponding delivery, and both must be served in the same route. The fleet of \(K\) homogeneous vehicles with the capacity \(\mathcal {C}\) serve the customers, and each route starts and finishes at the depot.

The PDPTW solution \(\sigma \) is a set of routes, where each route r is \(r=\left<v_0,v_{r(1)},v_{r(2)},\dots ,v_{n+1}\right>\), and it starts and finishes in the depot (therefore, we have \(v_0 = v_{n+1}\)).

3.1.1 Objectives

The PDPTW is commonly considered to be a two-objective discrete optimization problem, with the primary objective of minimizing the fleet size \(K\), where \(K\ge K_\textrm{min}\) (\(K_\textrm{min} = \left\lceil D/\mathcal {C}\right\rceil \), and \(\mathcal {D}=\sum _{i=1}^N{q^d(h_i)}\) denotes the total delivery demands that are to be served).

The second objective (6) is to minimize the total distance (T) traveled by all the vehicles serving the transportation requests:

$$\begin{aligned} T=\sum \limits _{i=0}^C { \sum \limits _{j=0}^C { \sum \limits _{k=1}^K {x_{(i,j,k)}c_{(i,j)}}}}. \end{aligned}$$
(6)

In the PDPTW, there are three decision variables (7), (8), and (9):

$$\begin{aligned} x_{(i,j,k)} (i,j \in \{0,1,\dots ,C\}, i \ne j, \forall k \in K), \end{aligned}$$
(7)
$$\begin{aligned} a_{i} (i \in \{0,1,\dots ,C\}),~\text {and} \end{aligned}$$
(8)
$$\begin{aligned} \hat{w} _{i} (i \in \{1,2,\dots ,C\}). \end{aligned}$$
(9)

If the k-th vehicle travels from \(v_i\) to \(v_j\), where \(i\ne j\), then \(x_{(i,j,k)}=1\) (it is 0 otherwise). The two other decision variables, \(a_i\) and \(\hat{w}_i\), indicate the arrival and the waiting times at \(v_i\) (additionally, we can observe that \(a_0 = e_0\)).

Let \(\sigma _A\) and \(\sigma _B\) be two feasible PDPTW solutions. If we consider the two aforementioned objectives, then \(\sigma _A\) is of a higher quality than \(\sigma _B\), if \((\) \(K(\sigma _A) < K(\sigma _B)\) \()\) or \((\) \(K(\sigma _A) = K(\sigma _B)\) and \(T(\sigma _A) < T(\sigma _B)\) \()\).

3.1.2 Constraints

The constraints (10), (11), (12), (13), (14), and (15) may be expressed as:

$$\begin{aligned} \sum \limits _{k=1}^K {\sum \limits _{j=0, j \ne i}^C {x_{(i,j,k)}} } = \sum \limits _{k=1}^K {\sum \limits _{i=0, j \ne i}^C {x_{(i,j,k)}} } = 1 \quad (\forall i, j \in V), \end{aligned}$$
(10)
$$\begin{aligned} \sum \limits _{j=1}^C { x_{(i,j,k)}} = \sum \limits _{j=1}^C { x_{(j,i,k)}} = 1 \quad (i = 0, \forall k \in K), \end{aligned}$$
(11)
$$\begin{aligned} \sum \limits _{j=1}^C { \sum \limits _{k=1}^K {x_{(i,j,k)} = K}} \quad (i = 0), \end{aligned}$$
(12)
$$\begin{aligned} \sum \limits _{i=1}^C { q^d(h_i) \sum \limits _{j=0, j \ne i}^C {x_{(i,j,k)} \le \mathcal {C}}} \quad (\forall k \in K), \end{aligned}$$
(13)
$$\begin{aligned} e_i \le a_i + \hat{w}_i \le l_i \quad (\forall i \in V),~\text {and} \end{aligned}$$
(14)
$$\begin{aligned} \max \{ a_{i} + s_{i} + c_{i,j}, e_j\} \le l_j, \end{aligned}$$
(15)

for all pairs (ij) such that \(i \ne j\), and \(x_{(i,j,k)}=1\), for a given k. Every customer is visited exactly once (10), all routes start and finish at the depot (11), and the fleet size is equal to K (12). Also, if a truck arrives at a certain travel point, then it departs from the same point. The capacity (13) and the time window constraints (14)–(15) must hold for each route as well. Therefore, the total amount of goods being delivered cannot exceed \(\mathcal {C}\), and the service of each customer must be started before its time window closes.

3.2 Experiment 1: end-to-end processing

To execute the complete pipeline, we exploited the sets of real-life pickup and delivery locations (\(\mathcal {V}_1\) and \(\mathcal {V}_2\)), being the bus stops, alongside the pickup points of the Blees’s employees located in the Gliwice area, Poland (they were provided via a KML map file). The benchmark instances (25 in totalFootnote 5) were generated using BG with the following parameters controlling the difficulty of the tests: \(\mathcal {N}_r= 100\), \(\mathcal {N}_{vr}= 10\), \(\mathcal {C}= 200\), \(\mathcal {W}= 0.1\), \(\mathcal {T}= 90\), \(s= 50\), and \(r= 25\) (see Section 2.2).

The solver module included ISG, one RM algorithm and two DM algorithms. Each component was run 10 times for no more than two minutes per execution. The solutions produced by each technique constituted the initial population for the next component in the pipeline. First, ISG assigned each pickup-delivery pair to a separate vehicle (generating “virtual” vehicles), and then reduced the number of vehicles until the fleet size was obtained. The RM algorithm employed our ejection-based approach, in which requests are ejected from a randomly selected route, and then feasibly reinserted into other routes [56]. Finally, DM involved a local search (LS) algorithm focusing on the in/out relocate/exchange moves that were iteratively employed to minimize the total travel distance of a feasible solution, and our memetic algorithm (MA), being a hybrid evolutionary technique which exploits local search moves to intensify the optimization in the most promising parts of the search space [55]. The RM and DM approaches continued their execution until the time limit was reached. Note that the parameterization of the solvers was not optimized, and we kept the hyperparameter values as suggested in the corresponding papers in all experiments.

Fig. 5
figure 5

The primary and secondary metrics—the values are averaged over 10 runs (each point is a single benchmark instance). The distance metrics (\(T\), \(\delta \)—passengers’ travel distance) are given in km, while time metrics are either in seconds (\(\tau \), \(\xi \), \(\pi \), \(\rho \)—waiting times) or in hours (\(\alpha \), \(\beta \), \(\varepsilon \), \(\zeta \)—the total and average vehicles’ and passengers’ travel times). We also render \(\eta \)—the total solution’s cost, \(\kappa \)—average number of pickup-delivery pairs per vehicle, and \(\lambda \)—average number of passengers per vehicle. Finally, we render objective functions \(\mathcal {F}_1\) and \(\mathcal {F}_2\) calculated for \(\omega _1 = \omega _2 = \ldots = \omega _9\)

In Fig. 5, we depict the primary metrics (\(K\), \(T\) [in km], and \(\tau \) [in seconds]), alongside our secondary metrics obtained using all investigated techniques (see the detailed results in the supplement available at https://gitlab.com/jnalepa/standardized_vrp). As expected, the main drop in \(K\) is observed between ISG and RM, without further changes in the DM component, as it focuses on minimizing \(T\). Although the distance \(T\) may effectively decrease in both RM and DM, its change in the RM component results from the corresponding change in \(K\). Note that this change is also reflected in the \(T\), \(\alpha \), \(\gamma \), and \(\eta \) metrics. Finally, we can appreciate that ISG consistently converges very fast, while LS and MA are far more stable than RM in this respect.

Table 5 The ranking values obtained using the investigated algorithms for all secondary metrics, averaged across all the generated tests

The secondary metrics presented in Fig. 5 bear a lot of practical significance (the ranking valuesFootnote 6 of all methods obtained over all secondary metrics are given in Table 5; here we do not report the ranking values for \(\mathcal {F}_1\) and \(\mathcal {F}_2\) for brevity). The vehicles round-trip time \(\alpha \) indicates the number of drivers needed to handle a day-long routing plan, whereas \(\delta \) and \(\varepsilon \) show whether the passengers are transported directly to their destinations or have to visit some other stops on their way. The total cost \(\eta \) encompasses the costs resulting from each vehicle (e.g., drivers salary or vehicle insurance) and traveled distance (e.g., fuel)—in Fig. 5, unit costs were assumed, but they can be customized according to one’s needs. Finally, \(\lambda \) shows the used capacity of the vehicles, and its maximization can lead to better utilization of the available fleet. Such secondary metrics not only shed more light on practical characteristics of the solutions, but allow to consistently pick the algorithm that outperforms the others according to the preferred measures (Table 5). It is of note that the convergence times of the specific modules exploited in the current version of the optimization framework vary significantly. For the ISG module, the run time was always below a second, for the RM module it ranged from about 13 s up to 90 s, for the DM (LS) module it went from 105 s to 112 s, while for the MA module it could be as short as 65 s and as long as 108 s. Let us note, however, that the modules are replaceable, so the convergence times, and consequently the running times of the complete execution pipeline may vary from the ones reported here, depending on the applied algorithmic solutions. Since the modules were executed sequentially, the total convergence time of the algorithm (understood as the complete framework of four modules) ranged from 201 s to 305 s (with the average of 263 s and the standard deviation of 26.5 s).

With equal weights, the combined objective functions \(\mathcal {F}_1\) and \(\mathcal {F}_2\) behave similarly to each other with the difference in magnitude resulting from the varying treatment of the \(\kappa \) and \(\lambda \) metrics. However, due to the flexibility of \(\mathcal {F}_1\) and \(\mathcal {F}_2\) resulting from the weights, and also thanks to the customization of the \(\eta \)’s parameters, we can model different real-life scenarios. For instance, let us assume that we are interested in passengers’ satisfaction measured by the smallest waiting time and that the passengers are less willing to be late for the dropoff point than at the pickup. We can then assume the following weights \(\omega _7 = 0.5\), \(\omega _8 = 0.25\), \(\omega _9 = 0.25\), and \(w_i = 0\) otherwise. One of the benchmarks reports the same \(K\) for all 10 runs with the difference in \(T\) between the best and worst solution being almost 5.7 km. However, the smallest waiting time (i.e., the best objective function value) is achieved for the largest \(T\). Therefore, in this scenario, the passengers can accept a longer route to reduce their waiting time and they will pick a different solution than the one based on metrics \(K\) and \(T\).

Fig. 6
figure 6

The box plots representing the distribution of the similarity metrics \(\mathcal {K}_1\)\(\mathcal {K}_3\) for our benchmarks. The values are averaged over 10 runs

In Fig. 6, we present the distribution of values reported by \(\mathcal {K}_1\)\(\mathcal {K}_3\) for our benchmarks. Note that the ISG component consistently produces the same solution (all metrics report perfect similarity). Furthermore, the solutions produced by other components differ significantly from the initial ones (as indicated by low similarities for pairs ISG/RM, ISG/LS, and ISG/MA). We observe that LS and MA tend to produce similar results, which indicates their good convergence capabilities.

A set of example visualizations (based on the Google Earth engine and KML files generated by the post-processing module) of the obtained solutions (for a single test instance) is rendered in Fig. 7. Such visualizations help us not only better understand the layout of the final solution, but can also allow us to manually fine-tune the obtained routing schedules if necessary, perhaps in some specific parts of the map.

Fig. 7
figure 7

Example solutions obtained using ISG, RM, LS, and MA for a single test. Different colors show the routes served by separate vehicles

Finally, our validation approach enables us to perform the statistical analysis to see whether the observed differences in primary (or secondary) metrics are significant. Since \(K\) for RM, LS, and MA are tied, the p-values could not be calculated (Table 6). For other cases, we can observe that employing different algorithms lead to obtaining significantly different final solutions and convergence abilities (at \(p<0.01\)). The two-step optimization methodology used in this work toward solving the rich VRPs is clearly better than the single-objective optimization algorithms in which either the number of routes K or the total distance T is optimized, since they usually ignore the other metrics. The flexibility of our framework allows us to easily build such optimization cascades (and replace specific algorithmic components, e.g., the algorithm for minimizing the number of routes, the traveled distance, or both). Since the RM and DM modules focus on optimizing different aspects, their combined performance provides us with solutions that are competitive with respect to the state-of-the-art ones. Overall, this experiment shows that the artifacts generated by all framework components can be thoroughly analyzed to give us a detailed overview of the algorithms’ behavior and capabilities.

Table 6 The p-values obtained using the Wilcoxon signed-rank test

3.3 Experiment 2: comparing algorithms

In this experiment, we confront the aforementioned DM algorithms and compare them in more detail over a widely-used Li and Lim’s PDPTW benchmark. Here, our aim is to present the flexibility of the proposed methodology, in which we can exploit existing benchmarks that have been already utilized in the literature to compare emerging solvers. Hence, from the scientific point of view, we can conveniently extract the quality measures obtained over the known data to directly compare them with those reported in other papers. We used 60 instances of the 200-request tests with clustered, randomized, and mixed locations (C, R, and RC groups, respectively) with small vehicle capacities and short TWs (C1, R1, and RC1), and with larger capacities and wider TWs (C2, R2, RC2)—in each group, we have ten distinct test casesFootnote 7. Both LS and MA algorithms were supplied with the same initial population composed of 10 solutions generated beforehand by the RM component. The collected \(K\), \(T\), and \(\tau \) metrics for this experiment are shown in Fig. 8. The distribution of \(K\) and \(T\) for both algorithms is similar, and the Wilcoxon signed-rank test confirms that there are no significant differences in these two metrics (the p-value amounts to 0.12).

Fig. 8
figure 8

The results (\(K\), \(T\) [in km], and \(\tau \) [in seconds]) obtained for LS and MA algorithms solving the Li and Lim benchmarks. The values were averaged over 10 runs, and each point corresponds to a single test instance

Fig. 9
figure 9

The results (\(K\) and \(T\) [in km]) for all Li and Lim’s groups. The values for the predefined solutions (Init.), LS and MA were averaged over the best and worst solutions obtained for each test instance (within 10 executions). The WB values were averaged over the world’s best known solutions

Fig. 10
figure 10

The results of profitability analysis for LS and MA algorithms, with \(c_{K}= 1\) (upper charts) or \(c_{K}= 500\). The value of \(c_{T}= 1\) for all charts. We use black (C1), green (C2), red (R1), cyan (R2), blue (RC1), and magenta (RC2) to denote different Li and Lim’s groups

Table 7 The ranking values obtained using LS and MA for all secondary metrics, averaged across all Li and Lim’s instances
Table 8 Results of correlation analysis for Li and Lim’s benchmarks, expressed as percentages of all analyzed cases within each group

The validation framework enables us to aggregate the artifacts at various levels—we render the comparison of \(K\)’s and \(T\)’s in Fig. 9 for each Li and Lim’s group separately. Also, it shows the best known results available in the literature (WB)—here, it is important to note that the WB schedules were obtained using various algorithms (i.e., not a single algorithm), hence should be considered as the current (known) quality upper bound of \(K\) and \(T\). Both LS and MA provide similar \(K\)’s, regardless of tests’ characteristics. On the other hand, the results are more diverse in \(T\), especially for MA and C2 (compared with C1), R2 (vs. R1), and RC2 (vs. RC1) which indicates that some algorithms are more efficient in tackling specific VRP instances, and converge in such situations much better. Hence, knowing their characteristics beforehand may ultimately help us select the best-suited optimization technique in either fully- or semi-automated way, perhaps exploiting additional expert knowledge concerning e.g., other features of the transportation ecosystem.

Fig. 11
figure 11

The box plots presenting the distributions of our similarity metrics a) \(\mathcal {K}_1\), b) \(\mathcal {K}_2\), and c) \(\mathcal {K}_3\) for all Li and Lim’s groups. We confront the solutions reported by LS and MA with the initial routing schedules (LS and MA boxes), and quantify the distance between the solutions elaborated using the same algorithm (for ten separate runs) for each test instance (Init., LS/LS, and MA/MA). The latter boxes (Init., LS/LS, MA/MA) may help understand the convergence abilities of the corresponding algorithm

Fig. 12
figure 12

Histogram of the number of requests (n) per hour (h) in the DRT-based benchmark. We assume the middle of the TW to determine the hour of the request

In Fig. 10, we present the results of the transportation system profitability analysis for the Li and Lim’s benchmarks. We evaluate the profitability function \(\mathcal {P}= c_{\text{ p }}\cdot \lambda - \theta \), where \(c_{\text{ p }}\) is the value introduced by a single passenger (e.g., the price of a ticket). Note that for unit values of \(c_{K}\) and \(c_{T}\), we always get a positive value of \(\mathcal {P}\) (denoting the obtained profits). For larger \(c_{K}\)’s, the transport becomes profitable, regardless of the benchmark type, only when \(c_{\text{ p }}\ge 8\). The profitability analysis, besides its practical significance, allows us also to differentiate between various problem types forming clusters of similar \(\mathcal {P}\) values for related problem instances. Finally, Table 7 gathers the ranking values obtained for all secondary metrics averaged over all problem instances in the Li and Lim’s groups. This analysis helps us select the algorithm according to the most important optimization criteria—as an example, if we consider \(\zeta \), \(\eta \) or \(\pi \) to be more important factors of our ITS, we should pick MA as our algorithm of choice, because it leads to better solutions in these quality metrics. Note that in this example we confront LS and MA (two example solvers)—we do not include WB here, as WB is actually the collection of various algorithms, not a single optimization technique (as mentioned earlier).

The results obtained for the Li and Lim’s benchmarks allow us to observe the significance of the proposed secondary metrics. Firstly, assuming \(\varepsilon \), \(\delta \), and \(\rho \) to be the meters of passengers’ satisfaction, we note that even for greater \(K\), we could see big improvements in these metrics (e.g., for the lc1_2_10 benchmark we got a reduction of 3300 s, 1075 m, or 33 s for \(\varepsilon \), \(\delta \), and \(\rho \), respectively). Secondly, given two solutions with (\(K_1\), \(T_1\)) and (\(K_2\), \(T_2\)), such that \(K_1 = K_2\) and \(T_1 \sim T_2\), we often observe large differences in \(\varepsilon \), \(\delta \), and \(\rho \). For example, for the lc2_2_2 benchmark, a difference in \(T\approx 3\) m corresponded to over \(-8200\) s difference in \(\varepsilon \) or over \(-3700\) m in \(\delta \). Similar observations apply also to other groups, e.g., for the lr2_2_9 benchmark an increase in \(T\) by less than 0.5 m corresponds to a reduction of \(-1170\) m in \(\delta \).

To get a better view on the correlation between the primary and secondary metrics, we analyzed how a change in \(T\) (for the same value of \(K\)) is reflected in the changes of \(\rho \), \(\delta \), and \(\varepsilon \), being the proposed quality metrics that may reflect the passengers’ satisfaction. The results, collected in Table 8, show that the directly proportional changes of the secondary metrics with respect to changes in \(T\) constitute 44.7%–71.8% (55.7% on average) of cases for LS, and 46.5%–74.2% (58.8% on average) of cases for MA. Note also that for LS, the share of directly proportional changes is always greater in the C2, R2, and RC2 groups as compared to the other ones. We observe a similar trend for MA, except for the RC groups. Finally, we note that certain portion of the analyzed cases, both for LS and MA, turned out to be indifferent to the changes in \(T\), as indicated in the columns marked with ‘−’ in Table 8.

Fig. 13
figure 13

Visualization of the best solution obtained for the DRT-based benchmark

In Fig. 11, we gather the values obtained using our similarity metrics \(\mathcal {K}_1\)\(\mathcal {K}_3\) over all Li and Lim’s groups. These distributions show, on the one hand, the convergence capabilities of the algorithms for test instances of specific characteristics (Init., LS/LS, and MA/MA). Calculating the similarities for the pairs of solutions obtained using different techniques (LS and MA vs. Init., and LS/MA) objectively indicates how much different the solutions elaborated to the corresponding test cases are. We can appreciate that \(\mathcal {K}\)’s are consistently largest for LS/LS, MA/MA, and LS/MA which means that the optimization leads to similar solutions for all groups. Interestingly, \(\mathcal {K}\)’s increase for LS and MA when compared to the initial population which indicates that the optimization leads to a known part of the solution space, and the elaborated schedules seem to be not far from some initial solutions. To further verify the abilities of the proposed similarity metrics, we ran classic random walk and labeled random walk graph kernels in this scenario [75]. They were not only unable to capture the graph similarities (consistently reporting 1.0, hence perfect similarity) but they ran much slower—it took 36.20 s on average to calculate kernels for all solutions and all variants in Fig. 11 in the case of the vanilla random walk (the labeled random walks tend to timeout here and ran for more than 60 min for each pair of schedules), whereas it was only 0.11 s, 0.23 s, and 0.16 s on average for our metrics \(\mathcal {K}_1\)\(\mathcal {K}_3\). The classic kernels suffer from high time complexity, with the labeled version being also very memory-intensive. Furthermore, these kernels do not consider the characteristics of the solution graphs such as the fact that the walks should start and finish only at the depot, or they are finite. Finally, the labeled kernel fails to report zero similarity for a pair of graphs having no common labels.

3.4 Experiment 3: real-life use case

To further prove the practical utility of our system, we collected sample data from an existing Demand Responsive Transport (DRT) system operated by Blees in the Northwestern part of Poland. To this end, we constructed a problem instance with only two vehicles able to carry 8 passengers at a time, which were to handle a set of 60 requests (30 pickup-delivery pairs) spread over a whole working day starting at 7 a.m. and ending at 7 p.m. The service time at each stop was 60 seconds and the time windows spanned from 2 to 8 minutes. The hourly distribution of the number of requests is shown in Fig. 12. It can be observed that the peak hours occurred at 10 a.m. and 4 p.m., with the least popular hours being 9 a.m., 11 a.m., and 3 p.m.

Given this problem instance, we were able to solve and evaluate the obtained solutions using the proposed framework. Some of the most important findings are as follows:

  1. 1.

    The ISG module produced 10 unique solutions, with the value of metric T ranging from 323183.4 m to 366984.9 m, while in all the remaining modules, each run produced the same solution.

  2. 2.

    The final best solution (shown in Fig. 13), obtained using the DM module (the solution was not improved by the MA module) reported the T metric equal to 313282.1 m, which is an improvement of around 3.1% with respect to the best solution produced by the ISG module.

  3. 3.

    To satisfy all the requests within their time windows, two vehicles were needed. This is consistent with the outcomes recorded by the DRT system, operating a single vehicle, in which a number of delayed pickups/deliveries were observed.

4 Conclusions and future work

Verification and validation of existing and emerging algorithms for tackling rich VRPs in ITSes trigger various practical challenges. We make a step toward establishing an adopted way of thorough examination of VRP solvers, and propose an approach that allows us to not only tackle any rich VRPs, but also to generate benchmark tests that are proven to be solvable. We introduced novel graph similarity metrics and showed that they can be effectively utilized to quantify the distance across the obtained solutions, hence to understand the convergence capabilities of the investigated algorithms. Finally, our secondary quality metrics that couple the standard measures commonly used to assess the routing schedules highlight the practical characteristics of the solutions, and allow to perform real-life investigations, such as the profitability analysis of the ITS. To experimentally show the abilities of the proposed techniques, we focused on the pickup and delivery problem with time windows, and executed a thorough experimental study, effectively split into three experiments. In the first experiment, we investigated the functional abilities of the entire validation pipeline proposed in this paper, and generated real-life PDPTW test instances in the Gliwice area, Poland, which were later solved using selected algorithms from the literature. Afterwards, in the second experiment, we showed how to objectively compare the algorithms over the well-established benchmark test instances in an unbiased and fair way. Finally, the third experiment focused on proving the practical utility of our system through deploying it over the sample data collected from the existing Demand Responsive Transport system, currently operating in the Northwestern part of Poland. Although our framework can be easily adjusted to other VRPs, we showed that the artifacts generated at different processing steps of our validation chain enable practitioners to execute quantitative, qualitative, and statistical analysis of the algorithms in the context of constrained pickup and delivery problems. We believe that the proposed validation process is a comprehensive comparison protocol that could help us perform reproducible, fully quantifiable and traceable research in the area of smart delivery systems and ITSes, and it will ultimately make following good research practices easier and straightforward. Effectively, we believe that our approach will be an important step toward combating the reproducibility crisis in the artificial intelligence research.

The research reported in this manuscript constitutes an interesting departure point for future developments. Solving rich VRPs is an extremely vital field, and a number of efficient algorithms for tackling various formulations of such transportation problems emerge monthly—it would certainly be interesting to experimentally prove the flexibility of the proposed validation framework in a variety of intelligent algorithms and variants of VRPs. Additionally, as selecting the appropriate parameter values in transportation solvers may play a key role in obtaining their high-quality operation, incorporating this step into our benchmarking approach for the best-performing state-of-the-art algorithms could help enhance their abilities even further. This would help us solidify the design of the framework through testing the interfaces between its components, as such components would correspond to specific scenarios (i.e., the algorithm, problem formulation, and test data tuples). Also, parallel and distributed algorithms have been blooming in the field, leading to the world’s best routing schedules in e.g., VRPTW and PDPTW, as reported on the SINTEF website for the widely-used benchmarksFootnote 8. Investigating such techniques in our framework would be an exciting research pathway that could indeed shed more light on their practical aspects, as deploying them in hardware-constrained environments (e.g., without dozens of parallel processors that can effectively cooperate) may become an obstacle in their fast adoption in industry. We believe that deploying our framework in practical transportation scenarios can help practitioners make more informed and robust decisions concerning the design of their intelligent transportation systems.