Introduction

Metaheuristics are widely used for the optimization in hydrology (Jahandideh-Tehrani et al. 2020; Maier et al. 2014), especially for conceptual catchment runoff models. Among various kinds of metaheuristics, Particle Swarm Optimization (PSO) (Kennedy and Eberhart 1995) and Differential Evolution (DE) (Storn and Price 1997) are two landmark examples of Swarm Intelligence and Evolutionary Algorithms (Boussaid et al. 2013). Both were proposed in the mid-1990’s and gained widespread popularity in hydrological applications (Jahandideh-Tehrani et al. 2020; Okkan and Kirdemir. 2020b; Maier et al. 2014; Kisi et al. 2010; Xu et al. 2022). DE turned out also a stepping point for development of Markov Chain Monte Carlo-based differential evolution adaptive Metropolis (DREAM) approach (Vrugt et al. 2009). Both DE and PSO algorithms are popular and widely considered to be effective, so that they have been frequently hybridized together into a single algorithm (Xin et al. 2012; Parouha and Verma 2022). Such PSO-DE hybrids were used in water-related applications like optimal localization of hydrocarbon reservoir wells (Nwankwor et al. 2013), design of water distribution system in big cities (Sedki and Ouazar 2012), design of hydraulic structures (Singh and Duggal 2015) or suspended sediment load estimation (Mohammadi et al. 2021).

Because the performance of various optimization methods may be highly uneven for particular application, one may find numerous large-scale comparisons among optimization algorithms in literature (Kazikova et al. 2021; Tharwat and Schenck 2021; Ezugwu et al. 2020; Price et al. 2019; Bujok et al. 2019; Piotrowski et al. 2017a), including some guidelines how to organize such comparisons (Swan et al. 2022; LaTorre et al. 2021). One may also find many comparison studies in which various kinds of metaheuristics are applied for different types of catchment runoff models. Among papers published during last few years, Jahandideh-Tehrani et al. (2021) compared PSO against Genetic Algorithm, Adnan et al. (2021) tested PSO against Grey Wolf optimizer, Tikhamarine et al. (2020) compared PSO against Harris-Hawks optimizer, Okkan and Kirdemir (2020a) proposed a hybrid PSO algorithm and compared it against five metaheuristics: the basic PSO, the basic DE, Genetic Algorithm, Invasive Weed Algorithm, and Artificial Bee Colony method, Hong et al. (2018) compared DE against Genetic Algorithm, Piotrowski et al. (2017b) compared large field of 26 diversified metaheuristics, and Tigkas et al. (2016) compared Shuffled Complex Evolution, Genetic Algorithm and Evolutionary Annealing. Good reviews of older studies may be found in Meier et al. (2019) and Reddy and Kumar (2020).

There are, however, two main problems with the application of DE and PSO algorithms in hydrology. First, despite the popularity of both PSO and DE in water-related studies, there is no paper that directly compares various variants from PSO and DE families of methods for catchment runoff modelling. Second, plenty of DE and PSO algorithms have appeared in recent decade (Das et al. 2016; Bonyadi and Michalewicz 2017; Bilal et al. 2020; Shami et al. 2022), and many of them perform much better than the basic DE and PSO versions (e.g. Tanabe and Fukunaga 2014; Piotrowski et al. 2017a; Bujok et al. 2019). However, in many hydrological applications only the simplest, over 20-year-old versions of either DE or PSO are used. As a result, one cannot find out which kind of algorithms are de facto more efficient in solving hydrological problems, especially in calibration of rainfall-runoff models.

In the present paper, we aim at detailed and thorough comparison of DE versus PSO algorithms applied for calibration of rainfall-runoff models. One may as well find plenty of other Evolutionary Algorithms applied to this task (Cantoni et al. 2022; Okkan and Kirdemir 2020a, b; Kumar et al. 2019; Dakhlaoui et al. 2012; Gan and Biftu 1996), but the present study is restricted to the comparison solely between DE and PSO variants. Instead of using historical versions of DE and PSO, we test relatively recently proposed variants that may currently be considered the state of the art. For comparison purposes, we have selected five DE and five PSO variants that were proposed between 2012 and 2022. These ten algorithms are applied for calibration of two conceptual rainfall-runoff models, namely HBV (Hydrologiska Byrans Vattenavdelning model; Bergström 1976; Lindström 1997) and GR4J (modèle du Génie Rural à 4 paramètres Journalier; Perrin et al. 2003). The research is performed on the Kamienna catchment that is located in the central part of Poland. We mainly focus on the relative performance of DE and PSO algorithms in calibration of hydrological models, as we wish to find out which family of methods perform better for this task. Direct comparisons between two hydrological models is considered to be of secondary importance in this paper.

Methodology and materials

Rainfall-runoff models

We consider two lumped conceptual catchment runoff models that are built of interconnected reservoirs with mathematical transfer functions used to describe the transfer of water between reservoirs and into the river.

HBV

The HBV model with a snow routine, proposed by Bergström and Forsman (1973), has been used in dozens of countries around the world. In the majority of these applications, the modified versions of the original HBV model have been used (Bergström 1976; Bergström and Lindström 2015). A block diagram of a particular version of the HBV model applied in this paper is shown in Fig. 1. A detailed description of the HBV model components for the version adopted in this paper is given in Piotrowski et al. (2017b).

Fig. 1
figure 1

Structure, conceptual storages and parameters of the HBV model

The input variables of the model are: daily precipitation, average daily air temperature and daily potential evapotranspiration (PET). Precipitation can take the form of rain, snow, or a mixture of snow and rain, which is described using the threshold temperature (TT) and the temperature interval (TTI). At temperatures lower than the lower limit (TT-0.5 TTI) only snow occurs, and at temperatures higher than the upper limit (TT + 0.5 TTI) only rain falls. In the interval between these limits, precipitation is a mixture of rain and snow, decreasing linearly from 100% snow in the lower limit to 0% in the upper limit.

The HBV model used has five state variables representing storage of snow, melt water, soil moisture, fast runoff, and baseflow. The HBV model has 13 parameters defined in Table 1, the values of which are determined using selected optimization procedures. The parameters are grouped into four categories: (1) snow process parameters (TT, TTI, CFMAX, CFR and WHC), (2) soil moisture parameters (FC, LP, β), (3) rapid runoff process parameters (KF, α), and (4) slow runoff (baseflow) parameters (PERC, KS) (Fig. 1).

Table 1 Parameters of the HBV model

GR4J

The original GR4J conceptual model is a daily lumped four-parameter catchment runoff model that takes into account changes in soil moisture and can be used for temperatures greater than zero (Perrin et al. 2003). Since our study is concerned with the catchment located in Polish climatic conditions, where snow plays an important role, the original model is extended by adding a snow module (Fig. 2). However, the original name GR4J is retained in this paper. This extended version has seven parameters, three of which (TT, TTI, CFMAX) relate to the snow routine. All GR4J parameters are listed in Table 2 with a brief description. A detailed description of the GR4J model can be found in Perrin et al. (2003).

Fig. 2
figure 2

Structure, conceptual storages and parameters of the GR4J model

Table 2 Parameters of the GR4J model

The input variables to the GR4J model are the same as the HBV model. Similarly to the HBV model, precipitation may take the form of rainfall, snowfall or a mixture of snowfall and rainfall. Snowmelt is assumed to be directly proportional to the temperature and is computed by means of the degree-day method.

Optimization algorithms

This paper focuses on direct comparison between two families of optimization algorithms: Particle Swarm Optimization (PSO) and Differential Evolution (DE), for conceptual rainfall-runoff model calibration. After a quarter century of research, hundreds of DE and PSO variants could be found in the literature (Das et al. 2016; Bonyadi and Michalewicz 2017). Various variants of DE and PSO may highly differ from each other and are often much more complicated than the basic versions of these algorithms. In this study, we assess modern DE and PSO variants, not their historical, simple versions. Five DE and five PSO relatively recently proposed variants were selected for calibration of the HBV and GR4J models. In the brief introduction below, we define only the classical simple variants, outline the main differences between DE and PSO, and give a brief guide to more advanced DE and PSO algorithms. For the detailed description of the DE and PSO variants being compared, readers are referred to the source papers.

Differential evolution and its variants

The classical Differential Evolution algorithm (Storn and Price 1997) defines a movement of population of NP individuals (solutions vectors) in D-dimensional decision space, where D is the number of parameters to be optimized, in a search for the global optimum. In generation g = 0, NP individuals: \({\mathbf{x}}_{i,g} = \left\{ {x_{i,g}^{1} , \ldots ,x_{i,g}^{{\text{D}}} } \right\}\), i = 1,…,NP are initialized at random according to the uniform distribution:

$$\begin{array}{*{20}c} {x_{i,0}^{j} = L^{j} + {\text{rand}}_{i}^{j} \left( {0,1} \right) \cdot \left( {U^{j} - L^{j} } \right);} & {j = 1, \ldots ,D;} & {i = 1, \ldots ,{\text{NP}}} \\ \end{array}.$$
(1)

Here, \({\text{rand}}_{i}^{j}\) (0,1) is a random value within (0,1) interval that is generated separately for each j-th element of i-th individual. Lj and Uj are lower and upper bounds that define the subset \(\prod\limits_{j = 1}^{D} {\left[ {L^{j} ,U^{j} } \right]}\) of the search space RD. After initialization of the population of solutions, in each g every individual makes a move across the search space following the three operations: mutation, crossover and selection. In the basic DE, the mutation is defined as:

$${\mathbf{v}}_{i,g} = {\mathbf{x}}_{r1,g} + F \cdot \left( {{\mathbf{x}}_{r2,g} - {\mathbf{x}}_{r3,g} } \right),$$
(2)

and is followed by the crossover.

$$u_{i,g}^{j} = \left\{ {\begin{array}{*{20}l} {v_{i,g}^{j} } \hfill & {{\text{if}}\quad {\text{rand}}_{i}^{j} \left( {0,1} \right) \le {\text{CR}}\quad {\text{or}}\quad j = j_{{{\text{rand}},i}} } \hfill \\ {x_{i,g}^{j} } \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(3)

In Eq. (2) r1, r2 and r3 are three different \(\left( {r1 \ne r2 \ne r3 \ne i} \right)\) integers that are randomly chosen from the range [1, NP]. In Eq. (3) jrand,i is another integer, randomly selected within [1, D]. Note that there are three control parameters of the basic DE variant, NP, CR and F, which, in the basic DE, need to be defined by the user.

As much as the search space is often bounded (i.e. the values of model parameters to be calibrated are restricted within some range), some verification is needed after crossover to check whether the new solution ui,g is within the bounds (Kononova et al. 2021). If ui,g turns out to be outside the bounds, it has to be forced to stay within the search domain (e.g. by using some of the methods discussed in Helwig et al. (2013) and Kadavy et al. (2022)). After that the objective function is called for the solution ui,g that is within the bounds and one obtains its goodness of fit f(ui,g) that represents the quality of the solution ui,g. Finally, selection operation is performed to choose only the better among xi,g and ui,g to enter the next generation.

$${\mathbf{x}}_{i,g + 1} = \left\{ {\begin{array}{*{20}l} {{\mathbf{u}}_{i,g} } \hfill & {{\text{if}}\quad f\left( {{\mathbf{u}}_{i,g} } \right) \le f\left( {{\mathbf{x}}_{i,g} } \right)} \hfill \\ {{\mathbf{x}}_{i,g} } \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(4)

After repeating the above procedures for each individual in the population, the NP individuals proceed to the g + 1 generation. The algorithm repeats the same steps in the subsequent generations until some stopping conditions are reached. In the present study, the maximum number of function calls set to 20,000 is defined as the stopping condition.

The majority of modern DE variants are much more complicated than the basic version from 1997, defined above (e.g. see Mohamed et al. 2021). A detailed review of DE variants may be found in Das et al. (2016), Al-Dabbagh et al. (2018) or Opara and Arabas (2018). The modern variants often adaptively modify the control parameters F and CR (Brest et al. 2006; Tanabe and Fukunaga 2014; Zuo et al. 2021; Ghosh et al. 2022), use variable population size NP (Tanabe and Fukunaga 2014; Piotrowski 2017; Polakova et al. 2019), implement multiple search (Qin et al. 2009; Wu et al. 2018; Yi et al. 2022) or crossover strategies (Zaharie 2009; Islam et al. 2012; Wang et al. 2022), introduce new procedures or use operators proposed for other metaheuristics (Piotrowski 2018; Caraffini and Neri 2019; Cai et al. 2020). DE is also sometimes hybridized with other metaheuristics (Gong et al. 2010; Xin et al. 2012; Awad et al. 2017). In the present study, we compare five advanced DE variants which are defined in Table 3. The detailed description of these algorithms may be found in the source papers. The control parameters of algorithms are the same as suggested in the source papers, but we provide an information on the population size used in Table 3.

Table 3 Compared variants of differential evolution and particle swarm optimization algorithms

Particle swarm optimization variants

Particle Swarm Optimization (Kennedy and Eberhart 1995) is a very popular stochastic population-based algorithm, inspired by the behavior of the swarm of animals. In PSO, the solutions (called particles) move across the D-dimensional search space all the time, but remember the best location they have visited so far. As in DE, initial positions xi,0 of NP PSO particles (i = 1,…, NP) are usually generated randomly within the bounds of the search space (Eq. (1)). However, in PSO each particle has an associated velocity vector. Depending on the specific PSO variant, the initial velocities vi,0 of each particle are either set to 0 or are generated from some pre-specified interval, which frequently depend on the differences between upper and lower bounds of the search space. The fitness value f(xi,0) is evaluated for each newly generated particle. Then in each generation g the particles are moving through the search space according to the following equation:

$$\begin{aligned} v_{i,g + 1}^{j} =\, & w \cdot v_{i,g}^{j} + c_{1} \cdot {\text{rand}}\;1_{i,g}^{j} \left( {0,1} \right) \cdot \left( {{\text{pbest}}_{i,g}^{j} - x_{i,g}^{j} } \right) + c_{2} \cdot {\text{rand}}\;2_{i,g}^{j} \left( {0,1} \right) \cdot \left( {{\text{gbest}}_{g}^{j} - x_{i,g}^{j} } \right) \\ x_{i,g + 1}^{j} =\, & x_{i,g}^{j} + v_{i,g + 1}^{j} {,}\\ \end{aligned}$$
(5)

where j = 1,…,D, pbesti,g and gbestg are the best positions visited during the search by i-th particle and the best position visited by any particle in the swarm, respectively. rand1i,gj(0,1) and rand2i,gj(0,1) are two random numbers generated at each generation from [0,1] interval for each i and j index separately, and c1 and c2 are acceleration coefficients (algorithm parameters to be set by the user). As may be seen, for each i-th particle three vectors are remembered—its current position xi,g, the best position pbesti,g visited by the i-th particle since the initialization of the search and i-th particle current velocity vi,g. The parameter w is the so-called inertia weight that was first introduced by Shi and Eberhart (1998). Like in the case of DE, modern PSO variants are often much more complicated than the initial version—for a survey readers are referred to Bonyadi and Michalewicz (2017), Cheng et al. (2018), and Shami et al. (2022). Modern PSO variants use different topologies—under this term we mean the communication possibilities between individuals (Lynn et al. 2018; Xia et al. 2020; Li et al. 2022), theoretically or empirically modify the values of control parameters (Clerc and Kennedy 2002; Harrison et al. 2018; Piotrowski et al. 2020; Cleghorn and Stapleberg 2022; Meng et al. 2022), introduce novel equations for movement of particles (Santos et al. 2020; Li et al. 2021; Houssein et al. 2021), bring together an ensemble of different PSO variants (Lynn and Suganthan 2017; Wu et al. 2019; Liu and Nishi 2022), or hybridize PSO with other algorithms (Aydilek 2018; Sengupta et al. 2019; Xu et al. 2019; Dziwinski and Bartczuk 2020). The details of the five specific PSO variants used in the current study are given in the second half of Table 3.

Major differences between DE and PSO

Technically, PSO is a major algorithm within a family of Swarm Intelligence that is based on the communal behavior of animals, and DE is a version of Evolutionary Computation that is based on the evolutionary principles of life. However, such inspiration-focused differences are irrelevant from optimization point of view (Tzanetos and Dounias 2021; Molina et al. 2020; Sorensen 2015). What’s important is that, although both types of algorithms are population-based metaheuristics (Del Ser et al. 2019; Boussaid et al. 2013), they perform a search in much different way.

First of all, in DE the particular individual verifies the new solution in each generation, but moves to the new location only if it is not inferior to the solution at which it was located at the beginning of the generation. It means that DE population may test new locations, but stay in the former ones until some promising region of the search space is found. Because the probability of visiting particular part of the search space is a function of the current location of individuals in DE population, such lack of movements may lead to stagnation (Weber et al. 2009) and hamper proofs of convergence (Hu et al. 2016; Opara and Arabas 2018). However, this feature assures that each individual is always located in the best place it has visited so far, and the location of the whole population is a kind of space-based memory of the high-quality solutions. On the contrary, the PSO particle in each generation moves and stays in the new location, irrespective how poor it is. As a result, the particle requires an additional memory in which a best solution it has visited so far is remembered. PSO particles may fly all around, and may have a problem with returning to the promising solutions (Van der Bergh and Engelbrecht 2006). This inspired researchers to determine the relations between the trajectories of PSO and the values of control parameters or topologies in an analytical way (Clerc and Kennedy 2002; Harrison et al. 2018; Cleghorn and Stapleberg 2022).

Another main difference between DE and PSO is the crossover Eq. (3). In almost all DE variants, the sampled solution is a mix of the former solution and a solution that comes up as a result of initial move (which in most DE variants will be an extended version of Eq. (2)). The crossover is useful as it allows keeping some information from the previous solution within the newly tested one. It limits the diversity, but enhances the chance of finding a better solution; without that the number of successful steps would often be very low in DE, and population could stagnate for a long time in the former location. As PSO performs moves all the time, crossover is not necessary (although it has been tested in some PSO variants, e.g. Engelbrecht 2016; Gong et al. 2016; Molaei et al. 2021).

Finally, the majority of new DE algorithms use adaptive control parameters and population size modification schemes (Al-Dabbagh et al. 2018). Although, as noted above, adaptive and variable population-size PSO variants are also numerous, they are not as clearly superior to the variants with fixed but carefully chosen values (Bonyadi 2020). PSO variants that adaptively modify acceleration coefficients (c1, c2) are relatively rare (for examples see Harrison et al. 2018) and do not much improve the performance.

Description of the study area

The study is performed on the Kamienna River catchment. The Kamienna Catchment is located in the Central Vistula basin in the Polish Upland area and covers 2020 km2 (Fig. 3).

Fig. 3
figure 3

The Kamienna River catchment, flow gauging station and meteorological stations

The main river of the catchment is the Kamienna (left tributary of the Vistula River), whose sources are located at the border of the Masovian and Świętokrzyskie provinces above the town of Skarżysko-Kamienna in the mountainous area. The river is 156 km long and runs from west to east, predominantly through the Świętokrzyskie Province. The catchment elevation varies from about 130–600 m a.m.s.l. There are large variations of the longitudinal slope of the channel in the upper part of the Kamienna River (around 10%). This part has a mountainous character up to Skarżysko-Kamienna, from where the slope gradually decreases, reaching near Kunów about 0.7% (Lenar-Matyas et al. 2006). The catchment area is prone to natural and human hazards (FramWat 2019). Human activities focused on increase in water retention in the catchment by constructing many small artificial reservoirs and two large ones: Wióry and Brody Iłżeckie.

According to the climate classification of Köppen–Geiger (adapted by Peel et al. 2007), the Kamienna Catchment climate is "cold" with no dry season and a warm summer. Annual areal precipitation for the period 1968–2018 varies from 410 to 920 mm, with a long-term annual mean of 600 mm, while the long-term monthly mean varies from about 30–90 mm (Senbeta and Romanowicz 2021). The minimum and maximum precipitations occur in winter and summer, respectively. The mean monthly temperature in the watershed in the same observational period varies from − 3.1 to 18.3 °C, with the minimum and maximum in January and July, respectively (Senbeta and Romanowicz 2021). The land use structure of the study catchment is dominated by agriculture (46.3%), a significant part of the area is also occupied by forest and semi-natural land (43.3%); other parts are artificial land and water bodies, 10% and 0.4%, respectively.

Dataset

Data used include daily hydrological and climatological variables, namely streamflow, air temperature, precipitation and potential evapotranspiration (PET) in and around the watershed. These data were collected for the historical period 1968–1982, during which the catchment could be considered free from anthropogenic influences. After 1982, the artificial reservoirs were constructed in the catchment, which changed the flow regime. The periods 1968–1970, 1971–1976 and 1977–1982 were used for warm-up, calibration and validation, respectively. Hydroclimatic data were obtained from the Institute of Meteorology and Water Management (https://dane.imgw.pl/).

The temperature-based method was used to estimate PET at each meteorological station. As both the HBV and GR4J models are lumped, temperature, precipitation and PET in the catchment were averaged using Thiessen polygon method.

Comparison criteria

Both HBV and GR4J models are calibrated using mean square error (MSE). As a result, we compare algorithms using exactly the same criterion that was used as objective function during search. Each algorithm is run 30 times on every model (HBV and GR4J). The mean, median, best, and worst performances from these 30 runs obtained for calibration and testing datasets are used for comparison. This is a frequently adopted compromise between the confidence in the result’s quality (the more runs, the more reliable results), and the applicability (more runs means more time for computation). As shown in Vecek et al. (2017), the number of runs has only moderate impact on the final conclusions from the research. In addition, we also report the standard deviation of the results obtained based on 30 runs. This allows us to discuss both the averaged performance, extremes, and the consistency of solutions found by particular optimization algorithm.

Results and discussion

Calibration of the HBV model

Each time the model is calibrated, some data need to be set aside and not be available to calibration algorithm—we call this data set validation (or testing) data. This validation dataset is important, because it detects a potential model’s overfitting. For obvious reason, we want our model to work correctly not just for the data that were used during calibration (calibration set), but also for some future, unknown data. Therefore, validation data set is needed to verify the practical effects of calibration. Thus the discussion of the results may be divided into two parts—the first covers the comparison based on the calibration data, and the second includes the comparison based on the validation data (see Table 4).

Table 4 MSE results obtained for the HBV model

Based on the calibration dataset, two algorithms, namely PPSO and HARD-DE, appear to be the best ones for the HBV model. When comparing the performance based on the mean or median from 30 runs, the results obtained by PPSO are the best. In particular, the low median obtained by PPSO (14.039) shows that this algorithm often leads to the results with high performance. Three algorithms (HARD-DE, MDE_pBX and L-SHADE) achieve equal median (14.505) indicating all these algorithms frequently find a similar, although sub-optimal solution. However, according to the average values HARD-DE performs better than MDE_pBX and L-SHADE.

In contrast, OLSHADE-CS leads to by far the poorest results, with MSE that is over 30% (median MSE = 20.365) higher than that of PPSO and HARD-DE. In terms of the mean and median, PSO-based PPSO looks as the winner, and DE-based OLSHADE-CS as the worst method. However, it does not mean the superiority of PSO over DE according to the mean and median measures, as the four remaining PSO variants (DEPSO, EPSO, PSO-sono and TAPSO) perform poorer than the majority of DE variants (HARD-DE, MDE-pBX, EnsDE, and often L-SHADE) on calibration data set.

The mean and median are not the only metrics to compare algorithms. Many users would simply be rather interested in the best results found. When one compares the best and the worst solutions obtained during 30 runs, HARD-DE becomes a winner. Indeed, the best solution found by HARD-DE (12.158) is about 7% better than the best solution found by PPSO (13.047). Moreover, HARD-DE never found a solution worse than the median (14.505), while the worst solution found by PPSO is over 10% poorer (16.161). We may also observe that the ranking of algorithms based on the best solutions found is generally different than rankings based on the mean or median. One should especially note that TAPSO and EPSO were able to find best solutions with lower MSE than the best solutions found by DE algorithms except HARD-DE. However, DEPSO and PSO-sono were not able to outperform DE methods anyway. Hence, according to the ranking based on the best solutions found, DE still outperforms PSO in general, but the relative positions of specific algorithms are different, and the whole picture is more complicated.

The superiority of DE over PSO algorithms on the calibration data set is probably an effect of the behavior of both families of algorithms. In the recent, efficient DE variants, the control parameters are often made adaptive. Hence they are flexibly being modified during search (Das et al. 2016; Al-Dabbagh et al. 2018), whereas the control parameters of PSO are more frequently set fixed throughout the whole search (e.g. Clerc and Kennedy 2002; Harrison et al. 2018). This flexibility of control DE parameters may give DE algorithm additional chances to cope with complicated fitness landscape of each specific problem, in cases where the PSO variants with fixed control parameter values are more conservative. Another difference that may partly justify better performance of DE is the selection operator. DE algorithms reject poorer solutions that are found during search, and move only to the better locations. Hence, the current DE population is composed of solutions that are better than all their predecessors. On the contrary, PSO algorithms keep moving all the time, and will produce final generations in both better or poorer locations. As a result, PSO variants may be considered more chaotic, and less effective in finding the precise location of the optima.

The results obtained for the validation dataset are over twice poorer than the results that were obtained for the calibration data set. It is, however, rather up to the data, not the calibration process. Considering mean and median measures, the PPSO algorithm is the winner for testing dataset, as it was for calibration dataset. However, the quality of solutions found by HARD-DE, MDE_pBX and L_SHADE is frequently not confirmed on testing dataset. All three algorithms achieved the second-best median on the calibration dataset, but the median MSE for the validation dataset is only 6th–8th, which is 10% poorer than the median MSE obtained by PPSO. In contrast to the calibration dataset, the PSO-based algorithms do not perform poorer than the DE-based ones on the validation dataset. It may suggest that finding the exact optimum based on the calibration data set is of moderate importance when the incoming data would significantly change (e.g. Beven 2006; Beven et al. 2022). Nonetheless, the overall best solution found by any method for the validation dataset again belongs to HARD-DE (32.350) and is again about 8% better than the best solution found by PPSO (35.099). This means that, in some sense, DE is still better than PSO on validation data, as one of DE variants is able to find much better result than all other competing algorithms. Whether one prefers to look at the mean or the best results is up to the user’s taste.

Calibration of the GR4J model

Contrary to the HBV model, the calibration of the GR4J model seems to be much simpler, and almost all algorithms, apart from OLSHADE-CS, lead to almost the same median and best results. Only mean MSE slightly vary, and the DE-based methods (excluding OLSHADE-CS), especially HARD-DE, achieve clearly better mean results than the PSO algorithms (Table 5). This indicates that algorithms rather compete in terms of failures, not in finding the best results. The poor performance of OLSHADE-CS may be due to its very slow convergence, and may be a side-effect of the fact that OLSHADE-CS was initially tested on, and probably fitted to, problems with very large number of allowed function calls (see Kumar et al. 2022 for initial tests).

Table 5 Results obtained for the GR4J model

The mean MSE obtained by HARD-DE (15.819) is by about 7% better than the mean MSE obtained by PPSO (16.567). This difference is also kept for the validation dataset, where HARD-DE is also about 8% better than PPSO. However, for the validation dataset surprisingly the best solution found by PPSO is better than the best solution found by HARD-DE. Moreover, the overall best solution of the GR4J model for the validation dataset is even found by the other PSO-based method, PSO-sono. This may look as the opposite finding compared with the one noted for HBV model. Nonetheless, the results obtained for the GR4J model are much less diversified than those for the HBV model. This may be due to much smaller number of parameters to be optimized—small number of parameters may lead to lower differences in performance between algorithms.

Conclusion

The present paper discusses numerous state-of-the-art variants from the PSO and DE optimization methods which were applied to calibration of the catchment runoff models. In the literature dealing with computational optimization methods, no broader comparison of performance between PSO and DE families has been presented so far. We have chosen five DE and five PSO variants proposed between 2012 and 2022. These ten algorithms were applied for calibration of two conceptual rainfall-runoff models: HBV and GR4J, on the Kamienna catchment located in the middle part of Poland. We aimed at finding whether the DE or PSO algorithms would be better suited for calibration of rainfall-runoff models. Furthermore, we focused on the relative performance obtained by the algorithms from the two different modern families in calibration of hydrological models, rather than comparing the results obtained by both conceptual rainfall-runoff models.

We show that the results obtained by different optimizers applied are roughly similar for the GR4J model, which has very few parameters. For the GR4J model, one may rather point at an inferior algorithm—OLSHADE-CS, rather than a winner, as many optimizers performed very similarly. No clear difference between PSO and DE methods could be found. This is probably because the GR4J model has low number of parameters that are relatively simple to calibrate. However, among the best results found during many runs, those found by two PSO variants (PPSO and PSO-sono) are better than those found by their DE competitors.

In the case of the HBV model, the results were much different. OLSHADE-CS showed the poorest performance as well, but the results obtained by other algorithms were diversified. Which exact method could be a winner depends on whether one focuses on calibration, or validation dataset, and whether one is interested in the mean/median performance, or in finding the best possible solution in one among 30 runs. Overall, two algorithms, PSO-based PPSO and DE-based HARD-DE performed best on the HBV model calibration. Comparison between both families of methods reveals that, in general, the DE algorithms slightly outperformed the PSO ones. The difference was, however, clearer for the calibration dataset than the validation dataset. We may recommend using adaptive variants of algorithms for model calibration, especially those that have flexible control parameters (e.g. HARD-DE) or advanced topology (e.g. PPSO) that may automatically tune the speed of information exchange between individuals within the population managed by the algorithm. DE algorithms seems to be more appropriate choice for the calibration of rainfall-runoff models than PSO variants, but the difference between their final performances is limited and depends on the measure that is used to create the ranking of algorithms.