1 Introduction

1.1 Context

Evolutionary Algorithms (EAs) are a family of algorithms inspired by the natural evolution of species. Generally, these population-based metaheuristic algorithms work over a set of individuals (solutions), called the population, and at each iteration of the algorithm they introduce several changes to evolve the population and to obtain improved solutions according to the function to optimize, which is denoted as the objective function or, equivalently, the fitness function.

Introduced by Mühlenbein and Paa\(\upbeta \) [23], Estimation of Distribution Algorithms (EDAs) [19] are an intriguing type of EA. The main characteristic of EDAs in comparison to generic EAs is the use of probability distributions instead of the usual natural evolution operators, such as recombination and mutation. In this way, EDAs start with a population, in most cases by means of sampling a uniform probability distribution over the search space. From the population, EDAs use a selection operator and obtain a subset of solutions which is used to learn a probability distribution. This distribution can be learned from scratch or by modifying the probability distribution used to sample the population at the previous iteration (such as in cGA [16]). The ideal goals of the learned probability distribution are to summarize the main features of the selected solutions and to highlight the best solutions. Finally, the learned probability distribution is sampled to obtain a new set of solutions and to generate a new population, which is used at the next iteration of the algorithm. In Algorithm 1 the general pseudocode of an EDA which learns a probability distribution from scratch at each iteration is introduced.

figure a

1.2 Motivation

EDAs have been and are being designed, applied and analyzed in the solution of Combinatorial Optimization Problems (COPs). Particular attention has been paid to the solution of binary COPs, where theoretical results at different levels have been provided for different implementations of EDAs. However, this development has not been extended to other non-binary COPs, such as permutation-based COPs. In order to bridge this gap, in this paper we extend part of those results to the area of permutation-based COPs. Several works have been presented in the literature that use EDAs specifically designed for permutation-based COPs, obtaining strong competitive results [5, 6, 24, 28]. However, it is still not clear which mechanisms allow them to obtain those results.

Our motivation behind this work is to present, for the first time, a theoretical analysis of EDAs designed for permutation-based COPs, and a mathematical modeling to study their behavior in several scenarios of increasing complexity. To the best of our knowledge, there are no theoretical studies on permutation-based EDAs. Therefore, we seek general knowledge for a better comprehension of the algorithms designed over the permutation space. Theoretical studies can focus on many different objectives, such as limit behavior of the algorithm, runtime analysis, population sizing and so on. Inspired by the path followed for the theoretical studies of EDAs designed for binary COPs, this first study focuses on convergence analysis.

Current research of binary EDAs is often based on runtime analysis. The goal is to find bounds on the number of generations to sample a high quality or optimal solution for the first time. This goal has a close connection with the practical use of the algorithms, where we would like to sample an optimal solution as soon as possible. Notice that an optimal solution can be reached for an algorithm without requiring convergence to it [32].

However, convergence analysis is a very gripping starting point of original analyses to gain insights into the studied algorithms and for know in which scenarios the algorithm is guaranteed to converge to the optimal model by its design. Moreover, this work presents, for the first time, a framework to study EDAs designed for permutation-based COPs. Let us explain in detail our two main motivations. Firstly, as previously mentioned, permutation-based EDAs have presented strong competitive results in practice. However, there are no theoretical studies that analyze the algorithms which obtain those results. Therefore, our first objective is to study the reasons and the characteristics of the used algorithms to achieve the presented results. Secondly, many mathematical frameworks have been presented in the literature to gain insights into binary EDAs. Some of the mentioned works are referenced in Sect. 1.4. Nevertheless, permutation-based EDAs have not gained the same attention of researchers and there are no mathematical frameworks in the literature to study this kind of algorithms. So, our second objective is to generate a mathematical model that can be used to analyze permutation-based EDAs theoretically.

While several EDAs have been designed for permutation-based COPs which use different probabilistic models, we concentrate on those that use the Mallows model as it is the one that has received the highest attention in the literature. The Mallows model [20] is considered the analogous distribution of the Gaussian distribution over the permutation space and it can be included in a more general class of probability models: distance-based exponential models. The Mallows model has been used for designing EDAs in the solution of the Permutation Flowshop Scheduling Problem [5, 6] and the Vehicle Routing Problem with Time Windows [24]. In the mentioned articles, the authors design EDAs in which a Mallows model is learnt from the selected population at each iteration of the algorithm. In [6] the authors named this algorithm Mallows-EDA, whereas in [5, 24] the authors generalize and expand Mallows-EDA. However, even if the mentioned articles have presented competitive results in practice, there are no studies that analyze the behavior of the applied algorithms. We study the reasons for the results obtained by the Mallows-EDA and its characteristics.

1.3 Contribution

In this paper, we present a mathematical framework to study a Mallows-EDA and focus on the convergence behavior of the algorithm for several fitness functions. Considering the ideas presented in previous works [15, 22, 34], we will study the sequence of the expected probability distributions obtained at each iteration of the algorithm (or, equivalently, we study the behavior of the algorithm when the population size tends to infinity). In this way, the randomness is removed and the algorithm is modeled as a dynamical system. Finally, our proposed mathematical framework is used to calculate the convergence behavior of the algorithm for several fitness functions. The studied functions are the constant function, the needle in a haystack (analogous to the definition presented in [25]) and a function defined by means of a Mallows model centered at different permutations.

This work is an extensive and detailed expansion of the work [29] and, as far as we know, our results are the first theoretical analysis given in the literature for permutation-based EDAs, and show the obstacles in achieving high quality theoretical results in this unexplored area. In comparison to the mentioned work, in this paper we show in detail the development and the reasons for using each mathematical tool that was not explained in [29] and we extend the obtained results presenting new scenarios and giving the reasons for obtaining them. Based on the motivations explained in Sect. 1.2, this work has three main goals. Our first goal is to present a mathematical framework which allows the reproducibility of this study to different distance-based exponential models and new fitness functions. Our second goal is to carry out an analysis so as to provide new knowledge on the convergence behavior of permutation-based algorithms. Moreover, for the analyzed objective functions in the present work, the obtained results are unexpected. We have observed that, for the scenarios in which the initial probability distribution is the uniform distribution or the fitness function is constant, the model converges to the optimal solution. However, in the rest of studied simple scenarios, the algorithm can converge to a degenerate distribution not necessarily centered at the optimal solution, or to a non-degenerate probability distribution. To determine the limit behavior of the algorithm, the equations to recognize the fixed points of the dynamical system are shown. These obtained results are dissimilar to the existing results in the literature for binary EDAs (for example, in [14, 15, 34], the studied algorithms converge to degenerate distributions centered at local optima or global optimum of the studied fitness function). An exhaustive list of the reached results can be found in Sects. 3, 4 and 5. At the beginning of each referenced section, a summary of the obtained results is introduced.

Finally, our third goal is to present the obtained knowledge in this study to lay the basis for upcoming research in this area. The presented analysis shows that, given an objective function, the initial probability distribution determines the limit behavior of the algorithm. Therefore, our first proposed algorithmic adaptation is to apply alternative initializations for obtaining high quality solutions. On the other hand, another proposed work is to analyze the expected number of iterations to achieve a high quality or optimal solution for the first time and connect it with the current tendency of the theoretical studies of EDAs. These points are discussed further in Sect. 6.

1.4 Related work

EDAs have been mostly designed and studied for binary COPs. Some examples of the designed EDAs for binary COPs are Univariate Marginal Distribution Algorithm (UMDA) [23], Population-based Incremental Learning (PBIL) [2], Compact Genetic Algorithm (cGA) [16] and Factorized Distribution Algorithm (FDA) [22]. Moreover, they have also been complemented with a theoretical study with the purpose of understanding and improving these algorithms [14, 15, 22]. The first theoretical studies focused on their convergence behavior and the current tendency of the theoretical studies is the runtime analysis of the algorithms. We highly recommend the work [18] for a state-of-the-art on binary EDAs and our ideal goal is to explore the permutation-based EDAs in an analogous way.

From [18], we want to highlight three inspiring works. In [14], the authors prove that when the fitness function is unimodal, PBIL converges to the global optimum. In [15], it is proved that any discrete EDA generates a population with an optimal solution if any solution of the search space can be generated at any iteration of the algorithm. In addition, in the same work, the authors review a dynamical system used in the literature to study UMDA and PBIL. In the present work, we have considered the idea of studying EDAs as dynamical systems. Last but not least, in [22], the authors study the convergence behavior of the FDA using Boltzmann and truncation selection and by analyzing finite and infinite populations, which shows the influence of the assumption of infinite populations and the differences in the obtained results.

The remainder of the paper is organized as follows. In Sect. 2, the basic concepts related with the Mallows model and our mathematical framework are introduced. In Sect. 3, the convergence behavior of the framework is studied for a constant objective function f. In Sect. 4, the function f analyzed is a needle in a haystack function. In Sect. 5, the function f analyzed is a Mallows model. In Sects. 3, 4 and 5, two initial distributions are considered for the analysis: the uniform distribution and a Mallows probability distribution. Finally, in Sect. 6, conclusions and future work are presented.

2 EDA based on Mallows models

The theoretical study of an EDA can focus on many different objectives, such as limit behavior, runtime analysis, population sizing and so on. In this work, the convergence behavior of the Mallows-EDA has been studied. To do so, a mathematical modeling based on dynamical systems is presented to achieve our objective.

2.1 Notation

The solutions of the studied optimization problems are permutations of length n. Let us denote by \(\varSigma _n\) the n-permutation space (\(|\varSigma _n|=n!=N\)) and \(f:\varSigma _n \longrightarrow {\mathbb {R}}\) the function to maximize. Let us denote by \(\sigma \) a permutation from \(\varSigma _n\) or a solution of the function f. Throughout this work, \(\sigma (i)\) represents the position of the element i in the solution \(\sigma \). The solution \(\sigma ^*\) is an optimal solution:

$$\begin{aligned} \sigma ^* = \arg \max _{\sigma \in \varSigma _n}f(\sigma ) . \end{aligned}$$

Moreover, let us define an adjacent transposition of a permutation \(\sigma \) as a swap of two consecutive elements. Additionally, \(\sigma ^{-1}\) is the inverse permutation of \(\sigma \).

A population in the algorithm is a subset (in the multiset sense) of M solutions of \(\varSigma _n\). Let us denote by \(D_i\) the population at step i and \(D_i^S\) the selected individuals from \(D_i\). There are several ways to study EDAs which depend on how an iteration of the algorithm is described. The most common explanation of a step of an EDA is the following one. The algorithm starts the iteration i from a population \(D_i\). Then, a subset of individuals is selected from \(D_i\) by means of the selection operator and \(D_i^S\) is defined. After that, a probability distribution \(P_i^L\) is learnt from \(D_i^S\) and finally a new population \(D_{i+1}\) is generated by sampling solutions from \(P_i^L\) and combining them with the solutions from \(D_i\). In Algorithm 1 the general pseudocode of an EDA is introduced. Still, there exists another possible interpretation of a step of an EDA in which probability distributions are considered as the main mathematical tool to study the algorithm [25]. In this second description, the algorithm starts the iteration i from a probability distribution \(P_i\) and a population \(D_i\) is sampled. Then, \(D_i^S\) is selected and finally a new probability distribution is learnt for the next iteration, \(P_i^L = P_{i+1}\). Throughout this work, the last description has been considered the main interpretation of EDAs for a better comprehension of Sects. 2.2 and 2.4.

The probability distributions can be represented using probability vectors. Let us denote by \(p_i(\sigma )\) the probability of \(\sigma \) under \(P_i\). Therefore, we can denote by \(P_i=(p_i(\sigma _1),\dotsc ,p_i(\sigma _N))\) the probability distribution of the population at iteration i. If we are studying EDAs with finite populations, the vector \(P_i\) can be considered as the “empirical probability mass function” of \(D_i\) (and analogous with \(P_i^S\) from the population \(D_i^S\)). We must emphasize that this representation of the populations by probability vectors is conceptual and it is really helpful for our proposed theoretical study, but it cannot be applied in practical EDAs due to the required memory. Moreover, the subindexes used for the permutations of the probability vectors distinguish the N permutations of \(\varSigma _n\) where an order has been set up. The space of possible probability vectors \(\varOmega _n\) is defined in the following way:

$$\begin{aligned} \varOmega _n= & {} \left\{ \left( p(\sigma _1),p(\sigma _2),\dotsc ,p(\sigma _N) \right) : \right. \\&\left. \sum _{j=1}^N p(\sigma _j)=1 , 0 \le p(\sigma _j) \le 1,\ j=1,\dotsc ,N \right\} . \end{aligned}$$

To avoid the trivial case, it is assumed that any initial probability vector \(P_0\) satisfies that \(p_0(\sigma _j)<1\), for \(j=1,\dotsc ,N\) (\(D_0\) is not formed only by one specific solution). Note that \(\varOmega _n\) contains degenerate distributions. Let us denote by \(1_{\sigma _k}=(1_{\sigma _k}(\sigma _1),\dotsc , 1_{\sigma _k}(\sigma _{k-1}),1_{\sigma _k}(\sigma _k),1_{\sigma _k}(\sigma _{k+1}) ,\dotsc ,1_{\sigma _k}(\sigma _N)) = (0,\dotsc ,0,1,0,\dotsc ,0) \) the degenerate probability distribution centered at \(\sigma _k\).

Hence, if \(P_i\) are considered the references of each step of an EDA, then the EDA can be considered a sequence of probability distributions, each one given by a stochastic transition rule \({\mathcal {G}}\):

$$\begin{aligned} \begin{array}{ccccccl} P_0 &{} \longrightarrow &{} P_1 &{} \longrightarrow &{} P_2 &{} \longrightarrow &{} \cdots \\ &{} {\mathcal {G}} &{} &{}{\mathcal {G}} &{} &{}{\mathcal {G}} &{} \end{array} \end{aligned}$$

that is, \(P_i = {\mathcal {G}}(P_{i-1})={\mathcal {G}}^{i}(P_0),\ \forall i \in {\mathbb {N}}\). Given a probability distribution \(P_i\), the operator \({\mathcal {G}}\) outputs the probability distribution obtained after sequentially applying the sampling, the selection operator and the learning step. In this work, the considered algorithm to analyze is the Mallows-EDA [6] and the selection operator used throughout this work is a 2-tournament selection. The details are explained in Sects. 2.3 and 2.4.

Hence, our objective is to study the convergence behavior described as follows:

$$\begin{aligned} \lim _{i\longrightarrow \infty } {\mathcal {G}}^{i}(P_0) . \end{aligned}$$

2.2 EDAs based on expectations

The application of the EDA schema to deal with optimization problems can involve an unapproachable variety of situations and behaviors. Due to this difficulty and following the ideas presented in the literature, our proposed mathematical modeling studies the expected probability distribution generated after one iteration of the algorithm. So, our proposed framework studies the deterministic function \(G:\varOmega _n \longrightarrow \varOmega _n\) which assigns the expected probability distribution of the operator \({\mathcal {G}}:\varOmega _n \longrightarrow \varOmega _n\), similar to the idea followed in [14]:

$$\begin{aligned} P_{i+1}= & {} G(P_i) = E[{\mathcal {G}}(P_i)] = E[(a \circ \phi )(P_i)] \\= & {} \sum _{P \in \varOmega _n} a(P) \cdot p(\phi (P_i)=P). \end{aligned}$$

where a(P) is the probability distribution obtained after applying the approximation step, \(\phi \) is the selection operator and \(p(\phi (P_i)=P)\) is the probability to obtain P from \(P_i\). The details of our proposed selection operator and approximation step are explained in Sect. 2.4.

Moreover, \(P_i =G^{i}(P_0)\). Studying the expected probability distribution, each time the algorithm is applied, the deterministic operator G removes the random drift and avoids ending in a different probability distribution. Another equivalent interpretation of the deterministic operator G is the study of EDAs when the population size of \(D_i\) and \(D_i^S\) tends to infinity [9, 10, 30, 34]. By the Glivenko-Canteli theorem [8], when the population size tends to infinity, the empirical probability distribution of \(D_i\) and \(D_i^S\) converge to the underlying probability distribution of \(D_i\) and \(D_i^S\), respectively. Under this assumption, \(P_i\) and \(P_i^S\) can be thought of as the population and the selected population at iteration i: in other words, \(P_i\) and \(P_i^S\) replace the populations \(D_i\) and \(D_i^S\) of the finite model, respectively. Therefore, our study can be thought of as the analysis of an EDA that works with the limit distributions of large populations. In Algorithm 2 the general pseudocode of an EDA based on expectations is shown.

figure b

Typical selection operators \(\phi \) are n-tournament selection, proportional selection and truncation selection [4, 34].

Therefore, the operator G induces a deterministic sequence:

$$\begin{aligned} \begin{array}{ccccccl} P_0 &{} \longrightarrow &{} P_1 &{} \longrightarrow &{} P_2 &{} \longrightarrow &{} \cdots \\ &{} G &{} &{} G &{} &{} &{} \end{array} \end{aligned}$$

and the new objective is to study

$$\begin{aligned} \lim _{i \longrightarrow \infty }G^i(P_0). \end{aligned}$$

In Sect. 2.4, the function G used throughout this work to study the convergence behavior of the algorithm is defined.

2.3 Mallows model

The Mallows model [20] is a distanced-based exponential probability model over permutations. Under this model, the probability value of every permutation \(\sigma \in \varSigma _n\) depends on two parameters: a central permutation \(\sigma _0\) and a spread parameter \(\theta \). The Mallows model is defined as follows:

$$\begin{aligned} P(\sigma ) = \frac{1}{\varphi (\theta ,\sigma _0)} e^{-\theta d(\sigma ,\sigma _0)} \end{aligned}$$

where d is an arbitrary distance function defined over the permutation space, \(d(\sigma ,\sigma _0)\) is the distance from \(\sigma \) to the central permutation \(\sigma _0\), and \(\varphi (\theta ,\sigma _0) = \sum _{\sigma \in \varSigma _n} e^{-\theta d(\sigma ,\sigma _0)}\) is the normalization constant. Due to the definition of the Mallows model, it is considered the analogous distribution of the Gaussian distribution over permutations. To simplify notation, let us denote by \(\hbox {MM}({\sigma _0},{\theta })\) a Mallows probability distribution centered at \(\sigma _0\) and spread parameter \(\theta \). Bear in mind that when \(\theta = 0\), \(\hbox {MM}({\sigma _0},{0})\) is a uniform probability distribution for any \(\sigma _0 \in \varSigma _n\).

An important property of a Mallows model is that any two permutations at the same distance from the central permutation have the same probability value. Hence, we can group the permutations according to their distance to the central permutation.

Different distances can be used with the Mallows model, such as Cayley distance, Hamming distance or, the most used distance in the literature for the Mallows model, Kendall tau distance [17], which is the one we use in our EDA analysis.

Definition 1

Kendall tau distance \(d_\tau (\sigma ,\pi )\) counts the number of pairwise disagreements between \(\sigma \) and \(\pi \). It can be mathematically defined as follows:

$$\begin{aligned} d_{\tau }(\sigma ,\pi )= & {} \left| \left\{ (i,j):i<j ,(\sigma (i)< \sigma (j) \wedge \pi (i)> \pi (j)) \right. \right. \\&\left. \left. \vee (\sigma (i) > \sigma (j) \wedge \pi (i) < \pi (j)) \right\} \right| \end{aligned}$$

where \(\sigma (i)\) is the position of the element i in the permutation \(\sigma \) (and similarly with \(\sigma (j)\), \(\pi (i)\) and \(\pi (j)\)).

By definition, \(\varSigma _n\) with \(d_{\tau }\) is a metric space. For simplification purposes, let us denote by \(\sigma \pi \) the composition of \(\sigma \) and \(\pi \) (i.e., \(\sigma \pi = \sigma \circ \pi \)) and \(d(\sigma ,\pi )\) the Kendall tau distance between \(\sigma \) and \(\pi \). According to the definition, the distance between two permutations is a non-negative integer between 0 and \(D =n(n-1)/2=\left( {\begin{array}{c}n\\ 2\end{array}}\right) \). A property of Kendall tau distance is that, for any \(\sigma ,\pi \in \varSigma _n\), \(d(\sigma ,\pi )+d(\pi ,I'\sigma )= d(\sigma ,I'\sigma )= D\), where \(I' = (n\ n-1\ \cdots 1)\). Consequently,

$$\begin{aligned} 2 \sum _{\pi \in \varSigma _n} d(\sigma ,\pi )= & {} \sum _{\pi \in \varSigma _n} \left( d(\sigma ,\pi )+d(\pi ,I'\sigma )\right) \\= & {} \sum _{\pi \in \varSigma _n} D = N \cdot D. \end{aligned}$$

Another property is that Kendall tau distance has the right invariance property; that is, \(d(\sigma ,\pi )=d(\sigma \rho , \pi \rho )\) for every permutation \(\sigma ,\pi ,\rho \in \varSigma _n\) [17]. Consequently, the normalization constant of the Mallows model can without loss of generality be written as \(\varphi (\theta )\).

Kendall tau distance can be equivalently written as

$$\begin{aligned} d(\sigma ,\pi ) = \sum _{i=1}^{n-1}V_i(\sigma ,\pi ) \end{aligned}$$

where \(V_i(\sigma ,\pi )\) is the minimum number of adjacent swaps to set the value \(\pi (i)\) in the i-th position of \(\sigma \) [21]. It is worth noting that there exists a bijection between any permutation \(\sigma \) of \(\varSigma _n\) and the vector \(\left( V_1(\sigma ,I),\dotsc ,V_{n-1}(\sigma ,I)\right) \), where I represents the identity permutation and \(V_i(\sigma ,I) \in \{0,\dotsc ,n-i\}\), \(\forall i=1,\dotsc ,n-1\). Furthermore, the components \(V_i(\sigma ,I)\) are independent when \(\sigma \) is uniform on \(\varSigma _n\).

Finally, with Kendall tau distance, the Mallows model with central permutation \(\sigma _0\) and spread parameter \(\theta \) and the Mallows model with central permutation \(I'\sigma _0\) and spread parameter \(-\theta \) are equivalent [13]. Therefore, without loss of generality, we assume that \(\theta > 0\).

2.4 Mathematical modeling

As we have mentioned in the introduction section, in this section we present a mathematical framework to study the convergence behavior of a Mallows-EDA by a deterministic operator based on expectations. Before presenting our proposed mathematical modeling, we want to present how the Mallows-EDA is defined in [6].

The main characteristic of the Mallows-EDA is that the learned probability distribution is a Mallows probability distribution. To learn a Mallows model, \(\sigma _0\) and \(\theta \) parameters must be estimated. By the maximum likelihood estimation method, the exact parameters are calculated. The log-likelihood function for a finite population \(\{\sigma _1,\dotsc , \sigma _M \}\) is as follows [13]:

$$\begin{aligned} -M\theta \sum _{i=1}^{n-1}{\bar{V}}_i -M \log \varphi (\theta ) \end{aligned}$$
(1)

where \({\bar{V}}_i\) denotes the observed mean for \(V_i\): \({\bar{V}}_i= \sum _{j=1}^M V_i(\sigma _j,\sigma _0)/M\). As we can observe in Eq. (1), the value \(-M\theta \sum _{i=1}^{n-1}{\bar{V}}_i\) depends on \(\sigma _0\) and \(\theta \), whereas the value \(-M \log \varphi (\theta )\) only depends on \(\theta \). Therefore, for a fixed non-negative value \(\theta \), maximizing the log-likelihood function is equivalent to minimizing \(\sum _{i=1}^{n-1} {\bar{V}}_i\). This problem is also known as the rank aggregation problem and the Kemeny ranking problem and it is an NP-hard problem [1, 3]. This makes the theoretical analysis very complex.

Therefore, given a sample of M permutations \(\{\sigma _1,\dotsc ,\sigma _M\}\), the first step to obtain the maximum likelihood estimators of the Mallows model is to obtain a permutation \(\sigma _0\) which minimizes \(\sum _{i=1}^{n-1} {\bar{V}}_i\). Let us denote by \({\hat{\sigma }}_0\) the estimated central permutation for the previous minimization problem. Once we obtain \({\hat{\sigma }}_0\), the maximum likelihood estimator of \(\theta \), denoted by \({\hat{\theta }}\), is obtained by solving the following equation [13]:

$$\begin{aligned} \sum _{i=1}^{n-1} {\bar{V}}_i = \frac{n-1}{e^\theta -1} - \sum _{i=1}^{n-1} \frac{n-i+1}{e^{(n-i+1)\theta }-1}. \end{aligned}$$
(2)

Despite the fact that previous theoretical studies that use dynamical systems ( [14, 33], for example) have closed formulae, the solution of this equation has not. For that reason, a numerical method such as, e.g., Newton-Raphson, has to be used to solve the equation. This is another reason that shows the complexity of the theoretical analysis. Once \({\hat{\sigma }}_0\) and \({\hat{\theta }}\) are estimated, the Mallows model is completely defined and it is used to sample new solutions for the next iteration of the algorithm. In Algorithm 3 the general pseudocode of Mallows-EDA defined in [6] is shown.

figure c

Throughout this work, in order to study the convergence behavior of the Mallows-EDA based on expectations, the deterministic operator \(G= a \circ \phi \) is used. This operator is a composition of the selection operator \(\phi \) and the approximation step a used to learn the Mallows model. Hence, the operator \(\phi \) returns the expected selection probability of the solutions from \(P_i\) and the function a uses a maximum likelihood estimation method to learn a Mallows model from \(P_i^S\).

The selection operator studied in this work has been the widely used 2-tournament selection, but it is worth mentioning that the use of any selection operator based on rankings of solutions which satisfies impartiality and no degeneration properties defined in [10] will produce the same results. This selection operator is based on the ranking of the solutions according to the objective function f and cannot assign extreme probabilities. Given the probability distribution \(P_i\) at iteration i and assuming a maximization problem, the expected probability of selecting a solution \(\sigma \) is the sum of all the binary selections in which \(\sigma \) and a solution \(\pi \) with a lower or equal fitness function value has been chosen, that is:

$$\begin{aligned} p^S_i(\sigma )= & {} 2\sum _{\pi | f(\sigma ) > f(\pi )} p_i(\sigma )p_i(\pi )\nonumber \\&+ \sum _{\pi | f(\sigma ) = f(\pi )} p_i(\sigma )p_i(\pi ). \end{aligned}$$
(3)

Once we have \(P_i^S\) calculated, the function a deals with the probabilities \(p_i^S(\sigma )\) to learn a new Mallows model which is the probability distribution of the next generation. In order to work with the probability vectors and the expected probability distributions and to estimate \(\sigma _0\) and \(\theta \), Eqs. (1) and (2) must be reformulated. To do so, the value \(\bar{V_i}\) is calculated using \(p^S(\sigma )\) as the proportion of the solution \(\sigma \) in the selected population by the weighted average value of \(V_i(\sigma ,\sigma _0)\). So, we have

$$\begin{aligned} {\bar{V}}_i = \sum _{\sigma \in \varSigma _n} V_i(\sigma ,\sigma _0)\cdot p^S(\sigma ). \end{aligned}$$

Therefore,

$$\begin{aligned} \sum _{i=1}^{n-1} {\bar{V}}_i= & {} \sum _{i=1}^{n-1} \sum _{\sigma \in \varSigma _n} V_i(\sigma ,\sigma _0)\cdot p^S(\sigma ) \\= & {} \sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma _0)\cdot p^S(\sigma ). \end{aligned}$$

So the maximum likelihood estimator of \(\sigma _0\) from the expected selected population is the following:

$$\begin{aligned} {\hat{\sigma }}_0= \arg \min _{\sigma \in \varSigma _n} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma )\cdot p^S(\pi ). \end{aligned}$$
(4)

The maximum likelihood estimator of \(\sigma _0\) might not be unique. In Sects. 4 and 5, we will observe some \(P^S\) probability distributions in which the estimated central permutation is not unique.

To estimate \(\theta \), we can use Eq. (2) in the same way as with finite populations and solve the following equation:

$$\begin{aligned} \sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma ) = \frac{n-1}{e^{\theta } -1} - \sum _{i=1}^{n-1} \frac{n-i+1}{e^{(n-i+1)\theta }-1}. \end{aligned}$$
(5)

Throughout this work, two observations related to the estimation of the spread parameter are considered. Firstly, the right-hand side of Eq. (5) is not defined when \(\theta = 0\). Still, the right-hand side of Eq. (5) tends to \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\) when \(\theta \) tends to 0 and \(\theta = 0\) is a removable singularity (see Proof in Proposition 1 of Appendix A).

Considering this observation, the following lemma proves that when the estimated central permutation is unique, then the estimated spread parameter has a positive value. It is worth mentioning that Lemma 1 is independent of the objective function f and the iteration i of the algorithm.

Lemma 1

Let \(P_i\) be a Mallows probability distribution with central permutation \(\sigma _0\) and spread parameter \(\theta \ge 0\), and \(P_i^S\) the probability distribution after a 2-tournament selection over \(P_i\). Let \({\hat{\sigma }}_0\) be the unique estimator of the central permutation of \(P_{i+1}\). Then, the value \({\hat{\theta }}\) which solves the following equation

$$\begin{aligned} \sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma ) = \frac{n-1}{e^{{\hat{\theta }}} -1} - \sum _{i=1}^{n-1} \frac{n-i+1}{e^{(n-i+1){\hat{\theta }}}-1} \end{aligned}$$

is a positive value. Equivalently, \(\sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma )\) is a value lower than \( \left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\).

Proof

First, let us consider the function g:

$$\begin{aligned} g(\theta ) = \left\{ \begin{array}{ll} \frac{n-1}{e^{\theta } -1} - \sum _{i=1}^{n-1} \frac{n-i+1}{e^{(n-i+1)\theta }-1}\ &{}\ \text { if } \theta \ne 0 \\ \frac{1}{2}\left( {\begin{array}{c}n\\ 2\end{array}}\right) \ &{}\ \text { if } \theta =0. \\ \end{array} \right. \end{aligned}$$
(6)

The function g is a continuous decreasing function such that \(g(\theta )+g(-\theta )=\left( {\begin{array}{c}n\\ 2\end{array}}\right) \), \(\lim _{\theta \longrightarrow -\infty } g(\theta ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \) and \(\lim _{\theta \longrightarrow \infty } g(\theta ) = 0\) (see Proof in Proposition 2 of Appendix A).

Secondly, for any \({\hat{\sigma }}_0\) and \({\hat{\theta }}\) parameters, \(\sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma )\) is a value from the interval \((0,\left( {\begin{array}{c}n\\ 2\end{array}}\right) )\). In particular,

$$\begin{aligned}&\sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma ) + \sum _{\sigma \in \varSigma _n} d(\sigma ,I'{\hat{\sigma }}_0)\cdot p^S(\sigma ) \\&\quad = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \sum _{\sigma \in \varSigma _n} p^S(\sigma ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) . \end{aligned}$$

Considering that, by hypothesis, \({\hat{\sigma }}_0\) is the unique estimator of the central permutation of \(P_{i+1}\),

$$\begin{aligned} \sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma ) < \sum _{\sigma \in \varSigma _n} d(\sigma ,I'{\hat{\sigma }}_0)\cdot p^S(\sigma ) \end{aligned}$$

is obtained and therefore

$$\begin{aligned} \sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma ) < \frac{1}{2}\left( {\begin{array}{c}n\\ 2\end{array}}\right) . \end{aligned}$$

\(\square \)

The second observation is that in the approximation step of our algorithm, at any iteration, if \(P^S\) is a Mallows model with central permutation \(\sigma _0\) and spread parameter \(\theta \), then the learned Mallows model is the same one: \({\hat{\sigma }}_0=\sigma _0\) and \({\hat{\theta }}=\theta \). The argument to prove this observation is that the probabilities of the solutions are ordered inversely according to their distance to \(\sigma _0\). Hence, Eq. (4) obtains the minimum value at \(\sigma _0\) and it is unique. Furthermore, when \({\hat{\theta }}=\theta \), Eq. (5) is fulfilled because \(P^S\) is a Mallows model. Another way to understand this observation is that when we work with infinite population and the sampling step is not needed, the probability distribution is kept constant. To simplify notation, throughout this work, let us consider the uniform distribution as a Mallows model with central permutation \(\sigma _0 \in \varSigma _n\) and spread parameter 0.

In addition, throughout this work it is assumed that the algorithm learns \(1_{\sigma _k}\) probability distribution if \(P^S=1_{\sigma _k}\). Note that \(1_{\sigma _k}\) is obtained as the limit distribution of \(\hbox {MM}({\sigma _k},{\theta })\) when \(\theta \) tends to infinity.

Once we have defined the selection operator and how we learn a new probability distribution, our operator G is defined. The schema of one iteration of the algorithm is the following:

where \(\phi \) is 2-tournament selection and a is the approximation step that learns a Mallows probability distribution by maximum likelihood estimation.

The aim of the following sections is to apply our proposed mathematical modeling in some scenarios. Each scenario considers an objective function f and an initial probability distribution \(P_0\). Our objective is to calculate \(G^{i}(P_0)\) when i tends to infinity. To do so, \(G^i(P_0)\) are calculated, for \(i=1,2,3,\dotsc \), and the results are analyzed. In some particular cases, it is enough to calculate \(G(P_0)\) to induce the limit behavior of the algorithm. For the most difficult cases, we study the fixed points of the algorithm and their attraction behavior, following the same idea used in the literature as in [14], among others.

In order to simplify the analysis and to present the tools and methods used to achieve our objectives, in this work we have considered three specific cases for the objective function. In Sect. 3, f is a constant function; in Sect. 4, f is a needle in a haystack function; and in Sect. 5, f is defined by a Mallows model. Objective functions such as the constant function and the needle in a haystack function have been used in many studies of different algorithms in the literature, whereas the Mallows model has been studied as an example of a unimodal objective function and to analyze the relation among the learned Mallows probability distributions by our dynamical system and the objective function. For these cases, we have considered \(P_0\) as a uniform distribution or a Mallows model.

3 Limiting behavior for a constant function

In these first scenarios, the function f to optimize is constant: \(f(\sigma )=c,\ \forall \sigma \in \varSigma _n\). Hence, any solution can be considered a global optimum. In this situation, it is proved that the algorithm keeps the initial probability distribution forever. We can summarize all the results from this section in Theorem 1.

Theorem 1

If f is a constant function and P a Mallows probability distribution, then \(G(P)=P\).

Proof

Starting from any Mallows model \(\hbox {MM}({\sigma _0},{\theta })\), let us observe the first iteration of the algorithm and calculate G(P). It is proved that the selection method keeps the same distribution, and then the learned parameters are \(\sigma _0\) and \(\theta \).

When f is a constant function, all the solutions are global optima. So, the selection probability of each solution is the same as the initial probability:

$$\begin{aligned} p^S(\sigma ) = p(\sigma ) , \forall \sigma \in \varSigma _n \Longrightarrow P^S=P. \end{aligned}$$

Given that \(P^S=P\), the next step of the algorithm is to estimate the parameters to learn a Mallows model from P. By the observation from Sect. 2.4 about the estimation of the parameters from a Mallows model, it is deduced that \({\hat{\sigma }}_0 = \sigma _0\) and \({\hat{\theta }} = \theta \). Consequently, it is proved that when f is a constant function, \(G(P)=P\) for any Mallows distribution P. \(\square \)

4 Limiting behavior for a needle in a haystack function

In the next case, f is a needle in the haystack function centered at \(\sigma ^*\); the function is constant except for one solution \(\sigma ^*\), which is the optimal solution. Let us define

$$\begin{aligned} f(\sigma )=\left\{ \begin{array}{ll} c\ &{}\ \ \sigma \ne \sigma ^* \\ c'\ &{}\ \ \sigma =\sigma ^* \end{array} \right. \end{aligned}$$

such that \(c' > c\).

In this section, the analysis focuses on the evolution and the convergence behavior of the algorithm when the fitness function can only take two possible values, one value for the optimal solution and the second value for any other solution. The analysis has been separated into three sections. In Sect. 4.1, the case when \(P_0\) is a uniform distribution is considered. In this particular case, the main procedure of the algorithm is shown and some general results are explained. As a result of this analysis, the case when \(P_0\) is a Mallows model centered at \(\sigma ^*\) is analyzed, which is mentioned in Sect. 4.2. Finally, in Sect. 4.3, \(P_0\) is a Mallows model centered at \(\sigma _0 \ne \sigma ^*\). In this case, a general observation among the rest of Mallows models is explained. To do so, the fixed points of the algorithm are calculated.

4.1 \(P_0\) a uniform initial probability distribution

In this section it is proved that when the initial probability distribution is a Mallows distribution centered at the optimal solution of the needle in the haystack function the algorithm converges to the degenerate distribution centered at the optimum. The obtained result in this section can be summarized in the following lemma.

Lemma 2

Let f be a needle in a haystack function centered at \(\sigma ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _0 \ge 0\). Then, the proposed EDA always converges to the degenerate distribution centered at \(\sigma ^*\).

Proof

Let us start the demonstration from the case that \(P_0\) is a uniform distribution. In order to calculate the limit behavior of the algorithm, let us start by calculating \(G(P_0)\), starting from the computation of \(P_0^S\). In this case, there are two different cases to analyze in the selection step. If \(\sigma ^*\) is chosen to take part in the tournament, then it has an equal or higher function value than any other permutation, so \(\sigma ^*\) is always selected. For the permutations \(\sigma \ne \sigma ^*\), they behave in the same way as when f is a constant function. So the probability after selection is as follows:

$$\begin{aligned} p_0^S (\sigma )= \left\{ \begin{array}{ll} p_0(\sigma )(2-p_0(\sigma ))\ &{}\ \ \sigma =\sigma ^* \\ p_0(\sigma )(1-p_0(\sigma ^*))\ &{}\ \ \sigma \ne \sigma ^*. \end{array} \right. \end{aligned}$$
(7)

This same argument can be used for any iteration of the algorithm for the selection operator.

After the selection probability has been computed, let us study the estimation of the parameters for the Mallows models. Let us start with the estimation of the central permutation in different iterations of the algorithm, and after that, the estimated spread parameters.

At the first iteration of Algorithm 2, in order to calculate \({\hat{\sigma }}_0\) for \(P_1\), it is necessary to calculate the solution of Eq. (4) using \(P_0^S\). Bear in mind that for any \(\sigma \ne \sigma ^*\),

$$\begin{aligned}&\sum _{\pi \in \varSigma _n \backslash \{\sigma ,\sigma ^*\}} d(\pi ,\sigma )\cdot p_0^S(\pi ) \nonumber \\&\qquad = \sum _{\pi \in \varSigma _n \backslash \{\sigma ,\sigma ^*\}} d(\pi ,\sigma ^*)\cdot p_0^S(\pi ) \end{aligned}$$
(8)

because the selection probabilities for all the permutations except \(\sigma ^*\) are the same, and the right invariance property over the Kendall tau distance ensures that the number of solutions at each distance is the same: that is, for a fixed \(d \in \{0,\dotsc ,D\}\), \(|\{ \pi \in \varSigma _n : d(\pi ,\sigma )=d \}|\) is constant for any \(\sigma \in \varSigma _n\) (see Definition 2).

Let \(\sigma \ne \sigma ^*\). Thus, \(d(\sigma ,\sigma ^*)=d>0\). Therefore, considering Eq. (8) and \(p_0^S(\sigma ^*)>p_0^S(\sigma )\),

$$\begin{aligned} \begin{aligned}&\sum _{\pi \in \varSigma _n} d(\pi ,\sigma )\cdot p_0^S(\pi ) \\&\quad = \sum _{\pi \in \varSigma _n \backslash \{\sigma ,\sigma ^*\}} d(\pi ,\sigma )\cdot p_0^S(\pi ) \\&\qquad + d \cdot p_0^S(\sigma ^*) + 0 \cdot p_0^S(\sigma ) \\&\quad > \sum _{\pi \in \varSigma _n \backslash \{\sigma ,\sigma ^*\}} d(\pi ,\sigma ^*)\cdot p_0^S(\pi ) \\&\qquad + d \cdot p_0^S(\sigma ) + 0 \cdot p_0^S(\sigma ^*) = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_0^S(\pi ), \end{aligned} \end{aligned}$$

and it proves that the maximum likelihood estimator of the central permutation is \(\sigma ^*\).

So \(P_1\) is a Mallows model with central permutation \(\sigma ^*\). Because of the uniqueness of the estimated central permutation and by Lemma 1, the estimated spread parameter of \(P_1\) is a positive value. In order to generalize the obtained results to any iteration of the algorithm, let us calculate the central permutation of \(P_2\). To determine \(P_1^S\), we consider Eq. (7) from \(P_1\). Accordingly, for each solution, the lower the distance to \(\sigma ^*\), the higher the probability of selecting the solution is. Therefore, to calculate \(P_2\), we can repeat the same argument of Sect. 3 to prove that \({\hat{\sigma }}_0=\sigma ^*\). This same argument can be repeated for any iteration \(i > 2\).

Once it has been proved that \(\sigma ^*\) is the estimated central permutation for the learned Mallows model at any iteration of the algorithm, let us study the estimation of \(\theta \). As we have mentioned previously, there is no closed formula for the solution of Eq. (5). Hence, instead of calculating the value of \(\theta \), we follow a different avenue to prove the limiting behavior of the algorithm. Knowing by Lemma 1 that the estimated spread parameter \({\hat{\theta }}\) at any iteration of the algorithm is positive, we prove that the estimated spread parameter increases in two consecutive iterations.

Particularly, Eq. (5) is analyzed to see if the spread parameter at iteration \(i+1\) is a higher or lower value than the spread parameter at iteration i. To this end, two consecutive iterations are considered and the difference between \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_i^S(\sigma )\) and \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_{i+1}^S(\sigma )\) is analyzed. Without loss of generality, let us analyze the relation when \(i=0\).

The difference between the values of the left-hand side of (5) depends on the values \(p_0^S(\sigma )\) and \(p_1^S(\sigma )\), \(\forall \sigma \in \varSigma _n\). Firstly, remember that \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_0^S(\sigma )\) was used to calculate the spread parameter of the Mallows probability distribution \(P_1\). Hence, by the definition of the operator a it holds that \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_0^S(\sigma )\) and \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1(\sigma )\) are the same value (this argument has been used to specify that the estimated parameters of a Mallows model to learn a new Mallows model are the same). Let us denote \({C= \sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1(\sigma )}\) and compare it with \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1^S(\sigma )\). Using Eq. (7) for \(P_1\),

$$\begin{aligned} \begin{aligned}&\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1^S(\sigma ) \\&\quad = \sum _{\sigma \in \varSigma _n \backslash \{\sigma ^*\}} d(\sigma ,\sigma ^*)\cdot p_1^S(\sigma ) \\&\quad \mathop {=}^{(\text {7})} \sum _{\sigma \in \varSigma _n \backslash \{\sigma ^*\}} d(\sigma ,\sigma ^*)\cdot p_1(\sigma )\left( 1-p_1(\sigma ^*)\right) \\&\quad = C\left( 1-p_1(\sigma ^*)\right) < C. \end{aligned} \end{aligned}$$

This implies that the left-hand side of Eq. (5) decreases in two consecutive iterations. Hence, as the function g defined in Eq. (6) is a strictly decreasing function over \(\theta \), the spread parameter increases after one iteration of the algorithm. So, \(\theta _2\) is a higher value than \(\theta _1\).

Using the same reasoning for any iteration, we can observe that at each iteration \(p(\sigma ^*)\) increases, whereas for all \(\sigma \ne \sigma ^*\) \(p(\sigma )\) decreases. Moreover,

$$\begin{aligned}&\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_j^S(\sigma ) \\&\quad = C\left( 1-p_1(\sigma ^*)\right) \left( 1-p_2(\sigma ^*)\right) \cdots (1-p_j(\sigma ^*)) < C\left( 1-p_1(\sigma ^*)\right) ^j \mathop {\longrightarrow }_{j \rightarrow \infty } 0. \end{aligned}$$

Consequently, \( \theta \) tends to infinity when the number of iterations increases.

Therefore, after applying our modeling departing from a uniform distribution to a needle in a haystack function, the algorithm converges to a Mallows model with central permutation \(\sigma ^*\) and a spread parameter \(\theta \) which tends to infinity. Hence, the distribution in the limit is concentrated around \(\sigma ^*\). \(\square \)

4.2 \(P_0\) a Mallows probability distribution with central permutation \(\sigma ^*\) and spread parameter \(\theta _0\)

This case is the same as the one in Sect. 4.1 after the first iteration. Hence, the algorithm converges to a degenerate distribution centered at \(\sigma ^*\).

4.3 \(P_0\) a Mallows probability distribution with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\)

Due to the difficulty of this case in comparison with the previous ones, the analysis of the convergence behavior of the algorithm is made from a new point of view. In this section, our objectives are to study the possible fixed points of the algorithm and to analyze the behavior of our dynamical system. Our first objective is to calculate the fixed points of the algorithm. A probability distribution is a fixed point of the algorithm if, after one iteration, the algorithm does not estimate a different probability distribution: that is to say, \(G(P)=P\). Consequently, the algorithm will always estimate the same probability distribution.

In Sect. 4.3, the following proof idea is used:

  1. (i)

    In Sect. 4.3.1, the fixed points of the algorithm are calculated.

    • First, it is proved that any degenerate distribution is a fixed point.

    • Then, non-degenerate fixed points are calculated.

  2. (ii)

    In Sect. 4.3.2, the attraction of the fixed points is studied.

  3. (iii)

    Finally, in Sect. 4.3.3, the performance of the algorithm is analyzed for different initial probability distributions \(P_0\).

4.3.1 Computation of the fixed points

For our first aim of Sect. 4.3, let us calculate the fixed points of our dynamical system G. First, let us realize that any degenerate distribution is a fixed point of the discrete dynamical system G. The selection probability departing from \(1_{\sigma _k}\) is:

$$\begin{aligned} p^S(\sigma ) = \left\{ \begin{array}{ll} 1\ &{}\ \text { if } \sigma = \sigma _k \\ 0\ &{}\ \text { otherwise.} \end{array} \right. \end{aligned}$$

Therefore, the probabilities of the solutions after the selection operator keep the same values of \(1_{\sigma _k}\), that is, \(P^S=1_{\sigma _k}=P\). Hence, bearing in mind that in Sect. 2.4 it has been assumed that the estimated model from a degenerate distribution is the same degenerate distribution, \(G(1_{\sigma _k})=1_{\sigma _k}\) is obtained.

However, the degenerate distributions are not the only fixed points of the discrete dynamical system G. By definition of G, any Mallows probability distribution for which the algorithm learns the same distribution is a fixed point; in other words, after the selection operator, if the algorithm estimates the same central permutation and spread parameter as in the previous distribution, then the Mallows probability distribution is a fixed point. In Lemma 3, a formal result of this idea is presented, showing which two equations are sufficient to achieve a fixed Mallows probability distribution.

Lemma 3

Let P be a Mallows probability distribution with central permutation \(\sigma _0\) and spread parameter \(\theta _0 < \infty \). If for all \(\sigma \ne \sigma _0\),

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p^S(\pi ) < \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p^S(\pi ) \end{aligned}$$
(9)

and

$$\begin{aligned} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi ) = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \end{aligned}$$
(10)

are fulfilled, then \(G(P)=P\).

Proof

By the maximum likelihood estimator of the parameters of the Mallows model, Inequality (9) ensures \({\hat{\sigma }}_0=\sigma _0\). In order to prove that \({\hat{\theta }}=\theta _0\), considering by hypothesis that P is a Mallows model and by Eqs. (5) and (10),

$$\begin{aligned} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )= & {} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) = \frac{n-1}{e^{\theta _0} -1} \\&- \sum _{i=1}^{n-1} \frac{n-i+1}{e^{(n-i+1)\theta _0}-1}. \end{aligned}$$

\(\square \)

Inequality (9) ensures \({\hat{\sigma }}_0=\sigma _0\) and Eq. (10) obtains \({\hat{\theta }}=\theta _0\). Inequality (9) and Eq. (10) can be written consecutively: for all \(\sigma \ne \sigma _0\),

$$\begin{aligned}&\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p^S(\pi ) \mathop {>}^{{\hat{\sigma }}_0=\sigma _0} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p^S(\pi )\nonumber \\&\quad \mathop {=}^{{\hat{\theta }}=\theta _0} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) . \end{aligned}$$
(11)

Lemma 3 presents a sufficient situation to achieve fixed points of the algorithm. Unfortunately, Lemma 3 does not present “the necessary condition” because of one very particular case: when \(G(P)=P\), it cannot be ensured that \(\sigma _0\) obtains the minimum value at Inequality (9) (perhaps there are more permutations which obtain the minimum value), even if \({\hat{\theta }}=\theta _0\). In the case that \(\sigma _0\) is the unique solution of Inequality (9), then Lemma 3 would present the necessary condition to be a fixed point. To avoid these specific scenarios and the equality case in Inequality (9), which represent zero Lebesgue measure sets, from now on we will consider that \(\sigma _0\) is the estimated central permutation. In practice, the EDA can be designed to have a preference criteria for ties.

Based on Lemma 3, our next objective is to observe the sufficient conditions to achieve fixed points of the algorithm when f is a needle in a haystack function. First, it is studied when \({\hat{\theta }}=\theta _0\), and then whether or not \({\hat{\sigma }}_0=\sigma _0\) is satisfied. Let us study Eq. (10).

$$\begin{aligned} \begin{aligned}&\ \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )=\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \\&\quad \mathop {\Longleftrightarrow }^{(\text {7})} \ p(\sigma ^*)\cdot d(\sigma ^*,\sigma _0) + (1-p(\sigma ^*))\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \\&\quad = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \\&\quad \Longleftrightarrow \ p(\sigma ^*)\cdot d(\sigma ^*,\sigma _0) = p(\sigma ^*)\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \\&\quad \Longleftrightarrow \ d(\sigma ^*,\sigma _0) = \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ). \end{aligned} \end{aligned}$$
(12)

From Eq. (12) we can deduce that \(\hbox {MM}({\sigma _0},{\theta _0})\) is not a fixed point if \(d(\sigma ^*,\sigma _0) \ge D/2\). This is due to the fact that the right-hand side of Eq. (5) tends to 0 when \(\theta \) tends to infinity and the supreme possible value of \(\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi )\) is D/2. Consequently, \(\hbox {MM}({\sigma _0},{\theta _0})\) is not a fixed point if \(d(\sigma ^*,\sigma _0) \ge D/2\). Note that this also means that if we start with \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\) such that \(d(\sigma ^*,\sigma _0) \ge D/2\), then the algorithm can only converge to a solution \(\sigma \) unequal to \(\sigma _0\) such that \(d(\sigma ^*,\sigma ) < D/2\).

Let us observe whether \({\hat{\sigma }}_0=\sigma _0\) is fulfilled when \({\hat{\theta }}=\theta _0\) and \(d(\sigma ^*,\sigma _0) < D/2\) (considering the case that the estimated central permutation is unique):

$$\begin{aligned} {\hat{\sigma }}_0= & {} \sigma _0 \Longleftrightarrow \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p^S(\pi ) \nonumber \\> & {} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p^S(\pi ), \forall \sigma \ne \sigma _0. \end{aligned}$$
(13)

The right-hand side of the equation is simplified by Eq. (12):

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p^S(\pi ) > d(\sigma ^*,\sigma _0), \forall \sigma \ne \sigma _0. \end{aligned}$$

By the definition of the selection probability (Eq. (7)),

$$\begin{aligned}&p(\sigma ^*)d(\sigma ^*,\sigma ) + (1-p(\sigma ^*))\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi ) \\&\quad > d(\sigma ^*,\sigma _0), \forall \sigma \ne \sigma _0. \end{aligned}$$

Solving for the summation in the left-hand side of the inequality,

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi ) > \frac{ d(\sigma ^*,\sigma _0) -p(\sigma ^*)d(\sigma ^*,\sigma )}{1-p(\sigma ^*)},\forall \sigma \ne \sigma _0. \nonumber \\ \end{aligned}$$
(14)

The value of the right-hand side of Inequality (14) can vary according to \(d(\sigma ^*,\sigma )\). In order to avoid repeating the same proof for different values of \(d(\sigma ^*,\sigma )\), let us consider the maximum possible value of the right-hand side of Inequality (14), which is the worst possible case, and prove it. Substituting the expression \(d(\sigma ^*,\sigma _0)-p(\sigma ^*)d(\sigma ^*,\sigma )\) by \(d(\sigma ^*,\sigma _0)\), we obtain the following inequality:

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi ) > \frac{d(\sigma ^*,\sigma _0)}{1-p(\sigma ^*)}. \end{aligned}$$
(15)

On the left-hand side of Inequality (15), the sum depends on \(\sigma \). In order to prove for all \(\sigma \ne \sigma _0\), let us take the smallest possible value. Considering that P is a Mallows model centered at \(\sigma _0\), the probabilities are ordered according to their distance to \(\sigma _0\). So, from the set \(\varSigma _n \backslash \{\sigma _0\}\), any solution \(\sigma \) at distance 1 from \(\sigma _0\) has the lowest value \(\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi )\), because \(d(\pi ,\sigma )=d(\pi ,\sigma _0)\pm 1\). Rewriting the previous equation for a solution \(\sigma \) at distance 1 from \(\sigma _0\),

$$\begin{aligned} \begin{aligned}&\ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi ) = \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ) \\&\qquad + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)< d(\pi ,\sigma ) \end{array}}p(\pi ) - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}}p(\pi )> \frac{d(\sigma ^*,\sigma _0)}{1-p(\sigma ^*)} \\&\quad \mathop {\Longleftrightarrow }^{(\text {12})} \ d(\sigma ^*,\sigma _0) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)< d(\pi ,\sigma ) \end{array}}p(\pi ) \\&\qquad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}}p(\pi )> \frac{d(\sigma ^*,\sigma _0)}{1-p(\sigma ^*)} \\&\quad \Longleftrightarrow \ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0) < d(\pi ,\sigma ) \end{array}}p(\pi ) \\&\qquad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}}p(\pi ) > \frac{p(\sigma ^*)d(\sigma ^*,\sigma _0)}{1-p(\sigma ^*)} . \end{aligned} \end{aligned}$$
(16)

In order to simplify the previous equation, let us introduce some new notation and definitions.

Definition 2

For any \(\sigma \) in \(\varSigma _n\) and \(d=0,\dotsc ,D\), let us denote

$$\begin{aligned} m_n(d) = |\{ \pi \in \varSigma _n : d(\pi ,\sigma )=d \}|. \end{aligned}$$

The sequence A008302 in The On-Line Encyclopedia of Integer Sequences (OEIS) [26] shows the first values and some properties of \(m_n(d)\) numbers.

Definition 3

For any \(\sigma \) and \(\tau \) in \(\varSigma _n\) such that \(d(\sigma ,\tau )=1\), and \(d=0,\dotsc ,D\), let us denote

$$\begin{aligned} {\mathcal {D}}_d=\{ \pi \in \varSigma _n : d(\pi ,\sigma )=d \text { and } d(\pi ,\tau )=d+1 \} \end{aligned}$$

and \(m_n^1(d) = |{\mathcal {D}}_d|\).

The sequence of non-negative numbers \(m_n^1(d)\) has been added in OEIS [26] (sequence A307429) by the authors of the present paper and several properties have been explained in Appendix B. To rewrite Inequality (16), Properties (ii), (iii) and (iv) from Appendix B have been used. These enunciate that \({m_n(d) = m_n^1(d) + m_n^1(d-1)}\), \(m_n^1(d)={m_n^1(D-d-1)}\) and that \({m_n^1(d) > m_n^1(d-1)}\) when \(d \in \{0,\dotsc ,d_{max}\}\), where \(d_{max}=(D/2)-1\) when D is even and \(d_{max}=\lfloor D/2 \rfloor \) when D is odd. Remembering that \(\varphi (\theta )=\sum _{\sigma \in \varSigma _n}e^{-\theta d(\sigma ,\sigma _0)}\) is the normalization constant for the Mallows probability distribution, Inequality (16) can be rewritten in the following way (let us denote \(d(\sigma ^*,\sigma _0)=d^*\)):

$$\begin{aligned} \begin{aligned}&\ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)< d(\pi ,\sigma ) \end{array}}p(\pi ) - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}}p(\pi )> \frac{p(\sigma ^*)d(\sigma ^*,\sigma _0)}{1-p(\sigma ^*)} \\&\quad \Longleftrightarrow \ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)< d(\pi ,\sigma ) \end{array}} \frac{e^{-{\hat{\theta }}d(\pi ,\sigma _0)}}{\varphi ({\hat{\theta }})} - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}} \frac{e^{-{\hat{\theta }}d(\pi ,\sigma _0)}}{\varphi ({\hat{\theta }})} \\&\quad> \frac{e^{-d^* {\hat{\theta }}}}{\varphi ({\hat{\theta }})} \cdot \varphi ({\hat{\theta }}) \cdot \frac{d^*}{\varphi ({\hat{\theta }})-e^{-d^* {\hat{\theta }}}} \\&\quad \Longleftrightarrow \ (\varphi ({\hat{\theta }})-e^{-d^*{\hat{\theta }}}) \left( \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0) < d(\pi ,\sigma ) \end{array}} e^{-{\hat{\theta }}d(\pi ,\sigma _0)} \right. \\&\left. \qquad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma _0)> d(\pi ,\sigma ) \end{array}} e^{-{\hat{\theta }}d(\pi ,\sigma _0)} \right)> d^* \cdot \varphi ({\hat{\theta }}) \cdot e^{-d^* {\hat{\theta }}} \\&\quad \Longleftrightarrow \ (\varphi ({\hat{\theta }})-e^{-d^*{\hat{\theta }}})\sum _{i=0}^D \left( m_n^1(i)-m_n^1(i-1) \right) e^{-i {\hat{\theta }}} \\&\quad> d^*\cdot \varphi ({\hat{\theta }})\cdot e^{-d^*{\hat{\theta }}} = \sum _{i=0}^D d^*\cdot m_n(i)\cdot e^{-(d^*+i){\hat{\theta }}} \\&\quad \Longleftrightarrow \ \varphi ({\hat{\theta }})\sum _{i=0}^D \left( m_n^1(i)-m_n^1(i-1) \right) e^{-i {\hat{\theta }}} \\&\quad> \sum _{i=0}^D \left( m_n^1(i)-m_n^1(i-1) + d^*\cdot m_n(i) \right) e^{-(d^*+i){\hat{\theta }}} \\&\quad \Longleftrightarrow \ \sum _{i=0}^D \sum _{j=0}^D m_n(i)\cdot \left( m_n^1(j)-m_n^1(j-1) \right) e^{-(i+j){\hat{\theta }}}\\&\quad > \sum _{i=0}^D \left( (d^*+1)m_n^1(i) + (d^*-1)m_n^1(i-1) \right) e^{-(d^*+i){\hat{\theta }}} . \end{aligned} \end{aligned}$$
(17)

The proof of Inequality (17) is shown in Appendix C. Therefore, the learned central permutation from \(P \sim \) \(\hbox {MM}({\sigma _0},{{\hat{\theta }}})\) is \(\sigma _0\). To sum up, \(P \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\) is a fixed point if \(d(\sigma ^*,\sigma _0) < D/2\) and \(\theta _0\) fulfills Eq. (12).

Fig. 1
figure 1

Spread parameter values in which Eq. (12) (continuous lines) and Eq. (13) (dashed lines) are fulfilled. Each line represents the value n (\(n=4,5,6,7\)) and each point depends on d (\(d=1,\dotsc , \lceil D/2 \rceil -1\))

4.3.2 Attraction of the fixed points

In Sect. 4.3.1, all the fixed points of the algorithm, degenerates and non-degenerates, have been studied. Let us define a fixed point of the dynamical system attractive if any Mallows model P near the fixed point will converge to it: that is to say, any P that has the same central estimator as the fixed point and a spread parameter value \(\theta \) “close” to \({\hat{\theta }}\) (in the limit sense) will converge to the fixed point. In addition, from the study of the fixed points, several observations have been derived.

For example, from Eq. (12), the attraction of the non-degenerate fixed points is totally deduced. Let us denote by \({\hat{\theta }}_{d^*}\) the minimum spread parameter values which fulfill Eq. (12) according to \(d(\sigma ^*,\sigma _0)\). In Eq. (10), \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) are compared to observe when the estimated spread parameter value remains the same value. Let us denote by \({\hat{\theta }}\) the spread parameter value which fulfills Eq. (10). However, \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) can be compared for any other spread parameter value \(\theta _0\). For example, when \(\theta _0 < {\hat{\theta }}_{d^*}\), \(d(\sigma ^*,\sigma _0) < \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi ) < \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\), and consequently the learned spread parameter is greater than \(\theta _0\); and when \(\theta _0 > {\hat{\theta }}_{d^*}\), then the learned spread parameter decreases. This observation shows us that the non-degenerate fixed points are attractive.

Another observation is that for sufficiently large \(\theta _0\) we obtain \(d(\sigma ^*,\sigma _0) > \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) and, consequently, \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi ) > \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\), which implies that \({\hat{\theta }}_0 < \theta _0\). Hence, all the degenerate fixed points centered at \(\sigma \ne \sigma ^*\) are not attractive. Consequently, the algorithm ends in a non-degenerate fixed point centered at \(\sigma \ne \sigma ^*\) or in the degenerate distribution centered at \(\sigma ^*\).

Moreover, Eq. (13) shows us the condition to estimate \(\sigma _0\) as the central permutation. Hence, there exists a spread parameter value \({\tilde{\theta }}_{d^*}\) (dependent on \(d(\sigma ^*,\sigma _0)<D/2\)) such that if \(\theta _0 < {\tilde{\theta }}_{d^*}\), then the estimated central permutation is not \(\sigma _0\). If \(\theta _0 ={\tilde{\theta }}_{d^*}\), then the algorithm can estimate more than one central permutation and its behavior will depend on the estimated central permutation. However, we will not focus on those exact Mallows models because they represent a zero Lebesgue measure set. In Fig. 1, the first values of \({\hat{\theta }}\) which fulfill Eq. (12) and \({\tilde{\theta }}_{d^*}\) are displayed for \(n=4,5,6\) and 7 and their respective \(d^*\) values, showing the proved result. The y-axis is plotted in log scale to recognize all the lines. In addition, for a fixed value n, it can be verified that when d increases, due to the fact that the right-hand side of Eq. (12) is a decreasing function, Eq. (12) is fulfilled for a lower \({\hat{\theta }}_d\) value.

4.3.3 Convergence behavior of the algorithm

After analyzing the attraction of the fixed points, the next step is to study the evolution of the estimated Mallows models; that is, when the algorithm estimates a new central permutation which is different from \(\sigma _0\), is it possible to limit the number of scenarios of the algorithm in advance? Can we know which fixed point is the convergence point of the algorithm in any situation?

In many cases it is shown to which fixed point the algorithm converges. The main result that is given about the convergence point of the algorithm is Lemma 4. Lemma 4 demonstrates that the algorithm estimates a central permutation which must be in a set of solutions dependent on \(\sigma ^*\) and \(\sigma _0\). In addition, for any \(\sigma _0\), there exists a spread parameter value \({\tilde{\theta }}(\sigma _0)\) such that if \(\theta _0 < {\tilde{\theta }}(\sigma _0)\), then the algorithm estimates a new central permutation different from \(\sigma _0\).

In order to prove Lemma 4, let us consider Definition 4.

Definition 4

Let \(\varSigma _n\) be the search space with metric \(d(\cdot ,\cdot )\). Let \(\sigma \) and \(\pi \) be two solutions of \(\varSigma _n\). Then, the segment from \(\sigma \) to \(\pi \), \(C(\sigma ,\pi )\), is the set with the permutations \(\tau \in \varSigma _n\) such that \(\sigma \), \(\pi \) and \(\tau \) fulfill the equality in the triangle inequality.

$$\begin{aligned} C(\sigma ,\pi )=\{\tau \in \varSigma _n: d(\sigma ,\tau )+d(\tau ,\pi ) = d(\sigma ,\pi ) \}. \end{aligned}$$

Let us call \(\tau \in C(\sigma ,\pi )\) a solution between \(\sigma \) and \(\pi \). Hence, \(C(\sigma ,\pi )\) is the set that includes all the permutations between \(\sigma \) and \(\pi \). Let us call the segment from \(\sigma \) to \(\pi \) unique when \(|C(\sigma ,\pi )|=d(\sigma ,\pi )+1\).

Two swaps are disjoint if the intersection of the sets of elements exchanged by each swap is empty.

Lemma 4

Let \(d(\cdot ,\cdot )\) be the Kendall tau distance and f an objective function such that its maximal solution is \(\sigma ^*\) and for any \(\sigma ,\pi \in \varSigma _n\), \(d(\sigma ,\sigma ^*) > d(\pi ,\sigma ^*)\) if and only if \(f(\sigma ) \le f(\pi )\). Let \(P_0\) be a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0) \ge 1\), and spread parameter \(\theta _0\). Then, the operator G always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.

Before presenting the Proof of Lemma 4, let us consider some preliminary ideas about our permutation space \(\varSigma _n\) and how the solutions can be organized and classified according to their description and the Kendall tau distance d. To do so, let us study the Cayley graph described by \((\varSigma _n,d)\) metric space.

Let us denote by CG(VE) the Cayley graph in which \(V=\varSigma _n\) and

$$\begin{aligned} E=\{ (\sigma , \pi ) \in \varSigma _n \times \varSigma _n \mid d(\sigma ,\pi )=1 \}. \end{aligned}$$

This graph has been studied in [7, 12]. Lemma 2.4 of [12] shows that there are two kinds of cycles formed in \(CG(\varSigma _n,E)\). Because d distance has the right invariance property, without loss of generality, let us simplify the notation and explain the two possible cycles formed by the adjacent swaps using the identity permutation I as the reference solution. Let us denote by [i] the adjacent transposition that exchanges the elements of the positions i and \(i+1\) (\(i = 1,\dotsc ,n-1\)). For example, \([i] \circ I\) represents the solution such that elements of the positions i and \(i+1\) from I are swapped \(\left( [i] \circ I = (1 \cdots i+1\ i\ \cdots n) \right) \). Analogously, let us consider a second adjacent transposition [j].

  • If \([j] \circ [i] \circ I = [i] \circ [j] \circ I\), then there is a unique 4-cycle in \(CG(\varSigma _n,E)\) passing through I, \([i] \circ I\) and \([j] \circ I\). Moreover, the 4-cycle is formed by the following solutions:

    $$\begin{aligned}\{ I,\ [i] \circ I,\ [j] \circ [i] \circ I,\ [j] \circ I \}. \end{aligned}$$
  • If \([j] \circ [i] \circ I \ne [i] \circ [j] \circ I\), then \([i] \circ [j] \circ [i] \circ I = [j] \circ [i] \circ [j] \circ I\) and there is a unique 6-cycle in \(CG(\varSigma _n,E)\) passing through I, \([i] \circ I\) and \([j] \circ I\). Moreover, the 6-cycle is formed by the following solutions:

    $$\begin{aligned}&\{ I,\ [i] \circ I,\ [j] \circ [i] \circ I,\ [i] \circ [j] \circ [i] \circ I,\\&[i] \circ [j] \circ I,\ [j] \circ I \}. \end{aligned}$$

By the definition of the generation of the cycles, the distances among the solutions of the same cycle are minimal. That is to say, the distance between two solutions of the same cycle is the number of edges between both solutions in the cycle.

The next observation is that considering any 4-cycle, a partition of \(\varSigma _n\) in 4 sets can be defined.

$$\begin{aligned} \{ \pi _1,\ \pi _2=[i] \circ \pi _1,\ \pi _3=[j] \circ \pi _1,\ \pi _4=[j] \circ [i] \circ \pi _1 \}. \end{aligned}$$

Without loss of generality, let us comment the particular case \(\pi _1=I\), and the same arguments can be applied for any other cycle. If \(\pi _1 = I\), then \(\pi _2 = (\cdots i+1\ i\ \cdots j\ j+1\ \cdots )\); \(\pi _3 = (\cdots i\ i+1\ \cdots j+1\ j\ \cdots )\); and \(\pi _4 = ( \cdots i+1\ i\ \cdots j+1\ j\ \cdots )\). In order to simplify the notation, the solutions of the 4-cycle can be classified according to the relative positions of the couple i and \(i+1\) and the couple j and \(j+1\). So, a partition \(\{S_1,S_2,S_3,S_4\}\) of \(\varSigma _n\) is defined as follows:

$$\begin{aligned} \begin{array}{l} S_1 = \{\sigma \in \varSigma _n \mid \sigma (i)<\sigma (i+1) \wedge \sigma (j)<\sigma (j+1) \};\\ S_2 = \{\sigma \in \varSigma _n \mid \sigma (i)>\sigma (i+1) \wedge \sigma (j)<\sigma (j+1) \};\\ S_3 = \{\sigma \in \varSigma _n \mid \sigma (i)<\sigma (i+1) \wedge \sigma (j)>\sigma (j+1) \};\\ S_4 = \{\sigma \in \varSigma _n \mid \sigma (i)>\sigma (i+1) \wedge \sigma (j)>\sigma (j+1) \}.\\ \end{array} \end{aligned}$$

It is evident that the partition is well-defined. Moreover, among these 4 sets, for each pair of sets a bijection can be described:

$$\begin{aligned} \begin{array}{ccccccc} S_1 &{} \longrightarrow &{} S_2 &{} \longrightarrow &{} S_3 &{} \longrightarrow &{} S_4 \\ \pi _{S_1} &{} \longmapsto &{} \pi _{S_2} &{} \longmapsto &{} \pi _{S_3} &{} \longmapsto &{} \pi _{S_4} \end{array} \end{aligned}$$
(18)

such that

$$\begin{aligned} \left\{ \begin{array}{ll} \pi _{S_1}(i)=\pi _{S_2}(i+1)=\pi _{S_3}(i)=\pi _{S_4}(i+1)&{} \\ \pi _{S_1}(i+1)=\pi _{S_2}(i)=\pi _{S_3}(i+1)=\pi _{S_4}(i)&{} \\ \pi _{S_1}(j)=\pi _{S_2}(j)=\pi _{S_3}(j+1)=\pi _{S_4}(j+1)&{} \\ \pi _{S_1}(j+1)=\pi _{S_2}(j+1)=\pi _{S_3}(j)=\pi _{S_4}(j)&{} \\ \pi _{S_1}(k)=\pi _{S_2}(k)=\pi _{S_3}(k)=\pi _{S_4}(k), &{} \\ \qquad \text { for any } k \ne i,i+1,j,j+1.\\ \end{array} \right. \end{aligned}$$

An important property of this defined partition is that if \(\sigma \in S_1\), then \(d(\pi _1,\sigma )< d(\pi _2,\sigma )=d(\pi _3,\sigma ) < d(\pi _4,\sigma )\) is fulfilled and analogously with the solutions of the sets \(S_2\), \(S_3\) and \(S_4\).

The previous idea can be repeated with two non-disjoint adjacent swaps, forming a 6-cycle and defining a partition of \(\varSigma _n\) in 6 sets, and for any cycle. In addition, we can extend the idea by using just one adjacent swap. In this last case, we can define a partition of \(\varSigma _n\) in two sets and a bijection between the sets, according to the relative position of the two elements permuted by the swap. This property is the main argument of the Proof of Lemma 4.

Once we know how the solutions are organized in \((\varSigma _n,d)\) metric space, Lemma 4 is proved by induction as follows. For any solution \(\tau \notin C(\sigma ^*,\sigma _0)\), there exists another solution \(\rho _1 \in \varSigma _n\) such that \(\rho _1\) is “closer” to \(\sigma ^*\) and \(\sigma _0\) than \(\tau \) and fulfills the following inequality:

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi ) > \sum _{\pi \in \varSigma _n}d(\pi , \rho _1)p^S(\pi ). \end{aligned}$$
(19)

In this way, the argument can be applied for all the solutions not included in \(C(\sigma ^*,\sigma _0)\) and, therefore, for any solution \(\tau \notin C(\sigma ^*,\sigma _0)\), there is a solution \(\rho \in C(\sigma ^*,\sigma _0)\) such that \(\rho \) fulfills Inequality (19) with regard to \(\tau \).

Proof

For any \(\tau \notin C(\sigma ^*,\sigma _0)\), there are two possible cases: (1) there is a solution \(\rho _1\) such that \(d(\tau ,\rho _1)=1\), \(d(\tau ,\sigma ^*)=d(\rho _1,\sigma ^*)+1\) and \(d(\tau ,\sigma _0)=d(\rho _1,\sigma _0)+1\) and (2) there is no such solution \(\rho _1\).

In the first case, if i and j are the elements swapped in the adjacent transposition between \(\tau \) and \(\rho _1\), it means that any solution of \(C(\sigma ^*,\sigma _0)\) keeps the same relative order between the elements i and j as \(\rho _1\) does. So,

$$\begin{aligned} \begin{aligned}&\ \sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi )> \sum _{\pi \in \varSigma _n}d(\pi , \rho _1 )p^S(\pi ) \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}d(\pi , \rho _1 )p^S(\pi ) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\tau )> d(\pi ,\rho _1) \end{array}}p^S(\pi ) \\&\qquad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\tau )< d(\pi ,\rho _1) \end{array}}p^S(\pi )> \sum _{\pi \in \varSigma _n}d(\pi , \rho _1 )p^S(\pi ) \\&\quad \Longleftrightarrow \ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\tau )> d(\pi ,\rho _1) \end{array}}p^S(\pi ) - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\tau ) < d(\pi ,\rho _1) \end{array}}p^S(\pi ) > 0 \end{aligned} \end{aligned}$$
(20)

Let us consider the following bijection.

$$\begin{aligned} \begin{array}{ccc} S_{\tau }=\{\sigma \in \varSigma _n \mid d(\sigma ,\tau ) < d(\sigma ,\rho _1) \} &{} \longrightarrow &{} S_{\rho }\\ \qquad =\{\sigma \in \varSigma _n \mid d(\sigma ,\tau ) > d(\sigma ,\rho _1) \} \\ \sigma _{\tau } &{} \longmapsto &{} \sigma _{\rho } \end{array} \end{aligned}$$

such that \(\sigma _{\tau }(i)=\sigma _{\rho }(j),\ \sigma _{\tau }(j)=\sigma _{\rho }(i)\) and \(\sigma _{\tau }(k)=\sigma _{\rho }(k)\), for any \(k \ne i,j \). According to the relative position of i and j, \(\sigma _\rho \) is closer to \(\sigma ^*\) and \(\sigma _0\) than \(\sigma _\tau \) and therefore, \(p^S(\sigma _{\rho }) > p^S(\sigma _{\tau })\) is achieved. Consequently, Inequality (20) is obtained.

In the second case, let us suppose that there are no swaps from \(\tau \) that decrease the distance to \(\sigma ^*\) and \(\sigma _0\) at the same time. First, let us consider an adjacent swap [i] from \(\tau \) that reduces the distance to \(\sigma ^*\). Let us denote \(\rho ' = [i] \circ \tau \). Therefore, similar to the first case, a bijection can be defined according to the relative position of the elements in the positions i and \(i+1\) in \(\tau \). Analogously, let us consider a second swap [j] from \(\tau \) that reduces the distance to \(\sigma _0\), denote \(\rho ''= [j] \circ \tau \) and define a bijection for the elements positioned at j and \(j+1\) in \(\tau \). The transpositions [i] and [j] define a unique cycle passing through \(\tau \). Moreover, by definition of the swaps and the segment \(C(\sigma ^*,\sigma _0)\) and the bijections defined in (18), this situation can only happen when the swaps \((i\ i+1)\) and \((j\ j+1)\) are not disjoint, which implies that the formed cycle has length 6. Besides, this cycle also implies that if we denote by \(\rho _{\tau }\) the furthest solution of the cycle from \(\tau \), then \(\rho _{\tau }\) is closer to \(\sigma ^*\) and \(\sigma _0\) at the same time than \(\tau \). Figure 2 presents the unique possible scenario. Hence, \(d(\sigma ^*,\rho _\tau )+d(\rho _\tau ,\sigma _0) < d(\sigma ^*,\tau )+d(\tau ,\sigma _0)\).

Let us rewrite the sum \(\sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi )\):

$$\begin{aligned} \begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi )&= \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau ) \\ d(\pi , \rho ')< d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \\&\quad + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ') < d(\pi , \tau ) \\ d(\pi , \rho ') = d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \\&\quad + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau ) \\ d(\pi , \rho ')> d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \\&\quad + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ') > d(\pi , \tau ) \\ d(\pi , \rho ') = d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) . \\ \end{aligned} \end{aligned}$$

We distribute the sums in two groups, depending on whether or not \(d(\pi ,\rho ')=d(\pi ,\rho '')\).

$$\begin{aligned}&\left[ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\rho ')< d(\pi , \tau ) \\ d(\pi , \rho ') = d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) +\sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\rho ')> d(\pi , \tau ) \\ d(\pi , \rho ') = d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \right] \nonumber \\&\quad + \left[ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau ) < d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau ) > d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \right] . \nonumber \\ \end{aligned}$$
(21)

To prove that the first square brackets sum is a positive value, for a solution \(\pi \in \varSigma _n\), if \(d(\pi , \rho ') = d(\pi , \rho '') < d(\pi , \tau )\), then \(d(\pi ,\rho _\tau ) < d(\pi ,\tau )\). So, if we denote by \((i\ i+1\ i+2)\) the set of elements which are permuted in the 6-cycle, we define the following bijection:

$$\begin{aligned} \begin{array}{ccc} S_{\tau }=\{\sigma \in \varSigma _n \mid d(\sigma ,\rho _\tau ) - d(\sigma ,\tau )=3 \} &{} \longrightarrow &{} S_{\rho }\\ \qquad =\{\sigma \in \varSigma _n \mid d(\sigma ,\tau ) - d(\sigma ,\rho _\tau ) = 3 \} \\ \sigma _{\tau } &{} \longmapsto &{} \sigma _{\rho } \end{array}, \end{aligned}$$

such that \( \sigma _{\tau }(i)=\sigma _{\rho }(i+2),\ \sigma _{\tau }(i+2)=\sigma _{\rho }(i)\) and \(\sigma _{\tau }(k)=\sigma _{\rho }(k)\), for any \(k \ne i,i+2\). Therefore, a correspondence between both sets is shown, and by the definition of the sets, \(p^S(\sigma _\rho ) > p^S(\sigma _\tau )\) is obtained for all \(\sigma _\tau \in S_\tau \).

Table 1 Classification of the behaviors of the EDA. f: Needle in a haystack\((\sigma ^*)\) and \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\), where \(D=n(n-1)/2\)

For the second square bracket, if \(d(\pi , \rho ')< d(\pi , \tau ) < d(\pi , \rho '')\), then \(d(\pi ,\tau )=d(\pi , \rho ') +1 = d(\pi , \rho '')-1\), and if \(d(\pi , \rho ')> d(\pi , \tau ) > d(\pi , \rho '')\), then \(d(\pi ,\tau )=d(\pi , \rho ') -1 = d(\pi , \rho '')+1\). So, the second square bracket of (21) can be rewritten in the following way:

$$\begin{aligned} \begin{aligned}&\ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau )< d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau )> d(\pi , \rho '') \end{array}}d(\pi , \tau )p^S(\pi ) \\&\quad = \ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ') \ne d(\pi , \rho '') \end{array}}d(\pi , \rho ')p^S(\pi ) \\&\qquad + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau )< d(\pi , \rho '') \end{array}}p^S(\pi ) - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau )> d(\pi , \rho '') \end{array}}p^S(\pi ) \\&\quad = \ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ') \ne d(\pi , \rho '') \end{array}}d(\pi , \rho '')p^S(\pi ) \\&\qquad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau ) < d(\pi , \rho '') \end{array}}p^S(\pi ) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau ) > d(\pi , \rho '') \end{array}}p^S(\pi ) . \end{aligned} \end{aligned}$$

Therefore, depending on \(\theta _0\), it can be ensured that

$$\begin{aligned}&\sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau )< d(\pi , \rho '') \end{array}}p^S(\pi ) - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau )> d(\pi , \rho '') \end{array}}p^S(\pi )>0 \text { or }\\&\quad - \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')< d(\pi , \tau )\\ d(\pi ,\tau ) < d(\pi , \rho '') \end{array}}p^S(\pi ) + \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi , \rho ')> d(\pi , \tau )\\ d(\pi ,\tau )> d(\pi , \rho '') \end{array}}p^S(\pi ) >0 . \end{aligned}$$

Consequently, there is a solution \(\rho _1 \in \{\rho ',\rho ''\}\) such that

$$\begin{aligned} \sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi ) > \sum _{\pi \in \varSigma _n}d(\pi , \rho _1)p^S(\pi ). \end{aligned}$$

So, for \(\tau \notin C(\sigma ^*,\sigma _0)\), there exists a solution \(\rho _1 \in \varSigma _n\) such that \(d(\rho _1,\tau )=1\) and \(\rho _1\) fulfills Inequality (19). If \(\rho _1 \notin C(\sigma ^*,\sigma _0)\), then by the same arguments, there exists another solution \(\rho _2 \in \varSigma _n\) such that \(d(\rho _1,\rho _2)=1\) and \(\rho _2\) fulfills Inequality (19) with regard to \(\rho _1\), and so on. Because \(\tau \notin C(\sigma ^*,\sigma _0)\), at least one induction step must fulfill the first situation explained in this proof (fulfilling Inequality (20)). Consequently, \(\rho _i\) is a solution from \(C(\sigma ^*,\sigma _0)\) such that it is a better estimator than \(\rho _1,\dotsc ,\rho _{i-1}\) and \(\tau \). \(\square \)

Fig. 2
figure 2

Example of the generated 6-cycle over \(\tau \) with two non-disjoint adjacent swaps

Lemma 4 shows us that the algorithm estimates central permutations from the set \(C(\sigma ^*,\sigma _0)\). Bear in mind that during the Proof of Lemma 4, the particular expression of f has not been used. Therefore, for our particular case, we can deduce Corollary 1.

Corollary 1

Let f be a needle in a haystack function centered at \(\sigma ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\). Then, the EDA always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.

Proof

When f is a needle in a haystack function, then \(f(\sigma ) < f(\sigma ^*)\) for any \(\sigma \ne \sigma ^*\). Hence, the conditions of Lemma 4 are fulfilled. \(\square \)

To summarize, the operator G ends in a non-degenerate fixed point or in the degenerate distribution centered at \(\sigma ^*\). The non-degenerate fixed points are centered at solutions \(\sigma \) such that \(d(\sigma ^*,\sigma ) < D/2\). In addition, when the algorithm estimates a different solution of \(\sigma _0\), the learned central estimator is a solution from \(C(\sigma ^*,\sigma _0) \backslash \{ \sigma _0 \}\).

All the results of Sects. 4.1, 4.2 and 4.3 are briefly shown in Table 1. In the first column, the section is shown. In the second and third columns, the initial parameters of \(P_0\) (\(\sigma _0\) and \(\theta _0\)) are described. Finally, in the last column, the explanations of the performance of the algorithm for each situation can be found.

5 Limiting behavior for a Mallows model function

In this section, the function f to optimize is a Mallows probability distribution with central permutation \(\sigma ^*\) and spread parameter \(\theta ^* >0\), without loss of generality. The Mallows model has been studied as an example of a unimodal objective function with different quality of solutions according to their distance to the central permutation. The objective of this section is to analyze the relation among the learned Mallows probability distributions by our dynamical system and the objective function. For that reason, we believe that it is a motivating starting point to study unimodal functions. In Sect. 5.1, the initial probability distribution \(P_0\) is a uniform distribution and the procedure of the algorithm at each iteration is analyzed. In Sect. 5.2, \(P_0\) is a Mallows probability distribution centered at \(\sigma \ne \sigma ^*\). In this situation, the fixed points of the algorithm and the convergence behavior of the algorithm are studied, in a similar way as in Sect. 4.3.

5.1 \(P_0\) a uniform initial probability distribution

In this section it is proved that when the initial probability distribution and the fitness functions are Mallows models centered at the same solution the algorithm converges to the degenerate distribution centered at the optimum. The obtained result is summarized in the following lemma.

Lemma 5

Let f be a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _0 \ge 0\). Then, the proposed EDA always converges to the degenerate distribution centered at \(\sigma ^*\).

Proof

For this particular scenario, we have studied how the algorithm performs at each iteration, analogous to Sect. 4.1. Let us start the demonstration from the case that \(P_0\) is a uniform distribution. First, in order to calculate \(P_1=G(P_0)\), let us calculate \(P_0^S\).

Bear in mind that the 2-tournament does not consider the exact function values of the solutions. In other words, by the definition of the Mallows probability distribution, a solution is selected more often if it is closer to \(\sigma ^*\), and to study the selection between two solutions, their distances to \(\sigma ^*\) are compared. With this property in mind, we can rewrite Eq. (3) in the following way: for any iteration of the algorithm i,

$$\begin{aligned} p^S_i(\sigma )= & {} 2\sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\sigma ,\sigma ^*) < d(\pi ,\sigma ^*) \end{array}} p_i(\sigma )p_i(\pi )\nonumber \\&+ \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\sigma ,\sigma ^*) = d(\pi ,\sigma ^*) \end{array}} p_i(\sigma )p_i(\pi ). \end{aligned}$$
(22)

The next step is to estimate the central permutation and spread parameter from \(P_0^S\) to learn \(P_1\). First, to estimate \(\sigma _0\), let us order the solutions increasingly according to their distance from \(\sigma ^*\). Remember that two solutions have the same probability to be selected if they are at the same distance from \(\sigma ^*\). For any \(\sigma \in \varSigma _n\),

$$\begin{aligned} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma )\cdot p_0^S(\pi ) = \sum _{d=0}^D \left( p_0^S({\tilde{\sigma }}_d) \sum _{\begin{array}{c} \pi \in \varSigma _n \\ d(\pi ,\sigma ^*)=d \end{array}}d(\pi ,\sigma ) \right) , \end{aligned}$$

where \({\tilde{\sigma }}_d\) denotes a solution at distance d from \(\sigma ^*\): \(d({\tilde{\sigma }}_d,\sigma ^*)=d\).

By Eq. (22), \(p_0^S({\tilde{\sigma }}_{0})> p_0^S({\tilde{\sigma }}_{1})> \cdots > p_i^S({\tilde{\sigma }}_{D})\). So, by Eq. (4), the maximum likelihood estimator must minimize \(\sum _{\pi \in \varSigma _n} d(\pi ,{\hat{\sigma }}_0)\cdot p_0^S(\pi )\), knowing that the selection probabilities are ordered according to their distance to \(\sigma ^*\) (the lower the distance from \(\sigma ^*\) to \(\pi \), the higher the value \(p^S(\pi )\) is). For that reason, the maximum likelihood estimator of \(\sigma _0\) is \(\sigma ^*\), and consequently, \(P_1\) follows a Mallows model with central permutation \(\sigma ^*\) and a positive spread parameter \(\theta _1\), as a consequence of Lemma 1.

The previous arguments can be used for any iteration. Hence, \(P_i\) is a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _i > 0\), for any \(i \in {\mathbb {N}}\). In order to see the evolution of the algorithm and the convergence behavior, let us prove that \(\theta _i\) increases at each iteration. To this end, the difference between the values of the left-hand side of Eq. (5) in two consecutive iterations are analyzed: \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_i^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_{i+1}^S(\pi )\). By the same arguments used in Sect. 4.1, the equality \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_i^S(\pi ) = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_{i+1}(\pi )\) is obtained. Let us use the sequence \(m_n(0),m_n(1),\dotsc ,m_n(D)\) given in Definition 2 and simplify the notation of the probabilities. By definition of the selection operator, for any \(\sigma \in \varSigma _n\) such that \(d(\sigma ,\sigma ^*)=d\), \(p^S(\sigma )\) can be rewritten in the following way:

$$\begin{aligned} p^S(\sigma ) = p(\sigma )\left( 2\left( 1-\sum _{i=0}^{d-1} m_n(i)p({\tilde{\sigma }}_i) \right) - m_n(d)p({\tilde{\sigma }}_d) \right) . \end{aligned}$$

Hence,

$$\begin{aligned}&\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p^S(\pi ) \\&\quad = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p(\pi )\\&\qquad +\sum _{d=1}^D m_n(d) \cdot d \cdot p({\tilde{\sigma }}_d) \\&\qquad \times \left( 1-2\sum _{i=0}^{d-1} m_n(i)p({\tilde{\sigma }}_i) - m_n(d)p({\tilde{\sigma }}_d) \right) , \end{aligned}$$

Let us define the function h:

$$\begin{aligned} h(\theta )= & {} \sum _{d=1}^D m_n(d) \cdot d \cdot p({\tilde{\sigma }}_d)\\&\times \left( 1-2\sum _{i=0}^{d-1} m_n(i)p({\tilde{\sigma }}_i) - m_n(d)p({\tilde{\sigma }}_d) \right) . \end{aligned}$$

For any \(\theta \ge 0\), \(h(\theta )\) is a negative value (see Proof in Proposition 5 of Appendix C). Consequently,

$$\begin{aligned} \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p^S(\pi ) < \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p(\pi ), \end{aligned}$$

and due to the fact that the function g defined in Eq. (6) is a strictly decreasing function over \(\theta \), we obtain \(\theta _{i+1} > \theta _{i}\).

Therefore, after applying our modeling, departing from a uniform distribution, to a function defined as a Mallows model, the algorithm converges to the degenerate distribution centered at \(\sigma ^*\). \(\square \)

5.2 \(P_0\) a Mallows probability distribution with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\)

The algorithm can experience many different behaviors depending on \(\sigma ^*\) and \(\sigma _0\). However, there are groups of different central permutations \(\sigma _0\) such that the algorithm behaves analogously. The analogy of the analysis with different central permutations can be obtained by means of symmetry among the solutions of \(\varSigma _n\). Due to the difficulty of studying all of them, we have worked in a similar way as in Sect. 4.3. In Sect. 5.2, the following proof idea is used:

  1. (i)

    In Sect. 5.2.1, the fixed points and their attraction are calculated.

    • First, it is observed that any degenerate distribution is a fixed point.

    • Then, the equations such that any non-degenerate fixed point must fulfill are calculated.

  2. (ii)

    In Sect. 5.2.2, the convergence behavior of the algorithm is explained and an example is shown.

A summary of all the results obtained in Sect. 5 is shown in Table 2 at the end of this section.

5.2.1 Fixed points of the algorithm and their attraction

The case \(n=2\) will not be explained because of its simplicity. From now on, let us suppose that \(n \ge 3\) and study the fixed points of our discrete dynamical system G. As in Sect. 4.3.1, let us start by observing that any degenerate distribution is a fixed point of the discrete dynamical system G, so let us focus on the non-degenerate fixed points.

For any Mallows probability distribution P, \(G(P)=P\) if and only if the estimated central permutation and spread parameter are the same as those of P. So, if Eq. (11) is fulfilled, then P is a non-degenerate fixed point. Let us study the equality of Eq. (11). We say that P is a candidate fixed point if it satisfies Eq. (10). Note that if P is a candidate fixed point, then \({\hat{\theta }} = \theta \).

$$\begin{aligned} \begin{aligned}&\ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p^S(\pi ) = \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ) \\&\quad \mathop {\Longleftrightarrow }^{(\text {22})} \ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi )\left( \sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)> d(\pi , \sigma ^*) \end{array}} 2p(\tau ) \right. \\&\left. \qquad + \sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*) = d(\pi , \sigma ^*) \end{array}} p(\tau ) \right) = \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ) \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi )\left( \sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)> d(\pi , \sigma ^*) \end{array}} p(\tau ) \right) \\&\quad = \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi )\left( \sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*) < d(\pi , \sigma ^*) \end{array}} p(\tau ) \right) \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)> d(\pi ,\sigma ^*) \end{array}}p(\pi )p(\tau )\left[ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) \right] =0 \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*) > d(\pi ,\sigma ^*) \end{array}}e^{-\theta \left( d(\pi ,\sigma _0)+d(\tau ,\sigma _0) \right) }\\&\quad \times \left[ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) \right] =0 . \end{aligned} \end{aligned}$$
(23)

As can be observed, Eq. (23) shows the first condition for a Mallows probability distribution P centered at \(\sigma _0\) to be a fixed point. Equation (23) has at least one solution \(\theta \) (depending on n, \(\sigma ^*\) and \(\sigma _0\), it may have more than one). One way to calculate the number of candidate fixed points centered at \(\sigma _0\) is to count the number of roots in Eq. (23) by Sturm’s theorem [27]. The exponential polynomial in \(\theta \in [0,+\infty )\) can be transformed into a polynomial defined in (0, 1] (transforming \(e^{-\theta }=x\)) in order to apply Sturm’s theorem. Moreover, the roots can be numerically solved to find the values of \(\theta \) in which \(P \sim \) \(\hbox {MM}({\sigma _0},{\theta })\) are candidate fixed points.

Moreover, for any pair of permutations \(\pi ,\tau \) (w.l.o.g., \(d(\tau ,\sigma ^*) > d(\pi ,\sigma ^*)\)), if we choose the pair of permutations \(I'\pi ,I'\tau \) where \(I'=(n\ n-1\ \cdots 1)\), the following similarities can be observed:

$$\begin{aligned} \left\{ \begin{array}{l} d(\tau ,\sigma ^*)> d(\pi ,\sigma ^*) \Longleftrightarrow D-d(I'\tau ,\sigma ^*) \\ \qquad \qquad > D-d(I'\pi ,\sigma ^*) \Longleftrightarrow d(I'\tau ,\sigma ^*) < d(I'\pi ,\sigma ^*) \\ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) = D-d(I'\pi ,\sigma _0)\\ \qquad \qquad -D+d(I'\tau ,\sigma _0)=d(I'\tau ,\sigma _0)-d(I'\pi ,\sigma _0). \end{array} \right. \end{aligned}$$

Hence,

$$\begin{aligned}&e^{-\theta \left( d(\pi ,\sigma _0)+d(\tau ,\sigma _0) \right) }\left[ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) \right] \nonumber \\&\quad = e^{-2D\theta }e^{\theta \left( d(I'\tau ,\sigma _0)+d(I'\pi ,\sigma _0) \right) }\left[ d(I'\tau ,\sigma _0)-d(I'\pi ,\sigma _0) \right] . \nonumber \\ \end{aligned}$$
(24)

Therefore, for any \(\sigma _0 \in \varSigma _n\), let us define the function H as follows:

$$\begin{aligned}&H(\sigma _0,\theta ) = \sum _{i=1}^{2D-1} H_i e^{-i\theta }\nonumber \\&\quad = \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*) > d(\pi ,\sigma ^*) \end{array}}e^{-\theta \left( d(\pi ,\sigma _0)+d(\tau ,\sigma _0) \right) }\nonumber \\&\quad \left[ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) \right] . \end{aligned}$$
(25)

By Eq. (24), \(H_i = H_{2D-i}\). In addition, \(H(\sigma _0,\theta )=-H(I'\sigma _0,\theta )\) for any \(\sigma _0 \in \varSigma _n\) and \(\theta \). Consequently, \(H(\sigma _0,{\hat{\theta }})=0\) if and only if \(H(I'\sigma _0,{\hat{\theta }})=0\). So, if P is a candidate fixed point with central permutation \(\sigma _0\) and spread parameter \({\hat{\theta }}\), then a Mallows probability distribution with central permutation \(I'\sigma _0\) and spread parameter \({\hat{\theta }}\) is a candidate fixed point as well.

In addition, from the previous observation, it has been equivalently shown that

$$\begin{aligned}&\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )< \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi ) \nonumber \\&\quad \Longleftrightarrow \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*) > d(\pi ,\sigma ^*) \end{array}}e^{-\theta \left( d(\pi ,\sigma _0)+d(\tau ,\sigma _0) \right) }\nonumber \\&\quad \times \left[ d(\pi ,\sigma _0)-d(\tau ,\sigma _0) \right] < 0 \end{aligned}$$
(26)

and analogous for the opposite inequality. So, when \(\theta \) tends to infinity, the highest exponential coefficient of \(H(\sigma _0,\theta )\) determines if the value is positive or not.

Fig. 3
figure 3

Candidate fixed points of our algorithm (\(\sigma \) and \(\theta \) values such that \(\hbox {MM}({\sigma },{\theta })\) fulfills Eq. (23)) and their attraction (\(n=5\)). The x-axis differentiate all the permutations of \(\varSigma _5\). The y-axis shows the spread parameter values

Considering all the observations of Eq. (23) and Inequality (26), in comparison with the results from Sects. 4.3.1 and 4.3.2, some new scenarios have been observed. The first one is that for a fixed value \(\sigma _0\), there can be more than one candidate fixed point. Hence, the algorithm can converge to more than one probability distribution centered at \(\sigma _0\). Moreover, from Eq. (25), similarities between \(\sigma _0\) and \(I'\sigma _0\) have been observed. Secondly, information about the attraction of the fixed points has been analyzed, even if the candidate fixed points are fixed points or not. From Inequality (26) whether or not if the degenerate distribution centered at \(\sigma _0\) is an attractive fixed point can be studied. Furthermore, knowing the attraction of the degenerate distribution, the attraction of all the candidate fixed points is completely defined. Reordering all the candidate fixed points centered at \(\sigma _0\) according to their spread parameters, they alternate their attraction in order not to obtain two consecutive candidate fixed points with the same attraction. Consequently, the last objective is to observe when a candidate fixed point is a fixed point.

To study if a candidate fixed point is a fixed point, it is necessary to observe if the estimated central permutation \({\hat{\sigma }}_0\) from a candidate fixed point P centered at \(\sigma _0\) is exactly \(\sigma _0\). So as to obtain the same central permutation, the inequality of Eq. (11) must be fulfilled at the solution \({\hat{\theta }}\) of Eq. (23) (assuming the uniqueness of the central permutation). Hence, for all \(\sigma \ne \sigma _0\),

$$\begin{aligned} \begin{aligned}&\ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p^S(\pi )> \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ) \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi )\left( 1+\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)> d(\pi ,\sigma ^*) \end{array}}p(\tau ) \right. \\&\left. \qquad - \sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)< d(\pi ,\sigma ^*) \end{array}}p(\tau ) \right)> \sum _{\pi \in \varSigma _n}d(\pi ,\sigma _0)p(\pi ) \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)>d(\pi ,\sigma ^*) \end{array}} p(\pi )p(\tau )\left[ d(\pi ,\sigma )-d(\tau ,\sigma ) \right] \\&\quad> \sum _{\pi \in \varSigma _n} p(\pi ) \left[ d(\pi ,\sigma _0) - d(\pi ,\sigma ) \right] \\&\quad \Longleftrightarrow \ \sum _{\pi \in \varSigma _n}\sum _{\begin{array}{c} \tau \in \varSigma _n \\ d(\tau ,\sigma ^*)>d(\pi ,\sigma ^*) \end{array}} p(\pi )p(\tau )\left[ d(\tau ,\sigma )-d(\pi ,\sigma ) \right] \\&\quad < \sum _{\pi \in \varSigma _n} p(\pi ) \left[ d(\pi ,\sigma ) - d(\pi ,\sigma _0) \right] . \end{aligned} \end{aligned}$$
(27)

Inequality (27) shows us the condition to estimate \(\sigma _0\) as the learned central permutation. Even though it can be completely separated according to their dependence to the distance from \(\sigma ^*\), a general solution cannot be observed (without knowing the particular values of the probabilities and distances) which tells us in advance if Inequality (27) is fulfilled or not. Actually, some experimental results show that there are candidate fixed points which do not fulfill Inequality (27).

In Fig. 3, an example of the attraction of the fixed points is shown for \(n=5\). The abscissa shows \(\sigma _0\), numerically indexed according to their distance to \(\sigma ^*\), and the ordinate represents the values of \(\theta _0\) which fulfill Eq. (23). Therefore, each dot represents a candidate fixed point. The color of the dot represents if the candidate fixed point is a fixed point or not, and its attraction. For any \(\sigma _0\) central permutation, the degenerate fixed points have been illustrated.

To summarize, Inequality (27) ensures exactly which candidates are the fixed points of our dynamical system.

5.2.2 Convergence behavior of the algorithm

Before introducing the convergence behavior of the algorithm, let us state Corollary 2, deduced from Lemma 4.

Corollary 2

Let f be a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\). Then, the EDA always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.

Proof

When f is a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*>0\), for any \(\sigma ,\pi \in \varSigma _n\), \(f(\sigma ) > f(\pi )\) if and only if \(d(\sigma ,\sigma ^*) < d(\pi ,\sigma ^*)\). Hence, the conditions of Lemma 4 are fulfilled. \(\square \)

Once we have Corollary 2 and we know the fixed points and their attraction, the behavior of the algorithm is totally defined and it can be summarized in the following way:

  • For any \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\), there exists a spread parameter value \(\theta '(\sigma _0)\) dependent on \(\sigma _0\) such that if \(\theta _0 < \theta '(\sigma _0)\), then Inequality (27) is not fulfilled for all \(\sigma \). In that case, by Corollary 2, the estimated central permutation after one iteration of the algorithm is a solution from \(C(\sigma ^*,\sigma _0)\backslash \{\sigma _0\}\).

  • If \(\theta _0 > \theta '(\sigma _0)\), then the algorithm estimates \(\sigma _0\) as the central permutation of the learned Mallows model. Let us classify the different possible behaviors according to the number of fixed points centered at \(\sigma _0\):

    • If there are no non-degenerate solutions centered at \(\sigma _0\) (there are no solutions for Eq. (23)), then the only fixed point centered at \(\sigma _0\) is the degenerate distribution \(1_{\sigma _0}\). In this case, if \(1_{\sigma _0}\) is attractive, the algorithm converges to it; otherwise, the estimated spread parameter decreases until an iteration when \({\hat{\theta }} < \theta '(\sigma _0)\) and, therefore, the estimated central permutation is not \(\sigma _0\) anymore, returning back to the previous situation.

    • If there are \(i \ge 1\) non-degenerate fixed points centered at \(\sigma _0\), then there exist i spread parameter values \({\tilde{\theta }}_i\) which solve Eq. (23) and fulfill Inequality (27). Hence, \(\theta '(\sigma _0)\) and \({\tilde{\theta }}_j\) for \(j=1,\dotsc ,i\) divide the interval \((\theta '(\sigma _0), +\infty )\) in \(i+1\) intervals.

      Let us denote by \((\theta '(\sigma _0),{\tilde{\theta }}_1)\), \(({\tilde{\theta }}_1,{\tilde{\theta }}_2)\), \(\dotsc \), \(({\tilde{\theta }}_{i-1},{\tilde{\theta }}_i)\) and \(({\tilde{\theta }}_{i},+\infty )\) the \(i+1\) formed intervals; \(P_k\) the non-degenerate fixed point centered at \(\sigma _0\) and spread parameter \({\tilde{\theta }}_{k}\), for \(k=1,\dotsc ,i\); and \(1_{\sigma _0}\) the degenerate fixed point centered at \(\sigma _0\). There are two possible situations, depending on whether \(1_{\sigma _0}\) is attractive or not.

      If \(1_{\sigma _0}\) is attractive, then \(P_i\) is not attractive and when \(\theta _0 \in ({\tilde{\theta }}_{i},+\infty )\), the algorithm converges to \(1_{\sigma _0}\). Moreover, because of the non-attraction of \(P_i\) and by the same argument, \(P_{i-1}\) is attractive and \(P_{i-2}\) is not attractive, \(P_{i-3}\) is attractive and \(P_{i-4}\) is not attractive, and so on. Hence, if \(\theta _0 \in ({\tilde{\theta }}_{i-2}, {\tilde{\theta }}_i)\), the algorithm converges to \(P_{i-1}\); if \(\theta _0 \in ({\tilde{\theta }}_{i-4}, {\tilde{\theta }}_{i-2})\), the algorithm converges to \(P_{i-3}\); and so on.

      Additionally, if \(1_{\sigma _0}\) is not attractive, then \(P_i\) is attractive and \(P_{i-1}\) is not attractive, and when \(\theta _0 \in ({\tilde{\theta }}_{i-1},+\infty )\), the algorithm converges to \(P_i\). Moreover, \(P_{i-2}\) is attractive and \(P_{i-3}\) is not attractive, and when \(\theta _0 \in ({\tilde{\theta }}_{i-3}, {\tilde{\theta }}_{i-1})\), the algorithm converges to \(P_{i-2}\). And so on.

      Observe that when \(P_1\) is not attractive and \(\theta _0 \in (\theta '(\sigma _0),{\tilde{\theta }}_1)\), the algorithm estimates lower spread parameters until \({\hat{\theta }}_0 < \theta '(\sigma _0)\). In this case, the algorithm estimates a new central permutation from \(C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\).

      Figure 4 is presented in order to show a visualization of the possible situations. The horizontal line represents the possible \(\theta _0\) value. In each interval, a blue arrow tells us if the estimated spread parameter is higher or lower, and the attraction of each fixed point can be observed. There are four possible cases, depending on the parity of i and the attraction of \(1_{\sigma _0}\). In the first two cases, i is an odd number, and in the first and fourth cases, \(1_{\sigma _0}\) is an attractive fixed point.

  • If \(\theta _0 = \theta '(\sigma _0)\), then the algorithm can randomly estimate \(\sigma _0\) or another \(\sigma \in C(\sigma ^*,\sigma _0)\) as the new central permutation. In the former case, if the fixed point with the lowest spread parameter centered at \(\sigma _0\) is attractive, the algorithm will converge to it. Otherwise, the algorithm learns a probability distribution centered at \(\sigma _0\) and spread parameter \({\hat{\theta }} < \theta _0\), and it behaves analogous as to the case \(\theta _0 < {\hat{\theta }}_0\). In the latter case, the algorithm estimates a new central permutation \(\sigma \) and spread parameter \({\hat{\theta }}\), and it will be analogous as \(P_0 \sim \hbox {MM}({\sigma },{{\hat{\theta }}})\).

Fig. 4
figure 4

Representation of all the possible scenarios in which the convergence behavior of the algorithm is represented. The cases are divided in 4, according to the parity of the value i and the attraction of the degenerate distribution \(1_{\sigma _0}\)

Let us present an example in order to illustrate the behavior described above.

Example 1

Let us consider \(n=5\), f a Mallows model centered at \(\sigma ^*=I\) and \(P_0\) a Mallows probability distribution with central permutation \(\sigma _0=(2 1 5 4 3)\) and spread parameter \(\theta _0\). To observe the behavior of the algorithm, let us calculate the candidate fixed points by Eq. (23) and the minimum spread parameter value \(\theta '(\sigma _0)\) that allows the estimation of \(\sigma _0\) as the learned central permutation, by Inequality (27).

In this particular case, there is only one solution which fulfills Eq. (23): \({\tilde{\theta }} \approx 1.2519\). Moreover, Inequality (27) shows that the equality is obtained when \(\theta '(\sigma _0)\approx 0.2770\). Therefore, a Mallows probability distribution centered at \(\sigma _0\) with spread parameter value \({\tilde{\theta }}\) is a fixed point of our mathematical modeling. In addition, if \(\theta _0 > {\tilde{\theta }}\), then \({\hat{\theta }} < \theta _0\). This last observation implies that the degenerate distribution centered at \(\sigma _0\) is not attractive, and consequently, \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\) is an attractive fixed point. Knowing the attraction of the fixed points, the value of \(\theta _0\) determines the behavior of the algorithm.

  • If \(\theta _0 < \theta '(\sigma _0)\), then \({\hat{\sigma }}_0 \in C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\). Hence, after one iteration, the algorithm restarts the process with a new central permutation and spread parameter. For example, if \(\theta _0 = 0.2760\), then the learned Mallows model after one iteration of the algorithm is \(\hbox {MM}({(12453)},{0.4016})\); and if \(\theta _0=0.2700\), then the learned Mallows model is \(\hbox {MM}({\sigma ^*},{0.3994})\).

  • If \(\theta _0 > \theta '(\sigma _0)\), then the algorithm converges to \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\) distribution.

  • If \(\theta _0 = \theta '(\sigma _0)\), then the algorithm estimates either \(\sigma _0\) or \({\hat{\sigma }}_0 \in C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\). In the first case, the algorithm converges to \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\), whereas in the second case, the algorithm estimates \(\hbox {MM}({(12453)},{0.4023})\) probability distribution after one iteration.

For any \(\sigma _0\), the same test would be repeated. All the results of Sects. 5.1 and 5.2 are briefly shown in Table 2, mentioning the initial parameters of \(P_0\) and explaining the performance of the algorithm.

Table 2 Classification of the behaviors of the EDA. \(f \sim \) \(\hbox {MM}({\sigma ^*},{\theta ^*})\) and \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\)

6 Conclusions and future work

We have presented a mathematical modeling to study an EDA based on Mallows models using discrete dynamical systems based on the expectations. Under this framework, we have studied the convergence behavior of the algorithm for several objective functions and initial probability distributions. Two different approaches have been followed to study the convergence behavior. For the simplest cases, the computation of one iteration of the algorithm has been enough to prove the limit behavior, whereas for the most complex cases, the fixed points of the algorithm and their attraction have been analyzed. Overall, for the latter, a wide range of possible ending probability distributions and trajectories for the algorithm have been observed, which, given its practical success [5], were by no means anticipated.

The main results can be summarized as follows. When the function to optimize is constant, all Mallows probability distributions are fixed points. When the function to optimize is a needle in a haystack function centered at \(\sigma ^*\) and the initial probability distribution is a Mallows distribution centered at \(\sigma _0\), the algorithm converges to the degenerate distribution centered at \(\sigma ^*\) or to a non-degenerate Mallows distribution centered at a permutation \(\sigma \) in the segment between \(\sigma ^*\) and \(\sigma _0\) such that the distance between \(\sigma \) and \(\sigma ^*\) is lower than \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\) and a spread parameter which fulfills the condition to be a (attractive) fixed point. Finally, when the function to optimize is a Mallows model centered at \(\sigma ^*\) and the initial probability distribution is a Mallows distribution centered at \(\sigma _0\), the algorithm converges to any Mallows distribution centered at a permutation in the segment between \(\sigma ^*\) and \(\sigma _0\), which is an attractive fixed point. The attraction of all the fixed points provides information in relation to the possible trajectories of the algorithm. In any case, the relation between the initial probability distribution and the objective function completely determines the convergence behavior of the algorithm. Because of that, a classification of the convergence behavior of the algorithm regarding the parameters of the Mallows model is shown.

Although the behavior of the presented algorithm with finite population can be different from that predicted from the expectations, the variety of scenarios shown in the presented modeling shows the complexity of predicting the limit distributions of finite-population EDAs. For a first comparison between the algorithm with finite and infinite populations, an EDA with Mallows model using finite populations and the Borda count [11] to estimate the central permutation \(\sigma _0\) could be applied and their performances contrasted. In addition, it is really intriguing to observe if other permutation-based EDAs or distance-based models achieve better convergence results and not many non-desirable solutions.

The obtained results in this work have been so unexpected that it encourages us to carry out new studies. We propose several future works to obtain better solutions in practice and to suggest how the runtime analysis can be performed. For example, according to the obtained solutions, the central permutation of the initial probability distribution determines which probability distributions can be learnt by the algorithm at each iteration. Then, for practical purposes, we propose a careful choice of the initial population. For example, a logical proposal is to generate individuals that are as far as possible from each other, expanding the initial search of the optimal solution. This proposal can be compared with the initialization presented in [5, 31], in which the authors apply a preliminary step so as to guide the algorithm to find the optimal solution. On the other hand, if we are interested in the runtime analysis of the algorithm, it is important to take into account some knowledge that emerges from our analysis. We have observed that the estimated spread parameter value at each iteration of the algorithm can be very critical. When the estimated spread parameter value change is big, the algorithm presents several scenarios in which the learned probability distribution can be significantly different (because the estimated central permutation is different in each case, for example) and the probability of sampling the optimal solution depends on it. On the contrary, when the estimated spread parameter value change is small, if the central permutation is not the optimal solution, the probability of reaching it will exponentially decrease with the spread parameter value. This observation may allow us to estimate the number of iterations required by the algorithm to converge to a model and when the researchers should modify the algorithm to escape from the expected tendency of the algorithm. Another analysis we propose is, starting from different initial probability distributions, to check if there exists a number of iterations that ensures the probability to sample the optimal solution is higher than a value and to track the probability at each iteration. With the obtained results, we could connect them with the presented results in the literature for binary EDAs and observe the similarities and differences between them.