Abstract
Estimation of Distribution Algorithms have been successfully used to solve permutationbased Combinatorial Optimization Problems. In this case, the algorithms use probabilistic models specifically designed for codifying probability distributions over permutation spaces. One class of these probability models are distancebased exponential models, and one example of this class is the Mallows model. In spite of its practical success, the theoretical analysis of Estimation of Distribution Algorithms for permutationbased Combinatorial Optimization Problems has not been developed as extensively as it has been for binary problems. With this motivation, this paper presents a first mathematical analysis of the convergence behavior of Estimation of Distribution Algorithms based on Mallows models. The model removes the randomness of the algorithm in order to associate a dynamical system to it. Several scenarios of increasing complexity with different fitness functions and initial probability distributions are analyzed. The obtained results show: a) the strong dependence of the final results on the initial population, and b) the possibility to converge to nondegenerate distributions even in very simple scenarios, which has not been reported before in the literature.
Introduction
Context
Evolutionary Algorithms (EAs) are a family of algorithms inspired by the natural evolution of species. Generally, these populationbased metaheuristic algorithms work over a set of individuals (solutions), called the population, and at each iteration of the algorithm they introduce several changes to evolve the population and to obtain improved solutions according to the function to optimize, which is denoted as the objective function or, equivalently, the fitness function.
Introduced by Mühlenbein and Paa\(\upbeta \) [23], Estimation of Distribution Algorithms (EDAs) [19] are an intriguing type of EA. The main characteristic of EDAs in comparison to generic EAs is the use of probability distributions instead of the usual natural evolution operators, such as recombination and mutation. In this way, EDAs start with a population, in most cases by means of sampling a uniform probability distribution over the search space. From the population, EDAs use a selection operator and obtain a subset of solutions which is used to learn a probability distribution. This distribution can be learned from scratch or by modifying the probability distribution used to sample the population at the previous iteration (such as in cGA [16]). The ideal goals of the learned probability distribution are to summarize the main features of the selected solutions and to highlight the best solutions. Finally, the learned probability distribution is sampled to obtain a new set of solutions and to generate a new population, which is used at the next iteration of the algorithm. In Algorithm 1 the general pseudocode of an EDA which learns a probability distribution from scratch at each iteration is introduced.
Motivation
EDAs have been and are being designed, applied and analyzed in the solution of Combinatorial Optimization Problems (COPs). Particular attention has been paid to the solution of binary COPs, where theoretical results at different levels have been provided for different implementations of EDAs. However, this development has not been extended to other nonbinary COPs, such as permutationbased COPs. In order to bridge this gap, in this paper we extend part of those results to the area of permutationbased COPs. Several works have been presented in the literature that use EDAs specifically designed for permutationbased COPs, obtaining strong competitive results [5, 6, 24, 28]. However, it is still not clear which mechanisms allow them to obtain those results.
Our motivation behind this work is to present, for the first time, a theoretical analysis of EDAs designed for permutationbased COPs, and a mathematical modeling to study their behavior in several scenarios of increasing complexity. To the best of our knowledge, there are no theoretical studies on permutationbased EDAs. Therefore, we seek general knowledge for a better comprehension of the algorithms designed over the permutation space. Theoretical studies can focus on many different objectives, such as limit behavior of the algorithm, runtime analysis, population sizing and so on. Inspired by the path followed for the theoretical studies of EDAs designed for binary COPs, this first study focuses on convergence analysis.
Current research of binary EDAs is often based on runtime analysis. The goal is to find bounds on the number of generations to sample a high quality or optimal solution for the first time. This goal has a close connection with the practical use of the algorithms, where we would like to sample an optimal solution as soon as possible. Notice that an optimal solution can be reached for an algorithm without requiring convergence to it [32].
However, convergence analysis is a very gripping starting point of original analyses to gain insights into the studied algorithms and for know in which scenarios the algorithm is guaranteed to converge to the optimal model by its design. Moreover, this work presents, for the first time, a framework to study EDAs designed for permutationbased COPs. Let us explain in detail our two main motivations. Firstly, as previously mentioned, permutationbased EDAs have presented strong competitive results in practice. However, there are no theoretical studies that analyze the algorithms which obtain those results. Therefore, our first objective is to study the reasons and the characteristics of the used algorithms to achieve the presented results. Secondly, many mathematical frameworks have been presented in the literature to gain insights into binary EDAs. Some of the mentioned works are referenced in Sect. 1.4. Nevertheless, permutationbased EDAs have not gained the same attention of researchers and there are no mathematical frameworks in the literature to study this kind of algorithms. So, our second objective is to generate a mathematical model that can be used to analyze permutationbased EDAs theoretically.
While several EDAs have been designed for permutationbased COPs which use different probabilistic models, we concentrate on those that use the Mallows model as it is the one that has received the highest attention in the literature. The Mallows model [20] is considered the analogous distribution of the Gaussian distribution over the permutation space and it can be included in a more general class of probability models: distancebased exponential models. The Mallows model has been used for designing EDAs in the solution of the Permutation Flowshop Scheduling Problem [5, 6] and the Vehicle Routing Problem with Time Windows [24]. In the mentioned articles, the authors design EDAs in which a Mallows model is learnt from the selected population at each iteration of the algorithm. In [6] the authors named this algorithm MallowsEDA, whereas in [5, 24] the authors generalize and expand MallowsEDA. However, even if the mentioned articles have presented competitive results in practice, there are no studies that analyze the behavior of the applied algorithms. We study the reasons for the results obtained by the MallowsEDA and its characteristics.
Contribution
In this paper, we present a mathematical framework to study a MallowsEDA and focus on the convergence behavior of the algorithm for several fitness functions. Considering the ideas presented in previous works [15, 22, 34], we will study the sequence of the expected probability distributions obtained at each iteration of the algorithm (or, equivalently, we study the behavior of the algorithm when the population size tends to infinity). In this way, the randomness is removed and the algorithm is modeled as a dynamical system. Finally, our proposed mathematical framework is used to calculate the convergence behavior of the algorithm for several fitness functions. The studied functions are the constant function, the needle in a haystack (analogous to the definition presented in [25]) and a function defined by means of a Mallows model centered at different permutations.
This work is an extensive and detailed expansion of the work [29] and, as far as we know, our results are the first theoretical analysis given in the literature for permutationbased EDAs, and show the obstacles in achieving high quality theoretical results in this unexplored area. In comparison to the mentioned work, in this paper we show in detail the development and the reasons for using each mathematical tool that was not explained in [29] and we extend the obtained results presenting new scenarios and giving the reasons for obtaining them. Based on the motivations explained in Sect. 1.2, this work has three main goals. Our first goal is to present a mathematical framework which allows the reproducibility of this study to different distancebased exponential models and new fitness functions. Our second goal is to carry out an analysis so as to provide new knowledge on the convergence behavior of permutationbased algorithms. Moreover, for the analyzed objective functions in the present work, the obtained results are unexpected. We have observed that, for the scenarios in which the initial probability distribution is the uniform distribution or the fitness function is constant, the model converges to the optimal solution. However, in the rest of studied simple scenarios, the algorithm can converge to a degenerate distribution not necessarily centered at the optimal solution, or to a nondegenerate probability distribution. To determine the limit behavior of the algorithm, the equations to recognize the fixed points of the dynamical system are shown. These obtained results are dissimilar to the existing results in the literature for binary EDAs (for example, in [14, 15, 34], the studied algorithms converge to degenerate distributions centered at local optima or global optimum of the studied fitness function). An exhaustive list of the reached results can be found in Sects. 3, 4 and 5. At the beginning of each referenced section, a summary of the obtained results is introduced.
Finally, our third goal is to present the obtained knowledge in this study to lay the basis for upcoming research in this area. The presented analysis shows that, given an objective function, the initial probability distribution determines the limit behavior of the algorithm. Therefore, our first proposed algorithmic adaptation is to apply alternative initializations for obtaining high quality solutions. On the other hand, another proposed work is to analyze the expected number of iterations to achieve a high quality or optimal solution for the first time and connect it with the current tendency of the theoretical studies of EDAs. These points are discussed further in Sect. 6.
Related work
EDAs have been mostly designed and studied for binary COPs. Some examples of the designed EDAs for binary COPs are Univariate Marginal Distribution Algorithm (UMDA) [23], Populationbased Incremental Learning (PBIL) [2], Compact Genetic Algorithm (cGA) [16] and Factorized Distribution Algorithm (FDA) [22]. Moreover, they have also been complemented with a theoretical study with the purpose of understanding and improving these algorithms [14, 15, 22]. The first theoretical studies focused on their convergence behavior and the current tendency of the theoretical studies is the runtime analysis of the algorithms. We highly recommend the work [18] for a stateoftheart on binary EDAs and our ideal goal is to explore the permutationbased EDAs in an analogous way.
From [18], we want to highlight three inspiring works. In [14], the authors prove that when the fitness function is unimodal, PBIL converges to the global optimum. In [15], it is proved that any discrete EDA generates a population with an optimal solution if any solution of the search space can be generated at any iteration of the algorithm. In addition, in the same work, the authors review a dynamical system used in the literature to study UMDA and PBIL. In the present work, we have considered the idea of studying EDAs as dynamical systems. Last but not least, in [22], the authors study the convergence behavior of the FDA using Boltzmann and truncation selection and by analyzing finite and infinite populations, which shows the influence of the assumption of infinite populations and the differences in the obtained results.
The remainder of the paper is organized as follows. In Sect. 2, the basic concepts related with the Mallows model and our mathematical framework are introduced. In Sect. 3, the convergence behavior of the framework is studied for a constant objective function f. In Sect. 4, the function f analyzed is a needle in a haystack function. In Sect. 5, the function f analyzed is a Mallows model. In Sects. 3, 4 and 5, two initial distributions are considered for the analysis: the uniform distribution and a Mallows probability distribution. Finally, in Sect. 6, conclusions and future work are presented.
EDA based on Mallows models
The theoretical study of an EDA can focus on many different objectives, such as limit behavior, runtime analysis, population sizing and so on. In this work, the convergence behavior of the MallowsEDA has been studied. To do so, a mathematical modeling based on dynamical systems is presented to achieve our objective.
Notation
The solutions of the studied optimization problems are permutations of length n. Let us denote by \(\varSigma _n\) the npermutation space (\(\varSigma _n=n!=N\)) and \(f:\varSigma _n \longrightarrow {\mathbb {R}}\) the function to maximize. Let us denote by \(\sigma \) a permutation from \(\varSigma _n\) or a solution of the function f. Throughout this work, \(\sigma (i)\) represents the position of the element i in the solution \(\sigma \). The solution \(\sigma ^*\) is an optimal solution:
Moreover, let us define an adjacent transposition of a permutation \(\sigma \) as a swap of two consecutive elements. Additionally, \(\sigma ^{1}\) is the inverse permutation of \(\sigma \).
A population in the algorithm is a subset (in the multiset sense) of M solutions of \(\varSigma _n\). Let us denote by \(D_i\) the population at step i and \(D_i^S\) the selected individuals from \(D_i\). There are several ways to study EDAs which depend on how an iteration of the algorithm is described. The most common explanation of a step of an EDA is the following one. The algorithm starts the iteration i from a population \(D_i\). Then, a subset of individuals is selected from \(D_i\) by means of the selection operator and \(D_i^S\) is defined. After that, a probability distribution \(P_i^L\) is learnt from \(D_i^S\) and finally a new population \(D_{i+1}\) is generated by sampling solutions from \(P_i^L\) and combining them with the solutions from \(D_i\). In Algorithm 1 the general pseudocode of an EDA is introduced. Still, there exists another possible interpretation of a step of an EDA in which probability distributions are considered as the main mathematical tool to study the algorithm [25]. In this second description, the algorithm starts the iteration i from a probability distribution \(P_i\) and a population \(D_i\) is sampled. Then, \(D_i^S\) is selected and finally a new probability distribution is learnt for the next iteration, \(P_i^L = P_{i+1}\). Throughout this work, the last description has been considered the main interpretation of EDAs for a better comprehension of Sects. 2.2 and 2.4.
The probability distributions can be represented using probability vectors. Let us denote by \(p_i(\sigma )\) the probability of \(\sigma \) under \(P_i\). Therefore, we can denote by \(P_i=(p_i(\sigma _1),\dotsc ,p_i(\sigma _N))\) the probability distribution of the population at iteration i. If we are studying EDAs with finite populations, the vector \(P_i\) can be considered as the “empirical probability mass function” of \(D_i\) (and analogous with \(P_i^S\) from the population \(D_i^S\)). We must emphasize that this representation of the populations by probability vectors is conceptual and it is really helpful for our proposed theoretical study, but it cannot be applied in practical EDAs due to the required memory. Moreover, the subindexes used for the permutations of the probability vectors distinguish the N permutations of \(\varSigma _n\) where an order has been set up. The space of possible probability vectors \(\varOmega _n\) is defined in the following way:
To avoid the trivial case, it is assumed that any initial probability vector \(P_0\) satisfies that \(p_0(\sigma _j)<1\), for \(j=1,\dotsc ,N\) (\(D_0\) is not formed only by one specific solution). Note that \(\varOmega _n\) contains degenerate distributions. Let us denote by \(1_{\sigma _k}=(1_{\sigma _k}(\sigma _1),\dotsc , 1_{\sigma _k}(\sigma _{k1}),1_{\sigma _k}(\sigma _k),1_{\sigma _k}(\sigma _{k+1}) ,\dotsc ,1_{\sigma _k}(\sigma _N)) = (0,\dotsc ,0,1,0,\dotsc ,0) \) the degenerate probability distribution centered at \(\sigma _k\).
Hence, if \(P_i\) are considered the references of each step of an EDA, then the EDA can be considered a sequence of probability distributions, each one given by a stochastic transition rule \({\mathcal {G}}\):
that is, \(P_i = {\mathcal {G}}(P_{i1})={\mathcal {G}}^{i}(P_0),\ \forall i \in {\mathbb {N}}\). Given a probability distribution \(P_i\), the operator \({\mathcal {G}}\) outputs the probability distribution obtained after sequentially applying the sampling, the selection operator and the learning step. In this work, the considered algorithm to analyze is the MallowsEDA [6] and the selection operator used throughout this work is a 2tournament selection. The details are explained in Sects. 2.3 and 2.4.
Hence, our objective is to study the convergence behavior described as follows:
EDAs based on expectations
The application of the EDA schema to deal with optimization problems can involve an unapproachable variety of situations and behaviors. Due to this difficulty and following the ideas presented in the literature, our proposed mathematical modeling studies the expected probability distribution generated after one iteration of the algorithm. So, our proposed framework studies the deterministic function \(G:\varOmega _n \longrightarrow \varOmega _n\) which assigns the expected probability distribution of the operator \({\mathcal {G}}:\varOmega _n \longrightarrow \varOmega _n\), similar to the idea followed in [14]:
where a(P) is the probability distribution obtained after applying the approximation step, \(\phi \) is the selection operator and \(p(\phi (P_i)=P)\) is the probability to obtain P from \(P_i\). The details of our proposed selection operator and approximation step are explained in Sect. 2.4.
Moreover, \(P_i =G^{i}(P_0)\). Studying the expected probability distribution, each time the algorithm is applied, the deterministic operator G removes the random drift and avoids ending in a different probability distribution. Another equivalent interpretation of the deterministic operator G is the study of EDAs when the population size of \(D_i\) and \(D_i^S\) tends to infinity [9, 10, 30, 34]. By the GlivenkoCanteli theorem [8], when the population size tends to infinity, the empirical probability distribution of \(D_i\) and \(D_i^S\) converge to the underlying probability distribution of \(D_i\) and \(D_i^S\), respectively. Under this assumption, \(P_i\) and \(P_i^S\) can be thought of as the population and the selected population at iteration i: in other words, \(P_i\) and \(P_i^S\) replace the populations \(D_i\) and \(D_i^S\) of the finite model, respectively. Therefore, our study can be thought of as the analysis of an EDA that works with the limit distributions of large populations. In Algorithm 2 the general pseudocode of an EDA based on expectations is shown.
Typical selection operators \(\phi \) are ntournament selection, proportional selection and truncation selection [4, 34].
Therefore, the operator G induces a deterministic sequence:
and the new objective is to study
In Sect. 2.4, the function G used throughout this work to study the convergence behavior of the algorithm is defined.
Mallows model
The Mallows model [20] is a distancedbased exponential probability model over permutations. Under this model, the probability value of every permutation \(\sigma \in \varSigma _n\) depends on two parameters: a central permutation \(\sigma _0\) and a spread parameter \(\theta \). The Mallows model is defined as follows:
where d is an arbitrary distance function defined over the permutation space, \(d(\sigma ,\sigma _0)\) is the distance from \(\sigma \) to the central permutation \(\sigma _0\), and \(\varphi (\theta ,\sigma _0) = \sum _{\sigma \in \varSigma _n} e^{\theta d(\sigma ,\sigma _0)}\) is the normalization constant. Due to the definition of the Mallows model, it is considered the analogous distribution of the Gaussian distribution over permutations. To simplify notation, let us denote by \(\hbox {MM}({\sigma _0},{\theta })\) a Mallows probability distribution centered at \(\sigma _0\) and spread parameter \(\theta \). Bear in mind that when \(\theta = 0\), \(\hbox {MM}({\sigma _0},{0})\) is a uniform probability distribution for any \(\sigma _0 \in \varSigma _n\).
An important property of a Mallows model is that any two permutations at the same distance from the central permutation have the same probability value. Hence, we can group the permutations according to their distance to the central permutation.
Different distances can be used with the Mallows model, such as Cayley distance, Hamming distance or, the most used distance in the literature for the Mallows model, Kendall tau distance [17], which is the one we use in our EDA analysis.
Definition 1
Kendall tau distance \(d_\tau (\sigma ,\pi )\) counts the number of pairwise disagreements between \(\sigma \) and \(\pi \). It can be mathematically defined as follows:
where \(\sigma (i)\) is the position of the element i in the permutation \(\sigma \) (and similarly with \(\sigma (j)\), \(\pi (i)\) and \(\pi (j)\)).
By definition, \(\varSigma _n\) with \(d_{\tau }\) is a metric space. For simplification purposes, let us denote by \(\sigma \pi \) the composition of \(\sigma \) and \(\pi \) (i.e., \(\sigma \pi = \sigma \circ \pi \)) and \(d(\sigma ,\pi )\) the Kendall tau distance between \(\sigma \) and \(\pi \). According to the definition, the distance between two permutations is a nonnegative integer between 0 and \(D =n(n1)/2=\left( {\begin{array}{c}n\\ 2\end{array}}\right) \). A property of Kendall tau distance is that, for any \(\sigma ,\pi \in \varSigma _n\), \(d(\sigma ,\pi )+d(\pi ,I'\sigma )= d(\sigma ,I'\sigma )= D\), where \(I' = (n\ n1\ \cdots 1)\). Consequently,
Another property is that Kendall tau distance has the right invariance property; that is, \(d(\sigma ,\pi )=d(\sigma \rho , \pi \rho )\) for every permutation \(\sigma ,\pi ,\rho \in \varSigma _n\) [17]. Consequently, the normalization constant of the Mallows model can without loss of generality be written as \(\varphi (\theta )\).
Kendall tau distance can be equivalently written as
where \(V_i(\sigma ,\pi )\) is the minimum number of adjacent swaps to set the value \(\pi (i)\) in the ith position of \(\sigma \) [21]. It is worth noting that there exists a bijection between any permutation \(\sigma \) of \(\varSigma _n\) and the vector \(\left( V_1(\sigma ,I),\dotsc ,V_{n1}(\sigma ,I)\right) \), where I represents the identity permutation and \(V_i(\sigma ,I) \in \{0,\dotsc ,ni\}\), \(\forall i=1,\dotsc ,n1\). Furthermore, the components \(V_i(\sigma ,I)\) are independent when \(\sigma \) is uniform on \(\varSigma _n\).
Finally, with Kendall tau distance, the Mallows model with central permutation \(\sigma _0\) and spread parameter \(\theta \) and the Mallows model with central permutation \(I'\sigma _0\) and spread parameter \(\theta \) are equivalent [13]. Therefore, without loss of generality, we assume that \(\theta > 0\).
Mathematical modeling
As we have mentioned in the introduction section, in this section we present a mathematical framework to study the convergence behavior of a MallowsEDA by a deterministic operator based on expectations. Before presenting our proposed mathematical modeling, we want to present how the MallowsEDA is defined in [6].
The main characteristic of the MallowsEDA is that the learned probability distribution is a Mallows probability distribution. To learn a Mallows model, \(\sigma _0\) and \(\theta \) parameters must be estimated. By the maximum likelihood estimation method, the exact parameters are calculated. The loglikelihood function for a finite population \(\{\sigma _1,\dotsc , \sigma _M \}\) is as follows [13]:
where \({\bar{V}}_i\) denotes the observed mean for \(V_i\): \({\bar{V}}_i= \sum _{j=1}^M V_i(\sigma _j,\sigma _0)/M\). As we can observe in Eq. (1), the value \(M\theta \sum _{i=1}^{n1}{\bar{V}}_i\) depends on \(\sigma _0\) and \(\theta \), whereas the value \(M \log \varphi (\theta )\) only depends on \(\theta \). Therefore, for a fixed nonnegative value \(\theta \), maximizing the loglikelihood function is equivalent to minimizing \(\sum _{i=1}^{n1} {\bar{V}}_i\). This problem is also known as the rank aggregation problem and the Kemeny ranking problem and it is an NPhard problem [1, 3]. This makes the theoretical analysis very complex.
Therefore, given a sample of M permutations \(\{\sigma _1,\dotsc ,\sigma _M\}\), the first step to obtain the maximum likelihood estimators of the Mallows model is to obtain a permutation \(\sigma _0\) which minimizes \(\sum _{i=1}^{n1} {\bar{V}}_i\). Let us denote by \({\hat{\sigma }}_0\) the estimated central permutation for the previous minimization problem. Once we obtain \({\hat{\sigma }}_0\), the maximum likelihood estimator of \(\theta \), denoted by \({\hat{\theta }}\), is obtained by solving the following equation [13]:
Despite the fact that previous theoretical studies that use dynamical systems ( [14, 33], for example) have closed formulae, the solution of this equation has not. For that reason, a numerical method such as, e.g., NewtonRaphson, has to be used to solve the equation. This is another reason that shows the complexity of the theoretical analysis. Once \({\hat{\sigma }}_0\) and \({\hat{\theta }}\) are estimated, the Mallows model is completely defined and it is used to sample new solutions for the next iteration of the algorithm. In Algorithm 3 the general pseudocode of MallowsEDA defined in [6] is shown.
Throughout this work, in order to study the convergence behavior of the MallowsEDA based on expectations, the deterministic operator \(G= a \circ \phi \) is used. This operator is a composition of the selection operator \(\phi \) and the approximation step a used to learn the Mallows model. Hence, the operator \(\phi \) returns the expected selection probability of the solutions from \(P_i\) and the function a uses a maximum likelihood estimation method to learn a Mallows model from \(P_i^S\).
The selection operator studied in this work has been the widely used 2tournament selection, but it is worth mentioning that the use of any selection operator based on rankings of solutions which satisfies impartiality and no degeneration properties defined in [10] will produce the same results. This selection operator is based on the ranking of the solutions according to the objective function f and cannot assign extreme probabilities. Given the probability distribution \(P_i\) at iteration i and assuming a maximization problem, the expected probability of selecting a solution \(\sigma \) is the sum of all the binary selections in which \(\sigma \) and a solution \(\pi \) with a lower or equal fitness function value has been chosen, that is:
Once we have \(P_i^S\) calculated, the function a deals with the probabilities \(p_i^S(\sigma )\) to learn a new Mallows model which is the probability distribution of the next generation. In order to work with the probability vectors and the expected probability distributions and to estimate \(\sigma _0\) and \(\theta \), Eqs. (1) and (2) must be reformulated. To do so, the value \(\bar{V_i}\) is calculated using \(p^S(\sigma )\) as the proportion of the solution \(\sigma \) in the selected population by the weighted average value of \(V_i(\sigma ,\sigma _0)\). So, we have
Therefore,
So the maximum likelihood estimator of \(\sigma _0\) from the expected selected population is the following:
The maximum likelihood estimator of \(\sigma _0\) might not be unique. In Sects. 4 and 5, we will observe some \(P^S\) probability distributions in which the estimated central permutation is not unique.
To estimate \(\theta \), we can use Eq. (2) in the same way as with finite populations and solve the following equation:
Throughout this work, two observations related to the estimation of the spread parameter are considered. Firstly, the righthand side of Eq. (5) is not defined when \(\theta = 0\). Still, the righthand side of Eq. (5) tends to \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\) when \(\theta \) tends to 0 and \(\theta = 0\) is a removable singularity (see Proof in Proposition 1 of Appendix A).
Considering this observation, the following lemma proves that when the estimated central permutation is unique, then the estimated spread parameter has a positive value. It is worth mentioning that Lemma 1 is independent of the objective function f and the iteration i of the algorithm.
Lemma 1
Let \(P_i\) be a Mallows probability distribution with central permutation \(\sigma _0\) and spread parameter \(\theta \ge 0\), and \(P_i^S\) the probability distribution after a 2tournament selection over \(P_i\). Let \({\hat{\sigma }}_0\) be the unique estimator of the central permutation of \(P_{i+1}\). Then, the value \({\hat{\theta }}\) which solves the following equation
is a positive value. Equivalently, \(\sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma )\) is a value lower than \( \left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\).
Proof
First, let us consider the function g:
The function g is a continuous decreasing function such that \(g(\theta )+g(\theta )=\left( {\begin{array}{c}n\\ 2\end{array}}\right) \), \(\lim _{\theta \longrightarrow \infty } g(\theta ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \) and \(\lim _{\theta \longrightarrow \infty } g(\theta ) = 0\) (see Proof in Proposition 2 of Appendix A).
Secondly, for any \({\hat{\sigma }}_0\) and \({\hat{\theta }}\) parameters, \(\sum _{\sigma \in \varSigma _n} d(\sigma ,{\hat{\sigma }}_0)\cdot p^S(\sigma )\) is a value from the interval \((0,\left( {\begin{array}{c}n\\ 2\end{array}}\right) )\). In particular,
Considering that, by hypothesis, \({\hat{\sigma }}_0\) is the unique estimator of the central permutation of \(P_{i+1}\),
is obtained and therefore
\(\square \)
The second observation is that in the approximation step of our algorithm, at any iteration, if \(P^S\) is a Mallows model with central permutation \(\sigma _0\) and spread parameter \(\theta \), then the learned Mallows model is the same one: \({\hat{\sigma }}_0=\sigma _0\) and \({\hat{\theta }}=\theta \). The argument to prove this observation is that the probabilities of the solutions are ordered inversely according to their distance to \(\sigma _0\). Hence, Eq. (4) obtains the minimum value at \(\sigma _0\) and it is unique. Furthermore, when \({\hat{\theta }}=\theta \), Eq. (5) is fulfilled because \(P^S\) is a Mallows model. Another way to understand this observation is that when we work with infinite population and the sampling step is not needed, the probability distribution is kept constant. To simplify notation, throughout this work, let us consider the uniform distribution as a Mallows model with central permutation \(\sigma _0 \in \varSigma _n\) and spread parameter 0.
In addition, throughout this work it is assumed that the algorithm learns \(1_{\sigma _k}\) probability distribution if \(P^S=1_{\sigma _k}\). Note that \(1_{\sigma _k}\) is obtained as the limit distribution of \(\hbox {MM}({\sigma _k},{\theta })\) when \(\theta \) tends to infinity.
Once we have defined the selection operator and how we learn a new probability distribution, our operator G is defined. The schema of one iteration of the algorithm is the following:
where \(\phi \) is 2tournament selection and a is the approximation step that learns a Mallows probability distribution by maximum likelihood estimation.
The aim of the following sections is to apply our proposed mathematical modeling in some scenarios. Each scenario considers an objective function f and an initial probability distribution \(P_0\). Our objective is to calculate \(G^{i}(P_0)\) when i tends to infinity. To do so, \(G^i(P_0)\) are calculated, for \(i=1,2,3,\dotsc \), and the results are analyzed. In some particular cases, it is enough to calculate \(G(P_0)\) to induce the limit behavior of the algorithm. For the most difficult cases, we study the fixed points of the algorithm and their attraction behavior, following the same idea used in the literature as in [14], among others.
In order to simplify the analysis and to present the tools and methods used to achieve our objectives, in this work we have considered three specific cases for the objective function. In Sect. 3, f is a constant function; in Sect. 4, f is a needle in a haystack function; and in Sect. 5, f is defined by a Mallows model. Objective functions such as the constant function and the needle in a haystack function have been used in many studies of different algorithms in the literature, whereas the Mallows model has been studied as an example of a unimodal objective function and to analyze the relation among the learned Mallows probability distributions by our dynamical system and the objective function. For these cases, we have considered \(P_0\) as a uniform distribution or a Mallows model.
Limiting behavior for a constant function
In these first scenarios, the function f to optimize is constant: \(f(\sigma )=c,\ \forall \sigma \in \varSigma _n\). Hence, any solution can be considered a global optimum. In this situation, it is proved that the algorithm keeps the initial probability distribution forever. We can summarize all the results from this section in Theorem 1.
Theorem 1
If f is a constant function and P a Mallows probability distribution, then \(G(P)=P\).
Proof
Starting from any Mallows model \(\hbox {MM}({\sigma _0},{\theta })\), let us observe the first iteration of the algorithm and calculate G(P). It is proved that the selection method keeps the same distribution, and then the learned parameters are \(\sigma _0\) and \(\theta \).
When f is a constant function, all the solutions are global optima. So, the selection probability of each solution is the same as the initial probability:
Given that \(P^S=P\), the next step of the algorithm is to estimate the parameters to learn a Mallows model from P. By the observation from Sect. 2.4 about the estimation of the parameters from a Mallows model, it is deduced that \({\hat{\sigma }}_0 = \sigma _0\) and \({\hat{\theta }} = \theta \). Consequently, it is proved that when f is a constant function, \(G(P)=P\) for any Mallows distribution P. \(\square \)
Limiting behavior for a needle in a haystack function
In the next case, f is a needle in the haystack function centered at \(\sigma ^*\); the function is constant except for one solution \(\sigma ^*\), which is the optimal solution. Let us define
such that \(c' > c\).
In this section, the analysis focuses on the evolution and the convergence behavior of the algorithm when the fitness function can only take two possible values, one value for the optimal solution and the second value for any other solution. The analysis has been separated into three sections. In Sect. 4.1, the case when \(P_0\) is a uniform distribution is considered. In this particular case, the main procedure of the algorithm is shown and some general results are explained. As a result of this analysis, the case when \(P_0\) is a Mallows model centered at \(\sigma ^*\) is analyzed, which is mentioned in Sect. 4.2. Finally, in Sect. 4.3, \(P_0\) is a Mallows model centered at \(\sigma _0 \ne \sigma ^*\). In this case, a general observation among the rest of Mallows models is explained. To do so, the fixed points of the algorithm are calculated.
\(P_0\) a uniform initial probability distribution
In this section it is proved that when the initial probability distribution is a Mallows distribution centered at the optimal solution of the needle in the haystack function the algorithm converges to the degenerate distribution centered at the optimum. The obtained result in this section can be summarized in the following lemma.
Lemma 2
Let f be a needle in a haystack function centered at \(\sigma ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _0 \ge 0\). Then, the proposed EDA always converges to the degenerate distribution centered at \(\sigma ^*\).
Proof
Let us start the demonstration from the case that \(P_0\) is a uniform distribution. In order to calculate the limit behavior of the algorithm, let us start by calculating \(G(P_0)\), starting from the computation of \(P_0^S\). In this case, there are two different cases to analyze in the selection step. If \(\sigma ^*\) is chosen to take part in the tournament, then it has an equal or higher function value than any other permutation, so \(\sigma ^*\) is always selected. For the permutations \(\sigma \ne \sigma ^*\), they behave in the same way as when f is a constant function. So the probability after selection is as follows:
This same argument can be used for any iteration of the algorithm for the selection operator.
After the selection probability has been computed, let us study the estimation of the parameters for the Mallows models. Let us start with the estimation of the central permutation in different iterations of the algorithm, and after that, the estimated spread parameters.
At the first iteration of Algorithm 2, in order to calculate \({\hat{\sigma }}_0\) for \(P_1\), it is necessary to calculate the solution of Eq. (4) using \(P_0^S\). Bear in mind that for any \(\sigma \ne \sigma ^*\),
because the selection probabilities for all the permutations except \(\sigma ^*\) are the same, and the right invariance property over the Kendall tau distance ensures that the number of solutions at each distance is the same: that is, for a fixed \(d \in \{0,\dotsc ,D\}\), \(\{ \pi \in \varSigma _n : d(\pi ,\sigma )=d \}\) is constant for any \(\sigma \in \varSigma _n\) (see Definition 2).
Let \(\sigma \ne \sigma ^*\). Thus, \(d(\sigma ,\sigma ^*)=d>0\). Therefore, considering Eq. (8) and \(p_0^S(\sigma ^*)>p_0^S(\sigma )\),
and it proves that the maximum likelihood estimator of the central permutation is \(\sigma ^*\).
So \(P_1\) is a Mallows model with central permutation \(\sigma ^*\). Because of the uniqueness of the estimated central permutation and by Lemma 1, the estimated spread parameter of \(P_1\) is a positive value. In order to generalize the obtained results to any iteration of the algorithm, let us calculate the central permutation of \(P_2\). To determine \(P_1^S\), we consider Eq. (7) from \(P_1\). Accordingly, for each solution, the lower the distance to \(\sigma ^*\), the higher the probability of selecting the solution is. Therefore, to calculate \(P_2\), we can repeat the same argument of Sect. 3 to prove that \({\hat{\sigma }}_0=\sigma ^*\). This same argument can be repeated for any iteration \(i > 2\).
Once it has been proved that \(\sigma ^*\) is the estimated central permutation for the learned Mallows model at any iteration of the algorithm, let us study the estimation of \(\theta \). As we have mentioned previously, there is no closed formula for the solution of Eq. (5). Hence, instead of calculating the value of \(\theta \), we follow a different avenue to prove the limiting behavior of the algorithm. Knowing by Lemma 1 that the estimated spread parameter \({\hat{\theta }}\) at any iteration of the algorithm is positive, we prove that the estimated spread parameter increases in two consecutive iterations.
Particularly, Eq. (5) is analyzed to see if the spread parameter at iteration \(i+1\) is a higher or lower value than the spread parameter at iteration i. To this end, two consecutive iterations are considered and the difference between \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_i^S(\sigma )\) and \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_{i+1}^S(\sigma )\) is analyzed. Without loss of generality, let us analyze the relation when \(i=0\).
The difference between the values of the lefthand side of (5) depends on the values \(p_0^S(\sigma )\) and \(p_1^S(\sigma )\), \(\forall \sigma \in \varSigma _n\). Firstly, remember that \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_0^S(\sigma )\) was used to calculate the spread parameter of the Mallows probability distribution \(P_1\). Hence, by the definition of the operator a it holds that \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_0^S(\sigma )\) and \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1(\sigma )\) are the same value (this argument has been used to specify that the estimated parameters of a Mallows model to learn a new Mallows model are the same). Let us denote \({C= \sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1(\sigma )}\) and compare it with \(\sum _{\sigma \in \varSigma _n} d(\sigma ,\sigma ^*)\cdot p_1^S(\sigma )\). Using Eq. (7) for \(P_1\),
This implies that the lefthand side of Eq. (5) decreases in two consecutive iterations. Hence, as the function g defined in Eq. (6) is a strictly decreasing function over \(\theta \), the spread parameter increases after one iteration of the algorithm. So, \(\theta _2\) is a higher value than \(\theta _1\).
Using the same reasoning for any iteration, we can observe that at each iteration \(p(\sigma ^*)\) increases, whereas for all \(\sigma \ne \sigma ^*\) \(p(\sigma )\) decreases. Moreover,
Consequently, \( \theta \) tends to infinity when the number of iterations increases.
Therefore, after applying our modeling departing from a uniform distribution to a needle in a haystack function, the algorithm converges to a Mallows model with central permutation \(\sigma ^*\) and a spread parameter \(\theta \) which tends to infinity. Hence, the distribution in the limit is concentrated around \(\sigma ^*\). \(\square \)
\(P_0\) a Mallows probability distribution with central permutation \(\sigma ^*\) and spread parameter \(\theta _0\)
This case is the same as the one in Sect. 4.1 after the first iteration. Hence, the algorithm converges to a degenerate distribution centered at \(\sigma ^*\).
\(P_0\) a Mallows probability distribution with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\)
Due to the difficulty of this case in comparison with the previous ones, the analysis of the convergence behavior of the algorithm is made from a new point of view. In this section, our objectives are to study the possible fixed points of the algorithm and to analyze the behavior of our dynamical system. Our first objective is to calculate the fixed points of the algorithm. A probability distribution is a fixed point of the algorithm if, after one iteration, the algorithm does not estimate a different probability distribution: that is to say, \(G(P)=P\). Consequently, the algorithm will always estimate the same probability distribution.
In Sect. 4.3, the following proof idea is used:

(i)
In Sect. 4.3.1, the fixed points of the algorithm are calculated.

First, it is proved that any degenerate distribution is a fixed point.

Then, nondegenerate fixed points are calculated.


(ii)
In Sect. 4.3.2, the attraction of the fixed points is studied.

(iii)
Finally, in Sect. 4.3.3, the performance of the algorithm is analyzed for different initial probability distributions \(P_0\).
Computation of the fixed points
For our first aim of Sect. 4.3, let us calculate the fixed points of our dynamical system G. First, let us realize that any degenerate distribution is a fixed point of the discrete dynamical system G. The selection probability departing from \(1_{\sigma _k}\) is:
Therefore, the probabilities of the solutions after the selection operator keep the same values of \(1_{\sigma _k}\), that is, \(P^S=1_{\sigma _k}=P\). Hence, bearing in mind that in Sect. 2.4 it has been assumed that the estimated model from a degenerate distribution is the same degenerate distribution, \(G(1_{\sigma _k})=1_{\sigma _k}\) is obtained.
However, the degenerate distributions are not the only fixed points of the discrete dynamical system G. By definition of G, any Mallows probability distribution for which the algorithm learns the same distribution is a fixed point; in other words, after the selection operator, if the algorithm estimates the same central permutation and spread parameter as in the previous distribution, then the Mallows probability distribution is a fixed point. In Lemma 3, a formal result of this idea is presented, showing which two equations are sufficient to achieve a fixed Mallows probability distribution.
Lemma 3
Let P be a Mallows probability distribution with central permutation \(\sigma _0\) and spread parameter \(\theta _0 < \infty \). If for all \(\sigma \ne \sigma _0\),
and
are fulfilled, then \(G(P)=P\).
Proof
By the maximum likelihood estimator of the parameters of the Mallows model, Inequality (9) ensures \({\hat{\sigma }}_0=\sigma _0\). In order to prove that \({\hat{\theta }}=\theta _0\), considering by hypothesis that P is a Mallows model and by Eqs. (5) and (10),
\(\square \)
Inequality (9) ensures \({\hat{\sigma }}_0=\sigma _0\) and Eq. (10) obtains \({\hat{\theta }}=\theta _0\). Inequality (9) and Eq. (10) can be written consecutively: for all \(\sigma \ne \sigma _0\),
Lemma 3 presents a sufficient situation to achieve fixed points of the algorithm. Unfortunately, Lemma 3 does not present “the necessary condition” because of one very particular case: when \(G(P)=P\), it cannot be ensured that \(\sigma _0\) obtains the minimum value at Inequality (9) (perhaps there are more permutations which obtain the minimum value), even if \({\hat{\theta }}=\theta _0\). In the case that \(\sigma _0\) is the unique solution of Inequality (9), then Lemma 3 would present the necessary condition to be a fixed point. To avoid these specific scenarios and the equality case in Inequality (9), which represent zero Lebesgue measure sets, from now on we will consider that \(\sigma _0\) is the estimated central permutation. In practice, the EDA can be designed to have a preference criteria for ties.
Based on Lemma 3, our next objective is to observe the sufficient conditions to achieve fixed points of the algorithm when f is a needle in a haystack function. First, it is studied when \({\hat{\theta }}=\theta _0\), and then whether or not \({\hat{\sigma }}_0=\sigma _0\) is satisfied. Let us study Eq. (10).
From Eq. (12) we can deduce that \(\hbox {MM}({\sigma _0},{\theta _0})\) is not a fixed point if \(d(\sigma ^*,\sigma _0) \ge D/2\). This is due to the fact that the righthand side of Eq. (5) tends to 0 when \(\theta \) tends to infinity and the supreme possible value of \(\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi )\) is D/2. Consequently, \(\hbox {MM}({\sigma _0},{\theta _0})\) is not a fixed point if \(d(\sigma ^*,\sigma _0) \ge D/2\). Note that this also means that if we start with \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\) such that \(d(\sigma ^*,\sigma _0) \ge D/2\), then the algorithm can only converge to a solution \(\sigma \) unequal to \(\sigma _0\) such that \(d(\sigma ^*,\sigma ) < D/2\).
Let us observe whether \({\hat{\sigma }}_0=\sigma _0\) is fulfilled when \({\hat{\theta }}=\theta _0\) and \(d(\sigma ^*,\sigma _0) < D/2\) (considering the case that the estimated central permutation is unique):
The righthand side of the equation is simplified by Eq. (12):
By the definition of the selection probability (Eq. (7)),
Solving for the summation in the lefthand side of the inequality,
The value of the righthand side of Inequality (14) can vary according to \(d(\sigma ^*,\sigma )\). In order to avoid repeating the same proof for different values of \(d(\sigma ^*,\sigma )\), let us consider the maximum possible value of the righthand side of Inequality (14), which is the worst possible case, and prove it. Substituting the expression \(d(\sigma ^*,\sigma _0)p(\sigma ^*)d(\sigma ^*,\sigma )\) by \(d(\sigma ^*,\sigma _0)\), we obtain the following inequality:
On the lefthand side of Inequality (15), the sum depends on \(\sigma \). In order to prove for all \(\sigma \ne \sigma _0\), let us take the smallest possible value. Considering that P is a Mallows model centered at \(\sigma _0\), the probabilities are ordered according to their distance to \(\sigma _0\). So, from the set \(\varSigma _n \backslash \{\sigma _0\}\), any solution \(\sigma \) at distance 1 from \(\sigma _0\) has the lowest value \(\sum _{\pi \in \varSigma _n}d(\pi ,\sigma )p(\pi )\), because \(d(\pi ,\sigma )=d(\pi ,\sigma _0)\pm 1\). Rewriting the previous equation for a solution \(\sigma \) at distance 1 from \(\sigma _0\),
In order to simplify the previous equation, let us introduce some new notation and definitions.
Definition 2
For any \(\sigma \) in \(\varSigma _n\) and \(d=0,\dotsc ,D\), let us denote
The sequence A008302 in The OnLine Encyclopedia of Integer Sequences (OEIS) [26] shows the first values and some properties of \(m_n(d)\) numbers.
Definition 3
For any \(\sigma \) and \(\tau \) in \(\varSigma _n\) such that \(d(\sigma ,\tau )=1\), and \(d=0,\dotsc ,D\), let us denote
and \(m_n^1(d) = {\mathcal {D}}_d\).
The sequence of nonnegative numbers \(m_n^1(d)\) has been added in OEIS [26] (sequence A307429) by the authors of the present paper and several properties have been explained in Appendix B. To rewrite Inequality (16), Properties (ii), (iii) and (iv) from Appendix B have been used. These enunciate that \({m_n(d) = m_n^1(d) + m_n^1(d1)}\), \(m_n^1(d)={m_n^1(Dd1)}\) and that \({m_n^1(d) > m_n^1(d1)}\) when \(d \in \{0,\dotsc ,d_{max}\}\), where \(d_{max}=(D/2)1\) when D is even and \(d_{max}=\lfloor D/2 \rfloor \) when D is odd. Remembering that \(\varphi (\theta )=\sum _{\sigma \in \varSigma _n}e^{\theta d(\sigma ,\sigma _0)}\) is the normalization constant for the Mallows probability distribution, Inequality (16) can be rewritten in the following way (let us denote \(d(\sigma ^*,\sigma _0)=d^*\)):
The proof of Inequality (17) is shown in Appendix C. Therefore, the learned central permutation from \(P \sim \) \(\hbox {MM}({\sigma _0},{{\hat{\theta }}})\) is \(\sigma _0\). To sum up, \(P \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\) is a fixed point if \(d(\sigma ^*,\sigma _0) < D/2\) and \(\theta _0\) fulfills Eq. (12).
Attraction of the fixed points
In Sect. 4.3.1, all the fixed points of the algorithm, degenerates and nondegenerates, have been studied. Let us define a fixed point of the dynamical system attractive if any Mallows model P near the fixed point will converge to it: that is to say, any P that has the same central estimator as the fixed point and a spread parameter value \(\theta \) “close” to \({\hat{\theta }}\) (in the limit sense) will converge to the fixed point. In addition, from the study of the fixed points, several observations have been derived.
For example, from Eq. (12), the attraction of the nondegenerate fixed points is totally deduced. Let us denote by \({\hat{\theta }}_{d^*}\) the minimum spread parameter values which fulfill Eq. (12) according to \(d(\sigma ^*,\sigma _0)\). In Eq. (10), \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) are compared to observe when the estimated spread parameter value remains the same value. Let us denote by \({\hat{\theta }}\) the spread parameter value which fulfills Eq. (10). However, \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) can be compared for any other spread parameter value \(\theta _0\). For example, when \(\theta _0 < {\hat{\theta }}_{d^*}\), \(d(\sigma ^*,\sigma _0) < \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi ) < \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\), and consequently the learned spread parameter is greater than \(\theta _0\); and when \(\theta _0 > {\hat{\theta }}_{d^*}\), then the learned spread parameter decreases. This observation shows us that the nondegenerate fixed points are attractive.
Another observation is that for sufficiently large \(\theta _0\) we obtain \(d(\sigma ^*,\sigma _0) > \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\) and, consequently, \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p^S(\pi ) > \sum _{\pi \in \varSigma _n} d(\pi ,\sigma _0)p(\pi )\), which implies that \({\hat{\theta }}_0 < \theta _0\). Hence, all the degenerate fixed points centered at \(\sigma \ne \sigma ^*\) are not attractive. Consequently, the algorithm ends in a nondegenerate fixed point centered at \(\sigma \ne \sigma ^*\) or in the degenerate distribution centered at \(\sigma ^*\).
Moreover, Eq. (13) shows us the condition to estimate \(\sigma _0\) as the central permutation. Hence, there exists a spread parameter value \({\tilde{\theta }}_{d^*}\) (dependent on \(d(\sigma ^*,\sigma _0)<D/2\)) such that if \(\theta _0 < {\tilde{\theta }}_{d^*}\), then the estimated central permutation is not \(\sigma _0\). If \(\theta _0 ={\tilde{\theta }}_{d^*}\), then the algorithm can estimate more than one central permutation and its behavior will depend on the estimated central permutation. However, we will not focus on those exact Mallows models because they represent a zero Lebesgue measure set. In Fig. 1, the first values of \({\hat{\theta }}\) which fulfill Eq. (12) and \({\tilde{\theta }}_{d^*}\) are displayed for \(n=4,5,6\) and 7 and their respective \(d^*\) values, showing the proved result. The yaxis is plotted in log scale to recognize all the lines. In addition, for a fixed value n, it can be verified that when d increases, due to the fact that the righthand side of Eq. (12) is a decreasing function, Eq. (12) is fulfilled for a lower \({\hat{\theta }}_d\) value.
Convergence behavior of the algorithm
After analyzing the attraction of the fixed points, the next step is to study the evolution of the estimated Mallows models; that is, when the algorithm estimates a new central permutation which is different from \(\sigma _0\), is it possible to limit the number of scenarios of the algorithm in advance? Can we know which fixed point is the convergence point of the algorithm in any situation?
In many cases it is shown to which fixed point the algorithm converges. The main result that is given about the convergence point of the algorithm is Lemma 4. Lemma 4 demonstrates that the algorithm estimates a central permutation which must be in a set of solutions dependent on \(\sigma ^*\) and \(\sigma _0\). In addition, for any \(\sigma _0\), there exists a spread parameter value \({\tilde{\theta }}(\sigma _0)\) such that if \(\theta _0 < {\tilde{\theta }}(\sigma _0)\), then the algorithm estimates a new central permutation different from \(\sigma _0\).
In order to prove Lemma 4, let us consider Definition 4.
Definition 4
Let \(\varSigma _n\) be the search space with metric \(d(\cdot ,\cdot )\). Let \(\sigma \) and \(\pi \) be two solutions of \(\varSigma _n\). Then, the segment from \(\sigma \) to \(\pi \), \(C(\sigma ,\pi )\), is the set with the permutations \(\tau \in \varSigma _n\) such that \(\sigma \), \(\pi \) and \(\tau \) fulfill the equality in the triangle inequality.
Let us call \(\tau \in C(\sigma ,\pi )\) a solution between \(\sigma \) and \(\pi \). Hence, \(C(\sigma ,\pi )\) is the set that includes all the permutations between \(\sigma \) and \(\pi \). Let us call the segment from \(\sigma \) to \(\pi \) unique when \(C(\sigma ,\pi )=d(\sigma ,\pi )+1\).
Two swaps are disjoint if the intersection of the sets of elements exchanged by each swap is empty.
Lemma 4
Let \(d(\cdot ,\cdot )\) be the Kendall tau distance and f an objective function such that its maximal solution is \(\sigma ^*\) and for any \(\sigma ,\pi \in \varSigma _n\), \(d(\sigma ,\sigma ^*) > d(\pi ,\sigma ^*)\) if and only if \(f(\sigma ) \le f(\pi )\). Let \(P_0\) be a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0) \ge 1\), and spread parameter \(\theta _0\). Then, the operator G always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.
Before presenting the Proof of Lemma 4, let us consider some preliminary ideas about our permutation space \(\varSigma _n\) and how the solutions can be organized and classified according to their description and the Kendall tau distance d. To do so, let us study the Cayley graph described by \((\varSigma _n,d)\) metric space.
Let us denote by CG(V, E) the Cayley graph in which \(V=\varSigma _n\) and
This graph has been studied in [7, 12]. Lemma 2.4 of [12] shows that there are two kinds of cycles formed in \(CG(\varSigma _n,E)\). Because d distance has the right invariance property, without loss of generality, let us simplify the notation and explain the two possible cycles formed by the adjacent swaps using the identity permutation I as the reference solution. Let us denote by [i] the adjacent transposition that exchanges the elements of the positions i and \(i+1\) (\(i = 1,\dotsc ,n1\)). For example, \([i] \circ I\) represents the solution such that elements of the positions i and \(i+1\) from I are swapped \(\left( [i] \circ I = (1 \cdots i+1\ i\ \cdots n) \right) \). Analogously, let us consider a second adjacent transposition [j].

If \([j] \circ [i] \circ I = [i] \circ [j] \circ I\), then there is a unique 4cycle in \(CG(\varSigma _n,E)\) passing through I, \([i] \circ I\) and \([j] \circ I\). Moreover, the 4cycle is formed by the following solutions:
$$\begin{aligned}\{ I,\ [i] \circ I,\ [j] \circ [i] \circ I,\ [j] \circ I \}. \end{aligned}$$ 
If \([j] \circ [i] \circ I \ne [i] \circ [j] \circ I\), then \([i] \circ [j] \circ [i] \circ I = [j] \circ [i] \circ [j] \circ I\) and there is a unique 6cycle in \(CG(\varSigma _n,E)\) passing through I, \([i] \circ I\) and \([j] \circ I\). Moreover, the 6cycle is formed by the following solutions:
$$\begin{aligned}&\{ I,\ [i] \circ I,\ [j] \circ [i] \circ I,\ [i] \circ [j] \circ [i] \circ I,\\&[i] \circ [j] \circ I,\ [j] \circ I \}. \end{aligned}$$
By the definition of the generation of the cycles, the distances among the solutions of the same cycle are minimal. That is to say, the distance between two solutions of the same cycle is the number of edges between both solutions in the cycle.
The next observation is that considering any 4cycle, a partition of \(\varSigma _n\) in 4 sets can be defined.
Without loss of generality, let us comment the particular case \(\pi _1=I\), and the same arguments can be applied for any other cycle. If \(\pi _1 = I\), then \(\pi _2 = (\cdots i+1\ i\ \cdots j\ j+1\ \cdots )\); \(\pi _3 = (\cdots i\ i+1\ \cdots j+1\ j\ \cdots )\); and \(\pi _4 = ( \cdots i+1\ i\ \cdots j+1\ j\ \cdots )\). In order to simplify the notation, the solutions of the 4cycle can be classified according to the relative positions of the couple i and \(i+1\) and the couple j and \(j+1\). So, a partition \(\{S_1,S_2,S_3,S_4\}\) of \(\varSigma _n\) is defined as follows:
It is evident that the partition is welldefined. Moreover, among these 4 sets, for each pair of sets a bijection can be described:
such that
An important property of this defined partition is that if \(\sigma \in S_1\), then \(d(\pi _1,\sigma )< d(\pi _2,\sigma )=d(\pi _3,\sigma ) < d(\pi _4,\sigma )\) is fulfilled and analogously with the solutions of the sets \(S_2\), \(S_3\) and \(S_4\).
The previous idea can be repeated with two nondisjoint adjacent swaps, forming a 6cycle and defining a partition of \(\varSigma _n\) in 6 sets, and for any cycle. In addition, we can extend the idea by using just one adjacent swap. In this last case, we can define a partition of \(\varSigma _n\) in two sets and a bijection between the sets, according to the relative position of the two elements permuted by the swap. This property is the main argument of the Proof of Lemma 4.
Once we know how the solutions are organized in \((\varSigma _n,d)\) metric space, Lemma 4 is proved by induction as follows. For any solution \(\tau \notin C(\sigma ^*,\sigma _0)\), there exists another solution \(\rho _1 \in \varSigma _n\) such that \(\rho _1\) is “closer” to \(\sigma ^*\) and \(\sigma _0\) than \(\tau \) and fulfills the following inequality:
In this way, the argument can be applied for all the solutions not included in \(C(\sigma ^*,\sigma _0)\) and, therefore, for any solution \(\tau \notin C(\sigma ^*,\sigma _0)\), there is a solution \(\rho \in C(\sigma ^*,\sigma _0)\) such that \(\rho \) fulfills Inequality (19) with regard to \(\tau \).
Proof
For any \(\tau \notin C(\sigma ^*,\sigma _0)\), there are two possible cases: (1) there is a solution \(\rho _1\) such that \(d(\tau ,\rho _1)=1\), \(d(\tau ,\sigma ^*)=d(\rho _1,\sigma ^*)+1\) and \(d(\tau ,\sigma _0)=d(\rho _1,\sigma _0)+1\) and (2) there is no such solution \(\rho _1\).
In the first case, if i and j are the elements swapped in the adjacent transposition between \(\tau \) and \(\rho _1\), it means that any solution of \(C(\sigma ^*,\sigma _0)\) keeps the same relative order between the elements i and j as \(\rho _1\) does. So,
Let us consider the following bijection.
such that \(\sigma _{\tau }(i)=\sigma _{\rho }(j),\ \sigma _{\tau }(j)=\sigma _{\rho }(i)\) and \(\sigma _{\tau }(k)=\sigma _{\rho }(k)\), for any \(k \ne i,j \). According to the relative position of i and j, \(\sigma _\rho \) is closer to \(\sigma ^*\) and \(\sigma _0\) than \(\sigma _\tau \) and therefore, \(p^S(\sigma _{\rho }) > p^S(\sigma _{\tau })\) is achieved. Consequently, Inequality (20) is obtained.
In the second case, let us suppose that there are no swaps from \(\tau \) that decrease the distance to \(\sigma ^*\) and \(\sigma _0\) at the same time. First, let us consider an adjacent swap [i] from \(\tau \) that reduces the distance to \(\sigma ^*\). Let us denote \(\rho ' = [i] \circ \tau \). Therefore, similar to the first case, a bijection can be defined according to the relative position of the elements in the positions i and \(i+1\) in \(\tau \). Analogously, let us consider a second swap [j] from \(\tau \) that reduces the distance to \(\sigma _0\), denote \(\rho ''= [j] \circ \tau \) and define a bijection for the elements positioned at j and \(j+1\) in \(\tau \). The transpositions [i] and [j] define a unique cycle passing through \(\tau \). Moreover, by definition of the swaps and the segment \(C(\sigma ^*,\sigma _0)\) and the bijections defined in (18), this situation can only happen when the swaps \((i\ i+1)\) and \((j\ j+1)\) are not disjoint, which implies that the formed cycle has length 6. Besides, this cycle also implies that if we denote by \(\rho _{\tau }\) the furthest solution of the cycle from \(\tau \), then \(\rho _{\tau }\) is closer to \(\sigma ^*\) and \(\sigma _0\) at the same time than \(\tau \). Figure 2 presents the unique possible scenario. Hence, \(d(\sigma ^*,\rho _\tau )+d(\rho _\tau ,\sigma _0) < d(\sigma ^*,\tau )+d(\tau ,\sigma _0)\).
Let us rewrite the sum \(\sum _{\pi \in \varSigma _n}d(\pi , \tau )p^S(\pi )\):
We distribute the sums in two groups, depending on whether or not \(d(\pi ,\rho ')=d(\pi ,\rho '')\).
To prove that the first square brackets sum is a positive value, for a solution \(\pi \in \varSigma _n\), if \(d(\pi , \rho ') = d(\pi , \rho '') < d(\pi , \tau )\), then \(d(\pi ,\rho _\tau ) < d(\pi ,\tau )\). So, if we denote by \((i\ i+1\ i+2)\) the set of elements which are permuted in the 6cycle, we define the following bijection:
such that \( \sigma _{\tau }(i)=\sigma _{\rho }(i+2),\ \sigma _{\tau }(i+2)=\sigma _{\rho }(i)\) and \(\sigma _{\tau }(k)=\sigma _{\rho }(k)\), for any \(k \ne i,i+2\). Therefore, a correspondence between both sets is shown, and by the definition of the sets, \(p^S(\sigma _\rho ) > p^S(\sigma _\tau )\) is obtained for all \(\sigma _\tau \in S_\tau \).
For the second square bracket, if \(d(\pi , \rho ')< d(\pi , \tau ) < d(\pi , \rho '')\), then \(d(\pi ,\tau )=d(\pi , \rho ') +1 = d(\pi , \rho '')1\), and if \(d(\pi , \rho ')> d(\pi , \tau ) > d(\pi , \rho '')\), then \(d(\pi ,\tau )=d(\pi , \rho ') 1 = d(\pi , \rho '')+1\). So, the second square bracket of (21) can be rewritten in the following way:
Therefore, depending on \(\theta _0\), it can be ensured that
Consequently, there is a solution \(\rho _1 \in \{\rho ',\rho ''\}\) such that
So, for \(\tau \notin C(\sigma ^*,\sigma _0)\), there exists a solution \(\rho _1 \in \varSigma _n\) such that \(d(\rho _1,\tau )=1\) and \(\rho _1\) fulfills Inequality (19). If \(\rho _1 \notin C(\sigma ^*,\sigma _0)\), then by the same arguments, there exists another solution \(\rho _2 \in \varSigma _n\) such that \(d(\rho _1,\rho _2)=1\) and \(\rho _2\) fulfills Inequality (19) with regard to \(\rho _1\), and so on. Because \(\tau \notin C(\sigma ^*,\sigma _0)\), at least one induction step must fulfill the first situation explained in this proof (fulfilling Inequality (20)). Consequently, \(\rho _i\) is a solution from \(C(\sigma ^*,\sigma _0)\) such that it is a better estimator than \(\rho _1,\dotsc ,\rho _{i1}\) and \(\tau \). \(\square \)
Lemma 4 shows us that the algorithm estimates central permutations from the set \(C(\sigma ^*,\sigma _0)\). Bear in mind that during the Proof of Lemma 4, the particular expression of f has not been used. Therefore, for our particular case, we can deduce Corollary 1.
Corollary 1
Let f be a needle in a haystack function centered at \(\sigma ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\). Then, the EDA always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.
Proof
When f is a needle in a haystack function, then \(f(\sigma ) < f(\sigma ^*)\) for any \(\sigma \ne \sigma ^*\). Hence, the conditions of Lemma 4 are fulfilled. \(\square \)
To summarize, the operator G ends in a nondegenerate fixed point or in the degenerate distribution centered at \(\sigma ^*\). The nondegenerate fixed points are centered at solutions \(\sigma \) such that \(d(\sigma ^*,\sigma ) < D/2\). In addition, when the algorithm estimates a different solution of \(\sigma _0\), the learned central estimator is a solution from \(C(\sigma ^*,\sigma _0) \backslash \{ \sigma _0 \}\).
All the results of Sects. 4.1, 4.2 and 4.3 are briefly shown in Table 1. In the first column, the section is shown. In the second and third columns, the initial parameters of \(P_0\) (\(\sigma _0\) and \(\theta _0\)) are described. Finally, in the last column, the explanations of the performance of the algorithm for each situation can be found.
Limiting behavior for a Mallows model function
In this section, the function f to optimize is a Mallows probability distribution with central permutation \(\sigma ^*\) and spread parameter \(\theta ^* >0\), without loss of generality. The Mallows model has been studied as an example of a unimodal objective function with different quality of solutions according to their distance to the central permutation. The objective of this section is to analyze the relation among the learned Mallows probability distributions by our dynamical system and the objective function. For that reason, we believe that it is a motivating starting point to study unimodal functions. In Sect. 5.1, the initial probability distribution \(P_0\) is a uniform distribution and the procedure of the algorithm at each iteration is analyzed. In Sect. 5.2, \(P_0\) is a Mallows probability distribution centered at \(\sigma \ne \sigma ^*\). In this situation, the fixed points of the algorithm and the convergence behavior of the algorithm are studied, in a similar way as in Sect. 4.3.
\(P_0\) a uniform initial probability distribution
In this section it is proved that when the initial probability distribution and the fitness functions are Mallows models centered at the same solution the algorithm converges to the degenerate distribution centered at the optimum. The obtained result is summarized in the following lemma.
Lemma 5
Let f be a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _0 \ge 0\). Then, the proposed EDA always converges to the degenerate distribution centered at \(\sigma ^*\).
Proof
For this particular scenario, we have studied how the algorithm performs at each iteration, analogous to Sect. 4.1. Let us start the demonstration from the case that \(P_0\) is a uniform distribution. First, in order to calculate \(P_1=G(P_0)\), let us calculate \(P_0^S\).
Bear in mind that the 2tournament does not consider the exact function values of the solutions. In other words, by the definition of the Mallows probability distribution, a solution is selected more often if it is closer to \(\sigma ^*\), and to study the selection between two solutions, their distances to \(\sigma ^*\) are compared. With this property in mind, we can rewrite Eq. (3) in the following way: for any iteration of the algorithm i,
The next step is to estimate the central permutation and spread parameter from \(P_0^S\) to learn \(P_1\). First, to estimate \(\sigma _0\), let us order the solutions increasingly according to their distance from \(\sigma ^*\). Remember that two solutions have the same probability to be selected if they are at the same distance from \(\sigma ^*\). For any \(\sigma \in \varSigma _n\),
where \({\tilde{\sigma }}_d\) denotes a solution at distance d from \(\sigma ^*\): \(d({\tilde{\sigma }}_d,\sigma ^*)=d\).
By Eq. (22), \(p_0^S({\tilde{\sigma }}_{0})> p_0^S({\tilde{\sigma }}_{1})> \cdots > p_i^S({\tilde{\sigma }}_{D})\). So, by Eq. (4), the maximum likelihood estimator must minimize \(\sum _{\pi \in \varSigma _n} d(\pi ,{\hat{\sigma }}_0)\cdot p_0^S(\pi )\), knowing that the selection probabilities are ordered according to their distance to \(\sigma ^*\) (the lower the distance from \(\sigma ^*\) to \(\pi \), the higher the value \(p^S(\pi )\) is). For that reason, the maximum likelihood estimator of \(\sigma _0\) is \(\sigma ^*\), and consequently, \(P_1\) follows a Mallows model with central permutation \(\sigma ^*\) and a positive spread parameter \(\theta _1\), as a consequence of Lemma 1.
The previous arguments can be used for any iteration. Hence, \(P_i\) is a Mallows model with central permutation \(\sigma ^*\) and spread parameter \(\theta _i > 0\), for any \(i \in {\mathbb {N}}\). In order to see the evolution of the algorithm and the convergence behavior, let us prove that \(\theta _i\) increases at each iteration. To this end, the difference between the values of the lefthand side of Eq. (5) in two consecutive iterations are analyzed: \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_i^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_{i+1}^S(\pi )\). By the same arguments used in Sect. 4.1, the equality \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_i^S(\pi ) = \sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)\cdot p_{i+1}(\pi )\) is obtained. Let us use the sequence \(m_n(0),m_n(1),\dotsc ,m_n(D)\) given in Definition 2 and simplify the notation of the probabilities. By definition of the selection operator, for any \(\sigma \in \varSigma _n\) such that \(d(\sigma ,\sigma ^*)=d\), \(p^S(\sigma )\) can be rewritten in the following way:
Hence,
Let us define the function h:
For any \(\theta \ge 0\), \(h(\theta )\) is a negative value (see Proof in Proposition 5 of Appendix C). Consequently,
and due to the fact that the function g defined in Eq. (6) is a strictly decreasing function over \(\theta \), we obtain \(\theta _{i+1} > \theta _{i}\).
Therefore, after applying our modeling, departing from a uniform distribution, to a function defined as a Mallows model, the algorithm converges to the degenerate distribution centered at \(\sigma ^*\). \(\square \)
\(P_0\) a Mallows probability distribution with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\)
The algorithm can experience many different behaviors depending on \(\sigma ^*\) and \(\sigma _0\). However, there are groups of different central permutations \(\sigma _0\) such that the algorithm behaves analogously. The analogy of the analysis with different central permutations can be obtained by means of symmetry among the solutions of \(\varSigma _n\). Due to the difficulty of studying all of them, we have worked in a similar way as in Sect. 4.3. In Sect. 5.2, the following proof idea is used:

(i)
In Sect. 5.2.1, the fixed points and their attraction are calculated.

First, it is observed that any degenerate distribution is a fixed point.

Then, the equations such that any nondegenerate fixed point must fulfill are calculated.


(ii)
In Sect. 5.2.2, the convergence behavior of the algorithm is explained and an example is shown.
A summary of all the results obtained in Sect. 5 is shown in Table 2 at the end of this section.
Fixed points of the algorithm and their attraction
The case \(n=2\) will not be explained because of its simplicity. From now on, let us suppose that \(n \ge 3\) and study the fixed points of our discrete dynamical system G. As in Sect. 4.3.1, let us start by observing that any degenerate distribution is a fixed point of the discrete dynamical system G, so let us focus on the nondegenerate fixed points.
For any Mallows probability distribution P, \(G(P)=P\) if and only if the estimated central permutation and spread parameter are the same as those of P. So, if Eq. (11) is fulfilled, then P is a nondegenerate fixed point. Let us study the equality of Eq. (11). We say that P is a candidate fixed point if it satisfies Eq. (10). Note that if P is a candidate fixed point, then \({\hat{\theta }} = \theta \).
As can be observed, Eq. (23) shows the first condition for a Mallows probability distribution P centered at \(\sigma _0\) to be a fixed point. Equation (23) has at least one solution \(\theta \) (depending on n, \(\sigma ^*\) and \(\sigma _0\), it may have more than one). One way to calculate the number of candidate fixed points centered at \(\sigma _0\) is to count the number of roots in Eq. (23) by Sturm’s theorem [27]. The exponential polynomial in \(\theta \in [0,+\infty )\) can be transformed into a polynomial defined in (0, 1] (transforming \(e^{\theta }=x\)) in order to apply Sturm’s theorem. Moreover, the roots can be numerically solved to find the values of \(\theta \) in which \(P \sim \) \(\hbox {MM}({\sigma _0},{\theta })\) are candidate fixed points.
Moreover, for any pair of permutations \(\pi ,\tau \) (w.l.o.g., \(d(\tau ,\sigma ^*) > d(\pi ,\sigma ^*)\)), if we choose the pair of permutations \(I'\pi ,I'\tau \) where \(I'=(n\ n1\ \cdots 1)\), the following similarities can be observed:
Hence,
Therefore, for any \(\sigma _0 \in \varSigma _n\), let us define the function H as follows:
By Eq. (24), \(H_i = H_{2Di}\). In addition, \(H(\sigma _0,\theta )=H(I'\sigma _0,\theta )\) for any \(\sigma _0 \in \varSigma _n\) and \(\theta \). Consequently, \(H(\sigma _0,{\hat{\theta }})=0\) if and only if \(H(I'\sigma _0,{\hat{\theta }})=0\). So, if P is a candidate fixed point with central permutation \(\sigma _0\) and spread parameter \({\hat{\theta }}\), then a Mallows probability distribution with central permutation \(I'\sigma _0\) and spread parameter \({\hat{\theta }}\) is a candidate fixed point as well.
In addition, from the previous observation, it has been equivalently shown that
and analogous for the opposite inequality. So, when \(\theta \) tends to infinity, the highest exponential coefficient of \(H(\sigma _0,\theta )\) determines if the value is positive or not.
Considering all the observations of Eq. (23) and Inequality (26), in comparison with the results from Sects. 4.3.1 and 4.3.2, some new scenarios have been observed. The first one is that for a fixed value \(\sigma _0\), there can be more than one candidate fixed point. Hence, the algorithm can converge to more than one probability distribution centered at \(\sigma _0\). Moreover, from Eq. (25), similarities between \(\sigma _0\) and \(I'\sigma _0\) have been observed. Secondly, information about the attraction of the fixed points has been analyzed, even if the candidate fixed points are fixed points or not. From Inequality (26) whether or not if the degenerate distribution centered at \(\sigma _0\) is an attractive fixed point can be studied. Furthermore, knowing the attraction of the degenerate distribution, the attraction of all the candidate fixed points is completely defined. Reordering all the candidate fixed points centered at \(\sigma _0\) according to their spread parameters, they alternate their attraction in order not to obtain two consecutive candidate fixed points with the same attraction. Consequently, the last objective is to observe when a candidate fixed point is a fixed point.
To study if a candidate fixed point is a fixed point, it is necessary to observe if the estimated central permutation \({\hat{\sigma }}_0\) from a candidate fixed point P centered at \(\sigma _0\) is exactly \(\sigma _0\). So as to obtain the same central permutation, the inequality of Eq. (11) must be fulfilled at the solution \({\hat{\theta }}\) of Eq. (23) (assuming the uniqueness of the central permutation). Hence, for all \(\sigma \ne \sigma _0\),
Inequality (27) shows us the condition to estimate \(\sigma _0\) as the learned central permutation. Even though it can be completely separated according to their dependence to the distance from \(\sigma ^*\), a general solution cannot be observed (without knowing the particular values of the probabilities and distances) which tells us in advance if Inequality (27) is fulfilled or not. Actually, some experimental results show that there are candidate fixed points which do not fulfill Inequality (27).
In Fig. 3, an example of the attraction of the fixed points is shown for \(n=5\). The abscissa shows \(\sigma _0\), numerically indexed according to their distance to \(\sigma ^*\), and the ordinate represents the values of \(\theta _0\) which fulfill Eq. (23). Therefore, each dot represents a candidate fixed point. The color of the dot represents if the candidate fixed point is a fixed point or not, and its attraction. For any \(\sigma _0\) central permutation, the degenerate fixed points have been illustrated.
To summarize, Inequality (27) ensures exactly which candidates are the fixed points of our dynamical system.
Convergence behavior of the algorithm
Before introducing the convergence behavior of the algorithm, let us state Corollary 2, deduced from Lemma 4.
Corollary 2
Let f be a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*\) and \(P_0\) a Mallows model with central permutation \(\sigma _0\), where \(d(\sigma ^*,\sigma _0)=d^* \ge 1\), and spread parameter \(\theta _0\). Then, the EDA always estimates a solution \(\tau \in C(\sigma ^*,\sigma _0)\) as the central permutation of the learned Mallows model.
Proof
When f is a Mallows model centered at \(\sigma ^*\) and spread parameter \(\theta ^*>0\), for any \(\sigma ,\pi \in \varSigma _n\), \(f(\sigma ) > f(\pi )\) if and only if \(d(\sigma ,\sigma ^*) < d(\pi ,\sigma ^*)\). Hence, the conditions of Lemma 4 are fulfilled. \(\square \)
Once we have Corollary 2 and we know the fixed points and their attraction, the behavior of the algorithm is totally defined and it can be summarized in the following way:

For any \(P_0 \sim \) \(\hbox {MM}({\sigma _0},{\theta _0})\), there exists a spread parameter value \(\theta '(\sigma _0)\) dependent on \(\sigma _0\) such that if \(\theta _0 < \theta '(\sigma _0)\), then Inequality (27) is not fulfilled for all \(\sigma \). In that case, by Corollary 2, the estimated central permutation after one iteration of the algorithm is a solution from \(C(\sigma ^*,\sigma _0)\backslash \{\sigma _0\}\).

If \(\theta _0 > \theta '(\sigma _0)\), then the algorithm estimates \(\sigma _0\) as the central permutation of the learned Mallows model. Let us classify the different possible behaviors according to the number of fixed points centered at \(\sigma _0\):

If there are no nondegenerate solutions centered at \(\sigma _0\) (there are no solutions for Eq. (23)), then the only fixed point centered at \(\sigma _0\) is the degenerate distribution \(1_{\sigma _0}\). In this case, if \(1_{\sigma _0}\) is attractive, the algorithm converges to it; otherwise, the estimated spread parameter decreases until an iteration when \({\hat{\theta }} < \theta '(\sigma _0)\) and, therefore, the estimated central permutation is not \(\sigma _0\) anymore, returning back to the previous situation.

If there are \(i \ge 1\) nondegenerate fixed points centered at \(\sigma _0\), then there exist i spread parameter values \({\tilde{\theta }}_i\) which solve Eq. (23) and fulfill Inequality (27). Hence, \(\theta '(\sigma _0)\) and \({\tilde{\theta }}_j\) for \(j=1,\dotsc ,i\) divide the interval \((\theta '(\sigma _0), +\infty )\) in \(i+1\) intervals.
Let us denote by \((\theta '(\sigma _0),{\tilde{\theta }}_1)\), \(({\tilde{\theta }}_1,{\tilde{\theta }}_2)\), \(\dotsc \), \(({\tilde{\theta }}_{i1},{\tilde{\theta }}_i)\) and \(({\tilde{\theta }}_{i},+\infty )\) the \(i+1\) formed intervals; \(P_k\) the nondegenerate fixed point centered at \(\sigma _0\) and spread parameter \({\tilde{\theta }}_{k}\), for \(k=1,\dotsc ,i\); and \(1_{\sigma _0}\) the degenerate fixed point centered at \(\sigma _0\). There are two possible situations, depending on whether \(1_{\sigma _0}\) is attractive or not.
If \(1_{\sigma _0}\) is attractive, then \(P_i\) is not attractive and when \(\theta _0 \in ({\tilde{\theta }}_{i},+\infty )\), the algorithm converges to \(1_{\sigma _0}\). Moreover, because of the nonattraction of \(P_i\) and by the same argument, \(P_{i1}\) is attractive and \(P_{i2}\) is not attractive, \(P_{i3}\) is attractive and \(P_{i4}\) is not attractive, and so on. Hence, if \(\theta _0 \in ({\tilde{\theta }}_{i2}, {\tilde{\theta }}_i)\), the algorithm converges to \(P_{i1}\); if \(\theta _0 \in ({\tilde{\theta }}_{i4}, {\tilde{\theta }}_{i2})\), the algorithm converges to \(P_{i3}\); and so on.
Additionally, if \(1_{\sigma _0}\) is not attractive, then \(P_i\) is attractive and \(P_{i1}\) is not attractive, and when \(\theta _0 \in ({\tilde{\theta }}_{i1},+\infty )\), the algorithm converges to \(P_i\). Moreover, \(P_{i2}\) is attractive and \(P_{i3}\) is not attractive, and when \(\theta _0 \in ({\tilde{\theta }}_{i3}, {\tilde{\theta }}_{i1})\), the algorithm converges to \(P_{i2}\). And so on.
Observe that when \(P_1\) is not attractive and \(\theta _0 \in (\theta '(\sigma _0),{\tilde{\theta }}_1)\), the algorithm estimates lower spread parameters until \({\hat{\theta }}_0 < \theta '(\sigma _0)\). In this case, the algorithm estimates a new central permutation from \(C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\).
Figure 4 is presented in order to show a visualization of the possible situations. The horizontal line represents the possible \(\theta _0\) value. In each interval, a blue arrow tells us if the estimated spread parameter is higher or lower, and the attraction of each fixed point can be observed. There are four possible cases, depending on the parity of i and the attraction of \(1_{\sigma _0}\). In the first two cases, i is an odd number, and in the first and fourth cases, \(1_{\sigma _0}\) is an attractive fixed point.


If \(\theta _0 = \theta '(\sigma _0)\), then the algorithm can randomly estimate \(\sigma _0\) or another \(\sigma \in C(\sigma ^*,\sigma _0)\) as the new central permutation. In the former case, if the fixed point with the lowest spread parameter centered at \(\sigma _0\) is attractive, the algorithm will converge to it. Otherwise, the algorithm learns a probability distribution centered at \(\sigma _0\) and spread parameter \({\hat{\theta }} < \theta _0\), and it behaves analogous as to the case \(\theta _0 < {\hat{\theta }}_0\). In the latter case, the algorithm estimates a new central permutation \(\sigma \) and spread parameter \({\hat{\theta }}\), and it will be analogous as \(P_0 \sim \hbox {MM}({\sigma },{{\hat{\theta }}})\).
Let us present an example in order to illustrate the behavior described above.
Example 1
Let us consider \(n=5\), f a Mallows model centered at \(\sigma ^*=I\) and \(P_0\) a Mallows probability distribution with central permutation \(\sigma _0=(2 1 5 4 3)\) and spread parameter \(\theta _0\). To observe the behavior of the algorithm, let us calculate the candidate fixed points by Eq. (23) and the minimum spread parameter value \(\theta '(\sigma _0)\) that allows the estimation of \(\sigma _0\) as the learned central permutation, by Inequality (27).
In this particular case, there is only one solution which fulfills Eq. (23): \({\tilde{\theta }} \approx 1.2519\). Moreover, Inequality (27) shows that the equality is obtained when \(\theta '(\sigma _0)\approx 0.2770\). Therefore, a Mallows probability distribution centered at \(\sigma _0\) with spread parameter value \({\tilde{\theta }}\) is a fixed point of our mathematical modeling. In addition, if \(\theta _0 > {\tilde{\theta }}\), then \({\hat{\theta }} < \theta _0\). This last observation implies that the degenerate distribution centered at \(\sigma _0\) is not attractive, and consequently, \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\) is an attractive fixed point. Knowing the attraction of the fixed points, the value of \(\theta _0\) determines the behavior of the algorithm.

If \(\theta _0 < \theta '(\sigma _0)\), then \({\hat{\sigma }}_0 \in C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\). Hence, after one iteration, the algorithm restarts the process with a new central permutation and spread parameter. For example, if \(\theta _0 = 0.2760\), then the learned Mallows model after one iteration of the algorithm is \(\hbox {MM}({(12453)},{0.4016})\); and if \(\theta _0=0.2700\), then the learned Mallows model is \(\hbox {MM}({\sigma ^*},{0.3994})\).

If \(\theta _0 > \theta '(\sigma _0)\), then the algorithm converges to \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\) distribution.

If \(\theta _0 = \theta '(\sigma _0)\), then the algorithm estimates either \(\sigma _0\) or \({\hat{\sigma }}_0 \in C(\sigma ^*,\sigma _0) \backslash \{\sigma _0 \}\). In the first case, the algorithm converges to \(\hbox {MM}({\sigma _0},{{\tilde{\theta }}})\), whereas in the second case, the algorithm estimates \(\hbox {MM}({(12453)},{0.4023})\) probability distribution after one iteration.
For any \(\sigma _0\), the same test would be repeated. All the results of Sects. 5.1 and 5.2 are briefly shown in Table 2, mentioning the initial parameters of \(P_0\) and explaining the performance of the algorithm.
Conclusions and future work
We have presented a mathematical modeling to study an EDA based on Mallows models using discrete dynamical systems based on the expectations. Under this framework, we have studied the convergence behavior of the algorithm for several objective functions and initial probability distributions. Two different approaches have been followed to study the convergence behavior. For the simplest cases, the computation of one iteration of the algorithm has been enough to prove the limit behavior, whereas for the most complex cases, the fixed points of the algorithm and their attraction have been analyzed. Overall, for the latter, a wide range of possible ending probability distributions and trajectories for the algorithm have been observed, which, given its practical success [5], were by no means anticipated.
The main results can be summarized as follows. When the function to optimize is constant, all Mallows probability distributions are fixed points. When the function to optimize is a needle in a haystack function centered at \(\sigma ^*\) and the initial probability distribution is a Mallows distribution centered at \(\sigma _0\), the algorithm converges to the degenerate distribution centered at \(\sigma ^*\) or to a nondegenerate Mallows distribution centered at a permutation \(\sigma \) in the segment between \(\sigma ^*\) and \(\sigma _0\) such that the distance between \(\sigma \) and \(\sigma ^*\) is lower than \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\) and a spread parameter which fulfills the condition to be a (attractive) fixed point. Finally, when the function to optimize is a Mallows model centered at \(\sigma ^*\) and the initial probability distribution is a Mallows distribution centered at \(\sigma _0\), the algorithm converges to any Mallows distribution centered at a permutation in the segment between \(\sigma ^*\) and \(\sigma _0\), which is an attractive fixed point. The attraction of all the fixed points provides information in relation to the possible trajectories of the algorithm. In any case, the relation between the initial probability distribution and the objective function completely determines the convergence behavior of the algorithm. Because of that, a classification of the convergence behavior of the algorithm regarding the parameters of the Mallows model is shown.
Although the behavior of the presented algorithm with finite population can be different from that predicted from the expectations, the variety of scenarios shown in the presented modeling shows the complexity of predicting the limit distributions of finitepopulation EDAs. For a first comparison between the algorithm with finite and infinite populations, an EDA with Mallows model using finite populations and the Borda count [11] to estimate the central permutation \(\sigma _0\) could be applied and their performances contrasted. In addition, it is really intriguing to observe if other permutationbased EDAs or distancebased models achieve better convergence results and not many nondesirable solutions.
The obtained results in this work have been so unexpected that it encourages us to carry out new studies. We propose several future works to obtain better solutions in practice and to suggest how the runtime analysis can be performed. For example, according to the obtained solutions, the central permutation of the initial probability distribution determines which probability distributions can be learnt by the algorithm at each iteration. Then, for practical purposes, we propose a careful choice of the initial population. For example, a logical proposal is to generate individuals that are as far as possible from each other, expanding the initial search of the optimal solution. This proposal can be compared with the initialization presented in [5, 31], in which the authors apply a preliminary step so as to guide the algorithm to find the optimal solution. On the other hand, if we are interested in the runtime analysis of the algorithm, it is important to take into account some knowledge that emerges from our analysis. We have observed that the estimated spread parameter value at each iteration of the algorithm can be very critical. When the estimated spread parameter value change is big, the algorithm presents several scenarios in which the learned probability distribution can be significantly different (because the estimated central permutation is different in each case, for example) and the probability of sampling the optimal solution depends on it. On the contrary, when the estimated spread parameter value change is small, if the central permutation is not the optimal solution, the probability of reaching it will exponentially decrease with the spread parameter value. This observation may allow us to estimate the number of iterations required by the algorithm to converge to a model and when the researchers should modify the algorithm to escape from the expected tendency of the algorithm. Another analysis we propose is, starting from different initial probability distributions, to check if there exists a number of iterations that ensures the probability to sample the optimal solution is higher than a value and to track the probability at each iteration. With the obtained results, we could connect them with the presented results in the literature for binary EDAs and observe the similarities and differences between them.
References
Ali A, Meilă M (2012) Experiments with Kemeny ranking: What works when? Math Soc Sci 64(1):28–40
Baluja S (1994) Populationbased incremental learning. A method for integrating genetic search based function optimization and competitive learning. Tech rep
Bartholdi J, Tovey CA, Trick MA (1989) Voting schemes for which it can be difficult to tell who won the election. Soc Choice Welfare 6(2):157–165
Blickle T, Thiele L (1996) A comparison of selection schemes used in evolutionary algorithms. Evol Comput 4(4):361–394
Ceberio J, Irurozki E, Mendiburu A, Lozano JA (2014) A distancebased ranking model estimation of distribution algorithm for the flowshop scheduling problem. IEEE Trans Evol Comput 18(2):286–300
Ceberio J, Mendiburu A, Lozano JA (2011) Introducing the Mallows model on estimation of distribution algorithms. In: International Conference on Neural Information Processing, Springer, pp 461–470
van De Vel ML (1993) Theory of Convex Structures, vol 50. Elsevier, Amsterdam
Devroye L, Györfi L, Lugosi G (2013) A Probabilistic Theory of Pattern Recognition, vol 31. Springer Science & Business Media, Berlin
Echegoyen C, Mendiburu A, Santana R, Lozano JA (2013) On the taxonomy of optimization problems under estimation of distribution algorithms. Evol Comput 21(3):471–495
Echegoyen C, Santana R, Mendiburu A, Lozano JA (2015) Comprehensive characterization of the behaviors of estimation of distribution algorithms. Theoret Comput Sci 598:64–86
Emerson P (2013) The original Borda count and partial voting. Soc Choice Welfare 40(2):353–358
Feng YQ (2006) Automorphism groups of Cayley graphs on symmetric groups with generating transposition sets. J Comb Theory, Series B 96(1):67–72
Fligner MA, Verducci JS (1986) Distance based ranking models. J Roy Stat Soc: Ser B (Methodol) 48(3):359–369
González C, Lozano JA, Larrañaga P (2000) Analyzing the population based incremental learning algorithm by means of discrete dynamical systems. Complex Systems 12:465–479
González C, Lozano JA, Larrañaga P (2002) Mathematical modeling of discrete estimation of distribution algorithms. In: Estimation of Distribution Algorithms, Springer, pp 147–163
Harik GR, Lobo FG, Goldberg DE (1999) The compact genetic algorithm. IEEE Trans Evol Comput 3(4):287–297
Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1/2):81–93
Krejca MS, Witt C (2020) Theory of estimationofdistribution algorithms. In: Theory of Evolutionary Computation, Springer, pp 405–442
Larrañaga P, Lozano JA (2002) Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation, vol 2. Springer Science & Business Media, Berlin
Mallows CL (1957) Nonnull ranking models. Biometrika 44(1/2):114–130
Meilă M, Phadnis K, Patterson A, Bilmes J (2007) Consensus ranking under the exponential model. In: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pp 285–294
Mühlenbein H, Mahnig T (1999) FDAA scalable evolutionary algorithm for the optimization of additively decomposed functions. Evol Comput 7(4):353–376
Mühlenbein H, Paa G (1996) From recombination of genes to the estimation of distributions I. Binary parameters. In: International Conference on Parallel Problem Solving from Nature, Springer, pp 178–187
PérezRodríguez R, HernándezAguirre A (2019) A hybrid estimation of distribution algorithm for the vehicle routing problem with time windows. Comput Ind Eng 130:75–96
Shapiro JL (2005) Drift and scaling in estimation of distribution algorithms. Evol Comput 13(1):99–123
Sloane N The OnLine Encyclopedia of Integer Sequences. http://oeis.org
Sturm C (2009) Mémoire sur la résolution des équations numériques. In: Collected Works of Charles François Sturm, Springer, pp. 345–390
Tsutsui S (2006) Node histogram vs. edge histogram: A comparison of probabilistic modelbuilding genetic algorithms in permutation domains. In: Proceedings of the IEEE Conference on Evolutionary Computation CEC, IEEE, pp 1939–1946
Unanue I, Merino M, Lozano JA (2019) A mathematical analysis of EDAs with distancebased exponential models. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp 429–430
Vose MD (1999) The Simple Genetic Algorithm: Foundations and Theory, vol 12. MIT Press, Cambridge
Wang F, Li Y, Zhou A, Tang K (2019) An estimation of distribution algorithm for mixedvariable newsvendor problems. IEEE Trans Evol Comput 24(3):479–493
Witt C (2021) On crossing fitness valleys with majorityvote crossover and estimationofdistribution algorithms. In: Proceedings of the 16th ACM/SIGEVO Conference on Foundations of Genetic Algorithms, pp 1–15
Zhang Q (2004) On stability of fixed points of limit models of univariate marginal distribution algorithm and factorized distribution algorithm. IEEE Trans Evol Comput 8(1):80–93
Zhang Q, Mühlenbein H (2004) On the convergence of a class of estimation of distribution algorithms. IEEE Trans Evol Comput 8(2):127–136
Acknowledgements
This research has been partially supported by Spanish Ministry of Science and Innovation through the projects PID2019104966GBI00/AEI/10.13039/501100011033, PID2019104933GBI00/AEI/10.13039/501100011033, PID2019106453GAI00/AEI/10.13039/501100011033 and BCAM Severo Ochoa accreditation SEV20170718; and by the Basque Government through the program BERC 20222025 and the projects IT150422 and IT149422; and by UPV/EHU through the project GIU20/054. Imanol holds a grant from the Department of Education of the Basque Government (PRE_2021_2_0224).
Funding
Open Access funding provided thanks to the CRUECSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is an extensive and detailed expansion of the work [29].
Appendices
Appendix A
In this appendix, several observations and properties about the righthand side of Eq. (5) are studied and commented.
Proposition 1
is a continuous function defined in \({\mathbb {R}}\backslash \{0\}\). Moreover,
Proof
First of all, the function \(g^*\) is not defined when \(\theta =0\). Moreover, the continuity of the function is trivial (combination of scalar and exponential functions and the denominator is never zero).
Let us show \(\lim _{\theta \rightarrow 0} g^*(\theta ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) /2\). Let us prove the limit by means of L’Hôpital’s rule.
\(\square \)
Therefore, we have a function which is defined in \({\mathbb {R}}\backslash \{0\}\) and has a limit when \(\theta \) tends to 0. So, the extension of the righthand side of Eq. (5) can be defined in the following way:
Proposition 2
\(g(\theta )\) is a continuous decreasing function, \(g(\theta )+g(\theta ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \) and
Proof
By definition of \(g(\theta )\) and Proposition 1, it is trivial to observe that \(g(\theta )\) is a continuous function. Moreover, for any value \(\theta \ne 0\),
To prove \(g'(\theta )\) is always a negative value (\(\theta \ne 0\)), we will prove the following inequality:
Developing the expression,
and this is always true (bear in mind that the exponential function is always positive). A direct way to see that the previous inequality holds is considering the next inequality:
In Proposition 4(v), a similar result is mentioned.
Now let us prove that \(g(\theta )+g(\theta ) = \left( {\begin{array}{c}n\\ 2\end{array}}\right) \). By definition of \(g(\theta )\), the case \(\theta = 0\) is trivial, so let us calculate for the rest of values.
Finally, the limit \(\lim _{\theta \rightarrow +\infty }g(\theta ) = 0\) is trivial. \(\square \)
Appendix B
In this appendix, several properties of the sequence \(m_n^1(d)\) defined in Definition 3 are shown, where \(n \in {\mathbb {N}}\) and \(d =0,\dotsc ,D=n(n1)/2\). The first values are shown in Table 3.
Proposition 3
For a fixed \(n \in {\mathbb {N}}\), the sequence \((m_n^1(0),\ldots ,m_n^1(D))\) satisfies the following properties:

(i)
For any distance \(d \in \{0,\dotsc ,D\}\),
$$\begin{aligned} m_n^1(d) = \sum _{i=0}^d (1)^{di}m_n(i). \end{aligned}$$ 
(ii)
For any distance \(d \in \{0,\dotsc ,D\}\),
$$\begin{aligned} m_n(d) = m_n^1(d) + m_n^1(d1). \end{aligned}$$ 
(iii)
For any distance \(d \in \{0,\dotsc ,D\}\),
$$\begin{aligned} m_n^1(d) = m_n^1(Dd1). \end{aligned}$$ 
(iv)
For any distance \(d \in \{0,\dotsc ,D\}\) and \(n > 3\),

If D is even, let us define \(d_{max}^1=(D/2)1\) and \(d_{max}^2=D/2\). Then,
$$\begin{aligned} \left\{ \begin{array}{ll} m_n^1(d) < m_n^1(d+1) &{} \text { when } d=0,\dotsc ,d_{max}^1 \\ m_n^1(d) > m_n^1(d+1) &{} \text { when } d=d_{max}^2,\dotsc ,D1. \end{array} \right. \end{aligned}$$ 
If D is odd, let us define \(d_{max}= \lfloor D/2 \rfloor \). Then,
$$\begin{aligned} \left\{ \begin{array}{ll} m_n^1(d) < m_n^1(d+1) &{}\text { when } d=0,\dotsc ,d_{max}1 \\ m_n^1(d) > m_n^1(d+1) &{}\text { when } d=d_{max},\dotsc ,D1. \end{array} \right. \end{aligned}$$


(v)
For any distance \(d \in \{0,\dotsc ,D\}\) and \(n\ne 1\),
$$\begin{aligned} m_n^1(d) = \sum _{k=0}^d \sum _{j=0}^{n1} m_{n1}(kj)\cdot (1)^{dk} . \end{aligned}$$ 
(vi)
For any distance \(d \in \{0,\dotsc ,D\}\),
$$\begin{aligned} m_n^1(d) \le m_n^1(d1) + m_n^1(d+1). \end{aligned}$$
Proof
Properties \((i)(v)\) can be easily derived from Definition 3 and the characteristics of the sequence \(m_n(d)\). Finally, let us prove Property (vi), which states that \(m_n^1(i) < m_n^1(i1) + m_n^1(i+1)\). There exist three cases:

(a)
If \(m_n^1(i) \le m_n^1(i1)\), the inequality is trivial.

(b)
If \(m_n^1(i) \le m_n^1(i+1)\), the inequality is trivial.

(c)
If \(m_n^1(i) > m_n^1(i1)\) and \(m_n^1(i) > m_n^1(i+1)\), then \(m_n^1(i)\) is a single maximum (Note that this case appears the first time when \(n=6\)).
In this particular case, D is an odd number, \(m_n^1(i) = m_n^1(\lfloor D/2 \rfloor )\) is the maximum value, \(m_n^1(i+1)= m_n^1(i1)\); and \(m_n(\lfloor D/2 \rfloor )\) and \(m_n(\lceil D/2 \rceil )\) are the maximum values.
We present the properties and observations used to prove the last situation:

(a)
For any \(n \ge 6\), the maximum distance between two permutations in \(\varSigma _n\) is \(D(n)=n(n1)/2\). So the difference between the maximum values for two permutations in \(\varSigma _n\) and \(\varSigma _{n1}\) is \(D(n)D(n1) = n(n1)/2  (n1)(n2)/2 = n1\).

(b)
Using the previous property and the sequence \(m_n(i)\), we can deduce the following observations:

If \(m_n(i)\) is the first maximum value for any fixed integer \(n \ge 6\), then \(m_{n1}(i\lfloor (n1)/2 \rfloor )\) is the (first) maximum value.

If \(m_n(i)\) is the maximum value, then \(m_{n1}(i)\) is located in the descending part of the sequence, that is, \(m_{n1}(i1)> m_{n1}(i) > m_{n1}(i+1)\).

Similarly, if \(m_n(i)\) is the maximum value, then \(m_{n1}(in)\) is located in the ascending part of the sequence, that is, \(m_{n1}(in1)< m_{n1}(in) < m_{n1}(in+1)\).


(c)
For any integer values n and i, \(m_n(i) \le m_n(i+1)+m_n(i1)\).
Once we bear the previous observations in mind, let us use Property (v).

If \(n1 \ge i\):
$$\begin{aligned} m_n^1(i) \mathop {=}^{(\text {B.1})} \left\{ \begin{array}{ll} \sum _{j=0}^{i/2} m_{n1}(2j) &{} \text { if } i \text { is even} \\ \sum _{j=0}^{(i1)/2} m_{n1}(2j+1) &{} \text { if } i \text { is odd.} \\ \end{array} \right. \end{aligned}$$ 
If \(n1 < i\),
$$\begin{aligned} m_n^1(i) \mathop {=}^{(\text {B.1})} \left\{ \begin{array}{ll} \sum _{j=0}^{i/2} m_{n1}(2j) \sum _{k=0}^{(in1)/2}m_{n1}(2k+1) &{} \\ \qquad \text { if } i \text { is even and } n \text { is odd}\\ \sum _{j=0}^{n1} m_{n1}(ij) + \sum _{k=0}^{(in2)/2} m_{n1}(2k+1) \\ \qquad  \sum _{l=0}^{(i1)/2} m_{n1}(2l) &{} \\ \qquad \text { if } i \text { is odd and } n \text { is odd}\\ \sum _{j=0}^{n1} m_{n1}(ij) + \sum _{k=0}^{(in2)/2} m_{n1}(2k+1) \\ \qquad  \sum _{l=0}^{(i2)/2} m_{n1}(2l+1) &{} \\ \qquad \text { if } i \text { is even and } n \text { is even}\\ \sum _{j=0}^{(i1)/2} m_{n1}(2j+1) \sum _{k=0}^{(in1)/2}m_{n1}(2k+1) \\ \qquad = \sum _{j=(in+1)/2}^{(i1)/2} m_{n1}(2j+1) &{} \\ \qquad \text { if } i \text { is odd and } n \text { is even.}\\ \end{array} \right. \end{aligned}$$
In order to extend the sums, let us denote by “ \(\begin{array}{c} (even)\\ \cdots \end{array}\) ” and “ \(\begin{array}{c} (odd)\\ \cdots \end{array}\) ” the coefficients with even and odd indexes, respectively.
When \(n \ge 6\), there are four possible cases depending on the n and i integer parity values (Eqs. (B.2) — (B.5)).
Let n be an odd number and i an even number.
Let n and i be odd numbers.
Let n and i be even numbers.
Let n be an even number and i an odd number.
In all cases, the result is proved. \(\square \)
Appendix C
In this appendix, all the properties used in this work about the exponential polynomials are shown. Throughout this work, the exponential polynomials have integer coefficients and the base used is e. For a fixed value n, the exponential polynomials can be denoted in the following way:
where D is the maximum Kendall tau distance. The highest value i used in this work is 2D, which is the maximum possible sum of two Kendall tau distance values. By definition, \(Pol(0)= \sum _{i=0}^{2D} a_i\), and when \(\theta \) tends to infinity, \(Pol(\theta )\) tends to \(a_0\).
Proposition 4
The following results are true:

(i)
If \(a_i >0,\ \forall i=0,\dotsc ,2D\), then \(Pol(\theta )\) is a positive decreasing function.

(ii)
If \(a_i \ge 0,\ i=0,\ldots ,j\), and \(a_i \le 0,\ i=j+1,\ldots ,2D\), where at least there exists one positive coefficient and one negative, and \(\sum _{i=0}^j a_i < \sum _{i=j+1}^{2D} a_i\), then there exists a positive value \(\theta \) such that \(Pol(\theta )=0\). Analogous with the inverse order.

(iii)
Let \(a_i \ge 0\), \(\forall i=0,\dotsc ,j_1,j_2,\dotsc ,2D\) (\(j_1 < j_21\)), and \(a_i < 0\), \(\forall i=j_1+1,\dotsc ,j_21\), where at least there exists one positive coefficient and one negative. If \(\sum _{i=0}^{j_1}a_i \ge \sum _{i=j_1+1}^{j_21}a_i\) and \(\sum _{i=j_1+1}^{j_21}a_i \le \sum _{i=j_2}^{2D}a_i\), then there are no positive roots. Analogous to the opposite order.

(iv)
If \(a_i =a_{2Di}\), \(\forall i = 0,\dotsc ,2D\), then \(a_D=0\) and \(Pol(0)=0\). In addition, there are no \(\theta \) positive roots (corollary of Property (iii)).

(v)
Let \(a_i > 0\), \(\forall i=0,\dotsc ,D1,D+1,\dotsc ,2D\), \(a_i=a_{2Di}\) and \(\sum _{i=0}^{2D}a_i = 0\). Then, \(Pol(0)=0\) and there are no \(\theta \) positive roots.
Proof
All the properties can be easily proved due to the definition of the exponential function. For Property (v), the argument used in Inequality (A.2) is used. \(\square \)
Proving inequality (17)
Proof
To prove Inequality (17) at \({\hat{\theta }}\) value, let us analyze the following functions (for a fixed \(n \in {\mathbb {N}}\) such that \(n \ge 3\)):
\(f_1(\theta )\) is an exponential function which fulfills Property (iv) from the exponential polynomials, whereas \(f_2(\theta )\) fulfills Property (i). An example of \(f_1(\theta )f_2(\theta )\) is displayed for \(n=5\) and \(d=1,\dotsc ,4\) in Fig. 5.
In order to prove Inequality (17), we have used the following result. At \(\theta _0 = {\hat{\theta }}\):
Hence, let us define a new function:
\(f_3(\theta )\) is an exponential polynomial which fulfills Property (ii) due to the fact that \(d^* < D/2\), and therefore \(\sum _{i=0}^{d^*1} a_i < \sum _{i=d^*+1}^D a_i\).
After defining the functions \(f_i\) for \(i=1,2,3\), let us define \(F_c(\theta )\) in the following way:
where c is a real positive value. Let us denote \(F_c(\theta )= \sum _{i=0}^{2D}b_i^c e^{i\theta }\).
When \(c=0\), \(F_c(\theta )\) is the function associated to Inequality (17). At the interval \([0,+\infty )\), \(F_0(\theta )\) starts at \(d^*\cdot n!\), and when \(\theta \) tends to infinity, \(F_0(\theta )\) tends to 1. Moreover, by Property (ii), it can be ensured that the equation \(F_0(\theta )=0\) is fulfilled once at \(\theta =\theta '\).
We wanted to prove Inequality (17): that is to say, \(F_0({\hat{\theta }}) > 0\). By definition of \(F_c(\theta )\), this is equivalent to proving that \(F_c({\hat{\theta }})>0\), for some value c. To prove this, we will choose an appropriate value c which ensures that \(F_c(\theta )\) is a positive value for any \(\theta \in [0,+\infty )\). First, notice that when c tends to infinity, \(F_c(0)\) tends to \(+\infty \). Then, we can ensure for \(c > M\), being M the smallest positive number such that \(\sum _{i=0}^{2D} b_i^M = 0\) is fulfilled, that \(F_c(0) >0\) and when \(\theta \) tends to infinity, \(F_c(\theta )\) tends to 1 because \(b_0^c = 1\). Finally, an option to prove that \(F_c\) is a positive function for a particular c is to observe that, for a suitable value c, \(F_c(\theta )\) fulfills Property (iii). This can be ensured because of the inequality \(\sum _{i=0}^{d^*1} a_i < \sum _{i=d^*+1}^D a_i\) which \(f_3(\theta )\) fulfills.
Consequently, \(F_c(\theta ) > 0\) for any \(\theta \in [0,+\infty )\). Particularly, \(F_c({\hat{\theta }}) > 0\) which implies, by definition of \(F_c(\theta )\), that Inequality (17) is fulfilled. \(\square \)
The function h is a negative function.
Proposition 5
For any \(\theta \ge 0\), let us denote
the difference value between \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)p^S(\pi )\) and \(\sum _{\pi \in \varSigma _n} d(\pi ,\sigma ^*)p(\pi )\). Then, \(h(\theta )\) is a negative value.
Proof
The proof is based on developing the sum in two nonpositive exponential polynomials with nonnegative coefficients and comparing those coefficients onebyone. On the one hand, let us denote by \(a_i\) the coefficient of \(e^{i\theta }\) in the lefthand side of Inequality (C.1).
where
and \(\delta _{i,j+k}\) is the Kronecker delta:
On the other hand, let us denote by \(b_i\) the coefficient of \(e^{i\theta }\) in the righthand side of Inequality (C.1).
where
and
To prove Inequality (C.1), let us demonstrate that \({a_i \le b_i, \forall i = 1,\dotsc , 2D}\), and \(a_i < b_i\) for at least one index i. For any i, note that if \(i\ne j+k\), there is no coefficient. Otherwise, when \(j > k\), then \(2j > i\); when \(j=k\), the coefficient of the summation is the same, and when \(k > j\), then \(2k > i\). Therefore, it is demonstrated that
\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Unanue, I., Merino, M. & Lozano, J.A. A mathematical analysis of EDAs with distancebased exponential models. Memetic Comp. 14, 305–334 (2022). https://doi.org/10.1007/s1229302200371y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1229302200371y
Keywords
 Estimation of Distribution Algorithms
 Permutationbased Combinatorial Optimization Problems
 Mathematical Modeling
 Dynamical Systems
 Mallows Model