A team of pursuit learning automata for solving deterministic optimization problems

Yazidi, Anis; Bouhmala, Nourredine; Goodwin, Morten

doi:10.1007/s10489-020-01657-9

A team of pursuit learning automata for solving deterministic optimization problems

Open access
Published: 14 April 2020

Volume 50, pages 2916–2931, (2020)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

A team of pursuit learning automata for solving deterministic optimization problems

Download PDF

1708 Accesses
10 Citations
Explore all metrics

Abstract

Learning Automata (LA) is a popular decision-making mechanism to “determine the optimal action out of a set of allowable actions” [1]. The distinguishing characteristic of automata-based learning is that the search for an optimal parameter (or decision) is conducted in the space of probability distributions defined over the parameter space, rather than in the parameter space itself [2]. In this paper, we propose a novel LA paradigm that can solve a large class of deterministic optimization problems. Although many LA algorithms have been devised in the literature, those LA schemes are not able to solve deterministic optimization problems as they suppose that the environment is stochastic. In this paper, our proposed scheme can be seen as the counterpart of the family of pursuit LA developed for stochastic environments [3]. While classical pursuit LAs can pursue the action with the highest reward estimate, our pursuit LA rather pursues the collection of actions that yield the highest performance by invoking a team of LA. The theoretical analysis of the pursuit scheme does not follow classical LA proofs, and can pave the way towards more schemes where LA can be applied to solve deterministic optimization problems. Furthermore, we analyze the scheme under both a constant learning parameter and a time-decaying learning parameter. We provide some experimental results that show how our Pursuit-LA scheme can be used to solve the Maximum Satisfiability (Max-SAT) problem. To avoid premature convergence and better explore the search space, we enhance our scheme with the concept of artificial barriers recently introduced in [4]. Interestingly, although our scheme is simple by design, we observe that it performs well compared to sophisticated state-of-the-art approaches.

A review on genetic algorithm: past, present, and future

Article 31 October 2020

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

Article 09 April 2023

1 Introduction

Learning Automata (LA) have been used in systems that have incomplete knowledge about the Environment in which they operate [1, 5,6,7,8,9,10,11]. The learning mechanism attempts to learn from a stochastic Teacher which models the environment. In his pioneering work, Tsetlin [12] attempted to use LA to model biological learning. In general, a random action is selected based on a probability vector, and these action probabilities are updated based on the observation of the Environment’s response, after which the procedure is repeated.

The term “Learning Automata” was first publicized and rendered famous in the survey paper by Narendra and Thathachar. The goal of LA is to “determine the optimal action out of a set of allowable actions” [1]. The distinguishing characteristic of automata-based learning is that the search for the optimizing parameter vector is conducted in the space of probability distributions defined over the parameter space, rather than in the parameter space itself [2].

Concerning applications, the entire field of LA and stochastic learning has had a myriad of applications [5,6,7, 9, 10], which (apart from the many applications listed in these books) include solutions for problems in network and communications [13,14,15,16], network call admission, traffic control, quality of service routing, [17,18,19], distributed scheduling [20], training hidden Markov models [21], neural network adaptation [22], intelligent vehicle control [23], and even fairly theoretical problems such as graph partitioning [24]. In addition to these fairly generic applications, with a little insight, LA can be used to assist in solving (by, indeed, learning the associated parameters) the stochastic resonance problem [25], the stochastic sampling problem in computer graphics [26], the problem of determining roads in aerial images by using geometric-stochastic models [27], and various location problems [28]. Similar learning solutions can also be used to analyze the stochastic properties of the random waypoint mobility model in wireless communication networks [29], achieve spatial point pattern analysis codes for GISs [30], digitally simulate wind field velocities [31], interrogate the experimental measurements of global dynamics in magneto-mechanical oscillators [32], and to analyze spatial point patterns [33]. LA-based schemes have already been utilized to learn the best parameters for neural networks [22], optimizing QoS routing [19], and bus arbitration [14] – to mention a few other applications.

Although many LA algorithms have been devised in the literature, those LA schemes are not able to solve deterministic optimization problems as they suppose that the environment is stochastic. In other words, classical LA schemes resort to the assumption that the response of the environment to the same action or set of actions is stochastic. However, in deterministic optimization problems, this is not the case as the output, which is the response of the environment is a deterministic function of the input. There have been many studies that resort to a team of LA for solving optimization problems where the objective function is noisy. Examples of those works include noise-tolerant learning of half-spaces [34] and nonlinear fractional knapsack problem [35]. The latter stream of works show that pursuit LA is a viable solution when the objective function is noisy. However, when the objective function to optimize is deterministic, .i.e. non-noisy, evidences from the literature catalogue that using a team of traditional pursuit LA yields slow convergence. For instance, Tilak et al. [36] report thata team of traditional pursuit LA larger than 10 deployed for solving a deterministic combinatorial problem, namely sensor coverage, yields a very slow convergence speed. In fact, Tilak et al. state: “Even at a modest number of 10 cameras, the centralized pursuit algorithm takes a long time for the automata team to converge which makes it unsuitable for an application like distributed object tracking where fast convergence is necessary”. Furthermore, some of the authors of the current manuscript [37] have also noticed this slow convergence in solving a machine learning classification problem mappedinto a combinatorial problem using a team of LA. Another important disadvantage of traditional teams of pursuit LA concerns the size of the required memory for storing the reward estimate vector. In the case of classical team of pursuit LA, one needs a shared memory for the reward estimate vector that increases dramatically with the size of the team. For instance, for a team of N LA each with two actions (binary action LA), the memory space required for storing the reward probability estimate is 2^N which is exhaustive as N increases [36]. This slow convergence of classical team of pursuit LA for solving deterministic optimization problems calls for a new LA paradigm which is the objective of this article.

In this paper, we develop a novel pursuit LA, which can be seen as the counterpart of the family of pursuit LA designed for stochastic environments [3]. While classical pursuit LAs are able to pursue the action with the highest reward estimate, our pursuit LA rather pursues the collection of actions that yield the highest performance. The theoretical analysis of the pursuit scheme does not follow classical LA proofs and can pave the way towards more schemes where LA can be applied to solve deterministic optimization problems.

We catalogue the contributions of this article as follows:

We devise a simple and lightweight optimization framework based on the theory of LA. In contrast to any LA scheme presented in the literature, our solution is especially designed for deterministic environments.
Our current solution extends the family of pursuit LA algorithms [3, 38, 39] to solve deterministic optimization problems. A common feature for all legacy pursuit algorithms is to estimate the reward probability of each action and pursue the action with the highest reward. In our current work, the environment is rather deterministic. Therefore, we opt to pursue the joint action of the team LA corresponding to the best solution found so far.
We provide sound theoretical results that demonstrate the convergence of our scheme under both constant learning parameter and time-decaying learning parameter. To the best of our knowledge, this is the first work that proposes an analysis of LA scheme with time-decaying learning parameter.
As an example of an optimization problem, we show how our scheme can be applied to solve the Max-SAT problem.

The remainder of this paper is organized as follows. In Section 2, we give an introduction to the theory of Learning Automata which is the fundamental tool in this paper. In Section 3, we survey some related work within the field of LA and optimization. In Section 4, we present our solution called Pursuit-LA and provide theoretical proofs demonstrating its convergence. Furthermore, we provide an experiment where we apply Pursuit-LA to the Max-SAT problem. Section 6 concludes the article.

2 Learning automata

In the field of Automata Theory, an automaton [5,6,7, 9, 10] is defined as a quintuple composed of a set of states, a set of outputs or actions, an input, a function that maps the current state and input to the next state, and a function that maps a current state (and input) into the current output.

Definition 1: A LA is defined by a quintuple <A, B, Q, F(., .), G(.)>, where:

1.
A = {α₁, α₂, …, α_r} is the set of outputs or actions that the LA must choose from, and α(t) is the action chosen by the automaton at any instant t.
2.
B = {β₁, β₂, …, β_m} is the set of inputs to the automaton. β(t) is the input at any instant t. The set B can be finite or infinite. The most common LA input is B = {0, 1}, where β = 0 represents reward, and β = 1 represents penalty.
3.
Q = {q₁, q₂, …, q_s} is the set of finite states, where Q(t) denotes the state of the automaton at any instant t.
4.
F(., .) : Q × B ↦ Q is a mapping in terms of the state and input at the instant t, such that, q(t + 1) = F[q(t), β(t)]. It is called a transition function, i.e., a function that determines the state of the automaton at any subsequent time instant t + 1. This mapping can either be deterministic or stochastic.
5.
G(.): is a mapping G : Q ↦ A, and is called the output function. G determines the action taken by the automaton if it is in a given state as: α(t) = G[q(t)]. With no loss of generality, G is deterministic.

If the sets Q, B and A are all finite, the automaton is said to be finite.

The Environment, E, typically, refers to the medium in which the automaton functions. The Environment possesses all the external factors that affect the actions of the automaton. Mathematically, an Environment can be abstracted by a triple <A, C, B>. A, C, and B are defined as follows:

1.
A = {α₁, α₂, …, α_r} is the set of actions.
2.
B = {β₁, β₂, …, β_m} is the is the output set of the Environment. Again, we consider the case when m = 2, i.e., with β = 0 representing a “Reward”, and β = 1 representing a “Penalty”.
3.
C = {c₁, c₂, …, c_r} is a set of penalty probabilities, where element c_i ∈ C corresponds to an input action α_i.

The process of learning is based on a learning loop involving the two entities: the Random Environment (RE), and the LA, as described in Fig. 1. In the learning process, the LA continuously interacts with the Environment to process responses to its various actions (i.e., its choices). Finally, through sufficient interactions, the LA attempts to learn the optimal action offered by the RE. The actual process of learning is represented as a set of interactions between the RE and the LA.

The automaton is offered a set of actions, and it is constrained to choosing one of them. When an action is chosen, the Environment gives out a response β(t) at a time “t”. The automaton is either penalized or rewarded with an unknown probability c_i or 1 − c_i, respectively. On the basis of the response β(t), the state of the automaton ϕ(t) is updated and a new action is chosen at (t + 1). The penalty probability c_i satisfies:

$$ {c}_i=\mathit{\Pr}\left[\beta (t)=1|\alpha (t)={\alpha}_i\right]\left(i=1,2,\dots, R\right). $$

We now provide a few important definitions used in the field. P(t) is referred to as the action probability vector, where, P(t) = [p₁(t), p₂(t), …, p_r(t)]^T, in which each element of the vector.

$$ {p}_i(t)=\mathit{\Pr}\left[\alpha (t)={\alpha}_i\right],i=1,\dots, r,\mathrm{such}\ \mathrm{that}\ \sum \limits_{i=1}^r{p}_i(t)=1\kern0.5em \forall t. $$

(1)

Given an action probability vector, P(t) at time t, the average penalty is:

$$ {\displaystyle \begin{array}{l}M(t)=E\left[\beta (t)|P(t)\right]=\mathit{\Pr}\left[\beta (t)=1|P(t)\right]\\ {}=\sum \limits_{i=1}^r\mathit{\Pr}\left[\beta (t)=1|\alpha (t)={\alpha}_i\right]\ \mathit{\Pr}\left[\alpha (t)={\alpha}_i\right]\\ {}=\sum \limits_{i=1}^r{c}_i{p}_i(t)\end{array}}. $$

(2)

The average penalty for the “pure-chance” automaton is given by:

$$ {M}_0=\frac{1}{r}\sum \limits_{i=1}^r{c}_i. $$

(3)

As t ↦ ∞, if the average penalty M(t) < M₀, at least asymptotically, the automaton is generally considered to be better than the pure-chance automaton. E[M(t)] is given by:

$$ E\left[M(t)\right]=E\left\{E\left[\beta (t)|P(t)\right]\right\}=E\left[\beta (t)\right]. $$

(4)

A LA that performs better than by pure-chance is said to be expedient.

Definition 2: A LA is considered expedient if:

$$ {\lim}_{t\mapsto \infty }E\left[M(t)\right]<{M}_0. $$

Definition 3: A LA is said to be absolutely expedient if

$$ E\left[M\left(t+1\right)|P(t)\right]<M(t), $$

implying that E[M(t + 1)] < E[M(t)].

Definition 4: A LA is considered optimal if

$$ {\lim}_{t\mapsto \infty }E\left[M(t)\right]={c}_l, $$

where c_l = min_i{c_i}.

It should be noted that no optimal LA exist. Marginally sub-optimal performance, also termed above as ϵ-optimal performance, is what LA researchers attempt to attain.

Definition 5: A LA is considered ϵ-optimal if:

$$ {\lim}_{n\mapsto \infty }E\left[M(t)\right]<{c}_l+\epsilon, $$

(5)

where ϵ > 0, and can be arbitrarily small, by a suitable choice of some parameter of the LA.

2.1 Types of learning automata

2.1.1 Deterministic learning automata

An automaton is termed as a deterministic automaton, if both the transition function F(., .) and the output function G(.) are deterministic. Thus, in a deterministic automaton, the subsequent state and action can be uniquely specified, provided the present state and input are given.

2.1.2 Stochastic learning automata

If, however, either the transition function F(., .), or the output function G(.) are stochastic, the automaton is termed to be a stochastic automaton. In such an automaton, if the current state and input are specified, the subsequent states and actions cannot be specified uniquely. In such a case, F(., .) only provides the probabilities of reaching the various states from a given state.

In the first LA designs, both the transition and output functions where time-invariant, and for this reason, these LA were considered to be “Fixed Structure Stochastic Automata” (FSSA). Tsetlin, Krylov, and Krinsky [12] have presented notable examples of this type of automata.

Subsequently, Vorontsova and Varshavskii introduced a class of stochastic automata known in the literature as Variable Structure Stochastic Automata (VSSA). In the definition of a VSSA, the LA is wholly defined by a set of actions (one of which is the output of the automaton), a set of inputs (which is usually the responsibility of the Environment) and a learning algorithm, T. The learning algorithm [7] operates on a vector (called the Action Probability vector).

Note that the algorithm T: [0, 1]^R × A × B → [0, 1]^R is an updating scheme where A = {α₁, α₂, …, α_R}, 2 ≤ R < ∞, is the set of output actions of the automaton, and B is the set of responses from the Environment. Thus, the updating is such that.

P(t + 1) = T(P(t), α(t), β(t)), where P(t) is the action probability vector, α(t) is the action chosen at time t, and β(t) is the response it has obtained.

If the mapping T is chosen in such a manner that the Markov process has absorbing states, the algorithm is referred to as an absorbing algorithm. Many families of VSSA that posses absorbing barriers have been reported [7]. Ergodic VSSA has also been investigated [7, 40]. These VSSAs converge in distribution and thus, the asymptotic distribution of the action probability vector has a value that is independent of the corresponding initial vector. While ergodic VSSA are suitable for non-stationary environments, absorbing VSSA are preferred in stationary environments.

3 Related work

In order to put our work in the right perspective, we will briefly discuss different optimization schemes relevant to this work mostly from the field of LA.

3.1 LA for optimization

A similar work to ours is due to Thathachar and Sastry [41] where the authors use a team of LA in order to find the optimal discriminant function in a feature space. The discriminant functions are parametrized, and a parameter is attached to each that is to be learned.

Subsequently, Santharam et al. [42] proposed using continuous LA in order to deal with the disadvantages of discretization, thus allowing an infinite number of actions. For an excellent review on the application of LA to the field of Pattern Recognition, we refer the reader to [43]. In [44], Zahiri devised an LA based classifier that operates using hypercubes in a recursive manner. In [45], the authors have proposed LA optimization methods for multimodal functions. Through experimental settings, the performance of these algorithms were shown to outperform genetic algorithms. In [46], the authors propose genetic LA for optimizing functions. Similarly, the work [47] proposed genetic algorithms for classifiers.

Misra and Oommen pioneered the concept of LA on a graph using pursuit LA [13, 48, 49] for solving the stochastic shortest path problem. Li [50] used a type of S Learning Automata [51] to find the shortest path in a graph. Beigy and Meybodi [52] provided the first proof in the literature that demonstrates the convergence of distributed LA on a graph for a reward inaction LA.

Concerning applications of distributed LA on a graph in the field of computer communications, we refer the reader to the work of Torkestani and collaborators [53,54,55].

3.2 Stochastic local search algorithms (SLS)

Due to their combinatorial explosion nature, large and complex SAT problems are hard to solve using systematic algorithms. One way to overcome the combinatorial explosion is to give up completeness. Local search algorithms are techniques which use this strategy. Local search algorithms are based on what is perhaps the oldest optimization method trial and error. Typically, they start with an initial assignment of values to variables randomly or heuristically generated. Satisfiability can be formulated as an optimization problem in which the goal is to minimize the number of unsatisfied clauses. Thus, the optimum is obtained when the value of the objective function equals zero, which means that all clauses are satisfied. Finite Learning Automata has been proposed as a mechanism for enhancing meta-heuristics based Max-SAT solvers. The work conducted in [56] proposes an adaptive memory based local search algorithm that exploits various strategies in order to guide the search to achieve a suitable trade-off between intensification and diversification. Multilevel techniques [57, 58] have been applied to Max-SAT with considerable success. They progressively coarsen the problem, find an assignment, and then employ a meta-heuristic to refine the assignment on each of the coarsened problems in reverse order.

During each iteration, a new solution is selected from the neighborhood of the current one by performing a move. Choosing a good neighborhood and a method for searching it is usually guided by intuition because very little theory is available as a guide. Most SLS uses a 1-flip neighborhood relation for which two truth-value assignments are neighbors if they differ in the truth value of one variable. If the new solution provides a better value in light of the objective function, the new solution becomes the current one. The search terminates if no better neighbor solution can be found.

One of the most popular local search for solving SAT is GSAT [59]. GSAT begins with a randomly generated assignment of values to variables and then uses the steepest descent heuristic to find the new variable-value assignment which best decreases the numbers of unsatisfied clauses. After a fixed number of moves, the search is restarted from a new random assignment. The search continues until a solution is found or a fixed number of restart is performed. An extension of GSAT referred to as random-walk [60] has been realized with the purpose of escaping from local optima. In a random walk step, a randomly unsatisfied clause is selected. Then, one of the variables appearing in that clause is flipped, thus effectively forcing the selected clause to become satisfied. The main idea is to decide at each search step whether to perform a standard GSAT or a random-walk strategy with a probability called the walk probability. Another widely used variant of GSAT is the WalkSAT algorithm originally introduced in [61]. It first picks randomly an unsatisfied clause c and then in a second step, one of the variables with the lowest break count appearing in the selected clause is randomly selected. The break count of a variable is defined as the number of clauses that would be unsatisfied by flipping the chosen variable. If there exists a variable with break count equals to zero, this variable is flipped, otherwise, the variable with minimal break count is selected with a certain probability (noise probability). The choice of unsatisfied clauses combined with the randomness in the selection of variables enables WalkSAT to avoid local minima and to explore the search space better. New algorithms [62] [63,64,65] have emerged using history-based variable selection strategy in order to avoid flipping the same variable. Apart from GSAT and its variants, several clause weighting based SLS algorithms [66, 67] have been proposed to solve SAT problems. The key idea is to associate the clauses of the given CNF formula with weights. Although these clause weighting SLS algorithms differ in the manner clause weights should be updated (probabilistic or deterministic) they all choose to increase the weights of all the unsatisfied clauses as soon as a local minimum is encountered. Clause weighting acts as a diversification mechanism rather than a way of escaping local minima. Finally, many other SLS algorithms have been applied to the SAT. These include techniques such as Simulated Annealing [68, 69], Evolutionary Algorithms [70], and Greedy Randomized Adaptive Search Procedures [71]. The nature-inspired GASAT algorithm [72] is a hybrid algorithm that combines a specific crossover and a tabu search procedure. The work in [73] proposes a hybrid approach called Iterated Robust Tabu Search (IRoTS) which combines an iterated local search and tabu search.

4 Our solution: Pursuit-LA

In this Section, we shall present our solution reckoned as Pursuit-LA for solving deterministic optimization problems. In many combinatorial problems, a candidate solution can be represented using a binary vector [74]. Adopting Pursuit-LA implies to attach an LA to each element of the binary vector whose respective decision is the action 0 or 1. The collective decision of the different LA will result into a solution. The solution with highest “fitness” will be pursued by the LA using the LRI scheme [7, 10]. Furthermore, we will give an example of application of the Pursuit-LA to the Max-SAT problem.

4.1 Convergence results of the pursuit-LA

In this Section, we will consider two convergence cases of the Pursuit-LA, namely convergence under time decaying learning parameter and convergence under constant learning parameter.

4.1.1 Pursuit-LA with time-dependent parameter

At each epoch, each LA in the team of LA chooses an action, therefore the choices of the team are synchronous. The joint action of the team of LA results in a candidate solution. The observed performance is fed back to the team of LA and used to reinforce the choice of the candidate solution yielding highest performance. More precisely for each LA in the team, we attach a component of the binary vector forming the candidate solution, the corresponding action coinciding with the candidate solution yielding highest performance sees its probability increasing at each time instant. In this sense, the joint action probability vector of the team of LA gets biased towards the best solution found so far, and thus the concept of pursuit. The choices of the team of LA are synchronous and the feedback is common for the team, which can be though as shared memory if one considers that the last feedback is stored in a common memory. Each LA also has a local memory to remember the best action so far (up to the current time instant) that has resulted in the highest performance for the team.

Let C(t) = {C₁(t), …, C_m(t)} be a candidate solution at time t where C_i takes a binary value and m the number of bits needed to code a candidate solution. We attach an LA to each component of the candidate solution.

The automaton’s state probability vector at the component C_i at time t is P_i(t) = [p_(i, 0)(t), p_(i, 1)(t)], which denotes the probability to yield 0 or 1 for the i^th component.

The normalized feedback function (or reward strength) is given by f(C(t)), where C(t) is the candidate solution tested at instant t. The function f(.) measures the fitness of the solution taking values from [0, 1] where 0 is the lowest possible reward, while 1 is the highest reward. In other words, the fitness function is normalized.

Let C^∗(t), be the solution with highest fitness found so far, i.e., the solution with highest fitness obtained up to time instant t.

The idea of pursuit here is to reward the LA whose actions correspond to the component of the solutions in C^∗(t).

We consider the LA update equations at component C_i. For all components C_i, and for, j ∈ {0, 1}, the update is given by:

$$ {p}_{\left(i,j\right)}\left(t+1\right)=\left(1-{\lambda}_t\right){\delta}_j+{\lambda}_t{p}_{\left(i,j\right)}(t) $$

(6)

Where δ_j is defined by

$$ {\delta}_j=\left\{\begin{array}{ll}1& \mathrm{if}\ {C}_i^{\ast }(t)=j\\ {}0& \mathrm{else}\ \end{array}\right. $$

(7)

λ_t is the update parameter and depends on time. In Theorem 13, we will consider the conditions by which the algorithm can converge when the update parameter depends on time. Further, we will give convergence results for the case of fixed λ, i.e., independent of time t.

Please note that, initially:

$ {p}_{\left(i,j\right)}(0)=\frac{1}{2} $, for j ∈ {0, 1}.

The informed reader would observe that the above update scheme corresponds to the linear Reward-Inaction LA update [1].

In fact if $ j\notin {C}_i^{\ast }(t) $ then p_(i, j)(t + 1) is reduced by multiplying by λ_t which is less than 1 as per the following equation:

$$ {p}_{\left(i,j\right)}\left(t+1\right)={\lambda}_t{p}_{\left(i,j\right)}(t) $$

(8)

However if $ j\in {C}_i^{\ast }(t) $ then p_(i, j)(t + 1) is increased. This can be proven as follows:

$$ {p}_{\left(i,j\right)}\left(t+1\right)-{p}_{\left(i,j\right)}(t)=\left[\left(1-{\lambda}_t\right)+{\lambda}_t{p}_{\left(i,j\right)}(t)\right]-{p}_{\left(i,j\right)}(t) $$

(9)

$$ =\left(1-{\lambda}_t\right)+{p}_{\left(i,j\right)}(t)\left({\lambda}_t-1\right) $$

(10)

$$ =\left(1-{\lambda}_t\right)\left(1-{p}_{\left(i,j\right)}(t)\right)\ge 0 $$

(11)

The update scheme is called pursuit LA and has rules that obey LRI. The idea is to always reward the transitions probabilities along the best solution obtained so far.

With the updating formula (Equation 6), we can show that the probability distribution converges to the distribution that satisfies the following property if the optimal solution $ {C}_i^{\ast } $ is unique.

$$ {p}_{ij}=\left\{\begin{array}{ll}1& \mathrm{if}\ {C}_i^{\ast }=j\\ {}0& \mathrm{else}\ \end{array}\right. $$

(12)

Intuition behind pursuit-LA

Thathachar and Sastry [75] pioneered the idea of pursuit LA. The action with the highest reward estimate is “pursued”. The latter work has fueled a great deal of interest in pursuit LA involving different variants [3, 38, 39]. A common feature for all these pursuit algorithms is to estimate the reward probability of each action and pursue the action with the highest reward. In our current work, the environment is ratherdeterministic. Therefore, we opt to pursue the joint action of the team LA, corresponding to the best solution found so far. We will now state some theoretical results that catalog the properties of the Pursuit-LA for both the time-varying update parameter and the fixed update parameter.

We will now state some theoretical results that catalogue the properties of the Pursuit-LA for both the time varying update parameter and the fixed update parameter. The optimal solution is generated with probability 1 only if the update parameter θ_t obeys the following condition:

$$ \sum \limits_{t=1}^{\infty}\prod \limits_{k=1}^{t-1}{\lambda}_k=\infty $$

(13)

Proof.

The proof follows similar arguments as in [76]. Using recurrence, we can obtain a lower bound on p_(i, j)(t):

$$ {p}_{\left(i,j\right)}(t)\ge \prod \limits_{k=1}^{t-1}{\lambda}_k{p}_{\left(i,j\right)}(0) $$

(14)

Let p_min(0) > 0 a lower bound on p_(i, j)(0).

Let A_t = {C(t) ≠ C^∗} the event that at iteration t, the candidate solution does not contain the optimal solution C^∗.

Let B_T the event that optimal solution is not found up to instant T.

$$ P\left({B}_T\right)=\prod \limits_{t=1}^TP\left({A}_t\right) $$

(15)

$$ P\left({B}_T\right)\le \prod \limits_{t=1}^T\left(1-\prod \limits_{k=1}^{t-1}{\left({\lambda}_k{p}_{min}(0)\right)}^m\right) $$

(16)

By resorting to (1 − u) ≤ exp (−u) we obtain

$$ P\left({B}_{\infty}\right)\le \prod \limits_{t=1}^{\infty}\left(1-\prod \limits_{k=1}^{t-1}{\left({\lambda}_k{p}_{min}(0)\right)}^m\right) $$

(17)

$$ \le \prod \limits_{t=1}^{\infty}\mathit{\exp}\left(-\prod \limits_{k=1}^{t-1}{\left({\lambda}_k{p}_{min}(0)\right)}^m\right) $$

(18)

$$ =\mathit{\exp}\left(-\sum \limits_{t=1}^{\infty}\prod \limits_{k=1}^{t-1}{\lambda}_k^m{p}_{min}{(0)}^m\right) $$

(19)

However, from our assumption

$$ \sum \limits_{t=1}^{\infty}\prod \limits_{k=1}^{t-1}{\lambda}_k=\infty $$

Since we have

$$ P\left({B}_{\infty}\right)\le 0 $$

(20)

then

$$ P\left({B}_{\infty}\right)=P\left({C}^{\ast}\mathrm{never}\ \mathrm{obtained}\right)=0 $$

(21)

Examples of smoothing sequences which eventually generate the optimal solution with probability 1 (that is, which satisfy the sufficient condition of Theorem 1) includes.

λ_t = 1 − 1/(t + 1)^β for β > 1.

and

$ {\lambda}_t=1-\frac{1}{\left(t+1\right)\mathit{\log}{\left(t+1\right)}^{\beta }} $ for β > 1.

Let t^∗ the first time instant when the optimal solution is found, the optimal components are always reinforced. For t^∗ + r, for (i^∗, j) such that i^∗ ∈ C^∗ and j ∉ C^∗, we have using recurrence:

$$ {p}_{\left({i}^{\ast },j\right)}\left({t}^{\ast }+r\right)=\prod \limits_{k={t}^{\ast}}^{t^{\ast }+r-1}{\lambda}_k{p}_{\left({i}^{\ast },{j}^{\ast}\right)}(t) $$

(22)

Easy to see from the assumption that $ {\prod}_{k=0}^{\infty }{\lambda}_k=0 $, by considering the log of the expression described in assumption on λ_k.

$$ \underset{r\to \infty }{\lim }{p}_{\left({i}^{\ast },j\right)}\left({t}^{\ast }+r\right)=0 $$

(23)

By considering summation to 1 of probability of going from node i^∗, and for j^∗ belonging to the optimal path.

$$ \underset{r\to \infty }{\lim }{p}_{\left({i}^{\ast },{j}^{\ast}\right)}\left({t}^{\ast }+r\right)=1 $$

(24)

4.2 Constant update parameter

In Theorem 13, we give the convergence result of the Pursuit-LA for the case of fixed parameter λ that is independent of time.

The optimal solution is generated with probability 1 only if the update parameter λ → 1.

Proof.

Using recurrence, we know that:

$$ {p}_{\left(i,j\right)}(t)>{\lambda}^{t-1}{p}_{\left(i,j\right)}(0) $$

(25)

Thus,

$$ P\left({A}_t\right)\le 1-{\left({\lambda}^{t-1}{p}_{min}(0)\right)}^m $$

(26)

Therefore,

$$ P\left({B}_T\right)\le \prod \limits_{t=1}^T\left(1-{\left({\lambda}^{t-1}{p}_{min}(0)\right)}^m\right) $$

(27)

Thus, by resorting again to (1 − u) ≤ exp (−u) we obtain

$$ P\left({C}^{\ast}\mathrm{never}\ \mathrm{obtained}\right)= Prob\left({B}_{\infty}\right)\le \prod \limits_{t=1}^{\infty}\mathit{\exp}\left(-{\lambda}^{\left(t-1\right)m}{p}_{min}{(0)}^m\right) $$

(28)

$$ \le \mathit{\exp}\left(-{p}_{min}{(0)}^m\sum \limits_{t=1}^{\infty }{\lambda}^{\left(t-1\right)m}\right) $$

(29)

$$ =\mathit{\exp}\left(-{p}_{min}{(0)}^m\sum \limits_{t=0}^{\infty }{\lambda}^{tm}\right) $$

(30)

Let us define $ h\left(\alpha \right)={\sum}_{t=0}^{\infty }{\lambda}^{tm} $.

$$ h\left(\alpha \right)=\sum \limits_{t=0}^{\infty }{\lambda}^{tm} $$

(31)

$$ =1/\left(1-{\lambda}^m\right) $$

(32)

$$ \underset{T\to \infty }{\lim }P\left({B}_T\right)=P\left({C}^{\ast }\ \mathrm{never}\ \mathrm{obtained}\right) $$

(33)

$$ \le \mathit{\exp}\left(-{p}_{min}{(0)}^mh\left(\lambda \right)\right) $$

(34)

Since we know that $ \underset{\lambda \to 1}{\lim }h\left(\lambda \right)=\infty $, then $ \underset{T\to \infty }{\lim }P\left({B}_T\right) $ can be made arbitrarily close to zero, if λ approaches 1.

Hence the theorem is proven. Now, let us characterize the LA probabilities at convergence.

Let t^∗ the first time instant when the optimal solution is found, the optimal so far components are always reinforced. For t^∗ + r, for (i^∗, j^∗) ∈ C^∗, we have:

Using recurrence, we can obtain verify

$$ {p}_{\left({i}^{\ast },{j}^{\ast}\right)}\left(t+r\right)={\lambda}^r{p}_{\left({i}^{\ast },{j}^{\ast}\right)}(t)+\left(1-\lambda \right)\sum \limits_{i=0}^{r-1}{\lambda}^{r-i-1} $$

(35)

We remark

$$ \underset{r\to \infty }{\lim}\sum \limits_{i=0}^{r-1}{\lambda}^{r-i-1}=1/\left(1-\lambda \right) $$

(36)

Therefore,

$$ \underset{r\to \infty }{\lim }{p}_{\left({i}^{\ast },{j}^{\ast}\right)}\left(t+r\right)=\left(1-\lambda \right)\times 1/\left(1-\lambda \right)=1 $$

(37)

4.3 Pursuit-LA with artificial barriers

In this section, we extend our Pursuit-LA with the concept of artificial barriers introduced recently by Yazidi and Hammer [4] to avoid the lock-in probability effect. The presented Pursuit-LA in the previous section is an absorbing scheme where the team of LA will converge after a large number of iterations to an absorbing state composed of a vector with components either 0 or 1. This creates a challenge when it comes to tuning the learning parameter as choosing high values of the learning parameter close to 1 renders the schemes extremely slow, while choosing high values might lead to premature convergence. To allow the scheme to avoid still getting locked in an absorbing state where premature convergence can take place, we introduce an upper and lower band for the probability of each LA in the team. Therefore, instead of allowing the LA probabilities to admit values within the interval [0, 1], we force the probabilities to be located in [p_min, p_max] where p_max is a user-defined upper bound for all the p_(i, j)(t) and p_min = 1 − p_max the counter part lower bound. p_max needs to be chosen in the neighborhood of 1 in order to bias the exploration to the neighborhood of the best solution found so far. We shall give now the update equations for Pursuit-LA with artificial barriers. For all components C_i, and for, j ∈ {0, 1}, the update is given by:

$$ {p}_{\left(i,j\right)}\left(t+1\right)=\left(1-\lambda \right)\left({\delta}_j{p}_{max}+\left(1-{\delta}_j\right){p}_{min}\right)+\lambda {p}_{\left(i,j\right)}(t) $$

(38)

Where δ_j is defined by

$$ {\delta}_j=\left\{\begin{array}{ll}1& \mathrm{if}\ {C}_i^{\ast }(t)=j\\ {}0& \mathrm{else}\ \end{array}\right. $$

(39)

Please note that, if we initially impose that p_max ≤ p_(i, j)(0) ≤ p_min, then it is easy to prove by recurrence that the update form will guarantee that at any subsequent time t > 0 that p_max ≤ p_(i, j)(t) ≤ p_min.

Let us suppose that p_max ≤ p_(i, j)(t) ≤ p_min and prove p_max ≤ p_(i, j)(t + 1) ≤ p_min In fact, p_(i, j)(t + 1) can be written in the form p_(i, j)(t + 1) = (1 − λ)p + λp_(i, j)(t) where p = p_max or p = p_min depending on whether δ_j = 1 or δ_j = 0. Therefore, p_(i, j)(t + 1) is a convex combination of two quantities that bother are in the interval [p_min, p_max]. Hence, the result is proven by recurrence.

It is easy to see that the update equation (Eq. (38)) can be written differently. In fact, if $ j\notin {C}_i^{\ast }(t) $ then p_(i, j)(t + 1) reduces to:

$$ {p}_{\left(i,j\right)}\left(t+1\right)=\left(1-\lambda \right){p}_{min}+\lambda {p}_{\left(i,j\right)}(t) $$

(40)

However, if $ j\in {C}_i^{\ast }(t) $ then p_(i, j)(t + 1) reduces to:

$$ {p}_{\left(i,j\right)}\left(t+1\right)=\left(1-\lambda \right){p}_{max}+\lambda {p}_{\left(i,j\right)}(t) $$

(41)

We can show that if $ j\in {C}_i^{\ast }(t) $ then p_(i, j)(t + 1) increases while its decreases in the opposite case (i.e, $ j\in {C}_i^{\ast }(t) $). In fact,

$$ {p}_{\left(i,j\right)}\left(t+1\right)-{p}_{\left(i,j\right)}(t)=\left(1-\lambda \right)\left({\delta}_j{p}_{max}+\left(1-{\delta}_j\right){p}_{min}\right)+\lambda {p}_{\left(i,j\right)}(t)-{p}_{\left(i,j\right)}(t) $$

(42)

$$ =\left(1-{\lambda}_t\right)\left({\delta}_j{p}_{max}+\left(1-{\delta}_j\right){p}_{min}\right)-{p}_{\left(i,j\right)}(t) $$

(43)

Then if $ j\in {C}_i^{\ast }(t) $

$$ {p}_{\left(i,j\right)}\left(t+1\right)-{p}_{\left(i,j\right)}(t)=\left(1-{\lambda}_t\right)\left({p}_{max}-{p}_{\left(i,j\right)}(t)\right)\ge 0 $$

(44)

Whereas if $ j\notin {C}_i^{\ast }(t) $

$$ {p}_{\left(i,j\right)}\left(t+1\right)-{p}_{\left(i,j\right)}(t)=\left(1-{\lambda}_t\right)\left({p}_{min}-{p}_{\left(i,j\right)}(t)\right)\le 0 $$

(45)

It is easy to observe that whenever p_max = 1, and consequently p_min = 0, the Pursuit-LA with absorbing barriers reduces to the Pursuit-LA with fixed learning parameter introduced in the previous section.

Before closing this section, it is not of place to observe that our algorithm enjoys low computational complexity. In fact, the proposed Pursuit-LA requires only an order of m operations per time step. This low computational complexity is an inherent property of Reinforcement Learning algorithms in general and LA in particular.

5 Application of pursuit-LA to max-SAT problem

In this section, we will test the performance of our algorithm for solving the Max-SAT problem, which is a well-known class of deterministic optimization problems. We will examine two main aspects of the algorithm. The first aspect is investigated in Section 1 and concerns the sensitivity of the algorithm to changes in the learning parameter. The second aspect we tackle is how well the current algorithm compares to other well-established state-of-the-art MAX-SAT solvers. The latter aspect is treated in Section 2.

5.1 Effect of varying the learning parameter

We will use a benchmark from https://www.cs.ubc.ca/hoos/SATLIB/benchm.html related to the SAT-encoded Flat Graph Colouring Problems. As a criterion of convergence, we reckon that the algorithm has converged whenever each of the LA has converged, meaning that each LA has an action whose probability exceeds 1 − ϵ where ϵ denotes a small scalar. In all the experiments,we choose ϵ = 0.01. In more formal terms, we deem that the scheme has converged if for all i, p_(i, 0)(t) > 1 − ϵ or p_(i, 1)(t) > 1 − ϵ. As an objective function, we resort to the percentage of satisfied clauses which also characterizes the performance of the algorithm. In the Max-SAT problem, we would like to maximize the number of satisfied clauses.

Table 1 gives the performance of the Pursuit-LA and convergence time for three values of the learning parameters λ = 0.9, λ = 0.99 and λ = 0.999. PC denotes the percentage of satisfied clauses while CT denotes the convergencetime. In Table 1, we consider different files representing the SAT-encoded Flat Graph Colouring Problems. The first category of files with prefix flat30 denotes a problem with 100 instances, 30 vertices, 60 edges, 3 colours, 90 variables and 300 clauses. The second category of files with prefix flat200 denotes a problem with 100 instances, 200 vertices, 479 edges, 3 colours, 600 variables and 223 clauses.

Table 1 Performance of the Pursuit-LA and convergence time for different values of the learning parameter

Full size table

From Table 1, the general remark is that the Pursuit-LA algorithm is generally fast and yields acceptable performance even for low values of λ. The performance seems to increase as λ increases; however, at the cost of longer convergence time. For example, let us consider the file flat30–5. We observe that Pursuit-LA converges quite fast for a small learning parameter λ = 0.99, namely with 107 iterations. Whenever we increase the parameter to λ = 0.999, the convergence time increases considerably to 11,223; however, the performance increases too substantially from 0.9233 to 0.99. In Fig. 2, we also give the evolution of the performance of the algorithm using the same file flat30–5 for λ = 0.999. Initially, i.e., at time zero, the performance was 0, 7533. As time proceeds, we observe from Fig. 2 that the performance steadily increases until converging to 0.99 after around 11,223 iterations. Please note that the first time we reach this performance is after about 8000 iterations. Nevertheless, it takes more iterations for the Pursuit-LA to converge as the probabilities of the actions that yield this performance (0.99) will keep on increasing as long no other better solution. This increase in the probability takes place with a learning parameter as big as λ = 0.999. The remaining time to converge after finding the best solution so far is depending on the time it takes for the smallest action probability to have its probability increasing above 1 − ϵ.

5.2 Comparison against reference solvers

The benchmark instances which are used to evaluate the performance of the Pursuit-LA algorithm belong to Random Unweighted-MAX2SAT/MAX3SAT.^{Footnote 1}

The performance of the algorithm is compared to various popular solvers in the literature:

AdaptNovelty+: stochastic local search algorithm with an adaptive noise mechanism [77].
IRoTS: Iterated Robust Tabu Search algorithm [73].
RoTS: Robust Tabu Search algorithm [78].
AdaptG2WSATp: adaptive gradient-based greedy WalkSAT algorithm with promising decreasing variable heuristic [79]
Adaptive memory-based local search heuristic (denoted by AMLS1,AMSL2) [56].

For the reference algorithms (IRoTS, RoTS, AdaptNovelty+) we carry out the experiments using UBCSAT (version 1.1) an implementation and experimentation environment for stochastic local search algorithms for SAT and MAX-SAT solvers. As shown by Tompkins and Hoos [80], the implementation of these reference algorithms in UBCSAT is more efficient than (or just as efficient as) the original implementations. In all tables, the first and second column identify the problem instance and the best known objective value f∗ (number of unsatisfied clauses). The remaining columns give the results of the the algorithms using three performance quality criteria which have been widely used for the performance evaluation of stochastic local search [56]:

The average objective function f_av over 20 independent runs.
The success rate sr defined as the number of time the algorithm reaches the best known objective value over 20 runs.
The average search step for reaching the average value.

To avoid premature convergence, we use p_max = 0.9. Furthermore, we fix λ = 0.95. We run the algorithms in epochs each consisting of 1000 iterations. We only update the probabilities at the end of each epoch. In simple terms, we test the current probability vector during the whole epoch before updating it best on the best solution found so far in the entire epoch.

Tables 2, 3 and 4 give the comparisons results.

Table 2 Performance Comparisons: Unweighted MAX2SAT/MAX3SAT

Full size table

Table 3 Performance Comparisons: Unweighted MAX2SAT/MAX3SAT

Full size table

Table 4 Performance Comparison: Unweighted MAX2SAT/MAX3SAT

Full size table

To summarize the results of this section, we have incorporated comparisons against some of the most established state-of-the-art schemes including AMSL1, AMSL2, AdaptiveNovelty+, AdaptG2WSATp, RoTS IroTS using different benchmarks. The results are really conclusive and quite surprising too. Although our scheme is straightforward and we aimed to show that this a proof of concept of the possibility to use LA for solving deterministic problems, we have found that it performs well. It is even superior to the AdaptNovelty+ which is a sophisticated state-of-art Max-SAT algorithm. For example, in Table 4, our Pursuit-LA consistently outperforms the AdaptiveNovelty+ in terms of f_av and convergence steps. Observe, for example, the results for the file s2v120c1700–2. Our Pursuit-LA achieves several unsatisfied clauses 250 in 306 epochs while the AdaptiveNovelty+ achieves 261 in 25,229 iterations. The optimal number for this case is 248, and therefore our Pursuit-LA has just two unsatisfied clauses compared to the AdaptiveNovelty+ that has eleven unsatisfied clauses.

5.3 Discussion

We believe that the Pursuit-LA algorithm owes it's performance to two main design principles:

By adding some artificial barriers, p_min and p_max for the LA probability, we are able to achieve better results by avoiding premature convergence of the algorithm. In fact, during the first experiments we had without artificial barriers, we observed some low performance due to the so-called stagnation effect. Stagnation means that the best so far solution does not change for a certain number of iterations that is relatively long. In face of stagnation and in the absence of artificial barriers, the LA team will get trapped into a probability vector which corresponds to an exclusive choice of the best so far solution. This happens within finite number of iterations that depends onthe learning parameter.
Furthermore, the second appealing idea that makes the algorithm performant is pursuing the best solution found so far in the probability space. In this perspective, pursuing means, at each iteration, biasing gradually the probabilistic search towards the optimal solution found so far in a probabilistic manner. In fact, the role of the learning parameter is to adjust the quantity by which the change in probability takes place in the direction of the best solution found so far.

Despite that the performance of Pursuit-LA is promising, it does not outperform the other algorithms in all scenarios, namely RoTS and IRoTS are superior to Pursuit-LA in terms of the quality of the solution as seen in Table 4. In fact, althoughthe Pursuit-LA gives a satisfactory solutions it converges prematurely according to the results of Table 4. For instance, for s2v200c1200–2, f^∗ = 127 is achieved for RoTS and IRoTS, while the Pursuit-LA achieves 137 unsatisfied clauses but within less time, namely, 647 epochs compared to 775 and 540 steps for RoTS and IRoTS respectively. Therefore, we believe that adaptively adjusting the learning parameter as well as the barriers over time borrowing ideas from IRoTS will boost the performance. Under artificial barriers, our stopping criterion used in the simulations in Section 2 is the stagnation of the search for two consecutive epochs. One can deal with stagnation using different ideas in the literature. For instance, IRoTS forces any variable whose value has not been changed over the last 10 search steps to be flipped. Such enhancement to Pursuit-LA can be the object of future research.

The idea behind Pursuit-LA is to bias the search probability vector towards the optimal solution found so far. However, adequately tuning the the learning parameter is a challenge and can be the objective of further future research. As shown in the experiments in Table 2, a small value of the tuning part will fasten the convergence speed at the cost of low quality solution. However, a large value of the tuning parameter induces a slow convergence speed, which usually results in a good quality solution. We have also improved the algorithm by imposing some artificial barriers. By virtue of the artificial barriers, we avoid the lock in probability which is a phenomenon known in the field of LA as each individual i^th LA will have its probability vector P_i(t) = [p_(i, 0)(t), p_(i, 1)(t)] converging to [0, 1] or [1, 0] which stops the search. Artificial barriers can help to solve local optimum problem. However, our algorithm can be adjusted to allow more diversification of the solution by using similar procedures to genetic algorithms to diversify the solution for example through mutation and crossover operations.

The Pursuit-LA algorithm is a stochastic algorithm. It has been compared in Table 2 with 3 stochastic algorithms RoTS, IRoTS and AdaptNovelty+ using the software UBCSAT [80]. RoTS is an algorithm based on a tabu search that repeatedly chooses the value of the tabu tuning parameter at random from a given interval during the search. The variant IRoTS whose subsidiary local search phase and perturbation phase are both based on RoTS uses a randomized acceptance criterion that is biased towards better-quality candidate solutions. The noise parameter, p, which controls the degree of randomness of the search process, has a major impact on the performance and run-time behaviour of the original algorithm Novelty+. Unfortunately,the optimal value of p varies significantly between problem instances, and even small deviations from the optimal value can lead to substantially decreased performance. AdaptNovelty+ dynamically adjusts the noise setting p based on search progress. It gave similar results in 4 cases when compared to both RoTS, IRoTS and was beaten in the remaining cases by at most 6%. However, our algorithm converges faster. The authors believe that they could improve the quality of solutionsgiven by algorithm by finding a suitable balance between diversification and intensification. By adopting a similar technique to AdaptNovelty+, we can increase the noise, by lowering down p_max and thus allowing more non-greedy moves, i.e., moves not in the neighborhood of the best solution so far. The strength of the algorithm is the fact that it could be parallelized and used in a multilevel context so that diversification and intensification could be exploited at different levels of the multilevel strategy. When Pursuit-LA algorithm is compared to these three algorithms, one needs to compare the strategy of adopting diversification and intensification between the different algorithms and which of these strategies is more efficient. The comparison may lead toa hybrid approach as the one described in [81].

Possible applications of pursuit-LA

Within the class of deterministic optimization problems, there is a large family of combinatorial problems which are by definition NP-hard. Those problems are solved usually using algorithms such as genetic algorithms, tabu-search, simulated annealing, Ant-Colony Optimization (ACO) to mention a few. Examples of combinational problems that can solved by our proposed solution include traveling salesman problems, knapsack problems, job scheduling, graph coloring, quadratic assignment [82] etc... In the current paper, we have dealt with an optimization problem where the candidate solutions can be represented in a binary format. Nevertheless, it is straightforward to extend our solution to code non-binary solutions by using the concept of multi-action LA. In this sense, a candidate solution is coded as a vector of discrete variables and therefore a multi-action LA can be attached to each component of the vector representing the candidate solution. A promising research direction is also to solve deterministic continuous optimization problems using the concept of pursuit. In fact, the components of the m dimensional vectors can be drawn randomly by using a team of m individual CALA [83].

6 Conclusion

In this paper, we have provided a novel LA that can solve deterministic optimization problems based on the idea of pursuit. The search for an optimal solution is conducted using a team of cooperative LA. The scheme can be seen as a game-theoretical solution to a deterministic optimization problem. Apart from being a contribution to the field of LA in itself, the scheme is lightweight and extremely simple to implement with very little memory. Despite being appealingly simple, extensive experimental results demonstrate that it performs well compared to sophisticated state-of-the-art approaches. As future work, we aim to investigate further improving the performance of the pursuit scheme in terms of convergence speed and exploration of the search space using Multilevel techniques introduced by Bouhmala [57, 58].

Notes

Those instances can be found in http://infohost.nmt.edu/borchers/maxsat.html.

References

Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 32(6):738–749
Google Scholar
Thathachar MAL, Sastry PS (2002) Varieties of learning automata: an overview. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 32(6):711–722
Google Scholar
Agache M, Oommen BJ (2002) Generalized pursuit learning schemes: new families of continuous and discretized learning automata, IEEE transactions on systems, man, and cybernetics. Part B (Cybernetics) 32(6):738–749
Google Scholar
Yazidi A, Hammer HL (2018) Solving stochastic nonlinear resource allocation problems using continuous learning automata. Appl Intell 48(11):4392–4411
Google Scholar
Lakshmivarahan S (1981) Learning Algorithms Theory and Applications, Springer-Verlag
Najim K, Poznyak AS (1994) Learning automata: theory and applications. Pergamon Press, Oxford
MATH Google Scholar
Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice-Hall, Inc.
Google Scholar
Obaidat MS, Papadimitriou GI, Pomportsis AS (2002) Learning automata: theory, paradigms, and applications. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 32(6):706–709
Google Scholar
Poznyak AS, Najim K (1997) Learning automata and stochastic optimization. Springer-Verlag, Berlin
MATH Google Scholar
Thathachar MAL, Sastry PS (2003) Networks of learning automata: techniques for online stochastic optimization. Kluwer Academic, Boston
Google Scholar
Zhang J, Wang C, Zang D, Zhou M (2016) Incorporation of optimal computing budget allocation for ordinal optimization into learning automata. IEEE Trans Autom Sci Eng 13(2):1008–1017
Google Scholar
Tsetlin ML (1973) Automaton theory and the modeling of biological systems. Academic Press, New York
MATH Google Scholar
Misra S, Oommen BJ (2004) GPSPA: a new adaptive algorithm for maintaining shortest path routing trees in stochastic networks. Int J Commun Syst 17:963–984
Google Scholar
Obaidat MS, Papadimitriou GI, Pomportsis AS, Laskaridis HS (2002) Learning automata-based bus arbitration for shared-edium ATM switches. IEEE Trans Syst Man Cybern B 32:815–820
Google Scholar
Oommen BJ, Roberts TD (2000) Continuous learning automata solutions to the capacity assignment problem. IEEE Trans Comput C-49:608–620
Google Scholar
Papadimitriou GI, Pomportsis AS (2000) Learning-automata-based TDMA protocols for broadcast communication systems with bursty traffic. IEEE Communication Letters:107–109
Atlassis AF, Loukas NH, Vasilakos AV (2000) The use of learning algorithms in ATM networks call admission control problem: a methodology. Comput Netw 34:341–353
Google Scholar
Atlassis AF, Vasilakos AV (2002) The use of reinforcement learning algorithms in traffic control of high speed networks, Advances in Computational Intelligence and Learning 353–369
Vasilakos AV, Saltouros MP, Atlassis AF, Pedrycz W (2003) Optimizing QoS routing in hierarchical ATM networks using computational intelligence techniques, IEEE transactions on systems. Man and Cybernetics: Part C 33:297–312
Google Scholar
Seredynski F (1998) Distributed scheduling using simple learning machines. Eur J Oper Res 107:401–413
MATH Google Scholar
Kabudian J, Meybodi MR, Homayounpour MM (2004) Applying continuous action reinforcement learning automata (CARLA) to global training of hidden markov models, in: Proceedings of the International Conference on Information Technology: Coding and Computing , ITCC’04, Las Vegas, Nevada, pp. 638–642
Meybodi MR, Beigy H (2002) New learning automata based algorithms for adaptation of backpropagation algorithm pararmeters. Int J Neural Syst 12:45–67
Google Scholar
Unsal C, Kachroo P, Bay JS (1997) Simulation study of multiple intelligent vehicle control using stochastic learning automata. Transactions of the Society for Computer Simulation International 14:193–210
Google Scholar
Oommen BJ, Croix E d S (1995) Graph partitioning using learning automata. IEEE Trans Comput C-45:195–208
MathSciNet MATH Google Scholar
Collins JJ, Chow CC, Imhoff TT (1995) Aperiodic stochastic resonance in excitable systems. Phys Rev E 52:R3321–R3324
Google Scholar
Cook RL (1986) Stochastic sampling in computer graphics. ACM Trans Graph 5:51–72
Google Scholar
Barzohar M, Cooper DB (1996) Automatic finding of main roads in aerial images by using geometric-stochastic models and estimation. IEEE Trans Pattern Anal Mach Intell 7:707–722
Google Scholar
Brandeau ML, Chiu SS (1989) An overview of representative problems in location research. Manag Sci 35:645–674
MathSciNet MATH Google Scholar
C. Bettstetter, H. Hartenstein, Xavier Pérez-Costa, Stochastic properties of the random waypoint mobility model, Journal Wireless Networks 10 (2004) 555–567
B. S. Rowlingson, P. J. Diggle, SPLANCS: Spatial Point Pattern Analysis Code in S-Plus, University of Lancaster, North West Regional Research Laboratory, 1991
Paola M (1998) Digital simulation of wind field velocity. J Wind Eng Ind Aerodyn 74-76:91–109
Google Scholar
Cusumano JP, Kimble BW (1995) A stochastic interrogation method for experimental measurements of global dynamics and basin evolution: application to a two-well oscillator. Nonlinear Dynamics 8:213–235
Google Scholar
Baddeley A, Turner R (2005) Spatstat: an R package for analyzing spatial point patterns. J Stat Softw 12:1–42
Google Scholar
Sastry P, Nagendra G, Manwani N (2010) A team of continuous-action learning automata for noise-tolerant learning of half-spaces. IEEE Transactions on Systems, Man, and Cybernetics 40(1):19–28
Google Scholar
Granmo O, Oommen B, Myrer S, Olsen M (2007) Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 37(1):166–175
Google Scholar
Tilak O, Mukhopadhyay S, Tuceryan M, Raje R (2010) A novel reinforcement learning framework for sensor subset selection, in: 2010 International Conference on Networking, Sensing and Control (ICNSC), IEEE, pp. 95–100
M. Goodwin, A. Yazidi, T. M. Jonassen, Distributed learning automata-based s-learning scheme for classification, Pattern Analysis and Applications (2019) 1–16
Zhang X, Granmo O-C, Oommen BJ (2013) On incorporating the paradigms of discretization and bayesian estimation to create a new family of pursuit learning automata. Appl Intell 39(4):782–792
Google Scholar
Oommen BJ, Lanctôt JK (1990) Discretized pursuit learning automata. IEEE Transactions on Systems, Man, and Cybernetics SMC-20(4):931–938
MathSciNet MATH Google Scholar
Oommen BJ, Agache M (2001) Continuous and discretized pursuit learning schemes: various algorithms and their comparison. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 31:277–287
Google Scholar
Thathachar MA, Sastry PS (1987) Learning optimal discriminant functions through a cooperative game of automata. IEEE Transactions on Systems, Man and Cybernetics 17(1):73–85
MathSciNet MATH Google Scholar
Santharam G, Sastry P, Thathachar M (1994) Continuous action set learning automata for stochastic optimization. Journal of the Franklin Institute 331(5):607–628
MathSciNet MATH Google Scholar
Sastry P, Thathachar M (1999) Learning automata algorithms for pattern classification. Sadhana 24(4):261–292
MathSciNet MATH Google Scholar
Zahiri S (2008) Learning automata based classifier. Pattern Recogn Lett 29(1):40–48
Google Scholar
Zeng X, Liu Z (2005) A learning automata based algorithm for optimization of continuous complex functions. Inf Sci 174(3):165–175
MathSciNet MATH Google Scholar
Howell M, Gordon T, Brandao F (2002) Genetic learning automata for function optimization. IEEE Transactions on Systems, Man, and Cybernetics 32(6):804–815
Google Scholar
Bandyopadhyay S, Murthy CA, Pal SK (1995) Pattern classification with genetic algorithms. Pattern Recogn Lett 16(8):801–808
Google Scholar
Misra S, Oommen BJ (2005) Dynamic algorithms for the shortest path routing problem: learning automata-based solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 35(6):1179–1192
Google Scholar
Misra S, Oommen BJ (2006) An efficient dynamic algorithm for maintaining all-pairs shortest paths in stochastic networks. IEEE Trans Comput 55(6):686–702
Google Scholar
Li H, Mason L, Rabbat M (2009) Distributed adaptive diverse routing for voice-over-ip in service overlay networks. IEEE Trans Netw Serv Manag 6(3):175–189
Google Scholar
Mason L (1973) An optimal learning algorithm for s-model environments. IEEE Trans Autom Control 18(5):493–496
MATH Google Scholar
Beigy H, Meybodi MR (2006) Utilizing distributed learning automata to solve stochastic shortest path problems. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 14(05):591–615
MathSciNet MATH Google Scholar
Torkestani JA, Meybodi MR (2010) An intelligent backbone formation algorithm for wireless ad hoc networks based on distributed learning automata. Comput Netw 54(5):826–843
MATH Google Scholar
Torkestani JA, Meybodi MR (2012) Finding minimum weight connected dominating set in stochastic graph based on learning automata. Inf Sci 200:57–77
MathSciNet MATH Google Scholar
Torkestani JA, Meybodi MR (2012) A learning automata-based heuristic algorithm for solving the minimum spanning tree problem in stochastic graphs. J Supercomput 59(2):1035–1054
Google Scholar
Lü Z, Hao J-K (2012) Adaptive memory-based local search for max-sat. Appl Soft Comput 12(8):2063–2071
Google Scholar
Bouhmala N, Groesland MS, Volden-Freberg V (2016) Enhanced metaheuristics with the multilevel paradigm for max-csps, in: International Conference on Computational Science and Its Applications, Springer, pp. 543–553
Bouhmala N (2012) A multilevel memetic algorithm for large sat-encoded problems. Evol Comput 20(4):641–664
Google Scholar
Selman B, Levesque HJ, Mitchell DG et al. (1992) A new method for solving hard satisfiability problems., in: AAAI, Vol. 92, pp. 440–446
Selman B, Kautz HA, Cohen B (1994) Noise strategies for improving local search, in: AAAI, Vol. 94, pp. 337–343
McAllester D, Selman B, Kautz H (1997) Evidence for invariants in local search, in: AAAI/IAAI, Rhode Island, USA, pp. 321–326
Glover F (1989) Tabu search“part i”. ORSA J Comput 1(3):190–206
MATH Google Scholar
Hansen P, Jaumard B (1990) Algorithms for the maximum satisfiability problem. Computing 44(4):279–303
MathSciNet MATH Google Scholar
Gent IP, Walsh T (1995) Unsatisfied variables in local search, Hybrid problems, hybrid solutions 73–85
Gent IP, Walsh T (1993) Towards an understanding of hill-climbing procedures for sat, in: AAAI, Vol. 93, pp. 28–33
Cha B, Iwama K (1995) Performance test of local search algorithms using new types of random cnf formulas, in: IJCAI, Vol. 95, pp. 304–310
Frank J (1997) Learning short-term weights for gsat, in: IJCAI (1), pp. 384–391
Spears WM (1993) Simulated annealing for hard satisfiability problems., in: Cliques, Coloring, and Satisfiability, Citeseer, pp. 533–558
Bouhmala N (2019) Combining simulated annealing with local search heuristic for max-sat. J Heuristics 25(1):47–69
Google Scholar
Eiben A, Van der Hauw J (1997) Solving 3-sat with adaptive genetic algorithms, in: Proceedings of the 4th IEEE Conference on Evolutionary Computation, Vol. 81, IEEE Press, p. 86
Johnson DS, Trick MA (1996) Cliques, coloring, and satisfiability: second DIMACS implementation challenge, October 11–13, 1993, Vol. 26, American Mathematical Soc
Hao J-K, Lardeux F, Saubion F (2003) Evolutionary computing for the satisfiability problem, in: Workshops on Applications of Evolutionary Computation, Springer, pp. 258–267
Smyth K, Hoos HH, Stützle T (2003) Iterated robust tabu search for max-sat, in: Conference of the Canadian Society for Computational Studies of Intelligence, Springer, pp. 129–144
Kar AK (2016) Bio inspired computing–a review of algorithms and scope of applications. Expert Syst Appl 59:20–32
Google Scholar
Thathachar MAL, Sastry PS, A new approach to designing reinforcement schemes for learning automata, IEEE Transactions on Systems, Man, and Cybernetics SMC-15
Gutjahr WJ (2002) Aco algorithms with guaranteed convergence to the optimal solution. Inf Process Lett 82(3):145–153
MathSciNet MATH Google Scholar
Hoos HH (2002) An adaptive noise mechanism for walksat, in: Eighteenth national conference on Artificial intelligence, American Association for Artificial Intelligence, pp. 655–660
Taillard É (1991) Robust taboo search for the quadratic assignment problem. Parallel Comput 17(4–5):443–455
MathSciNet Google Scholar
Li CM, Wei W, Zhang H (2007) Combining adaptive noise and look-ahead in local search for sat, in: International Conference on Theory and Applications of Satisfiability Testing, Springer, pp. 121–133
Tompkins DA, Hoos HH (2004) Ubcsat: An implementation and experimentation environment for sls algorithms for sat and max-sat, in: International conference on theory and applications of satisfiability testing, Springer, pp. 306–320
Wauters T, Verbeeck K, De Causmaecker P, Berghe GV (2013) Boosting metaheuristic search using reinforcement learning, in: Hybrid Metaheuristics, Springer, pp. 433–452
Martello S (ed) (1985) Survey in combinatorial optimization. Elsevier North-Holland, Inc., New York
Google Scholar
Santharam G, Sastry PS, Thathachar MAL (1994) Continuous action set learning automata for stochastic optimization. Journal of the Franklin Institute 331B5:607–628
MathSciNet MATH Google Scholar

Download references

Funding

Open Access funding provided by OsloMet - Oslo Metropolitan University.

Author information

Authors and Affiliations

Department of Computer Science, Oslo Metropolitan University, Oslo, Norway
Anis Yazidi, Nourredine Bouhmala & Morten Goodwin

Authors

Anis Yazidi
View author publications
You can also search for this author in PubMed Google Scholar
Nourredine Bouhmala
View author publications
You can also search for this author in PubMed Google Scholar
Morten Goodwin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anis Yazidi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yazidi, A., Bouhmala, N. & Goodwin, M. A team of pursuit learning automata for solving deterministic optimization problems. Appl Intell 50, 2916–2931 (2020). https://doi.org/10.1007/s10489-020-01657-9

Download citation

Published: 14 April 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10489-020-01657-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A team of pursuit learning automata for solving deterministic optimization problems

Abstract

Similar content being viewed by others

A review on genetic algorithm: past, present, and future

Monte Carlo Tree Search: a review of recent modifications and applications

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

1 Introduction

2 Learning automata

2.1 Types of learning automata

2.1.1 Deterministic learning automata

2.1.2 Stochastic learning automata

3 Related work

3.1 LA for optimization

3.2 Stochastic local search algorithms (SLS)

4 Our solution: Pursuit-LA

4.1 Convergence results of the pursuit-LA

4.1.1 Pursuit-LA with time-dependent parameter

Intuition behind pursuit-LA

4.2 Constant update parameter

4.3 Pursuit-LA with artificial barriers

5 Application of pursuit-LA to max-SAT problem

5.1 Effect of varying the learning parameter

5.2 Comparison against reference solvers

5.3 Discussion

Possible applications of pursuit-LA

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A team of pursuit learning automata for solving deterministic optimization problems

Abstract

Similar content being viewed by others

A review on genetic algorithm: past, present, and future

Monte Carlo Tree Search: a review of recent modifications and applications

An exhaustive review of the metaheuristic algorithms for search and optimization: taxonomy, applications, and open challenges

1 Introduction

2 Learning automata

2.1 Types of learning automata

2.1.1 Deterministic learning automata

2.1.2 Stochastic learning automata

3 Related work

3.1 LA for optimization

3.2 Stochastic local search algorithms (SLS)

4 Our solution: Pursuit-LA

4.1 Convergence results of the pursuit-LA

4.1.1 Pursuit-LA with time-dependent parameter

Intuition behind pursuit-LA

4.2 Constant update parameter

4.3 Pursuit-LA with artificial barriers

5 Application of pursuit-LA to max-SAT problem

5.1 Effect of varying the learning parameter

5.2 Comparison against reference solvers

5.3 Discussion

Possible applications of pursuit-LA

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation