A team of pursuit learning automata for solving deterministic optimization problems

Learning Automata (LA) is a popular decision-making mechanism to “determine the optimal action out of a set of allowable actions” [1]. The distinguishing characteristic of automata-based learning is that the search for an optimal parameter (or decision) is conducted in the space of probability distributions defined over the parameter space, rather than in the parameter space itself [2]. In this paper, we propose a novel LA paradigm that can solve a large class of deterministic optimization problems. Although many LA algorithms have been devised in the literature, those LA schemes are not able to solve deterministic optimization problems as they suppose that the environment is stochastic. In this paper, our proposed scheme can be seen as the counterpart of the family of pursuit LA developed for stochastic environments [3]. While classical pursuit LAs can pursue the action with the highest reward estimate, our pursuit LA rather pursues the collection of actions that yield the highest performance by invoking a team of LA. The theoretical analysis of the pursuit scheme does not follow classical LA proofs, and can pave the way towards more schemes where LA can be applied to solve deterministic optimization problems. Furthermore, we analyze the scheme under both a constant learning parameter and a time-decaying learning parameter. We provide some experimental results that show how our Pursuit-LA scheme can be used to solve the Maximum Satisfiability (Max-SAT) problem. To avoid premature convergence and better explore the search space, we enhance our scheme with the concept of artificial barriers recently introduced in [4]. Interestingly, although our scheme is simple by design, we observe that it performs well compared to sophisticated state-of-the-art approaches.


Introduction
Learning Automata (LA) have been used in systems that have incomplete knowledge about the Environment in which they operate [1,[5][6][7][8][9][10][11]. The learning mechanism attempts to learn from a stochastic Teacher which models the environment. In his pioneering work, Tsetlin [12] attempted to use LA to model biological learning. In general, a random action is selected based on a probability vector, and these action probabilities are updated based on the observation of the Environment's response, after which the procedure is repeated.
The term "Learning Automata" was first publicized and rendered famous in the survey paper by Narendra and Thathachar. The goal of LA is to "determine the optimal action out of a set of allowable actions" [1]. The distinguishing characteristic of automata-based learning is that the search for the optimizing parameter vector is conducted in the space of probability distributions defined over the parameter space, rather than in the parameter space itself [2].
Concerning applications, the entire field of LA and stochastic learning has had a myriad of applications [5-7, 9, 10], which (apart from the many applications listed in these books) include solutions for problems in network and communications [13][14][15][16], network call admission, traffic control, quality of service routing, [17][18][19], distributed scheduling [20], training hidden Markov models [21], neural network adaptation [22], intelligent vehicle control [23], and even fairly theoretical problems such as graph partitioning [24]. In addition to these fairly generic applications, with a little insight, LA can be used to assist in solving (by, indeed, learning the associated parameters) the stochastic resonance problem [25], the stochastic sampling problem in computer graphics [26], the problem of determining roads in aerial images by using geometric-stochastic models [27], and various location problems [28]. Similar learning solutions can also be used to analyze the stochastic properties of the random waypoint mobility model in wireless communication networks [29], achieve spatial point pattern analysis codes for GISs [30], digitally simulate wind field velocities [31], interrogate the experimental measurements of global dynamics in magnetomechanical oscillators [32], and to analyze spatial point patterns [33]. LA-based schemes have already been utilized to learn the best parameters for neural networks [22], optimizing QoS routing [19], and bus arbitration [14] to mention a few other applications.
Although many LA algorithms have been devised in the literature, those LA schemes are not able to solve deterministic optimization problems as they suppose that the environment is stochastic. In other words, classical LA schemes resort to the assumption that the response of the environment to the same action or set of actions is stochastic. However, in deterministic optimization problems, this is not the case as the output, which is the response of the environment is a deterministic function of the input. There have been many studies that resort to a team of LA for solving optimization problems where the objective function is noisy. Examples of those works include noisetolerant learning of half-spaces [34] and nonlinear fractional knapsack problem [35]. The latter stream of works show that pursuit LA is a viable solution when the objective function is noisy. However, when the objective function to optimize is deterministic, .i.e. non-noisy, evidences from the literature catalogue that using a team of traditional pursuit LA yields slow convergence. For instance, Tilak et al. [36] report thata team of traditional pursuit LA larger than 10 deployed for solving a deterministic combinatorial problem, namely sensor coverage, yields a very slow convergence speed. In fact, Tilak et al. state: "Even at a modest number of 10 cameras, the centralized pursuit algorithm takes a long time for the automata team to converge which makes it unsuitable for an application like distributed object tracking where fast convergence is necessary". Furthermore, some of the authors of the current manuscript [37] have also noticed this slow convergence in solving a machine learning classification problem mappedinto a combinatorial problem using a team of LA. Another important disadvantage of traditional teams of pursuit LA concerns the size of the required memory for storing the reward estimate vector. In the case of classical team of pursuit LA, one needs a shared memory for the reward estimate vector that increases dramatically with the size of the team. For instance, for a team of N LA each with two actions (binary action LA), the memory space required for storing the reward probability estimate is 2 N which is exhaustive as N increases [36]. This slow convergence of classical team of pursuit LA for solving deterministic optimization problems calls for a new LA paradigm which is the objective of this article.
In this paper, we develop a novel pursuit LA, which can be seen as the counterpart of the family of pursuit LA designed for stochastic environments [3]. While classical pursuit LAs are able to pursue the action with the highest reward estimate, our pursuit LA rather pursues the collection of actions that yield the highest performance. The theoretical analysis of the pursuit scheme does not follow classical LA proofs and can pave the way towards more schemes where LA can be applied to solve deterministic optimization problems.
We catalogue the contributions of this article as follows: & We devise a simple and lightweight optimization framework based on the theory of LA. In contrast to any LA scheme presented in the literature, our solution is especially designed for deterministic environments. & Our current solution extends the family of pursuit LA algorithms [3,38,39] to solve deterministic optimization problems. A common feature for all legacy pursuit algorithms is to estimate the reward probability of each action and pursue the action with the highest reward. In our current work, the environment is rather deterministic. Therefore, we opt to pursue the joint action of the team LA corresponding to the best solution found so far. & We provide sound theoretical results that demonstrate the convergence of our scheme under both constant learning parameter and time-decaying learning parameter. To the best of our knowledge, this is the first work that proposes an analysis of LA scheme with time-decaying learning parameter. & As an example of an optimization problem, we show how our scheme can be applied to solve the Max-SAT problem.
The remainder of this paper is organized as follows. In Section 2, we give an introduction to the theory of Learning Automata which is the fundamental tool in this paper. In Section 3, we survey some related work within the field of LA and optimization. In Section 4, we present our solution called Pursuit-LA and provide theoretical proofs demonstrating its convergence. Furthermore, we provide an experiment where we apply Pursuit-LA to the Max-SAT problem. Section 6 concludes the article.

Learning automata
In the field of Automata Theory, an automaton [5-7, 9, 10] is defined as a quintuple composed of a set of states, a set of outputs or actions, an input, a function that maps the current state and input to the next state, and a function that maps a current state (and input) into the current output. Definition 1: A LA is defined by a quintuple <A, B, Q, F(., .), G(.)>, where: 1 , α 2 , …, α r } is the set of outputs or actions that the LA must choose from, and α(t) is the action chosen by the automaton at any instant t. 2. B = {β 1 , β 2 , …, β m } is the set of inputs to the automaton. β(t) is the input at any instant t. The set B can be finite or infinite. The most common LA input is B = {0, 1}, where β = 0 represents reward, and β = 1 represents penalty. 3. Q = {q 1 , q 2 , …, q s } is the set of finite states, where Q(t) denotes the state of the automaton at any instant t. 4. F(., .) : Q × B ↦ Q is a mapping in terms of the state and input at the instant t, such that, . It is called a transition function, i.e., a function that determines the state of the automaton at any subsequent time instant t + 1. This mapping can either be deterministic or stochastic. 5. G(.): is a mapping G : Q ↦ A, and is called the output function. G determines the action taken by the automaton if it is in a given state as: . With no loss of generality, G is deterministic.
If the sets Q, B and A are all finite, the automaton is said to be finite.
The Environment, E, typically, refers to the medium in which the automaton functions. The Environment possesses all the external factors that affect the actions of the automaton. Mathematically, an Environment can be abstracted by a triple <A, C, B>. A, C, and B are defined as follows: 1. A = {α 1 , α 2 , …, α r } is the set of actions. 2. B = {β 1 , β 2 , …, β m } is the is the output set of the Environment. Again, we consider the case when m = 2, i.e., with β = 0 representing a "Reward", and β = 1 representing a "Penalty". 3. C = {c 1 , c 2 , …, c r } is a set of penalty probabilities, where element c i ∈ C corresponds to an input action α i .
The process of learning is based on a learning loop involving the two entities: the Random Environment (RE), and the LA, as described in Fig. 1. In the learning process, the LA continuously interacts with the Environment to process responses to its various actions (i.e., its choices). Finally, through sufficient interactions, the LA attempts to learn the optimal action offered by the RE. The actual process of learning is represented as a set of interactions between the RE and the LA.
The automaton is offered a set of actions, and it is constrained to choosing one of them. When an action is chosen, the Environment gives out a response β(t) at a time "t". The automaton is either penalized or rewarded with an unknown probability c i or 1 − c i , respectively. On the basis of the response β(t), the state of the automaton ϕ(t) is updated and a new action is chosen at (t + 1). The penalty probability c i satisfies: We now provide a few important definitions used in the field. P(t) is referred to as the action probability vector, where, P(t) = [p 1 (t), p 2 (t), …, p r (t)] T , in which each element of the vector.
Given an action probability vector, P(t) at time t, the average penalty is: The average penalty for the "pure-chance" automaton is given by: As t ↦ ∞, if the average penalty M(t) < M 0 , at least asymptotically, the automaton is generally considered to be better than the pure-chance automaton. E[M(t)] is given by: A LA that performs better than by pure-chance is said to be expedient. where c l = min i {c i }.
It should be noted that no optimal LA exist. Marginally sub-optimal performance, also termed above as ϵ-optimal performance, is what LA researchers attempt to attain.
Definition 5: A LA is considered ϵ-optimal if: where ϵ > 0, and can be arbitrarily small, by a suitable choice of some parameter of the LA.

Deterministic learning automata
An automaton is termed as a deterministic automaton, if both the transition function F(., .) and the output function G(.) are deterministic. Thus, in a deterministic automaton, the subsequent state and action can be uniquely specified, provided the present state and input are given.

Stochastic learning automata
If, however, either the transition function F(., .), or the output function G(.) are stochastic, the automaton is termed to be a stochastic automaton. In such an automaton, if the current state and input are specified, the subsequent states and actions cannot be specified uniquely. In such a case, F(., .) only provides the probabilities of reaching the various states from a given state.
In the first LA designs, both the transition and output functions where time-invariant, and for this reason, these LA were considered to be "Fixed Structure Stochastic Automata" (FSSA). Tsetlin, Krylov, and Krinsky [12] have presented notable examples of this type of automata.
Subsequently, Vorontsova and Varshavskii introduced a class of stochastic automata known in the literature as Variable Structure Stochastic Automata (VSSA). In the definition of a VSSA, the LA is wholly defined by a set of actions (one of which is the output of the automaton), a set of inputs (which is usually the responsibility of the Environment) and a learning algorithm, T. The learning algorithm [7] operates on a vector (called the Action Probability vector).
Note that the algorithm T: is the set of output actions of the automaton, and B is the set of responses from the Environment. Thus, the updating is such that. P(t + 1) = T(P(t), α(t), β(t)), where P(t) is the action probability vector, α(t) is the action chosen at time t, and β(t) is the response it has obtained.
If the mapping T is chosen in such a manner that the Markov process has absorbing states, the algorithm is referred to as an absorbing algorithm. Many families of VSSA that posses absorbing barriers have been reported [7]. Ergodic VSSA has also been investigated [7,40]. These VSSAs converge in distribution and thus, the asymptotic distribution of the action probability vector has a value that is independent of the corresponding initial vector. While ergodic VSSA are suitable for non-stationary environments, absorbing VSSA are preferred in stationary environments.

Related work
In order to put our work in the right perspective, we will briefly discuss different optimization schemes relevant to this work mostly from the field of LA.

LA for optimization
A similar work to ours is due to Thathachar and Sastry [41] where the authors use a team of LA in order to find the optimal discriminant function in a feature space. The discriminant functions are parametrized, and a parameter is attached to each that is to be learned.
Subsequently, Santharam et al. [42] proposed using continuous LA in order to deal with the disadvantages of discretization, thus allowing an infinite number of actions. For an excellent review on the application of LA to the field of Pattern Recognition, we refer the reader to [43]. In [44], Zahiri devised an LA based classifier that operates using hypercubes in a recursive manner. In [45], the authors have proposed LA optimization methods for multimodal functions. Through experimental settings, the performance of these algorithms were shown to outperform genetic algorithms. In [46], the authors propose genetic LA for optimizing functions. Similarly, the work [47] proposed genetic algorithms for classifiers.
Misra and Oommen pioneered the concept of LA on a graph using pursuit LA [13,48,49] for solving the stochastic shortest path problem. Li [50] used a type of S Learning Automata [51] to find the shortest path in a graph. Beigy and Meybodi [52] provided the first proof in the literature that demonstrates the convergence of distributed LA on a graph for a reward inaction LA.
Concerning applications of distributed LA on a graph in the field of computer communications, we refer the reader to the work of Torkestani and collaborators [53][54][55].

Stochastic local search algorithms (SLS)
Due to their combinatorial explosion nature, large and complex SAT problems are hard to solve using systematic algorithms. One way to overcome the combinatorial explosion is to give up completeness. Local search algorithms are techniques which use this strategy. Local search algorithms are based on what is perhaps the oldest optimization method trial and error. Typically, they start with an initial assignment of values to variables randomly or heuristically generated. Satisfiability can be formulated as an optimization problem in which the goal is to minimize the number of unsatisfied clauses. Thus, the optimum is obtained when the value of the objective function equals zero, which means that all clauses are satisfied. Finite Learning Automata has been proposed as a mechanism for enhancing meta-heuristics based Max-SAT solvers. The work conducted in [56] proposes an adaptive memory based local search algorithm that exploits various strategies in order to guide the search to achieve a suitable trade-off between intensification and diversification. Multilevel techniques [57,58] have been applied to Max-SAT with considerable success. They progressively coarsen the problem, find an assignment, and then employ a metaheuristic to refine the assignment on each of the coarsened problems in reverse order.
During each iteration, a new solution is selected from the neighborhood of the current one by performing a move. Choosing a good neighborhood and a method for searching it is usually guided by intuition because very little theory is available as a guide. Most SLS uses a 1-flip neighborhood relation for which two truth-value assignments are neighbors if they differ in the truth value of one variable. If the new solution provides a better value in light of the objective function, the new solution becomes the current one. The search terminates if no better neighbor solution can be found.
One of the most popular local search for solving SAT is GSAT [59]. GSAT begins with a randomly generated assignment of values to variables and then uses the steepest descent heuristic to find the new variable-value assignment which best decreases the numbers of unsatisfied clauses. After a fixed number of moves, the search is restarted from a new random assignment. The search continues until a solution is found or a fixed number of restart is performed. An extension of GSAT referred to as random-walk [60] has been realized with the purpose of escaping from local optima. In a random walk step, a randomly unsatisfied clause is selected. Then, one of the variables appearing in that clause is flipped, thus effectively forcing the selected clause to become satisfied. The main idea is to decide at each search step whether to perform a standard GSAT or a random-walk strategy with a probability called the walk probability. Another widely used variant of GSAT is the WalkSAT algorithm originally introduced in [61]. It first picks randomly an unsatisfied clause c and then in a second step, one of the variables with the lowest break count appearing in the selected clause is randomly selected. The break count of a variable is defined as the number of clauses that would be unsatisfied by flipping the chosen variable. If there exists a variable with break count equals to zero, this variable is flipped, otherwise, the variable with minimal break count is selected with a certain probability (noise probability). The choice of unsatisfied clauses combined with the randomness in the selection of variables enables WalkSAT to avoid local minima and to explore the search space better. New algorithms [62] [63][64][65] have emerged using history-based variable selection strategy in order to avoid flipping the same variable. Apart from GSAT and its variants, several clause weighting based SLS algorithms [66,67] have been proposed to solve SAT problems. The key idea is to associate the clauses of the given CNF formula with weights. Although these clause weighting SLS algorithms differ in the manner clause weights should be updated (probabilistic or deterministic) they all choose to increase the weights of all the unsatisfied clauses as soon as a local minimum is encountered. Clause weighting acts as a diversification mechanism rather than a way of escaping local minima. Finally, many other SLS algorithms have been applied to the SAT. These include techniques such as Simulated Annealing [68,69], Evolutionary Algorithms [70], and Greedy Randomized Adaptive Search Procedures [71]. The nature-inspired GASAT algorithm [72] is a hybrid algorithm that combines a specific crossover and a tabu search procedure. The work in [73] proposes a hybrid approach called Iterated Robust Tabu Search (IRoTS) which combines an iterated local search and tabu search.

Our solution: Pursuit-LA
In this Section, we shall present our solution reckoned as Pursuit-LA for solving deterministic optimization problems. In many combinatorial problems, a candidate solution can be represented using a binary vector [74]. Adopting Pursuit-LA implies to attach an LA to each element of the binary vector whose respective decision is the action 0 or 1. The collective decision of the different LA will result into a solution. The solution with highest "fitness" will be pursued by the LA using the LRI scheme [7,10]. Furthermore, we will give an example of application of the Pursuit-LA to the Max-SAT problem.

Convergence results of the pursuit-LA
In this Section, we will consider two convergence cases of the Pursuit-LA, namely convergence under time decaying learning parameter and convergence under constant learning parameter.

Pursuit-LA with time-dependent parameter
At each epoch, each LA in the team of LA chooses an action, therefore the choices of the team are synchronous. The joint action of the team of LA results in a candidate solution. The observed performance is fed back to the team of LA and used to reinforce the choice of the candidate solution yielding highest performance. More precisely for each LA in the team, we attach a component of the binary vector forming the candidate solution, the corresponding action coinciding with the candidate solution yielding highest performance sees its probability increasing at each time instant. In this sense, the joint action probability vector of the team of LA gets biased towards the best solution found so far, and thus the concept of pursuit. The choices of the team of LA are synchronous and the feedback is common for the team, which can be though as shared memory if one considers that the last feedback is stored in a common memory. Each LA also has a local memory to remember the best action so far (up to the current time instant) that has resulted in the highest performance for the team.
Let C(t) = {C 1 (t), …, C m (t)} be a candidate solution at time t where C i takes a binary value and m the number of bits needed to code a candidate solution. We attach an LA to each component of the candidate solution.
The automaton's state probability vector at the component , which denotes the probability to yield 0 or 1 for the i th component.
The normalized feedback function (or reward strength) is given by f(C(t)), where C(t) is the candidate solution tested at instant t. The function f(.) measures the fitness of the solution taking values from [0, 1] where 0 is the lowest possible reward, while 1 is the highest reward. In other words, the fitness function is normalized.
Let C * (t), be the solution with highest fitness found so far, i.e., the solution with highest fitness obtained up to time instant t.
The idea of pursuit here is to reward the LA whose actions correspond to the component of the solutions in C * (t).
We consider the LA update equations at component C i . For all components C i , and for, j ∈ {0, 1}, the update is given by: Where δ j is defined by λ t is the update parameter and depends on time. In Theorem 13, we will consider the conditions by which the algorithm can converge when the update parameter depends on time. Further, we will give convergence results for the case of fixed λ, i.e., independent of time t.
Please note that, initially: The informed reader would observe that the above update scheme corresponds to the linear Reward-Inaction LA update [1].
In fact if j∉C * i t ð Þ then p (i, j) (t + 1) is reduced by multiplying by λ t which is less than 1 as per the following equation: However if j∈C * i t ð Þ then p (i, j) (t + 1) is increased. This can be proven as follows: The update scheme is called pursuit LA and has rules that obey LRI. The idea is to always reward the transitions probabilities along the best solution obtained so far.
With the updating formula (Equation 6), we can show that the probability distribution converges to the distribution that satisfies the following property if the optimal solution C * i is unique.
Intuition behind pursuit-LA Thathachar and Sastry [75] pioneered the idea of pursuit LA. The action with the highest reward estimate is "pursued". The latter work has fueled a great deal of interest in pursuit LA involving different variants [3,38,39]. A common feature for all these pursuit algorithms is to estimate the reward probability of each action and pursue the action with the highest reward. In our current work, the environment is ratherdeterministic. Therefore, we opt to pursue the joint action of the team LA, corresponding to the best solution found so far. We will now state some theoretical results that catalog the properties of the Pursuit-LA for both the time-varying update parameter and the fixed update parameter.
We will now state some theoretical results that catalogue the properties of the Pursuit-LA for both the time varying update parameter and the fixed update parameter. The optimal solution is generated with probability 1 only if the update parameter θ t obeys the following condition: Proof.
The proof follows similar arguments as in [76]. Using recurrence, we can obtain a lower bound on p (i, j) (t): Let p min (0) > 0 a lower bound on p (i, j) (0). Let A t = {C(t) ≠ C * } the event that at iteration t, the candidate solution does not contain the optimal solution C * .
Let B T the event that optimal solution is not found up to instant T.
By resorting to (1 − u) ≤ exp (−u) we obtain However, from our assumption Examples of smoothing sequences which eventually generate the optimal solution with probability 1 (that is, which satisfy the sufficient condition of Theorem 1) includes. λ t = 1 − 1/(t + 1) β for β > 1. and Þlog tþ1 ð Þ β for β > 1. Let t * the first time instant when the optimal solution is found, the optimal components are always reinforced. For t * + r, for (i * , j) such that i * ∈ C * and j ∉ C * , we have using recurrence: Easy to see from the assumption that ∏ ∞ k¼0 λ k ¼ 0, by considering the log of the expression described in assumption on λ k .
By considering summation to 1 of probability of going from node i * , and for j * belonging to the optimal path.

Constant update parameter
In Theorem 13, we give the convergence result of the Pursuit-LA for the case of fixed parameter λ that is independent of time.
The optimal solution is generated with probability 1 only if the update parameter λ → 1. Proof.
Using recurrence, we know that: Thus, Therefore, Thus, by resorting again to (1 − u) ≤ exp (−u) we obtain Let us define h α Since we know that lim Hence the theorem is proven. Now, let us characterize the LA probabilities at convergence.
Let t * the first time instant when the optimal solution is found, the optimal so far components are always reinforced. For t * + r, for (i * , j * ) ∈ C * , we have: Using recurrence, we can obtain verify We remark Therefore,

Pursuit-LA with artificial barriers
In this section, we extend our Pursuit-LA with the concept of artificial barriers introduced recently by Yazidi and Hammer [4] to avoid the lock-in probability effect. The presented Pursuit-LA in the previous section is an absorbing scheme where the team of LA will converge after a large number of iterations to an absorbing state composed of a vector with components either 0 or 1. This creates a challenge when it comes to tuning the learning parameter as choosing high values of the learning parameter close to 1 renders the schemes extremely slow, while choosing high values might lead to premature convergence. To allow the scheme to avoid still getting locked in an absorbing state where premature convergence can take place, we introduce an upper and lower band for the probability of each LA in the team. Therefore, instead of allowing the LA probabilities to admit values within the interval [0, 1], we force the probabilities to be located in [p min , p max ] where p max is a user-defined upper bound for all the p (i, j) (t) and p min = 1 − pmax the counter part lower bound. p max needs to be chosen in the neighborhood of 1 in order to bias the exploration to the neighborhood of the best solution found so far. We shall give now the update equations for Pursuit-LA with artificial barriers. For all components C i , and for, j ∈ {0, 1}, the update is given by: Where δ j is defined by Please note that, if we initially impose that p max ≤ p (i, j) (0) ≤ p min , then it is easy to prove by recurrence that the update form will guarantee that at any subsequent time t > 0 that Let us suppose that p max ≤ p (i, j) (t) ≤ p min and prove p max ≤ p (i, j) (t + 1) ≤ p min In fact, p (i, j) (t + 1) can be written in the form p (i, j) (t + 1) = (1 − λ)p + λp (i, j) (t) where p = p max or p = p min depending on whether δ j = 1 or δ j = 0. Therefore, p (i, j) (t + 1) is a convex combination of two quantities that bother are in the interval [p min , p max ]. Hence, the result is proven by recurrence.
It is easy to see that the update equation (Eq. (38)) can be written differently. In fact, if j∉C * i t ð Þ then p (i, j) (t + 1) reduces to: However, if j∈C * i t ð Þ then p (i, j) (t + 1) reduces to: We can show that if j∈C * i t ð Þ then p (i, j) (t + 1) increases while its decreases in the opposite case (i.e, j∈C * i t ð Þ ). In fact, Then if j∈C * i t ð Þ Whereas if j∉C * i t ð Þ It is easy to observe that whenever p max = 1, and consequently p min = 0, the Pursuit-LA with absorbing barriers reduces to the Pursuit-LA with fixed learning parameter introduced in the previous section.
Before closing this section, it is not of place to observe that our algorithm enjoys low computational complexity. In fact, the proposed Pursuit-LA requires only an order of m operations per time step. This low computational complexity is an inherent property of Reinforcement Learning algorithms in general and LA in particular.

Application of pursuit-LA to max-SAT problem
In this section, we will test the performance of our algorithm for solving the Max-SAT problem, which is a well-known class of deterministic optimization problems. We will examine two main aspects of the algorithm. The first aspect is investigated in Section 1 and concerns the sensitivity of the algorithm to changes in the learning parameter. The second aspect we tackle is how well the current algorithm compares to other well-established state-of-the-art MAX-SAT solvers. The latter aspect is treated in Section 2.

Effect of varying the learning parameter
We will use a benchmark from https://www.cs.ubc.ca/hoos/ SATLIB/benchm.html related to the SAT-encoded Flat Graph Colouring Problems. As a criterion of convergence, we reckon that the algorithm has converged whenever each of the LA has converged, meaning that each LA has an action whose probability exceeds 1 − ϵ where ϵ denotes a small scalar. In all the experiments,we choose ϵ = 0.01. In more formal terms, we deem that the scheme has converged if for all i, p (i, 0) (t) > 1 − ϵ or p (i, 1) (t) > 1 − ϵ. As an objective function, we resort to the percentage of satisfied clauses which also characterizes the performance of the algorithm. In the Max-SAT problem, we would like to maximize the number of satisfied clauses. Table 1 gives the performance of the Pursuit-LA and convergence time for three values of the learning parameters λ = 0.9, λ = 0.99 and λ = 0.999. PC denotes the percentage of satisfied clauses while CT denotes the convergencetime. In Table 1, we consider different files representing the SATencoded Flat Graph Colouring Problems. The first category of files with prefix flat30 denotes a problem with 100 instances, 30 vertices, 60 edges, 3 colours, 90 variables and 300 clauses. The second category of files with prefix flat200 denotes a problem with 100 instances, 200 vertices, 479 edges, 3 colours, 600 variables and 223 clauses.
From Table 1, the general remark is that the Pursuit-LA algorithm is generally fast and yields acceptable performance even for low values of λ. The performance seems to increase as λ increases; however, at the cost of longer convergence time. For example, let us consider the file flat30-5. We observe that Pursuit-LA converges quite fast for a small learning parameter λ = 0.99, namely with 107 iterations. Whenever we increase the parameter to λ = 0.999, the convergence time increases considerably to 11,223; however, the performance increases too substantially from 0.9233 to 0.99. In Fig. 2, we also give the evolution of the performance of the algorithm using the same file flat30-5 for λ = 0.999. Initially, i.e., at time zero, the performance was 0, 7533. As time proceeds, we observe from Fig. 2 that the performance steadily increases until converging to 0.99 after around 11,223 iterations. Please note that the first time we reach this performance is after about 8000 iterations. Nevertheless, it takes more iterations for the Pursuit-LA to converge as the probabilities of the actions that yield this performance (0.99) will keep on increasing as long no other better solution. This increase in the probability takes place with a learning parameter as big as λ = 0.999. The remaining time to converge after finding the best solution so far is depending on the time it takes for the smallest action probability to have its probability increasing above 1 − ϵ.

Comparison against reference solvers
The benchmark instances which are used to evaluate the performance of the Pursuit-LA algorithm belong to Random Unweighted-MAX2SAT/MAX3SAT. 1 The performance of the algorithm is compared to various popular solvers in the literature: & AdaptNovelty+: stochastic local search algorithm with an adaptive noise mechanism [77]. & IRoTS: Iterated Robust Tabu Search algorithm [73]. & RoTS: Robust Tabu Search algorithm [78]. & AdaptG2WSATp: adaptive gradient-based greedy WalkSAT algorithm with promising decreasing variable heuristic [79] & Adaptive memory-based local search heuristic (denoted by AMLS1,AMSL2) [56].
For the reference algorithms (IRoTS, RoTS, AdaptNovelty+) we carry out the experiments using UBCSAT (version 1.1) an implementation and experimentation environment for stochastic local search algorithms for SAT and MAX-SAT solvers. As shown by Tompkins and Hoos [80], the implementation of these reference algorithms in UBCSAT is more efficient than (or just as efficient as) the original implementations. In all tables, the first and second column identify the problem instance and the best known objective value f * (number of unsatisfied clauses). The remaining columns give the results of the the algorithms using three performance quality criteria which have been widely used for the performance evaluation of stochastic local search [56]: To avoid premature convergence, we use p max = 0.9. Furthermore, we fix λ = 0.95. We run the algorithms in epochs each consisting of 1000 iterations. We only update the probabilities at the end of each epoch. In simple terms, we test the current probability vector during the whole epoch before updating it best on the best solution found so far in the entire epoch. Tables 2, 3 and 4 give the comparisons results.
To summarize the results of this section, we have incorporated comparisons against some of the most established stateof -the -art s che mes i nclu ding AMSL1 , AMSL2, AdaptiveNovelty+, AdaptG2WSATp, RoTS IroTS using different benchmarks. The results are really conclusive and quite surprising too. Although our scheme is straightforward and we aimed to show that this a proof of concept of the possibility to use LA for solving deterministic problems, we have found that it performs well. It is even superior to the AdaptNovelty+ which is a sophisticated state-of-art Max-SAT algorithm. For example, in Table 4, our Pursuit-LA consistently outperforms the AdaptiveNovelty+ in terms of f av and convergence steps. Observe, for example, the results for the file s2v120c1700-2. Our Pursuit-LA achieves several unsatisfied clauses 250 in 306 epochs while the AdaptiveNovelty+ achieves 261 in 25,229 iterations. The optimal number for this case is 248, and therefore our Pursuit-LA has just two unsatisfied clauses compared to the AdaptiveNovelty+ that has eleven unsatisfied clauses.

Discussion
We believe that the Pursuit-LA algorithm owes it's performance to two main design principles: & By adding some artificial barriers, p min and p max for the LA probability, we are able to achieve better results by avoiding premature convergence of the algorithm. In fact, during the first experiments we had without artificial barriers, we observed some low performance due to the socalled stagnation effect. Stagnation means that the best so far solution does not change for a certain number of iterations that is relatively long. In face of stagnation and in the absence of artificial barriers, the LA team will get trapped into a probability vector which corresponds to an exclusive choice of the best so far solution. This happens within finite number of iterations that depends onthe learning parameter. & Furthermore, the second appealing idea that makes the algorithm performant is pursuing the best solution found so far in the probability space. In this perspective, pursuing means, at each iteration, biasing gradually the probabilistic search towards the optimal solution found so far in a probabilistic manner. In fact, the role of the learning parameter is to adjust the quantity by which the change in probability takes place in the direction of the best solution found so far.
Despite that the performance of Pursuit-LA is promising, it does not outperform the other algorithms in all scenarios, namely RoTS and IRoTS are superior to Pursuit-LA in terms of the quality of the solution as seen in Table 4. In fact, althoughthe Pursuit-LA gives a satisfactory solutions it   /150  22  23  18  224  22  20  354  22  20  374  22  20  580  p2600/150  38  39  17  478  38  20  5188  38  20  1807  38  20  3564  p3675/150  2  7  3  949  2  20  42,153  2  20  5674  2  20  22,831  p3750/150  5  10  1  322  5  20  41,177  5  20  4737  5  20  9262 converges prematurely according to the results of Table 4. For instance, for s2v200c1200-2, f * = 127 is achieved for RoTS and IRoTS, while the Pursuit-LA achieves 137 unsatisfied clauses but within less time, namely, 647 epochs compared to 775 and 540 steps for RoTS and IRoTS respectively. Therefore, we believe that adaptively adjusting the learning parameter as well as the barriers over time borrowing ideas from IRoTS will boost the performance. Under artificial barriers, our stopping criterion used in the simulations in Section 2 is the stagnation of the search for two consecutive epochs. One can deal with stagnation using different ideas in the literature. For instance, IRoTS forces any variable whose value has not been changed over the last 10 search steps to be flipped. Such enhancement to Pursuit-LA can be the object of future research. The idea behind Pursuit-LA is to bias the search probability vector towards the optimal solution found so far. However, adequately tuning the the learning parameter is a challenge and can be the objective of further future research. As shown in the experiments in Table 2, a small value of the tuning part will fasten the convergence speed at the cost of low quality solution. However, a large value of the tuning parameter induces a slow convergence speed, which usually results in a good quality solution. We have also improved the algorithm by imposing some artificial barriers. By virtue of the artificial barriers, we avoid the lock in probability which is a phenomenon known in the field of LA as each individual i th LA will have its probability vector P i (t) = [p (i, 0) (t), p (i, 1) (t)] converging to [0, 1] or [1,0] which stops the search. Artificial barriers can help to solve local optimum problem. However, our algorithm can be adjusted to allow more diversification of the solution by using similar procedures to genetic algorithms to diversify the solution for example through mutation and crossover operations.
The Pursuit-LA algorithm is a stochastic algorithm. It has been compared in Table 2 with 3 stochastic algorithms RoTS, IRoTS and AdaptNovelty+ using the software UBCSAT [80]. RoTS is an algorithm based on a tabu search that repeatedly chooses the value of the tabu tuning parameter at random from a given interval during the search. The variant IRoTS whose subsidiary local search phase and perturbation phase are both based on RoTS uses a randomized acceptance criterion that is biased towards better-quality candidate solutions. The noise parameter, p, which controls the degree of randomness of the  257  20  458  257  20  540  280  0  39,770  273  0  236  s2v120c1700-2  248  248  20  1460  248  20  1097  261  0  25,229  250  12  306  s2v120c1700-3  239  239  20  538  239  20  1729  253  0  39,321  240  17  336  s2v120c1800-1  291  291  20  559  291  20  630  306  0  42,222  299  2  326  s2v120c1800-3  279  279  20  330  279  20  160  293  0  34,378  279  19  368  s2v120c2600-2  458  458  20  246  458  20  695  486  0  38,513  466  12  252  s2v120c2600-3  440  440  20  325  440  20  359  463  0  search process, has a major impact on the performance and run-time behaviour of the original algorithm Novelty+. Unfortunately,the optimal value of p varies significantly between problem instances, and even small deviations from the optimal value can lead to substantially decreased performance. AdaptNovelty+ dynamically adjusts the noise setting p based on search progress. It gave similar results in 4 cases when compared to both RoTS, IRoTS and was beaten in the remaining cases by at most 6%. However, our algorithm converges faster. The authors believe that they could improve the quality of solutionsgiven by algorithm by finding a suitable balance between diversification and intensification. By adopting a similar technique to AdaptNovelty+, we can increase the noise, by lowering down p max and thus allowing more non-greedy moves, i.e., moves not in the neighborhood of the best solution so far. The strength of the algorithm is the fact that it could be parallelized and used in a multilevel context so that diversification and intensification could be exploited at different levels of the multilevel strategy. When Pursuit-LA algorithm is compared to these three algorithms, one needs to compare the strategy of adopting diversification and intensification between the different algorithms and which of these strategies is more efficient. The comparison may lead toa hybrid approach as the one described in [81].
Possible applications of pursuit-LA Within the class of deterministic optimization problems, there is a large family of combinatorial problems which are by definition NP-hard. Those problems are solved usually using algorithms such as genetic algorithms, tabu-search, simulated annealing, Ant-Colony Optimization (ACO) to mention a few. Examples of combinational problems that can solved by our proposed solution include traveling salesman problems, knapsack problems, job scheduling, graph coloring, quadratic assignment [82] etc... In the current paper, we have dealt with an optimization problem where the candidate solutions can be represented in a binary format. Nevertheless, it is straightforward to extend our solution to code non-binary solutions by using the concept of multi-action LA. In this sense, a candidate solution is coded as a vector of discrete variables and therefore a multi-action LA can be attached to each component of the vector representing the candidate solution. A promising research direction is also to solve deterministic continuous optimization problems using the concept of pursuit. In fact, the components of the m dimensional vectors can be drawn randomly by using a team of m individual CALA [83].

Conclusion
In this paper, we have provided a novel LA that can solve deterministic optimization problems based on the idea of pursuit. The search for an optimal solution is conducted using a team of cooperative LA. The scheme can be seen as a gametheoretical solution to a deterministic optimization problem. Apart from being a contribution to the field of LA in itself, the scheme is lightweight and extremely simple to implement with very little memory. Despite being appealingly simple, extensive experimental results demonstrate that it performs well compared to sophisticated state-of-the-art approaches.
As future work, we aim to investigate further improving the performance of the pursuit scheme in terms of convergence speed and exploration of the search space using Multilevel techniques introduced by Bouhmala [57,58].
Funding Information Open Access funding provided by OsloMet -Oslo Metropolitan University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.