Parallel Exploration via Negatively Correlated Search

Effective exploration is a key to successful search. The recently proposed Negatively Correlated Search (NCS) tries to achieve this by parallel exploration, where a set of search processes are driven to be negatively correlated so that different promising areas of the search space can be visited simultaneously. Various applications have verified the advantages of such novel search behaviors. Nevertheless, the mathematical understandings are still lacking as the previous NCS was mostly devised by intuition. In this paper, a more principled NCS is presented, explaining that the parallel exploration is equivalent to the explicit maximization of both the population diversity and the population solution qualities, and can be optimally obtained by partially gradient descending both models with respect to each search process. For empirical assessments, the reinforcement learning tasks that largely demand exploration ability is considered. The new NCS is applied to the popular reinforcement learning problems, i.e., playing Atari games, to directly train a deep convolution network with 1.7 million connection weights in the environments with uncertain and delayed rewards. Empirical results show that the significant advantages of NCS over the compared state-of-the-art methods can be highly owed to the effective parallel exploration ability.


Section I Introduction
Negatively Correlated Search (NCS) [1] is a recently proposed Evolutionary Algorithm (EA) [2] of iteratively searching for optimal solutions.Driven by that a properly diversified population can be more beneficial to search [3], NCS explicitly asks different subsets of the population to periodically share their probabilistic distributions so that they can cooperatively model and control the diversity of the whole population.As the probabilistic distribution actually determines how the new solutions will be sampled, NCS is featured in explicitly modeling the diversity of the next population at the current iteration.On this basis, NCS is capable of capturing the on-going interactions between successive iterations and effectively controlling the diversity of the next population, distinguishing itself from traditional EAs who only measure the diversity of sampled population [3].Specifically, NCS explicitly divides the population into multiple exclusive sub-sets, i.e., sub-populations.
The evolution of each sub-population is regarded as a separate search process and is conducted by a traditional EA for exploitation.Meanwhile, the search processes are coordinated to explore different search space by driving their probabilistic distributions to be negatively correlated.As a result, NCS has shown to perform a parallel exploration search behavior that multiple search processes are guided to search different promising areas of the search space simultaneously (see fig. 2 in [1] for illustration).
Although the basic idea of NCS has attracted increasing research interests [9]- [12] and has shown very promising performance in various real-world problems [4]- [8] [24], the original instantiation of NCS [1] was mostly devised by intuition, lacking the mathematical explanations of why the negatively correlated search processes can lead to a parallel exploration and the guidance of how to optimally obtain the negatively correlated search processes.
In this paper, a mathematically principled NCS framework is proposed to address this issue.The new NCS explicitly regards the exploration and exploitation as two objectives of the general search procedure, and works by mathematically modeling and maximizing both a diversity model (for exploration) and a fitness model (for exploitation) of the next population.The diversity model measures the total negative correlations of the probabilistic distributions between pairwise search processes, and the fitness model describes the total expectation of the solution qualities that can be sampled under the probabilistic distributions.In other words, these two models respectively represent how different and how good the new solutions can be generated.By maximizing the diversity model, the search processes tend to be more negatively correlated as the "overlaps" among probabilistic distributions are getting smaller.By maximizing the fitness model, the expectation of solution qualities that can be sampled by the search processes is improved.
In practical, by employing the Natural Evolution Strategy [13] to evolve each search process, both the diversity model and the fitness model can be optimally maximized via partially gradient descending with respect to each search process.That is, each search process can independently maximize the negative correlation to the others and the expectation of sampling better solutions.On this basis, by gradient descending the two models at the same time, the resultant Negatively Correlated Natural Evolution Strategy (NCNES) is able to form a parallel exploration search behavior that different search processes will in parallel evolve to distinct yet promising areas of the search space.
To verify the effectiveness of NCNES, the reinforcement learning problem is considered for empirical studies, as it is widely acknowledged that the exploration ability has great impacts on the performance of a reinforcement learner [32].Three popular Atari games [25] covering shooting and obstacles avoidance tasks are selected as the test instances.To play the Atari games, NCNES is required to directly train a deep convolution network with 1.7 million connection weights for optimizing the policy, which imposes great challenges to NCNES as the search space is both large-scale and highly multi-modal.Even worse, the environmental rewards are highly uncertain and heavily delayed, making the training further difficult without the help of traditional back-propagation.Empirical results have successfully shown that, NCNES can achieve significantly more scores than the state-of-the-art algorithms (including both EAbased and gradient-based solutions).Furthermore, due to the parallel exploration search behavior, it has shown that NCNES can facilitate the search more computationally efficiently with parallel computing resources.
The reminder of this paper is as follows.In Section II, the new mathematically principled NCS is presented in detail, and the weakness of the original NCS that was designed by intuition is also discussed.
An instantiation of the new NCS framework, i.e., NCNES, is described in Section III.In Section IV, the effectiveness of NCNES is verified on three reinforcement learning problems by playing Atari.The conclusions are given in the Section V.

Section II NCS for Parallel Exploration
NCS stems from re-thinking of "how does population facilitate the search?"Although it has been widely acknowledged that effective information sharing among population is the key to successful cooperative search, an open question remains what information to share and how [14].By mimicking the cooperation in human, NCS asks the individuals in a population to have different search behaviors, so as to avoid repetitively searching a same region of the search space.Similar idea has also been adopted for ensemble learning [15].Each search behavior is defined as how the offsprings will be sampled based on their parents, and usually can be represented as a probabilistic distribution.The mathematical correlation among distributions is utilized to statistically model the diversity among the population.As a result, by explicitly driving multiple probabilistic distributions to be negatively correlated, NCS suffices to control the diversity of the next population.By implementing the above idea, it is necessary to instantiate a way for modeling the diversity and balancing it with exploitation.In the original NCS, such steps are mainly motivated by intuition, lacking mathematical explanations for in-depth analysis and shown to be sub-optimal.In this section, we first provide an integrated solution for these issues, and then discuss its merits over the original NCS.

Section II.A The Mathematical Model of NCS
Basically, the idea of NCS requires the population being exclusively grouped into  sub-populations, each of which is then evolved as a separate search process by a traditional EA, preferably those who sample solutions from an explicit probabilistic distribution [16].To re-design NCS, let us start a thought game from what kind of probabilistic distribution can facilitate the search better by covering promising areas of the search space and generating new solutions therein.
It is usually straightforward to build a simple well-defined distribution like Gaussian distribution and Cauchy distribution [16].Unfortunately, such distribution maybe incapable of capturing the complex problem characteristics like the multi-modality [17].Usually, it is non-trivial to properly set up one complicated distribution.Similar to Gaussian Mixture Model [18], we can have multiple simple distributions instead of one complicated distribution.Another advantage of constructing multiple distributions is that we can explicitly sample different solutions therefrom for the purpose of finding multiple optima [19].Then this problem can be turned into how to add new simple distributions to the first simple distribution.Clearly, the new distributions should be able to sample new solutions with high fitness values.Moreover, the new distributions should have fewer "overlaps" (correlations) with existing ones, so that they can be used to sample different regions of the solution space.
For clarity, let us construct the multi-distribution model from scratch.If we initially have one distribution ( 1 )1 , there is no worry of "overlap".Thus it is only required to sample solutions with higher enough fitness values.Mathematically, this objective  (to be maximized) can be modeled as the expectation of fitness values2 of the solutions  sampled from ( 1 ) [13], shown as Eq.(1).

/ 20
= ∫ ()(| 1 )d (1) If we want to add a new distribution ( 2 ) to ( 1 ), it has to minimize the correlation between them, as well as maximizing the expected fitness values of both ( 1 ) and ( 2 ) .For that purpose, the following Eq.( 2) should be maximized.
By maximizing the first additive term, it says that all the distributions should be able to sample solutions with high fitness values.And by maximizing the second additive term, it means that all the distributions should be mutually negatively correlated, by which the overlaps among  distributions can be maximized.Given that the distributions reflect how new solutions are generated, the first additive term is able to give an expectation of how good the next population might be, and the second additive term is thus capable of modeling the diversity of the next population.On this basis, the diversity model  for all  distributions is defined as Eq.( 4).where ((  )) = ∑ −((  ), (  )) =1 is the derived diversity component for the  -th search process.By further denoting the first additive term as ℱ and its  -th component as (  ) = ∫ ()(|  )d, Eq.( 3) can be re-written as Eq.( 5) for clarity.
Thus, the mathematical explanation of NCS can be expressed as maximizing the general objective , which turns into the maximization of both the diversity model  for exploration and the fitness model ℱ for exploitation.It is highly desired that  can be maximized in parallel to eliminate the interdependencies among search processes and to enjoy the computational acceleration.Since the distribution of a search process is independent from each other by definition, one way to achieve the parallel maximization of  is to apply the partial gradient descent to  with respect to each   .The gradient of Eq.( 5) can be calculated as Eq.( 6).
∇    = ∇   ℱ + ∇    = ∇   (  ) + ∇   ((  )), Clearly, by applying the gradient descent to , both the diversity model  and the fitness model ℱ of each search process can be independently maximized to enable NCS a parallel exploration search behavior, where each search process is highly likely to evolve to an un-visited promising area of the search space, respectively.

Section II.B The New NCS Framework
To implement Eq.( 6), it is required to know how to calculate ∇   (  ) and ∇   ((  )), and how to update   based on them.
For ∇   (  ) , the work in [13] has derived the following formulation (Eq.( 7)) that can be directly employed.
where    indicates the -th solution in the -th sub-population and  is the number of the solutions in the -th sub-population.For more details, please refer to [13].
Alternatively, a parameter  can be introduced to trade-off ∇   (  ) and ∇   (  ) for a subtle balance between exploitation and exploration, using Eq.( 9).
Similar to standard gradient descent methods [21], the objective function  can be maximized by optimizing the distribution parameters with Eq. (10).
where  is a step-size parameter for the gradient descending.
Based on the discussions above, the new NCS framework is listed in Algorithm I and described as follows.
At the beginning stage,  probabilistic distributions are initialized to form a set of parallel search processes.For each iteration, the following steps are executed in parallel: 1) each -th search process first generates  candidate solutions according to its probabilistic distribution (  ) at step 6; 2) the fitness values of all  newly generated solutions are evaluated with respect to the fitness function  at step 7; 3) the gradient of the fitness model locally approximated by the  -th sub-population, i.e., ∇   (  ), is calculated according to Eq.( 7) at step 9; the gradient of the diversity model with respect to the -th sub-population, i.e., ∇   ((  )), is calculated according to Eq.( 8) at step 10; then the gradient of the general objective function, i.e., ∇   , can be accumulated based on Eq.( 9) at step 11; the general objective function  is thus maximized by using gradient descent method (see Eq.( 10)), as shown in step 12. Finally, the best ever-found solution  * that is iteratively recorded (see step 8) will be output as the result of NCS before its halting (see step 13).

Section II.C The Merits of the New NCS
In the original NCS, there is no concept of both diversity model and fitness model.But if we look at the original NCS from this perspective, it can be found that the original NCS did not measure the expectation of qualities of unsampled solutions as the fitness model.Instead, to improve the solution qualities, it heuristically compared the fitness values of two sampled solutions for survival.This means that the original NCS cannot utilize the gradient descent method for maximizing the fitness model.Similarly, the diversity model was also maximized by such heuristic comparisons, leaving two technical issues for the original NCS, except for the unclear mathematical explanations.
To be specific, the original diversity model is basically a decentralized model.distributions of the other search processes, shown as Eq.( 11), To maximize each  ̅ ((  )) of the  -th search process, the original NCS works by comparing the diversity of the current distribution, i.e., the parent distribution (  ) estimated from the parent subpopulation, and the offspring distribution (  ′ ) estimated from the offspring sub-population, and then selecting the larger one to update the distribution (  ) for the next iteration.In order to obtain good balance between exploration and exploitation, the fitness values are also considered during the maximization of diversity.Let   be the parents in the -th search process, and   ′ be their offsprings.
Then the heuristic comparison goes as Eq.( 12), where  ∈ (0, +∞) is a trade-off parameter, and (  ) are the fitness values of   .For more details of the original NCS, please refer to [1].
It can be clearly seen in Eq. ( 12) that the maximization of both the diversity and the fitness highly depends on the samplings of the candidate solutions (note that the distribution parameters  here are also directly estimated from the sampled solutions).However, existing sampling techniques in EAs are usually randomized and thus may involve significant noise, which may mislead the maximization of both the diversity and the fitness.Another issue is that, the above heuristic comparison suffers from the interdependencies among search processes.Specifically, by substituting Eq.( 11) to Eq.( 12), it can be seen that the heuristic comparison in the -th search process explicitly requires the parent distribution (  ) from all the other -th search processes to decide its own parent sub-population and parent distribution at the next iteration, while the heuristic comparisons in other sub-populations also require doing so.
Consequently, the heuristic comparison in one search process will be interdependent from that in the others, since the parent distributions of different search processes have to be decided in sequential.Due to the above-mentioned two issues, the diversity and the fitness of each sub-population may not be maximized in parallel, possibly making the parallel exploration of NCS less effective.
Comparatively, in the new NCS, it is no longer needed to compete the exact values of the fitness and diversity pairwise between the parent and offspring sub-populations for survival, as the gradient descent mathematically provides the optimal direction for maximizing both the fitness models and diversity models.On this basis, the random noise of samplings and the interdependencies among sub-populations introduced by the original heuristic comparisons are avoided.As a result, the proposed new NCS framework has successfully addressed the two technical issues of the original NCS, and brings a much clearer explanation to the idea of NCS.

Section III Negatively Correlated Natural Evolution Strategies
To instantiate the new NCS framework, the type of probabilistic distribution (  ) should be specified.
In this paper, the Gaussian distribution is employed, i.e., (  ) = (  ,   ).The underlying reason is three-folds: 1) the Gaussian distribution is the most commonly used distribution in search [16]; 2) by using the Gaussian distribution, ∇   (  ) has an analytic closed form for efficient computation [13]; 3) the Bhattacharyya distance is also analytic based on the Gaussian distribution [1].
Nevertheless, [13] notices that if the above ∇    and ∇    are used as the gradients for , there is a serious issue for directly updating   and   with respect to Eq. (10).To be specific, it can be observed that ∇    ∝ , which means that a large   can make the learning steps of   and   insignificant, while a small   can result in a significant update of   and   .This can lead to an unstable search and thus become impossible to precisely locate the optimum [13].To address this issue, [13] derives the Fisher information matrix  from the natural gradient of a population.Here we extend it to the multi-population cases where each pair of    and    is respectively assigned for a subpopulation, shown as Eq.( 16).
With the Fisher information matrix,   and   are updated using Eq.(17).
Notice that, the above equations are computationally intensive.Specially, the inversion of the Fisher matrix subjects to the computational complexity of Ο( 6 ) if the full covariance matrix are considered [13], where  indicates the dimensionality of the search space.To alleviate the computational costs, we simply restrict the covariance matrix and the Fisher matrix for each distribution to be diagonals.This implies that the interdependencies among decision variables are omitted.Although it may make the algorithm less robust to non-separable problems, it suffices to significantly reduce the computational complexity to Ο(), as well as to improve the scalability of the algorithm [22].
Another technique adopted from [13] is the normalization of the fitness values.This is motivated by the difficulty of setting a proper trade-off parameter  for aggregating ∇   (  ) and ∇   ((  )) , as different problems may have quite varied scales of fitness values.For that purpose, the utility function in [13] is employed in this paper to reshape the fitness values in each sub-population.Specifically, for each sub-population, all  solutions are first ranked based on their fitness values, where π() indicates the rank of the -th solution.Then the utility function for each -th sub-population, denoted as   , is carried out to reshape the fitness of each -th solution according to Eq.( 18).After that, the utility of each solution is used by replacing the term of (   ) in Eq. (13).
The step-size parameters   and  Σ can be either tuned off-line or adjusted during the search.In this paper, the following strategy is used to adjust these two parameters at each iteration.
where   is the total time budget for the whole search and   is the consumed budget up to now.
is the natural constant.   and  Σ  are the initialized values for both step-size parameters, respectively.With Eq.( 19), these two step-sizes will decrease over iterations from the initialized values to zero.
So far, all the details have been presented to instantiate an NCS algorithm.To summarize, the proposed algorithm is a multi-Gaussian distribution based EA; Each distribution drives the evolution of one sub-

Section IV NCNES for Reinforcement Learning
EAs are intuitively promising solutions to Reinforcement Learning (RL) problems as the populationbased nature of EAs not only provides the urgent exploration ability to RL [25], but also provides other merits such as parallel acceleration [33], noisy-resistance [34], and compatibility of training non-differentiable policies (e.g., trees [35]).For example, the canonical NES has been successfully shown to be a promising reinforcement learning method by playing Atari games [25].Furthermore, RL problems are naturally good testbeds for NCNES as the performance of solving RL problems is highly dependent on the exploration ability when optimizing the policy [32].On this basis, this paper empirically studies the NCNES-based solution for Reinforcement Learning (RL) problems by playing Atari games.
For the purpose of performance assessment, the empirical studies will uncover three-fold advantages about how effectively the new NCS framework facilitates the search, how well the proposed new diversity model contributes to NCNES, and how well NCNES behaves on reinforcement learning problems.

Section IV.A Reinforcement Learning
RL is a class of problems that learns to make Markov decisions so that the long-term rewards can be maximized.In RL, the policy can be iteratively learnt only by interacting with the environment.At each time step, the agent picks an action according to the policy and the observed state of environment, leading to a transition from origin state to the next state, then receives a reward as the feedback to update the policy.The above steps keep going until terminated.To maximize the expected cumulative discounted reward in long term, numerous RL methods have been developed in the last decades, e.g., the modelbased methods [26][27], the value function based methods [28][29], and the policy search based methods [25][30].For more details of RL methods, please refer to [31].
Among the existing works, the policy search based methods that adopt the deep neural networks as the policy model have drawn most research attentions due to their powerful performance [25][30].The key problem for this type of methods turns into how to train the deep network in the RL settings, which faces three major difficulties.First, the search space of training the deep neural networks is highly large-scale and multi-modal; second, due to the Markov decision process nature of RL, the policy learning process is non-differentiable unless some derivable functions are specially designed (e.g., the critic function in A3C [30]); last, the delayed rewards may involve considerable noise.NES is a suitable method for learning the policy due to its derivate-free, robust and parallel features.Empirical studies on a set of Atari games have verified the advantages of NES over several state-of-the-art methods [25].The flowchart of applying NCNES to play Atari games can be seen in Fig. 1 for illustration.Basically, the agent aims to learn the policy by iteratively imposing actions to the Atari environment and getting states and rewards therefrom.The policy is modeled as a deep convolutional network for the purpose of conveniently and effectively processing the high-dimensional raw pixel data that is directly received from the video games.NCNES is applied to optimize the connection weights of the policy network without back-propagation.The network architecture of the agent consists of three convolution layers and two full connection layers (see Table I), as suggested by [28].More specifically, each individual solution is represented as a vector of all the connection weights of the policy model.Accordingly, the distributions of NCNES search processes are estimated based on those high-dimensional solutions.The training phase is divided into multiple epochs.At each epoch, the agent starts from the beginning of the game and takes a sequence of actions from the policy model to react to the environment, so as to gain as many as possible scores until game overs.After a game (i.e., an epoch) has been finished, the reward will be returned back to the agent as well as NCNES.Then NCNES takes the reward of each epoch as the fitness value of each iteration to optimize the connection weights (generating a population of new policy models for the next epoch) in a parallel exploration way, i.e., together with diversity among different search processes.When the training budget runs out, the final policy model will be output for further usages.

Section IV.B NCNES for Playing Atari
From the perspective of optimization, the above problem-solving procedure suffers from three kinds of difficulties. First, the search space is extremely large-scale.The deep architecture of the policy results in huge numbers of connection weights to be optimized, where NCNES needs to solve 1.7 million dimensional real-valued optimization problems. Second, the search space is highly multi-modal due to the complex architecture of the deep neural networks and the non-uniform distribution of the rewards. Third, the feedback is quite uncertain.On one hand, the reward is heavily delayed as the agent can only get the total reward from the environment after the game playing is ended, which makes it very difficult to evaluate the subtle action at each timestep of an epoch.On the other, the total reward involves considerable noise introduced by the randomized Atari games settings, which makes it even harder to evaluate the policy.
Due to the large-scale, uncertain, and multi-modal nature, the optimization problem is non-trivial at all.

Section IV.C Experimental Protocol
Three Atari games are selected for the empirical studies, i.e., Freeway, Enduro, and Beamrider.The  Three RL methods are selected for comparisons, denoted as A3C [30], CES [25] and NCS-C [1], respectively.All those methods are incorporated into the policy search based RL framework for training the same deep neural network as NCNES does, i.e., optimizing the connection weights.Among them, A3C is a state-of-the-art gradient-based method that trains the network with the traditional backpropagation.The other two algorithms are EA-based optimization method.CES is the canonical NES that has been successfully applied to play the Atari games [25].NCS-C is the instantiation of the original NCS framework.Both the well-established A3C and CES can be used to demonstrate the effectiveness of NCNES on playing Atari games.CES can also be used to assess how parallel exploration can facilitate the search, as NCNES can be simply viewed as a new variant NES with parallel exploration ability.NCS-C is used to show the advantages of the proposed new NCS framework over the original NCS.
For all the comparisons, each algorithm terminates the training phase in a game when the total time budget runs out, and the final solution (policy network) will be returned for testing.The quality of the final solution is measured with the testing score, i.e., averaged score of 30 repeated run in one gameplaying without the time limitations.Considering that the environment of a game-playing is randomly initialized, each game-playing will be repeated for three times, i.e., there will be three testing scores for each algorithm on each game.The total time budget is set as the total game frames that each algorithm is allowed to consume for training.For three EA-based methods, the total game frames are set to 100 million.For A3C, as it works quite differently with back-propagation, it is unfair to set the same total game frames with the EA-based ones.In this regard, we counted the game frames consumed by both well-established CES and A3C on the same hardware conditions and in the same game with the same given computational run time.It has been found that the ratio of the consumed game frames between them is about 2.5.As a result, the total game frames are set to 40 million for A3C for fairness.To discretize the games for agent's actions execution and states acquiring, the skipping frame is set to 4.
That is, for each training phase, the agent is allowed to take 25 million actions for EA-based methods and 10 million actions for gradient-based method.
As both CES and A3C have been successfully applied to play Atari games, we directly borrow the hyperparameters settings from the corresponding papers [25][30].The hyperparameters of both NCS-C and NCNES are given as follows.For NCS-C, the number of search processes is set to 8, the sigma is initialized to 0.01, the learning rate of the sigma and the learning epoch are set the same with its original paper [1].To reduce the noise of the environment, each solution will be re-evaluated for 10 times at each epoch of the training phase, and the averaged score will be returned to NCS-C as the fitness for the solution.For NCNES, the hyperparameters are listed in Table II for brevity.Freeway, and NCNES also shows significant advantages on other two games.This verifies the effectiveness of the mathematical NCS model.A3C performs less robust than the other three algorithms as its final policy model fails to gain any scores in two games.This maybe because the population-based search can reduce the uncertainty of the algorithms themselves, by 1) frequently sampling from a small region of the search space, which plays the role of re-evaluations to some extent; 2) only requiring the relative order of solutions to determine the search direction, which is less sensitive to the evaluation noise.Although A3C can occasionally gain high scores, it is very unstable as the score curves oscillate heavily, which even returns very bad policy models (i.e., the averaged score is 0.0 for two games) as the final output.This might be that A3C is less resistant to the environmental noise.
Fig. 3 The score curves of four algorithms on three games, respectively.
Performance Analysis on Policy Behaviors.It is expected that the parallel exploration search behavior of NCNES can help emerge some novel yet useful behaviors that traditional policies are less likely to express.For BeamRider, the agent trained by NCNES prefers staying in the left side of the available area and gains as many as 996 scores in a single testing play (see Fig. 4).The motivation behind this trick can be explained as that staying in the left side can prevent at most 50% enemy attacks, and thus is beneficial to longer survival.For Enduro, the agent prefers driving in the middle of the racing track when the weather is good so as to preserve the maximal freedom to move to both sides (see Fig. = ∑ ∑ −((  ), (  ))

Fig. 1
Fig. 1 The flowchart of NCNES based solution for playing Atari

screenshots of these three games are shown in Fig. 2 .
In freeway, the pedestrian is controlled by three actions (up, down and wait), aiming at avoiding dangerous collisions when goes across a ten-lane highway with large traffic volume, and scores every time it succeeds to reach the other side.The player in Enduro maneuvers a race car to avoid other racers and achieves higher mileage in an endurance race last for several "days" (counted in the game).The decreased visibility in night or severe weather, and the increased car speed as well as the frequency have posted great challenges.Beamrider is a horizontal scrolling short-range shooter targeted at shooting off destroyable coming enemies with a limited supply of torpedoes and escaping from other undefeatable enemies.

Fig. 2
Fig. 2 The screenshots of these three games, where (a) is for the Freeway, (b) is for the Enduro and (c) is for the Beamrider.

Fig. 3 .
To depict the curves, at the end of each epoch of the training phase, the current best policy model in terms of the training scores, will be additionally tested for 30 times.And the averaged testing score will be recorded for the purpose of depicting the score curve.Note that, this testing time will not be counted into the total game frames budget, as this score will not be used for helping training.Then the testing score is depicted epoch-byepoch to form the score curve.Generally, the score curve of an algorithm can express the convergence speed of the optimization algorithm.It can be seen that, NCNES (the red curve) can usually search a very good policy model in very short timesteps.This means that even with a much smaller time budget, NCNES can still outperform the others.For NES and NCS-C, the score curves increase much slower along with the timesteps.This verifies that the new NCS model can facilitate the search more effectively.

5
(a)).When the visibility decreases as it is snowy, foggy, and night, the agent prefers driving at one side of the racing track for safety, similar to human behaviors (see Figs.5 (b)-(d)).

Fig. 4 Fig. 5 Fig. 6
Fig.4 Tricks learned in BeamRider: the agent prefers staying in the left side of the available area.
population with the well-established NES; Multiple Gaussian distributions are driven to be negatively correlated by the proposed diversity model.In this regard, the proposed algorithm can also be regarded as a new variant of NES that has the ability of parallel exploration.Thus, it is named Negatively Correlated Natural Evolution Strategies (NCNES) for intuition.The detailed steps of NCNES is listed in Algorithm II for reference.

Results and Analysis Performance Analysis on Game Scoring.
Three repeated testing scores of each algorithm on three games are shown in TableIII.It can be clearly seen that, NCNES can outperform all the compared algorithms on the tested three games, which successfully verifies the effectiveness of NCNES on reinforcement learning problems.By comparing NCNES with CES, it suffices to show that the parallel exploration can facilitate the search much better as NCNES gains averagely twice scores over CES.By comparing NCNES with NCS-C, it can be seen that NCNES gains around three times scores over NCS-C on

Table III The averaged testing scores of four algorithms on three Atari games
To study from the optimization perspective, the score curves of four algorithms on three games are depicted in Performance Analysis on Convergence Speed.