Neuroevolutionary learning in nonstationary environments

This work presents a new neuro-evolutionary model, called NEVE (Neuroevolutionary Ensemble), based on an ensemble of Multi-Layer Perceptron (MLP) neural networks for learning in nonstationary environments. NEVE makes use of quantum-inspired evolutionary models to automatically configure the ensemble members and combine their output. The quantum-inspired evolutionary models identify the most appropriate topology for each MLP network, select the most relevant input variables, determine the neural network weights and calculate the voting weight of each ensemble member. Four different approaches of NEVE are developed, varying the mechanism for detecting and treating concepts drifts, including proactive drift detection approaches. The proposed models were evaluated in real and artificial datasets, comparing the results obtained with other consolidated models in the literature. The results show that the accuracy of NEVE is higher in most cases and the best configurations are obtained using some mechanism for drift detection. These results reinforce that the neuroevolutionary ensemble approach is a robust choice for situations in which the datasets are subject to sudden changes in behaviour.


Introduction
The ability of a classifier to learn from incremental and dynamic data extracted from a nonstationary environment (when data distribution changes over time) poses a challenge to the field of computational intelligence. In the context of neural networks, the problem becomes even more complicated, since most of the existing models must be retrained when a new data block is available, using the whole set of patterns learned until then. To cope with that sort of problem, a classifier must, ideally, be able to [43]: -Track and detect any changes in the underlying data distribution; -Learn with new data without the need to present the whole dataset again for the classifier; -Adjust its own parameters in order to address the detected changes on data; -Forget what has been learned when that knowledge is no longer useful for classifying new instances.
All these abilities seek, in one way or another, to deal with a phenomenon called concept drift [51,22]. This phenomenon defines datasets that suffer changes over time, such as when there is a change in the relevance of the variables, or when the mean and variance of the variables change.
Many approaches have been devised to accomplish some or all of the abilities mentioned above. One of the older and simpler approaches is a sliding window (not always continuous) on the input data used to train the classifier with the data delimited by this window [21]. Another method is to detect deviations and, if they occur, to adjust the classifier [7]. Some models, in turn, use rule-based classifiers, like [43,[59][60][61]. A more successful and widely used approach though is to use a group of different classifiers (ensemble) to cope with changes in the environment. Several different ensemble models have been proposed in the literature, including recent approaches like [56][57][58], and may or may not weigh each of its members. Most models using weighted classifier ensembles determine the weights for each classifier using a set of heuristics related to classifier performance in the most recent data received [22].
Although several algorithms have already been proposed in the literature for classification in concept drift scenariosmany even using ensembles -for this type of problem, neuroevolution has still been little explored. Neuroevolution uses evolutionary algorithms to adjust parameters that affect the performance of artificial neural networks, such as topology, learning rate, weights, among others. In this case, each solution of the evolutionary algorithm stores a representation of these parameters, which are evolved to find the optimal network for the problem. Applied to neural network ensembles, evolutionary algorithm is also able to dynamically adjust the entire model, a task that would be very arduous if performed manually, due to the complexity of the model.
Because of the architecture complexity, it is necessary that the neuroevolutionary models based on classifier ensembles have good computational performance and fast convergence, in order to be able to be applied in real scenarios. This feature becomes even more relevant in nonstationary environments, since it is necessary to update the ensemble each time new data become available or when some change is detected in data. Thus, this step must be fast so as not to compromise the overall performance of the model. To deal with this issue, an interesting and still littleexplored strategy in the literature related to neuroevolutionary models is the quantum-inspired evolutionary algorithms. This is a class of evolutionary algorithms developed to achieve better performance in computationally intensive problems, inspired by quantum computing principles [17,18,2,39,52,8]. One of the main advantages of the quantum-inspired evolutionary models is that good solutions are obtained with the smallest possible number of evaluations. This class of algorithms has been previously used in the literature to solve combinatorial and numerical optimization problems, based on binary [18,39] and real representations [2,39,52], providing better results and using less computational effort than classical genetic algorithms [47]. Applied to neural network ensembles, quantum-inspired evolutionary algorithms can be used to model the neural networks and to determine the voting weights for each ensemble member. Thus, each time a new block of data arrives, the ensemble can be optimized, improving its classification performance for the new data.
Models for learning in nonstationary environments can or cannot contain drift detection mechanisms. Most of the models found in the literature assume that the changes occur in a hidden context external to the model itself and, therefore, the drift cannot be predicted [15]. For this reason, these models use the passive and reactive approaches, that is, from the results of the model (in classification problems, the label predicted by the model is compared with the real label received), verify the drift occurrence and react to it only after the error is observed in the model. However, anticipating the detection of drift in the input data before they are submitted for prediction (i.e., before receiving the true labels) seems to be a more satisfactory approach since it permits to adjust the model previously to better deal with the new scenario and avoid the classification error. For this reason, the model proposed in this work uses this active approach, being an important differential compared to the existing approaches in the literature.
Given the above, the main objective of this work is to propose and develop a self-adaptive and flexible model, with good accuracy and suitable for learning in nonstationary environments. A new quantum-inspired neuroevolutionary model, based on a Multi-Layer Perceptron (MLP) neural network ensemble, will be presented for learning in nonstationary environments. The proposed model, called NEVE (Neuroevolutionary Ensemble), has the following characteristics: -Contains a concept drift detection mechanism, with the ability to detect changes proactively or reactively. This method, already detailed in [10] allows the reaction and adjustment of the model whenever necessary; -Performs the automatic generation of new classifiers for the ensemble, most suitable for the new input data, using the quantum-inspired evolutionary algorithm for numerical and binary optimization (QIEA-BR) [39]; -Automatically determines the voting weights of each ensemble member, using the quantum-inspired evolutionary algorithm for numerical optimization (QIEA-R) [2,52], a simplified version of QIEA-BR.
Several experiments were performed with artificial and real datasets to validate and compare the performance of the proposed model with other existing models for learning in nonstationary environments, verifying how the detection model affects the performance and accuracy of NEVE.
This work is structured in four additional sections. Section 2 presents a brief review of the literature related to the fundamentals of concept drift. It also describes the evolutionary models with quantum inspiration used in this work: QIEA-R and QIEA-BR. Section 3 presents the proposed neuroevolutionary model (NEVE) and Section 4 discusses the experimental results. Finally, Section 5 presents the conclusions of this work and possibilities of future work.
2 Literature review

Concept drift
The term concept drift can be defined informally as a change in the concept definition over time and, hence, change in its distribution. Concept drift refers to a supervised learning scenario, where the relationship between the input data and the target variable changes over time [15]. An environment from which this kind of data is obtained is considered a nonstationary environment. Formally speaking, considering the posterior probability of a sample x belonging to a class y, according to [9] concept drift is any scenario in which this probability changes over time, that is: P t + 1 (y| x) ≠ P t (y| x).A practical example of concept drift mentioned in [29] is detecting and filtering out spam e-mails. The description of the two classes "spam" and "non-spam" may vary over time. They are user specific, and user preferences also change over time. Moreover, the variables used at time t to classify spam may be irrelevant at t + k. In this way, the classifier must deal with "spammers", who will keep creating new forms to trick the classifier into labeling a spam as a legitimate e-mail.
Concept drift is usually classified in abrupt or gradual [15,51,54]. The abrupt drift occurs when a concept A is abruptly switched for another concept B, that is, at time t the source S1 is suddenly replaced by S2. The gradual drift, on the other hand, happens when a concept A is gradually exchanged for the other concept B. In this case, while there is no definitive change from concept A to concept B, we observe more and more occurrences of B and fewer occurrences of A. Both sources S1 and S2 are active, but as time passes, the probability of sampling the source S1 decreases as the sample probability of the source S2 increases. At the beginning of this drift, before more instances are observed, an instance of the S2 source can be easily mistaken for random noise. It is important to note that noise (or outlier) is not considered a type of drift because it refers to an anomaly or isolated occurrence of a random drift. In this case, there is no need to adapt the model, which should be robust to noise.
The term "Drift Detection" refers to techniques and mechanisms for detecting drift by identifying points of change or small intervals during which the variations occur. In this case, the environment has sufficiently changed so that the existing models can no longer be effective to predict the behavior of the current data [15]. Several drift detection mechanisms have already been proposed in the literature, but most of them work reactively: they compare the class predicted by the classifier to the correct class label received later, noticing the drift only after its occurrence and the misclassification. Only then, the reactive detector applies a sequence of procedures to identify some change in the conditional class distribution -a concept drift. Examples of reactive detectors can be found in [14,5,36,4,42,3,31,23,13,46].
Few papers use a proactive approach. [28] applies principal component analysis (PCA) for features extraction before the drift detection. The authors discuss and show evidence that components with lower variance should be stored as the extracted features, since they are more likely to be affected by a change. The authors then choose a change detection criterion based on the semiparametric log-likelihood function that is sensitive to changes in the mean and variance of the multidimensional distributions.
In [10], we proposed a new drift detection mechanism, called DetectA (Detect Abrupt Drift), which uses a proactive detection approach. This model is used in the experiments of this work and comprises three basic steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the training and test sets, conditioned to the given class labels provided in the training set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained.
Algorithms for handling concept drift problems can be categorized in several ways. Table 1, based on [9,27,29,30], summarizes the most commonly used classifications in the literature, with their respective definitions.
Algorithms that use the passive approach (without drift detection) regularly update the model as new data arrives and a forgetting heuristic is used, independently of the existence of change. For example: in a classifier ensemble, the weights of the members are updated after each new data received (individual or in blocks), based on the recent accuracy of ensemble members. Without concept drift, the classification accuracy will be stable and the weights will converge. If any changes occur, the weights will change to reflect them, without the need for explicit detection [29].
However, this can be very costly if the amount of data that arrives is excessively large or if the application require user feedback to label the data, which can be time-consuming. One way to reduce this problem is to use special techniques to detect changes and adapt the model only when unavoidable, using the active approach [51], also called trigger approach. In general, when active approaches detect a drift, some action is taken, for example, configuring a window with the latest data and retraining the classifier, or adding a new classifier to the ensemble.
Thus, the active method seeks to point out when the drift occurred and allows the model to modify itself or continue learning in the same way. A disadvantage of this method is the risk of having an imperfect mechanism that can produce false alarms, which is very common particularly in noisy datasets. In the passive mechanism, the learner believes that the environment can change at any time or can be continuously changing. The algorithm then continues to learn from the environment, building and organizing its knowledge base. If a change has occurred, it is learned. If nothing happened, the existing knowledge is reinforced [9]. The majority of literature ensembles follow a passive schema of adaptation, whereas active approaches are usually used with single online classifiers [27]. The models [24,26,48] are examples of passive approaches and the models [14,5,[36][37][38]32] are examples of active approaches.
Regarding data entry, it is worth emphasizing that individual patterns can be converted into batches or blocks of data. The opposite is also possible, but block data can come in large quantities, making instance-based processing very timeconsuming [29].
Comparing single classifier x ensemble approaches, ensemble-based approaches are newer and tend to have better accuracy, flexibility, and efficiency than those using a single classifier [29]. It is important to remember that in massive datasets it is often preferred to use simple models -such as single classifiers -since there may not be time to execute and update an ensemble. On the other hand, some authors argue that a simple ensemble may be easier to use than certain simple adaptive classifiers, such as decision trees. When time is not the main concern, but high accuracy is required, an ensemble becomes the natural solution. For example, in mammography screening for tumors, it is acceptable to take a few minutes per image [30]. Ensemble approaches can use different methods to adapt to a concept drift.
As mentioned earlier, responding to several types of concept drift is a difficult task for a simple classifier. For this reason, several systems based on classifier ensembles have recently been proposed to deal with concept drift learning, such as [49, 48, 11, 12, 44, 24-26, 45, 9, 33, 53, 6, 50]. The main novelty proposed in this work is the possibility of using an active drift detection mechanism (DetectA) together with an ensemble of neural networks, trained and combined through quantum-inspired evolutionary algorithms, allowing automatic and dynamic adjustment of the classifiers and their weights in the ensemble, using less computational time.

Quantum-inspired evolutionary algorithms
Classical evolutionary algorithms have been used successfully to solve complex optimization problems in a wide range of fields, such as automatic circuit design and equipment, task planning, software engineering and data mining, among many others [1,2]. The fact that this class of algorithms does not require rigorous mathematical formulations about the problem to be optimized, besides offering a high degree of parallelism in the search process, are some of the advantages of the use of evolutionary algorithms.
However, some problems are computationally costly regarding the evaluation of the fitness function during the search process, making optimization by evolutionary algorithms a slow process for situations where a fast response is desired (as in online optimization problems). In order to address these issues, quantum-inspired evolutionary algorithms have been developed, which are a class of estimation distribution algorithms that perform better in combinatorial and numerical optimization when compared to their homologous canonical genetic algorithms [1,2,8,17,18,39,52].
These algorithms are inspired by concepts of quantum physics, in particular in the concept of superposition of states, and were initially developed for optimization problems using binary representation, such as the Quantum-Inspired Evolutionary Algorithm (QIEA-B) [17][18][19][20], which uses a chromosome formed by q-bits. Each q-bit consists of a pair of numbers (α, β), where |α 2 | + |β 2 | = 1. The value |α 2 | indicates the probability that the q-bit has value 0 when observed, while value |β 2 | indicates the probability that the q-bit has value 1 when observed. Thus, in QIEA-B, a quantum individual is formed by M q-bits, according to (1): where i = 1, 2, 3, ..., M.

Active
Uses some drift detection mechanism, learning only when the drift is detected.
Individual input x Input in Blocks Individual Learn one instance at a time. They have better plasticity but poorer stability properties. They also tend to be more sensitive to noise as well as to the order in which the data are presented.

In blocks
Requires blocks of instances to learn. They benefit from the availability of larger amounts of data, have better stability properties, but can be ineffective if the batch size is too small, or if data from multiple environments are present in the same batch. Typically use some form of windowing to control the batch size. Single Classifier x Ensemble Single classifier Uses only one classifier. Ensemble Combines multiple classifiers.
Quantum-inspired evolutionary algorithms were then extended to real representation, to better deal with numerical optimization problems. In these problems, the direct representation is more appropriate, in which real numbers are directly encoded in a chromosome rather than converting binary strings into numbers. With real numerical representation, the memory demand is reduced while the precision is increased [1]. Thus, the Quantum-Inspired Evolutionary Algorithm with Real Representation (QIEA-R) was developed [1,2], inspired by the concept of multiple universes of quantum physics. In this scenario, the algorithm allows performing the optimization process with a smaller number of evaluations, substantially reducing the computational cost. Next sections describe the QIEA-R and QIEA-BR models, which are better suited to neuroevolution.

Quantum-inspired evolutionary algorithm with real representation (QIEA-R)
Originally proposed in [1], this algorithm was used to solve numerical optimization benchmark problems and the neural evolution of recurrent neural networks. The results obtained demonstrated the efficiency of this algorithm in the solution of these types of problems.
In QIEA-R, the quantum population Q(t) consists of N quantum individuals qi (i = 1, 2, 3, .., N) which are composed of G quantum genes. Each quantum gene is formed by a probability density function (PDF), which represents the superposition of states and is used to observe the classical gene. Quantum individuals can be represented by: where i = 1, 2, 3, ..., N, j = 1, 2, 3, ...,G and pij functions represent the probability density functions used by the QIEA-R to generate the values for the genes of the classical individuals. In other words, the pij(x) function represents the probability density of observing a given value for the quantum gene when its overlap is collapsed. The probability density function used by [1] is the square pulse, an uniform function of simple geometry, which can be defined by eq. 3: where Lij is the lower limit and Uij is the upper limit of the interval in which the gene j of the i-th quantum individual can collapse, i.e., assume values when observed. For the case where pij(x) is a square pulse, the quantum gene can be represented by storing the position of the center point of the square pulse and its width: μ ij and σ ij , respectively. The QIEA-R also uses a population of quantum individuals, which are observed to generate the classical individuals. The updating of the quantum individuals is carried out based on the evaluation of the classic individuals: μ ij and σ ij are altered in order to bring the pulse to the most promising region of the search space, increasing the probability of observing a certain set of values for the classical gene in the vicinity of the most successful individuals in the classical population. The pseudocode of the QIEA-R algorithm is shown in Appendix 1.
In this work, the QIEA-R is used to evolve voting weights for each classifier member of the ensemble and thus determine the final decision of the ensemble. In this way, the chromosome will have size n, where n represents the number of ensemble members. Each gene, in turn, will represent the voting weight associated with each classifier. Further details on QIEA-R can be found in [1,2,52].

Quantum-inspired evolutionary algorithm with binary-real representation (QIEA-BR)
The main motivation for creating an algorithm with mixed representation is that many real problems cannot be solved only by numerical decisions or combinatorial decisions. More specifically in the field of neural networks, the modeling process may involve combinatorial decisions (selection of the most relevant variables to the input layer, how many neurons should be used in the middle layer, etc.) and, simultaneously, numerical decisions (optimal values for synaptic weights).
With this motivation, [40] proposed the creation of an algorithm with quantum inspiration and binary-real representation, called QIEA-BR, for simultaneous optimization of combinatorial and numerical problems, that is, of mixed nature. The QIEA-BR algorithm was the first evolutionary algorithm with quantum inspiration and mixed representation proposed in the literature and will inherit the main characteristics of its precursors, such as global problem-solving ability and probabilistic representation of the search space. This mixed representation results in high population diversity in each quantum individual and the need of fewer individuals in the population to explore the search space.
The QIEA-BR algorithm also requires a population of quantum individuals that represents the overlap of possible states that the classical individuals can assume when observed. The quantum population Q(t), at any instant t of the evolutionary process, is formed by a set of N quantum individuals qi (i = 1, 2, 3, .., N). Each quantum individual qi of this population is formed by L genes gij (j = 1, 2, 3, ..., L). The main difference between the QIEA-BR and its predecessors is that part of the L genes is represented by q-bit, similar to QIEA-B, and another part by real quantum genes (q-real, similar to QIEA-R). Thus, the representation of a quantum individual i at any time instant t is given by: where the index b represents the binary part (q-bit) and the index r represents the real part (q-real). Thus a quantum individual can be described by: In this work, the QIEA-BR is used to perform the complete modeling of an artificial MLP neural network. The binary part selects the most appropriate input variables; defines which neurons (of a maximum number of neurons) are active in the hidden layer (1 active neuron, 0 inactive); and specifies the activation function of each neuron in the network (1 hyperbolic tangent and 0 sigmoid). The real part determines the values of all weights. Figure 1 illustrates the information that is encoded in each of the quantum genes, binary or real, of a QIEA-BR chromosome. This chromosome will be used in the neuroevolutionary models presented in Section 4.
In QIEA-BR, the evolution of the weights and activation function of a certain neuron in the quantum and classical chromosomes is conditioned to that neuron being active in the corresponding binary part. That is, the genes representing the weights and activation functions will remain unchanged by quantum and classical evolutionary process if this neuron is inactive.
The neural network created by QIEA-BR is similar to that shown in Fig. 2: the effective number of attributes in the input layer and of neurons in the hidden layer are evolved by the QIEA-BR, with the maximum size of inputs equal to the number of available attributes in the dataset (k) and the maximum number of neurons in the hidden layer (nh) configured by the user.
Thus, the number of genes is given by: where nc is the number of classes in the classification problem. In this case, the evaluation function used is the classification accuracy given by: where C i is the class of the i-th pattern, whileĈ i is the class predicted by the individual (MLP). When C i ¼Ĉ i then the result is zero, otherwise it is equal to one. Each individual is submitted to this evaluation function, in such a way that the best individuals are those who have greater accuracy. Further details on QIEA-BR can be found in [40,52].

NEVE: Neuroevolutionary Model for Learning in Nonstationary Environments
This section presents the proposed new quantum-inspired neuroevolutionary model, which is a self-adaptive and flexible , where each neural network member is trained and has its parameters (topology, weights, among others) optimized by QIEA-BR algorithm (see Section 2). This neuroevolutionary model is called NEVE (Neuroevolutionary Ensemble) and is composed of three main modules, detailed below and illustrated in Fig. 3: -Drift Detection; -Classifier Creation; -Evaluation and combination weights.
The Drift Detection module is optional. If activated, for each new input data block received, the detection module checks if any drift has occurred. The model works with data blocks of configurable size. If it is necessary (or desired) to work with individual data inputs, the block can be set to size to 1. However, it is important to mention that the strategy of working with one instance at a time is not the most suitable for this model, as it may compromise its computational performance. Two methods of detection were proposed: proactive and reactive detection methods, resulting in four different approaches implemented for this drift detection module [ The Classifier Creation Module is responsible for creating a new classifier, which may or may not be added to the ensemble, depending on its maximum size defined by the user. It is worth mentioning that the decision to create a new neural network is linked to the drift detection mechanism used, which will be better detailed in the following subsections. If created, the new classifier is added to the ensemble if space is available or by replacing an older classifier of worse accuracy. This approach gives the ensemble the ability to learn the new data without having to analyze the old data, as well as allowing to forget the data that is no longer needed. In short, the classifier creation module determines the complete configuration of the new MLP network ensemble member using the QIEA-BR algorithm (presented in Section 2). The algorithm selects the most relevant input variables, specifies the number of neurons in the hidden layer (respecting the maximum limit configured by the user), and determines the weights and activation functions of each neuron. The number of output neurons is equal to the number classes in the application.
Finally, the Evaluation Module is responsible for determining the final response of the classifier ensemble by combining the results presented by the classifier members. The QIEA-R algorithm is used to determine the most suitable voting weight for each classifier dynamically. The optimization of weights allows the model to easily adapt to sudden data changes by assigning higher weights to the classifiers best suited to the current concepts that govern the data. Three possible voting methods were implemented: -Linear Combination: It uses the QIEA-R algorithm to generate a voting weight for each classifier, which is multiplied by the output of each ensemble member (between 0 and 1), on a weighted average. The result of this weighted average is used to determine the ensemble response. If the problem has only two classes, the output is assigned to class 0 if the result is less than 0.5 and to class 1 otherwise; in case of problems with multiple classes, the class will be the one that presents the output with the highest value; -Weighted Majority Voting: As in the previous case, it uses the QIEA-R algorithm to generate a voting weight for each classifier. However, the outputs of the neurons from each ensemble network are first rounded (for values 0 or 1) and then multiplied by the corresponding classifier weight, thus forming a weighted average. Similar to the linear combination, in problems with only two classes, the output is defined as class 0 if the result of the weighted average is less than 0.5 and as class 1 otherwise; in the case of problems with multiple classes, the class associated with the output with the highest value is defined; -Simple Majority Voting: The output of each ensemble member is rounded to one of the possible classes, and the ensemble final output is the most chosen class among all classifiers. In this case, there is no need to determine voting weights.
In summary, considering the detection mechanism used, there are four possible variations of the NEVE model proposed and detailed in the following subsections: -ND-NEVE, without detection -RD-NEVE, with reactive detection -PDGL-NEVE, with proactive detection and the Group Label approach -PDPMS-NEVE, with proactive detection and the Pattern Mean Shift approach The following subsections detail each of the four proposed NEVE variations. For each variation, an explanatory text and a pseudocode of the algorithm is presented.

ND-NEVE (without detection)
The first variation of NEVE, "NEVE without Detection" (ND-NEVE), as the name implies, does not use any detection mechanism. It consists of an ensemble of MLP neural networks that, with each new data block received, it trains a new MLP that can be added to the ensemble if space is available.
The operation of ND-NEVE can be generalized as: when a data block t arrives (without the class labels), a new MLP network is trained using the QIEA-BR algorithm and t-1 block with the real class labels. The new network is provisionally added to the ensemble and the ensemble is tested with block t. Voting weights of all networks are determined using the QIEA-R algorithm and block t-1. The final ensemble classification is calculated using the test results with block t, the voting weights and the chosen voting method. Finally, we assume that the actual labels of block t become available and then, the permanence of the new network in the ensemble is evaluated. The pseudocode of the ND-NEVE is demonstrated in Appendix 2.

RD-NEVE (with reactive detection)
The second variation of NEVE is "RD-NEVE (with reactive detection)". This variation uses the reactive detection mechanism, detailed in [10]. For each new data block received, the ensemble classifies it and, as soon as the real class labels are obtained, the detection mechanism checks if a drift has occurred from the previous data block. If so, a new MLP is created, which is added to the ensemble if space is available.
The operation of RD-NEVE can be generalized as follows: when a data block t arrives, the voting weights of all ensemble members are determined using the QIEA-R algorithm and the t-1 block. The ensemble is tested with block t and classification results are combined with the weights calculated by QIEA-R, using the chosen voting method to determine the final ensemble classification. It is assumed that the real labels of block t are later available and the reactive detection can be applied [10]. If a drift has occurred in block t, a new MLP network is created using the QIEA-BR algorithm and trained with block t. The new network is added to the ensemble if space is available or if it is better than at least one of the old networks, replacing it on the ensemble. The pseudocode of the DE-NEVE is demonstrated in Appendix 2.

PDGL-NEVE (with proactive detection and Group Label approach)
The third variation of NEVE is "PDGL-NEVE (with proactive detection and Group Label approach)". This variation uses the proactive mechanism of detection [10], where each new data block is clustered, using the centroids of the previous data block as the initial centroids of the algorithm. Based on the clustering results, the detection mechanism checks if a drift has occurred from the previous data block; if so, the model trains a new MLP with the new data block and the class labels suggested by the clustering algorithm.
The operation of PDGL-NEVE can be summarized as: when block t arrives, its instances are grouped using the real classes of block t-1 as the initial suggestion of centroids, since the real class labels of block t are still unknown. Then, it is verified if a drift has occurred in block t in relation to block t-1. If so, a new MLP network is created using the QIEA-BR algorithm and trained using block t with the class labels provided by the clustering algorithm. The new network is provisionally added to the ensemble, which is tested with block t. The voting weights for all networks are determined using the QIEA-R algorithm and block t, also with the labels provided the clustering algorithm. The classification results and weights are combined using the chosen voting method to determine the final ensemble classification. It is assumed that the real labels of block t are later available and the initial centroids for the next grouping are updated, now considering the real class labels of the data block. The permanency of the new network in the ensemble is evaluated: it stays if space is available or if it is better than at least one of the old networks, replacing it in the ensemble. The pseudocode of the PDGL-NEVE is demonstrated in Appendix 2.

PDPMS-NEVE (with proactive detection and Pattern Mean Shift approach)
The fourth variation of the NEVE is "PDPMS-NEVE (with proactive detection and Pattern Mean Shift approach)". This variation also uses the proactive detection [10]. As in the previous variation, each new data block is grouped to verify if a drift has occurred from the previous data block. If so, a new MLP is trained with the previous labeled data block, and the new data block is "adjusted" towards the previous data block. In other words, when a drift is detected, instead of creating a new MLP using the new data block (as performed by the Group Label approach), the old data block is used to train the network and the drift is "removed" from the new data block. While in the Group Label approach the new network is suitable for the new data, in Pattern Mean Shift approach the new data is adjusted to the old network (trained with the old data). The pseudocode of the PDPMS-NEVE is demonstrated in Appendix 2.
Briefly, the main difference between PDGL-NEVE and PDPMS-NEVE is that in PDPMS-NEVE, when a drift is detected, a new MLP is created using the previous labeled data block (and not the new data block with the labels provided in the grouping, as in the PDGL-NEVE). Then, the new data block is "adjusted" in the direction of the previous data block and it is submitted to the ensemble classification. In the PDGL-NEVE, on the other hand, the new data block is tested by the ensemble without adjusting the data. Additionally, in PDPMS-NEVE the data block used to determine the weights of each MLP is the old data block with the real labels, while in the PDGL-NEVE the new data block is used with the labels provided by the grouping.
This section presented the neuroevolutionary model for learning in nonstationary environments proposed in this paper and detailed its four variations. The next section describes the experiments performed with the proposed detection methods.

Experiments
To assess the ability of the proposed model to learn in nonstationary environments and also to verify the best variations and configurations of the models regarding accuracy and computational performance, six different datasets were used on different simulations and scenarios. For the experiments, the four variations of the proposed model (described in Section 3) were used: ND-NEVE, RD-NEVE, PDGL-NEVE and PDPMS-NEVE. All experiments were run using standard libraries of MATLAB, as well as its Neural Networks package to train the baseline networks.

Datasets description
The datasets used in the experiments are: the SEA Concepts (an artificial dataset with a more controlled environment about the drifts) and four real datasets (Nebraska, Electricity, Cover Type and Poker Hand), where the exact moment that the drift occurs is unknown.
The SEA Concepts dataset was artificially created by [49]. It is characterized by extensive periods without major changes in the environment, but with occasional abrupt drifts. The Nebraska dataset presents a compilation of climate measurements from the Offutt Air Force Base substation in Bellevue, Nebraska. Its objective is to predict whether a rainfall may appear, using data from the last 30 days. Both datasets are available in [41]. The Electricity dataset is extracted from the Australian New South Wales Electricity Market and the class label defines the price change related to a moving average of the last 24 h. The purpose of the problem is to predict whether the price will go up or down. The Cover Type dataset contains information cells corresponding to a forest cover of 30 × 30 meters, extracted from the US Forest Service (USFS). Its goal is to predict the type of forest cover among seven possible values (therefore, a multi-class problem). The Poker Hand dataset has ten possible categories as output, representing the poker hand that contains 5 cards. The purpose is to identify the type of a Poker hand among the ten possibilities. These datasets are available in [34]. Table 2 presents the main features of each dataset, as well as the block size and number of blocks used in the experiments.

Execution details
All executions begin at t = 0 and end when consecutive T data blocks are presented for training and testing, with each block being able to suffer different scenarios of concept drift with unknown rates and natures.
As detailed in Section 3, the QIEA-BR algorithm evolves the topology of each new neural network, which is created following the criteria of each variation of the proposed model. The number of input variables is selected by QIEA-BR among the available variables in each dataset. For all datasets, a single hidden layer was used, whose number of neurons is evolved by QIEA-BR, having a maximum value specified by the user. The number of neurons in the output layer is equal to the number of classes in each dataset. The synaptic weights and activation functions of the hidden layer and the output layer are also determined by QIEA-BR.
The parameters of the quantum evolutionary algorithms are the same as those used by [1,40] and they are detailed in Table 3. The three voting methods detailed in Section 3 were evaluated: linear combination, weighted majority voting and simple majority voting. The maximum ensemble size is also a parameter defined by the user. Table 3 presents the configuration of the parameters used in all the experiments.
Thus, for each dataset, 72 different configurations of the model (4 × 3 × 3 × 2) were used, representing each possible combination of the parameters to be evaluated, as shown in Table 3. For each configuration, 30 simulations were performed and the average accuracy and computational time of these runs were calculated.

Results
The experiments presented below aimed at investigating the difference between accuracy (the ratio of number of correct predictions to the total number of input samples) and computational performance (execution time in seconds) among each of the four variations of the NEVE model, as well as the impact of the voting method, ensemble size and number of neurons in the hidden layer. Therefore, the objective of the experiment is to analyze how these modifications affect the results of the models for each dataset.
Tables 4, 5, 6 and 7 show the results of the experiments performed considering the accuracy and the computational performance measured in seconds. It should be noted that execution time is provided only for the SEA Concepts, Nebraska and Electricity datasets. Due to the considerable size of Poker Hand and Cover Type datasets, their execution required the parallelization on several computers, making the comparison of runtime between simulations impracticable. In all cases, the observed standard deviation was less than 2%. We highlighted the best 20% results in bold and gray and the worst 20% results in italics and underlined. The analysis of Tables 4 to 7 shows that: -In general, the ND-NEVE, RD-NEVE and PDPMS-NEVE approaches provided the best accuracy, while the PDGL-NEVE had the worst accuracy; -Considering computational performance, the ND-NEVE, RD-NEVE and PDPMS-NEVE approaches presented the best computational times, and the PDGL-NEVE approach, the worst. It was observed, however, that the  dataset also has a great influence on this criterion: the slowest was Electricity, which is the dataset that has the highest number of attributes and also a greater number of blocks among the datasets evaluated; -The best voting methods in terms of accuracy are, in this order: linear combination, followed by weighted majority and simple majority. This shows that the quantum algorithm is contributing positively to the accuracy of the model by determining the voting weights of the networks. Possibly, the early rounding performed in the weighted majority resulted in in attaining a lower average accuracy than the linear combination; -As for the computational performance, the best voting method was the simple majority, which was already expected since this method does not perform the determination of weights via quantum algorithm;  -It is observed that, in general, the strategy of unlimited ensemble has lower accuracy than the limited ensembles.
There was no significant difference in accuracy between the 5 and 10 ensemble size, which is a positive point, because the unlimited ensemble strategies also presented the worst computational performance, as expected. The unlimited ensemble tends to provide worse accuracy probably due to the increase in the search space of the QIEA-R for determining the voting weights when there are too many networks: it is enough to observe that, in all the datasets used, there are at least 400 data blocks, which allows ensembles of 400 networks for the unlimited case; -No substantial differences were observed either in the average accuracy or in the average computational performance considering the strategies of 5 and 10 neurons maximum in the hidden layer. Figure 4 presents a comparative graph of the computational time for the three binary datasets: SEA, Nebraska and Electricity datasets. It can be observed that the computational time of the ND-NEVE approach is superior to the others, whereas approaches with some type of detection present a similar mean computational time. This confirms that the proposed detection mechanism contributes to reducing the average execution time of the models.
The accuracy of the proposed NEVE approaches was also compared with DWM [26], Learn ++, NSE [9], RCD [16], EFPT [55] and AMANDA [56] models. We used 3 different drift detectors for the RCD algorithm: DDM [14], EDDM [5] and ECDD [42]. These simulations were carried out using MOA [35], an open source framework for data mining that includes several learning algorithms implemented for classification, regression, clustering, concept drift detection, among others. For this comparison, we used the same block size chosen for NEVE simulations for all the datasets. In order to make a more coherent comparison with NEVE and to discard the influence of the base classifier on the accuracy of the model, in all other models, the MLP neural networks were used as base classifiers. All the models were parameterized using values indicated by the respective authors. Table 8 presents the results of the best reached configuration (in terms of accuracy) of each NEVE variation, compared to the results of the other models. We highlighted the best results, by dataset, in bold and underlined, the second best in bold and the worst in italics and underlined. When more than one value is highlighted, it means that there is no statistically significant difference in the performance of the classifiers for ≤ 0.05, according to Wilcoxon test. We made 30 runs for each possible configuration and each dataset. In all cases, the observed standard deviation was less than 2%.
We can see from Table 8 that NEVE approaches obtained the best result in 2 datasets and the second best in the other 3. Apparently, the ND-NEVE and RD-NEVE approaches provide uniformly superior results in terms of accuracy. What is noticeable in this experiment, in general, is that the EFPT model is the main competitor of the NEVE in terms of accuracy considering SEA, Nebraska and Electricity datasets (as the author didn't performed tests with Poker and Covtype datasets, we  could not compare the models in this datasets) and the DWM models seems to be the main competitor of the NEVE in terms of accuracy considering Poker and Covtype datasets. From the results presented, we can highlight that NEVE provides good results without the need for a detection method; however, by adding one, substantial gains in accuracy and computational performance can be obtained. This fact reinforces that the neuroevolutionary ensemble approach is a robust choice for situations in which datasets are subject to sudden behavioral changes.

Conclusion
This work presented a new neuroevolutive model with quantum-inspiration, based on a multi-layer perceptron (MLP) neural network ensemble for learning in nonstationary environments, called NEVE (Neuro-EVolutionary Ensemble). This model can be used in conjunction with the DetectA concept drift detection model [10], which has the ability to detect changes both proactively and reactively. The use of Quantum-Inspired Evolutionary Algorithms in conjunction with NEVE allows the automatic generation of new classifiers for the ensemble (including the decision of its topology, the most appropriate input variables and its weights) and determining the voting weights of each neural network member of the ensemble.
Four different variations of NEVE were implemented: ND-NEVE (without detection), RD-NEVE (with reactive detection), PDGL-NEVE (with proactive detection and Group Label ap-proach), PDPMS-NEVE (with proactive detection and Pattern Mean Shift approach). These variations differ from each other in the way they detect and treat drifts, and were used in experiments with real and artificial datasets in order to evaluate which model variation and configurations achieved the best results. We varied the voting method, the maximum number of neurons in the hidden layer and the maximum size of the ensemble. It was found that the ND-NEVE, RD-NEVE and PDPMS-NEVE approaches produce best results in terms of accuracy and computational performance. It was also observed that the linear combination is the best voting method in terms of accuracy, and simple majority voting the best in terms of computational performance. The unlimited ensemble strategy has worse accuracy and computational performance than limited ensembles, with no significant difference between the 5 and 10 networks.
Compared with other consolidated models of the literature, the accuracy of NEVE was found to be superior in most cases. It appeared that the ND-NEVE and RD-NEVE approaches provide uniformly superior results in terms of accuracy, but the addition of the detection method in some cases has resulted in substantial gains. This fact reinforces that the neuroevolutionary ensemble approach was a robust choice for situations in which datasets are subject to sudden behavioral changes.
As future work, we intend to integrate, in a single evolutionary model, the creation of the neural network and the determination of voting weights, in order to perform the evolution process in a single integrated process. Also, it is intended to use NEVE for real applications, in order to validate its practical use, although it is very hard to know for sure if a dataset contains concept drift or not.

Appendix 1 -Pseudocode of the QIEA-R algorithm
The pseudocode of the QIEA-R algorithm is shown as follows.
Adriano Soares Koshiyama rec e i v e d h i s B S c d e g r e e i n Economics from UFRRJ and M S c d e g r e e i n E l e c t r i c a l Engineering from PUC-Rio. Nowadays is a PhD Candidate in Computer Science at University College London (UCL), with its main research subject being in F i n a n c i a l C o m p u t i n g a n d Analytics. Its main research topics are related to: Machine Learning, Statistical Methods, Optimization and Finance.