Automating model management: a survey on metaheuristics for concept-drift adaptation

This study provides an overview of the literature on automated adaptation of machine learning models via metaheuristics, in settings with concept drift. Drift-adaptation of machine learning models presents a high-dimensional optimisation problem; hence, stochastic optimisation via metaheuristics has been a popular choice for finding semi-optimal solutions with low computational costs. Traditionally, automated concept drift adaptation has mainly been studied in the literature on data stream mining; however, as data drift is prevalent in many areas, analogous solutions have been proposed in other fields. Comparing the conceptual solutions across multiple fields is thereby helpful for the overall progress in this area. The found literature is qualitatively classified in terms of relevant aspects of concept drift, adaptation/automation approach and type of metaheuristic. It is found that population-based metaheuristics are by far the most widely used optimisation methods across the domains in the retrieved literature. Methodological problems such as evaluation method and transparency in terms of concept drift type tested in the experiments are discovered and discussed. Over a ten-year period, the usage of metaheuristics in the found literature transitioned from automating single tasks in model development to full model selection in recent years. More transparency in terms of evaluation method and data characteristics is important for future comparison of solutions across drift types and patterns. Furthermore, it is proposed that future studies in this area evaluate the metaheuristics as models themselves, in order to enhance the general understanding of their performance differences in drift adaptation problems.


Introduction
Concept drift is a naturally occurring phenomenon, observed in many different fields (žliobaitė et al. 2016;Maisenbacher and Weidlich 2017), such as security and police, financial services, telecommunications, marketing, retail, production, media and others. Concept drift refers to the change in distributions and relationships within the data (Gama et al. 2013). When drift occurs, a machine learning model cannot project the previously learnt relationships to the new reality, which leads to degrading predictive performance. Depending on the field of application, the consequences can in some cases be severe (žliobaitė et al. 2016). As discussed in Schelter et al. (2018), models in production (providing predictions to end-users) will in these situations have to be re-trained using data from the new distribution. The amount of effort needed to reach previous performance levels might vary based on drift type, magnitude, and pattern, but is generally unpredictable. Re-training or re-developing machine learning models is in many cases performed manually by professional workers with high salaries and limited capacity (Polyzotis et al. 2018;Davenport and Patil 2012). From a business perspective, this can present a trade-off between maintenance and the development of new models. Full or partial automation of model maintenance is thereby more sustainable from a resource utilisation perspective. Tasks within automated model development (Feurer and Hutter 2019) and maintenance (Ghomeshi et al. 2019) generally consist of highly complex combinatorial optimization problems, where each step requires solving another computationally demanding optimization problem (called model training). Using exact methods is thereby either directly intractable or too costly. In this case, a group of algorithms called metaheuristics can be particularly useful, as they do not rely on assumptions about the problem structure, nor require perfect information (Bianchi et al. 2008). These methods do not guarantee to find a globally optimal solution, but rather aim to find a semi-optimal solution with minimal effort (being based on heuristics).
The use of metaheuristics is widespread across many application areas, such as business (Hemasian-Etefagh and Safi-Esfahani 2019), engineering (Tomoiagă et al. 2013), data stream mining (Ghomeshi et al. 2019) and automated machine learning (Feurer and Hutter 2019), amongst others. However, as the fields using metaheuristics for adaptation of machine learning models do not necessarily communicate, knowledge and findings might be fractured. A general overview of the literature across the fields will therefore be beneficial in highlighting potential challenges in the area. Comparing the literature in terms of which optimization problem the metaheuristic aim to solve, what type of metaheuristic is used, which machine learning model is adapted, which type of concept drift is studied, and how the proposed solution is evaluated, might therefore help future research.
The contribution of this paper is thereby to: 1) help understand the general usage of metaheuristics within the literature on self-adapting machine learning models, 2) classify the use cases in terms of how the metaheuristic assists in self-adaptability, 3) compare the used methodology of performance evaluation in different settings of concept drift, and finally, 4) highlight challenges and recommend future directions of research using metaheuristics for drift-adaptation.

Research questions
Motivated by enhancing the understanding of the usage of metaheuristics for drift-adaptation across multiple fields, this study will retrieve relevant literature and analyse the suggested usage of metaheuristics, as well as their approach to evaluating the proposed algorithms, use cases and/or frameworks. Finally, development of trends in the found literature over time will be studied. To guide this literature review, a set of five research questions are proposed: • RQ1: Which types of metaheuristics have been utilized for automated adaptation to concept drift? • RQ2: What characterize the application area of the usecases? • RQ3: How does the use-cases utilize metaheuristics for concept drift adaptation? • RQ4: Which forms of concept drift were investigated? • RQ5: How was the proposed use-cases evaluated?
The purpose of RQ1 is to get an overview of the application of various metaheuristic algorithms within conceptdrift related research. RQ2 aim at getting an overview of the application areas or overall context of the use-cases in relation to machine learning theory. RQ3 investigate how the metaheuristic algorithms was applied to help a machine learning system adapt to concept drift. RQ4 investigate which types of concept drift the use-case was evaluated on, and RQ5 looks closer at the method and metrics used for evaluation of the proposed methods.

Background
In the following, fundamental concepts relevant to this study will be introduced. The following sections are by no means meant to be complete definitions of the respective areas but will serve as an overview of the most important concepts. A graphical representation of the theoretical areas presented in this section can be seen in Fig. 1.

Machine learning
Machine learning is a sub field of Artificial Intelligence that focuses on developing software that learns to perform a task, rather than being hand-coded by the developer (Goodfellow et al. 2016). There are in general 4 different areas of machine learning, each with their own subfields: Supervised learning, Unsupervised learning, Self-supervised learning and Reinforcement learning (Chollet 2017).
In this literature review, the focus is mainly on supervised learning, which can be defined as learning a representation or parameters β in some function ŷ = f (x, β) , which, given some input x will predict an output ŷ . In this case, the the parameters are found using some machine learning algorithm, and we then define the model as the function f() and its associated parameters β. However, with this parametric Fig. 1 Overview of related theory example, it is important to stress that many non-parametric machine learning models such as random forests and k-nearest neighbors (Hastie et al. 2001) also exist. In the following, the project phases of machine learning model development will be described with focus on the professional work involved with model development.

Machine learning model development
There exist several normative frameworks for structuring machine learning projects, such as: Cross industry standard process for data mining (CRISP-DM) (Chapman et al. 2000), Sample, Explore, Modify, Model, and Assess (SEMMA) (Matignon 2007) and Knowledge discovery in databases (KDD) (Fayyad 1996). A conceptual overview has been made in Shafique and Haseeb (2014), which concludes that one framework is not necessarily superior to the other. The CRISP-DM framework does, however, include business understanding in the initial phase of the project, which help align problem and solution. For this reason, the Cross-industry Standard Process for Data-Mining (CRISP-DM) is used in the following example to illustrate the basic workflow of a machine learning project. The project work is usually carried out by one or more specialists, most often referred to as Data scientists (Davenport and Patil 2012) and Data engineers (Schelter et al. 2018). The six steps of CRISP-DM is described in the following: Step 1: Business understanding. The first step in the CRISP-DM framework is concerned with understanding the requirements and the underlying problem from the business perspective. This insight is thus used as guidance for the machine learning model, such that it provides business value. This phase has also been described as the problem definition phase (Chollet 2017).
Step 2: Data understanding. Next, relevant data is collected for initial analysis. The objective here is to look for patterns in order to form hypotheses for further testing via machine learning experiments. As argued in Chapman et al. (2000), data understanding is closely related to the business understanding, since the project plan cannot be formulated without having some level of knowledge about the data. Typically, this step would involve explorative data analysis and quality testing of one or more datasets that are available.
Step 3: Data preparation. This step is also known as Extract Transform and Load (ETL)-step. Here, a data engineer and/or data scientist first extract raw data (files, databases etc.) from the source systems, transform it into a format that is useable for the model(s), and load it prior to the modelling phase. This step often includes normalization, one-hot encoding, aggregation and other domain-specific transformations (Chollet 2017).
The data preparation also depends on the desired types of models (e.g. sequential vs. static). Parts of this step is also commonly referred to as feature preparation or feature engineering, which also often include feature selection.
Step 4: Modelling. In the modelling phase, one or more models are trained (described in detail later). The data scientist is unlikely to know from the beginning of the project which combination of model type and hyper parameters that will end up yielding the best result. This phase is thus experimental with a more or less systematic structure. It is most common to use grid-search (Raschka and Mirjalili 2019), which is all combinations of a set of model settings (also referred to as hyperparameters). Since the business problem is sufficiently understood at this point, it is important to define a performance metric that represents the business value of the model (Chollet 2017).
Step 5: Evaluation. This is the final step before deciding on deployment. At this point, multiple model candidates have been trained and a set of candidate models have been found. The goal now is to evaluate the full procedure and investigate whether mistakes have been made, and/or the developed models actually fulfill the business requirements (Chapman et al. 2000). This can be done using in-depth analysis of the model performance. Once a model has been sufficiently tested evaluated, and found to satisfy he business requirements, it is selected for the next phase: deployment.
Step 6: Deployment. When the best model candidate have been selected, it needs to be implemented in the system or process it was intended for. This means re-creating the full pipeline (transforming the raw data and predicting from transformed data) made in steps 1-5 in a way that enables real time or batch-prediction of new data, once it is available. Model monitoring is also included as part of deployment (Chapman et al. 2000). In the case that the model performance degrades, the process goes back to step one and continues from there, forming an infinite loop.
From a process-centric point of view, we can illustrate the CRISP-DM model development process in business process model notation (BPMN) ) as shown in Fig. 2. In this figure, the monitoring aspects of the life-cycle have been made more explicit, illustrating that once a model is deployed, it needs continuous monitoring and development which will be motivated further in section 2.5.

Batch learning
The vast majority of Machine learning literature is focused on what is referred to as batch learning or offline learning (Barddal et al. 2017). Here, the assumption is that the distribution of the data is stationary over time, and samples are independently identically distributed. This means that a classification or regression model can be trained via partitioning methods such as k-fold cross validation (Hastie et al. 2001), without a need to account for the timing of the samples. As argued in sections 2.4. and 2.5., this assumption is not always fulfilled. For steps 4 and 5 in the CRISP-DM framework illustrated earlier (Chapman et al. 2000), the task of the data scientist is to make decisions on so-called hyper parameters and evaluation method before selecting a final model candidate for deployment (step 6). An example of the training procedure is provided in section 2.2.2., followed up by model evaluation in section 2.2.3.

Stochastic gradient descent
A well-known example of a simple yet powerful machine learning algorithm is the Stochastic Gradient Descent (SGD), here used to find the optimal weights of a logistic regression model. Logistic regression is a generalised linear model with a nonlinear (sigmoid) activation/link function (Goodfellow et al. 2016): The model can thus be defined as: Where β is a vector of learnt weights. For this problem, the optimal weights can only be found analytically in special cases (Lipovetsky 2015), and a local search method such as SGD is therefore commonly used. A single prediction ŷ i can after learning the optimal weights be made from the linear combination of the weights and the inputs: ŷ i = g(β T X i ) . The weights can be learnt by minimizing a loss function, here illustrated by the binary cross entropy loss for classification problems: The learning process consist of dividing the available data into multiple batches for out-of-sample validation of the model, most often performed by 3-fold cross validation (Hastie et al. 2001). The model is trained on a subset of the data called the training set X train , by iteratively adjusting each weight β j with respect to the gradient of the loss function for each training example: Each update t to β are thus made based on the following update-rule: Here, λ is a real-valued scalar between 0 and 1, called the learning rate. The learning rate is a so-called hyper-parameter controlling the magnitude in which the weights of β are updated with respect to the gradient of the loss function L. Training the model using stochastic gradient descent (SGD) is illustrated in algorithm 1. The SGD-algorithm exists in a variety of forms: another version is the second-order learning algorithm also known as Newton's method. This is known to lead to faster convergence, but is more computationally expensive, as it also requires calculating of the second-order derivatives (Goodfellow et al. 2016).

Model evaluation
Performance of the model during training is most often evaluated on a second fold of the data called the validation set X valid using an evaluation metric. As mentioned earlier, this metric has to be in alignment with the business problem, so that the model learns to make predictions that are of business value (Chollet 2017). An example of a metric for evaluating classification models is the accuracy measure: This metric can be biased with respect to the balance of the target class; if only 10 percent of instances in the validation set belong to the negative class, the model could achieve 90 percent accuracy by classifying all instances as positive. To help alleviate these problems, other metrics such as precision, recall and the F1-score is often used: The F1-score has the advantage that it is controlling for the balance of the target classes. The out-of-sample evaluation can take place during training, and thus out-of-sample performance can be monitored during the training procedure. Corrections to the hyper-parameters is only based on the performance on X valid to avoid overfitting (Hastie et al. 2001) to X train . The final model selection is performed using a third and unseen fold called the test set X test .

Automated machine learning
Also known as AutoML, the field of automated machine learning focuses on automating as much as possible of the manual work of the data scientist with regards to the steps of CRISP-DM framework (Chapman et al. 2000). The field of AutoML has multiple sub-branches such as Meta-learning (Khan et al. 2020), Neural Architecture Search (NAS) (Elsken et al. 2019), hyper-parameter optimization (HPO) and full-model selection (FMS). As this section is not meant to cover all AutoML methods, only HPO and FMS will be considered in the following. Automating parts of (6) Accuracy = (TP + TN) (TP + FP + TN + FN) machine learning is arguably not a new problem (Bengio 2000), however, it has recently gained much popularity as machine learning has seen a boost in industry adoption, due to increased performance of algorithms and hardware (Goodfellow et al. 2016). A common problem across all machine learning projects is the combination of decisions the data scientist has to make in steps 3, 4 and 5 of CRISP-DM, based on steps 1 and 2. These decisions have a direct impact on the level of success, with respect to the performance of the models. A standard heuristic is to select an initial set of candidate settings across steps 3 and 4, train the model(s) on these settings, and evaluate the performance on validation set (Raschka and Mirjalili 2019). The best performing settings are thereafter explored further, depending on the quality requirements and time available to the project team. The main problem of this approach is that it is time-consuming to find the best set of settings, and AutoML can thereby be of help in these cases (Feurer and Hutter 2019).

Hyper-parameter optimization problem (HPO)
Adapting the definition from Feurer and Hutter (2019); given a machine learning model M, a set of N hyperparameters (learning rate, number of iterations, etc.) with impact on the final solution can be defined. Each i'th hyperparameter can be defined as Λ i , which is then part of the overall hyperparameter configuration space: Λ = Λ 1 ×Λ 2 × ...Λ N . A given set of hyperparameters λ for the model M is denoted as M λ . The task at hand is thus to find the set of hyperparameters λ * that minimize the loss over the validation data, given a particular evaluation method (K-fold cross-validation or any other data partitioning scheme): Here, the second term V(ℓ,M λ ,D train ,D valid ) measure the loss (e.g. Cross-entropy) of the model with the specified settings M λ over the training data D train , evaluated on the validation data D valid . Since this is generally defined as a batch-learning problem, the dataset D is finite and the optimization is thus over the expectation of the sample data D.

Blackbox HPO-methods
As mentioned in Muñoz et al. (2015), blackbox HPO methods can be divided into deterministic and stochastic variants. The deterministic variants rely on linear algebra or geometric methods to find a local optimal solution (due to nonconvexity of the HPO-problem) (Feurer and Hutter 2019), and can thus be re-started at other starting points to improve convergence towards a global optima. Stochastic methods are mainly based on random variables, statistics, or metaheuristics for guiding the search in order to keep it from being trapped in a local minima. Given sufficient trials, random search has a probability of 1 of finding the global minima (Muñoz et al. 2015), while also being more efficient than standard grid search (which is restricted to a fixed set of combinations) (Bergstra and Bengio 2012). All of the local search algorithms depend on one or more hyperparameters of their own, which determine their probability of finding the global optima within k iterations (Feurer and Hutter 2019).

Model-based HPO methods
Another direction in HPO is multi-fidelity (model-based) methods as described in Feurer and Hutter (2019). Here, the rationale is to optimize the problem using a lowfidelity version of the problem space. This could be via a smaller subset or compressed version of the data, leading to a faster search process yielding an approximation for a λ, which might perform well on the original (full) problem space. Another model-based approach to HPO is predictive termination (Domhan et al. 2015), where another ML model is tasked with predicting the learning curve of the main model, in order to terminate training before overfitting occurs.

Full-model selection
The HPO problem was extended to Full-model selection (FMS) in Escalante et al. (2009). This problem is also known as CASHO: Combined Algorithm Selection and Hyperparameter Optimization (Feurer and Hutter 2019). In this context the HPO problem is extended by including elements from steps 3 to 5 in the CRISP-DM framework (Chapman et al. 2000) (Data preparation, Modelling and Evaluation).
In (Escalante et al. 2009) the authors include the following sub-tasks from CRISP-DM, shown in Table 1.

Formal representation:
A single solution X i ∈ S (where S is the total solution space), is represented in Escalante et al. (2009) in the following form: Here, X i is a n-dimensional vector representing a particular solution i.e. a full model including data preparation, modelling and evaluation. Each element represented by X in the solution are binary vectors specifying the setting for each step, with the exception of X (i,sel) , which is a binary scalar specifying whether pre-processing should be done before feature selection. Additional case-specific decisions can be included by adding binary vectors to the solution space. The y-vectors contain the hyper-parameter settings for each possible combination of settings in their associated X-vectors. In the example provided by Escalante et al. (2009), X (i,pre) represent k possible pre-processing techniques such as z-transformation, scaling or range-normalization.

Search and evaluation:
As some hyper-parameters are continuous, the total search space becomes infinitely large. However, as argued in Feurer and Hutter (2019), continuous parts of the search space can be bounded, and/or discretized to reduce the overall size. Escalante et al. (2009) uses Particle Swarm Optimization (PSO) (Kennedy and Eberhart 1995) for solution search. PSO is a population-based metaheuristic (see section 2.6) that relies on a fitness function F. In the case of FMS, the data scientist has to select an evaluation metric for F that represents the business value (Chollet 2017).

Concept drift
In real-world machine learning applications, the assumption of stationarity in the data stream is often not fulfilled (Schelter et al. 2018), as data changes over time. This situation called concept drift can arise due to multiple factors, depending on the domain area. As explained in žliobaitė  (2016), sources can be: Adversary activities (in fraud detection), changes of preferences (in recommender systems), population change or simply due to a complex environment. In Akila and Reddy (2018) the authors further stress that concept drift in consumer-related data is not an error in the data, but rather a natural change in customer behavior. In business process-related data, continuous changes in organisational structures, legal regulations, and technological infrastructures are known to lead to concept drift (Maisenbacher and Weidlich 2017;Bose et al. 2011). Concept drift can be divided into two main categories (Tsymbal 2004): Real concept drift, and virtual concept drift. Real concept drift can be described as the relation between the target variable y and associated input variables X change over time. Virtual drift on the other hand relate to changes in the distribution of input variables X, without a change in the relationship between y and X (Gama et al. 2013). Given a classification model (such as the logistic regression model presented earlier), a class membership prediction can according to Bayesian theory be made using posterior probability of a class (Gama et al. 2013). For a given class y ∈ c: Where Real concept drift can thus be defined as a situation where the joint distribution of the target class y and the input data X is significantly different at t 0 compared to t 1 : Virtual concept drift can be described as situations where the distribution of X changes over time, without changing the decision boundary of y: Drift between concepts can occur in four different ways, (or a given combination of them) as illustrated in Fig. 3. Sudden or abrupt concept drift means an absolute shift between two concepts as a step between t 0 and t 1 . Incremental concept drift means a steady transition between two concepts at t 0 until t n , with multiple mixtures of the concepts present in between. At t n the old concept is non-existing, and the new concept is thereby dominant in the data. Gradual drift refer to situations where there is a back-and-forth change between two concepts happening between t 0 and t n , where the old concept is non-existant at t n . Reoccurring drift means an introduction of a new concept at time t 0+n , with a subsequent re-introduction of the original concept present at t 0 . As argued in Gama et al. (2013), mixtures of multiple driftpatterns can also be observed in real-world data.
Feature drift refer to the relevancy over time of each feature in a given feature space F . At a given point in thime, the most descriminative feature set can be selected from the overall feature space F t0 ⊆ F . In the case where the most descriminative feature set changes over time F t0 ≠F t1 , a feature drift is present in the data (Nguyen et al. 2012). Novel class apperance is another special case of drift where a previously unseen class is observed in the data (Webb et al. 2015). In this case P t0 (Y = c) = 0, and where c is the previously unseen class, and subsequently P t1 (Y = c) > 0 at time t 1 . This can for instance happen in a setting where a model is predicting activities in a business process, and a new type of activity is added to the process. This is also referred to as concept evolution.

Machine learning life-cycle management
Life-cycle management of machine learning projects extend the scope of CRISP-DM to activities performed post deployment. Being focused on ML production-settings, model management primarily involve the activities listed in Table 2 (Vartak and Madden 2018). As can be seen from the comparison, there is no perfect overlap between model management and CRISP-DM, as business and data understanding is not part of model management, and maintenance is not (explicitly) part of CRISP-DM. Another distinction between the two is that machine learning life-cycle management is focused on the interplay between data engineering and machine learning (Schelter et al. 2018). Here, the main problem is how to track and utilise metadata across all 6 steps. One of the key challenges in model management is Fig. 3 Overview of drift patterns, adapted from Gama et al. (2013) maintenance, where the main issue is when to retrain or redevelop models (Schelter et al. 2018). This point is usually determined through performance monitoring, which triggers a new iteration of the model management steps (Vartak and Madden 2018). This is often done offline (Schelter et al. 2018), and often performed manually in the interplay between data scientists, software engineers and site reliability engineers (Polyzotis et al. 2018). However, some commercial model management systems have support for (offline) automated HPO (Zaharia et al. 2018).

Online learning
Contrary to batch or offline machine learning, online machine learning is based on the assumption that the distribution of the data is non-stationary (žliobaitė et al. 2016). This problem is well-studied within the field of data stream mining (Gama et al. 2013), where the objective is adaptation of (often streaming-optimised) machine learning algorithms (Barddal et al. 2017). Online learning essentially aims at automating and adapting parts of the CRISP-DM framework (žliobaitė et al. 2016), in order to make the machine learning model robust to changes in the data.

Drift adaptation
Important to online learning is the distinction between blind and informed adaptation. Blind drift adaptation involves retraining the model either when new data appear or at fixed time intervals (Gama et al. 2013), without any form of drift-detection mechanism. This approach can be resource-intensive and potentially unnecessary. The idea of informed drift adaptation is on the contrary to detect when concept drift has occurred, and only then adapt the model to the new distribution. A generic model of adaptation is presented in Gama et al. (2013): 1. Predict: Predict an incoming sample/batch 2. Diagnose: Evaluate using the ground truth once it is available 3. Update: Use the new data to update model if needed The main objectives of drift adaptation is thus to: a) Detect concept drift as early as possible, b) Adapt to concept changes while ignoring noise, c) Perform the operation in less than the time it takes for a new example or batch to arrive, given a fixed budget of memory and computation (žliobaitė et al. 2016).

Learning modes
Learning modes for concept drift adaptation can be divided into three general forms: re-training, incremental adaptation and streaming (Gama et al. 2013). The re-training learning mode discards the existing model and re-train a new one from scratch, based on either old and new, or only new data samples. This approach is the equivalent of batch learning using a sliding window. Incremental adaptation updates the existing model instead of starting from scratch. This learning mode can update the model using either single or multiple samples, once the ground truth (true value of y) have been revealed to the learning algorithm. Finally, the streaming mode is used in settings with a high frequency of incoming samples. In this learning mode, the algorithm uses only few passes over each sample before they are discarded, in order to preserve memory (Gama et al. 2013).

Drift detection
As blind adaptation is a possibility, drift detection is not a necessity for concept drift adaptation. However, there are multiple advantages such as reducing computational load, as well potential insight into the nature of drift in the given setting (Gama et al. 2013). Mechanisms for drift detection have 4 main categories: Sequential analysis (Page 1954;Pesaranghader and Viktor 2016), Control charts (Ross et al. 2012), Distributional tests (Bifet and Gavaldà 2007) and context-dependent methods (Bouchachia 2011). The main goal of these methods is to monitor the distribution and trigger automated adaptation (based on the learning mode). A more complete overview of drift detection methods can be seen in Gama et al. (2013).

Metaheuristics
For complex optimization problems as HPO or FMS/ CASHO outlined earlier, finding the optimal point in a search space will be computationally demanding. Due to the curse of dimensionality, the volume of the search space grows exponentially with each added dimension. This has the unfortunate sideeffect of an exponential increase in the computation time needed (Chen et al. 2015). However, in some cases there exist a set of solutions that are not globally optimal, but "good enough" to solve the problem at hand. In these cases, a particular class of optimization algorithms  (Bianchi et al. 2008). The main motivation for these methods is to reduce solution quality in order to solve the problem with less effort (computation time). This is done by trading off exploration and exploitation using robust mechanics (Blum and Roli 2003). There are mainly three types of metaheuristics: populationbased, construction-based and local search-based methods. Each of these will be described in the following.

Local search methods
The local search variants rely on an initial solution and thereafter seek to improve the solution by moving towards the neighboring solutions in iterations. Simple iterative improvement tends to stop at a local minima and yield unsatisfactory results in combinatorial optimization (Blum and Roli 2003). Multiple improvements have therefore been proposed to the base algorithm over time. An example of a local search algorithm is Simulated Annealing (SA) illustrated in algorithm 2. This method allows moves towards worse solutions, in order to avoid getting stuck in a local minima. This is effectively a mechanism that tries to make the search more explorative. The algorithm has a temperature parameter T which denote the probability of moving towards a worse solution than the current one. This temperature decreases during the search, making the model less likely to explore (make uphill moves in minimisation), and more likely to exploit the current area of the search space. The decrease of T does not necessarily have to be monotonic and can, depending on the cooling scheme, also increase during the search (Blum and Roli 2003). An example being Here, s ′ is the new solution, and s is the old solution. T is the temperature parameter, where T 0 is the initial temperature, which then changes based on the cooling scheme. GenerateSolution() selects an initial solution by random, and PickAtRandom() selects from the neighboring solutions N(s).

Population-based methods
This family of methods are based on creating multiple solutions (referred to as individuals), and in some cases by combining superior alternatives in order to evolve a better set of solutions in the next population. Due to the many variants in this area, only the general evolutionary computation (EC) algorithm is presented in the following. Evolutionary computation is a family of population-based methods inspired by nature, where Genetic Algorithms (GA) are the most well-known due to their inspiration from a Darwinian principle: Survival of the fittest (Bianchi et al. 2008). Contrary to Local search-based algorithms, EC generates multiple solutions (populations) per iteration (referred to as a generation), where randomization (called mutation) and combination (also referred to as crossover) influence the algorithms exploration versus exploitation trade-off. An example of EC can be seen from algorithm 3.

Constructive methods
Constructive metaheuristics build solutions by combining components of the solution, until a full satisfactory solution is found. As mentioned in Trabelsi et al. (2010), these methods often consist of greedy search, where the best elements are picked at each step. Constructive methods can also combine a constructed solution with successive local search. An example of a constructive algorithm is the Ant Colony Optimization (ACO) (Dorigo and Di Caro 1999). This particular metaheuristic is inspired by the way ants search for food in nature (Bianchi et al. 2008). In general, the ants use a pheromone to mark the route they have taken. Multiple paths might then be searched, and while the pheromone vaporizes over time, the shortest path thus has the strongest presence of pheromone (Blum and Roli 2003). This behaviour is modelled by artificial agents (ants) that perform a greedy search on a graph G (V, A), where the nodes of the graph, V, are the components of the solution and, A is the connections between the components (Bianchi et al. 2008). An overview of the approach can be seen from algorithm 4.
ConstructAntsSolutions is the process wherein the solution is created incrementally by the agents in parallel. For each agent, the probability of going from a node k to successor node l is the probability P kl , which is also an increasing function of π kl and ρ kl (u). π kl is the pheromone on arc (k,l), and ρ kl (u) is the heuristic value of arc (k,l). Here, the heuristic value is a greedy estimate of the usefulness of adding (k,l) to the solution (Bianchi et al. 2008). EvaporatePheromone() decrease the pheromone π kl between arcs, every time an agent use that particular component. This essentially prevent the algorithm from getting stuck in a local minima. DeamonActions() are global (centralized) actions that is performed across all the agents. These vary between the particular implementations of the algorithm (Bianchi et al. 2008). An example is a local search over one or more of the solutions created by the agents, or adding more pheromone to bias the search from a non-local perspective (Blum and Roli 2003).

Methodology
In the following sections, the methodology used in this study will be discussed in detail. Section 3.1. discuss the search design, as well as the number of found studies. Section 3.2. discuss the exclusion criteria, while section 3.3. describe the analysis of the literature.

Study retrieval
The literature search has been guided by combining three main topics: metaheuristics, concept drift, and automated or adaptive forms of machine learning. The combination of the three topics has been implemented in the queries using Boolean AND. Since the last two topics exist in multiple forms with different names, synonyms have been defined in each of the three queries using Boolean OR. A cumulative search approach has been applied, as the aim has been to include as many relevant studies as possible within the three main topics mentioned above. The first query was designed for probing, whereas the second was intended to broaden the results. Finally, the third query imposed a third AND clause to restrict the results further to the field of Machine learning and Data stream mining (as the second query had many irrelevant results).
A stopping criterion was defined as obtaining more than 80 percent qualified, redundant results. The queries were performed using Google Scholar since most relevant publishers, proceedings and journals within the subject areas (Springer, Elsevier, IEEE Explore, ACM digital library) have online searchable titles and abstracts.
The first query led to 258 initial hits with 25 studies retrieved for further inspection. Unfortunately, none of these results was qualified to be included due to the exclusion criteria (defined in section 3.2). Query two led to 570 hits, wherein 38 studies were retrieved for closer inspection, resulting in 17 included studies. Finally, query three led to 534 hits, with 16 studies being retrieved for closer inspection. Eight of these studies were qualified for inclusion. However, all of these studies were already included in the results of query two. This triggered the stopping criterion, with the final number of included studies being 17. An overview is presented in Table 3.

Exclusion criteria
The first criterion was included to ensure that the literature contained all the topics of interest. The second criterion was added to exclude literature that did not suggest or study an explicit method for drift adaptation of a machine learning model via metaheuristics. The third criterion ensured only peer-reviewed studies were included, as a measure of quality. Similarly, criteria four excluded works published before 2021 with less than two citations. This was primarily added to ensure that the suggested method has had an impact within the scientific communities. Finally, the fifth criterion was formed to ensure the content could be retrieved and analysed.  Metaheuristic AND (Concept drift OR "Online learning") AND ("AutoML" OR "Automated Machine Learning" OR "Hyper-parameter optimization" OR "CASHO" OR "FMS") 258 25 0 2 Metaheuristic AND "Concept drift" AND ("Online learning" OR Adaptation OR Adaptive OR AutoML) 570 38 17 3

Studies that do not include all three main topics
Metaheuristic AND "Concept drift" AND (adaptation OR adaptive OR AutoML OR Automation OR "Hyper-parameter optimization" OR "Online learning" OR "Full model selection") AND ("Machine learning" OR "Data stream mining") 534 16 0 3. Peer-review • Unpublished articles (no DOI or publisher) 4. Scientific impact • Studies with < 2 citations when published before 2021 5. Availability • Studies with inaccessible abstract or full text

Analysis and classification of methods
Guided by the research questions, the found literature was reviewed and qualitatively coded. Open coding was used for RQ2 (field of application) and RQ4 (data type), whereas the rest of the coding was based on pre-determined codes from relevant theory outlined in Section 2. An overview of the research questions and their related codes can be seen below. However, additional information such as the machine learning model and the name of the metaheuristic(s) have also been included in the results.

Results
In the following, the results will be divided into four different sections: types, adaptation method, test of concept drift adaptability and chronological trends.

Types and general application areas of metaheuristics
As can be seen from Table 4, population-based metaheuristics is by far the most widely used method across the found studies. Only two studies used local search or constructionbased methods (Pinto et al. 2014;Kozak et al. 2020). The most frequently used metaheuristic is particle swarm optimization (PSO) (Kennedy and Eberhart 1995), which is applied to both continuous as well as discrete optimization (Bessa et al. 2018;Lan et al. 2019). In two of the use-cases, the metaheuristic is combined with replicator dynamics (Ghomeshi et al. 2019a;2019b) in order to utilize the benefit of both approaches in a FMS problem (see Section 2.3.4). The fields of application can mainly be grouped into: Engineering, computer science and natural language processing. In Bessa et al. (2018), the focus is mainly to generate self-adaptable models that perform well while using as little memory as possible, so that these can be used in e.g. telemetry hardware. Another example is in Rehman et al. (2019) where the focus is to have self-adaptable models that can compensate for gradual sensor-malfunction in order to save costs in a gas-detection problem. The most common application area is computer science or intelligent systems: in these studies, multiple datasets from different fields are used while demonstrating the efficiency and adaptability of a given algorithm, as compared to other state-of-the-art methods. The datasets used (commonly referred to as 'benchmark' datasets) are often retrieved from the UCI Machine learning repository. In the natural language processing approaches, the aim is to classify textual data, which is known to have a large dimensionality as well as both feature drift, and novel classes appearance over time. In Abid et al. (2019) the authors use AIS for classification of tweets and other social media data. In Kozak et al. (2020) and Cortez et al. (2012), the authors use email data for spam classification and automated email folder allocation, respectively. In addition to field-specific data, most studies use simulated data to be able to test drift adaptation abilities in different scenarios (see section 4.3).

Concept drift adaptation
As seen in Table 5, most of the use-cases utilise metaheuristics for assisting in supervised learning problems. In these cases, the ground truth is either assumed to be available instantly after the classification has been made (Lan et al. 2019;Abdulkarim and Engelbrecht 2019;Pinto et al. 2014),   or to be available in a delayed number of time steps (Ghomeshi et al. 2019a). In 4 of the studies (Bessa et al. 2018;Yeoh et al. 2019;Abid et al. 2019;Aydogdu and Ekinci 2020) the learning method is unsupervised, which is partially due to the fact that the ground truth is most often not available instantly, and therefore the algorithm has to be able to adapt and correct itself without knowing the ground truth. These studies use a clustering-approach (DBSCAN, K-means, DEN-STREAM, CLU-STREAM, KDE) in order to segment the data into clusters based on some manually specified criteria. In Aydogdu and Ekinci (2020) the authors use information entropy of a cluster to determine if a data-point belongs to the given cluster. In most of these studies, the accuracy of the clusters are evaluated using ground truth after the experiment is performed. In only one of the studies, the authors in Pinto et al. (2014) uses reinforcement learning to find the best policy in any given situation (for electricity market trading).

Models used
In general there are 3 categories of ML models applied across the use-cases: 1) Well-known machine learning algorithms ( 2022)). The first category is mainly motivated by the wellknown strengths and weaknesses of the algorithms across the various use-cases. The second category is motivated by the robustness through adaptability of ensemble algorithms in concept drift settings (Ghomeshi et al. 2019b;Gama et al. 2013). The third category is mainly proposed for the specific problems where the nature of the metaheuristic presents an advantage. In Karimi et al. (2012) the authors compare three different Harmony Classifiers (batch, incremental and improved incremental), where computation time is a main motivation for the Harmony Classifier. The approach in Kozak et al. (2020) presents a customised (collection of models acting together) ensemble of the decision tree algorithm based on ant-colony optimization (ACO). In this example the main motivation is to improve the average performance of the decision trees while maximizing heterogeneity in the ensemble.

AutoML and drift adaptation
In some studies the application of the metaheuristic is not directly related to automated machine learning (AutoML). In these studies, the metaheuristic is only used for model optimization (training), which is not an AutoML problem, but rather a generic machine learning problem. One example is using PSO for finding the optimal weights of a neural network (MLP/NN) as an alternative to backpropagation (Abdulkarim and Engelbrecht 2019), or finding the optimal value of a decision problem, using what-if analysis (Pinto et al. 2014). Looking at Table 5 (column 5: MH adaptation), most of the use-cases utilize metaheuristics for the feature selection (FS) problem. In some use cases, the FS-problem is extended to determining the size of time series windows (Lan et al. 2019;Kumar and Batra 2018;Izidio et al. 2021). In other cases the metaheuristic is used for the hyper-parameter optimization problem (HPO) in order to make the machine learning algorithm self-adaptable in the event it has to re-train. In Kumar and Batra (2018) the authors utilize low-fidelity HPO (Feurer and Hutter 2019), where the initial model candidates are evaluated on subsets of the data, to decrease the overall training time via early stopping. A general pattern across the use cases is that there are multiple phases of the suggested approach: an initialization phase, followed up by a online phase. The initialization phase is a generic offline batch-learning AutoML problem. In the subsequent online phase, either FMS or sub-problems such as FS, model selection (MS), and HPO is performed through a metaheuristic acting as the high-level optimization algorithm. In Abid et al. (2019) the authors use the Artificial Immune System (AIS) metaheuristic for FMS and perform both feature selection, model selection/management and novelty detection using this algorithm. For the use-cases that can be characterized as FMS (Ghomeshi et al. 2019a;Abid et al. 2019;Ghomeshi et al. 2019b;Kozak et al. 2020;Abidi et al. 2022;Izidio et al. 2021;Adnan et al. 2020), FMS is performed either in the initialization phase (as regular offline learning), or in the subsequent online phase via a selfadaptive lifecycle management. This is based on either blind (Pinto et al. 2014;Kozak et al. 2020) or informed adaptation (triggering) as seen by Table 6. In two cases, FMS is managed by a combination of a metaheuristic and replicator dynamics (RD) which then handle model lifecycle management (Ghomeshi et al. 2019a;2019b). Another example is seen in Kozak et al. (2020), where the information in the pheromone trails is used for model management (newer models have a stronger pheromone trail). Most of the use cases are based on incremental learning, with a streaming-setting in mind. However, some of the settings are similar to regular batch learning (Kumar and Batra 2018;Karimi et al. 2012). In addition, most of the adaptation techniques are blind, which is not necessarily negative in terms of concept drift adaptation, but have a higher computational cost (Gama et al. 2013). The drift detection methods employed vary across the found literature, with no dominating technique across the found literature.

Test of concept drift adaptability
Unfortunately, multiple studies include little information regarding the concept drift that the proposed solution is evaluated on. For the domain-specific use cases, real world data is used (Bessa et al. 2018;Cortez et al. 2012;Izidio et al. 2021) to demonstrate the ability to function in this environment, but the nature of the concept drift in the data is not sufficiently described. This pattern is present in multiple studies where real world data is used, except from in Rehman et al. (2019), where the authors demonstrate the fluctuations of the concept (gas sensor readings) over time.

Drift types
Looking at the results from Table 7 (column two: Drift type), it is evident that the types of concept drift tested are inconsistent across the found studies. In particular, 7/17 studies do not explicitly report the drift type. In Abid et al. (2019) the authors modify real world data to model the arrival of novel classes in the target variable. This is done by adding classes in subsequent batches in the duration of the experiments. Novelty detection in both feature and input space is tested in Yeoh et al. (2019), Abid et al. (2019) and Rehman et al. (2019) and most often handled using unsupervised learning or feature selection. A communality for the studies with textual data is the need for adaptive feature selection. This is seen in Abid et al. (2019), Kozak et al. (2020) andCortez et al. (2012), where feature drift is a natural phenomena happening in the real world data that is investigated: novel features (new words) arrive and other features become less important or disappear. Unfortunately, neither of these studies describe the type and magnitude of the drift naturally occurring in the data. In Ghomeshi et al. (2019a), Aydogdu and Ekinci (2020), Ghomeshi et al. (2019b) and Karimi et al. (2012) the authors compensate for this problem using simulated data alongside real world data. In this way, it is possible to control the various types of concept drift, and compare the adaptability of the proposed solution in different scenarios. In Ghomeshi et al. (2019a) the authors use a rotating hyperplane to simulate real concept drift given a small set of features. This particular type of simulation allows to create environments with different magnitudes of change over time. Generally, for studies (Cortez et al. 2012;Abdulkarim and Engelbrecht 2019;Abidi et al. 2022;Izidio et al. 2021;Adnan et al. 2020) the drift type is not explicitly reported, which unfortunately limits the external validity of their results for the purposes of this study.

Evaluation methods
In 8 of the 14 studies with classification problems, the accuracy metric is used as the primary metric for model evaluation. As described in Section 2.2.3, the accuracy metric is biased towards the majority class, meaning that performance on minority classes is largely overlooked if accuracy is the only performance metric used. Unfortunately, this is the case in Karimi et al. (2012), Kozak et al. (2020), Kumar and Batra (2018) and Ghomeshi et al. (2019b), which means that the results found in these studies could be correlated with balance of the target variable. Neither of the aforementioned studies report the balance of the target classes, however, the number of classes of each dataset is reported in Kozak et al. (2020), Karimi et al. (2012) and Ghomeshi et al. (2019b). With this in mind, the results of  Ghomeshi et al. (2019a) Incremental Blind None Pinto et al. (2014) Incremental Informed Sequential analysis Abid et al. (2019) Sliding window Informed Distributional test Aydogdu and Ekinci (2020) Sliding window Blind None Ghomeshi et al. (2019b) Sliding window Informed Control charts Kumar and Batra (2018) Batch Blind None Rehman et al. (2019) Sliding window Blind None Kozak et al. (2020) Incremental Blind None Karimi et al. (2012) Batch  Adnan et al. 2020) also evaluate the classification performance using either confusion matrix, ROC-index, F1-score or precision/recall performance, which all control for balance of the target classes. Evaluation is most often performed using two-fold partitioning of the data (Table 7), however, this varies across the studies depending on whether incremental learning or sliding window approach is used. In Abid et al. (2019) novelty detection capability is evaluated using a train and test period, where 2 concepts are present in the training period, and 4 concepts (two novel classes are added) in the test period. In Rehman et al. (2019) the authors use a training period of one batch (initialization phase) and subsequently use nine test batches to evaluate the online performance of the suggested approach.

Chronological trends
To illustrate some of the trends in the included literature over time, each study has been qualitatively coded into categories related to the type and family of the used metaheuristics, as well as which type of concept drift and AutoML problem type. The results can be seen from Fig. 4.
Looking at type of metaheuristic, it can be seen that the vast majority of the studies are population-based as mentioned in Section 4.1. Only two of the studies use other types of metaheuristics: construction-based (Kozak et al. 2020) and local search (Pinto et al. 2014). Next, proposed metaheuristics can be examined by categorizing which existing algorithm they are derived from, if any. Looking at the upper-left diagram in Fig. 4, it can be observed that 11 of the proposed algorithms are derived from either Genetic Algorithm (GA) or Particle-Swarm Optimization (PSO). Where PSO has seen the most interest in 2018 to 2020, the GA-based variants are more stable over time. Neither of the algorithms are new, with parts of Genetic algorithms being introduced first time in 1950 (Turing 1950), and Particle-swarm optimization being proposed in 1995 by Kennedy and Eberhart (Kennedy and Eberhart 1995). As the most recent studies are not included due to the filtering criteria, it cannot be determined if the level in 2019-2020 was temporary or not.
In terms of the drift types studied, the majority of the studies in the full period include real concept drift. In some of the studies, the drift type is not reported as mentioned in Section 4.3.1, however, there do not seem to be a temporal trend. Finally, a pattern in the type of AutoML problem can be observed by the lowerright diagram in Fig. 4: The early works found from 2012 mainly focus on conventional machine learning

Discussion and future work
The initial aim of this study was to search for literature on drift adaptation of machine learning models using metaheuristics. As the results of this literature review show, multiple problems studied in automated machine learning (FS, HPO, FMS) have been addressed and implemented as online versions in the found literature (Kozak et al. 2020, Ghomeshi et al., 2019a, 2019b. Based on the retrieved literature, the testing of concept drift adaptation in itself can be complex, and the level of details reported seem to vary across studies. Standards for evaluating machine learning algorithms, and especially concept drift adaptation, differ across the found studies. In future research, it might be helpful to choose evaluation metrics that are unbiased with respect to the balance of the target variable (in the case of classification problems). In addition, the assumption that the ground truth is available immediately after the prediction might not hold in many situations, and it might therefore be beneficial to additionally test scenarios where the ground truth is delayed, as in Rehman et al. (2019). Using realworld data with drift in combination with simulated drift is a strength of some studies, as it provides an in-depth understanding of the performance in various situations that might occur in a real setting. However, the external validity of the results might be improved if the nature of the concept drift from real-world data sources is reported alongside the results of a given framework.
The results of this study show that multiple solutions exist for combining black-box optimisation (Feurer and Hutter 2019) with machine learning in order to automate what is referred to as model maintenance in ML life-cycle terminology (Vartak and Madden 2018). This effectively illustrates that the literature has advanced over the years from solely focusing on single tasks in the CRISP-DM framework (Chapman et al. 2000), to performing online FMS in a data stream with concept drift. However, the crucial step of aligning the project goals, evaluation metrics and objective functions remain a manual task. Amongst the used metaheuristics population-based methods were the most used, possibly due to the benefits of parallel computation (training multiple candidate ML models simultaneously), their simplicity, or individual context-dependent strengths (Kozak et al. 2020;Abid et al. 2019). It remains unclear whether one population-based approach is better than another, as different population-based metaheuristics are not compared to each other in any of the found studies (only variants of the same algorithm such as (Ghomeshi et al. 2019a(Ghomeshi et al. , 2019bKarimi et al. 2012;Abdulkarim and Engelbrecht 2019)). Comparing Lower-left: drift type included in the study. Lower-right: AutoML problem type computational cost with accuracy over time, given different drift patterns, across multiple population-based metaheuristics, might further the knowledge of their differences in running costs versus performance and thereby the longterm business value.

Threats to validity
A general threat to the validity of this study is a potential selection bias in the retrieval of literature. To make this potential bias more transparent, the literature search and study selection process have been documented with the full queries and the exclusion criteria. A significant limitation of the results is that it is a snapshot in time and that the search engine (Google scholar) might be updated over time, so the results cannot be reliably reproduced. The fourth exclusion criteria (number of citations when published before 2021) also limits the findings due to 1) results filtered out in this study might get more citations in the future 2) studies with no citations might still fit all other criteria and present a valid method.

Conclusion
The results show that population-based metaheuristics are the most popular methods in the found literature. In particular, Genetic Algorithms and Particle-Swarm Optimisation are two popular metaheuristics used across multiple fields (engineering, computer science, managerial decision support, finance and social science). As neither of the found studies compares different population-based metaheuristics to each other, it remains unclear whether one variant is superior to another across the problem use cases. It is therefore suggested that future research focus on comparing not only machine learning algorithms but the performance across metaheuristics as well. The proposed approaches in the found literature are evaluated using either real-world data or a combination of synthetic and real-world data sets. In terms of drift-adaptation, the metaheuristics in the early literature are primarily used to automate single tasks in machine learning development, such as feature selection or hyperparameter optimisation. In more recent literature (at the time of writing), full model selection is a more widespread utilisation of metaheuristics for drift adaptation. Unfortunately, the found literature shows general signs of issues in the evaluation of machine learning models: In 4 of the 17 retrieved studies, the class distribution of the target variable is not reported, while accuracy is the only metric used for evaluating model performance. This is generally a problem, as the performance in minority classes can be neglected in data with unbalanced target class distributions. The majority of the found studies evaluate performance on real drift based on sudden, gradual or recurring drift patterns. In the studies using synthetic, or a mix of synthetic and manipulated real-world data, the formulation is generally transparent and therefore comparable to other studies testing the same problem type. However, in multiple studies using real-world data, the characteristics of the drift tested (type and pattern) are either unknown or not reported. It is therefore suggested that future works include drift characteristics alongside the relative performance of the proposed solutions.

Funding Open access funding provided by Norwegian University of Life Sciences
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.