Automated Adaptation Strategies for Stream Learning

Automation of machine learning model development is increasingly becoming an established research area. While automated model selection and automated data pre-processing have been studied in depth, there is, however, a gap concerning automated model adaptation strategies when multiple strategies are available. Manually developing an adaptation strategy can be time consuming and costly. In this paper we address this issue by proposing the use of flexible adaptive mechanism deployment for automated development of adaptation strategies. Experimental results after using the proposed strategies with five adaptive algorithms on 36 datasets confirm their viability. These strategies achieve better or comparable performance to the custom adaptation strategies and the repeated deployment of any single adaptive mechanism.


Introduction
Automated model selection has long been studied (Wasserman, 2000) and recently, notable advances in practical automated machine learning (AutoML) approaches (Hutter et al., 2011;Lloyd et al., 2014;Kotthoff et al., 2017;Mohr et al., 2018;Martin Salvador et al., 2019;Olson and Moore, 2019;Kedziora et al., 2020) have been made. In addition, automated data pre-processing in the context of complex machine learning pipelines generation and validation has also been a topic of recent interest (Feurer et al., 2015;Martin Salvador et al., 2019;Nguyen et al., 2020). There is however a gap concerning automated development of models' adaptation strategy, which is addressed in this paper. Here we define adaptation as changes in model training set, parameters and structure, all designed to track changes in the underlying data generating process over time. This contrasts with model selection which focuses on parameter estimation and the family to sample the model from.
With the current advances in data storage, database and data transmission technologies, learning on streaming data has become a critical part of many processes. Many models which are used to make predictions on streaming data are static, in the sense that they do not learn on current data and hence remain unchanged. However, there exists a class of models, stream learning models, which are capable of adding observations from the data stream to their training sets. In spite of the fact that these models utilise the data as it arrives, there can still arise situations where the underlying assumptions of the model no longer hold. We call such settings dynamic environments, where changes in data distribution (Zliobaite, 2011), change in features' relevance (Fern and Givan, 2000), non-symmetrical noise levels (Schmidt and Lipson, 2007) are common. These phenomena are sometimes called concept drift. It has been shown that many changes in the environment which are no longer being reflected in the model contribute to the deterioration of model's accuracy over time (Schlimmer and Granger, 1986;Street and Kim, 2001;Klinkenberg, 2004;Kolter and Maloof, 2007). This requires constant manual retraining and readjustment of the models which is often expensive, time consuming and in some cases impossible -for example when the historical data is not available any more. Various approaches have been proposed to tackle this issue by making the model adapt itself to the possible changes in environment while avoiding its complete retraining. These approaches however are manually designed and the application of automated machine learning to streaming data is scarce, which is the gap we aim to contribute to.
Typically there are several possible ways or adaptive mechanisms (AMs) to adapt a given model. A single iteration of adaptation is achieved by deploying one of multiple AMs (including trivial "doing nothing"), which changes the state of the existing model. Thus, during the model's operation, it is adapted by the sequential deployment of various AMs with the arrival of new data. We call the order of this deployment an adaptation strategy (AS). While in most of the existing research these adaptation strategies are custom (i.e. algorithm-specific) and are fixed at the design stage of the algorithm, a sequential adaptation framework proposed in our earlier work (Bakirov et al., 2015) enables flexible adaptation strategies without a prescribed AM deployment order. These flexible adaptation strategies, automatically developed according to this framework can be applied to any set of adaptive mechanisms for various machine learning algorithms. This removes the need to design custom adaptive strategies, resulting in automation of adaptation process. In this work we empirically show the viability of the automated adaptation strategies based on cross-validation (Bakirov et al., 2015) with the optional use of retrospective model correction (Bakirov et al., 2016).
We focus on the batch prediction scenario, where data arrives in large segments called batches. This is a common industrial scenario, especially in the chemical, microelectronics and pharmaceutical areas (Cinar et al., 2003). For the experiments we use Simple Adaptive Batch Learning Ensemble (SABLE) (Bakirov et al., 2015) and batch versions of four popular stream learning algorithms -the Dynamic Weighted Majority (DWM) (Kolter and Maloof, 2007), the Paired Learner (PL) (Bach and Maloof, 2010), the Leveraged Bagging (LB) (Bifet et al., 2010b) and BLAST (van Rijn et al., 2015). The use of these five algorithms allows to explore different types of online learning methods; local experts ensemble for regression in SABLE, global experts ensemble for classification in DWM and LB, switching between the two models in PL and the heterogeneous global ensemble in BLAST.
After a large-scale experimentation with 5 regression and 31 classification datasets, the main finding of this work is that in our settings, the proposed automated adaptive strategies show comparable accuracy rates to the custom adaptive strategies and, in many cases, to the repeated deployment of a single "best" AM. Thus, they are feasible to use for adaptation purposes, while saving time and effort spent on designing custom strategies.
The paper follows by presenting the related work on automated machine learning and adaptive mechanisms in Section 2. Section 3 presents mathematical formulation of the framework of adaptation with multiple adaptive mechanisms in batch streaming scenario. Section 4 introduces algorithms used for the experimentation, including their inherent adaptive mechanisms and custom adaptation strategies. Experimental methodology, the datasets on which experiments were performed and results are given in Section 5. We give our final remarks in Section 6.

Related Work
This section provides a background for our research. We start with a review of relevant automated machine learning approaches, particularly those which consider streaming data scenario. We follow up with a broad analysis of ML literature from the adaptive mechanisms point of view, where we introduce a simple hierarchy of adaptation. We then discuss how multiple adaptive mechanisms paradigm has been used for automating the design of predictive algorithms.

Automated machine learning for streaming data
Automated machine learning is an active research area. So far however, it has been mostly applied to static datasets, and there are not many works which consider automation for streaming scenario. Among these, different approaches exist. One of the works, before the most recent wave of AutoML research, can be found in (Kadlec and Gabrys, 2009) where a general purpose architecture to develop robust, adaptive prediction systems for the autonomous operation in changing environments for streaming data has been proposed. Various instantiations of this architecture followed focusing on challenging problems from the process industry when building adaptive, predictive soft sensors Gabrys, 2010, 2011;Bakirov et al., 2017).
Taking advantage of the recent wave of research in AutoML, an alternative approach to adaptation to changing environments was proposed in (Martín Salvador et al., 2016) where repeated automated deployment of Auto-WEKA for Multi-Component Predictive Systems (MCPS) to learn from new batches of data was used for life-long learning and the adaptation of complex MCPS when applied to changing streaming data from process industries. Celik and Vanschoren (2020) represent a development of this idea with the inclusion of the drift detection and the experimentation using several open source AutoML frameworks. An interesting approach closely tied with the Auto Sklearn is described in (Madrid et al., 2019). Authors propose using the ensemble nature of this framework to deal with streaming data, by adapting the weights of experts and adding new ones.
Some of the other recently proposed relevant methods are primarily focused on hyper-parameter optimisation problems. For example, Veloso et al. (2018) propose hyper-parameter optimization for streaming regression problems using the Nelder-Mead algorithm. In their experiments they optimise the hyper-parameters of one specific regression method. Carnein et al. (2020) are focusing on the hyperparameter selection for clustering of data in a streaming environment. They propose utilising a dynamic ensemble of different hyper-parameter configurations.
Despite the existing research, as acknowledged and discussed in a recent comprehensive and synthesising review of concepts in AutoML research and beyond (Kedziora et al., 2020), the pursuit of autonomy, described as the AutoML system's capability to independently adapt the ML solution over a lifetime of operation in changing environments, remains a lofty goal.

Adaptive mechanisms
Adapting machine learning models is an essential strategy for automatically dealing with changes in an underlying data distribution to avoid training a new model manually. Modern machine learning methods typically contain a complex set of elements allowing many possible AMs. This can increase the flexibility of such methods and broaden their applicability to various settings. However, the existence of multiple AMs also increases the decision space with regards to the adaptation choices and parameters, ultimately increasing the complexity of adaptation strategy. A possible hierarchy 1 of AMs is presented in Figure 1.
In a streaming data setting, to increase the accuracy, it can be beneficial to include recent data in the training set of the predictive models. On the other hand however, retraining a model from scratch is often inefficient, particularly dealing with high throughput scenarios or even impossible when the historical data is no longer available. For these cases, the solution is updating the model using only the available recent data. This can be done inherently by some general purpose ML algorithms, e.g. Naive Bayes or using stream/online algorithms, e.g. online Least Squares Estimation (Jang et al., 1997), online boosting and bagging (Oza and Russell, 2001) etc. Additionally, for non-stationary data, it becomes important to not only select a training set of sufficient size but also one which is relevant to the current data. This is often achieved by a moving window (Widmer and Kubat, 1996;Klinkenberg, 2004;Zliobaite and Kuncheva, 2010) or decay approaches(Joe Qin, 1998;Klinkenberg and Joachims, 2000).
The final layer of adaptation is changing the models' parameters, e.g. experts' combination weights in ensemble methods. These weights are often recalculated or updated throughout a models' runtime (Littlestone and Warmuth, 1994;Kolter and Maloof, 2007;Elwell and Polikar, 2011;Kadlec and Gabrys, 2011;Bakirov et al., 2017). Another group of techniques belonging to this family are methods using meta-learning for model adaptation (Nguyen et al., 2012;Rossi et al., 2014;van Rijn et al., 2015;Lemke and Gabrys, 2010). These methods generally include training a meta-model using meta-features. The meta-model is then used to select one or more predictors to calculate the final prediction. The change of the metamodel can then be seen as the change in parameters of the predictive model.
In this work we consider the possibility of using multiple different adaptive mechanisms, most often at different levels of the hierarchy. Many modern machine learning algorithms for streaming data explicitly include this possibility. A prominent example are the adaptive ensemble methods (Wang et al., 2003;Kolter and Maloof, 2007;Scholz and Klinkenberg, 2007;Bifet et al., 2009;Kadlec and Gabrys, 2010;Elwell and Polikar, 2011;Alippi et al., 2012;Souza and Araújo, 2014;Gomes Soares and Araújo, 2015;Bakirov et al., 2017) which often feature AMs from all three levels of hierarchy -online update of experts, changing experts' combination weights and modification of experts' set. Machine learning methods with multiple AMs are not limited to ensembles, but can also include Bayesian networks (Castillo and Gama, 2006), decision trees (Hulten et al., 2001), model trees (Ikonomovska et al., 2010), champion-challenger schemes (Bach and Maloof, 2010) etc.

Automating design of algorithms with multiple AMs
Existence of multiple AMs raises questions w.r.t. how they should be deployed. This includes defining the order of deployment and adaptation parameters (e.g. decay factors, expert weight decrease factors, etc.). It should be noted that all of the aforementioned algorithms use custom adaptive strategies, meaning that they deploy AMs in a manner specific to each of them. It follows that designing adaptive machine learning methods is a complex enterprise and is an obstacle to the automation of machine learning model's design. Kadlec and Gabrys (2009) present a plug and play architecture for pre-processing, adaptation and prediction which foresees the possibility of using different adaptation methods in a modular fashion, but does not address the method of AM selection. Bakirov et al. (2015Bakirov et al. ( , 2016 have presented several such methods for AM selection for their adaptive algorithm, which are discussed in detail in Section 3.2. These methods can be seen as automated adaptive strategies, which are applicable to all adaptive machine learning methods with multiple AMs. This allows simply using the described strategies for model adaptation, once having defined the available AMs.

Formulation
As adaptation mechanisms can affect several elements of a model and can depend on performance several time steps back, it is necessary to clarify the concepts via a framework to avoid confusion. We assume that the data is generated by an unknown time varying data generating process which can be formulated as: where ψ is the unknown function, τ a noise term, xτ ∈ R M is an input data instance, and yτ is the observed output at time τ . Then we consider the predictive method at a time τ as a function: whereŷτ is the prediction, fτ is an approximation (i.e. the model) of ψ(x, τ ), and Θ f is the associated parameter set. Our estimate, fτ , evolves via adaptation as each batch of data arrives as is now explained.

Adaptation
In the batch streaming scenario considered in this paper, data arrives in batches with τ ∈ {τ k · · · τ k+1 − 1}, where τ k is the start time of the k-th batch. If n k is the size of the k-th batch, τ k+1 = τ k +n k . It then becomes more convenient to index the model by the batch number k, denoting the inputs as X k = xτ k , · · · , x τ k+1 −1 and the outputs as y k = yτ k , · · · , y τ k+1 −1 . We examine the case where the prediction function f k is static within a k-th batch. 2 We denote the a priori predictive function at batch k as f − k , and the a posteriori predictive function, i.e. the adapted function given the observed output, as f + k . An adaptive mechanism, g(· ), may thus formally be defined as an operator which generates an updated prediction function based on the batch V k = {X k , y k } and other optional inputs. This can be written as: (3) or alternatively as f + k = f − k • g k for conciseness. Note f − k andŷ k are optional arguments and Θg is the set of parameters of g. The function is propagated into the next batch as f − k+1 = f + k and predictions themselves are always made using the a priori function f − k . We examine a situation when a choice of multiple, different AMs, {∅, g 1 , ..., g H } = G, is available. Any AM g h k ⊂ G can be deployed on each batch, where h k denotes the AM deployed at batch k. As the history of all adaptations up to the current batch, k, have in essence created f − k , we call that sequence g h1 , ..., g h k an adaptation sequence. Note that we also include the option of applying no adaptation denoted by ∅. In this formulation, only one element of G is applied for each batch of data. Deploying multiple adaptation mechanisms on the same batch are accounted for with their own symbol in G. Figure 2a illustrates our initial formulation of adaptation.

Automated adaptation strategies
In this section we present different generic automated adaptive strategies offering flexible deployment of AMs, which can be applied to any adaptive algorithm. At every batch k, an AM g h k must be chosen to deploy on the current batch of data. To obtain a benchmark performance, an adaptation strategy which minimizes the error over the incoming data batch X k+1 , y k+1 : where denotes the chosen error measure, can be used. Since X k+1 , y k+1 are not yet obtained, this strategy is not applicable in practice. Also note that this may not be the overall optimal strategy which minimizes the error over the whole dataset. We refer to this strategy as Oracle.
Given the inability to conduct the Oracle strategy, below we list some alternatives. The simplest adaptation strategy is applying the same AM to every batch. The scheme of this strategy is given in Figure 3a. Note that this scheme fits the "Adaptation" box in Figure 2a. A more common practice (see Section 2) is applying multiple or all available adaptive mechanisms. The scheme of this strategy is given in Figure 3b which again fits the "Adaptation" box in Figure 2a.
As introduced in (Bakirov et al., 2015), it is also possible to use V k for the choice of g h k . Given observations, the a posteriori prediction error To obtain a generalised estimate of the prediction error we apply q-fold 4 3 As a solid example consider the case where f + k is f − k retrained using {X k , y k }. In this case y k are part of the training set and so we risk overfitting the model if we also evaluate the goodness of fit on y k . 4 In subsequent experiments, q = 10 cross validation. The cross-validatory adaptation strategy (denoted as XVSelect) uses a subset (fold), . This is repeated q times resulting in q different error values and the AM, g h k ∈ G, with the lowest average error measure is chosen. If more than one AM has the same lowest average error, a selection among them is made randomly or utilising prior knowledge. In summary: where × denotes the cross validated error. The scheme of XVSelect for is given in Figure 3c. The next strategy can be used in combination with any of the above strategies as it focuses on the history of the adaptation sequence and retrospectively adapts two steps back. This is called the retrospective model correction (Bakirov et al., 2016). Specifically, we set the current model to the output of the AM at batch k − 1 which would have produced the best estimate in block k: The potential draws can be again resolved randomly or using prior knowledge 5 . Using the cross-validated error measure in Equation 6 is not necessary, because g h k−1 is independent of y k . Also note the presence of g h k ; retrospective correction does not in itself produce a f k+1 and so cannot be used for prediction unless it is combined with another strategy (g h k ). This strategy can be extended to consider the sequence of r AMs while choosing the optimal state for the current batch, which we call r-step retrospective correction: The scheme for retrospective correction is given in Figure 3d. Since the retrospective correction can be deployed alongside any adaptation scheme, we modify the general adaptation scheme (Figure 2a) accordingly, resulting in Figure 2b, where Figure 3d fits in the box "Correction". Notice that when using this approach, the prediction function f k (x), which is used to generate predictions, can be different from the propagation function f k (x) which is used as input for adaptation.
An important technical detail for both cross-validatory selection and retrospective correction is the resolution of draws, when two or more AMs show the same predictive performance. The draws appear frequently for classification scenarios with lower batch sizes. In these cases, a prior knowledge on AMs' predictive performance can be used to make a selection 6 . If no such knowledge exists, a random AM, or the AM which minimises the runtime can be chosen.
We next examine the prediction algorithms with respective adaptive mechanisms (the set G) used in this research.
SABLE is used to address regression problem while the other algorithms address the classification problem. We have developed batch versions of these classification algorithms, which are used in experiments. Our selection of algorithms allows to explore different types of online learning methods and different adaptive mechanisms, and demonstrate that the adaptive strategies described in this paper are in fact generic and can be applied to various adaptive algorithms with multiple AMs. Below the details of model adaptation with each algorithm are presented.

Simple Adaptive Batch Local Ensemble (SABLE) adaptation
SABLE (Bakirov et al., 2015) uses an ensemble of experts each implemented using a linear model formed through Recursive Partial Least Squares (RPLS) (Joe Qin, 1998). To get the final prediction, the predictions of base learners are combined using input/output space dependent weights (i.e. local learning), which are reflected in the descriptor of each expert. SABLE is designed for batch streaming scenario. It supports the creation and merger of base learners.
The SABLE algorithm allows the use of five different adaptive mechanisms (including the possibility of no adaptation). AMs are deployed as soon as the true values for the batch are available and before predicting on the next batch. The SABLE AMs are described below 7 . It should be noted, that as SABLE was conceived as an experimentation vehicle for AM sequences effects exploration, it does not provide a default custom adaptation strategy.
-SAM0 (No adaptation). No changes are applied to the predictive model, corresponding to ∅. -SAM1 (Batch learning). The simplest AM augments existing data with the data from the new batch and retrains the model. Given predictions of each expert f i ∈ F on V, {ŷ 1 , ...,ŷ I } and measurements of the actual values, y, V is partitioned into subsets in the following fashion: for every instance [x j , y j ] ∈ V. This creates subsets V i , i = 1...I such that Then each expert is updated using the respective dataset V i . This process updates experts only with the instances where they achieve the most accurate predictions, thus encouraging the specialisation of experts and ensuring that a single data instance is not used in the training data of multiple experts.
-SAM2 (Batch learning with forgetting). This AM is similar to one above but uses decay which reduces the weight of the experts historical training data, making the most recent data more important. It is realised via RPLS update with forgetting factor λ. λ is a hyper-parameter of SABLE. -SAM3 (Descriptors update / weights change). This AM recalculates the local descriptors using the new batch. This amounts to the change of weights of the experts. -SAM4 (Creation of new experts). New expert snew is created from V k . Then it is checked whether the newly created expert is similar to any existing experts, in which case the older expert is removed and their descriptors are merged.
Finally the descriptors of all resulting experts are updated. -SAM5. SAM2 (Batch learning with forgetting) followed by SAM4 (Creation of New Experts).

Batch Dynamic Weighted Majority (bDWM) adaptation
bDWM is an extension of DWM (Kolter and Maloof, 2007) designed to operate on batches of data instead of on single instances as in the original algorithm. bDWM is a global experts ensemble. Assume a set of with input x and a set of all possible labels C = {c 1 , ..., c J }. Then for all i = 1 · · · I and j = 1 · · · J the matrix A with following elements can be calculated: Assuming weights vector w = {w 1 , ..., w I } for respective predictors in S, the sum of the weights of predictors which voted for label c j is z An adaptive model based on bDWM starts with a single expert and can be adapted using an arbitrary sequence of 8 possible AMs (including no adaptation) given below.
-DAM0 (No adaptation). No changes are applied to the predictive model, corresponding to ∅. -DAM1 (Batch learning). After the arrival of the batch V t at time t each expert is updated with it.
-DAM2 (Weights update and experts pruning). Weights of experts are updated using following rule: where w t i is the weight of the i-th expert at time t, and u t i is its accuracy on the batch V t . The weights of all experts in ensemble are then normalized and the experts with a weight less than a defined threshold η are removed. It should be noted that the choice of factor e u t i is inspired by Herbster and Warmuth (1998), although due to different algorithm settings, the theory developed there is not readily applicable to our scenario. Weights update is different to the original DWM, which uses an arbitrary factor β < 1 to decrease the weights of misclassifying experts.
-DAM3 (Creation of a new expert). New expert is created from the batch V t and is given a weight of 1. bDWM (custom adaptive strategy 9 ). Having presented the separate adaptive mechanisms, we now describe the bDWM, a batch version of the original DWM. It starts with a single expert with a weight of one. At time t, after an arrival of new batch V t , experts makes predictions and overall prediction is calculated as shown earlier in this section. After the arrival of true labels all experts learn on the batch V t (invoking DAM1), update their weights (DAM2) and ensemble's accuracy u t is calculated. If u t accuracy is less than the accuracy of the naive majority classifier (based on all the batches of data seen up to this point) on the last batch, a new expert is created (DAM3). The schematics of this strategy is shown in Figure 4a. This scheme fits in "Adaptation" boxes in Figures 2a and 2b.

Batch Paired Learner (bPL) adaptation
bPL is an extension of PL (Bach and Maloof, 2010) designed to operate on batches of data instead of on single instances as in the original algorithm. bPL maintains two learners -a stable learner which is updated with all of incoming data and which is used to make predictions, and a reactive learner, which is trained only on the two most recent batches. For this method, three adaptive mechanisms are available, which are described below.
-PAM0 (No adaptation). No changes are applied to the predictive model, corresponding to ∅. -PAM1 (Updating stable learner). After the arrival of the batch V t at time t, stable learner is updated with it. -PAM2 (Switching to reactive learner). Current stable learner is discarded and replaced by reactive learner.
bPL (custom adaptive strategy). Having presented the separate adaptive mechanisms, we now describe the bPL, a batch version of the original PL. Its adaptive strategy revolves around comparing the accuracy values of stable (u t s ) and reactive (u t r ) learners on each batch of data. Every time when u t s < u t r a change counter is incremented. If the counter is higher than a defined threshold θ, an existing stable learner is discarded and replaced by the reactive learner, while the counter is set to 0. As before, a new reactive learner is trained from each subsequent batch. The schematics of this strategy are shown in Figure 4b. This scheme fits in "Adaptation" boxes in Figures 2a and 2b.

Batch Leveraged Bagging (bLB) adaptation
bLB is an extension of LB (Bifet et al., 2010b) designed to operate on batches of data instead of on single instances as in the original algorithm. Leveraged Bagging is based on Online Bagging (Oza and Russell, 2001) algorithm, but includes the improvements such as the removal of experts and addition of new ones based on ADWIN (Bifet and Gavaldà, 2007) change detector, randomization at the ensemble output using output code etc. For this method, five adaptive mechanisms (including no change) are available, which are described below.
-LAM0 (No adaptation). No changes are applied to the predictive model, corresponding to ∅. -LAM1 (Batch learning). After the arrival of the batch V t at time t each expert is updated with it. -LAM2 (Removing an existing expert and adding a new one). The expert with the lowest accuracy on the previously seen data is removed, and a new one trained from the most recent batch is added. -LAM3. LAM1 (Batch learning) followed by LAM2 (Removing an existing expert and adding a new one).
bLB (custom adaptive strategy). Having presented the separate adaptive mechanisms, we now describe the bLB, a batch version of the original LB. Its strategy invokes batch learning (LAM1) after the arrival of each batch of data. If ADWIN change detector detects a change, the expert with the lowest accuracy on the previously seen data is removed, and a new one trained from the most recent batch is added (LAM2). The schematics of this strategy is shown in Figure 5a. This scheme fits in "Adaptation" boxes in Figures 2a and 2b.

Batch BLAST (bBLAST) adaptation
bBLAST is an extension of BLAST (van Rijn et al., 2015) designed to operate on batches of data instead of on single instances as in the original algorithm. BLAST is an ensemble method using different types of base learners (as opposed to the ones mentioned above) with Online Performance Estimation for the weighting. For this method, four adaptive mechanisms (including no change) are available, which are described below.
-BAM0 (No adaptation). No changes are applied to the predictive model, corresponding to ∅. -BAM1 (Batch learning). After the arrival of the batch V t at time t each expert is updated with it. -BAM2 (Reweighing the experts). For every instance [x, y] ∈ V t experts are reweighed according to Online Performance Estimation.
bBLAST (custom adaptive strategy). Having presented the separate adaptive mechanisms, we now describe the bBLAST, a batch version of the original BLAST. bBLAST invokes the combination of the BAM1 (Batch learning) followed by BAM2 (Reweighing the experts) after the arrival of each batch of data. The schematics of this strategy is shown in Figure 5b. This scheme fits in "Adaptation" boxes in Figures 2a and 2b.

Experimental results
In the following sub-sections we describe the empirical validation of the proposed approaches. We start by describing the experimental methodology, including experiment settings, specification of datasets, evaluation strategy, libraries and base learners used. We then follow with the comparative analysis of regression and classification results of the proposed and custom adaptive strategies.

Methodology
The purpose of the experiments 10 in this section was to evaluate the usefulness of the proposed strategies. For this purpose we have performed the empirical comparison of automated adaptation strategies proposed in 3.2 with custom adaptive strategies and with strategies involving repeated deployment of a single AM. The goal of the automated adaptive strategies is to obtain performance comparable to what one would obtain using a (usually protracted) manually optimised adaptive strategy (including hyper-parameter selection). Therefore, if the proposed strategies attain comparable, or not significantly worse accuracy levels than the custom strategies, this shall be deemed a success. This section discusses the results in order of 10 All of the code except the SABLE algorithm, as well as all the datasets except Oxidizer and Drier can be found on https://github.com/RashidBakirov/multiple-adaptive-mechanisms. SABLE and the specified two datasets could not be shared because of confidentiality reasons. Result Description BestAM For all of the AMs (e.g. from DAM0 to DAM7 for the bDWM adaptation) we repeatedly deploy the same AM on all of the batches. We then select the best result among all of the runs. Note that this is a post-hoc strategy used for benchmark purposes, as the AM delivering the best result varies from dataset to dataset and is not known in advance. BestAM+RC The same as BestAM while additionally using retrospective correction after every batch. Note that the best AM here may be different to the one from BestAM. XVSelect Select AM (i.e. one of AMs from DAM0 to DAM7 for the bDWM adaptation) based on the current data batch using the cross-validatory approach described in Section 3.2. XVSelect+RC The same as XVSelect while additionally using retrospective correction after every batch. Custom Using custom adaptive strategy. Custom+RC The same as Custom while additionally using retrospective correction after every batch.
introduced algorithms. For all of the algorithms we compare the MAE/accuracy of strategies listed in Table 1. For SABLE, the experimentation uses five real world regression datasets listed in Table 5 in Appendix A. It has been shown, e.g. in Martin Salvador et al., 2019) that these datasets present different levels of volatility and noise. For the classification algorithms, we use five real world datasets listed in Table 6 and 26 synthetic datasets listed in Table 7 and visualised in Figure 13 in Appendix A.
For the real world datasets we use prequential evaluation (Dawid, 1984) which is a standard evaluation technique for data streams. For the batch scenario it works as follows; at time t we receive the data batch X t , and predict the values/labelŝ y t . Then the true values/labels y t are made available, and we calculate the error/accuracy of our predictions. Subsequently {X t , y t } are used for adaptation. Thus, the predictions are always made on unseen data, which is not included in the training data in any form. For synthetic datasets we generate an additional 100 test data instances for each single instance in training data using the same distribution. The predictive accuracy on the batch is then measured on test data relevant to that batch. This test data is not used for training or adapting models.
For the classification algorithms, the statistical significance of differences between the results is assessed using the Friedman test with post-hoc Nemenyi test, which are widely used to compare multiple classifiers (Demšar, 2006). The Friedman test checks for statistical difference between the compared classifiers; if so, the Nemenyi test is used to identify which classifiers are significantly better than others. We report the results of the Nemenyi tests as Nemenyi plots 11 . They plot the average rank of all methods and the critical difference per batch/base learner. Classifiers that are statistically equivalent are connected by a line.
For bDWM, bPL and bLB, Naive Bayes (NB) and Hoeffding Trees (HT) (Domingos and Hulten, 2000) were used as base learners. Open source libraries Prtools (Duin et al., 2007), Weka (Hall et al., 2009), MOA (Bifet et al., 2010a) and scikit-multiflow (Montiel et al., 2018) were employed. As there is not any randomness involved in the evaluation of datasets, a single run was used to compute the MAE (for regression) and accuracy (for classification) values, except for bLB, where 100 runs were used for each strategy.

Simple Adaptive Batch Local Ensemble (SABLE) results
Three different batch sizes for each dataset are examined in the simulations together using hyperparameters as tabulated in Table 8 in Appendix A. These parameter combinations were empirically identified using grid search, optimising the performance of the Oracle strategy (Eq. 4).
The results of the experiments using SABLE for batch sizes n = 50, 100, 200 are given in Table 2. These results suggest that most of the times XVSelect and XVSelect+RC perform better or comparable to BestAM and BestAM+RC. Overall XVSelect or XVSelect+RC had the lowest MAE with significant difference in 7 experiments out of 15. XVSelect or XVSelect+RC showed comparable (not worse with significant difference) performance to BestAM in 11 experiments. The cases where XVSelect and XVSelect+RC perform noticeably worse are Drier dataset with batch size of 100 and Sulfur dataset with all batch sizes. We relate this to the stability of these datasets. Indeed, the BestAM in all these cases is the slow adapting sequence of SAM1, without any forgetting of the old information. Difference in batch sizes is important for some datasets. This can be related to the frequency of changes and whether they happen within a batch, which can have a negative impact on XVSelect and XVSelect+RC. Retrospective correction (RC) has improved the performance of XVSelect for some cases. For the deployment of single AM, as seen in BestAM and BestAM+RC results, RC is more useful for the larger batch sizes, presumably because more training data prevents overfitting.

Batch Dynamic Weighted Majority (bDWM) results
The results of the Nemenyi test are shown in Figure 6 13 . For four experiments out of six, excluding NB base learner with batch sizes of 10 and 20, XVSelect and XVSelect+RC are both ranked higher than the bDWM (Custom strategy), in some cases significantly so. For batch size 10 with NB as base learner, bDWM performs better than both proposed approaches and for batch size 20, better than XVSelect+RC. The addition of retrospective correction does not seem to bring obvious benefit to adaptive strategies; while improving the results in some experiments, in most of the cases it decreases the accuracy. In terms of batch sizes, increasing n seems to improve the performance of XVSelect with NB base 12 The Wilcoxon signed-rank test assumes the null distribution is symmetric. This assumption mostly holds for our data. 13 The full results tables with accuracy values of each approach on each dataset are accessible from https://github.com/RashidBakirov/multiple-adaptive-mechanisms/tree/master/results. learner. In general, BestAM provides the best results across all experiments, while BestAM+RC performs slightly worse. It may be worth to reiterate that, for all of the classification experiments, the BestAM and BestAM+RC repeatedly deploy the single AM which delivers the best results specific for particular settings (dataset, batch size, base learner). This AM is not known in advance, so this strategy is not attainable in practice and is used for benchmark purposes.

Batch Paired Learner (bPL) results
For bPL and bPL+RC (Custom and Custom+RC strategies) we have used the threshold of θ = 1 for all the experiments. This value was chosen as it was experimentally established that the lower threshold values tend to provide better results than the higher ones. At the same time, keeping θ > 0 makes use of the change counter mechanism, a characteristic feature of bPL (θ = 0 provided similar results). We present the Nemenyi plots for both base learners on all three batch sizes in Figure 7. Also for this algorithm, XVSelect and XVSelect+RC show good performance and are ranked higher than the bPL for all batch sizes and base learner combinations. For bPL adaptation, the BestAM+RC performs well for all of the settings, however the performance of BestAM is poor for the low batch sizes. Retrospective correction appears to be useful for bPL adaptation, providing improvements for BestAM and XVSelect for most settings.

Batch Leveraged Bagging (bLB) results
bLB adaptation was implemented modifying the existing code from scikit-multiflow. The default hyper-parameters were kept. We present the Nemenyi plots of the average accuracy values of 100 runs for each adaptive strategy for both base learners on all three batch sizes in Figure 8. The performance of the proposed XVSelect is consistently better than the bLB (Custom strategy) for all of the settings, mostly significantly so. This is even more apparent for higher batch sizes. Behaviour of RC in this case is noteworthy; XVSelect+RC performs consistently worse than XVSelect although still beats the bLB in all of the settings bar one. On the other hand, bLB with RC (Custom+RC strategy) is always better than the bLB. It is possible that for Leveraged Bagging, combining XVSelect and RC makes the adaptation overfit to the last batch, thus reducing the accuracy. For bLB adapta- tion, the BestAM outperforms the proposed approaches in most of the settings, however there are no significant differences to the performance of XVSelect.

Batch BLAST (bBLAST) results
bBLAST adaptation was implemented modifying the existing MOA code. In contrast to the algorithms in the previous sections, bBLAST uses not single but multiple base learning algorithms; Hoeffding Tree, Naive Bayes, Perceptron, Stochastic Gradient Descent, and k Nearest Neighbour. All of the parameters of the bBLAST, as well as those of base experts are kept at defaults of MOA. We present the Nemenyi plots of the average the accuracy values of the selected adaptive strategies for all three batch sizes in Figure 9. The performance of the bBLAST (Custom strategy) is consistently better than the proposed adaptive strategies for all of the settings, though not significantly different than XVSelect+RC for batch sizes n = 10 and n = 20. The RC effect here is the mirror opposite to the bLB; bBLAST with RC (Custom+RC) always performs worse than bBLAST, however XVSelect+RC always performs better than XVSelect. Performance of BestAM and BestAM+RC strategies for this algorithm is markedly worse than for others as they are often outperformed by XVSelect and XVSelect+RC.

Summary of classification results
The conducted experiments give insight on several questions. Firstly, we are interested whether the proposed adaptation strategies XVSelect and XVSelect+RC provide comparable results to the custom strategies or to the best results achieved by a repeated deployment of any AM. Secondly, we would like to know whether the retrospective correction has really a positive effect on the accuracy of the predictions, and if so for which approaches. Finally, we would like to compare the performance of the adaptive strategies on the synthetic data to this on real-world datasets. To answer the first two questions we compare the results from sections 5.3-5.6 in Table 3, summing up the number of cases one approach was better and worse than the other across all of the algorithms, batch sizes and base learners (equal performance is represented by 0.5 in both "Better" and "Worse" columns). In comparison to Custom, XVSelect and XVSelect+RC has better accuracy for most experiments, often with significant difference. For these compar-  isons XVSelect and XVSelect+RC show similar results. Both XVSelect or XVSelect+RC perform in average worse than BestAM, however, for XVSelect, the performance is comparable (not significantly worse in majority of cases). Furthermore, we consider the effects of RC separately for each approach, as it has been shown that they could be different. For XVSelect, deploying RC seems to not have a critical effect. The positive effect of RC is more apparent on Custom strategy. For the BestAM it should be noted that the best AM can be different depending on dataset and even for the same dataset it is not necessarily the case that BestAM and BestAM+RC will be based on the same AM. However, we can still say if the best performing AM is known, the deployment of RC is likely to have a negative effect on the accuracy.
Finally, to evaluate the performance of XVSelect and XVSelect+RC on the synthetic vs. real world data, we have compared the results on these datasets separately, across all of the algorithms and settings using Nemenyi plots on more complicated nature of these datasets, where there may not exist a single AM that markedly optimises the performance, an observation in line with our earlier findings from (Bakirov et al., 2015). The performance of XVSelect and XVSelect+RC is comparatively worse on synthetic data, which may be simple enough for a single AM based adaptive strategy to deliver good results. Even for this case, these two approaches outperform Custom with a significant difference.

Runtime analysis
We proceed with the analysis of the runtime performance of our approaches. First, we note that with the assumption that the processing time for every batch, including the prediction, adaptation, and accuracy/error calculation is bounded by some constant, which is the case for all of the algorithms we consider, the runtime complexity of any custom adaptive algorithm is O(n), where n is the number of batches. In this case, the runtime complexity of XVSelect is O(|G|qn) where |G| is the number of available AMs and q is the number of cross-validation fold, as for every batch every AM with q-fold cross-validation is used. Retrospective correction has the complexity of O(|G|n), as for every batch every AM is used once. Thus, XVSelect+RC has the complexity of O(|G| 2 qn). Since |G| and q are constants, it follows that O(|G|n) ∼ O(|G|qn) ∼ O(|G| 2 qn) ∼ O(n), hence all the proposed methods are in the same order of runtime complexity as the custom strategies. For empirical runtime evaluation, we compare the performance of XVSelect, XVSelect+RC, Custom, Custom+RC strategies on classification dataset #28 (Power Italy) for different classification algorithms with n = 50 and NB base learner in Table 4 14 , initially without using any parallel processing. This dataset was chosen as it is a relatively large sized real-world dataset. The results show that the performance of our methods vary greatly depending on algorithm; e.g. for XVSelect+RC with 2-fold cross-validation, bPL adaptation has fastest relative average batch processing time (only 2.69 times higher than Custom), whereas bDWM adaptation has the slowest time (110.73 times higher than Custom).
The differences in performances are explainable by the internal characteristics of the algorithms. Batch processing time for XVSelect and XVSelect+RC is pro-portional to the batch runtimes with single AMs (e.g. when using Custom strategy). The longer batch runtimes are further extended by the cross-validation and retrospective correction. Therefore, XVSelect and XVSelect+RC for bDWM which has 8 AMs and can have about 20 active experts at the same time, has much higher relative batch runtime than bPL, which has only three AMs and two experts. Other interesting observation is that the RC does not always increase the batch processing time as seen in the example of bPL, which inherently deploys all of the AMs even without RC. This is also the case for bDWM XVSelect and XVSelect+RC, where this may be attributed to the AMs deployed by XVSelect+RC strategy (e.g. less creation of new experts AMs, which notably slow the model down).
Batch processing runtime can be improved by applying parallel processing as both cross-validatory selection and retrospective correction are embarrassingly parallel operations. Fully parallelising the adaptive strategy however requires available |G|q threads which can be prohibitive. Even the fully parallel implementation may not be as efficient as the custom strategy, because the choice of the AM can have an effect on the performance for the subsequent batches. This can be again seen on an example of expert creation AMs.
To illustrate these points a further experiment is undertaken, where two modifications of bDWM are proposed. The first one, bDWM Lite starts with two experts and includes only two AMs, DAM4 (weights update, experts pruning and batch learning) and DAM7 (weights update, experts pruning, batch learning and expert creation) instead of the original 8, which still allows to run the Custom strategy. bDWM Lite allows us to test the fully parallel implementation as it requires only 4 threads for this. The second modification, bDWM Zero, mimics bDWM Lite, and in addition limits the ensemble to only two experts. This prevents the performance degradation caused by expert creation. We experiment with XVSelect+RC with 2-fold cross-validation and two parallelisation 15 choices, cross-validation (XV) parallelisation where parallel processing is applied to the cross-validation only, and full parallelisation, where in addition to cross-validation, the retrospective correction is also run in parallel. Figure 11 shows the average batch runtimes over the whole dataset. Even without parallelisation, simply reducing the number of AMs from 8 to 2 (bDWM Lite), results in performance increase by the factor of 6, while parallelisation increases it even further. Limiting the number of experts further reduces the average batch runtime to only 3 times more than the Custom. Note that for bDWM Zero the parallelisation does not decrease the runtimes by much and that the full parallelisation doesn't outperform XV only parallelisation. This can be attributed to the already reduced runtime due to limited number of experts and the parallel processing overhead which negates increase in performance. Further insights are given in Figure 12. It can be seen that for bDWM Lite, average runtime per batch increases as batches come in, due to the increase in experts, however gradually flattens as the number of experts stabilizes around 20. Conversely, for bDWM Zero, the runtime per batch is stable from the start.  To confirm our assumptions, we have empirically investigated the merit of automated adaptation strategies XVSelect and XVSelect+RC. For this purpose we have conducted experiments on 10 real and 26 synthetic datasets, exhibiting various levels of adaptation need.
The results are promising, as for the majority of these datasets, the proposed automated approaches were able to demonstrate comparable or better performance to those of specifically designed custom algorithms and the repeated deployment of any single adaptive mechanism. However, it is not the goal of this paper to replace existing custom strategies with the proposed ones. We rather see the benefit of the proposed strategies in their applicability to all algorithms with multiple adaptive mechanisms, so that the designer of the algorithm does not need to spend time and effort to develop a custom adaptive strategy. We have analysed the cases where proposed strategies performed relatively poorly. It is postulated that the reasons for these cases were: a) lack of change/need for adaptation; b) insufficient data in a batch; and c) relatively simple datasets, all of which have trivial solutions. We have also identified that the choice of algorithm and base learner can affect the performance of proposed strategies.
A benefit of the proposed generic automated adaptation strategies is that they can help designers of machine learning solutions save time by not having to devise a custom adaptive strategy. XVSelect and XVSelect+RC are generally parameterfree, except for the number of cross validation folds, choosing which is trivial.
Naturally, the described strategies come at some cost in performance. This cost varies between different algorithms and is dependent on the number of AMs and other factors, such as number of experts. The runtimes can be reduced by the parallelisation of cross-validatory selection and retrospective correction. It is also conceivable for throughput requirements to be lower for batch learning scenario, as the data is passed to the model only after the whole batch is accumulated.

Future Work
This research has focused on batch scenario. Adapting the introduced automated adaptive strategies for incremental learning scenario remains a future research question. In that case a lack of batches would for example pose a question of data selection for cross validation. This could be addressed using data windows of static or dynamically changing size. Using an alternative to cross validation can be another solution. Another useful scope of research is focusing on a semi-supervised scenario, where true values or labels are not always available. This is relevant for many applications, amongst them in the process industry.
A dimension which may require more attention is further improvement of the runtime performance of the proposed approaches. An obvious first step in this direction is discarding the less useful, such as "do nothing", AMs.
Further research directions include theoretical analysis of this direction of research, where relevant expert/bandit strategies may be useful, as well as the experiments with other ML tasks such as time series prediction, clustering and recommender systems. Finally, as we have observed some discrepancies in performance of the proposed approaches across algorithms/datasets/base learners, a natural research direction is to investigate the reasons for these discrepancies. This would also include experimentation with different base learners.
In general, there is a rising tendency of modular systems for construction of machine learning solutions, where adaptive mechanisms are considered as separate entities, along with pre-processing and predictive techniques. One of the features of such systems is easy, and often automated plug-and-play machine learning (Kadlec and Gabrys, 2009;Kedziora et al., 2020). Generic automated adaptive strategies introduced in this paper further contribute towards this automation.  Highly volatile simulation (real conditions based) of catalyst activation in a multi-tube reactor. Task is the prediction of catalyst activity while inputs are flows, concentrations and temperatures (Strackeljan, 2006). 2 Thermal oxidiser 2820 36 Prediction of N Ox exhaust gas concentration during an industrial process, moderately volatile. Input features include concentrations, flows, pressures and temperatures (Kadlec and Gabrys, 2009). 3 Industrial drier 1219 16 Prediction of residual humidity of the process product, relatively stable. Input features include temperatures, pressures and humidities (Kadlec and Gabrys, 2009). 4 Debutaniser column 2394 7 Prediction of butane concentration at the output of the column. Input features are temperatures, pressures and flows (Fortuna et al., 2005). 5 Sulfur recovery 10081 6 Prediction of SO 2 in the output of sulfur recovery unit. Input features are gas and air flow measurements (Fortuna et al., 2003). 2 Widely used concept drift benchmark dataset thought to have seasonal and other changes as well as noise. Task is the prediction of whether electricity price rises or falls while inputs are days of the week, times of the day and electricity demands (Harries, 1999). 28 Power Italy 4489 2 4 The task is prediction of hour of the day (03:00, 10:00, 17:00 and 21:00) based on supplied and transferred power measured in Italy. (Zhu, 2010;Chen et al., 2015). 29 Contraceptive 4419 9 3 Contraceptive dataset from UCI repository (Newman et al., 1998) with artificially added drift (Minku et al., 2010). 30 Iris 450 4 4 Iris dataset (Anderson, 1936;Fisher, 1936) with artificially added drift (Minku et al., 2010). 31 Yeast 5928 8 10 Contraceptive dataset from UCI repository (Newman et al., 1998) with artificially added drift (Minku et al., 2010). Table 7: Synthetic classification datasets used in experiments, with N instances and C classes, from (Bakirov and Gabrys, 2013). Column "Drift" specifies number of drifts/changes in data, the percentage of change in the decision boundary and its type. All datasets have 2 input features.    (Bakirov and Gabrys, 2013).