# Controlled permutations for testing adaptive learning models

- 218 Downloads
- 2 Citations

## Abstract

We study evaluation of supervised learning models that adapt to changing data distribution over time (concept drift). The standard testing procedure that simulates online arrival of data (test-then-train) may not be sufficient to generalize about the performance, since that single test concludes how well a model adapts to this fixed configuration of changes, while the ultimate goal is to assess the adaptation to changes that happen unexpectedly. We propose a methodology for obtaining datasets for multiple tests by permuting the order of the original data. A random permutation is not suitable, as it makes the data distribution uniform over time and destroys the adaptive learning task. Therefore, we propose three controlled permutation techniques that make it possible to acquire new datasets by introducing restricted variations in the order of examples. The control mechanisms with theoretical guarantees of preserving distributions ensure that the new sets represent close variations of the original learning task. Complementary tests on such sets allow to analyze sensitivity of the performance to variations in how changes happen and this way enrich the assessment of adaptive supervised learning models.

### Keywords

Concept drift Evaluation Data streams Permutations## 1 Introduction

Changing distribution of data over time (concept drift [24]) is one of the major challenges for data mining applications, including marketing, financial analysis, recommender systems, spam categorization and more. As data arrives and evolves over time, constant manual adjustment of models is inefficient and with increasing amounts of data is quickly becoming infeasible. In such situations, decision models need to have mechanisms to update or retrain themselves using recent data; otherwise, their accuracy will degrade. Research attention to such supervised learning scenarios has been rapidly increasing in the last decade, a lot of adaptive learning models for massive data streams and smaller-scale sequential learning problems have been developed (e.g., [6, 14, 15, 16, 26, 28]). Evaluation of adaptive models requires specific testing procedures. Since data are not uniformly distributed over time, testing needs to take into account the sequential order of data. A standard procedure for that is the test-then-train (or prequential) [5], which mimics online learning. Given a sequential dataset, every example is first used for testing and then for updating the model. Suppose \((x_1,x_2,x_3,x_4)\) is our dataset. We train a model with \(x_1\). Next, the model is tested with \(x_2\) and the training set is augmented with \(x_2\). Next, we test on \(x_3\) and update the model with \(x_3\). Finally, we test on \(x_4\).

The limitation of this evaluation is that it allows to process a dataset only once in the fixed sequential order. The positions where and how changes happen remain fixed; thus, a single test concludes how well a model would adapt to this fixed configuration of changes. In contrast, the ultimate goal is to assess the performance on a given task online while data evolves unexpectedly. Thus, the results of *a single test* may be insufficient to conclude about the generalization performance of an adaptive model. The problem is particularly serious when several predictive models are compared, since, as we will see in the next section, the winning strategy may depend on the order of data.

Multiple tests with close variations of the original dataset could make evaluation more confident. The results from multiple tests could be used for assessing the stability of the model performance by estimating the variance of the accuracies. Multiple tests could also be used as validation sets for tuning the model parameters. However, for such an evaluation to be reliable, we need to ensure that the multiple test sets do not deviate much from the original learning problem. This study presents a methodology for generating such sets.

We propose to construct multiple test sets by permuting the data order in a controlled way to preserve local distributions, which means that the examples that were originally near to each other in time need to remain close after a permutation. We present three permutation techniques that are theoretically restricted to keep examples together. The permuted datasets can be considered as semi-synthetic: while order of data is gently modified, at the same time, the data itself is not altered. Each permutation technique creates different types of changes: sudden, re-occurring and gradual.

Our study is novel in identifying the risks of a single test, forming a measure between the original test set and its variations and proposing a theoretically founded solution how to form multiple test sets. The study contributes to the methodology of evaluating the performance of adaptive supervised learning models in online scenarios. As a result, it becomes possible to complement evaluation with assessment of sensitivity of the performance and also to obtain validation sets for parameter tuning. A short version of this study was published as a conference paper [29] in Discovery Science 2011.

The paper is organized as follows. In Sect. 2, we discuss evaluation of adaptive models. In Sect. 3, we present our permutations. Section 4 presents the theoretical guarantees for controlling the permutations. Section 5 experimentally demonstrates how our permutations aid in evaluating adaptive models. Section 6 discusses related work. Section 7 concludes the study.

## 2 Evaluation of adaptive learning models

The problem setting of adaptive supervised learning is as follows. Our datasets are composed of examples ordered in time; each example is described by the values of a fixed set of input variables and a target variable (the label), which the model needs to predict. Data distribution is expected to change over time. Learning models have mechanisms to *adapt* online by taking into account the new incoming examples. *Prediction accuracy* is the primary measure of the performance.

The test-then-train [12] is a standard procedure for evaluating the performance of adaptive learning models. This procedure restricts evaluation to a single run with a fixed configuration of changes in data. While different learning models have different adaptation rates, the results on a fixed test snapshot with a few changes may not be sufficient to generalize how this adaptive model would perform online on a given problem.

Evaluating adaptive models differs from evaluating learning models in the stationary settings in several key aspects. Firstly, in the stationary setting, all data are assumed to be originating from the same underlying distribution, and thus, examples are exchangeable. Testing sets can be formed combining any randomly chosen examples from a dataset; we can use ten-fold or leave-one-out cross-validation procedures or bootstrapping [25]. Adaptive models need to be tested on sequential data where a sequence includes different distributions and examples generally are not exchangeable; hence, when forming test sets, we need to respect the order of data. Moreover, in the stationary settings, we aim at testing *the predictive power* of a model and we measure it by the testing accuracy. In evolving settings, we aim at testing *the predictive power* as well as *the ability to adapt* and we measure both by the same testing accuracy. One change in the data distribution over time can be considered as one testing instance for testing the ability to adapt. Data stream snapshots at hand may contain only a few changes. In such a case, the evaluation of the ability to adapt will be biased toward the order and timing of changes in this data.

This problem of evaluation could be alleviated if we had a really long sequence with a few distributions and many changes or are in a continuous (probably gradual) change. Unfortunately, real sequential datasets often include only a few distribution changes, even though they contain plenty of examples (in particular data streams). For example, changes in sales due to seasonality and changes in suppliers or production process or new adversary activities in credit card usage do not happen very frequently. Changes in economic situation or personal interests can be even slower than yearly. Thus, the validity of conclusions from a single run evaluation may be limited to this particular order of data, while the goal of evaluation is to generalize about adaptivity when changes happen unexpectedly and in fact for any change pattern in a given application.

## 3 Proposed permutations

We aim at permuting a dataset so that the permuted sets closely resemble the original learning problem, yet provide meaningful additional test data. We require that the examples that were near in time in the original sequence remain close to each other in the permuted sequence. We propose three controlled permutation techniques that are aimed at modifying the configuration of changes in a dataset. Our permutations are non-intrusive to the original data, as they do not need to detect or model changes in the instance space, they only manipulate positions of examples in a sequence. Permutations will not help in situations where the data snapshot that is available misses some of the distributions that may happen in reality for a given application. However, our permutations will help to model various transitions and changes across the existing distributions.

The proposed permutations differ in assuming which examples in a dataset are exchangeable and highlight particular types of changes. The *time permutation* forces sudden drifts by shifting blocks of data. This permutation assumes that blocks of data are exchangeable with each other, as they represent different distributions of the original data and these distributions may occur in any order in an arbitrary data snapshot. The *speed permutation* introduces re-occurring concepts by shifting a set of examples to the end of the sequence, aiming at varying the speed of changes and modeling recurring concepts. It assumes that the distributions observed in the past may re-occur. The *shape permutation* forces gradual drifts by perturbing examples within their close neighborhood. It assumes that the examples that arrive at similar time are exchangeable with each other.

We study permutations in the following formal setting. Given is a sequential dataset consisting of \(n\) examples, \(n\)*-sequence*. Each example has an index, which indicates its position in time, we will permute these indices. Let \(\Omega _n\) be the space of all permutations of the integers \(\{1,2,\ldots ,n\}\). The *identity* sequence \(I_{(n)}\) is an \(n\)-sequence where the indices are in order \((1,2,\ldots ,n)\).

Let \(J = (j_1,\ldots ,j_n)\) be a permutation from \(\Omega _n\). Consider a permutation function \(\pi \) such that \(\pi (m)=j_m\) and \(\pi ^{-1}(j_m)=m\). Here, \(j_m\) is the original index of the example, which is now in the position \(m\). For example, let the original sequence be \(I_{(3)} = (1,2,3)\). If \(\pi (1)=2, \,\pi (2)=3\) and \(\pi (3)=1\), then \(J=(2,3,1)\). In this case \(\pi ^{-1}(1)=3, \,\pi ^{-1}(2)=1\) and \(\pi ^{-1}(3)=2\).

### 3.1 The time permutation

### 3.2 The speed permutation

For each example, we determine at random with a probability \(p\) whether or not it will be lifted. Then, the lifted examples are moved to the end of the sequence keeping their original order, as illustrated in Fig. 2 (center). The procedure is given in Algorithm 2. The parameter \(p\) varies the extent of the permutation.

### 3.3 The shape permutation

The proposed permutation procedures are inspired by physical analogs in card shuffling. We model the time permutation procedure as the overhand card shuffle [18], the speed permutation as the inverse riffle shuffle [1] and the shape permutation as a transposition shuffle [1].

## 4 Controlling the permutations

Our next step is to define a measure capturing to what extent a permutation preserves the original distributions. With this measure, we can theoretically justify that our permutations perturb the original order to a controlled extent staying far from random permutations. This ensures that we do not lose varying data distributions, and yet significantly perturb the data, as desired.

### 4.1 Measuring the extent of permutations

A number of distance measures between two permutations exists [8, 21, 22]. They count editing operations (e.g., insert, swap, reverse) to arrive from one permutation to the other. Such distances are not suitable for our purpose, as they measure *absolute* change in the position of an example, while we need to measure a *relative* change. We are interested to measure how well local distributions of data are preserved after a permutation. Preserving local distributions means that the examples that were originally near to each other in time need to remain close after a permutation. Thus, instead of measuring how far the examples have shifted, we need to measure how far they have moved from each other. If the examples move together, then local distributions are preserved.

To illustrate the requirement, consider a sequence of eight examples \((1,2,3,4,5,6,7,8)\). The measure should treat the permutation \((5,6,7,8,1,2,3,4)\) as being close to the original order. The local distributions \((1,2,3,4)\) and \((5,6,7,8)\) are preserved while the blocks are swapped. The permutation \((8,7,6,5,4,3,2,1)\) needs to be very close to the original. Although the global order has changed completely, every example locally has the same neighbors. In contrast, the permutation \((1,8,3,6,5,4,7,2)\) needs to be very distant from the original. Although half of the examples globally did not move, the neighbors are mixed and the local distributions are completely destroyed. To capture the local distribution aspects after a permutation, we introduce *the neighbor distance measure*.

**Definition 1**

The total neighbor distance (TND) between the original sequence and its permutation is defined as \(D = \sum _{i=1}^{n-1} |j_i - j_{i+1}|\). where \(j_i\) is the original position of an example that is now in the position \(i\), or \(j_i=\pi ^{-1}(i)\).

For example, the total neighbor distance between the identity permutation and \(J=(1,3,2,4)\) is \(D(J)=|1-3| + |3-2| + |2-4|= 5\).

**Definition 2**

The average neighbor distance (AND) between the original sequence and its permutation is the total neighbor distance divided by the number of adjacent pairs \(d = D/(n-1)\).

In our example \(D(J)=5, \,n=4\), hence, \(d(J)=5/3\).

In our definition, neither TND nor AND are metrics, but aggregations of the distances between adjacent examples (which are metrics). Although these measures could be normalized to the interval \([0,1]\), we keep them in this simple form, while we are interested in the differences rather than absolute values.

*Kendall distance*counts the number of swaps of the neighboring elements to get from one permutation to the other \(d_K \sim \sum _{i=1}^{n-1} \sum _{j=i+1}^n \mathbf{1}(\pi (j) < \pi (i))\).*Precede distance*counts the number of times the elements precede each other up to a constant it is the same as Kendall distance \(d_C \sim d_K\).*Spearman rank correlation*aggregates the squared differences between the positions of the same element in two permutations \(d_S \sim \sum _{i=1}^n (\pi (i) - i)^2\).*Position distance*sums the differences between the positions of the elements \(d_P \sim \sum _{i=1}^n |\pi (i) - i|\).*Adjacent distance*counts the number of elements, which neighbor each other in the two permutations \(d_A \sim -\sum _{i=1}^{n-1} \sum _{j=i+1}^n \mathbf{1} (|\pi (j) - \pi (i)|=1)\).*Exchange distance*counts the exchange operations of two elements needed to get from one permutation to the other \(d_E \sim \sum _{i=1}^n \mathbf{1} (\pi ^*(i) \ne i)\), where \(\pi ^*\) is changing as exchange operations proceed.*Hamming distance*counts the number of elements in the same positions \(d_H \sim \sum _{i=1}^n \mathbf{1} (\pi (i) = i)\).*Rising sequences*count the number of increasing subsequences \(d_R \sim -\sum _{i=1}^{n-1} \mathbf{1} (\pi ^{-1}(i)>\pi ^{-1}(i+1))\).*Runs*count the number of increasing index subsequences \(d_U \sim -\sum _{i=1}^{n-1} \mathbf{1} (\pi (i)>\pi (i+1))\).

*strong*a permutation is in destroying neighborhoods, while our measure does that.

### 4.2 The theoretical extent of our permutations

As we defined how to measure the extent of permutations, we can find theoretical expressions of that measure for our permutations.

**Proposition 3**

The time permutation tends to \(\lim _{n \rightarrow \infty } E(d) = 3-2p <3\), for fixed \(p \in (0,1)\).

We prove Proposition 3 in two stages. First, we assume that the number of splits is fixed, and then, we generalize to a random number of splits.

**Proposition 4**

*Proof (of Proposition 4)*

\(E(D_k) = D(I) - k + E(H)\). Starting from the identity permutation \(D(I)\), subtract \(k\) due to splits happening and add \(H\) due to the concatenation after the splits. Denote the positions of splits in the ascending order as \(k_1,k_2,\ldots ,k_k\). The total neighbor distance of the identity sequence is \(D(I_{(n)})=|1-2|+|2-3|+\cdots + |n-1 - n| = n-1\). \(H\) can be expressed as \(H = k_n - (k_{k-1} + 1) + k_k - (k_{k-2} + 1) + \cdots + k_2 - (k_0 + 1)\), where \(k_n=n\) and \(k_0=0\). After canceling out, we get \(H = n + k_k - k_1 - k\) and \(E(H) = n + E(k_k - k_1) - k\). There are \(n-1\) possible positions for \(k\) splits. It is easy to show that the expected difference between the minimum and the maximum element in a combination \(\genfrac(){0.0pt}{}{k}{n-1}\) is \(E(k_k - k_1) = n\frac{k-1}{k+1}\). Thus, \(E(H) = n - k + n\frac{k-1}{k+1}\), and \(E(D_k) = n - 1 - k + n - k + n\frac{k-1}{k+1}\). \(\square \)

*Proof (of Proposition 3)*

Now, \(k\) is a random variable following the binomial distribution \(k\sim \mathcal{B}(n,p)\); thus, the expected number of splits is \(E(k) = np\). From Proposition 4, we get \(E(D) = n- 1 + 2(n-E(k)) - \frac{2n}{E(k)+1}= n - 1 + 2np\frac{(n-np-1)}{np+1}\), and \(E(d)=E(D)/(n-1) = 1 + 2np\frac{(n-np-1)}{(np+1)(n-1)}\). \(\square \)

**Proposition 5**

The speed permutation tends to \(\lim _{n \rightarrow \infty } E(d) = 3\).

To prove Proposition 5, we need to find TND of a rising sequence.

**Definition 6**

A rising sequence is a sequence \(M\!=\!(j_1,j_2,\ldots ,j_m)\), where \(j_1\!<\!j_2\!<\!\)\(\cdots <j_m\).

**Proposition 7**

The total neighbor distance \(D\) of a rising sequence \(M\) is \(D(M) = j_m - j_1\).

*Proof (of Proposition 7)*

\(D(M) = \sum _{j=1}^{m-1} |i_j+1 - i_j| = \sum _{j=1}^{m-1} (i_{j+1} - i_j) = i_2 - i_1 + i_3 - i_2 + \cdots + i_{m-1} - i_{m-2} + i_m - i_{m-1} = i_m - i_1\). \(\square \)

*Proof (of Proposition 5)*

A lift forms two rising sequences. We denote the lifted subsequence \(l_1,l_2,\ldots ,l_L\) and the remaining subsequence \(z_1,z_2,\ldots ,z_Z\). Since the starting sequence is in the identity order, both the lifted and the remaining subsequences are rising. The total neighbor distance after the permutation is \(E(D) = E(z_Z-z_1 + l_L - l_1 + z_Z - l_1) = 2E(z_Z) - E(z_1) + E(l_L) - 2E(l_1)\), which is the sum of the neighbor distances in the two subsequences plus their concatenation. The lifted subsequence can start with ’1’ with a probability \(p\), or it can start with ’2’ with a probability \((1-p)p\), given that ’1’ was not lifted and so on. \(E(l_1) = 1p + 2(1-p)p + 3(1-p)^2p + \cdots + n(1-p)^{n-1}p \approx \frac{p}{p^2} = \frac{1}{p}\) and \(E(z_1) \approx \frac{1}{1-p}\). The sums use the identity \(\sum _{j=1}^n jp^j \approx \frac{p}{(1 - p)^2}\), which is straightforward to verify decomposing the sum into geometric progressions. Similarly, \(E(l_L) = np + (n-1)(1-p)p + (n-2)(1-p)^2p + \cdots + (1-p)^{n-1}p \approx p\frac{np-1+ p}{p^2} = n+1-\frac{1}{p}\) and \(E(z_Z) \approx n+1-\frac{1}{1-p}\). The sums use the identity \(\sum _{j=0}^n (n-j)p^j \approx \frac{n(1 - p) - p}{(1-p)^2}\)). With the terms in place, we get \(E(D) \approx 3(n+1) - \frac{3}{p(1-p)}\) and \(E(d) = \frac{E(D)}{n-1} = \frac{3(n+1)}{n-1} - \frac{3}{p(1-p)(n-1)}\). \(\square \)

For the time and the speed permutations, one iteration results in many edit operations in a sequence. Examples mix fast. In contrast, in the shape permutation, one iteration makes one edit operation; thus, examples mix slowly; hence, we need more than one iteration to perturb the order. Thus, AND of the shape permutation is also a function of the number of iterations \(k\).

**Proposition 8**

The shape permutation in the prudent theoretical limit (ignoring the minus term) \(\lim _{n \rightarrow \infty } E(d)\!< \!5\). Note that for \(k\!<\!2n\) we have \(E(d)\!<\!3\) and for fixed \(k\) we have \(\lim _{n\rightarrow \infty } E(d) \!<\! 1\).

*Proof (sketch of Proposition 8)*

Denote the expected total neighbor distance after \(k\) iterations of the shape permutation as \(E(D_k)\). As we start from the identity sequence, \(E(d_0) = \frac{n-1}{n-1} = 1\). After one iteration \(E(d_1)=E(d_0)+2\frac{n-3}{n-1}+1\frac{2}{n-1}\). After two iterations \(E(d_2) < E(D_1) + 2\frac{n-6}{n-1}+1\frac{2}{n-1} + 1\frac{2}{n-1} - 1\frac{1}{n-1} < E(d_1) + \frac{2}{n-1}\). The first inequality appears, since at the start and at the end of the sequence the examples have one neighbor instead of two while we treat them as having two neighbors. After \(k\) iterations \(E(d_k) < E(d_{k-1}) + \frac{2}{n-1} \approx E(d_0) + k\frac{2}{n-1} = 1 + \frac{2k}{n-1}\). \(\square \)

In order to assess how far a permutation is from random, we need the minimum AND of a random permutation.

**Proposition 9**

The minimum average neighbor distance \(d_{min}\) of a permutation of an \(n\)-sequence is \(d_{min} = 1\).

*Proof (of Proposition 9)*

A sequence of length \(n\) contains \(n-1\) adjacent pairs. Since there are no equal indices in the sequence, the distance between any two adjacent indices cannot be less than \(1\). In the identity sequence (\(1,2,3,\ldots , n\)), the distances between all the adjacent neighbors are equal to \(1\). Thus, \(d_{min} = (n-1)|1|/(n-1) = 1\). \(\square \)

**Proposition 10**

The expected average neighbor distance \(d\) of a random permutation of an \(n\)-sequence is \(E(d) = (n+1)/3\).

*Proof (of Proposition 10)*

When permuting at random, any permutation \(J\in \Omega _n\) is equally likely. Let \(J=(j_1,j_2,\ldots ,j_n)\) be a permutation. Since TND is a simple sum of pairwise distances, the expected neighbor distance of a random permutation resorts to an expected distance between two randomly chosen indices in a sequence: \(E(D) = E(|j_i - j_{i+1}|)\). We find the expected value as an average over all possible combinations \(E(|j_i - j_{i+1}|) = \sum _{u=1}^n \sum _{v=u+1}^n (v-u) / \genfrac(){0.0pt}{}{n}{2}\). The components of the numerator can be expressed as a triangular matrix \(T_{n-1 \times n-1}\), where the elements \(t_{ij}=i\) for \(j\le (n -i)\) and 0 otherwise. It can be shown that the sum of all the elements is \(S^T_n= \frac{1}{6}n(n+1)(n+2)\). Using this expression, we get \(E(|j_i - j_{i+1}|) = S^T_{n-1}/\genfrac(){0.0pt}{}{n}{2} = \frac{n+1}{3}\). \(\square \)

In summary, our permutations double-triple the average neighbor distance, which is a substantial variation for generating multiple test sets. The expected AND of a random permutation is linear in \(n\) (Proposition 10), while the maximum AND of our permutations is constant in \(n\). Since our permutations are that far from random and close to the original order, we are not losing variations in data distributions. The order of data is perturbed to a controlled extent.

## 5 Experiments

We explore our permutations experimentally in two parts. Firstly, we visualize the permutations on real data. Our goal is to analyze the behavior of the permutations when changes in the original sequence happen in different ways. Secondly, we test a set of adaptive learning models on real evolving datasets and their permutations. Our goal is to demonstrate what additional information becomes available as a result of testing with our permutations and how using this information reduces the risk of a single evaluation bias.

### 5.1 Visual inspection

We visualize data over time so that the effects of permutations on changes in data distribution could be observed to give an intuitive perspective on what effect the proposed permutations achieve. For simplicity of interpretation, we limit this analysis to a single input variable.

We see that all datasets permuted using the time, the speed and the shape permutation techniques closely resemble the original sequences and preserve distinct distributions of data. In contrast, a random permutation produces sequences that are uniformly distributed over time. As a result, different distribution and the need for adaptivity are lost. Randomly permuted sequences represent different online learning problems that do not require adaptation over time.

### 5.2 Testing adaptive models with multiple test sets

Next, we present computational experiments to demonstrate how our permutations can aid in evaluating adaptive models. We use three real datasets with the original time order covering 2–3 years period. All datasets present binary classification problems where concept drift is expected. The Chess^{1} dataset (size: \(503 \times 8\)) presents the task to predict the outcome of a chess game. Skills of a player and types of tournaments evolve over time. The Luxembourg (see footnote 1) dataset (size: \(1901 \times 31\)) asks to predict what time a person spends on the internet, given the demographical information of a person. The task is relevant for marketing purposes. The usage of internet is expected to change over time. The Electricity dataset [13] (size: \(45312 \times 8\)) is the same as in Sect. 2.

For each dataset, we generate *ten* permutations of each type (time, speed and shape) with the parameters fixed to \(p=0.5\) and \(k=n\). The code for our permutations is available online (see footnote 1). We test five adaptive classifiers: OzaBagAdwin [6] with 1 and 10 ensemble members (Oza1, Oza10), DDM [11], EDDM [4] and HoeffdingOptionTreeNBAdaptive (Hoeff) [19]. We use MOA implementations [5] of these classifiers with the Naive Bayes as the base classifier. Hoeff on the Electricity data is not reported as it runs out of memory.

Robustness of the performance can be assessed by calculating the standard deviation of the accuracy. For instance, in the Chess data, Hoeff shows the most robust performance. In the other two datasets, we can conclude that the ensemble techniques (Oza1 and Oza10) are quite robust, while the detectors (DDM and EDDM) are pretty resilient.

The permutations allow to assess reactions to variations in changes within the vicinity of the original data. For example, Hoeff is the least accurate on the original Chess dataset and on the shape permutations, while it is the most accurate on the time permutations. The results suggest that Hoeff is better at handling sudden changes. When testing with the time permutation on the Chess data, the accuracies of Oza1, Oza10 and Hoeff are notably higher than on the original. The observation suggests that the listed classifiers are better in handling more sudden drifts than incremental ones.

Finally, our permutations allow to assess the stability of the ranking of classifiers as well as comparing them pairwise. We see on the LU data that the ranking of accuracies is very stable, it does not vary. Drawing the conclusions in the paragraph above would have been risky or impossible on the evidence of a single test run, but is reasonable to nominate Hoeff as the most accurate classifier with the use of our controlled permutations.

Our final recommendations for testing practice are as follows. When evaluating a set of adaptive learning models, we suggest first to run the test-then-train test on the original data and use these results as a baseline. We suggest to run multiple tests with permuted datasets that would inform about robustness of the classifier to variations in changes. We advise to use this information for complementary qualitative assessment. Before applying a permutation, one needs to critically think if a particular type of permutations is sensible from the domain perspective for a given application.

## 6 Related work

Our study relates to three lines of research: comparing supervised learning models, randomization in card shuffling and measuring distance between permutations.

Comparing the performance of classifiers received a great deal of attention in the last decade, e.g., [7, 9, 17]; however, these discussions assume a classical classification scenario, where the data is static. A recent contribution [12] addresses issues of evaluating classifiers in the online setting. The authors present a collection of tools for comparing classifiers on streaming data (not necessarily changing data). They provide means how to present and analyze the results *after* the test-then-train procedure. Our work concerns the test-then-train procedure itself, thus can be seen as complementary. We are not aware of any research addressing the problem of evaluation bias for adaptive classifiers or studying how to generate multiple tests for such classifiers.

The second line of research relates to measuring distance between permutations in general [8, 21, 22] or with specific applications to bioinformatics (e.g., [10]). In Sect. 4, we reviewed and experimentally investigated the main existing distance measures. As we discussed, these distance measures quantify absolute changes in the positions of examples, while our problem requires evaluating the relative changes. Thus, we have introduced a new measure.

A large body of literature studies randomization in card shuffling (e.g., [1, 18]). These works theoretically analyze shuffling strategies to determine how many iterations are needed to mix a deck of cards to a random order. Although our datasets can be seen as decks of cards, we cannot reuse the theory for shuffling times, as they focus on different aspects of the problem. To adapt those theoretical results for our purpose, we would need to model probability distribution of the relations between cards. In the light of this option, we argue that our choice to use the average neighbor distance is much more simple and straightforward.

A few areas are related via terminology. Restricted permutations [3] avoid having subsequences ordered in a prescribed way, such requirements are not relevant for our permutations. Permutation tests [23] assess statistical significance of a relation between two variables (e.g., an input variable and the label), while we assess the effects of order. Block permutations [2] detect change points in time series, while we do not aim to analyze the data content, we perturb the data order. Discovering periodicity in time series [27] is as well based on analyzing the content of data, while we operate indices of a sequence. Notably, this study uses permutations of elements in time series as a baseline of no periodicity. Time series bootstrap methods [20] aim to estimate distribution of data by re-sampling. Our time permutation is similar as a technique. However, the problem setting is different; thus, generally, these methods are not directly reusable. These methods are designed for identically distributed dependent data, while our setting implies that the data is independent but not identically distributed.

## 7 Conclusion

We proposed a methodology for generating multiple test sets from real sequential data for evaluating adaptive models designated for an online operation. We pointed out that the standard test-then-train procedure while running a single test per dataset risks to produce results biased toward the fixed positions of changes in the dataset. Thus, we propose to run multiple tests with randomized copies of a data stream. We develop three permutation techniques that are theoretically restricted so that different distributions from the original data are not lost as a result of a permutation.

Our experiments demonstrate that such multiple tests provide good means for qualitative analysis of the performance of adaptive models. This allows to assess three more characteristics of the performance, such as volatility, reactivity to different ways changes can happen and stability of model ranking, in addition to the accuracy from a single test. Our permutations make it possible to pinpoint specific properties of the performance and explore sensitivity of the results to the data order. Such an analysis complements evaluation and can make the assessment more confident. Our permutations can be viewed as a form of cross-validation for evolving data.

This research opens several follow up research directions. It would be relevant to find what statistical tests are suitable to assess the statistical significance of the resulting accuracies. The problem is challenging, since the results from multiple tests cannot be considered independent. Another interesting direction is to develop mechanisms that would allow instead of restricting permutations with an upper bound to have permutations of a specified extent, or even further, to sample an extent from a probabilistic model (e.g., the Mallows model) and then generate a permutation accordingly.

## Footnotes

- 1.
Available at https://sites.google.com/site/zliobaite/permutations

## Notes

### Acknowledgments

The research leading to these results has received funding from the European Commission within the Marie Curie Industry and Academia Partnerships and Pathways (IAPP) programme under grant agreement no. 251617.

### References

- 1.Aldous D, Diaconis P (1986) Shuffling cards and stopping times. Am Math Mon 93(5):333–348CrossRefMATHMathSciNetGoogle Scholar
- 2.Antoch J, Huskova M (2001) Permutation tests in change point analysis. Stat Probab Lett 53:37–46CrossRefMATHMathSciNetGoogle Scholar
- 3.Atkinson M (1999) Restricted permutations. Discret Math 195:27–38CrossRefMATHGoogle Scholar
- 4.Baena-Garcia M, del Campo-Avila J, Fidalgo R, Bifet A, Gavalda R, Morales-Bueno R (2006) Early drift detection method. In: Proceedings of ECML PKDD workshop on knowledge discovery from Data Streams, p 7786Google Scholar
- 5.Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604Google Scholar
- 6.Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavalda R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 139–148Google Scholar
- 7.Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNetGoogle Scholar
- 8.Diaconis P (1988) Group representations in probability and statistics, vol 11 of Lecture notes-monograph series. Hayward Institute of Mathematical StatisticsGoogle Scholar
- 9.Dietterich T (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923CrossRefGoogle Scholar
- 10.Durrett R (2003) Shuffling chromosomes. J Theor Probab 16(3):725–750CrossRefMATHMathSciNetGoogle Scholar
- 11.Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Proceedings of Brazilian symposium on artificial intelligence (SBIA), pp 286–295Google Scholar
- 12.Gama J, Sebastiao R, Rodrigues PP (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 329–338Google Scholar
- 13.Harries M (1999) Splice-2 comparative evaluation: electricity pricing. Technical report, The University of South WalesGoogle Scholar
- 14.Ikonomovska E, Gama J, Dzeroski S (2011) Learning model trees from evolving data streams. Data Min Knowl Discov 23(1):128–168CrossRefMATHMathSciNetGoogle Scholar
- 15.Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22:371–391CrossRefGoogle Scholar
- 16.Kolter J, Maloof M (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790Google Scholar
- 17.Ojala M, Garriga G (2010) Permutation tests for studying classifier performance. J Mach Learn Res 11:1833–1863MATHMathSciNetGoogle Scholar
- 18.Pemantle R (1989) Randomization time for the overhand shuffle. J Theor Probab 2(1):37–49CrossRefMATHMathSciNetGoogle Scholar
- 19.Pfahringer B, Holmes G, Kirkby R (2007) New options for hoeffding trees. In: Proceedings of the 20th Australian joint conference on advances in artificial intelligence (AJCAAI), pp 90–99Google Scholar
- 20.Politis D (2003) The impact of bootstrap methods on time series analysis. Stat Sci 18(2):219–230CrossRefMathSciNetGoogle Scholar
- 21.Schiavinotto T, Stutzle T (2007) A review of metrics on permutations for search landscape analysis. Comput Oper Res 34(10):3143–3153CrossRefMATHGoogle Scholar
- 22.Sorensen K (2007) Distance measures based on the edit distance for permutation-type representations. J Heuristics 13(1):35–47Google Scholar
- 23.Welch W (1990) Construction of permutation tests. J Am Stat Assoc 85(411):693–698CrossRefGoogle Scholar
- 24.Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101Google Scholar
- 25.Witten I, Frank E, Hall M (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Los Altos, CAGoogle Scholar
- 26.Wozniak M (2011) A hybrid decision tree training method using data streams. Knowl Inf Syst 29(2):335–347CrossRefGoogle Scholar
- 27.Vlachos M, Yu P, Castelli V, Meek Ch (2006) Structural periodic measures for time-series data. Data Min Knowl Discov 12:1–28CrossRefMathSciNetGoogle Scholar
- 28.Zliobaite I (2011) Combining similarity in time and space for training set formation under concept drift. Intell Data Anal 15(4):589–611Google Scholar
- 29.Zliobaite I (2011) Controlled permutations for testing adaptive classifiers. In: Proceedings of the 14th international conference discovery science (DS), pp 365–379Google Scholar