Temporal density extrapolation using a dynamic basis approach
 783 Downloads
Abstract
Density estimation is a versatile technique underlying many data mining tasks and techniques, ranging from exploration and presentation of static data, to probabilistic classification, or identifying changes or irregularities in streaming data. With the pervasiveness of embedded systems and digitisation, this latter type of streaming and evolving data becomes more important. Nevertheless, research in density estimation has so far focused on stationary data, leaving the task of of extrapolating and predicting density at time points outside a training window an open problem. For this task, temporal density extrapolation (TDX) is proposed. This novel method models and predicts gradual monotonous changes in a distribution. It is based on the expansion of basis functions, whose weights are modelled as functions of compositional data over time by using an isometric logratio transformation. Extrapolated density estimates are then obtained by extrapolating the weights to the requested time point, and querying the density from the basis functions with backtransformed weights. Our approach aims for broad applicability by neither being restricted to a specific parametric distribution, nor relying on cluster structure in the data. It requires only two additional extrapolationspecific parameters, for which reasonable defaults exist. Experimental evaluation on various data streams, synthetic as well as from the realworld domains of credit scoring and environmental health, shows that the model manages to capture monotonous drift patterns accurately and better than existing methods. Thereby, it requires not more than 1.5 times the run time of a corresponding static density estimation approach.
Keywords
Density extrapolation Density forecasting Data streams Concept drift Nonstationary data Compositional data1 Introduction
The extent and scope of available data continues to grow, often comprising data that is continuously generated over longer time spans, in environments that are subject to changes over time (Reinsel et al. 2017). Volume and velocity of such data streams often require automated processing and analysis, while their dynamic nature and associated volatility needs to be considered as well (Fan and Bifet 2013). For example, the distribution of the data might change over time, a problem that is commonly denoted as concept drift (Widmer and Kubat 1996) or population drift (Kelly et al. 1999). In prediction tasks, such drift requires adaptation mechanisms, for example by forgetting outdated irrelevant data or models (Gama et al. 2014). However, merely attempting to avoid the negative influence that this dynamic nature can have on statistical models is leaving its potential unused. In contrast, aiming to understand and to incorporate the changes in data into the statistical model itself might provide additional value, by helping to perform more accurate predictions for the future and allowing a structured description of the occurring changes.
Several tasks and approaches have been proposed to identify change or irregularities in data, such as outlier detection [surveyed for example in Sadik and Gruenwald (2014)], anomaly detection [surveyed by Chandola et al. (2009)], or change detection [surveyed by Tran et al. (2014)]. Furthermore, some research has focused on the exploration, understanding, and exploitation of change: change diagnosis aims to estimate the spatiotemporal change rates in kernel density within a time window (Aggarwal 2005). This might be seen as an early proponent of the change mining paradigm, which was proposed afterwards by Böttcher et al. (2008) and calls for data mining approaches that provide understanding of change itself. Following this paradigm, drift mining techniques aim to provide explicit models of distributional changes over time, for example to transfer knowledge between different time points in data streams under verification latency (Hofer and Krempl 2013).
An essential technique in data science (Chacón and Duong 2018; Scott 2015) that is underneath many of the algorithms in the tasks above is density estimation (Rosenblatt 1956; Whittle 1958), which is also important for exploration and presentation of data in general (Silverman 1986), as it is for probabilistic classifiers such as Parzen Window Classifiers (Parzen 1962). Given a set of instances sampled within a training time window from a nonstationary (i.e., drifting) data stream, the provided density estimates typically correspond to the distribution over all instances within this window. However, these estimates might also be required for specific time points, rather than time windows. This time point might lie within the training time window, requiring an insample interpolation. Alternatively, it might lie outside, requiring an outofsample extrapolation to past or future time points.
Research in density estimation has so far focused on providing insample estimates, which are often adapted to the distribution of the most recently observed samples in the training time window. In contrast, density extrapolation beyond this window has only recently received attention (Krempl 2015), although it has some potential beyond being simply used in lieu of static density estimations in the applications above: Most importantly, extrapolating continuing trends allows to anticipate future distributional changes and to take preparatory measures. Examples are classifiers or models that incorporate forthcoming distributional changes anticipatively; active learning and change detection approaches that proactively sample in regions where change is expected to happen; or in the identification of unexpected changes in drift.
Figure 1 shows a feature X in a data stream, whose density at different points in time is described by an expansion of six Gaussian density functions. The weights associated to each of these basis functions form a composition (Aitchison 1982; Egozcue et al. 2003) that sums to one, their proportions are illustrated on the back end of the figure. The change in the density at the second and the third time point is modelled as a change in the distribution of the weights, as visible from the shift in the composition proportions. It is the core of the approach’s fitting process to model these weights as functions of time, which will be elaborated on in Sect. 3.
Modelling the data stream in this way entails less computation compared to kernel density estimators, since we do not need to complete a full kernel matrix but only evaluate the small number of basis functions at the sample positions. This approach also allows a straightforward way of forecasting the density at future time points, since the basis weight functions only need to be extrapolated to the desired time. Furthermore, the model delivers an easily interpretable description of drift by means of the basis weight functions.
The remainder of this article is organised as follows: in the Sect. 2, we review the related work, before presenting and discussing our temporal density extrapolation approach in detail in Sect. 3. An experimental evaluation of this approach is given in Sect. 4, concluding remarks in Sect. 5.
2 Related work
There is a vast literature on estimating the density within a training sample (Chacón and Duong 2018; Scott 2015), This includes approaches that provide spatiotemporal density estimates for time points within the training time window, e.g., by using timevarying mixture models (Lawlor and Rabbat 2016) or spatiotemporal kernels (Aggarwal 2005). However, our focus is on the task of density extrapolation in timeevolving data streams, as recently formulated in Krempl (2015). Thus, we will restrict the following review to this density extrapolation, then review the related problem of outofsample density forecasting, and finally discuss the broader context within the literature of concept drift in data streams.
Density extrapolation is described in Krempl (2015), together with a sketch of the general idea to extend kernel density estimation techniques for this task. In Lampert (2015), “extrapolating distribution dynamics” (EDD) is proposed, although this approach aims for predicting the distribution in the immediate future onestep ahead. EDD models the transformation between previous time points and applies this transformation to the most recent sample, to obtain a onestep ahead extrapolation of distribution dynamics.
A related problem is density forecasting [see, e.g., the survey in Tay and Wallis (2000)], where the realisations of a random variable are predicted. Applications exist in macroeconomics and finance (Tay and Wallis 2000; Tay 2015), as well as in specialised fields like energy demand forecasting (He and Li 2018; Mokilane et al. 2018). Forecast are either within (insample) or outside (outofsample) an observed sample and training time window. In our context, only outofsample forecasting is relevant. In Gu and He (2015), the ‘dynamic kernel density estimation’ insample forecasting approach by Harvey and Oryshchenko (2012) is extended to outofsample forecasting. Their method models a timevarying probability density function nonparametrically using kernel density estimation and schemes for observation weighting that are derived from time series modelling. The resulting approach provides directional forecasting signals, specifically for the application of predicting the direction of stock returns. Another direction in outofsample forecasting is the use of histograms as approximations of the underlying distribution. Arroyo and Maté (2009) use time series of histograms, with a histogram being available for each observed time point. They propose a kNNbased method to forecast the distribution at a future time point based on the previously observed histograms. However, the method is limited to only being able to forecast an already previously observed histogram, making it more suited for context with recurring patterns. Furthermore, motivated by symbolic data analysis (NoirhommeFraiture and Brito 2011), Dias and Brito (2015) proposed an approach that uses linear regression to forecast the density of one variable based on observed histogram data from another variable. However, our objective is an extrapolation of the same variable, but to future time points. Another direction is the direct modelling of the probability density. Motivated by applications in energy markets, several approaches have been proposed for forecasting energy supply (e.g., wind power) and demand (power consumption). Bessa et al. (2012) developed a kernel density estimation model based on the Nadaraya–Watson estimator in the context of wind power forecasting. The employed kernels include the beta and gamma kernels as well as the von Mises distribution, underlining the very specialised nature of the approach for use in wind power forecasting. The work of He and Li (2018) is also targeted towards use in the context of wind power forecasting, for which they propose a hybrid model consisting of a quantile regression neural network and a Epanechnikov kernel density estimator. Quantile regression is also used in the work Mokilane et al. (2018) to predict the electricity demand in south Africa for the purpose of longterm planning, while the work of Bikcora et al. (2015) combines an ARMAX and a GARCH model to forecast the density of electricity load in the context of smart loading electric vehicles. However, the approaches to modelling the probability density presented above all share a strong specificity to their application.
Density extrapolation and forecasting are part of the more general topic of handling nonstationarity in streaming, timeevolving data. This problem has gained particular attention in the data stream mining community, where changes in the distribution are commonly denoted as concept drift by Widmer and Kubat (1996) or population drift by Kelly et al. (1999). As discussed in the taxonomy of drift given in Webb et al. (2016), the drifting subject might be the distribution of features X conditioned on the class label Y. Such drift in the classconditional feature distribution P(XY) might result in drift of the posterior distribution P(YX). Of particular relevance for our work is gradual drift of P(XY) and P(YX), where the distribution slowly and continuously changes over time, as opposed sudden shift, where it changes abruptly. Many driftadaptation techniques have been proposed that aim to incorporate the distribution of the most recently observed labelled data (X, Y) into a machine learning model, as surveyed in Gama et al. (2014). However, a challenge in some applications is that no such recent data is available for model update (Krempl et al. 2014). For example, in stream classification true labels Y might arrive only with a considerable delay [socalled verification latency (Marrs et al. 2010) or label delay (Plasse and Adams 2016)], or might only be available during the initial training (Dyer et al. 2014). As discussed in Hofer and Krempl (2013), this requires adaptation mechanisms that use the limited available data, which is either recent but unlabelled, or labelled but old.
These adaptation mechanisms build on drift models (Krempl and Hofer 2011), which model the gradual drift over time in the posterior distribution P(YX) or the classconditional feature distribution P(XY). Then, they use this for temporal transfer learning from previous time points (source domains) to current or future time points (target domains). This is part of the broader change mining paradigm, introduced by Böttcher et al. (2008). This paradigm aims to understand the changes in a timeevolving distribution, and to use this to describe and predict changes. Various drift models and mechanisms for adaptation under verification latency have been proposed. However, they all model the classconditional distribution of instances, for example by employing clustering assumptions (Krempl and Hofer 2011; Dyer et al. 2014; Souza et al. 2015), or calculating changes of a prior (Hofer and Krempl 2013) or weights (Tasche 2014), or by matching labelled and unlabelled instances (Krempl 2011; Hofer 2015; Courty et al. 2014), or they directly adapt the classifier (Plasse and Adams 2016). Thus, they are not applicable to model the changes in a nonparametric, multimodal distribution of unlabelled data, as it is the objective of this work. Thus, the approach proposed in this work complements the existing change mining literature. By providing a method for extrapolating P(XY), it complements the expost drift analysis in Webb et al. (2017) and addresses the calls for better understanding of drift. This might be useful to assess socalled predictability of drift (Webb et al. 2016), and to adapt a classification model in presence of concept drift and verification latency.
3 Method
In the following subsections, we first derive in Subsection 3.1 the basic model for temporal density extrapolation by using an expansion of density functions and the ILRtransformation. Then, in Subsection 3.2, we extend this model by mechanisms for instance weighting and regularisation. Finally, in Subsection 3.3, we discuss how to solve the resulting optimisation problem. For the reader’s convenience, we have summarised our notation in Table 1.
Overview of the notation used
X  Observed sample  R  Order of polynomial 
\(x_i\)  ith observation in sample  D  Dimensionality of feature space 
T  Time attribute of observed sample  \( \gamma _j\)  Weight of jth basis function 
\(\tau _i\)  Time value of ith observation  \( {\varvec{\gamma }}\)  Vector of all M basis weights 
N  Observed instance sample size  B  Matrix of regression coefficients 
M  Number of basis functions  \(\phi _j\)  jth basis function 
h  Bandwidth  \({\varvec{\phi }} \)  Vector of all M basis functions 
3.1 Basic model
Let \({\mathcal {X}}\) be a univariate feature for which a sample \(X=\{x_1,\ldots ,x_N\}\) of size N with associated time values \(T=\{\tau _1,\ldots ,\tau _N\}\) with \(\tau _i \in [0,1]\, \hbox {for}\, i=1,\ldots ,N\) is observed. This forms a data stream segment. Since a data stream is a dynamic setting, it is possible that the distribution of X changes as time progresses. This results in the probability density of X at a given time point t to be unequal to that at a future time \(t + 1\). This change, referred to as concept drift, is assumed to not being limited to the observed sample, potentially continuing with time.
Such constraints are a defining characteristic of what is referred to as compositional data (Aitchison 1982), i.e. data that describes the composition of a whole of M components. It is useful to approach the modelling of the basis weight functions as a compositional data problem, because among the methods developed for the analysis of such data there is one that enables the elegant incorporation of these constraints into the model—the isometric logratio (ilr) transformation (Egozcue et al. 2003). The ilrtransformation was proposed by Egozcue et al. to transform compositional data from its original feature space, which for a composition of M components is a Mdimensional simplex (Aitchison 1982), to \({\mathbb {R}}^{M1}\).
3.2 Extension: instance weighting and regularisation
3.3 Optimisation problem and density model
This solver configuration was used in a multiple starting point search executed via the MultiStart function provided by the previously mentioned toolbox. For this the ‘artificial bound’ parameter used for starting point generation is set to 2 and the number of start points was set to 4, the choice of the latter is a tradeoff between higher optimality of the solution and shorter computational runtime.
4 Experimental evaluation
The goals of the experimental evaluation are twofold. First, to assess the quality of the densities predicted by the models for future time points. This is done by comparing the proposed method to the other two methods that are available in the literature for this problem. Second, to investigate the behaviour of the proposed method in the form of an analysis of its sensitivity with respect to the model hyperparameters and its computational runtime. The experiments are conducted in the 2017b version of MATLAB with the exception of the EDD method, for which an implementation in Python 2.7 has graciously been provided by the author of the EDD method. All experiments are conducted on the ‘Gemini’ cluster of Utrecht University, using a PowerEdge R730 with 32 HT cores and 256 GB memory.
For the experiments a range of data sets have been selected, in part real and artificial, which show various different kinds of drift over time. These data sets will be discussed in detail in the following.
4.1 Data
Four different artificial data sets are generated to simulate different drift patterns. The use of artificial data in this context has the advantages that both the drift pattern is explicitly known as well as the data generating process in general, so the models’ estimates can be compared to the true density.

meandrift—four components with the location parameter of all but the first component changing over time.

weightdrift—three components, mixture weight of the second decreases over time, weights of the other two increase.

sigmachange—three components with scale parameter changing over time.

staticskewnormals—four unchanging components.

dti—a ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s selfreported monthly income.

int_rate—interest rate on the loan

loan_amnt—the listed amount of the loan applied for by the borrower. If at some point in time the credit department reduces the loan amount, then it will be reflected in this value.

open_acc—the number of open credit lines in the borrower’s credit file.

revol_util—Revolving line utilisation rate, or the amount of credit the borrower is using relative to all available revolving credit.
To arrive at a more suitable data set the data was split into one subset per feature, resulting in 5 subsets with each a feature and the time variable. The separation by measuring station was removed, meaning that each hourly time stamp had at most 7 measurements associated to it. Each of these subsets underwent the same preprocessing. First, all entries whose feature value were missing have been discarded. Then all entries whose time stamp does not lie within the inclusive interval of 4 pm till 8 pm (considered the time frame of evening rush hour by common definition) were removed to eliminate the intraday variation in the data and instead focusing on the trends on a monthly basis. To this end the time stamps were generalised to monthlevel by dropping the day component of the time stamp. Finally the time variable was normalised to a scale from 0 to 1 and the feature values were either boxtransformed if the fitted \(\lambda \) value was not 0 and logtransformed if equal 0. In the end the processed data set represents the transformed compound measurements during evening rush hour in Skopje on a monthly basis, allowing for an analysis of the change in the distribution over time.
Figure 2 illustrates the changes in the density over time on both the real and artificial data sets. The Xaxis shows the feature values (X), while the Yaxis shows the probability density of X. For the artificial data set this is the true density, for the real data it is an approximation as will be discussed later. The shade of the different lines indicates the time point associated to the density curve, with lighter shades representing earlier and darker shades representing later time points. The middle right plot in Fig. 2 shows the change in the approximated density on the lending club ‘dti’ data set, showing a slow shift to the right that is fairly simple, while the plot of the ‘interest rate’ data set shows a more complex change. Many small changes at multiple locations make this pattern fairly complex and noisy. In contrast to this, the drift patterns of the artificial data sets are intentionally simple to clearly simulate certain causes for a changing distribution.
To give a comparison to the proposed method regarding the quality of the predicted density, two additional methods have been employed. The first of these is a static version of the proposed method, which fits the basis expansion model with static weights instead of modelling a timedependent weight function. As such, this model does not account for the temporal dimension of the data. The second method used is the “Extrapolating the Distribution Dynamics” (EDD) approach proposed by Lampert (2015). Since EDD is based on kernel density estimates, the number of training instances needs to be kept in mind when tuning the bandwidth parameter of the model. If the number of training instances during model selection is very different from the number of instances in the experiments training window, the kernel bandwidth is likely to be unfit, resulting in poor performance. To account for this it was ensured in the experiments that the training set presented to EDD in the experiments comprises the same amount of instances as there are in the respective model selection training set. The procedure for this is as follows: if the training window of the model selection comprises less than 10% of the entire data set, then all instances in this window are used; if the model selection training window contains more than 10% of the overall instances, then a random subsample corresponding to 10% of the data is used. When training the model during the experiments it is ensured that the number of training instances does not exceed the number of instances used during model selection. If the experiment training window contains more instances than that, a random subsample is used—if it contains less, all samples are used.
The experiment setup comprises a model selection phase elaborated on in Sect. 4.2, a training phase using the selected hyperparameters and the application of the trained models to predict the density in a previously unseen segment of the data stream. Thus, on an infinite stream the model selection is done at the beginning, while training and prediction phases are repeated over time. The data sets have been partitioned into several time windows as illustrated in Fig. 3 for the experiments. Consider that \(t=0\) indicates the first time point in the data set and \(t=1\) the last time point. In this setting, we consider the data until \(t=0.8\) as available historic data and the data after that as entirely unseen. We chose the time frame [0.0, 0.5[ for model selection, because of the strongly skewed distribution of instances over time in the lending club data set. This way there is a sufficiently large initial sample for all algorithms. Therein, the interval [0.0, 0.45[ is used to train models with different hyperparameter combinations and the interval [0.45, 0.5[ is used to evaluate them. The segment of [0.5, 0.8[ was used to train the models with the previously selected hyperparameter combinations. In order to investigate the influence of the training window size on the model performance, the methods were trained once on each of the three subintervals [0.5, 0.8[, [0.6, 0.8[ and [0.7, 0.8[. These three training windows per method result in 9 final models per data set that are finally evaluated.
The quality of a model’s density estimate for a given time point is measured by the mean absolute error (MAE) of the prediction and the true density, evaluated at a previously defined set of points. This error measure was chosen because it is commonly used in forecasting and provides a straightforward interpretation. Computing the error measure on the artificial data sets is simple, since the data generating process is entirely known and the true density can be computed. This however proves problematic with the realworld data, where the underlying process that generated the data is unknown and only random samples of varying size drawn from this process are available. To still evaluate the quality of the predicted density in this case, it is required to approximate the density based on the observed samples.
To approximate the true density at a time point \(t_i\) we consider the instances within a time window of size 4 around that time point to smooth potential noise, i.e. \(\{t_{i4},\ldots ,t_i,\ldots ,t_{i+4}\}\). Based on this set of instances of size S an ensemble of 9 smoothed histograms is created, with each histogram using a different bin size b. The set of used bin sizes consists of 8 incremental integer steps, 4 in positive and 4 in negative direction, around the result of Sturges’ formula \(b_s = (\log _2 S) +1\) (Sturges 1926). For each of these histograms we then compute the relative number of samples per bin with respect to the number of samples and divide it by the bin size. This density scaled, relative frequency is then associated to the bin centres and used to fit a cubic spline. Each smoothed histogram then approximates the density by the splines function value where it is greater than zero, while the density is regarded as equal to zero where the splines function value is smaller or equal zero. Each spline in the ensemble is then evaluated over the same set of points across the domain of X. The mean of the resulting 9 vectors then forms the approximation of the true density. This approximation is in the following only referred to as ‘baseline density’.
4.2 Model selection
Both the method proposed in this article as well as EDD (Lampert 2015) require hyperparameters that need to be determined as part of the model selection step in the experiments. The proposed method requires four hyperparameters, namely the number of basis functions M, their bandwidth h, the order of the polynomial R in the multivariate regression and the regularisation factor \(\lambda \). EDD requires the Gaussian kernel bandwidth \(\sigma \) and also the regularisation factor \(\lambda \).
In order to determine these hyperparameters a model selection step is incorporated into the experiment design, using the data designated for the model selection phase as discussed earlier. For the \(\lambda \) hyperparameter of EDD the author used \(\frac{1}{N}\) as a default value for his experiments on the artificial data sets. We based the parameter search space for \(\lambda \) on this suggestion by multiplying it by a scaling factor \(b \in \{0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5\}\). The parameter search space for \(\sigma \) consists of 20 evenly spaced numbers in the interval [0.001, 0.3]. For each possible combination of values in these two parameter search spaces an EDD model is trained and the parameter values of the model that scored the lowest MAE on the validation data is selected for use in the experiments for the given data set.
Again, the hyperparameter combination that yields the smallest MAE is selected. This is then used in the second phase of the model selection in which R and \(\lambda \) are determined via a linear search of the parameter space for both parameters, with \(R \in \{1,2,3\}\) and \(\lambda \in \{1,2,3,4,5\}\). Note that the search spaces for these parameters are based on experience with experiments prior to those presented here. A detailed account of the models sensitivity to different hyperparameter values will be presented in Sect. 4.4.
4.3 Results
In this section we will first address the question of the models predictive quality, for which the three groups of data sets (artificial, lending club and pollution) will be regarded sequentially. Then the results of a series of experiments with the goal of investigating the sensitivity of the proposed method with respect to its hyperparameters will be presented. Finally the computational effort of the involved methods will be discussed.
4.3.1 Artificial data
The weightdrift pattern matches the assumptions of the temporal density extrapolation (TDX) model the best, given that the weightdrift data is generated by a mixture distribution whose mixture weights are linearly changing over time. Figure 4a shows the MAE of the different models (tdx, static, edd) for the three different training window lengths ( 0.1, 0.2, 0.3) indicated by linestyle for a range of latency values (Xaxis). As mentioned earlier, the latency is the time difference between the end of the training window and the time point of the forecast. TDX is performing best on this pattern for all three training window sizes and it can be seen that the length of the training window influences the error of the model and its change over time. The TDX model with the longest training window shows both a lower error and a slower increase in error over time compared to the TDX models with window lengths of 0.2 and 0.1 respectively. The opposite effect can be observed with the static model, which performs better with a smaller training window.
On the meandrift pattern the location parameters of the subdistributions in the mixture linearly change over time, which results in a less smooth drift compared to that of weightdrift as seen in Fig. 2. Nonetheless TDX also performs best on this pattern as can be seen in Fig. 4b. Here the difference between the different training window lengths for TDX is smaller. Although the model with the longest training window performs marginally worse for smaller latency values, it later consistently performs better than both models with shorter training windows.
The static models show a slightly higher error and increase in error, again with the smallest training window resulting in a lower error.
On the sigmachange data the drift is simulated by a linear change in the standard deviation of the components in the mixture distribution, resulting in the density change shown in Fig. 2. Figure 4c shows that all TDX models except the one with the shortest training window perform consistently better than the static model, showing lower error and a slower increase in error over time.
Finally the performance on the staticskewnormals data set is illustrated in Fig. 4d, which does not include any change in the distribution over time at all. Although all TDX and static models are fairly close in terms of error \((<0.0025)\) the best TDX model only scores the third lowest error. All TDX models show almost no increase in error over time, indicating that the model was able to recognise the absence of drift.
To statistically validate these results a series of twosided Wilcoxon signedrank tests was performed. For each data set, we selected the best performing model of each method with respect to the training window size based on the summed MAE over all forecasted timepoints. Based on this selection we considered two scenarios, “TDX vs EDD” and “TDX vs static”. For each time point within the forecasting time window, the absolute deviations of the method’s predicted density to the true/baseline density was computed at the same 200 equally spaced points within the domain of X. These error distributions were then tested with the twosided Wilcoxon signedrank test in order to determine whether the differences of these distributions are significant at a significance level of 0.01.
These results show that the proposed method performs significantly better than EDD on all artificial data sets for all forecasted time points. Compared to the static model it also scores consistently lower MAE on weight and meandrift with exception of the very first time point on meandrift. These results are significant on both data sets over all time points except for the first 3 on meandrift. On the sigmachange data TDX shows a consistently lower MAE than the static model, but the differences are not significant at any of these time points. Finally, although the MAE of TDX’s predictions is only slightly larger than that of the static model for all time points on statiskewnormals, the differences are significant for all forecasted timepoints.
In summary the proposed method manages to capture the drift patterns on weightdrift, meandrift and sigmachange well and performs better than the static model. On the staticskewnormals data set the TDX models consistently have higher error than the static model, showing that on a entirely static data set the static model is hard to surpass.
4.3.2 Lending club data
The lending club data sets are the first set of realworld data sets used in the experiments. Since the data contains 10 years worth of monthly data, the time variable is measured in months on these data sets as give a better understanding of the presented time frames.
As the first of these data sets the revolving line utilisation rate (short ‘revol_util’) is considered. A look at the evolution of the baseline density of this data set in Fig. 2 (second row, first from left) shows interesting, nonmonotonous drift pattern. A peak on the left edge of the domain of X diminishes as the density shifts to form a new peak around \(X\in [0.06,0.08]\) that then shifts left towards \(T\in [0.03,0.06]\). The results in Fig. 6 show that on the revol_util data set TDX performs well with all three window sizes resulting in very similar error as well as almost identical, slight increase in error over time. Throughout the entire forecasting period all three TDX models achieve lower errors than the static models, while EDD scores an enormous error due to the bandwidth selection of the grid search being unfit for this later segment of the data set. The evolution of the density on this data set as shown in Fig. 2 provides a hint as to why a poor choice of bandwidth resulted from the model selection, since the density distribution has a very different shape during the earlier time points. This indicates that EDD’s parameters might be very sensitive to distributional changes. Looking at the predicted density as well as the baseline density in Fig. 7 one notices that even with a latency of 12 months the TDX model manages to anticipate the change in the density, while the forecast of the static model still reflects the density at the end of the training window.
A more volatile and mobile drift can be observed on the interest rate data set (short ‘int_rate’) in Fig. 2 (second row, second from left). Multiple smaller movements in the density make a difficult, nonmonotonous pattern. Figure 8 shows that within the first 7 months of latency the TDX models with 12 and 24 months of training data reach the lowest and secondlowest error respectively. After this point the static model with a 12 month window surpasses them, remaining the lowest error model for the rest of the forecast period. For the TDX model with a 12 month training window a quick increase in error can be observed for latencies \(> 8\) months, which is likely a result of the nonmonotonous and often reversing trends within the data. Here TDX continues the drift pattern that has been learned within the training window, but the drift pattern of the data changed, and possibly even reversed.
Figure 9 shows the baseline and predicted density on the int_rate data with 5 months of latency. It is noticeable that the baseline density shows a lot of peaks and valleys compared to the revol_util data set, which none of the methods seem to capture entirely. Inspection of the corresponding empirical cdf in Fig. 10 shows that while there are deviations of the baseline from the ecdf, the baseline is reasonably close to the empirical distribution.
To summarise the results on the lending club data it can be said that TDX provides good forecasts on the two data sets that show a continuation of the drift pattern beyond the training window. If the drift does not continue or is not present at all, the static version of the model provides better estimates than the extrapolating one, which is to be expected.
4.3.3 Pollution data
The second series of realworld data sets is the Skopje pollution data. These data sets are an interesting addition to the previously presented ones as they not only show nonmonotonous drift patterns, but these are to some extent repeating. This proves particularly challenging for all models in the experiments, since none of them is equipped to adjust to such repeating patterns. An example of this can be seen in Fig. 12a that shows the error on the CO data, while Fig. 12b shows the development of the baseline density within the test timeframe. From the latter figure one can observe how the density around \(x=1.5\) increases, then decreases below the initial level, then increases again and again afterwards, before finally decreasing as the majority of the density shifts to the right. Matching this, an increase in error can be observed for all models around a latency of 0.03 and 0.13, right when the two major increases in density are observed. Due to the backandforth of this development, the TDX models struggle to anticipate this change, leading to the static model performing better overall by not trying to anticipate the movement of the drift. Nonetheless the TDX models manage to only score a slightly higher error for most of the forecasting time window and even managing to perform better at the end. The test results in Fig. 5 confirm this proximity, showing that the differences between TDX and static are only significant for only 14 of 25 time points.
4.4 Parameter sensitivity analysis
Since the proposed method has 4 hyperparameters we will discuss how sensitive the method is to changes in these parameters. To this end a series of experiments was conducted in the time interval of \(t\in [0.3, 0.45]\) on the int_rate and meandrift data sets. In the first phase of these experiments the number of basis functions M and the bandwidth h was investigated by spanning a parameter grid of 30 bandwidth values and 9 different numbers of bases, while keeping the other two parameters fixed at \(R=2\) and \(\lambda =1\). To compare the resulting models, the MAE of the model’s prediction and the true/baseline density for latency of 0.05 was used. In Fig. 13a the MAE for different h and M values on the int_rate data set is shown. While the extreme values for small M and h values make the surface difficult to read a region with lower error is visible for \(h\le 0.2\).
The surface in Fig. 13b shows a clearer image for the results on the meandrift data set. There is a region with lower error for Mvalues larger than 6 for hvalues under 1, while very small hvalues combined with small mvalues result in higher error, reaching a maximum at the far corner of the surface at \(m=4\) and \(h=0.25\). In this case the combination of few bases and small bandwidth strongly impacts the model performance. For Mvalues greater than 6 the error increases quite similarly with increasing h, eventually reaching a plateau at \(h=3\) where the bandwidth is so large that the density estimate becomes so smooth that the influence of M is no longer visible. As it is also the case for the static density estimation, it can be seen that the choice of the M and h parameters is strongly dependent on the data so that it is best tuned on available data.
During all experiments above the training window was sufficiently large, so the number of used training samples was no concern. Figure 15 shows how the error behaves on meandrift for different numbers of samples used within the training window, employing the best parameters from Figs. 13b and 14b. In these experiments only every nth instance was used, with \(n \in \{124,62,31,16,8,4,2,1\}\). This results in sample sizes ranging from 31 to 3744, the latter being all instances with \(t\in [0.3, 0.45]\). The results show that while using the entirety of the samples resulted in the smallest error, using only around 500 instances achieved a comparable result.
4.5 Execution times
5 Conclusion
In this article we presented a novel approach called temporal density extrapolation (TDX) with the goal of predicting the probability density of a univariate feature in a data stream. TDX models the density as an expansion of Gaussian density basis functions, whose weights are modelled as functions of time to account for drift in the data. Fitting these timedependent weighting functions is approached by modelling the weights of the basis expansion as compositional data. For this purpose the isometric logratio transformation is used to ensure the compliance with the properties of the basis expansion weights. This approach allows to extrapolate the density model to time points outside the available training window while accounting for the changes over time that are often encountered in data streams.
The evaluation shows that the TDX manages to capture monotonous drift patterns, like changes in the component means or weights of a mixture distribution, better than a competing method (EDD) or a static version of the basis expansion model. Furthermore the model also performs well on those two lending club data sets that exhibit very noticeable drift (revol_util and int_rate). All these data sets have in common that the drift in the data is both continuous and unidirectional. Data sets that show little to no drift however, like the static artificial data set or the remaining lending club data sets, prove challenging for TDX. In these cases the static version of the model that fits the weights of the basis expansion in a nontimeadaptive fashion performs better, as the data generating process aligns with its stationarity assumption.
Furthermore the results on the pollution data sets show that the method is currently not equipped to handle drift patterns that show seasonality. As the model is designed to handle monotonous drift, this is not surprising but shows another avenue for future work.
The analysis of the model’s sensitivity with regard to its hyperparameters has shown that both the number of basis functions M and their bandwidth h has to be tuned on the data, as is necessary for the static density estimation model. For both extrapolationspecific parameters there exist reasonable defaults: for R, the order of the polynomial used to model the basis weights, and the regularisation strength \(\lambda \). This reduces the effort of configuring TDX.
In summary, TDX handles its intended application, i.e., data with monotonous drift, better than the only other comparable density forecasting approach to date (EDD) and provides reliable density forecasts on these data sets. The next step in the models development will be the use of TDXs density forecasts for probabilistic classification in data streams. Furthermore we aim to extend the methods capabilities by improving the performance on data sets with little to no drift.
Footnotes
Notes
Acknowledgements
We thank the Austrian National Bank for supporting our research Project No. 17028 as part of the OeNB Anniversary Fund. We thank Christoph Lampert for providing his implementation of the EDD approach and Utrecht University for providing the Gemini cluster for computations.
References
 Aggarwal CC (2005) On change diagnosis in evolving data streams. IEEE Trans Knowl Data Eng 17(5):587–600CrossRefGoogle Scholar
 Aitchison J (1982) The statistical analysis of compositional data. J R Stat Soc Ser B Methodol 44:139–177MathSciNetzbMATHGoogle Scholar
 Arroyo J, Maté C (2009) Forecasting histogram time series with knearest neighbours methods. Int J Forecast 25(1):192–207CrossRefGoogle Scholar
 Bessa RJ, Miranda V, Botterud A, Wang J, Constantinescu EM (2012) Time adaptive conditional kernel density estimation for wind power forecasting. IEEE Trans Sustain Energ 3(4):660–669CrossRefGoogle Scholar
 Bikcora C, Verheijen L, Weiland S (2015) Semiparametric density forecasting of electricity load for smart charging of electric vehicles. In: 2015 IEEE conference on control applications (CCA), IEEE, pp 1564–1570Google Scholar
 Böttcher M, Höppner F, Spiliopoulou M (2008) On exploiting the power of time in data mining. ACM SIGKDD Explor Newsl 10(2):3–11CrossRefGoogle Scholar
 Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B 26:211–243zbMATHGoogle Scholar
 Chacón JE, Duong T (2018) Multivariate kernel smoothing and its applications. CRC, Boca RatonCrossRefzbMATHGoogle Scholar
 Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15CrossRefGoogle Scholar
 Courty N, Flamary R, Tuia D (2014) Domain adaptation with regularized optimal transport. In: Calders T, Esposito F, Hüllermeier E, Meo R (eds) Proceedings of the European conference on machine learning and knowledge discovery in databases (ECMLPKDD 2014), Springer, Lecture Notes in Artificial Intelligence, vol 8724, pp 370–385Google Scholar
 Dias S, Brito P (2015) Linear regression model with histogramvalued variables. Stat Anal Data Min 8(2):75–113. https://doi.org/10.1002/sam.11260 MathSciNetCrossRefGoogle Scholar
 Dyer KB, Capo R, Polikar R (2014) Compose: A semisupervised learning framework for initially labeled nonstationary streaming data (special issue on learning in nonstationary and dynamic environments). IEEE Trans Neural Netw Learn Syst 25(1):12–26CrossRefGoogle Scholar
 Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarceloVidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300MathSciNetCrossRefzbMATHGoogle Scholar
 Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. SIGKDD Explor Newsl 14(2):1–5. https://doi.org/10.1145/2481244.2481246 CrossRefGoogle Scholar
 Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–44CrossRefzbMATHGoogle Scholar
 Gu W, He J (2015) A forecasting model based on timevarying probability density. In: Li M, Zhang Q, Zhang R, Shi X (eds) Proceedings of 2014 1st International conference on industrial economics and industrial security. Springer, Berlin, pp 519–525. https://doi.org/10.1007/9783662440858_75
 Harvey A, Oryshchenko V (2012) Kernel density estimation for time series data. Int J Forecast 28(1):3–14CrossRefGoogle Scholar
 He Y, Li H (2018) Probability density forecasting of wind power using quantile regression neural network and kernel density estimation. Energy Convers Manag 164:374–384CrossRefGoogle Scholar
 Hofer V (2015) Adapting a classification rule to local and global shift when only unlabelled data are available. Eur J Oper Res 243(1):177–189MathSciNetCrossRefzbMATHGoogle Scholar
 Hofer V, Krempl G (2013) Drift mining in data: a framework for addressing drift in classification. Comput Stat Data Anal 57(1):377–391MathSciNetCrossRefzbMATHGoogle Scholar
 Kelly MG, Hand DJ, Adams NM (1999) The impact of changing populations on classifier performance. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 367–371. https://doi.org/10.1145/312129.312285
 Krempl G (2011) The algorithm APT to classify in concurrence of latency and drift. In: Gama J, Bradley E, Hollmén J (eds) Advances in intelligent data analysis X. Lecture notes in computer science, vol 7014. Springer, Berlin, pp 222–233Google Scholar
 Krempl G (2015) Temporal density extrapolation. In: DouzalChouakria A, Vilar JA, Marteau PF, Maharaj A, Alonso AM, Otranto E, Nicolae MI (eds) Proceedings of the 1st international workshop on advanced analytics and learning on temporal data (AALTD) colocated with ECML PKDD 2015, CEUR workshop proceedings, vol 1425. http://ceurws.org/Vol1425/paper12.pdf. Accessed 6 June 2049
 Krempl G, Hofer V (2011) Classification in presence of drift and latency. In: Spiliopoulou M, Wang H, Cook D, Pei J, Wang W, Zaïane O, Wu X (eds) Proceedings of the 11th IEEE international conference on data mining workshops (ICDMW 2011), IEEE. https://doi.org/10.1109/ICDMW.2011.47
 Krempl G, Zliobaitė I, Brzeziński D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S, Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research (special issue on big data). SIGKDD Explor 16(1):1–10. https://doi.org/10.1145/2674026.2674028 CrossRefGoogle Scholar
 Lampert CH (2015) Predicting the future behavior of a timevarying probability distribution. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 942–950. http://pub.ist.ac.at/~chl/erc/papers/lampertcvpr2015.pdf. Accessed 6 June 2049
 Lawlor SF, Rabbat MG (2016) Estimation of timevarying mixture models: an application to traffic estimation. In: Proceedings of the IEEE statistical signal processing workshop, pp 1–5Google Scholar
 Marrs G, Hickey R, Black M (2010) The impact of latency on online classification learning with concept drift. In: Bi Y, Williams MA (eds) Knowledge science, engineering and management. Lecture notes in computer science, vol 6291. Springer, Berlin, pp 459–469Google Scholar
 Mokilane P, Galpin J, Sarma Yadavalli V, Debba P, Koen R, Sibiya S (2018) Density forecasting for longterm electricity demand in South Africa using quantile regression. S Afr J Econ Manag Sci 21(1):1–14CrossRefGoogle Scholar
 NoirhommeFraiture M, Brito P (2011) Far beyond the classical data models: symbolic data analysis. Stat Anal Data Min ASA Data Sci J 4(2):157–170MathSciNetCrossRefGoogle Scholar
 Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076MathSciNetCrossRefzbMATHGoogle Scholar
 Plasse J, Adams N (2016) Handling delayed labels in temporally evolving data streams. In: Big Data, IEEE, pp 2416–2424Google Scholar
 Reinsel D, Gantz J, Rydning J (2017) Data age 2025: the evolution of data to lifecritical. Technical report, IDC. https://www.seagate.com/files/wwwcontent/ourstory/trends/files/SeagateWPDataAge2025March2017.pdf. Accessed 6 June 2049
 Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27(3):832–837CrossRefzbMATHGoogle Scholar
 Sadik S, Gruenwald L (2014) Research issues in outlier detection for data streams. ACM SIGKDD Explor Newsl 15(1):33–40CrossRefGoogle Scholar
 Scott DW (2015) Multivariate density estimation: theory, practice, and visualization. Wiley Online Library, 2nd edn. Wiley, Hoboken. https://doi.org/10.1002/9781118575574.fmatter CrossRefGoogle Scholar
 Silverman BW (1986) Density estimation for statistics and data analysis. Monographs on statistics and applied probability. Chapman and Hall. http://nedwww.ipac.caltech.edu/level5/March02/Silverman/paper.pdf. Accessed 6 June 2049
 Souza VM, Silva DF, Gama J, Batista GE (2015) Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 873–881Google Scholar
 Sturges HA (1926) The choice of a class interval. J Am Stat Assoc 21:65–66. https://doi.org/10.1080/01621459.1926.10502161 CrossRefGoogle Scholar
 Tasche D (2014) Exact fit of simple finite mixture models. J Risk Financ Manag 7:150–164CrossRefGoogle Scholar
 Tay, AS (2015) A brief survey of density forecasting in macroeconomics. Macroeconomic Review. pp 92–97. Research Collection School Of Economics. https://ink.library.smu.edu.sg/soe_research/1901
 Tay AS, Wallis KF (2000) Density forecasting: a survey. Companion Econ Forecast 19:45–68zbMATHGoogle Scholar
 Tran DH, Gaber MM, Sattler KU (2014) Change detection in streaming data in the era of big data: models and issues. ACM SIGKDD Explor Newsl 16(1):30–38CrossRefGoogle Scholar
 Venables WN, Ripley BD (2002) Modern applied statistics with SPLUS. Springer, Berlin (pubSV:adr)CrossRefzbMATHGoogle Scholar
 Webb G, Lee LK, Goethals B, Petitjean F (2017) Understanding concept drift. ArXiv preprint arXiv:1704.00362v1
 Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994MathSciNetCrossRefzbMATHGoogle Scholar
 Whittle P (1958) On the smoothing of probability density functions. J R Stat Soc Ser B Methodol 20:334–343MathSciNetzbMATHGoogle Scholar
 Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden context. Mach Learn 23:2369–101Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.