Abstract
We develop the setting of sequential prediction based on shifting experts and on a “smooth” version of the method of specialized experts. To aggregate expert predictions, we use the AdaHedge algorithm, which is a version of the Hedge algorithm with adaptive learning rate, and extend it by the meta-algorithm Fixed Share. Due to this, we combine the advantages of both algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts. Also, (3) we incorporate in this scheme a “smooth” version of the method of specialized experts which allows us to make more flexible and accurate predictions. All results are obtained in the adversarial setting—no assumptions are made about the nature of the data source. We present results of numerical experiments for short-term forecasting of electricity consumption based on real data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider sequential prediction in the general framework of decision theoretic online learning or the Hedge setting by Freund and Schapire (1997), which is a variant of prediction with expert advice, see e.g. Littlestone and Warmuth (1994), Freund and Schapire (1997), Vovk (1990, 1998) and Cesa-Bianchi and Lugosi (2006).
The aggregating algorithm updates the experts weights at the end of each trial using losses suffered by the experts in the past. In classical setting (Freund and Schapire 1997; Vovk 1990), the process of an expert i weights updating is based on exponential weighting with a constant or variable learning rate \(\eta \):
where \(l_{i,t}\) is the loss suffered by the expert i at step t.
The goal of the algorithm is to design weight updates that guarantee that the loss of the aggregating algorithm is never much larger than the loss of the best expert or the best convex combination of the losses of the experts.
So, here the best expert or a convex combination of experts serves as a comparator. By a comparison vector we mean a vector \(\mathbf{q}=(q_1,\dots ,q_N)\) such that \(q_1+\cdots +q_N=1\) and all its components are nonnegative. We compare the cumulative loss of the aggregating algorithm and a convex combination of the losses \(\sum \limits _{t=1}^T (\mathbf{q}\cdot \mathbf{l}_t)\), where \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\) is a vector containing the losses of the experts at time t.
A more challenging goal is to learn well when the comparator \(\mathbf{q}\) changes over time, i.e. the algorithm competes with the cumulative sum \(\sum \limits _{t=1}^T (\mathbf{q}_t\cdot \mathbf{l}_t)\), where comparison vector \(\mathbf{q}_t\) changes over time. An important special case is when \(\mathbf{q}_t\) are unit vectors, then the sequence of trials is partitioned into segments. In each segment the loss of the algorithm is compared to the loss of a particular expert and this expert changes at the beginning of a new segment. The goal of the aggregation algorithm is to do almost as well as the sum of losses of experts forming the best partition. Algorithms and bounds for shifting comparators were presented by Herbster and Warmuth (1998). This method called Fixed Share was generalized by Bousquet and Warmuth (2002) to the method of Mixing Past Posteriors (MPP) in which arbitrary mixing schemes are considered. In what follows, MPP mixing schemes will be used in our algorithms.
Most papers in the prediction with expert advice setting either consider uniformly bounded losses or assume the existence of a specific loss function (see Vovk 1990; Cesa-Bianchi and Lugosi 2006). But in some practical applications, this assumption is too restrictive. We allow losses at any step to be unbounded and signed. The notion of a specific loss function is not used.
AdaHedge presented by de Rooij et al. (2014) is among a few algorithms that do not have similar restrictions. This algorithm is a version of the classical Hedge algorithm of Freund and Schapire (1997) and is a refinement of the Cesa-Bianchi and Lugosi (2006) algorithm. AdaHedge is completely parameterless and tunes the learning rate \(\eta \) in terms of a direct measure of past performance.
In de Rooij et al. (2014), an upper bound for regret of this algorithm is presented which is free from boundness assumptions for losses of the experts:
where \(L^*_T\) is the loss of the best expert, for other notations see Table 1 below.
In the case where losses of the experts are uniformly bounded the upper bound (2) takes the form \(O(\sqrt{T\ln N})\).
We emphasize that the versions of Fixed Share and MPP algorithms presented by Herbster and Warmuth (1998) and Bousquet and Warmuth (2002) use a constant learning rate, while the AdaHedge uses adaptive learning rate which is tuned on-line.
The first contribution of this paper is that we present the ConfHedge-1 algorithm which combines advantages of both these algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts.
The application we will consider below is the sequential short-term (one-hour-ahead) forecasting of electricity consumption will take place in a variant of the basic problem of prediction with expert advice called prediction with specialized (or sleeping) experts. At each round only some of the experts output a prediction while the other ones are inactive. Each expert is expected to provide accurate forecasts mostly in given external conditions, that can be known beforehand. For instance, in the case of the prediction of electricity consumption, experts can be specialized to a season, temperature, to working days or to public holidays, etc.
The method of specialized experts was first proposed by Freund et al. (1997) and further developed by Adamskiy et al. (2012), Chernov and Vovk (2009), Devaine et al. (2013), Kalnishkan et al. (2015). With this approach, at each step t, a set of specialized experts \(E_t\subseteq \{1,\dots , N\}\) is given. A specialized expert i issues its forecasts not at all steps \(t=1,2,\dots \), but only when \(i\in E_t\). At any step, the aggregating algorithm uses forecasts of only “active (non-sleeping)” experts.
The second contribution of this paper is that we have incorporated into ConfHedge-1 a smooth generalization of the method of specialized experts. At each time moment t, we complement the expert i forecast by a confidence level which is a real number \(p_{i,t}\in [0,1]\).
The setting of prediction with experts that report their confidences as a number in the interval [0, 1] was first studied by Blum and Mansour (2007) and further developed by Cesa-Bianchi et al. (2007), Gaillard et al. (2011), Gaillard et al. (2014).
In particular, \(p_{i,t}=1\) means that the expert forecast is used in full, whereas in the case of \(p_{i,t}=0\) it is not taken into account at all (the expert sleeps). In cases where \(0<p_{i,t}<1\) the expert’s forecast is partially taken into account. For example, with a gradual drop in temperature a corresponded specialized expert gradually loses its ability for accurate predictions of electricity consumption. The dependence of \(p_{i,t} \) on values of exogenous parameters can be predetermined by a specialist in the domain or can be constructed using regression analysis on historical data.
In Sect. 2, we present the ConfHedge-1 algorithm, which is a loss allocation algorithm adapted for the case, where the losses of the experts can be signed and unbounded. Also, this algorithm takes into account the confidence levels of the experts predictions. In Sect. 3.2, ConfHedge-2 variant of this algorithm is presented for the case when experts make forecasts and calculate their losses using a convex loss function.
In Theorem 1 we present the upper bounds for the shifting regret of these algorithms. The proof of this theorem is given in Sect. A. Some details of the proof from de Rooij et al. (2014) are presented as a supplementary material in Sect. B. All results are obtained in the adversarial setting and no assumptions are made about the nature of data source.
In Sect. 3.3, the techniques of confidence level selection and experts training are presented. We also present the results of numerical experiments of the short-term prediction of electricity consumption with the use of the proposed algorithms.
The approach that sets the confidence levels for expert predictions of electricity consumption is more general than the approach used in the paper Devaine et al. (2013), which uses “sleeping” experts. In our numerical experiments the aggregating algorithm with soft confidence levels outperforms other versions of aggregating algorithms including ones which use sleeping experts.
2 Online loss allocation algorithm
In this section we present an algorithm for the optimal online allocation of unbounded signed losses of the experts. In Sect. 3.2, a variant of this algorithm will be presented for the case when experts make forecasts and calculate their losses using a convex loss function.
We assume that at each step t, along with the losses \(l_{i,t}\) of experts, theirs confidence levels are given—a vector \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\), where \(p_{i,t}\in [0,1]\) for \(1\le i\le N\). We assume that \(\Vert \mathbf{p}_t\Vert _1>0\) for all t.
We can interpret the number \( p_{i,t} \) as the algorithm’s internal probability of following the expert i prediction. In this case, we define the auxiliary virtual losses of the expert as a random variable
where \(h_t\) is the aggregating algorithm loss. Denote \(\hat{l}_{i,t}=E_{\mathbf{p}_t}[\tilde{l}_{i,t}]=p_{i,t} l_{i,t}+(1-p_{i,t})h_t\) the mathematical expectation of a virtual loss of an expert i with respect to the probability distribution \(\mathbf{p}_{i,t}=(p_{i,t},1-p_{i,t})\).
At any step t we use cumulative weights \(w_{i,t}\) of the experts \(1\le i\le N\) which were computed at the previous step. The algorithm loss is defined as \(h_t=\sum \limits _{i=1}^N w_{i,t}\hat{l}_{i,t}\).
These definitions contain a logical circle—virtual losses are determined through loss of the algorithm, and the latter is determined through virtual losses. Nevertheless, all these quantities can be effectively calculated using the fixed-point method proposed by Chernov and Vovk (2009). We have
Canceling out the identical terms on the left and on the right sides, we obtain expression for calculating \(h_t\):
After the value of \(h_t\) was calculated by the formula (3), the weights can be calculated as
using the known value \(h_t\).Footnote 1 Also, we compute the mixloss \( m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t\hat{l}_{i,t}} \) and the mixability gap \(\delta _t=h_t-m_t\), which are used in the construction of the algorithm.
By the method MPP of Bousquet and Warmuth (2002), a mixing scheme is defined by a vector \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots , \beta ^{t+1}_t)\), where \(\sum \limits _{s=0}^t\beta ^{t+1}_s=1\) and \(\beta ^{t+1}_s\ge 0\) for \(0\le s\le t\).
In what follows the vector \(\mathbf{w}^\mu _t=(w^\mu _{1,t},\dots ,w^\mu _{N,t})\) presents the normalized experts weights at step t. The corresponding posterior probability distribution \(\mathbf{w}_{t+1}=(w_{1,t+1},\dots ,w_{N,t+1})\) for step \(t+1\) is defined as a convex combination \(\mathbf{w}_{t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _s\) with weights \(\beta ^{t+1}_s\), \(0\le s\le t\), where \(\mathbf{w}^\mu _s=(w^\mu _{1,s},\dots ,w^\mu _{N,s})\).
The vector \(\beta ^{t + 1}\) defines the weights by which the past distributions of experts are mixed. It can be re-set at each step t.
The ConfHedge-1 algorithm for mixing the posteriori distributions of experts is given below. Unlike standard exponential mixing algorithms, this algorithm uses not only the current accumulated weights of experts, but also mixes these weights and all the weights accumulated in past steps.
ConfHedge-1 | |
---|---|
Put \(w_{i,1}=w^\mu _{i,0}=\frac{1}{N}\) for \(i=1,\dots , N\), \(\varDelta _0=0\), \(\eta _1=\infty \). | |
FOR \(t=1,\dots ,T\) | |
Receive confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\) of the experts \(1\le i\le N\), where \(\Vert \mathbf{p}_t\Vert _1>0\). | |
Predict with the distribution \(\mathbf{w}^*_t=(w^*_{1,t},\dots ,w^*_{N,t})\), where \(w^*_{i,t}=\frac{w_{i,t}p_{i,t}}{\sum _{i=1}^N w_{i,t}p_{i,t}}\) for \(1\le i\le N\). | |
Receive a vector \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\) containing the losses of the experts. | |
Compute the loss \(h_t=(\mathbf{l}_t\cdot \mathbf{w}^*_t)\) of the algorithm. | |
Update the weights and the learning parameter in three stages: | |
Loss Update | |
Define \(w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_t^i-h_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-h_t)}}\) for \(1\le i\le N\). | |
Mixing Update | |
Choose a mixing scheme \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)\) and define future weights of the experts | |
\(w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}\) for \(1\le i\le N\). | |
Learning Parameter Update | |
Define mixloss \(m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t}+(1-p_{i,t})h_t)}\). Let \(\delta _t=h_t-m_t\) and \(\varDelta _t=\varDelta _{t-1}+\delta _t\). | |
Define the learning rate \(\eta _{t+1}=\ln ^*N/\varDelta _t\) for use at the next step \(t+1\). | |
ENDFOR |
We have \(m_t\le h_t\) by convexity of the exponent, then \(\delta _t\ge 0\) and \(\varDelta _t\le \varDelta _{t+1}\) for all t.
We will use the following mixing schemes by Bousquet and Warmuth (2002):
Example 1
A version of Fixed Share by Herbster and Warmuth (1998) (see also Cesa-Bianchi and Lugosi 2006; Vovk 1999) with a variable learning rate is defined by the following mixing scheme. Let a sequence \(1\ge \alpha _1\ge \alpha _2\ge \dots >0\) of parameters be given. Define \(\beta ^{t+1}_t=1-\alpha _{t+1}\) and \(\beta ^{t+1}_0=\alpha _{t+1}\) (\(\beta ^{t+1}_s=0\) for \(0<s<t\)). The corresponding prediction for step \(t+1\) is defined
for all \(1\le i\le N\). In what follows we put \(\alpha _t=1/t\) for all t.
Example 2
Uniform Past by Bousquet and Warmuth (2002) with a variable learning rate. Put \(\beta ^{t+1}_t=1-\alpha _{t+1}\) and \(\beta ^{t+1}_s=\frac{\alpha _{t+1}}{t}\) for \(0\le s<t\). The corresponding prediction for step \(t+1\) is defined
for all i and t.
Bousquet and Warmuth (2002) considered the notion of shifting regret with respect to a sequence \(\mathbf{q}_1,\mathbf{q}_2,\dots ,\mathbf{q}_T\) of comparison vectors: \(R_T=H_T-\sum \limits _{t=1}^T (\mathbf{q}_t\cdot l_t)\).Footnote 2
In the presence of confidence values, we consider the corresponding confidence shifting regret \(R^{(\mathbf{q})}_T=H_T-L^{(\mathbf{q})}_T\), where
where \(\hat{\mathbf{l}}_t=(\hat{l}_{1,t},\dots ,\hat{l}_{N,t})\) and \(\mathbf{q}_t=(q_{1,t},\dots ,q_{N,t})\) is a comparison vector at step t.
By definition this regret can be represented as
If \(p_{i,t}=1\) for all i and t then \(R^{(\mathbf{q})}_T=R_T\).
The quantity \(L^{(\mathbf{q})}_T\) depends on \(h_t\). To avoid this dependence, we will consider its lower and upper bounds: \(L^{(\mathbf{q}-)}_T\le L^{(\mathbf{q})}_T\le L^{(\mathbf{q}+)}_T\), where
Assume that the losses of the experts are bounded, for example, \(l_{i,t}\in [0,1]\) for all i and t. Using the techniques of Sect. A for \(\eta _t\sim \sqrt{\frac{\ln ^*N}{t}}\), we can prove that
where k is the number of switches of comparison vectors \(\mathbf{q}_t\) on the time interval \(1\le t\le T\).Footnote 3
Our goal is to obtain a similar bound in the absence of boundness assumptions for the expert losses. Let the mixing scheme of Example 1 be used.
The following theorem presents the upper bounds for the confidence shifting regret in the case where no assumptions are made about boundness of the losses of the experts.
Theorem 1
For any T and for any sequence \(\mathbf{q}_1,\dots ,\mathbf{q}_T\) of comparison vectors,
where \(\gamma _{k,T}=(k+2)(\ln T+1)\) and k is the number of switches of the comparison vectors on the time interval \(1\le t\le T\).
The bound (7) is an analogue for the shifting experts of the bound from Cesa-Bianchi et al. (2007) and the bound (8) is an analogue of the bound (16) of Theorem 8 from de Rooij et al. (2014). Proof of Theorem 1 is given in Sects. A and B.
A disadvantage of the bounds (8) and (9) below is in the presence of a term that depends quadratically on the number k of switches. Whether such a dependence is necessary is an open question. However, this term does not depend on the loss of the algorithm, it has only a slowly growing multiplicative factor \(O(\ln ^2 T)\). Corollary 1 below shows that in some special cases this dependence can be eliminated.
The bound (8) of Theorem 1 can be simplified in the different ways:
Corollary 1
For any T and for any sequence \(\mathbf{q}_1,\dots ,\mathbf{q}_T\) of comparison vectors,
The bound (10) linearly depends on the number of switches (for the proof see Sect. B). The bound (11) follows from (10). If the losses of the experts are uniformly bounded then the bound (11) is of the same order as the bound (6).
An important special case of Theorem 1 is when the comparison vectors \(\mathbf{q}_t=\mathbf{e}_{i_t}\) are unit vectors and \(p_{i,t}\in \{0,1\}\), i.e., the specialists case is considered for composite experts \(i_1,\dots ,i_T\). Then the confidence shifting regret equals \(R^{(\mathbf{q})}_T=\sum \limits _{t:p_{i_t,t}=1}^T (h_t-l_{i_t,t})\) and the corresponding differences in the right-hand side of inequality (8) are \(L^+_T-L^{(\mathbf{q}-)}_T=\sum \limits _{t:p_{i_t,t}=1}^T (l^+_t-l_{i_t,t})+ \sum \limits _{t:p_{i_t,t}=0}^T s_t\) and \(L^{(\mathbf{q}+)}_T-L^-_T=\sum \limits _{t:p_{i_t,t}=1}^T (l_{i_t,t}-l^-_t) + \sum \limits _{t:p_{i_t,t}=0}^T s_t\).
The bound (10) is important if the algorithm is to be used for a scenario in which we are provided with a sequence of gain vectors \(\mathbf{g}_t\) rather than losses: we can transform these gains into losses using \(\mathbf{l}_t=-\mathbf{g}_t\), and then run the algorithm. Assume that \(p_{i,t}=1\) for all i and t. The bound then implies that we incur small regret with respect to a composite expert if it has very small cumulative gain relative to the minimum gain (see also de Rooij et al. 2014).
The similar bounds for the mixing scheme of Example 2 also can be obtained, where \(\gamma _{k,T}=(2k+3)\ln T+(k+2)\) (see Sect. A).
3 Numerical experiments
Section 3.1 presents the results of applying ConfHedge-1 to synthetic data. In Sect. 3.3 the results of the short-term prediction of electricity consumption are presented. We use in these experiments the ConfHedge-2 algorithm which is a variant the previous algorithm adapted for the case, where experts present the numerical forecasts. The scheme of this algorithm is given in Sect. 3.2.
3.1 Unbounded signed losses
The first experiment was performed on synthetic data, where one-step losses of experts are signed, unbounded and perturbed by N(0, 1) additive noise. Confidence levels of all experts are always equal to one. Figure 1a shows mean values of these one-step expert losses. Figure 1b shows cumulative losses of three individual experts and cumulative losses of AdaHedge and ConfHedge-1. These experiments show that ConfHedge-1 is non-inferior to AdaHedge, and, after some time, even outperforms it.
3.2 Aggregation of expert forecasts
In this section we suppose that the losses of the experts are computed using a convex in \(\gamma \) loss function \(\lambda (\omega ,\gamma )\), where \(\omega \) is an outcome and \(\gamma \) is a forecast. Outcomes can belong to an arbitrary set, forecasts form a linear space.Footnote 4
Let at any step t the experts forecasts \(\mathbf{c}_t=(c_{1_t},\dots ,c_{N,t})\) and their confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\) are given. Here \(p_{i,t}\in [0,1]\) for all \(1\le i\le N\). Define the auxiliary virtual experts forecasts
where \(\gamma _t\) is a forecast of the aggregating algorithm. Then the mathematical expectation of any expert i forecast is equal to \(\hat{\mathbf{c}}_{i,t}=E_{\mathbf{p}_{i,t}}[\tilde{c}_{i,t}]=p_{i,t}c_{i,t}+(1-p_{i,t})\gamma _t\).
Define the aggregating algorithm forecast
In order to get rid of the logical circle in these definitions, we use the fixed point method by Chernov and Vovk (2009). We have
Cancel the same terms on the left and on the right sides and obtain
The further calculations are given in the scheme of ConfHedge-2 below.
ConfHedge-2 | |
---|---|
Define \(w_{i,1}=w^\mu _{i,0}=\frac{1}{N}\) for \(i=1,\dots , N\), \(\varDelta _0=0\), \(\eta _1=\infty \). | |
FOR \(t=1,\dots ,T\) | |
Receive the expert forecasts \(\mathbf{c}_t=(c_{1,t},\dots ,c_{N,t})\) and and their confidence levels \(\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})\). | |
Compute the aggregating algorithm forecast \(\gamma _t=\frac{\sum _{i=1}^N p_{i,t}w_{i,t} c_{i,t}}{\sum _{i=1}^N p_{i,t}w_{i,t}}\). | |
Receive an outcome \(\omega _t\) and compute the experts losses \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\), where \(l_{i,t}=\lambda (\omega _t,c_{i,t})\), | |
\(1\le i\le N\), and the algorithm loss \(a_t=\lambda (\omega _t,\gamma _t)\). | |
Update experts weights and learning parameter in three stages: | |
Loss Update | |
Define | |
\(w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_{i,t}-a_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-a_t)}}\) for \(1\le i\le N\). | |
Mixing Update | |
Choose a mixing scheme \(\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)\) and define future experts weights | |
\(w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}\) for \(1\le i\le N\). | |
Learning Parameter Update | |
Compute the mixloss | |
\(m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t})+(1-p_{i,t})a_t)}\). | |
Define \(\delta _t=h_t-m_t\), where \(h_t=\sum _{i=1}^N w_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})a_t)\), define also \(\varDelta _t=\varDelta _{t-1}+\delta _t\). | |
After that set \(\eta _{t+1}=\ln ^*N/\varDelta _t\) future value of the learning parameter. | |
ENDFOR |
Let \(A_T=\sum _{t=1}^T a_t\) be the loss of ConfHedge-2. We keep the notation \(H_T=\sum _{t=1}^T h_t\) and \(L^{(\mathbf{q})}_T=\sum _{t=1}^T (\mathbf{q}_t\cdot {\hat{\mathbf{l}}}_t)\). Theorem 1 also holds for these quantities. Hence, using the same notation as in the Sect. 2, we obtain a bound (16) and \(H_T-L^{(\mathbf{q})}_T\le \gamma _{k,T}\varDelta _T\).
Since by convexity of the loss function \(a_t=\lambda (\omega _t,\gamma _t)=\lambda (\omega _t,\sum _{i=1}^N w_{i,t}\hat{c}_{i,t})\le \sum _{i=1}^N w_{i,t}\hat{l}_{i,t}=h_t\) for all t, we have \(A_T\le H_T\).
The confidence shifting regret of ConfHedge-2 is equal to
The upper bounds (8) of this regret are given by Theorem 1 and by Corollary 1.
3.3 The electrical loads forecasting
The second group of numerical experiments were performed with the contest data of the GefCom2012 competition conducted on the Kaggle platform (Hong et al. 2014). The main objective of this competition was to predict the daily course of hourly electrical loads (demand values for electricity) in 20 regions according to temperature records at 11 meteorological stations. Databases are available at http://www.kaggle.com/datasets. The basic data were provided in the form of the table “temperature-history“ with archive records of temperature monitoring at 11 meteorological stations and the table “load-history“ with hourly electrical load data recorded at 20 power distribution stations of the region for the period from 01.01.2004 to 30.06.2008. The additional calendar information (seasons, days of the week, and working days vs. holidays) could be also used.
As an illustration of how the proposed expert aggregation methods perform, a simplified particular task was designed, namely, electrical load forecasting in one of the power distribution networks (Zone5) one hour ahead based on historical data and current calendar parameters. To account for temperature changes, the temperature measurements of only one meteorological station (Station9) were used; these data provided the best electrical load forecasts in the selected network on the training part of the sample.
Figure 2 shows the averaged curves of the daily electrical loads for each of the four seasons of 2004–2005 in the selected network. We see that the course of the averaged curves clearly depends on the time of day, and also varies from season to season. In addition, the working day and weekend day patterns demonstrate distinct differences in the level of electricity usage. Based on this figure, a simple scheme of forming an ensemble of experts, i.e., specialized algorithms that can only process strictly defined data, was chosen; the scheme includes the following categories: four times of day (night, morning, day, evening); working days and weekend days (two categories); four seasons (winter, spring, summer, fall), all this giving \(4\times 2\times 4=32\) specialized experts (Stepwise Linear Regression). We also use extra four experts, each of which is focused on one of the seasons of the year, and one nonsleeping expert (Random Forest algorithm). Thus, we used a total of 37 experts.
At each moment of time, the confidence function of a given expert is calculated as a product of the confidence functions for each of its specializations. For example, Fig. 3 shows the stages of constructing the confidence function for the expert focused on night forecasting (0–6 a.m.) on the working days of January. Thus synthesized confidence functions are used to form individual training samples for each expert at the stage of training and to aggregate expert forecasts at the stage of testing.
To ensure a more smooth switch between experts, the membership functions \(p_{i,t}\) were formed as trapezoids, where the function takes the value 1 on the plateau corresponding to the selected calendar interval, and varies linearly from 1 to 0 on the slopes. The slope width depends on the user defined parameter.
At the stage of training, the following steps are taken for each algorithm (expert): (1) For all elements of the training sample, a confidence level is calculated, assuming its values are close to 1, if the sample is to be considered by the expert, or close to 0, if the object is beyond its specialization. Based on the confidence level, an individual training subsample is formed for each algorithm from the full training sample. (2) Based on this individual training subsample, a forecasting model is constructed.
To compare the scheme of “smooth mixing“ with the scheme of “sleeping experts“, the experiments on expert decision aggregation were performed in two stages. First, only the scheme of mixing the sleeping and awake experts was used, i.e., the confidence level took only two values (0 or 1), and then the mixing algorithm from Section 2 of this work was used.
The evolution of differences of cumulative losses \(L^1_T-L^3_T\) and \(L^2_T-L^3_T\), where \(L^1_T\) is the cumulative loss of anytime nonsleeping Random Forest algorithm and \(L^2_T\), \(L^3_T\) are cumulative losses of two schemes of mixing (“sleeping experts‘ and “smooth mixing“), are shown in Fig. 4a.
The mean cumulative losses (Mean Absolute Error – MAE) \(\frac{1}{T}L^1_T\) of Random Forest algorithm and of two schemes of expert mixing: \(\frac{1}{T}L^2_T\) and \(\frac{1}{T}L^3_T\), are shown in Fig. 4b.Footnote 5
In this experiment, the “smooth mixing“ algorithm outperforms the aggregating algorithm using “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.
4 Conclusion
In this paper we extend the AdaHedge algorithm by de Rooij et al. (2014) for a case of shifting experts and for a smooth version of the method of specialized experts, where at any time moment each expert’s forecast is provided with a confidence level which is a number between 0 and 1.
To aggregate experts predictions, we use methods of shifting experts and the algorithm AdaHedge with an adaptive learning rate. Due to this, we combine the advantages of both algorithms. We use the shifting regret which is a more optimal characteristic of the algorithm, and we do not impose restrictions on the expert losses. Also, we incorporate in this scheme a smooth version of the method of specialized experts by Blum and Mansour (2007), which allows us to make more flexible and accurate predictions.
We obtained the new upper bounds for the regret of our algorithms, which generalize similar upper bounds for the case of specialized experts.
A disadvantage of Theorem 1 and of Corollary 1 is in asymmetry of the bounds (9) and (10) – first of them has a term that depends quadratically on the number k of switches. Whether such a dependence is necessary is an open question.
All results are obtained in the adversarial setting, no assumptions are made about the nature of data source.
We present the results of numerical experiments on short-term forecasting of electricity consumption based on a real data. In these experiments, the “smooth mixing“ algorithm outperforms the aggregating algorithm with “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.
Notes
In the simple Hedge we put \(w_{i,t+1}=w^\mu _{i,t}\). Some other mixing schemes will be given below.
The notion of regret with respect to a comparison vector was first defined by Kivinen and Warmuth (1999).
Does this bound is tight is an open question. Some lower bounds for mixloss (for the logarithmic loss function with the learning rate \(\eta =1\)) were obtained by Adamskiy et al. (2012). They show an information-theoretic lower bound for mixloss that must hold for any algorithm, and which is tight for Fixed Share.
In our experiments, the absolute loss function \(\lambda (\omega ,\gamma )=|\omega -\gamma |\) was used, where \(\omega \) and \(\gamma \) are real numbers. In practical applications, we can also use its biased variant \(\lambda (\omega ,\gamma )=\mu _1|\omega -\gamma |_{-}+\mu _2|\omega -\gamma |_{+}\), where \(|r|_{-}=-\min \{0,r\}\) and \(|r|_{+}=\max \{0,r\}\). The positive numbers \(\mu _1\) and \(\mu _2\) provide a balance of losses between the deviations of the forecasts \(\gamma \) and outcomes \(\omega \) in the positive and negative directions.
The absolute loss function was used in these experiments.
Mixloss is a very useful intermediate concept, cumulative variant of which is less or equal to the cumulative loss of the best expert (up to a small term) and, on the other hand, the cumulative mixloss is close to the cumulative loss of the aggregating algorithm. For the logarithmic loss function, the mixloss coincides with the loss of the Vovk aggregating algorithm (see Adamskiy et al. 2012; Cesa-Bianchi and Lugosi 2006; de Rooij et al. 2014).
References
Adamskiy, D., Koolen, W. M., Chernov, A., Vovk, V. (2012). A closer look at adaptive regret. In: N. H. Bshouty, G. Stoltz , N. Vayatis, & T. Zeugmann (Eds.), Algorithmic learning theory. ALT 2012. Lecture notes in Computer Science (Vol. 7568). Berlin, Heidelberg: Springer.
Blum, A., & Mansour, Y. (2007). From external to internal regret. Journal of Machine Learning Research, 8, 1307–1324.
Bousquet, O., & Warmuth, M. (2002). Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3, 363–396.
Chernov, A., & Vovk, V. (2009). Prediction with expert evaluators’ advice. In R. Gavaldà, G. Lugosi, T. Zeugmann, & S. Zilles (Eds.), Proceedings of the twentieth international conference on algorithmic learning theory. Lecture notes in computer science (Vol. 5809, pp. 8–22). Berlin: Springer.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.
Cesa-Bianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2/3), 321–352.
de Rooij, S., van Erven, T., Grunwald, P., & Koolen, W. (2014). Follow the leader. If you can, hedge if you must. Journal of Machine Learning Research, 15, 1281–1316.
Devaine, M., Gaillard, P., Goude, Y., & Stoltz, G. (2013). Forecasting electricity consumption by aggregating specialized experts. Machine Learning, 90(2), 231–260.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.
Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In Proc. 29th Annual ACM Symposium on Theory of Computing. 334–343.
Gaillard, P., Goude, Y., & Stoltz, G. (2011). A further look at the forecasting of the electricity consumption by aggregation of specialized experts. Technical report. pierre.gaillard.me/doc/GaGoSt-report.pdf.
Gaillard, P., Stoltz, G., & van Erven, T. (2014). A second-order bound with excess losses. JMLR: Workshop and Conference Proceedings, 35, 176–196.
Herbster, M., & Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32(2), 151–178.
Hong, T., Pinson, P., & Fan, Shu. (2014). Global energy forecasting competition 2012. International Journal of Forecasting, V.30(2), P.357–363.
Kalnishkan, Y., Adamskiy, D., Chernov, A., & Scarfe, T. (2015). Specialist experts for prediction with side information. IEEE international conference on data mining workshop (ICDMW). IEEE, 1470–1477.
Kivinen, J., & Warmuth, M.K. (1999). Averaging expert prediction. In P. Fisher & H.U. Simon (Eds.), Computational learning theory: 4th european conference (EuroColt ’99). 153–167, Springer.
Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261.
Vovk, V. (1990). Aggregating strategies. In M. Fulk and J. Case, (Eds.), Proceedings of the 3rd annual workshop on computational learning theory, 371–383, San Mateo, CA, Morgan Kaufmann.
Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2), 153–173.
Vovk, V. (1999). Derandomizing stochastic prediction strategies. Machine Learning, 35(3), 247–282.
V’yugin, V. (2017). Online aggregation of unbounded signed losses using shifting experts. Proceedings of machine learning research. 60: 1–15. http://proceedings.mlr.press/v60/
Acknowledgements
This paper is an extended version of the conference paper V’yugin (2017). This work was supported by Russian Science Foundation, project 14-50-00150.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Alexander Gammerman, Vladimir Vovk, Henrik Bostrom, and Lars Carlsson.
Appendices
Main lemma
For analysis of the mixing schemes, following Bousquet and Warmuth (2002), we use the notion of relative entropy
where n is an arbitrary positive integer number, \(\mathbf{p}=(p_1,\dots ,p_n)\), \(\mathbf{q}=(q_1,\dots ,q_n)\) are elements of the n-dimensional simplex of all probability distributions on a set of cardinality n. Put \(0\ln 0=0\).
Consider some properties of the relative entropy. The inequalities \(\mathbf{p}>\mathbf{q}\), \(\mathbf{p}\ge \mathbf{q}\), \(\mathbf{p}\ge \mathbf{0}\) for vectors will be understood componentwise; here \(\mathbf{0}\) is the vector with zeros components.
Lemma 1
Bousquet and Warmuth 2002 For each \(\mathbf{p},\mathbf{q},\mathbf{w}\) such that \(\mathbf{q},\mathbf{w}>\mathbf{0}\),
-
\( D(\mathbf{p}\Vert \mathbf{q})\le D(\mathbf{p}\Vert \mathbf{w})+\ln \left( \sum \limits _{i=1}^n p_i\frac{w_i}{q_i}\right) . \)
-
If \(\mathbf{q}\ge r\mathbf{w}\) for some real number \(r>0\) then \( D(\mathbf{p}\Vert \mathbf{q})\le D(p\Vert \mathbf{w})+\ln \frac{1}{r}. \) In particular, for \(\mathbf{p}=\mathbf{w}\) we have \(D(\mathbf{w}\Vert \mathbf{q})\le \ln \frac{1}{r}\) for each \(\mathbf{q}\ge r\mathbf{w}\).
-
Let \(\mathbf{p}\) be a probability vector, \(\mathbf{q}=\sum \limits _{i=0}^t\beta _i \mathbf{w}_i\), where \(\mathbf{w}_i>0\) for \(0\le i\le t\), \(\beta =(\beta _0,\dots ,\beta _t)\), \(\sum _{i=0}^t\beta _i=1\), and \(\beta >\mathbf{0}\). Then \( D(\mathbf{p}\Vert \mathbf{q})\le D(\mathbf{p}\Vert \mathbf{w}_i)+\ln \frac{1}{\beta _i} \) for each i. In particular, if \(\mathbf{p}=\mathbf{q}\) then \( D\left( \mathbf{w}_i\Vert \sum \limits _{i=0}^n\beta _i \mathbf{w}_i\right) \le \ln \frac{1}{\beta _i} \) for all i.
Proof
From concavity of the logarithm, we have
If \(\mathbf{q}\ge r\mathbf{w}\) then \(\sum \limits _{i=1}^n p_i\frac{w_i}{q_i}\le \sum \limits _{i=1}^n p_i\frac{w_i}{r w_i}= \frac{1}{r}\). \(\square \)
The notion of mixloss \( m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t l_{i,t}} \) and its cumulative variant \(M_T=\sum \limits _{t=1}^T m_t\) are used in Hedge analisys.Footnote 6 By definition \(m_t\le h_t\) for all t.
Lemma 2
(Bousquet and Warmuth 2002) For any comparison vector \(\mathbf{q}_t\),
Proof
By (14),
\(\square \)
The following lemma presents a bound for the confidence regret in terms of cumulative mixability gap.
Lemma 3
Let \(\alpha _t=\frac{1}{t}\) for all t and the mixing scheme from Example 1 was used. Then for any T, for any sequence of losses of the experts, and for any sequence of comparison vectors \(\mathbf{q}_t\) given on-line with no more than k switches on the time interval \(1\le t\le T\),
Proof
We apply Lemmas 1 and 2 for mixing schemes of Example 1 (Fixed Share).
Let a sequence \(\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})\) of losses of the experts and a sequence of comparison vectors \(\mathbf{q}_t=(q_{1,t},\dots ,q_{N,t})\) be given on-line for \(t=1,2,\dots \). Assume that T be an arbitrary and the comparison vector \(\mathbf{q}_t\) changes k times for \(1\le t\le T\).
We let \(1<t_1<t_2<\dots t_k\) be the subsequence of indices in the sequence of comparators \(\mathbf{q}_1,\dots ,\mathbf{q}_T\), where shifting occurs: \(\mathbf{q}_{t_j}\not = \mathbf{q}_{t_j-1}\) and \(\mathbf{q}_t=\mathbf{q}_{t-1}\) for all other steps, where \(t>1\). Define also \(t_0=1\) and \(t_{k+1}=T+1\). We apply Lemma 2 for the distribution \(\beta ^{t+1}\) from Example 1. Recall that \(w_{i,1}=w^\mu _{i,0}=\frac{1}{N}\) for \(i=1,\dots , N\).
Summing (15) on time interval where \(\mathbf{q}_t=\mathbf{q}_{t-1}\) for \(t_j+1\le t\le t_{j+1}-1\), we obtain
In transition from (17) to (18), the inequality \(w_{i,t}\ge \frac{\alpha _t}{N}\) was used, then
In transition from (18) to (19), we use the inequality (14), where \(s=t-1\),
In transition from (19) to (20) the entropy terms within the sections telescope and only for the beginning and the end of each section a positive and a negative entropy term remains, respectively. We also roughen the inequality (19) after division by \(\ln ^*N\).
For the beginnings of the k sections \(t=t_1,\dots ,t_k\) define \(s=0\), \(\beta ^{t_j}_0=\alpha _{t_j}\) in the inequality (14), then
Summing all these inequalities and canceling out the corresponding terms, we obtain
In transition from (23) to (24) we use the inequality \(D(\mathbf{q}\Vert \mathbf{w}^\mu _T)\ge 0\) for all \(\mathbf{q}\) and equality \(D(\mathbf{q}\Vert \mathbf{w}^\mu _0)=\ln N\). Then \( \sum \limits _{j=1}^k\frac{1}{\eta _{t_j}}D(\mathbf{q}_{t_j}\Vert \mathbf{w}^\mu _0)\le k\varDelta _T. \) For \(\alpha _t=\frac{1}{t}\) we use the inequality
Since \(H_T=M_T+\varDelta _T\), the bound (25) implies (16). \(\square \)
We will finish the proof of Theorem 1 at the end of Sect. B.
The corresponding bounds for mixing scheme of Example 2 can be obtained in a similar way. Since by definition \(w_{i,t}\ge \frac{\alpha _t}{Nt}\) for each t, the inequality (21) is changed to \(D(\mathbf{q}_t\Vert \mathbf{w}_t)\le \ln N+\ln T+\ln \frac{1}{\alpha _t}\). Also, the last term of the inequality (22) is replaced by \(\frac{1}{\eta _{t_j}}\ln \frac{1}{t\alpha _{t_j}}\). As a result, we obtain \(\gamma _{k,T}=(2k+3)\ln T+(k+2)\).
In the case of bounded losses: \(l_{i,t}\in [0,1]\), set in (17) – (25) \(\alpha _t=\frac{1}{t}\) and obtain \(M_T-L^{(\mathbf{q})}_T\le \frac{1}{\eta _T}((k+2)\ln T+(k+1)\ln N)\). Using the Hoeffding inequality \(h_t\le m_t+\frac{\eta _t}{8}\), where \(\eta _t\sim \sqrt{\frac{\ln ^*N}{t}}\), we obtain (6).
Technical bounds
The derivation of the upper bound for \(\varDelta _T\) is similar to that given in de Rooij et al. (2014), except that the losses of experts are replaced by \(\hat{l}_{i,t}=E_{\mathbf{p}_{i,t}}[\tilde{l}_{i,t}]\).
Let \(v_t=E_{j \sim \mathbf{w}_t} [(\hat{l}_{j,t}-E_{j\sim \mathbf{w}_t} [\hat{l}_{j,t}])^2]= \sum \limits _{j=1}^N w_{j,t} (\hat{l}_{j,t}-h_t)^2\) and \(V_T=\sum \limits _{t=1}^T v_t\).
Lemma 4
The quantity \(\delta _t\) satisfies the inequality
Proof
The inequality (26) will be proved using Bernstein inequality (see Lemmas 3-5 from Cesa-Bianchi and Lugosi 2006). Let \(X\in [0,1]\) be a random variable and Var[X] be its variance. Then for any \(\eta >0\), we have \(\ln E[e^{-\eta (X-E[X])}]\le Var [X](e^\eta -\eta -1)\).
Consider a random variable which takes the values \(\hat{l}_{j,t}\) with probabilities \(w_{j,t}\), where \(j = 1,\dots , N\). Let us transform it so that its values belong to the segment [0, 1]: \(X_t^j = \frac{\hat{l}_{j,t} - \hat{l}_t^-}{s_t}\). Then the Bernstein inequality can be written as follows:
for each \(\eta >0\). We rewrite this inequality in more detail for \(\eta =s_t\eta _t\).
First, we simplify the left-hand side of the inequality (27)
Then the inequality (27) can be written in the form \( \eta _t \delta _t\le \frac{1}{s_t^2} v_t\left( e^{s_t \eta _t}-1-s_t\eta _t\right) , \) from which we obtain the required inequality (26). \(\square \)
The inequality (26) can be presented in the form
Lemma 5
\(\left( \varDelta _T\right) ^2 \le (\ln ^*N) V_T + \left( \frac{2}{3}\ln ^*N+1\right) S_T\varDelta _T\).
Proof
We have
The bound for \(\frac{\delta _t}{\eta _t}\) is obtained using (28): \( \frac{1}{2}v_t\ge \frac{\delta _t s_t}{2 g \left( s_t \eta _t \right) }= \frac{\delta _t}{\eta _t} -s_t\varphi (s_t\eta _t)\delta _t, \) where \(\varphi (x)=\frac{e^x - \frac{1}{2} x^2 - x - 1}{xe^x - x^2 - x}\). It is not difficult to prove that \(\varphi (x)\le 1/3\).
Then, summing the inequality \(\frac{\delta _t}{\eta _t} \le \frac{1}{3} s_t \delta _t + \frac{1}{2} v_t\), and combining it with the inequality (29) we obtain the needed inequality. \(\square \)
Now, we obtain the bounds for \(V_T\). From definition \(v_t\le (l_t^+-h_t)(h_t-l_t^-)\le \frac{s^2_t}{4}\).
To obtain (8), we will use the following lemma.
Lemma 6
If \(L^{(\mathbf{q})}_T\le H_T\) then \(V_T\le S_T\frac{(L_T^+-L_T^{(\mathbf{q})})(L_T^{(\mathbf{q})}-L_T^-)}{L_T^+-L_T^-}+ \gamma _{k,T}S_T\varDelta _T\).
Proof
The following inequality holds true
The inequality (30) is obtained by applying the Jensen inequality to a concave function \(B(x,y,z)=(z-y)(y-x)/(z-x)\) on the set \(x\le y\le z\) (see for detail de Rooij et al. 2014). \(\square \)
Recall that \(H_T\le L_T^{(\mathbf{q})}+\gamma _{k,T}\varDelta _T\). Then, assuming \(L_T^{(\mathbf{q})}\le H_T\), we have
Denote
By the inequality (31) and Lemma 5
We have the inequality \(\varDelta _T^2\le a+b\varDelta _T\), where \(a=S_T Q_T\ln ^*N\), \(b=(\gamma _{k,T}\ln ^*N+\frac{2}{3}\ln ^*N+1)S_T\).
Solving this inequality with respect to \(\varDelta _T\), we obtain: \( \varDelta _T\le \frac{1}{2}b + \frac{1}{2}\sqrt{b^2 + 4a}\le \sqrt{a} + b=\sqrt{S_T Q_T\ln ^*N}+\left( \left( \gamma _{k,T}+\frac{2}{3}\right) \ln ^*N+1\right) S_T. \)
If \(H_T\le L^{(\mathbf{q})}_T\) then \(R^{(\mathbf{q})}_T\le 0\) and the inequality (8) is automatically executed. Otherwise, by Lemma 3,
We obtain the inequality (8) using the lower and the upper bounds (5) for \(L^{(\mathbf{q})}_T\).
To obtain (7), it is sufficient to use the inequality \(V_T\le \frac{1}{4}\sum \limits _{t=1}^T s^2_t\) and a derivation similar to (32). This completes the proof of Theorem 1.
To prove the inequality (10) of Corollary 1 we simplify the inequality (30) as \(V_T\le S_T(L^+_T-H_T)\le S_T(L^+_T-L^{(q)}_T)\) if \(L^{(q)}_T\le H_T\). Finally, \( R^{(\mathbf{q})}_T\le \gamma _{k,T}\varDelta _T\le \gamma _{k,T}\sqrt{S_T(L^+_T-L^{(q-)}_T)\ln ^*N}+ \gamma _{k,T}\left( \frac{2}{3}\ln ^*N+1\right) S_T \) for all T.
Rights and permissions
About this article
Cite this article
V’yugin, V., Trunov, V. Online aggregation of unbounded losses using shifting experts with confidence. Mach Learn 108, 425–444 (2019). https://doi.org/10.1007/s10994-018-5751-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-018-5751-z