Advertisement

Machine Learning

, Volume 108, Issue 3, pp 425–444

# Online aggregation of unbounded losses using shifting experts with confidence

Article
Part of the following topical collections:
1. Special Issue: Conformal Prediction

## Abstract

We develop the setting of sequential prediction based on shifting experts and on a “smooth” version of the method of specialized experts. To aggregate expert predictions, we use the AdaHedge algorithm, which is a version of the Hedge algorithm with adaptive learning rate, and extend it by the meta-algorithm Fixed Share. Due to this, we combine the advantages of both algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts. Also, (3) we incorporate in this scheme a “smooth” version of the method of specialized experts which allows us to make more flexible and accurate predictions. All results are obtained in the adversarial setting—no assumptions are made about the nature of the data source. We present results of numerical experiments for short-term forecasting of electricity consumption based on real data.

## Keywords

On-line learning Prediction with expert advice Unbounded losses Adaptive learning rate Algorithm Hedge Method of mixing past posteriors Shifting experts Specialized experts Confidence level Short-term prediction of electricity consumption

## 1 Introduction

We consider sequential prediction in the general framework of decision theoretic online learning or the Hedge setting by Freund and Schapire (1997), which is a variant of prediction with expert advice, see e.g. Littlestone and Warmuth (1994), Freund and Schapire (1997), Vovk (1990, 1998) and Cesa-Bianchi and Lugosi (2006).

The aggregating algorithm updates the experts weights at the end of each trial using losses suffered by the experts in the past. In classical setting (Freund and Schapire 1997; Vovk 1990), the process of an expert i weights updating is based on exponential weighting with a constant or variable learning rate $$\eta$$:
\begin{aligned} w_{i,t+1}=\frac{w_{i,t}e^{-\eta l_{i,t}}}{\sum \limits _{j=1}^N w_{j,t}e^{-\eta l_{j,t}}}, \end{aligned}
(1)
where $$l_{i,t}$$ is the loss suffered by the expert i at step t.

The goal of the algorithm is to design weight updates that guarantee that the loss of the aggregating algorithm is never much larger than the loss of the best expert or the best convex combination of the losses of the experts.

So, here the best expert or a convex combination of experts serves as a comparator. By a comparison vector we mean a vector $$\mathbf{q}=(q_1,\dots ,q_N)$$ such that $$q_1+\cdots +q_N=1$$ and all its components are nonnegative. We compare the cumulative loss of the aggregating algorithm and a convex combination of the losses $$\sum \limits _{t=1}^T (\mathbf{q}\cdot \mathbf{l}_t)$$, where $$\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})$$ is a vector containing the losses of the experts at time t.
Table 1

Basic notations and definitions

 N—number of experts $$\mathbf{l}_t=(l_{1,t},\dots , l_{N,t})$$—loss vector at step t $$\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})$$—vector of confidences at step t $${\hat{\mathbf{l}}}_t=(\hat{l}_{1,t},\dots , \hat{l}_{N,t})$$—vector of transformed losses $$l_t^-=\min _{1\le i\le N}l_{i,t}$$, $$l_t^+=\max _{1\le i\le N}l_{i,t}$$—min and max loss $$s_t=l_t^+-l_t^-$$—loss range $$\mathbf{q}_t=(q_{1,t},\dots ,q_{N,t})$$—comparison vector at step t $$\mathbf{w}^{\mu }_t=(w^\mu _{1,t},\dots ,w^\mu _{N,t})$$—experts weights $$\mathbf{w}_t=(w_{1,t},\dots ,w_{N,t})$$—experts posterior weights at step t $$\mathbf{w}^*_t=(w^*_{1,t},\dots ,w^*_{N,t})$$—the learner prediction, where $$w^*_{i,t}=\frac{w_{i,t}p_{i,t}}{\sum _{i=1}^N w_{i,t}p_{i,t}}$$ for $$1\le i\le N$$. $$h_t=(\mathbf{w}^*_i\cdot l_i)$$—Hedge loss (dot product of two vectors) $$m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t \hat{l}_{i,t}}$$—mixloss $$\delta _t=h_t-m_t$$—mixability gap $$\alpha _t$$—Fixed Share parameter (we put $$\alpha _t=\frac{1}{t}$$) $$L_T^-=\sum \limits _{t=1}^T l_t^-$$, $$L_T^+=\sum \limits _{t=1}^T l_t^+$$—cumulative minimal and maximal losses $$S_T=\max _{1\le t\le T} s_t$$—maximum loss range $$H_T=\sum \limits _{t=1}^T h_t$$—algorithm cumulative loss $$M_T=\sum \limits _{t=1}^T m_t$$—cumulative mixloss $$\varDelta _T=\sum _{t=1}^T\delta _t$$—cumulative gap $$R^{(\mathbf{q})}_T=\sum \limits _{t=1}^T\sum \limits _{i=1}^N q_{i,t} p_{i,t} (h_t-l_{i,t})$$— confidence shifting regret $$\eta _t=\frac{\ln ^*N}{\varDelta _{t-1}}$$—variable learning rate, where $$\ln ^*N=\max \{1,\ln N\}$$ put $$0/0=0$$

A more challenging goal is to learn well when the comparator $$\mathbf{q}$$ changes over time, i.e. the algorithm competes with the cumulative sum $$\sum \limits _{t=1}^T (\mathbf{q}_t\cdot \mathbf{l}_t)$$, where comparison vector $$\mathbf{q}_t$$ changes over time. An important special case is when $$\mathbf{q}_t$$ are unit vectors, then the sequence of trials is partitioned into segments. In each segment the loss of the algorithm is compared to the loss of a particular expert and this expert changes at the beginning of a new segment. The goal of the aggregation algorithm is to do almost as well as the sum of losses of experts forming the best partition. Algorithms and bounds for shifting comparators were presented by Herbster and Warmuth (1998). This method called Fixed Share was generalized by Bousquet and Warmuth (2002) to the method of Mixing Past Posteriors (MPP) in which arbitrary mixing schemes are considered. In what follows, MPP mixing schemes will be used in our algorithms.

Most papers in the prediction with expert advice setting either consider uniformly bounded losses or assume the existence of a specific loss function (see Vovk 1990; Cesa-Bianchi and Lugosi 2006). But in some practical applications, this assumption is too restrictive. We allow losses at any step to be unbounded and signed. The notion of a specific loss function is not used.

AdaHedge presented by de Rooij et al. (2014) is among a few algorithms that do not have similar restrictions. This algorithm is a version of the classical Hedge algorithm of Freund and Schapire (1997) and is a refinement of the Cesa-Bianchi and Lugosi (2006) algorithm. AdaHedge is completely parameterless and tunes the learning rate $$\eta$$ in terms of a direct measure of past performance.

In de Rooij et al. (2014), an upper bound for regret of this algorithm is presented which is free from boundness assumptions for losses of the experts:
\begin{aligned} R_T\le 2\sqrt{S_T\frac{(L_T^*-L_T^-)(L_T^+-L_T^*)}{L_T^+-L_T^-} \ln N} + \left( \frac{16}{3} \ln N + 2\right) S_T, \end{aligned}
(2)
where $$L^*_T$$ is the loss of the best expert, for other notations see Table 1 below.

In the case where losses of the experts are uniformly bounded the upper bound (2) takes the form $$O(\sqrt{T\ln N})$$.

We emphasize that the versions of Fixed Share and MPP algorithms presented by Herbster and Warmuth (1998) and Bousquet and Warmuth (2002) use a constant learning rate, while the AdaHedge uses adaptive learning rate which is tuned on-line.

The first contribution of this paper is that we present the ConfHedge-1 algorithm which combines advantages of both these algorithms: (1) we use the shifting regret which is a more optimal characteristic of the algorithm; (2) regret bounds are valid in the case of signed unbounded losses of the experts.

The application we will consider below is the sequential short-term (one-hour-ahead) forecasting of electricity consumption will take place in a variant of the basic problem of prediction with expert advice called prediction with specialized (or sleeping) experts. At each round only some of the experts output a prediction while the other ones are inactive. Each expert is expected to provide accurate forecasts mostly in given external conditions, that can be known beforehand. For instance, in the case of the prediction of electricity consumption, experts can be specialized to a season, temperature, to working days or to public holidays, etc.

The method of specialized experts was first proposed by Freund et al. (1997) and further developed by Adamskiy et al. (2012), Chernov and Vovk (2009), Devaine et al. (2013), Kalnishkan et al. (2015). With this approach, at each step t, a set of specialized experts $$E_t\subseteq \{1,\dots , N\}$$ is given. A specialized expert i issues its forecasts not at all steps $$t=1,2,\dots$$, but only when $$i\in E_t$$. At any step, the aggregating algorithm uses forecasts of only “active (non-sleeping)” experts.

The second contribution of this paper is that we have incorporated into ConfHedge-1 a smooth generalization of the method of specialized experts. At each time moment t, we complement the expert i forecast by a confidence level which is a real number $$p_{i,t}\in [0,1]$$.

The setting of prediction with experts that report their confidences as a number in the interval [0, 1] was first studied by Blum and Mansour (2007) and further developed by Cesa-Bianchi et al. (2007),  Gaillard et al. (2011), Gaillard et al. (2014).

In particular, $$p_{i,t}=1$$ means that the expert forecast is used in full, whereas in the case of $$p_{i,t}=0$$ it is not taken into account at all (the expert sleeps). In cases where $$0<p_{i,t}<1$$ the expert’s forecast is partially taken into account. For example, with a gradual drop in temperature a corresponded specialized expert gradually loses its ability for accurate predictions of electricity consumption. The dependence of $$p_{i,t}$$ on values of exogenous parameters can be predetermined by a specialist in the domain or can be constructed using regression analysis on historical data.

In Sect. 2, we present the ConfHedge-1 algorithm, which is a loss allocation algorithm adapted for the case, where the losses of the experts can be signed and unbounded. Also, this algorithm takes into account the confidence levels of the experts predictions. In Sect. 3.2, ConfHedge-2 variant of this algorithm is presented for the case when experts make forecasts and calculate their losses using a convex loss function.

In Theorem 1 we present the upper bounds for the shifting regret of these algorithms. The proof of this theorem is given in Sect. A. Some details of the proof from de Rooij et al. (2014) are presented as a supplementary material in Sect. B. All results are obtained in the adversarial setting and no assumptions are made about the nature of data source.

In Sect. 3.3, the techniques of confidence level selection and experts training are presented. We also present the results of numerical experiments of the short-term prediction of electricity consumption with the use of the proposed algorithms.

The approach that sets the confidence levels for expert predictions of electricity consumption is more general than the approach used in the paper Devaine et al. (2013), which uses “sleeping” experts. In our numerical experiments the aggregating algorithm with soft confidence levels outperforms other versions of aggregating algorithms including ones which use sleeping experts.

## 2 Online loss allocation algorithm

In this section we present an algorithm for the optimal online allocation of unbounded signed losses of the experts. In Sect. 3.2, a variant of this algorithm will be presented for the case when experts make forecasts and calculate their losses using a convex loss function.

We assume that at each step t, along with the losses $$l_{i,t}$$ of experts, theirs confidence levels are given—a vector $$\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})$$, where $$p_{i,t}\in [0,1]$$ for $$1\le i\le N$$. We assume that $$\Vert \mathbf{p}_t\Vert _1>0$$ for all t.

We can interpret the number $$p_{i,t}$$ as the algorithm’s internal probability of following the expert i prediction. In this case, we define the auxiliary virtual losses of the expert as a random variable
\begin{aligned} \tilde{l}_{i,t}= \left\{ \begin{array}{l} l_{i,t} \text{ with } \text{ probability } p_{i,t}, \\ h_t \text{ with } \text{ probability } 1-p_{i,t}, \end{array} \right. \end{aligned}
where $$h_t$$ is the aggregating algorithm loss. Denote $$\hat{l}_{i,t}=E_{\mathbf{p}_t}[\tilde{l}_{i,t}]=p_{i,t} l_{i,t}+(1-p_{i,t})h_t$$ the mathematical expectation of a virtual loss of an expert i with respect to the probability distribution $$\mathbf{p}_{i,t}=(p_{i,t},1-p_{i,t})$$.

At any step t we use cumulative weights $$w_{i,t}$$ of the experts $$1\le i\le N$$ which were computed at the previous step. The algorithm loss is defined as $$h_t=\sum \limits _{i=1}^N w_{i,t}\hat{l}_{i,t}$$.

These definitions contain a logical circle—virtual losses are determined through loss of the algorithm, and the latter is determined through virtual losses. Nevertheless, all these quantities can be effectively calculated using the fixed-point method proposed by Chernov and Vovk (2009). We have
\begin{aligned} h_t=\sum _{i=1}^N w_{i,t}\hat{l}_{i,t}= \sum _{i=1}^N w_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})h_t)= \sum _{i=1}^N w_{i,t} p_{i,t}(l_{i,t}-h_t)+h_t. \end{aligned}
Canceling out the identical terms on the left and on the right sides, we obtain expression for calculating $$h_t$$:
\begin{aligned} h_t=\frac{ \sum _{i=1}^N w_{i,t}p_{i,t} l_{i,t} }{\sum _{i=1}^N w_{i,t}p_{i,t} } \end{aligned}
(3)
After the value of $$h_t$$ was calculated by the formula (3), the weights can be calculated as
\begin{aligned} w^\mu _{i,t}=\frac{ w_{i,t} e^{-\eta \hat{l}_{i,t}} }{ \sum _{s=1}^N w_{s,t} e^{-\eta \hat{l}_{s,t}} }= \frac{ w_{i,t} e^{-\eta p_{i,t}(l_{i,t}-h_t)} }{ \sum _{s=1}^N w_{s,t} e^{-\eta p_{s,t}(l_{s,t}-h_t)} } \end{aligned}
(4)
using the known value $$h_t$$.1 Also, we compute the mixloss $$m_t=-\frac{1}{\eta _t}\sum \limits _{i=1}^N w_{i,t}e^{-\eta _t\hat{l}_{i,t}}$$ and the mixability gap $$\delta _t=h_t-m_t$$, which are used in the construction of the algorithm.

By the method MPP of Bousquet and Warmuth (2002), a mixing scheme is defined by a vector $$\beta ^{t+1}=(\beta ^{t+1}_0,\dots , \beta ^{t+1}_t)$$, where $$\sum \limits _{s=0}^t\beta ^{t+1}_s=1$$ and $$\beta ^{t+1}_s\ge 0$$ for $$0\le s\le t$$.

In what follows the vector $$\mathbf{w}^\mu _t=(w^\mu _{1,t},\dots ,w^\mu _{N,t})$$ presents the normalized experts weights at step t. The corresponding posterior probability distribution $$\mathbf{w}_{t+1}=(w_{1,t+1},\dots ,w_{N,t+1})$$ for step $$t+1$$ is defined as a convex combination $$\mathbf{w}_{t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _s$$ with weights $$\beta ^{t+1}_s$$, $$0\le s\le t$$, where $$\mathbf{w}^\mu _s=(w^\mu _{1,s},\dots ,w^\mu _{N,s})$$.

The vector $$\beta ^{t + 1}$$ defines the weights by which the past distributions of experts are mixed. It can be re-set at each step t.

The ConfHedge-1 algorithm for mixing the posteriori distributions of experts is given below. Unlike standard exponential mixing algorithms, this algorithm uses not only the current accumulated weights of experts, but also mixes these weights and all the weights accumulated in past steps.

ConfHedge-1

Put $$w_{i,1}=w^\mu _{i,0}=\frac{1}{N}$$ for $$i=1,\dots , N$$, $$\varDelta _0=0$$, $$\eta _1=\infty$$.

FOR $$t=1,\dots ,T$$

Receive confidence levels $$\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})$$ of the experts $$1\le i\le N$$, where $$\Vert \mathbf{p}_t\Vert _1>0$$.

Predict with the distribution $$\mathbf{w}^*_t=(w^*_{1,t},\dots ,w^*_{N,t})$$, where $$w^*_{i,t}=\frac{w_{i,t}p_{i,t}}{\sum _{i=1}^N w_{i,t}p_{i,t}}$$ for $$1\le i\le N$$.

Receive a vector $$\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})$$ containing the losses of the experts.

Compute the loss $$h_t=(\mathbf{l}_t\cdot \mathbf{w}^*_t)$$ of the algorithm.

Update the weights and the learning parameter in three stages:

Loss Update

Define $$w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_t^i-h_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-h_t)}}$$ for $$1\le i\le N$$.

Mixing Update

Choose a mixing scheme $$\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)$$ and define future weights of the experts

$$w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}$$ for $$1\le i\le N$$.

Learning Parameter Update

Define mixloss $$m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t}+(1-p_{i,t})h_t)}$$. Let $$\delta _t=h_t-m_t$$ and $$\varDelta _t=\varDelta _{t-1}+\delta _t$$.

Define the learning rate $$\eta _{t+1}=\ln ^*N/\varDelta _t$$ for use at the next step $$t+1$$.

ENDFOR

We have $$m_t\le h_t$$ by convexity of the exponent, then $$\delta _t\ge 0$$ and $$\varDelta _t\le \varDelta _{t+1}$$ for all t.

We will use the following mixing schemes by Bousquet and Warmuth (2002):

### Example 1

A version of Fixed Share by Herbster and Warmuth (1998) (see also Cesa-Bianchi and Lugosi 2006; Vovk 1999) with a variable learning rate is defined by the following mixing scheme. Let a sequence $$1\ge \alpha _1\ge \alpha _2\ge \dots >0$$ of parameters be given. Define $$\beta ^{t+1}_t=1-\alpha _{t+1}$$ and $$\beta ^{t+1}_0=\alpha _{t+1}$$ ($$\beta ^{t+1}_s=0$$ for $$0<s<t$$). The corresponding prediction for step $$t+1$$ is defined
\begin{aligned} w_{i,t+1}=\frac{\alpha _{t+1}}{N}+(1-\alpha _{t+1})w^\mu _{i,t} \end{aligned}
for all $$1\le i\le N$$. In what follows we put $$\alpha _t=1/t$$ for all t.

### Example 2

Uniform Past by Bousquet and Warmuth (2002) with a variable learning rate. Put $$\beta ^{t+1}_t=1-\alpha _{t+1}$$ and $$\beta ^{t+1}_s=\frac{\alpha _{t+1}}{t}$$ for $$0\le s<t$$. The corresponding prediction for step $$t+1$$ is defined
\begin{aligned} w_{i,t+1}=\alpha _{t+1}\sum \limits _{s=0}^{t-1}\frac{w^\mu _{i,s}}{t}+(1-\alpha _{t+1})w^\mu _{i,t} \end{aligned}
for all i and t.

Bousquet and Warmuth (2002) considered the notion of shifting regret with respect to a sequence $$\mathbf{q}_1,\mathbf{q}_2,\dots ,\mathbf{q}_T$$ of comparison vectors: $$R_T=H_T-\sum \limits _{t=1}^T (\mathbf{q}_t\cdot l_t)$$.2

In the presence of confidence values, we consider the corresponding confidence shifting regret $$R^{(\mathbf{q})}_T=H_T-L^{(\mathbf{q})}_T$$, where
\begin{aligned} L^{(\mathbf{q})}_T=\sum \limits _{t=1}^T (\mathbf{q}_t\cdot \hat{\mathbf{l}}_t)= \sum \limits _{t=1}^T\sum _{i=1}^N q_{i,t}\hat{l}_{i,t}= \sum _{t=1}^T\sum _{i=1}^N q_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})h_t), \end{aligned}
where $$\hat{\mathbf{l}}_t=(\hat{l}_{1,t},\dots ,\hat{l}_{N,t})$$ and $$\mathbf{q}_t=(q_{1,t},\dots ,q_{N,t})$$ is a comparison vector at step t.
By definition this regret can be represented as
\begin{aligned} R^{(\mathbf{q})}_T=\sum \limits _{t=1}^T\sum \limits _{i=1}^N q_{i,t} p_{i,t} (h_t-l_{i,t}). \end{aligned}
If $$p_{i,t}=1$$ for all i and t then $$R^{(\mathbf{q})}_T=R_T$$.
The quantity $$L^{(\mathbf{q})}_T$$ depends on $$h_t$$. To avoid this dependence, we will consider its lower and upper bounds: $$L^{(\mathbf{q}-)}_T\le L^{(\mathbf{q})}_T\le L^{(\mathbf{q}+)}_T$$, where
\begin{aligned} L^{(\mathbf{q}-)}_T=\sum _{t=1}^T\sum _{i=1}^N q_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})l^-_t), \nonumber \\ L^{(\mathbf{q}+)}_T=\sum _{t=1}^T\sum _{i=1}^N q_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})l^+_t). \end{aligned}
(5)
Assume that the losses of the experts are bounded, for example, $$l_{i,t}\in [0,1]$$ for all i and t. Using the techniques of Sect. A for $$\eta _t\sim \sqrt{\frac{\ln ^*N}{t}}$$, we can prove that
\begin{aligned} R^{(\mathbf{q})}_T=O\left( (k+1)\left( \ln T\sqrt{T}+\sqrt{T\ln ^*N}\right) \right) . \end{aligned}
(6)
where k is the number of switches of comparison vectors $$\mathbf{q}_t$$ on the time interval $$1\le t\le T$$.3

Our goal is to obtain a similar bound in the absence of boundness assumptions for the expert losses. Let the mixing scheme of Example 1 be used.

The following theorem presents the upper bounds for the confidence shifting regret in the case where no assumptions are made about boundness of the losses of the experts.

### Theorem 1

For any T and for any sequence $$\mathbf{q}_1,\dots ,\mathbf{q}_T$$ of comparison vectors,
\begin{aligned} R^{(\mathbf{q})}_T\le & {} \frac{1}{2}\gamma _{k,T}\sqrt{\sum \limits _{t=1}^T s_t^2\ln ^*N}+ \gamma _{k,T}\left( \frac{2}{3}\ln ^*N+1\right) S_T, \end{aligned}
(7)
\begin{aligned} R^{(\mathbf{q})}_T\le & {} \gamma _{k,T}\sqrt{S_T\frac{(L^+_T-L^{(\mathbf{q}-)}_T) (L^{(\mathbf{q}+)}_T-L^-_T)}{L^+_T-L^-_T}\ln ^* N} \nonumber \\&+\,\gamma _{k,T}\left( \left( \gamma _{k,T}+\frac{2}{3}\right) \ln ^*N+1\right) S_T, \end{aligned}
(8)
where $$\gamma _{k,T}=(k+2)(\ln T+1)$$ and k is the number of switches of the comparison vectors on the time interval $$1\le t\le T$$.

The bound (7) is an analogue for the shifting experts of the bound from Cesa-Bianchi et al. (2007) and the bound (8) is an analogue of the bound (16) of Theorem 8 from de Rooij et al. (2014). Proof of Theorem 1 is given in Sects. A and B.

A disadvantage of the bounds (8) and (9) below is in the presence of a term that depends quadratically on the number k of switches. Whether such a dependence is necessary is an open question. However, this term does not depend on the loss of the algorithm, it has only a slowly growing multiplicative factor $$O(\ln ^2 T)$$. Corollary 1 below shows that in some special cases this dependence can be eliminated.

The bound (8) of Theorem 1 can be simplified in the different ways:

### Corollary 1

For any T and for any sequence $$\mathbf{q}_1,\dots ,\mathbf{q}_T$$ of comparison vectors,
\begin{aligned} R^{(\mathbf{q})}_T\le & {} \gamma _{k,T}\sqrt{S_T(L^{(\mathbf{q}+)}_T-L^-_T)\ln ^*N}+ \gamma _{k,T}\left( \left( \gamma _{k,T}+\frac{2}{3}\right) \ln ^*N+1\right) S_T, \end{aligned}
(9)
\begin{aligned} R^{(\mathbf{q})}_T\le & {} \gamma _{k,T}\sqrt{S_T(L^+_T-L^{(\mathbf{q}-)}_T)\ln ^*N}+ \gamma _{k,T}\left( \frac{2}{3}\ln ^*N+1\right) S_T, \end{aligned}
(10)
\begin{aligned} R^{(\mathbf{q})}_T\le & {} \gamma _{k,T}\sqrt{S_T(L^+_T-L^-_T)\ln ^*N}+ \gamma _{k,T}\left( \frac{2}{3}\ln ^*N+1\right) S_T. \end{aligned}
(11)

The bound (10) linearly depends on the number of switches (for the proof see Sect. B). The bound (11) follows from (10). If the losses of the experts are uniformly bounded then the bound (11) is of the same order as the bound (6).

An important special case of Theorem 1 is when the comparison vectors $$\mathbf{q}_t=\mathbf{e}_{i_t}$$ are unit vectors and $$p_{i,t}\in \{0,1\}$$, i.e., the specialists case is considered for composite experts $$i_1,\dots ,i_T$$. Then the confidence shifting regret equals $$R^{(\mathbf{q})}_T=\sum \limits _{t:p_{i_t,t}=1}^T (h_t-l_{i_t,t})$$ and the corresponding differences in the right-hand side of inequality (8) are $$L^+_T-L^{(\mathbf{q}-)}_T=\sum \limits _{t:p_{i_t,t}=1}^T (l^+_t-l_{i_t,t})+ \sum \limits _{t:p_{i_t,t}=0}^T s_t$$ and $$L^{(\mathbf{q}+)}_T-L^-_T=\sum \limits _{t:p_{i_t,t}=1}^T (l_{i_t,t}-l^-_t) + \sum \limits _{t:p_{i_t,t}=0}^T s_t$$.

The bound (10) is important if the algorithm is to be used for a scenario in which we are provided with a sequence of gain vectors $$\mathbf{g}_t$$ rather than losses: we can transform these gains into losses using $$\mathbf{l}_t=-\mathbf{g}_t$$, and then run the algorithm. Assume that $$p_{i,t}=1$$ for all i and t. The bound then implies that we incur small regret with respect to a composite expert if it has very small cumulative gain relative to the minimum gain (see also de Rooij et al. 2014).

The similar bounds for the mixing scheme of Example 2 also can be obtained, where $$\gamma _{k,T}=(2k+3)\ln T+(k+2)$$ (see Sect. A).

## 3 Numerical experiments

Section 3.1 presents the results of applying ConfHedge-1 to synthetic data. In Sect. 3.3 the results of the short-term prediction of electricity consumption are presented. We use in these experiments the ConfHedge-2 algorithm which is a variant the previous algorithm adapted for the case, where experts present the numerical forecasts. The scheme of this algorithm is given in Sect. 3.2.

### 3.1 Unbounded signed losses

The first experiment was performed on synthetic data, where one-step losses of experts are signed, unbounded and perturbed by N(0, 1) additive noise. Confidence levels of all experts are always equal to one. Figure 1a shows mean values of these one-step expert losses. Figure 1b shows cumulative losses of three individual experts and cumulative losses of AdaHedge and ConfHedge-1. These experiments show that ConfHedge-1 is non-inferior to AdaHedge, and, after some time, even outperforms it. Fig. 1 Results of the experiment on the synthetic data. Left subfigure (a) shows the mean values of one-step experts losses (lines 1, 2, and 3). Right subfigure (b) shows cumulative losses of individual experts (thin lines 1, 2 and 3), AdaHedge and ConfHedge-1 cumulative losses (thick lines 4 and 5)

### 3.2 Aggregation of expert forecasts

In this section we suppose that the losses of the experts are computed using a convex in $$\gamma$$ loss function $$\lambda (\omega ,\gamma )$$, where $$\omega$$ is an outcome and $$\gamma$$ is a forecast. Outcomes can belong to an arbitrary set, forecasts form a linear space.4

Let at any step t the experts forecasts $$\mathbf{c}_t=(c_{1_t},\dots ,c_{N,t})$$ and their confidence levels $$\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})$$ are given. Here $$p_{i,t}\in [0,1]$$ for all $$1\le i\le N$$. Define the auxiliary virtual experts forecasts
\begin{aligned} \tilde{c}_{i,t}= \left\{ \begin{array}{l} c_{i,t} \text{ with } \text{ probability } p_{i,t}, \\ \gamma _t \text{ with } \text{ probability } 1-p_{i,t}, \end{array} \right. \end{aligned}
where $$\gamma _t$$ is a forecast of the aggregating algorithm. Then the mathematical expectation of any expert i forecast is equal to $$\hat{\mathbf{c}}_{i,t}=E_{\mathbf{p}_{i,t}}[\tilde{c}_{i,t}]=p_{i,t}c_{i,t}+(1-p_{i,t})\gamma _t$$.
Define the aggregating algorithm forecast
\begin{aligned} \gamma _t=\sum _{i=1}^N w_{i,t}\hat{c}_{i,t}. \end{aligned}
(12)
In order to get rid of the logical circle in these definitions, we use the fixed point method by Chernov and Vovk (2009). We have
\begin{aligned} \gamma _t=\sum _{i=1}^N w_{i,t}\hat{c}_{i,t}= \sum _{i=1}^N w_{i,t}(p_{i,t} c_{i,t}+(1-p_{i,t})\gamma _t)= \sum _{i=1}^N w_{i,t} p_{i,t}(c_{i,t}-\gamma _t)+\gamma _t. \end{aligned}
Cancel the same terms on the left and on the right sides and obtain
\begin{aligned} \gamma _t=\frac{\sum _{i=1}^N p_{i,t}w_{i,t} c_{i,t}}{\sum _{i=1}^N p_{i,t}w_{i,t}}. \end{aligned}
(13)
The further calculations are given in the scheme of ConfHedge-2 below.

ConfHedge-2

Define $$w_{i,1}=w^\mu _{i,0}=\frac{1}{N}$$ for $$i=1,\dots , N$$, $$\varDelta _0=0$$, $$\eta _1=\infty$$.

FOR $$t=1,\dots ,T$$

Receive the expert forecasts $$\mathbf{c}_t=(c_{1,t},\dots ,c_{N,t})$$ and and their confidence levels $$\mathbf{p}_t=(p_{1,t},\dots ,p_{N,t})$$.

Compute the aggregating algorithm forecast $$\gamma _t=\frac{\sum _{i=1}^N p_{i,t}w_{i,t} c_{i,t}}{\sum _{i=1}^N p_{i,t}w_{i,t}}$$.

Receive an outcome $$\omega _t$$ and compute the experts losses $$\mathbf{l}_t=(l_{1,t},\dots ,l_{N,t})$$, where $$l_{i,t}=\lambda (\omega _t,c_{i,t})$$,

$$1\le i\le N$$, and the algorithm loss $$a_t=\lambda (\omega _t,\gamma _t)$$.

Update experts weights and learning parameter in three stages:

Loss Update

Define

$$w^\mu _{i,t}=\frac{w_{i,t}e^{-\eta _t p_{i,t}(l_{i,t}-a_t)}}{\sum \limits _{s=1}^N w_{s,t}e^{-\eta _t p_{s,t}(l_{s,t}-a_t)}}$$ for $$1\le i\le N$$.

Mixing Update

Choose a mixing scheme $$\beta ^{t+1}=(\beta ^{t+1}_0,\dots ,\beta ^{t+1}_t)$$ and define future experts weights

$$w_{i,t+1}=\sum \limits _{s=0}^t\beta ^{t+1}_s w^\mu _{i,s}$$ for $$1\le i\le N$$.

Learning Parameter Update

Compute the mixloss

$$m_t=-\frac{1}{\eta _t}\ln \sum _{i=1}^N w_{i,t} e^{-\eta _t(p_{i,t}l_{i,t})+(1-p_{i,t})a_t)}$$.

Define $$\delta _t=h_t-m_t$$, where $$h_t=\sum _{i=1}^N w_{i,t}(p_{i,t}l_{i,t}+(1-p_{i,t})a_t)$$, define also $$\varDelta _t=\varDelta _{t-1}+\delta _t$$.

After that set $$\eta _{t+1}=\ln ^*N/\varDelta _t$$ future value of the learning parameter.

ENDFOR

Let $$A_T=\sum _{t=1}^T a_t$$ be the loss of ConfHedge-2. We keep the notation $$H_T=\sum _{t=1}^T h_t$$ and $$L^{(\mathbf{q})}_T=\sum _{t=1}^T (\mathbf{q}_t\cdot {\hat{\mathbf{l}}}_t)$$. Theorem 1 also holds for these quantities. Hence, using the same notation as in the Sect. 2, we obtain a bound (16) and $$H_T-L^{(\mathbf{q})}_T\le \gamma _{k,T}\varDelta _T$$.

Since by convexity of the loss function $$a_t=\lambda (\omega _t,\gamma _t)=\lambda (\omega _t,\sum _{i=1}^N w_{i,t}\hat{c}_{i,t})\le \sum _{i=1}^N w_{i,t}\hat{l}_{i,t}=h_t$$ for all t, we have $$A_T\le H_T$$.

The confidence shifting regret of ConfHedge-2 is equal to
\begin{aligned} R^{(\mathbf{q})}_T=A_T-L^{(\mathbf{q})}_T=\sum \limits _{t=1}^T\sum \limits _{i=1}^N q_{i,t}p_{i,t}(a_t-l_{i,t}). \end{aligned}
The upper bounds (8) of this regret are given by Theorem 1 and by Corollary 1.

### 3.3 The electrical loads forecasting

The second group of numerical experiments were performed with the contest data of the GefCom2012 competition conducted on the Kaggle platform (Hong et al. 2014). The main objective of this competition was to predict the daily course of hourly electrical loads (demand values for electricity) in 20 regions according to temperature records at 11 meteorological stations. Databases are available at http://www.kaggle.com/datasets. The basic data were provided in the form of the table “temperature-history“ with archive records of temperature monitoring at 11 meteorological stations and the table “load-history“ with hourly electrical load data recorded at 20 power distribution stations of the region for the period from 01.01.2004 to 30.06.2008. The additional calendar information (seasons, days of the week, and working days vs. holidays) could be also used.

As an illustration of how the proposed expert aggregation methods perform, a simplified particular task was designed, namely, electrical load forecasting in one of the power distribution networks (Zone5) one hour ahead based on historical data and current calendar parameters. To account for temperature changes, the temperature measurements of only one meteorological station (Station9) were used; these data provided the best electrical load forecasts in the selected network on the training part of the sample. Fig. 2 Averaged curves of the daily electrical loads for each of the four seasons of 2004–2005. The color band around each curve represents standard error of the mean. The solid lines show the average level of electricity usage for working days, and the dashed lines show the same estimates for the weekend days of the same season

Figure 2 shows the averaged curves of the daily electrical loads for each of the four seasons of 2004–2005 in the selected network. We see that the course of the averaged curves clearly depends on the time of day, and also varies from season to season. In addition, the working day and weekend day patterns demonstrate distinct differences in the level of electricity usage. Based on this figure, a simple scheme of forming an ensemble of experts, i.e., specialized algorithms that can only process strictly defined data, was chosen; the scheme includes the following categories: four times of day (night, morning, day, evening); working days and weekend days (two categories); four seasons (winter, spring, summer, fall), all this giving $$4\times 2\times 4=32$$ specialized experts (Stepwise Linear Regression). We also use extra four experts, each of which is focused on one of the seasons of the year, and one nonsleeping expert (Random Forest algorithm). Thus, we used a total of 37 experts.

At each moment of time, the confidence function of a given expert is calculated as a product of the confidence functions for each of its specializations. For example, Fig. 3 shows the stages of constructing the confidence function for the expert focused on night forecasting (0–6 a.m.) on the working days of January. Thus synthesized confidence functions are used to form individual training samples for each expert at the stage of training and to aggregate expert forecasts at the stage of testing.

To ensure a more smooth switch between experts, the membership functions $$p_{i,t}$$ were formed as trapezoids, where the function takes the value 1 on the plateau corresponding to the selected calendar interval, and varies linearly from 1 to 0 on the slopes. The slope width depends on the user defined parameter. Fig. 3 Confidence function construction for the expert January$$\times$$Working days$$\times$$Night
At the stage of training, the following steps are taken for each algorithm (expert): (1) For all elements of the training sample, a confidence level is calculated, assuming its values are close to 1, if the sample is to be considered by the expert, or close to 0, if the object is beyond its specialization. Based on the confidence level, an individual training subsample is formed for each algorithm from the full training sample. (2) Based on this individual training subsample, a forecasting model is constructed. Fig. 4 a The evolution of differences of cumulative losses $$L^1_T-L^3_T$$ and $$L^2_T-L^3_T$$: 1 – anytime nonsleeping expert (Random Forest algorithm), 2 – ConfHedge-2 using “sleeping experts” model, 3 – ConfHedge-2 using smooth confidence levels. b The mean cumulative losses (MAE) of Random Forest (1) and of two schemes of expert mixing (2 and 3)

To compare the scheme of “smooth mixing“ with the scheme of “sleeping experts“, the experiments on expert decision aggregation were performed in two stages. First, only the scheme of mixing the sleeping and awake experts was used, i.e., the confidence level took only two values (0 or 1), and then the mixing algorithm from Section 2 of this work was used.

The evolution of differences of cumulative losses $$L^1_T-L^3_T$$ and $$L^2_T-L^3_T$$, where $$L^1_T$$ is the cumulative loss of anytime nonsleeping Random Forest algorithm and $$L^2_T$$, $$L^3_T$$ are cumulative losses of two schemes of mixing (“sleeping experts‘ and “smooth mixing“), are shown in Fig. 4a.

The mean cumulative losses (Mean Absolute Error – MAE) $$\frac{1}{T}L^1_T$$ of Random Forest algorithm and of two schemes of expert mixing: $$\frac{1}{T}L^2_T$$ and $$\frac{1}{T}L^3_T$$, are shown in Fig. 4b.5

In this experiment, the “smooth mixing“ algorithm outperforms the aggregating algorithm using “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.

## 4 Conclusion

In this paper we extend the AdaHedge algorithm by de Rooij et al. (2014) for a case of shifting experts and for a smooth version of the method of specialized experts, where at any time moment each expert’s forecast is provided with a confidence level which is a number between 0 and 1.

To aggregate experts predictions, we use methods of shifting experts and the algorithm AdaHedge with an adaptive learning rate. Due to this, we combine the advantages of both algorithms. We use the shifting regret which is a more optimal characteristic of the algorithm, and we do not impose restrictions on the expert losses. Also, we incorporate in this scheme a smooth version of the method of specialized experts by Blum and Mansour (2007), which allows us to make more flexible and accurate predictions.

We obtained the new upper bounds for the regret of our algorithms, which generalize similar upper bounds for the case of specialized experts.

A disadvantage of Theorem 1 and of Corollary 1 is in asymmetry of the bounds (9) and (10) – first of them has a term that depends quadratically on the number k of switches. Whether such a dependence is necessary is an open question.

All results are obtained in the adversarial setting, no assumptions are made about the nature of data source.

We present the results of numerical experiments on short-term forecasting of electricity consumption based on a real data. In these experiments, the “smooth mixing“ algorithm outperforms the aggregating algorithm with “sleeping experts“ and both these algorithms outperform the anytime Random Forest forecasting algorithm.

## Footnotes

1. 1.

In the simple Hedge we put $$w_{i,t+1}=w^\mu _{i,t}$$. Some other mixing schemes will be given below.

2. 2.

The notion of regret with respect to a comparison vector was first defined by Kivinen and Warmuth (1999).

3. 3.

Does this bound is tight is an open question. Some lower bounds for mixloss (for the logarithmic loss function with the learning rate $$\eta =1$$) were obtained by Adamskiy et al. (2012). They show an information-theoretic lower bound for mixloss that must hold for any algorithm, and which is tight for Fixed Share.

4. 4.

In our experiments, the absolute loss function $$\lambda (\omega ,\gamma )=|\omega -\gamma |$$ was used, where $$\omega$$ and $$\gamma$$ are real numbers. In practical applications, we can also use its biased variant $$\lambda (\omega ,\gamma )=\mu _1|\omega -\gamma |_{-}+\mu _2|\omega -\gamma |_{+}$$, where $$|r|_{-}=-\min \{0,r\}$$ and $$|r|_{+}=\max \{0,r\}$$. The positive numbers $$\mu _1$$ and $$\mu _2$$ provide a balance of losses between the deviations of the forecasts $$\gamma$$ and outcomes $$\omega$$ in the positive and negative directions.

5. 5.

The absolute loss function was used in these experiments.

6. 6.

Mixloss is a very useful intermediate concept, cumulative variant of which is less or equal to the cumulative loss of the best expert (up to a small term) and, on the other hand, the cumulative mixloss is close to the cumulative loss of the aggregating algorithm. For the logarithmic loss function, the mixloss coincides with the loss of the Vovk aggregating algorithm (see Adamskiy et al. 2012; Cesa-Bianchi and Lugosi 2006; de Rooij et al. 2014).

## References

1. Adamskiy, D., Koolen, W. M., Chernov, A., Vovk, V. (2012). A closer look at adaptive regret. In: N. H. Bshouty, G. Stoltz , N. Vayatis, & T. Zeugmann (Eds.), Algorithmic learning theory. ALT 2012. Lecture notes in Computer Science (Vol. 7568). Berlin, Heidelberg: Springer.Google Scholar
2. Blum, A., & Mansour, Y. (2007). From external to internal regret. Journal of Machine Learning Research, 8, 1307–1324.
3. Bousquet, O., & Warmuth, M. (2002). Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3, 363–396.
4. Chernov, A., & Vovk, V. (2009). Prediction with expert evaluators’ advice. In R. Gavaldà, G. Lugosi, T. Zeugmann, & S. Zilles (Eds.), Proceedings of the twentieth international conference on algorithmic learning theory. Lecture notes in computer science (Vol. 5809, pp. 8–22). Berlin: Springer.Google Scholar
5. Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.
6. Cesa-Bianchi, N., Mansour, Y., & Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2/3), 321–352.
7. de Rooij, S., van Erven, T., Grunwald, P., & Koolen, W. (2014). Follow the leader. If you can, hedge if you must. Journal of Machine Learning Research, 15, 1281–1316.
8. Devaine, M., Gaillard, P., Goude, Y., & Stoltz, G. (2013). Forecasting electricity consumption by aggregating specialized experts. Machine Learning, 90(2), 231–260.
9. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.
10. Freund, Y., Schapire, R. E., Singer, Y., & Warmuth, M. K. (1997). Using and combining predictors that specialize. In Proc. 29th Annual ACM Symposium on Theory of Computing. 334–343.Google Scholar
11. Gaillard, P., Goude, Y., & Stoltz, G. (2011). A further look at the forecasting of the electricity consumption by aggregation of specialized experts. Technical report. pierre.gaillard.me/doc/GaGoSt-report.pdf.Google Scholar
12. Gaillard, P., Stoltz, G., & van Erven, T. (2014). A second-order bound with excess losses. JMLR: Workshop and Conference Proceedings, 35, 176–196.Google Scholar
13. Herbster, M., & Warmuth, M. (1998). Tracking the best expert. Machine Learning, 32(2), 151–178.
14. Hong, T., Pinson, P., & Fan, Shu. (2014). Global energy forecasting competition 2012. International Journal of Forecasting, V.30(2), P.357–363.
15. Kalnishkan, Y., Adamskiy, D., Chernov, A., & Scarfe, T. (2015). Specialist experts for prediction with side information. IEEE international conference on data mining workshop (ICDMW). IEEE, 1470–1477.Google Scholar
16. Kivinen, J., & Warmuth, M.K. (1999). Averaging expert prediction. In P. Fisher & H.U. Simon (Eds.), Computational learning theory: 4th european conference (EuroColt ’99). 153–167, Springer.Google Scholar
17. Littlestone, N., & Warmuth, M. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261.
18. Vovk, V. (1990). Aggregating strategies. In M. Fulk and J. Case, (Eds.), Proceedings of the 3rd annual workshop on computational learning theory, 371–383, San Mateo, CA, Morgan Kaufmann.Google Scholar
19. Vovk, V. (1998). A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2), 153–173.
20. Vovk, V. (1999). Derandomizing stochastic prediction strategies. Machine Learning, 35(3), 247–282.
21. V’yugin, V. (2017). Online aggregation of unbounded signed losses using shifting experts. Proceedings of machine learning research. 60: 1–15. http://proceedings.mlr.press/v60/

## Copyright information

© The Author(s) 2018

## Authors and Affiliations

• Vladimir V’yugin
• 1
• Vladimir Trunov
• 1
1. 1.Institute for Information Transmission ProblemsMoscowRussia

## Personalised recommendations 