Aggregating Algorithm for prediction of packs

This paper formulates a protocol for prediction of packs, which is a special case of on-line prediction under delayed feedback. Under the prediction of packs protocol, the learner must make a few predictions without seeing the respective outcomes and then the outcomes are revealed in one go. The paper develops the theory of prediction with expert advice for packs by generalising the concept of mixability. We propose a number of merging algorithms for prediction of packs with tight worst case loss upper bounds similar to those for Vovk’s Aggregating Algorithm. Unlike existing algorithms for delayed feedback settings, our algorithms do not depend on the order of outcomes in a pack. Empirical experiments on sports and house price datasets are carried out to study the performance of the new algorithms and compare them against an existing method.


Introduction
This paper deals with the on-line prediction protocol, where the learner needs to predict outcomes ω 1 , ω 2 . . .occurring in succession.The learner is getting the feedback along the way.
In the basic on-line prediction protocol, on step t the learner outputs a prediction γ t and then immediately sees the true outcome ω t .The quality of the prediction is assessed by a loss function λ(γ, ω) measuring the discrepancy between the prediction and outcome or, generally speaking, quantifying the (adverse) effect when a prediction γ confronts the outcome ω.The performance of the learner is assessed by the cumulative loss over T trials Loss(T ) = T t=1 λ(γ t , ω t ).In a protocol with the delayed feedback, there may be a delay getting true ωs.The learner may need to make a few predictions before actually seeing the outcomes of past trials.We will consider a special case of that protocol when outcomes come in packs: the learner needs to make a few predictions, than all outcomes are revealed, and again a few predictions need to be made.
A model problem we consider is prediction of house prices.Consider a dataset consisting of descriptions of houses and sale prices.Suppose that the prices come with transaction dates; it is therefore natural to analyse this dataset in an on-line framework trying to predict house prices on the basis of past information only.
However, let the timestamps in the dataset contain only the month of the transaction.Every month a few sales occur and we do not know the order.It is natural to try and work out all predicted prices for a particular month on the basis of past months and only then to look at the true prices.One month of transactions makes what we call a pack.
We are concerned in this paper with the problem of prediction with expert advice.Suppose that the learner has access to predictions of a number of experts.Before the learner makes a prediction, it can see experts' predictions and its goal is to suffer loss close to that of the retrospectively best expert.
The problem of prediction with expert advice is related to that of on-line optimisation, which has been extensively studied since [Zin03].These approaches overlap to a very large extent and many results make sense in both the frameworks.The problem of delayed feedback has mostly been studies within the on-line optimisation approach, e.g., in [JGS13,QK15].However, in this paper we will stick to the terminology and approach of prediction with expert advice going back to [LW94] and surveyed in [CBL06].Our starting point and the main tool is Vovk's Aggregating Algorithm [Vov90,Vov98], which provides a solution optimal in a certain sense.
Our key idea is to consider a pack as a single outcome.We study mixability of the resulting game and develop a few algorithms for prediction of packs based on the Aggregating Algorithm.We obtain upper bounds on their performance and discuss optimality properties.The reason why we need different algorithms is that the situation when the pack size varies from step to step can be addressed in different ways leading to different bounds.
The key result of the theory of delayed feedback stating that the regret multiplies by the magnitude of the delay (see [JGS13,WO02]) cannot be improved, but it receives interpretation in the context of the theory of prediction with expert advice with specific lower bounds of the Aggregating Algorithm type.In empirical studies our new algorithms show more stable performance than the existing algorithm based on running parallel copies of the merging procedure.
We carry out an empirical investigation on London and Ames house prices datasets.The experiments follow the approach of [KACS15]: prediction with expert advice can used to find relevant past information.Predictors trained on different sections of past data can be combined in the on-line mode so that prediction is carried out using relevant past data.
The paper is organised as follows.In Section 2 the theory of the Aggregating Algorithm is surveyed.In Section 3 we formulate the protocols for prediction of packs and then study the mixability of resulting games.This analysis leads to the Aggregating Algorithm for Prediction of Packs formulated in Section 4. The theory can be applied in different ways leading to different loss bounds, hence a few variations of the algorithm.We also describe the algorithm based on running parallel copies of the Aggregating Algorithm: it is a straightforward adaptation of an existing delayed feedback algorithm to our problem.Empirical experiments are described in Section 6.
As the upper bounds on the loss are based on the theory of the Aggregating Algorithm, most of them are tight in the worst case.As a digression from the prediction with expert advice framework, in Section 5 we prove a self-contained lower bound for prediction of packs in the context of the mix loss protocol of [AKCV16].

Prediction with Expert Advice
In this section we formulate the classical problem of prediction with expert advice.
A game G = Ω, Γ, λ is a triple of an outcome space Ω, prediction space Γ, and loss function λ : Γ × Ω → [0, +∞].Outcomes ω 1 , ω 2 , . . .∈ Ω occur in succession.A learner or prediction strategy outputs predictions γ 1 , γ 2 , . . .∈ Γ before seeing each respective outcome.The learner may have access to some side information; we will say that on each step the learner sees a signal x t coming from a signal space X.
The framework is summarised in Protocol 1.
FOR t = 1, 2, . . .nature announces x t ∈ X learner outputs γ t ∈ Γ nature announces ω t ∈ Ω learner suffers loss λ(γ t , ω t ) ENDFOR Over T trials the learner S suffers the cumulative loss Loss T = Loss T (S) = T t=1 λ(γ t , ω t ).In this paper we assume a full information environment.The learner knows Ω, Γ, and λ.It sees all ω t as they become available.On the other hand, we make no assumptions on the mechanism generating ω t and will be interested in worst-case guarantees for the loss.Now let {E θ | θ ∈ Θ} be a set of learners working according to Protocol 1 and parametrised by θ ∈ Θ.We will refer to these learners as experts and to the set as the pool of experts.If the pool is finite and |Θ| = N , we will refer to experts as E 1 , E 2 , . . ., E N .Suppose that on each turn, their predictions are made available to a learner S as a special kind of side information.The learner then works according to the following protocol.
The goal of the learner in this setup is to suffer loss close to the best expert in retrospect.We look for merging strategies giving guarantees of the type Loss T (S) Loss T (E θ ) for all θ ∈ Θ, all sequences of outcomes, and as many T as possible.
The merging strategies we are interested in are computable in some natural sense; we will not make exact statements about computability though.We do not impose any restrictions on experts.In what follows, the reader may substitute the clause 'for all predictions γ θ t appearing in Protocol 2' for the more intuitive clause 'for all experts'.

Aggregating Algorithm
In this section we present Vovk's Aggregating Algorithm (AA) after [Vov90,Vov98].In this paper we restrict ourselves to finite pools of experts, but AA can be straightforwardly extended to countable pools (by considering infinite sums) and even larger pools (by replacing sums with integrals).
The AA works as follows.It takes as parameters a prior distribution p 1 , p 2 , . . ., p N (such that p n ≥ 0 and N i=1 p n = 1), a learning rate η > 0 and a constant C admissible for η.
One can check by induction that the equality holds for all t = 1, 2, . ... Dropping all terms but one on the right-hand side yields the desired inequality.
The importance of the AA follows from the results of [Vov98].Under some mild regularity assumptions on the game and assuming the uniform initial distribution, it can be shown that the constants in 2 are optimal.If any merging strategy achieves the guarantee . ., all time horizons T , and all outcomes, then the AA with the uniform prior distribution p n = 1/N and some η > 0 provides the guarantee with the same or lower C and A.

Protocol
Consider the following extension of Protocol 1.
In summary, at every trial t the learner needs to make K t predictions rather than one.
Suppose that the learner may draw help from experts.We can extend Protocol 2 as follows.
There can be subtle variations of this protocol.Instead of getting all K t predictions from each expert at once, the learner may be getting predictions for each outcome one by one and making its own before seeing the next set of experts' predictions.For most of our analysis this does not matter, as we will see later.The learner may have to work on each 'pack' of experts' predictions sequentially without even knowing its size in advance.The only thing that make a difference is that the outcomes come in one go after the learner has finished predicting the pack.

Mixability
For a game G = Ω, Γ, λ and a positive integer K consider the game G K with the outcome and prediction space given by the Cartesian products Ω K and Γ K and the the loss function λ What are the mixability constants for this game?Let C η be the constants for G and C (K) η be the constants for G (K) .
The following lemma provides an upper bound for Lemma 1.For every game G we have for every ω k ∈ Ω. Multiplying these inequalities yields p n e −ηλ(γ n k ,ω k ) .
We will now apply the generalised Hölder inequality.On measure spaces, the inequality states that where K k=1 1/r k = 1/r (this follows from the version of the inequality in Section 9.3 of [Loè77] by induction).Interpreting a vector x = (x 1 , x 2 , . . ., x N ) as a function on a discrete space {1, 2, . . ., N } and introducing on this space the measure µ(n) = p n , we obtain Letting r k = 1 and r = 1/K we get Raising the resulting inequality to the power 1/K completes the proof.
Remark 1.Note that the proof of the lemma offers a constructive way of solving (1) for G K provided we know how to solve (1) for G. Namely, to solve (1) for G K with the learning rate η/K, we solve K systems for G with the learning rate η.
In order to get a lower bound for C A superprediction is a generalised prediction minorised by some prediction, i.e., a superprediction is a function f : Ω → [0, +∞] such that for some γ ∈ Γ we have f (ω) ≥ λ(γ, ω) for all ω ∈ Ω.The shape of the set of superpredictions plays a crucial role in determining C η .
For a wide class of games the following implication holds.If the game is mixable (i.e., C η = 1 for some η > 0), then its set of superpredictions is convex (Lemma 7 in [KVV04] essentially proves this for games with finite sets of outcomes).
Theorem 1.For a game G with a convex set of superprediction, any positive integer K and learing rate η > 0 we have We need to make a simple observation on the behaviour of C Lemma 3.For every game G the value of C η is non-decreasing in η.
Proof.Suppose that and η 2 ≤ η 1 .Raising the inequality to the power η 2 /η 1 ≤ 1 and using Jensen's inequality yields Remark 2. The proof is again constructive in the following sense.If we know how to solve (1) for G with a learning rate η 1 and an admissible C, we can solve (1) for η 2 ≤ η 1 and the same C.
Corollary 1.For every game G and positive integers K 1 ≤ K 2 , we have C Remark 3. Suppose we play the game G (K 1 ) but have to use the learning rate η/K 2 with C admissible for G with η.To solve (1), we can take K 1 solutions for (1) for G with the learning rate η.

Prediction with Plain Bounds
Suppose that in Protocol 5 the sizes of all packs are equal: K 1 = K 2 = . . .= K and the number K is known in advance.The proof of Lemma 1 suggests the following merging strategy, which we will call Aggregating Algorithm for Equal Packs (AAP-e).Protocol 6.
(1) initialise weights read the experts' predictions update the experts' weights for all outcomes and experts' predictions as long as the pack size is K.
Lemma 2 shows that the constants in this bound cannot be improved for equal weights provided G has a convex set of superpredictions (and G K satisfies the conditions of the optimality of AA).Now suppose that K t differ.To begin with, suppose that we know K upper bounding all K t .Consider the following algorithm, Aggregating Algorithm for Packs with the Known Maximum (AAP-max).
(1) initialise weights observe the outcomes ω t,k , k = 1, 2, . . ., K t (9) update the experts' weights The essential point here is step (9): we divide by the maximum K. Corollary 1 and Remark 3 imply the following result.for all outcomes and experts' predictions as long as the pack size does not exceed K.
Clearly, the constants in this bound cannot be improved in the same sense as above because of the case where all packs have the maximum size K.However, the algorithm clearly uses a suboptimal learning rate for steps with K t < K.We will address this later.Now consider the case where K is not known in advance.A simple trick allows one to handle this inconvenience.Consider the following algorithm, Aggregating Algorithm for Packs with an Unknown Maximum (AAPincremental).
(1) initialise losses p n e −η Losst(En)/K (3) holds with K equal to the maximum pack size over the first t trials.Suppose that the inequality holds on trial t.If on trial t + 1 the pack size does not exceed K, we essentially use AA with the learning rate η/K and maintain the inequality.
If the pack size changes to K > K, we change the learning rate to η/K .Raising (3) to the power K/K ≤ 1 and applying Jensen's inequality yields e −η Losst(S)/(CK ) ≥ N n=1 p n e −η Losst(En)/K . (4) Over the next trial, the inequality stays.

Prediction with Bounds on Pack Averages
The bounds in Section 4.1 are optimal if all packs are of the same size.On packs of smaller size there is some slack.
In this section we present an algorithm that fixes this problem.However, it results in an unusual kind of bound.
Consider the following algorithm, Aggregating Algorithm for Pack Averages (AAP-current).
for every expert E n .
The value of D does not need to be known in advance; we can always expand the array as the delay increases.
Note that that the protocol with delays is more general than the protocol of packs.On the other hand, for the parallel copies of the algorithms the order in the pack matters.This cannot be seen from 5, but obviously happens: it is important which example is picked by each copy.

A Mix Loss Lower Bound
The loss bounds in the theorems formulated above are often tight due to the optimality of the Aggregating Algorithm.The tightness was discussed after the corresponding results.
In this section we present a self-contained lower bound formulated for the mix loss protocol of [AKCV16].The proof sheds some more light on the extra term in the bound.
The mix loss protocol covers a number of learning settings including prediction with a mixable loss function; see Section 2 of [AKCV16] for a discussion.Consider the following protocol porting mix loss Protocol 1 from [AKCV16] to prediction of packs.
Protocol 10. T are the counterparts of experts' total losses.We shall propose a course of action for the nature leading to a high value of the regret L T − min n=1,2,...,N L n T .

FOR
Lemma 5.For any K arrays of N probabilities p 1 k , p 2 k , . . ., p N k , k = 1, 2, . . ., K, where p n k ∈ [0, 1] for all n = 1, 2, . . ., N and k = 1, 2, . . ., K and N n=1 p n k = 1 for all k, there is n such that Proof.Assume the converse.Let K k=1 p n k > 1/N K for all n.By the inequality of arithmetic and geometric means for all n = 1, 2, . . ., N .Summing the left-hand side over n yields Summing the right-hand side over n and using the assumption on the products of p n k , we get The contradiction proves the lemma.
Here is the strategy for the nature.Upon getting the probability distributions from the learner, it finds n 0 such that Kt k=1 p n 0 t,k ≤ 1/N Kt and sets n 0 t,1 = n 0 t,2 = . . .= n 0 t,Kt = 0 and n t,k = +∞ for all other n and k = 1, 2, . . ., K t .The learner suffers loss while n 0 t = 0. We see that over a single pack of size K we can achieve the regret of K ln N .Thus every upper bound of the form where K 1 is the size of the first pack.

Experiments
In this section, we present some empirical results.Our purpose is twofold.First, we want to study the behaviour of the algorithms described above in practice.Secondly, we want to demonstrate the power of on-line learning.

Datasets and Models
For our experiments, we used two datasets of house prices.There is a tradition of using house prices as a benchmark for machine learning algorithms going back to the Boston housing dataset.However, batch learning protocols have hitherto been used in most studies.
Recently extensive datasets with timestamps have become available.They call for on-line learning protocols.Property prices are prone to strong movements over time and the pattern of change may be complicated.Online algorithms should capture these patterns.

Ames House Prices
The first dataset describes the property sales that occurred in Ames, Iowa between 2006 and 2010.The dataset contains records of 2930 house sales transactions with 80 attributes, which are a mixture of nominal, ordinal, continuous, and discrete parameters (including physical property measurements) affecting the property value.The dataset was compiled by Dean De Cock for use in statistics education [DC11] as a modern substitute for the Boston Housing dataset.
There are timestamps in the dataset, but they contain only the month and the year of the purchase.The date is not available.Therefore, one can not apply the on-line protocol directly to the problem as at each time we observe a vector of outcomes instead of a single outcome.It is natural to try and work out all predicted prices for a particular month on the basis of past months and only then to see the true prices.One month of transactions makes what we call a pack in this paper.We interpret the problem as falling under Protocol 4. The prediction and outcome spaces are a real interval Ω = Γ = [A, B] and the square loss function λ(γ, ω) = (γ − ω) 2 is used.This game is mixable and the maximum η = 2/(B − A) 2 is the maximum such that C η = 1 (see [Vov01]; the derivation for the interval [−Y, Y ] can be easily adapted for [A, B]).
We apply AAP algorithms to Ames house prices data set.In the first set of experiments our experts are linear regression models based on only two attributes: the neighbourhood and the total square footage of the dwelling.These simple models explain around 80% of the variation in sales prices and they are very easy to train.Each expert has been trained on one month of the first year of the data.Hence there are 12 'monthly' experts.
In the second set of experiments on Ames house dataset we use random forests (RF) models after [Bel].A model was built for each quarter of the first year.Hence there are four 'quarterly' experts.They take longer to train but produce better results.Note that 'monthly' RF experts were not practical.Training a tree requires a lot of data and 'monthly' experts returned very poor results.
We then apply the experts to predict the prices starting from year two.

London House Prices
Another data set that was used to compare the performance of AAP contained house prices in and around London over the period 2009 to 2014.This dataset was made publicly available by the Land Registry in the UK and was originally sourced as part of a Kaggle competition.The Property Price data consists of details for property sales and contains around 1.38 million observations.This data set was studied before to provide reliable region predictions for Automated Valuation Models of house prices [Bel17].
As with Ames dataset, we use linear regression models that were built for each month of the first year of the data as experts of AAP.Features that were used in regression models contain information about the property: property type, whether new build, whether free-or leasehold.Along with the information about the proximity to tubes and railways, models use the English indices of deprivation 2010 which measures relative levels of deprivation.The following deprivation scores were used in models: income, employment, health and disability, education for children and skills for adults, barriers to housing and services with sub-domains wider barriers and geographical barriers, crime, living environment score with sub-domains for indoor and outdoor living (i.e.quality of housing and external environment, respectively).Additional to the general income score, separate scores for income deprivation affecting children and the older population were used.
In the second set of experiments on London house dataset we use RF models built for each month of the first year as experts.Compare to Ames dataset, London house dataset contains enough observations to train RF models on one month of the data.Hence we have 12 'monthly' experts.We start by comparing the family of AAP merging algorithms against parallel copies of AA.While for AAP algorithms the order of examples in the pack makes no difference, for parallel copies it is important.To analyse the dependency on the order we ran parallel copies 500 times randomly shuffling each pack each time.

Comparison of Merging Algorithms
Figure 1a shows the histogram of total losses of the parallel copies of AA with regression experts on Ames house dataset.The average total loss of parallel copies is almost the same as the total losses of AAP-incremental and AAP-max.AAP-current shows the best performance among AAP algorithms with a slight improvement over the mean.While the performance of parallel copies can be better, AAP family provides stable order-independent performance, which is good on average.
There is one remarkable ordering where parallel copies show greatly superior performance.If packs are ordered by PID (i.e., as in the database), parallel copies suffer substantially lower loss.PID (Parcel identification ID) is assigned to each property by the tax assessor.It is related to the geographical location.When the packs are ordered by PID, parallel copies benefit from geographical proximity of the houses; each copy happens to get similar houses.
Figure 1b shows the histogram of total losses of the algorithm with parallel copies of AA with RF experts.In this case, the average total loss of this algorithm is slightly lower than total losses of AAP-incremental and AAPmax.AA with parallel copies ordered by PID has lower total loss than the average.AAP-current has the lowest total loss among the AAP family and even beats the parallel copies for PID-ordered packs.

Comparison of AAP-incremental and AAP-max
Figure 2a illustrates the difference in total losses of AAP-incremental and AAP-max on Ames house prices data with regression models.AAPincremental performs better at the beginning of the period when the current maximum size of the pack is much lower than the maximum pack of the whole period.After that, AAP-incremental and AAP-max have almost similar performance and the total losses level out.
Figure 2b illustrates the difference in total losses of AAP-incremental and AAP-max on Ames house prices data with RF experts.Figures 2c,

Comparison of AAP-current and AAP-incremental
Figure 3 illustrates the difference in total losses of AAP-current and AAPincremental.Figures 3a and 3b show results for Ames house prices for regression and RF experts respectively, Figures 3c, 3d -for London house prices.In all experiments AAP-current steadily outperforms AAP-incremental.
The performance of AAP-current is remarkable because by design it is not optimised to minimise the total loss.The bound of Corollary 2 is weak in comparison to that of Theorem 4. In a way, here we assess AAP-current with a measure it is not good at.Still optimal decisions of AAP-current produce superior performance.

Comparison of AAP with Batch Models
In this section we compare AAP-incremental with two straightforward ways of prediction, which are essentially batch.One goal we have here is to do a sanity check and verify whether we are not studying properties of very bad algorithms.Secondly, we want to show that prediction with expert advice may show better ways of handling the available historical information.The first batch model we compare our on-line algorithm to is the seasonal model that predict January with linear regression model that has been trained on January of the first year, February -with linear model of February of the first year, etc.
In the case of 'quarterly' RF experts, we compete with seasonal model that predict first quarter with RF model that has been trained on the first  Secondly, what if we train a model on the whole of the first year?This may be more expensive than training smaller models, but what do we gain in performance?The second batch model is the linear model that has been trained on the whole first year of the data.In case of RF experts, we compete with RF model that has been trained on the first year of the data.
Figure 4 shows the comparison of total losses of AAP-current and batch linear regression models for Ames house dataset.AAP-current consistently performs better than the seasonal batch model.Thus the straightforward utilisation of seasonality does not help.
When compared to the linear regression model of the first year, AAPcurrent initially has higher losses but it becomes better towards the end.It could be explained as follows.AAP-current needs time to train until it becomes good in prediction.These results show that we can make a better use of the past data with prediction with expert advice than with models that were trained in the batch mode.
Table 1 shows total losses of algorithm (divided by 10 12 ).AAP algorithms always outperform seasonal batch models.As compared to linear regression batch models that were built on the first year of the data, AAP is slightly better on Ames house dataset and slightly worse on London house dataset.RF batch models that were built on the first year of the datasets constantly outperform AAP algorithms.
The losses quoted for the parallel copies are the means over 500 random shuffles as explained above.The experiment was not run for London house prices as it is very time-consuming.

Conclusion
We tested the performance of AAP against the algorithm with parallel copies of AA.We found that the average performance of algorithm with parallel copies of AA is close to the performance of AAP.AA with parallel copies ordered by PID has lower total loss than the average of AA with parallel copies which means that a meaningful ordering can have a big impact on the performance of the algorithm.In the absence of such knowledge, AAP algorithms provide more stable performance.
AAP-current is constantly outperforming AAP-incremental and AAPmax on two data sets.Therefore, we do not need to know the maximum size of the pack in advance.
Also experiments showed that in some cases we could get the better use of the past data with AAP than with models that were trained in the batch mode.
10) END FORThis algorithm essentially applies AA to G K with the learning rate η/K.If we extend the meaning of Loss for a strategy S working in the environment specified by Protocol 4 as follows:Loss T (S) = t,k , ω t,k ) , we get the following theorem.Theorem 2. If C is admissible for G with the learning rate η, then the learner following AAP-e suffers loss satisfying Loss T (S) ≤ C Loss En (S) + KC η ln 1 p n

Theorem 3 .
If C is admissible for G with the learning rate η, then the learner following AAP-m suffers loss satisfying Loss T (S) ≤ C Loss En (

6. 2 . 1
Comparison of AAP with Parallel Copies of AA (a) Regression on Ames house prices (b) RF on Ames house prices

Figure 1 :
Figure 1: Histogram of total losses (a) Regression on Ames house prices (b) RF on Ames house prices (c) Regression on London house prices (d) RF on London house prices

Figure 2 :
Figure 2: Comparison of total losses of AAP-incremental and AAP-max (a) Regression on Ames house prices (b) RF on Ames house prices (c) Regression on London house prices (d) RF on London house prices

Figure 3 :
Figure 3: Comparison of total losses of AAP-current and AAP-incremental (a) Loss difference of AAP-current and monthly batch (b) Loss difference of AAP-current and year batch

Figure 4 :
Figure 4: Comparison of total losses of AAP and batch models If C is admissible for G with the learning rate η, then the learner following AAP-incremental suffers loss satisfying where K is the maximum pack size over T trials, for all outcomes and experts' predictions Proof.We will show by induction over time that the inequality e −η Losst(S)/(CK) ≥ N n=1

Table 1 :
Total losses