Efficient probabilistic reconciliation of forecasts for real-valued and count time series

Hierarchical time series are common in several applied fields. The forecasts for these time series are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.


Introduction
Often time series are organized into a hierarchy.For example, the total visitors of a country can be divided into regions and the visitors of each region can be further divided into sub-regions.Such data structures are referred to as hierarchical time series; they are common in fields such as retail sales [Makridakis et al., 2021] and energy modelling [Taieb et al., 2021].
The forecasts for hierarchical time series should respect some summing constraints, in which case they are referred to as coherent.For instance, the sum of the forecasts for the sub-regions should match the forecast for the entire region.However, the forecasts independently produced for each time series (base forecasts) are generally incoherent.
Reconciliation algorithms [Hyndman et al., 2011, Wickramasuriya et al., 2019] adjust the incoherent base forecasts, making them coherent.Reconciled forecasts are generally more accurate than base forecasts: indeed, forecast reconciliation is a special case of forecast combination [Hollyman et al., 2021].An important application of reconciliation algorithms is constituted by temporal hierarchies [Athanasopoulos et al., 2017, Kourentzes andAthanasopoulos, 2021], which make coherent the forecasts produced for the same time series at different temporal scales.
Probabilistic reconciliation has been addressed only recently; earlier attempts [Jeon et al., 2019, Taieb et al., 2021], though experimentally effective, lacked a strong formal justification.For the case of Gaussian base forecasts, Corani et al. [2020] obtains the reconciled distribution in analytical form introducing the approach of reconciliation via conditioning.Panagiotelis et al. [2023] provides a framework for probabilistic reconciliation via projection.However this approach cannot reconcile discrete distributions.Corani et al. [2023] performs probabilistic reconciliation via conditioning of count time series by adopting the concept of virtual evidence [Pearl, 1988].However its implementation in probabilistic programming, based on Markov Chain Monte Carlo (MCMC), is too slow on large hierarchies; moreover it requires the base forecast distribution to be in parametric form.
The main contribution of this paper is the Bottom-Up Importance Sampling (BUIS) algorithm, which samples from the reconciled distribution obtained via conditioning with a substantial speedup with respect to Corani et al. [2023].BUIS can be used even when the base forecast distribution is only available through samples.This is the case of forecasts returned by models for time series of counts [Liboschik et al., 2017] or based on deep learning [Salinas et al., 2020].We prove the convergence of BUIS to the actual reconciled distribution.An implementation of the algorithm in the R language is available in the R package bayesRecon [Azzimonti et al., 2023].
We provide two further formal contributions.The first is a definition of coherence for probabilistic forecasts that applies to both discrete and continuous distributions.The second is a novel interpretation of the reconciliation via conditioning, in which the base forecast distribution is conditioned on the hierarchy constraints.This allows for a unified treatment of the reconciliation of discrete and continuous forecast distributions.We test our method exhaustively on temporal hierarchies reporting positive results both for the accuracy and the efficiency of our method.
The paper is organized as follows.In Sec. 2, we introduce the notation and the reconciliation of point forecasts.In Sec. 3, we introduce our approach to reconciliation via conditioning and we compare it to the existing literature.In Sec. 4, we introduce the Bottom-Up Importance Sampling algorithm.We empirically verify its correctness in Sec. 5, while in Sec.6 we test it on different data sets.We present the conclusions in Sec. 7.
Figure 1: A hierarchy with 4 bottom and 3 upper variables.

Notation
Consider the hierarchy of Fig. 1.We denote by b = [b1, . . ., bn b ] T the vector of bottom variables, and by u = [u1, . . ., un u ] T the vector of upper variables.We then denote by the vector of all the variables.The hierarchy can be expressed as a set of linear constraints: We refer to I ∈ R n b ×n b as the identity matrix, to S ∈ R n×n b as the summing matrix and to A ∈ R nu×n b as the aggregating matrix.We can thus write the constraints as u = Ab.For example, the aggregating matrix of the hierarchy in Fig. 1 is: A point y ∈ R n is coherent if it satisfies the constraints given by the hierarchy.We denote by S the set of coherent points, which is a linear subspace of R n : S := {y ∈ R n : y = Sb}. (2)

Temporal hierarchies
In temporal hierarchies [Athanasopoulos et al., 2017, Kourentzes andAthanasopoulos, 2021], forecasts are generated for the same time series at different temporal scales.For instance, a quarterly time series can be aggregated to the semi-annual and the annual scale.If we are interested in predictions up to one year ahead, we compute four quarterly forecasts q1, q2, q3, q4, two semi-annual forecasts ŝ1, ŝ2, and an annual forecast â1.We then obtain the hierarchy in Fig. 1.The base point forecasts, independently computed at each frequency, are b = [q1, q2, q3, q4] T and û = [â1, ŝ1, ŝ2] T .

Point forecasts reconciliation
Let us denote by ŷ = ûT | bT T the vector of the base (incoherent) forecasts.Note that, for ease of notation, we drop the time subscript.Point reconciliation is generally performed in two steps [Hyndman et al., 2011, Wickramasuriya et al., 2019].First, the reconciled bottom forecasts are computed by linearly combining the base forecasts of the entire hierarchy: b = Gŷ, for some matrix G ∈ R m×n .Then, the reconciled forecasts for the whole hierarchy are given by: ỹ = S b.
The state-of-the-art reconciliation method is MinT [Wickramasuriya et al., 2019], which defines G as: where W is the covariance matrix of the errors of the base forecasts.This method minimizes the expected sum of the squared errors of the reconciled forecasts, under the assumption of unbiased base forecasts.

Probabilistic reconciliation
Probabilistic reconciliation requires a probabilistic framework, in which forecasts are in the form of probability distributions.We denote by ν ∈ P(R n ) the forecast distribution for y, where P(R n ) is the space of probability measures on (R n , B(R n )), and B(R n ) is the Borel σ-algebra on R n .Moreover, we denote by νu and νb the marginal distributions of, respectively, the forecasts for the upper and the bottom components of y.
The forecast distribution ν may be either discrete or absolutely continuous.In the following, if there is no ambiguity, we will use π to denote either its probability mass function, in the former case, or its density, in the latter.Therefore, if ν is discrete, we have for any F ∈ B(R n ).Note that the sum is well-defined as π(x) > 0 for at most countably many x's.On the contrary, if ν is absolutely continuous, for any

Probabilistic Reconciliation
We now discuss coherence in the probabilistic framework and our approach to probabilistic reconciliation.
Recall that a point forecast is incoherent if it does not belong to the set S, defined as in (2).Let ν ∈ P(R n ) be a forecast distribution.Thus, ν is incoherent if there exists a set T of incoherent points, i.e.T ∩ S = ∅, such that ν(T ) > 0. Or, equivalently, if supp(ν) ⊈ S. We now define the summing map s : R n b → R n as (3) The image of s is given by S.Moreover, from (3) and (1), s is injective.Hence, s is a bijective map between R n b and S, with inverse given by s −1 (y) = b, where y = (u, b) ∈ S. As explained in Panagiotelis et al.
[2023], for any ν ∈ P(R n b ) we may obtain a distribution ν ∈ P(S) as ν = s # ν, namely the pushforward of ν using s: In other words, s # builds a probability distribution for y supported on the coherent subspace S from a distribution on the bottom variables b.Since s is a measurable bijective map, s # is a bijection between P(R n b ) and P(S), with inverse given by (s −1 ) # (Appendix A).We thus propose the following definition.
Definition 1.We call coherent distribution any distribution ν ∈ P(R n b ).
This definition works with any type of distribution.Moreover, it can be used even if the constraints are not linear, as it does not require s to be a linear map.

Probabilistic reconciliation
The aim of probabilistic reconciliation is to obtain a coherent reconciled distribution ν ∈ P(R n b ) from the base forecast distribution ν ∈ P(R n ).
The probabilistic bottom-up approach, which simply ignores any probabilistic information about the upper series, is obtained by setting ν = νb .Panagiotelis et al. [2023] proposes a reconciliation method based on projection.Given a continuous map ψ : R n → S, the reconciled distribution ν ∈ P(S) is defined as the push-forward of the base forecast distribution ν using ψ: Hence, if y1, . . ., yN are independent samples from ν, then ψ(y1), . . ., ψ(yN ) are independent samples from ν.The map ψ is expressed as ψ = s • g, where g : R n → R n b combines information from all the levels by projecting on the bottom level.
g is assumed to be in the form g(y) = d + Gy, and the parameters γ := (d, vec(G)) ∈ R n b +n b ×n are optimized through stochastic gradient descent (SGD) to minimize a chosen scoring rule.This approach therefore can only be used with continuous distributions.

Probabilistic reconciliation via conditioning
We now present our approach to probabilistic reconciliation, based on conditioning on the hierarchy constraints.Let Ŷ = ( Û, B) be a random vector representing the probabilistic forecasts with distribution given by ν, so that νu and νb are the distributions of Û and B.
Let us first suppose that the base forecast distribution ν ∈ P(R n ) is discrete, and let π be its probability mass function.We define ν by conditioning on the coherent subspace S: for any F ∈ B(R n b ), provided that P( Ŷ ∈ S) > 0. The sums in ( 4) are well-defined, as π(u, b) = π(y) > 0 for at most countably many y's.
Hence, ν is a discrete probability distribution with pmf given by Note that, if ν is absolutely continuous, we have that ν(S) = 0, since the Lebesgue measure of S is zero.Hence, P( B ∈ F | Ŷ ∈ S) is not welldefined.However, if we denote by π the density of ν, the last expression is still well-posed.We thus give the following definition.Definition 2. Let ν ∈ P(R n ) be a base forecast distribution.The reconciled distribution through conditioning is defined as the probability distribution where π and π are the densities of (respectively) ν and ν, if ν is absolutely continuous, or the probability mass functions otherwise.
To rigorously derive (6) in the continuous case, we proceed as follows.Let us define the random vector Z := Û − A B. Note that the event { Ŷ ∈ S} coincides with {Z = 0}.The joint density of (Z, B) can be easily computed (Appendix A): Then, the conditional density of B given Z = 0 is given by [C ¸inlar, 2011, Chapter 4]: Finally, note that, if Û and B are independent, (6) may be rewritten as where πu and πb are the densities of (respectively) νu and νb .This approach can be applied to both continuous and discrete distributions, yielding the same expression (6) for the reconciled distribution.
Given two coherent points y1, y2 ∈ S, the distribution reconciled through conditioning satisfies the following property: if π(y2) ̸ = 0, and π(y2) = 0 if π(y2) = 0; i.e., the relative probabilities of the coherent points are preserved.Moreover, reconciliation via conditioning ignores the behaviour of the base distribution outside the coherent subspace.As shown by ( 6), ν only depends on the values of ν on S. Reconciliation via conditioning is therefore invariant under modifications of the base forecast probabilities outside the coherent subspace.This constitutes a major difference with respect to the method of Panagiotelis et al.
[2023] that will be thoroughly studied in future work.
In Corani et al. [2023], the authors follow an approach based on virtual evidence [Pearl, 1988] to reconcile discrete forecasts.They set the joint bottom-up distribution as a prior on the entire hierarchy, and the update is made by conditioning on the base upper forecasts, treated as uncertain observations.In contrast, we provide a unified treatment of reconciliation via conditioning for the discrete and the continuous case.Our approach has a clear interpretation, as the conditioning is done on the hierarchy constraints.

Sampling from the reconciled distribution
If the base forecasts are jointly Gaussian, then the reconciled distribution is also Gaussian.In this case, reconciliation via conditioning yields the same mean and variance [Corani et al., 2020] of MinT, which is optimal with respect to the log score [Wickramasuriya, 2023].
In general, however, the reconciled distribution is not available in parametric form, hence we need to resort to sampling approaches.We propose a method based on Importance Sampling (IS, Kahn [1950], Elvira and Martino [2021]).

Importance Sampling
Let X be an absolutely continuous random variable with density p. Suppose we want to compute the expectation µ = E[f (X)], for some function f .Importance Sampling estimates the expectation µ by sampling from a different distribution q, and by weighting the samples to correct the mismatch between the target p and the proposal q.
In the following the term density denotes either the probability mass function (for discrete distributions) or the density with respect to the Lebesgue measure (for absolutely continuous distributions).Let q be a density such that q(x) > 0 if f (x)p(x) ̸ = 0, and let y1, . . ., yN be independent samples drawn from q.The self-normalized importance sampling estimate [Elvira and Martino, 2021] is: where w is defined as w(y) = c p(y) q(y) , for some (typically unknown) constant c.

Probabilistic reconciliation via IS
Let ν (Definition 2) be the target distribution.We set νb as proposal distribution.Given a sample b1, . . ., bN drawn form νb , the weights are computed as Then, (bi, wi)i=1,...,N is a weighted sample from ν, where wi := w i/ N j=1 w j are the normalized weights.Note that (10) may be interpreted as the conditional density of Û at the point Abi, given that B = bi.We thus draw samples (bi)i from the base bottom distributions, and then weight how likely they are using the base upper distributions.Under the assumption of independence between B and Û, the density of ν factorizes as in (7), hence: wi = πu(Abi). (11) However, IS is affected by the curse of dimensionality [Agapiou et al., 2017].In Appendix D.2, we empirically show that IS has poor accuracy when reconciling large hierarchies.Another shortcoming of IS is that it is unreliable if the proposal distribution does not well approximate the target distribution.Indeed, we prove in Appendix E that the performance of IS degrades as the Kullback-Leibler divergence between bottom-up and base forecast distributions (which is related to the incoherence of the base forecasts) increases.The Bottom-Up Importance Sampling (BUIS) algorithm addresses such problems.

Bottom-Up Importance Sampling algorithm
First, we state the main assumption of our algorithm: Assumption 1.The base forecasts of each variable are conditionally independent, given the time series observations.We leave for future work the extension of this algorithm to deal with correlations between the base forecasts.In this paper we perform experiments with temporal hierarchies, which commonly make this assumption.
In order to simplify the presentation, we also assume that the data structure is strictly hierarchical, i.e., that every node only has one parent Algorithm 1 Bottom-Up Importance Sampling i=1,...,N from πb 2: for l in levels do 3: (j,l) for i = 1, . . ., N 5: q j,l ,(j,l) , w (i) i 7: end for

, N
9: end for 10: return b (i) i and thus the hierarchy is represented by a tree.Grouped time series [Hyndman and Athanasopoulos, 2021, Chapter 11], which do not satisfy this assumption, require a more complex treatment; we discuss it in Sect.4.5.
The BUIS algorithm exploits the hierarchical structure to split a large nu-dimensional importance sampling problem into nu one-dimensional problems, thus deeply alleviating the curse of dimensionality.BUIS starts by drawing a sample from the base bottom distribution νb .Then, for each level of the hierarchy, from bottom to top, it updates the sample through an importance sampling step, using the "partially" reconciled distribution as proposal.
For each level l = 1, . . ., L of the hierarchy, we denote the upper variables at level l by u 1,l , . . ., u k l ,l .Moreover, for any upper variable u j,l , we denote by b 1,(j,l) , . . ., b q j,l ,(j,l) the bottom variables that sum up to u j,l .In this way, we have that L l=1 k l = nu, the number of upper variables, while k l j=1 q j,l = n b , the number of bottom variables, for each level l.Let us consider, for example, the hierarchy in Fig. 1.For the first level l = 1, we have k1 = 2, u1,1 = U2, and u2,1 = U3.Moreover, q1,1 = q2,1 = 2, and b Alg. 1 shows the BUIS algorithm.The "Resample" step samples with replacement from the discrete distribution given by for all i = 1, . . ., N .Note that the algorithm can be easily parallelized by drawing batches of samples on different cores.This additional step would further reduce the computational times.
We explicit the BUIS algorithm on the simple hierarchy in Fig. 1: ..,N from πB j , for j = 1, 2, 3, 4 2. Compute the weights (w (i) )i=1,...,N with respect to U2 as with replacement from (b 4. Repeat step 2 and 3 using B3, B4 and U3 to get b(i) 3 , b(i) and move to the next level 6.Compute the weights (w (i) )i=1,...,N with respect to U1 as with replacement from (b In Appendix B we prove the following proposition: Proposition 1.The output of the BUIS algorithm is approximately a sample drawn from the reconciled distribution ν.

Sample-based BUIS
Sometimes the base forecasts are given as samples, without a parametric form; this is the case of models for time series of counts [Liboschik et al., 2017] or based on deep learning [Salinas et al., 2020].BUIS can reconcile also this type of base forecasts.Since we only deal with one-dimensional densities to compute the weights, we use approximations based on samples.For discrete distributions, we use the empirical distribution.For continuous distributions, we use kernel density estimation [Chen, 2017].Therefore, we only need to replace line 4 in Algorithm 1 with: Sample u (i) j,l i=1,...,N from πu j,l q π ← Density Estimation u (i) j,l i=1,...,N q w (i) ← q π q j,l t=1 b (i) t,(j,l) The sample-based algorithm becomes slightly slower due to the density estimation step.

More complex hierarchies: grouped time series
We refer to grouped time series when the data structure does not disaggregate in a unique hierarchical manner [Hyndman and Athanasopoulos, 2021, Chapter 11].In this case, the aggregated series cannot be represented by a single tree, as a bottom node can have more than one parent.
For instance, consider a weekly time series, for which we compute the following temporal aggregates: 2-weeks, 4-weeks, 13-weeks, 26-weeks, 1year.A bottom node (weekly) is thus children of both the 2-weeks and of the 13-weeks aggregates.This structure cannot be represented as a tree.
The BUIS algorithm, as described in Sec.4.3, requires that the hierarchy is a tree, so it cannot be used in this case.Indeed, as highlighted in the proof, we need the independence of b1, . . ., bk l to multiply their densities.If the hierarchy is not a tree, correlations between bottom variables are created when conditioning on the upper levels.
To overcome this problem, we proceed as follows.First, we find the largest sub-hierarchy within the group structure.For instance, in the example above, we consider the sub-hierarchy given by the bottom variables and by the 2-weeks, 4-weeks and 1-year aggregates.All the other upper variables are then regarded as additional constraints.We use the BUIS algorithm on the sub-hierarchy, obtaining a sample b.Then, we compute the weights on b using the base distributions of the additional constraints.This is equivalent to performing a standard IS, where we use the output of BUIS on the hierarchical part as proposal distribution.In this way, we reduce the dimension of the IS task from nu, the total number of upper constraints, to the number of constraints that are not included in the sub-hierarchy: in the above example, from 46 to 6.We highlight that the distribution we sample from would be the same even with different choices of sub-hierarchies.However, picking the largest one is the best choice from a computational perspective.We now empirically test the convergence of the BUIS algorithm to the true reconciled distribution.We compare BUIS with IS and with the method by Corani et al. [2023], which we implement using the library PyMC [Salvatier et al., 2016].PyMC adopts an adaptive Metropolis-Hastings algorithm [Haario et al., 2001] for discrete distributions and the No-U-Turn Sampler (NUTS, Hoffman et al. [2014]) for continuous distributions.We performed experiments on the hierarchy of Fig. 2, implementing the IS and BUIS algorithms in Python.

Reconciling Gaussian forecasts
Dealing with Gaussian base forecasts, the reconciled distribution can be obtained in closed form [Corani et al., 2020].We can thus check how the various algorithms approximates the exact solution.We set on each bottom node a Gaussian distribution with mean randomly chosen in the We set σu = 3 as standard deviation for the base forecast of each upper variable.
We run PyMC with 4 chains with 5, 000 samples each.For IS and BUIS, we run multiple experiments, drawing each time a different number of samples, ranging from 10 4 to 10 6 .We repeat each experiment 30 times.We then compute the 2-Wasserstein distance [Panaretos and Zemel, 2019] between the true reconciled distribution, obtained analytically, and the empirical distributions obtained via sampling.The results are reported in Fig. 3a, where we also show the 95% confidence interval over the 30 experiments.Note that the axes are in logarithmic scale.
As expected, the performance of IS and BUIS depends on the incoherence level ϵ.This behavior also affects BUIS, which is based on importance sampling.However, BUIS is significantly more robust than IS, and it works effectively even with extreme incoherence level such as ϵ = 0.5.As the number of samples grows, the performance of BUIS improves, eventually outperforming the reference method based on PyMC.We confirm the results by computing the percentage error on the reconciled mean (Appendix D.1).Even with an extreme incoherence level, ϵ = 0.5, the percentage error on the mean obtained with 10 6 samples from BUIS is negligible (< 0.1%) and comparable to PyMC.In the same setup IS achieves an error greater than 5%.
Both IS and BUIS are substantially faster than PyMC (Table 1).The computational time of BUIS with 10 5 samples is two orders of magnitude smaller than PyMC, while achieving comparable performances.Note that here BUIS is running on a single core.An insight about the reasons of such a speedup is given in Appendix C, where we provide a detailed comparison between IS and a bare-bones implementation of MCMC on a simple hierarchy.
We also conduct similar experiments using a larger hierarchy; the results, reported in Appendix D.2, confirm that the BUIS is robust and computationally efficient.

Reconciling Poisson forecasts
We now consider discrete base forecasts.We set a Poisson distribution on each bottom variable, with mean randomly chosen in the interval [5,10].We denote by λ b ∈ R 8 + the vector of the base bottom means.As before, for each incoherence level ϵ ∈ {0.1, 0.3, 0.5}, we set the mean of the upper variables as λu = (1 + ϵ)Aλ b .In the Poisson case, the reconciled distribution cannot be analytically computed.We thus run an extensive experiment using PyMC, with 20 chains with 50, 000 samples each.We consider these samples as the true reconciled distribution.
We run the same experiments described in Sec.5.1.Since probabilistic forecasts of count time series are typically given as samples [Liboschik et al., 2017], we also run sample-based BUIS (Sec.4.4): we assume that the parametric form of the base distribution is unknown, and that only samples are available.
The 2-Wasserstein distances are reported in Fig. 3b.As the number of samples grows, BUIS and sample-based BUIS eventually outperform PyMC, for all levels of incoherence.As for the Gaussian case, the performance of IS deteriorates for larger values of the incoherence level.The results are confirmed by the percentage error on the reconciled mean (Appendix D.1), which is lower than 0.2% for BUIS with 10 5 samples and about 0.4% for PyMC.
In Table 1 we show the average computational times.Sample-based BUIS is slightly slower than BUIS because of the density estimation step.Note that, using 10 5 samples, BUIS and sample-based BUIS are 2 orders of magnitude faster than PyMC, while achieving a better performance for all incoherence levels.

Experiments on real data
We now perform probabilistic reconciliation on temporal hierarchies, using time series extracted from two different data sets: carparts, available from the R package expsmooth [Hyndman, 2018], and syph, available from the R package ZIM [Yang et al., 2018].
The carparts data set is about monthly sales of car parts.As in [Hyndman et al., 2008, Chapter 16], we remove time series with missing values, with less then 10 positive monthly demands and with no positive demand in the first 15 and final 15 months.After this selection, there are 1046 time series left.Note that we use less restrictive criteria in the selection of the time series than Corani et al. [2023], where only 219 time series from carparts were considered.Monthly data are aggregated into 2-months, 3-months, 4-months, 6-months and 1-year levels.
The syph data set is about the weekly number of syphilis cases in the United States.We remove the time series with ADI greater than 20.The ADI is computed as ADI = P i=1 p i P , where pi is the time period between two non-zeros values and P is the total number of periods [Syntetos and Boylan, 2005].We also remove the time series corresponding to the total number of cases in the US.After this selection, there are 50 time series left.Weekly data are aggregated into 2-weeks, 4-weeks, 13-weeks, 26-weeks and For both data sets, we fit a generalized linear model with the tscount package [Liboschik et al., 2017].We use a negative binomial predictive distribution, with a first-order regression on past observations.The test set has length 1 year for both data sets.We thus compute up to 12 steps ahead at monthly level, and up to 52 steps ahead at weekly level.Probabilistic forecasts are returned in the form of samples.
Reconciliation is performed in three different ways.In the first case, we fit a Gaussian distribution on the returned samples.Then, we follow [Corani et al., 2020] to analytically compute the Gaussian reconciled distribution.In the second case, we fit a negative binomial distribution on the samples, and we reconcile using the BUIS algorithm.Since these are grouped time series rather than hierarchical time series, we use the method of Sec.4.5 for grouped time series.Finally, we use the samplebased BUIS (Sec.4.4), without fitting a parametric distribution.Although the sample-based algorithm is slightly slower, this method yields a computational gain over BUIS, as fitting a negative binomial distribution on the samples requires about 1.2 s for the monthly hierarchy and 3.9 s for the weekly hierarchy.We refer to these methods, respectively, as N, NB, and samples.Furthermore, we denote by base the unreconciled forecasts.
We use different indicators to assess the performance of each method.The mean scaled absolute error (MASE) [Hyndman, 2006] is defined as Here, yt denotes the value of the time series at time t, while ŷt+j|t denotes the point forecast computed at time t for time t + j.The median of the distribution is used as point forecast, since it minimizes MASE [Kolassa, 2016].
The mean interval score (MIS) [Gneiting, 2011] is defined, for any α ∈ (0, 1), as where l and u are the lower and upper bounds of the (1 − α) forecast coverage interval and y is the actual value of the time series.In the following, we use α = 0.1.MIS penalizes wide prediction intervals, as well as intervals that do not contain the true value.
Finally, the Energy score [Székely and Rizzo, 2013] is defined as where P is the forecast distribution on the whole hierarchy, s, s ′ ∼ P are a pair of independent random variables and y is the vector of the actual values of all the time series.The energy score is a proper scoring rule for distributions defined on the entire hierarchy [Panagiotelis et al., 2023].We compute ES, with α = 2, using samples, as explained in Wickramasuriya [2023].
We use the skill score to compare the performance of a method with respect to a baseline method, in terms of percentage improvement.We use base as baseline method.For example, the skill score of NB on MASE is given by Skill(NB, base) = MASE(base) − MASE(NB) (MASE(base) + MASE(NB)) /2 .
Note that the skill score is symmetric and scale-independent.For each level, we compute the skill score for each forecasting horizon, and take the average.The skill scores for carparts are reported in Table 2.Both NB and samples methods yield a significant improvement for all the indicators, and for all the hierarchy levels.For both methods, the average improvement is about 20% for MASE, 40% for MIS and 50% for ES.The skill scores for syph are reported in Table 3.As before, the average improvement of NB and samples is significant for all indicators.For both datasets, the N method performs poorly, in many cases yielding negative skill scores.As observed in Corani et al. [2023], this method does not capture the asymmetry of the base forecasts.Finally, samples appears to perform better that NB.Indeed, the step of fitting a Negative Binomial distribution on the forecast samples may yield an additional source of error.

Conclusions
Our approach to probabilistic reconciliation based on conditioning allows to treat continuous and discrete forecast distributions in a unified framework.Moreover, the proposed BUIS is able to efficiently sample from continuous and discrete predictive distributions, provided in parametric form or as samples.We make available the BUIS algorithm within the R package bayesRecon [Azzimonti et al., 2023].
A future research direction is how to relax the assumption of conditional independence of the base forecasts.A second one is to study the implications of ignoring the behavior of the base forecast distribution outside the coherent subspace, which is a feature of reconciliation via conditioning and constitutes a major difference from reconciliation via projection.

A Proofs
Proposition 2. Let s : X → Y be a measurable bijection between two measure spaces (X, X ) and (Y, Y).Then, the pushforward s # : P(X) → P(Y ) is a bijection, with inverse given by (s −1 ) # .
Proof.First, we recall that the pushforward s # is defined, for any ν ∈ P(X) and F ∈ Y, as Hence, for any ν ∈ P(X) and G ∈ X , we have and therefore (s −1 ) # • s # is the identity map.Analogously, for any µ ∈ P(Y ) and F ∈ X , we have Proposition 3. Let π be the joint density of the random vector ( Û, B).
Then, the density of (Z, B), where Z := Û − A B, is given by Proof.The joint density of (Z, B) can be computed using the rule of change of variables [Billingsley, 2008, Chapter 17].Let H : R n → R n be defined as H is invertible, with inverse given by and we have that

D.2 Large hierarchy
We test the IS, BUIS, and PyMC algorithms on a larger hierarchy.We set a binary hierarchy, similar to that of Fig. 2, but with 5 levels: hence, there are 32 bottom and 31 upper nodes.We use the same procedure described in Sect.5.1 to set the Gaussian base forecasts.Using BUIS with 10 5 samples we achieve a small average percentage error (< 0.5%) on the reconciled means (Table 5), even with a large incoherence (ϵ = 0.5).
On the other hand, the error using IS is over 20%, even with 10 6 samples, proving that IS is not able to scale to large hierarchies.The results are confirmed by the plot of the 2-Wasserstein distance (Fig. 4).In conclusion, BUIS is able to correctly sample from the reconciled distribution, even in case of rather big hierarchies (∼ 60 nodes) and large incoherence level (ϵ = 0.5), while providing an impressive gain in terms of computational time with respect to PyMC (Table 6).

E Efficiency of IS
It is well-known that vanilla importance sampling is not effective to sample from high dimensional distributions; this prevents using it to reconcile large hierarchies.We also obtain low performances when the proposal distribution νb is not a good approximation of the target distribution ν.The following result relates the Kullback-Leibler divergence [Kullback and Leibler, 1951] between the base and reconciled distribution to the efficiency of IS.Proof.First, we recall that, given a pair of absolutely continuous probability distributions µ and ν, the Kullback-Leibler (KL) divergence is defined as KL(µ ∥ ν) = log p(x) q(x) p(x) dx, where p and q are the densities of, respectively, µ and ν.The discrete case is completely analogous.Now, let νb be the base bottom forecast distribution, and ν the reconciled distribution.We recall that the density of ν is given by Note that the right-hand side of ( 13) is a measure of the dispersion of the random variable W . Indeed, by the Jensen's inequality, it is always non-negative, and it is zero when W is constant a.s.; it gets larger as W becomes more dispersed.In the context of the measures of inequality, it usually referred to as Mean Logarithm Deviation [Haughton and Khandker, 2009].Moreover, from (10), we have that the importance sampling weights are IID copies of W . Hence, the more distant are the base and the reconciled distribution, in terms of Kullback-Leibler divergence, the more dispersed are the IS weights.A large dispersion of the weights leads to a poor performance of importance sampling [Martino et al., 2017].As the incoherence level ϵ grows, the distance between the distributions of A B and Û grows, and therefore also the distance between νb and ν, as the reconciled distribution merges the information coming from the bottom and the upper variables.

Figure 2 :
Figure 2: A binary hierarchy b) Poisson distributions

Figure 3 :
Figure 3: Wasserstein distance between true and empirical distributions.The axes are logarithmic.
interval[5, 10], and standard deviation σ b = 2.We denote by µ b ∈ R 8 + the vector of the base bottom means.We induce incoherence by setting the means of the base forecast of the upper variables as µu = (1 + ϵ)Aµ b , where A is the aggregating matrix and ϵ is the incoherence level; we consider ϵ ∈ {0.1, 0.3, 0.5}.Hence, if ϵ =0.3 the base upper means are 30% greater than the sum of the corresponding base bottom means.

Table 2 :
N vs base NB vs base samples vs base Skill scores on the time series extracted from carparts, detailed by each level of the hierarchy 1-year levels.

Table 3 :
Skill scores on the time series extracted from syph, detailed by each level of the hierarchy