Bayesian Tensor Factorisations for Time Series of Counts

We propose a flexible nonparametric Bayesian modelling framework for multivariate time series of count data based on tensor factorisations. Our models can be viewed as infinite state space Markov chains of known maximal order with non-linear serial dependence through the introduction of appropriate latent variables. Alternatively, our models can be viewed as Bayesian hierarchical models with conditionally independent Poisson distributed observations. Inference about the important lags and their complex interactions is achieved via MCMC. When the observed counts are large, we deal with the resulting computational complexity of Bayesian inference via a two-step inferential strategy based on an initial analysis of a training set of the data. Our methodology is illustrated using simulation experiments and analysis of real-world data.


Introduction
We consider a time-index sequence of multivariate random variables of size T , {y t } T t=1 , taking values in {0, 1, . ..}.We build a non-parametric model by (i) assuming that the transition probability law of the sequence {y t } conditional on the filtration up to time t−1, F t−1 , is that of a Markov chain of maximal order q, (ii) allowing non-linear dependence of the values at the previous q time points and (iii) incorporating complex interactions between lags.We propose a Bayesian model for multivariate time series of counts based on tensor factorisations.Our development is inspired by Yang and Dunson (2016) and Sarkar and Dunson (2016).Yang and Dunson (2016) introduced conditional tensor factorisation models that lead to parsimonious representations of transition probability vectors together with a simple, powerful Bayesian hierarchical formulation based on latent allocation variables.This framework has been exploited in Sarkar and Dunson (2016) to build a nonparametric Bayesian model for categorical data together with an efficient MCMC inferential framework.We adopt the ideas and methods of these papers to build flexible models for time series of counts.The major difference that distinguishes our work to Sarkar and Dunson (2016) is that, unlike categorical data, we deal with time series that are infinite, rather than finite, state space Markov chains.The resulting computational complexity of our proposed model is grown as the observed counts become larger, so we propose a two-step inferential strategy in which an initial, training part of the time series data, is utilized to facilitate the inference and prediction of the rest of the data.A common way to analyse univariate time series of counts is by assuming that the conditional probability distribution of y t | y t−1 , . . ., y t−q can be expressed as a Poisson density with rate λ t that depends either on previous counts y t−1 , . . ., y t−q or previous intensities λ t−1 , . . ., λ t−q .For example, one such popular model is the Poisson autoregressive model (without covariates) of order q, PAR(q): where β 0 , β 1 , . . ., β q are unknown parameters; see Cameron and Trivedi (2001).Grunwald et al. (1995), Grunwald et al. (1997) and Fokianos (2011) discuss the modelling and properties of a PAR(1) process.Brandt and Williams (2001) generalise PAR(1) to a PAR(q) process and apply it to the modelling of presidential vetoes in the United States.Kuhn et al. (1994) adopt such processes to model the counts of child injury in Washington Heights.
When we deal with M distinct time series of counts, the PAR(q) model is written, for m = 1, . . ., M, as y m,t ∼ Poisson(λ m,t ), log(λ m,t ) = β 0 + M m=1 q i=1 β i,m log(y m,t−i + 1); (2) see, for example, Liboschik et al. (2015).In the above equation, q is fixed for each m = 1, . . ., M. We will use this model formulation as a benchmark for comparison against our proposed methodology.Other approaches to modelling time series of counts include the integer-valued generalised autoregression conditional heteroscedastic models Heinen (2003); Weiß (2014) and the integer-valued autoregression processes Al-Osh and Alzaid (1987).We have not dealt with these models here because a proper Bayesian evaluation of their predictive performance requires a challenging Bayesian inference task which is beyond the scope of our work.The rest of the paper is organised as follows.We specify our model in Section 2, followed by estimation and inference details in Section 3. Simulation experiments and applications are provided in Section 4 and 5, respectively.
2 Model Specification 2.1 The Bayesian tensor factorisation model

Univariate time series
We build a probabilistic model by assuming that the transition probability law of y t conditional on F t−1 is that of a Markov chain of maximal order q: for t ∈ [q + 1, T ] where the set containing all integers from i to j is denoted as [i, j].This formulation includes the possibility that only a subset of the previous q values affects y t .We follow Sarkar and Dunson (2016) and introduce a series of latent variables as follows.First, let k j denote the maximal number of clusters that the values of y t−j can be separated into for predicting y t .To demonstrate the use of k j we present a simple example.Assume that y t depends only on y t−1 and the relationship in which the observed values of y t−1 affect the density of y t is based on the following stochastic rule: if y t−1 > 1 then y t ∼ Poisson(1) and if y t−1 ≤ 1 then y t ∼ Poisson(2).Then k 1 = 2 since the values of y t−1 are separated into two clusters that determine the distribution of y t .Note that if k j = 1 the value of y t−j does not affect the density of y t .The collection of all these latent variables K := {k j } j∈ [1,q] determines how past values of the time series affect the distribution of y t .We also define a collection of time-dependent latent allocation random variables Z t := {z j,t } j∈ [1,q] where z j,t specifies which of the k j clusters of y t−j affects y t .We will write Z t = H meaning that all latent variables in Z t equal to another collection of latent variables H := {h j } j∈[1,q] that do not depend on t.Finally, denote the collection H := {h j ∈ [1, k j ], j ∈ [1, q]} that depends on K.The connection among Z t , H and H is that for any t ∈ [q + 1, T ], Z t is sampled with value H ∈ H.
We are now in a position to define our model.Let λ Zt be the Poisson rate for generating y t given the random variable Z t .The conditional transition probability law (3) can be written as a Bayesian hierarchical model, for j ∈ [1, q], H ∈ H and t ∈ [q + 1, T ], as 1 (y t−j ), . . ., π k j (y t−j ) . (5) Expressions ( 4) and ( 5) imply that with constraints λ H ≥ 0 for any H ∈ H and h j (y t−j ) = 1 for each combination of (j, y t−j ).Multinomial([1, k], π) is a multinomial distribution selecting a value from [1, k] with a probability vector π.The formulation (6) is referred to as a conditional tensor factorisation with the Poisson density PD(y t ; λ H ) being the core tensor; see Harshman (1970); Harshman and Lundy (1994); Tucker (1966);De Lathauwer et al. (2000) for a description of tensor factorisations.It can also be interpreted as a Poisson mixture model with j∈[1,q] π (j) h j (y t−j ) being the mixture weights that depend on previous values of y t .
A more parsimonious representation for our tensor factorisation model is obtained by adopting a Dirichlet process for Poisson rates λ H . Independently, for each H ∈ H, we use the stick-breaking construction introduced by Sethuraman (1994) where δ(.) is a Dirac delta function and independently, for l ∈ [1, ∞), where λ * l represents a label-clustered Poisson rate.By letting Z * Zt denote the label of the cluster that Z t belongs to at time t ∈ [q + 1, T ], we complete the model formulation as (8)

Multivariate time series
The model of the previous section can be readily extended to deal with multivariate responses {Y t } T t=1 , where Y t = (y 1,t , . . ., y M,t ) ⊤ taking values in N 0 .The idea is similar to the way the univariate PAR model (1) is generalised to its multivariate counterpart (2).We assume that the transition probability for any τ ∈ [1, M] and t ∈ [q + 1, T ] is The idea is that each univariate time series may depend on all or some of the q previous values of all, or some, univariate time series.Model (9) assumes that conditional on past q values of all time series before time t, the M univariate random variables at time t are independent.The formulation requires M different latent variables for each dimension but, other than that, its specific details have no essential difference from those in the univariate case.

Two-step inference for large counts
Imagine that based on observed data {y t } T t=1 , one has to recursively forecast future observations y T +1 , y T +2 , . ... Clearly, the observed values of {y t } T t=1 determine the form of our models in Sections 2.1.1 and 2.1.2and as a result of this construction we may face the unfortunate situation in which a count that has been unobserved up to time T appears in the future observations.This problem can be solved by re-estimating the model but in cases where this is not desirable, we propose the following solution.We separate {y t } T t=1 into two segments of size T 1 and T 2 , representing the size of pre-training dataset and training dataset, respectively, so {y t } t∈[1,T 1 ] and {y t } t∈[T 1 +1,T 1 +T 2 ] are the corresponding observations in these sets.We aim to use the pre-training dataset to cluster all the counts in time series and the training dataset to model the time series with labelled counts.
We first define a collection of latent variables {w 1:c−1 , µ 1:c , c} that models the pre-training data {y Thus, (10) assumes that any y t in the pre-training dataset is distributed as a finite mixture of Poisson distributions with c components, weights w i and intensities µ i .The usual latent structure for such mixture models assumes indicator variables d t representing the estimated label of the mixture component that y t belongs to, so p(d We exploit this finite mixture clustering of the pre-training dataset to build our model for the training dataset.We define another collection of latent variables as D t = {d j,t } j∈ [1,q] and by setting d j,t = d t−j for all j ∈ [1, q] and t ∈ [T 1 + 1 + q, T 1 + T 2 ].We then build a probabilistic model for the training dataset by assuming that the transition probability law of the sequence The conditional transition probability law (11) can then be written as a Bayesian hierarchical model, for j ∈ [1, q] and t ∈ [T 1 + 1 + q, T 1 + T 2 ], as (12) and ( 13) immediately imply that with constraints λ H ≥ 0 for any H ∈ H and h j (d j,t ) = 1 for each combination of (j, d j,t ).It is clear that ( 14) is equivalent to ( 12) and (13).From ( 14) the expectation of The rest of the model which utilises the stick-breaking process for λ H is similar to the one used in Section 2.1.1.

Priors
We assign independent priors on π (j) (d j,t ) as k j (d j,t )} ∼ Dirichlet(γ j , . . ., γ j ), with γ j = 0.1.Also, we follow Sarkar and Dunson (2016) and set priors Notice that ϕ controls p(k j = κ) and the number of important lags for the proposed conditional tensor factorisation; for all our experiments throughout this paper, we set ϕ = 0.5.Following Viallefont et al. (2002), we place for the Gamma density of λ * l parameters a as the mid-range of y t in the training dataset a )] and b = 1.We set α 0 = 1 for the Beta prior to V l .Finally, we truncate the series (7), by assuming and we set L = 100.

Estimation and Inference
The joint density of the general model of Section 2.2.3 can be expressed as . The Poisson mixture model in the pre-training set is estimated with the MCMC algorithm of Marin et al. (2005).For any t > T 1 we then estimate Our BTF model has a finite number of mixture components with an unknown number of components due to the randomness of the random variable matrix K.We follow Yang and Dunson (2016) and estimate K separately through a stochastic search variable selection George and McCulloch (1997) based on approximated marginal likelihood.As Yang and Dunson (2016) point out, such an approach is helpful since it fixes the numbers of inclusions of the tensor and the sampling process of K can indicate whether a predictor is important.The rest of the inference proceeds by sampling all other random variables conditional on K and D through MCMC.

MCMC for finite Poisson mixtures
We follow the procedure in Marin et al. (2005).,c] are the corresponding Poisson rates.By setting the priors as

Important lags selection
Important lags are inferred by the variable K = {k j } j∈Z [1,q] .The basic calculations are as follows.Following Sarkar and Dunson (2016), the posterior of K = {k j } j∈Z [1,q] can be sampled as with k j = max {z j,t } t∈Z [T 1 +1+q,T 1 +T 2 ] , . . ., c and n j,ω = t∈Z [T 1 +1+q,T 1 +T 2 ] 1{d j,t = ω.}The levels of d j,t are partitioned into k j clusters {C j,r : r = 1, . . ., k j } with each cluster C j,r assumed to correspond to its own latent class h j = r.With independent Dirichlet priors on the mixture kernels λ H ∼ Gamma(a, b) marginalised out, the likelihood of our targeted response {y t } t∈T * 2 conditional on the cluster configuration C = {C j,r : , , we propose to either increase k j to (k j + 1) or decrease k j to (k j − 1).(ii) If an increasing move is proposed, we randomly split a cluster of d j,t into two clusters.We accept this move with an acceptance rate based on the approximated marginal likelihood.
(iii) If a decrease move is proposed, we randomly merge two clusters of d j,t into a single cluster.We accept this move with an acceptance rate based on the approximated marginal likelihood.If K * and C * are the updated model index and cluster, α(•; •) is the Metropolis-Hastings acceptance rate, L(•) is the likelihood function and q(• → •) is the proposal function, we obtain

Full conditional densities
For given D and K, denote by ζ a generic variable that collects the variables that are not explicitly mentioned, including y.Then the corresponding Gibbs sampling steps are and update π * l accordingly.
• Sample z j,t for j ∈ Z [1,q] and t , where H .../j=h is equal to H at all position except the j-th position taking the value h.

Simulation Experiments
We tested our methodology with simulated data from designed experiments against the Poisson autoregressive model (1) through the log predictive score calculated in an out-ofsample (test) dataset T of size T .For each model the log predictive score is estimated by where p(i) (y t ) denotes the one-step ahead estimated transition probability of observing y t∈T calculated using the parameter values at the i-th iteration of MCMC with total N iterations.It measures the predictive accuracy of the model by assessing the quality of the uncertainty quantification.A model predicts better when the log predictive score is smaller; see, for example, Czado et al. (2009).For each designed scenario, we generated 10 datasets with 5, 000 data points and out-of-sample predictive performance for all models was tested by using either the first 4, 000 or 4, 500 data points as training datasets and calculating the log predictive scores approximated via the MCMC output at the rest 1, 000 or 500 test data points respectively.The resulting mean log predictive score that is reported in Tables 1-3 is the average log predictive score across the 10 generated datasets.
The pre-training dataset for the BTF model has been chosen to be the first 3, 000 points.All MCMC runs were based on the following burn-in and posterior samples respectively: 2, 000 and 5, 000 for fitting the Poisson mixtures on the pre-training dataset; 1, 000 and 2, 000 for selecting the important lags and their corresponding number of inclusions; and 2, 000 and 5, 000 for sampling the rest of the parameters.Bayesian inference for Poisson autoregressive model was obtained by 'rjags ' Plummer et al. (2016) package based on 5,000 burn-in and 10,000 MCMC samples respectively.We first chose the order q of the model by choosing among all models with maximum order up to q + 2 using the AIC and BIC criteria.We set the priors for parameters as β 0 ∼ N(0, 10 −6 ) and  ).We fitted an M-dimensional multivariate Poisson autoregressive model of order q that predicts y ℓ,t with covariates {y m,t−1 } m∈M,m =ℓ as where β ℓ,0 , β ℓ,i and ζ ℓ,m are unknown parameters.Table 3 shows that for all 6 Scenarios, the Bayesian tensor factorisation model achieves impressively better predictive performance than the Bayesian Poisson autoregressive model.The times needed to run the MCMC algorithms for Bayesian Poisson autoregressive and BTF models are comparable.For 1000 iterations we needed, on average, 20 seconds for the BTF model implemented with our matlab code and 25 seconds for the Bayesian Poisson autoregressive models implemented with rjags.

Univariate flu data
We compared our Bayesian tensor factorisation model to Bayesian Poisson autoregressive model with two datasets from Google Flu Trends that refer to 514 Norway, Switzerland and Castilla-La Mancha weekly flu counts in Spain, see Figure 1.We chose the maximum lag q to be 10 for all models we applied to the data.We examined the sensitivity to the size of the pre-training data by considering three scenarios.We used 103(20%), 154(30%) and 206(40%) pre-training sizes and compared their predictive ability against the best models for Bayesian Poisson autoregression formulations based on AIC and BIC criteria.The last 103 and 52 data points were chosen for out-of-sample test comparison for each dataset.To demonstrate how our methodology works, we will present MCMC results for the Norway dataset based on 154 training points; results for both datasets and for all training sizes are given at the end of the Section.The pre-training results are illustrated in Figure 2.There are barely significant differences among 6 of the 10 clusters in the left panel so we fix the number of clusters to be 5, see Figure 2. Figure 3 shows some MCMC results for the rest of our Bayesian tensor factorisation model.With 411 training data points, Panels (a),(b) and (c) provide strong evidence that there are two important predictors, 6 possible combinations of (h 1 , ..., h q ) and 6 unique λ h 1 ,...,hq .Similarly, when the length of the training dataset is 462 panels (d),(e) and (f) indicate that there is evidence for only one important predictor, the total number of possible combinations of (h 1 , ..., h q ) is either 3 or 4, and that there are 3 unique λ h 1 ,...,hq .Model selection results for the Poisson autoregression models are illustrated in Figure 4. MCMC was based on 5,000 burn-in and 10,000 runs by using 'rjags', see Plummer et al. (2016).For the resulting parameter estimates see Table 4.The relative frequency distributions of q j=1 k j , or the total number of possible combinations of (h 1 , . . ., hq).(c,f): The relative frequency distributions of the number of unique λ h 1 ,...,hq .The average run times for the MCMC algorithms for BTF and the Bayesian Poisson autoregression models are comparable.For the former, 1000 iterations take approximately 20 seconds with our code written in matlab, whereas the latter takes approximately 25 seconds for 1000 iterations in the R package 'rjags'.

Multivariate flu data
We revisit the flu data of the previous subsection by jointly modelling flu cases in (i) the adjacent Swiss cantons of Bern and Aargau and (ii) in five neighbouring regions in southeastern Spain, namely Andalusia, Castilla-La Mancha, Illes Balears, Region de Murcia and Valencian Community.The data are depicted in Figure 1 and consist of 514 weekly counts from 09-Oct-2005 to 09-Aug-2015.We chose the maximum lag q to be ten for all multivariate BTF models we applied to the data.The sizes of training against the testing dataset are 411 : 103 and 462 : 52 respectively.Our BTF considered the first 154 data points as the pre-training dataset.Figures 6-9 illustrate how lags were selected in each real data application.Note that a lag is considered to be important, and thus is selected, when its corresponding relative frequency distribution is higher than 0.5.The predictive ability of the models compared to the Bayesian Poisson autoregression models are given in Tables 6 and 7.In the Swiss cantons it seems that the BTF model underperforms when Aargau flu cases are predicted from past flu cases of Aargu and Bern, whereas it outperforms when we predict Bern cases based on past data from Aargau and Bern.An informal justification of this behaviour is that from the data it seems that the two series have very high positive contemporaneous and lag-one correlations so naturally the model ( 16) that captures very well these linear dependencies outperforms our model.Such situations are expected when a general non-parametric model is compared to a linear model with the corresponding data generating mechanism to be primarily linear-based.Table 7 presents the five-dimensional example of Spanish regions in which the counts of each region are predicted from past counts of all five regions.Here, in eight out of ten cases BTF outperforms the Poisson autoregression model and in particular the log-predictive scores are dramatically lower in all cases with smaller training (462) and higher test ( 103) sizes.This is not surprising since our model is capable of capturing the complicated five-dimensional dependencies created in these Spanish regions.• Code availability: The code will be free and available from Petros Dellaportas' web site.
• Author contributions: The code has been written by Zhongzhen Wang.Zhongzhen Wang, Petros Dellaportas and Ioannis Kosmidis had equal contributions at the development of the theory.
• Licence: For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

Figure 1 :
Figure 1: Trace plot of 514 time-series data points counting flu cases in Norway, Aargau as well as Bern in Switzerland and five regions in eastern Spain including Andalusia, Castilla-La Mancha, Illes Balears, Region de Murcia and Valencian Community counted by each week from 09-Oct-2005 to 09-Aug-2015.
Total number of clusters to be 5

Figure 2 :
Figure 2: Fitting of a mixture of Poisson distributions.The dataset used is the pre-training data from flu cases in Norway counted by each week from 09-Oct-2005 to 09-Aug-2015.Panels (a) and (b) indicate that the outcome for a total number of clusters c are 10 and 5 respectively.The top panels illustrate the Poisson rates of their corresponding label of clusters, whilst the bottom panels show their corresponding log weights.

Figure 3 :
Figure 3: MCMC frequency results.In all panels, the x-axis represents the number and the y-axis does the relative frequency.Top three panels: 411 training data points; bottom three panels: 462 training data points.(a,d):The relative frequency distributions for the number of important predictor(s).(b,e): The relative frequency distributions of q j=1 k j , or the total number of possible combinations of (h 1 , . . ., hq).(c,f): The relative frequency distributions of the number of unique λ h 1 ,...,hq .

Figure 4 :
Figure 4: AIC and BIC scores given by PAR(q) models with q labelled in the x-axis for flu cases in Norway counted by each week from 09-Oct-2005 to 09-Aug-2015.(a): The AIC scores for the scenario with 411 training data points and 103 testing data points; (b): The BIC scores for the scenario with 411 training data points and 103 testing data points; (c): The AIC scores for the scenario with 462 training data points and 52 testing data points; (d): The BIC scores for the scenario with 462 training data points and 52 testing data points.

Figure 5 :
Figure 5: Out of sample predictive means and 95% highest credible regions (HCRs) of Bayesian tensor factorisations (BTF) and Poisson autoregressive models (PAR) compared against the Castilla-La Mancha data.The sizes of training and testing data are 411 and 103 respectively.

Figure 6 :
Figure 6: Lag selection for the Norway (left pair) and the Castilla-La Mancha (right pair) flu datasets.Each pair of figures represents (i) the inclusion proportions (y-axis) of different lags (x-axis) for the scenario with 411 training and 103 testing data points and (ii) the inclusion proportions (y-axis) of different lags (x-axis) for the scenario with 462 training data points and 52 testing data points.

Figure 8 :•
Figure 8: Important lag selection for the south-eastern Spain flu dataset.Y-axis represents the inclusion proportions of different lags in x-axis for the scenario with 411 training data points and 103 testing data points.A: Andalusia; CLM: Castilla-La Mancha; IB: Illes Balears; RM: Region de Murcia; VC: Valencian Community.

Figure 9 :
Figure 9: Important lag selection for the south-eastern Spain flu dataset.Y-axis represents the inclusion proportions of different lags in x-axis for the scenario with 462 training data points and 52 testing data points.A: Andalusia; CLM: Castilla-La Mancha; IB: Illes Balears; RM: Region de Murcia; VC: Valencian Community.

Table 1 :
Table1presents the results of out-of-sample comparative predictive ability based on six generated Poisson autoregressive models based on (1).Notice that when the order q is high and there are only a few true coefficients, as in cases C, E and F , the maximal order Markov structure of the BTF model achieves a comparative, satisfactory predictive performance.Given that the data generating process is based on Poisson autoregressive models these results are very promising.Mean log predictive scores (with standard deviations in brackets) for Bayesian Poisson autoregressive models and our Bayesian tensor factorisations model (BTF) based on 10 Poisson autoregression generated data sets for each one of 6 Scenarios.AIC and BIC columns indicate that the best model has been chosen with the corresponding criterion.Models with the best performance are highlighted in bold.
Next, we generated data in which past values affect current random variables in a nonlinear fashion as follows.There are K important lag(s) {y t−i 1 , . . ., y t−i K } and, for given ν + , ν − , if K j=1 y t−i j ≥ Kν + , then y t ∼ Poisson(ν + ); else y t ∼ Poisson(ν − ).We designed

Table 2 :
Mean log predictive scores (with standard deviations in brackets) for Bayesian Poisson autoregressive models and our Bayesian tensor factorisations model (BTF) based on 10 nonlinear generated data sets for each one of 6 Scenarios.AIC and BIC columns indicate that the best model has been chosen with the corresponding criterion.Models with the best performance are highlighted in bold.6 scenarios and the results are shown in Table2.Our proposed modelling formulation outperforms the Bayesian Poisson autoregressive model in all but one scenario.Finally, we replicated the last exercise by testing the models in a more challenging data generation mechanism in which the response is multivariate.We designed 6 different scenarios by generating an M-dimensional time series {y m,t } m∈[1,M ] and assuming that we are interested in predicting y 1,t .For t ≤ 10, we generated y m,t from Pois(ν − ) for each m; for t > 10, if K i=1 y m i ,t−j i ≥ ν − we generate y 1,t ∼ Poisson(ν + ), else y 1,t ∼ Poisson(ν − Non-zero coefficients for y 1,t : y 1,t−3 , y 2,t−4 ; Non-zero coefficients for y 2,t : y 1,t−1 , y 2,t−3 , y 2,t−5 4500 : 500 3.159(0.044)3.159(0.044)Non-zero coefficients for y 1,t : y 1,t−3 , y 2,t−4 , y 3,t−1 ; Non-zero coefficients for y 2,t : y 1,t−1 , y 2,t−2 , y 3,t−5 ; Non-zero coefficients for y 3,t : y 1,t−3 , y 2,t−2 , y 3,t−5

Table 3 :
Mean log predictive scores (with standard deviations in brackets) for Bayesian Poisson autoregressive models and our Bayesian tensor factorisations model (BTF) based on 10 nonlinear generated data sets for each one of 6 Scenarios.AIC and BIC columns indicate that the best model has been chosen with the corresponding criterion.Models with the best performance are highlighted in bold.

Table 4 :
Means of coefficients (with standard deviations in brackets) based on 5,000 burn-in and 10,000 MCMC runs.Two scenarios with different sizes of training against testing data are shown in each columns with their corresponding selected models indicated.
Table5indicates that in all pre-training size scenarios BTF outperform, in terms of predictive ability expressed with log predictive scores, Bayesian Poisson autoregression models.There is clearly a trade-off between good mixture estimation and adequate training size that is expressed in small and large pre-training sizes respectively.In our small empirical study it seems that there is evidence for some robustness in the inference procedure when the pre-training size is small, since 103 points outperform 206 points with the 154 points being the best performing pre-training size.The predictive means and 95% credible intervals of BTF and of the PAR(5) model that had one of the best predictive performances based on 103 test data are depicted in Figure5.

Table 5 :
Log predictive scores for Bayesian Poisson autoregression models and our Bayesian tensor factorisations model (BTF) for flu counts datasets in Norway and Castilla-La Mancha, Spain.The BTF model has performed with 103, 154 and 206 pre-training data points (PTDPs).AIC and BIC columns indicate that the best model has been chosen (in brackets) with the corresponding criterion.Training and testing data sizes appear in the second column.Models with the best performance are highlighted in bold.

Table 6 :
Log predictive score between Bayesian Poisson autoregressive model and Bayesian tensor factorisations model (BTF) for multivariate flu counts datasets.Multiple datasets include flu counts in two cantons in Switzerland, Aargau and Bern.AIC and BIC columns indicate that the best model has been chosen (in brackets) with the corresponding criterion.Models with the best performance are highlighted in bold.