Probabilistic time series forecasts with autoregressive transformation models

Probabilistic forecasting of time series is an important matter in many applications and research fields. In order to draw conclusions from a probabilistic forecast, we must ensure that the model class used to approximate the true forecasting distribution is expressive enough. Yet, characteristics of the model itself, such as its uncertainty or its feature-outcome relationship are not of lesser importance. This paper proposes Autoregressive Transformation Models (ATMs), a model class inspired by various research directions to unite expressive distributional forecasts using a semi-parametric distribution assumption with an interpretable model specification. We demonstrate the properties of ATMs both theoretically and through empirical evaluation on several simulated and real-world forecasting datasets.


Introduction
Conditional models describe the conditional distribution F Y |x (y | x) of an outcome Y conditional on observed features x (see, e.g., Jordan et al, 2002).Instead of modeling the complete distribution of Y | x, many approaches focus on modeling a single characteristic of this conditional distribution.Predictive models, for example, often focus on predicting the average outcome value, i.e., the expectation of the conditional distribution.Quantile regression (Koenker, 2005), which is used to model specific quantiles of Y | x, is more flexible in explaining the conditional distribution by allowing (at least theoretically) for arbitrary distribution quantiles.Various other approaches allow for an even richer explanation by, e.g., directly modeling the distribution's density f Y |x and thus the whole distribution F Y |x (y | x).Examples include mixture density networks (Bishop, 1994) in machine learning, or, in general, probabilistic modeling approaches such as Gaussian processes or graphical models (Murphy, 2012).In statistics and econometrics, similar approaches exist, which can be broadly characterized as distributional regression (DR) approaches (Chernozhukov et al, 2013;Foresi and Peracchi, 1995;Rügamer et al, 2020;Wu and Tian, 2013).Many of these approaches can also be regarded as conditional density estimation (CDE) models.
Modeling F Y |x (y | x) is a challenging task that requires balancing the representational capacity of the model (the expressiveness of the modeled distribution) and its risk for overfitting.While the inductive bias introduced by parametric methods can help to reduce the risk of overfitting and is a basic foundation of many autoregressive models, their expressiveness is potentially limited by this distribution assumption (cf. Figure 1).
Figure 1 Exemplary comparison of probabilistic forecasting approaches with the proposed method (ATM; with its uncertainty depicted by the darker shaded area) for a given time series (red line).While other methods are not expressive enough and tailored toward a simple unimodal distribution, our approach allows for complex probabilistic forecasts (here a bimodal distribution where the inducing mixture variable is unknown to all methods).

Our contributions
In this work, we propose a new and general class of semi-parametric autoregressive models for time series analysis called autoregressive transformation models (ATMs; Section 3) that learn expressive distributions based on interpretable parametric transformations.ATMs can be seen as a generalization of autoregressive models.We study the autoregressive transformation of order p (AT(p)) in Section 4 as the closest neighbor to a parametric autoregressive model, and derive asymptotic results for estimated parameters in Section 4.2.Finally, we provide evidence for the efficacy of our proposal both with numerical experiments based on simulated data and by comparing ATMs against other existing time series methods.

Background and Related Work
Approaches that model the conditional density can be distinguished by their underlying distribution assumption.Approaches can be parametric, such as mixture density networks (Bishop, 1994) for conditional density estimation and then learn the parameters of a pre-specified parametric distribution or non-parametric such as Bayesian non-parametrics (Dunson, 2010).A third line of research that we describe as semi-parametric, are approaches that start with a simple parametric distribution assumption F Z and end up with a far more flexible distribution F Y |x by transforming F Z (multiple times).Such approaches have sparked great interest in recent years, triggered by research ideas such as density estimation using non-linear independent components estimation or real-valued non-volume preserving transformations (Dinh et al, 2017).A general notion of such transformations is known as normalizing flow (NF; Papamakarios et al, 2021), where realizations z ∼ F Z of an error distribution F z are transformed to observations y via

Transformation models
Transformation models (TMs; Hothorn et al, 2014), a similar concept to NFs, only consist of a single transformation and thereby better allow theoretically studying model properties.The transformation in TMs is chosen to be expressive enough on its own and comes with desirable approximation guarantees.Instead of a transformation from z to y, TMs define an inverse flow h(y) = z.The key idea of TMs is that many well-known statistical regression models can be represented by a base distribution F Z and some transformation function h.Prominent examples include linear regression or the Cox proportional hazards model (Cox, 1972), which can both be seen as a special case of TMs (Hothorn et al, 2014).Various authors have noted the connection between autoregressive models and NFs (e.g., Papamakarios et al, 2021) and between TMs and NFs (e.g., Sick et al, 2021).Advantages of TMs and conditional TMs (CTMs) are their parsimony in terms of parameters, interpretability of the input-output relationship, and existing theoretical results (Hothorn et al, 2018).While mostly discussed in the statistical literature, various recent TM advancements have been also proposed in the field of machine learning (see, e.g., Van Belle et al, 2011) and deep learning (see, e.g., Baumann et al, 2021;Kook et al, 2021Kook et al, , 2022)).

Time series forecasting
In time series forecasting, many approaches rely on autoregressive models, with one of the most commonly known linear models being autoregressive (integrated) moving average (AR(I)MA) models (see, e.g., Shumway et al, 2000).Extensions include the bilinear model of Granger and Andersen (1978); Rao (1981), or the Markov switching autoregressive model by Hamilton (2010).Related to these autoregressive models are stochastic volatility models (Kastner et al, 2017) building upon the theory of stochastic processes.In probabilistic forecasting, Bayesian model averaging (Raftery et al, 2005) and distributional regression forecasting (Schlosser et al, 2019) are two further popular approaches while many other Bayesian and non-Bayesian techniques exist (see, e.g., Gneiting and Katzfuss, 2014, for an overview).

Transformation models
Parametrized transformation models as proposed by Hothorn et al (2014Hothorn et al ( , 2018) ) CTMs learn h(y | x) from the data, i.e., estimate a model for the (conditional) aleatoric uncertainty.
A convenient parameterization of h for continuous Y are Bernstein polynomials (BSPs; Farouki, 2012) with order M (usually M 50).BSPs are motived by the Bernstein approximation (Bernstein, 1912) with uniform convergence guarantees for M → ∞, while also being computationally attractive with only M + 1 parameters.BSPs further have easy and analytically accessible derivatives, which makes them a particularly interesting choice for the change of random variables.We denote the BSP basis by a M : Ξ → R M +1 with sample space Ξ.The transformation h is then defined as h(y | x) = a M (y) ϑ(x) with feature-dependent basis coefficients ϑ.This can be seen as an evaluation of y based on a mixture of Beta densities f Be(κ,µ) with different distribution parameters κ, µ and weights ϑ(x): where ỹ is a rescaled version of y to ensure ỹ ∈ [0, 1].Restricting ϑ m > ϑ m−1 for m = 1, . . ., M + 1 guarantees monotonicity of h and thus of the estimated CDF. Roughly speaking, using BSPs of order M , allows to model the polynomials of degree M of y.

Model definition
The transformation function h can include different data dependencies.One common choice (Hothorn, 2020;Baumann et al, 2021) is to split the transformation function into two parts ) where a(y) is a pre-defined basis function such as the BSP basis (omitting M for readability in the following), ϑ : χ ϑ → R M +1 a conditional parameter function defined on χ ϑ ⊆ χ and β(x) models a feature-induced shift in the transformation function.The flexibility and interpretability of TMs stems from the parameterization where the matrix Γ j ∈ R (M +1)×Oj , O j ≥ 1, subsumes all trainable parameters and represents the effect of the interaction between the basis functions in a and the chosen predictor terms b j : χ bj → R Oj , χ bj ⊆ χ.The predictor terms b j have a role similar to base learners in boosting and represent simple learnable functions.For example, a predictor term can be the jth feature, b j (x) = x j , and Γ j ∈ R (M +1)×1 describes the linear effect of this feature on the M + 1 basis coefficients, i.e., how the feature x j relates to the density transformation from Z to Y | x.Other structured non-linear terms such as splines allow for interpretable lower-dimensional non-linear relationships.Various authors also proposed neural network (unstructured) predictors to allow potentially multidimensional feature effects or to incorporate unstructured data sources (Sick et al, 2021;Baumann et al, 2021;Kook et al, 2021).In a similar fashion, β(x) can be defined using various structured and unstructured predictors.

Interpretability
Relating features and their effect in an additive fashion allows to directly assess the impact of each feature on the transformation and also whether changes in the feature just shift the distribution in its location or if the relationship also transforms other distribution characteristics such as variability or skewness (see, e.g., Baumann et al, 2021, for more details).

Relationship with autoregressive flows
In the notation of AFs, h −1 (•) is known as transformer, a parameterized and bijective function.By the definition of ( 4), the transformer in the case of TMs is represented by the basis function a(•) and parameters ϑ.In AFs, these transformer parameters are learned by a conditioner, which in the case of TMs are the functions b j .In line with the assumptions made for AFs, these conditioners in TMs do not need to be bijective functions themselves.

Autoregressive Transformations
Inspired by TMs and AFs, we propose autoregressive transformation models (ATMs).Our work is the first to adapt TMs for time series data and thereby lays the foundation for future extensions of TMs for time series forecasting.The basic idea is to use a parameter-free base distribution F Z and transform this distribution in an interpretable fashion to obtain F Y |x .One of the assumptions of TMs is the stochastic independence of observations, i.e., When Y is a time series, this assumption does clearly not hold.In contrast, this assumption is not required for AFs.
Let t ∈ T ⊆ N 0 be a time index for the time series for some p ∈ {1, . . ., t}, distribution G, parameter θ ∈ Θ with compact parameter space Θ ⊂ R v and filtration F s , s ∈ T , s < t, on the underlying probability space.Assume that the joint distribution of Y t , Y t−1 , . . ., Y 1 possesses the Markov property with order p, i.e., the joint distribution, expressed through its absolutely continuous density f , can be rewritten as product of its conditionals with p lags: (7) We use x to denote (potentially time-varying) features that are additional (exogenous) features.Their time-dependency is omitted for better readability here and in the following.Given this autoregressive structure, we propose a time-dependent transformation h t that extends (C)TMs to account for filtration and time-varying feature information.By modeling the conditional distribution of all time points in a flexible manner, ATMs provide an expressive way to account for aleatoric uncertainty in the data.

Definition 1 Autoregressive Transformation
Models Let h t , t ∈ T , be a time-dependent monotonic transformation function and F Z the parameter-free base distribution as in Definition 1 in the Supplementary Material.We define autoregressive transformation models as follows: This can be seen as the natural extension of ( 2) for time series data with autoregressive property and time-varying transformation function h t .In other words, (8) says that after transforming y t with h t , its conditional distribution follows the base distribution F Z , or vice versa, a random variable Z ∼ F Z can be transformed to follow the distribution Y t | x using h −1 t .

Relationship with autoregressive models and autoregressive flows
Autoregressive models (AMs; Bengio and Bengio, 1999) and AFs both rely on the factorization of the joint distribution into conditionals as in ( 7).Using the CDF of each conditional in (7) as transformer in an AF, we obtain the class of AMs (Papamakarios et al, 2021).AMs and ATMs are thus both (inverse) flows using a single transformation, but with different transformers and, as we will outline in Section 3.2, also with different conditioners.

Likelihood-based estimation
Based on ( 7), ( 8) and the change of variable theorem, the likelihood contribution of the tth observation y t in ATMs is given by and the full likelihood for T observations thus by where Y 0 = (y 0 , . . ., y −p+1 ) are known finite starting values and F 0 only contains these values.
Based on (9), we define the loss of all model parameters θ as negative log-likelihood − (θ) := and use ( 10) to train the model.As for AFs, many special cases can be defined from the above definition and more concrete structural assumptions for h t make ATMs an interesting alternative to other methods in practice.We will elaborate on meaningful structural assumptions in the following.

Structural assumptions
In CTMs, the transformation function h is usually decomposed as h(y where h 1 is a function depending on y and h 2 is a transformation-shift function depending only on x.For time-varying transformations h t our fundamental idea is that the outcome y t shares the same transformation with its filtration F t−1 , i.e., the lags Y t = (y t−1 , . . ., y t−p ).In other words, a transformation applied to the outcome must be equally applied to its predecessor in time to make sense of the autoregressive structural assumption.An appropriate transformation structure can thus be described by with BSPs a and specify their weights as in (5).The additional transformation h 2t ensures enough flexibility for the relationship between the transformed response and the transformed filtration, e.g., by using a non-linear model or neural network.An interesting special case arises for linear transformations in h 2t , which we elaborate in Section 4 in more detail.

Interpretability
The three main properties that make ATMs interpretable are 1) their additive predictor structure as outlined in (5); 2) the clear relationship between features and the outcome through the BSP basis, and 3) ATM's structural assumption as given in (11).As for (generalized) linear models, the additivity assumption in the predictor allows interpreting feature influences through their partial effect ceteris paribus.On the other hand, choices of M and F Z will influence the relationship between features and outcome by inducing different types of models.A normal distribution assumption for F Z and M = 1 will turn ATMs into an additive regression model with Gaussian error distribution (see also Section 4).For M > 1, features in h 1 will also influence higher moments of Y | x and allow more flexibility in modeling F Y |x .For example, a (smooth) monotonously increasing feature effect will induce rising moments of Y | x with increasing feature values.Other choices for F Z such as the logistic distribution also allow for easy interpretation of feature effects (e.g., on the log-odds ratio scale; see Kook et al, 2021).Finally, the structural assumption of ATMs enforces that the two previous interpretability aspects are consistent over time.We will provide an additional illustrative example in Section 5.2, further explanations in Supplementary Material B, and refer to Hothorn et al (2014) for more details on interpretability of CTMs.

Implementation
In order to allow for a flexible choice of transformation functions and predictors b j , we propose to implement ATMs in a neural network and use stochastic gradient descent for optimization.While this allows for complex model definitions, there are also several computational advantages.In a network, weight sharing for h 1t across time points is straightforward to implement and common optimization routines such as Adam (Kingma and Ba, 2014) prove to work well for ATMs despite the monotonicity constraints required for the BSP basis.Furthermore, as basis evaluations for a large number of outcome lags in F t−1 can be computationally expensive for large p (with space complexity O(t • M • p)) and add M additional columns per lag to the feature matrix, an additional advantage is the dynamic nature of mini-batch training.In this specific case, it allows for evaluating the bases only during training and separately in each mini-batch.It is therefore never required to set up and store the respective matrices.

AT(p) Model
A particular interesting special case of ATMs is the AT(p) model.This model class is a direct extension of the well-known autoregressive model of order p (short AR(p) model; Shumway et al, 2000) to transformation models.
Definition 2 AT(p) model We define the AT(p) model, a special class of ATMs, by setting h 1t (y t | x) = a(y t ) ϑ(x), and h 2t (F t−1 , x) = p j=1 φ j h 1t (y t−j ) + r(x), i.e., an autoregressive shift term with optional exogenous remainder term r(x).
As for classical time series approaches, φ j are the regression coefficients relating the different lags to the outcome and r is a structured model component (e.g., linear effects) of exogenous features that do not vary over time.

Model Details
The AT(p) model is a very powerful and interesting model class for itself, as it allows to recover the classical time series AR(p) model when setting M = 1, ϑ(x) ≡ ϑ and r(x) ≡ 0 (see Proposition 2 in Supplementary Material A for a proof of equivalence).But it can also be extended to more flexible autoregressive models in various directions.We can increase M to get a more flexible density, allowing us to deviate from the base distribution assumption F Z , e.g., to relax the normal distribution assumption of AR models.Alternatively, incorporating exogenous effects into h 1t allows to estimate the density data-driven or to introduce exogenous shifts in time series using features x in r(x).ATMs can also recover wellknown transformed autoregressive models such as the multiplicative autoregressive model (Wong and Li, 2000) as demonstrated in Section 5.1.When specifying M large enough, an AT(p) model will, e.g., learn the log-transformation function required to transform a multiplicative autoregressive time series to an additive autoregressive time series on the log-scale.In general, this allows the user to learn autoregressive models without the need to find an appropriate transformation before applying the time series model.This means that the uncertainty about preprocessing steps (e.g., a Box-Cox transformation; Sakia, 1992) is incorporated into the model estimation, making parts of the pre-processing obsolete for the modeler and its uncertainty automatically available.Non-linear extensions of AT(p) models can be constructed by modeling Y t in h 2t non-linearly, allowing ATMs to resemble model classes such as non-linear AR models with exogenous terms (e.g., Lin et al, 1996).In practice, values for p can, e.g., be found using a (forward) hyperparameter search by comparing the different model likelihoods.

Asymptotic theory
An important yet often neglected aspect of probabilistic forecasts is the epistemic uncertainty, i.e., the uncertainty in model parameters.Based on general asymptotic theory for time series models (Ling and McAleer, 2010), we derive theoretical properties for AT(p)s in this section.
Let θ * be the true value of θ and interior point of Θ.We define the following quantities involved in standard asymptotic MLE theory: Let θT = arg min Θ − (θ) be the parameter estimator based on Maximum-Likelihood estimation (MLE), We further state necessary assumptions to apply the theory of Ling and McAleer (2010) for a time series (Y t ) t∈T with known initial values Y 0 as defined in Section 3.

Assumption 1 Assume
(i) (Y t ) t∈T is strictly stationary and ergodic; with 0 < J < ∞; (iv) I is positive-definite and for some ξ > 0 E G {sup θ:θ−θ * <ξ J T (θ} < ∞. Assumptions 1 are common assumptions required for many time series models.We require only these and no other assumptions since AT(p)s and non-linear extensions are fully-parameterized time series models.This allows us to derive general statistical inference theory for AT(p) models.As stated in Hothorn et al (2018), Assumption 1(ii) holds if a is not arbitrarily ill-posed.In practice, both a finite Y 0 and Assumption 1(i) are realistic assumptions.Making two additional and also rather weak assumptions (1(iii)-(iv)) allows to derive the asymptotic normal distribution for θ.Based on the same assumptions, a consistent estimator for the covariance can be derived.The previous theorems can be proven by observing that the AT(p) model structure and all made assumptions follow the general asymptotic theory for time series models as given in Ling and McAleer (2010).See Supplementary Material A for details.
Using the above results, we can derive statistically valid UQ.An example is depicted in Figure 3. Since h is parameterized through θ, it is also possible to derive the so-called structural uncertainty of ATMs, i.e., the uncertainty induced by the discrepancy between the model's CDF F Y |x (y | x; θ) and the true CDF F * Y |x (y | x) (Liu et al, 2019).More specifically, h can be represented using a linear transformation of θ, h = Υθ, implying the (co-)variance ΥI −1 J(θ * )I −1 Υ for ĥ.

Practical application
where h t is parameterized by θ.In order to assess parameter uncertainty in the estimated density as, e.g.visualized in Figure 1 and 3, we propose to use a parametric Bootstrap described in detail in Supplementary Material C. We will first investigate theoretical properties of ATMs and the validity of statistical inference statements using simulation studies.We then compare our approach against other state-of-the-art methods described in the previous section on probabilistic forecasting tasks in a benchmark study.Additional results can be found in the Supplementary Material D.  D2 for an excerpt of the results).

Epistemic Uncertainty
In this experiment we validate our theoretical results proposed in Section 4.2.As in the previous experiment, we try to learn the log-transformed AR model using an AT(p = 3) model with coefficients (0.3, 0.2, 0.1).After estimation, we check the empirical distribution of θ and ĥ against their respective theoretical one in 1000 simulation replications.Figure 4 depicts a quantile-quantile plot of the empirical and theoretical distribution for both h and all 4 parameters (intercept and three lag coefficients).The empirical distributions are well aligned with their theoretical distribution as derived in Section 4.2, confirming our theoretical results.

Benchmarks
Finally, we compare our approach to its closest neighbor in the class of additive models, the ARIMA model (Hyndman et al, 2021), against a simple Box-Cox transformation (BoxCox), a neural network for mean-variance estimation (MVN) and a mixture density network (MDN; Bishop, 1994).
While there are many further forecasting techniques, especially in deep learning, we purposely exclude more complex machine and deep learning approaches to compare AT(p)s with approaches of similar complexity.More specifically, the different competitors were chosen to derive the following insights: The comparison of the AT(p) model with the ARIMA model will indicate whether relaxing the parametric assumption using TMs can improve performance while both methods take time series lags into account.The comparison of our method with BoxCox, on the other hand, will show similar performance if there is no relevant information in the lags of the time series.The MVN can potentially learn time series-specific variances but is not given the lagged information as input.
A good performance of the MVN will thus indicate heteroscedasticity in the data generating process which can, however, be accounted for using a parametric distributional regression approach.Finally, the MDN is an alternative approach to the AT(p) model that tries to overcome the parametric assumption by modeling a mixture of normal distributions.

Hyperparameter Setup
We define the AT(p) model by using an unconditional ϑ parameter and use the lag structure as well as a time series identifier as a categorical effect in the additive predictor of β.We further investigate different number of BSPs M ∈ {5, 10, 30} and different number of lags p ∈ {1, 2, 3}.Model training for all models but the ARIMA model was done using 1000 epochs with early stopping and a batch size of 128.For the MDN, we define 3 mixtures and use the AT(p)'s β as an additive predictor for the mean of every mixture component.
The MVN uses the time series identifier to learn individual means and variances.For ARIMA we used the auto.arimaimplementation (Hyndman et al, 2021) and performed a step-wise search via the AICc with different starting values for the order of the AR and the MA term.For the AR term, we consider the length of the corresponding forecasting horizon and halve this value.The search space for the MA term started either with 0 or 3.
We chose the ARIMA model with the lowest AICc on the validation set.For the auto.arimamodel on the m4 data, we restrict the observations to be used for model selection to 242 in order to reduce the computational complexity.A larger number did not give higher logscores.

Datasets
We compare approaches on commonly used benchmark datasets electricity (elec; Yu et al, 2016), traffic forecasting (traffic; Yu et al, 2016), monthly tourism (Athanasopoulos et al, 2011), the hourly m4 dataset (Makridakis et al, 2018) and currency exchange (Lai et al, 2018).A short summary of these datasets can be found in Table D3 in the Supplementary Material.

Evaluation
For each proposed method and dataset, we report the log-scores (Gneiting et al, 2007) and average results across time series and time points.The datasets are split into a training, validation, and test set by adhering to their time ordering.Evaluation windows are defined as done in the reference given for every dataset.

Results
Table 2 shows the results of the comparison.Our approach always yields competitive and consistently good results while outperforming other models on most data sets.

Conclusion and Outlook
We  The following definition of the error distribution follows Hothorn et al (2018).
Definition 3 Error Distributions Let Z : Ω → R be a U − B measurable function from (Ω, U) to the Euclidian space with Borel σ-algebra B with absolutely continuous distribution P Z = f Z µ L on the probability space (R, B, P Z ) and µ L the Lebesque measure.We define F Z and F −1 Z as the corresponding distributions and assume F Z (−∞) = 0, F Z (∞) = 1.0 < f Z (z) < ∞ ∀z ∈ R with log-concave, twicedifferentiable density f Z with bounded first and second derivatives.

A.2 Propositions
Proposition 1 (Interpretation of ( 11)) The ATM as defined in (8) and further specified in (11) can be seen as an additive regression model with outcome Proof We first define an additive regression model with outcome λ 1 := h 1t (y t ), predictor λ2 := −h 2t ((h 1t Y t | F t−1 , x) | x) and error term ε ∼ F Z , i.e., where we use λ2 = −λ 2 instead of λ 2 for convenience without loss of generality.This implies that λ 1 − λ2 = λ 1 + λ 2 = ε or equally λ 1 + λ 2 ∼ F Z .Optimizing this model is equal to fitting an ATM as defined in (8) with structural assumption as defined in (11).
Proposition 2 (Equivalence of AR(p) and AT(p) models) An autoregressive model of order p (AR(p)) with independent white noise following the distribution F Z in the location-scale family is equivalent to an AT(p) model for M = 1, ϑ(x) ≡ ϑ, r(x) ≡ 0 and error distribution F Z .
Proof The transformation function of an AT(p) model with BSPs of order M defined on an interval [ι l , ιu], ϑ(x) ≡ ϑ and r(x) ≡ 0 is given by We can further simplify the model by making a(y t ) more explicit: with ỹt = (y − ι l )/(ιu − ι l ) and Beta distribution density f BE(κ,µ) with parameters κ, µ.For simplicity and w.l.o.g.assume that y t ≡ ỹt .Setting M to 1, we get The transformation of the AT(p) model is thus given by with θ0 = (ϑ 0 (1 + j φ j ))/ θ1 and φj = φ j ϑ 1 / θ1 .From (8) we know The AR(p) model with coefficients ϕ 0 , . . ., ϕp is given by The equivalence of (A2) in combination (A1) with (A3) is then given when setting θ0 = −ϕ 0 , φj = −ϕ j ∀j ∈ {1, . . ., p} and σ = θ−1 1 .Since both models find their parameters using Maximum Likelihood and it holds θ1 > 0 (as required for σ) by the monotonicity restriction on the BSPs coefficient, the models are identical up to different parameterization.

A.3 Proof of Theorems
The provided theorems 1-3 can be proven by observing that AT(p)s' model structure and all made assumptions follow the general asymptotic theory for time series models as given in Ling and McAleer (2010).It is left to show that our setup and assumptions are equivalent to this general theory.
Proof.Our setup described in Section 4 together with Assumption 1(i) corresponds to the setup described in Ling and McAleer (2010), Section 2. Our Assumption 1(ii-iv) corresponds to their Assumption 2.1.In contrast, we do not consider the case of infinite Y 0 , but the extension is straightforward, by replacing initial values by some constant.Since AT(p)s and nonlinear extensions are fully-parameterized time series models (Equation 11) with parameter estimator θT found by MLE, all necessary assumptions are met to apply Theorem 2.1 in Ling and McAleer (2010) including the subsequent remark, which yields the proof of our theorems 1-3.

Appendix B Interpretability Example
Next to the theoretical properties of ATMs described in Section 3.2, we will give an illustrative example in this section to make the different interpretability aspects of ATMs more tangible.
Example 1 Assume that the true generating process is additive on a log-scale and influenced by the two previous time points t − 1 and t − 2. For example, t can be thought of as days in a year and the process Y t is an interest rate.Assume that the interest rate is multiplicatively influenced by the year x t ∈ E and further differs in its mean depending on a cyclic effect of the month η t .An example for a corresponding data generating process would be log(y t ) =0.5 log(y t−1 ) In this case, the transformation function h 1t can be defined as h 1t (y t ) = log(y t )( e∈E θeI(x t = e)) and approximated by a(y t ) ϑ(x t ), where a is the BSP evaluation of y t and ϑ a vector of coefficients depending on the year x t .Further φ 1 = 0.5, φ 2 = 0.2, and the exogenous shift r = sin(η t ), which in practice would be approximated using a basis function representation.
The interpretability properties listed in Section 3.2 can be explained as follows: 1.The additivity assumption in ϑ allows to interpret the individual effects of the year x t on the transformation function h 1 individually (ceteris paribus) as log(y t )( e∈E θ e I(x t = e)) = e∈E log(y t )θ e I(x t = e).Here, this would allow statements how a certain year e influences the interest rate's density.2. The use of the BSP basis for a in combination with 1. allows to visualize a forecasted density analytically for every additive term in ϑ.For example, to interpret year e, we evaluate ϑ(x t = e) and visualize h 1t (y) = a(y) ϑ(x t = e) as a function of y on a given domain of interest.3. The structural assumptions of ATMs, i.e., their separation into two transformation functions h 1 and h 2 , allows to interpret both transformation functions h 1 , h 2 individually (ceteris paribus).
In this example, the effect of the year can be interpreted using 1. and 2. while keeping the month fixed, and vice versa, the effect of the month can be interpreted by fixing the year.The applied transformation h 1 for AT(p) models further allows to to individually interpret the influence of different lags (here these are the multiplicative effects phi 1 = 0.5 and φ 2 = 0.2).

Software
For ATMs we extended the software deepregression (Rügamer et al, 2022) by including an additional additive component for lags and used optimization techniques considered in Rügamer et al (2020); Baumann et al (2021).For ARIMA, we use the forecast R package (Hyndman et al, 2021).

Appendix E Run-time Complexity
In addition to forecasting performance comparisons, we also conduct a run-time benchmark to compare the runtime complexity of ATMs with other approaches.We use two different implementations for ATMs and measure their run-time.We contrast these run-times with the ARIMA model as implemented in the forecast R package (Hyndman et al, 2021) and additionally include Prophet from the prophet R package (Taylor and Letham, 2021) as another fast alternative method for Bayesian forecasting.
The timing benchmark results (averaged over 10 replications) for different numbers of observations T are given in Table E4.Results suggest that -as expected -ATMs in a neural network are very slow compared to ARIMA, Prophet and also a plain ATM implementation in R.However, all methods show an exponential increase in time consumption while the time consumption of the neural network implementation of ATMs (ATM (neural)) with mini-batch training and early stopping does only slightly increase in runtime for an exponential increase in number of observations.Moreover, for 10 5 observations, ATM (plain) and Prophet already yield longer runtimes.
) using k transformation functions.Many different approaches exist to define expressive flows.These are often defined as a chain of several transformations or an expressive neural network and allow for universal representation of F Y |x (Papamakarios et al, 2021).Autoregressive models (e.g., Bengio and Bengio, 1999; Uria et al, 2016) for distribution estimation of continuous variables are a special case of NFs, more precisely autoregressive flows (AFs; Kingma et al, 2016; Papamakarios et al, 2017), with a single transformation.

Figure 2
Figure2Illustration of a transformation process induced by the structural assumption of Section 3.2.The original data history F t−1 (red) is transformed into a base distribution (orange) using the transformation h 1t (solid blue arrow) and then further transformed using h 2t (dashed green arrow) to match the transformed distribution of the current time point t.

Figure 3
Figure 3 Aleatoric vs. epistemic uncertainty: Different plots correspond to different orders of the BSP basis M , inducing different amounts of expressiveness and aleatoric uncertainty.In each plot, the fitted density is shown in red, and model uncertainties of this density based on the epistemic uncertainty in black.Epistemic uncertainty is generated according to results in Theorem 2 and 3.

Figure 4
Figure 4 Empirical evidence for the correctness of our theoretical results on PU: Expected vs. observed quantiles of the transformation function ht (left; one line per dataset) and model parameters θ for the different (lagged) transformed outcomes (right; one cross per dataset) based on 1000 simulation replications.The ideal angle bisector is plotted in red.
for t ∈ T , where indicates the element-wise application of h 1t to all lags in Y t .In other words, ATMs first apply the same transformation h 1t to y t and individually to y t−1 , y t−2 , . .., and then further consider a transformation function h 2t to shift the distribution (and thereby potentially other distribution characteristics) based on the transformed filtration.While the additivity assumption of λ 1t and λ 2t seems restrictive at first glance, the imposed relationship between y t and Y t only needs to hold in the transformed probability space.For example, h 1t can compensate for a multiplicative autoregressive effect between the filtration and y t by implicitly learning a log-transformation (cf.Section 5.1).At the same time, the additivity assumption offers a nice interpretation of the model, also depicted in Figure2: After transforming y t and Y t , (11) implies that training an ATM is equal to a regression model of the form λ 1t = λ 2t + ε, with additive error term ε ∼ F Z (cf.Proposition 1 in Supplementary Material A.2).This also helps explaining why only λ 2t depends on F t−1 : if λ 1t also involves F t−1 , ATMs would effectively model the joint distribution of the current time point and the whole filtration, which in turn contradicts the Markov assumption (7).Specifying h 1t very flexible clearly results in overfitting.As for CTMs, we use a featuredriven basis function representation h 1t

Table 1
Average and standard deviation (brackets) of the MSE (multiplied by 100 for better readability) between estimated and true coefficients in an AR(p) model using our approach on the tampered data (bottom row) and the corresponding oracle based on the true data (Oracle).

Table 2
Mean log-scores (higher is better) across 10 different initializations with standard deviations in brackets for each method (columns) and benchmark dataset (rows).Results for ARIMA are based on only one trial as there is typically no stochasticity in its results.The best performing method per data set is highlighted in bold.

Table D3
Makridakis et al (2018)011)hmark datasets.Data is available on a monthly, quarterly and yearly level.We used the 366 monthly series which measure tourism demand.The data is split into test and train.67monthare the minimum that is available for training and forecasting horizon is defined to be 24 months.The starting date for each monthly series is different.See Section 4 ofAthanasopoulos et al (2011)for details.m4The dataset is taken fromMakridakis et al (2018).It contains 414 time series which are summarized in the m4 hourly data set.The split between training and test is already provided.Details on further background can be found on Wikipedia: https://en.wikipedia.org/wiki/Makridakis Competitions.The starting point of each series is different.The minimum training length is 700 hours.The forecasting horizon is 48 hours.

Table E4
Comparison of run-times for different methods (in columns) on different numbers of observations (#Obs.)T (in rows).