1 Introduction

According to Du and Li (1991), the INAR(p) model is defined by the recursion

$$\begin{aligned} X_t = \alpha _1 \circ X_{t-1}+ \alpha _2 \circ X_{t-2} + \ldots + \alpha _p \circ X_{t-p} + \varepsilon _t, \, t \in {\mathbb {Z}}, \end{aligned}$$
(1)

with innovation process \(\varepsilon _t \overset{\text {i.i.d}}{\sim } G,\) where the distribution G has range \({\mathbb {N}}_0=\{0,1,2,\ldots \}\). Furthermore, let \(\varvec{\alpha } = (\alpha _1, \ldots , \alpha _p)' \in (0,1)^p\) denote the vector of model coefficients with \(\sum _{i=1}^p \alpha _i < 1\) and

$$\begin{aligned}\alpha _i \circ X_{t-i} = \sum \limits _{j=1}^{X_{t-i}} Z_j^{(t,i)}, \end{aligned}$$

where “\(\circ\)” is the binomial thinning operator first introduced by Steutel and Van Harn (1979). Here, \(\left( Z_j^{(t,i)}, \, j \in {\mathbb {N}}, \, t \in {\mathbb {Z}} \right) , \, i \in 1, \ldots , p\), are mutually independent Bernoulli-distributed random variables \(Z_j^{(t,i)} \sim \text {Bin}(1, \alpha _i)\) with \(P(Z_j^{(t,i)}=1)=\alpha _i\) independent of \((\varepsilon _t, \, t \in {\mathbb {Z}})\). The special case \(p=1\) results in the INAR(1) model introduced by McKenzie (1985) and Al-Osh and Alzaid (1987). All the thinning operations “\(\circ\)” are independent of each other and of \(\varepsilon _t, \, t \in {\mathbb {Z}}\). Furthermore, the thinning operation at time t and \(\varepsilon _t\) are independent of \(X_s, \, s < t\).

Most researchers deal with parametric estimation of INAR models (see for example Franke and Seligmann (1993), Freeland and McCabe (2005), Brännäs and Hellström (2001) and Jung et al. (2005)), i.e. they assume G to lie in some parametric class of distributions \({(G_\theta \mid \, \theta \in \Theta \subset {\mathbb {R}}^q )}\) for some finite \(q\in {\mathbb {N}}\). In contrast, Drost et al. (2009) introduced a semiparametric estimator, which on the one hand keeps the parametric assumption of the binomial thinning operation, but on the other hand allows to estimate the innovation distribution nonparametrically. Using empirical process theory, they derive asymptotic theory in terms of consistency and asymptotic normality results and proved efficiency. Consequently, their estimation approach does not require any parametric assumption regarding the innovation distribution, and avoids the risk of a falsely specified parametric assumption and its undesirable consequences. The approach estimates the coefficients of INAR models and the innovation distribution simultaneously. The resulting semiparametric maximum likelihood estimator

$$\begin{aligned}(\hat{\varvec{\alpha }}_{sp}, {\hat{G}}_{sp}) = ({\hat{\alpha }}_{sp,1}, \ldots ,{\hat{\alpha }}_{sp,p}, {\hat{G}}_{sp}(0), {\hat{G}}_{sp}(1), {\hat{G}}_{sp}(2), \ldots ), \end{aligned}$$

where \(\hat{\varvec{\alpha }}_{sp}=({\hat{\alpha }}_{sp,1}, \ldots ,{\hat{\alpha }}_{sp,p})\) denotes the vector of the estimated INAR coefficients and \(\{ {\hat{G}}_{sp}(k), \, k \in {\mathbb {N}}_0 \}\) are the estimated entries of the probability mass function (PMF) of G, maximizes the conditional log-likelihood function \(\log ({\mathcal {L}}(\varvec{\alpha },G))\), i.e.

$$\begin{aligned} \forall n \in {\mathbb {Z}}_+: (\hat{\varvec{\alpha }}_{sp}, {\hat{G}}_{sp}) \in \underset{(\varvec{\alpha },G) \in [0,1]^p \times \tilde{{\mathcal {G}}}}{\text {arg max}} \left( \prod \limits _{t=0}^n P^{\varvec{\alpha },G}_{(X_{t-1}, \ldots , X_{t-p}), X_t} \right) .\end{aligned}$$
(2)

Here, \(\tilde{{\mathcal {G}}}\) is the set of all probability measures on \({\mathbb {Z}}_+\) and \(P^{\varvec{\alpha },G}_{(X_{t-1}, \ldots , X_{t-p}), X_t}\) are the transition probabilities under the true model parameters \(\varvec{\alpha }\) and G, i.e.

$$\begin{aligned} P^{\varvec{\alpha },G}_{(x_{t-1}, \ldots , x_{t-p}), x_t}&= {\mathbb {P}}_{\varvec{\alpha }, G} \left( \sum \limits _{i=1}^p \alpha _i \circ X_{t-i}+\varepsilon _t=x_t \mid X_{t-1}=x_{t-1}, \ldots , X_{t-p}=x_{t-p} \right) \\&= (\text {Bin}(x_{t-1}, \alpha _1) *\ldots *\text {Bin}(x_{t-p}, \alpha _p) *G )\{x_t\}, \end{aligned}$$

with \({\mathbb {P}}\) the underlying probability measure and “\(*\)” denoting the convolution of distributions. In the special case of an INAR(1) model the transition probabilities are given by

$$\begin{aligned}{\mathbb {P}}_{\varvec{\alpha }, G} (X_t=x_t \mid X_{t-1}=x_{t-1})= \sum \limits _{j=0}^{\text {min}(x_t, x_{t-1})} \left( {\begin{array}{c}x_{t-1}\\ j\end{array}}\right) \alpha ^j (1-\alpha )^{x_{t-1}-j} {\mathbb {P}}_{\varvec{\alpha }, G}(\varepsilon _t=x_t-j), \end{aligned}$$

where \(\alpha\) is the coefficient of the INAR(1) model (McKenzie 1985; Al-Osh and Alzaid 1987). For \({k < \text {min}\{ X_t - \sum \limits _{i=1}^p X_{t-i} \, \mid \, t=p+1, \ldots , n \}}\) or \({k>\text {max}\{ X_t\, \mid \,t = 1, \ldots , n\}}\), the values \({\hat{G}}_{sp}(k), \, {k \in {\mathbb {N}}_0},\) are equal to 0. For further details, see Drost et al. (2009).

In practice, discrete probability distributions such as the Poisson, the negative binomial or the geometric distribution are often used as innovation distribution G, see Weiß (2018), Yang (2019), Al-Osh and Alzaid (1987), Al-Osh and Alzaid (1990). The common feature of all these distributions is their smoothness in the sense that consecutive entries of their PMFs differ only slightly from each other. However, for a small sample size n, the semiparametric estimation approach of Drost et al. (2009) may lead to rather non-smooth innovation distributions with unnatural gaps in their PMF. For illustration, we consider a time series containing counts of transactions of structured products (factor long certificates with leverage) from on-market and off-market trading per trading day between February 1, 2017 and July 31, 2018 (thus \(n=381\)). These data, which are plotted in Fig. 1, have first been presented by Homburg et al. (2021), who derived them from the Cascade-Turnoverdata of the Deutsche Börse Group. In the upper right corner, we see the estimated innovation distribution using the semiparametric procedure of Drost et al. (2009) which turns out to be smooth. In the second row, we consider only the first 100 observations of the time series, where the first plot shows indeed a bimodal estimated innovation distribution. In the third row, we only considered the first 20 observations. The lower-left plot shows the resulting estimated PMF, which contains an unnatural gap with \({\hat{G}}_{sp}(3)\) being estimated exactly equal to zero while its neighbors \({\hat{G}}_{sp}(2)\) and \({\hat{G}}_{sp}(4)\) are estimated positive. Hence, the resulting estimation is not smooth contrary to the estimated innovation distribution on the whole time series. In general, such non-smooth innovation distributions are not common in practice and instead, smoothly estimated innovation distributions are often desired. In this paper, we want to use this prior knowledge and take advantage of a natural qualitative smoothness assumption on the innovation distribution by proposing a version of the semiparametric estimation approach, which penalizes the roughness of the innovation distribution. The resulting estimated PMFs of this approach are contained in the right plots in the second and third row, respectively. In comparison, the penalized estimation now leads to a smoother estimation of the PMF without any gaps. We will have a closer look at additional real data examples in Sect. 4. For long time series, the smoothing caused by penalization is not of such great importance, because the distribution estimated without penalization will be sufficiently smooth by itself. But for short time series, estimation without smoothing will commonly lead to jagged estimated innovation distributions although the true distribution behind the data might be smooth. So the need for smoothing is of particular importance for short time series.

Fig. 1
figure 1

From left to right and top to bottom: Plot of time series of counts of transactions of structured products per trading day, the unpenalized estimation of the corresponding innovation distribution based on the full data and the (un)penalized estimated innovation distribution for the first 100 and 20 observations, respectively

The paper is organized as follows. In Sect. 2, we introduce a penalized estimation approach using roughness penalization and propose two algorithms for the data-driven selection of the penalization parameter. Section 3 examines our estimation approach in a comprehensive simulation study, where we compare the estimation performance of the penalized and the unpenalized approach for different settings. In a real data application in Sect. 4, we analyze the monthly demand of car spare parts to illustrate our method and its practical relevance. In the conclusion in Sect. 5, we summarize the results and give an outlook on further research questions.

2 Penalized approach of fitting INAR models

Penalized estimation of count data is a modern topic in current statistical research. Bui et al. (2021) consider parameter estimation in count data models using penalized likelihood methods. In a time series context, Nardi and Rinaldo (2011) studied LASSO penalization for fitting autoregressive time series models to get sparse solutions, i.e. where some autoregressive coefficients are estimated exactly as zero. Fokianos (2010) proposed an alternative estimation scheme for the estimation of INAR models based on minimizing the least square criterion under ridge type of constraints. Wang (2020) proposed a variable selection procedure for INAR(1) models with Poisson distributed innovations including covariables by using penalized estimation and Wang et al. (2021) introduced an order selection procedure for INAR(p) and INARCH(p) models also by using penalized estimation. By contrast, in this paper, we propose a penalized estimation approach for INAR models which does not rely on a penalization of the INAR coefficients (towards zero), but on a penalization of the roughness of the innovation distribution (towards smoothness).

2.1 Penalized estimation approach using roughness penalty

The idea of our approach is to penalize the log-likelihood used in the semiparametric estimation of the INAR model according to Drost et al. (2009). Thus, we still do not assume a parametric class of distributions, we only use the assumed qualitative (i.e. nonparametric) property of smoothness. More precisely, this refers to a roughness penalization as introduced by Scott et al. (1980), which is e.g. used by Adam et al. (2019) for developing a nonparametric approach to fit hidden Markov models to time series of counts. We design the penalty term based on the idea of Tibshirani et al. (2005), where differences of successive parameters are penalized. In this regard, we allow for differences of order \(m \in {\mathbb {N}}\). Applied to our setting, the estimation approach based on Drost et al. (2009) now maximizes the penalized log-likelihood (compare (2))

$$\begin{aligned} \log ({\mathcal {L}}_{\text {pen}}(\alpha ,G)) = \log ({\mathcal {L}}(\alpha ,G)) - \eta \cdot d_{G,m}, \end{aligned}$$

where \(\eta > 0\) is the so-called smoothing or penalization parameter, \(d_{G,m}\) denotes a suitable measure to quantify the roughness of G and m corresponds to the order of difference. According to Tibshirani et al. (2005), a first possible roughness measure for the penalization term is based on the \(L_1\) distance (LASSO penalization), i.e.

$$\begin{aligned} d_{G,m,1} = \sum \limits _{i=m}^{\text {max}(x_1, \ldots , x_n)} \mid \Delta ^m G(i)\mid , \end{aligned}$$
(3)

where \(\Delta ^m G(i) = \Delta ^{m-1}(\Delta G(i))\) and \(\Delta G(i) = G(i)- G(i-1)\). In addition, we consider the squared \(L_2\) distance (Ridge penalization) as second roughness measure, i.e.

$$\begin{aligned} d_{G,m,2} = \sum \limits _{i=m}^{\text {max}(x_1, \ldots , x_n)} (\Delta ^m G(i))^2. \end{aligned}$$
(4)

The idea behind choosing this second roughness measure is that it does not shrink the differences of the successive entries of the PMF exactly to 0 (contrary to the first roughness measure), but the differences become close to 0, which is more in line with the idea of a smooth distribution (note the analogy of penalized regression, where the \(L_1\) penalization is used for variable selection because of this property, see Fahrmeir et al. (2013)). The order of the differences m is a tuning parameter. For \(m=1\), we penalize only the distance between two directly consecutive entries, for \(m=2\) the smoothness is extended to a triple of values, etc.

Remark 1

A possible extension would be to allow for different penalization weights \((\eta _i)\) for the individual (higher-order) differences of the entries of the PMF. For instance, in the case of \(L_1\) penalization, the goal could be to maximize

$$\begin{aligned} \log ({\mathcal {L}}(\alpha ,G)) - \sum \limits _{i=m}^{\text {max}(x_1, \ldots , x_n)} \eta _i \, \mid \Delta ^m G(i)\mid ,\end{aligned}$$

analogously for the case of \(L_2\) penalization.

Figure 2 shows a first exemplary result on a sample of an INAR(1) process with \(n=25\) observations, order of difference \(m=1\) and smoothing parameter \(\eta =1\) roughly chosen by eye. In this example, the benefit of penalization already becomes clear. The penalized estimated innovation distributions are much closer to the true Poi(1) innovation distribution (which was truncated at value six for clarity) than the unpenalized estimated innovation distribution. Also, the difference between the \(L_1\) and the \(L_2\) penalization becomes visible. When using the \(L_2\) penalization, the distances between the values of the PMF become small, when using the \(L_1\) penalization they are shrinked to zero.

Fig. 2
figure 2

Barplots of the (estimated) innovation distributions for one realization in the four cases (no penalization, \(L_1\) penalization, \(L_2\) penalization, true distribution)

2.2 Selection of the penalization parameter

Now, we propose two approaches to determine for a fixed roughness measure the optimal smoothing/penalization parameter \(\eta\), which is a trade-off between fit to the data and the smoothness assumption. For this purpose, we adapt as a first approach the cross-validation procedure described in Adam et al. (2019) to our setting. Therefore, we split the data set into s blocks \(F_i, \, i=1, \ldots , s\), of roughly equal size. In each fold i, \(F_{(-i)}\) denotes the in-sample data (data without \(F_i\)) and \(F_i\) the out-of-sample data. This replicates the correct dependence structure except for the “glue points”, which only has a minor effect in practice when the data originate from an INAR model of small order. The greedy search algorithm is structured as follows:

Algorithm 1

 

  1. (1)

    Choose an initial \(\eta ^{(0)}>0\) and set \(z=0\).

  2. (2)

    For each fold i and for each value on a specified grid

    $$\begin{aligned} \{ \ldots , \eta ^{(z)} -2c, \eta ^{(z)} -c, \eta ^{(z)}, \eta ^{(z)} +c, \eta ^{(z)} +2c, \ldots \} \end{aligned}$$

    where \(c \in {\mathbb {R}}\) is a small constant, estimate the model with penalization on \(F_{(-i)}\) and compute the penalized log-likelihood on \(F_i\).

  3. (3)

    Average the resulting log-likelihood values across all folds i and choose \(\eta ^{(z+1)}\) as the penalization parameter on the grid that yields the maximum value.

  4. (4)

    Repeat steps 2) and 3) until \(\eta ^{(z+1)}=\eta ^{(z)}\) and define \(\eta ^{\text {opt}} := \eta ^{(z+1)}\).

Furthermore, to avoid a potentially non-optimal selection of the penalization parameter \(\eta\) caused by an inappropriate choice of the initial value \(\eta ^{(0)}\), we propose a second optimization algorithm. How we split the data in each fold \(j, \, j =1, \ldots , {\tilde{s}}\), in in- and out-of-sample data is specified later in Sect. 3.

Algorithm 2

 

  1. (1)

    For each fold j and each value \(\eta\) on a specified grid \(\{0, {\tilde{c}}, 2{\tilde{c}}, 3{\tilde{c}}, \ldots , u \}\) on the interval [0, u] for an appropriate upper bound u, estimate the model with penalization on the in-sample data and compute the penalized log-likelihood on the out-of-sample data.

  2. (2)

    Average the resulting log-likelihood values across all folds j.

  3. (3)

    Fit a polynomial of order r to the curve resulting from plotting the average out-of-sample log-likelihood against the grid.

  4. (4)

    Choose \(\eta ^{\text {opt}}\) as the value on the grid, where the curve takes its maximum value.

3 Simulation study

We investigate the performance of the proposed procedure in a simulation study with \(K=500\) Monte Carlo samples of size \(n \in \{ 20,50,100,250,500,1000 \}\) generated from an INAR(1) process according to (1) for \(p=1\) with different coefficients \(\alpha \in \{0.2, 0.5, 0.8 \}\) and innovation distributions \(G \in \{ \text {Poi}(1), \, \text {NB}\left( 2, \frac{2}{3}\right) , \, \text {Geo}\left( \frac{1}{2}\right) , \text {ZIP}\left( \frac{1}{2}, 2\right) \}\), where ZIP denotes a zero-inflated Poisson distribution as in Jazi et al. (2012). The parameters of the negative binomial, geometric and zero-inflated Poisson distribution are hereby chosen to have the same expected value as the \(\text {Poi}(1)\) distribution. But contrary to the \(\text {Poi}(1)\) distribution which is equidispersed, i.e. the variance of the distribution equals its mean, they are overdispersed, i.e. their variances are larger than their mean values. Another difference between the considered innovation distributions is their (non-) smoothness, see also Fig. 12 in the appendix. The Poi(1), \(\text {NB}\left( 2, \frac{2}{3}\right)\) and \(\text {Geo}\left( \frac{1}{2}\right)\) distributions are rather smooth, but the \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) distribution, which shows a pronounced zero probability, is not. The effect of this property on the roughness penalization is investigated in Subsect. 3.5. Moreover, in Subsect. 3.2, we also provide a small simulation setting for higher-order INAR processes and consider the case of an INAR(2) model. The implementation is straightforward but is a lot more demanding such that we restrict the considered setting to a rather small extent. To ensure the stationarity of the time series, we actually generate \(n+100\) observations and remove the first 100 observations. We consider first (\(m=1\)) and second (\(m=2\)) order differences in the penalization term (see Subsect. 3.4). As initialization for the smoothing parameter \(\eta ^{(0)}\), we set \(\eta ^{(0)}=1\) as in the example in Fig. 2 for the sample sizes \(n \in \{ 20,50,100,250 \}\) and for computing time reasons \(\eta ^{(0)}=0.5\) for \(n \in \{ 500,1000\}\).Footnote 1 For the considered grid around the smoothing parameter (see Algorithm 1) we choose \(c=0.05\) resulting in \(\{\eta ^{(z)}-0.1, \, \eta ^{(z)}-0.05, \, \eta ^{(z)}, \, \eta ^{(z)}+0.05, \, \eta ^{(z)}+0.1 \}\). Unless stated otherwise, we use \(\alpha =0.5\) as true INAR(1) coefficient and Algorithm 1 with 10-fold cross validation (\(s=10\)) as optimization algorithm. For the realization of the simulation study, we use the statistical programming language R 4.1.2 (R Core Team 2021).

3.1 Roughness penalty for smooth innovations distributions and first order differences

Figure 3 shows the \(L_2\) distances of the estimated innovation distributions to the true \(\text {Poi}(1)\) innovation distribution,

$$\begin{aligned} d({\hat{G}}, G)=\sum \limits _{i=0}^M ({\hat{G}}(i)-G(i))^2, \end{aligned}$$

for the different sample sizes and the respective estimation methods (unpenalized (up), \(L_1\) penalization and \(L_2\) penalization) for some large enough M. We use \(M=70\) as upper bound for the observations \(x_1, \ldots , x_n\) since after this value the corresponding probabilities of occurrence are negligibly small. When the sample size n is small, the penalized estimation of the innovation distribution provides a large benefit compared to the unpenalized estimation: The \(L_2\) distances of the penalized estimated to the true innovation distribution are much smaller than those of the unpenalized estimated to the true innovation distribution. Furthermore, the \(L_2\) penalization performs better than the \(L_1\) penalization. In Table 3 in the appendix, we also report the variance, the bias and the MSE of the first five estimated entries of the PMF resulting from the different procedures for the different sample sizes n. We see that the penalized estimation reduces both the variance, the absolute bias and consequently also the MSE of the estimated innovation distribution, especially for small n. Figures 17 and 18 and Tables 6 and 7 in the appendix show the analog results for a true \(\text {NB}\left( 2, \frac{2}{3}\right)\) and \(\text {Geo}\left( \frac{1}{2}\right)\) distribution, respectively. In general, regardless of the distribution and up to a sample size of \(n=100\), we see a clear improvement concerning the estimation performance when using penalization. From a sample size of \(n=250\) on, this improvement can only be seen marginally with the different methods essentially coinciding for large n. In Fig. 13 and Table 4 in the appendix, we show the results for INAR coefficient \(\alpha =0.2\) and Poi(1) innovation distribution and, correspondingly, in Fig. 15 and Table 5 for \(\alpha =0.8\). In the latter case, the benefit of the penalized estimation compared to the unpenalized estimation is even larger than in the case \(\alpha =0.5\). This is plausible because it is in general more difficult to estimate the innovation distribution for a larger value of \(\alpha\) as this also leads to a larger observations mean with innovations mean remaining constant. Therefore, more entries of the PMF have to be estimated with the same amount of data. Contrary, for \(\alpha =0.2\), we have (with analog arguments) less entries of the PMF which have to be estimated with the same amount of data, which simplifies the estimation of the PMF in general and the benefit of penalization decreases. Altogether, we can conclude that the benefit of penalization is more pronounced with larger \(\alpha\), that is, with larger serial dependency.

We get confirming conclusions, when we consider the values of the optimal smoothing parameter \(\eta\), which approaches zero with increasing n, see Fig. 4 for the case of a true Poi(1) innovation distribution, Figs. 19 and 20 in the appendix for the cases of a true \(\text {NB}\left( 2, \frac{2}{3}\right)\) and \(\text {Geo}\left( \frac{1}{2}\right)\) innovation distribution and Figs. 14 and 16 in the appendix in case of a true Poi(1) innovation distribution with \(\alpha =0.2\) and \(\alpha =0.8\), respectively. Thus, for increasing n, the penalized and the unpenalized estimation coincide as intuitively expected: For large n, there are enough observations to learn the smoothness of the innovation distribution from the data even without imposing smoothness through penalization.

Fig. 3
figure 3

Boxplots of the \(L_2\) distances of the estimated innovation distribution to the true Poi(1) innovation distribution of an INAR(1) process for different sample sizes n. We report results for unpenalized (up), \(L_1\) and \(L_2\) penalized estimation

Fig. 4
figure 4

Boxplots of the penalization parameter \(\eta\) selected by \(L_1\) penalization (upper panel) and \(L_2\) penalization (lower panel) for the different sample sizes n in the case of a true Poi(1) innovation distribution of an INAR(1) process

3.2 Higher-order INAR processes

To show that our proposed procedure is also applicable for higher-order INAR processes, we consider the case of a true INAR(2) process according to (1) for \(p=2\) with coefficients \(\alpha _1=0.3\), \(\alpha _2=0.2\) and \(G=\text {Poi}(1)\). Due to the high computing time for the semiparametric estimation, we only consider a small simulation setup with \(n=50\) observations and \(K=100\) Monte Carlo samples. We consider \(L_1\) and \(L_2\) penalization with first order differences and compare the performance with the case of estimation without penalization. In Fig. 21 in the appendix, we see that also for higher-order INAR models, penalized estimation of the innovation distribution provides a clear benefit compared to unpenalized estimation. With penalization we are closer to the true innovation distribution than without and we are able to reduce the variance, the absolute bias and consequently the MSE of our estimation, see Table 1. Again, \(L_2\) penalization works best.

Table 1 Variance, bias and MSE of the first five estimated entries of the PMF for \(n=50\) in case of a true Poi(1) innovation distribution of an INAR(2) process. We report results for unpenalized (up), \(L_1\) and \(L_2\) penalized estimation

3.3 Alternative selection of the penalization parameter

To investigate whether the results depend on the chosen initial parameter, we now determine the optimal penalization parameter alternatively using Algorithm 2 with \(u=5,\, {\tilde{c}}=0.1\) and \(r=5\). In this context, we want to address a potential practical issue of Algorithm 1: the generation of the in- and out-of-sample data. For each of the 10 folds, 90% of the data becomes the in-sample data and the remaining 10% the out-sample data. For small n, 10% of the data is small. To avoid this, we now use an n-fold cross-validation (\({\tilde{s}}=n\)) for sample sizes \(n \in \{20,50\}\) with Algorithm 2, where starting from each observation the following 50% of the data is in- and the other 50% is out-of-sample. When reaching the end of the time series, we start again from its beginning.

In Fig. 5, we see the results of this alternative procedure compared to the previous (iterative) procedure in Algorithm 1. It gives slightly better results than the iterative method, but overall the distances are very similar. The same can be concluded when considering Table 8. The alternative procedure leads to slightly lower MSE values, but altogether the values resemble each other. The 10-fold cross-validation also seems to be suitable and the resulting optimal parameters of the two procedures are close to each other (see Fig. 6). In conclusion, if we determine the optimal parameter from a sequence on a grid as in Algorithm 2, we tend to get slightly better results. However, the price to pay is a much higher computing time than with the iterative procedure. The iterative method needs a reasonably chosen starting value, but then it gives similarly good results in considerably less computing time. In addition, when using the alternative method, the question arises how to choose the upper limit of the interval adequately. In the following, we will continue to use the iterative method from Algorithm 1 but one should keep in mind that Algorithm 2 is also a practically useful procedure.

Fig. 5
figure 5

Boxplots of the \(L_2\) distances of the estimated innovation distribution to the true Poi(1) innovation distribution of an INAR(1) process for the different sample sizes n. We report results for unpenalized (up) and \(L_2\) penalized estimation using either the iterated Algorithm 1 (A1) or the alternative Algorithm 2 (A2)

Fig. 6
figure 6

Boxplots of the penalization parameter \(\eta\) selected by \(L_2\) penalization using Algorithm 1 (A1, upper panel) and Algorithm 2 (A2, lower panel) for the different sample sizes n in the case of a true Poi(1) innovation distribution of an INAR(1) process

3.4 Higher-order differences in penalization term

So far we only considered first order differences (\(m=1\)). Now we want to see if penalizing higher-order differences (e.g. \(m=2\)) is able to improve the performance of our penalized estimation method. In Fig. 7 and Table 10 in the appendix it is visible for the case of a true Poi(1) innovation distribution and \(L_2\) penalization that also the penalization of differences of higher order performs better than the unpenalized estimation in the cases of small sample sizes, and that it comes close to the penalization of first order differences. Similar results are in Fig. 22 and Table 9, both in the appendix, where we see the results of first and second order differences for the \(L_1\) penalization. In case of \(L_1\) penalization, we would prefer second order differences for small sample sizes. Overall, however, the \(L_2\) penalization of first-order differences performs best.

Fig. 7
figure 7

Boxplots of the \(L_2\) distances of the estimated innovation distribution to the true Poi(1) innovation distribution of an INAR(1) process for the different sample sizes n. We report results for unpenalized (up) and \(L_2\) penalized estimation using either first order (diff1) or second order (diff2) differences

3.5 Non-smooth innovation distribution

Finally, let us consider the case of \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) distributed innovations and consider Fig. 8. The results are as expected. Since the ZIP distribution is not smooth (see Fig. 12 in the appendix), the smoothness assumption and hence the penalization is not suitable. The boxplots reflect this: Except for sample size \(n=20\), the penalized estimation procedure provides no benefit and for some n even leads to slightly higher \(L_2\) distances from the true \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) distribution than the unpenalized procedure. As we can see in Table 11, the penalized estimation leads to a higher absolute bias when estimating the first (non-smooth) entry, G(0), of the PMF. As sample size n increases, the penalization has less impact, as there is enough data to detect the incorrect assumption such that the unpenalized and the penalized procedures coincide.

For comparison, let’s take a look at the results with a true \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) distribution when we exclude G(0) from the penalization displayed in Fig. 9, i.e. when we consider

$$\begin{aligned} {\tilde{d}}_{G,m,1} = \sum \limits _{i=m+1}^{\text {max}(x_1, \ldots , x_n)} \mid \Delta ^m G(i) \mid \quad \text {and} \quad {\tilde{d}}_{G,m,2} = \sum \limits _{i=m+1}^{\text {max}(x_1, \ldots , x_n)} (\Delta ^m G(i))^2, \end{aligned}$$

instead of \(d_{G,m,1}\) and \(d_{G,m,2}\) defined in (3) and (4). It becomes clear what we would expect: By excluding the “non-smooth entry” G(0) of the PMF of the innovation distribution from penalization, the penalized estimation works well again and provides a benefit for small n. In this case, the penalized estimation now results in a lower absolute bias of the estimated PMF’s first entry compared to the unpenalized estimation (compare Table 12). However, this benefit is not as pronounced as in the cases of a true Poi(1), \(\text {NB}\left( 2, \frac{2}{3}\right)\) and \(\text {Geo}\left( \frac{1}{2}\right)\) innovation distribution. This can probably be explained by the fact that the \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) distribution has most of its mass in zero and the corresponding entry of the PMF, G(0), remains unaffected by the penalization. Consequently, the results from penalized and unpenalized estimation do not differ substantially from each other.

Fig. 8
figure 8

Boxplots of the \(L_2\) distances of the estimated innovation distribution to the true \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) innovation distribution of an INAR(1) process for the different sample sizes n. We report results for unpenalized (up), \(L_1\) and \(L_2\) penalized estimation

Fig. 9
figure 9

Boxplots of the \(L_2\) distances of the estimated innovation distribution to the true \(\text {ZIP}\left( \frac{1}{2}, 2\right)\) innovation distribution of an INAR(1) process for the different sample sizes n. We report results for unpenalized (up), \(L_1\) and \(L_2\) penalized estimation without smoothing of G(0) (nz)

In summary, if the smoothness assumption of the innovation distribution is correctly imposed, it provides a large benefit for small sample size n. This holds whether the true underlying distribution is equidispersed or overdispersed. The best results are obtained for \(L_2\) penalization and first-order differences.

3.6 Estimation of the INAR coefficient

A drawback of the penalized estimation is that the estimation of the INAR coefficient \(\alpha\) no longer works well for small sample size n, see Fig. 23 in the appendix. A strength of the semiparametric estimation approach of Drost et al. (2009) is the accurate joint estimation of the INAR coefficient and the innovation distribution. This joint estimation accuracy is not maintained when penalization is used for small n. The \(L_2\) distances of the penalized estimated INAR coefficient \(\alpha\) to the true value are higher than for the unpenalized estimated coefficient. For increasing n, the estimation of \(\alpha\) improves, but since the benefit of the penalized estimation lies in the cases where n is small, this is no comfort.

Instead, we can solve this problem by taking only the estimator for the innovation distribution from the penalized approach and estimating the INAR coefficient with the unpenalized (efficient) estimation approach of Drost et al. (2009). In Fig. 23, we see that it is indeed preferable to combine the unpenalized estimation of the INAR coefficient \(\alpha\) and the penalized estimation of the innovation distribution G. Also when looking at the MSE, it is clear that this combination outperforms all other estimation approaches under consideration.

4 Real data example

For modeling intermittent demand, Syntetos and Boylan (2021) consider the equidispersed Poisson distribution on the one hand, and, as the demand variability may be severe when demand is intermittent, overdispersed distributions from the Compound-Poisson family (such as the negative binomial distribution) on the other hand. All these parametric distributions are smooth. With our novel penalized semiparametric estimation approach, we get smooth distributions without parametric assumptions, and as we saw in our simulations, our penalization procedure works well for both equi- and overdispersed distributions. By contrast, if using an unpenalized non-parametric estimation approach such as the empirical distribution function (EDF), Syntetos and Boylan (2021) criticize that demand values not observed in the past are automatically assigned zero probabilities for the future. Furthermore, they state that an EDF provides a perfect fit to the historical data, but it does not ensure the goodness of fit to the demand over the forecast horizon, especially with respect to higher percentiles. Again, these drawbacks are omitted with our penalized estimation approach. Finally, historical demand time series are often rather short, see the demand count time series provided by Snyder (2002) as an example, such that smoothing approaches would be particularly welcome. For these reason, the forecasting of intermittent demand appears to be a promising application area for our proposed penalized semiparametric estimation procedure.

Therefore, we consider time series (\(n=51\)) of the monthly demand of different car spare parts offered by an Australian subsidiary of a Japanese car company from January 1998 to March 2002 (Snyder 2002). Figure 10 contains an exemplary time series of car part 2404. The observations vary between 0 and 5 and the up and down movements indicate a moderate autocorrelation level. After inspecting the corresponding (P)ACF also included in Fig. 10, we conclude that an AR(1)-like model might be appropriate for describing the serial dependence of the time series. Moreover, \(L_2\) penalization with first order differences leads to an estimated innovation distribution without any unnatural gaps, i.e. zero values, in the PMF.

Now consider the 1-step median prediction and the 90% quantile of the 1-step prediction of the demand for car spare part 2404. The latter serves here as a worst-case scenario for spare parts requirements. Therefore, we determine the median and the 90% quantile of the predictive distribution \(P(\ldots \mid y)\), where \(y \in {0, \ldots , 10}\). Based on the results of the simulation study in Subsect. 3.6, we use the penalized estimated innovation distribution and the unpenalized estimated INAR coefficient to determine the conditional predictive distribution. Table 2 shows that the penalized estimation tends to lead to higher predicted values (more conservative prediction). Consequently, without penalizing the innovation distribution, the predictions for the demand for spare parts may be too low, which can lead to a lack of spare parts. Moreover, the penalization of the innovation distribution (especially for such short time series) can serve as a robustness analysis to identify possible uncertainties in the forecast at an early stage.

In addition, we consider car spare part 1971. Figure 11 again suggests an AR(1)-like model and a moderate autocorrelation level. The observations vary between 0 and 4 and there may be zero inflation in this time series. Therefore, in addition to the unpenalized and penalized estimates, we also consider the penalized estimate of the innovation distribution, where G(0) is not smoothed (see Subsect. 3.5). It becomes clear that this last estimation procedure yields more plausible results than when G(0) is smoothed. Again, the penalized estimation procedure yields a slightly smoother innovation distribution than the unpenalized estimation. In summary, if there is a reasonable suspicion of zero inflation, G(0) should not be smoothed.

Fig. 10
figure 10

From left to right and top to bottom: Plot of time series of monthly demand for car spare part 2404, its corresponding ACF and PACF and the unpenalized and the penalized estimated innovation distribution

Fig. 11
figure 11

Plot of time series of monthly demand for car spare part 1971, its corresponding ACF and PACF, the unpenalized and the penalized estimated innovation distribution and the penalized estimated innovation distribution excluding the first entry of the PMF (from left to right and from top to bottom)

Table 2 Unpenalized and penalized 1-step median prediction and 90% quantile of the 1-step prediction of the demand for car spare part 2404 when observing demand y

5 Conclusion

Although semiparametric estimation yields a decent fit in INAR models, its performance is often not convincing for small sample sizes. Therefore, we proposed a penalization approach that exploits a qualitative smoothness assumption fulfilled by commonly used innovation distributions. A simulation study showed that our penalization approach provides a large benefit in estimating the innovation distribution, especially for small sample sizes. Additionally, we showed that the combination of unpenalized estimation of INAR coefficients and penalized estimation of the innovation distribution provided the best performance. Future research should investigate whether additional penalization of the INAR coefficients may result in further benefit. Furthermore, as the penalization approach proved to be beneficial for forecasting, one may also think of an application in statistical process control, i.e. for the design of control charts relying on a fitted INAR(1) model. Another interesting issue for future research is the application of our proposed method on integer-valued autoregressive models on \({\mathbb {Z}}\), such as those proposed by Kim and Park (2008) or Liu et al. (2021).