A multiple inflated negative binomial hurdle regression model: analysis of the Italians' tourism behaviour during the Great Recession

We analyse tourism behaviour of Italian residents in the period covering the 2008 Great Recession. Using the Trips of Italian Residents in Italy and Abroad quarterly survey, carried out by the Italian National Institute of Statistics, we investigate whether and how the economic recession has affected the total number of overnight stays. The response variable is the result of a two-stage decision process: first we choose to take a holiday, then for how long. Moreover, since the number of overnight stays is typically concentrated on specific lengths (week-end, week, fortnight) we observe multiple peculiar spikes in its distribution. To take into account these two distinctive characteristics, we generalise the usual hurdle regression model by specifying a multiple inflated truncated negative binomial distribution for the positive responses. Results show that the economic recession impacted negatively on both components of the decision process and that, by controlling for the inflated nature of the response variable's distribution, the proposed formulation provides a better representation of the Italians' tourism behaviour in comparison with non-inflated hurdle models. Given this, we believe that our model can be a useful tool for policy makers who are trying to forecast the effects of new targeted policies to support tourism economy.


Introduction
During the years 2008-2013, consumption expenditures of Italian households was harshly hit by the Great Recession (ISTAT 2014) with a remarkable reduction of purchasing power (-10.4% between 2007 and 2013). In that period, Italian households showed a reduction in tourism expenditure and a change in travel behaviour as well. The expenditure share devoted to accommodation facilities passed from 2.8% in 2010 to 2.3% in 2013 and the annual decrease in the number of trips by resident was nearly -12% in 2010, -19% in 2013. Only in 2015, for the first time after seven years, there has been a remarkable increase (+13.5%).
Objective of our study is tourism behaviour of the Italian residents and, in particular, we analyse Italians' participation in tourism in the period covering the recent economic recession. Using data from the survey on Trips of Italian Residents in Italy and Abroad, carried out quarterly by the Italian National Institute of Statistics (ISTAT), we investigate whether the propensity in tourism participation (i.e., the probability of having at least one holiday trip with at least an overnight stay in a quarter) and the intensity of participation (i.e., the sum of the length of stay of all holiday trips taken in a quarter) have changed over the period of analysis.
The theoretical framework used for the joint analysis of these two aspects is the hurdle model, a modified count data model which allows to consider the response as result of a two-stage decision process: at first a person decides whether to take a holiday and then, conditionally to a positive decision, he decides the length of the holiday. In a general hurdle model, a binary model is used to represent the binary outcome of whether a count variable has a zero or a positive realisation and then the positive realisations are modelled by a truncated-at-zero count data model. Various specifications can be adopted for the truncated-at-zero model depending on the distribution of the positive realisations. Given their flexibility, the hurdle models have been widely used in several contexts of health and economic studies, and a few applications of this method can be found in tourism analysis as well (Hellström 2006, Bernini & Cracolici 2015, Boto-García et al. 2019. A noteworthy concern in the analysis of the number of overnight stays is the presence of multiple spikes in its distribution. That occurrence is due to the propensity to take a holiday in typical day blocks (e.g. week-end, one week, two weeks, etc.), which in turn produces a concentration of the total number of overnight stays on certain values, known as inflated values. Some authors have treated this problem by re-defining the response variable into two or more classes and then applying a logit or a multinomial model (Alegre & Pou 2006, Nicolau & Más 2004; others adopted a latent class approach (Alegre et al. 2011) or employed a quantile regression model (Salmasi et al. 2012).
We take a novel approach, not yet adopted in the context of tourism analysis: the truncated-at-zero model for the positive responses is specified as a multiple inflated truncated negative binomial model, that is a finite mixture of a zero-truncated negative binomial and a set of degenerate distributions on the inflated values, with the mixture probabilities modelled through a multinomial logit model. Even considering a wider literature, at the best of our knowledge the actual specification of a hurdle model with a multiple inflated distribution for the positive responses, even if theoretically feasible, has not been presented before.
Results show that the economic recession impacted negatively on both components of the decision process and that, by controlling for the inflated nature of the response variable distribution, the proposed formulation provides a better representation of the Italians' tourism behaviour in comparison with non-inflated hurdle models. In particular, by using a multiple inflated hurdle model we are able not only to identify the determinants of the phenomenon under study, but also to correctly fit the distribution of the total number of overnight stays, even in presence of extremely inflated values which are usually under-predicted by standard models. Given this characteristic, we believe that the use of multiple inflated hurdle models can produce results which could be useful for policy makers in evaluate how the Italians would react to the implementation of targeted tourism policies.
The paper is organized as follows. Section 2 describes the database and discusses the characteristics of the response variable. Section 3 presents the theoretical model used for the analysis and discusses its properties and usefulness for the aim of the study. Results of the empirical analysis are presented in Section 4, and the last section concludes with the discussion of the main findings.

Data
The analysis employs a pooled time series cross-section database of Italian residents in the period 2004-2013, which covers the last economic recession that has seriously affected Italian households. Data comes from the household survey on Trips and Holidays of Italian Residents in Italy and Abroad, which is the main statistical source of demand-side tourism data available in Italy. It is currently carried out by ISTAT for responding at the EU Reg.692/2011, and it collects information about domestic and outbound travels of the Italian residents.
From 1997 to 2013, it has been conducted quarterly on a national annual sample of about 14,000 households (about 3,500 per quarter), comprising an annual total of about 32,000 individuals. Each year, data are collected for the periods January-March, April-June, July-September and October-December. In each quarter and for each individual, information on travels with at least one overnight stay concluded during the quarter, made for any main purpose, are recorded. Tourism trips are classified into business and holiday trips.
In addition, socio-demographic characteristics of all household components are recorded: age, gender, region of residence, education level, marital status, occupational status and professional position. It should be noted that this information is collected for all individuals, regardless of their being traveller or not. Therefore the survey data allows to identify the share and characteristics of both tourism participants and non-participants. Unfortunately, these characteristics do not include any information about the individuals' economic status.
For the participants the survey offers also an in-depth insight about their tourism behaviour in terms of number of trips, nights spent and characteristics of the trip, but provides no information about tourism expenditure (although surveyed for the Tourism Satellite Account it is not provided for research purpose).
From 2014, the Trips and Holidays survey has become a focus included in the Households Budget Survey and deep changes have been introduced in every stage of the survey process. Therefore, since the two sources cannot be appropriately linked together, we stop our analysis at 2013. Moreover, given the adoption of the Euro currency occurred in 2002, we start the analysis from 2004.  As we are interested in studying the factors that may influence individual tourism behaviour, our unit of study is the individual. We limit the analysis to holiday trips and, since children's tourism choices are not individually made, we consider only persons at least 15 years old.
We define our study variable as the total number of overnight stays in a quarter, obtained by summing the length of stay of each holiday trip made by an individual in that quarter (the variable is set at zero for an individual who has not travelled). The variable's descriptive statistics presented in Table 1 show a prevalence of zero values: almost 80% of the sampled units are tourism non-participants. If we expand the data to the whole Italian population by using the expansion factor provided by ISTAT, we can estimate that only about the 24% of Italians (at least 15 years old) made on average at least a holiday trip in a quarter. Even in the summer quarter (July-September), this percentage reaches only 42%.
Considering the positive number of overnight stays, from Table 1 we can see that the variable is highly skewed and overdispersed (the variance is almost 14 times the mean). This is confirmed by Figure 1: when the number of overnight stays increases its frequency quickly decreases, but the distribution has a long tail of low-occurrence values. From Figure 1 we can observe multiple spikes in its distribution: the observed positive values of overnight stays are concentrated on specific values, like 2, 6, 7, 14, 15, 20 and 30 nights.
Finally, as expected, seasonality plays an important role in characterising tourism behaviour. The proportion of Italians who made on average at least a holiday trip in a quarter and the average number of overnight stays are significantly higher in the third quarter than in the other quarters. More important, as we can see in Table 2 and Figure 2, the distribution of the positive number of overnight stays is completely different in the third quarter from that of other quarters: it is more variable, with a longer tail and with a larger number of inflated values. In addition, there is a higher concentration on the inflated values.

Methodology
Since the response variable is discrete and non-negative, we refer to count data models. The most common models in literature are the Poisson and the negative binomial regression models. One of the basic assumptions of these models is that both zero and positive values of the response variable come from the same data generating process. However, as it frequently occurs when analysing socio-economic phenomena, our data do not adhere to this assumption. In fact, it makes sense to assume that a person firstly decides whether or not to take a holiday (i.e. whether or not to participate in tourism), and then, conditionally to a positive decision, he decides the number of overnight stays. In such a situation it seems opportune to firstly separate participants from non-participants, zeroes from non-zeroes, through a binary model and then to model the positive responses using a truncated-at-zero count data model.
This assumption is typical of the hurdle model, in which the two processes generating zeros and positive values are not constrained to be the same Cameron & Trivedi (2013). Firstly, a binomial probability governs the binary outcome of whether a count variate has a zero or a positive realisation and then, if the hurdle is crossed (i.e. the realisation is positive), the conditional distribution of the positives is governed by a truncated-at-zero count data model. Such a conditional setting enables the interpretation of covariate effects through event incidence and frequency in the respective logistic and truncated distribution components.
Formally, let y be a discrete non-negative response variable and let X and Z be two covariates matrices (that could coincide, at least partially, or be completely different), then a generic hurdle model for each individual i can be defined as: is the probability of observing a count of 0, usually estimated from a logit or probit model, and f 2 (j|z i ) is a truncatedat-zero count data density The choice of the model specification for f 2 is usually driven by data characteristics. In particular, one should take into account if the data are overdispersed (i.e., the variance exceeds the mean) and if there is an abnormally large number of observations concentrated on one or more values (i.e., the distribution is inflated). As discussed in the previous section, the positive number of overnight stays is both overdispersed and inflated, therefore the specification for f 2 need to reflect both these characteristics.
Models that handle a single inflated value, typically at zero, have been proposed since the early 1990s, starting with the zero-inflated Poisson (ZIP) regression model first presented by Lambert (1992). However, studies on the generalisation of single-inflated models to the situation of multiple inflated distributions are relatively recent and still in progress.
Here we move from the generalisation of the ZIP model to a multiple inflated Poisson model (MIP) suggested by Giles (2007) to allow for countinflation at multiple values. In dealing with multiple inflated count data the MIP model assumes a finite mixture model of a Poisson distribution and a set of degenerate distributions, one for each inflated value. In doing so, the MIP model assumes that overdispersion of data can only arise from splitting the data in more regimes. When this assumption does not hold and the overdispersion derives also from an heterogeneity component, it is opportune to generalise the MIP model into a multiple inflated negative binomial model (MINB), in analogy with what it is usually done when replacing a Poisson model with a negative binomial model. Finally, since we want to model truncated-at-zero count data, the negative binomial distribution will be replaced by its truncated counterpart, obtaining a multiple inflated truncated negative binomial model (MITNB).
Assuming that the positive count response has M − 1 inflated values, the MITNB distribution can be specified as: where M j=1 p ij = 1 and T N B(.) is the truncated negative binomial distribution. Note that the inflated values are not required to be consecutive in the model, even if they are denoted as 1, ..., (M − 1) for notational convenience.
Under this specification, f 2 becomes indicates the probability mass distribution of the negative binomial model with variance function λ i (1 + λ i /θ i ), denoted as NB2 model in Cameron & Trivedi (2013, p. 74); where Γ is the gamma function, λ i is the location parameter and θ i is the scale parameter (i.e. the inverse of the dispersion parameter).
Both λ i and θ i depend on covariates by the regression functions where Z 1 and Z 2 are subsets of the covariate matrix Z and β 1 and β 2 are the vectors of the corresponding regression parameters. Note that a smaller θ i corresponds to a larger overdispersion. The mixing probabilities p ij = Pr(y i = j) are modelled with a multinomial logit regression model (with reference category M ) ln Pr for j = 1, ..., (M − 1), where Z 3 is a subset of the covariate matrix Z, γ j is a vector of regression parameters specific to each M − 1 value. The multinomial logit model is an extremely flexible formulation, but requires the estimation of several parameters. If necessary one can replace it with other more parsimonious models, but these usually require additional assumptions on the parametric model formulation (for example, Su et al. (2013) use a cumulative logit model which relies on the parallel regression assumption).
Assuming, as usually done, that the error terms of the binary and the truncated model are independent, the likelihood function can be separated in two parts (one for each model component) and the two components f 1 and f 2 can be fitted separately through maximum likelihood estimation. Due to the model complexity, likelihood maximisation needs to be carried out by numeric optimisation techniques.
Once the MITNB hurdle model as been estimated it is possible to calculate the predicted values for the two components of the model. In particular, the predicted number of positive overnight stays can be computed aŝ where v j is the j-th inflated value andp ij ,λ i ,θ i are respectively the mixing probabilities, the location parameter and the scale parameter predicted for a generic observation i with covariates z i . The corresponding standard errors can then be computed by delta method.
Equation (10) highlights that, in order to accurately predict the positive values of y, it is essential to model not only the underline generating process (the TNB model) but also the inflated values.

Results
This section presents the results of the empirical analysis, divided in two parts. The first part focuses on the model specification and assessment, whereas the second part discusses the determinants of the two components (propensity and intensity) of tourism participation.

Model specification and assessment
The model includes a set of auxiliary variables that are collected by the Trips and Holidays survey for all sampled individuals (regardless of their being traveller or not). These are used as predictors of the total number of overnight stays in a quarter in the period 2004-2013. In detail, these variables can be classified into three categories: • Socio-demographic characteristics: gender, age (scaled; both in level and in a quadratic form), education level (academic degree vs. other levels), number of household members, presence of children up to 10 years old in the household.
• Economic characteristics: the economic-related variables available in the dataset are occupational status, professional position, and number of income recipients in the household. This last variable has been transformed into relative terms and included in the model as a categorical variable: first we calculate the proportion of household members at least 16 years old whose employment status is occupied or retired, then we factorise it into three categories ("no members", "at most %50 of members", "more than %50 of members"). The individual occupational status and the professional position have been combined into a single variable that distinguishes among "housewife/househusband", "student", "retired", "disabled", "managerial staff", "office worker", "manual worker", "self employed" and "professional". We are aware that the economic condition of the individuals may not be described entirely by these variables, but unfortunately this is the only information collected by the survey.
We refer to Table 9 in Appendix for the descriptive statistics of all the variables, their acronym, and definition. Both components of the hurdle model contain the above mentioned variables as main effects. That is, in our analysis the covariates matrix X of the logit model (which governs the binary outcome participation/not participation in tourism) and the covariates matrix Z of the truncated model (which governs the positive values of overnight stays) coincide. The regression model for the location parameter λ includes all the variables of matrix Z as well.
About the dispersion parameter θ, from Table 2 we have observed that the variability of the positive response is much higher in the third quarter, therefore to control for this heterogeneity we model θ as function of the third quarter (third vs. the others).
Analogously, from Figure 2 it is evident that the inflated values have different relevance depending on the quarter. Therefore, the mixing probabilities are estimated via a multinomial logit function of the third quarter.
The last setting required to complete the model specification is the identification of the inflated values, that is the values of overnight stays for which we observe an abnormal large frequency. In fact, the multiple inflated model (4) described in the previous section assumes that the number of inflated values (M − 1) is known, together with their values, therefore they need to be chosen before estimating the model. To this end, in order to identify the best model specification, we applied a two-step approach. First, we selected a list of plausible inflated values through visual inspection of the histogram  of the observed responses (Fig. 1). Then, we compared several hurdle model specifications, each characterised by a different set of inflated values, using goodness-of-fit criteria, like the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) (as suggested by Cai et al. (2018)). Table 3 describes some of the considered specifications. 1 In particular, the first model (Model 0) is the traditional negative binomial hurdle model without any inflated values, which has been used as benchmark model. Models 1, 2 and 3 are all multiple inflated negative binomial hurdle models: Model 1 includes only the most evident inflated values (see Figure 1), whereas Model 2 and Model 3 each adds more values to the previous model specification.
Model 4 is equivalent to Model 3, but the mixing probabilities are estimated as function of all quarters, therefore adding 32 additional parameters to the complete model. 2 Table 4 shows the corresponding model fit statistics. 3 It is clear that even by considering only few extremely evident inflations, as in Model 1, we obtain a much better fit to our data comparative to the benchmark hurdle model; and the fit improves again with the addition of other inflated values. Comparing Model 3 and Model 4 we can see that the inclusion of all quarters in the multinomial specification generates some additional gain (AIC is lower), but not as much as to be worthy of the additional complexity (BIC is higher). This confirm our choice to include only the third quarter in the multinomial model. Therefore, the final specification for the analysis is Model 3, which considers 16 inflated values: 2, 3,4,6,7,10,14,15,20,28,29,30,40,45,50, 60 nights.
In figures 3 and 4 we present the hanging rootogram plots for Model 0 and Model 3 respectively. As described by Kleiber & Zeileis (2016), the hanging rootogram is a graphic tool particularly useful for diagnosing issues such as overdispersion and multiple inflation in count data modelling. It displays predicted and observed distribution of the variable under study, showing how the model fits the data. Discrepancies are seen by comparison with the horizontal axis: if a bar doesn't reach the zero line then the model over-predicts a particular value, and if the bar exceeds the zero line it under-predicts it. The vertical axis is scaled to the square-root of the frequencies to draw more attention to differences in the tails of the distribution. The comparison between the two plots confirms that the proposed multiple inflated approach provides a better adaptation to the data since the model corrects most of the under-prediction of the inflated values that is displayed in figure 3.

Determinants of tourism behaviour
Maximum likelihood estimates of the multiple inflated hurdle model parameters are presented in the following tables: estimates for the logit regression are in Table 5, for the truncated negative binomial model are in Table 6, and for the multinomial regression are in 7.
Results show the importance of the socio-demographic variables as determinants of both the propensity and the intensity of tourism participation. Age has a discordant effect: older people tend to participate less in tourism, but when they do participate they tend to have longer holidays. Conversely, the effect of family composition is concordant in both components of the hurdle model: a larger family has a lower propension to travel and tend to spend fewer days on holiday, but having at least one young child increases both the odds of travelling and the number of overnight stays.
Consistently with the hypothesis that economic conditions matter, the obtained via numeric optimisation applying dual quasi-Newton algorithm.   Significance codes: * * * p < 0.001, * * p < 0.01, * p < 0.05, • p < 0.1 proportion of household's income recipients has a positive and highly significant effect on the decision to go on holiday. Moreover, estimates for the occupational status tell us that manual workers and unemployed persons are less likely to go on holiday, contrary to professionals, managerial staff and office workers who have a higher propensity to travel. But when on holiday, the occupational status acts primarily as a constraint on trips' duration: working individuals have less time to spend on holidays than students and retired people. The same can be said for the proportion of income recipients, since a higher proportion is associated with fewer days of holiday. The model presents a remarkable North-South divide in tourism participation: assuming that the other covariates remain constant, the odds for residents of insular and southern regions are about 50% lower than that of north-western residents. And the same North-South dualism can be observed in the number of nights spent on holiday. In this respect, one should also consider that northern regions have a more efficient transportation system Significance codes: * * * p < 0.001, * * p < 0.01, * p < 0.05, • p < 0.1 and a more favourable location as they are closer to foreign destinations that produce additional attractions for those Italian residents. On the other hand, since southern and insular regions have plenty of "in-house" leisure destinations, there could be a larger part of residents of these areas who prefer same-day trips, which are not registered in the dataset.
To understand the effect of the economic crisis, we compute the predictive margins of year and quarter for the two aspects of tourism participation, Pr ( Table 10 in Appendix. Graphs show that the reaction to the Great Recession starts in 2009 when, after a period of growth in participation (comparatively to 2004), the predicted probability of travelling begins to decrease; and the decline spikes in 2011 probably due to the heavy fiscal restrictive measures adopted by the Italian government in that year. In addition, from 2011 the tourism participation drops below the 2004's level and, after a year of stability, further decrease in 2013. This might indicate the presence of an inertia in reacting to the 2008 crisis: at first, households reacted to an increase in taxes by reducing savings to defend their living standards; after, considering the persistence of the crisis, households had to reduce consumption. Moreover, results reflect the general trend of a reduction in the average length of stay per trip which has been observed at the macro level (ISTAT 2014, Chp. 18). In fact the decrease seams to have started even before 2008, but the highest reduction can be identified in 2013 indicating that the 2011 downturn, more than that of 2008, strongly affected tourism behaviour about intensity.

Final remarks
Estimation results for the proposed multiple inflated hurdle regression model show that, in Italy, the Great Recession had a negative impact on both the propensity and the intensity of tourism participation. Moreover, estimates confirm common knowledge that seasonality is a universal factor in tourism and that socio-demographic and economic characteristics are relevant in determining individuals' tourism behaviour.
In assessing the effects of the Great Recession on tourism, we have to consider that nowadays tourism has become a "normal thing", a part of the lifestyle, quality of life and well being of an increasing number of people (Bargeman & van der Poel 2006, Dolnicar et al. 2012, Cracolici et al. 2013). Therefore, we should likely observe an inertia in tourism behaviour and a higher probability of a "slicing strategy" (e.g., cheaper holiday) rather than a pure "cutback strategy" (e.g., fewer trips, reduced length of stay) (Bronner & De Hoog 2012). The dataset used in our analysis doesn't include information about tourism expenditures, therefore we were not able to investigate whether Italians employs a "slicing strategy" in response to the economic crisis. Conversely, through the formulation of a hurdle model we studied which level of "cutback strategy" has been mostly implemented by the Italian citizens: (a) giving up holidays completely or (b) reducing in the number of overnight stays. Evidence shows that both strategies has been applied in the period of the Great Recession, but the more prominent reaction to the crisis seems to be the complete renounce to leisure trips: the probability of participation diminished between 2007 and 2013 by more that 30% in the off-peak seasons and by 25% in the summer.
Motivated by the analysis of the impact of the Great Recession on tourism behaviour, the paper proposes a general, novel approach for dealing with count variables whose distribution is inflated in multiple values. This feature can not be represented through the probability distribution models commonly used for count data, but needs to be properly addressed (alongside other data characteristics like zero-inflation and overdispersion) in order to avoid possible estimation biases and incorrect inference about the model parameters (Cai et al. 2018). Moreover, failing to control for the inflated nature of the distribution can limit the model's ability to produce reliable model based predictions.
We propose the use of a multiple inflated hurdle negative binomial model, with mixing probabilities modelled through a multinomial logit model, in comparison with the use of the well known hurdle negative binomial model. We show that, by controlling for the inflated nature of the response variable distribution, the proposed formulation provides a better representation of the Italians' tourism behaviour in comparison with non-inflated hurdle models. In particular, by using a multiple inflated hurdle model we are able not only to identify the determinants of the phenomenon under study, but also to correctly fit the distribution of the total number of overnight stays, even in presence of extremely inflated values which are usually under-predicted by standard models. Given this characteristic, we believe that multiple inflated hurdle models can be useful tools for decision makers who are trying to forecast future events or the consequences of some new targeted policies.
The proposed methodology assumes that the inflated values are known, or are exogenously selected by a double procedure of visual inspection and model comparison. Optimal selection of the mixture components is a controversial issue when using any mixture model, and further research should be devoted to investigate the possibility of including the identification of the inflated values directly in the model estimation process.   Significance codes: * * * p < 0.001, * * p < 0.01, * p < 0.05, • p < 0.1