1 Introduction

Modelling univariate count data by means of Poisson type regression models is nowadays a routine approach, and some competing model specifications have been proposed for bivariate count data as well (see, e.g. Lambert 1992; Kocherlakota and Kocherlakota 1992; Cameron and Trivedi 1998, 2005; Berkhout and Plug 2004; Famoye and Singh 2006; Lee et al. 2009; Winkelman 2008; Famoye 2010; Tsou 2016). It is worth mentioning that many empirical studies apply negative binomial regression because of the nature of count data. Multivariate count data occur in a wide range of applications, including accident analysis, sports statistics, economics, and many others (see e.g. Brijs et al. 2004, 2006; McHale and Scarf 2007; Ma et al. 2008; Baioa and Blangiardo 2010; Bermúdez and Karlis 2011; Shahtahmassebi and Moyeed 2016). Bayesian inference has sometimes been used in these applications. There are several approaches to the construction of the model for bivariate count data. The bivariate Poisson distribution based on the trivariate reduction method is restricted due to only positive correlation between two count variables (see Kocherlakota and Kocherlakota 1992; Brijs et al. 2006; Famoye 2010). Another approach is to model the joint distribution using copulas (see Van Ophem 1999; McHale and Scarf 2007); these models allow for more flexible specification of the dependence structure. Also, relatively flexible dependence structures appear in the models built upon the idea of the mixture of independent Poisson distributions (Aitchison and Ho 1989) or the conditional probabilities (Berkhout and Plug 2004). The models for multivariate count data are discussed in greater detail in, e.g. Cameron and Trivedi (1998) and Winkelman (2008).

In this paper we look at bivariate Poisson type regressions in the case when some observations follow a degenerate (i.e. univariate) distribution. Since a bivariate model for two non-degenerate count variables is the central part of our specification, we focus on some particular structure, following a simple and flexible modelling path, started by Berkhout and Plug (2004), which seems easy to generalize to dimensions higher than 2. However, our goal is to consider the possibility of degenerate bivariate observations, and not to propose one more new specification for related count variables.

In joint modelling of count variables we may face the situation when one of them is necessarily zero for many observed units. For example, if we analyze determinants of (and relation between) the numbers indicating how many times during a month people used public transport and how many times they used their own cars, then for any person without a car the number of using it is necessarily zero. It becomes crucial to examine the opportunities and consequences of inference (on determinants of using public transport and on the relationship between using public transport or private cars) on the basis of full data set in comparison to inference based on the data for car owners only. Using the latter means sample selection, which makes any generalization unjustified. In order to use all observations on two variables of interest, we propose a statistical model with switching between two specifications for count variables: a bivariate model and a univariate model. Switching is based on the third, zero–one variable (car ownership in the example above). Such approach enables to formulate testable hypotheses, e.g. that the mechanism generating values of the count variable which is always observed (never degenerate) is exactly the same in two groups of observed units.

The main part of the switching model, introduced in this paper, is a bivariate model of count variables representing the case where both variables are non-degenerate. We use the so-called ZIP–CP (zero inflated Poissonconditional Poisson) specification, proposed by Marzec and Osiewalski (2012); it is a bivariate Poisson type regression, more general than P–CP (Poissonconditional Poisson) model, introduced by Berkhout and Plug (2004). In the P–CP model, one of the variables is marginally Poisson and the other is conditionally Poisson. This model can be easily estimated and it allows correlation of any sign, but the sign of correlation between the two count variables depends on the sign of one parameter only and is independent of the explanatory variables of the model. In the ZIP–CP model of bivariate Poisson type regression, the marginal ZIP type distribution is used for the first variable (instead of the marginal Poisson distribution), which leads to the covariance sign dependent on the explanatory variables. The characteristics of the ZIP–CP model follow from the properties of the bivariate discrete ZIP–CP distribution, introduced and examined by Osiewalski (2012). The second part of the switching model proposed in this paper amounts to a univariate Poisson regression for the second variable—in case when the first variable is degenerate (with its full probability mass concentrated at zero). As it has been already mentioned, the third part is a dichotomous specification that describes switching between the bivariate (non-degenerate) and univariate (degenerate) case.

In the next section we present the probabilistic foundations of our switching model, i.e. the discrete distributions used to build the three parts of this model—in particular, the ZIP–CP distribution. Section 3 is devoted to our statistical model, the form of the likelihood function and the Bayesian analysis. Section 4 contains an empirical illustration, showing new results of joint modelling of the numbers of bank card and cash payments. In contrast to previous studies—see Polasik et al. (2012a) and Marzec and Osiewalski (2012)—we use all available data, for cardholders and non-cardholders. Our empirical example that refers to the research on non-cash transactions in Poland—see Polasik and Maciejewski (2009), Polasik (2015), Polasik et al. (2012b) and Goczek and Witkowski (2015, 2016)—serves as an illustration of modelling and inference problems with count variables, where one of them (the number of card payments) is degenerate for many observations (individuals without cards). In Sect. 5 concluding remarks are stated.

In the literature dealing with multivariate regression models we can find a variety of approaches to deal with dependent variables analyzed simultaneously of which at least one is partially observed. The practice of omitting data, also referred to as data selection, has been often used. For example, exploiting a dataset of over 11,000 payments, Bounie and Francois (2006) estimated the determinants of the probability of a transaction being paid by cash, check or bank card at the point of sale; a multinomial logit model was applied in their study. They excluded persons who did not hold bank cards or check accounts. The next simplification amounts to using variables which ignore the nature of the data. Kalckreuth et al. (2014) employed probit models for dependent variable. Although these practices seem useful, because they simplify the statistical modelling problem, they do not reflect the complexity of consumer behavior. From the viewpoint of statistical inference, such simplified approaches may be inappropriate, as we show in this paper. Stavins (2016) stated that “estimating consumers’ decisions to adopt and use payment instruments as independent events can lead to sample-selection problems”.

The aim of this paper is to present a new approach to payment behavior analysis by a tailor-made model, which is designed to cover all available observation units, e.g. consumers with or without cards. The advantage of our specification is that the proposed trivariate model can capture salient features of the original data set. This modeling approach is free from the sample selection error and it allows to test for possible misspecification due to sample selection.

2 Probabilistic foundations of the new statistical model

We consider the joint distribution of three random variables (Y1, Y2, Y3), where the third one is a zero–one variable, the second variable can take any non-negative integer value, and the first one is concentrated at zero when Y3 = 0 (Pr{Y1 = 0|Y3 = 0} = 1), and can take any non-negative integer value if Y3 = 1. Thus, when Y3 = 0, the conditional distribution of (Y1, Y2) is the same as the distribution of (0, Y2) and corresponds to the univariate distribution of Y2. Only when Y3 = 1 the distribution of the pair (Y1, Y2) is a bivariate distribution over the set of all pairs of non-negative integers; now we focus on its two special cases: P–CP (Berkhout and Plug 2004) and ZIP–CP (Osiewalski 2012). These distributions lead to particularly simple and useful bivariate Poisson type regression models. Other specifications impose restrictions on the correlation between two count variables or are more complicated from the statistical or numerical perspective.

If Y3 = 1, the probability distribution of (Y1, Y2) is as follows:

$$\begin{aligned} \Pr \{ Y_{1} &= i,\;Y_{2} = j|Y_{3} = 1\} = \Pr \{ Y_{1} = i|Y_{3} = 1\} \\&\Pr \{ Y_{2} = j|Y_{3} = 1,Y_{1} = i\} = g(i)\;h(j,i),\end{aligned} $$
(1)

where ij ∈ N ∪ {0}. If the distribution of Y1 is Poisson with mean (and variance) λ1 and the conditional distribution of Y2 given Y1 is Poisson with mean (and variance) λ2exp(αY1), i.e.

$$ g(i) = \exp ( - \lambda_{1} )(\lambda_{1} )^{i} /i!,\quad h(j,i) = \exp [ - \lambda_{2} \exp (\alpha \cdot i)](\lambda_{2} )^{j} \exp (\alpha \cdot i \cdot j)/j!, $$
(2)

then we have the bivariate P–CP distribution with the following moments (Berkhout and Plug 2004):

$$ E(Y_{2} |Y_{3} = 1) = \lambda_{2} \exp [\lambda_{1} (e^{\alpha } - 1)], $$
(3)
$$ E((Y_{2} )^{2} |Y_{3} = 1) = E(Y_{2} |Y_{3} = 1) + [E(Y_{2} |Y_{3} = 1)]^{2} \exp [\lambda_{1} (e^{\alpha } - 1)^{2} ], $$
(4)
$$ Var(Y_{2} |Y_{3} = 1) = E(Y_{2} |Y_{3} = 1) + [E(Y_{2} |Y_{3} = 1)]^{2} \{ \exp [\lambda_{1} (e^{\alpha } - 1)^{2} ] - 1\} , $$
(5)
$$ E(Y_{1} Y_{2} |Y_{3} = 1) = \lambda_{1} e^{\alpha } E(Y_{2} |Y_{3} = 1). $$
(6)

If α ≠ 0, then the variance (5) of Y2 is greater than its expectation (3). The dependence between the two variables leads to the inflated variance of Y2, which is usually observed in empirical count data. The Poisson distribution of Y1 does not have this crucial property. This is the first reason to generalize the bivariate P–CP distribution through replacing the marginal Poisson distribution of Y1 by a ZIP type distribution—in line of the approach presented by Lambert (1992), Cameron and Trivedi (1998, 2005) and Winkelman (2008). The second reason to generalize the P–CP model lies in its restrictive approach to the dependence between two count variables, as the sign of covariance between Y1 and Y2, i.e. of

$$\begin{aligned} &Cov(Y_{1} ,Y_{2} |Y_{3} = 1) = E(Y_{1} Y_{2} |Y_{3} = 1)\\ &\quad - E(Y_{1} |Y_{3} = 1)E(Y_{2} |Y_{3} = 1) = \lambda_{1} (e^{\alpha } - 1)E(Y_{2} |Y_{3} = 1),\end{aligned} $$
(7)

depends only on the sign of the real constant α, and not on λ1 or λ2, which are parameterized through explanatory variables in statistical applications of this probabilistic model.

The generalization proposed by Osiewalski (2012) allows for dependence of the sign of covariance on λ1 as well. This more general class of bivariate discrete distributions (denoted with a star) is characterized by the same conditional distribution of Y2 given Y1:

$$ {\Pr}^{*} \{ Y_{2} = j|Y_{3} = 1,Y_{1} = i\} = h(j,i) = \Pr \{ Y_{2} = j|Y_{3} = 1,Y_{1} = i\} $$
(8)

and by the ZIP-type distribution of Y1, with zero treated separately:

$$ {\Pr}^{*} \{ Y_{1} = i|Y_{3} = 1\} = g^{*} (i) = \left\{ {\begin{array}{*{20}l} \gamma \hfill & {for\;i = 0,} \hfill \\ {\frac{1 - \gamma }{1 - g(0)}g(i)} \hfill & {for\;i \in N,} \hfill \\ \end{array} } \right. $$
(9)

where γ belongs to the (0, 1) interval, and g and h are the same functions as in (2). If γ = g(0), then Pr*{Y1 = i|Y3 = 1} = g*(i) = g(i) = Pr{Y1 = i|Y3 = 1} and we are back in the P–CP case. If γ > g(0), then the distribution of Y1 is of the ZIP type, so we call the joint distribution ZIP–CP. However, note that the specification (9) is more general as it also allows γ < g(0).

The moments of the ZIP–CP distribution are related to the moments of the P–CP case through the following general formula (assuming 0m = 1 for m = 0):

$$\begin{aligned}& E^{*} (Y_{1}^{m} Y_{2}^{n} |Y_{3} = 1) = g_{0} [(1 - \gamma )E(Y_{1}^{m} Y_{2}^{n} |Y_{3} = 1) + (\gamma - g(0))\,0^{m}\\ &\quad E(Y_{2}^{n} |Y_{3} = 1,Y_{1} = 0)], \end{aligned}$$
(10)

where g0 = (1 − g(0))−1.

In particular:

$$ E^{*} (Y_{1} |Y_{3} = 1) = g_{0} (1 - \gamma )E(Y_{1} |Y_{3} = 1) = g_{0} (1 - \gamma )\lambda_{1} , $$
(11)
$$ E^{*} (Y_{1}^{2} |Y_{3} = 1) = g_{0} (1 - \gamma )E(Y_{1}^{2} |Y_{3} = 1) = g_{0} (1 - \gamma )\lambda_{1} (1 + \lambda_{1} ), $$
(12)
$$ E^{*} (Y_{2} |Y_{3} = 1) = g_{0} [(1 - \gamma )E(Y_{2} |Y_{3} = 1) + (\gamma - g(0))\lambda_{2} ], $$
(13)
$$ E^{*} (Y_{2}^{2} |Y_{3} = 1) = g_{0} [(1 - \gamma )E(Y_{2}^{2} |Y_{3} = 1) + (\gamma - g(0))\lambda_{2} (1 + \lambda_{2} )], $$
(14)
$$ E^{*} (Y_{1} Y_{2} |Y_{3} = 1) = g_{0} (1 - \gamma )E(Y_{1} Y_{2} |Y_{3} = 1) = g_{0} (1 - \gamma )\lambda_{1} e^{\alpha } E(Y_{2} |Y_{3} = 1), $$
(15)
$$ Var^{*} (Y_{1} |Y_{3} = 1) = \frac{1 - \gamma }{1 - g(0)}\lambda_{1} \left( {1 + \frac{\gamma - g(0)}{1 - g(0)}\lambda_{1} } \right), $$
(16)
$$\begin{aligned} Var^{*} (Y_{2} |Y_{3} = 1) =& \frac{1 - \gamma }{1 - g(0)}\left\{ Var(Y_{2} |Y_{3} = 1) + \frac{\gamma - g(0)}{1 - g(0)} [E(Y_{2} |Y_{3} = 1) - \lambda_{2} ]^{2} \right.\\&\left.+ \frac{\gamma - g(0)}{1 - \gamma }\lambda_{2} \right\}, \end{aligned}$$
(17)
$$ Cov^{*} (Y_{1} ,Y_{2} |Y_{3} = 1) = \frac{1 - \gamma }{1 - g(0)}\left\{ {Cov(Y_{1} ,Y_{2} |Y_{3} = 1) + \frac{\gamma - g(0)}{1 - g(0)}\lambda_{1} [E(Y_{2} ) - \lambda_{2} ]} \right\}, $$
(18)

which leads to the correlation coefficient of the form

$$ \begin{aligned} & Corr^{*} (Y_{1} ,Y_{2} |Y_{3} = 1) \\ & \quad = \frac{{Cov(Y_{1} ,Y_{2} |Y_{3} = 1) + \frac{\gamma - g(0)}{1 - g(0)}\lambda_{1} \left( {E(Y_{2} |Y_{3} = 1) - \lambda_{2} } \right)}}{{\sqrt {\lambda_{1} \left( {1 + \frac{\gamma - g(0)}{1 - g(0)}\lambda_{1} } \right)\left\{ {Var(Y_{2} |Y_{3} = 1) + \frac{\gamma - g(0)}{1 - g(0)}\left( {E(Y_{2} |Y_{3} = 1) - \lambda_{2} } \right)^{2} + \frac{\gamma - g(0)}{1 - \gamma }\lambda_{2} } \right\}} }}, \\ \end{aligned} $$
(19)

where E(Y2|Y3 = 1), Var(Y2|Y3 = 1) and Cov(Y1Y2|Y3 = 1) are the moments of the P–CP distribution in (3), (5) and (7). After simple manipulations we obtain

$$ \begin{aligned} Cov^{*} (Y_{1} ,Y_{2} |Y_{3} = 1) & = (1 - g(0))^{ - 2} (1 - \gamma )\lambda_{1} \left[ (1 - g(0))e^{\alpha } E(Y_{2} |Y_{3} = 1) \right.\\&\quad\left.- (1 - \gamma )E(Y_{2} |Y_{3} = 1) - (\gamma - g(0))\lambda_{2} \right] \\ & = (1 - e^{{ - \lambda_{1} }} )^{ - 2} (1 - \gamma )\lambda_{1} \lambda_{2} \left\{ \,\left[ {(1 - e^{{ - \lambda_{1} }} )e^{\alpha } - (1 - \gamma )} \right]\right.\\&\quad\left. \exp (\lambda_{1} (e^{\alpha } - 1)) - \gamma + e^{{ - \lambda_{1} }} \right\}. \\ \end{aligned} $$
(20)

Now it is clear that the variables (Y1, Y2) that follow the ZIP–CP distribution

  1. 1.

    are negatively correlated, if \( [(1 - e^{{ - \lambda_{1} }} )e^{\alpha } - (1 - \gamma )]\exp (\lambda_{1} (e^{\alpha } - 1)) < \gamma - e^{{ - \lambda_{1} }} \),

  2. 2.

    are positively correlated, if \( [(1 - e^{{ - \lambda_{1} }} )e^{\alpha } - (1 - \gamma )]\exp (\lambda_{1} (e^{\alpha } - 1)) > \gamma - e^{{ - \lambda_{1} }} \),

  3. 3.

    are uncorrelated, if \( [(1 - e^{{ - \lambda_{1} }} )e^{\alpha } - (1 - \gamma )]\exp (\lambda_{1} (e^{\alpha } - 1)) = \gamma - e^{{ - \lambda_{1} }} \).

When γ = g(0) = exp (− λ1), i.e. if Y1 is Poisson (under Y3 = 1), the complicated formulas (18) and (20) reduce to the much simpler form (7), where the sign of covariance depends only on the sign of α. In other cases, i.e. when Y1 is of ZIP type, the sign of covariance in (20) depends on the values of λ1 and α (not only on the sign of the latter constant). Obviously, the value of covariance in the ZIP–CP distribution (and not only its sign) as well as the value of the correlation coefficient (19) depend on all the constants appearing in the ZIP–CP probability function, i.e. on γ, λ1, λ2 and α.

Remind that increasing the probability of the zero value of Y1 (in comparison to the Poisson distribution with mean and variance λ1), that is assuming the ZIP type distribution with γ > g(0), leads to variance (16) greater than expectation (11). The ZIP–CP distribution class enables inflating variances of both count variables, although they are not symmetrically treated.

As yet our considerations has been focused on the conditional distribution of the pair (Y1, Y2) given Y3 = 1, that is on the complicated part of our trivariate structure. The distribution of Y2 given Y3 = 0 (and the only possible zero value of Y1) is specified in such a way as to make it easy to test in our statistical model whether the conditional distribution of Y2 given Y1 = 0 is the same in both situations: Y3 = 0 and Y3 = 1. Therefore we assume the Poisson distribution with the probability function

$$ \Pr \{ Y_{2} = j|Y_{3} = 0\} = \Pr \{ Y_{2} = j|Y_{3} = 0,Y_{1} = 0\} = h_{0} (j) = \exp ( - \lambda_{2,0} )\,(\lambda_{2,0} )^{j} /j!, $$
(21)

with λ2,0 possibly different from λ2.

Summing up all the assumptions we have already introduced, we propose the following joint distribution of three discrete variables:

$$ \Pr \{ Y_{1} = i,\;Y_{2} = j,Y_{3} = l\} = \left\{ {\begin{array}{*{20}l} {p\,g_{{}}^{*} (i)h(j,i),} \hfill & {i,j \in N \cup \{ 0\} ,\;l = 1,} \hfill \\ {(1 - p)\,h_{0} (j),\quad i = 0,} \hfill & {j \in N \cup \{ 0\} ,\;l = 0,} \hfill \\ {0,} \hfill & {i \in N,\;j \in N \cup \{ 0\} ,\;l = 0,} \hfill \\ \end{array} } \right. $$
(22)

where p = Pr{Y3 = 1}. The marginal distribution of the pair (Y1, Y2) is a particular mixture of the bivariate ZIP–CP distribution and the univariate Poisson distribution:

$$ \Pr \{ Y_{1} = i,Y_{2} = j\} = p\,g_{{}}^{*} (i)h(j,i) + (1 - p)I_{{\{ 0\} }} (i)h_{0} (j),\quad i,j \in N \cup \{ 0\} , $$
(23)

where IA(·) denotes the characteristic function of the set A; the moments can be written as:

$$ E(Y_{1}^{m} Y_{2}^{n} ) = p\,E^{*} (Y_{1}^{m} Y_{2}^{n} |Y_{3} = 1) + (1 - p)\,0^{m} E(Y_{2}^{n} |Y_{3} = 0,Y_{1} = 0), $$
(24)

with E*(Y m1 Y n2 |Y3 = 1) denoting the appropriate moment of the ZIP–CP distribution, see (10), and E(Y n2 |Y3 = 0, Y1 = 0) coming from the Poisson distribution with parameter λ2,0.

3 The Bayesian statistical model

Consider T trivariate observations (Y1tY2tY3t; t = 1,2, …, T), where Y3t are dichotomous. For Y3t = 1, pairs (Y1t, Y2t) have different ZIP–CP distributions, i.e.

$$ {\Pr}^{*} \{ Y_{1t} = i,\;Y_{2t} = j|Y_{3t} = 1\} = g_{t}^{*} (i)h_{t} (j,i)\quad (i,j \in N \cup \{ 0\} ), $$
(25)

where

$$ {\Pr}^{*} \{ Y_{1t} = i|Y_{3t} = 1\} = g_{t}^{*} (i) = \left\{ {\begin{array}{*{20}c} {\gamma_{t} } & {for\;i = 0,} \\ {\frac{{1 - \gamma_{t} }}{{1 - g_{t} (0)}}g_{t} (i)} & {for\;i \in N;\;g_{t} (i) = e^{{ - \lambda_{1t} }} \,(\lambda_{1t} )^{i} /i!,} \\ \end{array} } \right. $$
(26)
$$ {\Pr}^{*} \{ Y_{2t} = j|Y_{3t} = 1,Y_{1t} = i\} = h_{t} (j,i) = \exp [ - \lambda_{2t} e^{\alpha \cdot i} ](\lambda_{2t} )^{j} e^{\alpha \cdot i \cdot j} /j!, $$
(27)
$$ \lambda_{1t} = \exp (\varvec{x}_{t} {\varvec{\beta}}_{1} ),\quad \lambda_{2t} = \exp (\varvec{w}_{t} {\varvec{\beta}}_{2} ),\quad \gamma_{t} = \exp ( - e^{\delta } \lambda_{1t} ) = \exp ( - \exp (\delta + \varvec{x}_{t} {\varvec{\beta}}_{1} )), $$
(28)

xt and wt are row vectors consisting of values of explanatory variables that determine the probabilities of particular pairs of values of Y1t and Y2t. The role of the explanatory variables depends on the column vectors of parameters β1 and β2, but also on the dependence parameter α and the ZIP parameter δ, which governs the deviation of Pr*{Y1t = 0|Y3t = 1} from the value corresponding to the Poisson distribution. Now the moments of the distribution of (Y1t, Y2t) given Y3t, presented in the previous section, depend on the explanatory variables.

The specification based on (26) is known in the literature as the hurdle model (Cameron and Trivedi 2005, p. 680); Winkelman (2008) compares it to the original ZIP model. The hurdle model form of our ZIP type specification is very simple, thus making estimation and testing quite easy.

When Y3t = 0, the pairs (Y1t, Y2t) = (0, Y2t) have degenerate distributions, where for Y2t we assume Poisson distributions—as in (24); that is

$$\begin{aligned} \Pr \{ Y_{2t} = j|Y_{3t} = 0,Y_{1t} = 0\} &= h_{0,t} (j) = \exp [ - \lambda_{2t,0} ](\lambda_{2t,0} )^{j} /j!,\\&\quad \lambda_{2t,0} = \exp (\varvec{w}_{t}\varvec{\beta}_{2,0} ). \end{aligned}$$
(29)

If β2 = β2,0, then Pr *{Y2t = j|Y3t = 1, Y1t = 0} = Pr {Y2t = j|Y3t = 0, Y1t = 0} and the mechanism that generates values of Y2t given Y1t = 0 is exactly the same, no matter what the value of Y3t is. In order to test the hypothesis β2 = β2,0 we need a tri-variate statistical model. Under our assumptions, this model amounts to the following parametric class of distributions:

$$ \Pr \{ Y_{1t} = i,\;Y_{2t} = j,Y_{3t} = l;\;\,\theta \} = \left\{ {\begin{array}{*{20}l} {p_{t} \,g_{t}^{*} (i)\,h_{t} (j,i),} \hfill & {i,j \in N \cup \{ 0\} ,\;l = 1,} \hfill \\ {(1 - p_{t} )\,h_{0,t} (j),} \hfill & {i = 0,\;j \in N \cup \{ 0\} ,\;l = 0,} \hfill \\ {0,} \hfill & {i \in N,\;j \in N \cup \{ 0\} ,\;l = 0,} \hfill \\ \end{array} } \right. $$
(30)

where \( p_{t} = \Pr \{ Y_{3t} = 1\} = 1 - F( - \varvec{z}_{t}\varvec{\beta}_{3} ) \), zt is the row vector of explanatory variables and F is the distribution function representing the particular dichotomous model for Y3t. In the empirical section we use the logit model, i.e. we assume that F is the distribution function of the logistic distribution. Other models of the dichotomous variable Y3t are worth considering, especially the one based on the skewed Student t distribution, which Osiewalski and Marzec (2004a, b) introduced as a relatively general alternative for the logit and probit specifications. In our statistical model for the triple (Y1t, Y2t, Y3t) the parameter vector θ is a column grouping δ, α, β1, β2, β3 and β2,0. We assume that, for any θ, trivariate observations are stochastically independent.

When Y1t = y1t, Y2t = y2t and Y3t = y3t (t = 1,2, …, T) have been observed, the likelihood function takes the form

$$ \begin{aligned} L\left( {\varvec{\theta} ;y} \right) & = \left[ {\prod\limits_{{t:\,y_{3t} = 1,y_{1t} = 0}} \; \,\gamma_{t} \,h_{t} \left( {y_{2t} ,0} \right)} \right]\left[ {\prod\limits_{{t:\,y_{3t} = 1,y_{1t} > 0}} {\frac{{1 - \gamma_{t} }}{{1 - g_{t} (0)}}g_{t} \left( {y_{1t} } \right)\,h_{t} \left( {y_{2t} ,y_{1t} } \right)} } \right] \\ & \quad \left[ {\prod\limits_{{t:\,y_{3t} = 0,y_{1t} = 0}} \; h_{0,t} \left( {y_{2t} } \right)} \right]\left[ {\prod\limits_{{t:\,y_{3t} = 1}} {p_{t} } } \right]\left[ {\prod\limits_{{t:\,y_{3t} = 0}} {\;(1 - p_{t} )} } \right]\\& = L_{1} (\varvec{\beta}_{1} ,\varvec{\beta}_{2} ,\alpha ,\delta )\,L_{2} (\varvec{\beta}_{2,0} )\,L_{3} (\varvec{\beta}_{3} ), \\ \end{aligned} $$
(31)

where y denotes the (3 × T) matrix of the observed values of Y1t, Y2t and Y3t. The first two products in (31) correspond to the bivariate component of the mixture model and form the function L1 of δ, α, β1, β2; the third product in (31) corresponds to the univariate Poisson component and is the function L2 of β2,0; the fourth and fifth products in (31) correspond to the dichotomous switching variable and constitute the function L3 of β3. If there is no relation among these three groups of parameters, then inference on each of them can be conducted separately, using only the appropriate function Lr (r = 1, 2, 3) instead the full likelihood function. The situations of “no relations” or their presence can be precisely formalized within the Bayesian statistics, where a probability measure (prior distribution) on the parameter space is defined, prior independence between parameters can be formally stated and posterior independence can be considered. Here we focus on two situations: the case of prior independence among the three groups of parameters and the case of β2 = β2,0.

Under the separability of the likelihood function, obvious from (31), prior independence among (δ, α, β1, β2), β2,0 and β3 leads to their posterior independence, which means complete separability of inference on each group of parameters. In this case, using only observations with y3t = 1 for estimating (δ, α, β1, β2) is fully justified as well as using only observations with y3t = 0 for estimating β2,0 alone. Obviously, inference on such functions of θ that involve parameters from different groups, e.g. on Corr(Y1t, Y2t|θ)—the unconditional correlation coefficient between the first two elements of the triple (Y1t, Y2t, Y3t), must be based on the joint posterior density of θ, p(θ|y), which uses the full likelihood function and complete data. The joint posterior is needed if one wants to compare the unconditional correlation coefficient Corr(Y1t, Y2t|θ) and the conditional one, Corr(Y1t, Y2t|Y3t = 1, θ) = Corr*(Y1t, Y2t|Y3t = 1, δ, α, β1, β2), derived in the ZIP–CP model using formula (19).

In the case of β2 = β2,0, when (given Y1t = 0) Y2t is explained in exactly the same way no matter what Y3t is, L1 and L2 in the likelihood function cannot be considered separately as both depend on β2. In this case inference has to be based on all data, the full likelihood and the joint posterior. Making inferences with the use of the data with y3t = 1 only would mean sample selection error. Of course, testing β2 = β2,0 requires the general model, without this restriction.

Complete specification of our Bayesian statistical model [with the sampling distribution (30) that leads to the likelihood function (31)] requires the prior distribution of θ. Obviously, our prior choice is related to the model structure, not to the data that are analysed in the empirical part. We assume prior independence and the standard normal prior N(0, 1) for each parameter. Zero prior expectations mean that the simplest model (with no ZIP effect, no dependence and no explanatory variables) gets the highest prior chance, but unitary standard deviations ensure significant prior chances for specifications being far from the simplest one. It seems that such simple joint prior distribution introduces little initial information and guarantees easy Monte Carlo simulations of the posterior distribution. Obviously, sensitivity of inferences with respect to the form of the prior distribution is an empirical question, to be answered for the data at hand, but it is of greater importance mainly in small data-sets. According to basic Bayesian asymptotic results, under any regular prior, the posterior based on a sufficiently large number of observations can be approximated by an appropriate multivariate normal distribution centred at the maximum likelihood estimate. Thus, in empirical studies based on large data-sets, sensitivity with respect to the prior distribution becomes much less important.

In this study we implement the random-walk Metropolis–Hastings MCMC algorithm to simulate samples from the posterior distribution of θ (Gamerman 1998). This algorithm was started either at zero values of the parameters or at maximum likelihood estimates obtained by estimating each sub-model separately (due to separability of the likelihood function). It turned out that the selection of starting values was not important for convergence. We generated a candidate random variable from a multivariate Student distribution; preliminary runs were used to calibrate its precision matrix. The algorithm involved 1,000,000 cycles, and the acceptance rate was about 10%. Convergence of single chains from the MCMC sampler was confirmed by the graphical procedure proposed by Yu and Mykland (1998).

4 Joint modelling of the numbers of card and cash payments

In order to illustrate the empirical usefulness of the proposed statistical model, we use the data collected for the research that was financed by the National Bank of Poland and described by Polasik et al. (2012a) and Marzec et al. (2013). The data consist of the information whether person t is a cardholder (y3t) as well as the number of his/her cash payments (y2t) and card payments (y1t) within a month. T = 2518 persons were questioned in October or November 2010, or in January 2011. The fraction of cardholders was 47.3%.

Frequency distributions of the numbers of cash payments for y3t = 1 and y3t = 0 are presented in Table 1. For non-cardholders, the average number of cash payments during a month was 22.5 (with the empirical standard deviation 19.8); for cardholders, the average number of cash payments during a month was lower: 20.5 (with the empirical standard deviation 17.3). The value W2 = 10.9 of the modified test statistic of Anderson and Darling (1954) indicates dissimilarity of these two discrete distributions. For cardholders, the average number of card payments during a month is 5 (with standard deviation 6.7); the empirical correlation coefficient between y1t a y2t (given y3t = 1) is 0.008, which indicates no linear dependence.

Table 1 Frequency distributions of the numbers of cash payments y2t for cardholders (y3t = 1) and non-cardholders (y3t = 0).

The results obtained by Polasik et al. (2012a)—within the P–CP model on the basis of the data for 1190 cardholders—showed very small positive correlation between the numbers of cash and card payments. Marzec and Osiewalski (2012) confirmed this using the ZIP–CP model, indicating at the same time that the P–CP model is not a valid reduction of the more general ZIP–CP case, as both parameters α and δ are significantly different from zero. Note that univariate empirical distributions for cardholders only (i.e. with y3t = 1) require a bivariate model with inflated zeros for both count variables; the ZIP–CP specification meets this requirement, while the P–CP model does not. Moreover, for cardholders, formal Bayesian model comparison led to the conclusion that, in the ZIP–CP, model Y1t must represent the number of card payments and Y2t—cash payments (not vice versa); see Marzec and Osiewalski (2012). The necessity of establishing which count variable is the first one comes from the asymmetric structure of the bivariate model under consideration.

Now we present the results obtained for the full dataset, which includes non-cardholders. Similarly as Marzec and Osiewalski (2012), we have modelled raw data, without weights indicating the degree of representativeness of individual observations; such weights were used by Polasik et al. (2012a) and Marzec et al. (2013). The motivation to use weighted (adjusted) data amounts to adequately represent the population from which the sample has been drawn. Information about demographic characteristics, such as gender, age, marital status, and place of residence, are used to develop the weights. In this paper we model raw data, as we do not focus on representativeness issues.

The structure of our complete trivariate model—shown in Fig. 1—consists of two separate count variables models: for T1 = 1190 pairs (Y1t, Y2t) with Y3t = 1 and for T2 = 1328 variables Y2t if Y3t = 0, as well as of the specification for the dichotomous variable Y3t that links all T = T1 + T2 = 2518 observations. The same main characteristics of the questioned individuals are used as explanatory variables in all three parts of our joint model, that is xt = wt = zt.

Fig. 1
figure 1

Source: authors’ elaboration

The structure of the full trivatiate statistical model.

In Table 2 we present the typical values of our explanatory variables, i.e. the most frequent values for zero–one variables and the arithmetic means for other variables. It seems that the main determinants of having a bank card are: income, education, marital status and the access to Internet at home. The role of the access to Internet and marital status, as well as of the place of residence, is also suggested by the information presented in Table 3 (i.e. the fraction of ones in the case of dichotomous explanatory variables). We assume that the access to Internet is a proxy variable for consumer openness to technology adaptation.

Table 2 Typical (average or most frequent) values of explanatory variables.
Table 3 Fraction (%) of ones in the case of dichotomous explanatory variables.

In our empirical research we have used the statistical model presented in (30), together with the joint prior distribution, proposed in the previous section and assuming independence among all parameters of the trivariate model. Taking advantage of posterior independence, which results from the separability of the likelihood in (31) and prior independence, we have used three independent Metropolis–Hastings chains in order to simulate from the posterior distribution in each part of our model. That is, we have separately estimated (β1β2, α, δ) in the ZIP–CP model (M1), β2,0 in the Poisson model for the number of cash payments for non-cardholders (M2) and β3 in the logit model M3. The total number of parameters is 34.

In Table 4 we present the posterior means and standard deviation of all individual parameters; the results are printed in bold if the absolute value of the posterior mean is greater than two posterior standard deviations.

Table 4 Posterior means and standard deviations of the parameters of each part of the trivariate model.

Referring to the assumed prior distribution of parameters, we see that our N(0, 1) priors appear relatively vague in this application, because the posterior standard deviations are much lower than the prior standard deviations and almost all posterior means are in the interval (− 2, 2), and most of them in [− 1, 1]. We checked that our results are robust to changes in the prior distribution. On the other hand, in studies (like ours) where the number of observations is large, prior sensitivity becomes much less important.

For cardholders we see (in M1) that all seven explanatory variables that we have used are obviously important to explain the number of cash payments. But only the access to Internet, education and income significantly (and positively) affect the number of card payments. In the pure Poisson model for non-cardholders (M2), not all seven variables are important to explain the number of cash payments—gender, income and age are not. Our results show that a cardholder’s education, being in a marriage and the access to Internet have a negative effect on cash payments. Note, however, that the impact of these three variables on the number of cash payments is positive for non-cardholders. Living in a city will lead to more frequent use of cash as the payment method for both consumer types. Additionally, there is significant positive influence of age only on cash payments in the case of cardholders. In the logit model (M3), five variables (except for gender and age) are the determinants of possessing a bank card. We confirmed that being in a marriage, living in a city, having higher income, staying in education for a longer period and having the access to Internet increase the probability of having a bank card.

Let us stress the differences between posterior distributions of the parameters describing the number of cash payments in M1 and M2. In the case of four explanatory variables (marital status, income, education and the access to Internet), the signs of the posterior means are different. As the standard deviations of most of the parameters are small, we suspect that the equality β2 = β2,0 does not hold.

In order to verify the hypothesis β2 = β2,0 we use a Lindley-type Bayesian test (similar to the highest posterior density interval test, see Lindley 1965 p. 58; Zellner 1971, pp. 298–302). Let κ = β2 − β2,0; building upon the classical F or Chi squared tests, we consider the following quadratic form (see also Osiewalski and Steel 1993; Marzec and Osiewalski 2008):

$$ \tau = \tau \left( {\varvec{\kappa};\varvec{y}} \right) = \left( {\varvec{\kappa}- E\left( {\varvec{\kappa}|\varvec{y}} \right)} \right)^{\prime } \left( {V\left( {\varvec{\kappa}|\varvec{y}} \right)} \right)^{ - 1} \left( {\varvec{\kappa}- E\left( {\varvec{\kappa}|\varvec{y}} \right)} \right), $$
(32)

where E(κ|y) = E(β2|y)–E(β2,0|y) and V(κ|y) = V(β2|y) + V(β2,0|y); the sum of covariance matrices is a result of posterior independence between β2 and β2,0 in the general model (without any restrictions). Univariate variable τ is random as a function of both the observations and parameters of our Bayesian model. Inferences on τ are based on its posterior distribution with the density function p(τ|y). In our Lindley-type approach, testing the restriction κ = 0 amounts to checking whether the value τ(0; y) belongs to the region of the highest posterior density p(τ|y) and the close to one posterior probability mass. If so, we do not reject the hypothesis κ = 0 and go to the model based on this restriction, which unables the separate analysis of two subsamples (of cardholders and non-cardholders). If τ(0; y) is outside the highest posterior density region, then the equality κ = 0 is not supported by the data—so we reject it and stay with the results obtained in the general, unrestricted model.

In Fig. 2 we present the posterior density p(τ|y). The value τ(0; y) = 973.85 lies far in the tail of the posterior distribution of τ, so the equality β2 = β2,0 is strongly rejected. Thus, for our dataset, inferences (on individual parameters) based on the separability of the likelihood in (31) are free from any sample selection error.

Fig. 2
figure 2

Source: authors’ calculations

Posterior distribution of τ(κy).

Finally we present the posterior results for the conditional and unconditional correlation coefficients between the numbers of card and cash payments (Y1t, Y2t). Remind that the unconditional correlation coefficient Corr(Y1t, Y2t|θ) is a function of all parameters of the three sub-models, so its posterior distribution can be obtained only in the joint trivariate model, irrespectively of the outcome of the test we have considered above. In Table 5 we present main results. For all data (T = 2518) we have obtained the posterior distributions of the unconditional correlation coefficient Corr(Y1t, Y2t|θ) that were concentrated close to zero—but only on the positive side. The minimum posterior mean was 0.031, the maximum was 0.16 and the average posterior mean was 0.072 (always with a relatively small posterior standard deviation). It means that the unconditional correlation between the numbers of card and cash payments is very small, but positive. In the ZIP–CP model—for the cardholders only—the average posterior mean of the conditional correlation coefficient Corr(Y1t, Y2t|Y3t = 1, θ) was 0.073. The overall average estimate of the unconditional correlation coefficient is practically the same as the average estimate of the conditional correlation coefficient given y3t = 1, although the specific average estimates of Corr(Y1t, Y2t|θ) obtained for cardholders and non-cardholders are quite different.

Table 5 Posterior means of correlation coefficients between (Y1t, Y2t), averaged over observations.

5 Concluding remarks

The trivariate discrete distribution and Bayesian statistical model have been proposed in order to jointly model two count variables in the case where one of them can be degenerate. Our statistical model amounts to using a zero–one variable to switch between two separate models for count variables. The first model is bivariate and the second one is only univariate—but from the same class as the conditional part of the bivariate model. While the proposed modelling scheme is quite general, the choice of the sub-models (the building blocks of the trivariate structure) is rather specific and can be changed. Simplicity is the main criterion in choosing the ZIP–CP model (as the bivariate specification for count variables) and the logistic model (for the zero–one switching variable); both lead to a tractable trivariate model. Replacing the logistic part by a different dichotomous specification—e.g. based on a skewed Student t distribution and allowing for interactions of explanatory variables (see Osiewalski and Marzec 2004b)—is not difficult and may improve the data fit. However, replacing the ZIP–CP specification, which is the main part of our trivariate model, would be much more difficult. Using alternative structures for two related count variables is left for future research.

As far as the prior specification is concerned, our particular form of the prior distribution can easily be changed, but two crucial properties should be kept in mind. The separability of the likelihood function can be fully exploited only under prior independence of parameters describing sub-models, so their independence is a natural element of each prior specification. Also remind that particular, standard normal prior distribution (that we have assumed for each individual parameter) is not important if the number of observations is large—like in our empirical example. Obviously, small samples would require sensitivity analysis within a larger class of prior distributions (e.g. Student t with unknown degrees of freedom).

In the proposed Bayesian model one can easily use our Lindley-type test (the Bayesian counterpart of the F or Chi squared tests) in order to verify the fundamental restriction, which makes the parameters describing the non-degenerate count variable identical for both values of the switching zero–one variable. It would be interesting to use formal Bayesian model comparison (through Bayes factors and posterior model probabilities) for testing different specifications that could appear in future research. This would require an efficient estimator of the marginal data density value in each model. It seems that, in the case of the Markov Chain Monte Carlo simulations of the posterior distribution, the corrected arithmetic mean estimator proposed by Pajor (2017) is an appropriate tool.

Our trivariate model is constructed in such a way that separability of the likelihood function is preserved. Thus it is a useful tool to examine the consequences of sample selection caused by deleting all observations with only one non-degenerate count variable (i.e. deleting the whole subsample of non-cardholders). In our empirical example we have shown that inference on individual parameters is not affected by the sample selection error, since the restriction linking parameters of two sub-models is not supported by the data. We have also shown that any deeper inference on correlation between two count variables—that is, inference on the unconditional correlation coefficient as opposed to the conditional one—is possible only within our full trivariate specification.

Let us stress that the proposed trivariate model always enables making inference for all available data, without applying any preliminary tests. Instead, our model itself constitutes a useful testing framework, in particular for testing particular conditions that lead to sample selection errors. This is the main contribution of the paper.