Introduction

Count events frequently occur in all disciplines. In demography, count data like number of children ever born, number of deaths, and number of migration times have been previously modelled by Poisson regression [1]. One of the important assumptions guiding the use of Poisson distribution; is the equality of mean and variance which may not be feasible in reality. If this assumption is violated, the estimation method will produce biased estimates, inefficient standard errors, and misleading confidence interval and p-values [2]. Based on this limitation, researchers have recommended the use of negative binomial distribution which have an additional parameter that accounts for the usual occurrence of over-dispersion in count outcomes; thus, relaxing the constraint of equality of mean and variance [3].

Researchers have also argued that, count events are mainly characterized with large number of zeros [4,5,6,7] and this situation make modeling count data using both Poisson and negative binomial model inappropriate. Although, Poisson and negative binomial distribution assume possibilities of having zero counts but data may consist of large number of zero responses which violate the distributional assumptions of both models often referred to as the excess zero problems. Several studies have modelled fertility experience based on the distribution of the fertility pattern in different countries [3, 8,9,10,11,12,13,14] with a view to identifying factors influencing fertility. In Nigeria, the determinants of fertility have been examined using Poisson regression to account for the count nature of the variable [9, 11] and also negative binomial to account for over-dispersion or heterogeneity [3, 8]. Aside the limitation of the use of Poisson and negative binomial models for fertility data in Nigeria, the analysis is often conducted at national level thus neglecting some of the consequences of cultural diversities at regional level.

Nigeria has six regions defined by sociocultural differences which have implication on fertility. Striking variation exists in fertility across these regions ranging from total fertility rate (TFR) of 4.3 in South South, to 6.7 in North West [15]. Nigeria is the most populous country in Africa with population figure of about 200 million, the population of each of the six regions in the country is more than that of some countries like Togo, Republic of Benin, Liberia, Malawi, to mention a few [16]. Thus, modelling fertility data at national level and with the use of a particular model is likely to be fraught with hidden errors due to the peculiarities of the number of zeros and level of skewness inherent across regional data structures. Therefore, different models may be suitable for fertility at different regions. The current study extends [7] and modelled fertility data in each of the regions in Nigeria with six different distributions and evaluates the performance of the models for their suitability in each region.

Main Text

Methods

Data collection and utilization

The 2013 National Demography and Health Survey (NDHS) dataset was used for the implementation of the model fit. Data collection procedure involved a multi-stage cluster sampling technique. Prior to the survey, Nigeria was demarcated into smaller units regarded as enumeration areas (EAs) called clusters. This demarcation takes into consideration of the state boundaries to prevent merging of clusters within states. The respondents were selected from each cluster based on rural–urban allocation of specific numbers of clusters in the country. The current study used individual recode data with the information provided by women of childbearing age (15–49 years). Further information about the sampling strategy used for data collection can be accessed in the data originator’s website [15].

Data management

The outcome variable of interest was fertility which was measured by the number of children ever born (CEB), obtained from a total sample of 38,948 women. The data were weighted and the clustering effect was adjusted for in the various count models but unweighted for the skewness test and descriptive summaries of children (Additional file 1). To examine the correlation between CEB and background characteristics of women, a pairwise correlation test based on Bonferroni correction [17] for each region was conducted, 12 variables were used for the model fit: residence, women educational level, religion, ethnicity, wealth index, contraceptive use, currently residing with partner, number of other wives, age at first sex, husband educational level, women working status and husband/partners’ age. All these independent variables were retained for North Central and North West. For South East, South South and South West, residing with partner, number of wives, partner’s education was removed with an additional variable, women work status excluded for North East due to collinearity. All analyses were performed using Stata 15.0 at 0.05 level of significance.

Generalized linear models

Poisson model

The most common technique employed to model count data is Poisson regression. It has a usual feature of equality of mean and variance. Its probability mass function is given as:

$${\text{Pr}}\left( {{\text{Y}} = {{\text{y}}_{\text{i}}}{\text{|}}\mu } \right)= \frac{{{{\text{e}}^{{{ - }}\mu }}{\mu ^{{{\text{y}}_{\text{i}}}}}}}{{{{\text{y}}_{\text{i}}}{\text{!}}}};~{{\text{y}}_{\text{i}}}{\text{ = 0}},{\text{1}},{\text{2}}, \ldots$$
(1)

Where \({\text{y}}_{\text{i}}\) denote the random variable of the count response, that is, number of children ever born [18, 19].

Negative binomial model

The negative binomial (NB) distribution is a two-parameter distribution combining the Poisson distribution and the Gamma distribution (Gamma–Poisson mixture). It relaxes the assumption of equality of mean and variance, thus accounting for unobserved heterogeneity in count data [19,20,21,22]. Its probability mass function is given as:

$$Pr\left( {{\text{y}}_{\text{i}} {\text{|}} {{\mu }},\alpha } \right) = \frac{{\varGamma \left( {\alpha^{ - 1} + {\text{y}}_{\text{i}} } \right)}}{{\varGamma \left( {\alpha^{ - 1} } \right)\varGamma \left( {{\text{y}}_{\text{i}} + 1} \right) }} \left( {\frac{{\alpha^{ - 1} }}{{\alpha^{ - 1} + {{\mu }}}}} \right)^{{\alpha^{ - 1} }} \times \left( {\frac{{{\mu }}}{{{{\mu }} + \alpha^{ - 1} }}} \right)^{{{\text{y}}_{\text{i}}}} .$$
(2)

The mean and variance of the negative binomial distribution are E [y|µ, α] = µ and V [y|µ, α] = µ (1 + αµ). Where α is the dispersion parameter (if α > 0 and µ > 0). Special cases of the negative binomial include the Poisson (α = 0) and the geometric (α = 1) [19].

Zero-inflated models

For the zero-inflated Poisson (ZIP), the first process consist of a Poisson distribution that generates counts, some of which may be zero-sampling zero, and the second process is governed by binary distribution (logit or probit) for zero values-structural zeros [23]. Given variable yi, The ZIP model probability mass function has two model components as follows:

$$\Pr \left( {y_{i} |\mu _{i} } \right) = \left\{ {\begin{array}{*{20}l} {{\text{p}}_{{\text{i}}} + \left( {1 - {\text{p}}_{{\text{i}}} } \right)\exp \left( { - \mu _{{\text{i}}} } \right),} & {{\text{y}}_{{\text{i}}} = 0,0 \le p \le 1} \\ {\frac{{\left( {1 - {\text{p}}} \right)\exp \left( { - \mu _{{\text{i}}} } \right)\mu _{{\text{i}}}^{{{\text{y}}_{{\text{i}}} }} }}{{{\text{y}}_{{\text{i}}} !}}}, & {{\text{y}}_{{\text{i}}} \ge 1} \\ \end{array} } \right.$$
(3)

The outcome variable \(y_{i}\) is a non-negative integer, \(\mu_{i}\) is the expected Poisson count for the ith individual; \(p\) is the probability of extra zeros.

Similarly to the ZIP, the zero-inflated negative binomial (ZINB) model is employed to account for both over-dispersion and excess zero problems. For dependent variable yi with many zeros, the ZINB model probability mass function is given as:

$$\Pr \left( {y_{i} |\mu _{i} ,\alpha } \right) = \left\{ {\begin{array}{*{20}l} {p_{i} + \left( {1 - p_{i} } \right)\left( {1 + \alpha \mu _{i} } \right)^{{ - \alpha ^{{ - 1}} }} }, & {0 < p < 1} \\ {\left( {1 - p_{i} } \right)\frac{{\Gamma \left( {y_{i} + \frac{1}{\alpha }} \right)\left( {\alpha \mu _{i} } \right)^{{y_{i} }} }}{{y_{{i!}} {\text{ }}\Gamma \left( {\frac{1}{\alpha }} \right)1 + \alpha \mu ^{{y_{i} + \frac{1}{\alpha }}} }}} , & {y_{i} > \alpha } \\ \end{array} } \right.$$
(4)

where α ≥ 0 is an over-dispersion parameter [22].

Hurdle models

In the hurdle Poisson (HP) model, the first part is the hurdle at zero, which addresses the “few” or “more” zero outcome than the distributional assumption of the Poisson model and the second part governs the truncation part or positive outcomes [2, 19, 23]. Given a variable \(y_{i}\). the HP probability distribution is given as:

$$\Pr \left( {y_{i} = 0} \right) = 1 - p, \quad 0 \le p \le 1$$
$$\Pr \left( {Y = y_{i} } \right) = p\frac{{\exp \left( { - \mu_{i} } \right)\mu_{i}^{{y_{i} }} }}{{y_{i} !}}, \mu > 0;\quad y_{i} = 1,2, \ldots$$
(5)

where µ is the mean of the Poisson model, when \(\left( {1 - p} \right) > { \exp }\left( { - \mu } \right)\), the data contain more zeros relative to the Poisson model.

The hurdle negative binomial (HNB) is used when the hurdle model is appropriate and the data exhibit over-dispersion [19, 24]. The HNB model is given as:

$$\Pr \left( {y = 0} \right) = 1 - p, \quad 0 \le p \le 1$$
$${ \Pr }\left( {\text{Y = y}} \right) = \frac{\text{p}}{{ 1- \left( {\frac{\text{r}}{{\mu {\text{ + r}}}}} \right)^{\text{r}} }}\frac{{\varGamma ( {\text{y + r)}}}}{{\varGamma \left( {\text{r}} \right){\text{y!}}}}\left[ {\frac{\mu }{{\mu {\text{ + r}}}}} \right]^{\text{y}} \left[ {\frac{\text{r}}{{\mu {\text{ + r}}}}} \right]^{\text{r}} ,\quad {\text{ r,}}\;\mu \;{ > }\; 0 ;\;{\text{y = 1,2}} \ldots$$
(6)

The mean and variance of the HNB distribution are given as µ and µ (1 + µ/r) respectively, the quantity µ(1 + µ/r) is a measure of dispersion [22].

Model assessment and evaluation

The model selection criterion was based on the maximum likelihood estimates of the model parameter, using the log-likelihood and the Information Criterion (IC)—Akaike (AIC) and Bayesian (BIC). A lower IC value implies that the model is of better fit [25, 26]. An IC values with difference greater than 10 implies that the model with a smaller IC is superior, a value difference of 4 to 10 suggest a moderate superiority of one model against the other and an IC value differences less than 4 implies that the competing models are said to be indistinguishable [26].

Results

Socio-economic and demographic characteristics of respondents

In Nigeria, 29.5% of women age 15 to 49 years had no child, this percentage is highest in South South (42.4) and lowest in North West (21.3) (Fig. 1). The mean number of children ever born was highest in North West (3.89 ± 3.36) and lowest in South South (2.32 ± 2.58). As presented in Table 1, the information reveals that the age at first sex was lower in the Northern part of the country, compared to the Southern part, South East (18.96 ± 4.35), South West (18.69 ± 3.6) and South South (17.27 ± 3.22) except for North Central (18.06 ± 3.78). A higher number of women with no education were recorded in the Northern regions and women wealth quintiles were higher in Southern regions compared to the Northern regions. About 16% of women used any method of contraceptive in Nigeria and this varies across regions.

Table 1 Descriptive statistics of background characteristics by region
Fig. 1
figure 1

Percentage distribution of zero and non-zero count of children ever born by region (NDHS 2013)

Model selection criteria for the fitted model

The model assessments for each of the region are presented in Table 2 using the values from the AIC and BIC for evaluation basis. The hurdle negative binomial model was of best fit for North West (AIC = 45,421.19, BIC = 45,775.64) and South East (AIC = 13,767.37, BIC = 14,026.82) while the zero-inflated negative binomial provided a better fit for North East (AIC = 24,565.28, BIC = 24,828.33). Although, the zero-inflated negative binomial has a moderate superiority over the hurdle negative binomial in South South (AIC = 16,138.5, BIC = 16,411.23). For South West region, both AIC and BIC suggest that ZNB and ZIP are indistinguishable as best fit (\(ZINB \le ZIP < HNB \le HP < NB < Poisson)\) and no superiority exist between the zero-inflated models and their hurdle model analogs. In all cases, the zero-modified models were better than the GLMs, except for North Central were the BIC suggest that NB is of best fit (\(NB < HNB < ZINB < HP < ZIP < Poisson)\) contrary to the AIC and the log-likelihood (\(HNB < ZINB < HP < ZIP < NB < Poisson)\). Similarly, the models which take into account an over-dispersion parameter were better than their corresponding models not accounting for over-dispersion.

Table 2 Model assessment for alternative models

Discussion

This study examined the effectiveness of zero-augmented models compared to the standard Poisson and negative binomial models widely used for modelling fertility in Nigeria [3, 9, 11]. The current analysis was conducted separately in each of the six regions in Nigeria.

The results using the AIC and BIC has a model selection reviewed that both hurdle negative binomial and zero-inflated negative binomial provide a better fit for fertility data with large number of zeros and over-dispersion. Extensively, the AIC and BIC estimates from the zero-augmented negative binomial based models (HNB and ZINB) were of better fit than their Poisson based counterparts or in rare cases maybe indistinguishable. Consequently, both excess zeros and over-dispersion were recommended for fertility modelling not only at national level but also at regional levels. These findings are similar to other studies with similar data generating mechanism, containing large number of zeros [24, 27, 28]. Previous studies have noted that zero-inflated models are statistically appropriate in low fertility population studies and especially when there are large number of women with no children [13, 29].

The adjudged best model for each of the regions was used to predict the determinants of fertility peculiar to each region. For North Central, women with at least secondary level of education, partners with secondary education and women not working are factors driving low fertility. Secondary education, Igbo and higher age at first sex are factors determining low fertility in the North East. Residing in rural areas, secondary education, tertiary education, poorer women compared to poor women, no other wives, higher age at first sex and women not working are factors determining low level of fertility in the North West. Urban residence, women not working and increasing women educational level are factors responsible for low level of fertility in the South East. Increasing level of women education, wealth index, high age at first sex and women not working are drivers of low fertility in South South. Secondary and higher level of education, urban residency and women not working are factors contributing to low fertility level in the South West (Additional file 2).

In conclusion, the assessment in this paper provides evidence to support that fertility count data usually rightly skewed with excess zeros should be modelled using the zero-augmented models with negative binomial variant.

Limitation

Children ever born (CEB) was captured in NDHS based on the reported full birth history of women of reproductive age. There is likelihood of gross under-reporting of CEB due to cultural beliefs and norms of reporting actual number of births.