Skip to main content

Count-Based Regression Models

  • Chapter
  • First Online:
Advanced Statistics in Criminology and Criminal Justice

Abstract

Count-based data are common in criminological research including outcomes such as crime counts for geographic areas or the number of rearrests over a given time period for a group of individuals. Counts by nature are discrete, positively valued whole numbers. When we want to model counts as a dependent variable in a regression model, ordinary least squares (OLS) regression is generally a poor choice. Count-based regression approaches such as Poisson, quasi-Poisson, and negative binomial models appropriately handle these characteristics of counts as a dependent variable and do so by using a log-link, thus modeling the log of the count. The difference between a Poisson model and both quasi-Poisson and negative binomial models is that the latter two adjust for over-dispersion in the count data. Count data where the variance is greater than the mean are over-dispersion. The quasi-Poisson model produces regression coefficients that are identical to a Poisson model but with standard errors that are adjusted for any observed over-dispersion. Negative binomial regression models adjust for over-dispersion differently, and the regression coefficients may differ compared to a Poisson model but usually only slightly. Another complication with count data is the possibility that the distribution has an excess of zeros relative to what would be expected from a Poisson process. These are called zero-inflated distributions and can be modeled with zero-inflated versions of either the Poisson or negative binomial models. Historically, particularly in ecology where count data are also very common, using OLS regression on log-transformed counts was a common approach to handle this type of data (O’Hara and Kotze. Nature Proceedings 1(2):118-122, 2010). However, with the ready availability of statistical software able to perform the count-based regression models discussed in this chapter, there is little reason to not use these modeling methods. They better reflect the nature of the data and are less likely to violate key assumptions of the regression model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The data file we will use first represents a subset of the data from the National Youth Survey, Wave 1. The sample of 1,725 youth is representative of persons aged 11–17 years in the USA in 1976, when the first wave of data was collected. While these data may seem old, researchers continue to publish reports based on new findings and interpretations of these data. One of the apparent strengths of this study was its design; the youth were interviewed annually for 5 years from 1976 to 1980 and then were interviewed again in 1983 and 1987. The data file on our Website was constructed from the full data source available at the Inter-University Consortium of Political and Social Research, which is a national data archive. Data from studies funded by the National Institute of Justice (NIJ) are freely available to anyone with an Internet connection; go to http://www.icpsr.umich.edu/NACJD. All seven waves of data from the National Youth Survey are available, for example.

References

  • Agresti, A. (2003). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.

    Google Scholar 

  • Berk, R., & MacDonald, J. M. (2008). Overdispersion and Poisson regression. Journal of Quantitative Criminology, 24(3), 269–284.

    Article  Google Scholar 

  • Gardner, W., Mulvey, E. P., & Shaw, E. C. (1995). Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin, 118(3), 392.

    Article  Google Scholar 

  • Greene, W. H. (2018). Econometric analysis (8th ed., p. 905). Chennai: Pearson Education India.

    Google Scholar 

  • O’Hara, R., & Kotze, J. (2010). Do not log-transform count data. Nature Proceedings, 1(2), 118–122.

    Google Scholar 

  • Osgood, D. W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of Quantitative Criminology, 16(1), 21–43.

    Article  Google Scholar 

  • Ver Hoef, J. M., & Boveng, P. L. (2007). Quasi-Poisson vs. negative binomial regression: How should we model overdispersed count data? Ecology, 88(11), 2766–2772.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Weisburd .

Appendices

Symbols and Formulas

IRR:

Incident rate ratio

θ :

Over-dispersion parameter

A single independent variable Poisson model in population form:

$$ \ln \left(\mathrm{Y}\right)={\beta}_0+{\beta}_1{x}_1 $$

A single independent variable Poisson model in sample form:

$$ \ln (y)={b}_0+{b}_1{x}_1 $$

A single independent variable Poisson model in exponentiated form:

$$ y={e}^{b_0+{b}_1{x}_1} $$

The incident rate ratio (IRR) of a regression coefficient (b):

$$ \mathrm{IRR}={e}^b $$

Percent change in the dependent variable for a 1-unit change in an independent variable:

$$ \%\mathrm{change}=\left\{\begin{array}{ll}\left(1-{e}^b\right)\times 100=\left(1-\mathrm{IRR}\right)\times 100& \mathrm{if}\ b<0\ \mathrm{or}\ \mathrm{IRR}<1\\ {}\left({e}^b-1\right)\times 100=\left(\mathrm{IRR}-1\right)\times 100& \mathrm{if}\ b>0\ \mathrm{or}\ \mathrm{IRR}>1\ \end{array}\right. $$

Count-based regression model with an offset where x2 is the natural log of exposure:

$$ \ln (y)={b}_0+{b}_1{x}_1+\mathrm{offset}\left({x}_2\right) $$

Over-dispersion parameter for a quasi-Poisson model:

$$ \theta =\frac{1}{n-k-1}\sum \frac{{\left({y}_i-{\hat{y}}_i\right)}^2}{{\hat{y}}_i} $$

Standard errors for a quasi-Poisson model adjusted for over-dispersion:

$$ s{e}_{quasi- Poisson}=s{e}_{Poisson}\sqrt{\theta } $$

Exercises

  1. 6.1.

    Assume we have count data for two groups, shown below.

    Group 1:

    6

    7

    7

    10

    6

    6

    8

    2

    6

    5

    Group 2:

    5

    12

    7

    6

    5

    9

    6

    10

    5

    7

    Calculate the following:

    1. (a)

      The sum of the counts for each group.

    2. (b)

      The natural log of these sums.

    3. (c)

      The difference between the natural log of these sums.

    4. (d)

      The natural log of the ratio of these sums.

    5. (e)

      The exponent of the answer to (c) above.

    6. (f)

      The ratio of the mean for group 1 relative to the mean for group 2.

    7. (g)

      How do (e) and (f) compare?

    8. (h)

      The answer to (e) above is an IRR. Convert this to a percent change. How would you interpret this value?

  2. 6.2.

    Below are the results from a Poisson regression model.

    Independent variable

    b

    se

    z

    Intercept

    −1.25

    0.15

    −8.33

    x1

    −0.13

    0.06

    −2.17

    x2

    0.46

    0.08

    5.75

    x3

    0.05

    0.04

    1.25

    1. (a)

      Assume that the quasi-Poisson over-dispersion parameter, θ, is 2.00. Adjust the standard errors using this value.

    2. (b)

      Compute over-dispersion adjusted z-values.

    3. (c)

      Assuming that the z critical value for a .05 significance level is 1.96, how do the results change after adjusting for over-dispersion?

    4. (d)

      Compute the percent change associated with a one-unit change in each of the independent variables.

Computer Exercises

The data file used to illustrate the application of the count-based regression models in this chapter can be found in either SPSS (nys_l.sav or nys_l_student.sav) or Stata (nys_l.dta) format. Alternatively, one of the files can be imported into R using the read.sav() or read.dta() functions from the foreign package. The illustration of the commands below assumes that you have opened one of these files into SPSS, Stata, or R, and can also be found in the sample syntax files in both SPSS (Chapter_5.sps) and Stata (Chapter_5.do) format.

SPSS

Poisson Regression

The GENLIN procedure is used to estimate generalized linear models in SPSS (version 15 and later), and it has the option of specifying a Poisson distribution. The structure of the syntax is as follows:

GENLIN Dep_var BY IV_Factors (ORDER=ASCENDING) WITH IV_scale /MODEL IV_Factors IV_scale /INTERCEPT=YES /OFFSET=offset_var /DISTRIBUTION=POISSON LINK=LOG /CRITERIA CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL /PRINT FIT SUMMARY SOLUTION.

Note that the dependent variable is specified after the GENLIN command. The BY argument is used to specify categorical independent variables, while scale independent variables come after the WITH argument. An offset variable may be added with the OFFSET= argument.

Quasi-Poisson Regression

SPSS does not have the option to calculate quasi-Poisson regression.

Negative Binomial Regression

The GENLIN procedure can also be used to estimate a generalized linear model where the dependent variable has a negative binomial distribution. The structure of the syntax is as follows:

GENLIN Dep_var BY IV_Factors (ORDER=ASCENDING) WITH IV_scale /MODEL IV_Factors IV_scale /INTERCEPT=YES /OFFSET=offset_var /DISTRIBUTION=NEGBIN(1) LINK=LOG /CRITERIA CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL /PRINT FIT SUMMARY SOLUTION.

The code is the same as a Poisson regression generalized linear model with the exception of the DISTRIBUTION= argument. Negative binomial regression is specified by NEGBIN(1) instead of POISSON.

Zero-Inflated Poisson/Negative Binomial Regression

SPSS does not offer regression models that accommodate a dependent variable with a zero-inflated distribution.

Stata

Poisson Regression

The glm command for generalized linear models can be used for Poisson regression models in Stata. It is specified with the family() argument, and the family type for a Poisson model is poisson (make sure to use a lowercase p). You must also specify the argument link(log) and have the option to add an offset variable with the offset() argument. The basic structure of a Poisson regression model using the glm command is as follows:

glm dep_var indep_vars, family(poisson) link(log) offset(offset_var_name)

Recall that you need to specify i. in front of any categorical variables (e.g., i.cat_indep_var1). Relative risk ratios can be obtained instead of odds ratios by adding the argument eform to the right of the comma, as follows:

glm dep_var indep_vars, family(poisson) link(log) eform

The predict command is for post-estimation and can be used to obtain predicted estimates. Simply execute your glm model, and then, use the predict command, along with the name of the new variable to be used to store the predicted estimates.

predict new_var_name

Quasi-Poisson Regression

Stata does not have the option to calculate quasi-Poisson regression.

Negative Binomial Regression

Negative binomial regression models are also conducted in Stata using the glm command, whereby the family(nbinomial) argument is specified. You must also specify the argument link(log), and you have the option to add an offset variable with the offset() argument.

glm dep_var indep_vars, family(nbinomial) link(log) offset(offset_var_name)

As with Poisson regression when using glm, relative risk ratios can be obtained instead of odds ratios by adding the argument eform to the right of the comma, as follows:

glm dep_var indep_vars, family(nbinomial) link(log) eform

The predict command is for post-estimation and can be used to obtain predicted estimates. Simply execute your glm model, and then, use the predict command, along with the name of the new variable to be used to store the predicted estimates.

predict new_var_name

Zero-Inflated Poisson/Negative Binomial Regression

Zero-inflated Poisson and negative binomial regression models can be estimated with maximum likelihood in Stata by, respectively, using the zip or zinb commands. You must also specify the inflate() argument, which is where you add the variable(s) that predict excess 0s. If the offset() argument is used, it is specified within the inflate() command. The structure of both commands is as follows:

zip dep_var indep_vars, inflate( var_pred_0s, offset(offset_var)) zinb dep_var indep_vars, inflate( var_pred_0s, offset(offset_var))

With both zip and zinb commands, if you want relative rate ratios reported, add the argument irr to the right of the comma:

zinb dep_var indep_vars, irr inflate(var_pred_0s)

Additionally, if you would like to compare whether the zero-inflated Poisson is a better fit in comparison to the zero-inflated negative binomial, specify the argument zip to the right of the comma:

zinb dep_var indep_vars, zip inflate(var_pred_0s)

As with using glm, the predict command is for post-estimation and can be used to obtain predicted estimates. Simply execute your zip or zinb model , and then, use the predict command, along with the name of the new variable to be used to store the predicted estimates.

predict new_var_name

R

Poisson Regression

The glm() function for generalized linear models, which is in the stats package, can be used to fit many types of count regression models. It is specified with the family= argument, and the family type for a Poisson model is poisson (make sure to use a lowercase p). You can view the model output using the summary() function. The basic structure of a Poisson regression model using the glm() function is as follows:

model <- glm(dep_var ~ indep_var1 + indep_var2, data=dataset_name, family="poisson") summary(model)

You may add ~1 instead of ~indep_var1+indep_var2 if you wish to run an intercept-only model. And an exposure variable can be incorporated into the model using the offset() function. However, recall that you also need to log it, which can be done with the log() function as follows:

glm(dep_var ~ indep_var1 + offset(log(offset_var)), data=dataset_name, family="poisson")

The predict() function can be used along with your glm-class object to obtain predicted estimates for each case, and you can assign it as a new variable using the <- assignment operator. Then, the aggregate() function can be nested within the predict() function to get a mean of the predicted estimates. Note that the user must specify the formula within the aggregate function whereby the independent variable is the grouping variable.

predict(model) df$pred_est_var<-predict(model) aggregate(predict(model) ~x, data=df, mean)

The coefficients from the model can be transformed into odds ratios by exponentiating the coefficients. To do this, use both the coef() and exp() functions on the glm-class object that is storing the model output (model in our case).

exp(coef(model))

Quasi-Poisson Regression

The specification of a quasi-Poisson regression model with the glm() function is similar to a standard Poisson regression model, but you will change the family= argument to type quasipoisson. As with above, remember that this will require the installation of the stats package. The basic structure of the syntax is as follows:

model <- glm(dep_var ~ indep_var1 + indep_var2, data=dataset_name, family="quasipoisson") summary(model)

You may add ~1 instead of ~indep_var1+indep_var2 if you wish to run an intercept-only model. And an exposure variable can be incorporated into the model using the offset() function. However, recall that you also need to log it, which can be done with the log() function as follows:

glm(dep_var ~ indep_var1 + offset(log(offset_var)), data=dataset_name, family="quasipoisson")

The predict() function can be used along with your glm-class object to obtain predicted estimates for each case, and you can assign it as a new variable using the <- assignment operator. Then, the aggregate() function can be nested within the predict() function to get a mean of the predicted estimates. Note that the user must specify the formula within the aggregate function whereby the independent variable is the grouping variable.

predict(model) df$pred_est_var<-predict(model) aggregate(predict(model) ~x, data=df, mean)

The coefficients from the model can be transformed into odds ratios by exponentiating the coefficients. To do this, use both the coef() and exp() functions on the glm-class object that is storing the model output (model in our case).

exp(coef(model))

Negative Binomial Regression

Negative binomial regression models are set up a little differently than Poisson and quasi-Poisson as they use the function gml.nb(), which is from the MASS package. Notice that the family= argument is no longer needed.

model <- glm.nb(dep_var ~ indep_var1 + indep_var2, data=dataset_name) summary(model)

You may add ~1 instead of ~indep_var1+indep_var2 if you wish to run an intercept-only model. You can still incorporate an exposure variable into the model using the offset() function, but remember that you will need to log it using the log() function as follows:

glm.nb(dep_var ~ indep_var1 + offset(log(offset_var)), data=dataset_name)

The predict() function can be used to obtain predicted estimates for each case, and you can assign it as a new variable using the <- assignment operator. Note that even though we are using the function glm.nb(), it still produces a glm-class object. Then, the aggregate() function can be nested within the predict() function to get a mean of the predicted estimates. Note that the user must specify the formula within the aggregate function whereby the independent variable is the grouping variable.

predict(model) df$pred_est_var<-predict(model) aggregate(predict(model) ~x, data=df, mean)

The coefficients from the model can be transformed into odds ratios by exponentiating the coefficients. To do this, use both the coef() and exp() functions on the glm-class object that is storing the model output (model in our case).

exp(coef(model))

Zero-Inflated Poisson/Negative Binomial Regression

Zero-inflated regression models can be estimated with maximum likelihood in R by using the zeroinfl() function in the pscl package. The distribution, whether Poisson or negative binomial, is specified with the dist= argument. Additionally, you must specify link="logit" when using count data. The zeroinfl() function is set up for zero-inflated Poisson regression models as follows:

zip <- zeroinfl(dep_var ~ indep_var1 + indep_var2, data=dataset_name, dist="poisson", link="logit") summary(zip)

The zeroinfl() function is set up for zero-inflated negative binomial regression models as follows:

negbi <- zeroinfl(dep_var ~ indep_var1 + indep_var2, data=dataset_name, dist="negbin", link="logit") summary(negbi)

You can then assess which model is a better fit with the likelihood ratio test by using the lrtest() function as follows:

lrtest(zip, negbi)

As with the glm() function, you can incorporate an offset into the model using the offset() and log() functions:

zeroinfl(dep_var ~ indep_var1 + offset(log(offset_var)), data=dataset_name, dist="poisson", link="logit")

Problems

  1. 1.

    Enter the data from Exercise 6.1 above into SPSS, Stata, or R. Analyze these data with the counts as the dependent variable and group as the independent variable. How does the regression coefficient for the values you computed as part of Exercise 6.1 compare with the results from this analysis?

    Open the NYS Wave I data file (nys_l.sav, nys_l_student.sav, or nys_l.dta) to complete exercises 2–4. Make sure to select a variable from the dataset to be the exposure/offset variable when appropriate. When selecting independent variables, keep in mind the guidelines for selecting covariates that you learned in prior chapters (e.g., be mindful of multicollinearity).

  2. 2.

    Compute a Poisson regression model using the number of nights a week the student reported studying on average (eve_stdy) as the dependent variable. Include their grade point average (gpa) and their perception of the importance of attending college (imp_coll) as independent variables.

    1. (a)

      Explain the results in plain English.

    2. (b)

      Based on the results, predict the number of nights the student spent studying, and then, present the average number of nights the students spend studying for each grade point average.

  3. 3.

    Compute a negative binomial regression model using the self-reported frequencies that the student hit his/her parent in the previous year (hit_prnt) as the dependent variable. Add the student’s perceived importance of attaining a good career (imp_gjob) and the perceived importance of having friends (imp_frns) as independent variables.

    1. (a)

      Explain the results in plain English.

    2. (b)

      Compare the negative binomial regression model results with a Poisson regression model, and then, explain the differences. Which appears to be a better model fit?

  4. 4.

    Compute both a zero-inflated Poisson and zero-inflated negative binomial model using the self-reported frequencies that the student cheated on a school exam (cheat) as the dependent variable. Further, select five independent variables from the dataset that you suspect may affect cheating.

    1. (a)

      Compare the regression coefficients between the two models. Describe which model you think is a better fit.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Weisburd, D., Wilson, D.B., Wooditch, A., Britt, C. (2022). Count-Based Regression Models. In: Advanced Statistics in Criminology and Criminal Justice. Springer, Cham. https://doi.org/10.1007/978-3-030-67738-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67738-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67737-4

  • Online ISBN: 978-3-030-67738-1

  • eBook Packages: Law and CriminologyLaw and Criminology (R0)

Publish with us

Policies and ethics