Modeling Overdispersion, Autocorrelation, and Zero-Inflated Count Data Via Generalized Additive Models and Bayesian Statistics in an Aphid Population Study

  • F. J. CarvalhoEmail author
  • D. G. de Santana
  • M. V. Sampaio
Ecology, Behavior and Bionomics


Count variables are often positively skewed and may include many zero observations, requiring specific statistical approaches. Interpreting abiotic factor changes in insect populations of crop pests, under this condition, can be difficult. The analysis becomes even more complicated because of possible temporal or spatial correlation, irregularly spaced data, heterogeneity over time, and zero inflation. Generalized additive models (GAM) are important tools to evaluate abiotic factors. Moreover, Markov chain Monte Carlo (MCMC) techniques can be used to fit a model that contains a temporal correlation structure, based on Bayesian statistics (BGAM). We compared methods of modeling the effects of temperature, precipitation, and time for the Brevicoryne brassicae (L.) population in Uberlândia, Brasil. We applied the proposed BGAM to the data, comparing this to the GAM model with and without autocorrelation for time, using the statistical programming language R. Analysis of deviance identified significant effects of the smoothers for precipitation and time on the frequentist models. With BGAM, the problem in variance estimations for precipitation and temperature from the previous models was solved. Furthermore, trace and density plots for population-level effects for all parameters converged well. The estimated smoothing curves showed a linear effect with an increase of precipitation, where lower precipitation indicated no presence of the aphid. The average temperature did not affect the aphid incidence. Autocorrelation was solved with ARMA structures, and the excess of zero was solved with zero-inflation models. The example of B. brassicae incidence showed how well abiotic (and biotic) factors can be modeled and analyzed using BGAM.


Regular time series event ARMA structure Markov chain Monte Carlo simulation Abiotic factors Brevicoryne brassicae 


Authors’ Contributions

FJC, DGS, and MVS planned the experimental work and wrote the manuscript. FJC and DGS designed and conducted data analyses.

Supplementary material

13744_2019_729_MOESM1_ESM.pdf (49 kb)
ESM 1 (PDF 48 kb)


  1. Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) Second International Symposium on Information Theory. Akademai Kiado, Budapest, pp 267–281Google Scholar
  2. Atkins DC, Gallop RJ (2007) Rethinking how family researchers model infrequent outcomes: a tutorial on count regression and zero-inflated models. J Fam Psychol 21:726–735. CrossRefPubMedGoogle Scholar
  3. Atkins DC, Baldwin SA, Zheng C, Gallop RJ, Neighbors C (2013) A tutorial on count regression and zero-altered count models for longitudinal substance use data. Psychol Addict Behav 27:166–177. CrossRefPubMedGoogle Scholar
  4. Beck N, Katz JN (1995) What to do (and not to do) with time-series cross-section data. Am Polit Sci Rev 89:634–647. CrossRefGoogle Scholar
  5. Box GEP, Jenkins GM, Reinsel GC (1994) Time series analysis: forecasting and control. Holden-Day, San Francisco, p 500Google Scholar
  6. Bürkner PC (2017) brms: an R package for Bayesian multilevel models using Stan. J Stat Softw 80:1–28. CrossRefGoogle Scholar
  7. Coxe S, West SG, Aiken LS (2009) The analysis of count data: a gentle introduction to Poisson regression and its alternatives. J Pers Assess 91:121–136. CrossRefPubMedGoogle Scholar
  8. Everaert G, Eschutter Y, Troch M, Colin RJ, Schamphelaere K (2018) Multimodel inference to quantify the relative importance of abiotic factors in the population dynamics of marine zooplankton. J Mar Syst 181:91–98. CrossRefGoogle Scholar
  9. Falk MG, O’leary R, Nayak M, Collins P, Choy SL (2015) A Bayesian hurdle model for analysis of an insect resistance monitoring database. Environ Ecol Stat 22:207–226. CrossRefGoogle Scholar
  10. Faraway JJ (2006) Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. Chapman and Hall, Florida, p 301Google Scholar
  11. Feyrer F, Newman K, Nobriga M, Sommer T (2011) Modeling the effects of future outflow on the abiotic habitat of an imperiled estuarine fish. Estuar Coasts 34:120–128. CrossRefGoogle Scholar
  12. Fletcher D, Mackenzie D, Villouta E (2005) Modelling skewed data with many zeros: a simple approach combining ordinary and logistic regression. Environ Ecol Stat 12:45–54. CrossRefGoogle Scholar
  13. Frison KJ, Josephs O, Zarahn E, Holmes AP, Rouquette S, Poline JB (2000) To smooth or not to smooth? NeuroImage 12:196–208. CrossRefGoogle Scholar
  14. Gelman A (1996) Inference and monitoring convergence. In: Wilks WR, Richardson S, Spiegelhalter DJ (eds) Markov chain Monte Carlo in practice, vol 1996. Chapman and Hall, London, pp 131–143Google Scholar
  15. Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–511CrossRefGoogle Scholar
  16. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. Chapman and Hall, New York, p 675Google Scholar
  17. Ghosh SK, Mukhopadhyay P, Lu JC (2006) Bayesian analysis of zero-inflated regression models. J Stat Plan Infer 136:1360–1375. CrossRefGoogle Scholar
  18. Griewank A, Walther A (2008) Evaluating derivatives: principles and techniques of algorithmic differentiation. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, p 438CrossRefGoogle Scholar
  19. Guisan A, Zimmerman NE (2000) Predictive habitat distribution models in ecology. Ecol Model 135:147–186CrossRefGoogle Scholar
  20. Hilbe JM (2011) Negative binomial regression. Cambridge University Press, New York, p 553. CrossRefGoogle Scholar
  21. Hoffman MD, Gelman A (2014) The no-u-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J Mach Learn Res 15:1593–1623Google Scholar
  22. Horowitz AR, Ishaaya I (2004) Insect pest management. Springer-Verlag, Berlin Heidelberg, p 344. CrossRefGoogle Scholar
  23. Hughes RD, Gilbert NA (1968) Model of an aphid population-a general statement. J Anim Ecol 37:553–563CrossRefGoogle Scholar
  24. Ishwaran H, Rao JS (2005) Spike and slab variable selection: frequentist and Bayesian strategies. Ann Stat 33:730–773. CrossRefGoogle Scholar
  25. Kim YJ, Gu C (2004) Smoothing spline gaussian regression: more scalable computation via efficient approximation. J R Stat Soc 66:337–356CrossRefGoogle Scholar
  26. Lee Y, Nelder JA, Pawitan Y (2006) Generalized linear models with random effects. Chapman and Hall, New York, p 380CrossRefGoogle Scholar
  27. Martin TG, Wintle BA, Rhodes JR, Kuhnert PM, Field SA, Low-Choy SJ, Tyre AJ, Possingham HP (2005) Zero tolerance ecology: improving ecological inference by modeling the source of zero observation. Ecol Lett 8:1235–1246. CrossRefPubMedGoogle Scholar
  28. Neal RM (1993) Probabilistic inference using Markov chain Monte Carlo methods. University of Toronto, Toronto, p 140Google Scholar
  29. Neal RM (2003) Slice sampling. Ann Stat 31:705–741. CrossRefGoogle Scholar
  30. Neal RM (2011) MCMC using Hamiltonian dynamics. In: Brooks S, Gelman A, Jones GL, Meng X (eds) Handbook of Markov chain Monte Carlo. CRC Press, Boston, pp 113–162Google Scholar
  31. Neal DJ, Simons JS (2007) Inference in regression models of heavily skewed alcohol use data: a comparison of ordinary least squares, generalized linear models, and bootstrap resampling. Psychol Addict Behav 21:441–452. CrossRefPubMedGoogle Scholar
  32. Parmesan C (2007) Influences of species, latitudes, and methodologies on estimates of phonological response to global warming. Glob Chang Biol 13:1860–1872. CrossRefGoogle Scholar
  33. Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-PLUS. Springer, New York, p 530CrossRefGoogle Scholar
  34. Price PW, Denno RF, Eubanks MD, Finke DL, Kaplan I (2011) Insect ecology behavior, populations and communities. United Kingdom, Cambridge, p 791CrossRefGoogle Scholar
  35. R Core Team (2019) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, URL, version 3.5.0
  36. Ramos MR, Oliveira MM, Borges JG, McDill ME (2015) Statistical models for categorical data: brief review for applications in ecology. AIP Conf Proc 1648:1. CrossRefGoogle Scholar
  37. Sampaio MV, Korndörfer AP, Pujade-Villar J, Hubaide JEA, Ferreira SE, Arantes SO, Bortoletto DM, Guimarães CM, Sánchez-espigares JA, Caballero-López B (2017) Brassica aphid (Hemiptera: Aphididae) populations are conditioned by climatic variables and parasitism level: a study case of Triângulo Mineiro, Brazil. Bull Entomol Res 107:410–418. CrossRefPubMedGoogle Scholar
  38. Stroup WW (2012) Generalized linear mixed models: modern concepts, methods and applications. CRC Press, Boca Raton, p 555Google Scholar
  39. Stroup WW (2015) Rethinking the analysis of non-normal data in plant and soil science. Agron J 107:811–827. CrossRefGoogle Scholar
  40. Tarone AM, Foran DR (2008) Generalized additive models and Lucilia sericata growth: assessing confidence intervals and error rates in forensic entomology. J Forensic Sci 53:942–948. CrossRefPubMedGoogle Scholar
  41. Thackray DJ, Diggle AJ, Berlandier FA, Jones RAC (2004) Forecasting aphid outbreaks and epidemics of cucumber mosaic virus in lupin crops in a Mediterranean type environment. Virus Res 100:67–82. CrossRefPubMedGoogle Scholar
  42. Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27:1413–1432. CrossRefGoogle Scholar
  43. Warton DI (2005) Many zeros does not mean zero inflation: comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16:275–289. CrossRefGoogle Scholar
  44. Watanabe S (2010) Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res 11:3571–3594Google Scholar
  45. Welsh AH, Cunningham RB, Chambers RL (2000) Methodology for estimating the abundance of rare animals: seabird nesting on North East Herald Cay. Biometrics 56:22–30. CrossRefPubMedGoogle Scholar
  46. Whitney SK, Meehan TD, Kucharik CJ, Zhu J, Townsend PA, Hamilton K, Gratton C (2016) Explicit modeling of abiotic and landscape factors reveals precipitation and forests associated with aphid abundance. Ecol Appl 26:2600–2610. CrossRefGoogle Scholar
  47. Wilson LT, Barnett WW (1983) Degree days: an aid in crop and pest management. Calif Agric 37:4–7Google Scholar
  48. Winder L (1990) Predation of the cereal aphid Sitobion avenae by polyphagous predators on the ground. Ecol Entomol 15:105–110. CrossRefGoogle Scholar
  49. Wood SN (2003) Thin plate regression splines. J R Stat Soc 65:95–114. CrossRefGoogle Scholar
  50. Wood SN (2017) Generalized additive models: an introduction with R. Chapman & Hall, New York, p 476. CrossRefGoogle Scholar
  51. Yau KKW, Wang K, Lee AH (2006) Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biom J 45:437–452. CrossRefGoogle Scholar
  52. Zuur AF, Ieno EN, Smith GM (2007) Analysing ecological data. Springer, New York, p 680CrossRefGoogle Scholar
  53. Zuur AF, Ieno EN, Walker NJ, Saveliev AA, Smith GM (2009) Mixed effects models and extensions in ecology with R. Springer, New York, p 574CrossRefGoogle Scholar

Copyright information

© Sociedade Entomológica do Brasil 2019

Authors and Affiliations

  1. 1.Instituto de Ciências AgráriasUniv. Federal de UberlândiaUberlândiaBrasil

Personalised recommendations