Skip to main content
Log in

Efficient regression analyses with zero-augmented models based on ranking

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The data used in simulation studies were generated randomly, the NHANES data used in real data illustrations were imported from https://www.cdc.gov/nchs/nhanes/, the Twitter data and codes used to generate, import, and analyze the data can be found at: https://github.com/debkanda/ERSS-zero-models.

References

  • Al-Dlaigan Y, Shaw L, Smith A (2002) Is there a relationship between asthma and dental erosion? A case control study. Int J Pediatr Dent 12(3):189–200

    Article  Google Scholar 

  • Banda JM, Tekumalla R, Wang G et al (2021) A large-scale COVID-19 twitter chatter dataset for open scientific research-an international collaboration. Epidemiologia 2(3):315–324

    Article  Google Scholar 

  • Bohn LL (1996) A review of nonparametric ranked-set sampling methodology. Commun Stat Theory Methods 25(11):2675–2685

    Article  Google Scholar 

  • Bohn LL, Wolfe DA (1992) Nonparametric two-sample procedures for ranked-set samples data. J Am Stat Assoc 87(418):552–561

    Article  Google Scholar 

  • Broniatowski DA, Paul MJ, Dredze M (2013) National and local influenza surveillance through twitter: an analysis of the 2012–2013 influenza epidemic. PLoS ONE 8(12):e83672

    Article  Google Scholar 

  • Cameron AC, Trivedi PK (2013) Regression analysis of count data, vol 53. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Chen K, Duan Z, Yang S (2022) Twitter as research data: tools, costs, skill sets, and lessons learned. Polit Life Sci 41(1):114–130

    Article  Google Scholar 

  • Chen Z (2007) Ranked set sampling: its essence and some new applications. Environ Ecol Stat 14:355–363

    Article  MathSciNet  Google Scholar 

  • Chen Z, Bai Z, Sinha BK (2004) Ranked set sampling: theory and applications, vol 176. Springer, Berlin

    Google Scholar 

  • Cheung YB (2002) Zero-inflated models for regression analysis of count data: a study of growth and development. Stat Med 21(10):1461–1469

    Article  Google Scholar 

  • Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak. PLoS ONE 5(11):e14118

    Article  Google Scholar 

  • Collingwood L, Wilkerson J (2012) Tradeoffs in accuracy and efficiency in supervised learning methods. J Inf Technol Polit 9(3):298–318

    Article  Google Scholar 

  • Dell T, Clutter J (1972) Ranked set sampling theory with order statistics background. Biometrics pp 545–555

  • Dye B, Nowjack-Raymer R, Barker L et al (2008) Overview and quality assurance for the oral health component of the national health and nutrition examination survey (NHANES), 2003–04. J Public Health Dent 68(4):218–226

    Article  Google Scholar 

  • Frey J (2011) A note on ranked-set sampling using a covariate. J Stat Plan Inference 141(2):809–816

    Article  MathSciNet  Google Scholar 

  • Fung ICH, Tse ZTH, Cheung CN et al (2014) Ebola and the social media. The Lancet 384(9961):2207

    Article  Google Scholar 

  • Goswami U, O’Toole S, Bernabé E (2021) Asthma, long-term asthma control medication and tooth wear in American adolescents and young adults. J Asthma 58(7):939–945

    Article  Google Scholar 

  • Hilbe JM (2011) Negative binomial regression. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Kelly M, Steele J, Nuttall N, et al (2000) Adult dental health survey: Oral health in the united kingdom. The Stationary Office

  • Kim AE, Hansen HM, Murphy J et al (2013) Methodological considerations in analyzing twitter data. J Natl Cancer Inst Monogr 47:140–146

    Article  Google Scholar 

  • Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14

    Article  Google Scholar 

  • Lehmann EL, Casella G (2006) Theory of point estimation. Springer, Berlin

    Google Scholar 

  • Linder DF, Yin J, Rochani H et al (2018) Increased fisher’s information for parameters of association in count regression via extreme ranks. Commun Stat Theory Methods 47(5):1181–1203

    Article  MathSciNet  Google Scholar 

  • Lynne Stokes S (1977) Ranked set sampling with concomitant variables. Commun Stat Theory Methods 6(12):1207–1211

    Article  Google Scholar 

  • McIntyre G (1952) A method for unbiased selective sampling, using ranked sets. Aust J Agric Res 3(4):385–390

    Article  Google Scholar 

  • Moon C, Wang X, Lim J (2022) Empirical likelihood inference for area under the receiver operating characteristic curve using ranked set samples. Pharm Stat 21(6):1219–1245

    Article  Google Scholar 

  • Mullahy J (1986) Specification and testing of some modified count data models. J Econom 33(3):341–365

    Article  MathSciNet  Google Scholar 

  • Patil GP, Sinha A, Taillie C (1994) 5 ranked set sampling. Handb Stat 12:167–200

    Article  Google Scholar 

  • Prieto VM, Matos S, Alvarez M et al (2014) Twitter: a good place to detect health conditions. PLoS ONE 9(1):e86191

    Article  Google Scholar 

  • Samawi HM, Al-Sagheer OA (2001) On the estimation of the distribution function using extreme and median ranked set sampling. Biometr J J Math Methods Biosci 43(3):357–373

    MathSciNet  Google Scholar 

  • Samawi HM, Muttlak HA (1996) Estimation of ratio using rank set sampling. Biom J 38(6):753–764

    Article  MathSciNet  Google Scholar 

  • Samawi HM, Ahmed MS, Abu-Dayyeh W (1996) Estimating the population mean using extreme ranked set sampling. Biom J 38(5):577–586

    Article  Google Scholar 

  • Samawi HM, Rochani H, Linder D et al (2017) More efficient logistic analysis using moving extreme ranked set sampling. J Appl Stat 44(4):753–766

    Article  MathSciNet  Google Scholar 

  • Samawi HM et al (2002) On double extreme rank set sample with application to regression estimator. Metron-Int J Stat 60:50–63

    MathSciNet  Google Scholar 

  • See CT, Chen J (2008) Inequalities on the variances of convex functions of random variables. J Inequal Pure Appl Math 9(3):1–5

    MathSciNet  Google Scholar 

  • Takahasi K, Wakimoto K (1968) On unbiased estimates of the population mean based on the sample stratified by means of ordering. Ann Inst Stat Math 20(1):1–31

    Article  Google Scholar 

  • Thomas MS, Parolia A, Kundabala M et al (2010) Asthma and oral health: a review. Aust Dent J 55(2):128–133

    Article  Google Scholar 

  • Tomeny TS, Vargo CJ, El-Toukhy S (2017) Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009–15. Soc Sci Med 191:168–175

    Article  Google Scholar 

  • Winkelmann R, Zimmermann KF (1995) Recent developments in count data modelling: theory and application. J Econ Surv 9(1):1–24

    Article  Google Scholar 

  • Yin J, Hao Y, Samawi H et al (2016) Rank-based kernel estimation of the area under the roc curve. Stat Methodol 32:91–106

    Article  MathSciNet  Google Scholar 

  • Zamanzade E, Wang X (2017) Estimation of population proportion for judgment post-stratification. Comput Stat Data Anal 112:257–269

    Article  MathSciNet  Google Scholar 

  • Zamanzade E, Parvardeh A, Asadi M (2019) Estimation of mean residual life based on ranked set sampling. Comput Stat Data Anal 135:35–55

    Article  MathSciNet  Google Scholar 

  • Zeileis A, Kleiber C, Jackman S (2008) Regression models for count data in R. J Stat Softw 27(8):1–25

    Article  Google Scholar 

Download references

Acknowledgements

We appreciate the editors and reviewers for their valuable time and helpful comments to improve the contents and clarity of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingjing Yin.

Ethics declarations

Conflict of interest

No Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Assume Y = \((y_{1},..., y_{N})'\) as the vector of responses, the probability distribution function of a zero-inflated Poisson may be written as:

$$\begin{aligned} f(y; \lambda , p)= {\left\{ \begin{array}{ll} p_i + (1-p_i) e^{-\lambda _i} &{}\text {if }y_i=0\\ (1-p_i) \frac{e^{-\lambda _i}\lambda ^y_i}{y_i!}&{}\text {if }y_i=1,2,3...,\\ \end{array}\right. } \end{aligned}$$

the parameters \({\varvec{p}} = (p_1,..., p_N)\) and \(\varvec{\lambda } = (\lambda _i,..., \lambda _N)\) are modeled via canonical link GLMs as \(logit({\varvec{p}}) = {\varvec{G}}\gamma \) and \(log(\varvec{\lambda }) = {\varvec{B}}\beta \), where \({\varvec{G}}\) and \({\varvec{B}}\) are design matrices.

As described in Lambert (1992), the ZIP model can be fit using maximum likelihood via the EM algorithm. The log likelihood for regression parameters \(\gamma \) and \(\beta \) based on all of the data is given by

$$\begin{aligned} lnL(\gamma , \beta ; {\varvec{y}})&= \sum \limits _{i=1}^{n} \Bigg \{u_i ln\bigg [e^{\varvec{G_i}\gamma } + exp(-e^{\varvec{B_i}\beta })\bigg ] \\ {}&\quad + (1 - u_i) \big (y_i\varvec{B_i}\beta - e^{\varvec{B_i}\beta } \big ) \\ {}&\quad - ln \big (1 + e^{\varvec{G_i}\gamma }\big ) - (1 - u_i) ln(y_i!) \Bigg \}, \end{aligned}$$

The EM algorithm is based on a latent variable \(Z_i\), where we observe \(Z_i\) as 1, when \(Y_i\) is from the perfect, zero state and \(Z_i\) as 0, when \(Y_i\) is from the Poisson state. To formulate the log-likelihood for the complete data \(({\varvec{y}}, {\varvec{z}})\), we have:

$$\begin{aligned} lnL_c(\gamma , \beta ; {\varvec{y}}, {\varvec{z}})&= \sum \limits _{i=1}^{n} log(f(z_i|\gamma )) + \sum \limits _{i=1}^{n} log(f(y_i|z_i, \beta )) \\ {}&= \sum \limits _{i=1}^{n} \Big (z_iG_i\gamma - log(1 + e^{G_i\gamma }) \Big ) + \sum \limits _{i=1}^{n} (1 - Z_i)(y_i{\varvec{B}}_i\beta - e^{{\varvec{B}}_i\beta })\\&\quad - \sum \limits _{i=1}^{n} (1 - z_i) log(y_i!) \\ {}&= lnL_c(\gamma ; y, z) +lnL_c(\beta ; y, z) - \sum \limits _{i=1}^{n} (1 - z_i) log(y_i!). \end{aligned}$$

To implement EM algorithm, the log-likelihood above is maximized iteratively by alternating between estimating \(Z_i\) by its expectation under the current estimates of \((\gamma , \beta )\) (E step) and then maximizing \(L_c(\gamma , \beta ; y, z)\) (M step). In detail, the EM algorithm begins with starting values \((\gamma ^{(0)}, \beta ^{(0)})\) and proceeds iteratively. At iteration \((k + 1)\) of the EM algorithm requires the following steps.

E Step. Estimate \(Z_i\) by its posterior mean \(Z_i^{(k)}\) under current estimates \(\gamma ^{(k)}\) and \(\beta ^{(k)}\). This posterior mean is calculated as

$$\begin{aligned} z_i^{(k)}= {\left\{ \begin{array}{ll} [1 + exp(-G_i\gamma ^{(k)} - e^{B_i\beta ^(k)})]^{-1} &{}\text {if }y_i=0\\ 0 &{}\text {if }y_i=1,2,3....\\ \end{array}\right. } \end{aligned}$$

M Step for \(\gamma \). Find \(\gamma ^{(k+1)}\) by maximizing \(L_c(\gamma ;y,z^{(k)})\). This can be accomplished by performing an an unweighted binomial logistic regression of \(z^{(k)}\) on design matrix \({\varvec{G}}\) using a binomial denominator of one for each observation.

M Step for \(\beta \). Find \(\beta ^{(k+1)}\) by maximizing \(L_c(\beta ;y,z^{(k)})\). This can be accomplished from a weighted Poisson log-linear regression of y on \({\varvec{B}}\), with weights \(1 - Z^{(k)}\).

The EM algorithm converges in this problem and the estimates from the final iteration are the maximum likelihood estimates (MLEs) for the log-likelihood. The MLEs \(({\hat{\gamma }}, {\hat{\beta }})\) are asymptotically Gaussian with variances equal to the inverse of the observed Fisher information matrix. Other issues including the choice of starting values, convergence of the EM algorithm, asymptotic distribution of \(({\hat{\gamma }}, {\hat{beta}})\) in zero-inflated regression models are discussed in Lambert (1992).

Appendix B: Tables

See Tables 5, 6 , 7, 8, 9, 10, and 11.

Table 5 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 2
Table 6 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 2 including 10 additional predictors in the model
Table 7 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for hurdle negative binomial model of set size 3 including 5 additional predictors in the model
Table 8 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for zero-inflated Poisson model of set size 3
Table 9 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for zero-inflated Poisson model of set size 3 under the null setting \(\beta _h=0\)
Table 10 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 5
Table 11 Estimation of power of testing \(H_0\):\(\beta _h=0\) vs \(H_a\):\(\beta _h\ne 0\), bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 10 including 5 additional model predictors

See Figs. 8 and 9 and Table 12.

Fig. 8
figure 8

Distribution of the number of surfaces with tooth wear

Fig. 9
figure 9

Frequency distribution of the number of retweet counts

Table 12 Comparison of the computation time in seconds

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kanda, D., Yin, J., Zhang, X. et al. Efficient regression analyses with zero-augmented models based on ranking. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01503-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00180-024-01503-3

Keywords

Navigation