Abstract
Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.
Similar content being viewed by others
Data availability
The data used in simulation studies were generated randomly, the NHANES data used in real data illustrations were imported from https://www.cdc.gov/nchs/nhanes/, the Twitter data and codes used to generate, import, and analyze the data can be found at: https://github.com/debkanda/ERSS-zero-models.
References
Al-Dlaigan Y, Shaw L, Smith A (2002) Is there a relationship between asthma and dental erosion? A case control study. Int J Pediatr Dent 12(3):189–200
Banda JM, Tekumalla R, Wang G et al (2021) A large-scale COVID-19 twitter chatter dataset for open scientific research-an international collaboration. Epidemiologia 2(3):315–324
Bohn LL (1996) A review of nonparametric ranked-set sampling methodology. Commun Stat Theory Methods 25(11):2675–2685
Bohn LL, Wolfe DA (1992) Nonparametric two-sample procedures for ranked-set samples data. J Am Stat Assoc 87(418):552–561
Broniatowski DA, Paul MJ, Dredze M (2013) National and local influenza surveillance through twitter: an analysis of the 2012–2013 influenza epidemic. PLoS ONE 8(12):e83672
Cameron AC, Trivedi PK (2013) Regression analysis of count data, vol 53. Cambridge University Press, Cambridge
Chen K, Duan Z, Yang S (2022) Twitter as research data: tools, costs, skill sets, and lessons learned. Polit Life Sci 41(1):114–130
Chen Z (2007) Ranked set sampling: its essence and some new applications. Environ Ecol Stat 14:355–363
Chen Z, Bai Z, Sinha BK (2004) Ranked set sampling: theory and applications, vol 176. Springer, Berlin
Cheung YB (2002) Zero-inflated models for regression analysis of count data: a study of growth and development. Stat Med 21(10):1461–1469
Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak. PLoS ONE 5(11):e14118
Collingwood L, Wilkerson J (2012) Tradeoffs in accuracy and efficiency in supervised learning methods. J Inf Technol Polit 9(3):298–318
Dell T, Clutter J (1972) Ranked set sampling theory with order statistics background. Biometrics pp 545–555
Dye B, Nowjack-Raymer R, Barker L et al (2008) Overview and quality assurance for the oral health component of the national health and nutrition examination survey (NHANES), 2003–04. J Public Health Dent 68(4):218–226
Frey J (2011) A note on ranked-set sampling using a covariate. J Stat Plan Inference 141(2):809–816
Fung ICH, Tse ZTH, Cheung CN et al (2014) Ebola and the social media. The Lancet 384(9961):2207
Goswami U, O’Toole S, Bernabé E (2021) Asthma, long-term asthma control medication and tooth wear in American adolescents and young adults. J Asthma 58(7):939–945
Hilbe JM (2011) Negative binomial regression. Cambridge University Press, Cambridge
Kelly M, Steele J, Nuttall N, et al (2000) Adult dental health survey: Oral health in the united kingdom. The Stationary Office
Kim AE, Hansen HM, Murphy J et al (2013) Methodological considerations in analyzing twitter data. J Natl Cancer Inst Monogr 47:140–146
Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14
Lehmann EL, Casella G (2006) Theory of point estimation. Springer, Berlin
Linder DF, Yin J, Rochani H et al (2018) Increased fisher’s information for parameters of association in count regression via extreme ranks. Commun Stat Theory Methods 47(5):1181–1203
Lynne Stokes S (1977) Ranked set sampling with concomitant variables. Commun Stat Theory Methods 6(12):1207–1211
McIntyre G (1952) A method for unbiased selective sampling, using ranked sets. Aust J Agric Res 3(4):385–390
Moon C, Wang X, Lim J (2022) Empirical likelihood inference for area under the receiver operating characteristic curve using ranked set samples. Pharm Stat 21(6):1219–1245
Mullahy J (1986) Specification and testing of some modified count data models. J Econom 33(3):341–365
Patil GP, Sinha A, Taillie C (1994) 5 ranked set sampling. Handb Stat 12:167–200
Prieto VM, Matos S, Alvarez M et al (2014) Twitter: a good place to detect health conditions. PLoS ONE 9(1):e86191
Samawi HM, Al-Sagheer OA (2001) On the estimation of the distribution function using extreme and median ranked set sampling. Biometr J J Math Methods Biosci 43(3):357–373
Samawi HM, Muttlak HA (1996) Estimation of ratio using rank set sampling. Biom J 38(6):753–764
Samawi HM, Ahmed MS, Abu-Dayyeh W (1996) Estimating the population mean using extreme ranked set sampling. Biom J 38(5):577–586
Samawi HM, Rochani H, Linder D et al (2017) More efficient logistic analysis using moving extreme ranked set sampling. J Appl Stat 44(4):753–766
Samawi HM et al (2002) On double extreme rank set sample with application to regression estimator. Metron-Int J Stat 60:50–63
See CT, Chen J (2008) Inequalities on the variances of convex functions of random variables. J Inequal Pure Appl Math 9(3):1–5
Takahasi K, Wakimoto K (1968) On unbiased estimates of the population mean based on the sample stratified by means of ordering. Ann Inst Stat Math 20(1):1–31
Thomas MS, Parolia A, Kundabala M et al (2010) Asthma and oral health: a review. Aust Dent J 55(2):128–133
Tomeny TS, Vargo CJ, El-Toukhy S (2017) Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009–15. Soc Sci Med 191:168–175
Winkelmann R, Zimmermann KF (1995) Recent developments in count data modelling: theory and application. J Econ Surv 9(1):1–24
Yin J, Hao Y, Samawi H et al (2016) Rank-based kernel estimation of the area under the roc curve. Stat Methodol 32:91–106
Zamanzade E, Wang X (2017) Estimation of population proportion for judgment post-stratification. Comput Stat Data Anal 112:257–269
Zamanzade E, Parvardeh A, Asadi M (2019) Estimation of mean residual life based on ranked set sampling. Comput Stat Data Anal 135:35–55
Zeileis A, Kleiber C, Jackman S (2008) Regression models for count data in R. J Stat Softw 27(8):1–25
Acknowledgements
We appreciate the editors and reviewers for their valuable time and helpful comments to improve the contents and clarity of this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Assume Y = \((y_{1},..., y_{N})'\) as the vector of responses, the probability distribution function of a zero-inflated Poisson may be written as:
the parameters \({\varvec{p}} = (p_1,..., p_N)\) and \(\varvec{\lambda } = (\lambda _i,..., \lambda _N)\) are modeled via canonical link GLMs as \(logit({\varvec{p}}) = {\varvec{G}}\gamma \) and \(log(\varvec{\lambda }) = {\varvec{B}}\beta \), where \({\varvec{G}}\) and \({\varvec{B}}\) are design matrices.
As described in Lambert (1992), the ZIP model can be fit using maximum likelihood via the EM algorithm. The log likelihood for regression parameters \(\gamma \) and \(\beta \) based on all of the data is given by
The EM algorithm is based on a latent variable \(Z_i\), where we observe \(Z_i\) as 1, when \(Y_i\) is from the perfect, zero state and \(Z_i\) as 0, when \(Y_i\) is from the Poisson state. To formulate the log-likelihood for the complete data \(({\varvec{y}}, {\varvec{z}})\), we have:
To implement EM algorithm, the log-likelihood above is maximized iteratively by alternating between estimating \(Z_i\) by its expectation under the current estimates of \((\gamma , \beta )\) (E step) and then maximizing \(L_c(\gamma , \beta ; y, z)\) (M step). In detail, the EM algorithm begins with starting values \((\gamma ^{(0)}, \beta ^{(0)})\) and proceeds iteratively. At iteration \((k + 1)\) of the EM algorithm requires the following steps.
E Step. Estimate \(Z_i\) by its posterior mean \(Z_i^{(k)}\) under current estimates \(\gamma ^{(k)}\) and \(\beta ^{(k)}\). This posterior mean is calculated as
M Step for \(\gamma \). Find \(\gamma ^{(k+1)}\) by maximizing \(L_c(\gamma ;y,z^{(k)})\). This can be accomplished by performing an an unweighted binomial logistic regression of \(z^{(k)}\) on design matrix \({\varvec{G}}\) using a binomial denominator of one for each observation.
M Step for \(\beta \). Find \(\beta ^{(k+1)}\) by maximizing \(L_c(\beta ;y,z^{(k)})\). This can be accomplished from a weighted Poisson log-linear regression of y on \({\varvec{B}}\), with weights \(1 - Z^{(k)}\).
The EM algorithm converges in this problem and the estimates from the final iteration are the maximum likelihood estimates (MLEs) for the log-likelihood. The MLEs \(({\hat{\gamma }}, {\hat{\beta }})\) are asymptotically Gaussian with variances equal to the inverse of the observed Fisher information matrix. Other issues including the choice of starting values, convergence of the EM algorithm, asymptotic distribution of \(({\hat{\gamma }}, {\hat{beta}})\) in zero-inflated regression models are discussed in Lambert (1992).
Appendix B: Tables
Rights and permissions
About this article
Cite this article
Kanda, D., Yin, J., Zhang, X. et al. Efficient regression analyses with zero-augmented models based on ranking. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01503-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00180-024-01503-3