Efficient regression analyses with zero-augmented models based on ranking

Kanda, Deborah; Yin, Jingjing; Zhang, Xinyan; Samawi, Hani

doi:10.1007/s00180-024-01503-3

Efficient regression analyses with zero-augmented models based on ranking

Original Paper
Published: 14 May 2024

(2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Deborah Kanda¹,
Jingjing Yin ORCID: orcid.org/0000-0003-4843-613X²,
Xinyan Zhang³ &
…
Hani Samawi²

Abstract

Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantile Estimation in Modified Ranked Set Sampling Methods

Article 23 November 2022

Further Improving the Performance of Logistic Regression Analysis Using Double Extreme Ranking

Article 27 January 2020

Large sample properties of maximum likelihood estimator using moving extremes ranked set sampling

Article 13 January 2024

Data availability

The data used in simulation studies were generated randomly, the NHANES data used in real data illustrations were imported from https://www.cdc.gov/nchs/nhanes/, the Twitter data and codes used to generate, import, and analyze the data can be found at: https://github.com/debkanda/ERSS-zero-models.

References

Al-Dlaigan Y, Shaw L, Smith A (2002) Is there a relationship between asthma and dental erosion? A case control study. Int J Pediatr Dent 12(3):189–200
Article Google Scholar
Banda JM, Tekumalla R, Wang G et al (2021) A large-scale COVID-19 twitter chatter dataset for open scientific research-an international collaboration. Epidemiologia 2(3):315–324
Article Google Scholar
Bohn LL (1996) A review of nonparametric ranked-set sampling methodology. Commun Stat Theory Methods 25(11):2675–2685
Article Google Scholar
Bohn LL, Wolfe DA (1992) Nonparametric two-sample procedures for ranked-set samples data. J Am Stat Assoc 87(418):552–561
Article Google Scholar
Broniatowski DA, Paul MJ, Dredze M (2013) National and local influenza surveillance through twitter: an analysis of the 2012–2013 influenza epidemic. PLoS ONE 8(12):e83672
Article Google Scholar
Cameron AC, Trivedi PK (2013) Regression analysis of count data, vol 53. Cambridge University Press, Cambridge
Book Google Scholar
Chen K, Duan Z, Yang S (2022) Twitter as research data: tools, costs, skill sets, and lessons learned. Polit Life Sci 41(1):114–130
Article Google Scholar
Chen Z (2007) Ranked set sampling: its essence and some new applications. Environ Ecol Stat 14:355–363
Article MathSciNet Google Scholar
Chen Z, Bai Z, Sinha BK (2004) Ranked set sampling: theory and applications, vol 176. Springer, Berlin
Google Scholar
Cheung YB (2002) Zero-inflated models for regression analysis of count data: a study of growth and development. Stat Med 21(10):1461–1469
Article Google Scholar
Chew C, Eysenbach G (2010) Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak. PLoS ONE 5(11):e14118
Article Google Scholar
Collingwood L, Wilkerson J (2012) Tradeoffs in accuracy and efficiency in supervised learning methods. J Inf Technol Polit 9(3):298–318
Article Google Scholar
Dell T, Clutter J (1972) Ranked set sampling theory with order statistics background. Biometrics pp 545–555
Dye B, Nowjack-Raymer R, Barker L et al (2008) Overview and quality assurance for the oral health component of the national health and nutrition examination survey (NHANES), 2003–04. J Public Health Dent 68(4):218–226
Article Google Scholar
Frey J (2011) A note on ranked-set sampling using a covariate. J Stat Plan Inference 141(2):809–816
Article MathSciNet Google Scholar
Fung ICH, Tse ZTH, Cheung CN et al (2014) Ebola and the social media. The Lancet 384(9961):2207
Article Google Scholar
Goswami U, O’Toole S, Bernabé E (2021) Asthma, long-term asthma control medication and tooth wear in American adolescents and young adults. J Asthma 58(7):939–945
Article Google Scholar
Hilbe JM (2011) Negative binomial regression. Cambridge University Press, Cambridge
Book Google Scholar
Kelly M, Steele J, Nuttall N, et al (2000) Adult dental health survey: Oral health in the united kingdom. The Stationary Office
Kim AE, Hansen HM, Murphy J et al (2013) Methodological considerations in analyzing twitter data. J Natl Cancer Inst Monogr 47:140–146
Article Google Scholar
Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34(1):1–14
Article Google Scholar
Lehmann EL, Casella G (2006) Theory of point estimation. Springer, Berlin
Google Scholar
Linder DF, Yin J, Rochani H et al (2018) Increased fisher’s information for parameters of association in count regression via extreme ranks. Commun Stat Theory Methods 47(5):1181–1203
Article MathSciNet Google Scholar
Lynne Stokes S (1977) Ranked set sampling with concomitant variables. Commun Stat Theory Methods 6(12):1207–1211
Article Google Scholar
McIntyre G (1952) A method for unbiased selective sampling, using ranked sets. Aust J Agric Res 3(4):385–390
Article Google Scholar
Moon C, Wang X, Lim J (2022) Empirical likelihood inference for area under the receiver operating characteristic curve using ranked set samples. Pharm Stat 21(6):1219–1245
Article Google Scholar
Mullahy J (1986) Specification and testing of some modified count data models. J Econom 33(3):341–365
Article MathSciNet Google Scholar
Patil GP, Sinha A, Taillie C (1994) 5 ranked set sampling. Handb Stat 12:167–200
Article Google Scholar
Prieto VM, Matos S, Alvarez M et al (2014) Twitter: a good place to detect health conditions. PLoS ONE 9(1):e86191
Article Google Scholar
Samawi HM, Al-Sagheer OA (2001) On the estimation of the distribution function using extreme and median ranked set sampling. Biometr J J Math Methods Biosci 43(3):357–373
MathSciNet Google Scholar
Samawi HM, Muttlak HA (1996) Estimation of ratio using rank set sampling. Biom J 38(6):753–764
Article MathSciNet Google Scholar
Samawi HM, Ahmed MS, Abu-Dayyeh W (1996) Estimating the population mean using extreme ranked set sampling. Biom J 38(5):577–586
Article Google Scholar
Samawi HM, Rochani H, Linder D et al (2017) More efficient logistic analysis using moving extreme ranked set sampling. J Appl Stat 44(4):753–766
Article MathSciNet Google Scholar
Samawi HM et al (2002) On double extreme rank set sample with application to regression estimator. Metron-Int J Stat 60:50–63
MathSciNet Google Scholar
See CT, Chen J (2008) Inequalities on the variances of convex functions of random variables. J Inequal Pure Appl Math 9(3):1–5
MathSciNet Google Scholar
Takahasi K, Wakimoto K (1968) On unbiased estimates of the population mean based on the sample stratified by means of ordering. Ann Inst Stat Math 20(1):1–31
Article Google Scholar
Thomas MS, Parolia A, Kundabala M et al (2010) Asthma and oral health: a review. Aust Dent J 55(2):128–133
Article Google Scholar
Tomeny TS, Vargo CJ, El-Toukhy S (2017) Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009–15. Soc Sci Med 191:168–175
Article Google Scholar
Winkelmann R, Zimmermann KF (1995) Recent developments in count data modelling: theory and application. J Econ Surv 9(1):1–24
Article Google Scholar
Yin J, Hao Y, Samawi H et al (2016) Rank-based kernel estimation of the area under the roc curve. Stat Methodol 32:91–106
Article MathSciNet Google Scholar
Zamanzade E, Wang X (2017) Estimation of population proportion for judgment post-stratification. Comput Stat Data Anal 112:257–269
Article MathSciNet Google Scholar
Zamanzade E, Parvardeh A, Asadi M (2019) Estimation of mean residual life based on ranked set sampling. Comput Stat Data Anal 135:35–55
Article MathSciNet Google Scholar
Zeileis A, Kleiber C, Jackman S (2008) Regression models for count data in R. J Stat Softw 27(8):1–25
Article Google Scholar

Download references

Acknowledgements

We appreciate the editors and reviewers for their valuable time and helpful comments to improve the contents and clarity of this paper.

Author information

Authors and Affiliations

Comprehensive Cancer Center, University of New Mexico, Albuquerque, NM, USA
Deborah Kanda
Department of Biostatistics, Epidemiology and Environmental Health Sciences, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, GA, USA
Jingjing Yin & Hani Samawi
Statistics Department, School of Data science and Analytics, Kennesaw State University, Kennesaw, GA, USA
Xinyan Zhang

Authors

Deborah Kanda
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xinyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hani Samawi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingjing Yin.

Ethics declarations

Conflict of interest

No Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Assume Y = $(y_{1},..., y_{N})'$ as the vector of responses, the probability distribution function of a zero-inflated Poisson may be written as:

$$\begin{aligned} f(y; \lambda , p)= {\left\{ \begin{array}{ll} p_i + (1-p_i) e^{-\lambda _i} &{}\text {if }y_i=0\\ (1-p_i) \frac{e^{-\lambda _i}\lambda ^y_i}{y_i!}&{}\text {if }y_i=1,2,3...,\\ \end{array}\right. } \end{aligned}$$

the parameters ${\varvec{p}} = (p_1,..., p_N)$ and $\varvec{\lambda } = (\lambda _i,..., \lambda _N)$ are modeled via canonical link GLMs as $logit({\varvec{p}}) = {\varvec{G}}\gamma $ and $log(\varvec{\lambda }) = {\varvec{B}}\beta $, where ${\varvec{G}}$ and ${\varvec{B}}$ are design matrices.

As described in Lambert (1992), the ZIP model can be fit using maximum likelihood via the EM algorithm. The log likelihood for regression parameters $\gamma $ and $\beta $ based on all of the data is given by

$$\begin{aligned} lnL(\gamma , \beta ; {\varvec{y}})&= \sum \limits _{i=1}^{n} \Bigg \{u_i ln\bigg [e^{\varvec{G_i}\gamma } + exp(-e^{\varvec{B_i}\beta })\bigg ] \\ {}&\quad + (1 - u_i) \big (y_i\varvec{B_i}\beta - e^{\varvec{B_i}\beta } \big ) \\ {}&\quad - ln \big (1 + e^{\varvec{G_i}\gamma }\big ) - (1 - u_i) ln(y_i!) \Bigg \}, \end{aligned}$$

The EM algorithm is based on a latent variable $Z_i$, where we observe $Z_i$ as 1, when $Y_i$ is from the perfect, zero state and $Z_i$ as 0, when $Y_i$ is from the Poisson state. To formulate the log-likelihood for the complete data $({\varvec{y}}, {\varvec{z}})$, we have:

$$\begin{aligned} lnL_c(\gamma , \beta ; {\varvec{y}}, {\varvec{z}})&= \sum \limits _{i=1}^{n} log(f(z_i|\gamma )) + \sum \limits _{i=1}^{n} log(f(y_i|z_i, \beta )) \\ {}&= \sum \limits _{i=1}^{n} \Big (z_iG_i\gamma - log(1 + e^{G_i\gamma }) \Big ) + \sum \limits _{i=1}^{n} (1 - Z_i)(y_i{\varvec{B}}_i\beta - e^{{\varvec{B}}_i\beta })\\&\quad - \sum \limits _{i=1}^{n} (1 - z_i) log(y_i!) \\ {}&= lnL_c(\gamma ; y, z) +lnL_c(\beta ; y, z) - \sum \limits _{i=1}^{n} (1 - z_i) log(y_i!). \end{aligned}$$

To implement EM algorithm, the log-likelihood above is maximized iteratively by alternating between estimating $Z_i$ by its expectation under the current estimates of $(\gamma , \beta )$ (E step) and then maximizing $L_c(\gamma , \beta ; y, z)$ (M step). In detail, the EM algorithm begins with starting values $(\gamma ^{(0)}, \beta ^{(0)})$ and proceeds iteratively. At iteration $(k + 1)$ of the EM algorithm requires the following steps.

E Step. Estimate $Z_i$ by its posterior mean $Z_i^{(k)}$ under current estimates $\gamma ^{(k)}$ and $\beta ^{(k)}$. This posterior mean is calculated as

$$\begin{aligned} z_i^{(k)}= {\left\{ \begin{array}{ll} [1 + exp(-G_i\gamma ^{(k)} - e^{B_i\beta ^(k)})]^{-1} &{}\text {if }y_i=0\\ 0 &{}\text {if }y_i=1,2,3....\\ \end{array}\right. } \end{aligned}$$

M Step for $\gamma $. Find $\gamma ^{(k+1)}$ by maximizing $L_c(\gamma ;y,z^{(k)})$. This can be accomplished by performing an an unweighted binomial logistic regression of $z^{(k)}$ on design matrix ${\varvec{G}}$ using a binomial denominator of one for each observation.

M Step for $\beta $. Find $\beta ^{(k+1)}$ by maximizing $L_c(\beta ;y,z^{(k)})$. This can be accomplished from a weighted Poisson log-linear regression of y on ${\varvec{B}}$, with weights $1 - Z^{(k)}$.

The EM algorithm converges in this problem and the estimates from the final iteration are the maximum likelihood estimates (MLEs) for the log-likelihood. The MLEs $({\hat{\gamma }}, {\hat{\beta }})$ are asymptotically Gaussian with variances equal to the inverse of the observed Fisher information matrix. Other issues including the choice of starting values, convergence of the EM algorithm, asymptotic distribution of $({\hat{\gamma }}, {\hat{beta}})$ in zero-inflated regression models are discussed in Lambert (1992).

Appendix B: Tables

See Tables 5, 6 , 7, 8, 9, 10, and 11.

Table 5 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 2

Full size table

Table 6 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 2 including 10 additional predictors in the model

Full size table

Table 7 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for hurdle negative binomial model of set size 3 including 5 additional predictors in the model

Full size table

Table 8 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for zero-inflated Poisson model of set size 3

Full size table

Table 9 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for zero-inflated Poisson model of set size 3 under the null setting $\beta _h=0$

Full size table

Table 10 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 5

Full size table

Table 11 Estimation of power of testing $H_0$:$\beta _h=0$ vs $H_a$:$\beta _h\ne 0$, bias, MSE, coverage probability, and efficiency for hurdle Poisson model of set size 10 including 5 additional model predictors

Full size table

See Figs. 8 and 9 and Table 12.

Table 12 Comparison of the computation time in seconds

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kanda, D., Yin, J., Zhang, X. et al. Efficient regression analyses with zero-augmented models based on ranking. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01503-3

Download citation

Received: 12 April 2023
Accepted: 23 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1007/s00180-024-01503-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient regression analyses with zero-augmented models based on ranking

Abstract

Access this article

Similar content being viewed by others

Quantile Estimation in Modified Ranked Set Sampling Methods

Further Improving the Performance of Logistic Regression Analysis Using Double Extreme Ranking

Large sample properties of maximum likelihood estimator using moving extremes ranked set sampling

Data availability

References

Acknowledgements