1 Introduction

Coronavirus disease-2019 (COVID-19), initially so-called 2019-nCoV, belongs to the coronavirus family of enveloped positive-strand RNA viruses. This illness infects several species of animals and humans, causing respiratory tract infections, liver, neurological and gastrointestinal problems, ranging from mild to lethal (Guan et al. 2003). Its initial source was identified in Wuhan city, Hubei province of China, in persons exposed to seafood and wet animal wholesale market. The first case was detected in December 2019 (Municipal Health Commission et al. 2019) and has quickly spread worldwide.

In the past two decades, the COVID-19 is the third coronavirus to emerge in the human population, likely characterizing a potentially more novel and severe infectious disease to be revealed. Due to the rapid spread and increase in the number of cases, there is evidence that it is more contagious than the severe acute respiratory syndrome coronavirus (SARS-CoV) and the Middle East respiratory syndrome coronavirus (MERS-CoV) outbreaks, which occurred in 2002 and 2012, respectively (Huang et al. 2020; Munster et al. 2020). Inclusive, since its similarity with the SARS-CoV, the COVID-19 is also named by SARS-CoV-2.

In April 2020, due to many cases and deaths by the new coronavirus, New York City had become the new epicenter of the disease in the United States of America (U.S.) (Radmanesh et al. 2020), after Italy. Thenceforward, several other states have experienced a substantial increase in the number of cases and deaths. From January 20 to August 14, 2020, the total of confirmed cases passed five million in the country, being equal to 5, 150, 407. In this same period was recorded 164, 826 deaths (World Health Organization 2020). Those numbers are equivalent to about 25% of the documented cases total and 22% of deaths by coronavirus globally (World Health Organization 2020).

Some recent studies present statistical applications to pandemic data in the U.S. Bashir et al. (2020) analyzed the correlation between the virus and climate indicators in New York City. They identified that the temperature and air quality are significantly associated with the coronavirus pandemic. Regressive and autoregressive spatial models were examined by Mollalo et al. (2020) to explain variations of coronavirus in the whole country, considering several environmental, topographic, socioeconomic, behavioral, and demographic factors as predictor variables. Duhon et al. (2021) estimated the initial growth rate of COVID-19 for all countries of the world. They used a multiple linear regression model to study the association between the initial growth rate and non-pharmaceutical interventions, demographic, social, and climatic factors. Other similar studies can be found in Andersen (2020) and Zhang and Schwartz (2020).

Although several studies have been done regards to pandemic, to our best knowledge, a regression analysis modeling the first-wave coronavirus mortality rate across the 50 U.S. states has no been conducted. Our goal is to analyze how health care resources, demographic, socioeconomic, and behavioral variables affected the first-wave COVID-19 mortality rate in the U.S. to identify which covariates have a more significant influence on the mortality’s initial growth by this disease. This information can be helpful to improve decision-making in the area of public health policy. Moreover, the findings can help understand potential future outbreaks in other countries of the world.

In this context, some regressions are fitted to the first-wave coronavirus mortality rates in the 50 American states to determine the demographic, socioeconomic, health care resources, and behavioral covariates that affect these rates. Since the response variable has a restricted domain, a new parametric regression is constructed to fit these data. The new regression, based on a transformation on the Burr XII (BXII) random variable, is compared to the Kumaraswamy (Kw) and unit-Weibull (UW) regressions, which are feasible alternatives to model the median of such data. The main advantage of the proposed regression is that it captures the effect of the associated covariate to health care resources and provides the best regression’s adequacy measures. Other similar quantile regressions and unit models recently proposed can be found in Gündüz and Korkmaz (2020), Korkmaz (2020a, 2020b), Korkmaz et al. (2021).

The rest of the paper is structured as follows. A new regression to model the mortality rates in the American states is defined in Sect. 2. Further, the estimation of the parameters, a simulation study, and some goodness-of-fit measures to check the proposed regression’s adequacy are discussed. Section 3 contains some basic statistics of the data set, performs an analysis by identifying the best regression to fit the mortality rates, and provides some useful findings. Finally, in Sect. 4, some concluding remarks are addressed.

2 The proposed regression

This section aims to introduce a new regression that has much broader applicability in coronavirus mortality rates. This approach’s particular feature is that it accommodates double-bounded variables in the unit interval with several types of asymmetry. The proposal is based on the transformation \(Z=1-\mathrm {e}^{-X}\), where X is a BXII random variable having cumulative distribution function (cdf) and probability density function (pdf)

$$\begin{aligned} F_X\left( x ; c,d \right) =1-\left( 1+x^{c}\right) ^{-d},\qquad x>0, \end{aligned}$$

and

$$\begin{aligned} f_X\left( x ; c,d \right) =c\,d\,x^{c-1}\left( 1+x^{c}\right) ^{-(d+1)}, \end{aligned}$$

respectively, where \(c>0\) and \(d>0\) are shape parameters. It is worth noting that Z can also be seen as a reflected transformation on W, \(Z=1-W\), where W is a random variable following a unit Burr XII (UBXII) distribution pioneered by Korkmaz and Chesneau (2021). Hence, the cdf and pdf of the reflected unit Burr XII (RUBXII) distribution can be expressed as (for \(z\in (0,1)\))

$$\begin{aligned} F_Z(z;c,d)=1-[1+\,\log ^c(1-z)^{-1}\,]\,^ {-d}, \end{aligned}$$
(1)

and

$$\begin{aligned} f_Z(z;c,d)=c\,d\, \frac{(z-1)^{-1}\log ^{c-1}(1-z)^{-1}}{[1+\log ^c(1-z)^{-1}]^{d+1}}, \end{aligned}$$
(2)

respectively. By inverting (1), the quantile function (qf) of Z is

$$\begin{aligned} Q_Z(u;c,d)=1-\mathrm {exp} \left\{ -[ (1-u)^{-1/d}-1 ]^{1/c} \right\} . \end{aligned}$$
(3)

Both the UBXII and RUBXII distributions are special cases of the unit extended Weibull family; see Guerra et al. (2020).

To introducing a systematic component on a location parameter, the RUBXII distribution is re-parameterized in terms of its quantiles. Let \(q(\tau )=Q_Z(\tau ;c,d)\) be the \(\tau \)th quantile of Z. By evaluating Equation (3) in \(\tau \) and inverting for d,

$$\begin{aligned} d=\log ( 1-\tau )^{-1}/\log \left\{ 1+\log ^c\,[1-q(\tau )]^{-1} \right\} . \end{aligned}$$
(4)

Although the quantiles are functions of \(\tau \), \(q(\tau )\) is denoted just as q to simplify the notation. Then, by replacing (4) in Equations (1) and (2), the cdf and pdf of the RUBXII distribution expressed in terms of a quantile-based parameterization are (for \(z \in (0,1)\))

$$\begin{aligned} F_Z(z;q,c)=1-\left[ 1+\log ^c (1-z)^{-1}\right] ^{\frac{\log ( 1-\tau )}{\log \left[ 1+\log ^c\,(1-q)^{-1} \right] }}, \end{aligned}$$
(5)

and

$$\begin{aligned} f_Z(z; q,c)&=\frac{ \log (1-\tau )^{-c} \log ^{c -1} (1-z)^{-1}}{(1-z)\log \left[ 1+\log ^{c} (1-q)^{-1}\right] } \left[ 1+\log ^c (1-z)^{-1}\right] ^{\frac{\log (1-\tau )}{\log \left[ 1+\log ^c\,(1-q)^{-1} \right] }-1}, \end{aligned}$$
(6)

respectively, where \(c>0\) is a shape parameter and the quantile order \(\tau \in (0,1)\) is chosen by the researcher. Henceforth, let \(Z\sim \) RUBXII(qc) be a random variable having density (6).

In some cases, median-based regressions are preferable to the mean-based. Median is a more robust measure against the presence of atypical observations and asymmetries at data than the mean. Thus, when data present these features, it is more suitable to consider the median as a measure of location than the mean (Pumi et al. 2020). In the coronavirus mortality rates application of Sect. 3, we consider \(\tau =0.5\), and therefore, \(q=q(0.5)\) is the median of Z.

Figure 1 displays the RUBXII density plots with \(\tau =0.5\), which have the following forms: U, symmetric, right-skewed, increasing, and increasing-decreasing-increasing (tilde). Thus, it is useful for modeling variables with different types of skewness and heavy tails. Moreover, it can assume shapes (as tilde-shaped) whose densities of classical regressions for modeling unit data do not accommodate.

Fig. 1
figure 1

Plots of the RUBXII density (\(\tau =0.5\))

On the proposed re-parametrization, the qf of Z is

$$\begin{aligned} Q_Z (u)= 1-\mathrm {exp} \left\{ -\left[ (1-u)^{\log [1+\log ^c (1-q)^{-1}]/\log (1-\tau )}-1 \right] ^{1/c} \right\} . \end{aligned}$$
(7)

It is useful to generate observation from the RUBXII distribution by the inversion method since it has a closed-form. So, if U is a random variable having a standard uniform distribution, then \(Z=Q_Z(U)\) follows the RUBXII law.

Let \({\varvec{z}}=(z_1,\dots , z_n)^\top \) be a vector of n independent observations of the variables \(Z_i\sim \) RUBXII\((q_i,c)\) (for \(i=1,\ldots ,n\)). The new regression is proposed assuming that the parameters \(q_i\) can be expressed as a function of covariates under the systematic component

$$\begin{aligned} g(q_i)=\eta _i=\sum _{j=1}^k {x_{ij}}\,\xi _j={\varvec{x}}_i^\top \,{\varvec{\xi }}, \end{aligned}$$
(8)

where \(g:(0,1)\rightarrow \mathbb {R}\) is a strictly monotonic and twice differentiable link function, \(\eta _i\) is the linear predictor, and \({\varvec{\xi }} = (\xi _1,\ldots ,\xi _k)^\top \) is the parameter vector associated with the covariates \(\varvec{x}_i^\top =(x_{i1},\ldots ,x_{ik})\). The quantities \(q_i\) can be obtained by inverting (8) as \(q_i=g^{-1}(\eta _i)\).

Several link functions can be chosen for \(g(\cdot )\) such as the logit, probit, and complementary log–log. In applications, the logit link function is generally considered due to the useful interpretation of the regression coefficients as an odds ratio. It is defined as \(g(p)=\log [p/(1-p)]\), and it is used in all fitted regressions here.

2.1 Estimation

The estimation of the parameters of the RUBXII regression is done by the maximum likelihood (ML) method. Let \(\varvec{\theta }=(\varvec{\xi }^\top ,c)^\top \) be the \((k+1)\)-dimensional parameter vector. The log-likelihood function based on a sample of n independent observations is

$$\begin{aligned} \ell ({\varvec{\theta }})\equiv \ell (\varvec{\xi },c)=\sum _{i=1}^n \ell _i (q_i,c), \end{aligned}$$
(9)

where \(q_i\) satisfies the systematic component (8) and \(\ell _i (q_i,c)\) is the logarithm of the density \(f_Z(z_i;q_i,c)\) given in Eq. (6). Thus,

$$\begin{aligned} \ell _i(q_i,c)&=\,\log (1-z_i)^{-1}-\log \,[\,r(q_i)\,] +\log \,[\,\log (1-\tau )^{-c}\,]\\&\quad +\log \,[\,\log ^{c-1}(1-z_i)^{-1}] +[\log (1-\tau )/r(q_i)-1]\,r(z_i),\nonumber \end{aligned}$$

where \(r(x)=\log \,[\,1+\log ^c(1-x)^{-1}]\).

The components of the score vector \(U({\varvec{\theta }})\), given in Appendix 1, are defined as the partial derivatives of (9) with respect to each element of the parameter vector \({\varvec{\theta }}\). Equalizing its components to zero, \(U({\varvec{\theta }})={\varvec{0}}\), and solving the system simultaneously, the maximum likelihood estimators (MLEs) \(\varvec{{{\hat{\theta }}}}=(\varvec{{{\hat{\xi }}}}^\top ,{{\hat{c}}})^\top \)of \(\varvec{\theta }\) can be found. However, the system of equations is non-linear and cannot be solved analytically. In such a way, the estimators must be obtained through numerical optimization algorithms using well-known programming languages such as the R (optim function), SAS (PROC NLMIXED), and Ox program (MaxBFGS sub-routine).

2.2 Simulation study

Some Monte Carlo experiments are carried out to assess the performance of the MLEs on the finite sample. Consider the systematic component for \(q_i\):

$$\begin{aligned} \log \left( \frac{q_i}{1-q_i}\right) =\eta _i=\xi _1+\xi _2 \,x_{i2}, \qquad i=1,\ldots ,n. \end{aligned}$$

Four scenarios with different simulation schemes, combining various values for the parameter vector \({\varvec{\theta }}=(\xi _1,\xi _2, c)^\top \), are considered. To evaluate the performance of the MLEs, for each scenario, the samples \(\left\{ (z_1,x_{12})\right. \), \(\left. \ldots ,(z_n,x_{n2})\right\} \) are simulated 10,000 times with \(n\in \{30,90,160,300\}\). The occurrences of the response \(Z_i\sim \) RUBXII\((q_i,c)\) are obtained by the inversion method through the qf in Equation (7). The covariate \(x_{i2}\) is generated from a uniform distribution on the interval \((-3,3)\) (scenarios 1 and 2), and a standard normal distribution (scenarios 3 and 4). The R programming language (R Core Team 2021) is used to perform the simulation study.

The percentage relative bias (RB) and root mean squared error (RMSE) of the estimates in \(\varvec{\theta }\) are determined. Table 1 lists the results for these measures. Low RB values are noted even for small sample sizes. Considering all the scenarios and sample sizes, the RBs of the estimates of \(\xi _1\) and \(\xi _2\) are less than \(4\%\), and those of c are less than \(15\%\). On the other hand, the RMSE quickly goes to zero when n increases, thus in agreement with the asymptotic properties of the MLEs.

Table 1 Simulation results from the RUBXII regression

2.3 Regression model adequacy

In this section, some methods are presented to analyze whether a fitted regression is suitable for a data set. As goodness-of-fit measures of the RUBXII regression, the maximized log-likelihood value (LL), a normality test for the quantile residuals (Dunn and Smyth 1996), generalized pseudo-\(\hbox {R}^2\) (\(\hbox {R}_G^2\)), and a RESET-type test are considered. The same measures are adopted to compare the proposed regression with other suitable regressions for proportional data.

The quantile residuals for the RUBXII regression are

$$\begin{aligned} {\varvec{r}}=\Phi ^{-1}[F_Z({\varvec{z}};\hat{{\varvec{q}}},{\hat{c}})], \end{aligned}$$

where \(F_Z(\cdot )\) is the cdf of the RUBXII distribution given in Eq. (5) and \(\Phi ^{-1}(\cdot )\) is the qf of the standard normal distribution. If the fit is adequate, it is expected that the distribution of the quantile residuals is close to the standard normal. To check whether this assumption is satisfied, the well-known Shapiro–Wilk (SW) normality test can be performed.

The \(\hbox {R}_G^2\) is useful to assess the proportion of the response variable’s variation explained by the regression. It is defined by Nagelkerke (1991) as

$$\begin{aligned} R^2_G=1-\mathrm {exp} \left\{ -2/n\,[\ell (\hat{{\varvec{\theta }}})-\ell (\hat{{\varvec{\theta }}}_0)] \right\} , \end{aligned}$$

where \(\ell (\hat{{\varvec{\theta }}}_0)\) is the log-likelihood for the null model, i.e., modeling the response without covariates, and \(\ell (\hat{{\varvec{\theta }}})\) is the log-likelihood of the fitted regression. A regression with a higher value of \(R^2_G\) provides a larger explanation power of the response variable’s variation.

A RESET-type test introduced by Pereira and Cribari-Neto (2014) can be adopted to detect possible specification errors in the regression. The null hypothesis of this test is that the regression is correctly specified. It may be conducted in the following way: (i) fit the regression and obtain the fitted values \(\hat{{\varvec{q}}}=({\hat{q}}_1,\ldots ,{\hat{q}}_n)^\top \) of \({\varvec{q}}=(q_1,\ldots ,q_n)^\top \) using (8); (ii) compute powers of second and third degrees of \(\hat{{\varvec{q}}}\), i.e., get \(\hat{{\varvec{q}}}^2=({\hat{q}}_1^2,\ldots ,{\hat{q}}_n^2)^\top \) and \(\hat{{\varvec{q}}}^3=({\hat{q}}_1^3,\ldots ,{\hat{q}}_n^3)^\top \); and (iii) using these powers as additional covariates, fit the augmented regression, and test their significance through the likelihood ratio (LR) test.

The LR statistic is \(\omega =2[\ell (\hat{{\varvec{\theta }}})-\ell (\tilde{{\varvec{\theta }}})]\), where \(\ell (\hat{{\varvec{\theta }}})\) and \(\ell (\tilde{{\varvec{\theta }}})\) are the unrestricted and restricted maximized log-likelihood functions, respectively. Under the null hypothesis, \(\omega \) converges in distribution to a chi-squared with \(\nu \) degree of freedom, that is, \(\omega \xrightarrow []{D}\chi ^2_\nu \), where \(\nu \) is the number of added test variables (\(\nu =2\) in this case).

3 Results and discussion

In the first eight months of the coronavirus advance since its inception, on August 19, 2020 in the U.S., the Disease Control and Prevention (CDC) reported a total of 5,650,176 confirmed cases and 175,789 deaths, putting the disease with 3.1% lethality. Also, the adoption of systematic non-pharmaceutical interventions seems to have decreased mortality. Thus, understanding the relationship between demographic, socioeconomic, health care resources, and behavioral variables with the mortality rate became a crucial task. In this sense, this section presents the RUBXII regression’s application, concurrently with two other well-known regression models, by associating the mortality rate with these possible predictor variables.

The amount of information available on the disease is as abundant as it is scattered and unreliable. Therefore, before the analysis, data mining is built to construct the database described at the beginning of the section. The regression models chosen in this study consider an essential characteristic of the mortality rate: it belongs to the interval (0, 1).

3.1 Descriptive statistical analysis

The response variable is the COVID-19 deaths rate in the U.S. states. This rate is calculated in the 50 states from data available by the CDC (Centers for Disease Control and Prevention 2020). For all states, it is considered the total of deaths per hundred people on 30, 90, and 180 days after the 10th detected case, to ensure that the comparisons are made to the same period. In this way, a panel with three observations for each state is structured.

For all states, the population density, Gini coefficient, hospital beds, smoking rate, poverty rate, and life expectancy, are obtained from the following sources: World Population Review, Global Data Lab, World Atlas, Kaiser Family Foundation, Iowa Community Indicators Program of the Iowa State University, and County Health Rankings and Roadmaps. The response variable and covariates are defined below:

  1. 1.

    MR: Mortality rate (response variable) (Centers for Disease Control and Prevention 2021).

  2. 2.

    PD: Population density (p/\(\hbox {mi}^2\)) (data of 2020) (World Population Review 2020c).

  3. 3.

    GINI: Gini coefficient (data of 2017) (World Atlas 2017).

  4. 4.

    BEDS: Hospital beds per 100 thousand inhabitants (data of 2018) (Kaiser Family Foundation 2018).

  5. 5.

    SR: Smoking rate by state (data of 2020) (World Population Review 2020b).

  6. 6.

    PR: Poverty rate (data of 2020) (World Population Review 2020a).

  7. 7.

    LE: Life expectancy (data of 2018) (County Health Rankings & Roadmaps 2018).

  8. 8.

    \(\hbox {T}_{{90}}\): dummy that is equal to one if the response observation corresponds to mortality rate after 90 days of the 10th confirmed case, and zero otherwise.

  9. 9.

    \(\hbox {T}_{{180}}\): dummy that is equal to one if the response observation corresponds to mortality rate after 180 days of the 10th confirmed case, and zero otherwise.

Table 2 gives some descriptive measures of these variables. The MR has a high coefficient of variation (CV) for all current time periods, being the most at 30 days with a CV of about \(126\%\). Also, in the three time periods (30, 90, and 180 days), the response presents positive skewness, the mean is not close to the median, and at 30 and 90 days its kurtosis is greater than three indicating that it has a leptokurtic distribution. The GINI, and LE covariates have the lowest variabilities with CV ranging between about \(2\%\) and \(4\%\). On the other hand, the PD covariate has the most CV about at \(130\%\) and takes values on a sizeable range since the minimum and maximum are around 1p/\(\hbox {mi}^2\) (referring to the Alaska state) and 1, 215p/\(\hbox {mi}^2\), respectively. The BEDS, SR, and PR covariates have close CVs varying from around \(21\%\) to \(28\%\). Moreover, they have a mean close to the median, and kurtosis lower than three. Only the LE covariate has negative skewness.

Table 2 Descriptive statistics

Figure 2 displays the histogram of the MR and box plots from three panel’s observations, i.e., MR for 30, 90, and 180 days. The histogram and the three box plots agree to those figures in Table 2. The MR on 30, 90, and 180 days have skewed-right distribution, and it presents some outliers. Clearly, after 90, and 180 days of the 10th recorded case, the mortality rate has increased substantially according to the box plots.

Fig. 2
figure 2

Histogram of the MR and box plots of the MR after 30, 90, and 180 days after the 10th confirmed case

3.1.1 Correlation analysis

Initially, we present some dispersion plots of the response variable against each covariate; see Fig. 4. It can be noted that there is no indication of a linear relationship among them. Then Fig. 3 displays the correlation matrix for the current variables by considering the Spearman method. To study the significance of these correlations, it is carried out a Spearman correlation test and a non-parametric analysis. This test’s null hypothesis (\({\mathcal {H}}_0\)) is that the populational correlation coefficient between two variables is equal to zero, i.e., there is no statistically significant correlation. Under \({\mathcal {H}}_0\), the computed test statistic converges in distribution to a Student’s t distribution with \((n-2)\) degrees of freedom, where n is the sample size. The p-values of the test are given in Table 3.

Fig. 3
figure 3

Correlation matrix

In a first analysis, note that the response variable is positively correlated to PD, presenting the most correlation value with the MR regards to the other covariates (see Figure 3). Moreover, this correlation is significant; see Table 3. Hence, the MR increases as PD. Indeed, according to Rocklöv and Sjödin (2020), the contact rate by COVID-19 is proportional to population density. Observe also that there is a statistically significant correlation between the MR and the Gini coefficient (Table 3). A similar finding was found in Oronce et al. (2020).

Fig. 4
figure 4

Dispersion plots

Table 3 p-values of the Spearman correlation test between all variables

3.2 Fitted regressions

In what follows it is explored more deeply the relationship between covariates and the MR through regression analysis. The goodness-of-fit measures are investigated for the RUBXII regression defined in Sect. 2 with two competitive systematic components to study the effects of the covariates given in Sect. 3.1 on the median of the mortality rate by coronavirus in the U.S. states. The well-known Kw regression (Mitnik and Baek 2013) and the UW quantile regression (Mazucheli et al. 2020) are considered for comparison purposes. The densities of each competitive regression’s random component are given below.

Let Z be a random variable that follows a Kw distribution on median-dispersion parameterization (Mitnik and Baek 2013), say \(Z\sim \) Kw(qc). Then its pdf is (for \(z\in (0,1)\))

$$\begin{aligned} f(z;q,c)&=\, \frac{\log 0.5}{c\log (1-q ^{1/c})}z^{1/c} (1-z^{1/c})^ {\log 0.5/\log (1-q^{1/c})-1}, \end{aligned}$$
(10)

where \(0<q<1\) is the median of Z and \(c>0\) is a dispersion parameter.

Recently, Mazucheli et al. (2020) proposed the UW quantile regression. Let \(Z\sim \) UW(qc) be a random variable having the UW law. Then its pdf is (for \(z\in (0,1)\))

$$\begin{aligned} f(z;q,c)=\frac{c}{z} \left( \frac{\log \tau }{\log q} \right) \left( \frac{\log z}{\log q} \right) ^{c-1} \tau ^{(\log z/\log q)^c}, \end{aligned}$$
(11)

where \(0<q<1\) is the \(\tau \)th quantile, c is a shape parameter, and \(\tau \in (0,1)\) is assumed known. Here, it will be considered that \(\tau =0.5\) to model the median of Z.

Table 4 gives the estimates of the parameters and associated p values of the final fitted RUBXII, Kw, and UW regressions to the coronavirus death rates across the U.S. states. The significance of the estimates is adopted as a criterion to choose the variables in the final fits. The PR and LE covariates were not significant to the usual significance level (\(1\%,5\%\), and \(10\%\)) at all considered regressions. According to Table 4, when RUBXII regression is fitted, most of the covariates are significant at a significance level of \(1\%\), except for the BEDS, which is significant at \(10\%\). Other fitted regressions do not capture the effect of the covariate BEDS. Besides, the covariate SR is also not statistically significant in the fitted UW regression.

Table 4 Fitted regressions for the median of the MR by COVID-19 in the U.S. states

The goodness-of-fit measures of the fitted regressions given in Table 4 are reported in Table 5. The RUBXII regression has the best adequacy measures. It presents the most LL value and p-value of SW test upper to the usual nominal level of significance. Further, its \(\hbox {R}^2_G\) is the greatest, indicating that the fitted RUBXII regression explains \(76.57\%\) of the median response variability. The p-value of the SW test for the Kw and UW regressions’ residuals are lower than 0.05. Hence, we reject the null hypothesis that the residual distribution is normal at a significance level of \(5\%\). Therefore, these regressions are inadequate to the current data. The p-value of the RESET-type (RES) tests indicate that all fitted regressions are specified correctly at usual significance levels. Thus, the results from Table 5 favor the RUBXII more clearly than those Kw and UW regressions by showing its superiority in terms of model fit and significance of the BEDS covariate to the mortality rates by COVID-19 in the U.S. states.

Table 5 Goodness-of-fit measures for the final fitted regressions

Figure 5 displays a normal Q–Q plot for each fitted regression’s quantile residuals to assess if they are normally distributed. The plots corroborate with results from Table 5 by indicating that the RUBXII regression’s residuals are more close to a normal distribution since the data points are closely following the straight red line. For the other regressions, mainly the Q–Q plot from the UW regression’s residuals, it is possible to note a lack-of-fit of them to the standard normal distribution.

Fig. 5
figure 5

Normal Q–Q plot for the quantile residuals of the RUBXII, Kw, and UW fitted regressions

After the above analysis, there is evidence that the RUBXII regression provides a better fit quality. Therefore, from the estimates of the RUBXII regression parameters reported in Table 4, its regression equation can be expressed as

$$\begin{aligned} \log \,[{\hat{q}}_i/(1-{\hat{q}}_i)]&= -13.1507+ 0.0021\,\text {PD}_i + 12.6735\,\text {GINI}_i -0.1905\,\text {BEDS}_i\,\\&\quad + 8.8401\,\text {SR}_i +1.8856\,\text {T}_{90_i}+ 2.5751\,\text {T}_{180_i}.\nonumber \end{aligned}$$

Based on the fitted RUBXII regression, some findings of the modeling mortality rate’s median by COVID-19 in the U.S. states are now presented.

  • The PD presents a p-value lower than 0.0001, and its associated estimate is positive, which indicates that the MR is higher in states most densely populated. Similarly, Wong and Li (2020) showed that population density is an effective predictor of cumulative infection cases in the U.S. at the county level. According to this study, low population density offers a strong protective effect against COVID-19 infection.

  • The Gini coefficient is significant at the \(1\%\) level, and its positive estimate means that the MR increases in states with a larger Gini coefficient. This finding corroborates with the study of Oronce et al. (2020), who noted that states with higher income inequality had experienced a higher number of deaths by COVID-19.

  • The number of hospital beds is significant at the \(10\%\) level. The mortality rate’s median decreases when the total hospital beds per 100 thousand inhabitants increase as expected. According to Janke et al. (2021) U.S. geographic areas with fewer intensive care unit beds, nurses, and general medicine/surgical beds per COVID-19 case were statistically significantly associated with greater deaths in April.

  • The SR is mightily significant (p-value\( \, =0.0003\)). The mortality rate’s median increases as the SR grows according to the positive signal of its related estimate. This result is expected since the immune response of smoking patients decreases potentially (Taghizadeh-Hesary and Akbari 2020).

  • The dummy variables related to the time 90 and 180 days after the 10th confirmed case are significant as expected. As indicated by the box plots in Fig. 2, the MR grows steadily during the considered periods.

4 Concluding remarks

The COVID-19 characterizes a pandemic that has been spread across the United States of America (U.S.) since January 2020. This paper investigates how demographic, socioeconomic, health care resources, and behavioral variables are related to the mortality rate by COVID-19 in the U.S. states. To properly reach that aim, it is chosen regressions that consider the double-bounded characteristic of the mortality rate. It is introduced an alternative model called the reflected unit Burr XII (RUBXII) regression, which is a helpful tool for modeling bounded random variables in the interval (0, 1), such as rates, proportions, and indexes. This proposal is based on a new unit continuous distribution that arises from a transformation on a random variable Burr XII distributed. Further, a more general and useful quantile-parameterization is introduced to define the quantile regression for unit data. The estimation of the parameters, a simulation study to evaluate the maximum likelihood estimators’ performance and some adequacy measures to check whether the regression’s assumptions hold are discussed. After consolidating the data set about the mortality rates and other covariates for the U.S. states, a descriptive statistical analysis and regression modeling are done.

In this way, the new regression is compared with the Kumaraswamy and unit-Weibull regressions. The proposed regression is quite competitive compared with those regressions and provides the best fit according to some selection criteria. Thus, from the fitted RUBXII regression, it is possible to identify that the population density, Gini coefficient, hospital beds, and smoking rate are statistically significant in modeling the mortality rate’s median by COVID-19 in the U.S. states. This paper’s findings may improve understanding of coronavirus in the U.S. and help healthcare system better prepare for the advance of the pandemic or even respond to similar epidemics. Interested readers can access all computational codes at https://github.com/tatianefribeiro/RUBXII_Regression_COVID-19/tree/master. Since the RUBXII regression’s potentiality to analyze coronavirus data, it is aimed in future research to fit this regression to the mortality rates by coronavirus in other countries of the world