The unifed distribution
 190 Downloads
Abstract
We introduce a new distribution with support on (0,1) called unifed. It can be used as the response distribution for a GLM and it is suitable for data aggregation. We make a comparison to the beta regression. A link to an R package for working with the unifed is provided.
Keywords
Exponential dispersion family GLM R Beta regressionAbbreviations
 EDF
Exponential dispersion family
 GLM
Generalized linear model
 mle
Maximum likelihood estimator
Introduction
We introduce the unifed distribution. It is a continuous distribution with support on the interval (0,1). It can be characterized as the only exponential dispersion family containing the uniform distribution. This makes it suitable to be used as the response variable of a Generalized Linear Model (GLM).
An R (see (R Core Team 2017) and (Quijano Xacur 2019b)) package has been developed to work with this distribution. It is called unifed and contains functions for the density, distribution, quantiles and random generator. It also contains a family that can be used within the glm function of R. Additionally, the package provides Stan (Stan Development Team 2018) code for performing Bayesian analysis with the unifed including a function for fitting Bayesian unifed GLMs. Information about the package and how to install it can be found at https://gitlab.com/oquijano/unifed.
This is not the only model for performing regression on the unit interval. The beta regression (see (Ferrari and CribariNeto 2004)) has existed for a while and it provides more flexible shapes than the unifed GLM. One appealing property of the unifed GLM is that it is suitable for data reduction while the beta regression is not. This is discussed in “On the difficulties of data aggregation for the beta regression” section.
This paper is divided into 4 sections. In “Exponential dispersion families and GLMs” section we review the definition and properties of exponential dispersion families and GLMs. “The unifed distribution” section defines the unifed distribution. In “An illustrative example” section we illustrate an application to an auto insurance claims example. “Comparison between the unifed GLM and the beta regression” section reviews the beta regression and underlines it’s differences with the unifed GLM.
Exponential dispersion families and GLMs
where \(\dot {\kappa }=\kappa '\) and \(\ddot {\kappa }=\dot {\kappa }'\). (Eq. 2) allows to relate the mean and the variance and the mean of any EDF. This motivates the following definitions (see (Jørgensen 1997) or (Jørgensen 1992)).
Definition 0.1.
Definition 0.2.
Definition 0.3.
Weights and data aggregation
GLMs
The population can be divided into different classes according to the values of the explanatory variables. Thus, given a sample, we can group together all the observations that share the same values of the explanatory variables and aggregate them using (Eq. 7). It is important to mention that with this grouping there is no loss of information for estimating the mean since \(\bar {Y}\/\) is a sufficient statistic for θ (but not for ϕ, thus some information is lost for the estimation of ϕ). In this sense we say that GLMs are suitable for data aggregation. At the end of “An illustrative example” section we illustrate this property with real data for a unifed GLM.
where κ(θ)=(κ(θ_{1})⋯κ(θ_{m}))^{T},W=diag(w_{1},⋯,w_{m}), with w_{i} being the sum of all the weights in the ith class, 1=(1⋯1)^{T} and \(A( {y},\phi) = \prod _{i=1}^{m} \big (a(y_{i}, \frac {w_{i}}{\phi })\big)\/\).
Ω^{m}={(μ_{1}⋯μ_{m})^{T}:μ_{1},…,μ_{m}∈Ω}. D is called the deviance of the model. Note that finding the maximum likelihood estimator of \( {\mathcal {B}}\) is equivalent to finding what value of \( {\mathcal {B}}\) minimizes the deviance. For further details about the use and properties of the deviance see (Jørgensen 1992).
The unifed distribution
The unifed family is the Exponential Dispersion Family (EDF) generated by the uniform distribution (see Chapters 2 and 3 of (Jørgensen 1997) to see how an EDF can be generated from a moment generating function). We created the R package unifed (see (Quijano Xacur 2019b)) that includes functions to work with the unifed. In this section we make references to some functions in the package and we use this font format for those references.
where h and κ are as in (Eq. 14) and (Eq. 15), respectively and \(x\in [0,1],\theta \in \mathbb {R}, \phi \in \left \{1,\frac {1}{2},\frac {1}{3},\ldots \right \}\). We denote the unifed distribution with canonical parameter θ and dispersion parameter ϕ with unifed(θ,ϕ).
Float overflow of the IrwinHall implementation
Code  Result 

dirwin.hall(35,50)  0.0674864 
dirwin.hall(36,50)  13.12745 
dirwin.hall(37,50)  45.44388 
dirwin.hall(38,50)  37.44488 
where \(\dot {\kappa }\) and \(\dot {\kappa }\) are the first and second derivative of κ, respectively. We have not been able to find an analytical expression for the inverse function \(\dot {\kappa }^{1}\). Thus, it has not been possible either to find analytical expressions for the variance function and unit deviance of the unifed. Nevertheless, the unifed package contains the function unifed.kappa.prime.inverse that uses the Newthon Raphson method to implement the inverse of \(\dot {\kappa }\). This allows us to get a numerical solution for the variance function by using the relation \(\mathbf {V}(\mu) = \ddot {\kappa } (\dot {\kappa }^{1}(\mu))\). This is implemented in the function unifed.varf.
The function unifed.unit.deviance computes the unit deviance using (Eq. 20). As mentioned in “Exponential dispersion families and GLMs” section, the unit deviance can be used to reparametrize the distribution in terms of it’s mean and dispersion parameter. We denote with unifed^{∗}(μ,ϕ) the unifed distribution with mean μ and dispersion parameter ϕ and when ϕ=1, we write simply unifed^{∗}(μ).
Maximum likelihood estimation
where \(\bar {X}=\sum _{i=1}^{n}X_{i} / n\). The function unifed.mle in the unifed R package computes the mle using (Eq. 21). It is possible to use the unifed distribution as the response distribution of a GLM. In this case, ϕ must be fixed to one and the weight of each class is the number of observations in the class. The mle \(\hat {\mathcal {B}}\) of the regression coefficients can be found using iterative weighted least squares. In Section 2.5 of (McCullagh and J.A. 1989), they show that this method works for any response distribution whose density can be expressed as (Eq. 8). Thus, the method also works for the unifed. The unifed R package (Quijano Xacur 2019b) provides the function unifed that returns a family object than can be used inside the glm function.
An illustrative example
Vehicle insurance variables
Variable name  Description 

veh_value  vehicle value, in $10,000s 
exposure  01 
clm  occurrence of claim (0 = no, 1 = yes) 
numclaims  number of claims 
claimcst0  claim amount (0 if no claim) 
veh_body  vehicle body, coded as 
BUS  
CONVT = convertible  
COUPE  
HBACK = hatchback  
HDTOP = hardtop  
MCARA = motorized caravan  
MIBUS = minibus  
PANVN = panel van  
RDSTR = roadster  
SEDAN  
STNWG = station wagon  
TRUCK  
UTE  utility  
veh_age  age of vehicle: 1 (youngest), 2, 3, 4 
gender  gender of driver: M, F 
area  driver’s area of residence: A, B, C, D, E, F 
agecat  driver’s age category: 1 (youngest), 2, 3, 4, 5, 6 
We are interested in modeling the exposure; which is the proportion of time of the year in which the insurance policy is inforce for a given client. We use gender, agecat, area and veh_age as the explanatory variables.
The R code used to obtain the results that follow can be found in (Quijano Xacur 2019a).
Summary of Unifed GLM
Estimate  Std. Error  z value  Pr (>∥z)  

(Intercept)  0.3319  0.0197  16.84  0.0000 *** 
genderM  0.0288  0.0090  3.20  0.0014 ** 
agecat2  0.0011  0.0184  0.06  0.9518 
agecat3  0.0530  0.0178  2.97  0.0029 ** 
agecat4  0.0583  0.0178  3.28  0.0010 ** 
agecat5  0.1042  0.0189  5.51  0.0000 *** 
agecat6  0.0692  0.0210  3.30  0.0010 *** 
areaB  0.0239  0.0135  1.77  0.0761. 
areaC  0.0014  0.0121  0.11  0.9086 
areaD  0.0053  0.0157  0.34  0.7337 
areaE  0.0120  0.0175  0.68  0.4948 
areaF  0.0879  0.0214  4.10  0.0000 *** 
veh_age2  0.1708  0.0138  12.40  0.0000 *** 
veh_age3  0.1613  0.0133  12.16  0.0000 *** 
veh_age4  0.1549  0.0134  11.53  0.0000 *** 
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’’ 1  
(Dispersion parameter for unifed family taken to be 1)  
Null deviance:  585.47 on 287 degrees of freedom  
Residual deviance:  297.86 on 273 degrees of freedom 
A χ^{2} test for goodness of fit is commonly used for GLMs. The null hypothesis is that the data is distributed according to the fitted GLM. Assuming the null hypothesis for this example implies that the residual deviance reported at the bottom of Table 3 follows a χ^{2} distribution with 273 degrees of freedom. The pvalue for this example is \(\mathbb {P}(\chi _{273}^{2}\ge 297.86)=0.14\). Now, the detail with this test is that the χ^{2} distribution for the residual deviance is asymptotic on the smallest weight of all classes going to infinity (see (Jørgensen 1992, Section 3.6)). The smallest observed weight here is 4 and it corresponds to the class with gender=F, agecat=6, area=F and veh_age=1. Therefore the χ^{2} test for this example is not reliable.
Verifying data aggregation:
Summary of Unifed GLM without Data Aggregation
Estimate  Std. Error  z value  Pr(>∥z)  

(Intercept)  0.3319  0.0197  16.84  0.0000 *** 
genderM  0.0288  0.0090  3.20  0.0014 ** 
agecat2  0.0011  0.0184  0.06  0.9518 
agecat3  0.0530  0.0178  2.97  0.0029 ** 
agecat4  0.0583  0.0178  3.28  0.0010 ** 
agecat5  0.1042  0.0189  5.51  0.0000 *** 
agecat6  0.0692  0.0210  3.30  0.0010 *** 
areaB  0.0239  0.0135  1.77  0.0761. 
areaC  0.0014  0.0121  0.11  0.9086 
areaD  0.0053  0.0157  0.34  0.7337 
areaE  0.0120  0.0175  0.68  0.4948 
areaF  0.0879  0.0214  4.10  0.0000 *** 
veh_age2  0.1708  0.0138  12.40  0.0000 *** 
veh_age3  0.1613  0.0133  12.16  0.0000 *** 
veh_age4  0.1549  0.0134  11.53  0.0000 *** 
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’’ 1  
(Dispersion parameter for unifed family taken to be 1)  
Null deviance:  113445 on 67855 degrees of freedom  
Residual deviance:  113158 on 67841 degrees of freedom 
By comparing Tables 3 and 4 one can see that the estimated coefficients are the same in both cases. Thus, even though the deviance of both models differ, they give the same mle for the coefficients. This shows what we mean with data aggregation.
Comparison between the unifed GLM and the beta regression
The beta regression (Ferrari and CribariNeto 2004) is a versatile model for applications with a response variable on the unit interval. Moreover, the well documented R package betareg (CribariNeto and Zeileis 2010) makes it a practical tool in many applications.
The beta regression
where \( {\mathcal {B}}\) and γ are regression coefficients.
These regression models offer great flexibility when the response variable lies in the interval (0,1), and both are implemented in the R package betareg ((R Core Team 2017), (CribariNeto and Zeileis 2010)).
The beta distribution is not an EDF and therefore the beta regression is not a GLM. Nevertheless the parametrization chosen by the authors of the model along with (Eq. 23) give it a similar look and feel.
On the difficulties of data aggregation for the beta regression

\(\bar {Y}\) is a sufficient statistic for μ

The distribution of \(\bar {Y}\) belongs to the same family as the Y_{i}’s in (Eq. 7).
The factorization theorem (see (Hogg et al. 2019, Chapter 7)), implies that \(T=\prod _{i=1}^{n} \frac {y_{i}}{1y_{i}}\) is sufficient for μ. Now, the distribution of T, which is not beta, would be needed to use T for data aggregation. In other words, a regression model whose response distribution is a family that includes the distribution of T for every n would need to be developed.
Differences between the unifed GLM and the beta regression
In those cases where a beta regression and a unifed GLM give similar good fit, the parsimony principle suggests to pick the unifed GLM, since it has one parameter less; the dispersion parameter is known for the unifed GLM.
From a numerical point of view, the unifed GLM has the advantage that it is possible to use (Eq. 7) for data reduction. This is a practical advantage when dealing with large datasets specially if simulations of the response vector need to be performed.
Conclusion
This paper introduced a new distribution called unifed. It is the Exponential Dispersion Family generated by the uniform distribution. It allows to fit a GLM for responses on the unit interval (0,1). An R package for working with this distribution is provided.
We made a comparison to the beta regression, which is another regression model for responses on the unit interval. It provides more flexible shapes and therefore it can give better fit than a unifed GLM in many situations. In contrast, the unifed GLM is suitable for data aggregation which is a practical advantage when working with large datasets.
An application using publicly available data was presented.
Notes
Acknowledgements
Not applicable.
Authors’ contributions
All contributions were made by the author of the article, Oscar Alberto Quijano Xacur.
Funding
Not applicable.
Competing interests
The author declares that they have no competing interests.
References
 CribariNeto, F., Zeileis, A.: Beta regression in R. J. Stat. Softw. 34(2), 1–24 (2010).CrossRefGoogle Scholar
 Dahl, D. B., Scott, D., Roosen, C., Magnusson, A., Swinton, J.: Xtable: Export Tables to LaTeX or HTML. R package version 1.83. (2018). https://CRAN.Rproject.org/package=xtable. Accessed Mar 2019.
 de Jong, P., Heller, G. Z.: Generalized Linear Models for Insurance Data. Cambridge University Press (2008). Companion website: http://www.acst.mq.edu.au/GLMsforInsuranceData. http://dx.doi.org/10.1017/CBO9780511755408.
 Ferrari, S., CribariNeto, F.: Beta regression for modelling rates and proportions. J. Appl. Stat. 31(7), 799–815 (2004). https://doi.org/10.1080/0266476042000214501.MathSciNetCrossRefGoogle Scholar
 Hogg, R. V., McKean, J. W., Craig, A. T.: Introduction to Mathematical Statistics. 8th. Pearson, Boston (2019).Google Scholar
 Johnson, N. L., Kotz, S., Balakrishnan, N.: Continuous Univariate Distributions, Vol. 2. Wiley & Sons, New York (1995).zbMATHGoogle Scholar
 Jørgensen, B.: The Theory of Exponential Dispersion Models and Analysis of Deviance. Instituto de Matemática Pura e Aplicada, (IMPA), Brazil (1992).zbMATHGoogle Scholar
 Jørgensen, B.: The Theory of Dispersion Models. Chapman & Hall, London (1997).zbMATHGoogle Scholar
 Smyth, GK, Verbyla, AP: Double generalized linear models: approximate reml and diagnostics. Proceedings of the 14th International Workshop on Statistical Modelling, 66–80 (1999). https://pdfs.semanticscholar.org/3fd5/fb7ee7e6991d0e6e2f50dacc80283a4701b1.pdf.
 McCullagh, P., J.A., N.: Generalized Linear Models. 2nd. Chapman and Hall, London New York (1989).CrossRefGoogle Scholar
 Quijano Xacur, O. A.: Beta Density Plot. Code Snippet (2019). https://gitlab.com/oquijano/unifed/snippets/1880287. Accessed Jul 2019.
 Quijano Xacur, O.A.: Unifed Density Plot. Code Snippet (2018). https://gitlab.com/oquijano/unifed/snippets/1786224. Accessed Jul 2019.
 Quijano Xacur, O. A.: Vehicle Insurance Example. Code Snippet (2019a). https://gitlab.com/oquijano/unifed/snippets/1786226. Accessed Jul 2019.
 Quijano Xacur, O.A.: unifed. R package version 1.1.0 (2019b). https://CRAN.Rproject.org/package=unifed. Accessed Jul 2019.
 R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2017). https://www.Rproject.org/.
 Simas, A. B., BarretoSouza, W., Rocha, A. V.: Improved estimators for a general class of beta regression models. Comput. Stat. Data Anal. 54(2), 348–366 (2010). https://doi.org/10.1016/j.csda.2009.08.017.MathSciNetCrossRefGoogle Scholar
 Stan Development Team: RStan: the R interface to Stan. R package version 2.18.2 (2018). http://mcstan.org/.
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.