Normal-Power-Logistic Distribution: Properties and Application in Generalized Linear Model

The applications of Normal distribution in literature are verse, the new modified univariate normal power distribution is a new distribution which is adequate for modelling bimodal data. There are many data that would have been modelled by normal distribution, but because of their bimodality, they are not, since normal distribution is unimodal. In this paper, a new extension of the normal linear model called the normal-Power generalized linear model, derived from the T-Power\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lbrace$$\end{document}{Logistic\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rbrace$$\end{document}} framework is presented. The statistical properties of the distribution and the proposed model were derived such as quantiles, median, mode, robust skewness, robust kurtosis and moment. The maximum likelihood estimation method was considered to obtain the unknown model parameters. Three real data sets were analyzed to demonstrate the flexibility and usefulness of the proposed model. The new model would be very useful as alternative in cases where skewed or bimodal response variables, which are not well fitted with normal linear model.


Introduction
In probability and statistics, the power function and normal distributions are very useful in their individual applications. Not many authors have thought it to combine these two distributions. The normal distribution does not have a shape parameter, Copyright ©2021 by authors, all rights reserved. Authors agree that this article remains permanently open access under the terms of the Creative Commons Attribution License 4.0 International License.
1 3 but power function has; while power function does not have a location parameter but normal has. Both are flexible, so combining them will produce a more flexible distribution. The power function distribution is the inverse of Pareto distribution (Dallas 1976). The power function distribution is a special model that can be formed or related to the uniform, Weibull, Kumaraswamy distributions. The power function distribution is considered one of the simplest and handy lifetime distributions. Meniconi and Barry (1996) proposed the two-parameter power function distribution as a simple alternative to the exponential distribution when it comes to modelling failure data related to mortality rate and component failures. It is a special case of the beta distribution and one may sight the importance of the distribution in statistical tests such as the likelihood ratio test. The normal distribution on the other hand has been combined with other distribution to form a more flexible distribution, such as exponentiated-Normal (Gupta et al. 1998), Beta-Normal distribution (Eugene and Lee 2002), Gamma-Normal (GN) distribution (Zografos and Balakrishnan 2009), Kumaraswamy-Normal distribution (Cordeiro and de Castro 2011). Estimation of the power function parameters has been done by various authors, such as Zaka and Akhter (2013) .
Many classical distributions have been extensively used for modelling real data in many areas. However, in many situations; there is a clear need for extended forms of these distributions to improve the flexibility and goodness of fit of these distributions. For that reason, families of continuous distributions are developed by introducing one or more additional shape parameter(s) to the baseline distribution or by combining two or more distributions to produce new ones. Akarawak et al. (2013) described such new distributions as convoluted distributions. Some authors in recent years have developed frameworks used in combining these distributions to form new ones. A good example is the T-R{Y} framework (Aljarrah et al. 2014). Since then, a lot of authors have been using it to develop flexible life time distributions that are hazard weighted functions of the baseline distributions. Weibull-Normal distribution (Alzaatreh et al. 2014) was one of the first normal distribution combined with other distribution using the T-R{Y} framework. The Weibull power function distribution (Tahir et al. 2016) has a combination of power function and weibull distribution, using weibull distribution as a baseline distribution.
The simplicity and usefulness of the power function distribution compelled the researchers to explore its further extensions, generalizations, and applications in different areas of science (Arshad et al. 2020;Ekum et al. 2020b). Recently, Gamma-Power{log-logistic} distribution was proposed by Ekum et al. (2021) and demonstrated its usefulness in modelling skewed data. None of these study have combined normal and power function distribution, especially making power function distribution a baseline, except the normal-power{logistic} distribution (NPLD) proposed in the work of Ekum et al. (2021). More so, many properties of the NPLD has not been defined and studied, and it has not been developed into a generalized linear model for predicting relationship in regression applications.
Predicting oil spillage is of a major interest to researchers in the field of Geoscience and geological statistics. In Nigeria, oil spillage is a major problem that have devastated the ecosystem and biodiversity of the Niger Delta region in Nigeria. The quantity of oil spilled may be estimated using the estimated spilled volume. The estimated spill volume of crude oil may be determined by the duration of clean-up (Whanda et al. 2016;Deinkuro et al. 2021). Also, researchers may want to know if they can predict their researchgate score using their citations and research items. These are emerging issues of interest to researchers, especially the ones in academics (Jordan (2015); O'Brien (2019)). More so, the COVID-19 mortality rate per population and the linear effect on the economic wellbeing of Nigerians is also worth to study. This is because, the GDP per capita can be affected by COVID-19 mortality. The COVID-19 factor is also an extra burden to the wellbeing of the people (Pak et al. (2020); Iluno et al. (2021)).
In literature, there are some modifications of the normal distribution, which produced multimodality (Kundu 2017), which has multiple modes with less number of parameters. The modification of the normal distribution developed by Kundu (2017) is a bivariate family of distributions, why the one developed here is a univariate family. More so, Kundu (2017) did not extend their distribution to generalized linear model. The motivation of this work is based on the modelling of independent variables in regression modelling that have bimodal features. Other authors such as Famoye et al. (2018), Kundu (2017), etc, had developed distributions that are bimodal but none has extended it to regression modelling. More so, real life problems like the crude oil spill volume, number of citations in research gate, GDP per Capita, etc are real variables which maximum values can be estimated, so they are bounded below by zero (non negative) and above by a real value, rather than infinity. Thus, a distribution with bounded support is necessary [0, ], where > 0 is a real upper bound (Ekum et al. 2020b).
Thus, in this study, the aim is to adopt a novel univariate continuous probability distribution called the normal-power-logistics distribution NPLD, which was derived from the T-Power{logistic} family proposed and studied by Ekum et al. (2021) and extends it into generalized linear model in order to solve real regression problems, where the dependent variables are bimodal and skewed with a known maximum value. The model has four parameters, two from the normal distribution and the other two from the power function distribution, which one of it is a shape parameter and the other is an upper bound parameter to control the extremes of the distribution. The scope covers different characterizations, properties, regression model, and parameter estimation of the NPLD model. The method of Maximum Likelihood Estimation (MLE) was used to estimate the model parameters. The importance of the new model was proved empirically using three real-life datasets. The proposed model would be very useful in engineering, medicine, and all fileds of life, where the dependent variable of interest to be predicted has bimodal features. It is expected to perform well when normal distribution fails to fit the data of interest.

Materials and Methods
In this section, the theory and application of the proposed scheme are considered.

The Method of Generating the T-R{Y} Family of Distributions
The method of generating T-R{Y} family of distributions is considered. The T-R{Y} is a general approach for defining the W[F(x)] (a non-decreasing differentiable function) using the quantile function of a random variable Y in the T-X framework. Let T, R and Y be three random variables with cdf F T ( and Q Y (x) are their corresponding quantile functions. It is assumed that T is supported on the interval (a, b) and Y is supported on the interval (c, d) such that b > a and d > c are real numbers.

Important Operational Definition of Terms
The following definitions will be very useful in characterising the proposed model.
Journal of the Indian Society for Probability and Statistics (2023) 24:23-54 Definition 5 : The cumulative hazard function of the distribution from T-power{ logistic} family is given by Definition 6 : The reverse hazard function of the distribution from T-power{logis-tic} family is given by Definition 7 : The quantile function of T-power{logistic} family is the inverse function of its cdf and it is given by The quantile function is used in Monte Carlo method to simulate random variates of a distribution, and it is used to determine measures of partition. Several ways of quantile approximation when it is not in closed form are available in literature, of which quantile mechanics is one of such approach (Akagbue et al. 2017).

Definition 8 :
The T-power{logistic} family of distributions is derived from T-R{ Y} family proposed by Aljarrah et al. (2014) and Alzaatreh et al. (2014). The relationship among T, R, and Y are given thus: Definition 9 : Let R be a non-negative random variable with pdf f R (x) , and let E(R k ) denote the k th moment of R, then where E(X k ) is the k th moment of the random variable, X; [ 1 − F Y (.) ] is the survival function of the random variable Y, and T is the quantile values random variable T with respect to f T (x).

Normal-Power function {logistic} Model
The proposed model is a generalized linear model that takes the form where g( i ) is the link function, and the right hand side is the linear predictor. Six goodness-of-fit criteria are used to compare the flexibility of the proposed model with other known models. The goodness-of-fit criteria are log-likelihood (LogL), Akaike Information Criterion (AIC), Kolmogorov-Smirnov statistic (D), Anderson-Darling statistic (A), Cramer-von Mises statistic ( ) and Chi-square statistic ( 2 ). See (Chen and Balakrishnan 1995) for detailed information on A and . The lower the value of the criteria, the better the performance of the model. Also, to show the relationship between the observed dependent variable y and the predicted dependent variable ŷ , the coefficient of correlation is used. This shows the model that performs well if the correlation coefficient is high. It is assumed that the dependent variable y has a normal-power distribution.

Cumulative Distribution and Probability Density Functions of NPLD
Recall the cdf of T-power{logistic} defined by Ekum et al. (2021) given in Definition (1) as where F T [t] is the cdf random variable T. So, T can follow any known distribution.
If T follows a normal distribution with parameters and , then the pdf of T is given by and the cdf of T is given by Therefore So, put the value of t into F T (t) to have So, put the value of t into F T (t) to have where error function, erf(.) is given by The corresponding pdf of NPLD is given by taking the first derivative of F X (x) with respect to x and it is given by where is a location parameter, k is a shape parameter, is a scale parameter, and doubles as a scale and upper bound parameter. A random variable X follows a NPLD if it can be defined as X ∼ NPLD( , , k, ). Figure 1 is the pdf plot of NPLD, which shows that NPLD can be bimodal for some parameters values, skewed and kurtosis .

Useful Transformation
follows a normal distribution with parameters and , then the pdf of W is given by Proof Recall the pdf of NPLD in (9) We want to show that random variable W follows a normal distribution with parameters and .

By change of variable, let
Differentiating w with respect to x, and making dx the subject of the equation gives Now, changing the support from x to that of w, we have It follows from inverse transformation and we have is the pdf of normal distribution with parameters and . Equation (14) completes the proof. ◻ From Lemma 2.1, it shows that the pdf of NPLD with parameters ( , , k, ) is a proper pdf. No further proof is needed.

Survival and Related Functions of NPLD
The survival function of NPLD is given by Journal of the Indian Society for Probability and Statistics (2023) 24:23-54 The hazard function of NPLD is given by The cumulative hazard function of NPLD is given by The reverse hazard function of NPLD is given by

Quantile Function
Theorem 2.2 Let X be a random variable that follows NPLD with cdf F X (x) , then the inverse function of the cdf, which is the quantile function exist, and it is given by Proof Recall the cdf of NPLD given by Solving for x gives Equation (20) is the inverse function of the cdf of X, and it can be written as where Q X (p) is the quantile function of NPLD; Φ −1 (p) is the inverse function of the cdf of standard normal distribution, and p is a probability value uniformly generated, that is, P ∼ U(0, 1) . Thus, Equation (21) completes the proof. ◻

Measures of Partition
The quantile function can be used to derive all the measures of partition, such as, median, quartile, octile, decile and percentile. The median of NPLD is The 1st quartile of NPLD, which is the same as the 25th percentile is given by The 3rd quartile of NPLD, which is the same as the 75th percentile is given by Journal of the Indian Society for Probability and Statistics (2023) Theorem 2.3 Let X be a random variable that follows NPLD with quantile function Q X (p) , then the skewness is robust, because it is a resistance measure, which is not affected by extreme value, .
Proof Recall the median, 1st quartiles ( Q 1 ) and 3rd quartile ( Q 3 ) of NPLD given by and respectively.
The mode can be derived by differentiating the pdf, equate to zero, and solve for x.
Using product rule Let and Differentiating u with respect to x gives Differentiating v with respect to x gives Inserting (41), (42), (43) and (44) into (40) and equating to zero gives The solution to (45) is the mode of NPLD. Now, assume that = k = = 1 and = 0 , (45) becomes It is obvious from (46) that the mode of NPLD is not unique and it is possibly bimodal. The value of the shape parameter determines if it is bimodal or Journal of the Indian Society for Probability and Statistics (2023) 24:23-54 multi-modal. If k = 1, it is bimodal, if k = 2, it will have 3 peaks, if k = 3, it will have 4 peaks. However, some of these peaks might not be visible or obvious graphically because there can be repeated roots of the polynomial equation. The resulting equation for the mode is a polynomial of order k+1 as shown in equation (45). ◻

Series Expansion of NPLD
Theorem 2.6 Let X be a random variable that follows NPLD with parameters , , k, , the pdf of X, f X (x) , is a weighted pdf of power function distribution with parameters k and , that is, where f R (x) is the pdf of power function distribution, and Ψ is the weight.
Proof Recall the pdf of NPLD given in (9). Given the following series expansions Inserting (48-56) into the pdf of NPLD in (9) gives Equation (57)  n m a n−m y m ,

Moment of NPLD
Let X be a continuous random variable with pdf f X (x) , the rth moment is given by Recall the series expansion form of NPLD pdf given as Inserting f X (x) into Equation (60) gives

Note that
So that Let So that Equation (65) is the rth moment of GPLD.

The likelihood function of NPLD is given by
Taking the log gives The maximum likelihood estimation parameters of the NPLD are given by differentiating partially with respect to , and k and equating the results to zero and solve for each parameter. The equations obtained by setting the partial derivatives with respect to k to zero is not in closed form and the values of the parameter k is found using Newton's numerical procedure provided by R package (R Development Core Team 2009). The parameter cannot be estimated using the MLE method because it depends on X, thus, is estimated from from data using where > 0 is a very small positive number less than 1 chosen by the user. It should be noted that the maximum likelihood estimators of the parameters and are in close form and will always exist provided the values of parameters k and are known. The value of parameter cannot be determined by the maximum likelihood estimation method because it is an upper bound, so it can be estimated by equation (75) from the data. Parameter k is not in closed form and a numerical optimization method is used to estimate it. We find the initial value of k used in the numerical optimization by first assuming that the random sample is from power function distribution. We estimate the initial value of k from power function distribution. The moment estimate of parameter k is given by k =̄x −x ,x < , where x is the sample mean (Ekum et al. 2020b), estimated from data.

Numerical Optimization of Parameter k
In a case where the parameter estimated using Newton approximation is not optimal, a new relationship is derived by EM algorithm. Let where Ω is the parameter space of NPLD, so that we have Recall the pdfs of normal distribution and NPLD as and respectively. Substituting the pdfs of normal distribution and NPLD into Equation (77) gives where , and are known, such that, ̂=x , ̂= S , and ̂= sup x =x (n) , where x and S are the sample mean and sample standard deviation of ln x −x . Note that x (n) − x > 0, ∀ x ∈ X . Note k 1 is the initial value of k assumed as suggested, that is, k 1 =̄x −x ,x < . So that k +1 is the new estimate of k and it is optimal. Now that optimal value of k is known, then we can estimate the values of and using equations (72) and (73) respectively.

Error Bound and Confidence Interval for NPLD
The error bound for estimating a generic parameter Θ of NPLD is given by where is the level of significance, Θ is the parameter to be estimated, Q * p is the standard quantile function of NPLD with p = 1 − ;p ∈ [0, 1] , and S Θ is the standard error of Θ , that is, the square root of the variance of Θ.
The standard quantile function of NPLD is derived when k = = 1 and = 0 from the quantile function of NPLD and it is given by where Q * p is the standard quantile function of NPLD, Φ −1 (p) is the inverse function of the cdf of standard normal distribution known as the quantile function, and p is a probability value uniformly generated. Note that > 0 is a regulator parameter in this case. Its value is adjusted to determine how large the error bound should be. In this research, is taken as 2 to accommodate the population parameter. So, the level of significance, and are always chosen. The values of can be 1, 2 or 3 depending on how large you want the error bound to be.
Thus, the 100(1 − )% confidence interval for parameter Θ is given by where Θ is the point estimate of Θ.

Simulation Study of NPLD
The simulation study is presented to show the performances of the maximum likelihood estimators and their consistency. The procedure used to perform the simulation studies involves, generating uniform distribution of n quantiles, p. The quantile function defined in equation (21) for NPLD was used to generate NPLD random variates for the sample sizes n = 50, 100, 200 and 300 replicated 1000 times. The parameters values are set as k = = = 0.5 , k = = = 1 , and k = = = 2 and for a fixed = 2 . The actual values, mean estimates, standard errors, and 95% confidence interval are presented in Tables 1, 2 and 3. Tables 1, 2 and 3 show that the standard error decreases as the sample size increases, which implies that the MLEs are consistent.

Generalized Linear Regression Model for NPLD (NPGLM)
Let assume that the dependent random variable Y of interest in our linear model follows a NPLD given independent variable(s) X. The linear regression model is called NPLD Generalized Linear Model (NPGLM). Given the linear model in matrix form where Y is a n-dimensional vector called the dependent vector for all observations n; X is the set of k independent variables packed into a ( n × k + 1 ) matrix called the 82) Y = XB + e design matrix; B is a ( k + 1)-dimensional vector called the slope vector; e is the error term packed into a n-dimensional vector called the error vector.

Conditions for NPGLM
The conditions to use the GPGLM to fit the model are given thus: • Y must be continuous random variable • Y must be positive real number strictly greater than zero but strictly less than (upper bound for Y) • Y must follow NPLD • NPLD must be a member of the exponential family

Exponential Class of NPLD
An exponential family or class is a parametric set of probability distributions that has a certain form. This special form is chosen for mathematical convenience, based on some useful algebraic properties, as well as for generality (Akarawak et al. 2017). It is assumed that each component of Y follows a distribution in the exponential family of the form where a( ) is a function of a known parameter only, b( ) is a function of a canonical parameter and c(T(y), ) is a function of y and only, and T(y) is a function of y, known as the sufficient statistics for Y. Let assume that Y is a random variable that follows NPLD. Recall the pdf of the NPLD with parameters , , k, given by where parameter is an upper bound. The pdf f(y) is not free from parameter ( ), and hence, might be difficult to express as a member of the exponential family.
However, a simple transformation can be done with the data that follows a NPLD to a normal distribution as proved in Lemma (2.1).
Recall the transformed pdf Taking the log of (84) gives Taking the exponential of (85) gives Comparing (86) with (83) gives where w is a function of y, k, given by Since (86) can be written in exponential class, we can directly derive the joint sufficient statistics from it. So, the joint sufficient statistics for and are w and w 2 respectively. Thus, w and w 2 can give all information concerning parameters and respectively.

Maximum Likelihood Estimation of the Parameters of NPLD Regression Model
The log-likelihood of the pdf of NPLD is The link function is given by (84)

Then
The MLE parameter estimate for b j is in closed form and it is given by where the value of lambda can be approximated from the data using nth order statistic or simply ̂= max(y i ) +̄y ∀ i, where ȳ is the standard error of y computed from the data. An approximation for k can also be derived from data using k =̄y −ȳ ,ȳ < , where ȳ is the sample mean, derived from Ekum et al. (2020b).

Application
In this section, applications to three real data sets were provided to illustrate the uses and importance of the NPLD. Three competing models are used to fit the two data of interest, they are NPLD, Normal are Gamma GLMs.

Application 1: Estimated Spill Volume (ESV) of Crude Oil in Nigeria
The data on the estimated spilled volume (ESV) is collected from 7th January 2011 to 27th December 2019, at Shell Nigeria webisite (www.shell.com.ng/sustainability/ environment/oil-spills.html). Figure 2 shows that the oil spill data is bimodal with positive skewness (1.1302) and kurtosis (3.3977).

Fitting the Models to Oil Spill Data
The estimated spill volume of crude oil can be determined by the Duration of Cleanup (DOC). If the duration of clean-up is known, the spill volume can be estimated from an appropriate model. Thus, the dependent variable is ESV and the independent variable is the DOC. Table 4 shows the model parameters estimated, their standard errors and their corresponding P-values. Table 5 shows that the NPLD regression model outperforms the other regression models using all the selection criteria.

Application 2: Total Research Gate Score
Total Research Gate (TRG) score data is a cross-sectional data collected from Research Gate page of 100 selected researchers in the field of Mathematical Science   (Fig. 3). Figure 3 shows that the TRG score data is bimodal with positive skewness of 0.1595 and kurtosis of 1.9747.

Fitting the Models to Research Gate Data
The TRG score can be predicted by Citations and Research Items. If citations and research items increased, the TRG score will also increase. Thus, the dependent variable is TRG score, while the independent variables are citations and research items. Table 6 shows the model parameters estimated using MLE, their standard error and their corresponding P-values. The fitted NPLD regression model shows that the estimates 0 and 1 are significant at 5 % level of error. This is also true for gamma and normal regression models. Table 7 shows that the NPLD regression model outperforms the other regression models using all the goodness-of-fit criteria.

Application 3: Gross Domestics Product per Capita per COVID-19 Cases
The data used here are daily data collected from World Health Organisation (WHO) from 1st June 2020 to 31st December 2020, spanning 214 datasets, used by Iluno et al. (2021). The independent variable is a measure of COVID-19, termed COVID-19 Mortality per 1 million persons in the population (CMP), while the dependent Total Research Gate Score TRG Score per Author The CMP is a proxy to measure COVID-19 mortality, while RGDPC is a proxy to measure the economic wellbeing of a country. Figure 4 shows that the RGDPC data has a positive skewness of 2.317554 and kurtosis of 7.896267. This data is highly skewed and very peaked (leptokurtic).

Fitting the Models to COVID-19 Data
The RGDPC can be predicted by the CMP. If COVID-19 Mortality per Population is high, it can affect the GDP per Capita of a country negatively. Thus, the dependent variable is RGDPC and the independent variable is the CMP. Four competing distributions are used to fit the GLM. The performance of the three competing models are presented in Table 8 to show the performance of the models when fitted to the RGDPC data (Table 9). Table 8 shows the model parameters estimated, their standard errors and their corresponding P-values. Table 9 shows that the NPLD regression modeloutperforms the other regression models using all the selection criteria.

Conclusions
This study developed a novel NPLD model, using the T-Power{logistic} family of distributions. The cdf, pdf, survival function, hazard rate, cumulative hazard function, reverse hazard function, useful transformation, quantile functions, mode, robust skewness, robust kurtosis, series expansion and moment are derived. The maximum likelihood estimation of the parameters of the distribution were derived and that of its generalized regression model. The NPLD regression model was applied to three real-life data namely, Estimated Spill Volume (ESV) of crude oil in Niger Delta area of Nigeria, Total Research Gate (TRG) score of some selected researchers in research gate and GDP per Capita per COVID-19 cases (RGDPC; and the results of its performance was compared favourably with normal and Gamma regression models. The goodness of fit statistics showed that the NPLD regression model outperforms the other regression models using all the selection criteria. Also, the goodness of fit statistics also show that the NPLD regression model outperforms the other regression models using all the criteria for the TRG score model as well as the RGDPC model. Hence, NPLD regression model can be used effectively to analyze and model the crude oil spill volume data, TRG score data, RGDPC and other related data when normal is not good fit.
This research therefore recommends that • NPLD model should be used to estimate spill volume of crude oil, and total research gate score.  • It is recommended that the convoluted distribution NPLD should be used when normal is not a good fit to emerging data of interest. • It is recommended based on the applications that clean-up of spilled oil should be carried out immediately and complete it at record time, because it can be used to estimate the spilled volume of crude oil. • It is also recommended that researchers should increase the research items they upload to research gate and write quality papers to increase their citations, in order to increase their total research gate score. • It is also recommended that COVID-19 mortality be reduced, by providing medical response to infected individuals, because, it can affect the economic wellbeing of the nation.