Performance evaluation of different regression models: application in a breast cancer patient data

Abo El Nasr, Mona Mahmoud; Abdelmegaly, Alaa A.; Abdo, Doaa A.

doi:10.1038/s41598-024-62627-6

Performance evaluation of different regression models: application in a breast cancer patient data

Article
Open access
Published: 06 June 2024

Volume 14, article number 12986, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Performance evaluation of different regression models: application in a breast cancer patient data

Download PDF

Mona Mahmoud Abo El Nasr¹,
Alaa A. Abdelmegaly²^na1 &
Doaa A. Abdo¹^na1

339 Accesses
Explore all metrics

Abstract

This paper provides a comprehensive analysis of linear regression models, focusing on addressing multicollinearity challenges in breast cancer patient data. Linear regression methodologies, including GAM, Beta, GAM Beta, Ridge, and Beta Ridge, are compared using two statistical criteria. The study, conducted with R software, showcases the Beta regression model’s exceptional performance, achieving a BIC of − 5520.416. Furthermore, the Ridge regression model demonstrates remarkable results with the best AIC at − 8002.647. The findings underscore the practical application of these models in real-world scenarios and emphasize the Beta regression model’s superior ability to handle multicollinearity challenges. The preference for AIC over BIC in Generalized Additive Models (GAMs) is rooted in the AIC’s calculation framework, highlighting its effectiveness in capturing the complexity and flexibility inherent in GAMs.

A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes

Article Open access 08 May 2024

Overview of Topics Related to Model Selection for Regression

Competing Risks Models and Breast Cancer: A Brief Review

Introduction

Regression analysis is one of the most important tools which has several applications in many fields. There are various types of regression models available in the literature, linear model (LRM), non-linear model, generalized linear model (GLM), and generalized additive models (GAM)¹. GLM Introduced by Nelder & Wedderburn in 1972. GLM surpasses LRM assumptions, accommodating non-normally distributed responses, addressing heteroscedasticity, and allowing non-linear associations with predictors^2,3. GLM takes many forms, one of these is the beta regression model (BRM), which models the continuous random variable dependency and suggests that the standard unit values are intervals based on the independent variables in different fields⁴. proposed BRM to explain variations in the dependent variable by rates and proportion behavior which supposes interval values (0, 1). This model assumes that the response variable follows the beta distribution. Further, the model can also accommodate asymmetries and heteroscedasticity¹. Generally, the maximum likelihood estimator (MLE) is used to estimate the unknown regression coefficients of the BRM^5,6. GAM offers the analyst an outstanding regression tool for understanding the quantitative structure of language data. An early monograph on generalized additive models is Hastie and Tibshirani in 1990⁷. GLM and GAM have become one of the standard tools for analyzing the impact of covariates on possibly non-Gaussian response variables. The only difference between GAM and GLM is that GAM permits the including nonlinear smooth functions in the model⁸. The selection of the smoothing parameter can be obtained, among many other proposals, by minimizing the conditional Akaike’s information criterion (AIC)⁹. This version of AIC for GAMs uses the log-likelihood evaluated at the penalized MLE and with the effective degrees of freedom computed as discussed in¹⁰.

Multicollinearity problem is a popular issue in regression modeling. It indicates that there is a strong association between the explanatory variables. However, many biased estimators have been introduced to combat multicollinearity in linear regression, such as the Stein estimator¹¹, principal component estimator¹², ridge regression estimator¹³, improved ridge estimators¹⁴, contraction estimator¹⁵, modified ridge regression estimator¹⁶, Liu estimator¹⁷, Liu-type estimator¹⁸, restricted and unrestricted two-parameter estimator¹⁹, (k-d) class estimator²⁰, mixed ridge estimator²¹ and modified Liu-type estimator²². There are several methods to estimate the shrinkage parameter such as ridge, Liu, and Liu-type estimations, which have become a generally accepted and more effective methodology to solve the multicollinearity problem in several regression models¹³ proposed the ridge estimator (RE), the concept behind the ridge estimator is to apply a small definite amount (k) to the diagonal entries of the covariance matrix to increase the conditioning of this matrix, reduce the MSE, and achieve consistent coefficients. Several attempts have been made to choose the best ridge parameter k: Based on the work of²³ and²⁴. The impact of multicollinearity on GLM is significant and enduring. Among the various GLMs, the BRM is notably affected by multicollinearity^5,25 and²⁶ proposed the ridge estimators for the BRM to remedy the problem of instability of the traditional ML method and increase the efficiency of estimation¹ proposed a new modified ridge-type estimator for the BRM. This paper aims to present a comparative analysis of various statistical models, incorporating both real data and simulation studies, with a specific focus on evaluating these models using the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Although there are a lot of high-dimension regression studies^27,28, this paper specifically focuses on the evaluation of low-dimensional regression models. This paper is organized as follows; the differences in regression models the beta regression model, GAM regression model, GAM beta regression model, ridge model, and the beta ridge regression are presented. A numerical evaluation is offered using both Monte-Carlo simulation and empirical data application, respectively. Finally, conclusions are presented.

Methodology

Beta regression model

The most used model in many branches like economic and medical research is beta regression, which is used to consider the influence of specific independent variables on a non-normal dependent variable. However, in the state of beta regression, the response variable is constrained to intervals (0, 1), such as fractions and percentages. These models are used to examine the relationship and effect of some chosen independent and normal dependent variables. However, this is not appropriate for conditions where the response variable does not follow the normal distribution because it may give an overestimated estimator⁴ developed the beta regression model by using the link function to connect the mean function of its dependent variable and linear predictors. The inverse of a precision parameter of this model is called a dispersion scale, this parameter contains stability through observations. Despite the precision parameter might not be constant with the results of^29,30.

Let y be a continuous random variable that has a beta distribution with a probability density function as follows:

$$\begin{aligned} f(y_{i},\mu ,\emptyset )&= \dfrac{\Gamma (\emptyset )}{\Gamma (\mu \emptyset ) \Gamma (1-\mu \emptyset )} y^{\mu \emptyset -1}(1-y)^{\emptyset -\mu \emptyset -1}, 0<y<1;\quad 0<\mu <1; \emptyset >0. \end{aligned}$$

(1)

where $\Gamma$ is the gamma function and $\emptyset$ is the precision parameter.

$$\begin{aligned} \emptyset =\dfrac{1-\sigma ^{2}}{\sigma ^{2}} \end{aligned}$$

The mean and variance of the beta probability distribution are:

$$\begin{aligned} E(y)&= \mu .\\ var(y)&= \mu (1-\mu )\sigma ^{2} \end{aligned}$$

By using the logit link function, the model allows $\mu _{i}$, depending on covariates as follows:

$$\begin{aligned} g(\mu _{i})&= \log \left( \dfrac{\mu _{i}}{1-\mu _{i}}\right) =X_{i} ^\prime \beta =\eta _{i} \end{aligned}$$

(2)

The linear predictor is constrained within the beta distribution, which inherently models data with in the open set (0, 1). In scenarios where extreme values at 0 and 1 are observable, one can consider employing the inflated zero and or one beta distribution proposed by³¹.

Generalized Additive Model (GAM)

³² introduced generalized additive models that allow be modelling of the dependence of the response variable in a flexible way using smooth functions of the predictors by defining the linear predictors:

$$\begin{aligned} \eta _{i}&= \beta _{0}+\sum _{j=1}^{p}f_{j}(x_{ij}). \end{aligned}$$

(3)

where, $f_{j}(x_{ij})=\sum _{k=1}^{kj}\beta _{jk}(x_{ij})$ is the smoothing term from the $j^{\text {th}}$ predictor with $\{ \beta _{jk}( )\}_{k=1}^{kj}$, the asset of known basis functions associated with unknown parameter $\beta _{jk}$.

We can define different smoothers by adopting different basis functions.

As penalized regression splines, cubic regression splines bases³³.

Estimation

We can estimate the GAM model using restricted maximum likelihood (REML), which amounts to maximizing the penalized log likelihood:

$$\begin{aligned} L_{p}(\beta )&= L(\beta )-\dfrac{1}{2}\lambda \beta ^\prime S\beta \end{aligned}$$

(4)

where $L(\beta )=\sum _{t=1}^{n}L(y_{i}/\beta )$ is the log-likelihood for the observed values $y_{i}$ of the response variable. $\lambda$: is a smoothing parameter. S: is a known penalty matrix. $\lambda \beta ^{T} S\beta$: the smoothing penalty.

Presented REML as a convenient estimation method for marginal likelihood estimation of $\beta$ when the model contains Gaussian random effects, and it also leads to more stable estimates of $\lambda$ with a much-reduced risk of under-smoothing^10,34.

GAM beta regression model

Let $y_i$ represent the test positive rate (TPR), which is determined by dividing the number of new positive cases $P_i$ by the total number of tests $T_i$ at time i. Time i is determined to have integer values between 1 and n for the first and last times of the period studied. TPR should have a built-in limit as a proportion between 0 and 1. Several methods and models may be used to analyze variables that are represented as proportions, but the beta regression model is perhaps the most well-known among them^5,35.

The five-step GAM beta regression is as follows:

1.
Suppose that the predictor variable $Y_i$ follows a beta distribution with a mean of $\mu _i$^14,16.
$$\begin{aligned} Y_i\sim \text {beta}(\mu _i,\emptyset ) \end{aligned}$$
For the beta distribution’s mean and variance
$$\begin{aligned} E(y_i)=\mu _i, \text {and } V(y_i)=\dfrac{\mu _i(1-\mu _i)}{1+\emptyset } \end{aligned}$$
.
2.
In the second stage, we define the model’s systematic component. We determine the linear predictive functional $\eta _i$ as:
$$\begin{aligned} \eta _i=\beta 'x_i \end{aligned}$$
where $\beta$ is a vector with a $(p+1)$ dimensional regression model parameters that are yet to be defined, and $x_i$ is the intercept plus the vector of measured values on p forecasters. The predictor function $\eta _i$ provides the systematic component⁹ and³⁶. This equation represents how the systematic component is formulated in the model
3.
In the third stage, we need to establish the relationship between the predictor function $\eta _i$ and the expected value of $Y_i$ denoted as $\mu _i$.. This relationship is achieved using the Link function, resulting in the following outcomes^9,36.
$$\begin{aligned} \mu _i=\dfrac{exp(\eta _i)}{1+exp(\eta _i)}=\dfrac{1}{1+exp(-\eta _i)} \end{aligned}$$
4.
The Link function in Generalized Linear Models (GLM) is specified in the references^9,36.
$$\begin{aligned} logit(\mu _i)=\log \left( \dfrac{\mu _i}{1-\mu _i}\right) =\eta _i. \end{aligned}$$
5.
Generalized additive models provide flexibility in modeling the dependence of the response variable by defining the linear predictor as a smooth function of the predictors, as described in⁹.
$$\begin{aligned} \eta _i=\beta _0+\sum _{j=1}^{P}f_i(x_{ij}) \end{aligned}$$
The term $f_i(x_{ij})=\sum _{k=1}^{kj}\beta _{jk}\beta _{jk}(x_{ij})$ represents the smoothing function for the $j^{\text {th}}$ predictor. It involves a sum of terms, each represented by $\beta _{jk}\beta _{jk}(x_{ij})$.
6.
In estimating the Generalized Additive Model (GAM), Restricted Maximum Likelihood (REML) is utilized to maximize the penalized log-likelihood⁹ The penalized Log-Likelihood $L_p(\beta )$ is defined as
$$\begin{aligned} L_p(\beta )=L(\beta )-\dfrac{1}{2}\lambda \beta ' S \beta \end{aligned}$$
where $L(\beta )=\sum _{t=1}^{n}L(Y_i/\beta )$ is the likelihood function for the observed values $y_i$ of the response variable. $\lambda$ represents the smoothing parameter, and S is the known penalty matrix. The use of REML helps in maximizing this penalized log-likelihood for GAM estimation.
7.
predictions can be calculated as⁹.
$$\begin{aligned} \hat{\mu _{i}}&=\beta _{0}+\sum _{k=1}^{k} {\hat{\beta }}_{1k}\beta _{1k}(x_{i1}+(\hat{\beta _{2}}x_{i2})) \end{aligned}$$
(5)

Ridge regression model

One of the most widely used techniques for solving multicollinearity in multiple linear regression is ridge analysis. This method has found applications in various fields, including engineering, chemistry, and econometrics. Ridge regression (RR) modifies the Ordinary Least Squares (OLS) method to produce biased estimators of regression coefficients, thereby addressing issues related to multicollinearity. This approach is particularly valuable when OLS estimators exhibit significant variability. So, ridge analysis can improve the predictability and accuracy of a model¹³. Here, we describe the linear regression model³⁷:

$$\begin{aligned} Y&= X\beta +\epsilon \end{aligned}$$

(6)

where Y represents the dependent variable, it is an $n\times 1$, X is the matrix of predictor variables, $\beta$ is the vector of regression coefficients, it is $p\times 1$, and $\epsilon$ represents an $n\times 1$ vector of the error term.

In the context of ridge regression:

1.
The ordinary least squares (OLS) estimator ${\hat{\beta }}$ Eq. (6) is calculated as follows
$$\begin{aligned} {\hat{\beta }}&=(S)^{-1} X'Y \end{aligned}$$
(7)
where $S=X'X$ is the design matrix. represents the design matrix, and ${\hat{\beta }}$ is the vector of regression coefficients estimated using the ordinary least squares method.
2.
The ridge regression estimator, introduced by Hoerl and Kennard, is derived by minimizing the given objective function³⁷
$$\begin{aligned} (Y-X\beta )'(Y-X\beta )+k(\beta '\beta -c) \end{aligned}$$
(8)
where $(Y-X\beta )'(Y-X\beta )+$ is a part of the OLS objective that minimizes the sum of squared residuals, and $k(\beta '\beta -c)$ is the penalty term, where k is a constant,$\beta$ is the vector of regression coefficients, and c is a predefined constant.
3.
We obtain the normal equations³⁷
$$\begin{aligned} (X'X+KI_p)\beta&=X'Y \end{aligned}$$
(9)
where $X'X$ is the sum of squares and cross-products matrix, $kI_p$ introduces the penalty term into the normal equations, and k is a constant.
4.
The ridge estimator is determined by solving the normal equations, resulting in $({\hat{\beta }} (k))$ as shown in Eq. (10):
$$\begin{aligned} {\hat{\beta }} (k)&= (S+kI_p)^{-1}X'Y=W(k){\hat{\beta }}. \end{aligned}$$
(10)
where $S=X'X$, and $W(k)=(I_{P}+kS^{-1}) ^{-1}$ is a matrix derived to simplify the computation.
5.
The parameter k is the Biasing Parameter in ridge regression, Eq. (11) provides a method for selecting it¹³.
$$\begin{aligned} k&= p\sigma ^2/\beta '\beta . \end{aligned}$$
(11)
where p is the overall output variable, $\sigma ^2$ is an estimate of the variance, and $\beta '\beta$ is the sum of squared estimated coefficients.

The estimate of the ridge parameter, denoted as ${\hat{\beta }}_{k}$ is given by³⁸:
$$\begin{aligned} {\hat{\beta }}_{k}&= \Lambda +(kI)^{-1} X'Y \end{aligned}$$
(12)
where $\Lambda$ represents a diagonal $P\times P$ matrix. Efficiency of ${\hat{\beta }}_k$ is influenced by the selection of the ridge parameter k to get the smallest Mean Squared Error (MSE) estimate, a certain k value is determined. This assessment is performed at a chosen value of k, as expressed in Eq. (13)³⁸:
$$\begin{aligned} \text {MSE}_{\text {RIGED} }&= \sum _{i=1}^{p}\dfrac{\lambda _i\sigma ^2+k\beta _{i}^2}{(\lambda _i+k)^2} \end{aligned}$$
(13)
Here, unbiased OLS estimated values for ${\hat{\sigma }}^2$ and ${\hat{\beta }}$ are used in place of ${\sigma ^2}$ and ${\beta }$

Beta ridge regression model

The beta ridge regression estimator is proposed as an alternative to the beta maximum. likelihood estimator to mitigate the impacts of multicollinearity in the Beta Regression model. This estimator is denoted as follows⁵ and¹³ .

Assuming that ${\hat{\beta }}$ is an estimator of the vector $\beta$, the weighted sum of squared error is defined as⁵:

$$\begin{aligned} \Theta&= (y-x\beta )'(y-x{\hat{\beta }})=(y-x{\hat{\beta }}_{\text {ML}})'(y-x{\hat{\beta }}_{\text {ML}}) ({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})'X'WX({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})\nonumber \\&=\Theta _{\text {min}}+\Theta {\hat{\beta }} \end{aligned}$$

(14)

where $\Theta$ represents the minimum value, and ${\hat{\beta }}>0$ is the constant increment that causes the WSSE to increase when ${\hat{\beta }}_{\text {ML}}$ substituted for ${\hat{\beta }}$. The BRR estimator is obtained by minimizing the Length of ${\hat{\beta }}$ subject to a restriction:

$({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})^\prime X^\prime WX({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})=\Theta _{0}$, as Hoerl and Kennard’s restrictions¹³.

Minimized $\varrho ={(y-{\hat{\beta }})\ }^\prime (y-{\hat{\beta }}) (\ y -{X{\hat{\beta }}}_{ML})\ ^\prime (y -{X{\hat{\beta }}}_{ML})+ (({\hat{\beta }}-{{\hat{\beta }}}_{ML})\ ^\prime X^\prime$ as Hoerl ${{\hat{\beta }}}_{ML}$

$$\begin{aligned} \varrho ={{\hat{\beta }}\ }^\prime {\hat{\beta }}+(1/k) (({\hat{\beta }}-{{\hat{\beta }}}_{ML}) \ ^\prime \ X^\prime W\ X ({\hat{\beta }}-{{\hat{\beta }}}_{ML})-\ \Theta _o) \end{aligned}$$

(15)

where the Lagrangian multiplier is 1/k. When Eq. (15) is differentiated from ${\hat{\beta }}$, the outcome equals zero.

$$\begin{aligned} \frac{\partial \varrho }{\partial {\hat{\beta }}}&=2{\hat{\beta }}+\frac{(2X^\prime W X({\hat{\beta }}-{{\hat{\beta }}}_{ML})}{k}=0. \end{aligned}$$

After simplification, we obtain the following BRR estimator:

$$\begin{aligned} {{\hat{\beta }}}_{BRR}&={\hat{\beta }}=(X^\prime WX+k{I)}^{-1}X^{\ ^\prime }W X{{\hat{\beta }}}_{ML} \end{aligned}$$

(16)

Where, I is a matrix of identities with an order of $p\times p$, and k is the shrinkage parameter.

Numerical analysis

This study relies on data extracted from the Breast Cancer Wisconsin Diagnostic dataset, obtained from the University of Wisconsin Hospitals Madison Breast Cancer Database³⁹, covering the period from January 1989 to November 1991. The dataset comprises records from 569 breast cancer patients and was accessed through an open online repository hosted at https://www.kaggle.com/code/gpreda/breast-cancer-prediction-from-cytopathology-data. Our research aims to explore the relationship between 10 predictor variables and tumor progression in breast cancer patients.

Data set description

Breast cancer represents a significant health burden globally, standing as the most prevalent cancer among women and ranking as the second leading cause of cancer-related mortality in women. Characterized by aberrant cell growth in breast tissue, this disease poses substantial health risks. In our study, we selected the radius mean as the dependent variable for several reasons. While previous research predominantly focused on diagnosis and disease classification, our approach provides a novel perspective. By utilizing the diagnosis state to assess the extent of disease spread, as indicated by the radius mean variable, we delve into the progression of breast cancer based on diagnostic information. This unique method yields valuable insights into tumor behavior and disease severity. Utilizing ‘radius mean’ as a continuous variable enriches the analysis of tumor data, enabling the use of diverse statistical methods to uncover intricate patterns. This approach not only enhances the understanding of tumor impact on patient outcomes but also facilitates the discovery of new correlations and insights in breast cancer research⁴⁰. The radius mean serves as the primary outcome variable in our analysis, representing the average distance from the cell center to the perimeter. Its importance lies in its association with tumor spread; as the cell radius increases, so does the surface area, indicating a more extensive tumor spread. Our investigation encompasses 10 predictor variables, including diagnosis, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. These variables play crucial roles in elucidating various aspects of breast cancer progression. Detailed units of measurement for the features in the Breast Cancer Wisconsin (Diagnostic) Data Set are provided in Table 1.

Table 1 Descriptive of variables.

Performance evaluation of different regression models: application in a breast cancer patient data

Abstract

Similar content being viewed by others

A constrained maximum likelihood approach to developing well-calibrated models for predicting binary outcomes

Overview of Topics Related to Model Selection for Regression

Competing Risks Models and Breast Cancer: A Brief Review

Introduction

Methodology

Beta regression model

Generalized Additive Model (GAM)

Estimation

GAM beta regression model

Ridge regression model

Beta ridge regression model

Numerical analysis

Data set description

Monte Carlo simulation study

Results and discussion

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation