Introduction

Regression analysis is one of the most important tools which has several applications in many fields. There are various types of regression models available in the literature, linear model (LRM), non-linear model, generalized linear model (GLM), and generalized additive models (GAM)1. GLM Introduced by Nelder & Wedderburn in 1972. GLM surpasses LRM assumptions, accommodating non-normally distributed responses, addressing heteroscedasticity, and allowing non-linear associations with predictors2,3. GLM takes many forms, one of these is the beta regression model (BRM), which models the continuous random variable dependency and suggests that the standard unit values are intervals based on the independent variables in different fields4. proposed BRM to explain variations in the dependent variable by rates and proportion behavior which supposes interval values (0, 1). This model assumes that the response variable follows the beta distribution. Further, the model can also accommodate asymmetries and heteroscedasticity1. Generally, the maximum likelihood estimator (MLE) is used to estimate the unknown regression coefficients of the BRM5,6. GAM offers the analyst an outstanding regression tool for understanding the quantitative structure of language data. An early monograph on generalized additive models is Hastie and Tibshirani in 19907. GLM and GAM have become one of the standard tools for analyzing the impact of covariates on possibly non-Gaussian response variables. The only difference between GAM and GLM is that GAM permits the including nonlinear smooth functions in the model8. The selection of the smoothing parameter can be obtained, among many other proposals, by minimizing the conditional Akaike’s information criterion (AIC)9. This version of AIC for GAMs uses the log-likelihood evaluated at the penalized MLE and with the effective degrees of freedom computed as discussed in10.

Multicollinearity problem is a popular issue in regression modeling. It indicates that there is a strong association between the explanatory variables. However, many biased estimators have been introduced to combat multicollinearity in linear regression, such as the Stein estimator11, principal component estimator12, ridge regression estimator13, improved ridge estimators14, contraction estimator15, modified ridge regression estimator16, Liu estimator17, Liu-type estimator18, restricted and unrestricted two-parameter estimator19, (k-d) class estimator20, mixed ridge estimator21 and modified Liu-type estimator22. There are several methods to estimate the shrinkage parameter such as ridge, Liu, and Liu-type estimations, which have become a generally accepted and more effective methodology to solve the multicollinearity problem in several regression models13 proposed the ridge estimator (RE), the concept behind the ridge estimator is to apply a small definite amount (k) to the diagonal entries of the covariance matrix to increase the conditioning of this matrix, reduce the MSE, and achieve consistent coefficients. Several attempts have been made to choose the best ridge parameter k: Based on the work of23 and24. The impact of multicollinearity on GLM is significant and enduring. Among the various GLMs, the BRM is notably affected by multicollinearity5,25 and26 proposed the ridge estimators for the BRM to remedy the problem of instability of the traditional ML method and increase the efficiency of estimation1 proposed a new modified ridge-type estimator for the BRM. This paper aims to present a comparative analysis of various statistical models, incorporating both real data and simulation studies, with a specific focus on evaluating these models using the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Although there are a lot of high-dimension regression studies27,28, this paper specifically focuses on the evaluation of low-dimensional regression models. This paper is organized as follows; the differences in regression models the beta regression model, GAM regression model, GAM beta regression model, ridge model, and the beta ridge regression are presented. A numerical evaluation is offered using both Monte-Carlo simulation and empirical data application, respectively. Finally, conclusions are presented.

Methodology

Beta regression model

The most used model in many branches like economic and medical research is beta regression, which is used to consider the influence of specific independent variables on a non-normal dependent variable. However, in the state of beta regression, the response variable is constrained to intervals (0, 1), such as fractions and percentages. These models are used to examine the relationship and effect of some chosen independent and normal dependent variables. However, this is not appropriate for conditions where the response variable does not follow the normal distribution because it may give an overestimated estimator4 developed the beta regression model by using the link function to connect the mean function of its dependent variable and linear predictors. The inverse of a precision parameter of this model is called a dispersion scale, this parameter contains stability through observations. Despite the precision parameter might not be constant with the results of29,30.

Let y be a continuous random variable that has a beta distribution with a probability density function as follows:

$$\begin{aligned} f(y_{i},\mu ,\emptyset )&= \dfrac{\Gamma (\emptyset )}{\Gamma (\mu \emptyset ) \Gamma (1-\mu \emptyset )} y^{\mu \emptyset -1}(1-y)^{\emptyset -\mu \emptyset -1}, 0<y<1;\quad 0<\mu <1; \emptyset >0. \end{aligned}$$
(1)

where \(\Gamma\) is the gamma function and \(\emptyset\) is the precision parameter.

$$\begin{aligned} \emptyset =\dfrac{1-\sigma ^{2}}{\sigma ^{2}} \end{aligned}$$

The mean and variance of the beta probability distribution are:

$$\begin{aligned} E(y)&= \mu .\\ var(y)&= \mu (1-\mu )\sigma ^{2} \end{aligned}$$

By using the logit link function, the model allows \(\mu _{i}\), depending on covariates as follows:

$$\begin{aligned} g(\mu _{i})&= \log \left( \dfrac{\mu _{i}}{1-\mu _{i}}\right) =X_{i} ^\prime \beta =\eta _{i} \end{aligned}$$
(2)

The linear predictor is constrained within the beta distribution, which inherently models data with in the open set (0, 1). In scenarios where extreme values at 0 and 1 are observable, one can consider employing the inflated zero and or one beta distribution proposed by31.

Generalized Additive Model (GAM)

32 introduced generalized additive models that allow be modelling of the dependence of the response variable in a flexible way using smooth functions of the predictors by defining the linear predictors:

$$\begin{aligned} \eta _{i}&= \beta _{0}+\sum _{j=1}^{p}f_{j}(x_{ij}). \end{aligned}$$
(3)

where, \(f_{j}(x_{ij})=\sum _{k=1}^{kj}\beta _{jk}(x_{ij})\) is the smoothing term from the \(j^{\text {th}}\) predictor with \(\{ \beta _{jk}( )\}_{k=1}^{kj}\), the asset of known basis functions associated with unknown parameter \(\beta _{jk}\).

We can define different smoothers by adopting different basis functions.

As penalized regression splines, cubic regression splines bases33.

Estimation

We can estimate the GAM model using restricted maximum likelihood (REML), which amounts to maximizing the penalized log likelihood:

$$\begin{aligned} L_{p}(\beta )&= L(\beta )-\dfrac{1}{2}\lambda \beta ^\prime S\beta \end{aligned}$$
(4)

where \(L(\beta )=\sum _{t=1}^{n}L(y_{i}/\beta )\) is the log-likelihood for the observed values \(y_{i}\) of the response variable. \(\lambda\): is a smoothing parameter. S: is a known penalty matrix. \(\lambda \beta ^{T} S\beta\): the smoothing penalty.

Presented REML as a convenient estimation method for marginal likelihood estimation of \(\beta\) when the model contains Gaussian random effects, and it also leads to more stable estimates of \(\lambda\) with a much-reduced risk of under-smoothing10,34.

GAM beta regression model

Let \(y_i\) represent the test positive rate (TPR), which is determined by dividing the number of new positive cases \(P_i\) by the total number of tests \(T_i\) at time i. Time i is determined to have integer values between 1 and n for the first and last times of the period studied. TPR should have a built-in limit as a proportion between 0 and 1. Several methods and models may be used to analyze variables that are represented as proportions, but the beta regression model is perhaps the most well-known among them5,35.

The five-step GAM beta regression is as follows:

  1. 1.

    Suppose that the predictor variable \(Y_i\) follows a beta distribution with a mean of \(\mu _i\)14,16.

    $$\begin{aligned} Y_i\sim \text {beta}(\mu _i,\emptyset ) \end{aligned}$$

    For the beta distribution’s mean and variance

    $$\begin{aligned} E(y_i)=\mu _i, \text {and } V(y_i)=\dfrac{\mu _i(1-\mu _i)}{1+\emptyset } \end{aligned}$$

    .

  2. 2.

    In the second stage, we define the model’s systematic component. We determine the linear predictive functional \(\eta _i\) as:

    $$\begin{aligned} \eta _i=\beta 'x_i \end{aligned}$$

    where \(\beta\) is a vector with a \((p+1)\) dimensional regression model parameters that are yet to be defined, and \(x_i\) is the intercept plus the vector of measured values on p forecasters. The predictor function \(\eta _i\) provides the systematic component9 and36. This equation represents how the systematic component is formulated in the model

  3. 3.

    In the third stage, we need to establish the relationship between the predictor function \(\eta _i\) and the expected value of \(Y_i\) denoted as \(\mu _i\).. This relationship is achieved using the Link function, resulting in the following outcomes9,36.

    $$\begin{aligned} \mu _i=\dfrac{exp(\eta _i)}{1+exp(\eta _i)}=\dfrac{1}{1+exp(-\eta _i)} \end{aligned}$$
  4. 4.

    The Link function in Generalized Linear Models (GLM) is specified in the references9,36.

    $$\begin{aligned} logit(\mu _i)=\log \left( \dfrac{\mu _i}{1-\mu _i}\right) =\eta _i. \end{aligned}$$
  5. 5.

    Generalized additive models provide flexibility in modeling the dependence of the response variable by defining the linear predictor as a smooth function of the predictors, as described in9.

    $$\begin{aligned} \eta _i=\beta _0+\sum _{j=1}^{P}f_i(x_{ij}) \end{aligned}$$

    The term \(f_i(x_{ij})=\sum _{k=1}^{kj}\beta _{jk}\beta _{jk}(x_{ij})\) represents the smoothing function for the \(j^{\text {th}}\) predictor. It involves a sum of terms, each represented by \(\beta _{jk}\beta _{jk}(x_{ij})\).

  6. 6.

    In estimating the Generalized Additive Model (GAM), Restricted Maximum Likelihood (REML) is utilized to maximize the penalized log-likelihood9 The penalized Log-Likelihood \(L_p(\beta )\) is defined as

    $$\begin{aligned} L_p(\beta )=L(\beta )-\dfrac{1}{2}\lambda \beta ' S \beta \end{aligned}$$

    where \(L(\beta )=\sum _{t=1}^{n}L(Y_i/\beta )\) is the likelihood function for the observed values \(y_i\) of the response variable. \(\lambda\) represents the smoothing parameter, and S is the known penalty matrix. The use of REML helps in maximizing this penalized log-likelihood for GAM estimation.

  7. 7.

    predictions can be calculated as9.

    $$\begin{aligned} \hat{\mu _{i}}&=\beta _{0}+\sum _{k=1}^{k} {\hat{\beta }}_{1k}\beta _{1k}(x_{i1}+(\hat{\beta _{2}}x_{i2})) \end{aligned}$$
    (5)

Ridge regression model

One of the most widely used techniques for solving multicollinearity in multiple linear regression is ridge analysis. This method has found applications in various fields, including engineering, chemistry, and econometrics. Ridge regression (RR) modifies the Ordinary Least Squares (OLS) method to produce biased estimators of regression coefficients, thereby addressing issues related to multicollinearity. This approach is particularly valuable when OLS estimators exhibit significant variability. So, ridge analysis can improve the predictability and accuracy of a model13. Here, we describe the linear regression model37:

$$\begin{aligned} Y&= X\beta +\epsilon \end{aligned}$$
(6)

where Y represents the dependent variable, it is an \(n\times 1\), X is the matrix of predictor variables, \(\beta\) is the vector of regression coefficients, it is \(p\times 1\), and \(\epsilon\) represents an \(n\times 1\) vector of the error term.

In the context of ridge regression:

  1. 1.

    The ordinary least squares (OLS) estimator \({\hat{\beta }}\) Eq. (6) is calculated as follows

    $$\begin{aligned} {\hat{\beta }}&=(S)^{-1} X'Y \end{aligned}$$
    (7)

    where \(S=X'X\) is the design matrix. represents the design matrix, and \({\hat{\beta }}\) is the vector of regression coefficients estimated using the ordinary least squares method.

  2. 2.

    The ridge regression estimator, introduced by Hoerl and Kennard, is derived by minimizing the given objective function37

    $$\begin{aligned} (Y-X\beta )'(Y-X\beta )+k(\beta '\beta -c) \end{aligned}$$
    (8)

    where \((Y-X\beta )'(Y-X\beta )+\) is a part of the OLS objective that minimizes the sum of squared residuals, and \(k(\beta '\beta -c)\) is the penalty term, where k is a constant,\(\beta\) is the vector of regression coefficients, and c is a predefined constant.

  3. 3.

    We obtain the normal equations37

    $$\begin{aligned} (X'X+KI_p)\beta&=X'Y \end{aligned}$$
    (9)

    where \(X'X\) is the sum of squares and cross-products matrix, \(kI_p\) introduces the penalty term into the normal equations, and k is a constant.

  4. 4.

    The ridge estimator is determined by solving the normal equations, resulting in \(({\hat{\beta }} (k))\) as shown in Eq. (10):

    $$\begin{aligned} {\hat{\beta }} (k)&= (S+kI_p)^{-1}X'Y=W(k){\hat{\beta }}. \end{aligned}$$
    (10)

    where \(S=X'X\), and \(W(k)=(I_{P}+kS^{-1}) ^{-1}\) is a matrix derived to simplify the computation.

  5. 5.

    The parameter k is the Biasing Parameter in ridge regression, Eq. (11) provides a method for selecting it13.

    $$\begin{aligned} k&= p\sigma ^2/\beta '\beta . \end{aligned}$$
    (11)

    where p is the overall output variable, \(\sigma ^2\) is an estimate of the variance, and \(\beta '\beta\) is the sum of squared estimated coefficients.

    The estimate of the ridge parameter, denoted as \({\hat{\beta }}_{k}\) is given by38:

    $$\begin{aligned} {\hat{\beta }}_{k}&= \Lambda +(kI)^{-1} X'Y \end{aligned}$$
    (12)

    where \(\Lambda\) represents a diagonal \(P\times P\) matrix. Efficiency of \({\hat{\beta }}_k\) is influenced by the selection of the ridge parameter k to get the smallest Mean Squared Error (MSE) estimate, a certain k value is determined. This assessment is performed at a chosen value of k, as expressed in Eq. (13)38:

    $$\begin{aligned} \text {MSE}_{\text {RIGED} }&= \sum _{i=1}^{p}\dfrac{\lambda _i\sigma ^2+k\beta _{i}^2}{(\lambda _i+k)^2} \end{aligned}$$
    (13)

    Here, unbiased OLS estimated values for \({\hat{\sigma }}^2\) and \({\hat{\beta }}\) are used in place of \({\sigma ^2}\) and \({\beta }\)

Beta ridge regression model

The beta ridge regression estimator is proposed as an alternative to the beta maximum. likelihood estimator to mitigate the impacts of multicollinearity in the Beta Regression model. This estimator is denoted as follows5 and13 .

Assuming that \({\hat{\beta }}\) is an estimator of the vector \(\beta\), the weighted sum of squared error is defined as5:

$$\begin{aligned} \Theta&= (y-x\beta )'(y-x{\hat{\beta }})=(y-x{\hat{\beta }}_{\text {ML}})'(y-x{\hat{\beta }}_{\text {ML}}) ({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})'X'WX({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})\nonumber \\&=\Theta _{\text {min}}+\Theta {\hat{\beta }} \end{aligned}$$
(14)

where \(\Theta\) represents the minimum value, and \({\hat{\beta }}>0\) is the constant increment that causes the WSSE to increase when \({\hat{\beta }}_{\text {ML}}\) substituted for \({\hat{\beta }}\). The BRR estimator is obtained by minimizing the Length of \({\hat{\beta }}\) subject to a restriction:

\(({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})^\prime X^\prime WX({\hat{\beta }}-{\hat{\beta }}_{\text {ML}})=\Theta _{0}\), as Hoerl and Kennard’s restrictions13.

Minimized \(\varrho ={(y-{\hat{\beta }})\ }^\prime (y-{\hat{\beta }}) (\ y -{X{\hat{\beta }}}_{ML})\ ^\prime (y -{X{\hat{\beta }}}_{ML})+ (({\hat{\beta }}-{{\hat{\beta }}}_{ML})\ ^\prime X^\prime\) as Hoerl \({{\hat{\beta }}}_{ML}\)

$$\begin{aligned} \varrho ={{\hat{\beta }}\ }^\prime {\hat{\beta }}+(1/k) (({\hat{\beta }}-{{\hat{\beta }}}_{ML}) \ ^\prime \ X^\prime W\ X ({\hat{\beta }}-{{\hat{\beta }}}_{ML})-\ \Theta _o) \end{aligned}$$
(15)

where the Lagrangian multiplier is 1/k. When Eq. (15) is differentiated from \({\hat{\beta }}\), the outcome equals zero.

$$\begin{aligned} \frac{\partial \varrho }{\partial {\hat{\beta }}}&=2{\hat{\beta }}+\frac{(2X^\prime W X({\hat{\beta }}-{{\hat{\beta }}}_{ML})}{k}=0. \end{aligned}$$

After simplification, we obtain the following BRR estimator:

$$\begin{aligned} {{\hat{\beta }}}_{BRR}&={\hat{\beta }}=(X^\prime WX+k{I)}^{-1}X^{\ ^\prime }W X{{\hat{\beta }}}_{ML} \end{aligned}$$
(16)

Where, I is a matrix of identities with an order of \(p\times p\), and k is the shrinkage parameter.

Numerical analysis

This study relies on data extracted from the Breast Cancer Wisconsin Diagnostic dataset, obtained from the University of Wisconsin Hospitals Madison Breast Cancer Database39, covering the period from January 1989 to November 1991. The dataset comprises records from 569 breast cancer patients and was accessed through an open online repository hosted at https://www.kaggle.com/code/gpreda/breast-cancer-prediction-from-cytopathology-data. Our research aims to explore the relationship between 10 predictor variables and tumor progression in breast cancer patients.

Data set description

Breast cancer represents a significant health burden globally, standing as the most prevalent cancer among women and ranking as the second leading cause of cancer-related mortality in women. Characterized by aberrant cell growth in breast tissue, this disease poses substantial health risks. In our study, we selected the radius mean as the dependent variable for several reasons. While previous research predominantly focused on diagnosis and disease classification, our approach provides a novel perspective. By utilizing the diagnosis state to assess the extent of disease spread, as indicated by the radius mean variable, we delve into the progression of breast cancer based on diagnostic information. This unique method yields valuable insights into tumor behavior and disease severity. Utilizing ‘radius mean’ as a continuous variable enriches the analysis of tumor data, enabling the use of diverse statistical methods to uncover intricate patterns. This approach not only enhances the understanding of tumor impact on patient outcomes but also facilitates the discovery of new correlations and insights in breast cancer research40. The radius mean serves as the primary outcome variable in our analysis, representing the average distance from the cell center to the perimeter. Its importance lies in its association with tumor spread; as the cell radius increases, so does the surface area, indicating a more extensive tumor spread. Our investigation encompasses 10 predictor variables, including diagnosis, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. These variables play crucial roles in elucidating various aspects of breast cancer progression. Detailed units of measurement for the features in the Breast Cancer Wisconsin (Diagnostic) Data Set are provided in Table 1.

Table 1 Descriptive of variables.
Table 2 Descriptive statistics of breast cancer variables.

Table 2 provides comprehensive descriptive statistics for the variables in the breast cancer dataset, including the number of observations (N), as well as the minimum, maximum, mean, and standard deviation for each feature. Here’s a refined explanation of the analysis:

Texture, Perimeter, and Area: The mean texture value is 19.2896 (ranging from 9.71 to 39.28), the mean perimeter is 91.9690 pixels (ranging from 43.79 to 188.50), and the mean area is 654.8891 square pixels (ranging from 143.50 to 2501.00). Higher values for texture, perimeter, and area suggest greater variability, larger tumor sizes, and potentially more irregular tumor shapes, indicative of advanced breast cancer stages.

Smoothness and Compactness: The mean smoothness value is 0.0964 (ranging from 0.05 to 0.16), and the mean compactness is 0.1043 (ranging from 0.02 to 0.35). Lower smoothness values and higher compactness values suggest irregular and denser tumor structures, respectively, which may indicate more aggressive tumor growth patterns.

Concavity and concave points: The mean concavity value is 0.0888 (ranging from 0.00 to 0.43), and the mean number of concave points is 0.0489 (ranging from 0.00 to 0.20). Higher values for concavity and concave points indicate deeper and more pronounced concave regions in tumor contours, potentially reflecting aggressive tumor behavior.

Symmetry and fractal dimension: The mean symmetry value is 0.1812 (ranging from 0.11 to 0.30), and the mean fractal dimension is 0.0628 (ranging from 0.05 to 0.10). Deviations from symmetry in breast density and higher fractal dimension values suggest irregular and complex tumor shapes, respectively, which may be associated with aggressive tumor phenotypes and disease progression.

Table 3 Diagnosis.

In Table 3, the diagnosis frequencies indicate that 62.7% of cases are benign (B), while 37.3% are malignant (M). Understanding the distribution of malignant and benign cases is crucial for characterizing the dataset and identifying potential associations between diagnostic categories and clinical outcomes.

Table 4 The estimation of linear regression model parameters.

Table 4 shows the estimation of the linear regression coefficient The model performance indicators include \(R^{2}=0.9994\), F statistic: \(9.184e^{4}\) and a p-value of less than 0.05. The information criteria values are AIC = − 6383.895 and BIC = − 6331.769. These results collectively provide insights into the effectiveness and significance of the linear regression model in capturing the relationship between the predictor variables and the response variable.

To check the existence of multicollinearity in the data, two methods are used. First, the correlation matrix of all explanatory variables41, Table 5 shows the correlation matrix. It is seen that there are correlations greater than 0.8 between Perimeter and Area, Texture and Concave Points, and Area and Compactness. Second, Variance Inflation Factors (VIF) values for all variables greater than 542, high VIF values are indicative of a strong correlation between the predictor variables. Variables with high VIF: Perimeter, Area, Compactness, Concave Points. The determined condition number \(CN=\sqrt{\lambda _{max}/\lambda _{min}}\) of the data is 166.861. The correlation matrix, VIF and CN indicate the existence of a multicollinearity problem.

Table 5 The correlation matrix and VIF.
Table 6 The estimation of beta regression model parameters.

Table 6 indicates that several variables (diagnosis, perimeter, area, smoothness, and compactness) have a significant impact on the response variable, while others (texture, concavity, symmetry, and fractal dimension) do not show statistical significance in this analysis. These results provide insights into the relationship between the predictor variables and the response variable in the context of breast cancer data.

Table 7 Parameters estimates of GAM regression model.

Table 7 views the estimation of GAM parameters and the most influential variables on the response variable. The variables that increase breast cancer according to this data are perimeter, area, smoothness, compactness, and concave points.

Table 8 Deviance residuals for different models.

Table 8 introduces the residual deviance for the beta and GAM regression model, the deviance residual for the beta regression model ranges from (− 6.4337 to 6.4568), whereas the deviance residuals for the GAM model takes values from (− 0.1704 to 0.243243) which means GAM model has residuals less than beta regression model, this emphasizes that the differences between observed value and estimated value in GAM model are less than these differences in beta model, So the GAM model fits data in a best way from beta regression.

Table 9 Parameters estimates of GAM beta regression model.

Table 9 displays the results of GAM beta regression. It indicates the variables that significantly impact the response variable. Diagnosis, perimeter, area, smoothness, compactness, and concave points have a significant impact on the response variable. Specifically, diagnosis, perimeter, area, and smoothness have positively affected the response variable, implying that these variables increased the risk of breast cancer in the analyzed dataset the breast cancer for patients based on this data. In contrast, compactness and concave points have negatively affected the response variable. They have a decreased risk of breast cancer for patients based on this data. However, the variables texture, Concavity, Symmetry, and Fractal dimension do not show significance. They have small effects on the response variable and are not associated with a significant change in breast cancer risk in this dataset.

Table 10 Parameters estimates of ridge regression model.

Table 10 shows the results of ridge regression, The parameter estimates for the Ridge Regression Model include the estimate of the standard deviation of the error term (SC). It indicates the variables of diagnosis, perimeter, area, smoothness, compactness, concavity, and fractal dimension have a significant impact on the response variable. on the other hand, perimeter has a relatively large impact with an estimate of 0.9021, whereas texture has a very small impact with an estimate of − 0.0009. The variables diagnosis, perimeter, smoothness, and fractal dimension have a positive impact on the response variable. This suggests that an increase in these variables is associated with an increased risk of breast cancer as indicated by the data. In contrast, area, compactness, and concavity have negatively affected the response variable, these variables decreased the breast cancer based on the dataset. However, the variables texture, concave points, and symmetry are not significant in breast cancer risk in this dataset.

Table 11 Parameters estimates of beta ridge regression Model.

Table 11 shows the results of beta ridge regression. It indicates the variables of diagnosis, perimeter, area, smoothness, compactness, and concave points have a significant impact on the response variable. on the other hand, perimeter has a relatively large impact with an estimate of 12.5396, whereas texture has a very small impact with an estimate of − 0.0132. The variables diagnosis, perimeter, and smoothness, have a positive impact on the response variable. This suggests that an increase in these variables is associated with an increased risk of breast cancer. In contrast, area, compactness, and concave point have negatively affected the response variable, these variables decreased the risk of breast cancer. However, the variables texture, concavity p, symmetry, and fractal dimension are not significant in breast cancer risk.

Table 12 AIC and BIC for different models.

According to the model selection criterion, as seen in Table 12, the best model fit is the model that has the lowest value of this criterion; hence the Ridge regression model fits data in the best way for AIC and the Beta regression is the best model for BIC.

Figure 1
figure 1

The fitted values of estimated models.

Figure 1, illustrate the estimations of the values for the GAM, Beta, GAM Beta, Ridge, and Beta Ridge models. These figures demonstrate that an increase in one variable’s size corresponds to an increase in another variable, providing evidence that these models effectively capture the relationships between the variables and the radius.

Monte Carlo simulation study

In this section, we conduct a Monte Carlo simulation experiment to evaluate the performance of our proposed regression models across various conditions. The models under examination include the beta regression model, GAM regression model, GAM Beta regression model, Ridge regression model, and Beta Ridge regression model.

To generate synthetic data for our simulation, we utilized multivariate normal (mvrnorm) and beta distributions using the mvrnorm and rbeta functions, respectively. Specifically, we generated four predictor variables following multivariate normal distribution with a mean vector of zeros and a covariance matrix constructed using a correlation matrix and a diagonal scaling matrix (D). To ensure the stability and reliability of our results, we repeated the simulation 1000 times. We set the true mean parameter for the beta distribution to \(\mu\) = 3, with a dispersion parameter of \(\phi\) = 15. Given our focus on examining the effect of multicollinearity under different conditions, we varied the degree of correlation (rho) across \(\rho = (0.70, 0.80, 0.90)\), and the number of constants (k) across (0, 0.01, 0.10). These parameters allow us to assess the impact of multicollinearity on model performance across a range of scenarios, providing valuable insights into the robustness and applicability of our regression models.

The simulated Akaike information criterion (AIC) and Bayesian information criterion (BIC) introduced by43 are criteria for judging the performance of models as follows:

$$\begin{aligned} \text {AIC}=-2 \ln (L_{fit})+2k,\qquad \& \text {BIC}=-2 \ln (L_{fit})+2k \ln (n), \end{aligned}$$

where \(\ln (L_{fit})\) is the log-likelihood of whatever model was fitted, k is the number of parameters estimated, and n is the number of observations44. All the computations are performed using the R Programming Language.

Results and discussion

The results from the Monte Carlo simulations for different n are presented in Tables 13,  14,  151617181920 and 21 respectively. From these tables, the factors affecting the performance of the estimators are the degree of correlation \(\rho\), the number of sample sizes n and the constant of ridge values k. Generally, as the sample size increases, it is expected that both AIC and BIC values will decrease, reflecting the improved fit of the model due to the inclusion of more data points. This decrease is not indicative of a negative relationship between AIC and BIC but rather a reflection of their individual responses to increased sample size. AIC penalizes model complexity to a lesser extent than BIC, which is why they may decrease at different rates as sample size grows. This trend indicates an improvement in the efficiency of all models with larger sample sizes. For all sample sizes, the degrees of correlation, and the constant of ridge values, the Ridge model has the lowest AIC values, indicating a better fit compared to other models, following the Beta and GAM-Beta models have the lowest AIC values, suggesting better fit for larger datasets. Introducing a ridge constant (0.01, 0.10) marginally affects the AIC and BIC values for the Ridge and Beta Ridge models, indicating the sensitivity of these models to regularization strength. on the other hand, for all sample sizes, the Beta and GAM-Beta models have the lowest BIC values, suggesting they provide the best fit for larger sample sizes. The Beta Ridge model has the highest BIC, and AIC values across all sample sizes, indicating a relatively poor fit compared to other models.

Table 13 Simulation study results for \(\rho =0.7, k=0\).
Table 14 Simulation study results for \(\rho =0.7, k=0.01\).
Table 15 Simulation study results for \(\rho =0.7, k=0.1\).
Table 16 Simulation study results for \(\rho =0.8, k=0\).
Table 17 Simulation study results for \(\rho =0.8, k=0.01\).
Table 18 Simulation study results for \(\rho =0.8, k=0.1\).
Table 19 Simulation study results for \(\rho =0.9, k=0\).
Table 20 Simulation study results for \(\rho =0.9, k=0.01\).
Table 21 Simulation study results for \(\rho =0.9, k=0.1\).
Figure 2
figure 2

Average Values of AIC and BIC the degree of correlation (\(\rho\)), the number of sample sizes (n) and , and the constant of ridge values (k) for all models.

From Fig. 2, we found that as sample sizes increased the average values of AIC decreased, Moreover, the average values of both AIC and BIC demonstrate a decreasing pattern with the increase in sample sizes, specifically evident in the beta regression model, as illustrated in Figs.1 and 2. In Fig. 3, as the degree of correlation \((\rho )\) increases from 0.70 to 0.90, and as the ridge constant (k) in all sample sizes, the performance of the Ridge and Beta Ridge models shows a significant fluctuation in both AIC and BIC values, underscoring the sensitivity of these models to changes in correlation and regularization strength.

Figure 3
figure 3

Average Values of AIC and BIC the degree of correlation (\(\rho\)) and the constant of ridge values (k) for BRR, RR models.

In Fig. 3, across all values of \((\rho )\) and (k), the optimal model fit was consistently observed at a sample size of 200. Specifically, for AIC, the Ridge model demonstrated the best fit, while according to BIC, the Beta model also emerged as the top-performing model.

Figure 4
figure 4

Average Values of AIC and BIC the degree of correlation (\(\rho\)) and the constant of ridge values (k) for all models.

In Fig. 4, as sample sizes increase, depicted in (h.1) for varying degrees of correlation (\(\rho\)) and in (h.2) for constant ridge values, there is a notable decrease in the average AIC values. Ridge regression consistently emerges as the best model in terms of AIC. Conversely, when assessing the average BIC values, the optimal model is identified as beta ridge regression.

Conclusions

In this paper, we meticulously tailored a suite of models, including Generalized Additive Models (GAM), Beta regression, GAM Beta regression, Ridge regression, and Beta Ridge regression, to the intricacies of breast cancer data. Our analysis underscored a preference for the Akaike Information Criterion (AIC) in GAMs, attributed to its accommodation of the models’ complexity and flexibility, essential for capturing the multifaceted nature of the data [10].

A thorough simulation study was conducted to empirically validate our models across varying sample sizes and correlation coefficients, enhancing the robustness of our findings. In analyzing data from 569 breast cancer patients, we discerned key independent variables that significantly influence breast cancer risk. The comparative analysis revealed that the Beta regression model outperformed others based on the Bayesian Information Criterion (BIC), while the Ridge regression model showed superiority according to the Akaike Information Criterion (AIC). These results mirror those obtained from our simulation study, indicating that the selection between Ridge and Beta regression models may depend on the preferred information criterion, especially in smaller sample sizes. However, as sample sizes increase, both models consistently demonstrate suitability across both AIC and BIC metrics. It is crucial to acknowledge that these conclusions are drawn within the confines of our study’s dataset and simulation parameters, necessitating caution when extrapolating to other contexts.

Our investigation rigorously assessed a suite of low-dimensional regression models, including Generalized Additive Models (GAM), Beta regression, GAM Beta regression, Ridge regression, and Beta Ridge regression. While acknowledging the extensive research on high-dimensional regression models27,28, our study is distinctively focused on low-dimensional contexts. Applied to authentic breast cancer data, the performance of these models was meticulously evaluated against a simulation study, ensuring a robust examination within the dataset’s dimensional constraints.