Introduction

The application of statistical data processing has grown during the last decade, starting from traditional methods to advanced methods, including machine learning and extensive data analysis. The objective of statistical inference is to draw conclusions about a study population based on a sample of observations. Recently, subjective specific beliefs have been developed by introducing random effects in various components of models [1]. Different study problems involve specific sampling techniques and a statistical model to describe the analyzed situation.

Albatross Analytics is a statistical and computational data analysis program belonging to the open-source software class built after the R program package with the S programming language. Albatross Analytics is currently under a project by HGLM’s worldwide group. In particular, it provides a new unified state-of-the-art statistical package from basic analysis to advance analysis including various random-effect models (HGLMs, DHGLMs, MDHGLMs, and frailty models) whose implementations are generally difficult.

Meanwhile, the basis of Albatross analytics in R software is clear R software was first worked on by Robert Gentleman and Ross Ihaka of the University of Auckland's Statistics Department in 1995 [2, 3]. Most of the functionality and capabilities of Albatross Analytics can be obtained through Add—packages/libraries.

A library is a collection of commands or functions that can perform specific analyses. For instance, this implements double hierarchical generalized linear models in which the mean, dispersion parameters for the variance of random effects, and residual variance (overdispersion) can be further modeled as a random-effect model may use DHGLM [4], MDHGLM by Lee [5,6,7]. This package allows various models for multivariate response variables where each response is assumed to follow double hierarchical generalized linear models. See also further HGLM applications for machine learning [4], schizophrenic behavior data [8], variable selection methods [9], non-Gaussian factor [10], factor analysis for ordinal data [11], survival analysis [12], longitudinal outcomes and time-to-event data [13], and recent advanced topics [14,15,16,17].

The FRAILTYHL package fits semi-parametric frailty and competing risk models using the h-likelihood. This package allows lognormal or gamma frailties for random-effect distribution, and it fits shared or multilevel frailty models for correlated survival data. Functions are provided to format and summarize the FRAILTYHL results [18]. The estimates of fixed effects and frailty parameters and their standard errors are calculated. We illustrate the use of our package with two well-known data sets and compare our results with various alternative R-procedures. Refers to the application of semi-competing risks data [19], and clustered survival data [20, 21]. This paper addresses and explains what Albatross Analytics is and include how to use it in statistical and data science application. The advantage of Albatross Analytics is the user can analyze and interpret the data easily. Meanwhile, Fig. 1 shows the feature of Albatross Analytics, including fundamental analysis, random effect, regression, survival analysis, and multiple response analysis. This paper aims to express the application of Albatross Analytics software to handle statistical analysis in broad areas. Long story short, we provide illustrative examples. A hands-on various applications including HGLM, DHGLM, MDHGLM, Survival Analysis, Frailty Models, Support Vector Machine, and Structural Equation Models.

Fig. 1
figure 1

Features in albatross analytics

Illustrative examples

Data management

In today's world, data is the driving factor behind all establishments. As institutions keep collecting so much data, there is a need to handle the quality of the data becoming more notable by the day. Data Quality Management is the set of measures applied by a technical team or a database management system to enable good new knowledge [22,23,24]. The above collection of techniques is decided to carry out during the data management pathway, from data capture to execution, dissemination, and interpretation [24,25,26]. In line with this, the data management is the process of processing, managing, and maintaining data quality [27, 28]. Effective data management can increase the efficiency of research work [26, 29]. Figure 2a describes the main features available in Albatross Analytics. In the import data section, users can maximize this feature to upload data to be processed where the possible files are in excel and txt formats, respectively. For instance, Fig. 2b explains how to make a new variable feature, merge the dataset, and add new variables.

Fig. 2
figure 2

Main feature (A) and Data management in albatross analytics (B)

Each expression or variable has a data type such as numeric, integer, complex, logical, and character. The data types in Albatross analytics are expressed in class. A class is a combination of data types and the operations performed against the dataset type. The Albatross analytics look at the data as objects having attributes or properties. Data properties are defined by data type.

Basic analysis and GLMs

Descriptive statistics have been used to identify the specific characteristics of the data in the interpretation. We provide simple details of the findings and the procedures followed in Fig. 3. Alongside primary frequency distribution, we form the basis of almost all quantitative analyses of the results. Descriptive statistics shall be used to present practical explanations understandably. Descriptive statistics allow one to interpret enormous amounts of data in a structured way.

Fig. 3
figure 3

Basic analysis and regression

The t-test can be used to compare the means of two groups of data with the type of interval scale variable. Sometimes We will come across a study that aims to compare the mean of a sample with the mean of the entire population. Research models like this are rare, but the researcher can still provide valuable assumptions. We can do two kinds of tests, including z-test and a t-test. The condition we need to pay attention to is the population's standard deviation. If we know the standard deviation, we get it using the z-test. This will be found very rarely or never. Therefore, the most frequently used test is the t-test because we do not need to know the standard deviation of the population we study.

Furthermore, the use of the t-test on two samples is divided into two types based on the characteristics of the two samples. The first is the t-test on two independent samples. This means that the two samples to be studied came from two different groups and were given further treatment.

During the research, the use of analysis of variance is fundamental. One of the assumptions that must be met is that the population variances are the same, so we need to test the hypothesis. The purpose of the analysis of variance (ANOVA) is to determine the similarity of several population means. One-way ANOVA may be used if only one factor is involved. Two types of tests can be used in ANOVA testing, including formal and visual tests.

Meanwhile, the statistical test can be conducted by model checking plot. If the plot does not form a specific pattern, it is said that the homogeneity of the variance is fulfilled. We know the characteristics of each variable using descriptive analysis. In addition, we may see the relationship between variables, either normal or non-normal data [30]. With this correlation test, we want to know the similarity of the trends of the two variables. When the value of variable increases, it will also be accompanied by an increase or decrease in the value of other variables [31].

One main factor determines the test method used, namely the distribution of the data to be tested. We can use the parametric correlation test if the data distribution is normal, including Pearson's correlation coefficient. Besides, if the data distribution is not normal, we can use Kendall's rank correlation and Spearman’s rank correlation, which are non-parametric correlation tests.

Regression analysis tests the causal relationship between variables—one variable as the independent variable and one other variable as the dependent variable. Numerous regression approaches, including Poisson regression, were used during the 1970s. Linear regression and logistic regression require a unique estimation algorithm by maximizing the likelihood. Figure 4 explains that Albatross Analytics provides features for using the Linear model, GLM Logit Model, GLM Probit Model, Log-linear Model, and joint GLM.

Fig. 4
figure 4

HGLM algorithm (A), and DHGLM algorithm (B)

GLM describes a family of models where the response comes from the exponential family of distributions. The method used to t-test or F-test and inferences of these models is maximum likelihood (ML). In the GLM family of models, an IWLS algorithm can compute the ML estimates and their standard errors. Hence, the computational machinery developed for least-squares estimation for linear models can fit GLMs, but the statistical method is based on ML.

Hands-on and application albatross analytics

Hierarchical generalized linear models (HGLMs)

Albatross Analytics’ distinct advantage is its unified analysis of random effect models. Various random effect models can be represented as HGLMs and estimated by h-likelihood procedures [32]. HGLMs are defined as follows:

  1. (1)

    Conditional on random effects \(u\), the responses \(y\) follows a GLM family, satisfying

    $$E\left( {y|u} \right) = \mu \,{\text{and}}\,var\left( {y|u} \right) = \phi V\left( \mu \right),$$

    for which the kernel of the log-likelihood is given by

    $$\sum {{{\left\{ {{\text{y}}\theta - {\text{b}}\left( \theta \right)} \right\}} \mathord{\left/ {\vphantom {{\left\{ {{\text{y}}\theta - {\text{b}}\left( \theta \right)} \right\}} {\phi ,}}} \right. \kern-\nulldelimiterspace} {\phi ,}}}$$

    where \(\theta =\theta (\mu )\) is the canonical parameter. The linear predictor takes the form in Eq. (1):

    $$\eta =g\left(\mu \right)=X\beta +Zv,$$
    (1)

    where \(v=v(u)\), for some monotone function \(v(\cdot )\) and the link function \(g\left(\mu \right)\).

  2. (2)

    The random component \(u\) follows a (conjugate) distribution to a GLM family of distributions with parameter \(\lambda\).

To infer the HGLM, Lee and Nelder [32] proposed using the h-likelihood. The h (log-) likelihood is defined as Eq. 2:

$$h=\mathrm{log}\,{f}_{\beta ,\phi }(y|v)+\mathrm{log}\,{f}_{\lambda }\left(v\right).$$
(2)

The GLM attributes of an HGLM are summarized in Fig. 4.

In Bissell’s fabric study, the response variable \(y\) is the number of faults in a bolt of the fabric of length \(l\). Table 1 represents the results of the fabric study. Figure 6 illustrates the negative binomial model fitted via Poisson-gamma HGLM with saturated random effects for the complete response. In addition, the model checking plot is presented in Fig. 5.

Table 1 Results for fabric study
Fig. 5
figure 5

Model checking plots of DHGLM for mean (A), and model checking plots of DHGLM for dispersion (B)

Double hierarchical generalized linear models (DHGLMs)

HGLM can be extended by allowing additional random effects in their various components. Lee and Nelder [32] introduced a class of double HGLMs (DHGLMs) in which random effects can be specified in both the mean and the residual variances. Heteroscedasticity between clusters can be modeled by introducing random effects in the dispersion model as heterogeneity between clusters in the mean model. With DHGLMs, it is possible to have robust inference against outliers by allowing heavy-tailed distribution. Many models can be unified and extended further by the use of DHGLMs. These also include models in the finance area such as autoregressive conditional heteroscedasticity (ARCH) models, generalized ARCH (GARCH), and stochastic volatility (SV) models. Models can be further extended by introducing random effects in the variance terms. Suppose that conditional on the pair of random effects \((a, u)\), the response \(y\) satisfies.

$$E\left(y|a, u\right)=\mu \, {\text{and}} \, var\left(y|a, u\right)=\phi V\left(\mu \right).$$

The critical extension is to introduce random effects into the component \(\phi\):

  1. (1)

    Given \(u\), the linear predictor for \(\mu\) takes the HGLM form in Eq. 1 where \(g(\cdot )\) is the link function, \(X\) and \(Z\) are model matrices, \(v={g}_{M}\left(u\right)\) for some monotone function, \({g}_{M}\left(u\right)\) are the random effects, and \(\beta\) are the fixed effects. Moreover, dispersion parameters \(\lambda\) for \(u\) have the GLM form in Eq. 3

    $${\xi }_{M}={h}_{M}\left(\lambda \right)={G}_{M}{\gamma }_{M},$$
    (3)

    where \({h}_{M}()\) is the link function, \({G}_{M}\) is the model matrix and \({\gamma }_{M}\) is fixed effects.

  2. (2)

    Given \(a\), the linear predictor for \(\phi\) takes the HGLM form as described in Eq. 4

    $$\xi =h\left(\phi \right)=G\gamma +Fb,$$
    (4)

where \(h()\) is the link function, \(G\) and \(F\) are model matrices, \(b={g}_{D}\left(a\right)\) for some monotone function, \({g}_{D}\left(a\right)\) are the random effects, and \(\gamma\) are the fixed effects. Moreover, dispersion parameters \(\alpha\) for \(a\) have the GLM form, as shown in Eq. 5.

$${\xi }_{D}={h}_{D}\left(\alpha \right)={G}_{D}{\gamma }_{D}$$
(5)

where \({h}_{D}(\cdot )\) is the link function, \({G}_{D}\) is the model matrix and \({\gamma }_{D}\) is fixed effects. Here, the labels \(M\) and \(D\) stand for mean and dispersion, respectively. The GLM attributes of a DHGLM are summarized in Fig. 4.

However, We illustrate an example of how to fit the DHGLM. Hudak [33] presented crack growth data, listed in Lu [34]. Each of 21 metallic specimens was subjected to 120,000 loading cycles, with the crack lengths recorded every 10,000 cycles. Let \({l}_{ij}\) be the crack length of the \(i\)-th specimen at the \(j\)-th observation and \({y}_{ij}={l}_{ij}-{l}_{ij-1}\) be the corresponding increment of crack length (response variable) measured in inches, which always has a positive value. A detailed description of the model can be found in Table 2, and Fig. 5a and b represent the mean and the dispersion, respectively [5]. Compared to an HGLM, DHGLM gives model checking plots for mean and dispersion, respectively.

Table 2 Results of DHGLM for crack growth data

Multivariate double hierarchical generalized linear models (MDHGLM’s)

Using h-likelihood, multivariate models are directly extended by assuming correlations among random effects in DHGLMs for different responses. The use of h-likelihood indicates that interlinked GLM fitting methods for HGLMs can be easily extended to fit multivariate HGLMs (MDHGLMs). Moreover, the resulting algorithm is numerically efficient and gives statistically valid inferences. In this paper, we present the example for MDHGLM. For more details, see [35] Meanwhile, Price et al. [36] presented data from a study on the developmental toxicity of ethylene glycol (EG) in mice. Table 3 summarizes the data on malformation (binary response) and fetal weight (continuous response) and shows clear dose-related trends concerning both responses.

Table 3 Descriptive Statistics for crack growth data

To fit the EG data, the following bivariate HGLM is considered:

  1. (1)

    \({y}_{1ij}|{w}_{i}\sim N\left({\mu }_{ij}, \phi \right), {\mu }_{ij}={x}_{1ij}{\beta }_{1}+{w}_{i}\),

  2. (2)

    \({y}_{2ij}|{u}_{i}\sim Ber\left({p}_{ij}\right), logit({p}_{ij})={x}_{2ij}{\beta }_{2}+{u}_{i}\), and

  3. (3)

    \({({w}_{i}, {u}_{i})}^{T} \sim BVN\left(\mathbf {0,{\varvec{\Sigma}}}\right), cor({w}_{i},{u}_{i})=\rho\).

Figure 6 shows the path diagram of the model for the EG data. The malformation model information is given in Table 4, with cAIC for the evaluation models. In line with this, we get the result for the weight model in Table 5 and correlation in Table 6.

Fig. 6
figure 6

Path diagram for MDHGLM towards weight and misinformation

Table 4 Results for malformation model
Table 5 Results for weight model
Table 6 Correlation towards crack growth

Survival analysis

Albatross Analytics also provides features for survival analysis, which represent in Fig. 7 by including incomplete data caused by censoring in survival time (time-to-event) data including Kaplan–Meier Estimator, Cox Model, Frailty Model [7], and Competing Risk Model [19, 37]. More instances of the Kaplan Meier curve describe the relationship between the estimated survival function at time t and the survival time. The vertical axis represents the estimated survival function, and the horizontal axis represents the survival time.

Fig. 7
figure 7

Survival analysis feature

Cox proportional hazards (PH) regression is used to describe the relationship between the hazard function of survival time and independent variables which are considered to affect survival time. Cox regression is a common regression used in survival analysis because it does not assume a particular statistical distribution (e.g., baseline hazard) of the survival time.

Cox’s PH model is widely used to analyze survival data. This method is helpful with its semi-parametric existence, whereby baseline hazards are non-parametric, and treatment effects are estimated parametrically. A partial likelihood has usually been used to accommodate such a semi-parametric form. However, it can also be fitted with Poisson GLM methods. Moreover, they are sluggishly led to many nuisance parameters induced by non-parametric measurement hazards. Meanwhile, using the h-likelihood theory, we can prove that Poisson HGLM methodologies could be used for such kinds of modeling techniques. That being said, this method is again sluggish since the number of nuisance parameters in non-parametric baseline hazards grows with the number of events.

Example 1 using incomplete data caused by censoring in survival data

In Fig. 7, we study the analysis of incomplete data caused by censoring survival data. Cox’s PH model is widely used to analyze survival data. Frailty models with a non-parametric baseline hazard extend the PH model by allowing random effects in hazards and have been widely adopted for the analysis of correlated or clustered survival data using h-likelihood theory; we can show that Poisson HGLM algorithms can be used to fit the frailty models [12, 38,39,40,41,42,43].

Data consist of right-censored observation from q subjects, with \({n}_{i}\) observations each (\(i=1,\dots ,q)\), \(n={\Sigma }_{i}{n}_{i}\) as the total sample size, \({T}_{ij}\) as survival time for the \(j\)-th observation of the i-th subject (\(j=1,\dots ,{n}_{i})\), \({C}_{ij}\) as the corresponding censoring time, \({y}_{ij}=\mathrm{min}\left\{{T}_{ij},{C}_{ij}\right\}, {\delta }_{ij}=I({T}_{ij}\le {C}_{ij})\), and \({u}_{i}\) as observed frailty for the i-th subject. The conditional hazard function of \({T}_{ij}\) given \({u}_{i}\) is of the form in Eq. 6

$${\lambda }_{ij} \left(t|{u}_{i}\right)={\lambda }_{0j} \left(t\right) \mathrm{exp} \left({x}_{ij}^{T}\beta \right) {u}_{i}$$
(6)

Here \({\lambda }_{0}\left(\cdot \right)\) is an unspecified baseline hazard function and \(\beta ={\left({\beta }_{1},\dots ,{\beta }_{p}\right)}^{T}\) is a vector of regression parameters for the fixed covariates \({x}_{ij}\). Here, the term \({x}_{ij}^{T}\beta\) does not include an intercept term because of identifiability. Then, we assume that the frailties \({u}_{i}\) are i.i.d random variables with a frailty parameter \(\alpha\). We often assume gamma or log-normal distribution for \({u}_{i}\); that is, it is gamma frailty with \(E\left({u}_{i}\right)=1\) and \(\mathrm{var}\left({u}_{i}\right)=\alpha\) and log-normal frailty with \({v}_{i}=\mathrm{log}{u}_{i}\sim N(0,\alpha )\). Meanwhile, the multi-component frailty models can be expressed in Eq. 7, with the linear predictor

$$\eta =X\beta +{Z}^{1}{v}^{1}+{Z}^{k}{v}^{k}+\dots +{Z}^{k}{v}^{k}$$
(7)

\(X\) is \(n\times p\) model matrix for \(\beta\), \(\mathrm{and }{Z}^{r}\) is \(n\times {q}_{r}\) model matrices corresponding to the frailties \({v}^{r}\). At the same time, \({v}^{(r)}\) and \({v}^{(i)}\) are independent for \(r\ne I.\) Also, \({Z}^{r}\) has indicator values such that \({Z}_{st}^{(r)}=1\) if observation \(s\) is a member of the subject \(t\) in the \(r\)-th frailty component, and 0 otherwise.

To the illustration, below we present two examples. Example 1 considers the dataset of the recurrence of infections in kidney patients using a portable dialysis machine. The data consist of the first and second recurrences of kidney infection in 38 patients. The catheter is later removed if the condition occurs and can be removed for other reasons, which we regard as censoring (about 24%).

In Example 1, the variables consist of 38 patients (id), time until infection since the catheter insertion (time), and a censoring indicator (1, infection; 0, censoring) for status, age of the patient (age), sex (sex) of the patient (1, male; 2, female), disease types (disease) following GN, AN, PKD, other, and estimated frailty (frail). The survival times (1st and 2nd infection times) for the same patient are likely to be correlated because of shared frailty describing the common patient’s effect. We thus fit log-normal frailty models with two covariates, sex, and age. Here, we consider the patient as frailty. Figure 8 presents the Kaplan–Meier plot for the estimated survival probability of the sex (sex1, male; sex2, female). This shows that the female group has overall higher survival (i.e., less infectious) probabilities than ones in the male group. Table 7 summarizes the estimated results of the log-normal frailty model. We show the estimated frailty in Fig. 9. For further discussions in survival analysis, see [18].

Fig. 8
figure 8

Survival probability towards sex

Table 7 Model description FOR LOG-normal frailty
Fig. 9
figure 9

Estimated frailty in the kidney infection data

Example 2 placebo-controlled rIFN-g in the treatment of CGD

Example 2, in the following case examples, consists of a placebo-controlled rIFN-g in the treatment of CGD [44, 45]. One hundred twenty-eight patients from 13 centres were tracked for around 1 year. The survival times are the recurrent infection times of each patient. Censoring occurred at the last observation for all patients, except one, who experienced a severe infection on the date he left the study. About 63% of the data were censored. The recurrent infection times for a given patient are likely to be correlated. Also, each patient belongs to one of the 13 centres. The correlation may be attributed to the patient effect and centre effect. Meanwhile, the recurrent infection times of each patient or censoring time (tstart–tstop), 128 patients (id), 13 centers (center), rIFN-g or placebo (treat), censoring indicator (1, infection observed; 0, censored) for status, data of randomization (random) information about patients at study entry (sex, age, height, weight), the pattern of inheritance (inherit), use of steroids at study entry 1(yes), 0(no) (steroids), use of propylac antibiotics at study entry. 1(yes), 0(no) (propylac), categorization of the centers into four groups (hos.cat), and observation number within-subject (enum). We fit multilevel log-normal frailty with two frailties and a single covariate, treatment. Here, the two frailties are random center and patient terms, with their structures given in Eq. 8.

$$\begin{gathered} \eta = X\beta + Z^{1} v^{1} + Z^{2} v^{2} \hfill \\ v^{1} \sim N\left( {0, \alpha_{1} I_{q1} } \right) \hfill \\ v^{2} \sim N\left( {0, \alpha_{2} I_{q2} } \right) \hfill \\ \end{gathered}$$
(8)

Here \({v}^{1}\) is center frailty, and \({v}^{2}\) is patient frailty. For testing the need for a random component i.e.,\({(\alpha }_{1}=0 \, \mathrm {or}\,{\alpha }_{2}=0)\) we use the deviance \(-\,2{p}_{\beta ,\mathrm{v}}\left({h}_{p}\right)\) and fit the following four models.

M1 Cox’s model without frailty \({(\alpha }_{1}=0\, \mathrm{or}\,{\alpha }_{2}=0):-\,2{p}_{\beta ,\mathrm{v}}\left({h}_{p}\right)=707.48\)

M2 model without patient effect \({(\alpha }_{1}>0\, \mathrm{or}\,{\alpha }_{2}=0):-\,2{p}_{\beta ,\mathrm{v}}\left({h}_{p}\right)=703.66\)

M3 model without center effect \({(\alpha }_{1}=0\, \mathrm{or}\,{\alpha }_{2}>0):-\,2{p}_{\beta ,\mathrm{v}}\left({h}_{p}\right)=692.99\)

M4 multilevel model \({(\alpha }_{1}>0 \mathrm{or} {\alpha }_{2}>0):-2{p}_{\beta ,\mathrm{v}}\left({h}_{p}\right)=692.95\).

Table 8 represents the model description. The deviance difference (692.99 − 692.95 = 0.04) between M3 and M4 \((0.04<2.71={\chi }_{0.10}^{2} \left(1)\right)\) indicates the absence of the random center effects, and the deviance difference between M2 and M4 (10.71) shows the necessity of random patient effects. In addition, the deviance difference between M1 and M3 (14.49) presents the random patient effect with or without random center effects. All of the three criteria (cAIC, mAIC and rAIC) also choose M3 among the M1–M4. Figure 10 presents the estimated frailty effects of this study. The explanations of model evaluation toward these three criteria can be seen in the Appendix.

Table 8 Model description for log normal frailty
Fig. 10
figure 10

Estimated frailty effects in the CGD recurrent infection data

Support vector machine using H likelihood

Support Vector Machine (SVM) is a supervised learning method for classification and regression using non-linear boundaries by feature space [4, 46,47,48,49]. We present a Support Vector Machine (SVM) based on the HGLM method [4]. The match between the observed response and the model output is optimized. The output model is a feature or prognostic function also referred to as a utility function and more specifically in medical research it is called the prognostic index or health function, defined in Eq. 9:

$$u\left(x\right)={w}^{T}\varphi (x)$$
(9)

Here \(u:{\mathbb{R}}^{d}\to {\mathbb{R}}\), \(w\) is a vector of unknown \(d\) parameters and \(\varphi \left(x\right)\) is the transformation of the covariates x. In non-linear SVM, the transformation function used is "Kernel Trick", see: [50,51,52] Kernel Trick calculates the scalar product in the form of a kernel function. The SVM model is implied with a constraint function that will get the right margin. The constraint function of the SVM model is shown in Eq. 10. If there is an error in ranking it is given by the slack variable \({\xi }_{ij}\ge 0\). The formulation of the SVM model is described in Eq. 10: cantered depression, and the latent person-

$$\begin{gathered} \mathop {\min }\limits_{{w,\xi }} \frac{1}{2}w^{T} w + \frac{\gamma }{2}\sum\limits_{{i < j}} {v_{{ij}} } \xi _{{ij}} \hfill \\ {\text{constraint function}}\left\{ \begin{gathered} w^{T} \varphi \left( {x_{j} } \right) - w^{T} \varphi \left( {x_{i} } \right) \ge 1 - \xi _{{ij}} , \quad \forall i < j \hfill \\ \xi _{{ij}} \ge 0, \quad \forall i < j \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(10)

with a regularization parameter \(\upgamma \ge 0\). \({v}_{ij}\) is an indicator function of whether or not two subjects with observations \(i\) and \(j\) are comparable; it is 1 if \(i\) and \(j\) are comparable and 0 otherwise. In this paper, we use the dataset of the anatomy of an Abdominal Aortic Aneurysm (AAA), Aortic Anatomy on Endovascular Aneurysm Repair (EVAR), see [53]. The variables are described as follows: Y = Sex, X1 = Age, X2 = Aortic type Fusiform (1), Saccular (2), X3 = Proximal neck length, X4 = Proximal neck diameter, X5 = Proximal neck angle, and X6 = Max. Aneurysmal sac. We set the response variable towards simulation by following the Bernoulli distribution with 500 observations. In each scenario, the process of generating data is repeated 100 times. The parameter values used are: \(\gamma =0.7, Cost=8\). Verbose takes advantage of a per-process runtime setting. Meanwhile, the SVM parameter setting is as follows:

  • First simulation:

    Cluster method: “kmeans”, cost = 8, lambda = 1, centers = 2, verbose = 0.

  • Second simulation:

    Cluster method: “kernkmeans”, cost = 8, lambda = 1, centers = 2, verbose = 0.

  • Third simulation:

    Cluster.method = "kernkmeans", cost = 8, lambda = 1, centers = 3, verbose = 0.

  • Fourth simulation:

    Cluster.method = "kernkmeans", cost = 8, lambda = 1, centers = 4, verbose = 0.

There are two types of model evaluation criteria: the classification stage and the HGLM analysis stage. Evaluation of the model’s goodness at the classification stage uses AUC and is determined using the values contained in the confusion matrix, with

$$Sensitivity=\frac{True\,Positive}{True\,Positive+False\,Negative}$$
$$Specitifity=\frac{True\,Negative}{False\,Positive+True\,Negative}$$

This simulation shows that HGLM performs better with high sensitivity because some of the data used is a binary case that SVM cannot handle. For more information on step construction using hierarchical likelihood towards SVM. Table 9 represents that the use of Ensemble SVM reduces the accuracy and other measures. When the mixture patterns exist in the predictor, Ensemble SVM improves SVM performance in two scenarios. Ensemble SVM performed almost as well as logistic regression, except for sensitivity. There is a decrease in performance in the Ensemble SVM model in the multicollinearity condition and linear combination between the predictor variables. Meanwhile, HGLM still has a good performance, which is represented in Fig. 11a and b, respectively.

Table 9 Models accuracy compariso
Fig. 11
figure 11

AUC, accuracy, sensitivity, and specificity (A), and accuracy of all scenarios (B)

Using H-likelihood to structural equation model (HSEMs)

The is widely used in multidisciplinary fields [41]. To account for the information, [42, 43] performs the style of frequentist model averaging in structural equation modeling, non-linear structural equation modeling towards ordinal data [44] and partial least square [45, 46] and robust nonlinear with the interaction between exogenous and endogenous latent variables [47]. With an example we present a SEM method based on h-likelihood, called “hsem” [52].

In application, [48] uses two-level dynamic SEM on longitudinal data at Mplus. In this paper, we explicitly discuss how to use h-Likelihood in SEM. This data set consists of 50 repetitions on regular time scales for 100 individuals. For the response variable, the urge to smoke is on a standardized scale so that 0 corresponds to the average where the standard deviation is 1. Smokers can feel drastic mood changes. Starting from feeling happy then turning into sadness, this can show the characteristics of a person who is depressed. For those addicted, smoking can give a calm mind for a moment. The second model will answer the question; latent person predicts smoke, mean cantered depression, and the latent person-mean centered lag-1 urge to smoke. The model Eq. 11 is given as follows:

$$\begin{aligned} urge_{ti} = \,\, & \beta_{0i} + \beta_{1i} Time_{ti} + e_{ti} , \\ \beta_{0i} = \, \,& \gamma_{00} + u_{0i} \\ \beta_{1i} = \, & \gamma_{10} + u_{1i} \\ {\varvec{u}}_{i} = \,\,& \left( {\begin{array}{*{20}c} {u_{0i} } \\ {u_{1i} } \\ \end{array} } \right)\sim MVN\left( {\left[ {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {\tau_{00} } & 0 \\ 0 & {\tau_{11} } \\ \end{array} } \right]} \right) \\ & e_{ti} \sim N\left( {0,\sigma^{2} } \right) \\ \end{aligned}$$
(11)

Figure 12 represents the path diagram by using hsem. This same standard progression path across all respondents was defined through the fixed-effect model. In contrast, the person-specific random effects are used to catch the variance of each participant from the expected path. Meanwhile, the path diagram represents within-level and between-level models. As more instance, we provide the R package hsem [54].

Fig. 12
figure 12

Path diagram

Short review albatross analytics

This paper explains how Albatross software can be used for alternative multidisciplinary data processing. We offer model estimation, model checking plots, and visualization features to interpret information. Through data and R code, instances would further reveal the benefit of the HGLM model for particular statistical cases. The h-likelihood approach is distinct from both classical frequentist and Bayesian frameworks, while this encompasses inference of both fixed and random unknowns. The main benefit over classical frequentist approaches is that it would be possible to infer unobservable quantities, such as random effects, and therefore, observations could be rendered. Whenever a statistical model has been selected for the research, the likelihood contributes to the direction of inferential statistics.

Throughout direct ties with the establishment of h-likelihood, the nomenclature has already been used in which a wide variety of likelihoods have now been established. Most are through theoretical computation of GLM and GLMM, e.g., quasi-likelihood and extended quasi-likelihood. Many others are used to show the linkage of conventional frequentist estimation and Bayesian inference by the following other terms such as joint likelihood, extended likelihood, and adjusted profile likelihood. We demonstrate whether h-likelihood was an essential likelihood that marginal and REML probabilities and statistical probabilities are extracted. The extended probability theory underlies the h-likelihood system and demonstrates how it holds from classical and Bayesian probability.

Generalizations on random effects are of great application in simulations. For example, a typical example is that there are frequent observations of hospital admissions by patients and that the life of these patients can be expected. This might include a survival experiment with unexpected results for patients, and the variance of the estimates indicates the variability of the random effect.

During the first few examples, we demonstrate experiments using normal, log-normal, gamma, Poisson, and binomial HGLMs. Binary models are used to compare with application areas, while the dhglm package is fast and yields consistent results. Descriptions using HGLMs, including organized dispersion, are given below. We also line up models including correlated random effects and structural equation models.

The likelihood implies that probability models will offer an effective way to interpret the data if the model is accurate. It is also necessary to validate the model to verify the interpretation of the results. That being said, it could be hard to ascertain all the model assumptions. During the simulation using h-likelihood, SEM’s normal assumption in binary GLMMs can give serious biases if the normal assumption on random effects is incorrect.

Conclusion and future research

The likelihood inferences for specific models may be susceptible to data leakage outliers. If the data size is limited, we can review the data carefully to detect outliers, but it can be difficult for large-scale data to identify outliers or degraded data. A commonly cited drawback of the probability approach is that it is not resilient to model distribution predictions or the existence of outliers or data degradation. It is advantageous to build models that are likely to have stable inferences against such violations. That’s also feasible by believing that the model encompasses a wide variety of distributions. We are leaving future studies to combine h-likelihood in the deep learning [39, 40, 55,56,57,58,59], and using this framework towards spatial and remote sensing [60,61,62,63,64], hybrid forecasting [65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80], and more advanced disease detection cases using image detection [81,82,83,84,85,86,87,88,89,90].