Background

According to the Biomarkers consortium National Institute of Health (NIH), biomarkers are parameters that are objectively measured and evaluated as indicators of normal biological processes, pathogenic processes, or pharmacologic responses to therapeutic intervention. Biomarkers may be generally classified according to their use as indicated in Table 1 [1].

Table 1 Types of biomarkers

Different types of biomarkers require different characterizations

The process of biomarker development comprises five phases based on the Early Detection Research Network, with each phase building upon the results of the previous one [2]. These phases are arranged according to the strength of evidence, progressing from weaker to stronger. Statistical tests are conducted in each phase to determine significance. However, the discussed tests are most critical in the initial phases. Figure 1 highlights an example for using different statistical tools in biomarker research from plasma.

Fig. 1
figure 1

Key statistical tools for plasma biomarkers as an example. This figure sheds the light on an example for the main statistical tools used in plasma biomarker research. The results obtained from the qPCR are analyzed using statistical tools as ROC. ROC is a plot between the true positive rate and false positive rate. AUC is calculated with its corresponding p value. Thus, it can determine the marker’s ability to discriminate between patients and controls. Sensitivity and specificity are calculated as well. A prediction model is conducted also to predict a variable by one or more other variables and measure the influence of one or more variable on another variable. PPV, NPV, sensitivity and specificity were also calculated from the regression model

These phases begin with discovery and progress through analytical validation, clinical validation or biological validation, clinical utility, and eventually, the final stage of associated implementation factors such as legal, ethical, and social ramifications as well as cost effectiveness.

A biomarker needs to meet a few fundamental requirements before it can move on to the discovery phase/first phase. It must be readily available, simple to prepare and store, and available in sufficient quantities to meet its measurement requirements.

Analytical validation (the second phase) involves assessing the reproducibility of the biomarker measurements. Variables such as cut-off values, limits of detection, linearity, accuracy and precision, sensitivity and specificity, inter- and intra-assay coefficients of variation, and other factors are assessed at this stage.

The focus of the third stage, clinical validation, is the evaluation of qualities built on the thresholds established from the previous two phases. These performance indicators include likelihood and hazard ratios, area under the curve (AUC) or receiver operating characteristic curve (ROC), sensitivity and specificity, and positive and negative predictive values.

Any biomarker’s ultimate objective is to make it through the fourth most difficult stage of clinical utility. The performance of a marker is finally decided at this stage because it will be the basis for further clinical decisions. Because of this, not all markers deemed trustworthy or accurate may be ultimately accepted [3]. Assessing what qualifies a biomarker as clinically helpful is crucial and starts with quantifying the diagnostic properties; therefore, the main factors and specifications that must be met for diagnostic tests to be considered of clinical interest are covered in the next section.

Biomarker discovery and development occurs over various steps of qualification and validation that is supported by various statistical elements that ensure reproducibility and utility of the biomarker within its context of use [4, 5]. Initial stages of establishing a link between the biomarker and the disease’s outcome are supplemented by statistical tools that quantify the relationship to not only provide further degrees of evidence of its performance, but also enable tailoring of the biomarker for its intended use [6]. For example, diseases with low prevalence rates would benefit more from biomarkers having higher specificity rather than sensitivity [7]. Establishing preliminary characteristics of performance in initial stages additionally helps guide future directions within the study [8].

Criteria for a useful diagnostic test

The traditional method of testing the usefulness or accuracy of a diagnostic test is to measure it against a reference diagnosis typically used in clinical settings.

Diagnostic tests are often binary in their conclusion: they either aim to confirm or exclude a diagnosis. While many statistical methods exist, certain measures of diagnostic accuracy are more commonly used than others to characterize biomarkers. Such measures include classification probabilities (true positive fraction or TPF/sensitivity, true negative fraction or TNF/specificity), predictive values (positive predictive value or PPV, negative predictive value or NPV), diagnostic odds ratios (DORs), likelihood ratios (LRs), ROC curves, and Euclidean and Youden indexes. While some measures are discriminative (for example, ROC curves), others could be predictive (as in the case with logistic regression) in nature. Predictive measures are most helpful in determining the likelihood that a disease will afflict an individual, for example, while discriminative measures are typically used to simply classify those with the disease from those without. While good discriminative performance is often more aligned with diagnostic biomarkers, predictive measures are helpful in quantifying the magnitude of the test’s result on the outcome. The ideal diagnostic biomarker would be one to discriminate perfectly, being able to completely diagnose an individual with a disease without any false diagnoses taking place. However, it is often difficult to realize such concepts for a variety of reasons. The choice on the acceptable degree of diagnostic uncertainty would then be based on a variety of factors on the clinical level, such as the nature of the disease, the cost of medical care, and the psychological effects of a missed diagnosis.

Sensitivity and specificity

Sensitivity (the test’s ability to truly detect all people with the disease, or the true positive) and specificity (the test’s ability to discount all people without the disease) are common metrics used to assess a diagnostic test. Although a test with both high sensitivity and specificity is desirable, trade-offs can be made depending on the intent of use, setting and the nature of the disease itself to prioritize one over the other. Sensitivity and specificity can be derived by simple equations from a confusion matrix (also known as a classification table), as demonstrated in Table 2. It includes all possible possibilities in a clinical setting: true positive indicates those correctly diagnosed with the disease, false positives are those diagnosed without actually having the disease, false negative indicates those misdiagnosed as healthy despite actually having the disease, and finally true negatives are those correctly diagnosed as not having the disease [9]. All the equations derived from the matrix are shown in Table 3.

Table 2 Confusion matrix or classification table
Table 3 Diagnostic equations derived from the confusion matrix

The implications of false positives and negatives should be considered when designing the metrics and cut-offs of the diagnostic test. For example, a false negative means that a patient is misleadingly thought to be healthy until further symptoms develop or mortality occurs as a result of no treatment. Such a consequence is made worse in diseases where early diagnosis could result in treatment and full recovery or a better prognosis at minimum. On the other hand, a false positive result would cause unnecessary, if not harmful, medical interventions that may cause financial, psychological and overall avoidable harm to the individual.

Certain aspects of the disease are also critical in designing and evaluating diagnostic tests, particularly disease prevalence. Prevalence is defined as the fraction of people in the population having the disease as opposed to the total population under study itself. Prevalence is an important characteristic to take into consideration, particularly in metrics of diagnostic accuracy such as predictive values [10].

The tradeoffs in the measures of accuracy are therefore evaluated by assessing the relative risk of false positive or negative results within the population of the disease while taking into account the prevalence of the disease within the population itself as well.

PPV and NPV

The positive predictive value is the proportion of correctly predicted cases with the observed outcome versus the total number of cases predicted to have the outcome. The negative predictive value, on the other hand, is the proportion of correctly predicted cases lacking the observed characteristic in comparison with the overall number of cases predicted as not having the outcome. PPV and NPV are functions of prevalence and are influenced by prevalence. In other words, to calculate the two values, the prevalence must be known. While PPV and NPV are metrics often used in diagnostic accuracy studies, any interpretation derived would not be generalizable across studies, as they are greatly affected by prevalence. Meaning the interpretation derived would only be exclusive to the studied population.

In general, high specificity (ability to correctly diagnose those without disease or false positives/true negatives) tends to occur with a high PPV (ratio of truly diagnosed over all the diagnosed) value due to the presence of few false positives/falsely diagnosed.

ROC curve

The receiver operating characteristic curve is a curve drawn by joining together a series of points obtained from the determination of (sensitivity/true positive; 1—specificity/false positive) at different cut-offs. The area under the generated curve is used to evaluate classification performance with all possible different cut-offs of the biomarker.

There is no absolute consensus or calculation to derive what would be an acceptable AUC for a diagnostic biomarker, but generally speaking, most studies tend to follow general guideline values highlighted in Table 4 to evaluate the value calculated by the plot [11, 12]: Greater AUC values indicate better test performance, with AUC values that can range from 0.5 (no diagnostic ability) to 1.0. (Perfect diagnostic ability). The ROC curve is an important statistical technique for evaluating the performance of diagnostic medical tests, especially for tests that aim to detect cancers early [13].

Table 4 General interpretation of AUC values

Another way of interpretation would be to take into consideration the clinical setting where the biomarker will be used to determine whether the given AUC would have any meaningful significance.

Logistic regression

To fit models for the probability of disease as the outcome given marker values, logistic regression is used. Logistic regression, also known as the logistic model or the logit model, examines the relationship between a single or several independent continuous variables and a dichotomous/binary dependent variable. These types of analyses create a model to relate the outcome (the dependent variable), to the predictor variable (the independent variable). The probability of the occurrence of an outcome is estimated by fitting input data from epidemiological data (for example, patients and controls) to a logistic curve, where the predictive power is represented as the regression coefficients. There are two types of models used in analyses, depending on the number of possible outcomes in the dependent (Predictor) variable: if it is two/dichotomous, then binary logistic regression is utilized, and if it consists of more than two then multivariate logistic regression is used. Possible uses of logistic regression in the field of biomarker studies are highlighted in Table 5.

Table 5 Possible uses of logistic regression is diagnostic studies

Feature selection is another aspect of logistic regression that may be beneficial in the early stages of biomarker discovery, especially in high throughput techniques (for example, “-omics” methods involving DNA or RNA sequencing, or mass spectrometry) [14], where many potential candidates exist. Such methods help decrease the dimensionality of the data by removing redundant or irrelevant candidates to minimize complexity and further fine-tune the model generated to prevent overfitting [15]. This can be performed through several broad methods that include filter, wrapper, and embedded methods. The methods are classified depending on whether or not a model needs to be generated through learning algorithms like logistic regression in order to assess the features, with filter being the only methodology out of the three to act independently of the model [16]. Hybrid methods that combine two or all three exists as well [17]. An overview of each method’s strengths and weaknesses is highlighted in Table 6.

Table 6 The advantages and disadvantages of each feature selection method that is commonly used with learning algorithms

The evaluation of the logistic regression model includes multiple phases. The overall model is evaluated in terms of the relationship between all independent variables and the dependent variable. Then, the significance of the independent variable or variables is determined by assessing the derived regression coefficient per variable. Another phase includes assessing the model’s predictive accuracy/discriminating ability. The model must then be validated. The exhaustive steps are underlined below:

  1. 1.

    Evaluation of the overall model

    The overall fit of a model can be evaluated by comparing the predicted model to a null model (a model with no independent variable) when fitted to the input data. The model is said to be a better fit only if it exhibits improvement over the empty model [18], which is usually assessed through an Omnibus test or a Hosmer & Lemeshow test [11, 19].

  2. 2.

    Predictive accuracy and discrimination of model

    Once the fitness of a model is evaluated, the accuracy is assessed. The accuracy can be determined from the sensitivity and specificity of the model, which is calculated using a confusion matrix. A user defined cut-off is defined by the user (anywhere from 0 to 1) where all predicted values above the cut-off are classified as predictive [18].

  3. 3.

    Statistical significance of regression coefficients of independent variable

    Is the predictive power of the independent variable significant enough? The relationship between the dependent and independent variable can be confirmed through statistical significance, which can be assessed by multiple tests such as the Wald statistic, the odds ratio, and the likelihood ratio test [18, 20].

  4. 4.

    Validation of the model

    Once the model has been constructed, one final point must be assessed: whether the model developed with the independent/predictor variables can correctly predict the dependent/outcome variable in another subset of the population. There are two major methods of validation: external and internal. External validation is performed by testing the model on an entirely different dataset than the one used to build the model. Internal validation is performed using a similar subset of the population used to develop the model, if not the same.

    4.a Validation by frequentist approach

    The split-sample technique is performed by randomly splitting the dataset into training and validation sets. The disadvantages of such a method include the reduction of the dataset sample size used to develop the model, and different splitting formats may produce different results. Cross-validation mimics the split-sample method of dividing the sample into a training and validation set but adds to it in that it is a resampling technique where development and testing are done in rounds.

    Another commonly used method is bootstrap validation. This type depends on a hypothetical test set created based on the given values and is used to validate the model. In bootstrapping, the complete dataset is resampled several times with replacement, with statistics being generated on each resampling, and the statistics from each resampling are merged in a specific way. In logistic regression models developed in smaller samples, bootstrapping is commonly used to derive optimal estimates of internal validity [21].

    Biomarker studies that have been published with logistic regression often report either the coefficient of the logistic regression equation or the odds ratio (which is simply the exponent of the coefficient) [22], along with the confidence intervals (CI) or the significance (p value), to indicate the statistical significance of the associations established by these values between the predictor variable and the outcome variable.

Bayesian approach

The Bayesian approach is another statistical language approach that can substitute conventional logistic regression. This language has the ability to take into consideration our beliefs (current beliefs) and obtain the probability of distribution. The following equation demonstrates Bayes’ theorem.

This approach depends mainly on the availability of prior probabilities before conducting the study which is represented as P(A) (probability of A occurring). P(B/A) is the probability of event B to occur given A and this is termed the likelihood. P(B) is the probability of B to occur, and this is termed the evidence. Finally, from all this information, Pr(A/B) is computed, which is the posterior distribution, meaning that the prior is converted to posterior after taking into account the results of the experiment [23].

One main advantage in this approach is its ability to validate the model if the data available is limited. For instance, rare diseases could be a hurdle that face any clinical study due to small number of patients in the population [24, 25]. It also gives a range for how to be certain for or against a hypothesis rather than a point estimate. However, it is still a more complex type of statistical analysis, and more advanced statistical software is needed to utilize this method.

One main disadvantage, on the other hand, is that priors could be subjective and possibly affect the posterior distribution in some way. Moreover, the presence of priors is critical, which is not possible without the analysis.

Cut-off determination

In diagnostic studies, the test should yield binary outcomes (positive or negative). When a new biomarker is explored, the optimum cut-off to transform the continuous values into dichotomous ones is assessed through the use of several metrics that often incorporate sensitivity and specificity [26]. A general outline is detailed below of the most common calculations used for such assessments.

Youden’s index

An optimum cut-off in the statistical sense would be one with the greatest possible difference between the total positive rate (i.e., Sensitivity) and false positive rate (i.e., 1-Specificity) [27].

Diagnostic odds ratios/DOR

The DOR of a test is the ratio of the odds of positivity in diseased subjects compared to the odds of positivity in healthy subjects. The ratio is derived from sensitivity and specificity and as a result, is not affected by the prevalence of the disease [28]. DOR can be calculated using the following equation:

$$\begin{aligned} {\text{DOR}} & = & \left( {{\text{Sensitivity}}*{\text{Specificity}}} \right)/\left( {1 - {\text{Specificity}}/{\text{False positives}}*1 - {\text{Sensitivity}}/{\text{False negatives}}} \right) \, \\ \quad {\text{or }}\left( {{\text{Sensitivity}}/1 - {\text{Sensitivity}}} \right)/\left( {1 - {\text{Specificity}}/{\text{Specificity}}} \right). \\ \end{aligned}$$

Values higher than one generally indicate some degree of diagnostic usefulness [28], with increasing values indicating better performances. The DOR is commonly used as a measure of association in epidemiology; however, the discriminatory power is often put to the question [29, 30]. Since an odds ratio is a single number, it does not account for the trade-off between accurately identifying cancer patients and mistakenly identifying otherwise healthy individuals, but may be useful in characterizing population level risks [29]. Hence, some studies discourage the use of DOR when examining binary early detection biomarkers [31].

Likelihood ratios/LR

It is defined as the ratio of the probability of correctly diagnosing the disease in patients with the target disease to the probability of incorrectly diagnosing the disease. The LR predicts how likely a patient would have a disease using sensitivity and specificity. (LR+ indicates positive test results, while LR- indicates negative test results).

They are calculated using the following equations:

$${\text{LR}}\, + \, = {\text{Sensitivity}}/1 - {\text{Specificity}}$$
$${\text{LR}} - = 1 - {\text{Sensitivity}}/{\text{Specificity}}$$

Rough guidelines on how LR is generally interpreted in the literature [27] are highlighted in Table 7.

Table 7 General interpretations of LR values

Conclusion

The clinical field is still in an immense need for the development of new biomarkers.

Biomarkers offer guidance for clinicians at the beginning or throughout the clinical intervention itself. They could be screening, diagnostic, prognostic, predictive, monitoring, risk or response. Regardless of their specific use, studying biomarkers is often tied to statistical analysis. Statistical analyses are often carried out by biostatisticians.

The hurdles encountered by clinical researchers in statistical analysis are often attributed to the lack of a comprehensive and straightforward guide outlining the essential steps, together with their corresponding definitions, calculation methods, and reasoning behind why and how each calculation is used. This review serves as a general guide for the main statistical analyses that are needed to develop and validate a biomarker study.