1 What Is the Difference Between a Retrospective and a Prospective Study?

A retrospective study is based on information collected in the past, say about a disease that has already occurred and the study aims to investigate its association with the risk factors or exposure, for example, studying the association between lung cancer and those who smoke. The most common example of a retrospective study is a case–control study. However, all retrospective studies are not case–control ones. Retrospective studies are relatively inexpensive, easy to perform, and require less time to draw inferences from data. Their major disadvantage is that the investigator has to depend on the available information which has been collected and maintained in the past by others and maybe incomplete or lacking in some aspects, like confounding factors.

Unlike a retrospective study which looks backwards, a prospective study looks forwards and is conducted over a period of time in a cohort of subjects and attempts to study the relationship of outcomes with respect to specific exposures or risk factors. A prospective study may consist of studying the relationship between maternal anaemia and gestation. Here the outcome is observed with respect to different levels of haemoglobin. Prospective studies are relatively expensive and time-consuming as the disease or the outcome occurs after the beginning of the study, but are considered to be better than retrospective studies in the hierarchy of evidence.

2 What Is a Cross-Sectional Study?

Unlike retrospective and prospective study designs, a cross-sectional study examines and collects data at a single point in time. A cross-sectional study provides the current status, say the prevalence of a disease and other associated factors of a population.

3 What Is a Longitudinal Study?

A cross-sectional study collects data on individuals at a single point in time, whereas a longitudinal study collects data on the same individuals at different points in time. For example, a cohort of individuals is compared over time for a specific outcome. The Framingham Heart Study is one of the most well-known longitudinal studies and has been conducted over the last 50 years [1].

4 What Is a Case–Control Study?

In a case–control study subjects with and without a medical condition (disease) are investigated for exposure in the past. Cases who have a specific medical condition, say lung cancer and a control group without lung cancer. Case–control studies are usually, but not exclusively, retrospective. Cases and controls are individually checked for suspected risk factors or past exposure which are likely to cause the medical condition. The potential relationship between the suspected risk factor and the disease is examined by generally arranging the data of the counts in a 2 × 2 table and computing the odds ratio. In case–control studies, the odds ratio (OR) is the valid measure to assess the risk and not the relative risk (RR).

5 What Is a Randomized Controlled Trial?

A randomized controlled trial (CRT) is a scientific experiment where the investigator wants to compare the response of two or more treatments/drugs or interventions. Subjects are allocated randomly to different groups to reduce bias. Those in the control group are assigned a ‘placebo’ or ‘no intervention’ or ‘the existing intervention’ whereas the other groups called test groups or experimental groups are assigned the new intervention. Randomized trials can include more than one treatment group or even more than one control.

6 What Is a Cluster Randomized Trial?

The cluster randomized trial is a relatively new study design. Cluster randomized trials are now commonly used to evaluate public health, health policy, and health system interventions. In cluster designs, clusters are randomly allocated to groups and have several advantages over individual randomization in terms of implementation costs or administrative convenience. Readers may refer to a paper by Moberg and Kramer [2] for a better understanding of these designs and their applicability.

7 What Is an Equivalence, Superiority, or a Noninferiority Trial?

A superiority trial aims to test the superiority of one treatment over another in a randomized controlled trial. For example, we want to know whether a new drug is better in increasing the high-density lipoprotein (HDL) levels compared to the existing one. Superiority is established if the response improves by some predetermined, called superiority, margin. The null and alternative hypotheses in a superiority trial are specified as follows:

We set the null hypothesis (H0) that the response of the new drug is equal to or less than the existing one by a margin (δ) and look for evidence against it in favour of the alternative hypothesis (H1).

$$ {\displaystyle \begin{array}{l}{\mathrm{H}}_0:{\upmu}_1-{\upmu}_0\le \delta \\ {}{\mathrm{H}}_1:{\upmu}_1-{\upmu}_0>\delta \end{array}} $$

where δ ≥ 0 is the superiority margin.

An equivalence trial aims to test that the two treatments are ‘not too different’ with respect to some characteristics. Here the ‘not too different’ is a specified margin that is clinically unimportant. For example, if a difference of 2% or less in efficacy is irrelevant from the clinical perspective, an equivalence trial tests for a difference of a 2% margin using a two-tailed statistical test.

The hypotheses of an equivalence trial can be specified as follow:

$$ {\displaystyle \begin{array}{l}{\mathrm{H}}_0:\mid {\upmu}_1-{\upmu}_0\mid \ge \delta \\ {}{\mathrm{H}}_1:\mid {\upmu}_1-{\upmu}_0\mid <\delta \end{array}} $$

where δ > 0 is the tolerance margin.

A noninferiority trial aims to show that the new regimen is ‘not (much) worse’ than the existing or standard regimen by more than a specified margin. Noninferiority trials are conducted where the new regimen has the potential to be at least as effective as the standard regimen but is cost-effective or more convenient. The hypotheses of noninferiority trial can be specified as follow:

$$ {\displaystyle \begin{array}{l}{\mathrm{H}}_0:{\upmu}_1-{\upmu}_0\le -\delta \\ {}{\mathrm{H}}_1:{\upmu}_1-{\upmu}_0>-\delta \end{array}} $$

where δ ≥ 0 is the margin of clinical significance.

8 Questions and Answers related to Sampling Methods

8.1 What Are the Common Methods of Random Sampling?

There are a large number of methods in statistics that are used for selecting a random sample. The most common and widely used is simple random sampling. However, below we discuss some of the other methods as well.

Simple Random Sampling (SRS)

Simple random sampling is a method where each member of a population is given an equal chance(probability) to be included in the sample. The sample is selected on the basis of random numbers. Simple random sampling appears to be easy but is sometimes difficult to implement as it requires the availability of a sampling frame, i.e., a list of all the sampling units in the target population. Preparing this is an exhausting exercise, particularly when the list is large. Even, if we are able to do so, the SRS does not ensure that the different segments of the population are adequately represented in the sample.

Stratified Random Sampling

In stratified random sampling, the population is first divided into different homogeneous strata based on some criteria, say age, income, sex, BMI, or disease followed by selecting a simple random sample from each stratum. Compared to SRS, stratified sampling gives a more adequate representation to one or more subgroups of interest.

Multistage Random Sampling

Multistage random sampling is the most preferred sampling design, particularly when the population size is very large and spread over a large area, say all over a state. Here the sampling is done in stages by selecting a sample of districts called the primary stage units, followed by sampling blocks from the selected districts and villages from the selected blocks; and finally subjects from the selected villages. This is the most preferred sampling design followed in large-scale surveys and requires less effort in developing the sampling frame.

Cluster Random Sampling

A cluster is a group of subjects who are located in close proximity and easy to administer for survey purposes. In cluster sampling, clusters are the primary units and sampling is done on the clusters. In stratified sampling, a random sample is drawn from each of the strata, whereas in cluster sampling only the selected clusters are sampled. A major advantage of cluster sampling is that the cost is reduced by the increased sampling efficiency, whereas stratified sampling increases the precision.

Systematic Random Sampling

Systematic random sampling selects the first element randomly and the other units are automatically included on the basis of the sampling interval (every kth unit). Simple random sampling is easy to execute, particularly in a hospital. Every kth patient visiting the clinic can be included in the sample once the first patient is chosen randomly. However, systematic random sampling yields biased estimates.

In addition to the above commonly discussed methods of random sampling, there are others such as probability proportional to size sampling, area sampling, inverse sampling, sequential sampling, and nonrandom methods of sampling (purposive or convenience).

8.2 How to Perform Randomization in Real Research Settings?

Randomization is a prerequisite for conducting any experimental trial and helps in eliminating bias by randomly allocating subjects/units to different interventions/stimuli. Randomization provides the chance that the two or more than two groups are at par in terms of unseen heterogeneity prior to intervention. Normally, a randomized controlled trial (RCT) includes a group of subjects who do not receive any intervention or receive a placebo and is called the control group. The other group is the ‘intervention’ or the ‘test’ group receiving the treatment. If an experiment is to be performed on 50 subjects with 25 each in the intervention and the control group, we can assign subjects to either of the groups with the help of random numbers. Readers may refer to the website www.randomization.com or use R-software with little coding for drawing random numbers. Following is the output of the website for selecting 50 random numbers for the two groups:

figure a

8.3 What Are the Determinants of an Adequate Sample Size for a Research Study?

Determination of a sample size is a prerequisite for any research programme and will primarily depend on the objectives of the study as well as how statistically those objectives are to be answered. A study may have a single or multiple objectives. Further, an objective may be ‘descriptive’ or ‘analytical’ in nature. Determinants for the calculation of sample size mainly include the available basic information associated with the study such as the mean, standard deviation, proportion and effect size in addition to general statistical information such as confidence level, significance level (type-I error), and the power of the study (1 minus type-II error). This basic information is taken from the related studies published in the past. For more details on how to calculate sample size please see the chapter on ‘How to calculate an adequate sample size’ in this book.

9 Questions and Answers Related to the Diagnostic Ability/Validity of a Test

9.1 What Is the Sensitivity and Specificity of a Test?

In medical diagnosis, the terms sensitivity and specificity are often used to assess the ability of an alternative test against a gold standard to identify the positive and negative cases correctly. Sensitivity is the ability of a test to correctly identifying positive cases out of those who have the disease, whereas specificity is the ability to correctly identify negative cases out of those who do not have the disease. Sensitivity is also called the true-positive rate and specificity the true negative rate. For example, if a test correctly identifies 85 positive cases out of 100 cases who have the disease, its sensitivity is 85%. If the test correctly identifies 95 negative cases out of 100 who do not have the disease, its specificity will be 95%. Let us discuss the results of the identification of gold standard and alternative tests conducted on 700 subjects which are arranged in the following table:

 

Gold standard test

 

Alternative test

Positive

Negative

Row total

Positive

400 (TP)

30 (FP)

430

Negative

100 (FN)

170 (TN)

270

Column total

500

200

 

In the above table, the gold standard tests that disease is present in 500 and absent in 200 subjects (see column totals). However, the alternative test confirms that the disease is present in 430 and absent in 270 subjects (see row totals). There is a disagreement between the two tests. Had there been no disagreement between the results of two tests, an alternative test would have been as accurate as the gold standard. Let us further analyze critically the results of the alternative test. The alternative test says that the disease is present in 400 and not in the remaining 100 subjects. These 400 subjects are true-positives (TP) and 100 are false negatives (FN). Similarly, if we examine the negative cases which are 200 as identified by the gold standard, the alternative test says that there is no disease in 170 and the remaining 30 subjects have the disease. These 170 subjects are true negatives (TN) and 30 are false positives (FP). If there were no false positives and negatives, the alternative test would be as good as the gold-standard test. Below we describe various diagnostic, predictive, and accuracy parameters of the alternative test.

Sensitivity is the true-positive rate (TPR) = 100 × TP/(TP + FN) = 100 × 400/(400 + 100) = 80%. Sensitivity is also called Recall in information retrieval.

$$ \mathrm{False}\ \mathrm{negative}\ \mathrm{rate}\ \left(\mathrm{FNR}\right)=100-\mathrm{sensitivity}=100-80=20\% $$

Specificity also called the true negative rate (TNR) = 100 × TN/(TN + FP) = 100 × 170(170 + 30) =85%

$$ {\displaystyle \begin{array}{l}\mathrm{False}\ \mathrm{positive}\ \mathrm{rate}\kern0.5em \left(\mathrm{FPR}\right)=100-\mathrm{specificity}=100-85=15\%.\\ {}\mathrm{Positive}\ \mathrm{predictive}\ \mathrm{value}\kern0.5em \left(\mathrm{PPV}\right)=100\times \mathrm{TP}/\left(\mathrm{TP}+\mathrm{FP}\right)=100\times 400/\left(400+30\right)=93\%\\ {}\mathrm{Negative}\ \mathrm{predictive}\ \mathrm{value}\kern0.5em \left(\mathrm{NPV}\right)=100\times \mathrm{TN}/\left(\mathrm{TN}+\mathrm{FN}\right)=100\times 170/\left(170+100\right)=63\%\end{array}} $$

9.2 What Are Positive and Negative Predictivities?

The sensitivity and specificity of a test are indicators of its validity and do not measure its diagnostic value which is obtained in terms of predictivities. Let us analyze the results of the alternative test in the above table to understand the predictivities of the test. The test has labelled 430 subjects as positive cases. However, out of 430, only 400 have been correctly labelled, whereas 30 have been labelled wrongly. Similarly, the test has labelled 270 subjects as negative cases. However, out of 270, only 170 cases have been labelled correctly and the other 100 wrongly. Here the alternative test could label 93% of the positive cases correctly and 63% of the negative cases. The predictivity of a test is the ability to correctly label the tested cases. The positive predictivity of the test is its ability to correctly label the subjects ‘positive’ who test positive. Negative predictivity is correctly labelling the subjects ‘negative’ who test negative. Predictivities are also called post-test probabilities and measure the utility of the test in correctly identifying or excluding the disease. However, the major drawback with the predictivities is that they are disease-prevalence dependent. If disease prevalence or prior probability of disease is known we should use the formula for PPV and NPV which involves sensitivity, specificity, and prevalence. PPV is directly proportional to the prevalence of a disease. In the above-said example, if we increase the proportion of subjects having a disease, PPV will increase and NPV may fall. If the disease is rare, positive predictivity will be very low. However, sensitivity and specificity do not depend upon the prevalence. Thus, the calculation of predictivities should be done for a study that includes the correct proportion of diseased and non-diseased subjects to be tested. It would be appropriate to report predictivities along with sensitivity-specificity in a cross-sectional study. However, for case–control studies sensitivity-specificity should be calculated.

9.3 What Are Likelihood Ratio Tests?

Likelihood ratios are other measures that combine both sensitivity and specificity for interpreting diagnostic tests. The advantage of likelihood ratio tests is that they do not depend upon the prevalence of a disease. Likelihood ratio positive (LR+) is the ratio of a true-positive rate (TPR) and a false-positive rate, whereas likelihood ratio negative (LR−) is the ratio of the false-negative rate (FNR) and true negative rate (TNR). LR+ gives the odds of having a disease in relation to not having the disease when the test is found to be positive. When LR+ is 10, then the odds a person has a disease are 10:1 when he is tested positive. The higher the LR+, higher is the likelihood of having a disease. LR− is the reverse of LR+. An LR− of 0.1 means that the odds a person has a disease are 1:10 when he is tested negative. It has been shown that if we know the pretest probability (prevalence), say on the basis of earlier records or symptoms, a value of 10 or more of LR+ and 0.1 or less of LR− indicates that the test is extremely good. If a test has LR+ and LR− equal to 1 then the test has no diagnostic value.

$$ {\displaystyle \begin{array}{rcl}\mathrm{Likelihood}\ \mathrm{ratio}\ \mathrm{positive}\ \left(\mathrm{LR}+\right)& =& \mathrm{Sensitivity}/100-\mathrm{Specificity}\Big)=80/\left(100-85\right)=5.33\\ {}& =& \mathrm{TPR}/\mathrm{FPR}=80/15=5.33\end{array}} $$
$$ {\displaystyle \begin{array}{rcl}\mathrm{Likelihood}\ \mathrm{ratio}\ \mathrm{negative}\ \left(\mathrm{LR}-\right)& =& \left(100-\mathrm{Sensitivity}\right)/\mathrm{Specificity}=20/(85)=0.235\\ {}& =& \mathrm{FNR}/\mathrm{TNR}=20/85=0.235\end{array}} $$

9.4 What Is a Diagnostic Odds Ratio?

$$ \mathrm{Diagnostic}\ \mathrm{odds}\ \mathrm{ratio}\left(\mathrm{DOR}\right)=\left(\mathrm{LR}+\right)/\left(\mathrm{LR}-\right)=22.67. $$

A Diagnostic odds ratio (DOR) is the ratio of positive and negative likelihoods. DOR measures the effectiveness of a diagnostic test. DOR ranges from 0 to infinity. In a test with a 50% sensitivity and specificity, the DOR has a value of 1, with 90% it is 81 and with 99% it is 9801. A higher value of DOR is considered to give a better performance.

9.5 What Is the Accuracy of a Diagnostic Test?

The accuracy of the test is the ability of the test to correctly label the positive and negative cases.

Accuracy = (TP + TN)/N, where N is the number of subjects tested.

9.6 What Is a Receiver Operating Characteristic (ROC) Curve?

The ROC curve is a graphical plot between the true-positive rate (Sensitivity) and false-positive rate (1-Specificity) for various threshold settings of the predicting variable (Fig. 11.1). ROC curves are frequently used in medical research and help in predicting the binary outcome, i.e., having two states only, ‘Yes’ or ‘No’, ‘Positive’ or ‘Negative’. For example, we may be interested in predicting the coronary heart disease (CHD) outcome on the basis of a threshold level of LDL or making suicide prediction on the basis of a threshold level of post dexamethasone suppression test (DST) plasma cortisol. In both examples, the predictor variable is quantitative and the outcome is binary. The ROC curve is useful for locating an optimal threshold point on the curve to the least diseased and non-diseased subjects. The optimal cut-off point on the curve is where the sum of sensitivity and specificity is maximal. It is at the upper left corner on the curve. The area under the curve (AUC) is a measure of the diagnostic accuracy or discriminative ability and is used for comparing two or more medical tests for assessing the same outcome. A test that gives a higher AUC is considered to be better. A test with an AUC of 1 is a perfect test. Below we give the ROC curves of three predictors, their AUCs, cut-off values, sensitivities, and specificities for predicting a binary outcome using SPSS. Predictor2 has the highest value of AUC, i.e., 0.959 among the three predictors and is thus the strong predictor, whereas predictor1 is the week predictor (Table 11.1).

Fig. 11.1
figure 1

ROC curve (Example)

Table 11.1 Showing AUC, cut-off value, sensitivity, and specificity of three predictors

10 Questions and Answers Related to Relative Risk, Odds Ratio (OR), and Hazard Ratio

10.1 What Is Meant by the Relative Risk (RR), Odds Ratio (OR), and Hazard Ratio (HR)?

The medical literature commonly uses three measures for assessing risk namely relative risk (RR), odds ratio (OR), and hazard ratio (HR). All these are measures that look for the association of an event (e.g., occurrence of cancer, recurrence of a disease, survival/mortality, etc.) under two contrasting conditions. For example, these two conditions can be exposure versus non-exposure, treatment versus non-treatment, or surgery versus conservative treatment. All three risk measures provide some idea of the comparative risk under different conditions and are sometimes misunderstood and interchanged. They are known as relative measures of association. There are also methods based on absolute measures of association such as Risk difference, Rate difference, and Number needed to treat [3]. These absolute methods are less frequently used and have their own merits and demerits. Generally, there is confusion among the researchers which relative measure should be used.

The following points may help:

Odds ratio (OR) is a measure of association between exposure and an outcome. It is simply the ratio of two odds under two conditions, i.e., odds of having the disease under exposure and non-exposure. The Odds ratio is generally used for assessing risk in case–control studies. Below we discuss the risk of developing lung cancer who have exposure to smoking with the help of a hypothetical case–control study. We calculate the odds ratio using the data in the following table which includes 30 cases and 1170 controls from two populations (Table 11.2).

Table 11.2 Observed count and probabilities

Here the odds of having lung cancer under exposure are 1:9 and under non-exposure 1:99. Odds ratio which is the ratio of these two odds, i.e., 1/9 divided by 1/99 and is equal to 11. Odds ratio of greater than 1 implies that there is an increased occurrence of lung cancer among those who smoke compared to those who do not. Or we can say smoking has an association with lung cancer. If the odds ratio is equal to 1, exposure has no role in the development of lung cancer. If the odds ratio is less than 1, exposure plays a protective role.

Relative risk (RR) is the ratio of two probabilities and is generally calculated when we study the outcome for two groups, for example, recurrence of cancer among patients of two cohorts where patients of one cohort receive a new drug and the other gets an old drug. Unlike case–control studies where we have two samples—one pertains to ‘cases’ and the other to ‘controls’, in cohort studies, samples pertain to two groups and enable us to estimate the probabilities of outcome for each group correctly. Thus, we should calculate relative risk (RR) rather than odds ratio where we can calculate probabilities of outcome for each group. Relative risk or risk ratio is usually calculated in prospective study designs, e.g., cohort studies or RCTs.

Let us calculate the relative risk for a hypothetical example of two groups of women who drink and who do not drink drawn randomly from two populations. We follow both groups for a period of 10 years to observe the occurrence of liver disease. Here we assess the relative risk of developing the disease between the two groups. Relative risk is the probability (π1) of developing liver disease in women who drink divided by the probability (π2) who do not.

Liver disease

  

Yes

No

Total

Drinking

Yes

80 (0.16)

420 (0.84)

500

No

10 (0.02)

490 (0.98)

500

$$ \mathrm{RR}=\pi 1/\pi 2=0.16/0.02=8 $$

Thus, the relative risk of developing liver disease among females who drink is 8 times higher than those who do not. The following important points should be kept in mind while estimating odds ratio, relative risk, and hazard ratio; and effect sizes:

The odds ratio (OR) may or may not be equal to the relative risk (RR) depending upon the probabilities of outcome in the two groups. For the above-discussed case–control study, the OR is 11 and RR is 10. However, it is wrong to calculate the relative risk here as we cannot estimate the probabilities of outcome in the exposure and non-exposure groups as samples have not been drawn accordingly from the populations which had exposure and the others which did not. However, in the second example, we can estimate the probabilities under exposure and non-exposure. The odds ratio and relative risk are related together as shown below [4].

  • Odds ratio is equal to relative risk if probabilities of outcome are equal in two groups (exposure and non-exposure).

  • Odds ratio is greater than the relative risk if the probability of outcome in the exposure is higher than that of non-exposure.

  • Odds ratio is lower than the relative risk if the probability of outcome in the exposure is lower than that of non-exposure.

  • OR and RR are bound to differ, if there is a difference between the two probabilities. However, the odds ratio is close to the relative risk, if the probabilities of the outcome are small [5].

Odds ratio (OR) does not change even if the ratio of number of cases versus controls changes. However, relative risk (RR) changes [6]. For example, if the cell count in the above table increases from 180 to 360 and other cell count from 990 to 1980, the odds ratio remains the same, i.e., 11, but the RR changes from 10 to 10.5.

In an RCT or cohort study, we can calculate the odds ratio; however, the odds ratio only approximates the risk ratio if the outcome is rare or if the odds ratio is close to 1 [3].

Hazard ratio (HR) is the ratio of two hazard rates with respect to two conditions, say treatment and control, and is useful when the risk varies with respect to time. The term hazard ratio is often used interchangeably with the term relative risk ratio. However, the major difference between the two is that hazard ratio is estimated in a time-to-event analysis. Relative risk studies the cumulative risk over an entire study period with a defined endpoint and is usually calculated at the end of the study. Whereas, hazard ratio gives the instantaneous risk at some point of time or some time interval. The Cox regression model is generally used to estimate the hazard ratio and for drawing time to event curves. Cox regression investigates the effect of several variables in the occurrence of an event over a period of time. If the hazard ratio is greater than 1, then the predictor is associated with increased risk, and if less than 1, it is protective.

The following table gives a classification based on effect sizes (estimates of risk parameters) suggested by Oliver and co-workers [7] for various risk measures (Table 11.3).

Table 11.3 Association measures, classification, and effect size

10.2 What Is the Kaplan–Meier Curve?

The Kaplan–Meier (KM) curves are commonly used in clinical and basic research for estimating the probability of survival at different intervals of time (Fig. 11.2). It generally compares two cohorts, for example—one following a new treatment and the other existing regimen. One can easily infer from the curves which treatment prolongs survival or what is the status of survival in the two groups. Here the cohort B doing better than that of A in terms of survival. Group B has a median survival (50%) time of 23 months whereas A has 8 months.

Fig. 11.2
figure 2

Example of a Kaplan–Meier curve

Censored data can substantially affect the KM curve, but have to be included when fitting the model. Censoring of data is common in survival analysis and is a form of missing data either due to nonoccurrence of the event of interest during the study period or the subject has left the study prior to the occurrence of the event. Kaplan–Meier curves consider the impact of one factor at a time and ignore other confounding factors. Here the predictor variable is categorical such as surgery vs conservative treatment, new regimen vs existing regimen. Figure 11.2 gives Kaplan–Meier curves showing the survival probability (%) with respect to time for the two groups.

10.3 What Is the Cox Proportional: Hazards Model?

The Cox Proportional—Hazards model is essentially a regression model and assesses simultaneously the effect of several risk factors (predictors) on the survival time of the patients. These predictors are also known as covariates or confounding factors in the regression model which are not taken into consideration in Kaplan–Meier curves or the Log Rank test. The effect of any factor (predictor) is estimated by the hazard ratio (HR) which is the exponential of the regression coefficient. If the HR > 1 then the factor increases the hazard and is a risk factor, and if the HR < 1, it reduces the hazard and is a good prognostic factor. However, if HR = 1, the factor has no effect.

11 Questions and Answers Related to Association, Agreement and Correlation and Regression Analysis

11.1 What Is the Difference Between Association and Correlation?

In medical studies, the word ‘association’ and ‘correlation’ between two attributes/variables are frequently used and often interchanged. In ordinary language, they may carry the same meaning but statistically, they do not. Association refers to the general relationship and is normally used for studying the relationship between two nominal/categorical/ordinal attributes whereas correlation refers to a linear relationship between two quantitative attributes. Thus, when we want to ascertain whether CKD is linked to diabetes or hypertension, it is an ‘association’ as both the attributes are nominally scored as ‘Yes’ or ‘No’. In the other example of body weight and cholesterol level, both the attributes are numeric and the relationship has to be assessed through a correlation coefficient. The relationship between two quantitative variables can even be non-linear as well such as curvilinear or exponential.

11.2 How Can I Test the Association Between Two Nominal Attributes?

The most commonly used test to study the association between two nominal attributes is the Pearson Chi-square statistic which can be used for a 2 × 2 or higher classification (m×n classification, m categories for the first attribute and n for the other). Statistical Package for Social Sciences (SPSS) gives four tests under the Chi-square included under the crosstab module for testing the independence of nominal/ordinal attributes. These are - Pearson Chi-square, Chi-square with the Yates continuity correction, Likelihood ratio, and Fisher Exact tests. Normally, there is confusion about which test should be used for drawing conclusions, particularly when different tests give contrasting results. The following points will help in taking a decision:

  1. 1.

    If there is a 2 × 2 contingency table, i.e., two categories for each variable, the best choice is Fisher’s exact test. Compared to the Pearson Chi-square test, Fisher’s test is even suited for small samples.

  2. 2.

    For a higher contingency table (for example, 2 × 3 or 3 × 4 etc.), use Pearson Chi-square if at least 80% of the cells have expected counts of more than 5, otherwise use the Maximum Likelihood Ratio Chi-square test.

  3. 3.

    One should prefer the Fisher Exact test over the Yates’ continuity correction method used for small sample sizes.

11.3 How Can I Assess the Strength of an Association?

The above-discussed methods help us to find whether there is a significant association or not between the two categorial attributes. If the p-value is less than 0.05, conclude that there is a likelihood of an association. However, the p-value does not tell us whether the association is ‘weak’ or ‘strong’. Cramer’s V is the most useful statistical method used for testing the strength of an association when Chi-square is found to be significant. The Phi(φ) statistic is used for assessing the strength for a 2 × 2 contingency table whereas the Cramer’s V is used for classification larger than 2 × 2 tables.

11.4 How Can I Study the Relationship Between Two Quantitative Attributes?

When we want to study the relationship between attributes that have been measured on a continuous numeric scale, we cannot use chi-square statistics. For example, we may be interested in studying the relationship between age and height of school-going children, body weight and cholesterol level or maternal age and anxiety. For such attributes, we assess the relationship through the Pearson correlation coefficient which ranges from −1 to +1. Minus one (−1) is a perfect negative relationship and plus one (+1) is a perfect positive relationship. Zero (0) correlation coefficient implies two attributes are independent. One of the limitations of the Pearson correlation coefficient is that it assesses the linear or straight relationship between the two variables. It cannot assess a non-linear or monotonic relationship (curvilinear, exponential, or polynomial). Thus, it is always advisable to draw the scatter plot of the data points and visually explore the relationship before calculating the correlation coefficient. The major limitation of the correlation coefficient is that the two variables should be jointly normally distributed. The correlation coefficient is also severely affected by the outliers in the data.

11.5 What Are Cohen’s Guidelines for Assessing the Strength of a Correlation?

Cohen’s 1992 guidelines [8] can be used to assess the strength of a linear relationship. If the effect size, i.e., correlation coefficient is from 0.1 to less than 0.3, it is categorized as ‘small’, 0.3 to less than 0.5 ‘medium’; and 0.5 and above ‘large’.

11.6 How Can I Study a Relationship When My Attributes Are Ordinal?

For ordinal attributes/variables such as pain score, satisfaction level, MELD score, or intelligence level, we cannot use the Pearson correlation coefficient which assumes the variables to be normally distributed. For such variables, we can either use Kendall’s Tau or Spearman’s Rank correlation. However, Spearman’s correlation coefficient is more widely used and is appropriate for ordinal as well as continuous variables. Similar to Pearson’s correlation coefficient, it ranges from −1 to +1. A typical example for a Spearman correlation coefficient could be to assess whether or not the income level is related to the educational level.

It is worth mentioning here that Pearson’s correlation coefficient is highly sensitive to outliers (extreme values) and may give an unexpected result; therefore, extra precaution should be taken for the outliers (extreme values) before calculating the coefficient. However, Spearman’s coefficient is more robust to outliers than is Pearson’s coefficient.

11.7 What Do We Mean by Agreement Between Observers, Raters, and Diagnostic Tests?

Agreement is the degree of concordance between two or more sets of assessments or measurements. For example, two or more observers, raters, or diagnostic tests assess a subject for a particular characteristic. They may or may not agree with their assessment. While making a diagnosis, a standard test may diagnose a ‘benign’ condition whereas an alternative test may diagnose ‘malignancy’. Agreement can be assessed for nominal, ordinal as well as numeric measurements.

11.8 How Can I Assess the Agreement for Binary, Nominal, and Ordinal Measurements?

Cohen’s kappa and Fleiss’ kappa coefficients are used for assessing the agreement between two (or more) tests/raters/observers. Cohen’s kappa (ĸ) can be used for assessing the agreement between two raters with two or even more than two categories of observations. For example, a two-category observation can be either ‘Yes’ or ‘No’ for a particular characteristic and a three-category observation, say regarding the diagnosis can be ‘Benign’ or ‘Doubtful’ or ‘Malignant’. Kappa is used mostly to assess how close the agreement is to one rather than how far it is from 0. The following table can be used for assessing the strength of agreement:

Kappa

Strength of agreement

<0.3

Poor

0.3–0.5

Fair

0.5–0.7

Moderate

0.7–0.9

Good

>0.9

Excellent

Cohen’s kappa is used for studying the agreement between two raters, whereas Fleiss’ kappa is used when there are more than two raters. The Fleiss kappa, however, is a multi-rater generalization of Scott’s pi statistic.

11.9 How Can I Assess Agreement When the Measurements Are Numeric?

Cohen’s and Fleiss’ kappa are used when the measurements are either ordinal or categorical/nominal. However, when the measurements are numeric, these methods cannot be used. For example, we may be interested in assessing the agreement between two different methods which measure the haemoglobin levels of a number of patients. For such numeric measurements, two methods are available for assessing agreement—the Intra-class correlation coefficient and the Bland–Altman plot. The intra-class correlation coefficient lies between 0 and 1. Zero indicates no agreement, whereas one indicates perfect agreement. The Bland–Altman plot provides a graphical display of agreement between the two methods or techniques with 95% limits of agreement.

11.10 What Are Regression Methods?

Regression methods are extensively used in medical research for predicting the outcome based on one (or more) independent variables commonly known as predictors or regressors. The outcome variable is known as the dependent variable. For example, we may use a regression method for predicting the birth-weight of full-term babies with the weights of their fathers and mothers as predictors. A regression model can be written as follow:

$$ {\displaystyle \begin{array}{l}\mathrm{Simple}\ \mathrm{regression}\ \mathrm{model}:Y={\beta}_0+{\beta}_1{X}_1\\ {}\mathrm{Multiple}\ \mathrm{regression}\ \mathrm{model}:Y={\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+{\beta}_3{X}_3\dots \dots \dots {\beta}_n{X}_n\end{array}} $$

where Y is the dependent variable, X 1, X 2, X 3 … are independent variables, and β 0, β 1, β 2, β 3 …… are regression coefficients. The sign and magnitude of a regression coefficient indicates the role of the corresponding variable in predicting the outcome.

Regression methods are generally used in making a prediction as well as in understanding the relative contribution or role of various factors in predicting the response or outcome. If the regression model/equation involves one dependent and only one independent, it is called simple regression; and one dependent and two or more independents, multiple regression. If the number of dependents is greater than one, then it is in the domain of multivariate regression.

11.11 How Can I Test the Performance of the Regression Model?

When we develop a regression model, we want to be sure that that the developed model accurately predicts the outcome. We judge the performance or goodness of fit of the regression model on the basis of the multiple correlation coefficient (R 2) which ranges from 0 to 1. A model is considered to be good when it contains a small number of regressors (independent variables) but gives a sufficiently large value of R 2. When the number of predictors is large, computer algorithms are available with statistical packages which help in selecting the statistically significant variables. Forward selection, backward elimination, and stepwise are the common algorithms for selecting the variables. However, most studies in medicine use linear regression models. Sometimes the nature of the relationship is that linear models miserably fail to explain the variation in dependent variables and give a low value of R 2. In that case, curvilinear and non-linear regression models can be tried which are being increasingly used in the medical field. However, all precautions should be taken that the regression model includes the relevant variables before going for non-linear models.

11.12 How Can I Assess the Relative Importance of Predictors in a Regression Model?

A multiple regression model may include a number of predictors and one may be interested to know the relative importance of these in predicting the outcome. For example, one may like to know the relative importance of these variables which have been measured in different units. Bodyweight is measured in kilograms or grams whereas body height in centimetres or metres. When our aim is to assess the relative importance of such independent variables, standardized coefficients should be used. Comparing unstandardized coefficients of independent variables that have different units of measurement is rather like comparing apples and oranges. Therefore, most statistical packages provide a table of regression results which includes unstandardized regression coefficients as well as standardized regression coefficients (Beta coefficients).

It is worth mentioning here that interpreting standardized coefficients for individual variables is cumbersome compared to unstandardized ones. The unstandardized coefficient of any predictor indicates the amount of change in the response for a unit change in that factor, holding other predictors constant. However, standardized coefficients (Beta) are interpreted as the standard deviation change in the response variable when the predictor is changed by one standard deviation, holding all other predictors constant.

11.13 What Is a Serious Problem in Regression Analysis?

Multicollinearity is a problem in regression analysis and occurs when the predictor variables which are assumed to be independent are correlated among themselves. For example, when we want to predict body fat with the help of skinfold thickness at the triceps, mid-arm, and thigh, the three predictor variables are highly correlated among themselves. These highly correlated variables tend to enhance the standard errors of the coefficients, leading to unreliable and unstable estimates of regression coefficients. Increased standard errors, in turn, means that coefficients for some independent variables may be found to be statistically insignificant, though, in reality, they were not. In the presence of multicollinearity, when we add or delete a predictor variable, the regression coefficients change dramatically. However, it should be noted that if our aim is to make predictions then multicollinearity is not a serious concern.

11.14 How to Detect and Handle Multicollinearity?

The simplest way to detect multicollinearity is to calculate pairwise correlation coefficients of all the predictor variables. Pairs of predictors which have very high correlation coefficients are likely to result in multicollinearity. Remove either of the correlated predictors and fit the regression model with the remaining predictors. After doing so, you may notice some of the predictors which were earlier not significant, have now become significant.

The more reliable method to detect collinear predictors is to examine the variance inflation factor (VIF) of each predictor. SPSS gives the option to calculate the VIF of a predictor. If this is greater than or equal to 5, then it is considered collinear and needs attention. There are several ways in the literature to handle multicollinearity. These are some of the methods to do this:

  • Increase the sample size to avoid occurrence by chance. Increasing the sample size may reduce the collinearity.

  • Use stepwise regression analysis to choose the best subset having the highest R-squared value.

  • Remove predictors with high variance inflation factors (VIF) and fit the regression model with the remaining predictors.

  • Modify the existing predictors by making use of the information from prior research. For example, in a regression model which uses the height and weight of subjects that happen to be highly associated, we may take the ratio of the said predictors, i.e., BMI as a single predictor.

11.15 What Is Logistic Regression?

Logistic regression analysis has been extensively applied in medical research for predicting the categorial outcome. For example, predicting the outcome (survival/mortality) based on the pre-, intra-, and post-operative variables following GI surgery. Logistic regression studies the relationship between the outcome (dependent variable) which is a categorical/ordinal variable; and independent predictors which are a mixture of quantitative, categorical, and ordinal variables. The dependent variable of a binary logistic regression has two categories (yes or no, benign or malignant cancer, recurrence or non-recurrence of cancer, survival or mortality), whereas multinomial logistic regression has more than two (polytomous).

$$ {\displaystyle \begin{array}{l}\mathrm{Simple}\ \mathrm{logistic}\ \mathrm{regression}\ \mathrm{equation}:\ln \left(\pi /1-\pi \right)={\beta}_0+{\beta}_1{X}_1\\ {}\mathrm{Multiple}\ \mathrm{logistic}\ \mathrm{regression}\ \mathrm{equation}:\ln \left(\pi /1-\pi \right)={\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+{\beta}_3{\mathrm{X}}_3..{\beta}_n{X}_n\end{array}} $$

The probability

$$ \pi =1/\left[1+\exp \left(-{\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+{\beta}_3{X}_3..{\beta}_n{X}_n\right)\right] $$

where ln(π/1−π) is the logit of the probability (π) and lies between −∞ and + ∞, (π/1−π) represents odds for a positive response in the subjects and β is are logistic coefficients.

Logistic coefficients are similar to regression coefficients. However, most of the software packages also give exponentiated coefficients along with regression coefficients. The exponential of a regression coefficient is called the odds ratio and more specifically adjusted odds ratio. If odds ratio for a variable is statistically greater than one, it is considered to be a risk factor and less than one, a protective or prognostic one. When the odds ratio is one, it does not play any role.

11.16 How to Test the Adequacy of the Logistic Model?

A number of methods such as pseudo R 2, generalized R 2, likelihood ratio, and Wald statistics are available to which can test the adequacy or goodness of fit of the logistic model. However, log-likelihood (−2lnL) is commonly used and has a better appeal [1]. If the model is a perfect fit −2lnL is equal to 0 and for worst fit it is 1. A Nagelkerke pseudo R 2 resembles ordinary R 2. It ranges from 0 to 1, and the best fit model has a value of 1. Some software such as SPSS provide the Hosmer–Lemeshow test, which tests the null hypothesis that the model is an adequate fit. If p is found to be <0.05, there is an evidence against the null hypothesis or the model indicates a lack of fit.

12 Summing Up

We have discussed the essentials of most of the commonly used biostatistical tools in the form of questions and answers and provided an overview of these tools with cautionary notes so that readers need to be vigilant in applying them. Choosing an inappropriate tool may sometimes lead to severe consequences. We have apprised readers of the most important but controversial inferential tool, i.e., the p-value which is frequently misunderstood and misinterpreted.

Biostatistics includes an ocean of statistical tools and techniques. However, we have discussed the most commonly used in day-to-day research. There are more techniques, especially suited for advanced medical researchers. These techniques simultaneously consider several variables, such as multivariate multiple regression, path analysis, multivariate analysis of variance (MANOVA), discriminant functions, factor and cluster analysis, etc. Techniques are also available to study similarity and dissimilarity among the DNA sequences and for better understanding the genetic variation at the DNA or protein level. It is difficult to cover all the aspects of biostatistics in a single chapter. We hope the contents of the chapter will enable the students in applying appropriate statistical tools while pursuing their research protocols.