# Top ten errors of statistical analysis in observational studies for cancer research

## Abstract

Observational studies using registry data make it possible to compile quality information and can surpass clinical trials in some contexts. However, data heterogeneity, analytical complexity, and the diversity of aspects to be taken into account when interpreting results makes it easy for mistakes to be made and calls for mastery of statistical methodology. Some questionable research practices that include poor analytical data management are responsible for the low reproducibility of some results; yet, there is a paucity of information in the literature regarding specific statistical pitfalls of cancer studies. In addition to proposing how to avoid or solve them, this article seeks to expose ten common problematic situations in the analysis of cancer registries: convenience, dichotomization, stratification, regression to the mean, impact of sample size, competing risks, immortal time and survivor bias, management of missing values, and data dredging.

## Keywords

Cancer research Error Observational studies Pitfalls Registry Statistical analysis## Introduction

Registry-based multicenter studies make it possible to compile quality information with a degree of precision capable of rivaling that of clinical trials in certain, specific situations [1]. Furthermore, there are scenarios in which it is not ethical to conduct interventionist trials; consequently, decision making depends on the evidence obtained from epidemiological studies. As a result, a growing number of cancer studies, including subanalyses and translational analyses, are observational (17% according to clinicaltrials.gov). However, data heterogeneity, analytical complexity, as well as the diversity of aspects to be interpreted make it easy to make mistakes and call for mastery of statistical methodology. According to Ioannidis, a substantial part of the findings in the pharmaco-epidemiological literature is spurious, distortions of reality that reflect the prevailing cognitive bias in each field [2]. Nevertheless, to have enough sensitivity to distinguish the signals from the noise, scientific progress needs a low level of false discoveries, perhaps some 14% [3]. Of even greater concern is the use of questionable research practices in certain fields [4]. In one survey, 70% of the investigators confessed their failure to replicate third-party experiments; half of the time, they could not even replicate their own. One of the causes of irreproducibility is poor statistical design [5]. The literature contains any number of examples of biased analyses [6], the consequences of which include wasted time and resources, futile clinical trials, halting active drug development, etc.

The problem could be mitigated by implementing a policy of strict, peer-reviewed statistical design. In a review of the editorial process of the journal Lancet, only half of the articles eligible for publication after peer-review were deemed acceptable following statistical review [7]. However, another survey revealed that even in core journals, the probability of a formal methodological review is low [8]. Although attention to statistical quality has improved over time, these problems are still far from being resolved [9]. Furthermore, Wicherts et al. have pointed out that accessibility of raw data for reanalysis and critical interpretation is poor [10], which can be attributed to a lack of confidence on the authors’ part as to the robustness of their analyses. Despite the change in attitude toward data sharing, this policy is still far from becoming routine [11].

Against this background, our work seeks to expose the most common problematic situations we have encountered in working with observational cancer studies, as well as to create an awareness of their existence, in the form of a brief reference guide to improve the methodological quality of research projects. The aim has not been so much to provide a tutorial on the theory and practical handling of these techniques, which has been carried out previously by multiple authors in excellent reference texts [12, 13, 14], but to alert as to the frequency and impact of errors and biased results regularly presented at congresses and in journals. Nothing could be further from our intention than to contribute to the myth of the inaccessibility of statistics; instead, we aspire to warn of the risks naive and heuristic approaches entail, as well as the need to resort to professional statistical advice if necessary.

## Method

Ranking of the ten most common, academically interesting, difficult to perceive, or potentially harmful mistakes in observational studies of cancer

Issue | Cause | Impact | Recommendation |
---|---|---|---|

#1, Not even wrong (because you shouldn’t have even tried it in the first place) | Superfluous, unjustified, not pre-specified, and not guided by any hypothesis | The models are not suited to intended objectives; difficulty in generalizing results; waste of researchers’ time; potential risk to patients | Being able to generate a model does not mean that you should apply it; abstain from performing unsubstantiated and unforeseen analyses; explain to readers that the findings were accidental and their implications |

#2, Dichotomizing continuous variables is generally a bad idea | Dichotomization of continuous variables | Loss of information, reduction of statistical power and of correlations, decreased generalizability, serious mismatches, increased type I error, loss of non-linear relations, and problems interpreting extremes | Do not dichotomize continuous variables except: (1) analysis of extreme cases, (2) studies to verify the effect of already dichotomized variables, (3) when the underlying continuous variable is truly categorical, and the measurement observed has high reliability. Never use the optimal cutoff and later evaluate the effect of dichotomization on the same database |

#3, Simpson’s paradox | When conducting a pooled analysis of stratified data, the trend within each stratum may not coincide with the trend of the overall analysis. Arithmetic problem, the fundamental cause of which has to do with the existence of uncontrolled confounding factors that affect the time to exposure and the study outcome in a different way in each stratum | A trend that appears in several groups of data may disappear or be inverted when these strata are aggregated in a pooled analysis; incorrect interpretation of the effect of an intervention | Conduct stratified analyses (Cochran–Mantel–Haenszel test, conditional logistic regression, etc.). Evaluate the heterogeneity of your data and the consistency of the measures between groups (test of homogeneity of the odds ratios, I2 statistic, etc.) |

#4, Ignoring the regression toward the mean | Tendency of variables with extreme values the first time they are measure to approach the center of distribution in subsequent measurements. This happens because the data have random errors, with oscillations around their mean | Unsubstantiated inferences about the effect of interventions; difficulty in distinguishing the effect of an intervention from normal variability | Randomization, use of negative controls. Selection of patients using the mean of several repeated measures. Use of the analysis of covariance (ANCOVA) to estimate the effect of interventions on a biomarker |

#5, Not accounting for how sample size affects variation | The extreme values of a variable are more likely in small vs large series. The dispersion of the distribution of sample means is greater in small series, causing estimations to be more inaccurate | False inferences in epidemiology, healthcare management, and observational studies that involve results in cohorts of different sample size; inadequate attribution of healthcare resources; incorrect interpretation of uncertainty from clinical studies | Take sample size into account in the variability of the sample by means of the standard error equation: \( \sigma_{{{\bar{x}}}} = \frac{\sigma }{\sqrt n } \) |

#6, Competing risk bias | A competing event is that which prevents us from observing the primary event of interest | If these cases are censored in a classical survival analysis, the incidence of the principal event is overestimated and biased or incorrect measurements of the predictors’ effect are obtained | Use the Cumulative Incidence Function (CIF) instead of the Kaplan–Meier estimator, especially when the competing events account for > 10% of the sample. In etiological studies, use the Cause-specific hazard (CSH) function to model the hazard ratio. Fine and Gray’s competing risk regression (CRR) is a suitable method to analyze prognostic factors |

#7, Immortal time bias | Time-dependent variables that occur after the subject has initiated follow-up. The problem arises if these exposures are analyzed as if they were baseline variables. This entails the appearance of a period of immortal time, from the beginning of follow-up until exposure, during which the outcome never takes place | Observation of an apparent, but biased survival benefit in favor of the incorrectly analyzed variable. One variant is the so-called time-window bias | Identify all the time-dependent variables and consider them as such in the survival analysis |

#8, Survivor bias | Some individuals die more quickly than others; consequently, not all have the same probability of being admitted to a study or of being asked about their exposures and risk factors (less tendency to participate due to early demise, or non-random dropouts over time) | Spurious results, such as concluding that an ineffective treatment is associated with increased survival | Suspect it and control it in the design: Is it probable there are individuals who, carrying an important part of the information, have fewer possibilities of being recruited? Compile demographic data and outcomes of excluded subjects or be careful with eligibility criteria in case they establish minimum survival In the analytical phase, stratify the analyses on the basis of follow-up (the longer the follow-up, the greater the mortality) Repeat the analysis stratified by time with negative controls Depending on the case, use techniques to match individuals based on their tendency to be cases or controls (propensity score matching techniques), or analyze interventions as time-dependent variables |

#9, Missing values | All the databases contain missing values. Completely random loss is a strong premise that is rarely justified. Statistical software packages cancel all the information of a participant in case of a single missing datum The missingness of a variable can be associated with others observed variables, or to the unobserved information itself; hence its effect cannot be avoided | Based on the cause, missing data entail inefficiency or bias | Prevent and control the loss of data in the design Document the cause of missingness to be able to make inferences about them It is not wise to assume that if a datum is not collected in a clinical history, the patient does not have Analyze the pattern of missing data, the cause, relation with other variables, and association with the study outcome Imputing tends to be better than eliminating participants with missing data Multiple imputations is the most suitable method Pre-specify the method of imputation to be conducted in the study protocol |

#10, The infamous data dredging | The multiplicity of statistical tests can be inevitable. Need to obtain the maximum information by means of multiple subanalyses and translational studies The main risk of multiplicity is inflating the type I error Difficult to know if the analysis was previously specified in the protocol or not | Risk for the validity of the conclusions False discoveries Extremely difficult to understand the meaning of the analyses | Limit the number of analyses with straightforward analytical designs that suit the data Make corrections for multiple comparisons Control the family–wise error rate (FWER) by means of Bonferroni or Holm–Bonferroni corrections, or adjust the false discovery rate (FDR) using the Benjamini and Hochberg or other methods |

## Results and discussion

### Just because it is possible in no way means that it should be done (not even wrong)

Conducting unjustified analyses not been previously specified in study protocols can cause a tremendous waste of time and resources [15, 16, 17, 18]. Thus, when large databases are being managed with diffuse objectives and many items, it is possible to fall into the temptation of using all that flow of information to predict endpoints that have not been previously specified, the “polar opposite” of what was originally intended.

At a time when a great many publications are being generated and when it is, paradoxically, more and more expensive to publish, it is curious that there are more than 1000 prognostic models for prostate cancer that predict all kinds of endpoints [19]. Sarcastically, Vickers and Cronin created a nomogram that predicts age with good calibration and area under the ROC curve (AUC) of 0.780 [19]. This just goes to show that ‘just because you can create a predictive model, it does not mean that you should’.

Another example is the analysis of a study on neutropenia that resulted in a thrombosis risk score [20]. Given that the prediction of thromboembolic disease was not the primary endpoint, it was not actively searched for, the incidence was lower than expected, and an incorrect attribution of categories cannot be ruled out. Even so, a score to predict thrombotic risk was derived and internally validated that did not include all the candidate variables, due to lack of prevision. Two variants of the model were derived as secondary products of other studies [18]. Recently, as expected, these scores have been seen to have scant discriminatory capacity for thrombotic risk, with AUC of between 0.500 and 0.570 [18, 21]. However, before having rigorous validations available, they were already being used as a selection criterion in clinical trials of primary thromboprophylaxis, with the presumable waste of resources in far-from-conclusive studies.

### Dichotomizing continuous variables is generally a bad idea

Continuous variables are often dichotomized in observational studies, a practice that has untoward consequences that some investigators are unaware of [22, 23]. These repercussions include reducing the statistical power, which is the equivalent of losing approximately one-third of the sample [23], with a loss of correlation between variables and decreasing the capacity to generalize results [24].

*P*value = 0.30). It was then suggested to the author that he not dichotomize, but he heeded the suggestion only in the case of the fatigue score. In that case, the author found that the mean fatigue score was 81 in subjects with normal hemoglobin and 65 in individuals with anemia (

*P*value = 0.03), modest evidence against the null hypothesis. He was begged to use the original values. A positive regression was then detected,

*P*value = 0.000117, strong evidence in favor of lower hemoglobin being accompanied by more fatigue (Fig. 1).

Relation between hemoglobin levels (g/dL) and fatigue scores

Hemoglobin level | Total | Fatigue score, high | % Fatigue, high | |
---|---|---|---|---|

≥ 9 g/dL | 8 | 4 | 50 | 0.60 |

< 9 g/dL | 8 | 6 | 75 |

Another problem is the mismatch that occurs when, due to a small error in measurement, the dichotomized variable changes category. The decrease in variability increases type I error [25], since one part of the association between a dependent variable *Y* and a predictor *X* can be due to the variability of the latter. If we eliminate the variability, there is an increased likelihood that a third variable spuriously accounts for *Y*. Other times, non-linear relations that may be important are disregarded. Categorization also generates problems of interpretation and reduces the effect of the impact of the variables. A flawed practice is to use the data to find the optimal cutoff, since that gives rise to a problem of overfitting, can displace other predictors that are indeed important, and inflate the type I error [26].

The recommendation not to dichotomize has few exceptions. Nonetheless, one such exception applies when the underlying variable is truly categorical in nature and the measure observed is highly reliable, making mismatches exceedingly uncommon [27].

### The sum of local effects is not equivalent to the general effect

When performing observational studies, all confounding factors must be identified. If not controlled for, the trend within each stratum of a pooled, stratified data analysis may not coincide with that of the overall analysis. This phenomenon is called Simpson’s paradox.

Frequency of thromboembolic events at two hospitals based on use of low-molecular-weight heparin (LMWH) (simulated data)

No LMWH | LMWH | RR | |||||
---|---|---|---|---|---|---|---|

Total | VTE | % VTE | Total | VTE | % VTE | ||

Hospital 1 | 1015 | 15 | 1.4 | 605 | 5 | 0.8 | 0.57 |

Hospital 2 | 225 | 25 | 11 | 1495 | 95 | 6 | 0.54 |

Total | 1240 | 40 | 3.2 | 2100 | 100 | 4.8 | 1.50 |

In addition to the arithmetic problem, the fundamental cause is that there is an uncontrolled confounding factor, which is the subjects’ different thrombotic risk and exposure to thromboprophylaxis at both centers. The Cochran–Mantel–Haenszel method must be applied to analyze these stratified categorical data, and in the event that other confounding covariates are to be controlled for, a logistic regression must be performed. This example illustrates why it is important to analyze heterogeneity and for measures to be consistent (e.g., test of homogeneity of odds ratios) [28].

### Ignoring the regression toward the mean

Ignoring the potential negative effects of the regression toward the mean (RTM) leads to concluding unsubstantiated inferences regarding the effect of interventions [29, 30]. RTM is the tendency of a variable with an extreme value the first time it is measured to approach the center of distribution in a second, subsequent measurement. This basically happens, because all the data have random error. When we repeat the measurement in the same subject several times, it will adopt normally distributed values, with small fluctuations around the mean. Issues can arise when we select patients to be admitted into a study depending on a variable’s extreme value [e.g., a high prostate specific antigen (PSA) value, quality-of-life score, high symptom burden on a scale, low hemoglobin, high cholesterol, low z-score on densitometry, etc.] and we later repeat the measurement. In the following measurement, the selected individuals will tend to exhibit values that are closer to the mean [31, 32]. The problem is attributing the improvement only to the benefit of the intervention; the difficulty lies in distinguishing the true changes due to the intervention vs normal variations. All this causes a clear dilemma with surrogate biomarkers of response to therapies [33]. More than a few observational studies claim the impact of certain interventions on the levels of a given biomarker, interpreting it as an argument in favor of its biological effect. For example, Hamilton et al. observed that the use of statins was associated with lowered levels of PSA [34]. However, interpreting a value that is initially elevated and that declines is complex. In fact, a series has been reported in which prostate cancer is diagnosed on the basis of a stochastic elevation of PSA that tends to decrease spontaneously without any kind of treatment [35].

An interesting variant of RTM occurs when extreme responses to a first intervention are chosen to select patients for a second intervention [36]. This can give the false impression that the second therapy rescues individuals who were refractory to the first treatment.

Randomization is deemed to be the most efficacious measure to control RTM or the use of negative controls in observational studies. However, there are examples of randomized, placebo-controlled studies in which the parameters improve in all the treatment groups by RTM [37].

Another option would be to select patients using the mean of several repeated measures. In the analytical phase, one recommended method is the analysis of covariance (ANCOVA), which makes it possible to estimate the effect of categorical variables (e.g., interventions) on a continuous dependent variable, controlling for the effect of the imbalance of baseline levels of the latter.

### Not factoring in how sample size affects variation

We have a certain, intuitive idea of how a small sample size entails a degree of uncertainty. For example, given a drug with a response rate of 25%, the probability of no patients responding in a cohort size of ten would be 5.6%; rare, but plausible. In contrast, if the study had 20 individuals more (i.e., size of 30), the probability of no response would fall to 0.02%, highly improbable. When these experiments that involve large and small groups are repeated thousands of times, the dispersion of the distribution of the sample’s means will be greater in small series, and estimations will be less accurate in these groups. As a result, the variables tend to adopt more extreme values in small groups. For instance, one epidemiological study found that the incidence of renal cancer was lower in towns vs cities because of the healthier atmosphere. However, the country’s higher renal cancer rates were soon seen to also affect rural villages. Obviously, both conclusions were incompatible [38].

*μ*, the dispersion of this variable in the population (‘standard deviation of the population’ or

*σ*) had to be ascertained. To do so, the ‘standard deviation of the sample means’ (or \( \sigma_{{{\bar{x}}}} \)) could have been used. This is obtained if millions of “n-sized” samples are extracted from that population; according to the formula, which is key in all statistical inference, to calculate both the

*P*value, as well as confidence intervals of the tests [38]:

*σ*is practically never known. We can only access the ‘standard deviation’ of the sample we have studied, which we usually represent by “SD” or “S”. Our alternative then is to estimate

*σ*with the sample

*S*, which has come to be known as the most dangerous equation in the world [38]:

*σ*, which is a population parameter, is replaced by a value

*S*that fluctuates randomly from one sample to another around

*σ*, it occurs that in half of the samples

*S*will be greater than

*σ*, but in the other, half

*S*will be less than

*σ*, and in these cases, a narrower interval will be obtained which will have less confidence to contain the mean

*μ*. This decrease is compensated for by replacing

*z*

_{α/2}of the Normal distribution by a slightly higher value

*t*

_{n−1}; α/2 given by the Student-Fisher law depending on the number of degrees of freedom (df =

*n*− 1) of the variance

*S*

^{2}, so that the confidence interval 1 − α of a mean

*μ*will be estimated from the formula:

If the estimation’s accuracy is a function of sample size, the difficulty arises when attempting to compare several estimations attained from series having a different number of individuals, which is common in fields such as epidemiology or healthcare management. However, it can affect any question, whose answer requires that the effect of sample size on the variability of the data be known [38].

### Competing risks

The competing risk bias is still present in many articles published in core journals, with 50% of researchers ignoring the issue [39]. This often happens in analyses of large registries with a long follow-up, in which patients die due to causes other than the study variable. One example would be estimating the risk of colon cancer relapse in elderly patients who underwent surgery and who generally die due to old age [40].

In oncology, we use the Kaplan–Meier method to estimate survival functions and Cox regression to estimate the effect of covariates on the hazard of an event occurring. One key concept is censoring patients that have not experienced the event by the end of the observation period. Another one is non-informative censoring or the probability of events considered to be similar for censored patients or those that remain in the study. However, it is not uncommon for situations to occur in which the risk of death due to another cause, competing event, surpasses and prevents the primary event of interest from happening. Some authors censor those cases and here is the problem: the Kaplan–Meier estimator is based on the “fact” that the deceased patient could have continued to develop events, although we cannot observe them! [41] For example, in a study about thromboembolic disease, if the subject died without thrombosis, the investigator could have censored that case, but that would violate the precept of non-informative censoring [42, 43, 44, 45]. Consequently, Kaplan–Meier curves would overestimate the probability of the event, particularly when competing events are not independent [46]. In fact, the estimations can be substantially biased with > 10% of competing events [46]. Moreover, we can reach incorrect results about the effects of variables or interventions. For instance, an apparent protective relationship between smoking and melanoma has been observed in some large cohort and case–control studies [47]. One plausible explanation is the competing risk bias, in relation with mortality due to other smoking-related diseases.

In the presence of competing risks, other estimators (i.e., Aalen-Johansen) are needed to calculate the Cumulative Incidence Function (CIF). Moreover, the simple relationship between the occurrence of an event of interest in the cumulative scale (CIF) and the hazard ratio is not maintained. Fine & Gray adapted the model, with a Competing Risk Regression (CRR), to be applicable in this setting [40].

### Making individuals immortal

The immortal time bias, the fallacy that refuses to die, is present in hundreds of publications, staunchly suggesting that most are unaware of this problem [48]. The immortal time bias is common in registries with prolonged follow-up that evaluate the effect of different treatments, attributing spurious advantages to treated individuals [49, 50]. According to some estimations, 8% of biomedical articles could contain this type of bias, which would affect the basic conclusions in up to 5% [6].

In time-to-event analyses, some exposures take place after the subject has initiated follow-up in a study, known as time-dependent variables (i.e., tumor response). The bias is introduced when these exposures are mistakenly analyzed as if they were fixed variables that were present at the outset. This commonplace error leads to the appearance of a period of immortal time, from the beginning of follow-up until exposure, during which the outcome could not occur. Clearly, this does not mean that the subject is truly immortal, but to be classified as a participant with a time-dependent quality, said event must necessarily have not taken place. This always inserts a biased survival advantage in favor of the variable being analyzed, which generally has an impressive effect [51]. An extravagant example is the analysis that concluded that winning an Oscar reduced the risk of death by 28% [52].

In oncology, another example is a classical study about how adjuvant treatment duration influences the risk of breast cancer relapse [53]. It concluded that chemotherapy was only useful when more than 85% of the prescribed schedule was administered, but at least one part of the effect attributed to the dosage could suffer from this immortal time bias: some women received longer treatments simply because they had not relapsed. It is easy to avoid this bias: identify the time-dependent variables and consider them as such. Since we cannot see into the future, bear in mind that there is no perfect option for survival curves [54].

### Survivor bias

A basic premise for a study’s validity is that all the individuals have the same probability of being studied or followed until the outcomes of interest occur. However, in the real world, not everyone has the same probability of being asked about their exposures and risk factors, given the lesser propensity to participate due to early death or non-random dropouts over time [55, 56]. Thus, it could be spuriously concluded that an ineffective treatment is associated with increased survival. For example, it is possible that an observational study incorrectly concludes that surgery is effective for a certain disease. However, if staying alive long enough is not factored in as a requisite for surgery, we could find ourselves with a survivor bias [57].

In oncology, this is commonplace in quality-of-life studies when the questionnaires are only returned by the patients with better health status, and in epidemiological studies that assess the effect of risk or protective factors for cancer, in which some subjects with the study variable are never recruited [58, 59]. It would also be common in cancer studies, where prevalent (rather than incident) cases are recruited that may have special characteristics that have made them survive. For instance, Nielsen et al. analyzed a Danish cancer registry and found an inverse relation between statin use and mortality in oncological patients [60]. They adjusted their estimation of cancer-related mortality for the competing risk of early cardiovascular death in individuals who already had a diagnosis of cancer. Nevertheless, the survivor bias could have still played a role if the statin-treated individuals had been more prone to early death due to prevalent cardiovascular disease before having cancer. Had this been the case, the survivors who later developed tumors would have inadvertently contributed to the better survival of this subgroup, accounting in part for the results.

To minimize the risk of survivor bias, we must ask ourselves if it is likely that our study includes individuals who, carrying an important part of the information we are interested in, have fewer possibilities of being recruited for systematic reasons, and we must try to control this bias in the design phase. This can be done by compiling demographic data and outcomes of the people who are excluded, adding controls, or being careful with eligibility criteria in case they set a minimum survival [61].

If not, this bias can also be suspected during the analysis phase. For example, van Rein et al. analyzed the effect of statins on safety in subjects who received anticoagulant therapy with vitamin K antagonists [55]. They found a protective association on major bleeding in individuals exposed to statins: odds ratio 0.56, (95% confidence interval, 0.29–1.08). After adjusting for comorbidity and antiaggregant use, the magnitude of the protective effect did not increase; thus, they did not conclude that statins were protective, rather that the results were biased. To prove it, they stratified the analysis on the basis of time since admission into the study until major bleeding. Interestingly, the protective effect of statins was seen, but was only manifest in individuals with longer follow-up (withdrawn due to demise); in fact, in the subgroup with early follow-up, the statin was no longer a protective factor. After that, a negative control (O group) was used, which exhibited no association with mortality or bleeding [55].

According to the case, other techniques may be needed, such as matching individuals on the basis of propensity to be cases or controls, and the analysis of interventions as time-dependent variables [57, 60, 62].

### Missing values

Managing ‘missing values’ is a pervasive problem in all registries, particularly retrospective ones, and has a tremendous capacity to cause mayhem that is difficult to detect, such as over- or underestimating parameters, and reducing the statistical power to find statistically significant differences [63]. Burton et al. showed that only 40% of the articles in high-impact journals reported how missing data were managed [64]. Although there is a growing awareness of this issue, the problem is presumably still not fully resolved [65]. The most widely used technique (complete case analysis or direct elimination of participants with missing data) tends to be inadequate [63, 64]. In fact, in the most common statistical software packages, if there is a missing variable in one row, the entire participant is excluded from the regression analysis, resulting in diminished statistical power and, sometimes, in biased estimations.

There is a taxonomy of missing data depending on the cause. The most complicated are data “Not Missing at Random”, i.e., those that depend on the variable itself that has not been observed (e.g., dropouts due to toxicity or progression in longitudinal trials). In this case, erasing the participants with missing data biases the sample and decreases statistical power [66].

The simple imputation of missing data (e.g., with the median) is generally a suboptimal strategy, given that it artificially reduces the standard error by not factoring in the inherent uncertainty in the imputation process. The most effective method is multiple imputation. This technique randomly provides several data sets of complete data (usually 5–10), using the best associations between non-missing values [67]. The model of multiple imputation should contain all the variables contemplated in the epidemiological analysis, as well as auxiliary variables that could carry part of the missing information. This makes it possible to conduct separate analyses in each of these data sets to integrate all the estimations in a pooled analysis. It factors in the variability and error introduced, with the standard errors reflecting the greater uncertainty. However, many investigators continue to resist the imputation, because they are intuitively distrustful of assigning values that have not been collected directly [63, 64]. The correct way of looking at it is that multiple imputation intends to protect the meaning of the data that are available [68].

Undoubtedly, the best approach is to try to control the loss of data during the design stage, with mitigation strategies and document the cause of the “missingness”. In retrospective studies, histories often lack many key data, because the physician may not collect all of them in the course of their care practice. However, it is not wise to force the datum and deduce that if it is not collected, the patient does not have it. Likewise, it is advisable that the imputation method be previously specified in the study protocol.

### Data dredging

Multiple comparisons are inevitable, given that tumors are complex, and there can be several legitimate endpoints to review. On the other hand, studies are more and more expensive and it is logical that the authors endeavor to glean the most information possible trying to obtain multiple subanalyses and translational studies to clarify the questions concerning the different dimensions of the problem. This has a greater and greater impact in areas such as genetics, where the expression of thousands of genes are explored; clinical registries, where the optimal number of variables and sub-objectives are debated, or research in imaging techniques, where millions of voxels are analyzed.

Although a fundamental concept in the classical hypotheses testing is the *α* significance level the researcher is willing to accept (type I error), it is not enough when performing multiple comparisons. The main problem of multiplicity is inflating the type I error. The familywise error rate (FWER) is the probability of obtaining a false positive in the set of the entire family of comparisons conducted. Just what we call family of tests is the crux of the problem.

To illustrate, in a study with six dichotomous clinical variables, we can form 64 subgroups, giving rise to a 96% FWER, with the possibility of finding a subset in which the effect is statistically significant, although the overall result is negative. Bonferroni correction is used to control FWER and consist of dividing the α level by the number of comparisons. In the previous case, if c comparisons have been made, the significance level in each test to maintain an overall *α* = 0.05 would be 0.05/c, which is often too conservative. An alternative, which improves the Bonferroni method, is the procedure of Holm, which applies Bonferroni corrections step by step. Another approach is to control the false discovery rate (FDR), which represents the fraction of false positives among the statistically significant results. Researchers generally try to control this parameter by means of the Benjamini and Hochberg method. For example, Vasan et al. attempted to assess the effect of the ABO blood group on the risk of developing 45 different types of tumors, which implies 45 × 3 = 145 multiple comparisons, in a registry of 1.6 million blood donors [69]. What is most interesting is that for the 12 most statistically significant associations they detected, the FDR was > 25%, and only 9/12 results could be validated.

The problem arises from trying to rescue negative studies, torturing the information until it “confesses”, by analyzing multiple secondary comparisons. Sometimes, this adopts the shape of abusive interpretations of subgroup analyses, which is constant in cancer publications, attempting to draw conclusions from them. However, the results of unplanned subgroup analyses could be useful as hypothesis generators, which would need to be confirmed or falsified later in prospective studies. Other situations are the definition of multiple endpoints for a single study or conducting multiple interim analyses, which entails specific issues of interpretation [70]. In extreme cases, this can lead to the so-called “fishing” for *P* values, or data dredging [71].

The existence of multiple hidden comparisons impedes measuring the magnitude of the problem [72]; since the authors often do not communicate the negative tests, they have carried out. In so doing, the effective α level may not be known [73].

One way of avoiding these problems is the initial publication of study protocols, having to justify any deviation or violation of the statistical plan. However, it is easy that once the data are obtained, their structure may call for using unforeseen tests.

Furthermore, when reading an article, it is difficult to know if the analysis was previously specified in the protocol or whether it was the consequence of data dredging. Therefore, data dredging is a very important problem to be taken into consideration in light of its frequency, ability to evade detection, and its ability to lead to performing unnecessary clinical trials and provoking anxiety in the audience.

## Conclusion

We have reviewed ten common mistakes that have a great impact on observational cancer studies. All have been extensively exposed, criticized, and resolved in multiple articles on methodology. Some are not only very common in current literature, but also very hard to detect even for expert critical readers. Oftentimes, the consequences of the mistakes are disastrous for reaching valid conclusions from the studies. Researchers, editors, and reviewers of cancer journals must be aware of these problems to improve the quality of the studies, data analyses, and dissemination of results, avoiding unnecessary risks for the patients and a waste of time and resources for the rest of the oncological community.

## Notes

### Acknowledgements

Priscilla Chase Duran is acknowledged for editing the manuscript.

### Compliance with ethical standards

### Conflict of interest

None to declare. This is an academic study. No financial support has been received from external sources.

### Ethical statement

The study has been performed in accordance with the ethical standards of the Declaration of Helsinki and its later amendments.

### Informed consent

Not required.

## References

- 1.Garcia-Albeniz X, Chan JM, Paciorek AT, Logan RW, Kenfield SA, Cooperberg MR, et al. Immediate versus deferred initiation of androgen deprivation therapy in prostate cancer patients with PSA-only relapse. J Clin Oncol. 2014;32(15):817–24.Google Scholar
- 2.Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2(8):e124.CrossRefPubMedPubMedCentralGoogle Scholar
- 3.Jager LR, Leek JT. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics. 2014;15(1):1–12.CrossRefPubMedGoogle Scholar
- 4.John LK, Loewenstein G, Prelec D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci. 2012;23(5):524–32.CrossRefPubMedGoogle Scholar
- 5.Baker M. 1500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–4.CrossRefPubMedGoogle Scholar
- 6.Suissa S. Immortal time bias in pharmaco-epidemiology. Am J Epidemiol. 2008;167(4):492–9.CrossRefPubMedGoogle Scholar
- 7.Gore SM, Jones G, Thompson SG. The lancet’s statistical review process: areas for improvement by authors. Lancet. 1992;340(8811):100–2.CrossRefPubMedGoogle Scholar
- 8.Goodman SN, Altman DG, George SL. Statistical reviewing policies of medical journals. J Gen Intern Med. 1998;13(11):753–6.CrossRefPubMedPubMedCentralGoogle Scholar
- 9.Fernandes-Taylor S, Hyun JK, Reeder RN, Harris AHS. Common statistical and research design problems in manuscripts submitted to high-impact medical journals. BMC Res Notes. 2011;4(1):304.CrossRefPubMedPubMedCentralGoogle Scholar
- 10.Wicherts JM, Borsboom D, Kats J, Molenaar D. The poor availability of psychological research data for reanalysis. Am Psychol. 2006;61(7):726.CrossRefPubMedGoogle Scholar
- 11.Vickers AJ. Sharing raw data from clinical trials: what progress since we first asked “Whose data set is it anyway?”. Trials. 2016;17(1):227.CrossRefPubMedPubMedCentralGoogle Scholar
- 12.Bland M. An introduction to medical statistics. 4th ed. Oxford: Oxford University Press; 2015.Google Scholar
- 13.Kirkwood BR, Sterne JAC. Essential medical statistics. Massachusetts: Wiley; 2010.Google Scholar
- 14.Petrie A, Sabin C. Medical statistics at a glance. 3rd ed. Chichester: Wiley; 2013.Google Scholar
- 15.Carmona-Bayonas A, Font C, Fonseca PJ, Fenoy F, Otero R, Beato C, et al. On the necessity of new decision-making methods for cancer-associated, symptomatic, pulmonary embolism. Thromb Res. 2016;143:76–85.CrossRefPubMedGoogle Scholar
- 16.Carmona-Bayonas A, Fonseca PJ, Puig CF, Fenoy F, Candelera RO, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modeling: the EPIPHANY index. Br J Cancer. 2017;116(8):994–1001.CrossRefPubMedPubMedCentralGoogle Scholar
- 17.Fonseca PJ, Carmona-Bayonas A, García IM, Marcos R, Castañón E, Antonio M, et al. A nomogram for predicting complications in patients with solid tumours and seemingly stable febrile neutropenia. Br J Cancer. 2016;114:1191–8.CrossRefPubMedPubMedCentralGoogle Scholar
- 18.van Es N, Di Nisio M, Cesarman G, Kleinjan A, Otten H-M, Mahé I, et al. Comparison of risk prediction scores for venous thromboembolism in cancer patients: a prospective cohort study. Haematologica. 2017;102(9):1494–501.CrossRefPubMedPubMedCentralGoogle Scholar
- 19.Vickers AJ, Cronin AM. Everything you always wanted to know about evaluating prediction models (but were too afraid to ask). Urology. 2010;76(6):1298.CrossRefPubMedPubMedCentralGoogle Scholar
- 20.Khorana AA, Kuderer NM, Culakova E, Lyman GH, Francis CW. Development and validation of a predictive model for chemotherapy-associated thrombosis. Blood. 2008;111(10):4902–7.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Chaudhury A, Balakrishnan A, Thai C, Holmstrom B, Nanjappa S, Ma Z, et al. Validation of the khorana score in a large cohort of cancer patients with venous thromboembolism. Blood. 2016;128(2):879.Google Scholar
- 22.Del Priore G, Zandieh P, Lee M-J. Treatment of continuous data as categoric variables in obstetrics and gynecology. Obstet Gynecol. 1997;89(3):351–4.CrossRefPubMedGoogle Scholar
- 23.MacCallum RC, Zhang S, Preacher KJ, Rucker DD. On the practice of dichotomization of quantitative variables. Psychol Methods. 2002;7(1):19.CrossRefPubMedGoogle Scholar
- 24.Ravichandran C, Fitzmaurice GM. To dichotomize or not to dichotomize? Nutrition. 2008;24(6):610–1.CrossRefPubMedGoogle Scholar
- 25.Austin PC, Brunner LJ. Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Stat Med. 2004;23(7):1159–78.CrossRefPubMedGoogle Scholar
- 26.Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. JNCI J Natl Cancer Inst. 1994;86(11):829–35.CrossRefPubMedGoogle Scholar
- 27.DeCoster J. Iselin A-MR, Gallucci M. A conceptual and empirical examination of justifications for dichotomization. Psychol Methods. 2009;14(4):349–66.CrossRefPubMedGoogle Scholar
- 28.Jiménez-Fonseca P, Carmona-Bayonas A, Hernández R, Custodio A, Cano JM, Lacalle A, et al. Lauren subtypes of advanced gastric cancer influence survival and response to chemotherapy: real-World Data from the AGAMENON National Cancer Registry. Br J Cancer. 2017;117(6):775–82.CrossRefPubMedGoogle Scholar
- 29.George BJ, Beasley TM, Brown AW, Dawson J, Dimova R, Divers J, et al. Common scientific and statistical errors in obesity research. Obesity. 2016;24(4):781–90.CrossRefPubMedGoogle Scholar
- 30.Morton V, Torgerson DJ. Effect of regression to the mean on decision making in health care. BMJ Br Med J. 2003;326(7398):1083.CrossRefGoogle Scholar
- 31.Tsuboi M, Ezaki K, Tobinai K, Ohashi Y, Saijo N. Weekly administration of epoetin beta for chemotherapy-induced anemia in cancer patients: results of a multicenter, Phase III, randomized, double-blind, placebo-controlled study. Jpn J Clin Oncol. 2009;39(3):163–8.CrossRefPubMedGoogle Scholar
- 32.Bland JM, Altman DG. Statistics notes: some examples of regression towards the mean. BMJ. 1994;309(6957):780.CrossRefPubMedPubMedCentralGoogle Scholar
- 33.Aronson JK. Biomarkers and surrogate endpoints. Br J Clin Pharmacol. 2005;59(5):491–4.CrossRefPubMedPubMedCentralGoogle Scholar
- 34.Hamilton RJ, Goldberg KC, Platz EA, Freedland SJ. The influence of statin medications on prostate-specific antigen levels. JNCI J Natl Cancer Inst. 2008;100(21):1511–8.CrossRefPubMedGoogle Scholar
- 35.Miyamoto RK, Thompson IM. The reliability of digital rectal exam, PSA, repeat prostate biopsy, and endorectal MRI for following patients with clinically localized prostate cancer on active surveillance. J Urol. 2008;179(4):154.CrossRefGoogle Scholar
- 36.Cummings SR, Palermo L, Browner W, Marcus R, Wallace R, Pearson J, et al. Monitoring osteoporosis therapy with bone densitometry: misleading changes and regression to the mean. JAMA. 2000;283(10):1318–21.CrossRefPubMedGoogle Scholar
- 37.Vitolins MZ, Griffin L, Tomlinson WV, Vuky J, Adams PT, Moose D, et al. Randomized trial to assess the impact of venlafaxine and soy protein on hot flashes and quality of life in men with prostate cancer. J Clin Oncol. 2013;31(32):4092–8.CrossRefPubMedPubMedCentralGoogle Scholar
- 38.Wainer H. The most dangerous equation. Am Sci. 2007;95(3):249.CrossRefGoogle Scholar
- 39.Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.CrossRefPubMedGoogle Scholar
- 40.Berry SD, Ngo L, Samelson EJ, Kiel DP. Competing risk of death: an important consideration in studies of older adults. J Am Geriatr Soc. 2010;58(4):783–7.CrossRefPubMedPubMedCentralGoogle Scholar
- 41.Pietersen E, Ignatius E, Streicher EM, Mastrapa B, Padanilam X, Pooran A, et al. Long-term outcomes of patients with extensively drug-resistant tuberculosis in South Africa: a cohort study. Lancet. 2014;383(9924):1230–9.CrossRefPubMedGoogle Scholar
- 42.Ay C, Dunkler D, Simanek R, Thaler J, Koder S, Marosi C, et al. Prediction of venous thromboembolism in patients with cancer by measuring thrombin generation: results from the Vienna Cancer and Thrombosis Study. J Clin Oncol. 2011;29(15):2099–103.CrossRefPubMedGoogle Scholar
- 43.Ay C, Vormittag R, Dunkler D, Simanek R, Chiriac A-L, Drach J, et al. D-dimer and prothrombin fragment 1 + 2 predict venous thromboembolism in patients with cancer: results from the Vienna Cancer and Thrombosis Study. J Clin Oncol. 2009;27(25):4124–9.CrossRefPubMedGoogle Scholar
- 44.Campigotto F, Neuberg D, Zwicker JI. Biased estimation of thrombosis rates in cancer studies using the method of Kaplan and Meier. J Thromb Haemost. 2012;10(7):1449–51.CrossRefPubMedPubMedCentralGoogle Scholar
- 45.Brown JD, Adams VR, Moga DC. Impact of time-varying treatment exposures on the risk of venous thromboembolism in multiple myeloma. Healthcare. 2016;4(4):93.CrossRefPubMedCentralGoogle Scholar
- 46.Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.CrossRefPubMedPubMedCentralGoogle Scholar
- 47.Thompson CA, Zhang Z-F, Arah OA. Competing risk bias to explain the inverse relationship between smoking and malignant melanoma. Eur J Epidemiol. 2013;28(7):557–67.CrossRefPubMedGoogle Scholar
- 48.Stragliotto G, Rahbar A, Solberg NW, Lilja A, Taher C, Orrego A, et al. Effects of valganciclovir as an addon therapy in patients with cytomegaloviruspositive glioblastoma: a randomized, double blind, hypothesis generating study. Int J Cancer. 2013;133(5):1204–13.CrossRefPubMedGoogle Scholar
- 49.Park HS, Gross CP, Makarov DV, James BY. Immortal time bias: a frequently unrecognized threat to validity in the evaluation of postoperative radiotherapy. Int J Radiat Oncol Biol Phys. 2012;83(5):1365–73.CrossRefPubMedGoogle Scholar
- 50.Parikh ND, Marshall VD, Singal AG, Nathan H, Lok AS, Balkrishnan R, et al. Survival and cost-effectiveness of sorafenib therapy in advanced hepatocellular carcinoma: an analysis of the SEER-Medicare database. Hepatology. 2017;65(1):122–33.CrossRefPubMedGoogle Scholar
- 51.Suissa S. Immortal time bias in pharmacoepidemiology. Am J Epidemiol. 2007;167(4):492–9.CrossRefPubMedGoogle Scholar
- 52.Redelmeier DA, Singh SM. Survival in Academy Award–winning actors and actresses. Ann Intern Med. 2001;134(10):955–62.CrossRefPubMedGoogle Scholar
- 53.Bonadonna G, Valagussa P. Dose-response effect of adjuvant chemotherapy in breast cancer. N Engl J Med. 1981;304(1):10–5.CrossRefPubMedGoogle Scholar
- 54.Simon R, Makuch RW. A non-parametric graphical representation of the relationship between survival and the occurrence of an event: application to responder versus non-responder bias. Stat Med. 1984;3(1):35–44.CrossRefPubMedGoogle Scholar
- 55.van Rein N, Cannegieter SC, Rosendaal FR, Reitsma PH, Lijfering WM. Suspected survivor bias in case-control studies: stratify on survival time and use a negative control. J Clin Epidemiol. 2017;67(2):232–5.CrossRefGoogle Scholar
- 56.Hu Z-H, Connett JE, Yuan J-M, Anderson KE. Role of survivor bias in pancreatic cancer case–control studies. Ann Epidemiol. 2016;26(1):50–6.CrossRefPubMedGoogle Scholar
- 57.Sy RW, Bannon PG, Bayfield MS, Brown C, Kritharides L. Survivor treatment selection bias and outcomes research: a case study of surgery in infective endocarditis. Circ Cardiovasc Qual Outcomes. 2009;2(5):469–74.CrossRefPubMedGoogle Scholar
- 58.Ho AM-H, Zamora JE, Holcomb JB, Ng CSH, Karmakar MK, Dion PW. The many faces of survivor bias in observational studies on trauma resuscitation requiring massive transfusion. Ann Emerg Med. 2017;66(1):45–8.CrossRefGoogle Scholar
- 59.Brundage M, Osoba D, Bezjak A, Tu D, Palmer M, Pater J. Lessons learned in the assessment of health-related quality of life: selected examples from the National Cancer Institute of Canada Clinical Trials Group. J Clin Oncol. 2007;25(32):5078–81.CrossRefPubMedGoogle Scholar
- 60.Nielsen SF, Nordestgaard BG, Bojesen SE. Statin use and reduced cancer-related mortality. N Engl J Med. 2012;367(19):1792–802.CrossRefPubMedGoogle Scholar
- 61.Griffiths R, Mikhael J, Gleeson M, Danese M, Dreyling M. Addition of rituximab to chemotherapy alone as first-line therapy improves overall survival in elderly patients with mantle cell lymphoma. Blood. 2011;118(18):4808–16.CrossRefPubMedPubMedCentralGoogle Scholar
- 62.Austin PC, Mamdani MM, Van Walraven C, Tu JV. Quantifying the impact of survivor treatment bias in observational studies. J Eval Clin Pract. 2006;12(6):601–12.CrossRefPubMedGoogle Scholar
- 63.Jeličić H, Phelps E, Lerner RM. Use of missing data methods in longitudinal studies: the persistence of bad practices in developmental psychology. Dev Psychol. 2009;45(4):1195–9.CrossRefPubMedGoogle Scholar
- 64.Burton A, Altman DG. Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004;91(1):4–8.CrossRefPubMedPubMedCentralGoogle Scholar
- 65.Rombach I, Rivero-Arias O, Gray AM, Jenkinson C, Burke O. The current practice of handling and reporting missing outcome data in eight widely used PROMs in RCT publications: a review of the current literature. Qual Life Res. 2016;25(7):1613–23.CrossRefPubMedPubMedCentralGoogle Scholar
- 66.Raboud JM, Montaner JSG, Thorne A, Singer J, Schechter MT. Group CHIVTNAS. Impact of missing data due to dropouts on estimates of the treatment effect in a randomized trial of antiretroviral therapy for HIV-infected individuals. JAIDS J Acquir Immune Defic Syndr. 1996;12(1):46–55.CrossRefGoogle Scholar
- 67.Rubin DB, Schenker N. Multiple imputation in healthcare databases: an overview and some applications. Stat Med. 1991;10(4):585–98.CrossRefPubMedGoogle Scholar
- 68.Harrell F. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. 2nd ed. New York: Springer; 2015.CrossRefGoogle Scholar
- 69.Vasan SK, Hwang J, Rostgaard K, Nyrén O, Ullum H, Pedersen OB, et al. ABO blood group and risk of cancer: a register-based cohort study of 1.6 million blood donors. Cancer Epidemiol. 2016;44:40–3.CrossRefPubMedGoogle Scholar
- 70.Sen PK. Multiple comparisons in interim analysis. J Stat Plan Inference. 1999;82(1):5–23.CrossRefGoogle Scholar
- 71.Smith GD, Ebrahim S. Data dredging, bias, or confounding: they can all get you into the BMJ and the Friday papers. BMJ Br Med J. 2002;325(7378):1437.CrossRefGoogle Scholar
- 72.Sterling TD. Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J Am Stat Assoc. 1959;54(285):30–4.Google Scholar
- 73.Stacey AW, Pouly S, Czyz CN. An analysis of the use of multiple comparison corrections in ophthalmology research. An Analysis of the use of multiple comparison corrections. Invest Ophthalmol Vis Sci. 2012;53(4):1830–4.CrossRefPubMedGoogle Scholar