In clinical research, randomized controlled clinical trials are regarded as the gold standard for comparing different clinical interventions (CONSORT statement [1]). Conducting a clinical trial and the acquisition of patient data are very cumbersome, time-consuming, and expensive. Furthermore, randomized controlled clinical trials are occasionally criticized as being too artificial and unrealistic and sometimes even for generating non-reproducible results [2, 3]. On the contrary, in clinical routine, a vast amount of data is gathered under real-life conditions on a daily basis. Detailed information on baseline characteristics, treatments, exposures, and outcomes is assessed on an individual level. Providing a platform to collect, centralize, aggregate, and store these data naturally represents a great solution to the problems involved with data acquisition in clinical trials. For the treatment of pain, for example, the German Pain Practice Registry ( and the related online documentation service iDocLive® ( provide an excellent platform to centrally collect, store, and process daily routine data. These data can now be utilized and analyzed to derive valuable information.

Studies based on clinical routine data have several advantages. It is usually less difficult to realize higher sample sizes, enabling the investigation of smaller differences. Furthermore, these studies may include patients that are usually excluded from clinical trials, e.g., with a wider range of exposure levels [4]. Most of all, however, practitioners praise the time- and cost-effective approach of data acquisition as well as the strong connection to actual clinical practice.

While the time/cost argument is clearly important and always to be considered, it is merely connected to external factors and not to the actual research question at stake: discovering the true benefit of a (or several) clinical intervention(s). The argument concerning the connection to actual practice is also not directly applicable to the actual research question. Clearly, the intention and primary purpose of medical research is to help patients in practice. However, just because treatments have a specific reputation and are applied in a specific way in practice, does this mean it is correct and optimal to do so?

Unambiguously, the answer is no and it would contradict clinical development. Let us imagine, in clinical routine, that drug A is used primarily on a slightly different subset of patients to drug B. Solely comparing results of those drug interventions yields a biased picture and between-group results cannot be accredited to actual treatment difference, but rather to heterogeneity of treatment groups. Even if treatment groups are comparable at baseline, clinical routine might treat interventions differently with respect to concomitant medications, drug titrations, dose levels, or other aspects discussed later in this article. Again, results of drug intervention cannot be accredited to the drug itself but rather to those other factors.

As a result of this lack of objectivity, comparability, and internal validity, randomizing subjects to treatments is indispensable to obtain valid statistical results. Randomization is not only one of the most important techniques to avoid bias in clinical trials [5] but it also provides a basis for quantitative evaluation of clinical trial data [6]. Thus experts in the field of clinical trial design increasingly stress the importance of randomization and blinding (e.g., [7, 8]).

This article will discuss methodological issues in evaluations based on clinical routine data, pointing out strengths and limitations. This article does not contain any new studies with human or animal subjects performed by the author. The “Results” section will first elaborate general aspects of clinical routine data, followed by an illustration of these strengths and limitations in a recent real-case example and results of a simulation study. Conclusions are drawn at the end of this article.


General Aspects of Clinical Routine Data

Centrally storing and merging data gathered during clinical routine overcomes many difficulties in acquiring clinical data. Processing these data yields a unique opportunity to obtain a comprehensive and aggregated insight into the current state of treatment and care of a specific subset of patients:

  • How many of those patients receive a specific treatment?

  • What dose levels are administered in practice?

  • What concomitant medications are used?

It also gives insight into the differences between daily life and clinical trials regarding how treatments are applied, what patient populations look like, etc. All this information is without doubt highly valuable. In statistical terms, all these investigations are of descriptive nature; thus, these questions can be answered with descriptive statistics. Inferential statistics (e.g., statistical hypotheses tests deriving estimates and p values) on the other hand use (a small set of) observed data to infer properties about a larger population. Those generalized conclusions, e.g., on actual treatment differences, however, might be highly questionable if they are based on data gathered during clinical routine because of several methodological aspects. These aspects will be discussed in more detail in the remainder of this section.


In an ideal world, data on every single treatment would be assessable. A truly random sample of this population would yield a sound and valid basis for statistical inference about this population [6]. However, this situation is not available in practice and thus the existing subset of clinical routine data has to be critically investigated:

  • Is there a geographical bias in data collection?

  • Are younger or older patients more willing to consent to data storage?

  • Do participating physicians systematically prefer certain treatments?

  • Which data are collected for which patients?

Speaking in statistical terms (e.g., [9]), given a treatment this subset of patients can hardly ever form a representative subset of all patients of interest. Lacking the basic principle of the population model, there is no foundation for statistical inference and without proper randomization there is not even the possibility of an invoked population model [6]. Without a basis for statistical inference, results cannot be generalized to all patients of interest.

Data Collection

In clinical trials, data are collected in a prespecified, prospective, and homogeneous way. Clinical routine data unfortunately lack most of these attributes as a result of their biggest strength—clinical practice. Not only any dose adjustment or prescriptions of concomitant medication but in fact any behavior, interaction with and treatment of the patient, order and style of data assessment, etc. do not follow a prespecified protocol but are done at the discretion of the physician. This results not only in high heterogeneity in the data but also opens any door to selection bias, detection bias, attrition bias, measurement bias, and many other sources of bias (e.g., [6, 7, 10, 11]). Retrospectively sampling any subset of available routine data does not overcome these issues, as heterogeneity and bias might already be part of the data. To illustrate this with a fictive example, let us imagine two treatments:

  • Treatment A: Placebo.

  • Treatment B: An active pain medication proven to be slightly less efficacious, but much better tolerable than morphine.

For sufficient pain relief doctors are allowed to prescribe morphine as per required need in both cases. Now for sufficient pain relief, treatment A (placebo) is always administered with a high dose of morphine as concomitant medication, while treatment B (the active and efficacious drug) is always administered without morphine as concomitant medication. Efficacy results comparing drug A and drug B might erroneously show that treatment A (placebo) is superior to treatment B. Retrospectively taking any random subsample of patients treated this way will not prevent this false result as heterogeneity and bias are already part of every assessment and thus included in all data.

As a consequence, observed treatment difference cannot be accredited to the actual treatments but to a mixture of treatment difference, heterogeneity, and biases. An illustrative real-case example regarding systematic different use of concomitant medication in clinical routine data is given below.


Randomization is the most important design technique in clinical investigations, providing three important benefits [12]:

  1. (a)

    Randomization is used to balance not only known but also unknown and unobservable (latent) covariates [13]. Investigators often try to “demonstrate” baseline equality of covariates, even though it is from a statistical perspective illogical and the CONSORT Group explicitly discourages this behavior [14]. In addition, testing of baseline differences is only feasible for observable covariates but is impossible for unknown and latent covariates. As seen above, homogeneity between treatment groups is imperative for investigating true treatment difference.

  2. (b)

    Given the specific challenges in clinical research, it is mostly impossible to create a truly representative subset of patients of interest. Thus randomly assigning patients to treatment groups is imperative to provide a basis for statistical inference [6].

  3. (c)

    Furthermore in combination with blinding (see below), randomization helps to avoid bias in clinical investigations [5].

Overall, true randomization is the only technique enabling one to investigate a clinical hypothesis with internal validity. It is important to note that retrospectively taking a random selection of available routine data is not in line with ICH E9 [5], does not provide a basis for statistical inference about all patients of interest, does not create homogeneity between treatment groups, and does not avoid bias.


Randomization helps to create homogeneity in the data and avoid bias. However, without blinding, randomization lacks the capability to prevent many types of bias [5]. Detection/ascertainment bias, attrition bias, performance bias, co-intervention bias, and observer bias [11] can simply not be prevented by randomization without blinding. Even selection bias, commonly believed to be fully avoided by randomization, is a threat without blinding [15,16,17]. First- and second-order selection bias [18] can be prevented by randomization with allocation concealment. However, even in the case of proper randomization, without blinding (in this case masking past treatment assignments to the investigator), third-order selection bias can lead to substantial type I error rate elevation and thus lead to false test decisions [16, 17]. Thus, even in a randomized clinical trial, without blinding, selection bias has to be regarded and, if present, corrected for appropriately [10, 19,20,21].

Proper blinding and preventing the forms of bias described above refers to blinding the patient and the investigator. It is important to note that conducting a blinded analysis, i.e., blinding only the statistician during the analysis, does not prevent any of the above biases.

Random Selection of Existing Clinical Data

As described above, randomly selecting a subset of existing clinical routine data collected in the past does not prevent bias, does not increase homogeneity in known or latent covariates, and does not provide a basis for statistical inference. Even worse, considering publication bias, in a retrospective analysis, theoretically one could successively take multiple random selections until the desired result is available and only publish results on that specific “random” sample.

In sampling theory, there are further operative reasons for taking a random sample of a larger population [9]. Taking a subsample is always accompanied by a loss of information; however, it is often not possible or at least extremely time- and cost-consuming to acquire data on all individuals of interest (e.g., election forecasts, opinion polls, etc.). Sometimes it is not even reasonable to assess all data (e.g., chemical investigation on the quality of a shipment of fruits). Those reasons all apply to the acquisition of information (data). When information (data) is already gathered and available, as for collected and stored clinical routine data, there is no operative reason to take a subsample. Analyzing a larger data set requires neither more time nor additional costs, but it does yield more information.

Use of Intention-to-Treat Principle (ITT)

There is a strict need for an ITT population for confirmatory pivotal clinical trials [5]. However, the ITT principle is unambiguously defined as including all randomized subjects in the analysis and thus inseparable from true randomization. Using the term ITT in a non-randomized investigation is inappropriate and misleading. Thus studies based on clinical routine data should abstain from referring to an ITT population. Clearly the label ITT has achieved a certain standing and importance; thus, trial design experts criticize this misuse of the ITT labeling in clinical investigations as “simply dishonest” [8].

Analyses of Clinical Routine Data in the Ueberall and Mueller-Schwefe Publication [22]

In the recent past, a study based on clinical routine data investigated the efficacy and tolerability balance of oxycodone/naloxone and tapentadol in chronic low back pain with a neuropathic component [22]. Even though the title and various parts of the Ueberall and Mueller-Schwefe [22] publication (UMS publication) repeatedly use the words “blinded”, “random”, and “prospective”, it is important to note that their investigation on clinical routine data is in fact not randomized, not blinded, and their analysis is retrospective. Consequently, all methodological issues described in Sect. 2 have to be considered carefully. Without internal validity and without a basis for statistical inference, validity of generalized conclusions is highly questionable. While the remainder of this section will be devoted to illustrate in more detail some of the aspects of Sect. 2, further explicit insight into bias in the UMS publication can be found in a response to the UMS publication by the Cochrane Group [23].

Use of Laxatives

Various studies demonstrate the favorable tolerability profile of tapentadol prolonged release (PR) compared to other opioids (e.g., [24]). Comparison of tapentadol PR (TAP) to oxycodone/naloxone PR (OXN) in a prospective randomized clinical trial treating subjects homogeneously according to a prespecified protocol proved the tolerability of TAP to be superior to that of OXN [25]. In particular, TAP is associated with significantly lower incidences of constipation than OXN. In daily clinical routine, physicians treat patients according to their individual needs. This, however, might lead to systematic differences between two treatments in the use of concomitant medication. Regarding TAP and OXN, it might be expected that in daily clinical routine, physicians systematically rely more frequently on laxatives for patients treated with OXN compared to patients treated with TAP. The subset of daily clinical routine data assessed by iDocLive® and presented in the UMS publication actually demonstrates this behavior. Unfortunately the UMS publication erroneously claims that:

  • “The proportion of patients without using laxatives changed insignificantly from baseline to the study end for both treatments” and

  • “Analyses of the available patient information on the use of laxatives revealed a mixed, however, comparable, utilization pattern for both treatment groups evaluated”.

In the UMS publication, data were analyzed using very basic statistical hypotheses tests (Student’s/paired sample t test, χ 2 test). Unfortunately, the statistical testing procedure needed to correctly analyze the changes of laxative use, considering stochastical dependencies, is not among them. Given a baseline/end of treatment scenario, paired nominal data are present, where assessments of the same individual are stochastically dependent.

Based on information provided in the UMS publication, Tables 1 and 2 illustrate the concordant (main diagonal) and discordant (secondary diagonal) pairs of baseline/end of treatment laxative use for OXN and TAP, respectively. The discordant pairs separate perfectly in opposite directions: while the discordant pair in Table 1 (OXN) reflects perfect separation towards the additional need for laxatives at the end of treatment, the discordant pair in Table 2 (TAP) reflects perfect separation towards an improvement regarding laxative intake at the end of treatment.

Table 1 Laxative intake with oxycodone/naloxone PR
Table 2 Laxative intake with tapentadol PR

Analyzing the discordant pair in Table 1 with McNemar’s test for paired nominal data yields a significant difference (worsening) for OXN in laxative intake comparing baseline to the end of treatment (p = 0.001565/p = 0.004427, without/with Edward’s continuity correction, respectively). Analyzing the discordant pair in Table 2 shows a clear trend of improvement from baseline to the end of treatment for TAP (p = 0.02535/p = 0.07364, without/with Edward’s continuity correction, respectively).

Striving for a simple but direct comparison of OXN to TAP, taking the development of each individual patient into account, for each patient it could be assessed whether the use of laxatives improved, worsened, or did not change through the course of treatment (Table 3). Analysis of the data in Table 3 with Fisher’s exact test demonstrates a significant difference between OXN and TAP (p < 0.0001) in favor of TAP.

Table 3 Change in need for laxative in patients administered tapentadol PR and oxycodone/naloxone PR

Baring in mind the general aspects of statistical analyses on clinical routine data in Sect. 2, one has to interpret these analyses as a mere description of the specific subset of daily routine data assessed by iDocLive®. However, the UMS publication chose a composite efficacy/tolerability endpoint for their inferential primary statistical analyses. This primary endpoint is heavily affected by constipation and laxative intake. Thus, this example demonstrates that regarding their primary endpoint, observed treatment difference cannot be accredited solely to the actual treatments. Tolerability of OXN was supported by systematic disproportionate use of concomitant medications preventing a valid comparison between OXN and TAP.

Strengths of Clinical Routine Data

The above example demonstrates that observed treatment difference can often not be accredited to the actual treatments but to a mixture of treatment difference, heterogeneity, and biases. It also illustrates the unique strength and benefit of analyzing routine data, giving a detailed insight into the current treatment and care of patients in actual clinical practice.

The limitations on the population described in Sect. 2 still have to be considered; however, data can be utilized to give a detailed picture of patient treatment by the specific non-random subset of broad-specified pain specialists. As seen above, it reveals, for instance, that in actual clinical practice for the considered subset, laxatives are used more frequently with OXN than with TAP. Furthermore, the UMS publication reveals valuable descriptive information on baseline conditions and average doses that are administered in real life as opposed to clinical trials. Most information in the publication is about the small selection of routine data (261 of 579 “appropriate” patients). As discussed in Sect. 2, looking at the whole picture achieves an even broader insight without loss of information. This particularly applies to investigations into how many patients are treated with either (or even different) pain medication(s).

Simulation Study

A simulation study was conducted to investigate the potential impact of bias in studies based on clinical routine data. Results of this simulation study show not only a substantial type I error rate elevation but also illustrate that an inferior drug might actually demonstrate superiority with a high likelihood.

Statistical Model

In a parallel group design, clinical routine data on drug A are compared to clinical routine data on drug B. The variable of interest, i.e., the response variable, is continuous and assumed to be normally distributed, with homoscedasticity between treatment groups. The true and unbiased treatment effects of drug A and drug B are denoted by μ A and μ B, respectively.

Clinical routine data are prone to all types of bias described in Sect. 2. Types and magnitude of bias in studies based on clinical routine data might be very different from study to study. Which types of bias are actually present in a particular study of this nature is usually difficult to detect and quantify. In statistical terms, next to the treatment effects μ A/μ B, the magnitude of this “bias effect” always has to be related to the standard deviation and the resulting quotient is denoted by γ. The literature is quite different when it comes to the magnitude of these bias effects. Regarding selection bias, for example, Proschan [17] uses a bias effect of \(\gamma \in \{ 0.1, . . . , 0.5\}\), Berger et al. [15] use \(\gamma \in \{ 0, 0.5., . . . , 2\},\) and Follmann and Proschan [26] use γ = 1. In this simulation study, the bias effect is not necessarily reflecting only a single type of bias (e.g., selection bias), but might also reflect the sum of different biases. For example, the Cochrane Group sees a recent real-case study based on clinical routine data, the UMS publication, to be “at serious risk of bias” and graded various types of bias as “moderate”, “problematic”, or of “serious concern” [23]. Nevertheless, to be conservative in this simulation study, light bias effects of \(\gamma \in \{ 0.1, 0.3 , 0.5\}\) are investigated.

Studies based on clinical routine data often rely on data gathered at various sites and/or by various investigators. As a result of certain treatment characteristics or reputations, a systematic bias applying to most patients is not necessarily unlikely. However, given a multicenter setting and regarding various types of bias, different proportions of patients affected by bias are investigated. In this simulation study, for every patient a Bernoulli-distributed random variable B i  ∼ Ber(p) will be used to determine whether patient i was affected by bias or not. The distribution parameter \(p \in [0, 1]\) will vary in intervals of 0.1.

Finally, let ξ i denote the treatment group of patient i, i.e., ξ i  = −1 if patient i receives drug A and ξ i  = 1 if patient i receives drug B. As a result of the lack of randomization, ξ i is not a random variable, but rather a mere descriptor, which might be confounded by patient characteristics as well. However, this confounding would lead to an additional bias, which will be part of and thus covered by the bias effect γ. Let n A denote the number of patients receiving drug A and n B denote the number of patients receiving drug B.

Combining the above information and model assumptions in a joint model, one observes that the continuous response Y i of patient i follows a normal distribution with variance σ 2 and conditional expected value

$$E (Y_{i} | B_{i} ) = \frac{{1 - \xi_{i} }}{2}\mu_{\text{A}} + \frac{{1 + \xi_{i} }}{2}\mu_{\text{B}} - \xi_{i} B_{i} \gamma \sigma .$$

The conditional density function of the response Y i of the ith patient is thus given by

$$f_{{Y_{i} |B_{i} }}^{\left( \theta \right)} \left( {y_{i} |b_{i} } \right) = \frac{1}{{\sqrt {2\pi \sigma^{2} } }}\exp \left( { - \frac{1}{{2\sigma^{2} }}\left[ {y_{i} - \left( { \frac{{1 - \xi_{i} }}{2}\mu_{\text{A}} + \frac{{1 + \xi_{i} }}{2}\mu_{\text{B}} - \xi_{i} b_{i} \gamma \sigma } \right)} \right]^{2} } \right),$$

with \(\theta : = (\mu_{\text{A}} ,\mu_{\text{B}} , \gamma , \sigma )^{T}\) and \(b_{i}\) the realization of B i .

Part 1: No Treatment Difference

The first part of this simulation study investigates the case of no actual treatment difference, i.e., μ A = μ B. Furthermore σ = 1 is chosen and, as stated above, \(\gamma \in \{ 0.1, 0.3 , 0.5\}\) and \(p \in \{ 0, 0.1, \ldots , 1\}\) is investigated. In each simulation run, a study is simulated with n A = n B = 130 patients in each treatment group and responses according to Eq. (1). For each parameter combination (\(\gamma , p\)), 10,000 studies are simulated and data analyzed conducting a two-sample, two-sided t test at a significance level of α = 0.05. Since in part 1 of this simulation study, there is actually no treatment difference, i.e., μ A = μ B, without bias, the number of significant study results should be approximately 5% of 10,000, i.e., approximately 500 simulated studies should falsely show a significant difference between drug A and drug B.

Simulations were carried out in R [27] (a language and environment for statistical computing, version 3.3.1) using RStudio [28] (Integrated Development for R, version 0.99.891). Results of part 1 of this simulation study are given in Table 4 and illustrated in Fig. 1.

Table 4 Relative number of significant study results in the case of no treatment difference (type I error rate)
Fig. 1
figure 1

Relative number of significant study results in the case of no treatment difference (type I error rate)

The results of this simulation study illustrate the severe impact of bias in a clinical investigation. If no patients are affected by bias (p = 0), the type I error rate is approximately 5% independent of the magnitude of the bias effect (i.e., for all three values of \(\gamma \in \left\{ {0.1, 0.3 , 0.5} \right\}\)). Thus the hypothesis test holds the significance level as it is supposed to. The significance level is indicated by the horizontal line in Fig. 1.

If few patients (proportion p ≤ 0.2) are affected by a very small bias effect (γ = 0.1), the type I error rate is somewhat acceptable (even though also augmented). In all other parameter combinations, there is a substantial to severe type I error rate elevation. Especially for \(\gamma \in \left\{ {0.3 , 0.5} \right\}\) the results are alarming. The statistical hypotheses test erroneously indicates a significant treatment difference most of the time, even though in reality there is none. For example, for γ = 0.5 and if 40% of patients are affected by bias, the wrong test decision in favor of drug A was made in almost 90% of the simulated studies. But even if there is only a very small (γ = 0.1) but systematic (proportion p ≥ 0.9) bias, the type I error rate is over 30% and thus unacceptably large.

Part 2: Drug A Inferior to Drug B

The second part of this simulation study investigates the case that drug A is actually inferior to drug B, i.e., without loss of generality μ A < μ B. Now the impact of bias on the test decision depends on the magnitude of the bias effect in comparison to the treatment effect size. If the true relative treatment difference is larger than the bias effect, the study will still be able to come to a correct conclusion with a likelihood naturally depending on the actual difference between treatment effect size and bias effect. However, if one drug is actually better than the other but the sum of biases surpasses the treatment difference, the test decision can lightly swing in the opposite direction. The second part of the simulation study uses the same model assumptions as the first part, with the slight modification that drug A is actually inferior to drug B, i.e., \(\frac{{\mu_{\text{B}} - \mu_{\text{A}} }}{\sigma } = 0.3\) and the bias effect surpasses this effect size, i.e., γ = 0.5. Instead of a two-sided hypothesis test at the 5% significance level, a one-sided, two-sample t test is conducted at the 2.5% significance level. The null hypothesis states that drug A is inferior or equal to drug B, and the alternative hypothesizes that drug A is superior to drug B (the latter actually being incorrect).

Results of this part of the simulation study again demonstrate that bias can have a severe impact on the test decision. Even though drug A is actually inferior to drug B, a systematic bias (p ≥ 0.7) can easily revert this trend in the opposite direction. The proportion of wrong test decisions in favor of (the inferior) drug A is above 80% (Table 5).

Table 5 Relative number of significant study results: wrong decision in favor of drug A even though drug B is superior


Retrospective as well as prospective clinical studies can be influenced by various types of heterogeneity and bias. As extensively discussed in the literature and guidelines as well as demonstrated in the simulation study above, bias can have a severe impact on study results. Between-group results cannot be accredited to actual treatment difference, but rather to a mixture of treatment difference, heterogeneity, and bias. The impact of the last two might then not only lead to substantial impact of type I error rate elevation but is also actually able to demonstrate superiority of an inferior treatment.

In a clinical trial, multiple means and instruments exist to prevent bias and heterogeneity. Most important are randomization and blinding [5]. Thus, the CONSORT Group rightly regards randomized controlled clinical trials as the gold standard for evaluating health-care interventions (CONSORT statement [1]). Clearly not every prospective clinical trial is randomized and blinded. Randomization might not always be possible for ethical or practical reasons, e.g., randomly assigning subjects to smoking or not smoking cigarettes. Other clinical trials, e.g., those comparing two different types of surgery, can be randomized but not double-blinded. Many prospective clinical trials have, however, at least the potential to be randomized and double-blinded. On the other hand, studies based on clinical routine data are per se not randomized and not blinded as it would contradict clinical practice. Furthermore, their analyses are usually retrospective. Taking a random selection of existing data and conducting a blinded analysis do not prevent any of the above discussed issues. Quite the contrary, taking a random selection of already existing data actually has unnecessary disadvantages compared to analyzing all data.

Without randomization, without internal validity, and without representing a random sample of the larger population of interest, a basis for statistical inference is not given. Thus there is usually no statistical foundation to generalize results and properties derived from clinical routine data to all patients of interest.

Unfortunately, the above arguments and basic concepts of probability theory are occasionally ignored by practitioners and non-statisticians, who are captivated by p values. This malpractice draws away attention of the great benefits clinical routine data actually have to offer. Descriptively summarizing clinical routine data yields a comprehensive and aggregated insight into the current state of treatment and care of a specific subset of patients. Hypotheses can be derived from this valuable insight, which can subsequently be investigated under circumstances enabling a valid comparison. Thus, like other observational studies, well-executed investigations based on clinical routine data are indispensable precursors of randomized controlled clinical trials.

In summary, prespecified, homogeneous, prospective, randomized, and blinded data acquisition is desired to obtain valid clinical trial results. By default, the acquisition of clinical routine data is usually lacking most of these attributes. However, correctly analyzing clinical routine data yields a valuable insight into the current state of treatment and care of a specific subset of patients. Fulfilling a merely descriptive purpose for a specific subset of patients, descriptive summary statistics are the method of choice. Confirmatory analyses using inferential statistics trying to generalize results to all patients of interest have to be considered carefully and are often inappropriate.