Clinicians have an insatiable drive for definitive answers regarding clinical judgments they make every day. They also hold deep convictions based upon experience and training, which can only be shaken (modified) by convincing data. The days of relying upon the “chart review” for definitive answers has passed us by. How, then, can we answer important clinical questions using current tools from the rapidly developing world of outcomes research? This requires the conversion of an interesting clinical observation into an outcomes research question with a testable hypothesis, followed by an outcomes analysis with a research team.

The purpose of this article is to describe a “protocol,” or pathway, to facilitate this process. Akin to the formal method we teach new physicians to conduct a History and Physical (H&P), a formal protocol such as described here will facilitate outcomes analyses. There are three main phases to the protocol: study design, data preparation, and data analysis, with multiple steps within each phase. The logic of the outcomes analysis process becomes clear if the steps proceed sequentially.

Study design phase

The most important, and arguably the most difficult, phase of a study is its design phase. In fact, most problems with research studies arise in the very first step in this phase—asking the research question. An improperly framed research question will create difficult problems throughout the following steps of the project. Note that both the design phase and the data preparation phase will comprise the Methods section of a manuscript.

An important issue in framing a research question is: Will it be a descriptive study or an analytical study? A descriptive study is often employed when the research question involves a rare or new occurrence, disease, or procedure, since there is little established knowledge about the topic. The hallmark of a descriptive study is questions that begin with “what,” “where,” “when,” “who,” or “how.” For example, “Who has the disease in question?” or “What are the common comorbidities of patients with the disease in question?” These are also known as “open-ended” questions, and statistical testing is not applicable since there is no a priori expectation of any particular answer. If the study is descriptive, a statistician will not be necessary.

In contrast, an analytical study requires “close-ended” questions, usually beginning with a verb: “Is/Was…” or “Do/Does….” For example, “Does race or gender affect the mortality of patients with the disease in question?” These studies call for a yes/no answer and statistical testing is applicable.

The difference between open-ended questions for an inquiry in its earliest stage versus closed-ended questions in later stages can be compared to gathering history and physical examination information from a patient. A patient interview begins with open-ended questions (“Tell me about your pain”), but then moves on to closed-ended questions as the matrix of information begins to create a picture of the clinical situation. As this process evolves, the clinician begins to formulate a differential diagnosis list (e.g., “Was it dull? Did you have a fever? Was it in the left lower quadrant?”).

Confusion arises when an attempt is made to compare two slices of the same population with one another. For example, if we want to know if there were more men or women who underwent cholecystectomy last year, the percentage of women versus men would be a descriptive study and P values would not be relevant. This may appear to be a comparative study, but in fact it is a descriptive study because both populations (men who underwent cholecystectomy and women who underwent cholecystectomy) are correlated and thus represent the same population. They are essentially flip sides of the same coin. Figure 1 may help to clarify. Note that in Fig. 1A there is really only one pie, even though we have divided that pie into multiple pieces (representing, e.g., male patients vs. female patients). However, both slices of that pie are calculated with the same denominator, i.e., patients who underwent cholecystectomy. Any comparative statistics about them would be descriptive and formal statistical testing would not be applicable.

Fig. 1
figure 1

Conceptual illustration of the difference between a descriptive and an analytical analysis. A depicts a descriptive study, where both ratios are calculated off of the same denominator, and thus there is really only one study population. No formal statistical testing is applicable between 57 versus 43%. B depicts an analytical study, where there are two different study populations (i.e., the 55% is calculated off of a different denominator as the 57%). In that case, formal statistical testing is applicable to compare 55 versus 57%. A P value not applicable to compare different parts of the same populations. B P value applicable for comparing parts of two populations

To change the above question from a descriptive question to a “testable” question, we could, for example, ask whether the male-to-female ratio has changed between last year and the year before. Then one could calculate a P value to compare the differences between the two ratios. The P value in this instance would be interpreted as the probability that the observed finding is based on random chance alone, e.g., a P value of 0.05 indicates that in that status quo one would see the results that were found only 5% of the time, and a P value of 0.01 indicates that one would see the results that were found only 1% of the time, and so on. Figure 1B shows that there are now two pies, so we can ask whether the proportion of one group is higher or lower in one pie than the proportion of that group in another pie and determine the P value.

Step 1: define the population using inclusion and exclusion criteria

Defining the inclusion criteria for a study population is usually fairly intuitive, but there are some nuances to consider. For example, in examining the risk factors for patient safety events in trauma patients, it may be obvious that trauma patients should be the study population. However, how should a trauma patient be defined? Depending on the database (as described below), the definition of a trauma patient may be as simple as all patients in the database if a trauma registry is active. However, it may be more complicated if the database is a generic database such as an administrative database. In such a case, a set of diagnosis codes would be necessary to define trauma patients (for trauma, it is diagnosis codes in the range of 800-959).

However, not all “trauma” patients culled from an administrative database would be pertinent to answering the study question. This is where it becomes important to craft appropriate exclusion criteria. These exclusion criteria are usually related to the outcome variable or the independent variable (outcome variable and independent variable are defined in more detail below). For example, the risk factors for an event among patients who have the condition already cannot be studied. If the development of deep vein thrombosis (DVT) is to be studied, then patients who are admitted with a DVT would need to be excluded. Or, if the mortality rates of two treatment groups are to be compared, then patients who present with that “condition,” i.e., dead on arrival, are excluded. In addition, the risk factors cannot be studied in a population in which all possible variations in the independent variable that you want to test are not possible. For example, in the examination of the effect of insurance upon hospital admission status, patients over age 65 would have to be excluded since they are all insured and there are no uninsured patients in that population. Importantly, patients may be excluded for a combination of reasons. For example, burn patients may be excluded from trauma populations because the predictors of outcomes in burn patients are different from those of most trauma patients [1].

The validity of a study depends in large part on how the study population is defined. Subtle differences in population definition can produce different results. For example, many administrative databases use the International Classification of Disease, Ninth Revision (ICD-9) coding system to classify both diagnoses and procedures. Most physicians in the United States are more familiar with the American Medical Association’s Current Procedural Terminology (CPT) system for classifying procedures and think that the ICD-9 system pertains only to diagnosis codes; however, there are ICD-9 diagnosis codes as well as procedure codes. The difference between CPT and ICD systems makes it difficult to specifically identify certain procedures since there are more CPT codes than ICD-9 procedure codes, with multiple procedures often lumped into the same ICD-9 procedure code. Because of this incongruity, seemingly disparate study findings may simply be due to different ICD-9 procedure codes having been included in the inclusion criteria. To clarify such situations, the list of exact diagnosis and/or procedure codes used in the population definition should be included in the manuscript, either in the Methods section or as a list in the Appendix.

Step 2: define subsets

In outcomes research, an answer is often generated based upon large heterogeneous populations. Subset analysis can ask and answer questions about more homogeneous groups (minorities, elderly, geographic area) within the larger set. Thus, outcomes research is actually less effective in showing that a treatment works when using a large population (i.e., the efficacy issue), since it is difficult to control for all possible confounders retrospectively in a database [2]. Rather, the strength of outcomes research is in its generalizability (i.e., whether the treatment works in real-life situations and in every patient subpopulation). This has also been labeled the “effectiveness” issue and makes outcomes research an important tool for comparative effectiveness research. For example, if “A” works overall, does it also work in the elderly? Does it also work in minority populations? The latter is especially an important issue given the absence of data regarding minority populations in the literature [3, 4].

Step 3: define outcome variable(s)

This is perhaps the most important step in designing a research question. Unfortunately, it is often inadequately addressed, or missed entirely. It is quite typical for people to ask, “What are the outcomes for xyz patients?” However, such a question does not specify what the target outcome of interest actually is. An appropriate analytical research question requires the outcome to be specified up front: Is it mortality? Is it complications, or a specific set of complications? Complications as an outcome is a perfect example of why the outcome variable needs to be specified up front: If you define “complication” to include only two events, you will get a very different rate than if you included ten events. Also, certain outcomes such as wound infection or sepsis are notoriously difficult to define.

If the question was properly framed in the beginning, as a closed-ended question, then it usually becomes obvious what the outcome variable is. Once again, the importance of the initial framing of the question cannot be understated.

As described below, a study should have as few outcomes as possible, so they must be chosen very carefully. Each outcome of interest will require a fairly detailed analysis on its own. Having multiple outcomes may make the manuscript confusing. For example, the contributors leading to DVT are likely to be different in the setting of sepsis versus wound dehiscence versus death. A study that attempts to examine all these different outcome variables is likely to be lengthy and difficult to digest.

Step 4: define the primary comparison to be made

This is a critical feature for any analytical study. In a descriptive study, there is no comparison: The prevalence of x and y and the average of z in that population are simply described. An analytical study, on the other hand, requires that some comparison be made. For example, the question, “What is the mortality of xyz procedures in elderly patients?” would be in a descriptive study where statistical testing would not be applicable. The question, “Are the elderly at elevated risk for mortality compared to younger patients following xyz procedure?” would be in a comparative study where a statistical comparison would be made between elderly patients and young patients. Specifying the comparison to be tested up front also helps to avoid Type I error; otherwise, the investigator runs the risk of trying additional analyses, which may lead to spurious findings.

Step 5: define covariates/confounders

There are usually many factors that can influence an outcome variable of interest: These are termed covariates and confounders. For example, in comparing mortality rates of patients, the influence of age, gender, race, socioeconomic status, location, etc., must be considered in addition to the primary comparison variable of interest. This highlights a fundamental difference between clinical trials methodology and outcomes research. Both are concerned with confounders, but each addresses them differently. Clinical trials methodology addresses the issue via randomization, creating an equal mix of all possible confounders in both comparison groups. Outcomes research, on the other hand, does not have this luxury, and so it needs to adjust for the influence of confounders statistically. This presents a problem, however, since you need to know that something is a confounder before you can add it to the analysis and adjust for it. For example, if hair color were a determinant of mortality, but we did not know this and thus it was not collected and added to the database, then we would not be able to adjust for it in the analysis. This is a major difference between outcomes research compared to clinical trials, which is why this step is critical for outcomes researchers. The strengths of an outcomes study depends on how many covariates can be identified and adjusted for.

Data preparation phase

Once the research question is defined, the next step is to prepare the data for analysis. It is often surprising how long and challenging this “data preparation” or “data cleaning” step can be. It is rare for an investigator to move straight from the research question to an analysis without needing to deeply analyze and qualify the relevant data. In addition, it is important to take precaution at this phase to ensure patient confidentiality by not including patient identifiers in the analytical file to be created. This issue may be less relevant when analyzing administrative databases or population databases, but it may be overlooked when accessing institutional clinical databases.

Note that both the design phase and the data preparation phase will comprise the Methods section of a manuscript.

Step 1: select the database(s)

The first step in the data preparation phase is to select the workhorse database. Depending upon the research question, an administrative database versus a clinical database needs to be chosen. Also, the Agency for Healthcare Research and Quality (AHRQ) offers a users’ guide to registries that can be used to evaluate patient outcomes [5]. An example of an administrative database would be the Nationwide Inpatient Sample (NIS) [6], which is effective in answering questions regarding the cost of care. Examples of clinical databases would be the National Surgical Quality Improvement Program (NSQIP) [7], the National Trauma Data Bank (NTDB) [8], and the Surveillance, Epidemiology, and End Results (SEER) database [9], all of which contain more detailed clinical data. More than one database may be suitable, or necessary, to address the question at hand.

Step 2: link databases

The data that are needed to answer the research question may reside in different databases, in which case the linking of these databases will be necessary. This will require some identifiers that are common in both databases; for example, when looking at hospital characteristics (teaching status, rural/urban location, volume) and their impact on patient outcomes, the patient database will need to be linked with the hospital database, probably via hospital identification numbers. For internal institutional databases, data are often scattered across multiple data sources (medical records, labs, radiology), and linking databases together with patient identifiers becomes necessary. In most cases, the need for identifiers to make this linkage will make it impossible for investigators to act without help, especially when dealing with population-level databases [1012]. For example, since most population-level databases are de-identified, it is not possible to link the SEER database with NIS, which would be useful for answering questions about hospital care versus long-term outcomes. Fortunately, the federal government has recognized the need for such a linked database and has now released the SEER Medicare database for this purpose.

Step 3: select data elements

Selecting the data elements serves to match the research question to the available data elements. This involves looking up the reference manual, or “data dictionary,” for each database, and matching the research question elements to their corresponding database definitions. This may be challenging depending upon the clarity and rigor of the particular database. For example, there are three different variables for “stage of cancer” in the SEER database, all of which use different criteria.

Step 4: generate new data elements

This is perhaps the most time-intensive phase of an outcomes study. It is common for the sought variables to not be defined in a way that immediately meets the need of the study at hand. As an example, the definitions used as elements for the “stage” variable in SEER may not fit the assumptions of the study at hand. It then becomes necessary to manually construct novel “stage” variables based upon information from a number of other variables such as “extent of disease” or “node involvement,” This can become even more difficult if the variable of interest is somewhat amorphous. For example, for any outcomes study, it is important to adjust for patient “comorbidity.” However, there is no standard definition of comorbidity that is universally accepted. To adjust for comorbidity would therefore involve literature research to identify possible methods to measure comorbidity (preferably multiple methods) and then manually construct that variable based on other information contained in the database about each patient. In this specific case of comorbidity, the Charlson Index [13, 14] or the Elixhauser Index [15] among others would be useful.

Analysis phase

Following the process described below will produce the Results section of the manuscript.

Step 1: univariate descriptive analysis

The univariate descriptive analysis describes the entire study population. It is called “univariate” analysis because the population is described one characteristic at a time: average age, proportion males, race, socioeconomic status, insurance status, location, and so on. This is important so that future readers can determine whether the study applies to their patients. Since this section is solely descriptive, no formal statistical testing is necessary or applicable. An example data table for a univariate analysis is presented in Table 1. A study that is only descriptive would likely end after this stage. Analytical studies will continue on through the next few steps.

Table 1 Example of a univariate/demographics table

Step 2: bivariate analysis

The purpose of bivariate analysis is to report the differences between the comparison groups one characteristic at a time. For example, in comparing elderly versus younger patients, the data table will be a two-column table, with one column for elderly and another column for younger patients, and one row for every additional characteristic to be compared. An example is presented in Table 2. At this juncture, the task of statistical testing becomes central. If the characteristic to be compared is a continuous variable (e.g., length of stay), then t tests (for mean) or Wilcoxon test (for median) can be applied [16]. If the characteristic to be compared is a categorical variable (e.g., live or die), then a χ 2 test can be applied. If the outcome of interest is survival over time, as is common in cancer research, then a Kaplan–Meier analysis may be performed. The term “bivariate” analysis is used because for every statistical test performed, the relationship between two variables (i.e., between age and death rates, then between age and length of stay, and so on) is described statistically.

Table 2 Example of a bivariate analysis data table, presenting unadjusted comparison

A clinical trial manuscript may end its Results section here. It will not be necessary for it to go on to the multivariable analysis, since the comparison groups in a clinical trial should be balanced in every way (if designed properly), and there is no reason to proceed further and adjust for confounders. However, for outcomes analysis, the statistical tests presented in bivariate analysis are referred to as “unadjusted” analyses since the comparisons are being made one variable at a time. This analysis will therefore not account for any confounders, which are always present in outcomes analysis.

Step 3: multivariable analysis

Multivariate analysis is the hallmark of outcomes research. In a nutshell, it allows investigators to compare disparate groups by mathematically adjusting the differences (i.e., the confounders) between comparison groups so that they approach mathematical equivalence. Findings from multivariable analysis are also called “adjusted” results. An example is presented in Table 3. Multivariate analysis is performed with multiple logistic regression steps for a categorical outcome variable (e.g., live or die) or with multiple linear regressions for a linear outcome variable (e.g., length of stay). If the outcome of interest is survival over time, which is common in cancer research, then Cox proportional hazards analysis will be used.

Table 3 Example of a multivariable analysis data table, showing adjusted risks of outcome

The validity of the results from an outcomes analysis rests on the strength of this multivariable analysis—more specifically, on the number of confounders accounted for in this step. Therefore, it is important to list all the variables that are included in a multivariable analysis, discuss the rationale behind each of them, and then discuss these in the “limitations” part of the Discussion with respect to any variable that could not be accounted for in the study. The existence of unknown confounders should also be acknowledged (unlike clinical trials, which theoretically control for both known and unknown confounders via its randomization process, the possibility exists in outcomes analysis that there may be confounders that the world does not yet know about).

Many outcomes analyses end here at the multivariable analysis step. However, to make a stronger case, subset analysis and sensitivity analysis should also be performed. It will demonstrate appropriate rigor.

Step 4: subset analysis

The goal of subset analysis is to determine the generalizability of the findings. The idea is to repeat the analysis within every patient subgroup to determine whether the findings are qualitatively the same in all patients. These subset analyses will eliminate the concern that the study may be a spurious finding.

This is especially important in outcomes research, where heterogeneous patients make up the study populations. For example, if A works overall, does it also work in the elderly? Does it also work in minority populations? The consistency of the findings across different patient subpopulations will not only make the case for generalizability of findings, but it will also address one of the fundamental limitations of outcomes research: its inability to adjust for all confounders. If a finding is consistent across all patient populations, then the unknown confounders are probably not an issue. Since the prevalence of confounders is probably different in different patient subgroups but the results are nevertheless qualitatively consistent across these groups, then these confounders will likely not alter the study findings. There will obviously be quantitative differences between different patient populations, so this effect will be stronger or weaker in different patient subgroups. The objective of subset analysis (and the next step, sensitivity analysis) is not to detect these minor differences, but rather to detect whether there are qualitative differences, i.e., are the findings reversed in any patient subgroups. A few sentences regarding the presence or absence of qualitative differences should suffice.

Step 5: sensitivity analysis

The objective of sensitivity analysis is to alter some key assumptions of the study to determine if those changes will affect the conclusion. If the answer is no, it will strengthen the case that the study is not affected by methodological problems. Since there is often no consensus on what is the “best” methodology, a study that goes ahead and uses multiple methodologies will eliminate any potential reviewer concern that one method is better than another. For example, if the data are adjusted for patient comorbidities with the Charlson Index, the analysis could be repeated with the Elixhauser Index to determine if the results change qualitatively. Another approach would be to adjust for patient confounders with regular multiple regression analysis, and then with propensity score analysis, and see if the findings are equivalent [17, 18].

Again, as in subgroup analysis, there will likely be some quantitative differences (e.g., the difference between groups may be a little more or a little less), but hopefully there will be no qualitative change in your conclusion. A few sentences regarding the presence of absence of qualitative differences in the Discussion section should suffice.

Conclusion

A methodical protocol such as the one described here can facilitate converting an interesting clinical question into an outcomes research question with a testable hypothesis.