Data Design Overview

Figure 1. provides the context for Data Design by depicting the process of problem solving via ML modeling. Elements in black outlines are naturally occurring phenomena (i.e. occurring outside the control of problem solving team); elements in green outlines are controllable by the analyst/scientist involved in the ML-based problem solving. ML-based problem solving involves two steps. The first one is high-level data design, where one or more problem solving data sets are constructed. These data contain the subjects and data elements relevant to the problem, and are either extracted from a repository of naturally-collected data (i.e. without the control of the ML-based problem solving team—top branch in the figure) or collected specifically for the purpose of the modeling (lower branch). In the second step, modeling data sets are constructed, upon which machine learning algorithms can operate.

Fig. 1
A flow diagram of the M L design begins with natural processes and population. One sequence includes large-scale data collection, problem-specific data extraction and data set, transformation, modeling, and model. The other sequence is similar but excludes large-scale data collection.

Overview of problem solving using ML modeling

The goal of data design. A ML model can be viewed as a function that maps inputs to outputs. The inputs to the ML model are ML features, which exist in the ML model space. ML features don’t always have one-to-one correspondence to real-world entities or to the naturally-collected data elements. These entities and data elements exist in the real-world problem space. Similarly, the model output, which exists in the model space, needs to be mapped to the real-world entities in the problem space. The goal of data design is to create this mapping between the real-world problem space and the model space in a manner that after mapping the real-world entities onto the ML model space, then solving the problem in ML model space, and finally mapping the ML solution from the model space back to the real-world (problem space), we obtain a correct solution to the real-world problem.

Figure 2 depicts the context around data design from the perspective of the ML modeling process. It presents an overview of the ML modeling process and its iterative interaction with modeling. The blue rectangle highlights the elements of the modeling steps that fall under data design, and these are the elements that are discussed in this chapter. Elements outside the blue box, namely, data transformations, model fitting, evaluation and iterative improvement of the model are covered in other chapters (“The Development Process and Lifecycle of Clinical Grade and Other Safety and Performance-Sensitive AI/ML Models,” “Data Preparation, Transforms, Quality, and Management, ” and “Evaluation”) of this book.

Fig. 2
A cyclic flow diagram of data design includes problem definition with target population, outcomes, exposures, predictors, and timeframe, understanding causal relationship of factors, biases identification, design matrix implementation, model workout, and model evaluation or debugging.

Overview of the modeling process. Steps of the Data Design are highlighted inside the blue rectangle

Data design is the process of creating a formal specification of project goals and variables and establishing a bidirectional mapping between the real-world entities (and data elements) in the problem space and the ML features in the ML model space. ML then provides the function approximation models of the real-world data generating process that can be used along with the aforementioned mapping to solve problems in the real world by first solving them in the ML model.

The first step in ML-based problem solving is to define the problem to be solved according to the following five elements:

  1. 1.

    Outcome. What clinical outcomes are we considering? If we consider multiple outcomes, which is the primary outcome and which are secondary?

  2. 2.

    Exposure/Intervention. Is there a particular exposure or intervention that we wish to estimate the effect of?

  3. 3.

    Predictor variables (aka independent non-interventional variables). Which variables are relevant to this analysis and we want to include? Which variables are confounders and we absolutely must include? Which variables must we omit?

  4. 4.

    Target population. Which patients should the answer hold true for?

  5. 5.

    Time frame. When should the answer hold true? When do we plan to intervene, apply the model, or use the knowledge that we gained from this analysis? How long does it take for the outcome to manifest itself?

The above 5 elements describe the real-world solutions we wish to obtain and the data elements needed to obtain them.

As part of the data design, we also need to consider potential data design biases. Data design biases are systematic errors in the choice of data sources, including variables and samples, as well as mismatch of data to the modeling methods to be used.

Classical statistical study design vs. data design. In classical statistical study design the data representation for modeling is a matrix (the “design matrix” in statistical terminology), often a two-dimensional table of numbers. In ML however, we often deal with other data representations such as higher-dimensional matrices (tensors), graphs, sequences, text documents, images, combinations of the above, etc. The data design methods we describe here are generally applicable, although for simplicity we give examples based on flat matrix representations.

Defining the Problem

At the initiation of a ML project, usually, only a rough clinical or health science question and the context of use of the results are known. The objective of this step is to refine the rough problem statements into a more precise, formalized and operationalizable format. By precise, we mean that problem statement contains all and only the information we need to solve it; by formalized, we mean that the answer to the problem can be expressed as an estimand (a computable quantity); and by operationalizable, we mean that we can compute the answer in terms of the available data elements. In the following sections, we describe the setting, the five elements that make problem statements precise, and the most commonly used estimands. Afterwards, we describe common data design types to which we can map our problem, and in the last section of this chapter, we describe the inference process and explain what a “valid answer” means.

Setting

The first critical junction in defining the problem statement is to decide the operative setting. We consider three broad settings. First clinical settings, where the problem concerns clinical decision making, including risk models, estimates of effects of exposures, targeting interventions, and timing of interventions. Such clinical models ideally will directly inform patient care or otherwise become part of health care delivery.

The second type of setting is operational settings, where the results from the analysis are not directly used for treating patients, but rather for managing the system of health care.

The third type of setting is health science research. ML models can be used for a broad array of research problems, which include biomarker discovery, optimized treatment protocols based on biomarkers, discovering biological causal pathways, clinical trials, etc. Translational research contexts bridge the health sciences with the health care problem solving domains.

The setting in which the modeling results are used entails many attributes of the data design and modeling. For example, evaluation of health care-oriented ML models needs to take patient safety into account. Clearly, the direct risk of harm to patients is highest in the clinical setting and lowest in the research setting. The scope of populations involved in health care versus health science modeling can vary from very narrow to full-population studies. However, health care modeling is often restricted to specific health systems with or without examination of translation across systems.

Setting refers to context in which the modeling results will be used. We broadly distinguish between three settings: clinical, operational and research. Different settings impose different requirements towards the steps of the modeling process.

Best Practice 7.1.1

The ML data design needs to take the operative setting of the ML models into account.

Elements of the Problem Statement

As we discussed earlier, the modeling project is typically motivated by a clinical question, an operational opportunity, or a research need. This initial motivation offers only a rough outline for the problem statement.

The five elements of a problem statement (Outcome, Exposure/Intervention, Predictor variables, Target population, Time frame) help make a rough problem statement more precise. [1].

Example 7.2.1

As a hypothetical example, experts in a health system may believe that “starting diabetes treatment earlier could improve major cardiac events”.

The rough problem statement is “Does starting diabetes treatment earlier reduce major cardiac events?”. This question is not precise: how much earlier should we start? It is not formalized: what metric (estimand) for outcomes should we compute? It is also not operationalizable: how do we define “diabetic” using the available data elements?

To make the question more precise, we need to define 5 elements. Not all elements are needed for all questions, but most questions need most of these elements.

Target Population

The target population is the set of patients to which the problem statement applies.

If we construct a clinical risk model, the target population consists of patients to whom we are going to apply the resulting model. All patients in the target population must be at risk of developing the outcome in the problem statement. If the problem concerns an interventional treatment, the target population is patients who are eligible for the treatment. If the problem concerns the study of a biological function, then the target population is the set of research subjects in which this biological function exists.

Example 7.2.1

The example problem statement is related to the patient population of the health system in question, so the target population is the patient population (1) served by the hypothetical health system (2) who would be considered for diabetes treatment, or who could conceivably benefit from earlier diabetes treatment. So the example question is further refined to: “Can earlier initiation of diabetes treatment in diabetic patients eligible for it reduce major cardiac events in this health system?”

Exposure/Intervention

Some studies are concerned with the effect of an intervention or of an exposure (defined below). Not all studies have an intervention of interest, but if there is one, we need to specify it. The intervention in our running example is the earlier initiation of diabetes treatment.

The intervention or exposure divides the population into two groups.

  • The patients receiving the therapeutic interventions are referred to as treatment group, while the remaining (untreated) patients are referred to untreated patients (also as controls if untreated and similar to the treated ones before treatment).

In case of an exposure to a naturally-occurring factor, patients with the exposure are referred to as the exposed group and the remaining patients form the unexposed group (or controls if similar to the exposed group before exposure).

In non-designed data (e.g., collected from routine care records) therapeutic interventions may be considered as exposures. It is also common to collect data about interventions and multiple exposures and model them simultaneously. Note that classical study design does not distinguish between exposure and treatment, and refers to both as ‘exposure’.

Outcome

Not all analyses have a designated response variable (e.g., clinical outcome of interest). For example, finding comorbidities in older diabetic patients does not have a designated outcome of interest. However, the product of the analysis still needs to be specified. In this example, this product is the set of common comorbidities. Commonly, studies may also have multiple outcomes which are then categorized as primary and secondary.

The primary outcome(s) is (are) the main focus of interest; other outcomes are called secondary outcomes.

Example 7.2.1

In our running example, the main and only outcome is major cardiac events (MACE). Additional (secondary outcomes) could also be of interest, e.g., health care utilization or quality of life.

Patients with the outcome in question are referred to as cases, while patients without the outcome are referred to as controls.

Notice that the meaning of ‘control’ depends on the comparison being made: it can refer to two different groups, either those without the outcome or those without the intervention/exposure [2].

Time Period

Time period is the time frame encompassing the data to be modeled. Such time frames may concern, e.g., the time point at which the intervention is carried out (or a decision support model is used); or the time period during which the outcomes develop. There could also be a time period for collecting information before the intervention is applied.

Example 7.2.1

In our running example, for a retrospective analysis aiming to establish the effects of early diabetes treatment on MACE, we can use a design in which we collect historical patient data covering a 10 year time window starting 10 years before analysis and ending at time of analysis. MACE occurs 5–10 years after diabetes, hence the choice of time window length. Note that there are alternative designs to accomplish this modeling that will be discussed later.

Predictor (Non-Outcome) Variables

Predictor variables are all the data (in addition to outcomes) that could possibly be relevant to our modeling task. Predictor variables include demographics (age, sex or gender as appropriate), risk factors, exposures, interventions, social behavioral data, images, genetic information, etc. At a minimum, we need to include exposure variables that are known to influence the outcome(s) (if there are any), the interventions of interest (if there are any), those variables that are suspected to influence the exposure and outcome(s) of interest (i.e., potential confounders), and any other variable that we wish to adjust for.

There are also variables we should generally not include. (1) In a regression or SEM model that estimates causal effects, variables that must be excluded are (a) the causal descendants of the outcome and (b) variables on the causal chain between the exposure and the outcome. More details about causal modeling are provided in chapter “Foundations of Causal ML”. (2) In a predictive model, variables that are associated with the outcome and occur after it, must be excluded if model application is desired before the outcome occurrence. More generally, variables that would be measured after the model application should not be included.

Common Metrics

There are a number of measures we use to estimate “risk” or “effect size” and they fall into three broad categories that are frequently confused. These are proportions, ratios and rates.

Proportion is the number of elements with a particular property divided by the total number of elements.

For example, the prevalence of the diabetes in a community, is a proportion, where the “particular property” is having diabetes and “elements” are patients. Thus, the prevalence of diabetes is the number of patients with diabetes in this community divided by the total number of members in the community.

  • Ratio: The number of elements with a particular property divided by the number of elements without the particular property.

  • OR (equivalently)

  • p / (1−p),

  • where p is the proportion of elements with a particular property.

Rate: the number of measured events over a unit time period.

All these three types are fractions. Proportion and ratio have the same numerator, but differ in their denominators: for proportions, the elements in the numerator are included in the denominator, while for ratios, the elements in the numerator are excluded from the denominator. The proportion of diabetics in a population is the number of patients who are diabetic among all members of the population. Patients in the numerator (diabetic patients) are included in the denominator (all members). In contrast, the ratio of diabetic patients is the number of diabetic patients divided by the number of non-diabetic individuals. The numerator (diabetic patients) are excluded from the denominator (non-diabetic idividuals). The range of a proportion is between 0 and 1, while the range of the ratio is between 0 and infinity.

Rate differs from the other two types in that it has a time component: it measures the number of events over a unit period of time. For example, the rate of developing diabetes in the US is 1.4 million people per year.

Commonly used metrics include:

  • Prevalence: the proportion of patients with a disease in the population at a point in time. This is often expressed as a percentage (number of diseased patients per 100 individuals); or in epidemiology, prevalence is more commonly expressed as number of diseased patients per 10,000 individuals.

  • Incidence: the proportion of patients newly diagnosed with a disease in the population over a unit period of time. This is a rate often measured as number of new cases within 10,000 individuals over a year.

  • Probability of an event.

  • Odds of an event. Odds is a ratio: probability of an event divided by one minus the probability of that event.

  • Risk: measures how likely it is for an undesirable event to happen. It can be measured as an actual probability, as hazard, log hazard, etc., within a time period.

  • Relative risk is a fraction of an observed quantity divided by an expected (baseline) quantity. These quantities can be number of events, proportions, probabilities, rates (with the same time period for the observed and expected quantities).

  • Hazard and hazard ratio are formally defined in the “Time-to-Event Outcomes,” section of the “An Appraisal and Operating Characteristics of Major ML Methods Applicable in Healthcare and Health Science” chapter.

  • Average Treatment Effect (ATE): average difference in therapeutic effect between comparable treated and untreated patients.

  • Average Treatment Effect in the Treated (ATT): average effect of the treatment in a treated sample of patients (average difference of effect of treatment between treated patients vs without the treatment).

  • Individual treatment effect (ITE): the effect of treatment in a patient (difference in likelihood of an outcome with and without treatment).

Types of Modeling Problems

As we discussed in chapter “Artificial Intelligence (AI) and Machine Learning (ML) for Healthcare and Health Sciences: the Need for Best Practices Enabling Trust in AI and ML”, AI/ML have an extraordinary range of applications. Thankfully the vast majority of problems fall into only a handful of common categories.

Because the majority of problems fall into a few types, and each type can be addressed by a small number of data designs, a few data design templates suffice for solving most problems encountered in health sciences and health care.

Below we list the most commonly-encountered and broadly applicable problem classes.

Diagnostic problem class. This is the problem of assigning patients to diagnostic categories. It is solved by developing diagnostic models, hand-crafted or machine-learnt. An important aspect of the diagnostic problem is that it is concerned with diseases (or clinical conditions) that currently exist in the patient [3].

Screening problem class. A special case of the diagnosis problem where disease has not manifested clinically and needs to be detected as early as possible. Different from the diagnostic tests, screening tests are applied to a much greater number of patients. Mammography is an example of a screening test, aimed at identifying patients with breast cancer. When the screening test reports a positive result, a more reliable diagnostic test is then used to determine whether the subject indeed has the disease. Compared to diagnostic tests, screening tests are often less expensive, less harmful (have fewer side effects or lower risk to health) and thus can be deployed at a larger scale than a diagnostic test. Screening tests often suffer from producing a large number of false positives which is a consequence of the low prevalence of the tested condition in the broad population where the test is applied. ML modeling can help with implementing high quality scalable screening tests (e.g., radiology ancillary finding interpretation).

Risk prediction. This problems deals with forecasting patients’ future risk of developing a disease or a disease related outcome. Risk models differ from screening or diagnostic models in two important ways. First, screening and diagnostic models are concerned with diseases that the patient already developed or is in the process of developing, while risk models are concerned with events that may occur in the future. Second, diagnostic tests are often confined to diseases, while the risk models can be used more broadly, such as to assess the risk of future re-hospitalization, ICU admission, prolonged illness, etc.

Prognostic problems. Prognosis usually refers to forecasting outcomes of interest in patients with a disease (or simply the likelihood of recovery from the disease) [3]. In molecular precision oncology commonly predictive (instead of prognostic) refers to the likelihood of outcomes given a treatment. ML models can perform these forecasts (and in some cases, e.g., high-throughput-based molecular oncology, they provide the primary means for this task).

Estimating treatment efficacy. The goal of this class of problems is to estimate the effect of an intervention on an outcome. This outcome can be an event (e.g. a disease or re-hospitalization) or a quantity (e.g. systolic blood pressure reduction in hypertensive patients). While it is sufficient for the relationship between the predictor features and the outcome of interest to be associative (correlational) for diagnostic, screening, and risk models; treatment efficacy estimation needs to be based on causal relationships (especially when changes in treatment practices are contemplated).

Biomarker or risk factor discovery. Biomarkers are a very diverse group of entities that include: (1) any correlate of a phenotype or outcome of interest; (2) complex computable models (aka “signatures”) that can be used to diagnose, prognosticate or treat phenotypes/patients; (3) substitutes (proxy outcomes) for longer term outcomes in clinical trials; and (4) drug targets for new therapeutics. Accordingly, a variety of ML models can serve such discovery: for (1), predictive modeling; for (2), predictive, prognostic or causal ML modeling; for (3) and (4), causal ML modeling [4].

Operational problem solving. These models relate to the management and administration of health organizations, often involving resource planning and allocation. Examples include hospital bed allocation, ordering supplies, managing personnel, billing and reimbursement, etc.

Economic evaluations. These pertain to understanding the economic impact of clinical and administrative decisions. ML-enabled economic evaluations can be used to compare two treatments of the same disease and study their outcomes in light of their costs. They can also be used to inform resource allocation across multiple diseases by comparing health benefits of a new treatment of one disease versus the health benefits of a new treatment of a different disease. At an even higher level, they guide policy decisions about allocating funds to health care initiatives and prioritize investments [5].

Subpopulation discovery. This class of problems is closely related to (1) precision health, (2) poor average effects of treatments, and (3) population mixtures. With respect to (1), precision treatments target subgroups that will benefit from the treatment and protect subgroups that are more susceptible to toxic side effects. Regarding (2), often, RCTs reveal that the average effect of a new treatment in the general population of patients may be small, however, it is possible for the treatment to be optimally effective in a subgroup. In case (3), where an apparently homogeneous population in reality comprise subpopulations with different underlying disease mechanisms or physiology, it is often useful to de-mix these subpopulations and apply subpopulation-appropriate health interventions.

Inference: What Is a Valid Solution?

Effective ML-based problem-solving must ensure the validity of the produced solutions. In this section, we elaborate on solution validity. Every problem at hand concerns a particular patient population, the target population. The solution of the problem is valid if it holds true in the target population. It is very rare that we have data for the entire target population. Instead we often have access to a different population called the accessible population. Because it is impractical to collect data about all members in the accessible population, we draw a discovery sample from the accessible population (Fig. 3). Thus, ML models are derived from the discovery sample and not the target population and not even from the entirety of the accessible population. We must thus ensure that the solutions produced by the models generalize to the accessible population and then further to the target population [1].

Fig. 3
An illustration of inference includes the selection of discovery sample from accessible population, which is taken from target population. Discovery sample through modeling provides study findings transferred as knowledge to accessible and target population. Generalization with validation is used.

Inference. We estimate a quantity in the target population using a discovery sample

A valid solution is one that is true in the discovery sample and remains true in (i.e., generalizes to) the accessible and target populations.

Therefore central to validity is generalizability. The typical generalizability discussed in machine learning textbooks is internal validity referring to the requirement that a solution that is valid (holds true) in the discovery sample must generalize to the accessible population. Generalizability further to the target population is a type of external validity. The goal of using design protocols and best practices is to help achieve both types of validity.

Let us illustrate this concept through an example: consider the hypothetical question of whether aggressive management of high cholesterol helps reduce heart disease incidents. If this is a research question, then we may wish the answer to generalize to a very broad patient population such as all patients who have high cholesterol in the entire nation. The target population is all patients with high cholesterol in the nation (who have not had heart disease yet). Suppose, we conduct this study in a consortium of academic health centers located in a handful of states. Our accessible population then consists of patients with high cholesterol who are typically seen at these specific academic centers and have health data recorded in the corresponding EHRs. Further assume that we have analytic access to a small portion of health records (our discovery sample).

Assuming that the discovery sample is a random sample from the accessible population and that the resulting model is applied close to the time frame when the discovery sample was drawn, the results from the discovery sample will generalize to the accessible population as long as we do not have modeling issues such as overfitting (chapter “Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI”). However, the accessible population (individuals seen at academic health centers) and the target population (general USA population with high cholesterol) likely systematically differs and full generalizability of the results from the accessible population to the target population is unlikely. To address this problem, the original accessible population should be constructed in a way that better represents the general population. If that is not feasible then use of the models should be restricted to the accessible population only. Other more sophisticated techniques may also correct the generalizability issues. The general rule of thumb is that an imperfect design will require immense modeling efforts to rectify its biases (and may still not achieve the desired generalizability).

Best Practice 7.2.1

Seek to ensure validity and generalizability with good data design first. Resort to analytic corrections of biases only to the extent that optimal design is not attainable.

Now let us consider the same question in a narrower context with some important biases present: does aggressive management of high cholesterol help reduce heart disease incidents in the patient population of a specific health system? The target population is patients in this singular health system with high cholesterol. As before we may draw a random discovery sample of patients from the electronic health record (EHR) system to answer this question. Our accessible population appears to be the same as the target population. Assume further that not all patients are included in the EHR and that the patients in the EHR differ from the patients not in the EHR. For example patients in the EHR can have lower or higher cholesterol levels than those not in the EHR. Also assume that patients not in the EHR, predominantly receive their high cholesterol treatments in a different health system. Under these and many similar conditions, even when we build models for a single health system using data strictly from said health system, the target and accessible populations can differ. Such differences can impact generalizability and thus validity. A model that is valid in a data rich environment could perform poorly in a data poor environment when selection bias is present. Whether a model generalizes can be determined, of course, and can be made to generalize better through careful data design.

Pitfall 7.2.1

A random sample from a registry or an EHR is not necessarily a random sample from the patient population.

Best Practice 7.2.2

Ensure that the accessible population is representative of the target population.

Best Practice 7.2.3

Ensure that the discovery sample is representative of the accessible (and target) populations.

Data Design

In the previous section, we started with a rough question and transformed it into a problem statement. We specified the operative setting (clinical, operational, research) and made the problem statement more precise (than the rough question) by describing five major elements: (1) target population, (2) intervention, (3) outcome, (4) time period and (5) predictor variables. We also defined an estimand, a computable metric that solves the problem. In this section, we focus on the “How?”, how the problem can be solved.

Data designs provide a framework within which different problem types can be carried out while ensuring validity.

In this book, we focus mostly on observational studies, where data is already generated and we have no influence over how patients are treated or how their data is captured in sources like the EHR, but we do have influence over how data is sampled from such sources and in some cases how data will be collected specifically for modeling. These foci reflect the majority of practical situations and are harder than having full control over data generation and collection.

In the following sections, we describe the data design hierarchy, which allows us to answer a few simple questions about exposure, outcome and temporal relationships and arrive at a data design. We describe the most common data designs in detail, providing concrete examples. After we describe the data designs, in Table 1, we relate them to the problem types from section “Types of Modeling Problems” and to the metrics from section “Common Metrics” thus connecting the data design process with problem solving requirements.

Table 1 Mapping of problem types to appropriate data designs, analytic methods and measures. “Any” data design means, that the simplest data design depends on whether the problem at hand is temporal and not, and whether an outcome exists and whether it is rare

Data Design Hierarchy

Experimental Data Designs: these involve generation of data by manipulating some of the variables. From the perspective of the ML problem-solving team using the methods of this book, the team may have control over these interventions (we call this a primary experimental data) or, alternatively they may be given access to data in which someone else conducted interventions and the team had no control over these interventions (we call this secondary experimental data).

Observational Data Designs: these involve data that was measured/generated by the ML problem-solving team (we call this a primary observational data), or alternatively the team may be given access to data that someone else measured and the team had no control over such observations (we call this secondary observational data).

For example, if the problem-solving ML team is given access to previously conducted randomized clinical trials, the interventions had been assigned by the RCT experimenters; patients were randomized into groups receiving intervention (treated) or being controls (untreated). Similarly, if the problem-solving ML team is given access to EHR data, interventions occurred before the ML modeling starts and thus the team can only observe these interventions. However, in contrast to RCT data, the nature of interventions in EHR data does not involve randomization or controlled interventions.

Figure 4, adopted from [6], describes how several prototypical data designs ensue under the above conditions.

Fig. 4
A flow chart of data design begins with design intervention decision. If yes, primary experimental. If no, secondary experimental is followed by direction of time. Cross-sectional if no time. Cohort is forward. Case-control is backward. Cohort is followed by nested case-control case-cohort.

Data design hierarchy

If an analysis starts with an exposure and follows subjects to determine the outcome, it is called a cohort design. Conversely, when the analysis starts with an outcome, identifying subjects with (and without) an outcome, and proceeds backwards in time to establish their plausible causes, then it is termed a case-control design. Finally, if the exposures and outcomes are determined at a single time point (cross section), then it is called a cross-sectional design.

Cross-sectional design. The exposure and outcome are observed in the same cross section. With the existence of exposed and unexposed groups, the association between the exposure and the outcome can be measured, but since they are measured in the same cross section, the temporal relationship between them cannot be readily observed (but may be inferable with causal ML see chapter “Foundations of Causal ML”). The date of the cross section is the index date.

Best Practice 7.3.1

A cross-sectional data design suffices if one can solve the problem using the prevalence of the outcome in the exposed and unexposed groups.

Pitfall 7.3.1

A cross-sectional design is inappropriate if there is a time gap between the exposure and the outcome and is important, or if incidence of the outcome is of interest.

Cohort design. The logic of the analysis follows the passage of time: it progresses from the exposure towards the outcome. A cohort of patients with a particular exposure is identified, and a second comparison cohort without the exposure can also be observed (if needed by the problem at hand). These two cohorts are followed forwards in time determining their outcomes. The date of the exposure and the date when unexposed patients are included is the index date. The time period during which the cohort is followed is the follow-up period. Since patients did not have the outcome at the time of exposure, this design allows for measuring incidence rates and relative risks [7].

Pitfall 7.3.2

A cohort design falls short when the outcome is rare or if it takes a long time for the outcomes to develop.

Case-control design. Here we start with identifying a group of subjects with the outcome (cases) and another group of subjects without the outcome (controls). Then the analysis progresses backwards in time, looking for exposures in the records of the subjects. When a comparison group (unexposed patients) exists, associations can be measured. However, without knowledge of the prevalence of the outcome, incidence rates and hence relative risks cannot be computed in absolute terms. The critical part of this design is the choice of controls (or, equivalently, the availability of reliable ML algorithms that control analytically possible confounding) [8].

Relationship Between Problem Type and Data Design

Best Practice 7.3.2

Use the easiest/most economical data design that can solve problem. The mapping of the problem to a problem type can help find the best design.

Refining the Discovery sample—Use of Inclusion/Exclusion Criteria

The discovery sample is defined using a set of inclusion and exclusion criteria. As their names imply, inclusion criteria select patients to be included in the cohort and exclusion criteria exclude patients from the cohort.

The objective of the inclusion/exclusion criteria is to create a discovery sample that resembles the target population as closely as possible.

As we described in section “Inference: What Is a Valid Solution?,” modeling allows inferences, in which knowledge/models generated in a discovery sample are used in the target population (after sufficient validation). The discovery sample comes from the accessible population, which can be different from the target population and even the discovery sample can be different (not a random sample) from the accessible population. Our goal with the inclusion/exclusion criteria is to make the discovery sample representative of the target population.

A common starting point is the description of the target population as an inclusion criterion. Next, we refine it by specifying additional inclusion or exclusion criteria.

Exclusion criteria describe patients in the accessible population who are either unlikely to contribute to the validity of the findings, or are excluded so that the accessible population resembles the target population better. Common reasons for exclusion include:

  1. 1.

    Patients are not at risk of the outcome. For example, when studying the risk of falling in the elderly, patients who are bed-bound should be excluded because they have an artificially low risk of falling.

  2. 2.

    A special case of not being at risk is having a pre-existing outcome when we try to measure the incidence rate of a phenomenon. Patients who already have the outcome at baseline or are very likely to have the outcome (albeit being undiagnosed) at baseline are excluded.

  3. 3.

    Patients who are very unlikely to receive the intervention. Patients who are not at risk of or are at very low risk of the outcome should be excluded from modeling that determines the effect of an intervention. The intervention in such patients (if ethical) would show little (if any) efficacy and these patients would very rarely (if ever) receive this intervention.

  4. 4.

    Patients who have a different disease mechanism than the one under study.

  5. 5.

    Patients having a pre-existing condition that precludes the results from holding true. For example, models in which heart rate is a risk factor may not hold true in patients with pacemakers (heart rate is artificially controlled).

  6. 6.

    Insufficient data. Patients with insufficient data can be excluded, however, doing so can introduce biases.

  7. 7.

    A special case of insufficient data is insufficient follow-up. Patients with insufficient follow-up can be excluded, but doing so may introduce biases.

We will discuss discovery sample construction further in the context of the data design types.

Example of a Cohort Design

Cohort studies are arguably the most common data designs in clinical analytic models.

The defining characteristics of a cohort study are that (1) a cohort of patients are selected and (2) the cohort is followed forward in time to determine the outcomes.

In this exposition, we will focus on two kinds of cohort studies. The first one does not have a specific intervention, we merely measure the association between baseline exposures with the incidence of an outcome and construct a risk model. The second has a specific intervention.

Cohort Studies without a Specific Intervention of Interest

Example 7.3.1

Let’s consider development of a clinical risk model for assessing a patient’s 7-year risk of diabetes. Our hypothetical health system plans to apply this model every time a patient undergoes blood sugar testing. Patients that are found to have high risk of diabetes are referred to a diabetes prevention clinic.

Study type. This is a risk model (Table 1) and cohort study is appropriate. We are interested in measuring the 7-year risk of developing diabetes (incidence of diabetes), so hazard, hazard rate, probability, and possibly other measures are appropriate (Table 1).

Index date. To determine an index date, we take its intended clinical application into account. When a new blood sugar test result becomes available, the model will be applied to determine whether the patient needs to be referred to a diabetes prevention clinic. Patients can have multiple blood glucose tests, so we can simply take a cross-section of the eligible population at a random time and use their most recent test result, provided it is sufficiently recent. The criterion for “sufficient recent” is determined based on clinical knowledge; the objective is to ensure that if the blood glucose measurement had been taken exactly on the index date, we would not expect it to be meaningfully different from this “sufficiently recent” measurement.

Defining the target population. The target population consists of patients who are at risk of developing incident diabetes and have received blood glucose test(s) before the index date. Patients with pre-existing diabetes are no longer at risk of developing incident diabetes.

Inclusion/exclusion. The objective is to ensure that the discovery sample is not significantly different from the target population, which consists of patient who undergo blood sugar test(s). The discovery sample comes from the accessible population, which may be different from the target population. We can use inclusion/exclusion criteria to modify the discovery sample to better resemble the target population. Concerning the selection of the accessible population, we can use any convenient population, as long as it is not significantly different from the target population. For example, if we wish to apply the model to patients in 2020 and forward, then we could select patients (say) with blood sugar test(s) between 2005 and 2015, assuming that the criterion for ordering a blood glucose test has remained reasonably unchanged during this period (2005–2020). Conversely, we would not use patients only from (say) the obesity clinic as the accessible population, because obesity is a risk factor of diabetes and thus patients in the obesity clinic can have already progressed further towards diabetes than the general patient population with blood sugar measurements. We also have to be mindful to exclude patients who may have had their blood glucose taken for reasons other than suspecting diabetes (e.g. application for life insurance)—the resulting model will not be applied to these patients.

Cohort Studies with an Intervention of Interest

Figure 5 shows an illustration of different types of cohort studies with an intervention of interest. The horizontal axis represents time and the dashed vertical line is the index date where we start following patients. Each dot represents a patient. Black dots are censored patients (dropped out without an event) and red dots are patients with an event. The location of the dot shows the time of the event or censoring.

Fig. 5
An illustration of cohort data models. a. Treated bar is stacked over unexposed bar. b. Treated bar is staggered above the unexposed. Both have censored and event patients in 2 shades of dots to the right of the stack on the time axis.

Illustration of cohort data designs. Panel A represents a cohort design with a single cross-section, where treatment status does not change after the index date. Panel B shows a cohort design with staggered exposure. The horizontal axis is time, black dots represent censored patients (dropped out without an event) and red dots represent patients who suffered events

The vertical axis denotes treatment status. Panel A shows a simple design, with a single cross section, where patients’ treatment status does not change after the index date. Panel B shows a staggered treatment design, where the treatment status of some patients changes over time. These two designs solve different problems.

Example 7.3.2

We suspect that the prolonged (at least 6 months) use of a hypothetical drug has an adverse effect that increases the risk of mortality in diabetic patients. We wish to measure the effect size.

Problem type. This is a treatment effect estimation problem (Table 1) and we are going to use a cohort design with an intervention of interest (the particular drug). Since no particular time frame is specified for the outcome, hazard (time-to-event outcome) is appropriate for measuring the “risk of mortality”. (Had the problem specified 7-year mortality, we could have also used probability or death in 7 years.)

Exposure. Exposed patients are those who have used this drug for at least 6 months. We have multiple options for unexposed patients. If the clinical question places an emphasis on prolonged use as opposed to short-term use, then an appropriate control (unexposed) group is patients who took the drug for less than 6 months. If the comparison really is about being exposed to the drug at all, then an appropriate comparison group is patients who never took this drug.

Index date. The question does not specify any timing information, so we can take a cross section of patients at a random date and use this date as the index date. Some of the patients in this cross section will have already taken the drug for 6 months, other patients have not, and some patient will never take it.

Target population. Patients with diabetes who may be considered for intervention through this drug.

Inclusions/exclusions. Considerations for the construction of the analytic cohort are similar to those in the previous example. However, patients unexposed at the index date may later become exposed to the drug. In that case, this patient can be excluded; or the patient can be included but censored at the time of exposure to the drug. Caution is required with the latter approach, because censoring can be informative.

Predictor variables and confounding. The outcome, mortality, depends on factors other than exposure to the drug. Some of these factors can also influence exposure to the drug, thus they can be confounders. These confounders, if known with certainty, must be accounted for, for example, by including them as covariates or by balancing them between exposed and unexposed patients using propensity matching. If confounders are not known but are measured, we can discover them using causal ML methods.

Predictor variables of mortality are evaluated at the index date. If a patient has taken the drug for 6 months before the index date, the drug may have adversely affected some of the predictor variables by the index date. If these predictor variables are on the causal chain between the drug and the outcome they must be excluded.

If some variables are affecting the outcome but are not on the causal chain between treatment and outcome, they should be included.

Determining the precise location/role of such variables requires extensive domain knowledge or application of causal ML techniques that can place variables in the correct causal role category.

This is merely an illustrative design. Many other (possibly more appropriate) designs are also possible.

Example 7.3.3

Let us estimate the effect of a hypothetical diabetes drug on reducing the risk of mortality. Let us assume that this drug can be prescribed to all diabetic patients.

Clinical use. We assume that this model will be applied to patients, who are diabetic and are not yet on this drug (and possibly not on other drugs that target the same disease). Assume that every time such a patient interacts with the health system, the model will be applied. If the predicted risk exceeds a certain level, the patient is given the drug; otherwise, the patient will not receive the drug. This process repeats until the patient falls outside the inclusion criteria (e.g. gets this drug, gets a different drug for the same disease, is lost to follow-up, or dies).

Problem type. This is also a treatment effect estimation problem (Table 1) and we are going to use a cohort design with an intervention of interest (the particular drug). Similar to Example 7.3.2., since no particular time frame is specified for the outcome, hazard (time-to-event outcome) is appropriate for measuring the “risk of mortality”.

Exposure. Exposed patients are those taking the drug and unexposed patients are those who do not take any drugs for this disease.

Index date. For exposed patients, the index date is the date of their first prescription. There are multiple options for the choice of controls. First, we can take a sample of eligible patients at a date who did not receive the drug yet at that date. This date can be chosen randomly or as the median date of the exposed patients’ index dates. When a control patient receives any drug for the same disease (either the drug in question or a different drug for the same disease), the patient is censored.

Inclusion/exclusion. Considerations for the construction of the analytic cohort are similar to those in the previous example. If the drug is contraindicated for some patients for reasons unrelated to the outcome, these patients can be included (as untreated patients); if the drug is contraindicated for the presence of a complication that increases mortality (the outcome), these patients should be excluded as they increase the risk of mortality in the untreated group.

Predictor variables. We must include all confounders and we can include variables that cause the treatment or the outcome. We should not include variables that are descendants of the outcome, nor variables that are on the causal chain between the treatment and the outcome.

Merits and Demerits of Cohort Studies

Best Practice 7.3.3

Cohort studies are straightforward and necessary if the time gap between the index date and the outcome cannot be ignored.

Best Practice 7.3.4

Cohort studies allow us to estimate the prevalence and/or incidence of outcomes, separately for exposed and unexposed patients (if an intervention is considered).

Being able to estimate the prevalence or incidence of outcomes, separately for exposed and/or unexposed patients, allows us to compute a multitude of measures, including relative risks, hazard ratios, as well as odds ratios.

Pitfall 7.3.2

Cohort studies are less practical if the outcome is rare or if the outcome takes a long time to develop.

The key challenge to a cohort design is rare outcomes and outcomes that develop over very long periods of time. EHR systems track a great number of patients, but patients can be lost to follow-up over time. When the time gap between the index date and the outcome is long, we may not find sufficiently many cases due to the follow-up constraint.

In the case when outcomes are rare, a very large cohort may be necessary to contain a sufficiently high number of cases (patients with confirmed positive outcome). When some of the important variables need to be collected, ascertained or processed in a way that increases cost, doing this for a very large cohort is not cost effective. Additionally, if the outcome is very rare, a single cross section may not yield sufficient number of possible cases.

Case-Control Design

Case/control studies start from the outcome. Cases (patients with positive outcome) and controls (patients without an outcome) are identified and possibly sampled. Patient records for this sample are then examined moving backwards in time, recording the exposures (or treatments) they experienced.

Pitfall 7.3.3

When we sample cases or controls in a case/control design, we have to do so without any regard to the exposure. Whether a patient is selected into the cohort or not must be independent of the exposure(s)/treatment(s).

Best Practice 7.3.5

Whenever possible, the use of clinical trial data is recommended for case/control studies.

In observational data with complex causal structure and potentially unobserved confounders, we may inadvertently select patients based on criteria that are linked to the exposure/treatment. In clinical trial data, the exposure/treatment is not caused by any of the variables. Observe, however, that putative causes of the outcome (other than the randomized treatment) can be confounded in a RCT design (Fig. 6).

Fig. 6
An illustration of a case-control data model has time indicated horizontally. 5 exposures squares are distributed to the left of the stack of cases bar over controls bar.

Illustration of a case-control data design. The horizontal axis represents the passage of time and patients are listed vertically. Purple boxes are the exposures (or treatments) that some of the patients had. The design starts with the selection of a set of cases (red rectangle representing a number of patients) and a set of controls (black rectangle). Then patient history is examined retrospectively (backwards in time) looking for and recording exposures (or treatments)

Merits and Demerits

Best Practice 7.3.6

Case/control design is best suited when the outcome is rare, but the exposures/treatements are relatively frequent.

Case-control designs have smaller sample requirements and depending on the costs of collecting or measuring variables they can substantially reduce total data cost.

The key demerit lies in creating a discovery sample with artificial distribution of cases and controls. When the data is sampled de novo for analysis (as opposed to pre-existing e.g. in an EHR) this leads to difficulties in optimizing model use according to loss functions that are prior-dependent (because the natural prior is not known).

Pitfall 7.3.4

Case/control studies do not measure the prevalence of the outcome, which makes computing measures other than odds ratios non-trivial.

Case/control studies do not measure the prevalence of the outcome, thus we can only estimate odds ratios. Luckily, for rare outcomes, the odds ratios and relative risks are similar.

Nested Case-Control Design and Case-Cohort Design

These two designs attempt to merge the benefits of the cohort design with the case-control design. We would conduct a regular cohort design, but embed a case/control design. In the cohort, we select all cases (patients with the outcome) and match them to controls (patients without the outcome) based on their baseline characteristics. One benefit of this approach is that the exposure/treatment needs to be ascertained only for the cases and the selected controls, not all patients, yet because it is a cohort design, we can estimate prevalence of the outcome [1, 9].

A nested case-control and a case-cohort designs are both case/control designs nested into a cohort design. They differ in the way controls are selected. Nested case/control studies match cases with controls individually, while case-cohort studies simply take a random sample of the non-cases in the cohort as the control group.

The key limitation of these designs is the same as the case/control design: finding appropriate controls is error-prone and difficult to automate.

Biases

In statistics, bias refers to a systematic tendency which causes differences between results and facts [10, 11].

In the context of data design, bias are systematic deficiencies in the design that lead to errors in the produced models and corresponding problem domain solutions.

A bias is systematic if it is not arising purely through randomness. Many different types of bias exist. For example, in case of measurement bias, the systematic error exists between the measured and the actual values of a variable; similarly, in case of an estimation bias, the systematic error is between the estimate and the true value. In the context of drawing a sample from a population, sampling bias [11] occurs when some characteristic in the sample is over- or underrepresented relative to the population.

Broader uses of the term bias are also common. For example, an evaluation bias arises when the estimated performance of a model systematically differs from the actual performance often due to a flawed evaluation metric or method. Social biases, such as racial bias, arise when a systematic difference that cannot be attributed to a morally and scientifically justifiable reason between social groups (e.g. races or ethnicities) exist in terms of some desirable or undesirable metric (e.g. employment, compensation, incarceration rate). Analogously, health equity bias arises when health related measures, such as unjustifiable mortality rate, access to health care or health care utilization systematically differ across groups of patients. Analytic bias happens when the inference of the relationship between outcome and exposure/treatment is incorrect: the estimate of the relationship from the sample systematically differs from the true relationships in the population [12].

A particularly bad case of an analytic bias is when the direction of the relationship reverses, as it can happen in case of the Simpson’s paradox.

  • Even broader use of the term also exists where bias refers to a systematic preference of one option versus another. For example,

  • Inductive bias of a machine learning algorithm is the preference that the algorithm has for certain models over others. This is the only common use of the term bias that has a positive connotation.

  • Cognitive biases are systematic propensities of human decision making that lead to errors in decision making. Such cognitive biases may be the result of evolutionary adaptations that require rapid action based on heuristic or limited analytical reasoning.

In this section, we focus on biases that we encounter in data design. These are mainly estimation biases, including sampling and analytic biases as well as Simpson’s paradox. Other kinds of biases are discussed in the appropriate sections. Inductive bias is discussed in the “Foundations and Properties of AI/ML Systems” chapter, measurement bias in the “Evaluation” chapter,, and human biases in judgement and decision making in the “From ‘Human versus Machine’ to ‘Human with Machine'” chapter.

Commonly Encountered Biases in Data Design

Confounding bias. When the purpose of the analysis is to quantify the relationship between an exposure and an outcome, confounding, latent or observed, can distort this relationship. The confounding factor can take many forms and some of the common confounding forms have their own name. When the confounding factor is the missingness of data, we have information bias; when the factor is a latent factor causing certain patients in the target population to be excluded from the sample, we have selection bias; or when a factor is an indication for the exposure (treatment in this case) of interest and independently causes the outcome, we have confounding by indication [13].

Selection bias occurs when a (typically latent) factor causes certain patients to be excluded from the discovery sample who are part of the accessible population or from the accessible population who are part of the target population. Analysis is valid (unbiased) if the accessible population is a random sample from the target population and the discovery sample is a random sample from the accessible population. Violation of this assumption can affect the external validity (discovery sample or accessible population is not a random sample from the target population) or the internal validity (discovery sample is not a random sample from the accepted population) [11].

Selection bias can affect the analysis by creating an artifactual difference between cases and controls or between exposed and unexposed patients. In the former case, estimates of the prevalence of the outcome will be biased, which will transcend to any metric, such as relative risk, that relies on prevalence; in the latter case, the relationship between the exposure and outcome will be biased.

Information bias. Information bias occurs when the information is available differentially between cases and controls or between exposed and unexposed patients. The former biases prevalence estimates and the latter biases estimates of the association between exposure and treatment. Recall bias is a special case of information bias, where cases and control recall (remember) exposures differently [11].

Confounding by indication. This is another special case of confounding bias, where a confounder is an indication for the treatment and can independently cause an outcome. For example, infection may cause fever and thus use of paracetamol as well as it may be a cause for asthma later in life, thus it can distort the association between paracetamol use and asthma [13, 14].

Ascertainment bias. Certain exposures cause the outcome to be ascertained at a higher probability. For example, having eye problems may trigger a diabetes test (diabetic retinopathy) leading to the discovery of diabetes which may remain undiscovered in patients without eye problems. In some cases, when the proportion of patients with and without the exposure is “unlucky”, Simpson’s paradox combines with ascertainment bias and reverses the direction of the association between the exposure and outcome. This latter form of ascertainment bias is Berkson’s bias [15]. We discuss Simpson’s paradox later in Section “Simpson’s Paradox”.

Informed presence bias. Related to ascertainment bias, patients who have frequent encounters may have more problems documented that would remain undiscovered otherwise.

Non-contemporaneous control bias. If the comparison group is selected from a different time frame from the cases or the exposed group, they no longer represent a random sample from the accessible population (or the target population) leading to biases [16].

Reverse causation. Cause and effect may appear reversed in association. For example, when sedentary lifestyle is associated with increased death rate, sedentary lifestyle may not be the cause of death; quite on the contrary, a deadly disease may have caused the sedentary lifestyle and ultimately the death [17].

Simpson’s Paradox

Consider two factors, A and B, both binary for simplicity. Further, consider that we are studying the probability of the outcome in the four groups defined by the two factors. Figure 7 provides an illustration of the results. There are four circles in the figure representing the four groups. The radius of the circles is proportional to the number of subjects in the group. The horizontal layout of the circles relates to factor B: the two circles on the left lack factor B (i.e. B = 0) and those on the right have factor B (i.e. B = 1). The vertical position of the circles relates to factor A: the two circles on the top have factor A = 1, while the bottom two circles have A = 0. The vertical axis represents the probability of outcome. Clearly, factor A is associated with higher probability of outcome, as the two circles with A = 1 are higher than the two with A = 0. The two X’s in between the top and bottom circles represent the average probability in the sample when B = 0 (on the left) and when B = 1 (on the right).

Fig. 7
A diagram. 2 larger and 2 smaller circles are featured. Factor B and A correspond to horizontal and vertical orientations of the circles. The left circles are at 0 B factor, and the right are at 0 A factor. The P r outcomes for the lower and upper circles are 0.4 and 0.45, respectively.

Illustration of Simpson’s paradox

Let us consider patients with A = 1. They are located in the top two circles. We can see that the probability of outcome is lower when B = 1. The 80 patients with A = 1 and B = 0 (top left circle) have an average probability of outcome of 0.7, which is higher than the average probability of 0.65 in the group of 120 patients with A = 1 and B = 1 (top right circle). Similarly, let us consider the remaining patients, those with A = 0 (the bottom two circles). The probability of outcome is again lower for patients with B = 1. The 120 patients with A = 0 and B = 0 (bottom left circle) have an average probability (for the outcome) of 0.2, which is higher than the average probability of 0.15 in the group of 80 patients with A = 0 and B = 1 (bottom right circle). Paradoxically, when we look at all patients with B = 0, they have an average probability of 0.4 (left X), which is lower(!) than the average probability of 0.45 (right X) for patients with B = 1.

The interested reader is referred to [18] for more details and to [19] for a causal interpretation of the Simpson’s Paradox.

Best Practice 7.4.1

If you see an unexpected effect direction, always consider the possibility of Simpson’s paradox and Bergson bias.

Key Messages and Concepts Discussed in Chapter “Data Design”

Data design is an iterative process that transforms available data elements into an analytic matrix. The analytic matrix is a data set, on top of which ML algorithms can be directly executed. A successful data design allows for the ML algorithms to produce a valid solution to an ML problem.

A solution is valid if it holds true in the discovery sample that it was computed from and generalizes to the target population. The target population is the population, in which we want to use the modeling results (model, knowledge, etc).

Setting refers to the context in which the modeling result will be used. We broadly distinguish between three settings: clinical, operational and research. The setting will determine the requirement towards the data design, and more broadly, the entire analytic process.

The components of the data design include (1) defining an analytic question, (2) designing the analytic matrix (data design), (3) executing the modeling, (4) debugging and evaluating the model. The process is iterative.

The analytic question starts with a rough question that prompted the analytic project and is refined to be precise, formalized and operationalizable.

To make an analytic question precise, we commonly define five elements: (1) target population, (2) intervention, (3) outcome, (4) time period and (5) predictor variables.

The analytic question is formalized if its answer can be expressed as a computable estimand. The section “Common Metrics” list several such estimands (metrics).

Most analytic questions map to a small number of study types which are listed in the “Types of Modeling Problems” section. This allows us to develop templated solutions to these questions. These templates are the data designs.

Data designs form a hierarchy and by answering some questions about exposure, outcome and their temporal relationship, we arrive at a data design.

In this book, we focus on observational studies and we considered three main data designs: (1) cross-sectional, (2) cohort, and (3) case/control. In addition, nested designs, such as case/control nested into a cohort study or case/cohort are also discussed.

Concrete data design examples are shown and the merits and demerits of the various data designs are elaborated on.

Biases are systematic differences between two quantities. Often, one of these quantities are estimates (or measurements) and the other is observed. Biases that are common in data designs are presented. We discuss Simpson’s paradox.

Best Practice Recommendations and Pitfalls, Chapter “Data Design”

Best Practice 7.1.1. The ML data design needs to take the operative setting of the ML models into account.

Best Practice 7.2.1 Seek to ensure validity and generalizability with good data design first. Resort to analytic corrections of biases only to the extent that optimal design is not attainable.

Best Practice 7.2.2 Ensure that the accessible population is representative of the target population.

Best Practice 7.2.3 Ensure that the discovery sample is representative of the accessible (and target) populations.

Best Practice 7.3.1. A cross-sectional study suffices if you can answer the analytic question with the prevalence of the outcome in the exposed and unexposed groups.

Best Practice 7.3.2. Use the simplest data design that can answer your analytic question. The mapping of an analytic question to a study type can help find the simplest design.

Best Practice 7.3.3. Cohort studies are very straightforward and necessary if the time gap between the index date and the outcome cannot be ignored.

Best Practice 7.3.4. Cohort studies allow us to estimate the prevalence and/or incidence of outcomes, separately for exposed and unexposed patients (if an intervention is considered).

Best Practice 7.3.5. Whenever possible, the use of clinical trial data is recommended for case/control studies.

Best Practice 7.3.6. Case/control design is best suited when the outcome is rare, but the exposures are relatively frequent.

Best Practice 7.4.1. If you see an unexpected effect direction, always consider the possibility of Simpson’s paradox.

Pitfall 7.2.1. A random sample from a registry or an EHR is not necessarily a random sample from the patient population.

Pitfall 7.3.1. A cross-sectional study is inappropriate if a time gap between the exposure and the outcome is expected and important or if incidence of the outcome is of interest.

Pitfall 7.3.2. Cohort studies may not be appropriate if the outcome is rare or if the outcome takes a long time to develop.

Pitfall 7.3.3. When we sample cases or controls in a case/control study, we have to do so without any regard to the exposure. Whether a patient is selected into the cohort or not must be independent of the exposure(s).

Pitfall 7.3.4. Case/control studies do not measure the prevalence of the outcome, which makes computing measures other than odds ratios non-trivial.

Classroom Assignments and Discussion Topics, Chapter “Data Design”

  1. 1.

    Consider the problem of predicting the risk of a disease in a clinical setting and in a research setting. Can you think of some differences in the data design for these two settings in terms of (1) the purpose of the model, (2) data sources, (3) target population, (4) choice of modeling algorithm, and (5) evaluation?

  2. 2.

    Consider the problem of predicting the probability of discharge from the hospital in 4 days for a patient, in two settings: clinical setting and operational setting. Can you think of clinical and operational questions that could require these probabilities? How would the resulting models be different?

  3. 3.

    Suppose you have a discharge model that is being used in a clinical setting. For each patient, it predicts the probability of discharge in 10 days. You are asked to estimate the number of beds that will be available in 10 days. Would you use the clinical model or would you build a model specifically to estimate the number of available beds?

  4. 4.

    You are tasked with predicting the 7-year risk of diabetes.

    1. (a)

      Is the outcome prevalence or incidence of diabetes?

    2. (b)

      Is there an exposure of interest? Comparison groups?

    3. (c)

      Is a cohort study an appropriate data design? Could you make a cross-sectional design work?

    4. (d)

      What is the eligibility criterion and what is the target population? (You need to make some assumptions.)

    5. (e)

      Suppose the model will be used in obese patients. How did the eligibility criterion (and the target population) change?

    6. (f)

      Is obesity an “exposure of interest” and would the study require a comparison between obese and non-obese patients?

    7. (g)

      If you use a cohort study, how would you construct your cohort? What would be the index date? Accessible population? Exclusion criteria? Please elaborate.

    8. (h)

      Suppose the model is used only once, precisely at the moment a patient first becomes obese. How would you change your design? (Hint: a cross-section at a random date will probably not yield a sufficient number of patients.)

    9. (i)

      Suppose now that your outcome is a very rare form of diabetes. Is a cohort study still appropriate?

  5. 5.

    A hypothetical diabetes drug came to market 10 years ago. It is intended for diabetic patients to reduce the risk of major cardiac events. You are tasked to evaluate this drug in terms of its real-world effect (as seen in EHR data) to reduce major cardiac events and mortality.

    1. (a)

      What type of problem is this?

    2. (b)

      What is the outcome?

    3. (c)

      Is there an exposure/intervention of interest?

    4. (d)

      What are the comparison groups? (You have to make assumptions. Discuss how these assumptions affect your answer.)

    5. (e)

      What are potential confounders? How would you adjust for confounding?

      The question is intentionally underspecified. Feel free to make several assumptions and discuss how different assumptions affect your design.

  6. 6.

    You are building a model that advises clinicians on the expected effect of a hypothetical diabetes drug on major cardiac events (MACE). This model computes the risk of MACE and if the predicted probability exceeds a certain (clinically determined) threshold, the patient is given the drug. The model is applied to diabetic patients at every encounter until the patient either receives a diabetes drug or has achieves a normal blood sugar level.

    1. (a)

      Is there an exposure/intervention of interest?

    2. (b)

      What are the comparison groups (if any)?

    3. (c)

      What is the eligibility criterion and the target population?

    4. (d)

      Is a cohort study appropriate?

    5. (e)

      How would you determine the index date?

  7. 7.

    What is the simplest data design to

    1. (a)

      Determine whi0ch is more common: high blood pressure or high cholesterol

    2. (b)

      Determine whether the prevalence of major cardiac event is more common among patients with or without diabetes?

    3. (c)

      Determine whether the incidence of major cardiac event is more common among patients with or without diabetes?

    4. (d)

      Assess the effect of obesity on diabetes?

    5. (e)

      Compare the effect of two diabetes drugs on major cardiac events?

      In your(!) opinion, when is a data design “simple”?

  8. 8.

    Can you use Cox model with cross-sectional design?

  9. 9.

    How do you handle confounders in a deep learning model?

  10. 10.

    Consider a machine learned diabetes risk calculator. This model would predict patients’ 7-year risk of diabetes based on some commonly available data elements in non-diabetic patients. It would be applied to patients with high blood pressure, high blood cholesterol or obesity.

    1. (a)

      Can you think of a choice of accessible population that would cause selection bias?

    2. (b)

      Can you give an example of information bias in this context?

    3. (c)

      You find that heart disease is negatively associated with diabetes in this model. Can you explain this finding using Ascertainment bias? Berkson’s bias?

  11. 11.

    Consider the diabetes risk calculator from the previous question. Assume that this calculator now tries to predict the risk of a very rare form of diabetes.

    1. (a)

      Design a case/control study for developing this model. Who are the cases? Who are the controls? When do you evaluate the baseline predictor variables?

    2. (b)

      Can you give an example of non-contemporaneous control bias in this context?

  12. 12.

    “I prefer chocolate over vanilla ice cream.” Can you rephrase this statement using the term “bias”. Can you explain how this is a bias?

  13. 13.

    Can you explain why systematic racism is a bias using the definition of bias?