Keywords

Overview

ML modeling requires sophisticated data transformations of the data extractions or samples collected to serve the needs of an application according to the specifications of the data design (Chapter “Data Design”, Fig. 1.) The current chapter is concerned with this lower-level mapping (aka data transformation), of raw data elements into features suitable for ML modeling.

The raw data is transformed into the modeling data set in two steps. In the first step, the raw data is transformed into features, data design features, that the data design utilizes; and in the second step, the data design features are further transformed into ML features that the learning algorithm can operate on.

The key difference between the two steps is that the first step is independent of the learning algorithm and simply implements the study design, while the second step transforms the features specifically for the learning algorithm.

Figure 1 presents an outline of this process. Progressing from left to right, we start with the raw data elements in one or more source data repositories. The block labelled “Prepare Data” represents the first step, where we transform the raw data elements into data design features as specified by the data design. The target population, outcomes and covariates of interest are expressed in terms of the data design features. Data for these features are stored in a data mart. The next large block represents the second step and is labelled “Feature Engineering”, in which data design features are refined to ensure they represent what the modeling team intended and ML features are created. Converting raw data into ML features is an iterative process. Both steps (Data Preparation and Feature Engineering) iterate internally until sufficient data quality is achieved. When data quality issues cannot be resolved, there may be a need to move backwards, stepping back from Feature Engineering to Data Preparation or from Data Preparation to find new raw data elements. The block at the bottom of Fig. 1 is the “Data Ecosystem”, which is a collection of tools that support the entirety of the data transformation process. Data Ecosystem contains information about data mapping (from raw data to real-world entities), the tools necessary for the transformation, and other ancillary functions such as documentation and version control.

Fig. 1
A flow diagram of feature engineering includes source data, preparation including design features, data mart, and quality, ecosystem including common data model, software and tools, and feature engineering with analytic matrix, data quality, model development, and assess features.

Feature engineering architecture

The chapter starts by defining data and related concepts (section “Working with Data”). Section “Constructing the Design Features” describes the first step (Data Preparation) and section “Modelling Dataset and Feature Engineering” describes the second step (Feature Engineering). Due to its importance, data quality assurance receives a section on its own (section “Data Quality”) and section “Missing Data” discusses the missing data problem. Finally, in section “Data Processing Stacks” we review popular data processing ecosystems.

Working with Data

During the process of transforming raw data into an ML modeling dataset, we face a number of challenges. These challenges include:

  1. 1.

    The data in the EHR (and possibly other system) were designed with documentation of care in mind and their design may not be optimal for our purpose.

  2. 2.

    Data may be coming from different sources and each source may use a different terminology to denote the same real-world entity (e.g. “Systolic blood pressure”, “Blood pressure, systolic”).

  3. 3.

    Related to different terminologies, measurement units could also differ (e.g. temperature can be measured in F versus C).

  4. 4.

    Data quality issues abound: formatting issues, out of bound measurements, invalid characters, etc.

Common to many of these challenges is differences in mapping data in the data sources to real-world entities. In what follows, we describe terminologies related to the real-world (Problem Space in Chapter “Data Design”) entities.

A (biomedical) concept is a real-world clinical entity, such as a diagnosis, a medication, or a lab result.

“Blood pressure”, “Systolic Blood Pressure”, “Blood pressure medication”, “Diagnosis of high blood pressure” are all concepts and so are “Patient”, “Diagnosis”, “Medication”, etc. However, the blood pressure reading of “140 mmHg”, is not a concept; it is an observation for the concept “Blood Pressure”.

A vocabulary is the collection of all terms that a domain addresses.

This definition has two parts. First, there is a domain. A vocabulary for heart disease medications will contain all terms related to heart disease medications, but will (mostly) not(!) contain terms for unrelated domains, such as cancer diagnostic procedures, or will contain them at very high levels without any details. Second, a vocabulary contains all terms related to a domain, including synonyms (e.g. “Systolic blood pressure”, “Blood pressure, systolic”).

A terminology is a standardized vocabulary that represents a system of concepts.

For example, the International Classification of Diseases, revision 10 (ICD-10) is a terminology that represents the system of all diseases. Within this terminology there is an entry, “I21 Acute Myocardial Infarction”, for heart attack.

Property: entries in a terminology are unique. Terminologies do not contain synonyms, so in ICD-10, for instance, there is no entry such as “Heart attack”. The lack of synonyms makes terms corresponding to a concept unique, and thus terminologies are used to assign unique IDs to concepts.

An ontology is a representation of a domain of concepts, comprising (1) concepts and their formal names (i.e. terminology), (2) attributes of the concepts, and (3) relationships among the concepts.

The types of relationships ontologies encode depend on the domain and the ontology. For example, the National Drug File Reference Terminology (NDF-RT) is an ontology for pharmaceuticals and it defines relationships such as “has_ingredient”, “has_MOA” (mechanism of action), “may_treat”, “may_prevent”, etc. Most ontologies, however, define the ISA (taxonomy or containment) relationship: e.g. A beta blocker ISA (“is an”) anti-hypertensive medication, which ISA (“is a”) cardiovascular medication.

Attributes of concepts in the NDF-RT are, for example, whether a drug is meant for human use, available dosages, available routes (e.g. oral, topical, injectable), etc.

In mapping the raw data to real-world entities, the smallest unit is a data element.

A data element represents a singular concept.

For example, a birthdate, lab test result or a patient’s name are data elements. Data elements are also called fields, columns or attributes.

Data elements can be brought together into a dataset to answer a specific analytic question.

A dataset is a collection of data elements that groups related concepts together.

For example, yesterday’s claims or a copy of an EHR are datasets. Datasets can have other containment structures such as tables, rows, databases, etc.

A modeling problem can utilize several datasets, each dataset representing some aspect of the modeling.

It is common practice for an organization to bring all the data it needs into a common location, called a data warehouse (or data repository) (Fig. 2).

Fig. 2
A Research Data Warehouse framework includes data inflow from labs and external sources, data quality checks, terminology standardization, integration with health records, result generation, metadata, secure data transfer, a user portal, data sharing, and security services.

Architecture of a Research Data Warehouse (RDW). (a) incoming data flows from assay labs, plus data sources external to the institution; (b) data landing zone; (c) data input quality control processes; (d) terminology harmonization and standardization; (e) linkage to health record data and master subject index; (f) return of actionable results and generation of related alerts; (g) primary and secondary meta data and annotations; (h) secure transfer channels to supercomputing; (i) data portal with search, visualization, data extraction and analysis functions; (j) data flows to external and internal collaborating researchers; and (k) authentication, access control and audit services/layers

When an analytic project is carried out, data elements required by the analytic project are extracted and stored into a data mart.

A data mart is an extract of a subset of data from a data warehouse that is built for a specific project or for a group of related projects and contains only the data elements and patients/subjects of interest.

Datasets and data marts differ in two ways. First, a data warehouse (and to a certain degree a data mart, as well) are created for a broad purpose; while a dataset is created for a specific purpose. For example, a data mart for all hospitalizations may exist. From this data mart, a dataset for studying care pattern variations after heart attack can be extracted. Naturally, only a fraction of hospitalizations and a fraction of data elements (procedures, treatments, lab results, diagnoses, etc.) are related to heart attack. Thus, the resulting data set is both narrower in focus and smaller in size than the original data mart. Second, a data mart is typically standardized in terms of the terminologies it uses for various concepts. In contrast, a dataset might span multiple data warehouses and may therefore rely on different terminologies, or it may contain unstandardized data that do not conform to any particular terminology.

The linkage between the raw data and the real-world entities is provided by the data model.

The data model is an abstract model that defines data elements and standardizes how these data elements relate to each other and to real-world entities (clinical concepts, in our case).

Thus, a data model plays three roles. The first role is to catalog all data elements. If a domain for the data model is defined, the data model will contain all data elements in that domain. If no domain is specified, as could be case for the main data repository, the implicit domain is all clinical affairs, the set of all data elements in the data model is the entirety of the clinical concepts that can be modeled. Naturally, the data model can be expanded if necessary; in fact, publicly available data models, such as the OMOP model (described later in section “Standardized Data Models and Ontologies for Health Care and the Health Sciences”) undergoes regular reviews.

The second role is to relate data elements to each other. Later in this chapter (in section “Standardized Data Models and Ontologies for Health Care and the Health Sciences”), we describe the OMOP data model in more detail, but in the meantime consider the caricature example of “patients” who have “encounters” and during these encounters, they may receive diagnoses of “diseases”. Words in quotes are both data elements and clinical concepts. The data model will indicate relationships between “patients” and “encounters” and between “encounters” and “diseases”.

The third role of a data model is to establish a mapping between the data elements and the real-world clinical entities. Terminologies serve this role. For each data element, for which terminologies exist, a terminology is selected. That data element is now standardized to a particular terminology. Continuing with the above caricature, we can assume that no terminology is needed for the data elements “patient” and “encounter”, but several terminologies exist for diseases. We can elect to use ICD-10 to describe diseases. Although “encounter” does not have a terminology on its own, it may have an attribute, say “encounter type”, which can still have a well-defined vocabulary (containing terms such as “emergent”, “urgent”, or “elective”). The data model can still specify that the “encounter type” attribute of the data element “encounter” must use this vocabulary.

Related to a data model is a data dictionary.

A data dictionary is an inventory of all data elements and their “meta-data”, which comprises (a) the type of a data element (e.g. date, string, integer, real-value numeric), (b) expectations of the data (e.g. feasible values, range), (c) the terminology it is mapped to (if any), and (d) usually a brief narrative of what the data is supposed to represent.

A data dictionary is essentially a documentation of parts of the data model. Additionally, it is also complementary to it. When a data element cannot be mapped to a terminology, the data dictionary will contain metadata about that data element. For example, to the best of our knowledge, no terminology exists for “birthdate” (because it would be merely list of all dates and a term conveying that it is unknown). We still need to document the format of the date (e.g. yyyy-mm-dd), whether a date may be missing, a NULL value for missing dates if missingness is allowed (or the fact that missingness is not allowed), and possibly ranges of acceptable birthdates.

Best Practice 8.2.1

It is a best practice to use a data dictionary to define each data element in the dataset that we are working with.

Every organization or even every project can define its own data model. However, having several, potentially incompatible data models hinders sharing data, expertise, and models across organizations or even across projects within the same organization. It also leads to waste, as the same terms need to be redefined for every project or organization.

A common data model describes a standard data model that a data warehouse and derivative data marts conform with. This is used in projects and research networks and to foster interoperability of healthcare data across organizations.

Finally, we can define data. The fundamental raw material for developing AI and machine learning models is data. There are several definitions for data [1] with a practical definition being that “data” is the lowest level, initial ingredient on the path to information, knowledge and wisdom [2].

Data are what we are given at the start of our analytic modeling and represent observations about the real-world.

In healthcare, data can come from electronic health records (EHR), claims, research data registries, wearables, and patient reported data. Not all information is available in electronic format; some information that might be useful must be first transformed into something a computer can work with. For example, paper forms, clinical expertise, or patient’s genome must be processed into useful features.

For our purposes, we will use a practical definition of data as anything in electronic form that a computer can process.

Data model vs data design. Both data model and data design create a mapping from raw data to features used in modeling. Data design uses higher-level semantics. While a data model is primarily concerned with mapping real-world entities into data elements, a data design is concerned with the role the data elements plays in the analysis (outcome, exposure/intervention, predictor variable), the time-frame within which we consider the data element, the subjects/patients we need data from, etc.

Standardized Data Models and Ontologies for Health Care and the Health Sciences

As we discussed earlier, data models represent the linkage between the data in the data warehouse and the clinical context and are essential for the data to have meaning. Common data models differ from (regular) data models in that their definitions and specifications are shared across projects and possibly organizations. Several common data models exist within healthcare, including OMOP [3], i2b2 [4], HL7 FHIR [5], PCORNet [6], and others. Each of the EHR vendors has their own data model and research collaboratives have also developed their own data models. This proliferation of data models means that many projects have to spend a substantial amount of time converting from different source data formats into whichever model was used by the study team. This is inefficient and may also cause semantic data translation errors when moving from one format to another. While it isn’t right for every situation, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is gaining momentum and a number of research collaborations have decided to standardize on it including the National Institutes of Health All Of Us initiative [7], National COVID Cohort Collaboration (N3C) [8] and the eMerge network [9]. As more and more healthcare organizations develop and maintain an OMOP version of their healthcare data, they can more easily contribute to these research initiatives and researchers can better trust that the data from different organizations mean the same.

Table 1 shows key standardized ontologies for harmonizing data across domains, research sites, groups etc. Harmonization works by mapping data elements to standardized ontologies, thus ensuring (a) bringing same variables and values to common computer readable codes, (b) accurate search for data elements, and (c) interoperability across groups and sites.

Table 1 Standardized ontologies for harmonizing data

There are hundreds of ontologies and terminologies that are used to represent healthcare data, and more are being added every year. Table 1. contains some of the most commonly used ontologies and terminologies both in health sciences data, specifically in molecular and cellular, and in clinical data.

Even within a type of data (i.e. clinical vs. molecular), different domains use different terminologies.

For molecular and cellular data, these include the Ontology for Biomedical Investigations [10] for terminology related to assay results, Gene Ontology (GO) [11, 12], genetics variants VariO [13], metabolomics [14], and the Cell Oontology (CL) [15]. Commonly used clinical data terminologies for medications include RxNorm [16] and NDF-RT [17]; for laboratory tests LOINC [18]; for diagnoses ICD-10-CT [19], DRG (specifically for reimbursement) [20]; for procedures CPT [21] and ICD-10-PCS [19].

For any research study, a significant amount of work is required in standardizing the data and ensuring terminologies are consistently and correctly applied within the dataset.

Several tools exist that help with this translation. The Unified Medical Language System (UMLS) offers its Metathesaurus as a way to cross-walk between equivalent concepts [22]. In terms of tools, the Observational Health Data Science and Informatics (OHDSI) program [3] offers a single vocabulary service, named Athena, that provides a translation between common healthcare vocabularies.

As healthcare knowledge in medicine continues to advance, new terms need to be created to represent concepts in the data. The common data model may also need to evolve to accommodate new information and new types of data such as genomic, image, and unstructured note data. The OMOP data model itself has gone through a number of versions and the most recent version (5.4) does supports these new data types. Similarly, vocabularies also continue to evolve. As new concepts are created, they must be added to terminologies and be connected together across related terminologies. Both the Athena vocabularies and the UMLS Metathesaurus are continuously maintained to ensure new terms and linkages are up-to-date.

Constructing the Design Features

As we discussed earlier, the transformation from raw data to a modeling dataset consists of two main steps. First, the raw data elements in the data warehouse relevant to the analytic question are transformed into data design features; and in the second step these design features are further transformed into ML features that the learning algorithm can operate on. In this section, we focus on the first step. Specifically, we discuss the following tasks (1) extract the data elements from the warehouse that are necessary for the study, (2) transform them in a manner that is agnostic of the learning algorithm into the set of design features, and (3) using the inclusion/exclusion criteria from the study design, create a cohort of patients who are subjects in the study.

We defined concept in Section “Working with Data” as a real-world clinical entity, such as a diagnosis, a medication, or a lab result.

A feature represents a concept. A data design feature represents a concept in the data design; and an ML feature (variable) in the ML modeling dataset.

The difference between concept and feature is that features are those concepts that are referenced in the study design. There are numerous other features in the data model that are not referenced in the study design. These concepts do not need to be used in the analysis and will not be turned into features.

In this book, we refer to the concepts that appear in the data design as (data) design features.

Sometimes, for a concept in the study design, there exists a data element in the data mart that has a one-to-one correspondence. However, often, multiple data elements need to be combined to accurately represent a concept.

Example 8.3.1

Suppose we are interested in the side effects of a particular drug and we are constructing a cohort consisting of patients currently taking this particular drug. This drug is a concept, it is referenced by the data design (as an inclusion criterion), and thus, needs to become a data design feature. Due to the terminology we use, this drug has a one-to-one correspondence with the data in the medications table. In this case, the design feature is the corresponding drug in the medications table.

Example 8.3.2

Consider the concept of a “diabetic patient”. This concept could be defined as a patient with a diagnosis code that indicates diabetes. Under this definition, the design feature would be the presence of any diabetes-related diagnosis code. Diabetes is known to be underdiagnosed, thus using this definition could lead to false negatives. If the requirements of the modeling problem are sensitive to false negatives (and less sensitive to false positives), we may need to cast a wider net and define “diabetic patient” as patients with any indication of diabetes. In this case, the design feature could be created using a simple algorithm which checks whether the patient has a diabetes-related diagnosis code, a (possibly preventive) prescription of anti-diabetic drugs, or a periodic measurement of blood glucose (indicating suspicion of diabetes). In both cases, the design feature requires transformations of the data elements in the dataset. In the first case, the design feature maps to a set of diagnosis codes (called a value set); in the second case, a simple algorithm (called a phenotyping algorithm) transforms the original data elements into the design feature. Also note, that the amount of transformation the raw data needs to undergo depends on the requirements of the study.

The raw data may come from a variety of sources, but it was often collected for some purpose other than research. In some cases, such as randomized control trials, data may be collected specifically for a study and therefore already be perfectly tailored to what that modeling problem needs. However, in most cases, the raw data will need to be transformed into a form that is required to implement the data design. These transformed forms are given names such as features, variables, phenotypes, predictors, etc. and are often used interchangeably. In the subsequent sections we are going to discuss phenotypes, and value sets as two prominent ways of transforming the raw data elements into design features.

As transformations occur, information about inputs, the transformations themselves, and their outputs should be documented as meta-data. Not only is this required to reproduce results, but it is also necessary to ensure that researchers that use the resulting output understand how and why the data was produced and that it can therefore be trusted.

Data provenance (or data lineage) is the meta-data that describes data transformations. A complete set of data lineage information helps researchers know that the data that they are using in their study or for their model are correct.

In the remainder of this chapter, we will describe specific ways to construct the design features and also discuss cohort construction.

Phenotypes

Simple algorithms, like the one in the diabetes example (Example 8.3.2), are referred to as phenotyping algorithms. In genetics, a phenotype is an individual’s observable traits [23]. We will use the term phenotype to refer to the traits of the individuals represented in our data.

A phenotype is an algorithm or expression that defines a feature and is specified at a sufficient level of detail to be computable.

When we’re working with data transformations, we need to define the phenotypes computed from the source data in such a way that they can be consistently applied by anyone trying to reproduce the study. We will create these computable and reproducible phenotype definitions using concepts, value sets, cohorts, covariates and combinations of these that can be combined into complex expressions to define exactly the concept that we wish to represent. Some phenotypes are simple. For example, a patient that has had a heart attack will have a single diagnosis code that indicates they had that event as part of their record (for example, an ICD10 code of I21). Other phenotypes may be more complex. For example, a patient will be considered as a Type II diabetic patient if they have at least two diagnosis codes in their record within the prior 2 years (e.g. descendants of ICD10 code E11), at least one diabetes medication prescribed (e.g., metformin) and at least one HgA1c measure above 6.5 in the past 2 years.

Value Sets

Value sets, sometimes called concept sets, are lists of concept codes that further define a phenotype of interest by enumerating all of the codes that comprise that phenotype.

Only in simple cases will a single concept code identify the concept that a researcher is interested in. For example, in the case of atrial fibrillation, a researcher may just be interested in chronic atrial fibrillation and therefore will use SNOMED concept code 4141360. But more often, a list of the different concept ID’s that describe the intended phenotype is required. The UMLS value set authority (VSAC) [24] was created to allow organizations to develop and maintain lists of concept codes that define very specific phenotypes of interest. But finding all of the codes for the intended definition is still difficult. For example, within the VSAC, there are 62 different definitions of an atrial fibrillation concept set (Fig. 3).

Fig. 3
A table with 14 samples of atrial fibrillation concept sets has 5 columns of parameters.

Examples of atrial fibrillation value sets. (Screenshot taken from https://vsac.nlm.nih.gov/valueset/expansions?pr=all)

It is still a difficult task for a researcher to find a version of atrial fibrillation that defines their intended definition of atrial fibrillation. Even though the VSAC was created to provide a mechanism to share concept sets, it is still hard to know what phenotype a particular set of concepts really refers to and reusing these concepts sets can be problematic. In many cases researchers still end up defining their own set of codes.

There are common pre-defined value sets for us to use. Although ICD codes form a taxonomy, it is imperfect. There are codes that describe multiple diseases (e.g. diabetic retinopathy, hypertensive kidney), yet they appear only once in the ICD hierarchy; and conversely, codes indicative of a certain disease can appear in different parts of the taxonomy (e.g. pre-existing diabetes in pregnancy does not appear under the diabetes codes). This motivated the creation of the Clinical Classification Software Refined (CCSR) [25], which creates value sets of ICD-10-CT codes for different diseases, regardless of where the code resides in the ICD-10 hierarchy. Another example of value sets is the Elixhauser comorbidities. Suppose we wish to estimate the effect of an exposure (e.g. lead poisoning) on mortality. To elicit the effect of lead poisoning, we need to adjust for the patient’s general state of health, because other severe comorbidities the patient may have could also influence mortality risk. Elixhauser comorbidities are such severe comorbidities, and for each comorbidity, a value sets of ICD-10 codes is predefined [26].

Modeling Dataset and Feature Engineering

In the current section, we focus on the second step of transforming raw data into ML features, namely the transformation of the data design features into ML features. Recall that ML features are specific to the learning algorithm and allow them to operate on these features optimally. In this section we describe (1) the modeling dataset, (2) transformation of the data design features into ML features, (3) feature selection, and (4) feature importance.

ML Modeling Dataset

Now that we have defined the features, variables, phenotypes and the discovery sample, we can put this information into a format that is more easily consumed by our analysis and modeling software.

An ML Modeling Dataset is a data structure that contains information about the cohort and is in a convenient input format for the ML algorithm we selected for use.

The classical modeling dataset, also known as design matrix or analytic matrix, is typically a two-dimensional structure with one row for each subject (or observation) and a column for each variable/feature. Modern ML methods, in addition to analytic matrices, can also utilize higher dimensional tables (tensors), as well as non-tabular data sets (graphs, sequences, images, text, etc.).

A ML modeling dataset can contain classical analytic tables, data sets in more modern data formats, or a combination of these.

In this section we will examine best practices for producing these data extracts that will be consumed by our ML software. We will first examine the different types of fact tables defined as the “long” vs the “wide” data formats.

Classical Analytic Table Formats

Ultimately, data science projects need a way to represent all the information about each person that is part of each cohort in the study. Most data science projects use a simple spreadsheet-like fact table.

The format of the table is how the rows and columns of the table are laid out.

Typically, the table is arranged such that each row represents a person and the columns represent features (or facts) about that person.

In wide format, there is one row per patient/subject/sampling unit and facts about the subject are laid out in columns; in the long format, there is one fact/observation per row and each patient/subject/sampling unit can have multiple rows.

Example 8.4.1

Consider a cohort of patients. We collect facts about these patients and each lab result is a fact. In the wide format, we have one row per patient and the different lab tests are the columns. Not all patients have results for all tests; thus, some fields in a row will be empty. In the long format, the different lab results the patient has are the rows. The columns could be patient id, date, lab type, lab result.

Table normalization and pivoting. Suppose that in the domain of the data model, several different lab tests can be administered, including cholesterol tests (LDL, HDL, TG), glucose test (FPG), kidney function tests (creatinine), inflammation, etc. We could store these results in a “wide” table, with columns patient id, result date, LDL, HDL, TG, FPG, etc. Given that the LOINC terminology for laboratory results identifies 56,000 different lab tests, this would result in a table with 56,000 columns, which is a problem by itself. Compounding this problem is the fact that for each patient at each encounter (which is a row in this wide table), only a few tests are administered, thus the vast majority of the columns in each row would have missing value (NULL). Storing all these NULL values, wastes space. To prevent this waste, the table is normalized into a “long” format, where the columns are patient id, result date, lab type, lab result and a row is generated only for those tests that were actually administered. This way no NULL values need to be recorded.

Most analytic software however expect unnormalized (“wide”) tables, where lab results that could have been administered (but maybe not for this patient) are columns. The operation to covert a normalized (long) table into unnormalized (wide) table is called pivoting. The pivot operator takes a column (“lab type” in this example), creates new columns for (potentially user-selected) unique values of this column, and populates these new columns with values taken from another original column (“lab result”). When there is no data for a particular lab test on a particular date, the corresponding field is set to NULL.

Laboratory results are not the only normalized columns. The same normalization takes place for other data elements, such as diagnoses, medications, etc.

Longitudinal data. Longitudinal observations arise in several settings, including routinely collected data in the healthcare setting and clinical trials. These two settings can have different characteristic affecting the format in which the data is stored.

In a clinical trial setting, repeated measures may be taken following a particular schedule. All trial subjects are administered the same battery of tests at the same time point (where time may be measured related to entry into the trial, rather than calendar time). Suppose, two lab tests, LDL and creatinine, are measured for each subject on days 1, 5, and 30, then the data may be stored a “wide” format, having columns patient id, ldl_day1, ldl_day5, ldl_day30, creatinine_day1, creatinine_day5, creatinine_day30. In the longitudinal data analysis literature, this format is called person-level format.

Alternatively, in a routine healthcare setting, the schedule at which lab results are collected varies from person-to-person. Attempting to store such data in person-level format would yield an excessive number of columns: there would be a column for every date when a lab test was administered (for any patient) and every lab test. Instead, a “longer” format, called the person-event format (also known as counting process or lifetable format in the survival literature; see Chapter “Methods Summary”) is used. In the person-event format, there is a row for each patient at each time point when that patient has a lab test administered. Suppose, we measure cholesterol (LDL, HDL, TG) and kidney function (creatinine), a table in the person-event format would have columns patient id, result date, LDL, HDL, TG, creatinine.

Note that the person-event format is not a normalized format. Although it has a row for every time point when a test was administered (to that person), not all lab tests may be administered at that time. Tests that were not administered would have NULL values.

Relationship between analytic software and analytic table format. Almost all analytic software expect unnormalized tables (i.e. wide format, person-event or person-level format). From the perspective of normalization, the analytic software determines the format of the analytic table. In contrast, in the case of longitudinal data, analytic method (and software) is selected based on the data format. Data in person-level format versus person-event format yield different models with different interpretation.

Best Practice 8.4.1

Whether the data need to be pivoted or not depends on the software that will be used for modeling. Almost always, the software will expect pivoted (wide) data.

Best Practice 8.4.2

The choice between person-level and person-event format is driven by the analytic need and the analytic software (model) is chosen accordingly.

Feature Engineering

As we discussed earlier, data design features are transformed into ML features to better suit the ML algorithm. What we have not discussed earlier are the benefits of these transformations. These benefits include:

  1. 1.

    Models can become more interpretable (e.g. transform non-linear features, de-skew distributions),

  2. 2.

    Predictive performance can improve as the transformation can reduce bias, variance and noise (e.g. scaling continuous variables, remove outliers, autoencoders),

  3. 3.

    The transformation can combine existing features into features that are easier to use (e.g. single variable for complex diabetes phenotypes),

  4. 4.

    Summarize other features in a more meaningful way (e.g. summarize comorbidities into a risk index or severity index)

  5. 5.

    Make the modeling possible (e.g. using natural language processing to extract features from narrative text, dimensionality reduction).

  6. 6.

    The transformation can also be specifically aimed at reducing the dimensionality of the problem (i.e. the number of variables)

Featuring engineering is the process of transforming existing variables, or developing additional variables, or features, from the source data that can enable the use of a particular learning algorithms, improve its performance, or interpretability.

Feature engineering also involves understanding how all of the features are related so that irrelevant and redundant features can be removed and the most important features are identified for subsequent analysis and model development.

In the following, we review several techniques for transforming the design features into ML features.

Scaling. Several techniques, including neural networks, perform best when variables are brought to the same scale. Scaling helps prevent variables that are vastly different numerically, such as creatinine levels (between 0 and 2) and weight in pounds (between 0 and 500) from overwhelming an analysis, although some analytical techniques can deal with large scale differences. Variables can be brought to the same (or similar) scale by (1) simply normalizing them to be between (say) 0 and 1 (and possibly trimming some outliers), or (2) standardizing the features (subtracting the mean and dividing by the standard deviation).

Addressing Non-linearity. Some algorithms assume a linear relationship between the predictor variable and its effect on the outcome. Features that do not satisfy this assumption can be transformed. Common transformations include expanding the features into a polynomial (adding the squared, cubed, and higher order versions of the same feature), taking the square root or log of the feature, or more broadly, transforming it through a non-linear function. Alternatively, continuous features can be binned. This involves splitting the variable range into a number of categories or bins. For example, “age” can be split into bins of “0–16”, “17–65” and “>65” in an analysis.

Indicator variables. Many algorithms (e.g. regression, SVM, neural networks, etc.) operate exclusively on numeric variables. Some implementations of these algorithms may accept categorical variables, but will internally encode them into numeric variables. The key to converting categorical variables into numeric is that binary variables are often implemented as integer 0 or 1 values, thus they appear numeric. Although they may not have distributional characteristics that typical numeric variables have, from a technical perspective they appear numeric in the sense that they support numerical operators, such as addition and multiplication that the fitting algorithms rely on.

One-hot encoding. This is the simplest way to convert categorical variables into indicator variables. Each level (unique value) of a categorical variable becomes its own new variable. For example, the source data may have a variable named “eye_color” that contains the items “blue”, “brown”, “green”, or “black”. Each of these items becomes an ML feature, which could be named, for instance, “eyes_blue”, “eyes_brown”, “eyes_green”, and “eyes_black” and each would have a value of 0 or 1.

Binning. Continuous variables can also be converted into a set of indicator variables through binning. We discussed binning as a way of handling non-linear effects, but it can also help interpretability. There are several strategies to binning, including equi-depth and equi-width binning. In case of equi-depth binning, the number of observations in all bins will be as close to equal as possible. In case of equi-width binning, the range of values covered by each bin will be roughly equal. For example, categorizing age as 20–40, 40–60, 60–80 is equi-width, because all three bins cover a range of 20 years of age.

Contrast coding. When (ordinal or nominal) categorical variables are transformed into a set of indicator variables chiefly for the purpose of improving interpretation, contrast coding is used. The name comes from the property that levels of the categorical variable are encoded into indicator variables in such a way that the effect of each level is contrasted against another level or a reference level. When used in a regression model, the use of contrast coding schemes also has implications on the meaning of the intercept of the model.

Treatment coding is the simplest contrast coding scheme. It considers one level as the reference level and encodes all other levels using one-hot-encoding. In a regression model, the effect of each level on the outcome is measured relative to the reference level and the effect of the reference level on the outcome becomes part of the intercept. Other encodings, that offer different guarantees for the intercept and different interpretations for the effect sizes, include the Sum, Deviation, Helmert, Orthogonal Polynomial, Forward Difference, and Backward Difference (see [27,28,29,30] for details) codings. Creating contrasts manually can capture complex semantic relationships among levels of a categorical variable. Some coding schemes are most meaningful for ordinal variables.

Example 8.4.2

Consider a blood pressure variable with levels “normal”, “pre-hypertensive”, “hypertensive, grade 1”, “hypertensive, grade 2”. This is an ordinal variable, where “normal” < “pre-hypertensive” < “hypertensive, grade 1” < “hypertensive, grade 2” in terms of blood pressure: patients whose blood pressure is “normal” have lower blood pressure than patients whose blood pressure is “pre-hypertensive”. If we encode this variable into four indicator variables, such as ‘Is blood pressure normal?’ or ‘Is blood pressure pre-hypertensive?’, the ordering among these four indicator variables will not be explicit to the learning algorithm. If, instead, we encoded these variables as ‘Is blood pressure at least “hypertensive, grade 2”?’, then it is explicit to the learning algorithm that “grade 2” should carry higher risk than “grade 1”, because every patient with “grade 2” also has “grade 1”, and, in fact, they have already exceeded “grade 1”. If the learning algorithm is a linear regression algorithm, the interpretation of the coefficient of “grade 2” becomes the excess risk associated with having grade 2 hypertension relative to grade 1 hypertension.

Pitfall 8.4.1

If ordinal variables (or more generally, variables with semantic relationships among their levels) are encoded using one-hot-encoding, the semantic relationships will be hidden from the modeling algorithm.

Best Practice 8.4.3

If categorical variables with sematic relationships among their levels need to be converted into a set of indicator variables, design an encoding scheme (if possible) that makes this semantic relationship explicit to the learning algorithm.

Missingness indicators. We will discuss handling missing values later in Section “Missing Data”. When the missingness of data for a design feature carries information, missing feature indicators can be defined for each such features. The missingness indicator takes the value 1 if a variable is missing and 0 if it is present. The indicator allows the learning algorithm to assign an effect size to the missing feature.

Combining features to reduce dimensionality. It is often desirable to combine existing features into new features to make a particular analysis or model easier to build or understand. This is called dimensionality reduction. Dimensionality reduction is discussed in Chapter “Methods Summary”.

Scores. A supervised way of combining features is to create scores. A score is often a weighted sum of several features, where the weight can be assigned fully manually [30], semi-manually by converting regression weights into—supposedly more easily applicable—integer weights [31], and fully computationally, where regression weights are used as is [32].

Automated feature engineering. There are also methods to automatically engineer features by using software to create hundreds or thousands of features based on data elements. This saves the researcher from having to devise features that they think will be helpful for the model and instead generates a large set of potentially useful features. Feature selection techniques can then be used to find a subset of features for a more efficient but high performing model. Automated feature engineering algorithms apply mathematical functions to information about the relationship between data elements in a database or correlations between features to generate new features [33,34,35]. Automatic feature generation can save time and may even propose novel features, but a big issue is that the methods will generate a very large number of features, most of which will not be useful. Efficient feature selection algorithms must then whittle the set of features down to an optimal set. Domain experts hand crafting a set of features can usually do better than the automated algorithms [34].

Deep Feature Synthesis [34] is an automated feature generation method that operates on relational databases using the entity-relationship models and generates features that span multiple tables. The “depth” of the features is the number of entities that were used to create the features and is a user parameter.

Autoencoders aim to create new features, collectively called a data representation, such that the original data features can optimally be reconstructed from these new features. Autoencoders use deep learning to learn this representation and the objective is to minimize the so-called reconstruction error, which is the difference between the input data reconstructed from the new representation and the actual input data [36], given the architecture, including the size of the new representation (number of new features). If the resulting representation is of lower dimensionality than the original data, then this is also a dimensionality reduction technique (see chapter “Methods Summary”). We also discussed in Chapter 3 transfer learning, where a pre-trained model is used (after removing its output layer) as a new data representation. This new data representation can be viewed as a set of automatically engineered new features.

There are many commercial tools that can automatically generate features as well as a few open source tools for Python (e.g. autofeat [37] and Featuretools [38]) and for R (e.g. tsfeaturex [39]).

Knowledge-based automated features. Another common way to generate features is to use external knowledge, for example, a physical equation, to generate these features or to train a model to learn these features and apply these models to training data set thereafter to generate the new features for the training data. Some of the clinical scores, where the scores were computed on an external data set, unrelated to the training data, are special cases of this approach. For a more complete review of example of knowledge-based features, the reader is referred to [40].

Data Quality

Data quality is the degree to which a dataset meets a user’s requirements [41] and is fit for the purpose it is intended [42]. Whether the data was collected for the purpose of the modeling task (primary use of the data) or for a different purpose and the modeling task just uses the data as a convenience sample (secondary use of the data), data quality issues can exist and can impact the modeling and its results.

Consequences of data quality issues range from benign to catastrophic.

For example, from a modelling perspective, when a lab result is missing for a patient because of an accidental loss of data, this is typically benign: the issue is known and many modeling methods can correct for it. If a dataset has unknown biases, these biases can invalidate a model—unbeknownst to the modeling team—leading to potentially incorrect clinical predictions or findings.

When researchers try to make use of this data, such as research studies and quality initiatives, the first thing they need to do is to understand the quality of the information, what the data represents and mitigate any issues or gaps in what they require from the data.

While all modeling, even outside healthcare, can be impacted by data quality issues, data quality issues are exacerbated in healthcare due to (1) lack of standards for describing or dealing with data quality issues within healthcare data, (2) the lack of a common EHR data model, (3) a preponderance of vocabularies and terminologies to represent data types and the complexity of healthcare data, (4) the continuing evolution of additional medical concepts procedures and conditions as new medical knowledge is discovered.

Pitfall 8.5.1

There a fallacy that “Because of data quality issues in EHR data, they cannot be used for discovery”.

Best Practice 8.5.1

EHR data can be used for discovery, but be aware of the quality issues and select modeling methods that can correct for the relevant and potentially consequential issues.

Data Quality Standards

One major issue is that there are no standard ways of describing data quality. Until recently, there have not been standard ways of even consistently referring to different issues and aspects of data quality. A number of frameworks have tried to standardize the terms that are used to describe data quality issues, and a distillation of that work resulted in harmonized terminology for data quality [43]. The primary dimensions of data quality are Plausibility (are data believable?), Conformance (do data values conform to standards?) and Completeness (are there missing data?).

We can categorize the many different types of data quality issues that arise into the following groups. If we can identify data quality issues, then we can potentially apply mitigation strategies to deal with the issues, which are described in Table 2, which lists data quality dimensions with examples of the types of data quality issues that can arise.

Table 2 Examples of data quality issues

There is a growing body of work to develop frameworks of data quality rules that can be used to assess the quality of a specific dataset or cohort.

Assessing Data Quality

There is software that implements some of the data quality assessments. For example, the Data Quality Dashboard (DQD) [44], which is one of the data quality tools within the OHDSI toolset, assesses data quality along the dimensions of Completeness, Plausibility and Conformance. The DQD uses the harmonized data quality terminology and implements data quality rules to provide a system that can evaluate an OMOP formatted dataset and provide a set of metrics that be used to compare across datasets or for a single dataset across time (Fig. 4).

Fig. 4
A table of a metric set includes verification, validation, and total values with sub-divisions of pass, final, total, and % pass, for plausibility, conformance, completeness, and total in rows.

Example of a data quality dashboard [44]

The DQD encapsulates expectations as data quality rules that can be executed against the data to produce reports to show the overall data quality of the dataset. Importantly, it provides a quantitative score for the dataset that can be tracked over time (to see if changes are improving data quality) and across datasets (to ensure that datasets across organizations are of similar quality). Figure 5 lists examples of the type of rules and metrics that are computed by the data quality dashboard.

Fig. 5
A sample of the rules and metrics calculated by the data quality dashboard includes data for status, context, category, subcategory, level, description, and % records with 5 rows of data.

Examples of data quality rules implemented in a data quality dashboard [44]

While these quantitative data quality metrics are useful, they only assess overall data quality. One important exercise is for the researcher to explicitly define the expectations that they have of the data that is relevant to their particular research question. Even if a dataset is of high quality in general, it may not be for certain research questions.

Best Practice 8.5.2

Provide a minimal set of data quality metrics for Completeness, Plausibility and Conformance.

Missing Data

A data element is missing, if no data value is stored for a variable (features).

There are three general reasons why data may be missing. First, and arguably most common in healthcare, is structural missingness, where the data was not supposed to be collected in the first place. For example, we would not expect cholesterol data to be collected at an encounter for vaccination. The second general reason is inadvertent missingness. There was a reason to collect the data, but for example, the test failed to yield result. A third reason is data quality issues. Relevant data quality issues include not Plausible data elements (for example, BMI > 100), or values with formatting errors that impact data quality (Conformance).

Informative missingness. One important attribute to missingness is whether it is informative. Missingness is informative, if missingness of a data element is predictive of the outcome in question. One important consequence of informative missingness is that it can confound the relationship between (observed) predictors of the outcome.

Types of missingness. Data can be missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR). If the probability that a particular piece of data is missing is independent of any other variables as well as the true (unobserved) value of the missing variable, then it is missing completely at random (MCAR). In this case, the reasons that the data are missing are not related to the missing data itself and every observation has an equal chance of being missing. This allows us some freedom to deal with that missing data. Unfortunately, most data is not missing completely at random.

If the probability that a particular piece of data is missing depends only on other observed variables and does not depend on the true (unobserved) missing value (given the other observed variables), then the data is missing at random (MAR).

In this case there is a relationship between one or more observed pieces of data that is associated with the missing data. We can use the observed data to help us adjust for the missing data. Unfortunately, there is not usually an easy way to confirm that the missing data is only a function of the observed data.

Finally, if the probability that a particular piece of data is missing depends on the unobserved data itself, then the data is missing not at random (MNAR). Missing not at random is the most complicated situation and there is no way to confirm if missing data is MNAR without knowing the values of the missing data (which, of course, are missing). One of the only ways to mitigate MNAR data is to try to identify and resolve reasons for the cause of the missing data (Table 3).

Table 3 Types of missing data with examples

Handling missingness. Missing and implausible data can be handled in a number of different ways [45]. Traditional methods of dealing with missing data have assumed the data are MAR or MCAR. But if data are MNAR, those methods will produce biased estimators [46].

Complete data analysis, listwise deletion. Complete data analysis (aka listwise deletion) removes all of the data for any patient that has missing values. This is easy to implement, but can result in biased data if it is not truly MCAR or when the complete data is not a random sample from the analytic sample (or population). This approach also wastes data and reduces the population for modeling.

Single imputation methods calculate a replacement value for the missing data. Single imputation produces a set of complete data, but it has a few disadvantages. Most of the methods introduce bias, and only stochastic regression produces an unbiased dataset. Single imputation methods include using the mean (or median) of the complete data as a replacement for the missing values, hot-deck imputation (replacing missing values using values from “similar” patients), and last observation forward imputation (for longitudinal data, taking the last measured value and using for remaining missing values). Stochastic regression is the recommended single imputation method since it produces unbiased data for MAR and MCAR data. Stochastic regression involves developing a regression model to predict values for incomplete variables (using complete variables) and then adding a normally distributed residual term to the values in order to add the proper variance into the missing variables [45].

Multiple imputation methods create multiple copies of the dataset and compute estimates for the missing values. This technique replaces missing values in multiple copies of the original dataset and then analyzes each copy to produce model parameter estimates and standard errors. All of the model estimates are then pooled to produce a final set of model parameters and error estimates. There are a number of ways that the missing values can be estimated and the model parameters pooled. There are R (Mice) and Python (Scikit-learn, IterativeImputer) packages that implement these techniques.

Pattern-based modeling. When missingness follows only a few patterns, namely when only a few combinations of variables with simultaneously missing data exist, an alternative to imputation is to build models for each pattern separately. This method results in a collection of models (rather than a single model), one for each missingness pattern, but may produce unbiased estimates even for MNAR data.

Data Processing Stacks

In order to process and transform data, we need to use a set of tools. There are many data environments and software for working with data, including several excellent commercial environments, but there are also freely available open source tools.

As organizations continue to create an exponentially expanding amount of new data, there has been more focus on inventorying, describing and governing that data. Processes that transform data should use FAIR principles for scientific data and software (Findable Accessible Interoperable and Reusable data and software) [47, 48] and TRUST data principles (Transparency, Responsibility, User focus, Sustainability, and Technology) [49].

A data processing stack is a group of related tools that are designed to work together well and which form an ecosystem for efficiently working with data.

The qualities that are required in an effective data processing stack are:

  1. 1.

    Support for a common data model (CDM), which is a data model that is common across projects and researchers, promoting the sharing of artifacts and expertise in a community of researchers

  2. 2.

    Efficient and flexible access to databases and data storage structures (SQL, csv, flat file, etc.)

  3. 3.

    Tools to document data transformation steps and track data lineage

  4. 4.

    Support for using a complete programming language (e.g. R, Python, Julia, etc.) not just a data access language (e.g. SQL)

  5. 5.

    Rich packages and libraries to help with data transformations, data quality assessment, visualizations, feature engineering and model performance assessment

  6. 6.

    Support for the FAIR and TRUST principles

Documenting Data Transformations

Best Practice 8.7.1

It is a best practice to document all data transformations so that the entire process can be reproduced from scratch if necessary.

The documentation tool should ideally allow for mixing narrative text to describe each step of the data pipeline with the actual code that implements the transformations so that everything is in once place. In addition to data artifacts, the pipeline will also produce data quality assessments, summary tables and visualizations of the data and meta-data. The documentation environment should make it easy to keep all of these artifacts connected together and easily viewable. Feature engineering is an iterative process and the data pipeline will change quite often. The documentation and data pipelines should have a version control ability to track changes and allow for rollback of the undesired changes. The documentation is also something that should be easily shared within the project and across projects, ideally in a way that the documentation can collaboratively be edited and shared.

If the project does not require a specific documentation tool, an excellent place to start are Jupyter notebooks. They serve as documentation and as a development environment to write and execute code. Jupyter is an open-source web application that allows for creating and sharing documents that contain live code, equations, visualizations and narrative text [50]. Alternatives with similar capabilities include Zeppelin, R Studio, etc.

Common Data Processing Stacks

Structured Query Language (SQL) is a language for low-level data manipulation and transformation. Many SQL engines are designed to achieve very high performance on retrieving data elements from large data warehouses or data marts, but lack features for sophisticated feature engineering, such as missing value imputation, etc.

Spark is an open-source data science ecosystem, with implementations of SQL standards, offering powerful data transformation and feature engineering capabilities. Emphasis is placed on computational scalability, optimization, and load-balancing across potentially very large number of compute nodes. Spark also has ML modeling capabilities and some Spark environments, such as Zeppelin, support visualization. Programming languages supported by Spark include Scala (its native language), Java, Python, R, etc.

R is an open-source, primarily statistical programming language. As such, it natively supports tasks related to feature engineering, statistical modeling, and visualization. It is highly extensible, and packages implement all major population ML algorithms. With its origins in statistical processing, its data extraction and transformation capabilities were designed for smaller, in-memory data sets, however, packages have been developed to interface with database engines. It is not the first choice tool for data extraction.

Python is an open-source, interpreted, high-level, dynamically typed, general-purpose programming language created by Guido van Rossum in 1991. It has developed into one of the first-choice environments for data science activities, with extension packages covering an extremely broad range of areas including numerical computation, manipulation of tabular and graphical data, ML modeling, visualization, etc.

Matlab is a commercial language, originally for high-performance matrix manipulations and visualization. Major machine learning methods have been implemented for Matlab.

Table 4 summarizes how well each data stack addresses different tasks.

Table 4 Some common data processing stacks. Notation: ‘-‘: not designed to perform this task; ‘O’: possible to perform the task but not recommended; ‘+’ good at performing the task; ‘++’ designed specifically for efficiently performing this tasks

Best Practices 8.7.2

The minimum set of tools to learn is a common data model, at least one of the data science programming languages, and data access using SQL.

Data Pipelines

A data pipeline is a set of software programs that convert the raw data in the data sources into a modeling dataset that ML algorithms can directly operate on.

A data pipeline differs from a data processing stack, in that a data pipeline is a project-specific while a data processing stack is still general-purpose. Data processing stacks were implemented to be, after project-specific customizations, broadly applicable to a wide range of projects. The data pipeline, on the other hand, is fully implemented by the modeling team and its main purpose is specifically to produce the modeling data set. Often, data pipelines are implemented using data processing stacks.

Data science projects usually have quite a few data transformation steps. In fact, data manipulation, transformation, quality assessment and standardization consume most of the effort in a typical data science project. It is critical that data transformation steps be well defined, well documented, and reproducible. As we have discussed earlier, it is critical to document all of the steps of the data transformations in a project. Most projects will have many steps that need to be executed in a certain order in order to reproduce all of the features used in a project.

Best Practice 8.7.3

Every project should create and maintain a data pipeline, which is a repeatable process that performs all of the steps required to transform data from the source data to the final analytic fact table that is used as input for the analytic and modeling part of the project.

The team should be able to recreate the modeling dataset from scratch using only the source data and the data pipeline. It is also critical that metadata about when these transformations occurred on which source datasets and the characteristics of the data at that time are documented.

We discussed data lineage or data provenance (e.g. when did the transformation happen and what were the source input data) in section “Constructing the Design Features”. The data lineages help researchers know how the data that they are analyzing was created. It helps other researchers and collaborators reproduce the steps with their own data set, and it helps all of the stakeholders trust that the data was transformed in a manner that was expected and supports the analytic process. In commercial data science systems data lineage capabilities may be built into their products. For researchers using open source software, they usually have to cobble together their own solutions. Recently, there are a number of open source projects emerging that help to document data lineages (e.g. Amundsen, Datahub and Marquez).

Best Practice 8.7.4

Ensure that meta-data associated with phenotypes and variables contains enough information to allow for the re-creation of the phenotypes/variables from source data.

It is also important that as the definitions of the phenotype and variables change, the version of those definitions that was used to compute the variable is also recorded. Using a version control system for the definitions is also a best practice. You can check-in the code from your Jupyter Notebooks, Python or R programs in order to maintain a reproducible data pipeline.

A data pipeline is also a convenient place to insert data quality assessments. Information about the source data and its impact on the transformed data is important metadata that can help researchers to debug any analytic issues. These projects record data quality metadata of the source data longitudinally. They currently do not have very good mechanisms for assessing the data quality of transformed data elements, so researchers may have to develop their own assessments.

It is also useful to automate data pipelines and allow them to be executed when the source data changes. It may often be expensive in time and money to run entire pipelines on very large datasets in full, so it is useful to understand how source data updates impact downstream variable transformations. This allows software to only execute the parts of the pipeline that are necessary to regenerate the downstream elements and if certain source data does not change it may not affect other downstream elements, which is more efficient.

The result of the data pipeline should be the creation of your modeling data set. Some data ecosystems have tools to support this process. For example, OHDSI has the Atlas tool that takes as input a cohort definition (using the web UI or via a JSON definition file) and creates an analytic cohort from the criteria. Atlas supports defining variables as single concepts or via value sets. At this time, Atlas does not support defining more complex phenotype expressions, which need to be handled outside of the tools using a programming language or SQL.

Key Concepts in Chapter “Data Preparation, Transforms, Quality, and Management”

Biomedical concept, vocabulary, terminology, ontology.

Data element, dataset, data warehouse, data mart.

Data model, data dictionary, common data model.

Data provenance (lineage).

Phenotyping algorithm, value set.

Feature, variable.

ML Modeling Dataset.

Classical analytic table format: wide and long tables.

Variable types: numeric, binary, categorical (nominal and ordinal).

Feature engineering.

Data quality, data quality standards.

Missing Data, informative missingness, Types of missingness (MCAR, MAR, MNAR).

Data processing stack, ecosystems, data pipelines.

Pitfalls and Best Practices in Chapter “Data Preparation, Transforms, Quality, and Management”

Pitfall 8.4.1. If ordinal variables (or more generally, variables with semantic relationships among their levels) are encoded using one-hot-encoding, the semantic relationships will be hidden from the modeling algorithm.

Pitfall 8.5.1. There a fallacy that “Because of data quality issues in EHR data, they cannot be used for discovery”.

Best Practice 8.2.1. It is a best practice to use a data dictionary to define each data element in the dataset that we are working with.

Best Practice 8.4.1. Whether the data need to be pivoted or not depends on the software that will be used for modeling. Almost always, the software will expect pivoted (wide) data.

Best Practice 8.4.2. The choice between person-level and person-event format is driven by the analytic need and the analytic software (model) is chosen accordingly.

Best Practice 8.4.3. If categorical variables with sematic relationships among their levels need to be converted into a set of indicator variables, design an encoding scheme (if possible) that makes this semantic relationship explicit to the learning algorithm.

Best Practice 8.5.1. EHR data can be used for discovery, but be aware of the quality issues and select modeling methods that can correct for the relevant and potentially consequential issues.

Best Practice 8.5.2. Provide a minimal set of data quality metrics for Completeness, Plausibility and Conformance.

Best Practice 8.7.1. It is a best practice to document all data transformations so that the entire process can be reproduced from scratch if necessary.

Best Practices 8.7.2. The minimum set of tools to learn is a common data model, at least one of the data science programming languages, and data access using SQL.

Best Practice 8.7.3. Every project should create and maintain a data pipeline, which is a repeatable process that performs all of the steps required to transform data from the source data to the final analytic fact table that is used as input for the analytic and modeling part of the project.

Best Practice 8.7.4. Ensure that meta-data associated with phenotypes and variables support at some basic aspects such as the date of the transformation and the version of the source data that was used to compute that variable.

Questions for Class Discussion and Assignments Chapter “Data Preparation, Transforms, Quality, and Management”

  1. 1.

    The International Classification of Diseases, Revision 10 (ICD-10) is often used for documenting the diagnosis of diseases in the electronic health records. How many codes can you find that indicates type-2 diabetes mellitus among the ICD-10 codes?

    https://www.icd10data.com/ICD10CM/Codes/E00-E89/E08-E13

    (The above link takes you to the diabetes subtree of the ICD-10 hierarchy.)

    The Clinical Classification Software (CCS) reorganizes ICD-10 codes into diseases (with overlaps) so that they better represent diseases. Look at the ICD-10 codes under the type-2 diabetes CCS category (END005). Did you find any ICD-10 codes that are not in the E11 branch?

    (CCS is available from https://www.hcup-us.ahrq.gov/toolssoftware/ccsr/dxccsr.jsp)

  2. 2.

    Look at the RxNorm or NDF-RT terminologies. Where would you find insulin?

  3. 3.

    Creatinine in the bloodstream is one of the indicators of kidney disease. Which creatinine codes from LOINC can you use? How do they differ? [Hint: Beware of creatinine in urine.]

  4. 4.

    We aim to build a diabetes risk prediction model for adult (age >18) patients with elevated fasting plasma glucose (FPG) levels (at least 100 mg/dL). Exclude patients with pre-existing diabetes (as indicated by the presence of a diabetes diagnosis code, FPG >125 mg/dL, or prescription of an anit-diabetic drug). As predictors, we use specific vital signs and lab results: systolic and diastolic blood pressure, pulse, body mass index (BMI), cholesterol levels (LDL, HDL, TG), and FPG.

    1. (a)

      What are you design features? (Hint: Do not forget features in the inclusion and exclusion criteria and outcome.)

    2. (b)

      Assuming your institution uses ICD-10 for documenting diagnoses, how would you define ‘pre-existing diabetes’? Hints: Use the term “value set”.

    3. (c)

      You can cross-check your value set with CCS or CCSR for ICD-10-CT. https://www.hcup-us.ahrq.gov

      Hint: Use CCS or CCSR for ICD-10-CT (clinical terminology CT; not procedure codes PC)

    4. (d)

      Based on the description in the question, how would you define the phenotyping algorithm for ‘pre-existing diabetes’? For this question, you may assume that a value set for “diabetes medication” already exists. You can also use the vital signs and lab test names, you do not need to look up their LOINC codes.

    5. (e)

      Phenotype for the outcome. How is it different from your answer for question (d)?

  5. 5.

    Suppose you implement your risk model using GLM.

    1. (a)

      What would be your ML features? [Hint: Is pre-existing diabetes going to be turned into ML features?]

    2. (b)

      Do you need to transform your lab results in any way?

    3. (c)

      Due to inclusion and exclusion criteria, FPG will be truncated to a range between 100 and 125 mg/dL. Will this cause any problems?

    4. (d)

      Assume that age below 45 has no effect, between 45 and 65 have moderate effect and above 65, age has a very pronounced effect. How would you turn age into ML features?

    5. (e)

      Suppose we wish to add anti-hypertensive (blood pressure) medications as predictors. Further assume for simplicity that we only have three anti-hypertensive subclasses (ACE, ARB and BB). How would you encode these drugs into ML features? [Hints: Some patients do not take anti-hypertensive drugs.]

    6. (f)

      If I told you that ACE and ARB have roughly the same effect, but BB is more potent, would you change your encoding?

    7. (g)

      How would you change your encoding if I told you that BB can be added to an ACE/ARB regimen for even greater (i.e. interaction) effect? [Hint: there are two solutions; one of them does not require new ML features.]

  6. 6.

    Suppose you implement your risk model from Question 4 using neural networks? How would your answers to Question 5 change?

  7. 7.

    While implementing your model from Question 5 (using GLM), you find that some of the blood pressure measurements are missing. How do you handle the situation under the following conditions? Would you delete patients with missing blood pressure? Would you impute? Would you create a missingness indicator variable? Is the missingness informative?

    1. (a)

      All blood pressure measurements were collected, but a technician accidentally forgot to transfer some of them. Whether a blood pressure measurement was transferred or not is random.

    2. (b)

      Some blood pressure measurements are not collected because the patient is young, and BMI as well as all lab results indicate that the patient is healthy.

    3. (c)

      The missing blood pressure measurements are not collected because the patient is obviously healthy. For the same reason, other labs are missing, too.

  8. 8.

    Suppose observations for blood pressure are missing completely at random (a technician forgot to transfer them to you). Since they are missing completely at random, you decide to impute the population mean value.

    1. (a)

      Will this impact the predictions from your model?

    2. (b)

      Does this change the variance of the blood pressure in the sample? (Hint: use the formula for sample variance.)

    3. (c)

      Would this affect the estimated error of the prediction from your model? Would this effect the significance of the coefficient for blood pressure?

    4. (d)

      How can you compensate for this possible change in variance? (Hint: stochastic regression.)

  9. 9.

    Consider an ordinal categorical variable with four levels, A < B < C < D. How do I encode this variable into ML features to achieve the desired interpretation?

    1. (a)

      I wish to know the effect of B, C, D relative to A.

    2. (b)

      I wish to know the effect of B relative to A, the effect of C relative to B and the effect of D relative to C.

    3. (c)

      Some values are missing. I wish to know the effect of all levels and the effect of missingness relative to A. How do I define the ML features and the missingness indicator?

    4. (d)

      Some values are missing. I wish to know the effect of B relative to A, the effect of C relative to B, the effect of D relative to C, and the effect of missingness relative to D.

  10. 10.

    Consider an ordinal variable with three levels: A < B < C. Let α0, αB, αC denote the intercept and the coefficients quantifying the effects of B relative to A and C relative to A in a regression model. In an alternative encoding, let β0, βAB, βBC denote the intercept, the effect of B relative to A and the effect of C relative to B. Can you express αc in terms of β’s? (Hint: You can consider a simple model with a single predictor, the variable in question. You can now express the prediction from both models in terms of their respective coefficients when the observation (for your only predictor) is A, B and C. Can the predictions differ between the two models?)

  11. 11.

    Consider a caricature problem of predicting mortality based on predictors related to blood pressure (systolic and diastolic blood pressure, presence of blood pressure medications) and cholesterol (LDL, HDL, TG, presence of cholesterol medications). Two models are considered. The first one, model A, uses all 7 predictors to predict mortality. The second, alternative model, is a stacked model, where the first layer consists of a blood-pressure score, which is implemented as the predictions from a logistic regression model that predicts mortality using the 3 blood pressure related variables; and a cholesterol score that is implemented as a logistic regression model predicting mortality using the 4 cholesterol-related variables. The second layer is also a logistic regression model that predicts mortality based on the blood pressure score and cholesterol score. Are these models (model A and the stacked model) different? Is one of the models more constrained? Can you think of situations when one model performs better than the other?

  12. 12.

    The Charlson comorbidity score [51] is a weighted sum of the presence of comorbidities that predicts 1-year mortality. It is frequently used to adjust for a patient’s state of health. What is the benefit from using such a score versus using the individual comorbidities? Is there a disadvantage?