Data Preparation

Starbuck, Craig

doi:10.1007/978-3-031-28674-2_6

Craig Starbuck²

5580 Accesses

Abstract

This chapter explains data architecture concepts needed to properly integrate and extract data from analytics platforms as well as methods of screening and cleaning data (e.g., missingness, outliers, data binning, one-hot encoding, feature engineering).

You have full access to this open access chapter, Download chapter PDF

To begin a data analysis, we must first extract, combine, organize, and clean the requisite data. As depicted in Fig. 1, these data preparation tasks account for a large part of the work analytics professionals do.

A horizontal column chart plots the work of analytics professionals versus percentages. The estimated values for cleaning and organizing data, 60%, extracting data, 20%, exploratory data analysis, 10%, refining algorithm, 5%, building training sets, 4%, and other, 6%. — **Fig. 1**

Data Extraction

To properly and efficiently extract data—often through the use of SQL as covered in chapter “Introduction to SQL”—it is important to first understand some common ways in which data are stored and structured.

Data Architecture

Data are generally extracted directly from the source systems in which they are generated or from downstream repositories such as a data lake, data warehouse, or data mart.

Data Lake

A data lake stores myriad types of data—both structured and unstructured—in its native format until needed. Structured data refers to data that fits a predefined data model, such as hire dates formatted as MM/DD/YYYY or zip codes stored as a five-digit string. Unstructured data has no predefined data model, such as audio and video files, free-form performance review comments, emails, or digital exhaust from messaging tools; it is difficult to structure this type of data within a set of related tables.

Data are often stored in a data lake so that they are available when use cases are defined for them, at which time a data model is designed and developed outside the data lake to facilitate the requirements.

Data Warehouse

Data in a data warehouse (DW) are structured and organized into schemas of related tables based on business requirements and use cases. The main difference between a data lake and data warehouse is the type of data they are designed to store. A DW is designed to support analytics across large collections of data, such as transactional data (e.g., point-of-sale systems), point-in-time snapshots (e.g., month-end close reports), survey responses, spreadsheets, and more.

A DW can contain many different types of tables, but this chapter will focus on the two most common which are known as Type 1 and Type 2 tables. These tables are sometimes referred to as slowly changing dimensions (SCD).

A Type 1 table is created on a regular cadence (usually daily or monthly) and contains no history—only current values. For example, a Type 1 table may contain the latest known attributes for each active and terminated worker such as job, location, and manager. Type 1 tables are sometimes archived and appended to prior snapshots with an effective date, and this design has utility when a view of the workforce is required as of a past date or when an analysis calls for querying across multiple point-in-time snapshots (e.g., computing trailing 12-month attrition rates using average monthly headcount).

It is important to note that leveraging snapshots for trending analyses has some notable deficiencies given the large number of retroactive transactions processed in HCM systems that are not captured if prior snapshots are not updated. Below are some examples:

Org changes for which incorrect manager assignments are later identified and corrected
Back-dated compensation changes
Job profile attribute updates, resulting in incorrect values across snapshots prior to the update date
Edits to hire, promotion, transfer, and termination events after completing the business process in the system (e.g., delayed start date)

While a past date may be set as the effective date for these transactions, snapshots would incorrectly indicate that the change was effective on the date they were entered into the system (i.e., when the value first changed across snapshots), resulting in misalignment with the system of record. This can result in inaccurate metrics related to headcount, hires, career moves, and terminations in a given period. This type of data leakage can quickly become a larger issue as the size and complexity of the workforce grows. Even a few discrepancies relative to what managers see in the source system can create mistrust in data solutions and hamstring progress up the analytics value chain.

A Type 2 table is a table in which a new record is inserted when a change occurs for one or more specified dimensions. Jobs, managers, and locations are examples of slowly changing dimensions but unlike the Type 1 table which contains only the latest information, the Type 2 table houses a start date and end date for each worker and dimension to facilitate reporting and analysis on changes to attribute values over time. This concept of storing attribute values for the period of time during which they were effective is known as effective dating, and the inclusion of effective-dated logic in queries is fundamental to how data are accurately extracted for a particular date of interest.

Figure 2 illustrates the design of a Type 2 SCD for an active worker’s job, manager, and location changes. As the data show, worker 123 was promoted from Data Analyst to Sr. Data Analyst 1.5 years after joining, began reporting to their original manager (456) after a short stint reporting to someone else (789), and has worked remotely throughout their entire tenure with the company.

A table with 5 columns and 6 rows has data in type 2 S C D. The column headers are as follows. The employee I d, attribute name, attribute value, start date, and end date. — **Fig. 2**

Note that rows where end_date = '12/31/9999' indicate current attributes for active workers. For inactive workers, end_date would be set to the worker’s termination date for rows that represent last known attributes.

Constructing a view of the workforce as of a particular effective date involves selecting rows where the effective date is on or after start_date and on or prior to end_date. In a SQL query, this logic can be specified in the WHERE clause, which defines the rows to search.

To construct a view of the last known attributes for each worker in this table, we could select rows where the current date is on or after start_date and on or prior to end_date for active workers and then select the most recent rows (max end_date) for each inactive worker. However, using a Type 1 table simplifies this task since each dimension value is stored in a separate column and there is only one row per employee. Figure 3 shows how the current record for worker 123 would look in a Type 1 SCD.

A table with 6 columns and one row has data in type 1 S C D. The column headers are as follows. The employee I d, hire date, termination date, manager I d, job, and location. — **Fig. 3**

Data Mart

A data mart is a subset of a DW designed to easily deliver specific information to a certain set of users on a particular subject or for a well-defined use case. For example, in a people analytics context a diversity data mart could be developed to better isolate and secure restricted data such as gender and ethnicity. This data may be used to support diversity descriptives and trends for a limited audience approved by Data Privacy and Legal counsel based on legitimate business needs.

Database Normalization

Normalization is the process of partitioning data into multiple tables to reduce data redundancy and promote data integrity. Conversely, denormalization combines data into a single table to facilitate easier and faster data retrieval.

The tables used to explain SQL joins in chapter “Introduction to SQL” are examples of normalized data. Normalized tables introduce more complexity for analysts since data organized across tables need to be joined together to create a flat data structure that is easier to work with for analytics. However, normalized tables have a key advantage in accounting for past-dated changes since the latest data are retrieved from the various tables when needed rather than leveraging immutable snapshots that only reflect data as of the date and time they were created. For example, if a worker snapshot was created yesterday, and today a change is processed in the system to rename location id MA123 from Cambridge Office to Boston HQ with an effective date of yesterday, a static worker snapshot created yesterday would show Cambridge Office as the location while querying normalized tables would incorporate the updated Boston HQ location name.

One way to address the shortcomings of snapshots without the data engineering overhead is to perform destructive loads. Destructive loads, sometimes referred to as a truncate and reload approach, involves destroying prior snapshots and rebuilding them for each effective date. For example, if there is a policy that retroactive changes cannot be processed in the system prior to the past six months, a destructive load could be performed for a rolling six months of snapshots to ensure they reflect any past-dated worker events and non-worker attribute changes (e.g., department, location, job, position).

Modern Data Infrastructure

Though a deep treatment of data architecture is beyond the scope of this book, it is important to acknowledge the significant advancements in infrastructure and computation—and the important implications for analytics—since SCD architectures were first introduced by Ralph Kimball decades ago. These developments have greatly improved the efficiency with which analysts can translate data into information and insight.

With modern cloud environments, the significant investment associated with humans designing, developing, and maintaining these complex architectures is often difficult to justify given how inexpensive storage and compute have become. Increasingly, the heavy computational tasks have migrated out of data pipelines and into the analytics layer wherein analytics teams have more flexibility and control. Today, daily snapshots containing all current and historical records can be copied and stored within partitioned DW tables for a negligible increase in storage costs, and this greatly simplifies data pipeline complexity and engineering support requirements. This changing dynamic has given rise to a new breed of data engineers focused on optimizing the heavy computation requirements of analytics teams.

Data Screening and Cleaning

Once data are extracted and organized in a flat data structure, the initial data review process can commence. This process is often referred to as exploratory data analysis (EDA). EDA involves investigating patterns, completeness, anomalies, and assumptions using summary statistics and graphical representations. An analytics ideal is unimpeachable data quality, which is to say that the rigor of upstream business processes and downstream data screening is such that stakeholders become confident enough to channel more energy towards actioning on results than questioning the quality. Suspect data quality is often surfaced during the initial EDA step and affords the opportunity to address and avoid stakeholders discounting results during later phases of the project.

A handy function in base R for initial data screening is summary(). This function returns measures of central tendency (mean and median) and spread (min, max, and 1st/3rd quartiles) for each numeric variable.

An algorithm for the following tasks. Load library, load data, and summarize d f.

## employee_id active stock_opt_lvl trainings ## Min. :1001 Length:1470 Min. :0.0000 Min. :0.000 ## 1st Qu.:1368 Class :character 1st Qu.:0.0000 1st Qu.:2.000 ## Median :1736 Mode :character Median :1.0000 Median :3.000 ## Mean :1736 Mean :0.7939 Mean :2.799 ## 3rd Qu.:2103 3rd Qu.:1.0000 3rd Qu.:3.000 ## Max. :2470 Max. :3.0000 Max. :6.000 ## ## age commute_dist ed_lvl ed_field ## Min. :18.00 Min. : 1.000 Min. :1.000 Length:1470 ## 1st Qu.:30.00 1st Qu.: 2.000 1st Qu.:2.000 Class :character ## Median :36.00 Median : 7.000 Median :3.000 Mode :character ## Mean :36.92 Mean : 9.193 Mean :2.913 ## 3rd Qu.:43.00 3rd Qu.:14.000 3rd Qu.:4.000 ## Max. :60.00 Max. :29.000 Max. :5.000 ## ## gender marital_sts dept engagement ## Length:1470 Length:1470 Length:1470 Min. :1.00 ## Class :character Class :character Class :character 1st Qu.:2.00 ## Mode :character Mode :character Mode :character Median :3.00 ## Mean :2.73 ## 3rd Qu.:3.00 ## Max. :4.00 ## ## job_lvl job_title overtime business_travel ## Min. :1.000 Length:1470 Length:1470 Length:1470 ## 1st Qu.:1.000 Class :character Class :character Class :character ## Median :2.000 Mode :character Mode :character Mode :character ## Mean :2.064 ## 3rd Qu.:3.000 ## Max. :5.000 ## ## hourly_rate daily_comp monthly_comp annual_comp ## Min. : 30.00 Min. :240.0 Min. : 5200 Min. : 62400 ## 1st Qu.: 48.00 1st Qu.:384.0 1st Qu.: 8320 1st Qu.: 99840 ## Median : 66.00 Median :528.0 Median :11440 Median :137280 ## Mean : 65.89 Mean :527.1 Mean :11421 Mean :137054 ## 3rd Qu.: 83.75 3rd Qu.:670.0 3rd Qu.:14517 3rd Qu.:174200 ## Max. :100.00 Max. :800.0 Max. :17333 Max. :208000 ## ## ytd_leads ytd_sales standard_hrs salary_hike_pct perf_rating ## Min. :11.00 Min. : 15496 Min. :80 Min. :11.00 Min. :3.000 ## 1st Qu.:45.00 1st Qu.: 56997 1st Qu.:80 1st Qu.:12.00 1st Qu.:3.000 ## Median :59.00 Median : 73505 Median :80 Median :14.00 Median :3.000 ## Mean :55.84 Mean : 77124 Mean :80 Mean :15.21 Mean :3.154 ## 3rd Qu.:65.00 3rd Qu.: 96002 3rd Qu.:80 3rd Qu.:18.00 3rd Qu.:3.000 ## Max. :95.00 Max. :281499 Max. :80 Max. :25.00 Max. :4.000 ## NA's :1061 NA's :1061 ## prior_emplr_cnt env_sat job_sat rel_sat ## Min. :0.000 Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 ## Median :2.000 Median :3.000 Median :3.000 Median :3.000 ## Mean :2.693 Mean :2.722 Mean :2.729 Mean :2.712 ## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :9.000 Max. :4.000 Max. :4.000 Max. :4.000 ## ## wl_balance work_exp org_tenure job_tenure ## Min. :1.000 Min. : 0.00 Min. : 0.000 Min. : 0.000 ## 1st Qu.:2.000 1st Qu.: 6.00 1st Qu.: 3.000 1st Qu.: 2.000 ## Median :2.000 Median :10.00 Median : 5.000 Median : 3.000 ## Mean :1.841 Mean :11.28 Mean : 7.032 Mean : 4.229 ## 3rd Qu.:2.000 3rd Qu.:15.00 3rd Qu.: 9.000 3rd Qu.: 7.000 ## Max. :2.000 Max. :40.00 Max. :72.000 Max. :18.000 ## ## last_promo mgr_tenure interview_rating ## Min. : 0.000 Min. : 0.000 Min. :2.000 ## 1st Qu.: 0.000 1st Qu.: 2.000 1st Qu.:3.600 ## Median : 1.000 Median : 3.000 Median :4.100 ## Mean : 2.188 Mean : 4.123 Mean :3.989 ## 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:4.500 ## Max. :15.000 Max. :17.000 Max. :5.000 ##

Note that fields with NA values contain missing values. Also, by default employee_id is treated as an integer in R, which is why descriptive statistics appropriate for numeric data are provided. Despite the absence of characters, employee_id should be treated as a character string since we will not perform any arithmetic operations using these ids.

Missingness

Before considering whether and how to handle missing data, it is important to distinguish between structural missingness and informative missingness (Kuhn and Johnson, 2013).

Structural missingness relates to data that are missing for a logical reason. For example, we would not expect a new joiner with a few days of tenure to have a performance score. Likewise, we would not expect an active employee who is not a rehire to have a termination date. Therefore, it would not make sense to define a value to address missing data in these cases.

Informative missingness relates to missing data that is informative regarding an outcome of interest. For example, in a survey context we may find a relationship between missing values on manager effectiveness questions and unfavorability on a psychological safety scale. This may indicate that employees who are fearful of retaliation are uncomfortable providing honest feedback about their managers, while employees who feel it is safe to speak up about issues are more comfortable responding in prosocial ways.

In some cases, we have the luxury of simply removing observations with missing values and using the remaining complete cases for analysis—assuming there are relatively few observations with missing values and no systematic missingness patterns that could bias analyses. However, since we are often working with wide data sets containing relatively few observations in a people analytics setting, this may not be feasible. As we will cover in later chapters, sample size considerations are fundamental to achieving adequate power in statistical testing, so case removal is only possible with larger data sets.

Data imputation refers to the methods by which missing data are replaced with substituted values when case removal is not appropriate. The most common data imputation method is replacing missing values with a descriptive statistic such as the mean, median, or mode based on available data. For example, if most employees have an age in the system, the average, median, or most frequent age could be used in place of the cases with a missing age. To be more precise, the average, median, or most frequent age of those with similar values for variables believed to correlate with the missing variable may be used (e.g., similar years of experience, job, level). We would expect there to be less variability in age within a well-defined segment relative to the entire employee population, so this would likely be a more accurate estimate of an individual’s actual age.

Let us evaluate the employees data frame for missing annual_comp values using the logical is.na() function and return values of variables relevant in determining one’s annual compensation. The subset() function can be used to select a subset of data from a data frame.

An algorithm for the following tasks. Store original annual comp for sample employee, create an N A in lieu of annual comp for illustrative purposes, and return relevant employee characteristics where annual comp is missing.

## employee_id job_title job_lvl ## 1176 2176 Manufacturing Director 2

Next, we will impute the average value of annual_comp based on employees with the same values for the relevant variables. The sapply() function can be used in conjunction with the mean() function to apply the average to the subsetted data frame. The sapply() function is a member of a broader set of apply() functions in R, and the s indicates that the result of applying the specified function is a scalar object that holds a single value, such as a number (mean value in this case).

An algorithm for the following tasks. Return average annual comp for employees with similar characteristics, excluding employees with missing comp values, impute missing comp for relevant segment, and display absolute difference between original and imputed comp.

## annual_comp ## 1176 1169

While this approach should help in demonstrating the mechanics of imputing a missing value on a case-by-case basis, a more scalable solution is needed for data with a large number of missing values across employees with different values of these variables. There are more sophisticated methods of data imputation that involve models to estimate missing values, such as linear regression which will be introduced in chapter “Linear Regression”. Modeling techniques leverage a similar approach to the method outlined above in that the target values of cases with similar characteristics to those with missing values are used to aid estimation. Multiple imputation builds upon this approach by combining the information from multiple data sets imputed using different methods with a goal of minimizing the potential bias introduced by a singular method of imputation.

Outliers

The treatment of outliers is one of the most enduring and pervasive methodological challenges in organizational research. A literature review by Aguinis et al. (2013) uncovered 14 unique definitions of outliers, 39 outlier identification techniques, and 20 different ways of addressing them. Appropriate methods for defining and addressing outliers are domain-specific, and there are many important considerations that should inform whether and how outliers should be handled.

The water crisis in Flint, Michigan is a tragic example of a series of statistical mishaps involving poor sampling methodology and outlier handling. As the story goes, Flint stopped paying the Detroit Water and Sewer Department to source water from Lake Huron and began sourcing it from the Flint River as a cost-cutting measure in April 2014 (Langkjår-Bain, 2017). Residents of Flint began showing signs of lead poisoning, and authorities denied residents’ claims that their tap water was to blame—despite some extreme cases in which the tap water was colored orange.

Water companies routinely add chemicals to water to prevent pipe corrosion which can cause lead to seep into drinking water. In Flint’s hurry to switch water sources, they failed to address the fact that the Flint River is naturally high in chloride—a chemical that corrodes pipes. According to the Lead and Copper Rule (LCR) of 1991, lead consumption should not exceed 15 parts per billion (ppb) in more than 10% of homes tested—though no quantity of lead is considered safe to ingest. If the 90th percentile value for sampled homes is greater than 15 ppb, action is required.

Two initial samples of tap water were taken from a concerned resident’s home; one measured 104 ppb (6X higher than the LCR threshold) and the other measured 397 ppb (25X higher than the LCR threshold). Authorities dismissed these samples as outliers, citing old led pipes in the resident’s home. Authorities collected samples of their own and despite federal guidelines requiring n ≥ 100 samples, an under-powered analysis was performed using only 71 samples. Of the 71 samples, two with levels above the 15 ppb threshold were discarded, and the removal of these outliers resulted in aggregate lead levels falling beneath the action threshold.

In the end, the tenacity of the growing number of residents with health concerns resulted in new samples being analyzed by a team of researchers at Virginia Tech University. Researchers found that the 90th percentile value among the sample of households—which included homes with non-lead pipes and water filtration systems—was 26.8 ppb and the highest individual sample was 158 ppb! The city switched back to the Lake Huron water source in October 2015 (18 months later), and a state of emergency was declared. The State of Michigan has brought numerous criminal charges against state and local officials which include misconduct in office, tampering with evidence, willful neglect of duty, and various counts of conspiracy. Residents also launched a series of class action lawsuits against the Governor (Langkjår-Bain, 2017).

This may seem like a dramatic appeal, but the importance of investigating outliers cannot be overstated. Simply discarding outliers may truly be a grave mistake! If outliers are attributable to measurement error, it may be appropriate to discard them. If outliers represent properly measured values, they should be investigated. As we will discuss further in chapter “Descriptive Statistics”, a common method of outlier detection is identifying values which fall outside the following interval:

$$\displaystyle \begin{aligned} I = Q1 - 1.5 * IQR; Q3 + 1.5 * IQR \end{aligned}$$

Low Variability

Variables with low variability often do not provide sufficient information for identifying patterns in data. For example, if we are interested in using information on stock options to understand why employees vary in their levels of retention risk but find that the employee stock purchase plan (ESPP) terms are identical for nearly all employees, including a stock option variable in the analysis is unlikely to provide any meaningful signal.

When working with survey data, checking for straightlining should be an early data screening step. Straightlining refers to a constant response across all survey items, which may be evidence that the respondent lost motivation or was not attentive and thoughtful when taking the survey. Since straight-line responses may influence results, it is often best to discard these cases—especially when the sample size is adequately large for the planned analyses without them. If the same response is given for both positively and negatively worded versions of a question (e.g., comparing “I plan to be here in a year” to “I do not plan to be here in a year”), which we expect to be inversely related, this gives added support for discarding these responses.

Fields with low variability can be easily identified using descriptive statistics from the summary() function. If the Min and Max values are equal, there is no variability in the field’s values. Based on the following descriptives, we should remove standard_hrs from the data:

An algorithm for the following task. Return descriptives to understand the distribution of standard hours.

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 80 80 80 80 80 80

Given that the data dictionary in chapter “Getting Started” indicates performance ratings range from 1 to 4, the following descriptives should raise red flags:

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.000 3.000 3.000 3.154 3.000 4.000

Assuming not everyone in the company is a stellar performer (i.e., only Noteworthy and Exceptional ratings), we may be working with a partial data set that could bias analyses. This may be due to poor integrity of performance data in the source system or repository from which the data were pulled, or the query written to extract data from the source may be flawed.

Inconsistent Categories

Inconsistent categories impact aggregation and trending by categorical dimensions. It is often necessary to create mappings based on logical rules in order to standardize dimension values across time. In the case of reorgs, a department may be disbanded, renamed, or integrated into one or multiple other departments. Therefore, when working with historical data, records may contain legacy department names that do not align with the current organizational taxonomy. Mapping from former to current departments may require logic based on manager ids, divisions, job profiles, or other variables depending on the nature of reorgs over time.

Job architecture projects often introduce the need for mappings as well. Jobs and levels may completely change for all employees with a job architecture revamp, in which case trending along job and level dimensions (e.g., attrition by job or level over multiple years) is only possible with logic that clarifies how legacy jobs and levels map to those in the new career framework.

Changes to allowable values in source systems often result in inconsistent categorical data over time. For example, the education field may switch from a free-form text field in which employees can enter any value (e.g., B.S., B.A., BS, BA, Bachelor of Science, Bachelor of Arts, Bachelor’s, Bachelors, Bachelor’s Degree, Bachelor Degree, undergraduate degree, 4-year degree, four-year degree) to a standardized solution in which there is a clean and well-defined set of allowable values from which employees can choose (e.g., Bachelor’s Degree, Master’s Degree, Doctoral Degree). This warrants either a one-time historical cleanup upon implementing the allowable values or downstream logic to tidy up data for analytics. A best practice is to address data quality issues upstream (e.g., in the source system) to avoid duplicative data cleaning procedures across downstream applications.

Data Binning

Data binning refers to the process by which larger high-level groups of values are defined and constructed. As a general rule, extremely granular categories should be avoided—especially when there is no theoretical basis for such categories facilitating a project’s objectives or deepening insights. Where the n-count is expected to be consistently low for a defined categorical bin, it is usually best to define a larger bin. For example, a variable measuring highest level of educational attainment that contains 9th, 10th, 11th, and 12th grade categories may be converted into higher-level “High School Not Completed” and “High School Completed” bins.

For modeling applications, it is important to let the algorithm determine the cutpoints for numeric data in relation to the outcome. For example, if organization tenure is measured in years, arbitrarily defining bin sizes of “Less Than 1 Year,” “1–5 Years,” and “More Than 5 Years” will likely result in information loss. Any variability within these bins that may be useful in explaining variance in the outcome would be lost with such wide bins. The machine learning (ML) models that will be covered in chapter “Predictive Modeling” are great for algorithmically determining cutpoints for binning numeric data across descriptive, diagnostic, and predictive projects alike.

One-Hot Encoding

One-hot encoding, also known as dummy coding, involves transforming a categorical variable into numeric values on which statistical procedures can be performed. For EDA, this is not required, as counts and percent of total metrics can be calculated on these dimensions for descriptive purposes. However, for modeling applications, unordered categorical variables must be converted into k − 1 variables, where k is the number of categories, using binary (1/0) coding.

Understanding how categorical data are coded is critical to a correct interpretation of output. For example, if a remote work variable exists with “Remote” or “Non-Remote” values, we may code “Remote” values as 1 and “Non-Remote” values as 0. We could then evaluate the statistical relationship of this transformed categorical variable with other numeric variables.

If an unordered categorical variable has more than 2 values, we must create a separate 1/0 field for each value and omit one category for use as a reference group. As we will cover in chapter “Linear Regression”, one of several assumptions in linear regression is that independent variables are not collinear; that is, no pair of independent variables is highly correlated. Without an omitted category, each of the one-hot encoded fields will be perfectly correlated with the others. When the field representing category A is 1, the fields for other categories will always be 0. As illustrated in Fig. 4, by omitting a category there will be cases when all fields have a 0 value (i.e., rows where the value is the omitted category), which will reduce the strength of the bivariate correlations.

Two tables with 2 columns and 4 rows and 4 columns and 4 rows. Their column headers are as follows. I D and country, and I D, Canada, Mexico, and U S A. The third row of the first table is labeled reference group. — **Fig. 4**

For a categorical variable with only two values, the ifelse() function can be leveraged to assign values:

An algorithm for the following task. Return unique values of gender field with unique parentheses function.

## [1] "Female" "Male"

An algorithm for the tasks of gender one-hot encoding and preview records.

## employee_id gender_ohe ## 1 1001 1 ## 2 1002 0 ## 3 1003 0 ## 4 1004 1 ## 5 1005 0 ## 6 1006 0

For variables with more than 2 unordered categories, we can leverage the model.matrix() function for one-hot encoding. Let us illustrate by encoding locations. As we can see, Human Resources is the smallest department (n = 63) in these data.

An algorithm for the task of return counts by the department.

## dept n ## 1 Research & Development 961 ## 2 Sales 446 ## 3 Human Resources 63

By default, the model.matrix() function will produce a matrix of 1/0 values for k − 1 categories. The first column in the matrix is an intercept column containing a value of 1 for each row to ensure linear independence, and the default behavior results in the first value of the factor being the omitted group. For more flexibility over which value is omitted, we can drop the intercept using -1 in the first argument passed to model.matrix() and then choose the reference group for the analysis in a subsequent step.

An algorithm for the tasks of department one-hot encoding and preview data.

## deptHuman Resources deptResearch & Development deptSales ## 1 0 0 1 ## 2 0 1 0 ## 3 0 1 0 ## 4 0 1 0 ## 5 0 1 0 ## 6 0 1 0

We will drop the department with the lowest n rather than the more arbitrary method based on the first value of the factor. Since departments are coded as either 1 or 0, we can use the colSums() function to sum each column and the which.min() function to identify which has the lowest sum (i.e., smallest department by employee count).

An algorithm for the following tasks. Drop department with the lowest sum of lowest n-count and preview refined one-hot encoded data.

## deptResearch & Development deptSales ## 1 0 1 ## 2 1 0 ## 3 1 0 ## 4 1 0 ## 5 1 0 ## 6 1 0

As expected, the Human Resources department was dropped via the n-count selection criterion. We can now integrate these one-hot encoded fields into the original data frame for analysis.

An algorithm for the following tasks. Combine employees and matrix containing one-hot encoded departments and drop original department field.

Feature Engineering

Level one people analytics tends to utilize only the delivered fields from the HCM system (e.g., location, job profile, org tenure), but a good next step is to derive smarter variables from these fields. These can then be used to cut data differently or as inputs in models. Below are some examples of how basic data available in the HCM system can be transformed into new variables that provide different information. This can be easily accomplished using the arithmetic functions we have covered.

Number of jobs per unit of tenure (larger proportions tend to see greater career pathing)
Office/remote worker (binary variable dummy coded as 1/0)
Local/remote manager (binary variable dummy coded as 1/0)
Hire/Rehire (binary variable dummy coded as 1/0)
Hired/acquired (proxy for culture shock effects)
Gender isolation (ratio of employee’s gender to number of the same within immediate work group)
Generation isolation (comparison of age bracket to most frequent generational bracket within immediate work group)
Ethnic isolation (ratio of employee’s ethnicity to number of the same within immediate work group)
Difference between employee and manager age
Percentage change between last two performance appraisal scores (per competency and/or overall)
Team and department quit outbreak indicators (ratio of terms over x months relative to average headcount over x months)
Industry experience (binary or length in years)

Review Questions

1.
What are the differences between data lakes, data warehouses, and data marts?
2.
What is the difference between a Type 1 and Type 2 table in a DW?
3.
In what ways has modern cloud computing influenced data architecture?
4.
Why is it dangerous to address missing values without domain knowledge of how the data are generated?
5.
How can missing values be addressed when impacted records cannot be eliminated from a data set?
6.
When is one-hot encoding required for categorical variables?
7.
When one-hot encoding a categorical variable with more than two categories, why is an omitted category important?
8.
When binning numeric data, what are some considerations in determining the size of each bin?
9.
Why should variables with low to no variability be dropped?
10.
Where are validation rules ideally situated to limit downstream data cleaning tasks and ensure consistent categorical dimension values?

Bibliography

Aguinis, H., Gottfredson, R., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270–301.
Article Google Scholar
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer.
Book MATH Google Scholar
Langkjår-Bain, R. (2017), The murky tale of Flint’s deceptive water data. Significance, 14, 16–21. https://doi.org/10.1111/j.1740-9713.2017.01016.x.
Article Google Scholar

Download references

Author information

Authors and Affiliations

St. Louis, USA
Craig Starbuck

Authors

Craig Starbuck
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Craig Starbuck .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Starbuck, C. (2023). Data Preparation. In: The Fundamentals of People Analytics. Springer, Cham. https://doi.org/10.1007/978-3-031-28674-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-28674-2_6
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28673-5
Online ISBN: 978-3-031-28674-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics