This chapter will cover variable types, measurement scales, errors, sampling methods, and scale reliability and validity.

Variable Types

The framing of variables in research hypotheses guides the treatment of each in our analyses. This section explores the function of independent, dependent, control, moderating, and mediating variables as well as the broader classification of these variables as either endogenous or exogenous.

Independent Variables (IV)

An Independent Variable (IV) is a variable which is assumed to have a direct effect on another variable. IVs are sometimes referred to as predictors, factors, features, antecedents, or explanatory variables, and these terms will be used interchangeably throughout this book.

Let us consider a scenario in which we wish to examine whether an inclusion training program given to a random sample of leaders has a positive effect on team-level belonging. In a true experimental design, participation in the inclusion training would be the only difference between the treatment (teams whose leaders who participate in the training) and the control (teams whose leaders who do not participate in the training). In this case, inclusion training participation is the IV (the variable we are manipulating).

IVs are also present in non-experimental designs. For example, we may survey employees and ask them to rate how inclusive their leader is and also provide self-reported belonging ratings. In this context, leader inclusiveness (rather than an inclusion training) is the IV. If we find that average team-level belonging scores tend to be higher when leader inclusiveness scores are higher, this may indicate that leader inclusion has some influence on team-level belonging. Of course, there could be alternative explanations for any observed differences in team-level belonging, which is why experimental designs tend to provide stronger evidence for an IV’s effect.

Dependent Variables (DV)

A Dependent Variable (DV) is a variable that is dependent on the IV. DVs are also referred to as outcome, response, or criterion variables.

In our leader inclusion example, team-level belonging is the DV since this variable is assumed to depend on the level of team leaders’ inclusiveness. It is important to note that regardless of a study’s results, it is the positioning of the variables in the study’s hypotheses (rooted in theory) that determines the type of variable. If we hypothesize that leader inclusion training has a positive effect on team-level belonging, but the study finds no such effect, the inclusion intervention is still the IV and team-level belonging the DV.

Control Variables (CV)

A Control Variable (CV) is a variable that is held constant in research. The unchanging state of a CV allows us to understand the extent to which the IV has a unique and independent effect on the DV.

In an experimental context, control variables represent a researcher’s attempt to control for alternative explanations so that the IV’s main effect on the DV can be isolated. In our leader inclusion example, we would seek to determine whether the reason for any observed differences in team-level belonging can be attributed to leader inclusion training rather than other factors that theoretically may explain the differences. In this setting, we should ensure that the two groups are similar with respect to characteristics such as gender and ethnicity, since underrepresented groups (URGs) may have different experiences independent of any training programs. Gender and ethnicity would be CVs in this case.

While we control for the effects of alternative explanations by way of the research design in an experimental context, we will discuss ways to control for these statistically in chapter “Linear Regression” for correlational designs. CVs are equally important in experimental and non-experimental contexts.

Moderating Variables

A Moderating Variable influences the strength of the effect of an IV on a DV. The effect of a moderating variable is often referred to as an interaction effect or in the context of a model, an interaction term.

Moderating variables may augment (strengthen) or attenuate (weaken) the effect one variable has on another. Interactions are widely understood in the context of medications; one drug may independently have a positive effect on one’s health but when combined with another medication, the interaction can behave very differently—and may even be lethal. In our inclusive leadership example, we may find that the strength of the training’s effect on team-level belonging varies based on a leader’s span of control (SoC). It stands to reason that leaders with a lower SoC may find it easier to cultivate a climate of inclusivity and belonging, while leaders with a higher SoC may have more scope/projects and team dynamics to manage and may find it more difficult to consistently apply strategies covered during the training.

Interactions between variables are vital to our understanding of nuance and complexity in the dynamics influencing outcomes in organizational settings. Chapter “Linear Regression” will cover how to test whether an interaction is meaningful or observed merely due to chance.

Mediating Variables

A Mediating Variable explains the how or why of an IV’s effect on a DV. It may be helpful to think of mediating variables as a “go-between”—a part of the causal pathway of an effect.

In the case of inclusive leadership training, effects on belonging are likely not the result of the training itself. More likely, the training raised awareness and helped leaders develop strategies to cultivate team inclusivity. One strategy endorsed during the training may be participative decision making, and the implementation of this strategy may explain why the training has a positive effect on team-level belonging. There may be multiple mediators of any observed effect of the intervention on team-level belonging depending on training outcomes.

Variables in these more complex relationship paths are sometimes characterized as having proximal or distal effects. Proximal effects are those which directly influence an outcome, such as the impact of participative decision making on team belonging. Distal effects are upstream effects that indirectly influence an outcome, such as inclusion training participation.

Mediating variables may fully or partially mediate the relationship between an IV and DV. Full mediation indicates that the mediator fully explains the effect; in other words, without the mediator in the model, there is no relationship between an IV and a DV. Partial mediation indicates that the mediator partially explains the effect; that is, there is still a relationship between an IV and DV without the mediator in the model. Partial mediation indicates that there are additional explanations for how or why observed effects exist, such as the implementation of additional team inclusion strategies influencing team belonging. In chapter “Linear Regression”, we will discuss how to test for both full and partial mediation.

Translating research hypotheses into conceptual models of hypothesized relationships is helpful in visually representing the function of each variable in the study. Figure 1 illustrates how each type of variable is depicted using our inclusive leadership example.

Fig. 1
A relationship model represents the inclusive training associated with participative decision-making which further connects to team belonging. Gender and ethnicity also connect to team belonging. The span of control also relates to team building.

Conceptual model of hypothesized relationships among variables

Endogenous vs. Exogenous Variables

In the context of regression models, which will be covered in later chapters, variables are sometimes categorized as either endogenous or exogenous. Endogenous and exogenous variables are especially important in econometrics and economic modeling in which analysts seek to understand causal factors influencing outcomes.

An endogenous variable is a dependent variable whose values can be determined or explained based on the other variables in a statistical model. In other words, values of the dependent variable change predictably with values of other variables in the model. By contrast, an exogenous variable is an independent variable on which other variables in the model have no direct or systematic impact.

For example, if we are interested in studying the factors influencing year-to-date (YTD) sales among salespeople, we may consider an independent variable such as years of sales experience. If YTD sales is a function of one’s sales experience, YTD sales is an endogenous variable since its values can be explained—at least in part—by the values of the independent factor (sales experience). Since one’s experience in sales is independent of other variables in the model (i.e., YTD sales does not influence a person’s years of experience), experience is an exogenous variable.

Measurement Scales

Measurement scales are used to categorize and quantify variables. There are two major categorizations—discrete and continuous—and these, together with the research hypotheses, help determine appropriate types of analyses to perform.

Discrete Variables

Discrete variables are also known as categorical or qualitative variables. Categorical variables have a finite or countable number of values associated with them, and these can be further categorized as either nominal or ordinal.

Nominal

A nominal variable is one with two or more categories for which there is no intrinsic ordering to the categories. Examples of nominal variables include office locations, departments, and teams. A dichotomous variable is a specific type of nominal variable which has only two unordered categories. Examples of dichotomous variables include people leader vs. individual contributor, active vs. inactive status, and remote worker vs. non-remote worker.

Ordinal

An ordinal variable is like a nominal variable with one important difference: ordinal variables have ordered categories. Examples of ordinal variables include education levels, job levels, and survey variables measured on Likert-type scales.

Continuous Variables

Continuous variables are also known as quantitative variables. Continuous variables can assume any real value in some interval, and these can be further categorized as either interval or ratio variables.

Interval

Variables measured on an interval scale have a natural order and a quantifiable difference between values but no absolute zero value. Examples include SAT scores, IQ scores, and temperature measured in Fahrenheit or Celsius (but not Kelvin). In these examples, 0 is either not an option (i.e., SAT and IQ) or does not represent the absence of something (e.g., 0 degrees is a temperature).

Ratio

Variables measured on a ratio scale have the same properties as data measured on an interval scale with one important difference: ratio data have an absolute zero value. Examples include compensation, revenue, and sales; a zero in these contexts is possible and would indicate a true absence of something.

Sampling Methods

The goal of research is to understand a population based on data from a subset of population members. In practice, it is often not feasible to collect data from every member of a population, so we instead calculate sample statistics to estimate population parameters.

While the population represents the entire group of interest, the sampling frame represents the subset of the population to which the researcher has access. In an ideal setting, the population and sampling frame are the same, but they are often different in practice. For example, a professor may be interested in understanding student sentiment about a new school policy but only has access to collect data from students in the courses she teaches. In this case, the entire student body is the population but the students she has access to (those in the courses she teaches) represent the sampling frame. The sample is the subset of the sampling frame that ultimately participates in the research (e.g., those who complete a survey or participate in a focus group).

Sampling methods are categorized as either probability or non-probability, and this section will cover the types and implementations within each.

Probability Sampling

Probability sampling can help us gain insight into the probable. Probability sampling is intended to facilitate inferences since data collected through random selection methods are more likely to be representative of the population.

It is important to understand the centrality of randomness in probability sampling. Randomization protects against subjective biases, self-validates the data, and is the key ingredient that defines the representative means of extracting information (Kahneman, 2011). Sample data that are not representative of the population of interest can lend to anomalies—mere coincidences. While non-random data can be leveraged for directionally correct insights, randomness is required to make inferences about a broader population with a reasonable degree of confidence.

Let us consider an example from Kahneman (2011) in which six babies are born in sequence at a hospital. The gender of these babies is of course random and independent; the gender of one does not influence the gender of another. Consider the three possible sequences of girls (G) and boys (B) below:

  • BBBGGG

  • GGGGGG

  • BGBBGB

Though it may initially be counter-intuitive, since the events are independent and the outcomes (B and G) are (approximately) equally likely, any possible sequence of births is as likely as any other.

Sample size (colloquially referred to as the n-count) is another important factor in sampling as this can have a material influence on the representativeness of sample data—and consequently, the veracity of results and conclusions based on them. As we will explore further in chapter “Statistical Inference”, as the sample size increases, so too does our confidence that estimates based on sample data reflect population parameters.

To illustrate the effects of sample sizes, let us consider a hypothetical study in which the promotion rate in an organization is found to be lowest in divisions that are primarily software engineers, low diversity, small, and geographically dispersed. Which of these characteristics might offer an explanation? Let us assume that this study also found that the divisions with highest promotion rates have identical characteristics: software engineers, low diversity, small, and geographically dispersed. The key piece of information here is that the divisions are small.

Small samples yield extreme results more often than large samples. Small samples neither cause nor prevent outcomes; they merely allow the incidence of the outcome to be much higher (or much lower) than it is in the larger population (Kahneman, 2011).

Simple Random Sampling

Simple random sampling is a method in which each member of the population has the same probability of being selected for a sample. An example of simple random sampling is randomly selecting a specified number (or percent) of employees from the workforce to participate in a survey without regard for tenure, department, level, or other characteristics.

We can use the sample() function in R to randomly select from a vector of elements:

A snippet of an R program with commands to load library, load data, set seed for reproducible results, and sample 10 employees randomly.

##  [1] 2308 2018 2125 2004 1623 1905 1645 1934 1400 1900

The replace = F argument in the sample() function indicates that we want to sample without replacement (i.e., we do not want an employee to be selected twice in the sample). If we draw multiple names from a hat without replacement, we will not put names back into the hat once they are drawn; each has a chance of being selected only once.

While sampling with replacement would not make sense for applications such as pulse survey participation, as we would not want a given employee to take a survey multiple times, replace = T can be applied if the application requires it (e.g., a lottery in which employees can win more than once).

Stratified Random Sampling

Stratified random sampling is a sampling method in which the population is first divided into strata. Then, a simple random sample is taken from each stratum—a homogeneous subset of the population with similar characteristics with regard to the variable of interest. The combined results constitute the sample.

To ensure samples do not comprise a larger proportion of employees from a particular department, education level, tenure band, generational cohort, or other variable deemed useful in explaining differences in response scores, researchers can randomly select members for each stratum based on the proportion in the respective stratum in the larger population. For example, if 30% of all employees are in the Engineering department, the researcher could randomly select a calculated number of employees from the Engineering department such that 30% of employees in the sample come from this department.

Let us demonstrate stratified random sampling in R by sampling \(\frac {1}{2}\) of employees within each department. First, we can review counts of employees for each department using the group_by() and summarise() functions from the dplyr library.

A snippet of an R program with commands to load library and return employee counts by department.

## # A tibble: 3 x 2 ##   dept                       n ##   <chr>                  <int> ## 1 Human Resources           63 ## 2 Research & Development   961 ## 3 Sales                    446

Next, we will randomly select \(\frac {1}{2}\) of employees within each department using the group_by() and sample_frac() functions from the dplyr library. We will store the selected employees’ records in a data frame and then query it to validate that the counts are roughly \(\frac {1}{2}\) the total count observed for each department.

A snippet of an R program with commands to obtain and store stratified random samples and return sample counts by the department.

## # A tibble: 3 x 2 ##   dept                       n ##   <chr>                  <int> ## 1 Human Resources           32 ## 2 Research & Development   480 ## 3 Sales                    223

Cluster Sampling

Cluster sampling is a sampling method often used in market research in which the population is first divided into clusters. Then, a simple random sample of clusters is taken. All the members of the selected clusters together constitute the sample. Unlike stratified random sampling, it is the clusters that are selected at random—not the individuals. It is assumed that each cluster by itself is representative of the population (i.e., each cluster is heterogeneous).

Employees may be partitioned into clusters based only on their geographic region, for example. Since there is not further partitioning on other variables, each cluster is expected to be heterogeneous on the basis of variables other than geographic region—unless geography is related to other variables (e.g., call center employees are all located in a company’s Pacific Northwest region). By selecting a random set of clusters, the combination of employees across the selected clusters is expected to be representative of the population.

Let us demonstrate how to implement cluster sampling in R:

A snippet of an R program with commands to randomly assign each employee to 1 of 10 clusters, randomly select 5 clusters, store cluster samples, and display dimensions of the cluster sample object.

## [1] 748  37

Systematic Sampling

Systematic sampling involves selecting sample members from a population according to a random starting point and a fixed, periodic interval known as a sampling interval. The sampling interval is computed by taking the population size and dividing it by the desired sample size. The resulting number is the interval at which population members are selected for the sample.

For example, if there are 10,000 employees and our desired sample size is 500, the sampling interval is 20. Therefore, we would select every 20th employee for our sample. It is important that the sequence does not represent a standardized pattern that would bias the data; this process needs to be random. For example, if the employee id generated by the HCM system increases with time, we would expect employees with longer tenure to have lower employee ids while new joiners would have higher employee ids. Ordering employees by employee id prior to selection could bias the sample on the basis of variables related to tenure (e.g., aggressive periods of hiring in a particular department).

Let us walk through the step-by-step process for implementing the systematic sampling procedure in R:

A snippet of an R program with commands to specify desired sample size, determine population size, compute sampling interval, randomly select a value between 1 and sampling interval, increment starting value by the sampling interval, and store systematic sample.

Non-Probability Sampling

Non-probability sampling can help us gain insight into the possible. The main difference between non-probability and probability sampling is that non-probability sampling does not involve random selection and probability sampling does. Therefore, we cannot make inferences based on data collected through non-probability sampling methods since the sample is unlikely to be representative of the population.

Convenience (Accidental) Sampling

Convenience sampling is the most common type of nonprobabilistic sampling. This sampling method involves taking samples that are conveniently located around a specific location (physical or virtual).

If we were to study employee sentiment about new benefit plans by polling employees walking through the lobby of a particular office building one morning, this would represent convenience sampling. Aside from the risk of employees sharing socially desirable responses in such a setting and invalidating the results, a major shortcoming of this approach is that we are only capturing the sentiment of those who happen to walk into one particular building during one limited window of time. This would not capture the sentiment of those working remotely, working in another office location, on PTO, taking a sick day, attending an offsite conference or meeting, or stuck in traffic and running late.

Quota Sampling

Quota sampling is a nonprobabilistic sampling method in which researchers assign quotas to a group of people to create subgroups of individuals that reflect the characteristics of the population. This is nonprobabilistic since researchers choose the sample rather than randomly selecting it.

If the characteristics of the employee population are known, the researcher polling employees in the office lobby about benefit plans could collect some additional information (e.g., department, job, tenure) to achieve a commensurate proportion of each in the sample. If 30% of all employees are in the Engineering department, the researcher could assign a quota—such as 3 in every 10 participants—to choose a sample in which 30% of employees come from the Engineering department.

Purposive (Judgmental) Sampling

The main goal of purposive sampling is to construct a sample by focusing on characteristics of a population that are of interest to the researcher. Purposive sampling is often used in qualitative or mixed methods research contexts in which a smaller sample is sufficient. Since it is a nonprobabilistic sampling method, purposive sampling is highly prone to researcher bias.

For example, the People Team may be interested in understanding what is top-of-mind for employees in order to design a survey with relevant items. The team may choose people to participate in focus groups to surface qualitative themes—not for the purpose of generalizing findings but to guide survey item selection efforts.

Sampling and Nonsampling Error

Sampling and nonsampling errors are general categorizations of biases and error in research (Albright and Winston, 2016). It is important to understand these and proactively mitigate the risks they present to research integrity.

Sampling Error

Sampling error is the inevitable result of basing inferences on a random sample rather than the entire population. The two main contributors to sampling error are the size of the sample and variation in the underlying population. The risk of sampling error decreases as the sample size approaches the population size; however, it is usually not feasible to gain information from the entire population, so sampling error is generally a concern.

Selection Bias

Selection bias is the bias introduced by a non-random method of selecting data for analysis, which can systematically skew results in a particular direction. Selection bias may result in observed relationships or differences that are not due to true relationships or differences in the underlying populations, but to the way in which participants or data were selected for the research.

A type of selection bias that is especially important to consider in people analytics is survival bias. In people analytics, survival bias is the logical error of focusing only on people who made it past some selection process while overlooking those who did not. For example, to gain an accurate understanding of the number of employees who survive to each tenure milestone (e.g., 1 year, 2 years, 3 years, etc.), we need to study both active and inactive people to avoid biased results. We may find a significant drop in the percent of active employees who survive from 3 to 4 years, for example, but without information on inactive employees we do not know if this is a function of hiring (i.e., relatively few hired more than 3 years ago) or due to a spike in attrition beyond 3 years of tenure.

The work of a mathematician named Abraham Wald during World War II is a classic example of survival bias. Wald was a member of the Statistical Research Group (SRG) at Columbia University that examined damage to returning aircraft (Wald, 1943). Rather than focusing on the areas with damage, Wald recommended a different way of looking at the data, suggesting that the reason certain parts of planes were not covered in bullet holes was because planes that were shot in these areas did not return. In other words, locations with bullet holes represented locations that could sustain damage and still return home. This insight led to armor being reinforced in areas with no bullet holes (Fig. 2).

Fig. 2
The top and side view of an aircraft with the bullet impact sites marked with dots. The bullet marks are concentrated on the wings and fuselage. A few dots are on the tail portion.

Hypothetical data for damaged portions of returning WWII planes. Image courtesy of Cameron Moll (2005)

Missing data can sometimes be more valuable than the data we have, and it is critical to promote a representative data generative process to prevent biased selection and results.

Nonsampling Error

There are many types of nonsampling error that can invalidate results beyond the sampling procedure, and we will focus on several that are particularly germane to people analytics: nonresponse bias, nontruthful responses, measurement error, and voluntary response bias.

Nonresponse Bias

As discussed in the context of sampling error, we usually do not have access to information on entire populations of interest, so we must consider the possibility that those for whom we are missing data may have common qualities, perceptions, or opinions that differ from those for whom we do have data. This is known as nonresponse bias.

Surveys are a staple in the set of data sources leveraged for people analytics. While survey data provide unique attitudinal and perceptive signals that can help explain future behavior and events, surveys tend to be far more susceptible to nonsampling error than other data sources. If we administer an employee experience survey to the entire organization and receive a 60% response rate, the reality is that we do not know how the 40% of nonrespondents would have responded. It is possible that nonrespondents represent highly disengaged employees, in which case their responses may have materially influenced results and conclusions in an unfavorable direction. It is also possible that nonrespondents did not participate because they were busy, away on vacation, cynical to the confidentiality language in the communications, or any number of other reasons which may or may not have resulted in significantly different feedback relative to respondents.

As an aside, nonrespondents can actually function as an important variable in analyses. In some contexts, nonrespondents may be at greater risk of voluntary attrition than those who respond unfavorably on a stay intentions survey item such as, “I plan to be working here in one year.” Evaluating response rate distributions across departments, divisions, roles, and other dimensions may be an insightful endeavor.

Nonresponse bias is not limited to surveys. For example, self-reported demographics such as gender and ethnicity may not be disclosed by all employees in the HCM system. This can bias segmentations based on these categorical dimensions. While there are strategies to address this, such as visual ID or applying models trained to infer missing values (which may be necessary to fulfill EEOC reporting requirements), there may still be error in the imputed values.

Nontruthful Responses

While high response rates may reduce nonresponse bias, this is not always something to celebrate. Organizations that incentivize participation in surveys often do so at the risk of people responding in socially desirable ways and providing nontruthful responses to achieve some defined target. If an employee has an unhealthy relationship with his manager and does not trust that managers will not have access to individual-level responses, the employee may decide to indicate on the survey that everything is highly favorable to help the team win the month of casual days leadership promised. This can skew and invalidate results.

While survey participation should be strongly encouraged since higher response rates can mitigate the risk of certain types of bias and increase confidence that the survey is representative of the collective organization’s sentiments, incentivizing participation can be dangerous.

Measurement Error

Measurement error relates to errors stemming from confusing questions, survey fatigue, and low-quality scales used to measure multidimensional psychological constructs. The field of psychometrics is a vast scientific discipline concerned with the development of assessment tools, measurement instruments, and formalized models to understand latent psychological constructs such as engagement, belonging, purpose, and wellbeing using observable indicators.

The measurement method can affect observed data either by changing the underlying construct of interest or by distorting the measurement process without impacting the construct itself (Spector, 2006). Common method variance (CMV), also known as monomethod bias, relates to a widely held belief that relationships between variables measured using the same method are inflated. The idea that the measurement method itself introduces a degree of variance in measures has been cited in the organizational literature for decades, and it is raised almost exclusively when cross-sectional, self-reported surveys are utilized. Though controversial, there is generally a consensus that where it is possible to do so, it is preferable to leverage multiple measurement methods.

My doctoral dissertation research explored how implicit voice theories, which are deep-seated beliefs about the risks involved in speaking up to those higher in the organizational hierarchy (e.g., negative career consequences), influenced the extent to which individual contributors actually speak up in prosocial ways to their leaders (Starbuck, 2016). Since the individual contributors are best placed to provide information on the implicit beliefs they maintain, the IV was measured using cross-sectional self-reports. At the same time, I surveyed the immediate supervisor for each individual contributor and asked them to rate each of their direct reports using a leader-directed voice scale; these supervisor-reports of leader-directed voice were used as the DV in this study. To investigate CMV, which was a tangential interest to the primary research objective, I also included the leader-directed voice scale (using self-reported language) on the survey administered to individual contributors.

For the 1032 employees from whom I collected data in an investment firm context (individual contributors: n = 696; supervisors: n = 336), I was surprised to find only a weak correlation between self-reported voice and supervisor-reported voice (r = .26, p < .01). On average, self-reported voice was higher with less variation (\(\bar {x}\) = 5.91, s = 1.15) relative to supervisor-reported voice (\(\bar {x}\) = 5.69, s = 1.34). Interestingly, there was support for almost none of the hypothesized relationships when supervisor-reported voice was positioned as the DV, though most were supported when self-reported voice was substituted as the DV in post-hoc analyses.

Given the prevalence of monomethod self-reports in the social sciences, the influence of CMV is an important consideration.

Scale Reliability and Validity

While an exhaustive treatment of psychometrics is beyond the scope of this book, reliability and validity are two broad sets of methods designed to increase the robustness of psychological instrumentation which will be reviewed in this section.

It may be helpful to consider a weight scale to understand differences between reliability and validity. A weight scale is designed to provide an accurate measurement of one’s weight, and we expect measurements to be consistently accurate over time. If a 150 lb. person steps onto a weight scale and receives a reading of 180 lbs., the scale is not valid as the person actually weighs 30 lbs. less than the reading. If the person steps onto the scale a second time moments later and receives a reading of 200 pounds, the scale is not reliable either (inconsistent measurements).

Figure 3 visually depicts differences between reliability and validity. As researchers, it is critical to measure what we intend to measure (validity) and do it with consistency (reliability). Survey items with poor psychometric properties can lend to invalid conclusions due to measurement error. Even slight adjustments to validated instrumentation—such as changing the number of scale anchors (e.g., increasing from a 5-point to 7-point Likert scale), tweaking item language, or modifying which items are included in a composite scale—generally warrant reliability and validity analyses.

Fig. 3
Three Raf Roundel target signs. Reliable, not valid has marks on the outermost ring. Not reliable, not valid has a few marks spaced out on all parts. Reliable and valid has all the marks concentrated within the center portion.

Visual depiction of reliability and validity

Reliability

Reliability describes the quality of measurement (i.e., the consistency or repeatability of measures). Types of reliability include:

  • Inter-Rater or Inter-Observer Reliability: the degree to which different raters/observers give consistent estimates of the same phenomenon.

  • Test-Retest Reliability: the consistency of a measure from one time to another.

  • Parallel-Forms Reliability: the consistency of the results of two tests constructed in the same way from the same content domain.

  • Internal Consistency Reliability: the consistency of results across items within a test.

Validity

Validity describes how well a concept was translated into a functioning and operating reality (operationalization). There are four main types of validity: (a) face validity, (b) content validity, (c) construct validity, and (d) criterion-related validity.

Face validity

Face validity is an assessment of how valid a measure appears on the surface. In other words, face validity represents whether the measurement approach on its face is a good translation of the construct. This is the least scientific method of validity and should never be accepted on its own merits.

Content Validity

Content validity is a somewhat subjective assessment of whether a measure covers the full content domain. Content validity relies on people’s perceptions to measure constructs that would otherwise be difficult to measure.

For example, a panel of experts may gather to discuss the various dimensions of a theoretical construct. The psychometrician may then use this information to develop survey items that tap these dimensions to achieve a comprehensive measure of the construct.

Construct Validity

In social science, constructs are often measured using a collection of related indicators that together, cover the various dimensions of the theoretical idea. Constructs may manifest in a set of behaviors, which provide evidence for their existence. Construct validity represents the degree to which a collection of indicators and behaviors—the operationalization of the concept—truly represents theoretical constructs.

Psychological safety, a belief that a context is safe for interpersonal risk-taking (Edmondson, 1999), has no direct measure. However, there are indicators and behaviors that are helpful in understanding the extent to which an environment is psychologically safe. We may ask employees whether they are able to bring up problems to decision makers or whether it is safe to take risks on their team. Based on the theoretical conception of psychological safety, these would be helpful (though not collectively exhaustive) indicators of the construct in an organizational setting.

Construct validity can be partitioned into two types:

  • Convergent validity: the degree to which the operationalization is similar to (converges on) other operationalizations to which it theoretically should be similar.

  • Discriminant validity: the degree to which the operationalization is not similar to (diverges from) other operationalizations to which it theoretically should not be similar.

A nomological network is central to providing evidence for a measure’s construct validity. The nomological network is an idea developed by Cronbach and Meehl (1955) to represent the constructs of interest, their observable manifestations, as well as the interrelationships among them. The term “nomological” is derived from the Greek word meaning “lawful”; therefore, the nomological network aims to make clear what a construct means so that laws (nomologicals) can be applied. Simply put, the nomological network attempts to link the conceptual and theoretical realm to the observable one to provide a practical methodology for assessing a measure’s construct validity.

If psychological safety theory suggests the construct should be positively associated with leader openness and negatively related to employee withholding (silence), we can use validated measures of openness and withholding to test for these theoretical relationships with psychological safety and substantiate construct validity.

Criterion-Related Validity

Criterion-related validity, sometimes referred to as instrumental validity, describes how well scores from one measure are adequate estimates of performance on an outcome measure (or criterion).

There are two types of criterion-related validity:

  • Predictive validity: the operationalization’s ability to predict something it should theoretically be able to predict.

  • Concurrent validity: the operationalization’s ability to distinguish between groups that it should theoretically be able to distinguish between.

If psychological safety should positively influence employee voice, there would be support for predictive validity if we find that employees who report more favorable perceptions of psychological safety are more willing to speak up. If we administer a new scale to measure psychological safety and at the same time (concurrently) administer an existing, validated measure of the same construct, highly correlated results would lend support for the new measure’s concurrent validity.

For detailed instruction on the survey scale development process, see DeVellis (2012).

Review Questions

  1. 1.

    What are the differences between parameters and statistics?

  2. 2.

    How does a sampling frame differ from a sample?

  3. 3.

    How does cluster sampling differ from stratified random sampling?

  4. 4.

    What is the primary benefit of probabilistic sampling methods over nonprobabilistic sampling?

  5. 5.

    Is nonresponse bias only applicable in the context of surveys?

  6. 6.

    What type of variable influences the strength of the effect one variable has on another?

  7. 7.

    100 randomly selected employees in the Marketing department of an organization participated in a survey on career pathing for marketing professionals. What is the sample and what is the population in this case?

  8. 8.

    How does the meaning of zero differ between interval and ratio-scaled variables?

  9. 9.

    Can discrete variables have more than 2 values?

  10. 10.

    What are some examples of nonprobabilistic sampling methods?