Introduction

Understanding an occupation (i.e., a person’s regular work or profession) is relevant to research questions across many disciplines, including psychology, neuroscience, economics, political science, education, sociology, health care, and ageing. These questions often require retrospective analyses relating occupational practice to other measures, such as—among others—health status, cognitive abilities, and sex or gender differences. By contrast, vocational psychologists use prospective analysis of personality traits, abilities, and interests to predict placement or select candidate (e.g., Hartman & Betz, 2007; Holland, 1997; Larson et al., 2002; Ralston et al., 2004)—a procedure that has been criticized due to low empirical support (Fouad & Kozlowski, 2019; Savickas, 2001).

Tools that can convert occupations into quantified dimensions of mental traits and skills—as well as contextual factors (e.g., sex/gender and economic access)—are required to test hypotheses regarding characteristics of individuals whose occupation is known. Researchers across disciplines have strived to develop an “occupational space” that can be linked to various research questions. For instance, cognitive neuropsychologists have identified key occupational attributes that relate to the location of atrophy in frontotemporal lobar degeneration in order to understand how long-term engagement in occupations associates with individual differences in the emergence of neurodegeneration (e.g., Spreng et al., 2010). Meanwhile, cognitive neuroscientists have assessed how occupations associate with structural brain health (e.g., Habeck et al., 2019) and gerontologists have identified key occupational factors in order to identify specific physical traits of occupations that could predict healthy ageing (e.g., Burzynska et al., 2019). While there is some overlap in the cognitive factors identified by these different research teams (e.g., a factor relating to the occupational complexity), these methods used to identify occupational factors differ across teams and disciplines, a problem making related findings challenging to compare.

There are several methodological barriers to developing continuous measures of occupation characteristics. Occupation is a categorical variable with a large number of titles (i.e., modalities), including multiple titles describing the same occupation (e.g., medical doctor vs. physician). Because occupations can be described with varying degrees of specificity (e.g., physician vs. cardiologist), each occupational title exists within a complex hierarchical structure. Additionally, each occupation requires a different mix of knowledge, qualifications, and abilities, and this makes it challenging to categorize occupations into subgroups.

The dictionary of occupational titles (DOT)––first created in 1930 by the United States Department of Labor (DOT: United States Department of Labor, 2006)––is the earliest and best-known occupation taxonomic system. In 1995, the DOT was replaced by the Occupational Classification Network (O*NET) database (US Department of Labor, 2019a), reducing the DOT’s 28,800 occupations to 966 occupations grouped using a conceptual framework called the Content Model (see Peterson et al., 2001). O*NET’s sampling methodology and occupational assignment improved upon the DOT, enhancing application in social sciences (Handel, 2016) (Peterson et al., 2001). The O*NET Content Model provides a framework that identifies the most important types of information about occupations and integrates this information into a theoretical, empirically validated, model. The O*NET content model includes both worker-oriented as well as job-oriented traits. The O*NET content model describes and categorizes these distinguishing characteristics of occupations and provides a set of standardized, measurable variables that represent key features of occupations. Worker-oriented variables, including the specific knowledge, skills, and abilities associated with particular occupations, have long been used to explore how particular traits relate to different aspects of occupation at the individual level (e.g., Burrus et al., 2013).

Both DOT and O*NET data have also been subjected to dimension reduction techniques in order to extract components accounting for fundamental occupational traits, such as occupational complexity, people versus things, and physical demands (Clark, 2002; Hadden et al., 2004; Hanson et al., 1999; Levine, 2003; Shu et al., 1996). Similarly, clustering approaches have been applied to reduce the complexity of O*NET data (e.g., Nolan et al., 2011; Slaper, 2014). Although several studies have reduced O*NET data into accessible, sharable, and actionable data (Indiana Department of Workforce Development Research and Analysis Division & Indiana Business Research Center, 2011), these analyses have been tailored to select samples or occupational categories and are not flexible or modifiable to suit different research questions. A standardized, accessible, and flexible system for quantification of occupation characteristics is required to scale the application of the rich O*NET database across research domains.

We created a method for the derivation of quantified dimensional scores characterizing O*NET occupations for use by the research community. We applied principal component analysis (PCA; Abdi & Williams, 2010; Eckart & Young, 1936; Hotelling, 1933) to the ratings of knowledge, skills, and abilities associated with each occupation from O*NET to identify dimensions (also called components) that capture patterns in the data. We then used these components as the input to hierarchical cluster analyses (HCA; Bridges, 1966) to identify categorical clusters of occupations and their associated traits as indicated by the ratings. Visualizations of the occupational space were labeled by these clusters as well as by the overall required level of preparation (i.e., ‘Job Zone’, which quantifies the required level of education, related experience, and on-the-job training, see https://www.onetonline.org/help/online/zones). Follow-up PCAs with HCAs within different Job Zones were conducted to identify finer-grained occupational components unaccounted for by education and socioeconomic status. We also provide an illustrative example where we use Visualization of Latent Components Assessed in O*Net Occupations (VOLCANO) to understand the link between cognition and occupation in a sample of healthy adults. All analyses, data, and clustering are available through the Open Science Framework (OSF), GitHub (https://github.com/juchiyu/OccupationPCAs), and an original Shiny app (i.e., an R-based interactive web app; the codes and link are included in the same GitHub repository).

Method

Data collection and extraction

The data included in this study were obtained from a public database available from the United States Department of Labor Standard Occupational Classification Network (O*NET: US Department of Labor, 2019a). O*NET is a Standard Occupational Classification based system that organizes work into 966 occupations and associated traits as determined from surveys of workers (US Department of Labor, 2019c).

In our analyses, each occupation is treated as an “observation” described by scores on different measures. For each occupation within the O*NET database, we extracted 120 trait variables (e.g., trunk strength ability, physics knowledge, and clerical skills) that were described with Likert-scaled ratings ranging from “not important” (0) to “extremely important” (5) concerning abilities (52 variables), knowledge (33 variables), and skills (35 variables). The current study did not include job-oriented traits (US Department of Labor, 2019b), because the key aim of this study is to uncover the association between individual abilities (e.g., cognitive, behavioral, neural) and worker-level occupational attributes. These data (966 occupations described by 120 traits) were used as input for all subsequent analyses (see Table 1 for examples).

Table 1 Sample data of occupations and associated traits
Table 2 Descriptions of occupation cluster labels

Analyses

A general occupation PCA was applied to the 966 occupations described by 120 rated traits. The first component of this general occupation PCA was dominated by Job Zone level (i.e., a five-level grouping variable that reflects the preparation required; specifically, the amount of education, related experience, and on-the-job training needed to do the work). To identify components that explain variance over and above the effect of Job Zone effect, we conducted follow-up PCAs to identify components only within Job Zones 4 and 5 (391 occupations requiring extensive preparation; the Job Zones 4–5 PCA) and Job Zones 1–3 (575 occupations requiring relatively less preparation; the Job Zones 1–3 PCA). Data were analyzed using R, version 4.1.1 ( R Core Team , 2021), with packages ExPosition, version 2.8.23 (Beaton et al., 2014), stats, version 4.1.1 (R Core Team, 2021), ggplot2, version 3.3.5 (Wickham, 2016), ggdendro, version 0.1.22 (de Vries & Ripley, 2020), dendextend, version 1.15.1 (Galili, 2015), and tidyverse, version 1.3.1 (Wickham et al., 2019).

Principal component analysis (PCA)

We performed PCA on the preprocessed data where each trait was mean-centered (i.e., the mean of each trait is now equal to 0) across occupations (with the ExPosition R package, Beaton et al., 2014; scaling—a.k.a., normalization, such as using Z-scores—was not used because all traits were rated on the same scale)Footnote 1. PCA creates a set of orthogonal (i.e., uncorrelated) variables called principal components (Abdi & Williams, 2010; Hotelling, 1933; Pearson, 1901). The scores of each principal component are called component scores and are obtained as a linear combination of the original variables (i.e., here traits) with coefficients—called loadings—that indicate the importance of each original variable in the combination. For each component, the amount of explained variance in the data is measured by the variance of its component scores—called the eigenvalue of this component. In PCA, these components are ordered (from the largest to the smallest) according to their eigenvalues.

In PCA, a component is reliable when it explains a significant amount of variance. This significance was evaluated by a permutation test obtained by (1) randomly permuting the data within each variable, (2) computing a PCA on the permuted data table, (3) repeating this process many (i.e., 1000) times, and (4) generating the probability distribution of each eigenvalue from the PCAs of these permuted datasets. From each distribution, the proportion of the permuted eigenvalues that are larger than the observed eigenvalue gives the probability associated with (i.e., the p value of) this eigenvalue (Abdi & Williams, 2010; Buja & Eyuboglu, 1992; Reddon, 1984).

In PCA, these components are interpreted by inspecting, one component at a time, their patterns of component scores and loadings, and this process is facilitated by drawing maps (i.e., scatterplots) that plot the loadings of two components (typically Components 1 and 2) against each other. In these maps, the correlation between two traits is estimated by the angle that these two traits form with the origin of the map: Two traits with a small angle are positively correlated, two traits with a right angle are orthogonal (i.e., uncorrelated), and two traits with a large obtuse angle are negatively correlated. Here, loadings were scaled to have the same variance as the component scores and therefore called “scaled loadings” (by contrast with the usual loadings that have unitary variance).

Finally, to reveal the structure of the occupations in the same component space, we used the component scores as coordinates to create scatterplots of the occupations. In these maps, the similarity between occupations is evaluated by their distances: Occupations near each other require similar skills, whereas occupations far from each other require distinct skills (note that this procedure differs from the plots of the traits whose interpretation relies on the angle between pairs of traits). We labeled these components according to how both traits and occupations are distributed. The labels were selected for the sake of efficient communication, recognizing that they do not necessarily reflect the nuances of each component.

Hierarchical cluster analysis (HCA)

Hierarchical cluster analyses (HCA; Bridges, 1966)Footnote 2 were performed on both the occupations’ component scores and the traits’ scaled loadings for the three PCAs (i.e., general, Job Zones 1–3, Job Zones 4–5) using the significant components of each PCA. The HCAs were conducted using the R-function hclust (R Core Team, 2021), and the clusters were extracted using Ward’s minimum variance (Ward, 1963). Homogeneous clusters were defined such that the numbers of occupations or traits were roughly equivalent (Fig. 1; see the detailed list of occupations and traits in Supplementary S1, S3, and S4).

Fig. 1
figure 1

Descriptive information of the General, the Job Zones 4–5, and the Job Zones 1–3 PCAs. Note. Descriptive information for the General occupation (A), Job Zones 4–5 (B) and Job Zones 1–3 (C) PCAs and HCAs. Top: Scree plots depicting the percentage of variance explained by each component. The green dashed lines indicate the Kaiser criteria, and the purple dots indicate significant components as determined by permutation tests. Bottom: Tree diagrams depicting the hierarchical structure of the clustering of occupations from the HCA (N = number of occupations). Clusters were colored using a gradient of the component scores / scaled loadings of the first components (i.e., yellow-green reflects positive-negative loadings on the first component for occupation; see Figs. 2, 34 and 5 for color gradients). See Supplementary S1, S2, and S4 for the tree diagram depicting the hierarchical structure of the clustering of traits. Details of abbreviations for the occupation clusters are listed in Table 2

Visualizations

Component scores and scaled loadings were grouped and labeled using clusters derived from the corresponding HCAs. The component scores are also grouped and labeled using the Job Zone variable from O*NET to assess the degree to which a given component corresponds to the overall level of education and preparation required. An interactive, publicly available (R-based) Shiny app (available at https://github.com/juchiyu/OccupationPCAs) is provided so that others can re-run analyses and plot the data, adjusting the parameters (e.g., the number of clusters, clustering method, occupation vs trait clustering) according to their needs.

Results

General occupation PCA

The general occupation PCA identified seven significant components, which, together, explained 77.24% of the variance. HCA identified 18 occupation clusters and nine trait clusters (Fig. 1A and Supplementary S1). The occupation spaces of the first three components are shown in Figs. 2 and 3. The first component, explaining 37% of the variance, differentiated labor-intensive occupations (e.g., stonemasons, logging equipment operators) from education-intensive occupations (e.g., industrial-organizational psychologists, political science teachers), with the scaled loadings of traits reflecting physical versus cognitive abilities (Fig. 2B and Supplementary S5, horizontal axes). Accordingly, this component corresponded to the five O*NET Job Zone groups (Fig. 2A; horizontal axis). Because this component was largely determined by the amount of preparation required (including education, related experience, and on-the-job training), we labeled it “Preparation.” This component likely reflects what has previously been labeled “occupational complexity” or “substantive complexity” (Crouter et al., 2006; Gadermann et al., 2014; Hadden et al., 2004; Smart et al., 2014).

Fig. 2
figure 2

First two components (Preparation and STEM) from the General occupation PCA. Note. Panel A illustrates the degree to which each component corresponds to mean Job Zone ratings. Anchors indicate occupations (B) with the highest contributions (e.g., [Tree] Fallers and extent flexibility contributed strongly and positively to Preparation [Component 1]). Occupation component scores (B) are labeled by clusters derived from the HCA. The horizonal axis in all plots represents Preparation (Component 1), and the vertical axis represents STEM (Component 2)

Fig. 3
figure 3

First and third components (Preparation and Health versus Computational Science) for the General occupation PCA. Note. Panel A illustrates the degree to which each component corresponds to mean Job Zone ratings. Anchors indicate occupations (B) with the highest contributions (e.g., [Tree] Fallers and extent flexibility contributed strongly and positively to Preparation [Component 1]). In all plots, the horizonal axis represents Preparation (Component 1), and the vertical axis represents Health versus Computational Science (Component 3)

The second component explained 19% of the variance and differentiated occupations in STEM (e.g., robotic, marine, biomedical engineers) from occupations that are non-STEM (e.g., models, telemarketers, coatroom attendants), with the scaled loadings of traits reflecting engineering, technology, and natural sciences (Fig. 2B and Supplementary S5, vertical axes). These scaled loadings of traits such as fine arts and philosophy on the low end of this component were close to 0 and therefore did not contribute strongly to this component. We therefore simply called this component “STEM.”

The third component explained 9% of the variance and differentiated occupations in medicine, health science, and social science (e.g., nurse practitioners, clinical nurse specialists, police officers) from computer science and engineering (e.g., hardware engineers, aerospace engineers, mechanical drafters), with the scaled loadings of traits anchored by engineering and technology versus social sciences, humanities, and natural sciences (Fig. 3B and Supplementary S6, vertical axes). We labeled this component “Health versus Computational Science.”

Job zones 4–5 PCA

Job Zones 4–5 PCA identified six significant components, which, together, explained 70% of the total variance. HCA was conducted on the first four of these components (which, together, explained 60% of the variance because the last two components did not yield interpretable clusters, Supplementary S2). This HCA identified 18 occupation clusters and 11 trait clusters (Fig. 1B and Supplementary S3). Figure 4 shows the occupation space of the first two components of the Job Zones 4–5 PCA. The preparation-level factor observed in general occupation PCA no longer dominated the first component, likely because the Job Zones 4–5 PCA is restricted to occupations requiring considerable to extensive preparation level (see Fig. 4A). Job Zones 4–5 contain occupations that rely on cognitive abilities (e.g., coordinating, supervising, managing, etc.).

Fig. 4
figure 4

First and second components (STEM versus Social Science and Humanities; Health versus Computational Science) for the Job Zones 4–5 PCA. Note. Panel A illustrates the degree to which each component corresponds to mean Job Zone ratings. Anchors indicate occupations (B) with the highest contributions (e.g., Engineers and Teachers contributed strongly to STEM versus Social Science and Humanities [Component 1]). Occupation component scores (B) are labeled by clusters derived from the HCA. In all plots, the horizonal axis represents STEM versus Social Science and Humanities (Component 1) and the vertical axis represents Health versus Computational Science (Component 2)

The first and second components of Job Zones 4–5 PCA (Fig. 4) resembled the second and third components of the general occupation PCA (Fig. 3). The first component, explaining 27% of the total variance, differentiated STEM occupations (e.g., manufacturing, marine, robotic engineers) from occupations in the humanities, social sciences, and particularly teachers (e.g., English language and literature teachers, history teachers), with the scaled loadings of traits reflecting science and engineering versus social science, humanities, and communication (Fig. 4B and Supplementary S7, horizontal axes). Unlike the general occupation PCA, where the social science, humanities, and communication traits have scaled loadings close to null, these traits strongly contributed to the variance of this component. This difference is attributable to the greater influence of liberal arts education in this set of professions. We labeled this component “STEM versus Social Science and Humanities.”

The second component of Job Zones 4–5 PCA, explaining 16% of the variance, differentiated occupations in health sciences and medicine (e.g., surgeons, nurse practitioners, oral and maxillofacial surgeons) from occupations in computer sciences and business (e.g., data specialists, cost estimators, research analysts), with the scaled loadings of traits reflecting social science, humanities, and natural science versus science, engineering, and math (Fig. 4B and Supplementary S7, vertical axes). We labeled this component “Health versus Computational Science.”

Job zones 1–3 PCA

Job Zones 1–3 PCA identified seven significant components, which, together, explained 75% of the total variance. All significant components were included in the HCA, which identified ten occupation clusters and nine trait clusters (Fig. 1C and Supplementary S4). Job Zones 1–3 typically contain occupations that are labor-intensive (e.g., manual labor, typing speed).

Figure 5 shows the occupation space of the first two components of this PCA. The first component, explaining 33% of the total variance differentiated manual labor occupations (e.g., manufactured building installers, mechanics, millwrights) from office administrative occupations (e.g., telemarketers, clerks), with the scaled loadings of traits reflecting engineering, technology, operations and control, and physical strength as distinct from communication and humanities (i.e., customer-service-related skills; Fig. 5B and Supplementary S8, horizontal axes). This component was labeled “Manual Labor versus Office.”

Fig. 5
figure 5

First and second components (Manual Labor versus Office; Technical) for the Job Zones 1–3 PCA. Note. Panel A illustrates the degree to which each component corresponds to mean Job Zone ratings. Anchors indicate occupations (B) with the highest contributions (e.g., Mechanical occupations contributed strongly to Manual Labor versus Office [Component 1]). Occupation component scores (B) are labeled by clusters derived from the HCA. In all plots, the horizonal axis represents Manual Labor versus Office (Component 1) and the vertical axis represents Technical (Component 2)

The second component, explaining 19% of the variance, differentiated occupations that require specialized and practical knowledge and skills (i.e., technical occupations, such as product managers or fire-fighting supervisors) from occupations that do not require specialized and practical knowledge and skills (e.g., pressers, graders, sorters, cleaners). The scaled loadings of traits reflect technical and scientific knowledge and communication (which requires practical training and textbook learning) as opposed to general physical abilities (which rely on broad motor capacity) (Fig. 5B and Supplementary S8, vertical axes). This component was labeled “Technical.”

Application: Relationship of verbal and non-verbal abilities to STEM occupations

To illustrate an application of VOLCANO on real-world data, we analyzed the relationship of verbal and non-verbal abilities in relation to occupation in the Nathan Kline Institute-Rockland Sample (NKI-RS; Nooner et al., 2012), a data set that has the advantage of rich standardized testing in a large sample of adults with occupation coded. As a proof-of-principle, we tested the hypothesis that non-verbal abilities would be uniquely associated with STEM occupations. We first created groups according to cognitive performance (Verbal/Non-verbal IQ discrepancy). For descriptive purposes, we examined frequencies of individuals in each discrepancy group within occupational clusters from our hierarchical clustering algorithm. We then projected groups defined by cognitive abilities into the occupational space to assess their relationships to the components. As the NKI-RS data focus on cognitive performance, we restricted this analysis to the Job Zones 4–5 PCA that includes occupations requiring more extensive training and preparation, especially with respect to cognitive skills. Indeed, the first two components of this space separate (1) STEM from Humanities and (2) Health from Computational Science. Finally, we extracted the occupation factor scores as continuous measures for relation to cognitive performance in a traditional regression analysis.

Methods

Because this analysis uses the Job Zones 4–5 PCA, only individuals who reported having an occupation that falls within Job Zones 4–5 were included in the analyses. A total of 470 NKI-RS participants were included (152 males and 318 females; Mage = 56.69 years, SDage = 15.45 years). The NKI-RS dataset is a community sample of participants across the lifespan from Rockland County, a suburban/rural county 20 miles northwest of New York City. This sample is intended to be a phenotypically rich neuroimaging sample, consisting of data obtained from representative individuals from the community rather than comprised solely of university students, as is typical of neuroimaging datasets. Individuals with past or current reports of head injury, stroke, bipolar disorder, autism spectrum disorder, attention deficit hyperactivity disorder, Alzheimer’s disease, epilepsy, and a full-scale IQ < 70 were excluded from analyses. We also excluded individuals who did not speak English as their native language, because their verbal IQ scores would be artificially lowered. Only participants with both free entry occupation and scores on the Wechsler Abbreviated Scale of Intelligence – II (WASI-II) were included. Participants provided informed consent in accordance with the NKI-RS research ethics boards.

Participants manually entered their occupations as a free-entry item as part of the Hollingshead Four-Factor Index of Socioeconomic Status (Hollingshead, 1975). Free-text entries that do not correspond to an O*NET-listed occupation were re-coded by three independent raters who completed training on 100 representative occupations (the coding manual is available on GitHub: https://github.com/juchiyu/OccupationPCAs). Each rater converted 264 occupations, including 67 overlapping occupations. Inter-rater reliability for occupation coding of occupations that did not already correspond to an O*NET listed occupation (based on row cluster classification, see above Methods) was acceptable (Fleiss’ mappa = .66; (Cohen, 1960; Fleiss et al., 1969).

Participants completed the Wechsler Abbreviated Scale of Intelligence – 2nd edition (WASI-II) as part of a comprehensive test battery. The WASI-II is a brief intelligence test—with excellent reliability and validity (Irby & Floyd, 2013)—designed for individuals between the ages of 6 and 90 years. This instrument includes four subtests Vocabulary, Similarities, Block Design, and Matrix Reasoning. A Verbal Comprehension Index (VCI) can be derived from the respective age-corrected standardized scores on the Vocabulary and Similarities subtests and a Perceptual Reasoning Index (PRI) can be derived from the respective age-corrected standardized scores on the Matrix Reasoning and Block Design subtests. VCI scores reflect verbal abilities including abstract verbal reasoning ability, semantic knowledge and verbal comprehension and expression. PRI scores reflect non-verbal abilities including visuospatial processing, and abstract problem solving. These measures correspond to the full Wechsler Adult Intelligence Scale – IV VCI and PRI scores, with a mean of 100 and a standard deviation of 15. We predicted that these indices would be differentially associated with Component 1 (STEM versus Social Science and Humanities) identified in Job Zones 4–5 PCA, with PRI associated with STEM occupation and VCI associated with social science and humanities occupations.

Participants were assigned to one of two groups based on their VCI-PRI discrepancy scores that were computed as the difference between VCI and PRI. Reliabilities of discrepancy scores in adults range from .82 to .89, a value considered large enough to justify hypothesis generation (Ryan & Gontkovsky, 2021). Individuals with a ten-point discrepancy between VCI and PRI and VCI were included in the projection. In this analysis, we only consider two groups from the distribution tails of the individuals: VCI+ and PRI+. The VCI+ group (N = 145) includes individuals with a VCI score greater than their PRI by at least ten points and were considered to have relatively stronger verbal skills. Conversely, PRI+ group (N = 240) includes individuals with a PRI score greater than their VCI by at least ten points and who, therefore, were considered to have relatively stronger perceptual reasoning skills; the remaining 85 participants difference score was less than ten points. The VCI+ and PRI+ groups did not differ by age t(193) = – 0.91, ns, or gender χ2(1) = 0.73, ns. Participants were then colored based on group membership and projected into the cognitive occupation space, using supplementary projections (a procedure also called out of sample elements projections, for details, see Abdi & Williams, 2010).

Results and discussion

Figure 6A shows the occupation clusters with the proportion of each group for the purpose of sample characterization. The VCI+ group—relative to the PRI+ group—contains a larger proportion of people with low scores on Component 1 (e.g., occupations in Social Sciences, Business and Government, Alternative Therapies). Conversely, the PRI+ group, relative to the VCI+ group, contains a larger proportion of individuals with high scores on Component 1 of the Job Zones 4–5 PCA (e.g., occupations in Science and Mathematics, Engineering, Computer, and Informatics).

Fig. 6
figure 6

Projection the Rockland data onto the space of Job Zones 4–5 PCA. Note. Visualization of the association between the Job Zones 4–5 PCA and the Wechsler Abbreviated Scale of Intelligence (WASI-II) including verbal comprehension index (VCI) and perceptual reasoning index (PRI) from the NKI-RS dataset. A shows the distribution of occupation clusters within the groups defined as VCI+ and PRI+. From the left, clusters range from low to high on Component 1 (see Figs. 1B and 4B for interpretation). As expected, the VCI+ group contains a larger proportion of people with occupations low on Component 1 (Social Sciences, Business and Government, Alternative Therapies) relative to the PRI+ group, whereas the PRI+ group contains more people with occupations high on Component 1 (Science and Mathematics, Engineering, Computer and Informatics). B Illustrates the supplementary projection of the discrepancy scores (VCI+ denotes the group with VCI larger than PRI by 10, and PRI+ denotes the group with PRI larger than VCI by 10). The ellipses indicate 95% bootstrapped confidence intervals. C The PRI is positively correlated with Component 1 scores, indicating an association with STEM, whereas the VCI is negatively correlated with Component 1, indicating an association with social sciences and humanities. The two correlation coefficients are significantly different. D The difference between participants' VCI and PRI scores (i.e., VCI-PRI) is negatively correlated with Component 1, indicating that participants with a smaller difference between VCI and PRI are more likely to have occupations in Social Science and Humanities than in STEM. For positive WASI difference scores, a greater difference between VCI and PRI relates to having a lower score on Component 1 (i.e., an occupation in the humanities). Conversely, for negative WASI difference scores, a greater difference between VCI and PRI relates to having a higher score on Component 1 (i.e., an occupation in STEM)

The utility of VOLCANO for quantifying occupational data on a ratio scale is illustrated by Figs. 6B and 6C. In Fig. 6B, the participants are represented as their occupations in the Job Zones 4–5 PCA space. The results show that individuals with stronger PRI relative to VCI scores have occupations that are more positive (i.e., higher in STEM) on Component 1 of the Job Zones 4–5 PCA. As the 95% bootstrapped confidence intervals (ellipses) do not overlap, this difference is statistically significant. There was no difference on Component 2 (Health versus Computational Science).

Next, we leveraged the full range of VCI and PRI scores in a multiple linear regression analysis predicting individuals’ Component 1 score extracted from Job Zones 4–5 PCA plus gender. A significant regression was found F(3, 466) = 19.35, p < 0.001 with an R2 of .11. Within this model, there was a significant main effect of gender (β = – 1.53, SE = 0.30, t = – 5.06, p < 0.001), with males having jobs with higher scores on Component 1 (i.e., males were more likely than females to have jobs in a STEM discipline). There was also a significant main effect of VCI and PRI scores (VCI: β = – 0.04, SE = 0.01, t = – 2.88, p < 0.005; PRI β = – 0.06, SE = 0.01, t = 5.45, p < 0.001). Notably, there was no significant interaction between gender and VCI or PRI. Additional results for this linear regression are presented in Table 3 and illustrated in Fig. 6B, where it can be seen that PRI was significantly positively related to Component 1 scores, r(468) = .211, p < 0.001, whereas VCI had a negative slope that was not significant, r(468) = – .01, ns, with the slopes of these bivariate correlations significantly different as examined by William-Hotelling’s test, t(467) = 4.93, p < 0.001. To further illustrate how verbal versus non-verbal cognitive abilities relate to Component 1, the correlation of the VCI minus PRI difference score was significant, r(468) = – .24, p < 0.001 (see Fig. 6D).

Table 3 Multiple linear regression results

These results illustrate the utility of VOLCANO for characterizing occupation and testing predictions on real-world occupational data. The empirically-derived clusters are useful for sample characterization, but data analytic options are restricted to non-parametric statistics. Many studies of occupation use such measures (Zeman et al., 2020). The added value of VOLCANO is the derivation of component scores for use in more powerful parametric analyses. This exercise was intended as proof-of-principle rather than theory-testing, as it generally accepted that spatial reasoning is important for STEM disciplines (e.g., Khine, 2017), whereas there was no a priori reason for this dissociation to be observed when contrasting Health versus Computational science (Component 2). The relationship between specific verbal intellectual abilities and selection of occupations in the social sciences and humanities has received less attention, possibly owing to the heterogeneity of these occupations.

Discussion

We used multivariate methods (i.e., PCA and HCA) to convert the heterogeneous categorical variable “occupation” into a concise set of continuous variables (along with the constrained set of categorical groupings). Our VOLCANO Shiny app provides a platform for standardized, quantitative characterization of occupation, enabling a new level of data sharing and comparison across studies concerning the skills, abilities, and traits associated with specific occupations. In addition to making O*NET data accessible and shareable, the VOLCANO Shiny app makes data flexible and supporting researcher’s ability to use O*NET data to address a range of research questions.

We implemented PCA on traits of occupations from O*NET and revealed three meaningful continuous components. These include (1) a component that reflects the education and preparation needed for specific occupations, (2) a STEM component that reflects the degree to which occupations are within a STEM discipline, and (3) a component that distinguishes STEM occupations between those in health science and those from scientific professions that require computational and mathematical thinking. These components are similar to those previously described for DOT and O*NET data (Hadden et al., 2004). However, we seek to transcend a static occupational space to create a flexible, accessible application that can accommodate the dynamic needs of researchers studying occupational traits across disciplines.

The general occupation space derived in the current study is useful for questions associated with occupations across all Job Zones. The inclusion of two additional occupation spaces (i.e., Job Zones 4–5 and Job Zones 1–3) is useful for questions tailored to specific groups, such as higher cognitive skills (e.g., working memory, mathematical reasoning) for Job Zones 4–5 space or physical skills (e.g., basal metabolic rate) for the Job Zones 1–3 space. Together, these three occupation spaces hold promise to uncover how occupation is associated with a wide set of health, psychological, and financial measures as well as overall standard of living.

The first and second components of the Job Zones 4–5 occupation space closely resembled the second and third components of the general occupation space (i.e., STEM component and Health versus Computational Science component). This finding was expected because the components of PCA are orthogonal (i.e., the second and third components in the general occupation PCA are uncorrelated with the first education-related component). More importantly, removing the education-related component allowed for a wider distribution of cognitive skills along Components 2 and 3, enabling finer-grained distinctions across occupations requiring higher education, that did not contribute to the corresponding component in the general occupation PCA. By contrast, the Job Zones 1–3 occupation space—which included labor-intensive occupations—has distinct components. The first component separated manual labor from office jobs, while the second component separated occupations that rely more on technical skills from those that rely less on such skills. It is worth noting that this pattern is related to a linear pattern of Job Zones 1–3 (see Fig. 5A) which indicates that the education and the preparation required for these occupations are closely related to their technicality. The HCA, performed on the scaled loadings from the PCAs, provided a set of data-driven categories that can be used to address research questions that require categorical clusters of occupational traits.

Notably, across PCAs the heterogeneity of one component is often decomposed by the subsequent component. For instance, the heterogeneous construct of an occupation in a STEM discipline is often decomposed by the subsequent component. More specifically, Component 3 of the general PCA and Component 2 of the Job Zones 4–5 PCA; are labeled as “Health versus Computational Science.” Because PCA components are orthogonal (i.e., uncorrelated)), these results showed that the distinction between non-STEM versus STEM (Health and Computational Science combined) is independent from the distinction within STEM (Health vs. Computational). Thus, our method does capture the complexity of professions such as medical doctors, who score high on health and low on computational science.

The occupation spaces can be exploited by projecting new observations onto the components using supplementary projections. For instance, groups with specific exposures, neurological characteristics, or cognitive traits can be projected into the occupational space to determine the association between these supplementary traits and occupation selection (provided they have been coded within the same O*Net job titles that we used to create the spaces). Alternatively, researchers can use the Shiny app to extract occupations, component scores, scaled loadings, and categorical clusters for use in classic univariate analyses as well as complex multilevel modelling, such as structural equation modelling (SEM).

As a practical illustration of these methods, we assessed the relationships of verbal and non-verbal intellectual abilities to STEM versus social science / humanities occupations (Component 1 of Job Zones 4–5 PCA). Using both supplementary projections and extraction of component scores, we found that visuospatial and non-verbal analytical reasoning abilities were related to selection of STEM professions, as expected on the basis of prior research (Khine, 2017). Vocabulary and verbal reasoning were related to the practice of professions in the social sciences and humanities. Because these analyses were focused on the association between cognition and occupation, we restricted data to occupations that require extensive preparation and training and that are relatively more reliant on cognition (i.e., occupation space as defined by the Job Zones 4–5 PCA). A more extensive cognitive battery would be required to assess the association between other theoretically relevant skills and occupations across the full range of job zones. These findings were not intended to advance theory, but rather to provide proof-of-principle that the VOLCANO technique can be used to isolate specific occupational components in relation to external measures. Considering the NKI-RS sample, VOLCANO enables the incorporation of quantified occupational data into analyses with deep behavioral, mental health, and neuroimaging data included with that dataset. Moreover, VOLCANO standardization can facilitate linkages of findings across datasets containing O*NET-coded occupations.

We share our code on OSF and GitHub, and we provide a Shiny app to support researchers repurposing these data to address distinct research questions. With the publicly available code, the occupation space can be re-generated with an updated O*NET database. The Shiny app can be used to implement different clustering methods (either hierarchical clustering analysis or K-means), select different numbers of clusters, and generate distinct component spaces based on the inclusion of specific Job Zones (e.g., an occupational space within a single job zone). The Shiny app can also generate detailed lists of occupations and traits that comprise each component and cluster to help researchers identify key characteristics of the component and facilitate generating study-appropriate labels and names.

The following limitations should be considered by researchers using these methods. Each occupation’s contribution within a component is a relative measure only interpretable within the full set of occupations included in the space. For instance, the same occupation would appear to be more physically demanding in the cognitive compared to the labor-intensive occupation space. Second, the occupational spaces are limited to those occupations listed in O*NET. While other out-of-sample occupations evaluated by the same set of traits could be projected onto the same space using supplementary projections (Abdi & Williams, 2010), such an exercise would assume that new trait ratings are comparable to older ones—an assumption that may be unjustified depending on the methods used to collect the ratings. Additionally, the definition of Job Zone is somewhat ambiguous because it reflects to varying degrees the education, training, experience, required knowledge, wage and salary level associated with particular jobs. That said, O*NET is the most comprehensive occupation dataset available. We acknowledge that routine updating of O*NET could change occupational spaces. We therefore provide the code required to generate occupational spaces using any occupation dataset, including future iterations of O*NET. Finally, we used consensus to derive labels for components and clusters in relation to the underlying constructs, but this labelling method is ultimately subjective.

Research implementing component reduction methodologies to characterize occupation data date back over 50 years, even before the emergence of comprehensive taxonomic systems, such as O*NET (e.g., Cole et al., 1971; Cunningham et al., 1983). Yet none of the prior attempts to reduce O*NET data has resulted in an accessible, sharable, and standardized system for the quantification and classification of occupation characteristics. The occupational space derived in the current study results in a set of components that converge with past research (e.g., Potter et al., 2008; Smyth et al., 2004; Spreng et al., 2010), but with additional intricacy, flexibility, and specificity. Moreover, this occupational space can be easily implemented, reproduced, updated, and adapted across disciplines and countries to enhance feasibility, consistency, and comparability across research settings. A key contribution of the current study is that it provides a flexible and easily accessible tool that will support and expedite future research on cognitive, neurological, and behavioral characteristics associated with occupations across a range of educational attainment levels. Additionally, findings from the current study provide—for occupational traits and factors (i.e., components)—conceptual and terminology guardrails that will improve replicability and communication of findings across disciplines. Finally, our findings and techniques support research questions that will deepen our understanding of occupation, which in turn holds promise to enhance individual quality of life, and global innovation and productivity.