Main text

Background

Nearly 20 years ago, the U.S. Institute of Medicine (IOM, now the Academies of Science) recognized that sex (biology) matters in determining health outcomes but also that gender (sociocultural behaviors and attitudes) interacts with sex to influence health and disease processes across the lifespan [1, 2]. In the past decade, both the Canadian Institutes of Health Research (2010) [3] and the European Commission (2014) [4] have endorsed integrating sex and gender (usually as male/female binaries) into health research, and the US National Institutes of Health (NIH) has mandated the inclusion of sex as a biological variable (SABV—2016) [5]. Yet still today, sex and gender are often inappropriately conflated in the biomedical literature [6], and gender is rarely considered largely because the field lacks quantitative tools for analyzing the influence of gender on health outcomes.

In this paper, we argue for Gender as a Sociocultural Variable (GASV) as a complement to SABV. Despite several efforts to examine gender and health [7, 8], the field lacks adequate tools to assess gender. To address this problem, we set out to develop a new instrument—the Stanford Gender-Related Variables for Health Research (GVHR).

Our interest was piqued by a 2007 study that reported that men with higher “femininity” scores had lower risk of coronary heart disease. No such relationship was observed among women [9]. Similarly, a 2015 study found that, independent of biological sex, young adults with a gender score more strongly associated with “feminine gender-related characteristics” were more likely to experience a recurrence of acute coronary syndrome (ACS)—regardless of whether they identified as a man or a woman (a non-binary option was not offered) [10, 11].

These innovative studies highlighted how challenging it is to account for both sex and gender in clinical research. Although these tools demonstrated that gender plays a key role as a determinant of health outcomes, its operational definition deserves further consideration. We were concerned that the conclusions were based on outdated gender identity constructs, such as the Bem Sex Role Inventory (BSRI, 1974) [12], developed using predominantly white, higher socioeconomic U.S. undergraduate student participants and based on outdated notions of masculinity and femininity or their cognates. We also saw that the GENESIS-PRAXY gender questionnaire used in the 2015 study was tested in a sample of 909 heart-disease patients and had not been cross-validated in broader patient, or non-patient, populations. We judged that despite its focus on gender, the GENESIS-PRAXY study used logistic regression with biological sex as the outcome to derive its final score of masculine and feminine characteristics. We also found that the internal consistency of the measure had not been explicitly tested.

More importantly, moving forward, it is important to recognize that it is no longer sufficient to reduce gender to a two-dimensional spectrum stretching from “masculinity” to “femininity,” i.e., concepts which were historically construed as complementary oppositions [13], with a man being one thing and a woman being the opposite (for example, rational/emotional; public/private; mind/body). These concepts are too broad and imprecise to be useful in health research. Given that the goal is to provide physicians and policy makers with gender-related health interventions, measures of gender-related behaviors, such as caregiving or risk-taking, should be labeled as such.

To address these limitations, we develop a new gender-variables instrument. This step toward developing more comprehensive and precise survey-based measures of gender, in relation to health, builds on insights from gender theory to capture key aspects of three dimensions of gender that can be deployed quantitatively in diverse clinical research or large health surveys: (1) Gender Norms [14], (2) Gender-Related Traits [15, 16], and (3) Gender Relations [17] (see Tables S1 and S2), each of which may correlate with sex assigned at birth or self-reported gender identity, without a predetermined coding of that behavior as masculine or feminine (see Table 1). Thus, we adopt a multidimensional understanding that seeks to capture how intrapersonal, interpersonal, and institutional aspects of gender intersect to shape people’s health and illness [17].

Table 1 Three interrelated dimensions of gender

These three meta-categories are not mutually exclusive. Gender is multidimensional: any given individual may experience different configurations of gender norms, traits, and relations that cannot be subsumed into a “masculine” or “feminine” score or considered “fixed” [23].

Our new instrument represents a step toward developing more comprehensive and precise survey-based measures of gender in relation to health. Our questionnaire is designed to shed light on how specific gender-related behaviors and attitudes contribute to health and disease processes, irrespective of—or in addition to—biological sex and self-reported gender identity.

Methods

In a systematic review of the English-language literature from 1975 to 2015, we identified 74 eligible scales used in gender-related measures. From the 74 scales, we distilled 11 composite gender constructs and developed 44 items to measure gender, which we subjected to exploratory and confirmatory factor analysis (EFA and CFA) in three independent U.S. survey samples. This reduced the original 11 gender-related variables to 7 factors: caregiver strain, work strain, independence, risk-taking, emotional intelligence, social support, and discrimination and the 44 survey items to 25. We then examined the relevance of the derived subscales in allowing for more precise analysis of variations in health-related quality of life, obesity, and risky health behaviors.

Literature search

We searched PsycINFO, PsycTESTS, and PubMed for all English-language studies using gender-related tests and scales, from 1975 through 2015, to identify existing questionnaires or scales and construct a comprehensive list of typical traits and/or characteristics used in gender-related measures in both psychology and medicine. We screened an initial sample of 2981 articles from PubMed, PsycTESTS, and PsycINFO from which 405 articles were deemed relevant for further interrogation, within which 127 unique gender-related tests and scales were identified. We also screened existing literature reviews published in books and found four additional scales. Altogether, 131 gender-related questionnaires were sorted into three overarching categories for analyzing gender norms, gender-related traits, and gender relations. Further, we checked citation frequencies for each scale to determine how often it has been used in the literature. All gender scales with at least 20 Google Scholar citations within the last 10 years were selected for further investigation, of which 74 scales met the criteria. All articles published from 2006 to 2015 were retained for further investigation irrespective of whether they were cited or not (the search methods, selection criteria, and review procedures are specified in SI text, Figure S1, and Tables S3-S4).

Several limitations made it impossible to simply plug these 74 scales into a new questionnaire. Very few existing scales (N = 8) focus on associations between gender and health [24,25,26,27,28]. The eligible gender-related scales are generally restricted to either men or women and assess either masculinity [29] or femininity [30, 31] or both as unipolar or bipolar constructs [12, 32,33,34,35,36,37,38] (e.g., hyper-masculinity or hyper-femininity) [26, 28, 39,40,41,42,43,44,45,46,47,48,49]. Further, most scales rely on “agree-disagree” ratings, making them susceptible to acquiescence bias [50].

Questionnaire development

In developing the questionnaire, we recognized the need to minimize the time burden of completing the questionnaire; therefore, we limited the initial item pool to 3–6 items per construct. To avoid acquiescence bias, we presented our items as construct-specific questions. The 74 scales guided our selection of core characteristics to be included among our gender-related measures.

  1. 1.

    Gender Norms. Guided by three of the 74 eligible scales, respondents’ adherence to gender norms was measured by three composite constructs (caregiver strain, time use, and work strain) consisting of 16 items.

    • Caregiver strain captures perceived consequences of responsibility for unpaid, long-term caregiving to children, partners, friends, and elderly (excluding housework and caregiving occupations) [51, 52] and consists of three items, adapted from Graessel and colleagues [53], recorded on a five-point scale: emotional exhaustion, physical exhaustion, and worries about the future caused by caregiving for someone in need, such as a child, elder, partner, or disabled family member. Higher scores on these items indicate higher levels of caregiver strain.

    • Time use measures individual hours spent per day, recorded as open-ended numerical estimates of daily time-spent [54], in the following categories: paid work, household activities, eating and drinking, leisure and sport, caring for others, sleeping, and commuting. The items were adapted from the American Time Use Survey.

    • Work strain measures job strain and emotional job demands as six items, recorded on a five-point scale: work speed, work repetition, emotional job demands, physical job demands, perceived risk, and physical hazards at work. The first four items were adapted from Karasek and Theorell [55]; the last two were developed by the authors. Higher scores on each item indicate higher levels of work strain.

  2. 2.

    Gender-Related Traits. Guided by 31 of the original 74 eligible scales, gender-related traits were measured as five composite constructs: competitive, risk-taking, independence, communal, and expressive, consisting of 16 items.

    • Competitive consists of two items, recorded on a five-point scale, asking respondents how often they find themselves competing with others in situations that do not call for competition and how competitive they are in general, compared to others. The first item is modified from Ryckman et al. [56], the second is developed by the authors. Higher scores on each item indicate higher levels of competitiveness.

    • Risk-taking focuses on physical and behavioral risks, measured by three items adapted from Dohmen et al. [57], recorded on a five-point scale: general risk-taking behavior, risk-taking when making financial decisions, and risk-taking with respect to recreational activities. Higher scores on each item indicate higher levels of risk-taking.

    • Independence is a personality trait characterized by a focus on the person as an individual, not as part of a community or group, which includes agency, self-confidence, self-determination, and decision-making ability, but not self-control or self-esteem [58]. Independence was based on three items, recorded on a five-point scale, adapted from Bakker et al. [59], Clark et al. [60], and Triandis et al. [61], asking respondents how important it is for them to be independent, how often they turn to others for help when in need, and how important it is for them to solve their problems independently. A higher score indicates a higher level of independence for all three items.

    • Communal is a trait characterized by a focus on the individual as part of a group or community, an orientation toward relationships and a concern for others’ needs and well-being [62]. We used four items, scored on a five-point scale, asking respondents how often they worry about what other people think about them, how often they take other people’s needs into account when making important decisions, how often friends talk to them about their problems, and how easy it is for them to spot when someone in a group is feeling uncomfortable. Item one was developed by the authors, item two was adapted from Clark et al. [60], and items three and four were adapted from Baron-Cohen and Wheelwright [63]. Higher scores indicate higher levels of communal orientation.

    • Expressive captures abilities to recognize and express emotions, such as sadness, anger, frustration, compassion, joy, or affection and includes aspects of emotional intelligence, i.e., individuals’ ability to recognize what they feel, manage those emotions, and use emotions in problem solving [64]. We used four items, recorded on a five-point scale, asking respondents how often they talk to friends about their problems, how easy it is for them to understand their own feelings, how easy it is for them to express what they are feeling, and how easy it is for them to ask other people for help when in need. Item one was developed by the authors, item two was adapted from Salovey and colleagues [65], item three was adapted from Gross and John [66], and item four was adapted from Clark and colleagues [60]. Higher scores indicate higher levels of expressivity.

  3. 3.

    Gender Relations. Guided by 40 eligible scales identified in the literature search, gender relations were measured by three composite constructs: social support, discrimination, and quality of family relationships, consisting of 13 items and a single-item measure of personal income.

    • Social support captures perceived satisfaction with the type (physical, emotional, informational, and financial), the availability, and level of support a person might receive. Support may come from partners, relatives, friends, coworkers, health-care systems, the larger community, etc. Social support consists of four items recorded on a five-point-scale measuring the availability and level of social support a person might receive. We asked respondents how often, within the past year, they had someone they could ask for advice, someone to show them love and affection, someone to help them with daily chores, and how often they felt lonely. Item one, two, and three were adapted from the ENRICHD Social Support Inventory [67] and item four was developed by the authors. Higher scores indicate higher levels of social support.

    • Discrimination refers to “systemic unfair treatment” and can occur at multiple levels. We measured micro or interpersonal discrimination. Specifically, we asked the respondents how often they had felt discriminated against because of their gender, in general, when getting hired, when at school, when receiving medical care, in other public settings, and in their family. The items were recorded on a five-point-scale, with higher scores indicating more frequent experiences of gender discrimination.

    • Quality of family relationships measures experiences of harmony and conflict in familial settings, based on two self-developed items recorded on a five-point-scale asking respondents about how they would describe the quality of their relationship with close relatives (in the past year) and how often they had argued with close relatives (in the past year). Higher scores indicate higher levels of perceived quality in family relationships.

Initial testing of the questionnaire

Content validity of the first draft of the questionnaire was assessed by nine members of the author-group who had not taken part in the construction of the item-list, some with technical expertise in the construction of survey-questionnaires and some with expertise in gender-related aspects of health. Each coauthor was asked to rate each item with respect to its relevance in measuring a given variable construct. Ratings were made on a four-point scale (4 = very relevant, 3 = relevant but needs minor alteration, 2 = unable to assess relevance without major revision, 1 = not relevant). These coauthors were also invited to suggest additional items, if they considered the proposed item-list inadequate in capturing a given construct. To improve item quality, we conducted seven cognitive interviews with people recruited by posters in the local area who varied on demographic markers such as education-level, job, age, ethnicity, and gender. In the interviews, we used verbal probing techniques to identify questions that the interviewees found vague and unclear and to elicit how they arrived at answers to the questions. We did this by asking them to reflect on what the items meant to them, how they would rephrase the question in their own words, how they came up with their answers to the questions, and whether the questions were easy or hard to answer and why.

Survey participants

Participants were recruited from the USA through two online services and a health research registry: Prolific, Amazon Mechanical Turk, and the Stanford Research Registry, which consists of ~ 4000 former adult patients at the Stanford University Medical Center who have agreed to be contacted for participation in research studies. We used the web-based Qualtrics software to collect data from Prolific (sample 1) in August and September 2017, Mechanical Turk (sample 2) in December 2017 and January 2018, and the Stanford Research Registry in May and June 2017 (sample 3). Sample 1 consisted of 2051 respondents; 1992 completed the survey. Sample 2 consisted of 2135 respondents; 2043 completed the survey. Sample 3 consisted of 489 respondents; 452 completed the survey. Sample characteristics are presented in Table 2.

Table 2 Sample characteristics

Procedures

Self-rated health was assessed using the Health-Related Quality of Life Core Module (CDC HRQoL-4). This module consists of four items about perceived general health, recent physical health, recent mental health, and recent activity limitations. The ordinal question about perceived general health did not meet the assumption of proportionality of odds required for ordered logistic regressions. Hence, we dichotomized the item into (1) fair or poor and (2) good, very good, or excellent. We measured current smoking and current vaping by number of cigarettes smoked per day and number of times vaping per day. These variables were dichotomized in the analysis (not smoking = 0, smoking = 1; not vaping = 0, vaping = 1) due to a high frequency of zero values (> 75%). Binge drinking was measured by the frequency of consuming five or more drinks on one occasion for males and four or more drinks on one occasion for females (within the last 3 months) [68]. We followed standard procedure and recoded these items into a unisex dichotomous variable (binge drinking less than monthly = 0, binge drinking monthly, weekly, or daily/almost daily = 1). BMI was calculated based on self-reported height and weight and dichotomized for analysis to reflect under or normal weight (BMI < 25 = 0) and overweight or obese (BMI ≥ 25 = 1) (item phrasing and response options for the health questions are reported in Table S5). Specifications on the nine demographic covariates (including item phrasing and response options) used in the regressions are presented in Table S6.

Exploratory factor analysis

We opted to start our analysis with EFA, rather than CFA, because we considered EFA to be the most appropriate first step for a survey measure of this novelty. While the systematic review allowed us to distill core gender-related attitudes and behaviors, we were uncertain how many latent factors would emerge in the subsequent testing. Moreover, we were uncertain how several of the items would distribute across factors (e.g., the items for time-use and quality in family relationships) and we wanted to leave open the possibility that various items would cross-load onto factors other than their parent factor, potentially leading to fewer latent factors than initially expected. As described in the results section, this happened to be the case. In addition, since we had the opportunity and resources to collect three survey samples, a complementary approach combining EFA and CFA allowed us to benefit from advantages of each method.

The EFA was based on iterated principal axis factoring as the extraction method and Promax (oblique) rotation to allow for correlated factors. Exploratory factor analysis and statistical analyses were done in SPSS. In the EFA, we examined questions from all three gender categories in a common factor model with multiple factors. All respondents with missing data for relevant items were removed from sample 1 prior to the analysis. To allow for analysis of the largest possible sample, 10 items targeting caregivers and employees were recoded so that people not currently caring for someone in need or not currently employed (or employed in the past) were ascribed the value 1, which represents no strain due to caregiving or work (see Table S7).

We subjected the 44 questionnaire items to EFA in sample 1. Velicer’s minimum average partial test suggested a 7-factor structure (Table S8, screeplot, communalities, and unique variances are presented in Figure S2 and Table S9) [69]. The conceptual clarity of this solution also best resembled the thematic gender dimensions identified in the literature review. This solution retained 35 of the 44 items subjected to EFA and explained 51% of the variance in item scores. For purposes of interpretability, we excluded all items with loadings below 0.40.

Factor one in this solution includes six items addressing perceived discrimination. Factor two consists of seven items capturing daily time spent on work and work strain-related characteristics. Factor three encompasses four items concerning perceived strain and time-use related to caregiving. Factor four includes five items capturing competitive and risk-taking behavior. Factor five encompasses three items concerning perceived social support. Factor six includes six items capturing empathy and expressive behavior. Finally, factor seven includes four questions about independence. The distribution of items on factors was consistent across alternative factor rotation methods (Tables S10-S13).

Confirmatory factor analysis

The CFA was carried out in SPSS AMOS Graphics 26 and based on maximum likelihood estimations. We allowed the factors to be correlated. The likelihood ratio test (also known as the χ2 test) is highly sensitive to even small departures of the data from exact fit, especially in large-N samples. Moreover, χ2 values increase with sample size and the number of variables in the model [70]. Therefore, we followed Cheung and Rensvold [71] and Yuan and Bentler [72] and determined global model fit and invariance based on the approximate-fit statistics. Specifically, we used the Tuckler Lewis Index (TLI), the root mean square error of approximation (RMSEA), and the standardized root mean square residual (SRMR), relying on conventional fit criteria [73]. The 35 items and 7 factors retained in the EFA were submitted to CFA in samples 1, 2, and 3. Following Gerbing and Hamilton [74], we used CFA to refine the EFA solution identified in sample 1. Next, we cross-validated the outcomes of this CFA in samples 2 and 3. Configural invariance was examined separately in samples 2 and 3. We used multiple-groups CFA to assess metric and scalar invariance across samples 2 and 3. We removed all observations with missing data for one or several of the 35 items subjected to CFA in sample 1, and for the 25 items subjected to CFA in samples 2 and 3. This slightly reduced sample 1 from n = 2051 to n = 2009, sample 2 from n = 2135 to n = 2054, and sample 3 from n = 489 to n = 449. One item concerning perceived gender discrimination in education had high rates of missing data in sample 3 (N > 100). Hence, we restricted the cross-validation in sample 3, and the multiple-groups CFA in samples 2 and 3, to the remaining 24 variables. To avoid a Heywood estimate on the Social support factor, we followed Chen et al. [75] and constrained the error variance of one item (socsupchores) to 0.001 in samples 2 and 3.

The initial 35-item solution based on the EFA did not perform satisfactorily with respect to global model fit in sample 1 (χ2 = 5897.38, df = 539, p = 0.00, TLI = 0.81, RMSEA = 0.07, SRMR = 0.07). To obtain acceptable model fit, we examined the factor loadings, removing all items with loadings ≤ 0.5. The “trimmed” solution, consisting of 25 items, exhibited good fit to the data (χ2 = 1362.53, df = 254, p = 0.00, TLI = 0.95, RMSEA = 0.05, SRMR = 0.04) (Table S14). The cross-validation of this final model in samples 2 and 3, with no equality constraints, also exhibited reasonable fit to the data, indicating configural invariance (sample 2: χ2 = 1440.7, df = 255, p = 0.00, TLI = 0.94, RMSEA = 0.05, SRMR = 0.04; sample 3: χ2 = 496.5, df = 232, p = 0.00, TLI = 0.93, RMSEA = 0.05, 3: SRMR = 0.05) (Table S15).

A more restricted multiple-groups CFA with factor loadings assumed to be equal across samples 2 and 3 also supported metric invariance (χ2 = 1899.849, df = 481, p = 0.00, TLI = 0.94, RMSEA = 0.03, SRMR = 0.04) (Table S16). Further restrictions with both factor loadings and intercepts assumed to be equal across samples 2 and 3 also indicated reasonable fit compared to prior models, supporting scalar invariance (χ2 = 2167.051, df = 505, p = 0.00, TLI = 0.93, RMSEA = 0.04, SRMR = 0.04) (Table S17). As a sensitivity check, we also ran the multiple groups CFA in samples 2 and 3 with the item on perceived gender discrimination in education included (sample 2 = 2054; sample 3 = 348) and obtained comparable model fit (metric invariance: χ2 = 2149.723, df = 528, p = 0.00, TLI = 0.93, RMSEA = 0.04, SRMR = 0.04; scalar invariance: χ2 = 2380.150, df = 553, p = 0.00, TLI = 0.93, RMSEA = 0.04, SRMR = 0.04) (Tables S18-S19).

Reliability

The reliabilities of the factors implied by the final CFA solution were assessed in all samples using Raykov’s ρ. ρ was computed using James Gaskin’s “Validity master tool” [76], and following conventional criteria [77], we considered values > 0.60 desirable.

Results

Table 3 reports the factor loadings, Raykov’s ρ, and item scoring for the trimmed CFA model in samples 1, 2, and 3, with seven variables. The 7 factors listed above represent our final gender-related variables. Our analyses yielded low to moderate inter-factor correlations (Table S20), which is not unusual in multidimensional gender measures [26, 78, 79]. We calculated mean-item subscale scores for each factor and used these as predictors in the regressions presented below. For subscales including continuous variables, mean-item scores were calculated based on standardized variables (z-scores). All variables are scored from lower to higher levels of the given constructs.

Table 3 Factor loadings for the trimmed CFA models in Samples 1, 2 and 3 and Raykov’s ρ for each factor

Figure 1 displays the z-scores (averaged by group) for the 7 gender-related variables for respondents seeing themselves as men, women, and gender fluid/non-binary in sample one. The figure demonstrates the advantage of capturing specific gender-related behaviors and attitudes through multiple variables.

Fig. 1
figure 1

Gender-related variables capturing specific behaviors and attitudes. The figure displays the z-scores for the seven gender-related variables for respondents seeing themselves as men (green), women (orange) and gender fluid/Non-binary (grey) in sample 1 (N = 1893)

Associations with self-rated health and health-risk behaviors

Existing research shows notable sex differences in health-related quality of life, obesity, and risky health behaviors, such as smoking and heavy drinking [80,81,82]. Here, we examine the relevance of our gender-related variables in predicting self-related health and health-risk behaviors.

Associations between our 7 gender-related variables, sex assigned at birth, self-reported gender identity and physical health, mental health, and activity limitations due to poor physical or mental health were analyzed in samples 1 and 2 using negative binomial regressions, with birth year, personal income, education level, race, and ethnicity as covariates. Only associations that are consistent across samples 1 and 2 are reported here. Measures of reported birth sex and gender identity were highly correlated. Therefore, we ran all models twice: once including birth sex and once including gender identity. Here, we report the outcomes of the models including birth sex as a key predictor (see Tables S21-S23 for specifications on the regression models including gender identity).

As displayed in Table 4, caregiver strain and discrimination were associated with lower physical health, mental health, and activity levels in both samples, whereas social support was associated with higher mental health and activity levels in both samples. The results for the remaining associations were inconclusive in one or both samples.

Table 4 Adjusted incidence rate ratios and 95% confidence intervals of associations with health-related measures in negative binomial regressions

Associations between our gender-related variables, sex assigned at birth, self-reported gender identity and general health status, smoking, vaping, binge drinking, and BMI were analyzed using logistic regressions in samples 1 and 2 (Table 5, adjusted for year of birth, personal income, education level, ethnicity, and race), and we ran separate models with sex and gender identity (see Tables S24-S28 for specifications on the regression models including gender identity). In both samples, caregiver strain, discrimination, and male birth sex were associated with fair or poor self-rated health, while risk-taking and social support predicted good, very good, or excellent self-rated health. Further, caregiver strain and work strain were associated with smoking, while discrimination was associated with vaping and higher levels of risk-taking was associated with binge drinking. Caregiver strain, low levels of risk-taking, discrimination, and male birth sex were associated with overweight. The results for the remaining associations were inconclusive in one or both samples.

Table 5 Adjusted odds ratios and 95% CIs of associations with health-related measures in logistic regressions

Combining the data from samples 1 and 2 (and adjusting for sample in the regression models), all associations with self-related health and health-risk behaviors reported above persisted at the 99.9% confidence level (Figs. 2 and 3), as did the following associations: emotional intelligence with smoking and male birth sex with vaping and binge drinking.

Fig. 2
figure 2

Adjusted incidence rate ratios of associations with recent physical health, mental health, and activity limitations. This figure displays the outcomes of the negative binomial regressions predicting health outcomes in the combined sample (sample 1 + sample 2) (Physical health, N = 3879; mental health, N = 3880; activity limitations, N = 3876). Error bars represent 99.9% confidence intervals. See Tables S37-S39 for model specifications

Fig. 3
figure 3

Adjusted odds ratios of associations with health status, smoking, vaping, binge drinking, and BMI. This figure displays the outcomes of the binary logistic regressions predicting health outcomes in the combined sample (sample 1 + sample 2) (health status, N = 3,894; smoking, N = 3,892 vaping, N = 3,891; binge drinking, N = 3,887; BMI, N = 3,851. Error bars represent 99.9% confidence intervals. See Tables S40-S44 for model specifications

Discussion

Following a comprehensive review of gender measures from 1975 to 2015, we have applied a rigorous process to identify key aspects of gender for the purpose of developing a new gender assessment tool for use in clinical and population research, including large-scale health surveys involving diverse Western populations. Through exploratory and confirmatory factor analyses, we reduced the original 44 survey items to 25 (Table S45) and the 11 original constructs to 7 gender-related variables: caregiver strain, work strain, independence, risk-taking, emotional intelligence, social support, and discrimination.

Each variable seeks to capture an important aspect of gender within the populations studied. Each variable measures an individual participant’s self-reported behavior or attitude for that characteristic and is designed to be scored individually, as a distinct human behavior or attribute. Behaviors are not coded “masculine” or “feminine,” and we recommend against consolidating the variable scores into these unipolar or bipolar indices. Studies that reduce gender-related variables to “femininity” or “masculinity” scores give little guidance for behavioral interventions. For example, if caregiver strain is found to be associated with higher risk of recurrence or death in patients with ACS, it should be reported as such. Subsuming gender-related factors into masculine or feminine indices will reduce, rather than improve, the precision and applicability of survey-based measures of health.

Moreover, the regression analyses (Figs. 2 and 3) suggest that the gender-related variables pertaining to norms (caregiver strain and work strain) and relations (discrimination and social support) have stronger correlations with self-rated health measures than the variables pertaining to gender-related traits (with risk-taking as an important exception). This finding aligns with extant research suggesting that institutional and interpersonal aspects of gender may be more important than individual traits and characteristics in shaping health and disease processes [17].

Clinicians and public health researchers may employ the Stanford GVHR to gain a fuller understanding of gender differences in health outcomes than can be captured by simply asking people to self-identify their sex or gender. It might be used, for example, in the treatment of chronic pain or osteoporosis research, each of which has robust gender components [83]. Specifically, we recommend that researchers start by measuring all 7 variables alongside sex assigned at birth, self-reported gender identity, and other relevant factors, such as sexual orientation, ethnicity, age, income, and education, to specify which characteristics, traits, and behaviors may be predictive of the health issue in focus. In subsequent studies and patient-based outcome measures, the focus may be restricted to a subset of variables of documented relevance to a specific disease or health condition [84]. Our gender-related variables are developed to capture how gender norms, gender-related traits, and gender relations intersect to shape people’s health and disease (and vice versa). Researchers using our instrument are encouraged to be aware of broader institutional and cultural contexts that may influence patients’ gender roles and identities and health outcomes.

The Stanford GVHR is specific to the survey cohorts, time, place, and culture in which it was developed and tested. Because gender norms, traits, and relations vary across and within cultures and change over time [85], we recommend that variables be updated with each generation or even more frequently. Obvious limitations are that our variables were developed from English-language literature and validated in nonprobability samples, recruited online and through a research registry in US populations, and the age composition of our samples (median ages range from 36 to 50 years) does not represent typical patient populations encountered in clinical practice. We strongly encourage more research to test the validity of the gender-related variables within and across age groups and cultures, as well as across a wide range of global settings.

Any association of a gender-related variable with a health outcome does not infer causality. Our assessment was cross-sectional. Confounding or even reverse causality (health phenotypes affecting gender-related behaviors and attitudes) should be considered. Another limitation is the small number of items per construct which might limit generalizability; at the same time, a smaller number of items likely increases the usefulness of the questionnaire to practitioners and researchers. An avenue for further research is the expansion the number of items for each construct, especially emotional intelligence, which had less than ideal reliability across all three samples. In addition, a few of the initial gender-related constructs that were not retained in the exploratory factor analysis, such as competition and quality of family relationships, may have been deselected due to insufficient items in the study’s initial pool of attitudes and behaviors and should be reconsidered in future research.

Perspectives and significance

This project represents an approach toward developing more comprehensive and precise survey-based measures of gender in relation to health. In the future, the proposed list of variables to measure other gender-related factors could be expanded to place more emphasis on gender relations, e.g., by integrating factors such as decision-making power (including over household resources and health expenditures) and the distribution of domestic labor in families and among both same- and different-sex cohabiting or romantic partners. Indeed, many of the measures that have informed our work may be outdated (noting that some date back to the 1970s and 1980s). It may be desirable to complement our proposed list of variables with more “timely” variables that better represent how specific patients and persons conceive of gender in 2020. It would also be interesting to explore associations between our gender-related variables and other health-related aspects such as health literacy, health-seeking behavior, and provider-patient interactions.

Conclusion

Our questionnaire is designed to shed light on how specific gender-related behaviors and attitudes contribute to health and disease processes, irrespective of—or in addition to—biological sex and self-reported gender identity. Use of these gender-related variables in experimental studies, such as clinical trials, may also help us understand if gender factors play an important role as treatment effect modifiers and would thus need to be further considered in treatment decision-making.