Conducting gender-based analysis of existing databases when self-reported gender data are unavailable: the GENDER Index in a working population

Objectives Growing attention has been given to considering sex and gender in health research. However, this remains a challenge in the context of retrospective studies where self-reported gender measures are often unavailable. This study aimed to create and validate a composite gender index using data from the Canadian Community Health Survey (CCHS). Methods According to scientific literature and expert opinion, the GENDER Index was built using several variables available in the CCHS and deemed to be gender-related (e.g., occupation, receiving child support, number of working hours). Among workers aged 18–50 years who had no missing data for our variables of interest (n = 29,470 participants), propensity scores were derived from a logistic regression model that included gender-related variables as covariates and where biological sex served as the dependent variable. Construct validity of propensity scores (GENDER Index scores) were then examined. Results When looking at the distribution of the GENDER Index scores in males and females, they appeared related but partly independent. Differences in the proportion of females appeared between groups categorized according to the GENDER Index scores tertiles (p < 0.0001). Construct validity was also examined through associations between the GENDER Index scores and gender-related variables identified a priori such as choosing/avoiding certain foods because of weight concerns (p < 0.0001), caring for children as the most important thing contributing to stress (p = 0.0309), and ability to handle unexpected/difficult problems (p = 0.0375). Conclusion The GENDER Index could be useful to enhance the capacity of researchers using CCHS data to conduct gender-based analysis among populations of workers.


Introduction
Despite growing attention given to the importance of considering sex and gender in health research (Johnson et al. 2009;Day et al. 2017;McGregor et al. 2016;Pilote and Humphries 2014), these terms are still used inconsistently and interchangeably in the literature (Vissandjee et al. 2016;Boerner et al. 2018). Whereas sex refers to a set of biological attributes and is associated with physical and physiological features (CIHR 2018), gender can be defined as socially constructed roles, behaviours, expressions, and identities of girls, women, boys, men, and gender diverse people (CIHR 2018). Gender is an important construct to examine as it influences how people perceive themselves and each other, how they act and interact, and the distribution of power and resources in society (CIHR 2018).
Measurement of biological sex is relatively straightforward (male, female, intersex) and is usually included as a variable in clinical and epidemiological studies (Vissandjee et al. 2016). As for gender, some validated self-report indexes are available for the measurement of selected gender constructs in prospective studies (e.g., gender roles, identity, relations) (Nanda 2011;McHugh and Hanson Frieze 1997;Shulman et al. 2017;Kachel et al. 2016;Bem 1974). However, many large administrative databases or surveys do not include gender measures, mostly because it has not been planned from the outset. The secondary analysis of such data sources is, nonetheless, indispensable to enriching our understanding of health trajectories, healthcare utilization, and real-world risks and benefits of drugs among large populations (Schneeweiss and Avorn 2005;Tamblyn et al. 1995;Bernatsky et al. 2013;Hashimoto et al. 2014).
Even if researchers have the opportunity to include various gender-related variables in multivariate modeling of various health outcomes (examples of gender-related variables include time spent on child care, occupation, number of working hours, types of leisure activities, stress (Bekker 2003)), the calculation of a single composite score is a statistically efficient option (Glynn et al. 2006). Various approaches have been proposed to derive composite gender indexes using existing data (Lippa and Connelly 1990;Pelletier et al. 2015;Smith and Koehoorn 2016;Canadian Institutes of Health Research 2017). For example, Smith and Koehoorn (2016) assigned a numerical value to each response category of four gender-related variables available in the Canadian Labour Force Survey (responsibility for caring for children, occupation, number of hours of work, and level of education). They then created a gender score by summing these variables (Smith and Koehoorn 2016). Although the proposed approach was simple and the resulting gender index showed face validity and sensitivity to change, the method was subjective since assumptions and categorizations were made about what answers were more feminine or more masculine. In contrast, other statistical approaches may be used to minimize researchers' subjectivity surrounding the processing of variables for the computation of a composite index. Using genderrelated variables available in the GENESIS-PRAXY cardiovascular study, Pelletier et al. (2015) derived a gender score using a principal component analysis and a logistic regression model where sex served as the dependent variable for the calculation of a propensity score.
The Canadian Community Health Survey (CCHS) is a rich source of detailed self-reported information about the health status, health risk factors, and use of healthcare services among Canadians (Statistics Canada 2012), and its secondary analysis is of great value for research purposes (Sanmartin et al. 2016;Raina et al. 1999;Yergens et al. 2014). However, the CCHS does not contain questions about gender, thus limiting the usefulness of the survey data for researchers interested in the topic and its relation to the health of Canadians. Moreover, to the best of our knowledge, a composite gender index has not been derived using the CCHS data. The aim of this study was to create and validate a composite gender index, namely the GENDER Index, using selected variables available from the CCHS.

Data source
The current study was conducted using the TORSADE Cohort (TrajectOiRes SAnté -Données Enrichies), an infrastructure of the Quebec SUPPORT Unit (Support for People and Patient-Oriented Research and Trials). This database was created with the aim of better understanding healthcare trajectories associated with ambulatory care sensitive conditions. This cohort of 60,791 individuals living in the province of Quebec results from the linkage between data from Statistics Canada's CCHS (questionnaires 2007-2008, 2009-2010, and 2011-2012) and those of the administrative longitudinal databases (1996 to 2016) held by the Régie de l'assurance maladie du Québec (RAMQ). Authorization was granted by the Commission d'accès à l'information du Québec before data linkage and approval was obtained from concerned university Research Ethics Boards.
The CCHS collects data about the health of individuals of at least 12 years of age living in the ten Canadian provinces and the three territories (probability sampling) (Statistics Canada 2012). Not included are individuals living on Aboriginal reserves, full-time members of the Canadian Forces, institutionalized individuals, or persons living in the Quebec regions of Nunavik and Terres-Cries-de-la-Baie-James (altogether less than 3% of the Canadian population). CCHS response rates are high (69.8-78.9% depending on the cycle (Sanmartin et al. 2016)), response rates are similar in the province of Quebec vs the whole of Canada (Statistics Canada 2010a), and test-retest reliability of the answers to several questions has been well demonstrated (Raina et al. 1999). The TORSADE cohort contains data of all CCHS participants who accepted to share their data with Quebec's Statistics Institute and agreed to data linkage (92.8% of CCHS participants) (Institut de la statistique du Québec 2018). In the 2007-2008, 2009-2010, and 2011-2012 CCHS questionnaires, biological sex was measured as a dichotomous variable (male vs female) without a "do not know" option.
For the following reasons, only the CCHS variables were considered for the creation of the GENDER Index: (1) the CCHS database is much richer than the Quebec administrative ones in terms of potentially gender-related socio-economic information, (2) the calendar date of the CCHS questionnaire is often defined as the index date in studies using the TORSADE Cohort, which makes it more logical to calculate gender scores at the date of completion of the questionnaire, and (3) Quebec administrative databases are not always available to researchers in other Canadian provinces who work with CCHS data.

Identification of gender-related variables
A screening for potentially gender-related CCHS variables was achieved based on the following: (1) the Multi-Facet Gender and Health Model (Bekker 2003), (2) the different gender constructs proposed by Johnson et al. (2009) (gender roles, gender identity, gender relations, and institutionalized gender), (3) a review of variables considered in studies that derived composite gender indexes using other administrative/existing survey data (Lippa and Connelly 1990;Pelletier et al. 2015;Smith and Koehoorn 2016). Three members of the study team (one with expertise in the field of sex and gender, two in the field of epidemiology and biostatistics) discussed and reached a consensus about relevant CCHS variables. A very conservative approach was used at this point and all variables potentially relevant were considered (see Table 1). However, to be eligible, variables had to be measured in the three cycles of the CCHS (questionnaires 2007-2008, 2009-2010, and 2011-2012), be collected in the Canadian province of Quebec, and have ≤ 15% missing values (cut-off for which missing values can be considered problematic (Fox-Wasylyshyn and El-Masri 2005)). Although healthcare resources and medication use can be gender-related (Bekker 2003), they were not retained for the creation of the GENDER Index because such variables are expected to be import a n t o u t c o m e s o f f u t u r e e p i d e m i o l o g i c a l a n d pharmacoepidemiological research projects conducted using the TORSADE Cohort or CCHS data.
The selection process led to a total of 19 candidate variables (Table 1). According to the literature, occupational characteristics are important gender-related variables to be considered in the creation of a gender index (Bekker 2003) and CCHS work-related variables are measured among participants aged 18-50 years. A back-and-forth process between our modelization and our results also suggested that occupational characteristics were also among the most important variables for the creation of the GENDER Index. For these reasons, the current study was conducted in the sample of participants employed in the past 12 months and aged 18-50 years. Aboriginal status was not included in the GENDER Index because none of the participants reported being Aboriginal.

Creation of the GENDER Index
The GENDER Index was derived using a propensity scoring approach. This approach was inspired by the work of Pelletier et al. (2015) that was endorsed by the Canadian Institutes of Health Research (CIHR) in their online training modules on integrating sex and gender in health research (Canadian Institutes of Health Research 2017).
The GENDER Index composite scores were derived following these steps: First, collinearity was explored among all the candidate variables using variance inflation factors (VIF) (O'Brien 2007) and parametric or non-parametric independent samples tests (according to the type and distribution of variables). All VIF values respected cut-offs suggested for detecting multicollinearity (VIF greater than 5 or 10 (Vatcheva et al. 2016)). Since none of the variables explained entirely or most entirely another variable, no exclusions were applied at this point (Table 1). All candidate variables were then included as independent variables (covariates) in a multiple logistic regression model for which biological sex served as the dependent variable (female = 1, male = 0). In such a multiple regres- sion model, a propensity score can be derived for each participant, which can be defined as the conditional probability for a participant to have the outcome of interest given his observed covariates. Propensity score values can be added to the dataset as a new variable by adding a simple output command when running SAS® proc logistic. In our study, the probability of each respondent to be a female given the estimates from the logit model was calculated, which formed the propensity score and was included as a new variable in the dataset (i.e., the GENDER Index score). Higher scores on the 0-100 GENDER Index can be interpreted as a higher level of characteristics associated with being female/having more feminine characteristics.
It should be acknowledged from the outset that using biological sex as the dependent variable in our regression model can be criticized because it merges the related but different concepts of sex and gender (Johnson et al. 2009). However, previous authors showed that even if biological sex was used to create a gender score (Lippa and Connelly 1990;Pelletier et al. 2015), the two variables appeared as related but partly independent in the analysis (e.g., great variability of gender scores within each sex). Pelletier et al. (2015) also argued that defining gender-related variables as psychosocial variables that differ between males and females is concordant with the literature which often refers to gender as roles, attitudes, opportunities, and expectations held by males and females.

Validity analysis
In addition to the calculation of descriptive statistics to summarize respondents' characteristics, analyses were undertaken to explore the validity of the GENDER Index among the TORSADE Cohort. Face validity is the extent to which the items/components of an index look as though they are an adequate reflection of the construct to be measured (Mokkink et al. 2010). This property was examined by measuring the associations between each gender-related variable included in the GENDER Index and the gender score itself using univariate linear regression analyses. Construct validity can be defined as the extent to which the scores of an index are consistent with hypotheses (e.g., internal relationships, relationships with scores of other instruments, differences between relevant groups) based on the assumption that the index validly measures the construct under study (Mokkink et al. 2010). Construct validity was thus assessed by (1) comparing the distribution of GENDER Index scores between males and females using overlapping histograms, (2) comparing the proportion of females between groups categorized according to the GENDER Index scores tertiles (division of the ordered scores distribution into three parts, each containing a third of the population), and (3) examining the associations between presumed gender-related variables that were not In the table, shaded cells indicate that the variables were excluded during the selection process either because they were not measured in the three CCHS cycles or had > 15% missing values a The current study was conducted in the sample of participants employed in the past 12 months b Work-related variables were measured among participants aged 18-50 years. In the CCHS, workers' industry classification and occupational classification are both measured (North American Industry Classification System (NAICS) and National Occupational Classification (NOC)). For example, a participant could work in an organization of the trades/construction sector without having an occupation in the field (e.g., nurse or occupational health professional working for a mining company; secretary or accountant working for a construction company). Also, a participant could work in an organization of the healthcare sector without having an occupation in the field of health (e.g., janitor or human resource professional working in a healthcare centre). Classifications were recoded and simplified for the purpose of the current study to reflect sectors where substantial sex differences exist (Institut de la statistique du Québec 2011) included in the creation of the GENDER Index and GENDER Index scores using univariate linear regressions (i.e., choice or avoidance of certain foods because of body weight concerns, ability to handle unexpected and difficult problems, caring for children as the most important thing contributing to feelings of stress). These variables deemed to be gender-related were not included in the GENDER Index because they were not available for all CCHS cycles. Finally, in order to test the impact of various methodological approaches on the validity of the GENDER Index, sensitivity analyses were conducted by reducing the number of variables to be included in the multiple logistic regression model used to create the GENDER Index using a backward elimination technique until all remaining variables had p values < 0.05 (an approach used by Pelletier et al. 2015). Data analyses were performed using SAS® (version 9.4, Cary, NC, USA). Appropriate CCHS sampling weights and bootstrap variance estimation procedures were used (Statistics Canada 2012).

Results
Among the 60,791 individuals of the TORSADE Cohort, a total of 29,470 (48.24%) participants employed in the past 12 months and aged 18-50 years had no missing data for any of the variables included in the GENDER Index. Characteristics of the study sample are presented in Table 2.
The multiple logistic regression model used to create the propensity scores (GENDER Index scores) and all variables that were considered are presented in Table 3. The categorization of gender-related variables led to a total of 43 dummy variables included in the model (c = 0.796). In regard to our sample size, it respects the recommended events per independent variable ratio of 10:1 (Harrell et al. 1996). Sensitivity analyses revealed that the number of variables to be included in the multiple logistic regression model was not affected by the backward elimination technique.

Face validity of the GENDER Index
Results of univariate linear regression analyses measuring the associations between each variable included in the GENDER Index and the gender score itself are presented in Table 4. Associations (p < 0.05) were found for all variables except for ownership of the household (owner vs tenant), supporting the extent to which variables used to create the GENDER Index were relevant to the gender score. The six variables with the highest regression coefficients (β) were as follows: (1) having an occupation in the field of trades, transport, and equipment operators, related occupations, or occupations unique to primary industry, (2) receiving child support as the main source of household income, (3) working in an organization of the healthcare or social assistance sector, (4) having an occupation in the field of health, social science, education, government service, or religion, (5) working in an organization of the construction or manufacturing sector, (6) number of working hours per week.

Construct validity of the gender index
The distribution of GENDER Index scores in males and females is represented in Fig. 1. According to this visual representation, sex and GENDER Index scores appeared related but partly independent (e.g., incomplete histogram overlap, variability of gender scores within each sex group). Differences were also found in the proportion of females between groups categorized according to the GENDER Index scores tertiles (tertile 1: 14.90% vs tertile 2: 36.84% vs tertile 3: 48.26%, p value < 0.0001).
Regarding associations between GENDER Index scores and presumed gender-related variables identified a priori and not included in the index GENDER Index, univariate linear regression models revealed that choosing or avoiding certain foods because of body weight concerns (β 0.046, p < 0.0001) and caring for children as the most important activity contributing to feelings of stress (β 0.048, p = 0.0309) were associated with higher GENDER Index scores (presumed to represent more feminine characteristics). A greater ability to handle unexpected and difficult problems (β excellent vs poor − 0.093, p = 0.0375) was associated with lower GENDER Index scores (more masculine characteristics).

Discussion
To our knowledge, this is the first study to derive a composite gender index using CCHS data. Validity of an index can be defined as the extent to which all of the accumulated evidence supports the intended interpretation of the scores for the intended purpose (Streiner and Kottner 2014; AERA/APA/ NCME 2014). Our results thus suggest that the GENDER Index could be useful to enhance the capacity of researchers using workers CCHS data to conduct gender-based analysis in the absence of self-reported gender measures.
The GENDER Index development was intended to maximize its face validity. Almost all variables included in the GENDER Index also appeared to be important when they were examined in relation to the total score. Variables most related to the total score (occupation, receiving child support as the main source of household income, and number of working hours per week) were consistent with variables retained by other authors when creating composite gender indexes (responsibility for caring for children, occupation, number of hours of work (Smith and Koehoorn 2016), and hours per week doing housework (Pelletier et al. 2015)). Using known-groups and convergent validity analytical approaches, various arguments towards the construct validity of the use of the GENDER Index are also provided.
The GENDER Index is a multidimensional composite score and was not intended to represent only one gender construct. When looking at the variables available in the CCHS and included in the index, some characteristics such as childcare responsibilities and type of work can relate to gender roles (behavioural norms applied to men and women) (Johnson et al. 2009). Race and interactions within social units can interact with gender relationships (how individuals interact with and are treated by others based on their ascribed gender) (Johnson et al. 2009). We can therefore argue that considering variables such as race and sense of belonging to the local community in the creation of the GENDER Index expands its multidimensional nature. Aspects related to institutionalized gender (how power and influence are distributed differently among men and women) (Johnson et al. 2009) were also represented through the inclusion of variables such as race, education, job limitations (e.g., stress at work), and access to resources such as money or food. Since marital status can be related to opportunities afforded to the genders (e.g., job opportunities) (Nadler  and Kufahl 2014) and stress can be related to gender roles or gender identities (Jones et al. 2016;Eisler et al. 1988), such variables were also relevant to our work. Gender is an important construct to enhance our understanding of health determinants, disease courses, and treatment outcomes. In fact, it can be associated with important aspects surrounding both communicable and chronic diseases, such as experience and expression of physical symptoms (e.g., pain (Boerner et al. 2018)), health behaviours (e.g., vaccination (Vamos et al. 2018), treatment adherence (Sajatovic et al. 2011), alcohol or drug use (Lye and Waldron 1998)), coping strategies   (Spendelow et al. 2018), and expectations (Bekker 2003). Using their composite gender score, Pelletier et al. (2015) found that, independently from biological sex, gender was associated with cardiovascular risk factors such as hypertension, diabetes, family history depressive symptoms, and anxious symptoms. The same team also found an association between gender scores and serious health outcomes such as recurrence of acute coronary syndrome (Pelletier et al. 2016).
When analyzing administrative databases or existing survey data, researchers have the possibility to identify various gender-related variables and include them in multiple regression modeling of various health outcomes. However, the use of a composite gender score offers advantages. Such scores can be used for adjustment in multiple regression models, matching, and subgroup stratification (using measures of position such as tertiles) in order to better control confounding variables in observational studies (Glynn et al. 2006). As Italicized p values indicate statistically significant associations (p < 0.05) a Higher scores on the 0-100 GENDER Index can be interpreted as a higher level of characteristics associated with being female/having more feminine characteristics compared with the use of a set of gender-related variables, they provide greater statistical power by reducing the number of covariates included in multiple regression models, offer the possibility to test interaction terms, and reduce multiple comparisons (Glynn et al. 2006;Song et al. 2013).

Limitations
First, it was not possible to examine the validity of the GENDER Index by comparing it with an existing validated gender assessment instrument since the CCHS does not include such a tool. It is also important to underline that the validity of the index should be further investigated in different populations (e.g., validation subsample or more recent CCHS cycles). Another limitation of our study has to do with the generalizability of the GENDER Index to age groups not included in the current study. Because occupational characteristics were important gender-related variables to be considered in the creation of a gender index, the GENDER Index could only be calculated in workers. Although this aspect is a major threat to our study's external validity, the GENDER Index could be useful for many researchers (e.g., in the field of occupational health). Further studies should explore the validity of indexes that can be calculated without considering occupational characteristics.

Conclusions
This investigation provides a methodological example for researchers who wish to conduct gender-based analysis of existing databases when self-reported gender data are unavailable. Despite the limitations of our study, the results support the value of the GENDER Index as a new tool to enhance the capacity of researchers using CCHS data to conduct genderbased analysis among populations of workers.
Funding information This study was supported by the following: (1) the Canadian Institutes of Health Research (CIHR) (Personalized Health Catalyst Grants -Development of predictive analytic models: #PCG155479) and (2) the Quebec SUPPORT Unit (Support for People and Patient-Oriented Research and Trials), an initiative funded by CIHR, Ministère de la santé et des services sociaux du Québec, and Fonds de recherche du Québec -Santé.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
References AERA/APA/NCME (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Fig. 1 Distribution of GENDER Index scores in men and women. Higher scores on the 0-100 GENDER Index can be interpreted as a higher level of characteristics associated with being female/having more feminine characteristics (Created with Excel software)