Sex differences in personality scores on six scales: Many significant, but mostly small, differences

This study examined sex differences in domain and facet scores from six personality tests in various large adult samples. The aim was to document differences in large adult groups which might contribute new data to this highly contentious area. We reported on sex differences on the Myers-Briggs Type Indicator (MBTI); the Five Factor NEO-PI-R; the Hogan Personality Indicator (HPI); the Motives and Values Preferences Indicator (MVPI); the Hogan Development Survey (HDS) and the High Potential Trait Indicator (HPTI). Using multivariate ANOVAs we found that whilst there were many significant differences on these scores, which replicated other studies, the Cohen’s d statistic showed very few (3 out of 130) differences >.50. Results from each test were compared and contrasted, particularly where they are measuring the same trait construct. Implications and limitations for researchers interested in assessment and selection are discussed.


Introduction
It has been observed that researchers have to be courageous to investigate or write about sex differences (Furnham, 2017). Even the terminology is a sensitive issue: the terms "man" and "woman" are typically used in reference to gender, whereas the terms "male" and "female" are used in reference to sex. In this paper we shall examine sex differences and refer to male and female.
What is most surprising in this complex research area is comparing the radically different conclusions of researchers and reviewers on exactly the same topic. Early intelligence researchers put in considerable effort to ensure tests showed minimal sex differences (Mackintosch, 1986;Mackintosh, 2011), though personality researchers seemed less concerned with evidence of sex differences. However, over the course of the last 20 years there have been a great number of studies concerned with gender differences in personality world-wide (Del Giudice, 2009;Del Giudice et al., 2012;Schmitt, 2015;Schmitt et al., 2008;Weisberg et al., 2011). Some have concentrated on particular group differences using clinical populations, different age groups, and different cultures, or whether scores change much over time (Furnham & Cheng, 2019). Few have been interested in devising valid tests that minimise sex differences, but rather trying to establish the size, and more importantly the cause, of the differences they find currently exist.
There have also been meta-analysis in the area, some done many years ago. Thus, using now less known and used tests of an earlier era, Feingold (1994) concluded that males were more assertive, and had higher self-esteem than females, who scored higher than males in extraversion, anxiety, trust, and, especially, tender-mindedness. There were no sex differences in social anxiety, impulsiveness, activity, locus of control, and orderliness.
Some argue that even if they are small, actual, verifiable (not artefactual) sex differences, they should not be explored or explained because of the divisive personal and social effect that it can have on both sexes. Others believe there are important explicable reasons for sex differences which warrant scientific description and explanation (Buss, 1995;Eagly, 1995;Furnham, 2017).
Another curiosity in this highly disputed area is the apparent contradiction between popular and scientific writers (Gray, 1992;Pease & Pease, 2002). There are many popular books that portray a simple evolutionary perspective that describes, and even rejoices in, sex differences in almost all human behaviour, but particularly communication, relationships and work. These are contrasted with the measured and cautious academic books and papers that note how complex some of these seemingly simply questions are, and how all the answers require numerous qualifiers (Halpern et al., 2007).
Inevitably there are two strongly competing, opposite forces: those who stress the biology of difference and those who stress the sociology of similarity. The former often suggest that these differences are immutable, though it is accepted that all innate traits can be changed with experience. Whilst nearly everyone acknowledges that we are biopsychosocial beings there are those who see us more as BIOpsychosocial as opposed to biopsychoSOCIAL. This all concerns explaining how and why observed differences occur (Furnham, 2017).
At the heart of the issue is the quality and quantity of sex differences, their cause and consequence. Though the focus has always been on differences, the trend has been to talk of similarities which is what a great deal of the literature suggests. Indeed, it has been argued that the word difference is too easily confused with deficiency. The same is true of the words sex and gender: the former applying to biological distinctions and the latter sociological categories (Furnham, 2017).
As a result, there is a sophisticated and subtle naturenurture debate that persists across many interrelated disciplines whose practitioners' study human behaviour (Furnham & Kanazawa, 2020). Many academics who view gender as the product of socialization and cultural factors are split into opposing camps based on whether there are large or insignificant differences between women and men/ males and females (discussed by Buss & Schmitt, 2011).
There is an extensive and growing literature on how evolution creates systematic variation in personality (Nettle, 2006;Penke et al., 2007). Scholars attempt to explain how culture, biology, and evolution interact to collectively shape personality (Fischer, 2018). Evolutionary psychology posits various sources of sex differences, such as sexual selection (intersexual selection and intrasexual competition) and the theory of obligatory parental investment (Archer, 1996(Archer, , 2009Buss, 1995;Geary, 2010). Moreover evolutionary psychologists attempt to describe and explain how evolutionary processes shape sex differences in personality and the specific reasons as to why we might expect, or not expect, to see sex differences specific personality traits (Del Giudice et al., 2012;Lippa, 2010;Schmitt et al., 2008). There is also a salient literature on the proposed cultural origins of gender, more particularly the purported sociocultural factors that shape gender symmetry (Hyde, 2007) or differences (Eagly & Wood, 1999).
Behaviour geneticists decompose total variance in personality and other individual traits into three components: heritability (genes), shared environment (everything that happens within the family that makes siblings from one family similar to each other but different from those from other families), and unshared environment (everything that happens within and outside the family that makes siblings from one family different from each other) (Plomin et al., 2012). Behaviour geneticists contend that the rough rule of thumb when it comes to the determinants of adult personality and other traits is 50-0-50, that is, roughly 50% of the variance in personality, behaviour, and other traits is heritable (influenced by genes), roughly 0% by the shared environment (what happens within the family and is experienced similarly by all siblings), and roughly 50% by the non-shared environment (what happens inside and outside the family not shared by siblings). Tooby and Cosmides' (2005) talk of the standard social science model (SSSM) of the brain. Adherents of the SSSM argue that the brain is a generalpurpose device that is almost entirely shaped by culture and that individual differences are explained by social environment and learning (Vrabel & Zeigler-Hill, 2017). Tooby and Cosmides (2005) purport that their integrated model is superior to the SSSM because it integrates both culture and evolutionary biology. However, among evolutionarily-minded scholars, some believe that this distinction represents a false dichotomy (Richardson, 2007;Wallace, 2010). Suffice it to say this remains a highly contentious academic area of research.
Reviewers of this topic can be described as Maximizers vs Minimizers. Maximizers want to find and explain the (many large) differences between the sexes while the minimisers want to emphasize how few real and meaningful differences there are (Furnham, 2017). Part of this debate can be seen in the interpretation of Cohen's d, which is an indicator of difference. Whilst there are conventions about how to label the difference as: none, trivial, small, medium, large and very large, this is also contested. Most researchers quote Cohen who suggested that d = 0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size. This means that if two groups' means do not differ by 0.2 standard deviations or more, the difference is trivial, even if it is statistically significant, although these cut off points have been disputed. However there is an interesting literature which suggests d differs according to a number of factors (subdiscipline, sample size) and that in some areas of research a d of .25 to .35 could be considered medium (Hemphill, 2003;Greenwald et al., 2015;Schäfer & Schwarz, 2019).
To what extent do these results matter? In the applied context, it might be useful to contrast the approach taken with IQ tests, where they were originally designed to eliminate, as much as possible, sex differences. This does not seem to have been done by personality test creators. Most researchers do not worry about sex differences on IQ tests, because in essence there are none, but perhaps we should worry about personality tests in selection contexts. This could clearly have an impact in selection. It is particularly interesting when different personality tests essentially measure the same trait but yield large d differences.

This Study
This study is on sex differences in personality. In this study we report data on six questionnaires, four of which are well k n o w n a n d h a v e o v e r l a p p i n g d i m e n s i o n s ( l i k e Conscientiousness). We were fortunate to have large data sets on each of these, though we are aware that there are other important and well-used personality tests some of which measure other dimensions (e.g. HEXACO). We also aware that one test we report on, namely the MBTI, has been heavily criticized academically, though still very frequently used in applied and consulting settings and thus we examine it along with the others (Barbuto Jr, 1997;Furnham, 2018). Also, we also examine the MVPI (see below) which strictly speaking measures motives and values rather than traits, but yields some interesting and important results.
We believe this study has various unique features. First, while it replicates many other studies, it does so in overpowered samples often comprising many thousands of adults. Second, it examines the differences in six different well-known tests, whereas previous studies nearly all examined only one test. This allows the possibility of looking at differences between tests that measure the same construct (e.g. NEO Neuroticism, HPI Adjustment; HPTI Adjustment). Third, for two of the well know measures (NEO; HPI) we were able to examine differences at both domain and facet level. Fourth, for two questionnaires there were two large samples so that it was possible to examine replications. In all studies the respondents were first language English-speaking adults.
Usually test manuals provide information on group differences such as ethnicity and gender. Sometimes this data is very out-of-date and restricted to one continent. Surprisingly, the N is also often modest. Moreover, it seems to be the case that test publishers are eager to show as few group differences as possible as this may influence potential buyers of the test (Furnham, 2018). For each of the tests used in this study the manuals were consulted to examine the data on sex differences. Each provide good evidence of the internal and test-retest reliability of the test scores. This led to the development of the hypotheses, though the major concern was in the size of the differences.

Participants
There were seven different samples, most over 1000 participants, used in this study. The focus was on sex differences and these are shown in each table. People ranged in age from 24 to 69 years with the majority being in their late thirties. For each questionnaire there was no significant sex difference in age between males and females. In most of the samples (over 50%) were graduates and once again it was established that there was no difference in the education level between males and females. Most were working adults in supervisory and management positions from a very wide range of organisations. We did not have data on the participants socioeconomic status or their work history. Because the participants were nearly all at middle manager levels in their organisations there was a bias towards more males often being twice as many as females (see study limitations). Participants self-identified as either male or female: there was very little missing data for this question.

Instruments
1. The MBTI Myers-Briggs Type Indicator-Form G (MBTI: Myers & McCaulley, 1985). This is a Jungian-based inventory that is composed of 94 forced-choice items that yield scores on each of the eight factors as well as the famous four dimensions: Introversion-Extraversion, Sensation-Intuition, Thinking-Feeling and Judging-Perceiving. Respondents are classified into one of 16 personality types based on the largest score obtained for each bipolar scale. The test provides linear scores on each dimension which are usually discussed in terms of types based on cut-off scores. The Myers-Briggs Type Indicator has been the focus of extensive research and substantial evidence has accumulated suggesting the inventory has satisfactory concurrent and predictive validity and reliability. 2. The NEO Personality Inventory Revised (NEO-PI-R) (Costa & McCrae, 1992) Furnham, 2014). The HPTI is a measure of personality traits, specifically within a workplace context. It is comprised of six factors, Adjustment, Curiosity, Ambiguity Acceptance, Conscientiousness, Courage, and Competitiveness. The inventory is 78 items in length. Each trait is converted into a percentile rank based off the normal distribution of the sample. Various paper have been published using this measure (Furnham & Treglown, 2018) Procedure Participants were tested by three well established Britishbased psychological consultancies over a period from 10 to 16 years, where participants attended assessment centres and their data was logged. The same participants tended to complete the MBTI and NEO-PI-R where the data were obtained from one consultancy, the three Hogan Instruments (HPI, HDS, MVPI) where data were obtained from the other consultancies, and the HPTI the third consultancy. This was done in either in assessment centres or online as a part of recruitment or development process, and all participants were given full feedback on their test performance. They came from a wide range of organisations in the private and public sector. Participants agreed to take part in research and anonymised data was used in the analysis with their permission. Data sets were given to the authors for analysis with all tests scored which means we could not calculate alphas, though we have no reason to believe there were any problems with them (Hogan et al., 2007). Ethics permission was requested and received (CEHP: 2017; 514).

Results
Data was first screened for random responding, missing data, and other errors. In each analysis we started with MANOVAs and each was significant, followed by one way ANOVAs. Bonferroni corrections (p < .01) were made which meant a number of analyses (12 in all) ceased to be significant. In the interpretation we only focused on results where p < .001 though our primary focus was on the Cohen's d score/ We assumed that under d < .20 was a small difference and d < .50 a medium sized difference.
1. MBTI Table 1 shows that males scored higher than females on all dimensions, particularly Thinking-Feeling where the d was in the .20 > d < .50 range. Males scored higher in Sensing and higher on Judging which is consistent with the literature. This confirms H1 and H2.
2. NEO-PI-R Table 2 shows that all big five factors showed significant sex differences. Females scored higher on four of the five traits, particularly Openness and Neuroticism, but lower on Conscientiousness. All but three of the facets revealed significant differences. With few exceptions the facets within each domain showed consistent differences. Exceptions were Assertiveness and Excitement Seeking in Extraversion where males scored higher than females. Of the 30 d scores, 17 were < .20, 16 were .20 > d < .50 and only one >.50 (Feelings in the Openness factor). This confirms H3 and H4. Table 3 shows all seven domain factors were significant. Males scored significantly higher on Adjustment, Ambition, Sociability and Inquisitive, but lower on Interpersonal Sensitivity, Prudence, and Learning Approach. Once again, the facets within each domain score tended to be consistent both in direction and significance. Of the 50 analyses 32 showed d scores <.20, 17 were 20 > d < .50 and one >.50 (Curiosity in the Inquisitive factor). This confirms H5, H6 and H7. Table 4 shows the results from the two different samples. The results were reasonable consistent. In both samples males scored significantly higher than females on Recognition, Power, Commerce and Science but lower on Hedonism, Altruism, Affiliation, and Aesthetics. Combining the two in all 8 of the d's were < .20, 10 were .20 > d < .50 and two d > 50 (Commerce in Sample 1 and Science in Sample 2). This confirms H8 to H11. Table 5 also shows data from two different samples, which were one again were reasonably consistent. In both samples females were significantly higher on Excitable (Borderline), Cautious and Dutiful, while males scored higher on Sceptical, Reserved, Bold, and Mischievous. Combing the two samples on the 22 differences 16 were d < .20 and 6 were .20 > d < .50. This confirms H12 to H16

HDS
6. HPTI Table 6 shows the results of gender difference tests for the six HPTI traits. Significant differences were noted for all six traits, with males scoring higher on Conscientiousness, Adjustment, Risk Approach, Ambiguity Acceptance, and Competitiveness, whereas female participants scored higher on Curiosity. Effect sizes revealed that only Risk Approach (d = .32) and Ambiguity Acceptance (d = .33) had small effect sizes, whilst the rest can be regarded as negligible (d < .20).

Discussion
The results of this study can be interpreted in various different ways. A sex-difference maximiser would note that a cursory glance at the six tables shows that the vast majority of ANOVAs (over 80%) shows significant sex differences, many at the p < .001 illustrating the fundamental point that there are many and important sex differences in personality, using a variety of measures and assessed at both the domain and facet level. On the other hand, the minimiser might take comfort in the effect size High Scores indicate the dimension on the right, low on the left *p < .05, **p < .01 ***p < .001 data (Cohen's d) and note that there are very few large or even medium effect sizes, though this depends on how size is categorised. Nearly all the hypotheses based on the previous literature were confirmed. Overall, the MBTI showed relatively small differences except in the Thinking-Feeling variable which has been the topic of much debate. It has been suggested (and refuted) that this factor is essentially measuring Neuroticism and hence the higher score for females which is consistent with the previous literature (Furnham, 2018) The biggest domain differences were for three traits where females scored higher than males. The most unusual finding was the big difference on Openness (which was also shown in the HPTI trait of curiosity) where there is a limited literature and few speculations on sex differences. The smallest and fewest differences were on Consciousness and its facets. The facet analysis  gave some indication of variability within domain but few where the differences went in the opposite direction. Two exceptions were the facets of assertiveness and excitement seeking in Extraversion where, as in many other studies males scored higher than females. Interestingly the highest d was for the Openness facet Feelings (d = .53) which reflects the finding in the MBTI. (Furnham, 1996). The results of the HPI confirm previous studies with the biggest domain d's being for Adjustment, Ambition and Curiosity with males scoring higher and Interpersonal Sensitivity with females scoring higher. Again, most of the facets scores went in the same direction though they did occasionally differ greatly in size: compare empathy and calmness in Adjustment.
The results of the replicated MVPI study showed two things: where there were significant differences the results went in the same direction, and that the biggest differences lay in male's interest in power, business and science, values associated with entrepreneurship and work success (Furnham, 2018). Further, as in previous studies females scored higher in Altruism and Aesthetics.
The findings from the HDS show similar outcome in the two studies. When grouping the eleven traits into the recommended tri-partite system the results are clear: females tend to have scores on those traits moving away from (Cautious but not Reserved) and toward others (Dutiful not Diligent) while males score higher on traits in the moving against others category (especially Mischievous).  The final scale showed two of the six HPTI scales with relatively large differences: males score higher in Risk Approach and Ambiguity Acceptance which has been shown many times before. Although there was a sex difference on Competitiveness, the size of this was modest.
One interesting comparison could be between the scores of different tests which essentially (claimed to) measure the same construct. Thus, the sex difference d for Neuroticism in the NEO-PI-R was .35, Adjustment in the HPI was .30 and A d j u s t m e n t i n t h e H P T I w a s . 1 4 . S i m i l a r l y , Conscientiousness in the NEO-PI-R was .12 and in the HPTI was .11, and Prudence .06. Equally the sex difference d in Agreeableness in the NEO-PI-R was .32 and Interpersonal Sensitivity in the HPI was .30. Therefore, the results seem to suggest similar sex differences on scales of different length and question measuring the same phenomenon. There were however exceptions: females were more Extraverted and Open on the NEO=PI-R, but less Sociable and Curious on the HPI.
One interesting issue concerns revisiting each question and facet to determine whether there was any inherent sex bias in the question wording and whether if these were removed the overall d would decline. This is not an issue of attempted to deny or reduce differences that exist but rather trying to reduce artefacts arising from question selection. Certainly, with changes in society, particularly with reference to sex and gender differences, questionnaire wording could cause both offense and differences in interpretation unless they are constantly updated.
Another issue to arise from this study is the great variability in the facet score items and labels that are essentially measuring the same dimension. Compare for instance the six Openness facets of the NEO-PI-R with six facets of the HPI. Given these labels it is expected that these two measures are relatively weakly correlated and measuring rather different factors.
Finally accepting that there are some real, biologically based, stable sex differences, as opposed to socialised gender differences, in personality traits the question arises as to why they occur. Results such as these cannot inform the naturenurture debate, with (most) evolutionary psychologists offering a cohesive (and for some convincing) argument as to why there are replicable, consistent and cross-cultural findings. Minimizers who reject the "biology as destiny" approach attempt to explain all these differences in terms of primary and secondary socialisation (Buss, 1995). However, in a big review study Schmitt et al. (2017) concluded: "Social role theory appears inadequate for explaining some of the observed cultural variations in men's and women's personalities. Evolutionary theories regarding ecologically-evoked gender differences are described that may prove more useful in explaining global variation in human personality" (p45). This study, like all, others has limitations. All participants were British adults taking part in a compulsory assessment centre. Though they might have been tempted by impression management there is no reason to suspect that there were sex differences in this behaviour. The reason why males outnumber females tended to reflect the profile of middle managers in those organisations which reflected all sectors, public and private. The sample was thus biased in terms of age, education and class and the question remains whether a more representative sample of people from a wider age range and social class background would have shown more or fewer sex differences. Furthermore, nearly all participants were from Europe and the effects of culture were thus not explored. It could be that sex differences are smaller in more Western, individualistic, democratic, egalitarian, and higher gender-parity cultural contexts than those from more traditional, developing countries.
It has been argued that personality changes over time and it may be that sex differences and similarities in personality are different for young, middle-aged and older participants (Roberts et al., 2001). Finally there is always the possibility that there are sex differences is self-report behaviours and biases, such that females exhibit more humility and males more hubris and that therefore some observed differences are more due to other factors and artefacts than actual personality differences.
Funding Open access funding provided by Norwegian Business School.

Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.