A literature search was performed in PsycInfo and Web of Science to identify international publications on the EQ or SQ. The search terms “EQ”, “Empathy Quotient”, “empathizing”, “SQ”, “Systemizing Quotient” and “systemizing” were used. Only studies written in the English language describing the EQ and/or SQ scores of a healthy sample were included. Studies that only included selected samples, such as patient samples or student samples of specific education types, were excluded. The psychometrics of the identified studies are quantitatively described in Tables 1 and 2, and are discussed in the introduction section. A synthesis is provided in the discussion section. Studies that included an ASD sample in addition to a healthy sample were qualitatively described in the introduction section as well.
Two groups of participants were recruited. Group 1 is a community sample consisting of 685 adults (270 males, 415 females) in the age of 16–84 years with a mean age of 33 (SD 14.5) years. The participants were recruited via the social networks of the researchers and various psychology students that collaborated on the project, and received no rewards for participation. They were contacted face-to-face, by e-mail or social media with the request to participate and could click a link on the computer to go to the survey. In the survey the participants first read the informed consent of the study and agreed with participation if they continued completing the questionnaire. Subsequently, the participants completed several demographical questions, then the EQ and SQ-R, and finally a questionnaire on symptoms of Attention Deficit Hyperactivity Disorder (this latter questionnaire was not included in the analyses of the current study). The study of Group 1 was approved by the Ethical Committee of Psychology of the University of Groningen (ppo-011-221, ppo-012-115, test–retest reliability study: ppo-013-077).
Half of the participants were students at the time of the survey (45.4 % fulltime, 5.3 % part time) and the other half were non-students. Half of the participants had a degree in higher professional or academic education (50.6 %), 42.2 % had a diploma in senior secondary vocational/general education or pre-university education, 5.7 % in junior secondary/general education, and 1.2 % finished primary school only (0.3 % had missing data). A quarter of the sample had a fulltime job, 40.0 % a part time job and 34.7 % had no occupation at the time of the study, and 0.3 % had missing data. One third of the sample was single (32.1 %), one third was married (28.9 %) or had a registered partnership (2.0 %), and one-third had a partner with whom they were living together (11.5 %) or living apart together (22.8 %), and a minority was divorced (2.0 %) or widower/widower (0.6 %). In the sample, 5.1 % indicated that they were diagnosed with a mental disorder and 2.8 % indicated using medication for mental complaints.
For the calculation of the test–retest reliability, 164 participants who had left their e-mail address for a follow-up study, were asked to complete the EQ and SQ-R again. In total 58 participants (22 males, 36 females) completed both questionnaires for a second time, with an average time between test and retest of 15 months (ranging from 6 to 20 months). Nineteen participants were students at the time of the study, and of the 58 participants 40 had a part-time or fulltime job and the remaining 18 had no job. More than half of the participants (n = 36) had a degree in higher professional or academic education. Three participants indicated that they were diagnosed with a mental disorder and two used medication for mental complaints.
Group 2 consisted of 42 males with a formal clinical diagnosis in the autistic spectrum in the age of 17–34 and a mean age of 22 (SD 4.2) years. The participants were recruited from the outpatient clinic of Accare, i.e. the University Child and Adolescent Psychiatry Center, Groningen, and from the Autism Team of the Northern Netherlands, Jonx/Lentis, Groningen. All patients had been assessed for the presence of an ASD according to the DSM-IV criteria by at least one experienced clinician, who was not involved in this study. Patients had to meet the criteria for either Autism Disorder, Asperger’s Syndrome or Pervasive Developmental Disorder Not Otherwise Specified. The assessment was performed by extensive (hetero)anamnestic interviews. For 12 cases the Autism Diagnostic Observation Scale (ADOS) (Lord et al. 2000) had been performed during the clinical assessment, and all 12 cases obtained clinical scores on at least one of its subscales. More specifically, 7 patients scored above the clinical cut-off on the Communication Scale (2+), 10 patients scored above the clinical cut-off on the Social Interaction Scale (4+), and 3 patients scored in the clinical range on both scales. For the remaining 30 patients no gold standard diagnostic measure for autism was available at the time of data-collection. These patients were described by means of the AQ and SRS-A (see “Materials” section). Moreover, patients were only included when the clinician judged the patients as having an intelligence level within the normal range (IQ ≥ 80). In case of doubt, patients performed a short version of the Groninger Intelligence Test (GIT) (Luteijn and Barelds 2004). For 19 cases the GIT was administered, and this subgroup scored in the range of 80–128 with an average IQ of 103 (SD 15.3). The patients completed paper-and-pencil versions of the EQ and SQ-R as part of a (pilot for a) treatment study which was approved by the Medical Ethical Committee of the University Medical Center Groningen (METc 2010.133).
EQ (Group 1 and 2)
The original 40-item EQ plus 20 filler items (Baron-Cohen and Wheelwright 2004) was translated into Dutch by the author YG and the translation was checked by the author AdH (the Dutch EQ, scoring key, and norm table can be requested from the corresponding author). The EQ items are rated on a 4-point Likert-scale (strongly agree, slightly agree, slightly disagree, strongly disagree). The 20 filler items (2, 3, 5, 7, 9, 13, 16, 17, 20, 23, 24, 30, 31, 33, 40, 45, 47, 51, 53, 56) are not counted in the scoring. A three point scoring system was adopted from Baron-Cohen and Wheelwright (2004), discriminating ‘lacking’, ‘mildly’ and ‘strongly’ empathic behaviour. The 21 forward items (1, 6, 19, 22, 25, 26, 35, 36, 37, 38, 41, 42, 43, 44, 52, 54, 55, 57, 58, 59, 60) are scored 2 for ‘strongly agree’, 1 for ‘slightly agree’, and 0 for ‘strongly disagree’ and ‘slightly disagree’. The 19 reversed items (4, 8, 10, 11, 12, 14, 15, 18, 21, 27, 28, 29, 32, 34, 39, 46, 48, 49, 50) are scored 2 for ‘strongly disagree’ and 1 for ‘slightly disagree’, and 0 for ‘strongly agree’ and ‘slightly agree’. Previous factor-analytic studies distinguished three EQ subscales labelled ‘Cognitive Empathy’ (CE), ‘Emotional Empathy’ (EE), and ‘Social Skills’ (SS), either based on 28 items (Lawrence et al. 2004) or on 15 items (Muncer and Ling 2006). See Table 3 for an overview of the items belonging to each subscale.
SQ-R (Group 1 and 2)
The revised version of the SQ (SQ-R) (Wheelwright et al. 2006) was translated into Dutch by the author YG and the translation was checked by the author AdH (the Dutch SQ-R, scoring key, and norm table can be requested from the corresponding author). The 75 items are rated on a 4-point Likert-scale (strongly agree, slightly agree, slightly disagree, strongly disagree). A three point scoring system was adopted from Wheelwright et al. (2006), discriminating ‘lacking’, ‘mildly’ and ‘strongly’ systemizing behaviour. The 36 reversed items (3, 6, 8, 10, 15, 17, 22, 24, 26, 28, 31, 33, 34, 35, 37, 39, 40, 44, 45, 47, 48, 49, 51, 52, 54, 56, 57, 58, 59, 63, 64, 65, 67, 70, 71, 73) are scored 2 for ‘strongly disagree’ and 1 for ‘slightly disagree’, and 0 for ‘strongly agree’ and ‘slightly agree’. In contrast, the remaining 39 forward items are scored 2 for ‘strongly agree’ and 1 for ‘slightly agree’, and 0 for ‘strongly disagree’ and ‘slightly disagree’.
Brain Type: D (Group 1 and 2)
Based on the EQ and SQ-R scores the participant’s brain type can be calculated. For this calculation, first the EQ and SQ-R total scores were standardized by the estimated population means of Group 1 (n = 685), using the formulas E = [EQ − M(EQ)/(maximum possible score)] and S = [SQ-R − M(SQ-R)/(maximum possible score)]. A continuous measure for brain type is calculated by the formula D = [(S − E)/2] (see Wheelwright et al. 2006). A positive score on D indicates brain Type S, or Extreme Type S, a negative score indicates brain Type E, or Extreme Type E, and a score close to zero indicates brain Type B.
FQ (Group 1)
The Friendship Questionnaire (FQ) is a 35-item self-report questionnaire measuring a person’s enjoyment and importance of friendships and interest in other people (Baron-Cohen and Wheelwright 2003), translated by Uzieblo, De Corte, Crombez, and Buysse (unpublished). On each item, the participants have to decide which statement about friendships and social interactions is most applicable to them. On each item two, three, four or five statements are presented. For example, on item 1 the participant has to choose between the following statements: “I have one or two particular best friends”; “I have several friends who I would call best friends”; “I don’t have anybody who I would call a best friend”. Twenty-seven out of 35 items are included in the scoring with a maximum score of 5 per item, resulting in a maximum total score of 135. Approximately half of the items are reverse keyed items. The FQ was demonstrated to have high internal consistency (Cronbach’s α = 0.84) in a mixed ASD and healthy control sample, and good criterion validity as demonstrated by large sex differences (females scoring higher than males) and individuals with ASD scoring lower than healthy controls with large effect size (Baron-Cohen and Wheelwright 2003).
SRS-A (Group 2)
The Social Responsiveness Scale-Adults (SRS-A) (Noens et al. 2012) is a scale that can be used as both a screening test and as an aid to clinical diagnosis for ASD (Aldridge et al. 2012). It consists of 64 items covering the various dimensions of interpersonal behaviour, communication and repetitive/stereotypic behaviours that are typical for ASD, and is rated on a 4-point Likert scale (not true, sometimes true, often true, almost always true). For 20 patients the informant version of the SRS-A was available which was completed by a close relative or friend (mean score 77; SD 32; range 31–147). Of this subgroup, 13 patients scored above the clinical cut-off score of 61.
AQ (Group 2)
The Dutch translation of the Autism Spectrum Quotient (AQ) (Hoekstra et al. 2008) was administered in 40 patients in order to describe their self-experienced autistic traits. The AQ consists of 50 items assessing personal preferences and habits related to ASD, and is rated on a 4-point Likert scale (definitely agree, slightly agree, slightly disagree, definitely agree). Half of the items are reverse keyed. The British scoring method with dichotomized answer categories (agree, disagree) was used (Baron-Cohen et al. 2001).
For 40 of the 42 patients an AQ was available and this group obtained a mean score of 25 (SD 7.7; range 9–40). Only 8 of them scored above the clinical cut-off score for the British population of 32+. Interestingly, 15 out of the 17 patients who scored beneath this AQ cut-off score and for whom also an ADOS or SRS-A score was available obtained a clinical score on the ADOS or SRS-A. This confirms that most of the patients in this study reporting subclinical AQ scores still show problems in the autistic spectrum according to their friends, families or professionals. In comparison to the other clinical ratings, low AQ scores may be a consequence of the patients’ impairment in self-reflection and awareness of their autistic traits which has previously been observed in patients with Asperger syndrome (Jackson et al. 2012). An underestimation of autistic traits has also been observed in children/adolescents with ASD, as compared to parent ratings (Johnson et al. 2009).
All analyses, except for the factor analyses, were performed using SPSS 20 (IBM Corp.). Completion of the SQ-R (75 items) and EQ (40 items) by a large number of participants (n = 691) went along with few missing values that were due to participants occasionally omitting to provide an answer to a question. For missing values an imputation model was used (including all variables of the respective scale) that was estimated by maximum likelihood (ML), obtaining a singly imputed data set. One respondent omitted to complete the SQ-R and was therefore excluded. Five participants were excluded from further analysis. They were suspected for careless completion of the questionnaires, because they filled out scores on the FQ items 30 and 34 that were not credible. Because of an error in the online survey, part of the participants omitted to provide an answer to the FQ item 30 (n = 161), item 34 (n = 155), and item 25 (n = 381). These items were also imputed using a model that was estimated by using ML. Data of 685 respondents entered the final analysis.
Confirmatory Factor Analyses (CFA)
Factor analysis was performed in LISREL 8.8 (Jöreskog and Sörbom 2006) in order to determine whether the three-factorial structure of the EQ and the one-factorial structure of the SQ-R could be replicated in this Dutch sample. With regard to the EQ, separate confirmatory factor analyses (CFA) were performed on the 28-item version (Lawrence et al. 2004) and the 15-item version (Muncer and Ling 2006) in order to test whether the data fitted the previously proposed three-factor structure. The following models were tested: (1) three-factor model competing with a one-factor model of the 28-item version, (2) three-factor model competing with a one-factor model of the 15-item version, (3) one-factor model of the 40-item version without competing models. The original 40 items of the EQ as well as the item distributions across the three factors (CE, EE, and SS) for both the 15-item version and the 28-item version are presented in Table 3. The three-factor models were only tested for the short-versions, because no factor structure had previously been proposed for the EQ containing all original 40 items, and some of the items load strongly on social desirability (Berthoz et al. 2008; Dimitrijevic et al. 2012; Preti et al. 2011). Diagonally Weighted Least Square (DWLS) estimation method was applied for all CFAs because of an ordered-categorical response format. Scaling of latent variables was achieved by setting the factor variance to 1. The t-rule was applied for identification of latent variables (Bollen 1989). All analyses were carried out on the total sample of healthy participants (n = 685) which considerably exceeds the criterion of a minimum sample size of 200 respondents for CFA (Hinkin 1998).
The fit of the respective factor structure was evaluated by the following statistics of CFA: Chi-Square value with corresponding p value, normed Chi-Square (χ2/df), Root Mean Squared Error of Approximation (RMSEA), 90 %-confidence interval (CI) of the RMSEA, Standardized Root Mean Square Residual (SRMR) and Comparative Fit Index (CFI). The Chi-Square value with its corresponding p value belongs to the class of absolute fit indices (Hu and Bentler 1999). Disadvantages of Chi-Square statistics are that both deviations from normality and large sample sizes may result in model rejection (Hooper et al. 2008). Therefore, less weight was given to the Chi-Square test than to the descriptive measure of the normed Chi-Square (Wheaton et al. 1977). Recommendations for an acceptable ratio of the normed Chi-Square range from 5.0 to 2.0 with a good fit below a value of 3.0 (Hinkin 1998; Hooper et al. 2008). Furthermore, the Root Mean Squared Error of Approximation (RMSEA) and a 90 %-CI of the RMSEA were calculated (Steiger 1990). There is consensus about an upper limit of RMSEA of 0.07 (Steiger 2007) and of an upper limit of the CI of the RMSEA of less than 0.08 (Hooper et al. 2008). The Standardized Root Mean Square Residual (SRMR) ranges from 0 to 1 with acceptable models obtaining values up to 0.08 (Hu and Bentler 1999). The Comparative Fit Index (CFI) is a revised version of the Non-Normed Fit Index (NNFI), also known as Tucker-Lewis Index (Bentler 1990). There is an agreement that a CFI of ≥0.90 to ≥0.95 indicates a good model fit (Hu and Bentler 1999). The goodness-of-fit statistics of the respective factor model were compared to the cut-offs and recommendations as cited above.
Exploratory Factor Analysis (EFA)
With regard to the SQ-R, exploratory factor analysis was conducted by using principal axis factoring (PAF) with oblique rotation method to test whether there are several statistical meaningful clusters of items representing psychologically meaningful concepts. Parallel analysis was performed to determine the number of factors to retain in PAF. In parallel analysis (PA-PAF), random data matrices of the same size as the actual data set were generated and eigenvalues were computed for the correlation matrices of each of the random data sets. Eigenvalues of random data sets and actual data sets were compared and the number of factors to retain was determined by those factors whose eigenvalues in the random data set exceeded the eigenvalues in the actual data set. A scree plot inspection was additionally performed to support factor retention criterion in PA-PAF.
The internal consistency of the SQ-R, the EQ scale and subscales was estimated using Cronbach’s α. The test–retest reliability was computed for the SQ-R and the EQ scales by Pearson correlations. Furthermore, reliability of the continuous measure for brain type (D) was derived. As the index D is a difference score of standardized EQ and SQ-R scores, its reliability was estimated by taking the reliabilities of EQ and SQ-R into consideration and controlling for the observed score correlation between both measures (Kessler 1977; Linn and Slinde 1977; Rogosa and Willett 1983).
For validation of the EQ and SQ-R, sex differences were tested on the EQ, SQ-R, and D. It was hypothesized that females obtain higher scores on EQ and lower scores on SQ-R and D, i.e. showing a more empathizing brain type. Additionally, it was tested whether males with ASD compared to control males score lower on EQ and higher on SQ-R, and D, i.e. showing a more systemizing brain type. For this purpose, means and standard deviations of the EQ scales, SQ-R, and D (the continuous measure of brain type) were calculated for males and females of Group 1, and also for Group 2 (the males with ASD). In order to explore groups validity, independent-samples t-tests were used to estimate sex differences in Group 1 and differences between the ASD patients in Group 2 and the males of Group 1 on the EQ scales, the SQ-R, and D. Effect sizes (Cohen’s d) were calculated for all comparisons to indicate the magnitude of group differences. Further calculations were performed on the EQ version with the best psychometric properties (15-item, 28-item version or 40-item version). To further explore criterion validity, in Group 1 and 2 correlational analysis between the SQ-R, the EQ scales, D, and the FQ (only in group 1) or the AQ (only in group 2) was performed, using Pearson’s correlation coefficients. The correlational analyses were separated for the two groups in order to explore differential correlational patterns in the ASD group compared to the typical group, e.g. the strength of the trade-off between EQ and SQ that has previously been suggested to be higher in people with ASD (Wheelwright et al. 2006).
A receiver operating characteristic (ROC) analysis was conducted in order to explore the accuracy of SQ-R, EQ and D in detecting males with ASD relative to healthy males of Group 1. A ROC analysis distinguishes between true positive rates and true negative rates. Whereas the true positive rate (sensitivity) describes the proportion of individuals with autism who are correctly identified as having the condition, the true negative rate (specificity) describes the proportion of healthy individuals who are correctly identified as not having the condition. A ROC curve plots the sensitivity against ‘1—specificity’ at each level of the scale under scrutiny (i.e. SQ-R, EQ or D) to predict the criterion (distinguishing healthy males from males with ASD). ROC analysis allows for determination of an overall accuracy of classification as measured by the area under the curve (AUC), as well as classification statistics to address specific goals, i.e. high sensitivity or high specificity.